LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCHv2 00/29] TDX Guest: TDX core support
@ 2022-01-24 15:01 Kirill A. Shutemov
  2022-01-24 15:01 ` [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
                   ` (29 more replies)
  0 siblings, 30 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

Hi All,

Intel's Trust Domain Extensions (TDX) protects confidential guest VMs
from the host and physical attacks by isolating the guest register
state and by encrypting the guest memory. In TDX, a special TDX module
sits between the host and the guest, and runs in a special mode and
manages the guest/host separation.

	Please review and consider applying.

More details of TDX guests can be found in Documentation/x86/tdx.rst.

All dependencies of the patchset are in Linus' tree now.

SEV/TDX comparison:
-------------------

TDX has a lot of similarities to SEV. It enhances confidentiality
of guest memory and state (like registers) and includes a new exception
(#VE) for the same basic reasons as SEV-ES. Like SEV-SNP (not merged
yet), TDX limits the host's ability to make changes in the guest
physical address space.

TDX/VM comparison:
------------------

Some of the key differences between TD and regular VM is,

1. Multi CPU bring-up is done using the ACPI MADT wake-up table.
2. A new #VE exception handler is added. The TDX module injects #VE exception
   to the guest TD in cases of instructions that need to be emulated, disallowed
   MSR accesses, etc.
3. By default memory is marked as private, and TD will selectively share it with
   VMM based on need.

You can find TDX related documents in the following link.

https://software.intel.com/content/www/br/pt/develop/articles/intel-trust-domain-extensions.html

Git tree:

https://github.com/intel/tdx.git guest-upstream

Previous version:

https://lore.kernel.org/r/20211214150304.62613-1-kirill.shutemov@linux.intel.com

Changes from v1:
  - Rebased to tip/master (94985da003a4).
  - Address feedback from Borislav and Josh.
  - Wire up KVM hypercalls. Needed to send IPI.
Andi Kleen (1):
  x86/tdx: Early boot handling of port I/O

Isaku Yamahata (1):
  x86/tdx: ioapic: Add shared bit for IOAPIC base address

Kirill A. Shutemov (16):
  x86/traps: Add #VE support for TDX guest
  x86/tdx: Add HLT support for TDX guests
  x86/tdx: Add MSR support for TDX guests
  x86/tdx: Handle CPUID via #VE
  x86/tdx: Handle in-kernel MMIO
  x86: Consolidate port I/O helpers
  x86/boot: Allow to hook up alternative port I/O helpers
  x86/boot/compressed: Support TDX guest port I/O at decompression time
  x86/tdx: Get page shared bit info from the TDX module
  x86/tdx: Exclude shared bit from __PHYSICAL_MASK
  x86/tdx: Make pages shared in ioremap()
  x86/tdx: Add helper to convert memory between shared and private
  x86/mm/cpa: Add support for TDX shared memory
  x86/kvm: Use bounce buffers for TD guest
  ACPICA: Avoid cache flush on TDX guest
  x86/tdx: Warn about unexpected WBINVD

Kuppuswamy Sathyanarayanan (9):
  x86/tdx: Detect running as a TDX guest in early boot
  x86/tdx: Extend the cc_platform_has() API to support TDX guests
  x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper
    functions
  x86/tdx: Detect TDX at early kernel decompression time
  x86/tdx: Add port I/O emulation
  x86/tdx: Wire up KVM hypercalls
  x86/acpi, x86/boot: Add multiprocessor wake-up support
  x86/topology: Disable CPU online/offline control for TDX guests
  Documentation/x86: Document TDX kernel architecture

Sean Christopherson (2):
  x86/boot: Add a trampoline for booting APs via firmware handoff
  x86/boot: Avoid #VE during boot for TDX platforms

 Documentation/x86/index.rst              |   1 +
 Documentation/x86/tdx.rst                | 194 ++++++++
 arch/x86/Kconfig                         |  15 +
 arch/x86/boot/a20.c                      |  14 +-
 arch/x86/boot/boot.h                     |  35 +-
 arch/x86/boot/compressed/Makefile        |   1 +
 arch/x86/boot/compressed/head_64.S       |  25 +-
 arch/x86/boot/compressed/misc.c          |  26 +-
 arch/x86/boot/compressed/misc.h          |   4 +-
 arch/x86/boot/compressed/pgtable.h       |   2 +-
 arch/x86/boot/compressed/tdcall.S        |   3 +
 arch/x86/boot/compressed/tdx.c           |  88 ++++
 arch/x86/boot/compressed/tdx.h           |  16 +
 arch/x86/boot/cpuflags.c                 |   3 +-
 arch/x86/boot/cpuflags.h                 |   1 +
 arch/x86/boot/early_serial_console.c     |  28 +-
 arch/x86/boot/io.h                       |  28 ++
 arch/x86/boot/main.c                     |   4 +
 arch/x86/boot/pm.c                       |  10 +-
 arch/x86/boot/tty.c                      |   4 +-
 arch/x86/boot/video-vga.c                |   6 +-
 arch/x86/boot/video.h                    |   8 +-
 arch/x86/include/asm/acenv.h             |  16 +-
 arch/x86/include/asm/apic.h              |   7 +
 arch/x86/include/asm/cpufeatures.h       |   1 +
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/idtentry.h          |   4 +
 arch/x86/include/asm/io.h                |  22 +-
 arch/x86/include/asm/kvm_para.h          |  22 +
 arch/x86/include/asm/mem_encrypt.h       |   8 +
 arch/x86/include/asm/pgtable.h           |  19 +-
 arch/x86/include/asm/realmode.h          |   1 +
 arch/x86/include/asm/set_memory.h        |   1 -
 arch/x86/include/asm/shared/io.h         |  32 ++
 arch/x86/include/asm/shared/tdx.h        |  30 ++
 arch/x86/include/asm/tdx.h               |  92 ++++
 arch/x86/kernel/Makefile                 |   4 +
 arch/x86/kernel/acpi/boot.c              | 114 +++++
 arch/x86/kernel/apic/apic.c              |  10 +
 arch/x86/kernel/apic/io_apic.c           |  18 +-
 arch/x86/kernel/asm-offsets.c            |  20 +
 arch/x86/kernel/cc_platform.c            |  43 +-
 arch/x86/kernel/head64.c                 |   7 +
 arch/x86/kernel/head_64.S                |  24 +-
 arch/x86/kernel/idt.c                    |   3 +
 arch/x86/kernel/process.c                |   5 +
 arch/x86/kernel/smpboot.c                |  12 +-
 arch/x86/kernel/tdcall.S                 | 300 ++++++++++++
 arch/x86/kernel/tdx.c                    | 592 +++++++++++++++++++++++
 arch/x86/kernel/traps.c                  | 110 +++++
 arch/x86/mm/ioremap.c                    |   5 +
 arch/x86/mm/mem_encrypt.c                |   9 +-
 arch/x86/mm/mem_encrypt_amd.c            |  10 +-
 arch/x86/mm/pat/set_memory.c             |  44 +-
 arch/x86/realmode/rm/header.S            |   1 +
 arch/x86/realmode/rm/trampoline_64.S     |  63 ++-
 arch/x86/realmode/rm/trampoline_common.S |  12 +-
 arch/x86/realmode/rm/wakemain.c          |  14 +-
 include/linux/cc_platform.h              |  19 +
 kernel/cpu.c                             |   3 +
 60 files changed, 2079 insertions(+), 142 deletions(-)
 create mode 100644 Documentation/x86/tdx.rst
 create mode 100644 arch/x86/boot/compressed/tdcall.S
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/boot/compressed/tdx.h
 create mode 100644 arch/x86/boot/io.h
 create mode 100644 arch/x86/include/asm/shared/io.h
 create mode 100644 arch/x86/include/asm/shared/tdx.h
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdcall.S
 create mode 100644 arch/x86/kernel/tdx.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 19:29   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 02/29] x86/tdx: Extend the cc_platform_has() API to support TDX guests Kirill A. Shutemov
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

cc_platform_has() API is used in the kernel to enable confidential
computing features. Since TDX guest is a confidential computing
platform, it also needs to use this API.

In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.

Since cc_plaform_has() API will be used frequently across the boot
code, instead of repeatedly detecting the TDX guest status using the
CPUID instruction, detect once and cache the result. Add a function
(is_tdx_guest()) to read the cached TDX guest status in CC APIs.

Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform. This feature bit will be used to
do TDX-specific handling in some areas of the ARCH code where a
function call to check for TDX guest status is not cost-effective
(for example, TDX hypercall support).

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                         | 12 +++++++++
 arch/x86/include/asm/cpufeatures.h       |  1 +
 arch/x86/include/asm/disabled-features.h |  8 +++++-
 arch/x86/include/asm/tdx.h               | 23 ++++++++++++++++++
 arch/x86/kernel/Makefile                 |  1 +
 arch/x86/kernel/head64.c                 |  4 +++
 arch/x86/kernel/tdx.c                    | 31 ++++++++++++++++++++++++
 7 files changed, 79 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/tdx.h
 create mode 100644 arch/x86/kernel/tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6fddb63271d9..09e6744af3f8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -880,6 +880,18 @@ config ACRN_GUEST
 	  IOT with small footprint and real-time features. More details can be
 	  found in https://projectacrn.org/.
 
+config INTEL_TDX_GUEST
+	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
+	depends on X86_64 && CPU_SUP_INTEL
+	depends on X86_X2APIC
+	help
+	  Support running as a guest under Intel TDX.  Without this support,
+	  the guest kernel can not boot or run under TDX.
+	  TDX includes memory encryption and integrity capabilities
+	  which protect the confidentiality and integrity of guest
+	  memory contents and CPU state. TDX guests are protected from
+	  potential attacks from the VMM.
+
 endif #HYPERVISOR_GUEST
 
 source "arch/x86/Kconfig.cpu"
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6db4e2932b3d..defed3bd543b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -238,6 +238,7 @@
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
 #define X86_FEATURE_PVUNLOCK		( 8*32+20) /* "" PV unlock function */
 #define X86_FEATURE_VCPUPREEMPT		( 8*32+21) /* "" PV vcpu_is_preempted function */
+#define X86_FEATURE_TDX_GUEST		( 8*32+22) /* Intel Trust Domain Extensions Guest */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..f556086e6093 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -65,6 +65,12 @@
 # define DISABLE_SGX	(1 << (X86_FEATURE_SGX & 31))
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+# define DISABLE_TDX_GUEST	0
+#else
+# define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -76,7 +82,7 @@
 #define DISABLED_MASK5	0
 #define DISABLED_MASK6	0
 #define DISABLED_MASK7	(DISABLE_PTI)
-#define DISABLED_MASK8	0
+#define DISABLED_MASK8	(DISABLE_TDX_GUEST)
 #define DISABLED_MASK9	(DISABLE_SMAP|DISABLE_SGX)
 #define DISABLED_MASK10	0
 #define DISABLED_MASK11	0
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
new file mode 100644
index 000000000000..e375a950a033
--- /dev/null
+++ b/arch/x86/include/asm/tdx.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021-2022 Intel Corporation */
+#ifndef _ASM_X86_TDX_H
+#define _ASM_X86_TDX_H
+
+#include <linux/init.h>
+
+#define TDX_CPUID_LEAF_ID	0x21
+#define TDX_IDENT		"IntelTDX    "
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+void __init tdx_early_init(void);
+bool is_tdx_guest(void);
+
+#else
+
+static inline void tdx_early_init(void) { };
+static inline bool is_tdx_guest(void) { return false; }
+
+#endif /* CONFIG_INTEL_TDX_GUEST */
+
+#endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 6aef9ee28a39..211d9fcdd729 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -130,6 +130,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index de563db9cdcd..1cb6346ec3d1 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -40,6 +40,7 @@
 #include <asm/extable.h>
 #include <asm/trapnr.h>
 #include <asm/sev.h>
+#include <asm/tdx.h>
 
 /*
  * Manage page tables very early on.
@@ -516,6 +517,9 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	copy_bootdata(__va(real_mode_data));
 
+	/* Needed before cc_platform_has() can be used for TDX */
+	tdx_early_init();
+
 	/*
 	 * Load microcode early on BSP.
 	 */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
new file mode 100644
index 000000000000..1ef6979a6434
--- /dev/null
+++ b/arch/x86/kernel/tdx.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2021-2022 Intel Corporation */
+
+#undef pr_fmt
+#define pr_fmt(fmt)     "tdx: " fmt
+
+#include <linux/cpufeature.h>
+#include <asm/tdx.h>
+
+static bool tdx_guest_detected __ro_after_init;
+
+bool is_tdx_guest(void)
+{
+	return tdx_guest_detected;
+}
+
+void __init tdx_early_init(void)
+{
+	u32 eax, sig[3];
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
+
+	if (memcmp(TDX_IDENT, sig, 12))
+		return;
+
+	tdx_guest_detected = true;
+
+	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
+
+	pr_info("Guest detected\n");
+}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 02/29] x86/tdx: Extend the cc_platform_has() API to support TDX guests
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
  2022-01-24 15:01 ` [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 19:31   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Confidential Computing (CC) features (like string I/O unroll support,
memory encryption/decryption support, etc) are conditionally enabled
in the kernel using cc_platform_has() API. Since TDX guests also need
to use these CC features, extend cc_platform_has() API and add TDX
guest-specific CC attributes support.

Use is_tdx_guest() API to detect for the TDX guest status and return
TDX-specific CC attributes. To enable use of CC APIs in the TDX guest,
select ARCH_HAS_CC_PLATFORM in the CONFIG_INTEL_TDX_GUEST case.

This is a preparatory patch and just creates the framework for adding
TDX guest specific CC attributes.

Since is_tdx_guest() function (through cc_platform_has() API) is used in
the early boot code, disable the instrumentation flags and function
tracer. This is similar to AMD SEV and cc_platform.c.

Since intel_cc_platform_has() function only gets called when
is_tdx_guest() is true (valid CONFIG_INTEL_TDX_GUEST case), remove the
redundant #ifdef in intel_cc_platform_has().

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig              | 1 +
 arch/x86/kernel/Makefile      | 3 +++
 arch/x86/kernel/cc_platform.c | 9 ++++-----
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 09e6744af3f8..1491f25c844e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
 	bool "Intel TDX (Trust Domain Extensions) - Guest Support"
 	depends on X86_64 && CPU_SUP_INTEL
 	depends on X86_X2APIC
+	select ARCH_HAS_CC_PLATFORM
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 211d9fcdd729..67415037c33c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -22,6 +22,7 @@ CFLAGS_REMOVE_early_printk.o = -pg
 CFLAGS_REMOVE_head64.o = -pg
 CFLAGS_REMOVE_sev.o = -pg
 CFLAGS_REMOVE_cc_platform.o = -pg
+CFLAGS_REMOVE_tdx.o = -pg
 endif
 
 KASAN_SANITIZE_head$(BITS).o				:= n
@@ -31,6 +32,7 @@ KASAN_SANITIZE_stacktrace.o				:= n
 KASAN_SANITIZE_paravirt.o				:= n
 KASAN_SANITIZE_sev.o					:= n
 KASAN_SANITIZE_cc_platform.o				:= n
+KASAN_SANITIZE_tdx.o					:= n
 
 # With some compiler versions the generated code results in boot hangs, caused
 # by several compilation units. To be safe, disable all instrumentation.
@@ -50,6 +52,7 @@ KCOV_INSTRUMENT		:= n
 
 CFLAGS_head$(BITS).o	+= -fno-stack-protector
 CFLAGS_cc_platform.o	+= -fno-stack-protector
+CFLAGS_tdx.o		+= -fno-stack-protector
 
 CFLAGS_irq.o := -I $(srctree)/$(src)/../include/asm/trace
 
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 6a6ffcd978f6..c72b3919bca9 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -13,14 +13,11 @@
 
 #include <asm/mshyperv.h>
 #include <asm/processor.h>
+#include <asm/tdx.h>
 
-static bool __maybe_unused intel_cc_platform_has(enum cc_attr attr)
+static bool intel_cc_platform_has(enum cc_attr attr)
 {
-#ifdef CONFIG_INTEL_TDX_GUEST
 	return false;
-#else
-	return false;
-#endif
 }
 
 /*
@@ -76,6 +73,8 @@ bool cc_platform_has(enum cc_attr attr)
 {
 	if (sme_me_mask)
 		return amd_cc_platform_has(attr);
+	else if (is_tdx_guest())
+		return intel_cc_platform_has(attr);
 
 	if (hv_is_isolation_supported())
 		return hyperv_cc_platform_has(attr);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
  2022-01-24 15:01 ` [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
  2022-01-24 15:01 ` [PATCHv2 02/29] x86/tdx: Extend the cc_platform_has() API to support TDX guests Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 19:58   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.

In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.

A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A variant of TDCALL used to communicate
with the VMM is called TDVMCALL.

Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).

__tdx_hypercall()    - Used by the guest to request services from the
		       VMM (via TDVMCALL).
__tdx_module_call()  - Used to communicate with the TDX module (via
		       TDCALL).

Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.

The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file.  The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.

Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.

For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".

Based on previous patch by Sean Christopherson.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tdx.h    |  40 +++++
 arch/x86/kernel/Makefile      |   2 +-
 arch/x86/kernel/asm-offsets.c |  20 +++
 arch/x86/kernel/tdcall.S      | 269 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tdx.c         |  23 +++
 5 files changed, 353 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/tdcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index e375a950a033..5107a4d9ba8f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,11 +8,51 @@
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+#define TDX_HYPERCALL_STANDARD  0
+
+/*
+ * Used in __tdx_module_call() to gather the output registers'
+ * values of the TDCALL instruction when requesting services from
+ * the TDX module. This is a software only structure and not part
+ * of the TDX module/VMM ABI
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
+/*
+ * Used in __tdx_hypercall() to gather the output registers' values
+ * of the TDCALL instruction when requesting services from the VMM.
+ * This is a software only structure and not part of the TDX
+ * module/VMM ABI.
+ */
+struct tdx_hypercall_output {
+	u64 r10;
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 void __init tdx_early_init(void);
 bool is_tdx_guest(void);
 
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		      struct tdx_module_output *out);
+
+/* Used to request services from the VMM */
+u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
+		    u64 r15, struct tdx_hypercall_output *out);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 67415037c33c..ce3e044f7f12 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -133,7 +133,7 @@ obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_JAILHOUSE_GUEST)	+= jailhouse.o
-obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_GUEST)	+= tdcall.o tdx.o
 
 obj-$(CONFIG_EISA)		+= eisa.o
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 9fb0a2f8b62a..8a3c6b34be7d 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -18,6 +18,7 @@
 #include <asm/bootparam.h>
 #include <asm/suspend.h>
 #include <asm/tlbflush.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_XEN
 #include <xen/interface/xen.h>
@@ -65,6 +66,25 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	/* Offset for fields in tdx_module_output */
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
+	/* Offset for fields in tdx_hypercall_output */
+	OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_output, r10);
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#endif
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
new file mode 100644
index 000000000000..46a49a96cf6c
--- /dev/null
+++ b/arch/x86/kernel/tdcall.S
@@ -0,0 +1,269 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/asm-offsets.h>
+#include <asm/asm.h>
+#include <asm/frame.h>
+#include <asm/unwind_hints.h>
+
+#include <linux/linkage.h>
+#include <linux/bits.h>
+#include <linux/errno.h>
+
+/*
+ * Bitmasks of exposed registers (with VMM).
+ */
+#define TDX_R10		BIT(10)
+#define TDX_R11		BIT(11)
+#define TDX_R12		BIT(12)
+#define TDX_R13		BIT(13)
+#define TDX_R14		BIT(14)
+#define TDX_R15		BIT(15)
+
+/* Frame offset + 8 (for arg1) */
+#define ARG7_SP_OFFSET		(FRAME_OFFSET + 0x08)
+
+/*
+ * These registers are clobbered to hold arguments for each
+ * TDVMCALL. They are safe to expose to the VMM.
+ * Each bit in this mask represents a register ID. Bit field
+ * details can be found in TDX GHCI specification, section
+ * titled "TDCALL [TDG.VP.VMCALL] leaf".
+ */
+#define TDVMCALL_EXPOSE_REGS_MASK	( TDX_R10 | TDX_R11 | \
+					  TDX_R12 | TDX_R13 | \
+					  TDX_R14 | TDX_R15 )
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM. It is supported in
+ * Binutils >= 2.36.
+ */
+#define tdcall .byte 0x66,0x0f,0x01,0xcc
+
+/*
+ * __tdx_module_call()  - Used by TDX guests to request services from
+ * the TDX module (does not include VMM services).
+ *
+ * Transforms function call register arguments into the TDCALL
+ * register ABI.  After TDCALL operation, TDX module output is saved
+ * in @out (if it is provided by the user)
+ *
+ *-------------------------------------------------------------------------
+ * TDCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - TDCALL Leaf number.
+ * RCX,RDX,R8-R9       - TDCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - TDCALL instruction error code.
+ * RCX,RDX,R8-R11      - TDCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_module_call() function ABI:
+ *
+ * @fn  (RDI)          - TDCALL Leaf ID,    moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *                       stored temporarily in R12 (not
+ *                       shared with the TDX module). It
+ *                       can be NULL.
+ *
+ * Return status of TDCALL via RAX.
+ */
+SYM_FUNC_START(__tdx_module_call)
+	FRAME_BEGIN
+
+	/*
+	 * R12 will be used as temporary storage for
+	 * struct tdx_module_output pointer. Since R12-R15
+	 * registers are not used by TDCALL services supported
+	 * by this function, it can be reused.
+	 */
+
+	/* Callee saved, so preserve it */
+	push %r12
+
+	/*
+	 * Push output pointer to stack.
+	 * After the TDCALL operation, it will be fetched
+	 * into R12 register.
+	 */
+	push %r9
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	/* Move TDCALL Leaf ID to RAX */
+	mov %rdi, %rax
+	/* Move input 4 to R9 */
+	mov %r8,  %r9
+	/* Move input 3 to R8 */
+	mov %rcx, %r8
+	/* Move input 1 to RCX */
+	mov %rsi, %rcx
+	/* Leave input param 2 in RDX */
+
+	tdcall
+
+	/*
+	 * Fetch output pointer from stack to R12 (It is used
+	 * as temporary storage)
+	 */
+	pop %r12
+
+	/* Check for TDCALL success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz .Lno_output_struct
+
+	/*
+	 * Since this function can be initiated without an output pointer,
+	 * check if caller provided an output struct before storing
+	 * output registers.
+	 */
+	test %r12, %r12
+	jz .Lno_output_struct
+
+	/* Copy TDCALL result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+
+.Lno_output_struct:
+	/* Restore the state of R12 register */
+	pop %r12
+
+	FRAME_END
+	ret
+SYM_FUNC_END(__tdx_module_call)
+
+/*
+ * __tdx_hypercall() - Make hypercalls to a TDX VMM.
+ *
+ * Transforms function call register arguments into the TDCALL
+ * register ABI.  After TDCALL operation, VMM output is saved in @out.
+ *
+ *-------------------------------------------------------------------------
+ * TD VMCALL ABI:
+ *-------------------------------------------------------------------------
+ *
+ * Input Registers:
+ *
+ * RAX                 - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
+ * RCX                 - BITMAP which controls which part of TD Guest GPR
+ *                       is passed as-is to the VMM and back.
+ * R10                 - Set 0 to indicate TDCALL follows standard TDX ABI
+ *                       specification. Non zero value indicates vendor
+ *                       specific ABI.
+ * R11                 - VMCALL sub function number
+ * RBX, RBP, RDI, RSI  - Used to pass VMCALL sub function specific arguments.
+ * R8-R9, R12-R15      - Same as above.
+ *
+ * Output Registers:
+ *
+ * RAX                 - TDCALL instruction status (Not related to hypercall
+ *                        output).
+ * R10                 - Hypercall output error code.
+ * R11-R15             - Hypercall sub function specific output values.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __tdx_hypercall() function ABI:
+ *
+ * @type  (RDI)        - TD VMCALL type, moved to R10
+ * @fn    (RSI)        - TD VMCALL sub function, moved to R11
+ * @r12   (RDX)        - Input parameter 1, moved to R12
+ * @r13   (RCX)        - Input parameter 2, moved to R13
+ * @r14   (R8)         - Input parameter 3, moved to R14
+ * @r15   (R9)         - Input parameter 4, moved to R15
+ *
+ * @out   (stack)      - struct tdx_hypercall_output pointer (cannot be NULL)
+ *
+ * On successful completion, return TDCALL status or -EINVAL for invalid
+ * inputs.
+ */
+SYM_FUNC_START(__tdx_hypercall)
+	FRAME_BEGIN
+
+	/* Move argument 7 from caller stack to RAX */
+	movq ARG7_SP_OFFSET(%rsp), %rax
+
+	/* Check if caller provided an output struct */
+	test %rax, %rax
+	/* If out pointer is NULL, return -EINVAL */
+	jz .Lret_err
+
+	/* Save callee-saved GPRs as mandated by the x86_64 ABI */
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+
+	/*
+	 * Save output pointer (rax) on the stack, it will be used again
+	 * when storing the output registers after the TDCALL operation.
+	 */
+	push %rax
+
+	/* Mangle function call ABI into TDCALL ABI: */
+	/* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
+	xor %eax, %eax
+	/* Move TDVMCALL type (standard vs vendor) in R10 */
+	mov %rdi, %r10
+	/* Move TDVMCALL sub function id to R11 */
+	mov %rsi, %r11
+	/* Move input 1 to R12 */
+	mov %rdx, %r12
+	/* Move input 2 to R13 */
+	mov %rcx, %r13
+	/* Move input 3 to R14 */
+	mov %r8,  %r14
+	/* Move input 4 to R15 */
+	mov %r9,  %r15
+
+	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
+
+	tdcall
+
+	/* Restore output pointer to R9 */
+	pop  %r9
+
+	/* Copy hypercall result registers to output struct: */
+	movq %r10, TDX_HYPERCALL_r10(%r9)
+	movq %r11, TDX_HYPERCALL_r11(%r9)
+	movq %r12, TDX_HYPERCALL_r12(%r9)
+	movq %r13, TDX_HYPERCALL_r13(%r9)
+	movq %r14, TDX_HYPERCALL_r14(%r9)
+	movq %r15, TDX_HYPERCALL_r15(%r9)
+
+	/*
+	 * Zero out registers exposed to the VMM to avoid
+	 * speculative execution with VMM-controlled values.
+	 * This needs to include all registers present in
+	 * TDVMCALL_EXPOSE_REGS_MASK (except R12-R15).
+	 * R12-R15 context will be restored.
+	 */
+	xor %r10d, %r10d
+	xor %r11d, %r11d
+
+	/* Restore callee-saved GPRs as mandated by the x86_64 ABI */
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	jmp .Lhcall_done
+.Lret_err:
+       movq $-EINVAL, %rax
+.Lhcall_done:
+       FRAME_END
+
+       retq
+SYM_FUNC_END(__tdx_hypercall)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 1ef6979a6434..d40b6df51e26 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -9,6 +9,29 @@
 
 static bool tdx_guest_detected __ro_after_init;
 
+/*
+ * Wrapper for standard use of __tdx_hypercall with panic report
+ * for TDCALL error.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
+				 u64 r15, struct tdx_hypercall_output *out)
+{
+	struct tdx_hypercall_output dummy_out;
+	u64 err;
+
+	/* __tdx_hypercall() does not accept NULL output pointer */
+	if (!out)
+		out = &dummy_out;
+
+	/* Non zero return value indicates buggy TDX module, so panic */
+	err = __tdx_hypercall(TDX_HYPERCALL_STANDARD, fn, r12, r13, r14,
+			      r15, out);
+	if (err)
+		panic("Hypercall fn %llu failed (Buggy TDX module!)\n", fn);
+
+	return out->r10;
+}
+
 bool is_tdx_guest(void)
 {
 	return tdx_guest_detected;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 21:02   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 05/29] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov, Sean Christopherson

Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to unmapped pages (EPT violation)

In the settings that Linux will run in, virtualization exceptions are
never generated on accesses to normal, TD-private memory that has been
accepted.

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

If a guest kernel action which would normally cause a #VE occurs in
the interrupt-disabled region before TDGETVEINFO, a #DF (fault
exception) is delivered to the guest which will result in an oops.

Add basic infrastructure to handle any #VE which occurs in the kernel
or userspace. Later patches will add handling for specific #VE
scenarios.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/idtentry.h |   4 ++
 arch/x86/include/asm/tdx.h      |  21 ++++++
 arch/x86/kernel/idt.c           |   3 +
 arch/x86/kernel/tdx.c           |  63 ++++++++++++++++++
 arch/x86/kernel/traps.c         | 110 ++++++++++++++++++++++++++++++++
 5 files changed, 201 insertions(+)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..8ccc81d653b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -625,6 +625,10 @@ DECLARE_IDTENTRY_XENCB(X86_TRAP_OTHER,	exc_xen_hypervisor_callback);
 DECLARE_IDTENTRY_RAW(X86_TRAP_OTHER,	exc_xen_unknown_trap);
 #endif
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+DECLARE_IDTENTRY(X86_TRAP_VE,		exc_virtualization_exception);
+#endif
+
 /* Device interrupts common/spurious */
 DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 5107a4d9ba8f..d17143290f0a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -4,6 +4,7 @@
 #define _ASM_X86_TDX_H
 
 #include <linux/init.h>
+#include <asm/ptrace.h>
 
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
@@ -40,6 +41,22 @@ struct tdx_hypercall_output {
 	u64 r15;
 };
 
+/*
+ * Used by the #VE exception handler to gather the #VE exception
+ * info from the TDX module. This is a software only structure
+ * and not part of the TDX module/VMM ABI.
+ */
+struct ve_info {
+	u64 exit_reason;
+	u64 exit_qual;
+	/* Guest Linear (virtual) Address */
+	u64 gla;
+	/* Guest Physical (virtual) Address */
+	u64 gpa;
+	u32 instr_len;
+	u32 instr_info;
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 void __init tdx_early_init(void);
@@ -53,6 +70,10 @@ u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
 		    u64 r15, struct tdx_hypercall_output *out);
 
+bool tdx_get_ve_info(struct ve_info *ve);
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..1da074123c16 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
 	 */
 	INTG(X86_TRAP_PF,		asm_exc_page_fault),
 #endif
+#ifdef CONFIG_INTEL_TDX_GUEST
+	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index d40b6df51e26..5a5b25f9c4d3 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,9 @@
 #include <linux/cpufeature.h>
 #include <asm/tdx.h>
 
+/* TDX module Call Leaf IDs */
+#define TDX_GET_VEINFO			3
+
 static bool tdx_guest_detected __ro_after_init;
 
 /*
@@ -32,6 +35,66 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
 	return out->r10;
 }
 
+bool tdx_get_ve_info(struct ve_info *ve)
+{
+	struct tdx_module_output out;
+
+	/*
+	 * NMIs and machine checks are suppressed. Before this point any
+	 * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
+	 * additional #VEs are permitted (but it is expected not to
+	 * happen unless kernel panics).
+	 */
+	if (__tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out))
+		return false;
+
+	ve->exit_reason = out.rcx;
+	ve->exit_qual   = out.rdx;
+	ve->gla         = out.r8;
+	ve->gpa         = out.r9;
+	ve->instr_len   = lower_32_bits(out.r10);
+	ve->instr_info  = upper_32_bits(out.r10);
+
+	return true;
+}
+
+/*
+ * Handle the user initiated #VE.
+ *
+ * For example, executing the CPUID instruction from user space
+ * is a valid case and hence the resulting #VE has to be handled.
+ *
+ * For dis-allowed or invalid #VE just return failure.
+ */
+static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+{
+	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+	return false;
+}
+
+/* Handle the kernel #VE */
+static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
+{
+	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+	return false;
+}
+
+bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
+{
+	bool ret;
+
+	if (user_mode(regs))
+		ret = tdx_virt_exception_user(regs, ve);
+	else
+		ret = tdx_virt_exception_kernel(regs, ve);
+
+	/* After successful #VE handling, move the IP */
+	if (ret)
+		regs->ip += ve->instr_len;
+
+	return ret;
+}
+
 bool is_tdx_guest(void)
 {
 	return tdx_guest_detected;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c9d566dcf89a..428504535912 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -1212,6 +1213,115 @@ DEFINE_IDTENTRY(exc_device_not_available)
 	}
 }
 
+#ifdef CONFIG_INTEL_TDX_GUEST
+
+#define VE_FAULT_STR "VE fault"
+
+static void ve_raise_fault(struct pt_regs *regs, long error_code)
+{
+	struct task_struct *tsk = current;
+
+	if (user_mode(regs)) {
+		tsk->thread.error_code = error_code;
+		tsk->thread.trap_nr = X86_TRAP_VE;
+		show_signal(tsk, SIGSEGV, "", VE_FAULT_STR, regs, error_code);
+		force_sig(SIGSEGV);
+		return;
+	}
+
+	/*
+	 * Attempt to recover from #VE exception failure without
+	 * triggering OOPS (useful for MSR read/write failures)
+	 */
+	if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
+		return;
+
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_VE;
+
+	/*
+	 * To be potentially processing a kprobe fault and to trust the result
+	 * from kprobe_running(), it should be non-preemptible.
+	 */
+	if (!preemptible() && kprobe_running() &&
+	    kprobe_fault_handler(regs, X86_TRAP_VE))
+		return;
+
+	/* Notify about #VE handling failure, useful for debugger hooks */
+	if (notify_die(DIE_GPF, VE_FAULT_STR, regs, error_code,
+		       X86_TRAP_VE, SIGSEGV) == NOTIFY_STOP)
+		return;
+
+	/* Trigger OOPS and panic */
+	die_addr(VE_FAULT_STR, regs, error_code, 0);
+}
+
+/*
+ * Virtualization Exceptions (#VE) are delivered to TDX guests due to
+ * specific guest actions which may happen in either user space or the
+ * kernel:
+ *
+ *  * Specific instructions (WBINVD, for example)
+ *  * Specific MSR accesses
+ *  * Specific CPUID leaf accesses
+ *  * Access to unmapped pages (EPT violation)
+ *
+ * In the settings that Linux will run in, virtualization exceptions are
+ * never generated on accesses to normal, TD-private memory that has been
+ * accepted.
+ *
+ * Syscall entry code has a critical window where the kernel stack is not
+ * yet set up. Any exception in this window leads to hard to debug issues
+ * and can be exploited for privilege escalation. Exceptions in the NMI
+ * entry code also cause issues. Returning from the exception handler with
+ * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
+ *
+ * For these reasons, the kernel avoids #VEs during the syscall gap and
+ * the NMI entry code. Entry code paths do not access TD-shared memory,
+ * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
+ * that might generate #VE. VMM can remove memory from TD at any point,
+ * but access to unaccepted (or missing) private memory leads to VM
+ * termination, not to #VE.
+ *
+ * Similarly to page faults and breakpoints, #VEs are allowed in NMI
+ * handlers once the kernel is ready to deal with nested NMIs.
+ *
+ * During #VE delivery, all interrupts, including NMIs, are blocked until
+ * TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
+ * the VE info.
+ *
+ * If a guest kernel action which would normally cause a #VE occurs in
+ * the interrupt-disabled region before TDGETVEINFO, a #DF (fault
+ * exception) is delivered to the guest which will result in an oops.
+ */
+DEFINE_IDTENTRY(exc_virtualization_exception)
+{
+	struct ve_info ve;
+	bool ret;
+
+	/*
+	 * NMIs/Machine-checks/Interrupts will be in a disabled state
+	 * till TDGETVEINFO TDCALL is executed. This ensures that VE
+	 * info cannot be overwritten by a nested #VE.
+	 */
+	ret = tdx_get_ve_info(&ve);
+
+	cond_local_irq_enable(regs);
+
+	if (ret)
+		ret = tdx_handle_virt_exception(regs, &ve);
+	/*
+	 * If tdx_handle_virt_exception() could not process
+	 * it successfully, treat it as #GP(0) and handle it.
+	 */
+	if (!ret)
+		ve_raise_fault(regs, 0);
+
+	cond_local_irq_disable(regs);
+}
+
+#endif
+
 #ifdef CONFIG_X86_32
 DEFINE_IDTENTRY_SW(iret_error)
 {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 05/29] x86/tdx: Add HLT support for TDX guests
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-01-29 14:53   ` Borislav Petkov
  2022-01-24 15:01 ` [PATCHv2 06/29] x86/tdx: Add MSR " Kirill A. Shutemov
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).

To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].

In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.

To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.

Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  3 ++
 arch/x86/kernel/process.c  |  5 +++
 arch/x86/kernel/tdcall.S   | 31 +++++++++++++++++
 arch/x86/kernel/tdx.c      | 70 ++++++++++++++++++++++++++++++++++++--
 4 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d17143290f0a..9b4714a45bb9 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -74,10 +74,13 @@ bool tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
+void tdx_safe_halt(void);
+
 #else
 
 static inline void tdx_early_init(void) { };
 static inline bool is_tdx_guest(void) { return false; }
+static inline void tdx_safe_halt(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 81d8ef036637..d48afc69ebfa 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -46,6 +46,7 @@
 #include <asm/proto.h>
 #include <asm/frame.h>
 #include <asm/unwind.h>
+#include <asm/tdx.h>
 
 #include "process.h"
 
@@ -870,6 +871,10 @@ void select_idle_routine(const struct cpuinfo_x86 *c)
 	} else if (prefer_mwait_c1_over_halt(c)) {
 		pr_info("using mwait in idle threads\n");
 		x86_idle = mwait_idle;
+	} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+		pr_info("using TDX aware idle routine\n");
+		x86_idle = tdx_safe_halt;
+		return;
 	} else
 		x86_idle = default_idle;
 }
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 46a49a96cf6c..ae74da33ccc6 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
 #include <asm/asm.h>
 #include <asm/frame.h>
 #include <asm/unwind_hints.h>
+#include <uapi/asm/vmx.h>
 
 #include <linux/linkage.h>
 #include <linux/bits.h>
@@ -39,6 +40,12 @@
  */
 #define tdcall .byte 0x66,0x0f,0x01,0xcc
 
+/*
+ * Used in __tdx_hypercall() to determine whether to enable interrupts
+ * before issuing TDCALL for the EXIT_REASON_HLT case.
+ */
+#define ENABLE_IRQS_BEFORE_HLT 0x01
+
 /*
  * __tdx_module_call()  - Used by TDX guests to request services from
  * the TDX module (does not include VMM services).
@@ -230,6 +237,30 @@ SYM_FUNC_START(__tdx_hypercall)
 
 	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
 
+	/*
+	 * For the idle loop STI needs to be called directly before
+	 * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
+	 * instruction enables interrupts only one instruction later.
+	 * If there is a window between STI and the instruction that
+	 * emulates the HALT state, there is a chance for interrupts to
+	 * happen in this window, which can delay the HLT operation
+	 * indefinitely. Since this is the not the desired result,
+	 * conditionally call STI before TDCALL.
+	 *
+	 * Since STI instruction is only required for the idle case
+	 * (a special case of EXIT_REASON_HLT), use the r15 register
+	 * value to identify it. Since the R15 register is not used
+	 * by the VMM as per EXIT_REASON_HLT ABI, re-use it in
+	 * software to identify the STI case.
+	 */
+	cmpl $EXIT_REASON_HLT, %r11d
+	jne .Lskip_sti
+	cmpl $ENABLE_IRQS_BEFORE_HLT, %r15d
+	jne .Lskip_sti
+	/* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
+	xor %r15, %r15
+	sti
+.Lskip_sti:
 	tdcall
 
 	/* Restore output pointer to R9 */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5a5b25f9c4d3..eeb456631a65 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -6,6 +6,7 @@
 
 #include <linux/cpufeature.h>
 #include <asm/tdx.h>
+#include <asm/vmx.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_VEINFO			3
@@ -35,6 +36,61 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
 	return out->r10;
 }
 
+static u64 __cpuidle _tdx_halt(const bool irq_disabled, const bool do_sti)
+{
+	/*
+	 * Emulate HLT operation via hypercall. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), sec 3.8 TDG.VP.VMCALL<Instruction.HLT>.
+	 *
+	 * The VMM uses the "IRQ disabled" param to understand IRQ
+	 * enabled status (RFLAGS.IF) of the TD guest and to determine
+	 * whether or not it should schedule the halted vCPU if an
+	 * IRQ becomes pending. E.g. if IRQs are disabled, the VMM
+	 * can keep the vCPU in virtual HLT, even if an IRQ is
+	 * pending, without hanging/breaking the guest.
+	 *
+	 * do_sti parameter is used by the __tdx_hypercall() to decide
+	 * whether to call the STI instruction before executing the
+	 * TDCALL instruction.
+	 */
+	return _tdx_hypercall(EXIT_REASON_HLT, irq_disabled, 0, 0,
+			      do_sti, NULL);
+}
+
+static bool tdx_halt(void)
+{
+	/*
+	 * Since non safe halt is mainly used in CPU offlining
+	 * and the guest will always stay in the halt state, don't
+	 * call the STI instruction (set do_sti as false).
+	 */
+	const bool irq_disabled = irqs_disabled();
+	const bool do_sti = false;
+
+	if (_tdx_halt(irq_disabled, do_sti))
+		return false;
+
+	return true;
+}
+
+void __cpuidle tdx_safe_halt(void)
+{
+	 /*
+	  * For do_sti=true case, __tdx_hypercall() function enables
+	  * interrupts using the STI instruction before the TDCALL. So
+	  * set irq_disabled as false.
+	  */
+	const bool irq_disabled = false;
+	const bool do_sti = true;
+
+	/*
+	 * Use WARN_ONCE() to report the failure.
+	 */
+	if (_tdx_halt(irq_disabled, do_sti))
+		WARN_ONCE(1, "HLT instruction emulation failed\n");
+}
+
 bool tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -75,8 +131,18 @@ static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
 /* Handle the kernel #VE */
 static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 {
-	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
-	return false;
+	bool ret = false;
+
+	switch (ve->exit_reason) {
+	case EXIT_REASON_HLT:
+		ret = tdx_halt();
+		break;
+	default:
+		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+		break;
+	}
+
+	return ret;
 }
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 06/29] x86/tdx: Add MSR support for TDX guests
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 05/29] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 21:38   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 07/29] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
                   ` (23 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

Use hypercall to emulate MSR read/write for the TDX platform.

There are two viable approaches for doing MSRs in a TD guest:

1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal
   do. Some will succeed, others will cause a #VE. All of those that
   cause a #VE will be handled with a TDCALL.
2. Use paravirt infrastructure.  The paravirt hook has to keep a list
   of which MSRs would cause a #VE and use a TDCALL.  All other MSRs
   execute RDMSR/WRMSR instructions directly.

The second option can be ruled out because the list of MSRs was
challenging to maintain. That leaves option #1 as the only viable
solution for the minimal TDX support.

For performance-critical MSR writes (like TSC_DEADLINE), future patches
will replace the WRMSR/#VE sequence with the direct TDCALL.

RDMSR and WRMSR specification details can be found in
Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec titled "TDG.VP.
VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>".

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 44 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index eeb456631a65..29a03a4bdb53 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -91,6 +91,39 @@ void __cpuidle tdx_safe_halt(void)
 		WARN_ONCE(1, "HLT instruction emulation failed\n");
 }
 
+static bool tdx_read_msr(unsigned int msr, u64 *val)
+{
+	struct tdx_hypercall_output out;
+
+	/*
+	 * Emulate the MSR read via hypercall. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
+	 */
+	if (_tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out))
+		return false;
+
+	*val = out.r11;
+
+	return true;
+}
+
+static bool tdx_write_msr(unsigned int msr, unsigned int low,
+			       unsigned int high)
+{
+	u64 ret;
+
+	/*
+	 * Emulate the MSR write via hypercall. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.WRMSR>".
+	 */
+	ret = _tdx_hypercall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
+			     0, 0, NULL);
+
+	return ret ? false : true;
+}
+
 bool tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -132,11 +165,22 @@ static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
 static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 {
 	bool ret = false;
+	u64 val;
 
 	switch (ve->exit_reason) {
 	case EXIT_REASON_HLT:
 		ret = tdx_halt();
 		break;
+	case EXIT_REASON_MSR_READ:
+		ret = tdx_read_msr(regs->cx, &val);
+		if (ret) {
+			regs->ax = lower_32_bits(val);
+			regs->dx = upper_32_bits(val);
+		}
+		break;
+	case EXIT_REASON_MSR_WRITE:
+		ret = tdx_write_msr(regs->cx, regs->ax, regs->dx);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		break;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 07/29] x86/tdx: Handle CPUID via #VE
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 06/29] x86/tdx: Add MSR " Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 21:39   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized
by the TDX module while some trigger #VE.

Implement the #VE handling for EXIT_REASON_CPUID by handing it through
the hypercall, which in turn lets the TDX module handle it by invoking
the host VMM.

More details on CPUID Virtualization can be found in the TDX module
specification, the section titled "CPUID Virtualization".

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 29a03a4bdb53..f213c67b4ecc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -124,6 +124,31 @@ static bool tdx_write_msr(unsigned int msr, unsigned int low,
 	return ret ? false : true;
 }
 
+static bool tdx_handle_cpuid(struct pt_regs *regs)
+{
+	struct tdx_hypercall_output out;
+
+	/*
+	 * Emulate the CPUID instruction via a hypercall. More info about
+	 * ABI can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
+	 */
+	if (_tdx_hypercall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out))
+		return false;
+
+	/*
+	 * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
+	 * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
+	 * So copy the register contents back to pt_regs.
+	 */
+	regs->ax = out.r12;
+	regs->bx = out.r13;
+	regs->cx = out.r14;
+	regs->dx = out.r15;
+
+	return true;
+}
+
 bool tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -157,8 +182,18 @@ bool tdx_get_ve_info(struct ve_info *ve)
  */
 static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
 {
-	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
-	return false;
+	bool ret = false;
+
+	switch (ve->exit_reason) {
+	case EXIT_REASON_CPUID:
+		ret = tdx_handle_cpuid(regs);
+		break;
+	default:
+		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+		break;
+	}
+
+	return ret;
 }
 
 /* Handle the kernel #VE */
@@ -181,6 +216,9 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 	case EXIT_REASON_MSR_WRITE:
 		ret = tdx_write_msr(regs->cx, regs->ax, regs->dx);
 		break;
+	case EXIT_REASON_CPUID:
+		ret = tdx_handle_cpuid(regs);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		break;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 07/29] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-01-24 19:30   ` Josh Poimboeuf
                     ` (2 more replies)
  2022-01-24 15:01 ` [PATCHv2 09/29] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
                   ` (21 subsequent siblings)
  29 siblings, 3 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.

To emulate an instruction an emulator needs two things:

  - R/W access to the register file to read/modify instruction arguments
    and see RIP of the faulted instruction.

  - Read access to memory where instruction is placed to see what to
    emulate. In this case it is guest kernel text.

Both of them are not available to VMM in TDX environment:

  - Register file is never exposed to VMM. When a TD exits to the module,
    it saves registers into the state-save area allocated for that TD.
    The module then scrubs these registers before returning execution
    control to the VMM, to help prevent leakage of TD state.

  - Memory is encrypted a TD-private key. The CPU disallows software
    other than the TDX module and TDs from making memory accesses using
    the private key.

In TDX the MMIO regions are instead configured to trigger a #VE
exception in the guest. The guest #VE handler then emulates the MMIO
instruction inside the guest and converts it into a controlled hypercall
to the host.

MMIO addresses can be used with any CPU instruction that accesses
memory. This patch, however, covers only MMIO accesses done via io.h
helpers, such as 'readl()' or 'writeq()'.

readX()/writeX() helpers limit the range of instructions which can trigger
MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO
region allows compiler to generate whatever instruction it wants.
Supporting all possible instructions is a task of a different scope

MMIO access with anything other than helpers from io.h may result in
MMIO_DECODE_FAILED and an oops.

AMD SEV has the same limitations to MMIO handling.

=== Potential alternative approaches ===

== Paravirtualizing all MMIO ==

An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.

Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.

However, any paravirtual approach would be patching approximately
120k call sites. With a conservative overhead estimation of 5 bytes per
call site (CALL instruction), it leads to bloating code by 600k.

Many drivers will never be used in the TDX environment and the bloat
cannot be justified.

== Patching TDX drivers ==

Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests.  Right now, that's
limited only to virtio and some x86-specific drivers.

All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch. This will be implemented in the
future, removing the bulk of MMIO #VEs.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 114 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f213c67b4ecc..8e630eeb765d 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,8 @@
 #include <linux/cpufeature.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_VEINFO			3
@@ -149,6 +151,112 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
 	return true;
 }
 
+static bool tdx_mmio(int size, bool write, unsigned long addr,
+		     unsigned long *val)
+{
+	struct tdx_hypercall_output out;
+	u64 err;
+
+	err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
+			     addr, *val, &out);
+	if (err)
+		return true;
+
+	*val = out.r11;
+	return false;
+}
+
+static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
+{
+	return tdx_mmio(size, false, addr, val);
+}
+
+static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
+{
+	return tdx_mmio(size, true, addr, val);
+}
+
+static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	char buffer[MAX_INSN_SIZE];
+	unsigned long *reg, val = 0;
+	struct insn insn = {};
+	enum mmio_type mmio;
+	int size;
+	bool err;
+
+	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+		return -EFAULT;
+
+	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+		return -EFAULT;
+
+	mmio = insn_decode_mmio(&insn, &size);
+	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
+		return -EFAULT;
+
+	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+		reg = insn_get_modrm_reg_ptr(&insn, regs);
+		if (!reg)
+			return -EFAULT;
+	}
+
+	switch (mmio) {
+	case MMIO_WRITE:
+		memcpy(&val, reg, size);
+		err = tdx_mmio_write(size, ve->gpa, &val);
+		break;
+	case MMIO_WRITE_IMM:
+		val = insn.immediate.value;
+		err = tdx_mmio_write(size, ve->gpa, &val);
+		break;
+	case MMIO_READ:
+		err = tdx_mmio_read(size, ve->gpa, &val);
+		if (err)
+			break;
+		/* Zero-extend for 32-bit operation */
+		if (size == 4)
+			*reg = 0;
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_READ_ZERO_EXTEND:
+		err = tdx_mmio_read(size, ve->gpa, &val);
+		if (err)
+			break;
+
+		/* Zero extend based on operand size */
+		memset(reg, 0, insn.opnd_bytes);
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_READ_SIGN_EXTEND: {
+		u8 sign_byte = 0, msb = 7;
+
+		err = tdx_mmio_read(size, ve->gpa, &val);
+		if (err)
+			break;
+
+		if (size > 1)
+			msb = 15;
+
+		if (val & BIT(msb))
+			sign_byte = -1;
+
+		/* Sign extend based on operand size */
+		memset(reg, sign_byte, insn.opnd_bytes);
+		memcpy(reg, &val, size);
+		break;
+	}
+	case MMIO_MOVS:
+	case MMIO_DECODE_FAILED:
+		return -EFAULT;
+	}
+
+	if (err)
+		return -EFAULT;
+
+	return insn.length;
+}
+
 bool tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -219,6 +327,12 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 	case EXIT_REASON_CPUID:
 		ret = tdx_handle_cpuid(regs);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		ve->instr_len = tdx_handle_mmio(regs, ve);
+		ret = ve->instr_len > 0;
+		if (!ret)
+			pr_warn_once("MMIO failed\n");
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		break;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 09/29] x86/tdx: Detect TDX at early kernel decompression time
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 18:30   ` Borislav Petkov
  2022-02-01 22:33   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 10/29] x86: Consolidate port I/O helpers Kirill A. Shutemov
                   ` (20 subsequent siblings)
  29 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

The early decompression code does port I/O for its console output. But,
handling the decompression-time port I/O demands a different approach
from normal runtime because the IDT required to support #VE based port
I/O emulation is not yet set up. Paravirtualizing I/O calls during
the decompression step is acceptable because the decompression code size is
small enough and hence patching it will not bloat the image size a lot.

To support port I/O in decompression code, TDX must be detected before
the decompression code might do port I/O. Add support to detect for
TDX guest support before console_init() in the extract_kernel().
Detecting it above the console_init() is early enough for patching
port I/O.

Add an early_is_tdx_guest() interface to get the cached TDX guest
status in the decompression code.

The actual port I/O paravirtualization will come later in the series.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  1 +
 arch/x86/boot/compressed/misc.c   |  8 ++++++++
 arch/x86/boot/compressed/misc.h   |  2 ++
 arch/x86/boot/compressed/tdx.c    | 29 +++++++++++++++++++++++++++++
 arch/x86/boot/compressed/tdx.h    | 16 ++++++++++++++++
 arch/x86/boot/cpuflags.c          |  3 +--
 arch/x86/boot/cpuflags.h          |  1 +
 arch/x86/include/asm/shared/tdx.h |  7 +++++++
 arch/x86/include/asm/tdx.h        |  4 +---
 9 files changed, 66 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdx.c
 create mode 100644 arch/x86/boot/compressed/tdx.h
 create mode 100644 arch/x86/include/asm/shared/tdx.h

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6115274fe10f..732f6b21ecbd 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -101,6 +101,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index a4339cb2d247..d8373d766672 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -370,6 +370,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	lines = boot_params->screen_info.orig_video_lines;
 	cols = boot_params->screen_info.orig_video_cols;
 
+	/*
+	 * Detect TDX guest environment.
+	 *
+	 * It has to be done before console_init() in order to use
+	 * paravirtualized port I/O oprations if needed.
+	 */
+	early_tdx_detect();
+
 	console_init();
 
 	/*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 16ed360b6692..0d8e275a9d96 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -28,6 +28,8 @@
 #include <asm/bootparam.h>
 #include <asm/desc_defs.h>
 
+#include "tdx.h"
+
 #define BOOT_CTYPE_H
 #include <linux/acpi.h>
 
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
new file mode 100644
index 000000000000..6853376fe69a
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * tdx.c - Early boot code for TDX
+ */
+
+#include "../cpuflags.h"
+#include "../string.h"
+
+#include <asm/shared/tdx.h>
+
+static bool tdx_guest_detected;
+
+bool early_is_tdx_guest(void)
+{
+	return tdx_guest_detected;
+}
+
+void early_tdx_detect(void)
+{
+	u32 eax, sig[3];
+
+	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
+
+	if (memcmp(TDX_IDENT, sig, 12))
+		return;
+
+	/* Cache TDX guest feature status */
+	tdx_guest_detected = true;
+}
diff --git a/arch/x86/boot/compressed/tdx.h b/arch/x86/boot/compressed/tdx.h
new file mode 100644
index 000000000000..18970c09512e
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021 Intel Corporation */
+#ifndef BOOT_COMPRESSED_TDX_H
+#define BOOT_COMPRESSED_TDX_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_INTEL_TDX_GUEST
+void early_tdx_detect(void);
+bool early_is_tdx_guest(void);
+#else
+static inline void early_tdx_detect(void) { };
+static inline bool early_is_tdx_guest(void) { return false; }
+#endif
+
+#endif
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index a0b75f73dc63..a83d67ec627d 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -71,8 +71,7 @@ int has_eflag(unsigned long mask)
 # define EBX_REG "=b"
 #endif
 
-static inline void cpuid_count(u32 id, u32 count,
-		u32 *a, u32 *b, u32 *c, u32 *d)
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d)
 {
 	asm volatile(".ifnc %%ebx,%3 ; movl  %%ebx,%3 ; .endif	\n\t"
 		     "cpuid					\n\t"
diff --git a/arch/x86/boot/cpuflags.h b/arch/x86/boot/cpuflags.h
index 2e20814d3ce3..475b8fde90f7 100644
--- a/arch/x86/boot/cpuflags.h
+++ b/arch/x86/boot/cpuflags.h
@@ -17,5 +17,6 @@ extern u32 cpu_vendor[3];
 
 int has_eflag(unsigned long mask);
 void get_cpuflags(void);
+void cpuid_count(u32 id, u32 count, u32 *a, u32 *b, u32 *c, u32 *d);
 
 #endif
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
new file mode 100644
index 000000000000..12bede46d048
--- /dev/null
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_X86_SHARED_TDX_H
+#define _ASM_X86_SHARED_TDX_H
+
+#define TDX_CPUID_LEAF_ID	0x21
+#define TDX_IDENT		"IntelTDX    "
+
+#endif
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9b4714a45bb9..53f7dd0fbe58 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -5,9 +5,7 @@
 
 #include <linux/init.h>
 #include <asm/ptrace.h>
-
-#define TDX_CPUID_LEAF_ID	0x21
-#define TDX_IDENT		"IntelTDX    "
+#include <asm/shared/tdx.h>
 
 #define TDX_HYPERCALL_STANDARD  0
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 10/29] x86: Consolidate port I/O helpers
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 09/29] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 22:36   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 11/29] x86/boot: Allow to hook up alternative " Kirill A. Shutemov
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

There are two implementations of port I/O helpers: one in the kernel and
one in the boot stub.

Move the helpers required for both to <asm/shared/io.h> and use the one
implementation everywhere.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/boot.h             | 35 +-------------------------------
 arch/x86/boot/compressed/misc.h  |  2 +-
 arch/x86/include/asm/io.h        | 22 ++------------------
 arch/x86/include/asm/shared/io.h | 32 +++++++++++++++++++++++++++++
 4 files changed, 36 insertions(+), 55 deletions(-)
 create mode 100644 arch/x86/include/asm/shared/io.h

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 34c9dbb6a47d..22a474c5b3e8 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,6 +23,7 @@
 #include <linux/edd.h>
 #include <asm/setup.h>
 #include <asm/asm.h>
+#include <asm/shared/io.h>
 #include "bitops.h"
 #include "ctype.h"
 #include "cpuflags.h"
@@ -35,40 +36,6 @@ extern struct boot_params boot_params;
 
 #define cpu_relax()	asm volatile("rep; nop")
 
-/* Basic port I/O */
-static inline void outb(u8 v, u16 port)
-{
-	asm volatile("outb %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u8 inb(u16 port)
-{
-	u8 v;
-	asm volatile("inb %1,%0" : "=a" (v) : "dN" (port));
-	return v;
-}
-
-static inline void outw(u16 v, u16 port)
-{
-	asm volatile("outw %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u16 inw(u16 port)
-{
-	u16 v;
-	asm volatile("inw %1,%0" : "=a" (v) : "dN" (port));
-	return v;
-}
-
-static inline void outl(u32 v, u16 port)
-{
-	asm volatile("outl %0,%1" : : "a" (v), "dN" (port));
-}
-static inline u32 inl(u16 port)
-{
-	u32 v;
-	asm volatile("inl %1,%0" : "=a" (v) : "dN" (port));
-	return v;
-}
-
 static inline void io_delay(void)
 {
 	const u16 DELAY_PORT = 0x80;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0d8e275a9d96..8a253e85f990 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -22,11 +22,11 @@
 #include <linux/linkage.h>
 #include <linux/screen_info.h>
 #include <linux/elf.h>
-#include <linux/io.h>
 #include <asm/page.h>
 #include <asm/boot.h>
 #include <asm/bootparam.h>
 #include <asm/desc_defs.h>
+#include <asm/shared/io.h>
 
 #include "tdx.h"
 
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index f6d91ecb8026..8ce0a40379de 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -44,6 +44,7 @@
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
+#include <asm/shared/io.h>
 
 #define build_mmio_read(name, size, type, reg, barrier) \
 static inline type name(const volatile void __iomem *addr) \
@@ -258,20 +259,6 @@ static inline void slow_down_io(void)
 #endif
 
 #define BUILDIO(bwl, bw, type)						\
-static inline void out##bwl(unsigned type value, int port)		\
-{									\
-	asm volatile("out" #bwl " %" #bw "0, %w1"			\
-		     : : "a"(value), "Nd"(port));			\
-}									\
-									\
-static inline unsigned type in##bwl(int port)				\
-{									\
-	unsigned type value;						\
-	asm volatile("in" #bwl " %w1, %" #bw "0"			\
-		     : "=a"(value) : "Nd"(port));			\
-	return value;							\
-}									\
-									\
 static inline void out##bwl##_p(unsigned type value, int port)		\
 {									\
 	out##bwl(value, port);						\
@@ -320,10 +307,8 @@ static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 BUILDIO(b, b, char)
 BUILDIO(w, w, short)
 BUILDIO(l, , int)
+#undef BUILDIO
 
-#define inb inb
-#define inw inw
-#define inl inl
 #define inb_p inb_p
 #define inw_p inw_p
 #define inl_p inl_p
@@ -331,9 +316,6 @@ BUILDIO(l, , int)
 #define insw insw
 #define insl insl
 
-#define outb outb
-#define outw outw
-#define outl outl
 #define outb_p outb_p
 #define outw_p outw_p
 #define outl_p outl_p
diff --git a/arch/x86/include/asm/shared/io.h b/arch/x86/include/asm/shared/io.h
new file mode 100644
index 000000000000..f17247f6c471
--- /dev/null
+++ b/arch/x86/include/asm/shared/io.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHARED_IO_H
+#define _ASM_X86_SHARED_IO_H
+
+#define BUILDIO(bwl, bw, type)						\
+static inline void out##bwl(unsigned type value, int port)		\
+{									\
+	asm volatile("out" #bwl " %" #bw "0, %w1"			\
+		     : : "a"(value), "Nd"(port));			\
+}									\
+									\
+static inline unsigned type in##bwl(int port)				\
+{									\
+	unsigned type value;						\
+	asm volatile("in" #bwl " %w1, %" #bw "0"			\
+		     : "=a"(value) : "Nd"(port));			\
+	return value;							\
+}
+
+BUILDIO(b, b, char)
+BUILDIO(w, w, short)
+BUILDIO(l, , int)
+#undef BUILDIO
+
+#define inb inb
+#define inw inw
+#define inl inl
+#define outb outb
+#define outw outw
+#define outl outl
+
+#endif
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 11/29] x86/boot: Allow to hook up alternative port I/O helpers
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 10/29] x86: Consolidate port I/O helpers Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 19:02   ` Borislav Petkov
  2022-02-01 22:39   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 12/29] x86/boot/compressed: Support TDX guest port I/O at decompression time Kirill A. Shutemov
                   ` (18 subsequent siblings)
  29 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.

But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.

Add a way to hook up alternative port I/O helpers in the boot stub.
All port I/O operations are routed via 'pio_ops'. By default 'pio_ops'
initialized with native port I/O implementations.

This is a preparation patch. The next patch will override 'pio_ops' if
the kernel booted in the TDX environment.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/a20.c                  | 14 +++++++-------
 arch/x86/boot/boot.h                 |  2 +-
 arch/x86/boot/compressed/misc.c      | 18 ++++++++++++------
 arch/x86/boot/compressed/misc.h      |  2 +-
 arch/x86/boot/early_serial_console.c | 28 ++++++++++++++--------------
 arch/x86/boot/io.h                   | 28 ++++++++++++++++++++++++++++
 arch/x86/boot/main.c                 |  4 ++++
 arch/x86/boot/pm.c                   | 10 +++++-----
 arch/x86/boot/tty.c                  |  4 ++--
 arch/x86/boot/video-vga.c            |  6 +++---
 arch/x86/boot/video.h                |  8 +++++---
 arch/x86/realmode/rm/wakemain.c      | 14 +++++++++-----
 12 files changed, 91 insertions(+), 47 deletions(-)
 create mode 100644 arch/x86/boot/io.h

diff --git a/arch/x86/boot/a20.c b/arch/x86/boot/a20.c
index a2b6b428922a..7f6dd5cc4670 100644
--- a/arch/x86/boot/a20.c
+++ b/arch/x86/boot/a20.c
@@ -25,7 +25,7 @@ static int empty_8042(void)
 	while (loops--) {
 		io_delay();
 
-		status = inb(0x64);
+		status = pio_ops.inb(0x64);
 		if (status == 0xff) {
 			/* FF is a plausible, but very unlikely status */
 			if (!--ffs)
@@ -34,7 +34,7 @@ static int empty_8042(void)
 		if (status & 1) {
 			/* Read and discard input data */
 			io_delay();
-			(void)inb(0x60);
+			(void)pio_ops.inb(0x60);
 		} else if (!(status & 2)) {
 			/* Buffers empty, finished! */
 			return 0;
@@ -99,13 +99,13 @@ static void enable_a20_kbc(void)
 {
 	empty_8042();
 
-	outb(0xd1, 0x64);	/* Command write */
+	pio_ops.outb(0xd1, 0x64);	/* Command write */
 	empty_8042();
 
-	outb(0xdf, 0x60);	/* A20 on */
+	pio_ops.outb(0xdf, 0x60);	/* A20 on */
 	empty_8042();
 
-	outb(0xff, 0x64);	/* Null command, but UHCI wants it */
+	pio_ops.outb(0xff, 0x64);	/* Null command, but UHCI wants it */
 	empty_8042();
 }
 
@@ -113,10 +113,10 @@ static void enable_a20_fast(void)
 {
 	u8 port_a;
 
-	port_a = inb(0x92);	/* Configuration port A */
+	port_a = pio_ops.inb(0x92);	/* Configuration port A */
 	port_a |=  0x02;	/* Enable A20 */
 	port_a &= ~0x01;	/* Do not reset machine */
-	outb(port_a, 0x92);
+	pio_ops.outb(port_a, 0x92);
 }
 
 /*
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 22a474c5b3e8..bd8f640ca15f 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -23,10 +23,10 @@
 #include <linux/edd.h>
 #include <asm/setup.h>
 #include <asm/asm.h>
-#include <asm/shared/io.h>
 #include "bitops.h"
 #include "ctype.h"
 #include "cpuflags.h"
+#include "io.h"
 
 /* Useful macros */
 #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index d8373d766672..cc47cf239c67 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -47,6 +47,8 @@ void *memmove(void *dest, const void *src, size_t n);
  */
 struct boot_params *boot_params;
 
+struct port_io_ops pio_ops;
+
 memptr free_mem_ptr;
 memptr free_mem_end_ptr;
 
@@ -103,10 +105,12 @@ static void serial_putchar(int ch)
 {
 	unsigned timeout = 0xffff;
 
-	while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+	while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 &&
+	       --timeout) {
 		cpu_relax();
+	}
 
-	outb(ch, early_serial_base + TXR);
+	pio_ops.outb(ch, early_serial_base + TXR);
 }
 
 void __putstr(const char *s)
@@ -152,10 +156,10 @@ void __putstr(const char *s)
 	boot_params->screen_info.orig_y = y;
 
 	pos = (x + cols * y) * 2;	/* Update cursor position */
-	outb(14, vidport);
-	outb(0xff & (pos >> 9), vidport+1);
-	outb(15, vidport);
-	outb(0xff & (pos >> 1), vidport+1);
+	pio_ops.outb(14, vidport);
+	pio_ops.outb(0xff & (pos >> 9), vidport+1);
+	pio_ops.outb(15, vidport);
+	pio_ops.outb(0xff & (pos >> 1), vidport+1);
 }
 
 void __puthex(unsigned long value)
@@ -370,6 +374,8 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 	lines = boot_params->screen_info.orig_video_lines;
 	cols = boot_params->screen_info.orig_video_cols;
 
+	init_io_ops();
+
 	/*
 	 * Detect TDX guest environment.
 	 *
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 8a253e85f990..ea71cf3d64e1 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -26,7 +26,6 @@
 #include <asm/boot.h>
 #include <asm/bootparam.h>
 #include <asm/desc_defs.h>
-#include <asm/shared/io.h>
 
 #include "tdx.h"
 
@@ -35,6 +34,7 @@
 
 #define BOOT_BOOT_H
 #include "../ctype.h"
+#include "../io.h"
 
 #ifdef CONFIG_X86_64
 #define memptr long
diff --git a/arch/x86/boot/early_serial_console.c b/arch/x86/boot/early_serial_console.c
index 023bf1c3de8b..03e43d770571 100644
--- a/arch/x86/boot/early_serial_console.c
+++ b/arch/x86/boot/early_serial_console.c
@@ -28,17 +28,17 @@ static void early_serial_init(int port, int baud)
 	unsigned char c;
 	unsigned divisor;
 
-	outb(0x3, port + LCR);	/* 8n1 */
-	outb(0, port + IER);	/* no interrupt */
-	outb(0, port + FCR);	/* no fifo */
-	outb(0x3, port + MCR);	/* DTR + RTS */
+	pio_ops.outb(0x3, port + LCR);	/* 8n1 */
+	pio_ops.outb(0, port + IER);	/* no interrupt */
+	pio_ops.outb(0, port + FCR);	/* no fifo */
+	pio_ops.outb(0x3, port + MCR);	/* DTR + RTS */
 
 	divisor	= 115200 / baud;
-	c = inb(port + LCR);
-	outb(c | DLAB, port + LCR);
-	outb(divisor & 0xff, port + DLL);
-	outb((divisor >> 8) & 0xff, port + DLH);
-	outb(c & ~DLAB, port + LCR);
+	c = pio_ops.inb(port + LCR);
+	pio_ops.outb(c | DLAB, port + LCR);
+	pio_ops.outb(divisor & 0xff, port + DLL);
+	pio_ops.outb((divisor >> 8) & 0xff, port + DLH);
+	pio_ops.outb(c & ~DLAB, port + LCR);
 
 	early_serial_base = port;
 }
@@ -104,11 +104,11 @@ static unsigned int probe_baud(int port)
 	unsigned char lcr, dll, dlh;
 	unsigned int quot;
 
-	lcr = inb(port + LCR);
-	outb(lcr | DLAB, port + LCR);
-	dll = inb(port + DLL);
-	dlh = inb(port + DLH);
-	outb(lcr, port + LCR);
+	lcr = pio_ops.inb(port + LCR);
+	pio_ops.outb(lcr | DLAB, port + LCR);
+	dll = pio_ops.inb(port + DLL);
+	dlh = pio_ops.inb(port + DLH);
+	pio_ops.outb(lcr, port + LCR);
 	quot = (dlh << 8) | dll;
 
 	return BASE_BAUD / quot;
diff --git a/arch/x86/boot/io.h b/arch/x86/boot/io.h
new file mode 100644
index 000000000000..2659180e3210
--- /dev/null
+++ b/arch/x86/boot/io.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_IO_H
+#define BOOT_IO_H
+
+#include <asm/shared/io.h>
+
+struct port_io_ops {
+	unsigned char (*inb)(int port);
+	unsigned short (*inw)(int port);
+	unsigned int (*inl)(int port);
+	void (*outb)(unsigned char v, int port);
+	void (*outw)(unsigned short v, int port);
+	void (*outl)(unsigned int v, int port);
+};
+
+extern struct port_io_ops pio_ops;
+
+static inline void init_io_ops(void)
+{
+	pio_ops.inb = inb;
+	pio_ops.inw = inw;
+	pio_ops.inl = inl;
+	pio_ops.outb = outb;
+	pio_ops.outw = outw;
+	pio_ops.outl = outl;
+}
+
+#endif
diff --git a/arch/x86/boot/main.c b/arch/x86/boot/main.c
index e3add857c2c9..447a797891be 100644
--- a/arch/x86/boot/main.c
+++ b/arch/x86/boot/main.c
@@ -17,6 +17,8 @@
 
 struct boot_params boot_params __attribute__((aligned(16)));
 
+struct port_io_ops pio_ops;
+
 char *HEAP = _end;
 char *heap_end = _end;		/* Default end of heap = no heap */
 
@@ -133,6 +135,8 @@ static void init_heap(void)
 
 void main(void)
 {
+	init_io_ops();
+
 	/* First, copy the boot header into the "zeropage" */
 	copy_boot_params();
 
diff --git a/arch/x86/boot/pm.c b/arch/x86/boot/pm.c
index 40031a614712..4180b6a264c9 100644
--- a/arch/x86/boot/pm.c
+++ b/arch/x86/boot/pm.c
@@ -25,7 +25,7 @@ static void realmode_switch_hook(void)
 			     : "eax", "ebx", "ecx", "edx");
 	} else {
 		asm volatile("cli");
-		outb(0x80, 0x70); /* Disable NMI */
+		pio_ops.outb(0x80, 0x70); /* Disable NMI */
 		io_delay();
 	}
 }
@@ -35,9 +35,9 @@ static void realmode_switch_hook(void)
  */
 static void mask_all_interrupts(void)
 {
-	outb(0xff, 0xa1);	/* Mask all interrupts on the secondary PIC */
+	pio_ops.outb(0xff, 0xa1);	/* Mask all interrupts on the secondary PIC */
 	io_delay();
-	outb(0xfb, 0x21);	/* Mask all but cascade on the primary PIC */
+	pio_ops.outb(0xfb, 0x21);	/* Mask all but cascade on the primary PIC */
 	io_delay();
 }
 
@@ -46,9 +46,9 @@ static void mask_all_interrupts(void)
  */
 static void reset_coprocessor(void)
 {
-	outb(0, 0xf0);
+	pio_ops.outb(0, 0xf0);
 	io_delay();
-	outb(0, 0xf1);
+	pio_ops.outb(0, 0xf1);
 	io_delay();
 }
 
diff --git a/arch/x86/boot/tty.c b/arch/x86/boot/tty.c
index f7eb976b0a4b..ee8700682801 100644
--- a/arch/x86/boot/tty.c
+++ b/arch/x86/boot/tty.c
@@ -29,10 +29,10 @@ static void __section(".inittext") serial_putchar(int ch)
 {
 	unsigned timeout = 0xffff;
 
-	while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+	while ((pio_ops.inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
 		cpu_relax();
 
-	outb(ch, early_serial_base + TXR);
+	pio_ops.outb(ch, early_serial_base + TXR);
 }
 
 static void __section(".inittext") bios_putchar(int ch)
diff --git a/arch/x86/boot/video-vga.c b/arch/x86/boot/video-vga.c
index 4816cb9cf996..17baac542ee7 100644
--- a/arch/x86/boot/video-vga.c
+++ b/arch/x86/boot/video-vga.c
@@ -131,7 +131,7 @@ static void vga_set_80x43(void)
 /* I/O address of the VGA CRTC */
 u16 vga_crtc(void)
 {
-	return (inb(0x3cc) & 1) ? 0x3d4 : 0x3b4;
+	return (pio_ops.inb(0x3cc) & 1) ? 0x3d4 : 0x3b4;
 }
 
 static void vga_set_480_scanlines(void)
@@ -148,10 +148,10 @@ static void vga_set_480_scanlines(void)
 	out_idx(0xdf, crtc, 0x12); /* Vertical display end */
 	out_idx(0xe7, crtc, 0x15); /* Vertical blank start */
 	out_idx(0x04, crtc, 0x16); /* Vertical blank end */
-	csel = inb(0x3cc);
+	csel = pio_ops.inb(0x3cc);
 	csel &= 0x0d;
 	csel |= 0xe2;
-	outb(csel, 0x3c2);
+	pio_ops.outb(csel, 0x3c2);
 }
 
 static void vga_set_vertical_end(int lines)
diff --git a/arch/x86/boot/video.h b/arch/x86/boot/video.h
index 04bde0bb2003..87a5f726e731 100644
--- a/arch/x86/boot/video.h
+++ b/arch/x86/boot/video.h
@@ -15,6 +15,8 @@
 
 #include <linux/types.h>
 
+#include "boot.h"
+
 /*
  * This code uses an extended set of video mode numbers. These include:
  * Aliases for standard modes
@@ -96,13 +98,13 @@ extern int graphic_mode;	/* Graphics mode with linear frame buffer */
 /* Accessing VGA indexed registers */
 static inline u8 in_idx(u16 port, u8 index)
 {
-	outb(index, port);
-	return inb(port+1);
+	pio_ops.outb(index, port);
+	return pio_ops.inb(port+1);
 }
 
 static inline void out_idx(u8 v, u16 port, u8 index)
 {
-	outw(index+(v << 8), port);
+	pio_ops.outw(index+(v << 8), port);
 }
 
 /* Writes a value to an indexed port and then reads the port again */
diff --git a/arch/x86/realmode/rm/wakemain.c b/arch/x86/realmode/rm/wakemain.c
index 1d6437e6d2ba..b49404d0d63c 100644
--- a/arch/x86/realmode/rm/wakemain.c
+++ b/arch/x86/realmode/rm/wakemain.c
@@ -17,18 +17,18 @@ static void beep(unsigned int hz)
 	} else {
 		u16 div = 1193181/hz;
 
-		outb(0xb6, 0x43);	/* Ctr 2, squarewave, load, binary */
+		pio_ops.outb(0xb6, 0x43);	/* Ctr 2, squarewave, load, binary */
 		io_delay();
-		outb(div, 0x42);	/* LSB of counter */
+		pio_ops.outb(div, 0x42);	/* LSB of counter */
 		io_delay();
-		outb(div >> 8, 0x42);	/* MSB of counter */
+		pio_ops.outb(div >> 8, 0x42);	/* MSB of counter */
 		io_delay();
 
 		enable = 0x03;		/* Turn on speaker */
 	}
-	inb(0x61);		/* Dummy read of System Control Port B */
+	pio_ops.inb(0x61);		/* Dummy read of System Control Port B */
 	io_delay();
-	outb(enable, 0x61);	/* Enable timer 2 output to speaker */
+	pio_ops.outb(enable, 0x61);	/* Enable timer 2 output to speaker */
 	io_delay();
 }
 
@@ -62,8 +62,12 @@ static void send_morse(const char *pattern)
 	}
 }
 
+struct port_io_ops pio_ops;
+
 void main(void)
 {
+	init_io_ops();
+
 	/* Kill machine if structures are wrong */
 	if (wakeup_header.real_magic != 0x12345678)
 		while (1)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 12/29] x86/boot/compressed: Support TDX guest port I/O at decompression time
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 11/29] x86/boot: Allow to hook up alternative " Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 22:55   ` Thomas Gleixner
  2022-01-24 15:01 ` [PATCHv2 13/29] x86/tdx: Add port I/O emulation Kirill A. Shutemov
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.

But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.

Hook up TDX-specific port I/O helpers if booting in TDX environment.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/Makefile |  2 +-
 arch/x86/boot/compressed/tdcall.S |  3 ++
 arch/x86/boot/compressed/tdx.c    | 59 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/shared/tdx.h | 23 ++++++++++++
 arch/x86/include/asm/tdx.h        | 21 -----------
 5 files changed, 86 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/boot/compressed/tdcall.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 732f6b21ecbd..8fd0e6ae2e1f 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -101,7 +101,7 @@ ifdef CONFIG_X86_64
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
-vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/tdcall.S b/arch/x86/boot/compressed/tdcall.S
new file mode 100644
index 000000000000..aafadc136c88
--- /dev/null
+++ b/arch/x86/boot/compressed/tdcall.S
@@ -0,0 +1,3 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "../../kernel/tdcall.S"
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 6853376fe69a..f2e1449c74cd 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -5,6 +5,10 @@
 
 #include "../cpuflags.h"
 #include "../string.h"
+#include "../io.h"
+
+#include <vdso/limits.h>
+#include <uapi/asm/vmx.h>
 
 #include <asm/shared/tdx.h>
 
@@ -15,6 +19,54 @@ bool early_is_tdx_guest(void)
 	return tdx_guest_detected;
 }
 
+static inline unsigned int tdx_io_in(int size, int port)
+{
+	struct tdx_hypercall_output out;
+
+	__tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+			size, 0, port, 0, &out);
+
+	return out.r10 ? UINT_MAX : out.r11;
+}
+
+static inline void tdx_io_out(int size, int port, u64 value)
+{
+	struct tdx_hypercall_output out;
+
+	__tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
+			size, 1, port, value, &out);
+}
+
+static inline unsigned char tdx_inb(int port)
+{
+	return tdx_io_in(1, port);
+}
+
+static inline unsigned short tdx_inw(int port)
+{
+	return tdx_io_in(2, port);
+}
+
+static inline unsigned int tdx_inl(int port)
+{
+	return tdx_io_in(4, port);
+}
+
+static inline void tdx_outb(unsigned char value, int port)
+{
+	tdx_io_out(1, port, value);
+}
+
+static inline void tdx_outw(unsigned short value, int port)
+{
+	tdx_io_out(2, port, value);
+}
+
+static inline void tdx_outl(unsigned int value, int port)
+{
+	tdx_io_out(4, port, value);
+}
+
 void early_tdx_detect(void)
 {
 	u32 eax, sig[3];
@@ -26,4 +78,11 @@ void early_tdx_detect(void)
 
 	/* Cache TDX guest feature status */
 	tdx_guest_detected = true;
+
+	pio_ops.inb = tdx_inb;
+	pio_ops.inw = tdx_inw;
+	pio_ops.inl = tdx_inl;
+	pio_ops.outb = tdx_outb;
+	pio_ops.outw = tdx_outw;
+	pio_ops.outl = tdx_outl;
 }
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 12bede46d048..4a0218bedc75 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -1,7 +1,30 @@
 #ifndef _ASM_X86_SHARED_TDX_H
 #define _ASM_X86_SHARED_TDX_H
 
+#include <linux/types.h>
+
+/*
+ * Used in __tdx_hypercall() to gather the output registers' values
+ * of the TDCALL instruction when requesting services from the VMM.
+ * This is a software only structure and not part of the TDX
+ * module/VMM ABI.
+ */
+struct tdx_hypercall_output {
+	u64 r10;
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+};
+
+#define TDX_HYPERCALL_STANDARD  0
+
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+/* Used to request services from the VMM */
+u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
+		    u64 r15, struct tdx_hypercall_output *out);
+
 #endif
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 53f7dd0fbe58..27eb4ab2fdd2 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -7,8 +7,6 @@
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>
 
-#define TDX_HYPERCALL_STANDARD  0
-
 /*
  * Used in __tdx_module_call() to gather the output registers'
  * values of the TDCALL instruction when requesting services from
@@ -24,21 +22,6 @@ struct tdx_module_output {
 	u64 r11;
 };
 
-/*
- * Used in __tdx_hypercall() to gather the output registers' values
- * of the TDCALL instruction when requesting services from the VMM.
- * This is a software only structure and not part of the TDX
- * module/VMM ABI.
- */
-struct tdx_hypercall_output {
-	u64 r10;
-	u64 r11;
-	u64 r12;
-	u64 r13;
-	u64 r14;
-	u64 r15;
-};
-
 /*
  * Used by the #VE exception handler to gather the #VE exception
  * info from the TDX module. This is a software only structure
@@ -64,10 +47,6 @@ bool is_tdx_guest(void);
 u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 		      struct tdx_module_output *out);
 
-/* Used to request services from the VMM */
-u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
-		    u64 r15, struct tdx_hypercall_output *out);
-
 bool tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 13/29] x86/tdx: Add port I/O emulation
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 12/29] x86/boot/compressed: Support TDX guest port I/O at decompression time Kirill A. Shutemov
@ 2022-01-24 15:01 ` Kirill A. Shutemov
  2022-02-01 23:01   ` Thomas Gleixner
  2022-02-02  6:22   ` Borislav Petkov
  2022-01-24 15:02 ` [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O Kirill A. Shutemov
                   ` (16 subsequent siblings)
  29 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

TDX hypervisors cannot emulate instructions directly. This includes
port I/O which is normally emulated in the hypervisor. All port I/O
instructions inside TDX trigger the #VE exception in the guest and
would be normally emulated there.

Use a hypercall to emulate port I/O. Extend the
tdx_handle_virt_exception() and add support to handle the #VE due to
port I/O instructions.

String I/O operations are not supported in TDX. Unroll them by declaring
CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/cc_platform.c |  3 +++
 arch/x86/kernel/tdx.c         | 48 +++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index c72b3919bca9..8da246ab4339 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -17,6 +17,9 @@
 
 static bool intel_cc_platform_has(enum cc_attr attr)
 {
+	if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
+		return true;
+
 	return false;
 }
 
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 8e630eeb765d..e73af22a4c11 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -13,6 +13,12 @@
 /* TDX module Call Leaf IDs */
 #define TDX_GET_VEINFO			3
 
+/* See Exit Qualification for I/O Instructions in VMX documentation */
+#define VE_IS_IO_IN(exit_qual)		(((exit_qual) & 8) ? 1 : 0)
+#define VE_GET_IO_SIZE(exit_qual)	(((exit_qual) & 7) + 1)
+#define VE_GET_PORT_NUM(exit_qual)	((exit_qual) >> 16)
+#define VE_IS_IO_STRING(exit_qual)	((exit_qual) & 16 ? 1 : 0)
+
 static bool tdx_guest_detected __ro_after_init;
 
 /*
@@ -257,6 +263,45 @@ static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	return insn.length;
 }
 
+/*
+ * Emulate I/O using hypercall.
+ *
+ * Assumes the IO instruction was using ax, which is enforced
+ * by the standard io.h macros.
+ *
+ * Return True on success or False on failure.
+ */
+static bool tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
+{
+	struct tdx_hypercall_output out;
+	int size, port, ret;
+	u64 mask;
+	bool in;
+
+	if (VE_IS_IO_STRING(exit_qual))
+		return false;
+
+	in   = VE_IS_IO_IN(exit_qual);
+	size = VE_GET_IO_SIZE(exit_qual);
+	port = VE_GET_PORT_NUM(exit_qual);
+	mask = GENMASK(BITS_PER_BYTE * size, 0);
+
+	/*
+	 * Emulate the I/O read/write via hypercall. More info about
+	 * ABI can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.IO>".
+	 */
+	ret = _tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, !in, port,
+			     in ? 0 : regs->ax, &out);
+	if (!in)
+		return !ret;
+
+	regs->ax &= ~mask;
+	regs->ax |= ret ? UINT_MAX : out.r11 & mask;
+
+	return !ret;
+}
+
 bool tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -333,6 +378,9 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 		if (!ret)
 			pr_warn_once("MMIO failed\n");
 		break;
+	case EXIT_REASON_IO_INSTRUCTION:
+		ret = tdx_handle_io(regs, ve->exit_qual);
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		break;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2022-01-24 15:01 ` [PATCHv2 13/29] x86/tdx: Add port I/O emulation Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-01 23:02   ` Thomas Gleixner
  2022-02-02 10:09   ` Borislav Petkov
  2022-01-24 15:02 ` [PATCHv2 15/29] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
                   ` (15 subsequent siblings)
  29 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Andi Kleen <ak@linux.intel.com>

TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O, by converting them
into TDCALLs to call the host.

But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver (although it is expected not to be used).

The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures).

This early handler enables the use of normal in*/out* macros without
patching them for every driver. Since there is no expectation that
early port I/O is performance-critical, the #VE emulation cost is worth
the simplicity benefit of not patching the port I/O usage in early
code. There are also no concerns with nesting, since there should be
no NMIs or interrupts this early.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  4 ++++
 arch/x86/kernel/head64.c   |  3 +++
 arch/x86/kernel/tdx.c      | 17 +++++++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 27eb4ab2fdd2..8013686192fd 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -53,12 +53,16 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
 void tdx_safe_halt(void);
 
+bool tdx_early_handle_ve(struct pt_regs *regs);
+
 #else
 
 static inline void tdx_early_init(void) { };
 static inline bool is_tdx_guest(void) { return false; }
 static inline void tdx_safe_halt(void) { };
 
+static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 1cb6346ec3d1..76d298ddfe75 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -417,6 +417,9 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
 	    trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
 		return;
 
+	if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs))
+		return;
+
 	early_fixup_exception(regs, trapnr);
 }
 
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index e73af22a4c11..ebb29dfb3ad4 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -302,6 +302,23 @@ static bool tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
 	return !ret;
 }
 
+/*
+ * Early #VE exception handler. Only handles a subset of port I/O.
+ * Intended only for earlyprintk. If failed, return false.
+ */
+__init bool tdx_early_handle_ve(struct pt_regs *regs)
+{
+	struct ve_info ve;
+
+	if (tdx_get_ve_info(&ve))
+		return false;
+
+	if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
+		return false;
+
+	return tdx_handle_io(regs, ve.exit_qual);
+}
+
 bool tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 15/29] x86/tdx: Wire up KVM hypercalls
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-01 23:05   ` Thomas Gleixner
  2022-01-24 15:02 ` [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
                   ` (14 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI
is similar, those instructions no longer function for TDX guests.

Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX
guests to run with KVM acting as the hypervisor.

Among other things, KVM hypercall is used to send IPIs.

Since the KVM driver can be built as a kernel module, export
tdx_kvm_hypercall() to make the symbols visible to kvm.ko.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/kvm_para.h | 22 ++++++++++++++++++++++
 arch/x86/include/asm/tdx.h      | 11 +++++++++++
 arch/x86/kernel/tdx.c           | 15 +++++++++++++++
 3 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 56935ebb1dfe..57bc74e112f2 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,6 +7,8 @@
 #include <linux/interrupt.h>
 #include <uapi/asm/kvm_para.h>
 
+#include <asm/tdx.h>
+
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
 #else
@@ -32,6 +34,10 @@ static inline bool kvm_check_and_clear_guest_paused(void)
 static inline long kvm_hypercall0(unsigned int nr)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, 0, 0, 0, 0);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr)
@@ -42,6 +48,10 @@ static inline long kvm_hypercall0(unsigned int nr)
 static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, p1, 0, 0, 0);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1)
@@ -53,6 +63,10 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
 				  unsigned long p2)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, p1, p2, 0, 0);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2)
@@ -64,6 +78,10 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
 				  unsigned long p2, unsigned long p3)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, p1, p2, p3, 0);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3)
@@ -76,6 +94,10 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 				  unsigned long p4)
 {
 	long ret;
+
+	if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return tdx_kvm_hypercall(nr, p1, p2, p3, p4);
+
 	asm volatile(KVM_HYPERCALL
 		     : "=a"(ret)
 		     : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 8013686192fd..4bcaadf21dc6 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -65,4 +65,15 @@ static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
+long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
+		       unsigned long p3, unsigned long p4);
+#else
+static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
+				     unsigned long p2, unsigned long p3,
+				     unsigned long p4)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ebb29dfb3ad4..a4e696f12666 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -44,6 +44,21 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
 	return out->r10;
 }
 
+#ifdef CONFIG_KVM_GUEST
+long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
+		       unsigned long p3, unsigned long p4)
+{
+	struct tdx_hypercall_output out;
+
+	/* Non zero return value indicates buggy TDX module, so panic */
+	if (__tdx_hypercall(nr, p1, p2, p3, p4, 0, &out))
+		panic("KVM hypercall %u failed. Buggy TDX module?\n", nr);
+
+	return out.r10;
+}
+EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
+#endif
+
 static u64 __cpuidle _tdx_halt(const bool irq_disabled, const bool do_sti)
 {
 	/*
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 15/29] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-01 23:06   ` Thomas Gleixner
  2022-02-02 11:27   ` Borislav Petkov
  2022-01-24 15:02 ` [PATCHv2 17/29] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
                   ` (13 subsequent siblings)
  29 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Sean Christopherson, Kai Huang, Kirill A . Shutemov

From: Sean Christopherson <sean.j.christopherson@intel.com>

Historically, x86 platforms have booted secondary processors (APs)
using INIT followed by the start up IPI (SIPI) messages. In regular
VMs, this boot sequence is supported by the VMM emulation. But such a
wakeup model is fatal for secure VMs like TDX in which VMM is an
untrusted entity. To address this issue, a new wakeup model was added
in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
the APs. More details about this wakeup model can be found in ACPI
specification v6.4, the section titled "Multiprocessor Wakeup Structure".

Since the existing trampoline code requires processors to boot in real
mode with 16-bit addressing, it will not work for this wakeup model
(because it boots the AP in 64-bit mode). To handle it, extend the
trampoline code to support 64-bit mode firmware handoff. Also, extend
IDT and GDT pointers to support 64-bit mode hand off.

There is no TDX-specific detection for this new boot method. The kernel
will rely on it as the sole boot method whenever the new ACPI structure
is present.

The ACPI table parser for the MADT multiprocessor wake up structure and
the wakeup method that uses this structure will be added by the following
patch in this series.

Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/apic.h              |  2 ++
 arch/x86/include/asm/realmode.h          |  1 +
 arch/x86/kernel/smpboot.c                | 12 ++++++--
 arch/x86/realmode/rm/header.S            |  1 +
 arch/x86/realmode/rm/trampoline_64.S     | 38 ++++++++++++++++++++++++
 arch/x86/realmode/rm/trampoline_common.S | 12 +++++++-
 6 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 48067af94678..35006e151774 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -328,6 +328,8 @@ struct apic {
 
 	/* wakeup_secondary_cpu */
 	int	(*wakeup_secondary_cpu)(int apicid, unsigned long start_eip);
+	/* wakeup secondary CPU using 64-bit wakeup point */
+	int	(*wakeup_secondary_cpu_64)(int apicid, unsigned long start_eip);
 
 	void	(*inquire_remote_apic)(int apicid);
 
diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 331474b150f1..fd6f6e5b755a 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -25,6 +25,7 @@ struct real_mode_header {
 	u32	sev_es_trampoline_start;
 #endif
 #ifdef CONFIG_X86_64
+	u32	trampoline_start64;
 	u32	trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 617012f4619f..6269dd126dba 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1088,6 +1088,11 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 	unsigned long boot_error = 0;
 	unsigned long timeout;
 
+#ifdef CONFIG_X86_64
+	/* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
+	if (apic->wakeup_secondary_cpu_64)
+		start_ip = real_mode_header->trampoline_start64;
+#endif
 	idle->thread.sp = (unsigned long)task_pt_regs(idle);
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
 	initial_code = (unsigned long)start_secondary;
@@ -1129,11 +1134,14 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
 
 	/*
 	 * Wake up a CPU in difference cases:
-	 * - Use the method in the APIC driver if it's defined
+	 * - Use a method from the APIC driver if one defined, with wakeup
+	 *   straight to 64-bit mode preferred over wakeup to RM.
 	 * Otherwise,
 	 * - Use an INIT boot APIC message for APs or NMI for BSP.
 	 */
-	if (apic->wakeup_secondary_cpu)
+	if (apic->wakeup_secondary_cpu_64)
+		boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip);
+	else if (apic->wakeup_secondary_cpu)
 		boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
 	else
 		boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index 8c1db5bf5d78..2eb62be6d256 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -24,6 +24,7 @@ SYM_DATA_START(real_mode_header)
 	.long	pa_sev_es_trampoline_start
 #endif
 #ifdef CONFIG_X86_64
+	.long	pa_trampoline_start64
 	.long	pa_trampoline_pgd;
 #endif
 	/* ACPI S3 wakeup */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index cc8391f86cdb..ae112a91592f 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -161,6 +161,19 @@ SYM_CODE_START(startup_32)
 	ljmpl	$__KERNEL_CS, $pa_startup_64
 SYM_CODE_END(startup_32)
 
+SYM_CODE_START(pa_trampoline_compat)
+	/*
+	 * In compatibility mode.  Prep ESP and DX for startup_32, then disable
+	 * paging and complete the switch to legacy 32-bit mode.
+	 */
+	movl	$rm_stack_end, %esp
+	movw	$__KERNEL_DS, %dx
+
+	movl	$X86_CR0_PE, %eax
+	movl	%eax, %cr0
+	ljmpl   $__KERNEL32_CS, $pa_startup_32
+SYM_CODE_END(pa_trampoline_compat)
+
 	.section ".text64","ax"
 	.code64
 	.balign 4
@@ -169,6 +182,20 @@ SYM_CODE_START(startup_64)
 	jmpq	*tr_start(%rip)
 SYM_CODE_END(startup_64)
 
+SYM_CODE_START(trampoline_start64)
+	/*
+	 * APs start here on a direct transfer from 64-bit BIOS with identity
+	 * mapped page tables.  Load the kernel's GDT in order to gear down to
+	 * 32-bit mode (to handle 4-level vs. 5-level paging), and to (re)load
+	 * segment registers.  Load the zero IDT so any fault triggers a
+	 * shutdown instead of jumping back into BIOS.
+	 */
+	lidt	tr_idt(%rip)
+	lgdt	tr_gdt64(%rip)
+
+	ljmpl	*tr_compat(%rip)
+SYM_CODE_END(trampoline_start64)
+
 	.section ".rodata","a"
 	# Duplicate the global descriptor table
 	# so the kernel can live anywhere
@@ -182,6 +209,17 @@ SYM_DATA_START(tr_gdt)
 	.quad	0x00cf93000000ffff	# __KERNEL_DS
 SYM_DATA_END_LABEL(tr_gdt, SYM_L_LOCAL, tr_gdt_end)
 
+SYM_DATA_START(tr_gdt64)
+	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
+	.long	pa_tr_gdt
+	.long	0
+SYM_DATA_END(tr_gdt64)
+
+SYM_DATA_START(tr_compat)
+	.long	pa_trampoline_compat
+	.short	__KERNEL32_CS
+SYM_DATA_END(tr_compat)
+
 	.bss
 	.balign	PAGE_SIZE
 SYM_DATA(trampoline_pgd, .space PAGE_SIZE)
diff --git a/arch/x86/realmode/rm/trampoline_common.S b/arch/x86/realmode/rm/trampoline_common.S
index 5033e640f957..4331c32c47f8 100644
--- a/arch/x86/realmode/rm/trampoline_common.S
+++ b/arch/x86/realmode/rm/trampoline_common.S
@@ -1,4 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 	.section ".rodata","a"
 	.balign	16
-SYM_DATA_LOCAL(tr_idt, .fill 1, 6, 0)
+
+/*
+ * When a bootloader hands off to the kernel in 32-bit mode an
+ * IDT with a 2-byte limit and 4-byte base is needed. When a boot
+ * loader hands off to a kernel 64-bit mode the base address
+ * extends to 8-bytes. Reserve enough space for either scenario.
+ */
+SYM_DATA_START_LOCAL(tr_idt)
+	.short  0
+	.quad   0
+SYM_DATA_END(tr_idt)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 17/29] x86/acpi, x86/boot: Add multiprocessor wake-up support
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-01 23:27   ` Thomas Gleixner
  2022-01-24 15:02 ` [PATCHv2 18/29] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
                   ` (12 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Sean Christopherson, Rafael J . Wysocki, Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

TDX cannot use INIT/SIPI protocol to bring up secondary CPUs because it
requires assistance from untrusted VMM.

For platforms that do not support SIPI/INIT, ACPI defines a wakeup
model (using mailbox) via MADT multiprocessor wakeup structure. More
details about it can be found in ACPI specification v6.4, the section
titled "Multiprocessor Wakeup Structure". If a platform firmware
produces the multiprocessor wakeup structure, then OS may use this
new mailbox-based mechanism to wake up the APs.

Add ACPI MADT wake structure parsing support for x86 platform and if
MADT wake table is present, update apic->wakeup_secondary_cpu_64 with
new API which uses MADT wake mailbox to wake-up CPU.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/apic.h |   5 ++
 arch/x86/kernel/acpi/boot.c | 114 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/apic/apic.c |  10 ++++
 3 files changed, 129 insertions(+)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 35006e151774..bd8ae0a7010a 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -490,6 +490,11 @@ static inline unsigned int read_apic_id(void)
 	return apic->get_apic_id(reg);
 }
 
+#ifdef CONFIG_X86_64
+typedef int (*wakeup_cpu_handler)(int apicid, unsigned long start_eip);
+extern void acpi_wake_cpu_handler_update(wakeup_cpu_handler handler);
+#endif
+
 extern int default_apic_id_valid(u32 apicid);
 extern int default_acpi_madt_oem_check(char *, char *);
 extern void default_setup_apic_routing(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 5b6d1a95776f..af204a217575 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -65,6 +65,15 @@ static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
 static bool acpi_support_online_capable;
 #endif
 
+#ifdef CONFIG_X86_64
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+/* Lock to protect mailbox (acpi_mp_wake_mailbox) from parallel access */
+static DEFINE_SPINLOCK(mailbox_lock);
+#endif
+
 #ifdef CONFIG_X86_IO_APIC
 /*
  * Locks related to IOAPIC hotplug
@@ -336,6 +345,80 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
 	return 0;
 }
 
+#ifdef CONFIG_X86_64
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+	static physid_mask_t apic_id_wakemap = PHYSID_MASK_NONE;
+	unsigned long flags;
+	u8 timeout;
+
+	/* Remap mailbox memory only for the first call to acpi_wakeup_cpu() */
+	if (physids_empty(apic_id_wakemap)) {
+		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+						sizeof(*acpi_mp_wake_mailbox),
+						MEMREMAP_WB);
+	}
+
+	/*
+	 * According to the ACPI specification r6.4, section titled
+	 * "Multiprocessor Wakeup Structure" the mailbox-based wakeup
+	 * mechanism cannot be used more than once for the same CPU.
+	 * Skip wakeups if they are attempted more than once.
+	 */
+	if (physid_isset(apicid, apic_id_wakemap)) {
+		pr_err("CPU already awake (APIC ID %x), skipping wakeup\n",
+		       apicid);
+		return -EINVAL;
+	}
+
+	spin_lock_irqsave(&mailbox_lock, flags);
+
+	/*
+	 * Mailbox memory is shared between firmware and OS. Firmware will
+	 * listen on mailbox command address, and once it receives the wakeup
+	 * command, CPU associated with the given apicid will be booted.
+	 *
+	 * The value of apic_id and wakeup_vector has to be set before updating
+	 * the wakeup command. To let compiler preserve order of writes, use
+	 * smp_store_release.
+	 */
+	smp_store_release(&acpi_mp_wake_mailbox->apic_id, apicid);
+	smp_store_release(&acpi_mp_wake_mailbox->wakeup_vector, start_ip);
+	smp_store_release(&acpi_mp_wake_mailbox->command,
+			  ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+	/*
+	 * After writing the wakeup command, wait for maximum timeout of 0xFF
+	 * for firmware to reset the command address back zero to indicate
+	 * the successful reception of command.
+	 * NOTE: 0xFF as timeout value is decided based on our experiments.
+	 *
+	 * XXX: Change the timeout once ACPI specification comes up with
+	 *      standard maximum timeout value.
+	 */
+	timeout = 0xFF;
+	while (READ_ONCE(acpi_mp_wake_mailbox->command) && --timeout)
+		cpu_relax();
+
+	/* If timed out (timeout == 0), return error */
+	if (!timeout) {
+		spin_unlock_irqrestore(&mailbox_lock, flags);
+		return -EIO;
+	}
+
+	/*
+	 * If the CPU wakeup process is successful, store the
+	 * status in apic_id_wakemap to prevent re-wakeup
+	 * requests.
+	 */
+	physid_set(apicid, apic_id_wakemap);
+
+	spin_unlock_irqrestore(&mailbox_lock, flags);
+
+	return 0;
+}
+#endif
 #endif				/*CONFIG_X86_LOCAL_APIC */
 
 #ifdef CONFIG_X86_IO_APIC
@@ -1083,6 +1166,29 @@ static int __init acpi_parse_madt_lapic_entries(void)
 	}
 	return 0;
 }
+
+#ifdef CONFIG_X86_64
+static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+				     const unsigned long end)
+{
+	struct acpi_madt_multiproc_wakeup *mp_wake;
+
+	if (!IS_ENABLED(CONFIG_SMP))
+		return -ENODEV;
+
+	mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+	if (BAD_MADT_ENTRY(mp_wake, end))
+		return -EINVAL;
+
+	acpi_table_print_madt_entry(&header->common);
+
+	acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+	acpi_wake_cpu_handler_update(acpi_wakeup_cpu);
+
+	return 0;
+}
+#endif				/* CONFIG_X86_64 */
 #endif				/* CONFIG_X86_LOCAL_APIC */
 
 #ifdef	CONFIG_X86_IO_APIC
@@ -1278,6 +1384,14 @@ static void __init acpi_process_madt(void)
 
 				smp_found_config = 1;
 			}
+
+#ifdef CONFIG_X86_64
+			/*
+			 * Parse MADT MP Wake entry.
+			 */
+			acpi_table_parse_madt(ACPI_MADT_TYPE_MULTIPROC_WAKEUP,
+					      acpi_parse_mp_wake, 1);
+#endif
 		}
 		if (error == -EINVAL) {
 			/*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b70344bf6600..3c8f2c797a98 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2551,6 +2551,16 @@ u32 x86_msi_msg_get_destid(struct msi_msg *msg, bool extid)
 }
 EXPORT_SYMBOL_GPL(x86_msi_msg_get_destid);
 
+#ifdef CONFIG_X86_64
+void __init acpi_wake_cpu_handler_update(wakeup_cpu_handler handler)
+{
+	struct apic **drv;
+
+	for (drv = __apicdrivers; drv < __apicdrivers_end; drv++)
+		(*drv)->wakeup_secondary_cpu_64 = handler;
+}
+#endif
+
 /*
  * Override the generic EOI implementation with an optimized version.
  * Only called during early boot when only one CPU is active and with
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 18/29] x86/boot: Avoid #VE during boot for TDX platforms
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 17/29] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  0:04   ` Thomas Gleixner
  2022-01-24 15:02 ` [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
                   ` (11 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Sean Christopherson <seanjc@google.com>

There are a few MSRs and control register bits that the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent security
guarantees. Fortunately, TDX ensures that these are all in the correct
state before the kernel loads, which means the kernel does not need to
modify them.

The conditions to avoid are:

 * Any writes to the EFER MSR
 * Clearing CR0.NE
 * Clearing CR3.MCE

This theoretically makes the guest boot more fragile. If, for instance,
EFER was set up incorrectly and a WRMSR was performed, it will trigger
early exception panic or a triple fault, if it's before early
exceptions are set up. However, this is likely to trip up the guest
BIOS long before control reaches the kernel. In any case, these kinds
of problems are unlikely to occur in production environments, and
developers have good debug tools to fix them quickly.

Change the common boot code to work on TDX and non-TDX systems.
This should have no functional effect on non-TDX systems.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/boot/compressed/head_64.S   | 25 +++++++++++++++++++++----
 arch/x86/boot/compressed/pgtable.h   |  2 +-
 arch/x86/kernel/head_64.S            | 24 ++++++++++++++++++++++--
 arch/x86/realmode/rm/trampoline_64.S | 27 +++++++++++++++++++++++----
 5 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1491f25c844e..1c59e02792e4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -885,6 +885,7 @@ config INTEL_TDX_GUEST
 	depends on X86_64 && CPU_SUP_INTEL
 	depends on X86_X2APIC
 	select ARCH_HAS_CC_PLATFORM
+	select X86_MCE
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index fd9441f40457..b576d23d37cb 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -643,12 +643,25 @@ SYM_CODE_START(trampoline_32bit_src)
 	movl	$MSR_EFER, %ecx
 	rdmsr
 	btsl	$_EFER_LME, %eax
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+	jc	1f
 	wrmsr
-	popl	%edx
+1:	popl	%edx
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	%cr4, %eax
+
+#ifdef CONFIG_X86_MCE
+	/*
+	 * Preserve CR4.MCE if the kernel will enable #MC support.  Clearing
+	 * MCE may fault in some environments (that also force #MC support).
+	 * Any machine check that occurs before #MC support is fully configured
+	 * will crash the system regardless of the CR4.MCE value set here.
+	 */
+	andl	$X86_CR4_MCE, %eax
+#endif
+	orl	$X86_CR4_PAE, %eax
 	testl	%edx, %edx
 	jz	1f
 	orl	$X86_CR4_LA57, %eax
@@ -662,8 +675,12 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	$__KERNEL_CS
 	pushl	%eax
 
-	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	/*
+	 * Enable paging again.  Keep CR0.NE set, FERR# is no longer used
+	 * to handle x87 FPU errors and clearing NE may fault in some
+	 * environments.
+	 */
+	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 6ff7e81b5628..cc9b2529a086 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
 #define TRAMPOLINE_32BIT_PGTABLE_OFFSET	0
 
 #define TRAMPOLINE_32BIT_CODE_OFFSET	PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE	0x70
+#define TRAMPOLINE_32BIT_CODE_SIZE	0x80
 
 #define TRAMPOLINE_32BIT_STACK_END	TRAMPOLINE_32BIT_SIZE
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9c63fc5988cd..652845cc527e 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -141,7 +141,17 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movq	%cr4, %rcx
+#ifdef CONFIG_X86_MCE
+	/*
+	 * Preserve CR4.MCE if the kernel will enable #MC support.  Clearing
+	 * MCE may fault in some environments (that also force #MC support).
+	 * Any machine check that occurs before #MC support is fully configured
+	 * will crash the system regardless of the CR4.MCE value set here.
+	 */
+	andl	$X86_CR4_MCE, %ecx
+#endif
+	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
@@ -246,13 +256,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/* Setup EFER (Extended Feature Enable Register) */
 	movl	$MSR_EFER, %ecx
 	rdmsr
+	/*
+	 * Preserve current value of EFER for comparison and to skip
+	 * EFER writes if no change was made (for TDX guest)
+	 */
+	movl    %eax, %edx
 	btsl	$_EFER_SCE, %eax	/* Enable System Call */
 	btl	$20,%edi		/* No Execute supported? */
 	jnc     1f
 	btsl	$_EFER_NX, %eax
 	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
-1:	wrmsr				/* Make changes effective */
 
+	/* Avoid writing EFER if no change was made (for TDX guest) */
+1:	cmpl	%edx, %eax
+	je	1f
+	xor	%edx, %edx
+	wrmsr				/* Make changes effective */
+1:
 	/* Setup cr0 */
 	movl	$CR0_STATE, %eax
 	/* Make changes effective */
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index ae112a91592f..170f248d5769 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -143,13 +143,28 @@ SYM_CODE_START(startup_32)
 	movl	%eax, %cr3
 
 	# Set up EFER
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	/*
+	 * Skip writing to EFER if the register already has desired
+	 * value (to avoid #VE for the TDX guest).
+	 */
+	cmp	pa_tr_efer, %eax
+	jne	.Lwrite_efer
+	cmp	pa_tr_efer + 4, %edx
+	je	.Ldone_efer
+.Lwrite_efer:
 	movl	pa_tr_efer, %eax
 	movl	pa_tr_efer + 4, %edx
-	movl	$MSR_EFER, %ecx
 	wrmsr
 
-	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+.Ldone_efer:
+	/*
+	 * Enable paging and in turn activate Long Mode. Keep CR0.NE set, FERR#
+	 * is no longer used to handle x87 FPU errors and clearing NE may fault
+	 * in some environments.
+	 */
+	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 
 	/*
@@ -169,7 +184,11 @@ SYM_CODE_START(pa_trampoline_compat)
 	movl	$rm_stack_end, %esp
 	movw	$__KERNEL_DS, %dx
 
-	movl	$X86_CR0_PE, %eax
+	/*
+	 * Keep CR0.NE set, FERR# is no longer used to handle x87 FPU errors
+	 * and clearing NE may fault in some environments.
+	 */
+	movl	$(X86_CR0_NE | X86_CR0_PE), %eax
 	movl	%eax, %cr0
 	ljmpl   $__KERNEL32_CS, $pa_startup_32
 SYM_CODE_END(pa_trampoline_compat)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 18/29] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  0:09   ` Thomas Gleixner
  2022-01-24 15:02 ` [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module Kirill A. Shutemov
                   ` (10 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Unlike regular VMs, TDX guests use the firmware hand-off wakeup method
to wake up the APs during the boot process. This wakeup model uses a
mailbox to communicate with firmware to bring up the APs. As per the
design, this mailbox can only be used once for the given AP, which means
after the APs are booted, the same mailbox cannot be used to
offline/online the given AP. More details about this requirement can be
found in Intel TDX Virtual Firmware Design Guide, sec titled "AP
initialization in OS" and in sec titled "Hotplug Device".

Since the architecture does not support any method of offlining the
CPUs, disable CPU hotplug support in the kernel.

Since this hotplug disable feature can be re-used by other VM guests,
add a new CC attribute CC_ATTR_HOTPLUG_DISABLED and use it to disable
the hotplug support.

With hotplug disabled, /sys/devices/system/cpu/cpuX/online sysfs option
will not exist for TDX guests.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/cc_platform.c |  7 ++++++-
 include/linux/cc_platform.h   | 10 ++++++++++
 kernel/cpu.c                  |  3 +++
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 8da246ab4339..dcb31d6a7554 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -17,8 +17,13 @@
 
 static bool intel_cc_platform_has(enum cc_attr attr)
 {
-	if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
+	switch (attr) {
+	case CC_ATTR_GUEST_UNROLL_STRING_IO:
+	case CC_ATTR_HOTPLUG_DISABLED:
 		return true;
+	default:
+		return false;
+	}
 
 	return false;
 }
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index efd8205282da..691494bbaf5a 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -72,6 +72,16 @@ enum cc_attr {
 	 * Examples include TDX guest & SEV.
 	 */
 	CC_ATTR_GUEST_UNROLL_STRING_IO,
+
+	/**
+	 * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
+	 *
+	 * The platform/OS is running as a guest/virtual machine does not
+	 * support CPU hotplug feature.
+	 *
+	 * Examples include TDX Guest.
+	 */
+	CC_ATTR_HOTPLUG_DISABLED,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 407a2568f35e..58fd06ebc2c8 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -34,6 +34,7 @@
 #include <linux/scs.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/cpuset.h>
+#include <linux/cc_platform.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -1185,6 +1186,8 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
 
 static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
 {
+	if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED))
+		return -EOPNOTSUPP;
 	if (cpu_hotplug_disabled)
 		return -EBUSY;
 	return _cpu_down(cpu, 0, target);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (18 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  0:14   ` Thomas Gleixner
  2022-02-07 10:44   ` Borislav Petkov
  2022-01-24 15:02 ` [PATCHv2 21/29] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
                   ` (9 subsequent siblings)
  29 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

Intel TDX doesn't allow VMM to access guest private memory. Any memory
that is required for communication with the VMM must be shared
explicitly by setting a bit in the page table entry. Details about
which bit in the page table entry to be used to indicate shared/private
state can be determined by using the TDINFO TDCALL (call to TDX
module).

Fetch and save the guest TD execution environment information at
initialization time. The next patch will use the information.

More details about the TDINFO TDCALL can be found in
Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec titled "TDCALL[TDINFO]".

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index a4e696f12666..b27c4261bfd2 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -11,6 +11,7 @@
 #include <asm/insn-eval.h>
 
 /* TDX module Call Leaf IDs */
+#define TDX_GET_INFO			1
 #define TDX_GET_VEINFO			3
 
 /* See Exit Qualification for I/O Instructions in VMX documentation */
@@ -19,6 +20,12 @@
 #define VE_GET_PORT_NUM(exit_qual)	((exit_qual) >> 16)
 #define VE_IS_IO_STRING(exit_qual)	((exit_qual) & 16 ? 1 : 0)
 
+/* Guest TD execution environment information */
+static struct {
+	unsigned int gpa_width;
+	unsigned long attributes;
+} td_info __ro_after_init;
+
 static bool tdx_guest_detected __ro_after_init;
 
 /*
@@ -59,6 +66,28 @@ long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
 EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
 #endif
 
+static void tdx_get_info(void)
+{
+	struct tdx_module_output out;
+	u64 ret;
+
+	/*
+	 * TDINFO TDX module call is used to get the TD execution environment
+	 * information like GPA width, number of available vcpus, debug mode
+	 * information, etc. More details about the ABI can be found in TDX
+	 * Guest-Host-Communication Interface (GHCI), sec 2.4.2 TDCALL
+	 * [TDG.VP.INFO].
+	 */
+	ret = __tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
+
+	/* Non zero return value indicates buggy TDX module, so panic */
+	if (ret)
+		panic("TDINFO TDCALL failed (Buggy TDX module!)\n");
+
+	td_info.gpa_width = out.rcx & GENMASK(5, 0);
+	td_info.attributes = out.rdx;
+}
+
 static u64 __cpuidle _tdx_halt(const bool irq_disabled, const bool do_sti)
 {
 	/*
@@ -455,5 +484,7 @@ void __init tdx_early_init(void)
 
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
+	tdx_get_info();
+
 	pr_info("Guest detected\n");
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 21/29] x86/tdx: Exclude shared bit from __PHYSICAL_MASK
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (19 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  0:18   ` Thomas Gleixner
  2022-01-24 15:02 ` [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
                   ` (8 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

In TDX guests, by default memory is protected from host access. If a
guest needs to communicate with the VMM (like the I/O use case), it uses
a single bit in the physical address to communicate the protected/shared
attribute of the given page.

In the x86 ARCH code, __PHYSICAL_MASK macro represents the width of the
physical address in the given architecture. It is used in creating
physical PAGE_MASK for address bits in the kernel. Since in TDX guest,
a single bit is used as metadata, it needs to be excluded from valid
physical address bits to avoid using incorrect addresses bits in the
kernel.

Enable DYNAMIC_PHYSICAL_MASK to support updating the __PHYSICAL_MASK.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig      | 1 +
 arch/x86/kernel/tdx.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1c59e02792e4..680c3cad9422 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,6 +886,7 @@ config INTEL_TDX_GUEST
 	depends on X86_X2APIC
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MCE
+	select DYNAMIC_PHYSICAL_MASK
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index b27c4261bfd2..beeaf61934bc 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -486,5 +486,13 @@ void __init tdx_early_init(void)
 
 	tdx_get_info();
 
+	/*
+	 * All bits above GPA width are reserved and kernel treats shared bit
+	 * as flag, not as part of physical address.
+	 *
+	 * Adjust physical mask to only cover valid GPA bits.
+	 */
+	physical_mask &= GENMASK_ULL(td_info.gpa_width - 2, 0);
+
 	pr_info("Guest detected\n");
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (20 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 21/29] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  0:25   ` Thomas Gleixner
  2022-02-07 16:27   ` Borislav Petkov
  2022-01-24 15:02 ` [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private Kirill A. Shutemov
                   ` (7 subsequent siblings)
  29 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

In TDX guests, guest memory is protected from host access. If a guest
performs I/O, it needs to explicitly share the I/O memory with the host.

Make all ioremap()ed pages that are not backed by normal memory
(IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.

Since TDX memory encryption support is similar to AMD SEV architecture,
reuse the infrastructure from AMD SEV code.

Add tdx_shared_mask() interface to get the TDX guest shared bitmask.

pgprot_decrypted() is used by drivers (i915, virtio_gpu, vfio). Export
both pgprot_encrypted() and pgprot_decrypted().

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 19 +++++++++++++------
 arch/x86/include/asm/tdx.h     |  4 ++++
 arch/x86/kernel/cc_platform.c  | 23 +++++++++++++++++++++++
 arch/x86/kernel/tdx.c          |  9 +++++++++
 arch/x86/mm/ioremap.c          |  5 +++++
 5 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 8a9432fb3802..40e22db48319 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -15,12 +15,6 @@
 		     cachemode2protval(_PAGE_CACHE_MODE_UC_MINUS)))	\
 	 : (prot))
 
-/*
- * Macros to add or remove encryption attribute
- */
-#define pgprot_encrypted(prot)	__pgprot(__sme_set(pgprot_val(prot)))
-#define pgprot_decrypted(prot)	__pgprot(__sme_clr(pgprot_val(prot)))
-
 #ifndef __ASSEMBLY__
 #include <linux/spinlock.h>
 #include <asm/x86_init.h>
@@ -38,6 +32,19 @@ void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
 void ptdump_walk_pgd_level_checkwx(void);
 void ptdump_walk_user_pgd_level_checkwx(void);
 
+/*
+ * Macros to add or remove encryption attribute
+ */
+#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
+pgprot_t pgprot_encrypted(pgprot_t prot);
+pgprot_t pgprot_decrypted(pgprot_t prot);
+#define pgprot_encrypted(prot)	pgprot_encrypted(prot)
+#define pgprot_decrypted(prot)	pgprot_decrypted(prot)
+#else
+#define pgprot_encrypted(prot) (prot)
+#define pgprot_decrypted(prot) (prot)
+#endif
+
 #ifdef CONFIG_DEBUG_WX
 #define debug_checkwx()		ptdump_walk_pgd_level_checkwx()
 #define debug_checkwx_user()	ptdump_walk_user_pgd_level_checkwx()
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4bcaadf21dc6..c6a279e67dff 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -55,6 +55,8 @@ void tdx_safe_halt(void);
 
 bool tdx_early_handle_ve(struct pt_regs *regs);
 
+phys_addr_t tdx_shared_mask(void);
+
 #else
 
 static inline void tdx_early_init(void) { };
@@ -63,6 +65,8 @@ static inline void tdx_safe_halt(void) { };
 
 static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
 
+static inline phys_addr_t tdx_shared_mask(void) { return 0; }
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index dcb31d6a7554..be8722ad4792 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -12,6 +12,7 @@
 #include <linux/mem_encrypt.h>
 
 #include <asm/mshyperv.h>
+#include <asm/pgtable.h>
 #include <asm/processor.h>
 #include <asm/tdx.h>
 
@@ -90,3 +91,25 @@ bool cc_platform_has(enum cc_attr attr)
 	return false;
 }
 EXPORT_SYMBOL_GPL(cc_platform_has);
+
+pgprot_t pgprot_encrypted(pgprot_t prot)
+{
+        if (sme_me_mask)
+                return __pgprot(__sme_set(pgprot_val(prot)));
+        else if (is_tdx_guest())
+		return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
+
+        return prot;
+}
+EXPORT_SYMBOL_GPL(pgprot_encrypted);
+
+pgprot_t pgprot_decrypted(pgprot_t prot)
+{
+	if (sme_me_mask)
+		return __pgprot(__sme_clr(pgprot_val(prot)));
+	else if (is_tdx_guest())
+		return __pgprot(pgprot_val(prot) | tdx_shared_mask());
+
+	return prot;
+}
+EXPORT_SYMBOL_GPL(pgprot_decrypted);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index beeaf61934bc..3bf6621eae7d 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -66,6 +66,15 @@ long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
 EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
 #endif
 
+/*
+ * The highest bit of a guest physical address is the "sharing" bit.
+ * Set it for shared pages and clear it for private pages.
+ */
+phys_addr_t tdx_shared_mask(void)
+{
+	return BIT_ULL(td_info.gpa_width - 1);
+}
+
 static void tdx_get_info(void)
 {
 	struct tdx_module_output out;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 026031b3b782..a5d4ec1afca2 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -242,10 +242,15 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
 	 * If the page being mapped is in memory and SEV is active then
 	 * make sure the memory encryption attribute is enabled in the
 	 * resulting mapping.
+	 * In TDX guests, memory is marked private by default. If encryption
+	 * is not requested (using encrypted), explicitly set decrypt
+	 * attribute in all IOREMAPPED memory.
 	 */
 	prot = PAGE_KERNEL_IO;
 	if ((io_desc.flags & IORES_MAP_ENCRYPTED) || encrypted)
 		prot = pgprot_encrypted(prot);
+	else
+		prot = pgprot_decrypted(prot);
 
 	switch (pcm) {
 	case _PAGE_CACHE_MODE_UC:
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (21 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  0:35   ` Thomas Gleixner
  2022-02-08 12:12   ` Borislav Petkov
  2022-01-24 15:02 ` [PATCHv2 24/29] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
                   ` (6 subsequent siblings)
  29 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

Intel TDX protects guest memory from VMM access. Any memory that is
required for communication with the VMM must be explicitly shared.

It is a two-step process: the guest sets the shared bit in the page
table entry and notifies VMM about the change. The notification happens
using MapGPA hypercall.

Conversion back to private memory requires clearing the shared bit,
notifying VMM with MapGPA hypercall following with accepting the memory
with AcceptPage hypercall.

Provide a helper to do conversion between shared and private memory.
It is going to be used by the following patch.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  9 +++++
 arch/x86/kernel/tdx.c      | 78 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c6a279e67dff..f6a5fb4bf72c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -57,6 +57,8 @@ bool tdx_early_handle_ve(struct pt_regs *regs);
 
 phys_addr_t tdx_shared_mask(void);
 
+int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end, bool enc);
+
 #else
 
 static inline void tdx_early_init(void) { };
@@ -67,6 +69,13 @@ static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
 
 static inline phys_addr_t tdx_shared_mask(void) { return 0; }
 
+
+static inline int tdx_hcall_request_gpa_type(phys_addr_t start,
+					     phys_addr_t end, bool enc)
+{
+	return -ENODEV;
+}
+
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
 #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 3bf6621eae7d..ea638c6ecb92 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -13,6 +13,10 @@
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
 #define TDX_GET_VEINFO			3
+#define TDX_ACCEPT_PAGE			6
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA		0x10001
 
 /* See Exit Qualification for I/O Instructions in VMX documentation */
 #define VE_IS_IO_IN(exit_qual)		(((exit_qual) & 8) ? 1 : 0)
@@ -97,6 +101,80 @@ static void tdx_get_info(void)
 	td_info.attributes = out.rdx;
 }
 
+static bool tdx_accept_page(phys_addr_t gpa, enum pg_level pg_level)
+{
+	/*
+	 * Pass the page physical address to the TDX module to accept the
+	 * pending, private page.
+	 *
+	 * Bits 2:0 if GPA encodes page size: 0 - 4K, 1 - 2M, 2 - 1G.
+	 */
+	switch (pg_level) {
+	case PG_LEVEL_4K:
+		break;
+	case PG_LEVEL_2M:
+		gpa |= 1;
+		break;
+	case PG_LEVEL_1G:
+		gpa |= 2;
+		break;
+	default:
+		return true;
+	}
+
+	return __tdx_module_call(TDX_ACCEPT_PAGE, gpa, 0, 0, 0, NULL);
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest.  The VMM is expected to change its mapping
+ * of the page in response.
+ */
+int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end, bool enc)
+{
+	u64 ret;
+
+	if (end <= start)
+		return -EINVAL;
+
+	if (!enc) {
+		start |= tdx_shared_mask();
+		end |= tdx_shared_mask();
+	}
+
+	/*
+	 * Notify the VMM about page mapping conversion. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
+	 * sec "TDG.VP.VMCALL<MapGPA>"
+	 */
+	ret = _tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0, NULL);
+
+	if (ret)
+		ret = -EIO;
+
+	if (ret || !enc)
+		return ret;
+
+	/*
+	 * For shared->private conversion, accept the page using
+	 * TDX_ACCEPT_PAGE TDX module call.
+	 */
+	while (start < end) {
+		/* Try 2M page accept first if possible */
+		if (!(start & ~PMD_MASK) && end - start >= PMD_SIZE &&
+		    !tdx_accept_page(start, PG_LEVEL_2M)) {
+			start += PMD_SIZE;
+			continue;
+		}
+
+		if (tdx_accept_page(start, PG_LEVEL_4K))
+			return -EIO;
+		start += PAGE_SIZE;
+	}
+
+	return 0;
+}
+
 static u64 __cpuidle _tdx_halt(const bool irq_disabled, const bool do_sti)
 {
 	/*
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 24/29] x86/mm/cpa: Add support for TDX shared memory
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (22 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  1:27   ` Thomas Gleixner
  2022-01-24 15:02 ` [PATCHv2 25/29] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
                   ` (5 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov, Sean Christopherson, Kai Huang

TDX steals a bit from the physical address and uses it to indicate
whether the page is private to the guest (bit set 0) or unprotected
and shared with the VMM (bit set 1).

AMD SEV uses a similar scheme, repurposing a bit from the physical address
to indicate encrypted or decrypted pages.

The kernel already has the infrastructure to deal with encrypted/decrypted
pages for AMD SEV. Modify the __set_memory_enc_pgtable() and make it
aware about TDX.

After modifying page table entries, the kernel needs to notify VMM about
the change with tdx_hcall_request_gpa_type().

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Tested-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                   |  2 +-
 arch/x86/include/asm/mem_encrypt.h |  8 ++++++
 arch/x86/include/asm/set_memory.h  |  1 -
 arch/x86/kernel/cc_platform.c      |  2 ++
 arch/x86/mm/mem_encrypt_amd.c      | 10 ++++---
 arch/x86/mm/pat/set_memory.c       | 44 ++++++++++++++++++++++++++----
 include/linux/cc_platform.h        |  9 ++++++
 7 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 680c3cad9422..33e6ec6fd89f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -886,7 +886,7 @@ config INTEL_TDX_GUEST
 	depends on X86_X2APIC
 	select ARCH_HAS_CC_PLATFORM
 	select X86_MCE
-	select DYNAMIC_PHYSICAL_MASK
+	select X86_MEM_ENCRYPT
 	help
 	  Support running as a guest under Intel TDX.  Without this support,
 	  the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index e2c6f433ed10..f45a9ea2dec9 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -52,6 +52,8 @@ void __init mem_encrypt_free_decrypted_mem(void);
 /* Architecture __weak replacement functions */
 void __init mem_encrypt_init(void);
 
+int amd_notify_range_enc_status_changed(unsigned long vaddr, int npages, bool enc);
+
 void __init sev_es_init_vc_handling(void);
 
 #define __bss_decrypted __section(".bss..decrypted")
@@ -85,6 +87,12 @@ early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages, bool enc) {}
 
 static inline void mem_encrypt_free_decrypted_mem(void) { }
 
+static inline int amd_notify_range_enc_status_changed(unsigned long vaddr,
+						      int npages, bool enc)
+{
+	return 0;
+}
+
 #define __bss_decrypted
 
 #endif	/* CONFIG_AMD_MEM_ENCRYPT */
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ff0f2d90338a..ce8dd215f5b3 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -84,7 +84,6 @@ int set_pages_rw(struct page *page, int numpages);
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
 bool kernel_page_present(struct page *page);
-void notify_range_enc_status_changed(unsigned long vaddr, int npages, bool enc);
 
 extern int kernel_set_to_readonly;
 
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index be8722ad4792..1fbcf19fa20d 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -21,6 +21,8 @@ static bool intel_cc_platform_has(enum cc_attr attr)
 	switch (attr) {
 	case CC_ATTR_GUEST_UNROLL_STRING_IO:
 	case CC_ATTR_HOTPLUG_DISABLED:
+	case CC_ATTR_GUEST_TDX:
+	case CC_ATTR_GUEST_MEM_ENCRYPT:
 		return true;
 	default:
 		return false;
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 2b2d018ea345..6aa4e0c27368 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -256,7 +256,8 @@ static unsigned long pg_level_to_pfn(int level, pte_t *kpte, pgprot_t *ret_prot)
 	return pfn;
 }
 
-void notify_range_enc_status_changed(unsigned long vaddr, int npages, bool enc)
+int amd_notify_range_enc_status_changed(unsigned long vaddr, int npages,
+					 bool enc)
 {
 #ifdef CONFIG_PARAVIRT
 	unsigned long sz = npages << PAGE_SHIFT;
@@ -270,7 +271,7 @@ void notify_range_enc_status_changed(unsigned long vaddr, int npages, bool enc)
 		kpte = lookup_address(vaddr, &level);
 		if (!kpte || pte_none(*kpte)) {
 			WARN_ONCE(1, "kpte lookup for vaddr\n");
-			return;
+			return 0;
 		}
 
 		pfn = pg_level_to_pfn(level, kpte, NULL);
@@ -285,6 +286,7 @@ void notify_range_enc_status_changed(unsigned long vaddr, int npages, bool enc)
 		vaddr = (vaddr & pmask) + psize;
 	}
 #endif
+	return 0;
 }
 
 static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
@@ -392,7 +394,7 @@ static int __init early_set_memory_enc_dec(unsigned long vaddr,
 
 	ret = 0;
 
-	notify_range_enc_status_changed(start, PAGE_ALIGN(size) >> PAGE_SHIFT, enc);
+	amd_notify_range_enc_status_changed(start, PAGE_ALIGN(size) >> PAGE_SHIFT, enc);
 out:
 	__flush_tlb_all();
 	return ret;
@@ -410,7 +412,7 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size)
 
 void __init early_set_mem_enc_dec_hypercall(unsigned long vaddr, int npages, bool enc)
 {
-	notify_range_enc_status_changed(vaddr, npages, enc);
+	amd_notify_range_enc_status_changed(vaddr, npages, enc);
 }
 
 void __init mem_encrypt_free_decrypted_mem(void)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index b4072115c8ef..06c65689d6fb 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -32,6 +32,7 @@
 #include <asm/set_memory.h>
 #include <asm/hyperv-tlfs.h>
 #include <asm/mshyperv.h>
+#include <asm/tdx.h>
 
 #include "../mm_internal.h"
 
@@ -1983,6 +1984,27 @@ int set_memory_global(unsigned long addr, int numpages)
 				    __pgprot(_PAGE_GLOBAL), 0);
 }
 
+static pgprot_t pgprot_cc_mask(bool enc)
+{
+	if (enc)
+		return pgprot_encrypted(__pgprot(0));
+	else
+		return pgprot_decrypted(__pgprot(0));
+}
+
+static int notify_range_enc_status_changed(unsigned long vaddr, int npages,
+					   bool enc)
+{
+	if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
+		phys_addr_t start = __pa(vaddr);
+		phys_addr_t end = __pa(vaddr + npages * PAGE_SIZE);
+
+		return tdx_hcall_request_gpa_type(start, end, enc);
+	} else {
+		return amd_notify_range_enc_status_changed(vaddr, npages, enc);
+	}
+}
+
 /*
  * __set_memory_enc_pgtable() is used for the hypervisors that get
  * informed about "encryption" status via page tables.
@@ -1999,8 +2021,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	memset(&cpa, 0, sizeof(cpa));
 	cpa.vaddr = &addr;
 	cpa.numpages = numpages;
-	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
-	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
+
+	cpa.mask_set = pgprot_cc_mask(enc);
+	cpa.mask_clr = pgprot_cc_mask(!enc);
+
 	cpa.pgd = init_mm.pgd;
 
 	/* Must avoid aliasing mappings in the highmem code */
@@ -2008,9 +2032,17 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	vm_unmap_aliases();
 
 	/*
-	 * Before changing the encryption attribute, we need to flush caches.
+	 * Before changing the encryption attribute, flush caches.
+	 *
+	 * For TDX, guest is responsible for flushing caches on private->shared
+	 * transition. VMM is responsible for flushing on shared->private.
 	 */
-	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
+		if (!enc)
+			cpa_flush(&cpa, 1);
+	} else {
+		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	}
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2027,8 +2059,8 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	 * Notify hypervisor that a given memory range is mapped encrypted
 	 * or decrypted.
 	 */
-	notify_range_enc_status_changed(addr, numpages, enc);
-
+	if (!ret)
+		ret =  notify_range_enc_status_changed(addr, numpages, enc);
 	return ret;
 }
 
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 691494bbaf5a..16c0ad925bf0 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -82,6 +82,15 @@ enum cc_attr {
 	 * Examples include TDX Guest.
 	 */
 	CC_ATTR_HOTPLUG_DISABLED,
+
+	/**
+	 * @CC_ATTR_GUEST_TDX: Trust Domain Extension Support
+	 *
+	 * The platform/OS is running as a TDX guest/virtual machine.
+	 *
+	 * Examples include Intel TDX.
+	 */
+	CC_ATTR_GUEST_TDX = 0x100,
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 25/29] x86/kvm: Use bounce buffers for TD guest
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (23 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 24/29] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-01-24 15:02 ` [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with the VMM must be
shared explicitly. The same rule applies for any DMA to and from the
TDX guest. All DMA pages have to be marked as shared pages. A generic way
to achieve this without any changes to device drivers is to use the
SWIOTLB framework.

Force SWIOTLB on TD guest and make SWIOTLB buffer shared by generalizing
mem_encrypt_init() to cover TDX.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/cc_platform.c | 1 +
 arch/x86/kernel/tdx.c         | 3 +++
 arch/x86/mm/mem_encrypt.c     | 9 ++++++++-
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 1fbcf19fa20d..62c89c077cdd 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -23,6 +23,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
 	case CC_ATTR_HOTPLUG_DISABLED:
 	case CC_ATTR_GUEST_TDX:
 	case CC_ATTR_GUEST_MEM_ENCRYPT:
+	case CC_ATTR_MEM_ENCRYPT:
 		return true;
 	default:
 		return false;
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index ea638c6ecb92..6048887ac846 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -5,6 +5,7 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <linux/swiotlb.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
 #include <asm/insn.h>
@@ -581,5 +582,7 @@ void __init tdx_early_init(void)
 	 */
 	physical_mask &= GENMASK_ULL(td_info.gpa_width - 2, 0);
 
+	swiotlb_force = SWIOTLB_FORCE;
+
 	pr_info("Guest detected\n");
 }
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 50d209939c66..194ace3a748a 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -42,7 +42,14 @@ bool force_dma_unencrypted(struct device *dev)
 
 static void print_mem_encrypt_feature_info(void)
 {
-	pr_info("AMD Memory Encryption Features active:");
+	pr_info("Memory Encryption Features active:");
+
+	if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
+		pr_cont(" Intel TDX\n");
+		return;
+	}
+
+	pr_cont("AMD ");
 
 	/* Secure Memory Encryption */
 	if (cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (24 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 25/29] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  1:33   ` Thomas Gleixner
  2022-01-24 15:02 ` [PATCHv2 27/29] ACPICA: Avoid cache flush on TDX guest Kirill A. Shutemov
                   ` (3 subsequent siblings)
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Isaku Yamahata, Kirill A . Shutemov

From: Isaku Yamahata <isaku.yamahata@intel.com>

The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.

When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host.  This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.

Earlier patches in this series modified ioremap() so that
ioremap()-created mappings such as virtio will be marked as
shared. However, the IOAPIC code does not use ioremap() and instead
uses the fixmap mechanism.

Introduce a special fixmap helper just for the IOAPIC code.  Ensure
that it marks IOAPIC pages as "shared".  This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/apic/io_apic.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index c1bb384935b0..d2fef5893e41 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -49,6 +49,7 @@
 #include <linux/slab.h>
 #include <linux/memblock.h>
 #include <linux/msi.h>
+#include <linux/cc_platform.h>
 
 #include <asm/irqdomain.h>
 #include <asm/io.h>
@@ -65,6 +66,7 @@
 #include <asm/irq_remapping.h>
 #include <asm/hw_irq.h>
 #include <asm/apic.h>
+#include <asm/pgtable.h>
 
 #define	for_each_ioapic(idx)		\
 	for ((idx) = 0; (idx) < nr_ioapics; (idx)++)
@@ -2677,6 +2679,18 @@ static struct resource * __init ioapic_setup_resources(void)
 	return res;
 }
 
+static void io_apic_set_fixmap_nocache(enum fixed_addresses idx,
+				       phys_addr_t phys)
+{
+	pgprot_t flags = FIXMAP_PAGE_NOCACHE;
+
+	/* Set TDX guest shared bit in pgprot flags */
+	if (cc_platform_has(CC_ATTR_GUEST_TDX))
+		flags = pgprot_decrypted(flags);
+
+	__set_fixmap(idx, phys, flags);
+}
+
 void __init io_apic_init_mappings(void)
 {
 	unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
@@ -2709,7 +2723,7 @@ void __init io_apic_init_mappings(void)
 				      __func__, PAGE_SIZE, PAGE_SIZE);
 			ioapic_phys = __pa(ioapic_phys);
 		}
-		set_fixmap_nocache(idx, ioapic_phys);
+		io_apic_set_fixmap_nocache(idx, ioapic_phys);
 		apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n",
 			__fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK),
 			ioapic_phys);
@@ -2838,7 +2852,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base,
 	ioapics[idx].mp_config.flags = MPC_APIC_USABLE;
 	ioapics[idx].mp_config.apicaddr = address;
 
-	set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
+	io_apic_set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address);
 	if (bad_ioapic_register(idx)) {
 		clear_fixmap(FIX_IO_APIC_BASE_0 + idx);
 		return -ENODEV;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 27/29] ACPICA: Avoid cache flush on TDX guest
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (25 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-01-24 15:02 ` [PATCHv2 28/29] x86/tdx: Warn about unexpected WBINVD Kirill A. Shutemov
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states. It is
required to prevent data loss.

While running inside TDX guest, the kernel can bypass cache flushing.
Changing sleep state in a virtual machine doesn't affect the host system
sleep state and cannot lead to data loss.

The approach can be generalized to all guest kernels, but, to be
cautious, let's limit it to TDX for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/acenv.h | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..d19deca6dd27 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -13,7 +13,21 @@
 
 /* Asm macros */
 
-#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
+/*
+ * ACPI_FLUSH_CPU_CACHE() flushes caches on entering sleep states.
+ * It is required to prevent data loss.
+ *
+ * While running inside TDX guest, the kernel can bypass cache flushing.
+ * Changing sleep state in a virtual machine doesn't affect the host system
+ * sleep state and cannot lead to data loss.
+ *
+ * TODO: Is it safe to generalize this from TDX guests to all guest kernels?
+ */
+#define ACPI_FLUSH_CPU_CACHE()					\
+do {								\
+	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))	\
+		wbinvd();					\
+} while (0)
 
 int __acpi_acquire_global_lock(unsigned int *lock);
 int __acpi_release_global_lock(unsigned int *lock);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 28/29] x86/tdx: Warn about unexpected WBINVD
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (26 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 27/29] ACPICA: Avoid cache flush on TDX guest Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-02  1:46   ` Thomas Gleixner
  2022-01-24 15:02 ` [PATCHv2 29/29] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
  2022-02-09 10:56 ` [PATCHv2 00/29] TDX Guest: TDX core support Kai Huang
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

WBINVD causes #VE in TDX guests. There's no reliable way to emulate it.
The kernel can ask for VMM assistance, but VMM is untrusted and can ignore
the request.

Fortunately, there is no use case for WBINVD inside TDX guests.

Warn about any unexpected WBINVD.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 6048887ac846..22c785c2059c 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -530,6 +530,10 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 	case EXIT_REASON_IO_INSTRUCTION:
 		ret = tdx_handle_io(regs, ve->exit_qual);
 		break;
+	case EXIT_REASON_WBINVD:
+		WARN_ONCE(1, "Unexpected WBINVD\n");
+		ret = true;
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		break;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2 29/29] Documentation/x86: Document TDX kernel architecture
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (27 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 28/29] x86/tdx: Warn about unexpected WBINVD Kirill A. Shutemov
@ 2022-01-24 15:02 ` Kirill A. Shutemov
  2022-02-24  9:08   ` Xiaoyao Li
  2022-02-09 10:56 ` [PATCHv2 00/29] TDX Guest: TDX core support Kai Huang
  29 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 15:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Document the TDX guest architecture details like #VE support,
shared memory, etc.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/index.rst |   1 +
 Documentation/x86/tdx.rst   | 194 ++++++++++++++++++++++++++++++++++++
 2 files changed, 195 insertions(+)
 create mode 100644 Documentation/x86/tdx.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index f498f1d36cd3..382e53ca850a 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -24,6 +24,7 @@ x86-specific Documentation
    intel-iommu
    intel_txt
    amd-memory-encryption
+   tdx
    pti
    mds
    microcode
diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
new file mode 100644
index 000000000000..903c9cecccbd
--- /dev/null
+++ b/Documentation/x86/tdx.rst
@@ -0,0 +1,194 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Intel Trust Domain Extensions (TDX)
+=====================================
+
+Intel's Trust Domain Extensions (TDX) protect confidential guest VMs
+from the host and physical attacks by isolating the guest register
+state and by encrypting the guest memory. In TDX, a special TDX module
+sits between the host and the guest, and runs in a special mode and
+manages the guest/host separation.
+
+Since the host cannot directly access guest registers or memory, much
+normal functionality of a hypervisor (such as trapping MMIO, some MSRs,
+some CPUIDs, and some other instructions) has to be moved into the
+guest. This is implemented using a Virtualization Exception (#VE) that
+is handled by the guest kernel. Some #VEs are handled inside the guest
+kernel, but some require the hypervisor (VMM) to be involved. The TD
+hypercall mechanism allows TD guests to call TDX module or hypervisor
+function.
+
+#VE Exceptions:
+===============
+
+In TDX guests, #VE Exceptions are delivered to TDX guests in following
+scenarios:
+
+* Execution of certain instructions (see list below)
+* Certain MSR accesses.
+* CPUID usage (only for certain leaves)
+* Shared memory access (including MMIO)
+
+#VE due to instruction execution
+---------------------------------
+
+Intel TDX dis-allows execution of certain instructions in non-root
+mode. Execution of these instructions would lead to #VE or #GP.
+
+Details are,
+
+List of instructions that can cause a #VE is,
+
+* String I/O (INS, OUTS), IN, OUT
+* HLT
+* MONITOR, MWAIT
+* WBINVD, INVD
+* VMCALL
+
+List of instructions that can cause a #GP is,
+
+* All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
+  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
+* ENCLS, ENCLV
+* GETSEC
+* RSM
+* ENQCMD
+
+#VE due to MSR access
+----------------------
+
+In TDX guest, MSR access behavior can be categorized as,
+
+* Native supported (also called "context switched MSR")
+  No special handling is required for these MSRs in TDX guests.
+* #GP triggered
+  Dis-allowed MSR read/write would lead to #GP.
+* #VE triggered
+  All MSRs that are not natively supported or dis-allowed
+  (triggers #GP) will trigger #VE. To support access to
+  these MSRs, it needs to be emulated using TDCALL.
+
+Look Intel TDX Module Specification, sec "MSR Virtualization" for the complete
+list of MSRs that fall under the categories above.
+
+#VE due to CPUID instruction
+----------------------------
+
+In TDX guests, most of CPUID leaf/sub-leaf combinations are virtualized by
+the TDX module while some trigger #VE. Combinations of CPUID leaf/sub-leaf
+which triggers #VE are configured by the VMM during the TD initialization
+time (using TDH.MNG.INIT).
+
+#VE on Memory Accesses
+----------------------
+
+A TD guest is in control of whether its memory accesses are treated as
+private or shared.  It selects the behavior with a bit in its page table
+entries.
+
+#VE on Shared Pages
+-------------------
+
+Access to shared mappings can cause a #VE. The hypervisor controls whether
+access of shared mapping causes a #VE, so the guest must be careful to only
+reference shared pages it can safely handle a #VE, avoid nested #VEs.
+
+Content of shared mapping is not trusted since shared memory is writable
+by the hypervisor. Shared mappings are never used for sensitive memory content
+like stacks or kernel text, only for I/O buffers and MMIO regions. The kernel
+will not encounter shared mappings in sensitive contexts like syscall entry
+or NMIs.
+
+#VE on Private Pages
+--------------------
+
+Some accesses to private mappings may cause #VEs.  Before a mapping is
+accepted (AKA in the SEPT_PENDING state), a reference would cause a #VE.
+But, after acceptance, references typically succeed.
+
+The hypervisor can cause a private page reference to fail if it chooses
+to move an accepted page to a "blocked" state.  However, if it does
+this, page access will not generate a #VE.  It will, instead, cause a
+"TD Exit" where the hypervisor is required to handle the exception.
+
+Linux #VE handler
+-----------------
+
+Both user/kernel #VE exceptions are handled by the tdx_handle_virt_exception()
+handler. If successfully handled, the instruction pointer is incremented to
+complete the handling process. If failed to handle, it is treated as a regular
+exception and handled via fixup handlers.
+
+In TD guests, #VE nesting (a #VE triggered before handling the current one
+or AKA syscall gap issue) problem is handled by TDX module ensuring that
+interrupts, including NMIs, are blocked. The hardware blocks interrupts
+starting with #VE delivery until TDGETVEINFO is called.
+
+The kernel must avoid triggering #VE in entry paths: do not touch TD-shared
+memory, including MMIO regions, and do not use #VE triggering MSRs,
+instructions, or CPUID leaves that might generate #VE.
+
+MMIO handling:
+==============
+
+In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
+mapping which will cause a VMEXIT on access, and then the VMM emulates the
+access. That's not possible in TDX guests because VMEXIT will expose the
+register state to the host. TDX guests don't trust the host and can't have
+their state exposed to the host.
+
+In TDX the MMIO regions are instead configured to trigger a #VE
+exception in the guest. The guest #VE handler then emulates the MMIO
+instructions inside the guest and converts them into a controlled TDCALL
+to the host, rather than completely exposing the state to the host.
+
+MMIO addresses on x86 are just special physical addresses. They can be
+accessed with any instruction that accesses memory. However, the
+introduced instruction decoding method is limited. It is only designed
+to decode instructions like those generated by io.h macros.
+
+MMIO access via other means (like structure overlays) may result in
+MMIO_DECODE_FAILED and an oops.
+
+Shared memory:
+==============
+
+Intel TDX doesn't allow the VMM to access guest private memory. Any
+memory that is required for communication with VMM must be shared
+explicitly by setting the bit in the page table entry. The shared bit
+can be enumerated with TDX_GET_INFO.
+
+After setting the shared bit, the conversion must be completed with
+MapGPA hypercall. The call informs the VMM about the conversion between
+private/shared mappings.
+
+set_memory_decrypted() converts a range of pages to shared.
+set_memory_encrypted() converts memory back to private.
+
+Device drivers are the primary user of shared memory, but there's no
+need in touching every driver. DMA buffers and ioremap()'ed regions are
+converted to shared automatically.
+
+TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
+converted to shared on boot.
+
+For coherent DMA allocation, the DMA buffer gets converted on the
+allocation. Check force_dma_unencrypted() for details.
+
+References
+==========
+
+More details about TDX module (and its response for MSR, memory access,
+IO, CPUID etc) can be found at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
+
+More details about TDX hypercall and TDX module call ABI can be found
+at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface-1.0-344426-002.pdf
+
+More details about TDVF requirements can be found at,
+
+https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO
  2022-01-24 15:01 ` [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
@ 2022-01-24 19:30   ` Josh Poimboeuf
  2022-01-24 22:08     ` Kirill A. Shutemov
  2022-01-24 22:40   ` Dave Hansen
  2022-02-01 22:30   ` [PATCHv2 " Thomas Gleixner
  2 siblings, 1 reply; 154+ messages in thread
From: Josh Poimboeuf @ 2022-01-24 19:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, knsathya, pbonzini, sdeep, seanjc,
	tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:01:54PM +0300, Kirill A. Shutemov wrote:
> +static bool tdx_mmio(int size, bool write, unsigned long addr,
> +		     unsigned long *val)
> +{
> +	struct tdx_hypercall_output out;
> +	u64 err;
> +
> +	err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> +			     addr, *val, &out);
> +	if (err)
> +		return true;
> +
> +	*val = out.r11;
> +	return false;
> +}
> +
> +static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> +	return tdx_mmio(size, false, addr, val);
> +}
> +
> +static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
> +{
> +	return tdx_mmio(size, true, addr, val);
> +}
> +
> +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	char buffer[MAX_INSN_SIZE];
> +	unsigned long *reg, val = 0;
> +	struct insn insn = {};
> +	enum mmio_type mmio;
> +	int size;
> +	bool err;
> +
> +	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> +		return -EFAULT;
> +
> +	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> +		return -EFAULT;
> +
> +	mmio = insn_decode_mmio(&insn, &size);
> +	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> +		return -EFAULT;
> +
> +	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> +		reg = insn_get_modrm_reg_ptr(&insn, regs);
> +		if (!reg)
> +			return -EFAULT;
> +	}
> +
> +	switch (mmio) {
> +	case MMIO_WRITE:
> +		memcpy(&val, reg, size);
> +		err = tdx_mmio_write(size, ve->gpa, &val);
> +		break;

The return code conventions are still all mismatched and confusing:

- Most tdx_handle_*() handlers return bool (success == true)

- tdx_handle_mmio() returns int (success > 0)

- tdx_mmio*() helpers return bool (success == false)

I still don't see any benefit in arbitrarily mixing three different
return conventions, none of which matches the typical kernel style for
returning errors, unless the goal is to confuse the reader and invite
bugs.

There is precedent in traps.c for some handle_*() functions to return
bool (success == true), so if the goal is to align with that
semi-convention, that's ok.  But at the very least, please do it
consistently:

  - change tdx_mmio*() to return true on success;

  - change tdx_handle_mmio() to return bool, with 'len' passed as an
    argument.

Or, even better, just change them all to return 0 on success like 99+%
of error-returning kernel functions.

-- 
Josh


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO
  2022-01-24 19:30   ` Josh Poimboeuf
@ 2022-01-24 22:08     ` Kirill A. Shutemov
  2022-01-24 23:04       ` Josh Poimboeuf
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 22:08 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, knsathya, pbonzini, sdeep, seanjc,
	tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 11:30:08AM -0800, Josh Poimboeuf wrote:
> On Mon, Jan 24, 2022 at 06:01:54PM +0300, Kirill A. Shutemov wrote:
> > +static bool tdx_mmio(int size, bool write, unsigned long addr,
> > +		     unsigned long *val)
> > +{
> > +	struct tdx_hypercall_output out;
> > +	u64 err;
> > +
> > +	err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> > +			     addr, *val, &out);
> > +	if (err)
> > +		return true;
> > +
> > +	*val = out.r11;
> > +	return false;
> > +}
> > +
> > +static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
> > +{
> > +	return tdx_mmio(size, false, addr, val);
> > +}
> > +
> > +static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
> > +{
> > +	return tdx_mmio(size, true, addr, val);
> > +}
> > +
> > +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> > +{
> > +	char buffer[MAX_INSN_SIZE];
> > +	unsigned long *reg, val = 0;
> > +	struct insn insn = {};
> > +	enum mmio_type mmio;
> > +	int size;
> > +	bool err;
> > +
> > +	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
> > +		return -EFAULT;
> > +
> > +	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
> > +		return -EFAULT;
> > +
> > +	mmio = insn_decode_mmio(&insn, &size);
> > +	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
> > +		return -EFAULT;
> > +
> > +	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
> > +		reg = insn_get_modrm_reg_ptr(&insn, regs);
> > +		if (!reg)
> > +			return -EFAULT;
> > +	}
> > +
> > +	switch (mmio) {
> > +	case MMIO_WRITE:
> > +		memcpy(&val, reg, size);
> > +		err = tdx_mmio_write(size, ve->gpa, &val);
> > +		break;
> 
> The return code conventions are still all mismatched and confusing:
> 
> - Most tdx_handle_*() handlers return bool (success == true)
> 
> - tdx_handle_mmio() returns int (success > 0)

Right, all tdx_handle_* are consistent: success > 0.

> - tdx_mmio*() helpers return bool (success == false)

And what is wrong with that? Why do you mix functions that called in
different contexts and expect them to have matching semantics?

> I still don't see any benefit in arbitrarily mixing three different
> return conventions, none of which matches the typical kernel style for
> returning errors, unless the goal is to confuse the reader and invite
> bugs.

Okay, we have an disagreement here.

I picked a way to communicate function result as I see best fits the
situation. It is a judgement call.

I will adjust code if maintainers see it differently from me. But until
then I don't see anything wrong here.

> There is precedent in traps.c for some handle_*() functions to return
> bool (success == true), so if the goal is to align with that
> semi-convention, that's ok.  But at the very least, please do it
> consistently:
> 
>   - change tdx_mmio*() to return true on success;
> 
>   - change tdx_handle_mmio() to return bool, with 'len' passed as an
>     argument.

Hard no.

Returning a value via passed argument is the last resort for cases when
more than one value has to be returned. In this case the function is
perfectly capable to communicate result via single return value.

I don't see a reason to complicate the code to satisfy some "typical
kernel style".

> Or, even better, just change them all to return 0 on success like 99+%
> of error-returning kernel functions.

Citation needed. 99+% looks like an overstatement to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO
  2022-01-24 15:01 ` [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
  2022-01-24 19:30   ` Josh Poimboeuf
@ 2022-01-24 22:40   ` Dave Hansen
  2022-01-24 23:04     ` [PATCHv2.1 " Kirill A. Shutemov
  2022-02-01 22:30   ` [PATCHv2 " Thomas Gleixner
  2 siblings, 1 reply; 154+ messages in thread
From: Dave Hansen @ 2022-01-24 22:40 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

> +static bool tdx_mmio(int size, bool write, unsigned long addr,
> +		     unsigned long *val)
> +{
> +	struct tdx_hypercall_output out;
> +	u64 err;
> +
> +	err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> +			     addr, *val, &out);
> +	if (err)
> +		return true;
> +
> +	*val = out.r11;
> +	return false;
> +}
> +
> +static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
> +{
> +	return tdx_mmio(size, false, addr, val);
> +}
> +
> +static bool tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
> +{
> +	return tdx_mmio(size, true, addr, val);
> +}
> +
> +static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
...
> +	bool err;

I'll agree with Josh on one point: "bool err" _is_ weird.

Things tend to either return int with 0 for success or bool with true 
for success.

The tdx_handle*() ones seem OK to me.  It's pretty normal to have a 
literal "handler" return true if things were handled.

I'd probably just make tdx_mmio() return an int.  It seems to only able 
to return -EFAULT anyway, so changing the return from bool->int and doing:

-	return false;
+	return -EFAULT;

isn't exactly a heavy lift.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2.1 08/29] x86/tdx: Handle in-kernel MMIO
  2022-01-24 22:40   ` Dave Hansen
@ 2022-01-24 23:04     ` Kirill A. Shutemov
  2022-02-01 16:14       ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-24 23:04 UTC (permalink / raw)
  To: dave.hansen
  Cc: jpoimboe, aarcange, ak, bp, dan.j.williams, david, hpa, jgross,
	jmattson, joro, kirill.shutemov, knsathya, linux-kernel, luto,
	mingo, pbonzini, peterz, sathyanarayanan.kuppuswamy, sdeep,
	seanjc, tglx, tony.luck, vkuznets, wanpengli, x86

In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.

To emulate an instruction an emulator needs two things:

  - R/W access to the register file to read/modify instruction arguments
    and see RIP of the faulted instruction.

  - Read access to memory where instruction is placed to see what to
    emulate. In this case it is guest kernel text.

Both of them are not available to VMM in TDX environment:

  - Register file is never exposed to VMM. When a TD exits to the module,
    it saves registers into the state-save area allocated for that TD.
    The module then scrubs these registers before returning execution
    control to the VMM, to help prevent leakage of TD state.

  - Memory is encrypted a TD-private key. The CPU disallows software
    other than the TDX module and TDs from making memory accesses using
    the private key.

In TDX the MMIO regions are instead configured to trigger a #VE
exception in the guest. The guest #VE handler then emulates the MMIO
instruction inside the guest and converts it into a controlled hypercall
to the host.

MMIO addresses can be used with any CPU instruction that accesses
memory. This patch, however, covers only MMIO accesses done via io.h
helpers, such as 'readl()' or 'writeq()'.

readX()/writeX() helpers limit the range of instructions which can trigger
MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO
region allows compiler to generate whatever instruction it wants.
Supporting all possible instructions is a task of a different scope

MMIO access with anything other than helpers from io.h may result in
MMIO_DECODE_FAILED and an oops.

AMD SEV has the same limitations to MMIO handling.

=== Potential alternative approaches ===

== Paravirtualizing all MMIO ==

An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.

Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.

However, any paravirtual approach would be patching approximately
120k call sites. With a conservative overhead estimation of 5 bytes per
call site (CALL instruction), it leads to bloating code by 600k.

Many drivers will never be used in the TDX environment and the bloat
cannot be justified.

== Patching TDX drivers ==

Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests.  Right now, that's
limited only to virtio and some x86-specific drivers.

All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch. This will be implemented in the
future, removing the bulk of MMIO #VEs.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/tdx.c | 113 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index f213c67b4ecc..c5367e331bf6 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -7,6 +7,8 @@
 #include <linux/cpufeature.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_VEINFO			3
@@ -149,6 +151,111 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
 	return true;
 }
 
+static int tdx_mmio(int size, bool write, unsigned long addr,
+		     unsigned long *val)
+{
+	struct tdx_hypercall_output out;
+	u64 err;
+
+	err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
+			     addr, *val, &out);
+	if (err)
+		return -EFAULT;
+
+	*val = out.r11;
+	return 0;
+}
+
+static int tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
+{
+	return tdx_mmio(size, false, addr, val);
+}
+
+static int tdx_mmio_write(int size, unsigned long addr, unsigned long *val)
+{
+	return tdx_mmio(size, true, addr, val);
+}
+
+static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+{
+	char buffer[MAX_INSN_SIZE];
+	unsigned long *reg, val = 0;
+	struct insn insn = {};
+	enum mmio_type mmio;
+	int size, err;
+
+	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
+		return -EFAULT;
+
+	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
+		return -EFAULT;
+
+	mmio = insn_decode_mmio(&insn, &size);
+	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
+		return -EFAULT;
+
+	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
+		reg = insn_get_modrm_reg_ptr(&insn, regs);
+		if (!reg)
+			return -EFAULT;
+	}
+
+	switch (mmio) {
+	case MMIO_WRITE:
+		memcpy(&val, reg, size);
+		err = tdx_mmio_write(size, ve->gpa, &val);
+		break;
+	case MMIO_WRITE_IMM:
+		val = insn.immediate.value;
+		err = tdx_mmio_write(size, ve->gpa, &val);
+		break;
+	case MMIO_READ:
+		err = tdx_mmio_read(size, ve->gpa, &val);
+		if (err)
+			break;
+		/* Zero-extend for 32-bit operation */
+		if (size == 4)
+			*reg = 0;
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_READ_ZERO_EXTEND:
+		err = tdx_mmio_read(size, ve->gpa, &val);
+		if (err)
+			break;
+
+		/* Zero extend based on operand size */
+		memset(reg, 0, insn.opnd_bytes);
+		memcpy(reg, &val, size);
+		break;
+	case MMIO_READ_SIGN_EXTEND: {
+		u8 sign_byte = 0, msb = 7;
+
+		err = tdx_mmio_read(size, ve->gpa, &val);
+		if (err)
+			break;
+
+		if (size > 1)
+			msb = 15;
+
+		if (val & BIT(msb))
+			sign_byte = -1;
+
+		/* Sign extend based on operand size */
+		memset(reg, sign_byte, insn.opnd_bytes);
+		memcpy(reg, &val, size);
+		break;
+	}
+	case MMIO_MOVS:
+	case MMIO_DECODE_FAILED:
+		return -EFAULT;
+	}
+
+	if (err)
+		return err;
+
+	return insn.length;
+}
+
 bool tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -219,6 +326,12 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 	case EXIT_REASON_CPUID:
 		ret = tdx_handle_cpuid(regs);
 		break;
+	case EXIT_REASON_EPT_VIOLATION:
+		ve->instr_len = tdx_handle_mmio(regs, ve);
+		ret = ve->instr_len > 0;
+		if (!ret)
+			pr_warn_once("MMIO failed\n");
+		break;
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
 		break;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO
  2022-01-24 22:08     ` Kirill A. Shutemov
@ 2022-01-24 23:04       ` Josh Poimboeuf
  0 siblings, 0 replies; 154+ messages in thread
From: Josh Poimboeuf @ 2022-01-24 23:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, knsathya, pbonzini, sdeep, seanjc,
	tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Tue, Jan 25, 2022 at 01:08:21AM +0300, Kirill A. Shutemov wrote:
> > The return code conventions are still all mismatched and confusing:
> > 
> > - Most tdx_handle_*() handlers return bool (success == true)
> > 
> > - tdx_handle_mmio() returns int (success > 0)
> 
> Right, all tdx_handle_* are consistent: success > 0.

Non-zero success is not the same as above-zero success.  The behavior is
not interchangeable.

> > - tdx_mmio*() helpers return bool (success == false)
> 
> And what is wrong with that? Why do you mix functions that called in
> different contexts and expect them to have matching semantics?

Why would you expect the reader of the code to go investigate the weird
return semantics of every called function?

And "success == false" is just plain confusing, I haven't seen that one.

> > I still don't see any benefit in arbitrarily mixing three different
> > return conventions, none of which matches the typical kernel style for
> > returning errors, unless the goal is to confuse the reader and invite
> > bugs.
> 
> Okay, we have an disagreement here.
> 
> I picked a way to communicate function result as I see best fits the
> situation. It is a judgement call.
> 
> I will adjust code if maintainers see it differently from me. But until
> then I don't see anything wrong here.
> 
> > There is precedent in traps.c for some handle_*() functions to return
> > bool (success == true), so if the goal is to align with that
> > semi-convention, that's ok.  But at the very least, please do it
> > consistently:
> > 
> >   - change tdx_mmio*() to return true on success;
> > 
> >   - change tdx_handle_mmio() to return bool, with 'len' passed as an
> >     argument.
> 
> Hard no.
> 
> Returning a value via passed argument is the last resort for cases when
> more than one value has to be returned. In this case the function is
> perfectly capable to communicate result via single return value.
> 
> I don't see a reason to complicate the code to satisfy some "typical
> kernel style".

It's a convention for a reason.

> > Or, even better, just change them all to return 0 on success like 99+%
> > of error-returning kernel functions.
> 
> Citation needed. 99+% looks like an overstatement to me.

From Documentation/process/coding-style.rst:

16) Function return values and names
------------------------------------

Functions can return values of many different kinds, and one of the
most common is a value indicating whether the function succeeded or
failed.  Such a value can be represented as an error-code integer
(-Exxx = failure, 0 = success) or a ``succeeded`` boolean (0 = failure,
non-zero = success).

Mixing up these two sorts of representations is a fertile source of
difficult-to-find bugs.  If the C language included a strong distinction
between integers and booleans then the compiler would find these mistakes
for us... but it doesn't.  To help prevent such bugs, always follow this
convention::

	If the name of a function is an action or an imperative command,
	the function should return an error-code integer.  If the name
	is a predicate, the function should return a "succeeded" boolean.

For example, ``add work`` is a command, and the add_work() function returns 0
for success or -EBUSY for failure.  In the same way, ``PCI device present`` is
a predicate, and the pci_dev_present() function returns 1 if it succeeds in
finding a matching device or 0 if it doesn't.

-- 
Josh


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 05/29] x86/tdx: Add HLT support for TDX guests
  2022-01-24 15:01 ` [PATCHv2 05/29] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
@ 2022-01-29 14:53   ` Borislav Petkov
  2022-01-29 22:30     ` [PATCHv2.1 " Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-01-29 14:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:01:51PM +0300, Kirill A. Shutemov wrote:
> @@ -870,6 +871,10 @@ void select_idle_routine(const struct cpuinfo_x86 *c)
>  	} else if (prefer_mwait_c1_over_halt(c)) {
>  		pr_info("using mwait in idle threads\n");
>  		x86_idle = mwait_idle;
> +	} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
> +		pr_info("using TDX aware idle routine\n");
> +		x86_idle = tdx_safe_halt;
> +		return;

Forgot to remove that "return".

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-01-29 14:53   ` Borislav Petkov
@ 2022-01-29 22:30     ` Kirill A. Shutemov
  2022-02-01 21:21       ` Thomas Gleixner
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-01-29 22:30 UTC (permalink / raw)
  To: bp
  Cc: aarcange, ak, dan.j.williams, dave.hansen, david, hpa, jgross,
	jmattson, joro, jpoimboe, kirill.shutemov, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, seanjc, tglx, tony.luck,
	vkuznets, wanpengli, x86

The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).

To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].

In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.

To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.

Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tdx.h |  3 ++
 arch/x86/kernel/process.c  |  4 +++
 arch/x86/kernel/tdcall.S   | 31 +++++++++++++++++
 arch/x86/kernel/tdx.c      | 70 ++++++++++++++++++++++++++++++++++++--
 4 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d17143290f0a..9b4714a45bb9 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -74,10 +74,13 @@ bool tdx_get_ve_info(struct ve_info *ve);
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
 
+void tdx_safe_halt(void);
+
 #else
 
 static inline void tdx_early_init(void) { };
 static inline bool is_tdx_guest(void) { return false; }
+static inline void tdx_safe_halt(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 81d8ef036637..71aa12082370 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -46,6 +46,7 @@
 #include <asm/proto.h>
 #include <asm/frame.h>
 #include <asm/unwind.h>
+#include <asm/tdx.h>
 
 #include "process.h"
 
@@ -870,6 +871,9 @@ void select_idle_routine(const struct cpuinfo_x86 *c)
 	} else if (prefer_mwait_c1_over_halt(c)) {
 		pr_info("using mwait in idle threads\n");
 		x86_idle = mwait_idle;
+	} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+		pr_info("using TDX aware idle routine\n");
+		x86_idle = tdx_safe_halt;
 	} else
 		x86_idle = default_idle;
 }
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index 46a49a96cf6c..ae74da33ccc6 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -3,6 +3,7 @@
 #include <asm/asm.h>
 #include <asm/frame.h>
 #include <asm/unwind_hints.h>
+#include <uapi/asm/vmx.h>
 
 #include <linux/linkage.h>
 #include <linux/bits.h>
@@ -39,6 +40,12 @@
  */
 #define tdcall .byte 0x66,0x0f,0x01,0xcc
 
+/*
+ * Used in __tdx_hypercall() to determine whether to enable interrupts
+ * before issuing TDCALL for the EXIT_REASON_HLT case.
+ */
+#define ENABLE_IRQS_BEFORE_HLT 0x01
+
 /*
  * __tdx_module_call()  - Used by TDX guests to request services from
  * the TDX module (does not include VMM services).
@@ -230,6 +237,30 @@ SYM_FUNC_START(__tdx_hypercall)
 
 	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
 
+	/*
+	 * For the idle loop STI needs to be called directly before
+	 * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
+	 * instruction enables interrupts only one instruction later.
+	 * If there is a window between STI and the instruction that
+	 * emulates the HALT state, there is a chance for interrupts to
+	 * happen in this window, which can delay the HLT operation
+	 * indefinitely. Since this is the not the desired result,
+	 * conditionally call STI before TDCALL.
+	 *
+	 * Since STI instruction is only required for the idle case
+	 * (a special case of EXIT_REASON_HLT), use the r15 register
+	 * value to identify it. Since the R15 register is not used
+	 * by the VMM as per EXIT_REASON_HLT ABI, re-use it in
+	 * software to identify the STI case.
+	 */
+	cmpl $EXIT_REASON_HLT, %r11d
+	jne .Lskip_sti
+	cmpl $ENABLE_IRQS_BEFORE_HLT, %r15d
+	jne .Lskip_sti
+	/* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
+	xor %r15, %r15
+	sti
+.Lskip_sti:
 	tdcall
 
 	/* Restore output pointer to R9 */
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index 5a5b25f9c4d3..eeb456631a65 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -6,6 +6,7 @@
 
 #include <linux/cpufeature.h>
 #include <asm/tdx.h>
+#include <asm/vmx.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_VEINFO			3
@@ -35,6 +36,61 @@ static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
 	return out->r10;
 }
 
+static u64 __cpuidle _tdx_halt(const bool irq_disabled, const bool do_sti)
+{
+	/*
+	 * Emulate HLT operation via hypercall. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), sec 3.8 TDG.VP.VMCALL<Instruction.HLT>.
+	 *
+	 * The VMM uses the "IRQ disabled" param to understand IRQ
+	 * enabled status (RFLAGS.IF) of the TD guest and to determine
+	 * whether or not it should schedule the halted vCPU if an
+	 * IRQ becomes pending. E.g. if IRQs are disabled, the VMM
+	 * can keep the vCPU in virtual HLT, even if an IRQ is
+	 * pending, without hanging/breaking the guest.
+	 *
+	 * do_sti parameter is used by the __tdx_hypercall() to decide
+	 * whether to call the STI instruction before executing the
+	 * TDCALL instruction.
+	 */
+	return _tdx_hypercall(EXIT_REASON_HLT, irq_disabled, 0, 0,
+			      do_sti, NULL);
+}
+
+static bool tdx_halt(void)
+{
+	/*
+	 * Since non safe halt is mainly used in CPU offlining
+	 * and the guest will always stay in the halt state, don't
+	 * call the STI instruction (set do_sti as false).
+	 */
+	const bool irq_disabled = irqs_disabled();
+	const bool do_sti = false;
+
+	if (_tdx_halt(irq_disabled, do_sti))
+		return false;
+
+	return true;
+}
+
+void __cpuidle tdx_safe_halt(void)
+{
+	 /*
+	  * For do_sti=true case, __tdx_hypercall() function enables
+	  * interrupts using the STI instruction before the TDCALL. So
+	  * set irq_disabled as false.
+	  */
+	const bool irq_disabled = false;
+	const bool do_sti = true;
+
+	/*
+	 * Use WARN_ONCE() to report the failure.
+	 */
+	if (_tdx_halt(irq_disabled, do_sti))
+		WARN_ONCE(1, "HLT instruction emulation failed\n");
+}
+
 bool tdx_get_ve_info(struct ve_info *ve)
 {
 	struct tdx_module_output out;
@@ -75,8 +131,18 @@ static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
 /* Handle the kernel #VE */
 static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 {
-	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
-	return false;
+	bool ret = false;
+
+	switch (ve->exit_reason) {
+	case EXIT_REASON_HLT:
+		ret = tdx_halt();
+		break;
+	default:
+		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
+		break;
+	}
+
+	return ret;
 }
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 08/29] x86/tdx: Handle in-kernel MMIO
  2022-01-24 23:04     ` [PATCHv2.1 " Kirill A. Shutemov
@ 2022-02-01 16:14       ` Borislav Petkov
  0 siblings, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-01 16:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: dave.hansen, jpoimboe, aarcange, ak, dan.j.williams, david, hpa,
	jgross, jmattson, joro, knsathya, linux-kernel, luto, mingo,
	pbonzini, peterz, sathyanarayanan.kuppuswamy, sdeep, seanjc,
	tglx, tony.luck, vkuznets, wanpengli, x86

On Tue, Jan 25, 2022 at 02:04:32AM +0300, Kirill A. Shutemov wrote:
> MMIO addresses can be used with any CPU instruction that accesses
> memory. This patch, however, covers only MMIO accesses done via io.h

Just like the last time:

s/This patch, however, covers only/Address only/

Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> helpers, such as 'readl()' or 'writeq()'.
> 
> readX()/writeX() helpers limit the range of instructions which can trigger
> MMIO. It makes MMIO instruction emulation feasible. Raw access to MMIO

"Raw access to a MMIO region allows the compiler to ..."

> region allows compiler to generate whatever instruction it wants.
> Supporting all possible instructions is a task of a different scope
								     ^
								     . Fullstop


...

> @@ -149,6 +151,111 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
>  	return true;
>  }
>  
> +static int tdx_mmio(int size, bool write, unsigned long addr,
> +		     unsigned long *val)

You don't need to break that line.

Rest LGTM.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 09/29] x86/tdx: Detect TDX at early kernel decompression time
  2022-01-24 15:01 ` [PATCHv2 09/29] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
@ 2022-02-01 18:30   ` Borislav Petkov
  2022-02-01 22:33   ` Thomas Gleixner
  1 sibling, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-01 18:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:01:55PM +0300, Kirill A. Shutemov wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> The early decompression code does port I/O for its console output. But,
> handling the decompression-time port I/O demands a different approach
> from normal runtime because the IDT required to support #VE based port
> I/O emulation is not yet set up. Paravirtualizing I/O calls during
> the decompression step is acceptable because the decompression code size is
> small enough and hence patching it will not bloat the image size a lot.
> 
> To support port I/O in decompression code, TDX must be detected before
> the decompression code might do port I/O. Add support to detect for
> TDX guest support before console_init() in the extract_kernel().

s/Add support to detect for TDX guest support before console_init() in the extract_kernel()./Detect whether the kernel runs in a TDX guest./

Simple.

> Detecting it above the console_init() is early enough for patching
> port I/O.

No need for that sentence - there's already a comment above the call
below.

> 
> Add an early_is_tdx_guest() interface to get the cached TDX guest

s/get/query/

> status in the decompression code.

...

> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index a4339cb2d247..d8373d766672 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -370,6 +370,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
>  	lines = boot_params->screen_info.orig_video_lines;
>  	cols = boot_params->screen_info.orig_video_cols;
>  
> +	/*
> +	 * Detect TDX guest environment.
> +	 *
> +	 * It has to be done before console_init() in order to use
> +	 * paravirtualized port I/O oprations if needed.

Unknown word [oprations] in comment.
Suggestions: ['orations', 'operations', 'op rations', 'op-rations', 'preparations', 'reparations', 'inspirations', 'operation']

> +	 */
> +	early_tdx_detect();
> +
>  	console_init();
>  
>  	/*

...

> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> new file mode 100644
> index 000000000000..12bede46d048
> --- /dev/null
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -0,0 +1,7 @@
> +#ifndef _ASM_X86_SHARED_TDX_H
> +#define _ASM_X86_SHARED_TDX_H

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#232: FILE: arch/x86/include/asm/shared/tdx.h:1:
+#ifndef _ASM_X86_SHARED_TDX_H

Why isn't checkpatch part of your patch creation workflow?

> +
> +#define TDX_CPUID_LEAF_ID	0x21
> +#define TDX_IDENT		"IntelTDX    "
> +
> +#endif
	  ^
	 /* _ASM_X86_SHARED_TDX_H */

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 11/29] x86/boot: Allow to hook up alternative port I/O helpers
  2022-01-24 15:01 ` [PATCHv2 11/29] x86/boot: Allow to hook up alternative " Kirill A. Shutemov
@ 2022-02-01 19:02   ` Borislav Petkov
  2022-02-01 22:39   ` Thomas Gleixner
  1 sibling, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-01 19:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:01:57PM +0300, Kirill A. Shutemov wrote:
> Port I/O instructions trigger #VE in the TDX environment. In response to
> the exception, kernel emulates these instructions using hypercalls.
> 
> But during early boot, on the decompression stage, it is cumbersome to
> deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
> handling.
> 
> Add a way to hook up alternative port I/O helpers in the boot stub.
> All port I/O operations are routed via 'pio_ops'. By default 'pio_ops'
> initialized with native port I/O implementations.

I see that you like to talk about what a patch is doing in the commit
message but there's really no need for it - that's visible from the diff
itself, hopefully.

So talk about the *why* instead, pls.

> This is a preparation patch. The next patch will override 'pio_ops' if
> the kernel booted in the TDX environment.

That's also not needed.

The rest LGTM.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot
  2022-01-24 15:01 ` [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
@ 2022-02-01 19:29   ` Thomas Gleixner
  2022-02-01 23:14     ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 19:29 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

Kirill,

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:

Just a nitpick...

> +static bool tdx_guest_detected __ro_after_init;
> +
> +bool is_tdx_guest(void)
> +{
> +	return tdx_guest_detected;
> +}
> +
> +void __init tdx_early_init(void)
> +{
> +	u32 eax, sig[3];
> +
> +	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
> +
> +	if (memcmp(TDX_IDENT, sig, 12))
> +		return;
> +
> +	tdx_guest_detected = true;
> +
> +	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);

So with that we have two ways to detect a TDX guest:

   - tdx_guest_detected

   - X86_FEATURE_TDX_GUEST

Shouldn't X86_FEATURE_TDX_GUEST be good enough?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 02/29] x86/tdx: Extend the cc_platform_has() API to support TDX guests
  2022-01-24 15:01 ` [PATCHv2 02/29] x86/tdx: Extend the cc_platform_has() API to support TDX guests Kirill A. Shutemov
@ 2022-02-01 19:31   ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 19:31 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> Since is_tdx_guest() function (through cc_platform_has() API) is used in
> the early boot code, disable the instrumentation flags and function
> tracer. This is similar to AMD SEV and cc_platform.c.
>
> Since intel_cc_platform_has() function only gets called when
> is_tdx_guest() is true (valid CONFIG_INTEL_TDX_GUEST case), remove the
> redundant #ifdef in intel_cc_platform_has().
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-01-24 15:01 ` [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
@ 2022-02-01 19:58   ` Thomas Gleixner
  2022-02-02  2:55     ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 19:58 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov, Kai Huang

Kirill,

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,11 +8,51 @@
>  #define TDX_CPUID_LEAF_ID	0x21
>  #define TDX_IDENT		"IntelTDX    "
>  
> +#define TDX_HYPERCALL_STANDARD  0
> +
> +/*
> + * Used in __tdx_module_call() to gather the output registers'
> + * values of the TDCALL instruction when requesting services from
> + * the TDX module. This is a software only structure and not part
> + * of the TDX module/VMM ABI
> + */
> +struct tdx_module_output {
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	u64 r10;
> +	u64 r11;
> +};

I've seen exactly the same struct named seamcall_regs_out in the TDX
host series. I assume that's not coincidence which begs the question why
this is required twice with different names.

> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	BLANK();
> +	/* Offset for fields in tdx_module_output */
> +	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
> +	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
> +	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
> +	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
> +	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
> +	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);

Which obviously duplicates the above part as well.

> + *-------------------------------------------------------------------------
> + * TDCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX                 - TDCALL Leaf number.
> + * RCX,RDX,R8-R9       - TDCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX                 - TDCALL instruction error code.
> + * RCX,RDX,R8-R11      - TDCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * __tdx_module_call() function ABI:
> + *
> + * @fn  (RDI)          - TDCALL Leaf ID,    moved to RAX
> + * @rcx (RSI)          - Input parameter 1, moved to RCX
> + * @rdx (RDX)          - Input parameter 2, moved to RDX
> + * @r8  (RCX)          - Input parameter 3, moved to R8
> + * @r9  (R8)           - Input parameter 4, moved to R9
> + *
> + * @out (R9)           - struct tdx_module_output pointer
> + *                       stored temporarily in R12 (not
> + *                       shared with the TDX module). It
> + *                       can be NULL.
> + *
> + * Return status of TDCALL via RAX.
> + */

And unsurprisingly this function and __seamcall of the other patch set
are very similar aside of the calling convention (__seamcall has a
struct for the input parameters) and the obvious difference that one
issues TDCALL and the other SEAMCALL.

So can we please have _one_ implementation and the same struct(s) for
the module call which is exactly the same for host and guest except for
the instruction used.

IOW, this begs a macro implementation

.macro TDX_MODULE_CALL host:req

       ....

        .if \host
        seamcall
        .else
	tdcall
        .endif

       ....

So the actual functions become:

SYM_FUNC_START(__tdx_module_call)
        FRAME_BEGIN
        TDX_MODULE_CALL host=0
        FRAME_END
        ret
SYM_FUNC_END(__tdx_module_call)

SYM_FUNC_START(__tdx_seam_call)
        FRAME_BEGIN
        TDX_MODULE_CALL host=1
        FRAME_END
        ret
SYM_FUNC_END(__tdx_seam_call)

Hmm?

> +/*
> + * Wrapper for standard use of __tdx_hypercall with panic report
> + * for TDCALL error.
> + */
> +static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
> +				 u64 r15, struct tdx_hypercall_output
> *out)

This begs the question whether having a struct hypercall_input similar
to the way how seamcall input parameters are implemented makes more
sense than 7 function arguments. Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest
  2022-01-24 15:01 ` [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
@ 2022-02-01 21:02   ` Thomas Gleixner
  2022-02-01 21:26     ` Sean Christopherson
  2022-02-12  1:42     ` Kirill A. Shutemov
  0 siblings, 2 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 21:02 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov, Sean Christopherson

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index df0fa695bb09..1da074123c16 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
>  	 */
>  	INTG(X86_TRAP_PF,		asm_exc_page_fault),
>  #endif
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
> +#endif
>  
> +bool tdx_get_ve_info(struct ve_info *ve)
> +{
> +	struct tdx_module_output out;
> +
> +	/*
> +	 * NMIs and machine checks are suppressed. Before this point any
> +	 * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
> +	 * additional #VEs are permitted (but it is expected not to
> +	 * happen unless kernel panics).

I really do not understand that comment. #NMI and #MC are suppressed
according to the above. How long are they suppressed and what's the
mechanism? Are they unblocked on return from __tdx_module_call() ?

What prevents a nested #VE? If it happens what makes it fatal? Is it
converted to a #DF or detected by software?

Also I do not understand that the last sentence tries to tell me. If the
suppression of #NMI and #MC is lifted on return from tdcall then both
can be delivered immediately afterwards, right?

I assume the additional #VE is triggered by software or a bug in the
kernel.

Confused.

> +	 */
> +	if (__tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out))
> +		return false;
> +
> +	ve->exit_reason = out.rcx;
> +	ve->exit_qual   = out.rdx;
> +	ve->gla         = out.r8;
> +	ve->gpa         = out.r9;
> +	ve->instr_len   = lower_32_bits(out.r10);
> +	ve->instr_info  = upper_32_bits(out.r10);
> +
> +	return true;
> +}
> +
> +/*
> + * Handle the user initiated #VE.
> + *
> + * For example, executing the CPUID instruction from user space
> + * is a valid case and hence the resulting #VE has to be handled.
> + *
> + * For dis-allowed or invalid #VE just return failure.
> + */
> +static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> +	return false;
> +}
> +
> +/* Handle the kernel #VE */
> +static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> +	return false;
> +}
> +
> +bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> +{
> +	bool ret;
> +
> +	if (user_mode(regs))
> +		ret = tdx_virt_exception_user(regs, ve);
> +	else
> +		ret = tdx_virt_exception_kernel(regs, ve);
> +
> +	/* After successful #VE handling, move the IP */
> +	if (ret)
> +		regs->ip += ve->instr_len;
> +
> +	return ret;
> +}
> +
>  bool is_tdx_guest(void)
>  {
>  	return tdx_guest_detected;
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index c9d566dcf89a..428504535912 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -61,6 +61,7 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/vdso.h>
> +#include <asm/tdx.h>
>  
>  #ifdef CONFIG_X86_64
>  #include <asm/x86_init.h>
> @@ -1212,6 +1213,115 @@ DEFINE_IDTENTRY(exc_device_not_available)
>  	}
>  }
>  
> +#ifdef CONFIG_INTEL_TDX_GUEST
> +
> +#define VE_FAULT_STR "VE fault"
> +
> +static void ve_raise_fault(struct pt_regs *regs, long error_code)
> +{
> +	struct task_struct *tsk = current;
> +
> +	if (user_mode(regs)) {
> +		tsk->thread.error_code = error_code;
> +		tsk->thread.trap_nr = X86_TRAP_VE;
> +		show_signal(tsk, SIGSEGV, "", VE_FAULT_STR, regs, error_code);
> +		force_sig(SIGSEGV);
> +		return;
> +	}
> +
> +	/*
> +	 * Attempt to recover from #VE exception failure without
> +	 * triggering OOPS (useful for MSR read/write failures)
> +	 */
> +	if (fixup_exception(regs, X86_TRAP_VE, error_code, 0))
> +		return;
> +
> +	tsk->thread.error_code = error_code;
> +	tsk->thread.trap_nr = X86_TRAP_VE;
> +
> +	/*
> +	 * To be potentially processing a kprobe fault and to trust the result
> +	 * from kprobe_running(), it should be non-preemptible.
> +	 */
> +	if (!preemptible() && kprobe_running() &&
> +	    kprobe_fault_handler(regs, X86_TRAP_VE))
> +		return;
> +
> +	/* Notify about #VE handling failure, useful for debugger hooks */
> +	if (notify_die(DIE_GPF, VE_FAULT_STR, regs, error_code,
> +		       X86_TRAP_VE, SIGSEGV) == NOTIFY_STOP)
> +		return;
> +
> +	/* Trigger OOPS and panic */
> +	die_addr(VE_FAULT_STR, regs, error_code, 0);

This is pretty much a copy of the #GP handling. So why not consolidating
this properly?

--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -559,6 +559,36 @@ static bool fixup_iopl_exception(struct
 	return true;
 }
 
+static bool gp_try_fixup_and_notify(struct pt_regs *regs, int trapnr, long error_code,
+				    const char *str)
+{
+	if (fixup_exception(regs, trapnr, error_code, 0))
+		return true;
+
+	current->thread.error_code = error_code;
+	current->thread.trap_nr = trapnr;
+
+	/*
+	 * To be potentially processing a kprobe fault and to trust the result
+	 * from kprobe_running(), we have to be non-preemptible.
+	 */
+	if (!preemptible() && kprobe_running() &&
+	    kprobe_fault_handler(regs, trapnr))
+		return true;
+
+	ret = notify_die(DIE_GPF, str, regs, error_code, trapnr, SIGSEGV);
+	return ret == NOTIFY_STOP;
+}
+
+static void gp_user_force_sig_segv(struct pt_regs *regs, int trapnr, long error_code,
+				   const char *str)
+{
+	current->thread.error_code = error_code;
+	current->thread.trap_nr = trapnr;
+	show_signal(current, SIGSEGV, "", str, regs, error_code);
+	force_sig(SIGSEGV);
+}
+
 DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 {
 	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
@@ -587,34 +617,14 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_pr
 		if (fixup_iopl_exception(regs))
 			goto exit;
 
-		tsk->thread.error_code = error_code;
-		tsk->thread.trap_nr = X86_TRAP_GP;
-
 		if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0))
 			goto exit;
 
-		show_signal(tsk, SIGSEGV, "", desc, regs, error_code);
-		force_sig(SIGSEGV);
+		gp_user_force_sig_segv(regs, X86_TRAP_GP, error_code, desc);
 		goto exit;
 	}
 
-	if (fixup_exception(regs, X86_TRAP_GP, error_code, 0))
-		goto exit;
-
-	tsk->thread.error_code = error_code;
-	tsk->thread.trap_nr = X86_TRAP_GP;
-
-	/*
-	 * To be potentially processing a kprobe fault and to trust the result
-	 * from kprobe_running(), we have to be non-preemptible.
-	 */
-	if (!preemptible() &&
-	    kprobe_running() &&
-	    kprobe_fault_handler(regs, X86_TRAP_GP))
-		goto exit;
-
-	ret = notify_die(DIE_GPF, desc, regs, error_code, X86_TRAP_GP, SIGSEGV);
-	if (ret == NOTIFY_STOP)
+	if (gp_try_fixup_and_notify(regs, X86_TRAP_GP, error_code, desc))
 		goto exit;
 
 	if (error_code)

which makes this:

static void ve_raise_fault(struct pt_regs *regs, long error_code)
{
	if (user_mode(regs)) {
		gp_user_force_sig_segv(regs, X86_TRAP_VE, error_code, VE_FAULT_STR);
		return;
	}

	if (gp_try_fixup_and_notify(regs, X86_TRAP_VE, error_code, VE_FAULT_STR)
        	return;

	die_addr(VE_FAULT_STR, regs, error_code, 0);
}

Hmm?

> +/*
> + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> + * specific guest actions which may happen in either user space or the
> + * kernel:
> + *
> + *  * Specific instructions (WBINVD, for example)
> + *  * Specific MSR accesses
> + *  * Specific CPUID leaf accesses
> + *  * Access to unmapped pages (EPT violation)
> + *
> + * In the settings that Linux will run in, virtualization exceptions are
> + * never generated on accesses to normal, TD-private memory that has been
> + * accepted.
> + *
> + * Syscall entry code has a critical window where the kernel stack is not
> + * yet set up. Any exception in this window leads to hard to debug issues
> + * and can be exploited for privilege escalation. Exceptions in the NMI
> + * entry code also cause issues. Returning from the exception handler with
> + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> + *
> + * For these reasons, the kernel avoids #VEs during the syscall gap and
> + * the NMI entry code. Entry code paths do not access TD-shared memory,
> + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> + * that might generate #VE.

How is that enforced or validated? What checks for a violation of that
assumption?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-01-29 22:30     ` [PATCHv2.1 " Kirill A. Shutemov
@ 2022-02-01 21:21       ` Thomas Gleixner
  2022-02-02 12:48         ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 21:21 UTC (permalink / raw)
  To: Kirill A. Shutemov, bp
  Cc: aarcange, ak, dan.j.williams, dave.hansen, david, hpa, jgross,
	jmattson, joro, jpoimboe, kirill.shutemov, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86

On Sun, Jan 30 2022 at 01:30, Kirill A. Shutemov wrote:
> +/*
> + * Used in __tdx_hypercall() to determine whether to enable interrupts
> + * before issuing TDCALL for the EXIT_REASON_HLT case.
> + */
> +#define ENABLE_IRQS_BEFORE_HLT 0x01
> +
>  /*
>   * __tdx_module_call()  - Used by TDX guests to request services from
>   * the TDX module (does not include VMM services).
> @@ -230,6 +237,30 @@ SYM_FUNC_START(__tdx_hypercall)
>  
>  	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
>  
> +	/*
> +	 * For the idle loop STI needs to be called directly before
> +	 * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
> +	 * instruction enables interrupts only one instruction later.
> +	 * If there is a window between STI and the instruction that
> +	 * emulates the HALT state, there is a chance for interrupts to
> +	 * happen in this window, which can delay the HLT operation
> +	 * indefinitely. Since this is the not the desired result,
> +	 * conditionally call STI before TDCALL.
> +	 *
> +	 * Since STI instruction is only required for the idle case
> +	 * (a special case of EXIT_REASON_HLT), use the r15 register
> +	 * value to identify it. Since the R15 register is not used
> +	 * by the VMM as per EXIT_REASON_HLT ABI, re-use it in
> +	 * software to identify the STI case.
> +	 */
> +	cmpl $EXIT_REASON_HLT, %r11d
> +	jne .Lskip_sti
> +	cmpl $ENABLE_IRQS_BEFORE_HLT, %r15d
> +	jne .Lskip_sti
> +	/* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
> +	xor %r15, %r15
> +	sti
> +.Lskip_sti:
>  	tdcall

This really can be simplified:

        cmpl	$EXIT_REASON_SAFE_HLT, %r11d
        jne	.Lnohalt
        movl	$EXIT_REASON_HLT, %r11d
        sti
.Lnohalt:
	tdcall

and the below becomes:

static bool tdx_halt(void)
{
	return !!__tdx_hypercall(EXIT_REASON_HLT, !!irqs_disabled(), 0, 0, 0, NULL);
}

void __cpuidle tdx_safe_halt(void)
{
        if (__tdx_hypercall(EXIT_REASON_SAFE_HLT, 0, 0, 0, 0, NULL)
        	WARN_ONCE(1, "HLT instruction emulation failed\n");
}

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest
  2022-02-01 21:02   ` Thomas Gleixner
@ 2022-02-01 21:26     ` Sean Christopherson
  2022-02-12  1:42     ` Kirill A. Shutemov
  1 sibling, 0 replies; 154+ messages in thread
From: Sean Christopherson @ 2022-02-01 21:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Sean Christopherson

On Tue, Feb 01, 2022, Thomas Gleixner wrote:
> On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> > index df0fa695bb09..1da074123c16 100644
> > --- a/arch/x86/kernel/idt.c
> > +++ b/arch/x86/kernel/idt.c
> > @@ -68,6 +68,9 @@ static const __initconst struct idt_data early_idts[] = {
> >  	 */
> >  	INTG(X86_TRAP_PF,		asm_exc_page_fault),
> >  #endif
> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +	INTG(X86_TRAP_VE,		asm_exc_virtualization_exception),
> > +#endif
> >  
> > +bool tdx_get_ve_info(struct ve_info *ve)
> > +{
> > +	struct tdx_module_output out;
> > +
> > +	/*
> > +	 * NMIs and machine checks are suppressed. Before this point any
> > +	 * #VE is fatal. After this point (TDGETVEINFO call), NMIs and
> > +	 * additional #VEs are permitted (but it is expected not to
> > +	 * happen unless kernel panics).
> 
> I really do not understand that comment. #NMI and #MC are suppressed
> according to the above. How long are they suppressed and what's the
> mechanism? Are they unblocked on return from __tdx_module_call() ?

TDX_GET_VEINFO is a call into the TDX module to get the data from #VE info struct
pointed at by the VMCS.  Doing TDX_GET_VEINFO also clears that "valid" flag in
the struct.  It's basically a CMPXCHG on the #VE info struct, except that it routes
through the TDX module.

The TDX module treats virtual NMIs as blocked if the #VE valid flag is set, i.e.
refuses to inject NMI until the guest does TDX_GET_VEINFO to retrieve the info for
the last #VE.

I don't understand the blurb about #MC.  Unless things have changed, the TDX module
doesn't support injecting #MC into the guest.

> What prevents a nested #VE? If it happens what makes it fatal? Is it
> converted to a #DF or detected by software?

A #VE that would occur is morphed to a #DF by the TDX module if the #VE info valid
flag is already set.  But nested #VE should work, so long as the nested #VE happens
after TDX_GET_VEINFO.

> Also I do not understand that the last sentence tries to tell me. If the
> suppression of #NMI and #MC is lifted on return from tdcall then both
> can be delivered immediately afterwards, right?

Yep, NMI can be injected on the instruction following the TDCALL.  

Something like this?
	
	/*
	 * Retrieve the #VE info from the TDX module, which also clears the "#VE
	 * valid" flag.  This must be done before anything else as any #VE that
	 * occurs while the valid flag is set, i.e. before the previous #VE info
	 * was consumed, is morphed to a #DF by the TDX module.  Note, the TDX
	 * module also treats virtual NMIs as inhibited if the #VE valid flag is
	 * set, e.g. so that NMI=>#VE will not result in a #DF.
	 */
 
> I assume the additional #VE is triggered by software or a bug in the
> kernel.

I'm curious if that will even hold true, there's sooo much stuff that can happen
from NMI context.  I don't see much value in speculating what will/won't happen
after retrieving the #VE info.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 06/29] x86/tdx: Add MSR support for TDX guests
  2022-01-24 15:01 ` [PATCHv2 06/29] x86/tdx: Add MSR " Kirill A. Shutemov
@ 2022-02-01 21:38   ` Thomas Gleixner
  2022-02-02 13:06     ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 21:38 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> +static bool tdx_read_msr(unsigned int msr, u64 *val)
> +{
> +	struct tdx_hypercall_output out;
> +
> +	/*
> +	 * Emulate the MSR read via hypercall. More info about ABI
> +	 * can be found in TDX Guest-Host-Communication Interface
> +	 * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
> +	 */
> +	if (_tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out))
> +		return false;
> +
> +	*val = out.r11;
> +
> +	return true;
> +}
> +
> +static bool tdx_write_msr(unsigned int msr, unsigned int low,
> +			       unsigned int high)
> +{
> +	u64 ret;
> +
> +	/*
> +	 * Emulate the MSR write via hypercall. More info about ABI
> +	 * can be found in TDX Guest-Host-Communication Interface
> +	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.WRMSR>".
> +	 */
> +	ret = _tdx_hypercall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
> +			     0, 0, NULL);
> +
> +	return ret ? false : true;
> +}
> +
>  bool tdx_get_ve_info(struct ve_info *ve)
>  {
>  	struct tdx_module_output out;
> @@ -132,11 +165,22 @@ static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
>  static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
>  {
>  	bool ret = false;
> +	u64 val;
>  
>  	switch (ve->exit_reason) {
>  	case EXIT_REASON_HLT:
>  		ret = tdx_halt();
>  		break;
> +	case EXIT_REASON_MSR_READ:
> +		ret = tdx_read_msr(regs->cx, &val);
> +		if (ret) {
> +			regs->ax = lower_32_bits(val);
> +			regs->dx = upper_32_bits(val);
> +		}
> +		break;

Why here?

static bool tdx_read_msr(struct pt_regs *regs)
{
	struct tdx_hypercall_output out;

	/*
	 * Emulate the MSR read via hypercall. More info about ABI
	 * can be found in TDX Guest-Host-Communication Interface
	 * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
	 */
	if (_tdx_hypercall(EXIT_REASON_MSR_READ, regs->cx, 0, 0, 0, &out))
		return false;

	regs->ax = lower_32_bits(out.r11);
	regs->dx = upper_32_bits(out.r11);
	return true;
}

and

static bool tdx_read_msr(struct pt_regs *regs)
{
	/*
	 * Emulate the MSR write via hypercall. More info about ABI
	 * can be found in TDX Guest-Host-Communication Interface
	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.WRMSR>".
	 */
	return !!_tdx_hypercall(EXIT_REASON_MSR_WRITE, regs->cx,
        			(u64)regs->dx << 32 | regs->ax,
			     	0, 0, NULL);
}

Also the switch case can be simplified as the only action after 'break;'
is 'return ret':

	switch (ve->exit_reason) {
	case EXIT_REASON_HLT:
		return tdx_halt();
	case EXIT_REASON_MSR_READ:
		return tdx_read_msr(regs);
	case EXIT_REASON_MSR_WRITE:
		return tdx_write_msr(regs);
        default:
                ....

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 07/29] x86/tdx: Handle CPUID via #VE
  2022-01-24 15:01 ` [PATCHv2 07/29] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
@ 2022-02-01 21:39   ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 21:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> +static bool tdx_handle_cpuid(struct pt_regs *regs)
> +{
> +	struct tdx_hypercall_output out;
> +
> +	/*
> +	 * Emulate the CPUID instruction via a hypercall. More info about
> +	 * ABI can be found in TDX Guest-Host-Communication Interface
> +	 * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
> +	 */
> +	if (_tdx_hypercall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out))
> +		return false;
> +
> +	/*
> +	 * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
> +	 * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
> +	 * So copy the register contents back to pt_regs.
> +	 */
> +	regs->ax = out.r12;
> +	regs->bx = out.r13;
> +	regs->cx = out.r14;
> +	regs->dx = out.r15;
> +
> +	return true;
> +}

Ack.

>  bool tdx_get_ve_info(struct ve_info *ve)
>  {
>  	struct tdx_module_output out;
> @@ -157,8 +182,18 @@ bool tdx_get_ve_info(struct ve_info *ve)
>   */
>  static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
>  {
> -	pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> -	return false;
> +	bool ret = false;
> +
> +	switch (ve->exit_reason) {
> +	case EXIT_REASON_CPUID:
> +		ret = tdx_handle_cpuid(regs);
> +		break;

Comment about ret and break applies accordingly.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO
  2022-01-24 15:01 ` [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
  2022-01-24 19:30   ` Josh Poimboeuf
  2022-01-24 22:40   ` Dave Hansen
@ 2022-02-01 22:30   ` Thomas Gleixner
  2 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 22:30 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
>  
> +static bool tdx_mmio(int size, bool write, unsigned long addr,
> +		     unsigned long *val)
> +{
> +	struct tdx_hypercall_output out;
> +	u64 err;
> +
> +	err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
> +			     addr, *val, &out);

What's the purpose of storing *val as an argument for reads?

> +	if (err)
> +		return true;
> +
> +	*val = out.r11;
> +	return false;

Why is this writing back unconditionally for writes?

> +
>  bool tdx_get_ve_info(struct ve_info *ve)
>  {
>  	struct tdx_module_output out;
> @@ -219,6 +327,12 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
>  	case EXIT_REASON_CPUID:
>  		ret = tdx_handle_cpuid(regs);
>  		break;
> +	case EXIT_REASON_EPT_VIOLATION:
> +		ve->instr_len = tdx_handle_mmio(regs, ve);
> +		ret = ve->instr_len > 0;

I agree with Josh here. This is just wrong. Why returning the instr_len
as an error/success indicator? That's just a horrible idea simply
because the "error value" which is <= 0 is converted to a boolean return
value.

So what's wrong with doing the obvious here

	case EXIT_REASON_EPT_VIOLATION:
		return tdx_handle_mmio(regs, ve);

and have the handler function set ve->instr_length?

Also instead of having this not really helpful tdx_mmio() helper just
implement read and write seperately:

static bool tdx_mmio_read(int size, unsigned long addr, unsigned long *val)
{
	struct tdx_hypercall_output out;

	if (_tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, EPT_READ,
	   		   addr, 0, &out)
		return false;

	*val = out.r11;
	return true;
}

static bool tdx_mmio_write(int size, unsigned long addr, unsigned long val)
{
	return !!_tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE,
	   		   addr, val, NULL);
}

The return value is consistent with all the other handling functions
here, they return a boolean True for success. Which makes the main
handler consistent with the rest.

static bool tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
	char buffer[MAX_INSN_SIZE];
	unsigned long *reg, val;
	struct insn insn = {};
	int size, extend_size;
	enum mmio_type mmio;
        u8 extend_val = 0;
	bool ret;

	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
		return false;

	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
		return false;

	mmio = insn_decode_mmio(&insn, &size);
	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
		return false;

	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
		reg = insn_get_modrm_reg_ptr(&insn, regs);
		if (!reg)
			return false;
	}

        ve->instr_length = insn.length;

	switch (mmio) {
	case MMIO_WRITE:
		memcpy(&val, reg, size);
                return tdx_mmio_write(size, ve->gpa, val);
	case MMIO_WRITE_IMM:
		val = insn.immediate.value;
                return tdx_mmio_write(size, ve->gpa, val);
	case MMIO_READ:
	case MMIO_READ_ZERO_EXTEND:
	case MMIO_READ_SIGN_EXTEND:
        	break;
	case MMIO_MOVS:
	case MMIO_DECODE_FAILED:
		return false;
	}

        /* Handle reads */
	if (!tdx_mmio_read(size, ve->gpa, &val))
		return false;

	switch (mmio) {
	case MMIO_READ:
		/* Zero-extend for 32-bit operation */
		extend_size = size == 4 ? sizeof(*reg) : 0;
                break;
	case MMIO_READ_ZERO_EXTEND:
		/* Zero extend based on operand size */
		extend_size = insn.opnd_bytes;
                break;
	case MMIO_READ_SIGN_EXTEND:
		/* Sign extend based on operand size */
		extend_size = insn.opnd_bytes;
                if (size == 1 && val & BIT(7))
                	extend_val = 0xFF;
                else if (size > 1 && val & BIT(15))
                	extend_val = 0xFF;
		break;
	default:
        	BUG();
	}

        if (extend_size)
		memset(reg, extend_val, extend_size);
        memcpy(reg, &val, size);
	return true;
}

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 09/29] x86/tdx: Detect TDX at early kernel decompression time
  2022-01-24 15:01 ` [PATCHv2 09/29] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
  2022-02-01 18:30   ` Borislav Petkov
@ 2022-02-01 22:33   ` Thomas Gleixner
  1 sibling, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 22:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> +++ b/arch/x86/boot/compressed/tdx.c
> @@ -0,0 +1,29 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tdx.c - Early boot code for TDX

Please get rid of this filename reference here. It's pointless and stale
once this file is renamed.

> index 000000000000..12bede46d048
> --- /dev/null
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -0,0 +1,7 @@

Lacks a SPDX identifier 

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 10/29] x86: Consolidate port I/O helpers
  2022-01-24 15:01 ` [PATCHv2 10/29] x86: Consolidate port I/O helpers Kirill A. Shutemov
@ 2022-02-01 22:36   ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 22:36 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> There are two implementations of port I/O helpers: one in the kernel and
> one in the boot stub.
>
> Move the helpers required for both to <asm/shared/io.h> and use the one
> implementation everywhere.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 11/29] x86/boot: Allow to hook up alternative port I/O helpers
  2022-01-24 15:01 ` [PATCHv2 11/29] x86/boot: Allow to hook up alternative " Kirill A. Shutemov
  2022-02-01 19:02   ` Borislav Petkov
@ 2022-02-01 22:39   ` Thomas Gleixner
  2022-02-01 22:53     ` Thomas Gleixner
  1 sibling, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 22:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:

> Port I/O instructions trigger #VE in the TDX environment. In response to
> the exception, kernel emulates these instructions using hypercalls.
>
> But during early boot, on the decompression stage, it is cumbersome to
> deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
> handling.
>
> Add a way to hook up alternative port I/O helpers in the boot stub.
> All port I/O operations are routed via 'pio_ops'. By default 'pio_ops'
> initialized with native port I/O implementations.
>
> This is a preparation patch. The next patch will override 'pio_ops' if
> the kernel booted in the TDX environment.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Aside of Borislav's comments:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 11/29] x86/boot: Allow to hook up alternative port I/O helpers
  2022-02-01 22:39   ` Thomas Gleixner
@ 2022-02-01 22:53     ` Thomas Gleixner
  2022-02-02 17:20       ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 22:53 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Tue, Feb 01 2022 at 23:39, Thomas Gleixner wrote:

> On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
>
>> Port I/O instructions trigger #VE in the TDX environment. In response to
>> the exception, kernel emulates these instructions using hypercalls.
>>
>> But during early boot, on the decompression stage, it is cumbersome to
>> deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
>> handling.
>>
>> Add a way to hook up alternative port I/O helpers in the boot stub.
>> All port I/O operations are routed via 'pio_ops'. By default 'pio_ops'
>> initialized with native port I/O implementations.
>>
>> This is a preparation patch. The next patch will override 'pio_ops' if
>> the kernel booted in the TDX environment.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>
> Aside of Borislav's comments:
>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Second thoughts.

> +#include <asm/shared/io.h>
> +
> +struct port_io_ops {
> +	unsigned char (*inb)(int port);
> +	unsigned short (*inw)(int port);
> +	unsigned int (*inl)(int port);
> +	void (*outb)(unsigned char v, int port);
> +	void (*outw)(unsigned short v, int port);
> +	void (*outl)(unsigned int v, int port);
> +};

Can we please make that u8, u16, u32 instead of unsigned char,short,int?

That's the kernel convention for hardware related functions for many
years now.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 12/29] x86/boot/compressed: Support TDX guest port I/O at decompression time
  2022-01-24 15:01 ` [PATCHv2 12/29] x86/boot/compressed: Support TDX guest port I/O at decompression time Kirill A. Shutemov
@ 2022-02-01 22:55   ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 22:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> +static inline unsigned int tdx_io_in(int size, int port)
> +{
> +	struct tdx_hypercall_output out;
> +
> +	__tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
> +			size, 0, port, 0, &out);
> +
> +	return out.r10 ? UINT_MAX : out.r11;
> +}
> +
> +static inline void tdx_io_out(int size, int port, u64 value)
> +{
> +	struct tdx_hypercall_output out;
> +
> +	__tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
> +			size, 1, port, value, &out);
> +}
> +
> +static inline unsigned char tdx_inb(int port)
> +{
> +	return tdx_io_in(1, port);
> +}
> +
> +static inline unsigned short tdx_inw(int port)
> +{
> +	return tdx_io_in(2, port);
> +}
> +
> +static inline unsigned int tdx_inl(int port)
> +{
> +	return tdx_io_in(4, port);
> +}
> +
> +static inline void tdx_outb(unsigned char value, int port)
> +{
> +	tdx_io_out(1, port, value);
> +}
> +
> +static inline void tdx_outw(unsigned short value, int port)
> +{
> +	tdx_io_out(2, port, value);
> +}
> +
> +static inline void tdx_outl(unsigned int value, int port)
> +{
> +	tdx_io_out(4, port, value);
> +}

Looks good but the u8, u16, u32 comment applies here as well obviously.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 13/29] x86/tdx: Add port I/O emulation
  2022-01-24 15:01 ` [PATCHv2 13/29] x86/tdx: Add port I/O emulation Kirill A. Shutemov
@ 2022-02-01 23:01   ` Thomas Gleixner
  2022-02-02  6:22   ` Borislav Petkov
  1 sibling, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 23:01 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
>  static bool intel_cc_platform_has(enum cc_attr attr)
>  {
> +	if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
> +		return true;
> +

switch (attr) perhaps as there are more coming, right?

>  	return false;
>  }

> +/*
> + * Emulate I/O using hypercall.
> + *
> + * Assumes the IO instruction was using ax, which is enforced
> + * by the standard io.h macros.
> + *
> + * Return True on success or False on failure.
> + */
> +static bool tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
> +{
> +	struct tdx_hypercall_output out;
> +	int size, port, ret;
> +	u64 mask;
> +	bool in;
> +
> +	if (VE_IS_IO_STRING(exit_qual))
> +		return false;
> +
> +	in   = VE_IS_IO_IN(exit_qual);
> +	size = VE_GET_IO_SIZE(exit_qual);
> +	port = VE_GET_PORT_NUM(exit_qual);
> +	mask = GENMASK(BITS_PER_BYTE * size, 0);
> +
> +	/*
> +	 * Emulate the I/O read/write via hypercall. More info about
> +	 * ABI can be found in TDX Guest-Host-Communication Interface
> +	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.IO>".
> +	 */
> +	ret = _tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, !in, port,
> +			     in ? 0 : regs->ax, &out);
> +	if (!in)
> +		return !ret;
> +
> +	regs->ax &= ~mask;
> +	regs->ax |= ret ? UINT_MAX : out.r11 & mask;
> +
> +	return !ret;
> +}
> +
>  bool tdx_get_ve_info(struct ve_info *ve)
>  {
>  	struct tdx_module_output out;
> @@ -333,6 +378,9 @@ static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
>  		if (!ret)
>  			pr_warn_once("MMIO failed\n");
>  		break;
> +	case EXIT_REASON_IO_INSTRUCTION:
> +		ret = tdx_handle_io(regs, ve->exit_qual);

                return ...

Other than that LGTM.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O
  2022-01-24 15:02 ` [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O Kirill A. Shutemov
@ 2022-02-01 23:02   ` Thomas Gleixner
  2022-02-02 10:09   ` Borislav Petkov
  1 sibling, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 23:02 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> This early handler enables the use of normal in*/out* macros without
> patching them for every driver. Since there is no expectation that
> early port I/O is performance-critical, the #VE emulation cost is worth
> the simplicity benefit of not patching the port I/O usage in early
> code. There are also no concerns with nesting, since there should be
> no NMIs or interrupts this early.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 15/29] x86/tdx: Wire up KVM hypercalls
  2022-01-24 15:02 ` [PATCHv2 15/29] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
@ 2022-02-01 23:05   ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 23:05 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> Among other things, KVM hypercall is used to send IPIs.
>
> Since the KVM driver can be built as a kernel module, export
> tdx_kvm_hypercall() to make the symbols visible to kvm.ko.
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-01-24 15:02 ` [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
@ 2022-02-01 23:06   ` Thomas Gleixner
  2022-02-02 11:27   ` Borislav Petkov
  1 sibling, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 23:06 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Sean Christopherson, Kai Huang, Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:

> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Historically, x86 platforms have booted secondary processors (APs)
> using INIT followed by the start up IPI (SIPI) messages. In regular
> VMs, this boot sequence is supported by the VMM emulation. But such a
> wakeup model is fatal for secure VMs like TDX in which VMM is an
> untrusted entity. To address this issue, a new wakeup model was added
> in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
> the APs. More details about this wakeup model can be found in ACPI
> specification v6.4, the section titled "Multiprocessor Wakeup Structure".
>
> Since the existing trampoline code requires processors to boot in real
> mode with 16-bit addressing, it will not work for this wakeup model
> (because it boots the AP in 64-bit mode). To handle it, extend the
> trampoline code to support 64-bit mode firmware handoff. Also, extend
> IDT and GDT pointers to support 64-bit mode hand off.
>
> There is no TDX-specific detection for this new boot method. The kernel
> will rely on it as the sole boot method whenever the new ACPI structure
> is present.
>
> The ACPI table parser for the MADT multiprocessor wake up structure and
> the wakeup method that uses this structure will be added by the following
> patch in this series.
>
> Reported-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot
  2022-02-01 19:29   ` Thomas Gleixner
@ 2022-02-01 23:14     ` Kirill A. Shutemov
  2022-02-03  0:32       ` Josh Poimboeuf
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-01 23:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Tue, Feb 01, 2022 at 08:29:55PM +0100, Thomas Gleixner wrote:
> Kirill,
> 
> On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> 
> Just a nitpick...
> 
> > +static bool tdx_guest_detected __ro_after_init;
> > +
> > +bool is_tdx_guest(void)
> > +{
> > +	return tdx_guest_detected;
> > +}
> > +
> > +void __init tdx_early_init(void)
> > +{
> > +	u32 eax, sig[3];
> > +
> > +	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
> > +
> > +	if (memcmp(TDX_IDENT, sig, 12))
> > +		return;
> > +
> > +	tdx_guest_detected = true;
> > +
> > +	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> 
> So with that we have two ways to detect a TDX guest:
> 
>    - tdx_guest_detected
> 
>    - X86_FEATURE_TDX_GUEST
> 
> Shouldn't X86_FEATURE_TDX_GUEST be good enough?

Right. We have only 3 callers of is_tdx_guest() in cc_platform.c
I will replace them with cpu_feature_enabled(X86_FEATURE_TDX_GUEST).

Thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 17/29] x86/acpi, x86/boot: Add multiprocessor wake-up support
  2022-01-24 15:02 ` [PATCHv2 17/29] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
@ 2022-02-01 23:27   ` Thomas Gleixner
  2022-02-05 12:37     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-01 23:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Sean Christopherson, Rafael J . Wysocki, Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> +#ifdef CONFIG_X86_64
> +/* Physical address of the Multiprocessor Wakeup Structure mailbox */
> +static u64 acpi_mp_wake_mailbox_paddr;
> +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> +/* Lock to protect mailbox (acpi_mp_wake_mailbox) from parallel access */
> +static DEFINE_SPINLOCK(mailbox_lock);
> +#endif
> +
>  #ifdef CONFIG_X86_IO_APIC
>  /*
>   * Locks related to IOAPIC hotplug
> @@ -336,6 +345,80 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
>  	return 0;
>  }
>  
> +#ifdef CONFIG_X86_64
> +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> +{
> +	static physid_mask_t apic_id_wakemap = PHYSID_MASK_NONE;
> +	unsigned long flags;
> +	u8 timeout;
> +
> +	/* Remap mailbox memory only for the first call to acpi_wakeup_cpu() */
> +	if (physids_empty(apic_id_wakemap)) {
> +		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
> +						sizeof(*acpi_mp_wake_mailbox),
> +						MEMREMAP_WB);
> +	}
> +
> +	/*
> +	 * According to the ACPI specification r6.4, section titled
> +	 * "Multiprocessor Wakeup Structure" the mailbox-based wakeup
> +	 * mechanism cannot be used more than once for the same CPU.
> +	 * Skip wakeups if they are attempted more than once.
> +	 */
> +	if (physid_isset(apicid, apic_id_wakemap)) {
> +		pr_err("CPU already awake (APIC ID %x), skipping wakeup\n",
> +		       apicid);
> +		return -EINVAL;
> +	}
> +
> +	spin_lock_irqsave(&mailbox_lock, flags);

What's the reason that interrupts need to be disabled here? The comment
above this invocation is not really informative ...

> +	/*
> +	 * Mailbox memory is shared between firmware and OS. Firmware will
> +	 * listen on mailbox command address, and once it receives the wakeup
> +	 * command, CPU associated with the given apicid will be booted.
> +	 *
> +	 * The value of apic_id and wakeup_vector has to be set before updating
> +	 * the wakeup command. To let compiler preserve order of writes, use
> +	 * smp_store_release.

What? If the only purpose is to tell the compiler to preserve code
ordering then why are you using smp_store_release() here?
smp_store_release() is way more than that...

> +	 */
> +	smp_store_release(&acpi_mp_wake_mailbox->apic_id, apicid);
> +	smp_store_release(&acpi_mp_wake_mailbox->wakeup_vector, start_ip);
> +	smp_store_release(&acpi_mp_wake_mailbox->command,
> +			  ACPI_MP_WAKE_COMMAND_WAKEUP);
> +
> +	/*
> +	 * After writing the wakeup command, wait for maximum timeout of 0xFF
> +	 * for firmware to reset the command address back zero to indicate
> +	 * the successful reception of command.
> +	 * NOTE: 0xFF as timeout value is decided based on our experiments.
> +	 *
> +	 * XXX: Change the timeout once ACPI specification comes up with
> +	 *      standard maximum timeout value.
> +	 */
> +	timeout = 0xFF;
> +	while (READ_ONCE(acpi_mp_wake_mailbox->command) && --timeout)
> +		cpu_relax();
> +
> +	/* If timed out (timeout == 0), return error */
> +	if (!timeout) {

So this leaves a stale acpi_mp_wake_mailbox->command. What checks that
acpi_mp_wake_mailbox->command is 0 on the next invocation?

Aside of that assume timeout happens and the firmware acts after this
returned. Then you have inconsistent state as well. Error handling is
not trivial, but making it hope based is the worst kind.

> +		spin_unlock_irqrestore(&mailbox_lock, flags);
> +		return -EIO;
> +	}
> +
> +	/*
> +	 * If the CPU wakeup process is successful, store the
> +	 * status in apic_id_wakemap to prevent re-wakeup
> +	 * requests.
> +	 */
> +	physid_set(apicid, apic_id_wakemap);
> +
> +	spin_unlock_irqrestore(&mailbox_lock, flags);
> +
> +	return 0;
> +}
> +#endif
>  #endif				/*CONFIG_X86_LOCAL_APIC */

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 18/29] x86/boot: Avoid #VE during boot for TDX platforms
  2022-01-24 15:02 ` [PATCHv2 18/29] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
@ 2022-02-02  0:04   ` Thomas Gleixner
  2022-02-11 16:13     ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  0:04 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
>
> Change the common boot code to work on TDX and non-TDX systems.
> This should have no functional effect on non-TDX systems.

Emphasis on should? :)

> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -643,12 +643,25 @@ SYM_CODE_START(trampoline_32bit_src)
>  	movl	$MSR_EFER, %ecx
>  	rdmsr
>  	btsl	$_EFER_LME, %eax
> +	/* Avoid writing EFER if no change was made (for TDX guest) */
> +	jc	1f
>  	wrmsr
> -	popl	%edx
> +1:	popl	%edx
>  	popl	%ecx
>  
>  	/* Enable PAE and LA57 (if required) paging modes */

This comment should move after the #endif below and here it wants a
comment which explains why reading cr4 and the following code sequence
is correct. If you write up that comment then you'll figure out that it
is incorrect.

> -	movl	$X86_CR4_PAE, %eax
> +	movl	%cr4, %eax

Assume CR4 has X86_CR4_MCE set then how is the below correct when
CONFIG_X86_MCE=n? Not to talk about any other bits which might be set in
CR4 and are only cleared by the CONFIG_X86_MCE dependent 'andl'.

> +#ifdef CONFIG_X86_MCE
> +	/*
> +	 * Preserve CR4.MCE if the kernel will enable #MC support.  Clearing
> +	 * MCE may fault in some environments (that also force #MC support).
> +	 * Any machine check that occurs before #MC support is fully configured
> +	 * will crash the system regardless of the CR4.MCE value set here.
> +	 */
> +	andl	$X86_CR4_MCE, %eax
> +#endif

So this wants to be

#ifdef CONFIG_X86_MCE
	movl	%cr4, %eax
	andl	$X86_CR4_MCE, %eax
#else
	movl	$0, %eax
#endif

No?

> +	orl	$X86_CR4_PAE, %eax
>  	testl	%edx, %edx
>  	jz	1f
>  	orl	$X86_CR4_LA57, %eax
> @@ -662,8 +675,12 @@ SYM_CODE_START(trampoline_32bit_src)
>  	pushl	$__KERNEL_CS
>  	pushl	%eax
>  
> -	/* Enable paging again */
> -	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
> +	/*
> +	 * Enable paging again.  Keep CR0.NE set, FERR# is no longer used
> +	 * to handle x87 FPU errors and clearing NE may fault in some
> +	 * environments.

FERR# is no longer used is really not informative here. The point is
that any x86 CPU which is supported by the kernel requires CR0_NE to be
set. This code was wrong from the very beginning because 64bit CPUs
never supported #FERR. The reason why it exists is Copy&Pasta without
brain applied and the sad fact that the hardware does not enforce it in
native mode for whatever reason. So this want's to be a seperate patch
with a coherent comment and changelong.

> +	 */
> +	movl	$(X86_CR0_PG | X86_CR0_NE | X86_CR0_PE), %eax
>  	movl	%eax, %cr0
>  
>  	/* Enable PAE mode, PGE and LA57 */
> -	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
> +	movq	%cr4, %rcx

See above...

> +#ifdef CONFIG_X86_MCE
> +	/*
> +	 * Preserve CR4.MCE if the kernel will enable #MC support.  Clearing
> +	 * MCE may fault in some environments (that also force #MC support).
> +	 * Any machine check that occurs before #MC support is fully configured
> +	 * will crash the system regardless of the CR4.MCE value set here.
> +	 */
> +	andl	$X86_CR4_MCE, %ecx
> +#endif
> +	orl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
>  #ifdef CONFIG_X86_5LEVEL
>  	testl	$1, __pgtable_l5_enabled(%rip)
>  	jz	1f
> @@ -246,13 +256,23 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
>  	/* Setup EFER (Extended Feature Enable Register) */
>  	movl	$MSR_EFER, %ecx
>  	rdmsr
> +	/*
> +	 * Preserve current value of EFER for comparison and to skip
> +	 * EFER writes if no change was made (for TDX guest)
> +	 */
> +	movl    %eax, %edx
>  	btsl	$_EFER_SCE, %eax	/* Enable System Call */
>  	btl	$20,%edi		/* No Execute supported? */
>  	jnc     1f
>  	btsl	$_EFER_NX, %eax
>  	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
> -1:	wrmsr				/* Make changes effective */
>  
> +	/* Avoid writing EFER if no change was made (for TDX guest) */
> +1:	cmpl	%edx, %eax
> +	je	1f
> +	xor	%edx, %edx
> +	wrmsr				/* Make changes effective */
> +1:
>  	/* Setup cr0 */
>  	movl	$CR0_STATE, %eax
>  	/* Make changes effective */
> diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
> index ae112a91592f..170f248d5769 100644
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -143,13 +143,28 @@ SYM_CODE_START(startup_32)
>  	movl	%eax, %cr3
>  
>  	# Set up EFER
> +	movl	$MSR_EFER, %ecx
> +	rdmsr
> +	/*
> +	 * Skip writing to EFER if the register already has desired
> +	 * value (to avoid #VE for the TDX guest).
> +	 */
> +	cmp	pa_tr_efer, %eax
> +	jne	.Lwrite_efer
> +	cmp	pa_tr_efer + 4, %edx
> +	je	.Ldone_efer
> +.Lwrite_efer:
>  	movl	pa_tr_efer, %eax
>  	movl	pa_tr_efer + 4, %edx
> -	movl	$MSR_EFER, %ecx
>  	wrmsr
>  
> -	# Enable paging and in turn activate Long Mode
> -	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
> +.Ldone_efer:
> +	/*
> +	 * Enable paging and in turn activate Long Mode. Keep CR0.NE set, FERR#
> +	 * is no longer used to handle x87 FPU errors and clearing NE may fault
> +	 * in some environments.
> +	 */
> +	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_NE | X86_CR0_PE),
> %eax

See above.

>  	movl	%eax, %cr0
>  
>  	/*
> @@ -169,7 +184,11 @@ SYM_CODE_START(pa_trampoline_compat)
>  	movl	$rm_stack_end, %esp
>  	movw	$__KERNEL_DS, %dx
>  
> -	movl	$X86_CR0_PE, %eax
> +	/*
> +	 * Keep CR0.NE set, FERR# is no longer used to handle x87 FPU errors
> +	 * and clearing NE may fault in some environments.
> +	 */
> +	movl	$(X86_CR0_NE | X86_CR0_PE), %eax

Ditto.

>  	movl	%eax, %cr0
>  	ljmpl   $__KERNEL32_CS, $pa_startup_32
>  SYM_CODE_END(pa_trampoline_compat)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests
  2022-01-24 15:02 ` [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
@ 2022-02-02  0:09   ` Thomas Gleixner
  2022-02-02  0:11     ` Thomas Gleixner
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  0:09 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
>  static bool intel_cc_platform_has(enum cc_attr attr)
>  {
> -	if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
> +	switch (attr) {
> +	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> +	case CC_ATTR_HOTPLUG_DISABLED:
>  		return true;
> +	default:
> +		return false;
> +	}
>  
>  	return false;

Sigh. If 'default:' returns false then this final return cannot be
reached, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests
  2022-02-02  0:09   ` Thomas Gleixner
@ 2022-02-02  0:11     ` Thomas Gleixner
  2022-02-03 15:00       ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  0:11 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov

On Wed, Feb 02 2022 at 01:09, Thomas Gleixner wrote:

> On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
>>  static bool intel_cc_platform_has(enum cc_attr attr)
>>  {
>> -	if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
>> +	switch (attr) {
>> +	case CC_ATTR_GUEST_UNROLL_STRING_IO:
>> +	case CC_ATTR_HOTPLUG_DISABLED:

Not that I care much, but I faintly remember that I suggested that in
one of the gazillion of threads.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module
  2022-01-24 15:02 ` [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module Kirill A. Shutemov
@ 2022-02-02  0:14   ` Thomas Gleixner
  2022-02-07 22:27     ` Sean Christopherson
  2022-02-07 10:44   ` Borislav Petkov
  1 sibling, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  0:14 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> +static void tdx_get_info(void)
> +{
> +	struct tdx_module_output out;
> +	u64 ret;
> +
> +	/*
> +	 * TDINFO TDX module call is used to get the TD execution environment
> +	 * information like GPA width, number of available vcpus, debug mode
> +	 * information, etc. More details about the ABI can be found in TDX
> +	 * Guest-Host-Communication Interface (GHCI), sec 2.4.2 TDCALL
> +	 * [TDG.VP.INFO].
> +	 */
> +	ret = __tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
> +
> +	/* Non zero return value indicates buggy TDX module, so panic */

Can you please get rid of these useless comments all over the place. The
panic() message tells the same story. Please document the non-obvious
things.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 21/29] x86/tdx: Exclude shared bit from __PHYSICAL_MASK
  2022-01-24 15:02 ` [PATCHv2 21/29] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
@ 2022-02-02  0:18   ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  0:18 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:

> In TDX guests, by default memory is protected from host access. If a
> guest needs to communicate with the VMM (like the I/O use case), it uses
> a single bit in the physical address to communicate the protected/shared
> attribute of the given page.
>
> In the x86 ARCH code, __PHYSICAL_MASK macro represents the width of the
> physical address in the given architecture. It is used in creating
> physical PAGE_MASK for address bits in the kernel. Since in TDX guest,
> a single bit is used as metadata, it needs to be excluded from valid
> physical address bits to avoid using incorrect addresses bits in the
> kernel.
>
> Enable DYNAMIC_PHYSICAL_MASK to support updating the __PHYSICAL_MASK.
>
> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Impressive....

> ---
>  arch/x86/Kconfig      | 1 +
>  arch/x86/kernel/tdx.c | 8 ++++++++
>  2 files changed, 9 insertions(+)

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-01-24 15:02 ` [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
@ 2022-02-02  0:25   ` Thomas Gleixner
  2022-02-02 19:27     ` Kirill A. Shutemov
  2022-02-07 16:27   ` Borislav Petkov
  1 sibling, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  0:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:

> In TDX guests, guest memory is protected from host access. If a guest
> performs I/O, it needs to explicitly share the I/O memory with the host.
>
> Make all ioremap()ed pages that are not backed by normal memory
> (IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.
>
> Since TDX memory encryption support is similar to AMD SEV architecture,
> reuse the infrastructure from AMD SEV code.
>
> Add tdx_shared_mask() interface to get the TDX guest shared bitmask.
>
> pgprot_decrypted() is used by drivers (i915, virtio_gpu, vfio). Export
> both pgprot_encrypted() and pgprot_decrypted().

How so?

# git grep pgprot_encrypted
arch/x86/include/asm/pgtable.h:#define pgprot_encrypted(prot)   __pgprot(__sme_set(pgprot_val(prot)))
arch/x86/mm/ioremap.c:          prot = pgprot_encrypted(prot);
arch/x86/mm/ioremap.c:  return encrypted_prot ? pgprot_encrypted(prot)
arch/x86/mm/mem_encrypt_amd.c:          protection_map[i] = pgprot_encrypted(protection_map[i]);
arch/x86/mm/pat/set_memory.c:           cpa.mask_clr = pgprot_encrypted(cpa.mask_clr);
arch/x86/platform/efi/quirks.c:                           pgprot_val(pgprot_encrypted(FIXMAP_PAGE_NORMAL)));
fs/proc/vmcore.c:       prot = pgprot_encrypted(prot);
include/linux/pgtable.h:#ifndef pgprot_encrypted
include/linux/pgtable.h:#define pgprot_encrypted(prot)  (prot)

I cannot find any of the above mentioned subsystems in this grep
output. Neither does this patch add any users which require those
exports.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private
  2022-01-24 15:02 ` [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private Kirill A. Shutemov
@ 2022-02-02  0:35   ` Thomas Gleixner
  2022-02-08 12:12   ` Borislav Petkov
  1 sibling, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  0:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> Intel TDX protects guest memory from VMM access. Any memory that is
> required for communication with the VMM must be explicitly shared.
>
> It is a two-step process: the guest sets the shared bit in the page
> table entry and notifies VMM about the change. The notification happens
> using MapGPA hypercall.
>
> Conversion back to private memory requires clearing the shared bit,
> notifying VMM with MapGPA hypercall following with accepting the memory
> with AcceptPage hypercall.
>
> Provide a helper to do conversion between shared and private memory.
> It is going to be used by the following patch.

Strike that last sentence...

> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -13,6 +13,10 @@
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO			1
>  #define TDX_GET_VEINFO			3
> +#define TDX_ACCEPT_PAGE			6
> +
> +/* TDX hypercall Leaf IDs */
> +#define TDVMCALL_MAP_GPA		0x10001
>  
>  /* See Exit Qualification for I/O Instructions in VMX documentation */
>  #define VE_IS_IO_IN(exit_qual)		(((exit_qual) & 8) ? 1 : 0)
> @@ -97,6 +101,80 @@ static void tdx_get_info(void)
>  	td_info.attributes = out.rdx;
>  }
>  
> +static bool tdx_accept_page(phys_addr_t gpa, enum pg_level pg_level)
> +{
> +	/*
> +	 * Pass the page physical address to the TDX module to accept the
> +	 * pending, private page.
> +	 *
> +	 * Bits 2:0 if GPA encodes page size: 0 - 4K, 1 - 2M, 2 - 1G.
> +	 */
> +	switch (pg_level) {
> +	case PG_LEVEL_4K:
> +		break;
> +	case PG_LEVEL_2M:
> +		gpa |= 1;
> +		break;
> +	case PG_LEVEL_1G:
> +		gpa |= 2;
> +		break;
> +	default:
> +		return true;

Crack. boolean return true means success. Can we please keep this
convention straight throughout the code and not as you see fit?

This random choice of return code meanings is just a recipe for
disaster. Consistency matters.

> +	}
> +
> +	return __tdx_module_call(TDX_ACCEPT_PAGE, gpa, 0, 0, 0, NULL);
> +}
> +
> +/*
> + * Inform the VMM of the guest's intent for this physical page: shared with
> + * the VMM or private to the guest.  The VMM is expected to change its mapping
> + * of the page in response.
> + */
> +int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end, bool enc)
> +{
> +	u64 ret;
> +
> +	if (end <= start)
> +		return -EINVAL;
> +
> +	if (!enc) {
> +		start |= tdx_shared_mask();
> +		end |= tdx_shared_mask();
> +	}
> +
> +	/*
> +	 * Notify the VMM about page mapping conversion. More info about ABI
> +	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
> +	 * sec "TDG.VP.VMCALL<MapGPA>"
> +	 */
> +	ret = _tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0, NULL);
> +
> +	if (ret)
> +		ret = -EIO;
> +
> +	if (ret || !enc)
> +		return ret;
> +
> +	/*
> +	 * For shared->private conversion, accept the page using
> +	 * TDX_ACCEPT_PAGE TDX module call.
> +	 */
> +	while (start < end) {
> +		/* Try 2M page accept first if possible */

Talking about consistency:

tdx_accept_page() implements 1G maps, but they are not required to be
handled here for some random reason, right?

> +		if (!(start & ~PMD_MASK) && end - start >= PMD_SIZE &&
> +		    !tdx_accept_page(start, PG_LEVEL_2M)) {
> +			start += PMD_SIZE;
> +			continue;
> +		}
> +
> +		if (tdx_accept_page(start, PG_LEVEL_4K))
> +			return -EIO;
> +		start += PAGE_SIZE;
> +	}

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 24/29] x86/mm/cpa: Add support for TDX shared memory
  2022-01-24 15:02 ` [PATCHv2 24/29] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
@ 2022-02-02  1:27   ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  1:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov, Sean Christopherson, Kai Huang

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> -void notify_range_enc_status_changed(unsigned long vaddr, int npages, bool enc)
> +int amd_notify_range_enc_status_changed(unsigned long vaddr, int npages,
> +					 bool enc)
>  {
>  #ifdef CONFIG_PARAVIRT
>  	unsigned long sz = npages << PAGE_SHIFT;
> @@ -270,7 +271,7 @@ void notify_range_enc_status_changed(unsigned long vaddr, int npages, bool enc)
>  		kpte = lookup_address(vaddr, &level);
>  		if (!kpte || pte_none(*kpte)) {
>  			WARN_ONCE(1, "kpte lookup for vaddr\n");
> -			return;
> +			return 0;
>  		}
>  
>  		pfn = pg_level_to_pfn(level, kpte, NULL);
> @@ -285,6 +286,7 @@ void notify_range_enc_status_changed(unsigned long vaddr, int npages, bool enc)
>  		vaddr = (vaddr & pmask) + psize;
>  	}
>  #endif
> +	return 0;
>  }

This is obviously a preparatory change, so please split it out into a
seperate patch. You know the drill already, right?

> +static pgprot_t pgprot_cc_mask(bool enc)
> +{
> +	if (enc)
> +		return pgprot_encrypted(__pgprot(0));
> +	else
> +		return pgprot_decrypted(__pgprot(0));
> +}

How is this relevant to the scope of this TDX patch? Why is this not
part of the previous change which consolidated __pgprot(_PAGE_ENC)?

Just because it is too obvious to fixup the usage sites first?

> +static int notify_range_enc_status_changed(unsigned long vaddr, int npages,
> +					   bool enc)
> +{
> +	if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
> +		phys_addr_t start = __pa(vaddr);
> +		phys_addr_t end = __pa(vaddr + npages * PAGE_SIZE);
> +
> +		return tdx_hcall_request_gpa_type(start, end, enc);
> +	} else {
> +		return amd_notify_range_enc_status_changed(vaddr, npages, enc);
> +	}

This is more than lame, really. The existing SEV specific
notify_range_enc_status_changed() function has been called
unconditionally, but for TDX you add a cc_platform_has() check and still
call the AMD part unconditionally if !TDX.

Aside of that the two functions have different calling conventions. Why?

Just because the TDX function which you defined requires physical
addresses this needs to be part of the PAT code?

Make both functions share the same calling conventions and thinks hard
about whether cc_platform_has() is the proper mechanism. There are other
means to handle such things. Hint: x86_platform_ops

> +}
> +
>  /*
>   * __set_memory_enc_pgtable() is used for the hypervisors that get
>   * informed about "encryption" status via page tables.
> @@ -1999,8 +2021,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  	memset(&cpa, 0, sizeof(cpa));
>  	cpa.vaddr = &addr;
>  	cpa.numpages = numpages;
> -	cpa.mask_set = enc ? __pgprot(_PAGE_ENC) : __pgprot(0);
> -	cpa.mask_clr = enc ? __pgprot(0) : __pgprot(_PAGE_ENC);
> +
> +	cpa.mask_set = pgprot_cc_mask(enc);
> +	cpa.mask_clr = pgprot_cc_mask(!enc);
> +
>  	cpa.pgd = init_mm.pgd;
>  
>  	/* Must avoid aliasing mappings in the highmem code */
> @@ -2008,9 +2032,17 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  	vm_unmap_aliases();
>  
>  	/*
> -	 * Before changing the encryption attribute, we need to flush caches.
> +	 * Before changing the encryption attribute, flush caches.
> +	 *
> +	 * For TDX, guest is responsible for flushing caches on private->shared
> +	 * transition. VMM is responsible for flushing on shared->private.
>  	 */
> -	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> +	if (cc_platform_has(CC_ATTR_GUEST_TDX)) {
> +		if (!enc)
> +			cpa_flush(&cpa, 1);
> +	} else {
> +		cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
> +	}

This is the point where my disgust tolerance ends. Seriously. Is that
all you can come up with? Slapping this kind of conditionals all over
the place?

Again. Think hard about the right abstraction for this and not about how
to duct tape it into the existing code. Just becausre cc_platform_has()
exists does not mean it's the only tool which can be used. Not
everything is a nail...

This screams for an indirect branch, e.g. some extension to the existing
x86_platform_ops.

It's trivial enough to add a (encrypted) guest specific data structure
with relevant operations to x86_platform_ops and fill that in on
detection like we do for other things. Then the whole muck here boils
down to:

-	notify_range_enc_status_changed(addr, numpages, enc);
-
+	if (!ret)
+		ret =  x86_platform.guest.enc_status_changed(addr, numpages, enc);

and

-	cpa_flush(&cpa, !this_cpu_has(X86_FEATURE_SME_COHERENT));
+	cpa_flush(&cpa, x86_platform.guest.enc_flush_required(enc));

Hmm?

Feel free to come up with better names...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2022-01-24 15:02 ` [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
@ 2022-02-02  1:33   ` Thomas Gleixner
  2022-02-04 22:09     ` Yamahata, Isaku
  2022-02-04 22:31     ` Kirill A. Shutemov
  0 siblings, 2 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  1:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Isaku Yamahata, Kirill A . Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> The kernel interacts with each bare-metal IOAPIC with a special
> MMIO page. When running under KVM, the guest's IOAPICs are
> emulated by KVM.
>
> When running as a TDX guest, the guest needs to mark each IOAPIC
> mapping as "shared" with the host.  This ensures that TDX private
> protections are not applied to the page, which allows the TDX host
> emulation to work.
>
> Earlier patches in this series modified ioremap() so that

The concept of earlier patches does not exist.

> ioremap()-created mappings such as virtio will be marked as
> shared. However, the IOAPIC code does not use ioremap() and instead
> uses the fixmap mechanism.
>
> Introduce a special fixmap helper just for the IOAPIC code.  Ensure
> that it marks IOAPIC pages as "shared".  This replaces
> set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
> allows custom 'prot' values.

Why is this a TDX only issue and SEV does not suffer from that?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 28/29] x86/tdx: Warn about unexpected WBINVD
  2022-01-24 15:02 ` [PATCHv2 28/29] x86/tdx: Warn about unexpected WBINVD Kirill A. Shutemov
@ 2022-02-02  1:46   ` Thomas Gleixner
  2022-02-04 21:35     ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02  1:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A. Shutemov

On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:

> WBINVD causes #VE in TDX guests. There's no reliable way to emulate it.
> The kernel can ask for VMM assistance, but VMM is untrusted and can ignore
> the request.
>
> Fortunately, there is no use case for WBINVD inside TDX guests.

If there is not usecase, then why

> Warn about any unexpected WBINVD.

instead of terminating the whole thing?

I'm tired of the "let us emit a warning in the hope it gets fixed'
thinking.

That's just wrong. Any code which has an assumption that it relies on
WBINVD to work correctly has to be analysed and not ignored on the
assumption that there is no use case for WBINVD inside TDX guests.

Its's simply wishful thinking that stuff gets fixed because of a
WARN_ONCE(). This has never worked. The only thing which works is to
make stuff fail hard or slow it down in a way which makes it annoying
enough to users to complain.

This is new technology. Anything which wants to use it has to obey to
the rules of this new technology. Just define it to be: WBINVD is
forbidden. End of story.

The Intel approach of 'Let us tolerate all sins of the past' has been
proven to be wrong, broken and outright dangerous in the past. So why
are you insisting to proliferate that?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-01 19:58   ` Thomas Gleixner
@ 2022-02-02  2:55     ` Kirill A. Shutemov
  2022-02-02 10:59       ` Kai Huang
  2022-02-02 17:08       ` Thomas Gleixner
  0 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-02  2:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel, Kai Huang

On Tue, Feb 01, 2022 at 08:58:59PM +0100, Thomas Gleixner wrote:
> And unsurprisingly this function and __seamcall of the other patch set
> are very similar aside of the calling convention (__seamcall has a
> struct for the input parameters) and the obvious difference that one
> issues TDCALL and the other SEAMCALL.
> 
> So can we please have _one_ implementation and the same struct(s) for
> the module call which is exactly the same for host and guest except for
> the instruction used.
> 
> IOW, this begs a macro implementation
> 
> .macro TDX_MODULE_CALL host:req
> 
>        ....
> 
>         .if \host
>         seamcall
>         .else
> 	tdcall
>         .endif
> 
>        ....
> 
> So the actual functions become:
> 
> SYM_FUNC_START(__tdx_module_call)
>         FRAME_BEGIN
>         TDX_MODULE_CALL host=0
>         FRAME_END
>         ret
> SYM_FUNC_END(__tdx_module_call)
> 
> SYM_FUNC_START(__tdx_seam_call)
>         FRAME_BEGIN
>         TDX_MODULE_CALL host=1
>         FRAME_END
>         ret
> SYM_FUNC_END(__tdx_seam_call)
> 
> Hmm?
> 
> > +/*
> > + * Wrapper for standard use of __tdx_hypercall with panic report
> > + * for TDCALL error.
> > + */
> > +static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
> > +				 u64 r15, struct tdx_hypercall_output
> > *out)
> 
> This begs the question whether having a struct hypercall_input similar
> to the way how seamcall input parameters are implemented makes more
> sense than 7 function arguments. Hmm?

Okay, below is my take on addressing feedback for both __tdx_module_call()
and __tdx_hypercall().

It is fixup for whole patchset. It has to be folded accordingly. I wanted
to check if it works and see if I understand your request correctly.

__tdx_module_call() is now implemented by including tdxcall.S can using
the macro defined there. Host side of TDX can do the same on their side.
TDX_MODULE_* offsets are now outside of CONFIG_INTEL_TDX_GUEST and can be
used by both host can guest.

I changed __tdx_hypercall() to take single argument with struct pointer
that used for both input and output.

Is it the right direction? Or did I misunderstand something?

diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index f2e1449c74cd..3ff676379947 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -21,20 +21,31 @@ bool early_is_tdx_guest(void)
 
 static inline unsigned int tdx_io_in(int size, int port)
 {
-	struct tdx_hypercall_output out;
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_IO_INSTRUCTION,
+		.r12 = size,
+		.r13  = 0,
+		.r14 = port,
+	};
 
-	__tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
-			size, 0, port, 0, &out);
+	__tdx_hypercall(&args);
 
-	return out.r10 ? UINT_MAX : out.r11;
+	return args.r10 ? UINT_MAX : args.r11;
 }
 
 static inline void tdx_io_out(int size, int port, u64 value)
 {
-	struct tdx_hypercall_output out;
-
-	__tdx_hypercall(TDX_HYPERCALL_STANDARD, EXIT_REASON_IO_INSTRUCTION,
-			size, 1, port, value, &out);
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_IO_INSTRUCTION,
+		.r12 = size,
+		.r13  = 1,
+		.r14 = port,
+		.r15 = value,
+	};
+
+	__tdx_hypercall(&args);
 }
 
 static inline unsigned char tdx_inb(int port)
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 4a0218bedc75..ce06002346a3 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -9,7 +9,7 @@
  * This is a software only structure and not part of the TDX
  * module/VMM ABI.
  */
-struct tdx_hypercall_output {
+struct tdx_hypercall_args {
 	u64 r10;
 	u64 r11;
 	u64 r12;
@@ -24,7 +24,6 @@ struct tdx_hypercall_output {
 #define TDX_IDENT		"IntelTDX    "
 
 /* Used to request services from the VMM */
-u64 __tdx_hypercall(u64 type, u64 fn, u64 r12, u64 r13, u64 r14,
-		    u64 r15, struct tdx_hypercall_output *out);
+u64 __tdx_hypercall(struct tdx_hypercall_args *out);
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 8a3c6b34be7d..0b465e7d0a2f 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -66,9 +66,7 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
-#ifdef CONFIG_INTEL_TDX_GUEST
 	BLANK();
-	/* Offset for fields in tdx_module_output */
 	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
 	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
 	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
@@ -76,13 +74,14 @@ static void __used common(void)
 	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
 	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
 
-	/* Offset for fields in tdx_hypercall_output */
-	OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_output, r10);
-	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_output, r11);
-	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_output, r12);
-	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_output, r13);
-	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_output, r14);
-	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_output, r15);
+#ifdef CONFIG_INTEL_TDX_GUEST
+	BLANK();
+	OFFSET(TDX_HYPERCALL_r10, tdx_hypercall_args, r10);
+	OFFSET(TDX_HYPERCALL_r11, tdx_hypercall_args, r11);
+	OFFSET(TDX_HYPERCALL_r12, tdx_hypercall_args, r12);
+	OFFSET(TDX_HYPERCALL_r13, tdx_hypercall_args, r13);
+	OFFSET(TDX_HYPERCALL_r14, tdx_hypercall_args, r14);
+	OFFSET(TDX_HYPERCALL_r15, tdx_hypercall_args, r15);
 #endif
 
 	BLANK();
diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index ae74da33ccc6..4db79fbbb857 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -9,6 +9,8 @@
 #include <linux/bits.h>
 #include <linux/errno.h>
 
+#include "tdxcall.S"
+
 /*
  * Bitmasks of exposed registers (with VMM).
  */
@@ -19,9 +21,6 @@
 #define TDX_R14		BIT(14)
 #define TDX_R15		BIT(15)
 
-/* Frame offset + 8 (for arg1) */
-#define ARG7_SP_OFFSET		(FRAME_OFFSET + 0x08)
-
 /*
  * These registers are clobbered to hold arguments for each
  * TDVMCALL. They are safe to expose to the VMM.
@@ -33,13 +32,6 @@
 					  TDX_R12 | TDX_R13 | \
 					  TDX_R14 | TDX_R15 )
 
-/*
- * TDX guests use the TDCALL instruction to make requests to the
- * TDX module and hypercalls to the VMM. It is supported in
- * Binutils >= 2.36.
- */
-#define tdcall .byte 0x66,0x0f,0x01,0xcc
-
 /*
  * Used in __tdx_hypercall() to determine whether to enable interrupts
  * before issuing TDCALL for the EXIT_REASON_HLT case.
@@ -86,67 +78,7 @@
  */
 SYM_FUNC_START(__tdx_module_call)
 	FRAME_BEGIN
-
-	/*
-	 * R12 will be used as temporary storage for
-	 * struct tdx_module_output pointer. Since R12-R15
-	 * registers are not used by TDCALL services supported
-	 * by this function, it can be reused.
-	 */
-
-	/* Callee saved, so preserve it */
-	push %r12
-
-	/*
-	 * Push output pointer to stack.
-	 * After the TDCALL operation, it will be fetched
-	 * into R12 register.
-	 */
-	push %r9
-
-	/* Mangle function call ABI into TDCALL ABI: */
-	/* Move TDCALL Leaf ID to RAX */
-	mov %rdi, %rax
-	/* Move input 4 to R9 */
-	mov %r8,  %r9
-	/* Move input 3 to R8 */
-	mov %rcx, %r8
-	/* Move input 1 to RCX */
-	mov %rsi, %rcx
-	/* Leave input param 2 in RDX */
-
-	tdcall
-
-	/*
-	 * Fetch output pointer from stack to R12 (It is used
-	 * as temporary storage)
-	 */
-	pop %r12
-
-	/* Check for TDCALL success: 0 - Successful, otherwise failed */
-	test %rax, %rax
-	jnz .Lno_output_struct
-
-	/*
-	 * Since this function can be initiated without an output pointer,
-	 * check if caller provided an output struct before storing
-	 * output registers.
-	 */
-	test %r12, %r12
-	jz .Lno_output_struct
-
-	/* Copy TDCALL result registers to output struct: */
-	movq %rcx, TDX_MODULE_rcx(%r12)
-	movq %rdx, TDX_MODULE_rdx(%r12)
-	movq %r8,  TDX_MODULE_r8(%r12)
-	movq %r9,  TDX_MODULE_r9(%r12)
-	movq %r10, TDX_MODULE_r10(%r12)
-	movq %r11, TDX_MODULE_r11(%r12)
-
-.Lno_output_struct:
-	/* Restore the state of R12 register */
-	pop %r12
-
+	TDX_MODULE_CALL host=0
 	FRAME_END
 	ret
 SYM_FUNC_END(__tdx_module_call)
@@ -184,14 +116,7 @@ SYM_FUNC_END(__tdx_module_call)
  *
  * __tdx_hypercall() function ABI:
  *
- * @type  (RDI)        - TD VMCALL type, moved to R10
- * @fn    (RSI)        - TD VMCALL sub function, moved to R11
- * @r12   (RDX)        - Input parameter 1, moved to R12
- * @r13   (RCX)        - Input parameter 2, moved to R13
- * @r14   (R8)         - Input parameter 3, moved to R14
- * @r15   (R9)         - Input parameter 4, moved to R15
- *
- * @out   (stack)      - struct tdx_hypercall_output pointer (cannot be NULL)
+ * @args  (RDI)        - struct tdx_hypercall_output args
  *
  * On successful completion, return TDCALL status or -EINVAL for invalid
  * inputs.
@@ -199,41 +124,23 @@ SYM_FUNC_END(__tdx_module_call)
 SYM_FUNC_START(__tdx_hypercall)
 	FRAME_BEGIN
 
-	/* Move argument 7 from caller stack to RAX */
-	movq ARG7_SP_OFFSET(%rsp), %rax
-
-	/* Check if caller provided an output struct */
-	test %rax, %rax
-	/* If out pointer is NULL, return -EINVAL */
-	jz .Lret_err
-
 	/* Save callee-saved GPRs as mandated by the x86_64 ABI */
 	push %r15
 	push %r14
 	push %r13
 	push %r12
 
-	/*
-	 * Save output pointer (rax) on the stack, it will be used again
-	 * when storing the output registers after the TDCALL operation.
-	 */
-	push %rax
-
 	/* Mangle function call ABI into TDCALL ABI: */
 	/* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
 	xor %eax, %eax
-	/* Move TDVMCALL type (standard vs vendor) in R10 */
-	mov %rdi, %r10
-	/* Move TDVMCALL sub function id to R11 */
-	mov %rsi, %r11
-	/* Move input 1 to R12 */
-	mov %rdx, %r12
-	/* Move input 2 to R13 */
-	mov %rcx, %r13
-	/* Move input 3 to R14 */
-	mov %r8,  %r14
-	/* Move input 4 to R15 */
-	mov %r9,  %r15
+
+	/* Copy hypercall registers from arg struct: */
+	movq TDX_HYPERCALL_r10(%rdi), %r10
+	movq TDX_HYPERCALL_r11(%rdi), %r11
+	movq TDX_HYPERCALL_r12(%rdi), %r12
+	movq TDX_HYPERCALL_r13(%rdi), %r13
+	movq TDX_HYPERCALL_r14(%rdi), %r14
+	movq TDX_HYPERCALL_r15(%rdi), %r15
 
 	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
 
@@ -263,16 +170,13 @@ SYM_FUNC_START(__tdx_hypercall)
 .Lskip_sti:
 	tdcall
 
-	/* Restore output pointer to R9 */
-	pop  %r9
-
-	/* Copy hypercall result registers to output struct: */
-	movq %r10, TDX_HYPERCALL_r10(%r9)
-	movq %r11, TDX_HYPERCALL_r11(%r9)
-	movq %r12, TDX_HYPERCALL_r12(%r9)
-	movq %r13, TDX_HYPERCALL_r13(%r9)
-	movq %r14, TDX_HYPERCALL_r14(%r9)
-	movq %r15, TDX_HYPERCALL_r15(%r9)
+	/* Copy hypercall result registers to arg struct: */
+	movq %r10, TDX_HYPERCALL_r10(%rdi)
+	movq %r11, TDX_HYPERCALL_r11(%rdi)
+	movq %r12, TDX_HYPERCALL_r12(%rdi)
+	movq %r13, TDX_HYPERCALL_r13(%rdi)
+	movq %r14, TDX_HYPERCALL_r14(%rdi)
+	movq %r15, TDX_HYPERCALL_r15(%rdi)
 
 	/*
 	 * Zero out registers exposed to the VMM to avoid
@@ -290,10 +194,6 @@ SYM_FUNC_START(__tdx_hypercall)
 	pop %r14
 	pop %r15
 
-	jmp .Lhcall_done
-.Lret_err:
-       movq $-EINVAL, %rax
-.Lhcall_done:
        FRAME_END
 
        retq
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index be5465cb81c3..4924c7a1a002 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -35,36 +35,39 @@ static struct {
  * Wrapper for standard use of __tdx_hypercall with panic report
  * for TDCALL error.
  */
-static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14,
-				 u64 r15, struct tdx_hypercall_output *out)
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
 {
-	struct tdx_hypercall_output dummy_out;
-	u64 err;
-
-	/* __tdx_hypercall() does not accept NULL output pointer */
-	if (!out)
-		out = &dummy_out;
-
-	/* Non zero return value indicates buggy TDX module, so panic */
-	err = __tdx_hypercall(TDX_HYPERCALL_STANDARD, fn, r12, r13, r14,
-			      r15, out);
-	if (err)
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = fn,
+		.r12 = r12,
+		.r13 = r13,
+		.r14 = r14,
+		.r15 = r15,
+	};
+
+	if (__tdx_hypercall(&args))
 		panic("Hypercall fn %llu failed (Buggy TDX module!)\n", fn);
 
-	return out->r10;
+	return args.r10;
 }
 
 #ifdef CONFIG_KVM_GUEST
 long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
 		       unsigned long p3, unsigned long p4)
 {
-	struct tdx_hypercall_output out;
-
-	/* Non zero return value indicates buggy TDX module, so panic */
-	if (__tdx_hypercall(nr, p1, p2, p3, p4, 0, &out))
+	struct tdx_hypercall_args args = {
+		.r10 = nr,
+		.r11 = p1,
+		.r12 = p2,
+		.r13 = p3,
+		.r14 = p4,
+	};
+
+	if (__tdx_hypercall(&args))
 		panic("KVM hypercall %u failed. Buggy TDX module?\n", nr);
 
-	return out.r10;
+	return args.r10;
 }
 EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
 #endif
@@ -146,7 +149,7 @@ int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end, bool enc)
 	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
 	 * sec "TDG.VP.VMCALL<MapGPA>"
 	 */
-	ret = _tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0, NULL);
+	ret = _tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0);
 
 	if (ret)
 		ret = -EIO;
@@ -192,8 +195,7 @@ static u64 __cpuidle _tdx_halt(const bool irq_disabled, const bool do_sti)
 	 * whether to call the STI instruction before executing the
 	 * TDCALL instruction.
 	 */
-	return _tdx_hypercall(EXIT_REASON_HLT, irq_disabled, 0, 0,
-			      do_sti, NULL);
+	return _tdx_hypercall(EXIT_REASON_HLT, irq_disabled, 0, 0, do_sti);
 }
 
 static bool tdx_halt(void)
@@ -231,47 +233,65 @@ void __cpuidle tdx_safe_halt(void)
 
 static bool tdx_read_msr(unsigned int msr, u64 *val)
 {
-	struct tdx_hypercall_output out;
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_MSR_READ,
+		.r12 = msr,
+	};
 
 	/*
 	 * Emulate the MSR read via hypercall. More info about ABI
 	 * can be found in TDX Guest-Host-Communication Interface
 	 * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
 	 */
-	if (_tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out))
-		return false;
+	if (__tdx_hypercall(&args))
+		panic("Hypercall failed (Buggy TDX module!)\n");
 
-	*val = out.r11;
+	if (args.r10)
+		return false;
 
+	*val = args.r11;
 	return true;
 }
 
 static bool tdx_write_msr(unsigned int msr, unsigned int low,
 			       unsigned int high)
 {
-	u64 ret;
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_MSR_WRITE,
+		.r12 = msr,
+		.r13 = (u64)high << 32 | low,
+	};
 
 	/*
 	 * Emulate the MSR write via hypercall. More info about ABI
 	 * can be found in TDX Guest-Host-Communication Interface
 	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.WRMSR>".
 	 */
-	ret = _tdx_hypercall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
-			     0, 0, NULL);
+	if (__tdx_hypercall(&args))
+		panic("Hypercall failed (Buggy TDX module!)\n");
 
-	return ret ? false : true;
+	return args.r10 ? false : true;
 }
 
 static bool tdx_handle_cpuid(struct pt_regs *regs)
 {
-	struct tdx_hypercall_output out;
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_CPUID,
+		.r12 = regs->ax,
+		.r13 = regs->cx,
+	};
 
 	/*
 	 * Emulate the CPUID instruction via a hypercall. More info about
 	 * ABI can be found in TDX Guest-Host-Communication Interface
 	 * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
 	 */
-	if (_tdx_hypercall(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, &out))
+	if (__tdx_hypercall(&args))
+		panic("Hypercall failed (Buggy TDX module!)\n");
+	if (args.r10)
 		return false;
 
 	/*
@@ -279,10 +299,10 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
 	 * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
 	 * So copy the register contents back to pt_regs.
 	 */
-	regs->ax = out.r12;
-	regs->bx = out.r13;
-	regs->cx = out.r14;
-	regs->dx = out.r15;
+	regs->ax = args.r12;
+	regs->bx = args.r13;
+	regs->cx = args.r14;
+	regs->dx = args.r15;
 
 	return true;
 }
@@ -290,15 +310,21 @@ static bool tdx_handle_cpuid(struct pt_regs *regs)
 static int tdx_mmio(int size, bool write, unsigned long addr,
 		     unsigned long *val)
 {
-	struct tdx_hypercall_output out;
-	u64 err;
-
-	err = _tdx_hypercall(EXIT_REASON_EPT_VIOLATION, size, write,
-			     addr, *val, &out);
-	if (err)
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_EPT_VIOLATION,
+		.r12 = size,
+		.r13 = write,
+		.r14 = addr,
+		.r15 = *val,
+	};
+
+	if (__tdx_hypercall(&args))
+		panic("Hypercall failed (Buggy TDX module!)\n");
+	if (args.r10)
 		return -EFAULT;
 
-	*val = out.r11;
+	*val = args.r11;
 	return 0;
 }
 
@@ -402,8 +428,11 @@ static int tdx_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
  */
 static bool tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
 {
-	struct tdx_hypercall_output out;
-	int size, port, ret;
+	struct tdx_hypercall_args args = {
+		.r10 = TDX_HYPERCALL_STANDARD,
+		.r11 = EXIT_REASON_IO_INSTRUCTION,
+	};
+	int size, port;
 	u64 mask;
 	bool in;
 
@@ -415,20 +444,25 @@ static bool tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
 	port = VE_GET_PORT_NUM(exit_qual);
 	mask = GENMASK(BITS_PER_BYTE * size, 0);
 
+	args.r12 = size;
+	args.r13 = !in;
+	args.r14 = port;
+	args.r15 = in ? 0 : regs->ax;
+
 	/*
 	 * Emulate the I/O read/write via hypercall. More info about
 	 * ABI can be found in TDX Guest-Host-Communication Interface
 	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.IO>".
 	 */
-	ret = _tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, !in, port,
-			     in ? 0 : regs->ax, &out);
+	if (__tdx_hypercall(&args))
+		panic("Hypercall failed (Buggy TDX module!)\n");
 	if (!in)
-		return !ret;
+		return !args.r10;
 
 	regs->ax &= ~mask;
-	regs->ax |= ret ? UINT_MAX : out.r11 & mask;
+	regs->ax |= args.r10 ? UINT_MAX : args.r11 & mask;
 
-	return !ret;
+	return !args.r10;
 }
 
 /*
diff --git a/arch/x86/kernel/tdxcall.S b/arch/x86/kernel/tdxcall.S
new file mode 100644
index 000000000000..ba5e8e35de36
--- /dev/null
+++ b/arch/x86/kernel/tdxcall.S
@@ -0,0 +1,76 @@
+#include <asm/asm-offsets.h>
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM.
+ *
+ * TDX host user SEAMCALL instruction to make requests to TDX module.
+ *
+ * They are supported in Binutils >= 2.36.
+ */
+#define tdcall		.byte 0x66,0x0f,0x01,0xcc
+#define seamcall	.byte 0x66,0x0f,0x01,0xcf
+
+.macro TDX_MODULE_CALL host:req
+	/*
+	 * R12 will be used as temporary storage for struct tdx_module_output
+	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
+	 * services supported by this function, it can be reused.
+	 */
+
+	/* Callee saved, so preserve it */
+	push %r12
+
+	/*
+	 * Push output pointer to stack.
+	 * After the operation, it will be fetched into R12 register.
+	 */
+	push %r9
+
+	/* Mangle function call ABI into TDCALL/SEAMCALL ABI: */
+	/* Move Leaf ID to RAX */
+	mov %rdi, %rax
+	/* Move input 4 to R9 */
+	mov %r8,  %r9
+	/* Move input 3 to R8 */
+	mov %rcx, %r8
+	/* Move input 1 to RCX */
+	mov %rsi, %rcx
+	/* Leave input param 2 in RDX */
+
+	.if \host
+	seamcall
+	.else
+	tdcall
+	.endif
+
+	/*
+	 * Fetch output pointer from stack to R12 (It is used
+	 * as temporary storage)
+	 */
+	pop %r12
+
+	/* Check for success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz .Lno_output_struct
+
+	/*
+	 * Since this function can be initiated without an output pointer,
+	 * check if caller provided an output struct before storing
+	 * output registers.
+	 */
+	test %r12, %r12
+	jz .Lno_output_struct
+
+	/* Copy result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+
+.Lno_output_struct:
+	/* Restore the state of R12 register */
+	pop %r12
+.endm
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 13/29] x86/tdx: Add port I/O emulation
  2022-01-24 15:01 ` [PATCHv2 13/29] x86/tdx: Add port I/O emulation Kirill A. Shutemov
  2022-02-01 23:01   ` Thomas Gleixner
@ 2022-02-02  6:22   ` Borislav Petkov
  1 sibling, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-02  6:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:01:59PM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index 8e630eeb765d..e73af22a4c11 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -13,6 +13,12 @@
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_VEINFO			3
>  
> +/* See Exit Qualification for I/O Instructions in VMX documentation */
> +#define VE_IS_IO_IN(exit_qual)		(((exit_qual) & 8) ? 1 : 0)
> +#define VE_GET_IO_SIZE(exit_qual)	(((exit_qual) & 7) + 1)
> +#define VE_GET_PORT_NUM(exit_qual)	((exit_qual) >> 16)
> +#define VE_IS_IO_STRING(exit_qual)	((exit_qual) & 16 ? 1 : 0)

Use BIT() and masks here. For example:

#define VE_IS_IO_STRING(e)	((e) & BIT(4))

You don't need the ternary ?: either as you're using them all in a
boolean context.

> +static bool tdx_handle_io(struct pt_regs *regs, u32 exit_qual)
> +{
> +	struct tdx_hypercall_output out;
> +	int size, port, ret;
> +	u64 mask;
> +	bool in;
> +
> +	if (VE_IS_IO_STRING(exit_qual))
> +		return false;
> +
> +	in   = VE_IS_IO_IN(exit_qual);
> +	size = VE_GET_IO_SIZE(exit_qual);
> +	port = VE_GET_PORT_NUM(exit_qual);
> +	mask = GENMASK(BITS_PER_BYTE * size, 0);
> +
> +	/*
> +	 * Emulate the I/O read/write via hypercall. More info about
> +	 * ABI can be found in TDX Guest-Host-Communication Interface
> +	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.IO>".

"section"

> +	 */
> +	ret = _tdx_hypercall(EXIT_REASON_IO_INSTRUCTION, size, !in, port,
> +			     in ? 0 : regs->ax, &out);
> +	if (!in)
> +		return !ret;
> +
> +	regs->ax &= ~mask;
> +	regs->ax |= ret ? UINT_MAX : out.r11 & mask;
> +
> +	return !ret;
> +}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O
  2022-01-24 15:02 ` [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O Kirill A. Shutemov
  2022-02-01 23:02   ` Thomas Gleixner
@ 2022-02-02 10:09   ` Borislav Petkov
  1 sibling, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-02 10:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:02:00PM +0300, Kirill A. Shutemov wrote:
> Subject: Re: [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O

The condensed patch description in the subject line should start with a
uppercase letter and should be written in imperative tone:

	... Handle early boot port I/O

or so.

> From: Andi Kleen <ak@linux.intel.com>
> 
> TDX guests cannot do port I/O directly. The TDX module triggers a #VE
> exception to let the guest kernel emulate port I/O, by converting them

s/,//

> into TDCALLs to call the host.
> 
> But before IDT handlers are set up, port I/O cannot be emulated using
> normal kernel #VE handlers. To support the #VE-based emulation during
> this boot window, add a minimal early #VE handler support in early
> exception handlers. This is similar to what AMD SEV does. This is
> mainly to support earlyprintk's serial driver, as well as potentially
> the VGA driver (although it is expected not to be used).

expectations, shmexpectations...

...

> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 1cb6346ec3d1..76d298ddfe75 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -417,6 +417,9 @@ void __init do_early_exception(struct pt_regs *regs, int trapnr)
>  	    trapnr == X86_TRAP_VC && handle_vc_boot_ghcb(regs))
>  		return;
>  


	if (IS_ENABLED(CONFIG_INTEL_TDX_GUEST)) &&

> +	if (trapnr == X86_TRAP_VE && tdx_early_handle_ve(regs))
> +		return;
> +
>  	early_fixup_exception(regs, trapnr);

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-02  2:55     ` Kirill A. Shutemov
@ 2022-02-02 10:59       ` Kai Huang
  2022-02-03 14:44         ` Kirill A. Shutemov
  2022-02-02 17:08       ` Thomas Gleixner
  1 sibling, 1 reply; 154+ messages in thread
From: Kai Huang @ 2022-02-02 10:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kai Huang


>  /*
> diff --git a/arch/x86/kernel/tdxcall.S b/arch/x86/kernel/tdxcall.S
> new file mode 100644
> index 000000000000..ba5e8e35de36
> --- /dev/null
> +++ b/arch/x86/kernel/tdxcall.S
> @@ -0,0 +1,76 @@
> +#include <asm/asm-offsets.h>
> +
> +/*
> + * TDX guests use the TDCALL instruction to make requests to the
> + * TDX module and hypercalls to the VMM.
> + *
> + * TDX host user SEAMCALL instruction to make requests to TDX module.
> + *
> + * They are supported in Binutils >= 2.36.
> + */
> +#define tdcall		.byte 0x66,0x0f,0x01,0xcc
> +#define seamcall	.byte 0x66,0x0f,0x01,0xcf
> +
> +.macro TDX_MODULE_CALL host:req
> +	/*
> +	 * R12 will be used as temporary storage for struct tdx_module_output
> +	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
> +	 * services supported by this function, it can be reused.
> +	 */
> +
> +	/* Callee saved, so preserve it */
> +	push %r12
> +
> +	/*
> +	 * Push output pointer to stack.
> +	 * After the operation, it will be fetched into R12 register.
> +	 */
> +	push %r9
> +
> +	/* Mangle function call ABI into TDCALL/SEAMCALL ABI: */
> +	/* Move Leaf ID to RAX */
> +	mov %rdi, %rax
> +	/* Move input 4 to R9 */
> +	mov %r8,  %r9
> +	/* Move input 3 to R8 */
> +	mov %rcx, %r8
> +	/* Move input 1 to RCX */
> +	mov %rsi, %rcx
> +	/* Leave input param 2 in RDX */

Currently host seamcall also uses a data structure which has all possible input
registers as input argument, rather than having one argument for each register:

	struct seamcall_regs_in {
		u64 rcx;
		u64 rdx;
		u64 r8;
		u64 r9;
	};

Which way is better (above struct name can be changed of course)?

Or should we rename tdx_module_output to tdx_module_regs, and use it as both
input and output (similar to __tdx_hypercall())?

> +
> +	.if \host
> +	seamcall
> +	.else
> +	tdcall
> +	.endif
> +
> +	/*
> +	 * Fetch output pointer from stack to R12 (It is used
> +	 * as temporary storage)
> +	 */
> +	pop %r12

For host SEAMCALL, one additional check against VMfailInvalid needs to be
done after here.  This is because at host side, both P-SEAMLDR and TDX module
are expected to be loaded by UEFI (i.e. UEFI shell tool) before booting to the
kernel, therefore host kernel needs to detect whether them have been loaded, by
issuing SEAMCALL.

When SEAM software (P-SEAMLDR or TDX module) is not loaded, the SEAMCALL fails
with VMfailInvalid.  VMfailInvalid is indicated via a combination of RFLAGS
rather than using %rax.  In practice, RFLAGS.CF=1 can be used to check whether
VMfailInvalid happened, and we need to have something like below:

	/*
	 * SEAMCALL instruction is essentially a VMExit from VMX root
	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
	 * that the targeted SEAM firmware is not loaded or disabled,
	 * or P-SEAMLDR is busy with another SEAMCALL.  %rax is not
	 * changed in this case.
	 *
	 * Set %rax to VMFAILINVALID for VMfailInvalid.  This value
	 * will never be used as actual SEAMCALL error code.
	 */
	jnb     .Lno_vmfailinvalid
	mov     $(VMFAILINVALID), %rax
	jmp     .Lno_output_struct

.Lno_vmfailinvalid:

> +
> +	/* Check for success: 0 - Successful, otherwise failed */
> +	test %rax, %rax
> +	jnz .Lno_output_struct
> +
> +	/*
> +	 * Since this function can be initiated without an output pointer,
> +	 * check if caller provided an output struct before storing
> +	 * output registers.
> +	 */
> +	test %r12, %r12
> +	jz .Lno_output_struct
> +
> +	/* Copy result registers to output struct: */
> +	movq %rcx, TDX_MODULE_rcx(%r12)
> +	movq %rdx, TDX_MODULE_rdx(%r12)
> +	movq %r8,  TDX_MODULE_r8(%r12)
> +	movq %r9,  TDX_MODULE_r9(%r12)
> +	movq %r10, TDX_MODULE_r10(%r12)
> +	movq %r11, TDX_MODULE_r11(%r12)
> +
> +.Lno_output_struct:
> +	/* Restore the state of R12 register */
> +	pop %r12
> +.endm
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-01-24 15:02 ` [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
  2022-02-01 23:06   ` Thomas Gleixner
@ 2022-02-02 11:27   ` Borislav Petkov
  2022-02-04 11:27     ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-02 11:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Sean Christopherson, Kai Huang

On Mon, Jan 24, 2022 at 06:02:02PM +0300, Kirill A. Shutemov wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Historically, x86 platforms have booted secondary processors (APs)
> using INIT followed by the start up IPI (SIPI) messages. In regular
> VMs, this boot sequence is supported by the VMM emulation. But such a
> wakeup model is fatal for secure VMs like TDX in which VMM is an
> untrusted entity. To address this issue, a new wakeup model was added
> in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
> the APs. More details about this wakeup model can be found in ACPI
> specification v6.4, the section titled "Multiprocessor Wakeup Structure".
> 
> Since the existing trampoline code requires processors to boot in real
> mode with 16-bit addressing, it will not work for this wakeup model
> (because it boots the AP in 64-bit mode). To handle it, extend the
> trampoline code to support 64-bit mode firmware handoff. Also, extend
> IDT and GDT pointers to support 64-bit mode hand off.
> 
> There is no TDX-specific detection for this new boot method. The kernel
> will rely on it as the sole boot method whenever the new ACPI structure
> is present.
> 
> The ACPI table parser for the MADT multiprocessor wake up structure and
> the wakeup method that uses this structure will be added by the following
> patch in this series.
> 
> Reported-by: Kai Huang <kai.huang@intel.com>

I wonder what that Reported-by tag means here for this is a feature
patch, not a bug fix or so...

> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
> index 331474b150f1..fd6f6e5b755a 100644
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -25,6 +25,7 @@ struct real_mode_header {
>  	u32	sev_es_trampoline_start;
>  #endif
>  #ifdef CONFIG_X86_64
> +	u32	trampoline_start64;
>  	u32	trampoline_pgd;
>  #endif

Hmm, so there's trampoline_start, sev_es_trampoline_start and
trampoline_start64. If those are mutually exclusive, can we merge them
all into a single trampoline_start?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-02-01 21:21       ` Thomas Gleixner
@ 2022-02-02 12:48         ` Kirill A. Shutemov
  2022-02-02 17:17           ` Thomas Gleixner
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-02 12:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bp, aarcange, ak, dan.j.williams, dave.hansen, david, hpa,
	jgross, jmattson, joro, jpoimboe, knsathya, linux-kernel, luto,
	mingo, pbonzini, peterz, sathyanarayanan.kuppuswamy, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86

On Tue, Feb 01, 2022 at 10:21:58PM +0100, Thomas Gleixner wrote:
> On Sun, Jan 30 2022 at 01:30, Kirill A. Shutemov wrote:
> > +/*
> > + * Used in __tdx_hypercall() to determine whether to enable interrupts
> > + * before issuing TDCALL for the EXIT_REASON_HLT case.
> > + */
> > +#define ENABLE_IRQS_BEFORE_HLT 0x01
> > +
> >  /*
> >   * __tdx_module_call()  - Used by TDX guests to request services from
> >   * the TDX module (does not include VMM services).
> > @@ -230,6 +237,30 @@ SYM_FUNC_START(__tdx_hypercall)
> >  
> >  	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
> >  
> > +	/*
> > +	 * For the idle loop STI needs to be called directly before
> > +	 * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
> > +	 * instruction enables interrupts only one instruction later.
> > +	 * If there is a window between STI and the instruction that
> > +	 * emulates the HALT state, there is a chance for interrupts to
> > +	 * happen in this window, which can delay the HLT operation
> > +	 * indefinitely. Since this is the not the desired result,
> > +	 * conditionally call STI before TDCALL.
> > +	 *
> > +	 * Since STI instruction is only required for the idle case
> > +	 * (a special case of EXIT_REASON_HLT), use the r15 register
> > +	 * value to identify it. Since the R15 register is not used
> > +	 * by the VMM as per EXIT_REASON_HLT ABI, re-use it in
> > +	 * software to identify the STI case.
> > +	 */
> > +	cmpl $EXIT_REASON_HLT, %r11d
> > +	jne .Lskip_sti
> > +	cmpl $ENABLE_IRQS_BEFORE_HLT, %r15d
> > +	jne .Lskip_sti
> > +	/* Set R15 register to 0, it is unused in EXIT_REASON_HLT case */
> > +	xor %r15, %r15
> > +	sti
> > +.Lskip_sti:
> >  	tdcall
> 
> This really can be simplified:
> 
>         cmpl	$EXIT_REASON_SAFE_HLT, %r11d
>         jne	.Lnohalt
>         movl	$EXIT_REASON_HLT, %r11d
>         sti
> .Lnohalt:
> 	tdcall
> 
> and the below becomes:
> 
> static bool tdx_halt(void)
> {
> 	return !!__tdx_hypercall(EXIT_REASON_HLT, !!irqs_disabled(), 0, 0, 0, NULL);
> }
> 
> void __cpuidle tdx_safe_halt(void)
> {
>         if (__tdx_hypercall(EXIT_REASON_SAFE_HLT, 0, 0, 0, 0, NULL)
>         	WARN_ONCE(1, "HLT instruction emulation failed\n");
> }
> 
> Hmm?

EXIT_REASON_* are architectural, see SDM vol 3D, appendix C. There's no
EXIT_REASON_SAFE_HLT.

Do you want to define a synthetic one? Like

#define EXIT_REASON_SAFE_HLT	0x10000

?

Looks dubious to me, I donno. I worry about possible conflicts with the
spec in the future.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 06/29] x86/tdx: Add MSR support for TDX guests
  2022-02-01 21:38   ` Thomas Gleixner
@ 2022-02-02 13:06     ` Kirill A. Shutemov
  2022-02-02 17:18       ` Thomas Gleixner
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-02 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Tue, Feb 01, 2022 at 10:38:05PM +0100, Thomas Gleixner wrote:
> On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> > +static bool tdx_read_msr(unsigned int msr, u64 *val)
> > +{
> > +	struct tdx_hypercall_output out;
> > +
> > +	/*
> > +	 * Emulate the MSR read via hypercall. More info about ABI
> > +	 * can be found in TDX Guest-Host-Communication Interface
> > +	 * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
> > +	 */
> > +	if (_tdx_hypercall(EXIT_REASON_MSR_READ, msr, 0, 0, 0, &out))
> > +		return false;
> > +
> > +	*val = out.r11;
> > +
> > +	return true;
> > +}
> > +
> > +static bool tdx_write_msr(unsigned int msr, unsigned int low,
> > +			       unsigned int high)
> > +{
> > +	u64 ret;
> > +
> > +	/*
> > +	 * Emulate the MSR write via hypercall. More info about ABI
> > +	 * can be found in TDX Guest-Host-Communication Interface
> > +	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.WRMSR>".
> > +	 */
> > +	ret = _tdx_hypercall(EXIT_REASON_MSR_WRITE, msr, (u64)high << 32 | low,
> > +			     0, 0, NULL);
> > +
> > +	return ret ? false : true;
> > +}
> > +
> >  bool tdx_get_ve_info(struct ve_info *ve)
> >  {
> >  	struct tdx_module_output out;
> > @@ -132,11 +165,22 @@ static bool tdx_virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
> >  static bool tdx_virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> >  {
> >  	bool ret = false;
> > +	u64 val;
> >  
> >  	switch (ve->exit_reason) {
> >  	case EXIT_REASON_HLT:
> >  		ret = tdx_halt();
> >  		break;
> > +	case EXIT_REASON_MSR_READ:
> > +		ret = tdx_read_msr(regs->cx, &val);
> > +		if (ret) {
> > +			regs->ax = lower_32_bits(val);
> > +			regs->dx = upper_32_bits(val);
> > +		}
> > +		break;
> 
> Why here?
> 
> static bool tdx_read_msr(struct pt_regs *regs)
> {
> 	struct tdx_hypercall_output out;
> 
> 	/*
> 	 * Emulate the MSR read via hypercall. More info about ABI
> 	 * can be found in TDX Guest-Host-Communication Interface
> 	 * (GHCI), sec titled "TDG.VP.VMCALL<Instruction.RDMSR>".
> 	 */
> 	if (_tdx_hypercall(EXIT_REASON_MSR_READ, regs->cx, 0, 0, 0, &out))
> 		return false;
> 
> 	regs->ax = lower_32_bits(out.r11);
> 	regs->dx = upper_32_bits(out.r11);
> 	return true;
> }
> 
> and
> 
> static bool tdx_read_msr(struct pt_regs *regs)
> {
> 	/*
> 	 * Emulate the MSR write via hypercall. More info about ABI
> 	 * can be found in TDX Guest-Host-Communication Interface
> 	 * (GHCI) sec titled "TDG.VP.VMCALL<Instruction.WRMSR>".
> 	 */
> 	return !!_tdx_hypercall(EXIT_REASON_MSR_WRITE, regs->cx,
>         			(u64)regs->dx << 32 | regs->ax,
> 			     	0, 0, NULL);
> }
> 
> Also the switch case can be simplified as the only action after 'break;'
> is 'return ret':
> 
> 	switch (ve->exit_reason) {
> 	case EXIT_REASON_HLT:
> 		return tdx_halt();
> 	case EXIT_REASON_MSR_READ:
> 		return tdx_read_msr(regs);
> 	case EXIT_REASON_MSR_WRITE:
> 		return tdx_write_msr(regs);
>         default:
>                 ....
> 
> Hmm?

No problem with this approach on read side.

On the write side there's one important optimization (outside of the
initial TDX enabling):

https://github.com/intel/tdx/commit/2cea8becaa5a287c93266c01fc7f2a4ed53c509d

It will require rework, maybe use separate __tdx_hypercall() for the
paravirt call implementation.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-02  2:55     ` Kirill A. Shutemov
  2022-02-02 10:59       ` Kai Huang
@ 2022-02-02 17:08       ` Thomas Gleixner
  1 sibling, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02 17:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel, Kai Huang

Kirill,

On Wed, Feb 02 2022 at 05:55, Kirill A. Shutemov wrote:
> On Tue, Feb 01, 2022 at 08:58:59PM +0100, Thomas Gleixner wrote:
> Okay, below is my take on addressing feedback for both __tdx_module_call()
> and __tdx_hypercall().
>
> It is fixup for whole patchset. It has to be folded accordingly. I wanted
> to check if it works and see if I understand your request correctly.
>
> __tdx_module_call() is now implemented by including tdxcall.S can using
> the macro defined there. Host side of TDX can do the same on their side.
> TDX_MODULE_* offsets are now outside of CONFIG_INTEL_TDX_GUEST and can be
> used by both host can guest.
>
> I changed __tdx_hypercall() to take single argument with struct pointer
> that used for both input and output.
>
> Is it the right direction? Or did I misunderstand something?

Looks good. Nice consolidation and I like the idea of using one
structure for in and out.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-02-02 12:48         ` Kirill A. Shutemov
@ 2022-02-02 17:17           ` Thomas Gleixner
  2022-02-04 16:55             ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02 17:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: bp, aarcange, ak, dan.j.williams, dave.hansen, david, hpa,
	jgross, jmattson, joro, jpoimboe, knsathya, linux-kernel, luto,
	mingo, pbonzini, peterz, sathyanarayanan.kuppuswamy, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86

On Wed, Feb 02 2022 at 15:48, Kirill A. Shutemov wrote:

> On Tue, Feb 01, 2022 at 10:21:58PM +0100, Thomas Gleixner wrote:
>> On Sun, Jan 30 2022 at 01:30, Kirill A. Shutemov wrote:
>> This really can be simplified:
>> 
>>         cmpl	$EXIT_REASON_SAFE_HLT, %r11d
>>         jne	.Lnohalt
>>         movl	$EXIT_REASON_HLT, %r11d
>>         sti
>> .Lnohalt:
>> 	tdcall
>> 
>> and the below becomes:
>> 
>> static bool tdx_halt(void)
>> {
>> 	return !!__tdx_hypercall(EXIT_REASON_HLT, !!irqs_disabled(), 0, 0, 0, NULL);
>> }
>> 
>> void __cpuidle tdx_safe_halt(void)
>> {
>>         if (__tdx_hypercall(EXIT_REASON_SAFE_HLT, 0, 0, 0, 0, NULL)
>>         	WARN_ONCE(1, "HLT instruction emulation failed\n");
>> }
>> 
>> Hmm?
>
> EXIT_REASON_* are architectural, see SDM vol 3D, appendix C. There's no
> EXIT_REASON_SAFE_HLT.
>
> Do you want to define a synthetic one? Like
>
> #define EXIT_REASON_SAFE_HLT	0x10000
> ?

That was my idea, yes.

> Looks dubious to me, I donno. I worry about possible conflicts with the
> spec in the future.

The spec should have a reserved space for such things :)

But you might think about having a in/out struct similar to the module
call or just an array of u64.

and the signature would become:

__tdx_hypercall(u64 op, u64 flags, struct inout *args)
__tdx_hypercall(u64 op, u64 flags, u64 *args)

and have flag bits:

    HCALL_ISSUE_STI
    HCALL_HAS_OUTPUT

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 06/29] x86/tdx: Add MSR support for TDX guests
  2022-02-02 13:06     ` Kirill A. Shutemov
@ 2022-02-02 17:18       ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02 17:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Wed, Feb 02 2022 at 16:06, Kirill A. Shutemov wrote:
> On Tue, Feb 01, 2022 at 10:38:05PM +0100, Thomas Gleixner wrote:
>> Hmm?
>
> No problem with this approach on read side.
>
> On the write side there's one important optimization (outside of the
> initial TDX enabling):
>
> https://github.com/intel/tdx/commit/2cea8becaa5a287c93266c01fc7f2a4ed53c509d
>
> It will require rework, maybe use separate __tdx_hypercall() for the
> paravirt call implementation.

Yes, but that's not the end of the world.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 11/29] x86/boot: Allow to hook up alternative port I/O helpers
  2022-02-01 22:53     ` Thomas Gleixner
@ 2022-02-02 17:20       ` Kirill A. Shutemov
  2022-02-02 19:05         ` Thomas Gleixner
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-02 17:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Tue, Feb 01, 2022 at 11:53:28PM +0100, Thomas Gleixner wrote:
> On Tue, Feb 01 2022 at 23:39, Thomas Gleixner wrote:
> 
> > On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> >
> >> Port I/O instructions trigger #VE in the TDX environment. In response to
> >> the exception, kernel emulates these instructions using hypercalls.
> >>
> >> But during early boot, on the decompression stage, it is cumbersome to
> >> deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
> >> handling.
> >>
> >> Add a way to hook up alternative port I/O helpers in the boot stub.
> >> All port I/O operations are routed via 'pio_ops'. By default 'pio_ops'
> >> initialized with native port I/O implementations.
> >>
> >> This is a preparation patch. The next patch will override 'pio_ops' if
> >> the kernel booted in the TDX environment.
> >>
> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >
> > Aside of Borislav's comments:
> >
> > Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Second thoughts.
> 
> > +#include <asm/shared/io.h>
> > +
> > +struct port_io_ops {
> > +	unsigned char (*inb)(int port);
> > +	unsigned short (*inw)(int port);
> > +	unsigned int (*inl)(int port);
> > +	void (*outb)(unsigned char v, int port);
> > +	void (*outw)(unsigned short v, int port);
> > +	void (*outl)(unsigned int v, int port);
> > +};
> 
> Can we please make that u8, u16, u32 instead of unsigned char,short,int?
> 
> That's the kernel convention for hardware related functions for many
> years now.

I inherited these prototypes from the main kernel I/O helpers. See patch
10/29.

Do you want 10/29 to be changed to use u8/16/32?

Maybe a separate patch to convert main kernel to u8/16/32 before
consolidation with boot stub?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 11/29] x86/boot: Allow to hook up alternative port I/O helpers
  2022-02-02 17:20       ` Kirill A. Shutemov
@ 2022-02-02 19:05         ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02 19:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Wed, Feb 02 2022 at 20:20, Kirill A. Shutemov wrote:
> On Tue, Feb 01, 2022 at 11:53:28PM +0100, Thomas Gleixner wrote:
>> Can we please make that u8, u16, u32 instead of unsigned char,short,int?
>> 
>> That's the kernel convention for hardware related functions for many
>> years now.
>
> I inherited these prototypes from the main kernel I/O helpers. See patch
> 10/29.
>
> Do you want 10/29 to be changed to use u8/16/32?
>
> Maybe a separate patch to convert main kernel to u8/16/32 before
> consolidation with boot stub?

Yes, please. That aligns them also with the asm-generic/io.h
implementation.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-02  0:25   ` Thomas Gleixner
@ 2022-02-02 19:27     ` Kirill A. Shutemov
  2022-02-02 19:47       ` Thomas Gleixner
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-02 19:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Wed, Feb 02, 2022 at 01:25:28AM +0100, Thomas Gleixner wrote:
> On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> 
> > In TDX guests, guest memory is protected from host access. If a guest
> > performs I/O, it needs to explicitly share the I/O memory with the host.
> >
> > Make all ioremap()ed pages that are not backed by normal memory
> > (IORES_DESC_NONE or IORES_DESC_RESERVED) mapped as shared.
> >
> > Since TDX memory encryption support is similar to AMD SEV architecture,
> > reuse the infrastructure from AMD SEV code.
> >
> > Add tdx_shared_mask() interface to get the TDX guest shared bitmask.
> >
> > pgprot_decrypted() is used by drivers (i915, virtio_gpu, vfio). Export
> > both pgprot_encrypted() and pgprot_decrypted().
> 
> How so?
> 
> # git grep pgprot_encrypted
> arch/x86/include/asm/pgtable.h:#define pgprot_encrypted(prot)   __pgprot(__sme_set(pgprot_val(prot)))
> arch/x86/mm/ioremap.c:          prot = pgprot_encrypted(prot);
> arch/x86/mm/ioremap.c:  return encrypted_prot ? pgprot_encrypted(prot)
> arch/x86/mm/mem_encrypt_amd.c:          protection_map[i] = pgprot_encrypted(protection_map[i]);
> arch/x86/mm/pat/set_memory.c:           cpa.mask_clr = pgprot_encrypted(cpa.mask_clr);
> arch/x86/platform/efi/quirks.c:                           pgprot_val(pgprot_encrypted(FIXMAP_PAGE_NORMAL)));
> fs/proc/vmcore.c:       prot = pgprot_encrypted(prot);
> include/linux/pgtable.h:#ifndef pgprot_encrypted
> include/linux/pgtable.h:#define pgprot_encrypted(prot)  (prot)
> 
> I cannot find any of the above mentioned subsystems in this grep
> output. Neither does this patch add any users which require those
> exports.

Try to grep pgprot_decrypted().

I guess we can get away not exporting pgprot_encrypted(), but this
asymmetry bothers me :)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-02 19:27     ` Kirill A. Shutemov
@ 2022-02-02 19:47       ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-02 19:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Wed, Feb 02 2022 at 22:27, Kirill A. Shutemov wrote:
> On Wed, Feb 02, 2022 at 01:25:28AM +0100, Thomas Gleixner wrote:
>> On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
>> I cannot find any of the above mentioned subsystems in this grep
>> output. Neither does this patch add any users which require those
>> exports.
>
> Try to grep pgprot_decrypted().

Bah.

> I guess we can get away not exporting pgprot_encrypted(), but this
> asymmetry bothers me :)

Well, no. We export only stuff which is needed. Exporting just because
is a NONO.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot
  2022-02-01 23:14     ` Kirill A. Shutemov
@ 2022-02-03  0:32       ` Josh Poimboeuf
  2022-02-03 14:09         ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Josh Poimboeuf @ 2022-02-03  0:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, knsathya, pbonzini, sdeep, seanjc,
	tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 02, 2022 at 02:14:59AM +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 01, 2022 at 08:29:55PM +0100, Thomas Gleixner wrote:
> > Kirill,
> > 
> > On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> > 
> > Just a nitpick...
> > 
> > > +static bool tdx_guest_detected __ro_after_init;
> > > +
> > > +bool is_tdx_guest(void)
> > > +{
> > > +	return tdx_guest_detected;
> > > +}
> > > +
> > > +void __init tdx_early_init(void)
> > > +{
> > > +	u32 eax, sig[3];
> > > +
> > > +	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
> > > +
> > > +	if (memcmp(TDX_IDENT, sig, 12))
> > > +		return;
> > > +
> > > +	tdx_guest_detected = true;
> > > +
> > > +	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> > 
> > So with that we have two ways to detect a TDX guest:
> > 
> >    - tdx_guest_detected
> > 
> >    - X86_FEATURE_TDX_GUEST
> > 
> > Shouldn't X86_FEATURE_TDX_GUEST be good enough?
> 
> Right. We have only 3 callers of is_tdx_guest() in cc_platform.c
> I will replace them with cpu_feature_enabled(X86_FEATURE_TDX_GUEST).

I had the same review comment before.  I was told that cc_platform_has()
was called before caps have been set up properly, so caps can't be
relied upon this early.

I'm not really convinced that's true.  Yes, early_identify_cpu() --
which runs after tdx_early_init() -- clears all boot_cpu_data's
capability bits to zero [*].

But shortly after that, early_identify_cpu() restores any "forced" caps
with a call to get_cpu_cap() -> apply_forced_caps().

So as far as I can tell, while it's subtle, it should work.  However, it
should be tested carefully ;-)


[ *] The memset() of boot_cpu_data seems unnecessary since it should
     already be cleared by the compiler when it gets placed in the
     .data..read_mostly section.

-- 
Josh


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot
  2022-02-03  0:32       ` Josh Poimboeuf
@ 2022-02-03 14:09         ` Kirill A. Shutemov
  2022-02-03 15:13           ` Dave Hansen
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-03 14:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, knsathya, pbonzini, sdeep, seanjc,
	tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 02, 2022 at 04:32:09PM -0800, Josh Poimboeuf wrote:
> On Wed, Feb 02, 2022 at 02:14:59AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Feb 01, 2022 at 08:29:55PM +0100, Thomas Gleixner wrote:
> > > Kirill,
> > > 
> > > On Mon, Jan 24 2022 at 18:01, Kirill A. Shutemov wrote:
> > > 
> > > Just a nitpick...
> > > 
> > > > +static bool tdx_guest_detected __ro_after_init;
> > > > +
> > > > +bool is_tdx_guest(void)
> > > > +{
> > > > +	return tdx_guest_detected;
> > > > +}
> > > > +
> > > > +void __init tdx_early_init(void)
> > > > +{
> > > > +	u32 eax, sig[3];
> > > > +
> > > > +	cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2],  &sig[1]);
> > > > +
> > > > +	if (memcmp(TDX_IDENT, sig, 12))
> > > > +		return;
> > > > +
> > > > +	tdx_guest_detected = true;
> > > > +
> > > > +	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> > > 
> > > So with that we have two ways to detect a TDX guest:
> > > 
> > >    - tdx_guest_detected
> > > 
> > >    - X86_FEATURE_TDX_GUEST
> > > 
> > > Shouldn't X86_FEATURE_TDX_GUEST be good enough?
> > 
> > Right. We have only 3 callers of is_tdx_guest() in cc_platform.c
> > I will replace them with cpu_feature_enabled(X86_FEATURE_TDX_GUEST).
> 
> I had the same review comment before.  I was told that cc_platform_has()
> was called before caps have been set up properly, so caps can't be
> relied upon this early.
> 
> I'm not really convinced that's true.  Yes, early_identify_cpu() --
> which runs after tdx_early_init() -- clears all boot_cpu_data's
> capability bits to zero [*].
> 
> But shortly after that, early_identify_cpu() restores any "forced" caps
> with a call to get_cpu_cap() -> apply_forced_caps().
> 
> So as far as I can tell, while it's subtle, it should work.  However, it
> should be tested carefully ;-)
> 
> 
> [ *] The memset() of boot_cpu_data seems unnecessary since it should
>      already be cleared by the compiler when it gets placed in the
>      .data..read_mostly section.
> 

There are couple of uses of cc_platform_has() before tdx_early_init() is
called: in sme_map_bootdata() and sme_unmap_bootdata(). Both called from
copy_bootdata().

We can move tdx_early_init() before copy_bootdata(), but in this case
tdx_early_init() won't be able to parse kernel command line. This
capability used by patches outside the initial TDX submission.

I just realized that we have moved tdx_early_init() back and forth few
times for this reason. Ughh..

I will rework (or drop) patches that parse command line options from
tdx_early_init() and move tdx_early_init() before copy_bootdata().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-02 10:59       ` Kai Huang
@ 2022-02-03 14:44         ` Kirill A. Shutemov
  2022-02-03 23:47           ` Kai Huang
  2022-02-04  3:43           ` Kirill A. Shutemov
  0 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-03 14:44 UTC (permalink / raw)
  To: Kai Huang
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kai Huang

On Wed, Feb 02, 2022 at 11:59:10PM +1300, Kai Huang wrote:
> 
> >  /*
> > diff --git a/arch/x86/kernel/tdxcall.S b/arch/x86/kernel/tdxcall.S
> > new file mode 100644
> > index 000000000000..ba5e8e35de36
> > --- /dev/null
> > +++ b/arch/x86/kernel/tdxcall.S
> > @@ -0,0 +1,76 @@
> > +#include <asm/asm-offsets.h>
> > +
> > +/*
> > + * TDX guests use the TDCALL instruction to make requests to the
> > + * TDX module and hypercalls to the VMM.
> > + *
> > + * TDX host user SEAMCALL instruction to make requests to TDX module.
> > + *
> > + * They are supported in Binutils >= 2.36.
> > + */
> > +#define tdcall		.byte 0x66,0x0f,0x01,0xcc
> > +#define seamcall	.byte 0x66,0x0f,0x01,0xcf
> > +
> > +.macro TDX_MODULE_CALL host:req
> > +	/*
> > +	 * R12 will be used as temporary storage for struct tdx_module_output
> > +	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
> > +	 * services supported by this function, it can be reused.
> > +	 */
> > +
> > +	/* Callee saved, so preserve it */
> > +	push %r12
> > +
> > +	/*
> > +	 * Push output pointer to stack.
> > +	 * After the operation, it will be fetched into R12 register.
> > +	 */
> > +	push %r9
> > +
> > +	/* Mangle function call ABI into TDCALL/SEAMCALL ABI: */
> > +	/* Move Leaf ID to RAX */
> > +	mov %rdi, %rax
> > +	/* Move input 4 to R9 */
> > +	mov %r8,  %r9
> > +	/* Move input 3 to R8 */
> > +	mov %rcx, %r8
> > +	/* Move input 1 to RCX */
> > +	mov %rsi, %rcx
> > +	/* Leave input param 2 in RDX */
> 
> Currently host seamcall also uses a data structure which has all possible input
> registers as input argument, rather than having one argument for each register:
> 
> 	struct seamcall_regs_in {
> 		u64 rcx;
> 		u64 rdx;
> 		u64 r8;
> 		u64 r9;
> 	};
> 
> Which way is better (above struct name can be changed of course)?
> 
> Or should we rename tdx_module_output to tdx_module_regs, and use it as both
> input and output (similar to __tdx_hypercall())?

Unlike hypercall case, here we have more managable number of arguments.
I would rather keep input arguments outside of any structs. It is easier
for callers, IMO.

Any objections?

> > +
> > +	.if \host
> > +	seamcall
> > +	.else
> > +	tdcall
> > +	.endif
> > +
> > +	/*
> > +	 * Fetch output pointer from stack to R12 (It is used
> > +	 * as temporary storage)
> > +	 */
> > +	pop %r12
> 
> For host SEAMCALL, one additional check against VMfailInvalid needs to be
> done after here.  This is because at host side, both P-SEAMLDR and TDX module
> are expected to be loaded by UEFI (i.e. UEFI shell tool) before booting to the
> kernel, therefore host kernel needs to detect whether them have been loaded, by
> issuing SEAMCALL.
> 
> When SEAM software (P-SEAMLDR or TDX module) is not loaded, the SEAMCALL fails
> with VMfailInvalid.  VMfailInvalid is indicated via a combination of RFLAGS
> rather than using %rax.  In practice, RFLAGS.CF=1 can be used to check whether
> VMfailInvalid happened, and we need to have something like below:
> 
> 	/*
> 	 * SEAMCALL instruction is essentially a VMExit from VMX root
> 	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> 	 * that the targeted SEAM firmware is not loaded or disabled,
> 	 * or P-SEAMLDR is busy with another SEAMCALL.  %rax is not
> 	 * changed in this case.
> 	 *
> 	 * Set %rax to VMFAILINVALID for VMfailInvalid.  This value
> 	 * will never be used as actual SEAMCALL error code.
> 	 */
> 	jnb     .Lno_vmfailinvalid
> 	mov     $(VMFAILINVALID), %rax
> 	jmp     .Lno_output_struct
> 
> .Lno_vmfailinvalid:

Okay, I will add it under .if \host.

But maybe use JNC instead of JNB? Since we check for CF flag,
Jump-if-Not-Carry sounds more reasonable than Jump-if-Not-Below.
Not-Below of what?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests
  2022-02-02  0:11     ` Thomas Gleixner
@ 2022-02-03 15:00       ` Borislav Petkov
  2022-02-03 21:26         ` Thomas Gleixner
  0 siblings, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-03 15:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kirill A. Shutemov, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 02, 2022 at 01:11:56AM +0100, Thomas Gleixner wrote:
> On Wed, Feb 02 2022 at 01:09, Thomas Gleixner wrote:
> 
> > On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> >>  static bool intel_cc_platform_has(enum cc_attr attr)
> >>  {
> >> -	if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
> >> +	switch (attr) {
> >> +	case CC_ATTR_GUEST_UNROLL_STRING_IO:
> >> +	case CC_ATTR_HOTPLUG_DISABLED:
> 
> Not that I care much, but I faintly remember that I suggested that in
> one of the gazillion of threads.

Right, and yeah, adding a separate attribute is ok too but we already
have a hotplug disable method. Why can't this call

	cpu_hotplug_disable()

on the TDX init path somewhere and have this be even simpler?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot
  2022-02-03 14:09         ` Kirill A. Shutemov
@ 2022-02-03 15:13           ` Dave Hansen
  0 siblings, 0 replies; 154+ messages in thread
From: Dave Hansen @ 2022-02-03 15:13 UTC (permalink / raw)
  To: Kirill A. Shutemov, Josh Poimboeuf
  Cc: Thomas Gleixner, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, knsathya, pbonzini, sdeep, seanjc,
	tony.luck, vkuznets, wanpengli, x86, linux-kernel

On 2/3/22 06:09, Kirill A. Shutemov wrote:
> I just realized that we have moved tdx_early_init() back and forth few
> times for this reason. Ughh..

Can you flesh out the changelogs so we don't continue to repeat history,
please?

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests
  2022-02-03 15:00       ` Borislav Petkov
@ 2022-02-03 21:26         ` Thomas Gleixner
  0 siblings, 0 replies; 154+ messages in thread
From: Thomas Gleixner @ 2022-02-03 21:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Thu, Feb 03 2022 at 16:00, Borislav Petkov wrote:
> On Wed, Feb 02, 2022 at 01:11:56AM +0100, Thomas Gleixner wrote:
>> On Wed, Feb 02 2022 at 01:09, Thomas Gleixner wrote:
>> 
>> > On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
>> >>  static bool intel_cc_platform_has(enum cc_attr attr)
>> >>  {
>> >> -	if (attr == CC_ATTR_GUEST_UNROLL_STRING_IO)
>> >> +	switch (attr) {
>> >> +	case CC_ATTR_GUEST_UNROLL_STRING_IO:
>> >> +	case CC_ATTR_HOTPLUG_DISABLED:
>> 
>> Not that I care much, but I faintly remember that I suggested that in
>> one of the gazillion of threads.
>
> Right, and yeah, adding a separate attribute is ok too but we already
> have a hotplug disable method. Why can't this call
>
> 	cpu_hotplug_disable()
>
> on the TDX init path somewhere and have this be even simpler?

That's daft. I rather have this explicit control which makes it obvious
what's going on.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-03 14:44         ` Kirill A. Shutemov
@ 2022-02-03 23:47           ` Kai Huang
  2022-02-04  3:43           ` Kirill A. Shutemov
  1 sibling, 0 replies; 154+ messages in thread
From: Kai Huang @ 2022-02-03 23:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kai Huang


> > 	/*
> > 	 * SEAMCALL instruction is essentially a VMExit from VMX root
> > 	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> > 	 * that the targeted SEAM firmware is not loaded or disabled,
> > 	 * or P-SEAMLDR is busy with another SEAMCALL.  %rax is not
> > 	 * changed in this case.
> > 	 *
> > 	 * Set %rax to VMFAILINVALID for VMfailInvalid.  This value
> > 	 * will never be used as actual SEAMCALL error code.
> > 	 */
> > 	jnb     .Lno_vmfailinvalid
> > 	mov     $(VMFAILINVALID), %rax
> > 	jmp     .Lno_output_struct
> > 
> > .Lno_vmfailinvalid:
> 
> Okay, I will add it under .if \host.
> 
> But maybe use JNC instead of JNB? Since we check for CF flag,
> Jump-if-Not-Carry sounds more reasonable than Jump-if-Not-Below.
> Not-Below of what?

Fine with JNC :)

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-03 14:44         ` Kirill A. Shutemov
  2022-02-03 23:47           ` Kai Huang
@ 2022-02-04  3:43           ` Kirill A. Shutemov
  2022-02-04  9:51             ` Kai Huang
  2022-02-04 10:12             ` Kai Huang
  1 sibling, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-04  3:43 UTC (permalink / raw)
  To: Kai Huang
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Thu, Feb 03, 2022 at 05:44:03PM +0300, Kirill A. Shutemov wrote:
> Any objections?

Below is proper patch of the idea. It can be used to implement both
SEAMCALL and TDCALL wrappers.

It works for TDCALL. Kai, could you check if it is fine for SEAMCALL?

----------------------------------8<-----------------------------------

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 4 Feb 2022 02:03:21 +0300
Subject: [PATCH] x86/tdx: Provide common base for SEAMCALL and TDCALL C
 wrappers

Secure Arbitration Mode (SEAM) is an extension of VMX architecture.  It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are both isolated from the legacy
VMX operation where the host kernel runs.

A CPU-attested software module (called 'TDX module') runs in SEAM VMX
root to manage and protect VMs running in SEAM VMX non-root.  SEAM VMX
root is also used to host another CPU-attested software module (called
'P-SEAMLDR') to load and update the TDX module.

Host kernel transits to either P-SEAMLDR or TDX module via the new
SEAMCALL instruction, which is essentially a VMExit from VMX root mode
to SEAM VMX root mode.  SEAMCALLs are leaf functions defined by
P-SEAMLDR and TDX module around the new SEAMCALL instruction.

A guest kernel can also communicate with TDX module via TDCALL
instruction.

TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI.
RAX is used to carry both the SEAMCALL leaf function number (input) and
the completion status (output).  Additional GPRs (RCX, RDX, R8-R11) may
be further used as both input and output operands in individual leaf.

TDCALL and SEAMCALL share the same ABI and require the largely same
code to pass down arguments and retrieve results.

Define an assembly macro that can be used to implement C wrapper for
both TDCALL and SEAMCALL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tdx.h    | 20 ++++++++
 arch/x86/kernel/asm-offsets.c |  9 ++++
 arch/x86/kernel/tdxcall.S     | 91 +++++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+)
 create mode 100644 arch/x86/kernel/tdxcall.S

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index ba8042ce61c2..2f8cb1e53e77 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,25 @@
 #define TDX_CPUID_LEAF_ID	0x21
 #define TDX_IDENT		"IntelTDX    "
 
+#define TDX_SEAMCALL_VMFAILINVALID     0x8000FF00FFFF0000ULL
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Used to gather the output registers values of the TDCALL and SEAMCALL
+ * instructions when requesting services from the TDX module.
+ *
+ * This is a software only structure and not part of the TDX module/VMM ABI.
+ */
+struct tdx_module_output {
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+};
+
 #ifdef CONFIG_INTEL_TDX_GUEST
 
 void __init tdx_early_init(void);
@@ -18,4 +37,5 @@ static inline void tdx_early_init(void) { };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */
 
+#endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 9fb0a2f8b62a..7dca52f5cfc6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -18,6 +18,7 @@
 #include <asm/bootparam.h>
 #include <asm/suspend.h>
 #include <asm/tlbflush.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_XEN
 #include <xen/interface/xen.h>
@@ -65,6 +66,14 @@ static void __used common(void)
 	OFFSET(XEN_vcpu_info_arch_cr2, vcpu_info, arch.cr2);
 #endif
 
+	BLANK();
+	OFFSET(TDX_MODULE_rcx, tdx_module_output, rcx);
+	OFFSET(TDX_MODULE_rdx, tdx_module_output, rdx);
+	OFFSET(TDX_MODULE_r8,  tdx_module_output, r8);
+	OFFSET(TDX_MODULE_r9,  tdx_module_output, r9);
+	OFFSET(TDX_MODULE_r10, tdx_module_output, r10);
+	OFFSET(TDX_MODULE_r11, tdx_module_output, r11);
+
 	BLANK();
 	OFFSET(BP_scratch, boot_params, scratch);
 	OFFSET(BP_secure_boot, boot_params, secure_boot);
diff --git a/arch/x86/kernel/tdxcall.S b/arch/x86/kernel/tdxcall.S
new file mode 100644
index 000000000000..27d6fcc8e44c
--- /dev/null
+++ b/arch/x86/kernel/tdxcall.S
@@ -0,0 +1,91 @@
+#include <asm/asm-offsets.h>
+#include <asm/tdx.h>
+
+/*
+ * TDX guests use the TDCALL instruction to make requests to the
+ * TDX module and hypercalls to the VMM.
+ *
+ * TDX host user SEAMCALL instruction to make requests to TDX module.
+ *
+ * They are supported in Binutils >= 2.36.
+ */
+#define tdcall		.byte 0x66,0x0f,0x01,0xcc
+#define seamcall	.byte 0x66,0x0f,0x01,0xcf
+
+.macro TDX_MODULE_CALL host:req
+	/*
+	 * R12 will be used as temporary storage for struct tdx_module_output
+	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
+	 * services supported by this function, it can be reused.
+	 */
+
+	/* Callee saved, so preserve it */
+	push %r12
+
+	/*
+	 * Push output pointer to stack.
+	 * After the operation, it will be fetched into R12 register.
+	 */
+	push %r9
+
+	/* Mangle function call ABI into TDCALL/SEAMCALL ABI: */
+	/* Move Leaf ID to RAX */
+	mov %rdi, %rax
+	/* Move input 4 to R9 */
+	mov %r8,  %r9
+	/* Move input 3 to R8 */
+	mov %rcx, %r8
+	/* Move input 1 to RCX */
+	mov %rsi, %rcx
+	/* Leave input param 2 in RDX */
+
+	.if \host
+	seamcall
+	/*
+	 * SEAMCALL instruction is essentially a VMExit from VMX root
+	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
+	 * that the targeted SEAM firmware is not loaded or disabled,
+	 * or P-SEAMLDR is busy with another SEAMCALL.  %rax is not
+	 * changed in this case.
+	 *
+	 * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
+	 * This value will never be used as actual SEAMCALL error code.
+	 */
+	jnc     .Lno_vmfailinvalid
+	mov     $TDX_SEAMCALL_VMFAILINVALID, %rax
+	jmp     .Lno_output_struct
+.Lno_vmfailinvalid:
+	.else
+	tdcall
+	.endif
+
+	/*
+	 * Fetch output pointer from stack to R12 (It is used
+	 * as temporary storage)
+	 */
+	pop %r12
+
+	/* Check for success: 0 - Successful, otherwise failed */
+	test %rax, %rax
+	jnz .Lno_output_struct
+
+	/*
+	 * Since this function can be initiated without an output pointer,
+	 * check if caller provided an output struct before storing
+	 * output registers.
+	 */
+	test %r12, %r12
+	jz .Lno_output_struct
+
+	/* Copy result registers to output struct: */
+	movq %rcx, TDX_MODULE_rcx(%r12)
+	movq %rdx, TDX_MODULE_rdx(%r12)
+	movq %r8,  TDX_MODULE_r8(%r12)
+	movq %r9,  TDX_MODULE_r9(%r12)
+	movq %r10, TDX_MODULE_r10(%r12)
+	movq %r11, TDX_MODULE_r11(%r12)
+
+.Lno_output_struct:
+	/* Restore the state of R12 register */
+	pop %r12
+.endm
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-04  3:43           ` Kirill A. Shutemov
@ 2022-02-04  9:51             ` Kai Huang
  2022-02-04 13:20               ` Kirill A. Shutemov
  2022-02-04 10:12             ` Kai Huang
  1 sibling, 1 reply; 154+ messages in thread
From: Kai Huang @ 2022-02-04  9:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel


> +
> +.macro TDX_MODULE_CALL host:req
> +	/*
> +	 * R12 will be used as temporary storage for struct tdx_module_output
> +	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
> +	 * services supported by this function, it can be reused.
> +	 */
> +
> +	/* Callee saved, so preserve it */
> +	push %r12
> +
> +	/*
> +	 * Push output pointer to stack.
> +	 * After the operation, it will be fetched into R12 register.
> +	 */
> +	push %r9
> +
> +	/* Mangle function call ABI into TDCALL/SEAMCALL ABI: */
> +	/* Move Leaf ID to RAX */
> +	mov %rdi, %rax
> +	/* Move input 4 to R9 */
> +	mov %r8,  %r9
> +	/* Move input 3 to R8 */
> +	mov %rcx, %r8
> +	/* Move input 1 to RCX */
> +	mov %rsi, %rcx
> +	/* Leave input param 2 in RDX */
> +
> +	.if \host
> +	seamcall
> +	/*
> +	 * SEAMCALL instruction is essentially a VMExit from VMX root
> +	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> +	 * that the targeted SEAM firmware is not loaded or disabled,
> +	 * or P-SEAMLDR is busy with another SEAMCALL.  %rax is not
> +	 * changed in this case.
> +	 *
> +	 * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
> +	 * This value will never be used as actual SEAMCALL error code.
> +	 */
> +	jnc     .Lno_vmfailinvalid
> +	mov     $TDX_SEAMCALL_VMFAILINVALID, %rax
> +	jmp     .Lno_output_struct

If I read correctly, in case of VMfailInvalid, another "pop %r12" is needed
before jmp to .Lno_output_struct, otherwise it doesn't match the stack (pushed
twice).

However, since "test %rax, %rax" will also catch TDX_SEAMCALL_VMFAILINVALID, it
seems we can just delete above "jmp .Lno_output_struct"?

> +.Lno_vmfailinvalid:
> +	.else
> +	tdcall
> +	.endif
> +
> +	/*
> +	 * Fetch output pointer from stack to R12 (It is used
> +	 * as temporary storage)
> +	 */
> +	pop %r12
> +
> +	/* Check for success: 0 - Successful, otherwise failed */
> +	test %rax, %rax
> +	jnz .Lno_output_struct
> +
> +	/*
> +	 * Since this function can be initiated without an output pointer,
> +	 * check if caller provided an output struct before storing
> +	 * output registers.
> +	 */
> +	test %r12, %r12
> +	jz .Lno_output_struct
> +
> +	/* Copy result registers to output struct: */
> +	movq %rcx, TDX_MODULE_rcx(%r12)
> +	movq %rdx, TDX_MODULE_rdx(%r12)
> +	movq %r8,  TDX_MODULE_r8(%r12)
> +	movq %r9,  TDX_MODULE_r9(%r12)
> +	movq %r10, TDX_MODULE_r10(%r12)
> +	movq %r11, TDX_MODULE_r11(%r12)
> +
> +.Lno_output_struct:
> +	/* Restore the state of R12 register */
> +	pop %r12
> +.endm
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-04  3:43           ` Kirill A. Shutemov
  2022-02-04  9:51             ` Kai Huang
@ 2022-02-04 10:12             ` Kai Huang
  2022-02-04 13:18               ` Kirill A. Shutemov
  1 sibling, 1 reply; 154+ messages in thread
From: Kai Huang @ 2022-02-04 10:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel


> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -8,6 +8,25 @@
>  #define TDX_CPUID_LEAF_ID	0x21
>  #define TDX_IDENT		"IntelTDX    "

Seems above two are not required by assembly file?  If so also move them to
within #ifndef __ASSEMBLY__?

>  
> +#define TDX_SEAMCALL_VMFAILINVALID     0x8000FF00FFFF0000ULL
> +
> +#ifndef __ASSEMBLY__
> +
> +/*
> + * Used to gather the output registers values of the TDCALL and SEAMCALL
> + * instructions when requesting services from the TDX module.
> + *
> + * This is a software only structure and not part of the TDX module/VMM ABI.
> + */
> +struct tdx_module_output {
> +	u64 rcx;
> +	u64 rdx;
> +	u64 r8;
> +	u64 r9;
> +	u64 r10;
> +	u64 r11;
> +};
> +

Is declaration of __tdx_module_call() outside of CONFIG_INTEL_TDX_GUEST?

>  #ifdef CONFIG_INTEL_TDX_GUEST
>  
>  void __init tdx_early_init(void);
> @@ -18,4 +37,5 @@ static inline void tdx_early_init(void) { };
>  
>  #endif /* CONFIG_INTEL_TDX_GUEST */
>  
> +#endif /* !__ASSEMBLY__ */
>  #endif /* _ASM_X86_TDX_H */

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-02-02 11:27   ` Borislav Petkov
@ 2022-02-04 11:27     ` Kuppuswamy, Sathyanarayanan
  2022-02-04 13:49       ` Borislav Petkov
  2022-02-10  0:25       ` Kai Huang
  0 siblings, 2 replies; 154+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2022-02-04 11:27 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov, Sean Christopherson
  Cc: tglx, mingo, dave.hansen, luto, peterz, aarcange, ak,
	dan.j.williams, david, hpa, jgross, jmattson, joro, jpoimboe,
	knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel, Kai Huang


On 2/2/2022 3:27 AM, Borislav Petkov wrote:
> On Mon, Jan 24, 2022 at 06:02:02PM +0300, Kirill A. Shutemov wrote:
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> Historically, x86 platforms have booted secondary processors (APs)
>> using INIT followed by the start up IPI (SIPI) messages. In regular
>> VMs, this boot sequence is supported by the VMM emulation. But such a
>> wakeup model is fatal for secure VMs like TDX in which VMM is an
>> untrusted entity. To address this issue, a new wakeup model was added
>> in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
>> the APs. More details about this wakeup model can be found in ACPI
>> specification v6.4, the section titled "Multiprocessor Wakeup Structure".
>>
>> Since the existing trampoline code requires processors to boot in real
>> mode with 16-bit addressing, it will not work for this wakeup model
>> (because it boots the AP in 64-bit mode). To handle it, extend the
>> trampoline code to support 64-bit mode firmware handoff. Also, extend
>> IDT and GDT pointers to support 64-bit mode hand off.
>>
>> There is no TDX-specific detection for this new boot method. The kernel
>> will rely on it as the sole boot method whenever the new ACPI structure
>> is present.
>>
>> The ACPI table parser for the MADT multiprocessor wake up structure and
>> the wakeup method that uses this structure will be added by the following
>> patch in this series.
>>
>> Reported-by: Kai Huang <kai.huang@intel.com>
> I wonder what that Reported-by tag means here for this is a feature
> patch, not a bug fix or so...

I think it was added when Sean created the original patch. I don't have the
full history.

Sean, since this is not a bug fix, shall we remove the Reported-by tag?

>
>> diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
>> index 331474b150f1..fd6f6e5b755a 100644
>> --- a/arch/x86/include/asm/realmode.h
>> +++ b/arch/x86/include/asm/realmode.h
>> @@ -25,6 +25,7 @@ struct real_mode_header {
>>   	u32	sev_es_trampoline_start;
>>   #endif
>>   #ifdef CONFIG_X86_64
>> +	u32	trampoline_start64;
>>   	u32	trampoline_pgd;
>>   #endif
> Hmm, so there's trampoline_start, sev_es_trampoline_start and
> trampoline_start64. If those are mutually exclusive, can we merge them
> all into a single trampoline_start?

trampoline_start and sev_es_trampoline_start are not mutually exclusive. 
Both are
used in arch/x86/kernel/sev.c.

arch/x86/kernel/sev.c:560:      startup_ip = 
(u16)(rmh->sev_es_trampoline_start -
arch/x86/kernel/sev.c:561: rmh->trampoline_start);

But trampoline_start64 can be removed and replaced with 
trampoline_start. But using
_*64 suffix makes it clear that is used for 64 bit(CONFIG_X86_64).

Adding it for clarity seems to be fine to me. But if you would prefer 
single variable, we
can remove it. Please let me know.

>
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-04 10:12             ` Kai Huang
@ 2022-02-04 13:18               ` Kirill A. Shutemov
  2022-02-05  0:06                 ` Kai Huang
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-04 13:18 UTC (permalink / raw)
  To: Kai Huang
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Fri, Feb 04, 2022 at 11:12:39PM +1300, Kai Huang wrote:
> 
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -8,6 +8,25 @@
> >  #define TDX_CPUID_LEAF_ID	0x21
> >  #define TDX_IDENT		"IntelTDX    "
> 
> Seems above two are not required by assembly file?  If so also move them to
> within #ifndef __ASSEMBLY__?

Why? It is harmless.

> >  
> > +#define TDX_SEAMCALL_VMFAILINVALID     0x8000FF00FFFF0000ULL
> > +
> > +#ifndef __ASSEMBLY__
> > +
> > +/*
> > + * Used to gather the output registers values of the TDCALL and SEAMCALL
> > + * instructions when requesting services from the TDX module.
> > + *
> > + * This is a software only structure and not part of the TDX module/VMM ABI.
> > + */
> > +struct tdx_module_output {
> > +	u64 rcx;
> > +	u64 rdx;
> > +	u64 r8;
> > +	u64 r9;
> > +	u64 r10;
> > +	u64 r11;
> > +};
> > +
> 
> Is declaration of __tdx_module_call() outside of CONFIG_INTEL_TDX_GUEST?

No, it is defined within CONFIG_INTEL_TDX_GUEST. Why? Host side has to
build their helper anyway.

> >  #ifdef CONFIG_INTEL_TDX_GUEST
> >  
> >  void __init tdx_early_init(void);
> > @@ -18,4 +37,5 @@ static inline void tdx_early_init(void) { };
> >  
> >  #endif /* CONFIG_INTEL_TDX_GUEST */
> >  
> > +#endif /* !__ASSEMBLY__ */
> >  #endif /* _ASM_X86_TDX_H */

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-04  9:51             ` Kai Huang
@ 2022-02-04 13:20               ` Kirill A. Shutemov
  0 siblings, 0 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-04 13:20 UTC (permalink / raw)
  To: Kai Huang
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Fri, Feb 04, 2022 at 10:51:38PM +1300, Kai Huang wrote:
> > +	.if \host
> > +	seamcall
> > +	/*
> > +	 * SEAMCALL instruction is essentially a VMExit from VMX root
> > +	 * mode to SEAM VMX root mode.  VMfailInvalid (CF=1) indicates
> > +	 * that the targeted SEAM firmware is not loaded or disabled,
> > +	 * or P-SEAMLDR is busy with another SEAMCALL.  %rax is not
> > +	 * changed in this case.
> > +	 *
> > +	 * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
> > +	 * This value will never be used as actual SEAMCALL error code.
> > +	 */
> > +	jnc     .Lno_vmfailinvalid
> > +	mov     $TDX_SEAMCALL_VMFAILINVALID, %rax
> > +	jmp     .Lno_output_struct
> 
> If I read correctly, in case of VMfailInvalid, another "pop %r12" is needed
> before jmp to .Lno_output_struct, otherwise it doesn't match the stack (pushed
> twice).

Oopsie. Thanks for catching it.

> However, since "test %rax, %rax" will also catch TDX_SEAMCALL_VMFAILINVALID, it
> seems we can just delete above "jmp .Lno_output_struct"?

Good point. Will do it this way.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-02-04 11:27     ` Kuppuswamy, Sathyanarayanan
@ 2022-02-04 13:49       ` Borislav Petkov
  2022-02-15 21:36         ` Kirill A. Shutemov
  2022-02-10  0:25       ` Kai Huang
  1 sibling, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-04 13:49 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Kirill A. Shutemov, Sean Christopherson, tglx, mingo,
	dave.hansen, luto, peterz, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kai Huang

On Fri, Feb 04, 2022 at 03:27:19AM -0800, Kuppuswamy, Sathyanarayanan wrote:
> trampoline_start and sev_es_trampoline_start are not mutually exclusive.
> Both are
> used in arch/x86/kernel/sev.c.

I know - I've asked Jörg to have a look here.

> But trampoline_start64 can be removed and replaced with trampoline_start.
> But using
> _*64 suffix makes it clear that is used for 64 bit(CONFIG_X86_64).
> 
> Adding it for clarity seems to be fine to me.

Does it matter if the start IP is the same for all APs? Or do will there
be a case where you have some APs starting from the 32-bit trampoline
and some from the 64-bit one, on the same system? (that would be weird
but what do I know...)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-02-02 17:17           ` Thomas Gleixner
@ 2022-02-04 16:55             ` Kirill A. Shutemov
  2022-02-07 22:52               ` Sean Christopherson
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-04 16:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: bp, aarcange, ak, dan.j.williams, dave.hansen, david, hpa,
	jgross, jmattson, joro, jpoimboe, knsathya, linux-kernel, luto,
	mingo, pbonzini, peterz, sathyanarayanan.kuppuswamy, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86

On Wed, Feb 02, 2022 at 06:17:08PM +0100, Thomas Gleixner wrote:
> On Wed, Feb 02 2022 at 15:48, Kirill A. Shutemov wrote:
> 
> > On Tue, Feb 01, 2022 at 10:21:58PM +0100, Thomas Gleixner wrote:
> >> On Sun, Jan 30 2022 at 01:30, Kirill A. Shutemov wrote:
> >> This really can be simplified:
> >> 
> >>         cmpl	$EXIT_REASON_SAFE_HLT, %r11d
> >>         jne	.Lnohalt
> >>         movl	$EXIT_REASON_HLT, %r11d
> >>         sti
> >> .Lnohalt:
> >> 	tdcall
> >> 
> >> and the below becomes:
> >> 
> >> static bool tdx_halt(void)
> >> {
> >> 	return !!__tdx_hypercall(EXIT_REASON_HLT, !!irqs_disabled(), 0, 0, 0, NULL);
> >> }
> >> 
> >> void __cpuidle tdx_safe_halt(void)
> >> {
> >>         if (__tdx_hypercall(EXIT_REASON_SAFE_HLT, 0, 0, 0, 0, NULL)
> >>         	WARN_ONCE(1, "HLT instruction emulation failed\n");
> >> }
> >> 
> >> Hmm?
> >
> > EXIT_REASON_* are architectural, see SDM vol 3D, appendix C. There's no
> > EXIT_REASON_SAFE_HLT.
> >
> > Do you want to define a synthetic one? Like
> >
> > #define EXIT_REASON_SAFE_HLT	0x10000
> > ?
> 
> That was my idea, yes.
> 
> > Looks dubious to me, I donno. I worry about possible conflicts with the
> > spec in the future.
> 
> The spec should have a reserved space for such things :)
> 
> But you might think about having a in/out struct similar to the module
> call or just an array of u64.
> 
> and the signature would become:
> 
> __tdx_hypercall(u64 op, u64 flags, struct inout *args)
> __tdx_hypercall(u64 op, u64 flags, u64 *args)
> 
> and have flag bits:
> 
>     HCALL_ISSUE_STI
>     HCALL_HAS_OUTPUT
> 
> Hmm?

We have two distinct cases: standard hypercalls (defined in GHCI) and KVM
hypercalls. In the first case R10 is 0 (indicating standard TDVMCALL) and
R11 defines the operation. For KVM hypercalls R10 encodes the operation
(KVM hypercalls indexed from 1) and R11 is the first argument. So we
cannot get away with simple "RDI is op" interface.

And we need to return two values: RAX indicates if TDCALL itself was
successful and R10 is result of the hypercall. So we cannot easily get
away without output struct. HCALL_HAS_OUTPUT is not needed.

I would rather keep assembly side simple: shuffle values from the struct
to registers and back. C side is resposible for making sense of the
registers.

With all this in mind the __tdx_hypercall() will boil down to

u64 __tdx_hypercall(struct tdx_hypercall_args *args, u64 flags);

with the only flag HCALL_ISSUE_STI. Is it what you want to see?

I personally don't see why flag is better than synthetic argument as we
have now.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 28/29] x86/tdx: Warn about unexpected WBINVD
  2022-02-02  1:46   ` Thomas Gleixner
@ 2022-02-04 21:35     ` Kirill A. Shutemov
  0 siblings, 0 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-04 21:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Wed, Feb 02, 2022 at 02:46:17AM +0100, Thomas Gleixner wrote:
> On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> 
> > WBINVD causes #VE in TDX guests. There's no reliable way to emulate it.
> > The kernel can ask for VMM assistance, but VMM is untrusted and can ignore
> > the request.
> >
> > Fortunately, there is no use case for WBINVD inside TDX guests.
> 
> If there is not usecase, then why
> 
> > Warn about any unexpected WBINVD.
> 
> instead of terminating the whole thing?
> 
> I'm tired of the "let us emit a warning in the hope it gets fixed'
> thinking.

I probably misunderstood what you meant in the previous WBINVD thread[1] by:

	Then you have the #VE handler which just acts on any other wbinvd
	invocation via warn, panic, whatever, no?

I went the warn path because I think it is consistent with BUG() vs. WARN()
policy: "Use WARN() and WARN_ON() instead, and handle the "impossible"
error condition as gracefully as possible."[2]

IMO, ignored WBINVD has less chance to lead user data loss than panic().

Anyway, I'm okay dropping the patch. It will bring us to "terminate whole
thing" solution. I just wanted to explain why the patch is present in the
series.

[1] https://lore.kernel.org/all/87lf126010.ffs@tglx/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/deprecated.rst

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2022-02-02  1:33   ` Thomas Gleixner
@ 2022-02-04 22:09     ` Yamahata, Isaku
  2022-02-04 22:31     ` Kirill A. Shutemov
  1 sibling, 0 replies; 154+ messages in thread
From: Yamahata, Isaku @ 2022-02-04 22:09 UTC (permalink / raw)
  To: Thomas Gleixner, Kirill A. Shutemov, mingo, bp, Hansen, Dave,
	Lutomirski, Andy, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, Williams, Dan J, david,
	hpa, Gross, Jurgen, jmattson, joro, Poimboe, Josh, knsathya,
	pbonzini, sdeep, Christopherson,,
	Sean, Luck, Tony, vkuznets, wanpengli, x86, linux-kernel,
	Kirill A . Shutemov, Yamahata, Isaku

> > ioremap()-created mappings such as virtio will be marked as
> > shared. However, the IOAPIC code does not use ioremap() and instead
> > uses the fixmap mechanism.
> >
> > Introduce a special fixmap helper just for the IOAPIC code.  Ensure
> > that it marks IOAPIC pages as "shared".  This replaces
> > set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
> > allows custom 'prot' values.
> 
> Why is this a TDX only issue and SEV does not suffer from that?

The bit meaning is opposite. 
TDX: set: shared, cleared: private
SEV: set: private, cleared: shared

Without this patch, it happens to work for SEV. (or any emulated MMIO can work)
But for TDX, IOAPIC emulation doesn't work.

Thanks,
Isaku Yamahata

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2022-02-02  1:33   ` Thomas Gleixner
  2022-02-04 22:09     ` Yamahata, Isaku
@ 2022-02-04 22:31     ` Kirill A. Shutemov
  2022-02-07 14:08       ` Tom Lendacky
  1 sibling, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-04 22:31 UTC (permalink / raw)
  To: Thomas Gleixner, Tom Lendacky
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel, Isaku Yamahata

On Wed, Feb 02, 2022 at 02:33:16AM +0100, Thomas Gleixner wrote:
> On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> > ioremap()-created mappings such as virtio will be marked as
> > shared. However, the IOAPIC code does not use ioremap() and instead
> > uses the fixmap mechanism.
> >
> > Introduce a special fixmap helper just for the IOAPIC code.  Ensure
> > that it marks IOAPIC pages as "shared".  This replaces
> > set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
> > allows custom 'prot' values.
> 
> Why is this a TDX only issue and SEV does not suffer from that?

Hm. Good question.

I think it is because FIXMAP_PAGE_NOCACHE does not have __ENC bit set so
the mapping is accessible to host. With TDX the logic is oposit:
everything is private if the bit is not set.

Tom, does it sound right?

BTW, I will drop 'if (cc_platform_has(CC_ATTR_GUEST_TDX))'.
pgprot_decrypted() is nop on AMD in this case.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
  2022-02-04 13:18               ` Kirill A. Shutemov
@ 2022-02-05  0:06                 ` Kai Huang
  0 siblings, 0 replies; 154+ messages in thread
From: Kai Huang @ 2022-02-05  0:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel


> > Is declaration of __tdx_module_call() outside of CONFIG_INTEL_TDX_GUEST?
> 
> No, it is defined within CONFIG_INTEL_TDX_GUEST. Why? Host side has to
> build their helper anyway.
> 

Right. Fine to me.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 17/29] x86/acpi, x86/boot: Add multiprocessor wake-up support
  2022-02-01 23:27   ` Thomas Gleixner
@ 2022-02-05 12:37     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 154+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2022-02-05 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Kirill A. Shutemov, mingo, bp, dave.hansen,
	luto, peterz
  Cc: aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel, Sean Christopherson,
	Rafael J . Wysocki


On 2/1/2022 3:27 PM, Thomas Gleixner wrote:
> On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
>> +#ifdef CONFIG_X86_64
>> +/* Physical address of the Multiprocessor Wakeup Structure mailbox */
>> +static u64 acpi_mp_wake_mailbox_paddr;
>> +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
>> +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
>> +/* Lock to protect mailbox (acpi_mp_wake_mailbox) from parallel access */
>> +static DEFINE_SPINLOCK(mailbox_lock);
>> +#endif
>> +
>>   #ifdef CONFIG_X86_IO_APIC
>>   /*
>>    * Locks related to IOAPIC hotplug
>> @@ -336,6 +345,80 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e
>>   	return 0;
>>   }
>>   
>> +#ifdef CONFIG_X86_64
>> +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
>> +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
>> +{
>> +	static physid_mask_t apic_id_wakemap = PHYSID_MASK_NONE;
>> +	unsigned long flags;
>> +	u8 timeout;
>> +
>> +	/* Remap mailbox memory only for the first call to acpi_wakeup_cpu() */
>> +	if (physids_empty(apic_id_wakemap)) {
>> +		acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
>> +						sizeof(*acpi_mp_wake_mailbox),
>> +						MEMREMAP_WB);
>> +	}
>> +
>> +	/*
>> +	 * According to the ACPI specification r6.4, section titled
>> +	 * "Multiprocessor Wakeup Structure" the mailbox-based wakeup
>> +	 * mechanism cannot be used more than once for the same CPU.
>> +	 * Skip wakeups if they are attempted more than once.
>> +	 */
>> +	if (physid_isset(apicid, apic_id_wakemap)) {
>> +		pr_err("CPU already awake (APIC ID %x), skipping wakeup\n",
>> +		       apicid);
>> +		return -EINVAL;
>> +	}
>> +
>> +	spin_lock_irqsave(&mailbox_lock, flags);
> What's the reason that interrupts need to be disabled here? The comment
> above this invocation is not really informative ...

I initially thought this routine is getting called from interrupt 
context. But after
re-investigation, this seems not true. So we don't need to disable IRQ here.
Regular spin_lock/unlock variant is suffice. Sorry for the mistake.

>
>> +	/*
>> +	 * Mailbox memory is shared between firmware and OS. Firmware will
>> +	 * listen on mailbox command address, and once it receives the wakeup
>> +	 * command, CPU associated with the given apicid will be booted.
>> +	 *
>> +	 * The value of apic_id and wakeup_vector has to be set before updating
>> +	 * the wakeup command. To let compiler preserve order of writes, use
>> +	 * smp_store_release.
> What? If the only purpose is to tell the compiler to preserve code
> ordering then why are you using smp_store_release() here?
> smp_store_release() is way more than that...

Order of execution is not the only reason. Since this memory is shared with
firmware, I thought it is better to keep it volatile.  So used 
smp_store_release()
variant.  I have initially used only WRITE_ONCE(), but suggestion to use
smp_store_release came out of community review.

>
>> +	 */
>> +	smp_store_release(&acpi_mp_wake_mailbox->apic_id, apicid);
>> +	smp_store_release(&acpi_mp_wake_mailbox->wakeup_vector, start_ip);
>> +	smp_store_release(&acpi_mp_wake_mailbox->command,
>> +			  ACPI_MP_WAKE_COMMAND_WAKEUP);
>> +
>> +	/*
>> +	 * After writing the wakeup command, wait for maximum timeout of 0xFF
>> +	 * for firmware to reset the command address back zero to indicate
>> +	 * the successful reception of command.
>> +	 * NOTE: 0xFF as timeout value is decided based on our experiments.
>> +	 *
>> +	 * XXX: Change the timeout once ACPI specification comes up with
>> +	 *      standard maximum timeout value.
>> +	 */
>> +	timeout = 0xFF;
>> +	while (READ_ONCE(acpi_mp_wake_mailbox->command) && --timeout)
>> +		cpu_relax();
>> +
>> +	/* If timed out (timeout == 0), return error */
>> +	if (!timeout) {
> So this leaves a stale acpi_mp_wake_mailbox->command. What checks that
> acpi_mp_wake_mailbox->command is 0 on the next invocation?


  For each invocation, acpi_mp_wake_mailbox->comand value is set as 1. So I
think we don't have to worry about the previous state. Please correct me
if I don't understand your query.

>
> Aside of that assume timeout happens and the firmware acts after this
> returned. Then you have inconsistent state as well. Error handling is
> not trivial, but making it hope based is the worst kind.

Current assumption is, once the timeout happens, current wakeup request is
considered failed and firmware will not update the command address. But
current ACPI spec does not document the above assumption. I will check with
the spec owner about this issue and get back to you.

>
>> +		spin_unlock_irqrestore(&mailbox_lock, flags);
>> +		return -EIO;
>> +	}
>> +
>> +	/*
>> +	 * If the CPU wakeup process is successful, store the
>> +	 * status in apic_id_wakemap to prevent re-wakeup
>> +	 * requests.
>> +	 */
>> +	physid_set(apicid, apic_id_wakemap);
>> +
>> +	spin_unlock_irqrestore(&mailbox_lock, flags);
>> +
>> +	return 0;
>> +}
>> +#endif
>>   #endif				/*CONFIG_X86_LOCAL_APIC */
> Thanks,
>
>          tglx

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module
  2022-01-24 15:02 ` [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module Kirill A. Shutemov
  2022-02-02  0:14   ` Thomas Gleixner
@ 2022-02-07 10:44   ` Borislav Petkov
  1 sibling, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-07 10:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:02:06PM +0300, Kirill A. Shutemov wrote:
> @@ -59,6 +66,28 @@ long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
>  EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
>  #endif
>  
> +static void tdx_get_info(void)

Btw, can we strip the "tdx_" prefix off from all those static
functions... there's an overload of "tdx" prefixes when looking at
the code and it would be easier on the eyes if they're only with the
external interfaces... it also helps differentiate which is which.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address
  2022-02-04 22:31     ` Kirill A. Shutemov
@ 2022-02-07 14:08       ` Tom Lendacky
  0 siblings, 0 replies; 154+ messages in thread
From: Tom Lendacky @ 2022-02-07 14:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel, Isaku Yamahata

On 2/4/22 16:31, Kirill A. Shutemov wrote:
> On Wed, Feb 02, 2022 at 02:33:16AM +0100, Thomas Gleixner wrote:
>> On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
>>> ioremap()-created mappings such as virtio will be marked as
>>> shared. However, the IOAPIC code does not use ioremap() and instead
>>> uses the fixmap mechanism.
>>>
>>> Introduce a special fixmap helper just for the IOAPIC code.  Ensure
>>> that it marks IOAPIC pages as "shared".  This replaces
>>> set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
>>> allows custom 'prot' values.
>>
>> Why is this a TDX only issue and SEV does not suffer from that?
> 
> Hm. Good question.
> 
> I think it is because FIXMAP_PAGE_NOCACHE does not have __ENC bit set so
> the mapping is accessible to host. With TDX the logic is oposit:
> everything is private if the bit is not set.
> 
> Tom, does it sound right?

Correct, FIXMAP_PAGE_NOCACHE => PAGE_KERNEL_IO_NOCACHE, which does not 
have the encryption bit set, so it is mapped as shared under SEV.

Thanks,
Tom

> 
> BTW, I will drop 'if (cc_platform_has(CC_ATTR_GUEST_TDX))'.
> pgprot_decrypted() is nop on AMD in this case.
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-01-24 15:02 ` [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
  2022-02-02  0:25   ` Thomas Gleixner
@ 2022-02-07 16:27   ` Borislav Petkov
  2022-02-07 16:57     ` Dave Hansen
  1 sibling, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-07 16:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:02:08PM +0300, Kirill A. Shutemov wrote:
> -/*
> - * Macros to add or remove encryption attribute
> - */
> -#define pgprot_encrypted(prot)	__pgprot(__sme_set(pgprot_val(prot)))
> -#define pgprot_decrypted(prot)	__pgprot(__sme_clr(pgprot_val(prot)))

Why can't you simply define

cc_set() and cc_clear()

helpers which either call the __sme variants or __tdx variants, the
latter you can define the same way, respectively, as the __sme ones.

And then you do:

#define pgprot_encrypted(prot)       __pgprot(cc_set(pgprot_val(prot)))
#define pgprot_decrypted(prot)       __pgprot(cc_clear(pgprot_val(prot)))

And just so that it works as early as possible, you can define a global
tdx_shared_mask or so which gets initialized the moment you have
td_info.gpa_width.

And then you don't need to export anything or other ifdefferies - you
just make sure you have that mask defined as early as needed.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-07 16:27   ` Borislav Petkov
@ 2022-02-07 16:57     ` Dave Hansen
  2022-02-07 17:28       ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Dave Hansen @ 2022-02-07 16:57 UTC (permalink / raw)
  To: Borislav Petkov, Kirill A. Shutemov
  Cc: tglx, mingo, luto, peterz, sathyanarayanan.kuppuswamy, aarcange,
	ak, dan.j.williams, david, hpa, jgross, jmattson, joro, jpoimboe,
	knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On 2/7/22 08:27, Borislav Petkov wrote:
> On Mon, Jan 24, 2022 at 06:02:08PM +0300, Kirill A. Shutemov wrote:
>> -/*
>> - * Macros to add or remove encryption attribute
>> - */
>> -#define pgprot_encrypted(prot)	__pgprot(__sme_set(pgprot_val(prot)))
>> -#define pgprot_decrypted(prot)	__pgprot(__sme_clr(pgprot_val(prot)))
> Why can't you simply define
> 
> cc_set() and cc_clear()
> 
> helpers which either call the __sme variants or __tdx variants, the
> latter you can define the same way, respectively, as the __sme ones.

I think your basic point here is valid: Let's have a single function to
take a pgprot and turn it into an "encrypted" or "decrypted" pgprot.

But, we can't do it with functions like cc_set() and cc_clear() because
the polarity is different:

> +pgprot_t pgprot_encrypted(pgprot_t prot)
> +{
> +        if (sme_me_mask)
> +                return __pgprot(__sme_set(pgprot_val(prot)));
> +        else if (is_tdx_guest())
> +		return __pgprot(pgprot_val(prot) & ~tdx_shared_mask());
> +
> +        return prot;
> +}

For "encrypted", SME sets bits and TDX clears bits.

For "decrypted", SME clears bits and TDX sets bits.

We can surely *do* this with cc_something() helpers.  It's just not as
easy as making cc_set/cc_clear().

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-07 16:57     ` Dave Hansen
@ 2022-02-07 17:28       ` Borislav Petkov
  2022-02-14 22:09         ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-07 17:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Feb 07, 2022 at 08:57:39AM -0800, Dave Hansen wrote:
> We can surely *do* this with cc_something() helpers.  It's just not as
> easy as making cc_set/cc_clear().

Sure, that's easy: cc_pgprot_{enc,dec}() or so.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module
  2022-02-02  0:14   ` Thomas Gleixner
@ 2022-02-07 22:27     ` Sean Christopherson
  0 siblings, 0 replies; 154+ messages in thread
From: Sean Christopherson @ 2022-02-07 22:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kirill A. Shutemov, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 02, 2022, Thomas Gleixner wrote:
> On Mon, Jan 24 2022 at 18:02, Kirill A. Shutemov wrote:
> > +static void tdx_get_info(void)
> > +{
> > +	struct tdx_module_output out;
> > +	u64 ret;
> > +
> > +	/*
> > +	 * TDINFO TDX module call is used to get the TD execution environment
> > +	 * information like GPA width, number of available vcpus, debug mode
> > +	 * information, etc. More details about the ABI can be found in TDX
> > +	 * Guest-Host-Communication Interface (GHCI), sec 2.4.2 TDCALL
> > +	 * [TDG.VP.INFO].
> > +	 */
> > +	ret = __tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
> > +
> > +	/* Non zero return value indicates buggy TDX module, so panic */
> 
> Can you please get rid of these useless comments all over the place. The
> panic() message tells the same story. Please document the non-obvious
> things.

And why isn't there a tdx_module_call() wrapper to panic() on failure?  IIRC,
that's why the asm routines had the double underscore, but that detail appears
to have been lost.  E.g. __tdx_module_call(TDX_GET_VEINFO, ...) in patch 04 should
also panic, but it currently morphs the #VE into a #GP if it can't retrieve the
info, which will lead to weird "#GPs" on things like vanilla MOV instructions if
something does go wrong.  TDX_ACCEPT_PAGE is the only call into the TDX Module
for which failure is not fatal.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-02-04 16:55             ` Kirill A. Shutemov
@ 2022-02-07 22:52               ` Sean Christopherson
  2022-02-09 14:34                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Sean Christopherson @ 2022-02-07 22:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, bp, aarcange, ak, dan.j.williams, dave.hansen,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, tony.luck, vkuznets,
	wanpengli, x86

On Fri, Feb 04, 2022, Kirill A. Shutemov wrote:
> > > Looks dubious to me, I donno. I worry about possible conflicts with the
> > > spec in the future.
> > 
> > The spec should have a reserved space for such things :)

Heh, the problem is someone has to deal with munging the two things together.
E.g. if there's a EXIT_REASON_SAFE_HLT then the hypervisor would need a handler
that's identical to EXIT_REASON_HLT, except with guest.EFLAGS.IF forced to '1'.
The guest gets the short end of the stick because EXIT_REASON_HLT is already an
established VM-Exit reason.

> > But you might think about having a in/out struct similar to the module
> > call or just an array of u64.
> > 
> > and the signature would become:
> > 
> > __tdx_hypercall(u64 op, u64 flags, struct inout *args)
> > __tdx_hypercall(u64 op, u64 flags, u64 *args)
> > 
> > and have flag bits:
> > 
> >     HCALL_ISSUE_STI
> >     HCALL_HAS_OUTPUT
> > 
> > Hmm?
> 
> We have two distinct cases: standard hypercalls (defined in GHCI) and KVM
> hypercalls. In the first case R10 is 0 (indicating standard TDVMCALL) and
> R11 defines the operation. For KVM hypercalls R10 encodes the operation
> (KVM hypercalls indexed from 1) and R11 is the first argument. So we
> cannot get away with simple "RDI is op" interface.
> 
> And we need to return two values: RAX indicates if TDCALL itself was
> successful and R10 is result of the hypercall. So we cannot easily get
> away without output struct. HCALL_HAS_OUTPUT is not needed.

But __tdx_hypercall() should never fail TDCALL.  The TDX spec even says:

  RAX TDCALL instruction return code. Always returns Intel TDX_SUCCESS (0).

IIRC, the original PoC went straight to a ud2 if tdcall failed.  Why not do
something similar?  That would get rid of the bajillion instances of:

	if (__tdx_hypercall(...))
		panic("Hypercall fn %llu failed (Buggy TDX module!)\n", fn);

E.g.

diff --git a/arch/x86/kernel/tdcall.S b/arch/x86/kernel/tdcall.S
index fde628791100..04284f0c198e 100644
--- a/arch/x86/kernel/tdcall.S
+++ b/arch/x86/kernel/tdcall.S
@@ -170,8 +170,10 @@ SYM_FUNC_START(__tdx_hypercall)
 .Lskip_sti:
        tdcall

+       test %rax, %rax,
+       jnz .Lerror
+
        /* Copy hypercall result registers to arg struct: */
-       movq %r10, TDX_HYPERCALL_r10(%rdi)
        movq %r11, TDX_HYPERCALL_r11(%rdi)
        movq %r12, TDX_HYPERCALL_r12(%rdi)
        movq %r13, TDX_HYPERCALL_r13(%rdi)
@@ -194,7 +196,13 @@ SYM_FUNC_START(__tdx_hypercall)
        pop %r14
        pop %r15

-       FRAME_END
+       FRAME_END
+
+       movq %r10, %rax
+       retq
+
+.Lerror:
+       <move stuff into correct registers if necessary>
+       call tdx_hypercall_error

-       retq
 SYM_FUNC_END(__tdx_hypercall)

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private
  2022-01-24 15:02 ` [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private Kirill A. Shutemov
  2022-02-02  0:35   ` Thomas Gleixner
@ 2022-02-08 12:12   ` Borislav Petkov
  2022-02-09 23:21     ` Kirill A. Shutemov
  1 sibling, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-08 12:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Jan 24, 2022 at 06:02:09PM +0300, Kirill A. Shutemov wrote:
> +static bool tdx_accept_page(phys_addr_t gpa, enum pg_level pg_level)

accept_page() as it is a static function.

> +int tdx_hcall_request_gpa_type(phys_addr_t start, phys_addr_t end, bool enc)
> +{
> +	u64 ret;
> +
> +	if (end <= start)
> +		return -EINVAL;
> +
> +	if (!enc) {
> +		start |= tdx_shared_mask();
> +		end |= tdx_shared_mask();
> +	}
> +
> +	/*
> +	 * Notify the VMM about page mapping conversion. More info about ABI
> +	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
> +	 * sec "TDG.VP.VMCALL<MapGPA>"
> +	 */
> +	ret = _tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0, NULL);
> +


^ Superfluous newline.

> +	if (ret)
> +		ret = -EIO;
> +
> +	if (ret || !enc)

Is the second case here after the "||" the conversion-to-shared where it
only needs to notify with MapGPA and return?

Of all the places, this one needs a comment.

> +		return ret;
> +
> +	/*
> +	 * For shared->private conversion, accept the page using
> +	 * TDX_ACCEPT_PAGE TDX module call.
> +	 */
> +	while (start < end) {
> +		/* Try 2M page accept first if possible */
> +		if (!(start & ~PMD_MASK) && end - start >= PMD_SIZE &&
> +		    !tdx_accept_page(start, PG_LEVEL_2M)) {

What happens here if the module doesn't accept the page? No error
reporting, no error handling, no warning, nada?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
                   ` (28 preceding siblings ...)
  2022-01-24 15:02 ` [PATCHv2 29/29] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
@ 2022-02-09 10:56 ` Kai Huang
  2022-02-09 11:08   ` Borislav Petkov
  29 siblings, 1 reply; 154+ messages in thread
From: Kai Huang @ 2022-02-09 10:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel


>  60 files changed, 2079 insertions(+), 142 deletions(-)
>  create mode 100644 Documentation/x86/tdx.rst
>  create mode 100644 arch/x86/boot/compressed/tdcall.S
>  create mode 100644 arch/x86/boot/compressed/tdx.c
>  create mode 100644 arch/x86/boot/compressed/tdx.h
>  create mode 100644 arch/x86/boot/io.h
>  create mode 100644 arch/x86/include/asm/shared/io.h
>  create mode 100644 arch/x86/include/asm/shared/tdx.h
>  create mode 100644 arch/x86/include/asm/tdx.h
>  create mode 100644 arch/x86/kernel/tdcall.S
>  create mode 100644 arch/x86/kernel/tdx.c
> 

Hi,

Is it better to change the file name(s) to reflect they are for TDX guest
support, for instance, especially the last one arch/x86/kernel/tdx.c?

TDX host support basically does detection of SEAM, TDX KeyIDs, P-SEAMLDR and
initialize the TDX module, so likely TDX host support will introduce couple of
new files to do above things respectively, and the majority of the code could be
self-contained under some directory (currently under arch/x86/kernel/cpu/tdx/,
but can be changed of course).  Could we have some suggestions on how to
organize?

Thanks,
-Kai

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 10:56 ` [PATCHv2 00/29] TDX Guest: TDX core support Kai Huang
@ 2022-02-09 11:08   ` Borislav Petkov
  2022-02-09 11:30     ` Kai Huang
  0 siblings, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-09 11:08 UTC (permalink / raw)
  To: Kai Huang
  Cc: Kirill A. Shutemov, tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 09, 2022 at 11:56:13PM +1300, Kai Huang wrote:
> TDX host support basically does detection of SEAM, TDX KeyIDs, P-SEAMLDR and
> initialize the TDX module, so likely TDX host support will introduce couple of
> new files to do above things respectively,

Why a couple of new files? How much code is that?

> and the majority of the code could be self-contained under some
> directory (currently under arch/x86/kernel/cpu/tdx/, but can be
> changed of course). Could we have some suggestions on how to organize?

So we slowly try to move stuff away from arch/x86/kernel/ as that is a
dumping ground for everything and everything there is "kernel" so that
part of the path is kinda redundant.

That's why, for example, we stuck the entry code under arch/x86/entry/.

I'm thinking long term we probably should stick all confidentail
computing stuff under its own folder:

arch/x86/coco/

for example. The "coco" being COnfidential COmputing, for lack of a
better idea.

And there you'll have

arch/x86/coco/tdx and
arch/x86/coco/sev

where to we'll start migrating the AMD stuff eventually too.

Methinks.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 11:08   ` Borislav Petkov
@ 2022-02-09 11:30     ` Kai Huang
  2022-02-09 11:40       ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Kai Huang @ 2022-02-09 11:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, 9 Feb 2022 12:08:18 +0100 Borislav Petkov wrote:
> On Wed, Feb 09, 2022 at 11:56:13PM +1300, Kai Huang wrote:
> > TDX host support basically does detection of SEAM, TDX KeyIDs, P-SEAMLDR and
> > initialize the TDX module, so likely TDX host support will introduce couple of
> > new files to do above things respectively,
> 
> Why a couple of new files? How much code is that?

This is the fine names and code size of current internal version that I have:

 .../admin-guide/kernel-parameters.txt         |   6 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/intel-tdx.rst               | 259 +++++++
 arch/x86/Kconfig                              |  14 +
 arch/x86/include/asm/seam.h                   | 213 ++++++
 arch/x86/include/asm/tdx_host.h               |  20 +
 arch/x86/kernel/asm-offsets_64.c              |  18 +
 arch/x86/kernel/cpu/Makefile                  |   1 +
 arch/x86/kernel/cpu/intel.c                   |   6 +
 arch/x86/kernel/cpu/tdx/Makefile              |   1 +
 arch/x86/kernel/cpu/tdx/p-seamldr.c           | 109 +++
 arch/x86/kernel/cpu/tdx/p-seamldr.h           |  14 +
 arch/x86/kernel/cpu/tdx/seam.c                | 105 +++
 arch/x86/kernel/cpu/tdx/seamcall.S            |  80 ++
 arch/x86/kernel/cpu/tdx/tdmr.c                | 581 ++++++++++++++
 arch/x86/kernel/cpu/tdx/tdmr.h                |  28 +
 arch/x86/kernel/cpu/tdx/tdx.c                 | 707 ++++++++++++++++++
 arch/x86/kernel/cpu/tdx/tdx_arch.h            |  88 +++
 arch/x86/kernel/cpu/tdx/tdx_seamcall.h        | 138 ++++
 19 files changed, 2389 insertions(+)
 create mode 100644 Documentation/x86/intel-tdx.rst
 create mode 100644 arch/x86/include/asm/seam.h
 create mode 100644 arch/x86/include/asm/tdx_host.h
 create mode 100644 arch/x86/kernel/cpu/tdx/Makefile
 create mode 100644 arch/x86/kernel/cpu/tdx/p-seamldr.c
 create mode 100644 arch/x86/kernel/cpu/tdx/p-seamldr.h
 create mode 100644 arch/x86/kernel/cpu/tdx/seam.c
 create mode 100644 arch/x86/kernel/cpu/tdx/seamcall.S
 create mode 100644 arch/x86/kernel/cpu/tdx/tdmr.c
 create mode 100644 arch/x86/kernel/cpu/tdx/tdmr.h
 create mode 100644 arch/x86/kernel/cpu/tdx/tdx.c
 create mode 100644 arch/x86/kernel/cpu/tdx/tdx_arch.h
 create mode 100644 arch/x86/kernel/cpu/tdx/tdx_seamcall.h

Because SEAM, P-SEAMLDR can logically be independent, so I feel it's better to
have separate C files for them.  TDMR (TD Memory Region, which is the structure
defined by TDX architecture to manage usable TDX memory) is split out as a
separate file as the logic to deal with it requires non-trival LOC too.

> 
> > and the majority of the code could be self-contained under some
> > directory (currently under arch/x86/kernel/cpu/tdx/, but can be
> > changed of course). Could we have some suggestions on how to organize?
> 
> So we slowly try to move stuff away from arch/x86/kernel/ as that is a
> dumping ground for everything and everything there is "kernel" so that
> part of the path is kinda redundant.
> 
> That's why, for example, we stuck the entry code under arch/x86/entry/.
> 
> I'm thinking long term we probably should stick all confidentail
> computing stuff under its own folder:
> 
> arch/x86/coco/
> 
> for example. The "coco" being COnfidential COmputing, for lack of a
> better idea.
> 
> And there you'll have
> 
> arch/x86/coco/tdx and
> arch/x86/coco/sev
> 
> where to we'll start migrating the AMD stuff eventually too.

Thanks for the information.  However, for now does it make sense to also put
TDX host files under arch/x86/kernel/, or maybe arch/x86/kernel/tdx_host/?

As suggested by Thomas, host SEAMCALL can share TDX guest's __tdx_module_call()
implementation.  Kirill will have a arch/x86/kernel/tdxcall.S which implements
the core body of __tdx_module_call() and is supposed to be included by the new
assembly file to implement the host SEAMCALL function.  From this perspective,
it seems more reasonable to just put all TDX host files under arch/x86/kernel/?

Thanks in advance.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 11:30     ` Kai Huang
@ 2022-02-09 11:40       ` Borislav Petkov
  2022-02-09 11:48         ` Kai Huang
  0 siblings, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-09 11:40 UTC (permalink / raw)
  To: Kai Huang
  Cc: Kirill A. Shutemov, tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Thu, Feb 10, 2022 at 12:30:33AM +1300, Kai Huang wrote:
> Because SEAM, P-SEAMLDR can logically be independent, so I feel it's better to
> have separate C files for them.

Most of those look like small files. I don't see the point of having it
all in separate files - you can just as well put them in tdx.c and carve
out only then when the file becomes too unwieldy to handle.

> Thanks for the information.  However, for now does it make sense to also put
> TDX host files under arch/x86/kernel/, or maybe arch/x86/kernel/tdx_host/?

Didn't you just read what I wrote about "kernel"?

> As suggested by Thomas, host SEAMCALL can share TDX guest's __tdx_module_call()
> implementation.  Kirill will have a arch/x86/kernel/tdxcall.S which implements
> the core body of __tdx_module_call() and is supposed to be included by the new
> assembly file to implement the host SEAMCALL function.  From this perspective,
> it seems more reasonable to just put all TDX host files under arch/x86/kernel/?

It would be a lot harder to move them to a different location later,
when they're upstream already. I'm talking from past experience here.

But let's see what the others think first.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 11:40       ` Borislav Petkov
@ 2022-02-09 11:48         ` Kai Huang
  2022-02-09 11:56           ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Kai Huang @ 2022-02-09 11:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, 9 Feb 2022 12:40:17 +0100 Borislav Petkov wrote:
> On Thu, Feb 10, 2022 at 12:30:33AM +1300, Kai Huang wrote:
> > Because SEAM, P-SEAMLDR can logically be independent, so I feel it's better to
> > have separate C files for them.
> 
> Most of those look like small files. I don't see the point of having it
> all in separate files - you can just as well put them in tdx.c and carve
> out only then when the file becomes too unwieldy to handle.

arch/x86/kernel/tdx.c is already taken by this series.  This is the reason that
I think perhaps it's better to rename it to reflect it is for TDX guest support.

> 
> > Thanks for the information.  However, for now does it make sense to also put
> > TDX host files under arch/x86/kernel/, or maybe arch/x86/kernel/tdx_host/?
> 
> Didn't you just read what I wrote about "kernel"?
> 
> > As suggested by Thomas, host SEAMCALL can share TDX guest's __tdx_module_call()
> > implementation.  Kirill will have a arch/x86/kernel/tdxcall.S which implements
> > the core body of __tdx_module_call() and is supposed to be included by the new
> > assembly file to implement the host SEAMCALL function.  From this perspective,
> > it seems more reasonable to just put all TDX host files under arch/x86/kernel/?
> 
> It would be a lot harder to move them to a different location later,
> when they're upstream already. I'm talking from past experience here.

Are you suggesting even for now we can start to put TDX host support to
arch/x86/coco/tdx/ ?

> 
> But let's see what the others think first.

Sure thanks for comments.

> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 11:48         ` Kai Huang
@ 2022-02-09 11:56           ` Borislav Petkov
  2022-02-09 11:58             ` Kai Huang
  2022-02-09 16:50             ` Sean Christopherson
  0 siblings, 2 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-09 11:56 UTC (permalink / raw)
  To: Kai Huang
  Cc: Kirill A. Shutemov, tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Thu, Feb 10, 2022 at 12:48:31AM +1300, Kai Huang wrote:
> Are you suggesting even for now we can start to put TDX host support to
> arch/x86/coco/tdx/ ?

That's exactly what I'm suggesting. The TDX stuff is not upstream so
nothing's cast in stone yet. This way there won't be any unpleasant code
movements later.

But let's wait to see what the bikeshed discussion will bring first and
then start moving files.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 11:56           ` Borislav Petkov
@ 2022-02-09 11:58             ` Kai Huang
  2022-02-09 16:50             ` Sean Christopherson
  1 sibling, 0 replies; 154+ messages in thread
From: Kai Huang @ 2022-02-09 11:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, 9 Feb 2022 12:56:26 +0100 Borislav Petkov wrote:
> On Thu, Feb 10, 2022 at 12:48:31AM +1300, Kai Huang wrote:
> > Are you suggesting even for now we can start to put TDX host support to
> > arch/x86/coco/tdx/ ?
> 
> That's exactly what I'm suggesting. The TDX stuff is not upstream so
> nothing's cast in stone yet. This way there won't be any unpleasant code
> movements later.
> 
> But let's wait to see what the bikeshed discussion will bring first and
> then start moving files.
> 

Sure.  Thanks.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-02-07 22:52               ` Sean Christopherson
@ 2022-02-09 14:34                 ` Kirill A. Shutemov
  2022-02-09 18:05                   ` Sean Christopherson
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-09 14:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, bp, aarcange, ak, dan.j.williams, dave.hansen,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, tony.luck, vkuznets,
	wanpengli, x86

On Mon, Feb 07, 2022 at 10:52:19PM +0000, Sean Christopherson wrote:
> On Fri, Feb 04, 2022, Kirill A. Shutemov wrote:
> > > > Looks dubious to me, I donno. I worry about possible conflicts with the
> > > > spec in the future.
> > > 
> > > The spec should have a reserved space for such things :)
> 
> Heh, the problem is someone has to deal with munging the two things together.
> E.g. if there's a EXIT_REASON_SAFE_HLT then the hypervisor would need a handler
> that's identical to EXIT_REASON_HLT, except with guest.EFLAGS.IF forced to '1'.
> The guest gets the short end of the stick because EXIT_REASON_HLT is already an
> established VM-Exit reason.
> 
> > > But you might think about having a in/out struct similar to the module
> > > call or just an array of u64.
> > > 
> > > and the signature would become:
> > > 
> > > __tdx_hypercall(u64 op, u64 flags, struct inout *args)
> > > __tdx_hypercall(u64 op, u64 flags, u64 *args)
> > > 
> > > and have flag bits:
> > > 
> > >     HCALL_ISSUE_STI
> > >     HCALL_HAS_OUTPUT
> > > 
> > > Hmm?
> > 
> > We have two distinct cases: standard hypercalls (defined in GHCI) and KVM
> > hypercalls. In the first case R10 is 0 (indicating standard TDVMCALL) and
> > R11 defines the operation. For KVM hypercalls R10 encodes the operation
> > (KVM hypercalls indexed from 1) and R11 is the first argument. So we
> > cannot get away with simple "RDI is op" interface.
> > 
> > And we need to return two values: RAX indicates if TDCALL itself was
> > successful and R10 is result of the hypercall. So we cannot easily get
> > away without output struct. HCALL_HAS_OUTPUT is not needed.
> 
> But __tdx_hypercall() should never fail TDCALL.  The TDX spec even says:
> 
>   RAX TDCALL instruction return code. Always returns Intel TDX_SUCCESS (0).
> 
> IIRC, the original PoC went straight to a ud2 if tdcall failed.  Why not do
> something similar?  That would get rid of the bajillion instances of:
> 
> 	if (__tdx_hypercall(...))
> 		panic("Hypercall fn %llu failed (Buggy TDX module!)\n", fn);

Okay, below is my take on it. Boot tested.

I used UD2 as a way to deal with it-never-heppens condition. Not sure if
it the right way, but stack trace looks useful and I see it being used for
CONFIG_DEBUG_ENTRY.

Any comments?

/*
 * __tdx_hypercall() - Make hypercalls to a TDX VMM.
 *
 * Transforms values in  function call argument struct tdx_hypercall_args @args
 * into the TDCALL register ABI. After TDCALL operation, VMM output is saved
 * back in @args.
 *
 *-------------------------------------------------------------------------
 * TD VMCALL ABI:
 *-------------------------------------------------------------------------
 *
 * Input Registers:
 *
 * RAX                 - TDCALL instruction leaf number (0 - TDG.VP.VMCALL)
 * RCX                 - BITMAP which controls which part of TD Guest GPR
 *                       is passed as-is to the VMM and back.
 * R10                 - Set 0 to indicate TDCALL follows standard TDX ABI
 *                       specification. Non zero value indicates vendor
 *                       specific ABI.
 * R11                 - VMCALL sub function number
 * RBX, RBP, RDI, RSI  - Used to pass VMCALL sub function specific arguments.
 * R8-R9, R12-R15      - Same as above.
 *
 * Output Registers:
 *
 * RAX                 - TDCALL instruction status (Not related to hypercall
 *                        output).
 * R10                 - Hypercall output error code.
 * R11-R15             - Hypercall sub function specific output values.
 *
 *-------------------------------------------------------------------------
 *
 * __tdx_hypercall() function ABI:
 *
 * @args  (RDI)        - struct tdx_hypercall_args for input and output
 * @flags (RSI)        - TDX_HCALL_* flags
 *
 * On successful completion, return the hypercall error code.
 */
SYM_FUNC_START(__tdx_hypercall)
	FRAME_BEGIN

	/* Save callee-saved GPRs as mandated by the x86_64 ABI */
	push %r15
	push %r14
	push %r13
	push %r12

	/* Mangle function call ABI into TDCALL ABI: */
	/* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
	xor %eax, %eax

	/* Copy hypercall registers from arg struct: */
	movq TDX_HYPERCALL_r10(%rdi), %r10
	movq TDX_HYPERCALL_r11(%rdi), %r11
	movq TDX_HYPERCALL_r12(%rdi), %r12
	movq TDX_HYPERCALL_r13(%rdi), %r13
	movq TDX_HYPERCALL_r14(%rdi), %r14
	movq TDX_HYPERCALL_r15(%rdi), %r15

	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx

	/*
	 * For the idle loop STI needs to be called directly before
	 * the TDCALL that enters idle (EXIT_REASON_HLT case). STI
	 * instruction enables interrupts only one instruction later.
	 * If there is a window between STI and the instruction that
	 * emulates the HALT state, there is a chance for interrupts to
	 * happen in this window, which can delay the HLT operation
	 * indefinitely. Since this is the not the desired result,
	 * conditionally call STI before TDCALL.
	 */
	testq $TDX_HCALL_ISSUE_STI, %rsi
	jz .Lskip_sti
	sti
.Lskip_sti:
	tdcall

	/*
	 * TDVMCALL leaf does not suppose to fail. If it fails something
	 * is horribly wrong with TDX module. Stop the world.
	 */
	test %rax, %rax
	je .Lsuccess
	ud2
.Lsuccess:
	/* TDVMCALL leaf return code is in R10 */
	movq %r10, %rax

	/* Copy hypercall result registers to arg struct if needed */
	testq $TDX_HCALL_HAS_OUTPUT, %rsi
	jz .Lout

	movq %r10, TDX_HYPERCALL_r10(%rdi)
	movq %r11, TDX_HYPERCALL_r11(%rdi)
	movq %r12, TDX_HYPERCALL_r12(%rdi)
	movq %r13, TDX_HYPERCALL_r13(%rdi)
	movq %r14, TDX_HYPERCALL_r14(%rdi)
	movq %r15, TDX_HYPERCALL_r15(%rdi)
.Lout:
	/*
	 * Zero out registers exposed to the VMM to avoid
	 * speculative execution with VMM-controlled values.
	 * This needs to include all registers present in
	 * TDVMCALL_EXPOSE_REGS_MASK (except R12-R15).
	 * R12-R15 context will be restored.
	 */
	xor %r10d, %r10d
	xor %r11d, %r11d

	/* Restore callee-saved GPRs as mandated by the x86_64 ABI */
	pop %r12
	pop %r13
	pop %r14
	pop %r15

	FRAME_END

	retq
SYM_FUNC_END(__tdx_hypercall)
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 11:56           ` Borislav Petkov
  2022-02-09 11:58             ` Kai Huang
@ 2022-02-09 16:50             ` Sean Christopherson
  2022-02-09 19:11               ` Borislav Petkov
  1 sibling, 1 reply; 154+ messages in thread
From: Sean Christopherson @ 2022-02-09 16:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kai Huang, Kirill A. Shutemov, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 09, 2022, Borislav Petkov wrote:
> On Thu, Feb 10, 2022 at 12:48:31AM +1300, Kai Huang wrote:
> > Are you suggesting even for now we can start to put TDX host support to
> > arch/x86/coco/tdx/ ?
> 
> That's exactly what I'm suggesting. The TDX stuff is not upstream so
> nothing's cast in stone yet. This way there won't be any unpleasant code
> movements later.

I strongly prefer we put the guest and host code in separate directories.  Both
TDX and SEV are big enough that they'll benefit from splitting up files, having
to fight over file names or tag all files with guest/host will get annoying.

I do like the idea of arch/x86/coco though.  The most straightforward approach
would be:

  arch/x86/coco/guest/
  arch/x86/coco/host/

but that doesn't provide any extensibility on the host virtualization side, e.g.
to land non-coco, non-KVM-specific host virtualization code (we have a potential
use case for this).  If that happens, we'd end up with x86 KVM having code and
dependencies split across:

  arch/x86/coco/host
  arch/x86/kvm/
  arch/x86/???/

An alternative idea would be to mirror what generic KVM does (virt/kvm/), and do:

  arch/x86/coco/<guest stuff>
  arch/x86/virt/<"generic" x86 host virtualization stuff>
  arch/x86/virt/coco/<host coco stuff>
  arch/x86/virt/kvm/

Though I can already hear the stable trees and downstream kernels crying out in
horror at moving arch/x86/kvm :-)

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-02-09 14:34                 ` Kirill A. Shutemov
@ 2022-02-09 18:05                   ` Sean Christopherson
  2022-02-09 22:23                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Sean Christopherson @ 2022-02-09 18:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, bp, aarcange, ak, dan.j.williams, dave.hansen,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, tony.luck, vkuznets,
	wanpengli, x86

On Wed, Feb 09, 2022, Kirill A. Shutemov wrote:
> On Mon, Feb 07, 2022 at 10:52:19PM +0000, Sean Christopherson wrote:
> .Lskip_sti:
> 	tdcall
> 
> 	/*
> 	 * TDVMCALL leaf does not suppose to fail. If it fails something
> 	 * is horribly wrong with TDX module. Stop the world.
> 	 */
> 	test %rax, %rax
> 	je .Lsuccess
> 	ud2

If the ud2 or call to an external "do panic" helper is out-of-line, then the happy
path avoids a taken branch.  Not a big deal, but it's also trivial to do.

> .Lsuccess:
> 	/* TDVMCALL leaf return code is in R10 */
> 	movq %r10, %rax
> 
> 	/* Copy hypercall result registers to arg struct if needed */
> 	testq $TDX_HCALL_HAS_OUTPUT, %rsi
> 	jz .Lout
> 
> 	movq %r10, TDX_HYPERCALL_r10(%rdi)
> 	movq %r11, TDX_HYPERCALL_r11(%rdi)
> 	movq %r12, TDX_HYPERCALL_r12(%rdi)
> 	movq %r13, TDX_HYPERCALL_r13(%rdi)
> 	movq %r14, TDX_HYPERCALL_r14(%rdi)
> 	movq %r15, TDX_HYPERCALL_r15(%rdi)
> .Lout:
> 	/*
> 	 * Zero out registers exposed to the VMM to avoid
> 	 * speculative execution with VMM-controlled values.
> 	 * This needs to include all registers present in
> 	 * TDVMCALL_EXPOSE_REGS_MASK (except R12-R15).
> 	 * R12-R15 context will be restored.

This comment block should use the "full" 80 chars.

> 	 */
> 	xor %r10d, %r10d
> 	xor %r11d, %r11d
> 
> 	/* Restore callee-saved GPRs as mandated by the x86_64 ABI */
> 	pop %r12
> 	pop %r13
> 	pop %r14
> 	pop %r15
> 
> 	FRAME_END
> 
> 	retq
> SYM_FUNC_END(__tdx_hypercall)
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 16:50             ` Sean Christopherson
@ 2022-02-09 19:11               ` Borislav Petkov
  2022-02-09 20:07                 ` Sean Christopherson
  0 siblings, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-09 19:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, Kirill A. Shutemov, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 09, 2022 at 04:50:08PM +0000, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Borislav Petkov wrote:
> > On Thu, Feb 10, 2022 at 12:48:31AM +1300, Kai Huang wrote:
> > > Are you suggesting even for now we can start to put TDX host support to
> > > arch/x86/coco/tdx/ ?
> > 
> > That's exactly what I'm suggesting. The TDX stuff is not upstream so
> > nothing's cast in stone yet. This way there won't be any unpleasant code
> > movements later.
> 
> I strongly prefer we put the guest and host code in separate directories.  Both
> TDX and SEV are big enough that they'll benefit from splitting up files, having
> to fight over file names or tag all files with guest/host will get annoying.
> 
> I do like the idea of arch/x86/coco though.  The most straightforward approach
> would be:
> 
>   arch/x86/coco/guest/
>   arch/x86/coco/host/
> 
> but that doesn't provide any extensibility on the host virtualization side, e.g.
> to land non-coco, non-KVM-specific host virtualization code (we have a potential
> use case for this).  If that happens, we'd end up with x86 KVM having code and
> dependencies split across:
> 
>   arch/x86/coco/host
>   arch/x86/kvm/
>   arch/x86/???/
> 
> An alternative idea would be to mirror what generic KVM does (virt/kvm/), and do:
> 
>   arch/x86/coco/<guest stuff>
>   arch/x86/virt/<"generic" x86 host virtualization stuff>
>   arch/x86/virt/coco/<host coco stuff>
>   arch/x86/virt/kvm/
> 
> Though I can already hear the stable trees and downstream kernels crying out in
> horror at moving arch/x86/kvm :-)

Hmmm, so I am still thinking about guest-only when we're talking about
arch/x86/coco/.

Lemme look at the other virt things:

the kvm host virt stuff is in:

arch/x86/kvm/

 (btw, this is where the SEV host stuff is: arch/x86/kvm/svm/sev.c)

arch/x86/hyperv/ - looks like hyperv guest stuff

arch/x86/xen/ - xen guest stuff

arch/x86/kernel/cpu/vmware.c - vmware guest stuff

arch/x86/kernel/cpu/acrn.c - Acorn guest stuff

So we have a real mess. :-(

Not surprised though. So that last thing you're suggesting kinda makes
sense but lemme tweak it a bit:

arch/x86/coco/<guest stuff>
arch/x86/virt/<"generic" x86 host virtualization stuff>
arch/x86/virt/tdx/ - no need for the "coco" thing - TDX is nothing but coco. TDX host
stuff

arch/x86/virt/sev/ - ditto

and we'll keep arch/x86/kvm because of previous precedents with other
things I've enumerated above.

Hmmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 19:11               ` Borislav Petkov
@ 2022-02-09 20:07                 ` Sean Christopherson
  2022-02-09 20:36                   ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Sean Christopherson @ 2022-02-09 20:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kai Huang, Kirill A. Shutemov, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 09, 2022, Borislav Petkov wrote:
> On Wed, Feb 09, 2022 at 04:50:08PM +0000, Sean Christopherson wrote:
> > An alternative idea would be to mirror what generic KVM does (virt/kvm/), and do:
> > 
> >   arch/x86/coco/<guest stuff>
> >   arch/x86/virt/<"generic" x86 host virtualization stuff>
> >   arch/x86/virt/coco/<host coco stuff>
> >   arch/x86/virt/kvm/
> > 
> > Though I can already hear the stable trees and downstream kernels crying out in
> > horror at moving arch/x86/kvm :-)
> 
> Hmmm, so I am still thinking about guest-only when we're talking about
> arch/x86/coco/.
> 
> Lemme look at the other virt things:
> 
> the kvm host virt stuff is in:
> 
> arch/x86/kvm/
> 
>  (btw, this is where the SEV host stuff is: arch/x86/kvm/svm/sev.c)
> 
> arch/x86/hyperv/ - looks like hyperv guest stuff
> 
> arch/x86/xen/ - xen guest stuff
> 
> arch/x86/kernel/cpu/vmware.c - vmware guest stuff
> 
> arch/x86/kernel/cpu/acrn.c - Acorn guest stuff
> 
> So we have a real mess. :-(

Don't forget :-)

  arch/x86/kernel/kvm.c - KVM guest stuff

> Not surprised though. So that last thing you're suggesting kinda makes
> sense but lemme tweak it a bit:
> 
> arch/x86/coco/<guest stuff>
> arch/x86/virt/<"generic" x86 host virtualization stuff>
> arch/x86/virt/tdx/ - no need for the "coco" thing - TDX is nothing but coco. TDX host
> stuff
> 
> arch/x86/virt/sev/ - ditto
> 
> and we'll keep arch/x86/kvm because of previous precedents with other
> things I've enumerated above.
> 
> Hmmm?

No objection to omitting "coco".  Though what about using "vmx" and "svm" instead
of "tdx" and "sev".  We lose the more explicit tie to coco, but it would mirror the
sub-directories in arch/x86/kvm/ and would avoid a mess in the scenario where tdx
or sev needs to share code with the non-coco side, e.g. I'm guessing TDX will need
to do VMXON.

  arch/x86/virt/vmx/
  	tdx.c
	vmx.c

  arch/x86/virt/svm/
  	sev.c
	sev-es.c
	sev-snp.c
  	svm.c


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 20:07                 ` Sean Christopherson
@ 2022-02-09 20:36                   ` Borislav Petkov
  2022-02-10  0:05                     ` Kai Huang
  2022-02-16 15:48                     ` Kirill A. Shutemov
  0 siblings, 2 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-09 20:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, Kirill A. Shutemov, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Brijesh Singh, Tom Lendacky

+ SEV guys. You can scroll upthread to read up on the context.

On Wed, Feb 09, 2022 at 08:07:52PM +0000, Sean Christopherson wrote:
> Don't forget :-)
> 
>   arch/x86/kernel/kvm.c - KVM guest stuff

I knew I'd miss something, ofc.

> No objection to omitting "coco".  Though what about using "vmx" and "svm" instead
> of "tdx" and "sev".

I'm not dead-set on this but ...

> We lose the more explicit tie to coco, but it would mirror the
> sub-directories in arch/x86/kvm/

... having them too close in naming to the non-coco stuff, might cause
confusion when looking at:

arch/x86/kvm/vmx/vmx.c

vs

arch/x86/virt/vmx/vmx.c

Instead of having

arch/x86/kvm/vmx/vmx.c

and

arch/x86/virt/tdx/vmx.c

That second version differs just the right amount. :-)

> and would avoid a mess in the scenario where tdx
> or sev needs to share code with the non-coco side, e.g. I'm guessing TDX will need
> to do VMXON.
> 
>   arch/x86/virt/vmx/
>   	tdx.c
> 	vmx.c
> 
>   arch/x86/virt/svm/
>   	sev.c
> 	sev-es.c
> 	sev-snp.c
>   	svm.c

That will probably be two files too: sev.c and svm.c

But let's see what the other folks think first...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-02-09 18:05                   ` Sean Christopherson
@ 2022-02-09 22:23                     ` Kirill A. Shutemov
  2022-02-10  1:21                       ` Sean Christopherson
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-09 22:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, bp, aarcange, ak, dan.j.williams, dave.hansen,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, tony.luck, vkuznets,
	wanpengli, x86

On Wed, Feb 09, 2022 at 06:05:52PM +0000, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Kirill A. Shutemov wrote:
> > On Mon, Feb 07, 2022 at 10:52:19PM +0000, Sean Christopherson wrote:
> > .Lskip_sti:
> > 	tdcall
> > 
> > 	/*
> > 	 * TDVMCALL leaf does not suppose to fail. If it fails something
> > 	 * is horribly wrong with TDX module. Stop the world.
> > 	 */
> > 	test %rax, %rax
> > 	je .Lsuccess
> > 	ud2
> 
> If the ud2 or call to an external "do panic" helper is out-of-line, then the happy
> path avoids a taken branch.  Not a big deal, but it's also trivial to do.

Something like this?

I assume FRAME_END is irrelevent after UD2.

SYM_FUNC_START(__tdx_hypercall)
	FRAME_BEGIN

	/* Save callee-saved GPRs as mandated by the x86_64 ABI */
	push %r15
	push %r14
	push %r13
	push %r12

	/* Mangle function call ABI into TDCALL ABI: */
	/* Set TDCALL leaf ID (TDVMCALL (0)) in RAX */
	xor %eax, %eax

	/* Copy hypercall registers from arg struct: */
	movq TDX_HYPERCALL_r10(%rdi), %r10
	movq TDX_HYPERCALL_r11(%rdi), %r11
	movq TDX_HYPERCALL_r12(%rdi), %r12
	movq TDX_HYPERCALL_r13(%rdi), %r13
	movq TDX_HYPERCALL_r14(%rdi), %r14
	movq TDX_HYPERCALL_r15(%rdi), %r15

	movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx

	/*
	 * For the idle loop STI needs to be called directly before the TDCALL
	 * that enters idle (EXIT_REASON_HLT case). STI instruction enables
	 * interrupts only one instruction later. If there is a window between
	 * STI and the instruction that emulates the HALT state, there is a
	 * chance for interrupts to happen in this window, which can delay the
	 * HLT operation indefinitely. Since this is the not the desired
	 * result, conditionally call STI before TDCALL.
	 */
	testq $TDX_HCALL_ISSUE_STI, %rsi
	jz .Lskip_sti
	sti
.Lskip_sti:
	tdcall

	/*
	 * TDVMCALL leaf does not suppose to fail. If it fails something
	 * is horribly wrong with TDX module. Stop the world.
	 */
	testq %rax, %rax
	jne .Lpanic

	/* TDVMCALL leaf return code is in R10 */
	movq %r10, %rax

	/* Copy hypercall result registers to arg struct if needed */
	testq $TDX_HCALL_HAS_OUTPUT, %rsi
	jz .Lout

	movq %r10, TDX_HYPERCALL_r10(%rdi)
	movq %r11, TDX_HYPERCALL_r11(%rdi)
	movq %r12, TDX_HYPERCALL_r12(%rdi)
	movq %r13, TDX_HYPERCALL_r13(%rdi)
	movq %r14, TDX_HYPERCALL_r14(%rdi)
	movq %r15, TDX_HYPERCALL_r15(%rdi)
.Lout:
	/*
	 * Zero out registers exposed to the VMM to avoid speculative execution
	 * with VMM-controlled values. This needs to include all registers
	 * present in TDVMCALL_EXPOSE_REGS_MASK (except R12-R15). R12-R15
	 * context will be restored.
	 */
	xor %r10d, %r10d
	xor %r11d, %r11d

	/* Restore callee-saved GPRs as mandated by the x86_64 ABI */
	pop %r12
	pop %r13
	pop %r14
	pop %r15

	FRAME_END

	retq
.Lpanic:
	ud2
SYM_FUNC_END(__tdx_hypercall)
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private
  2022-02-08 12:12   ` Borislav Petkov
@ 2022-02-09 23:21     ` Kirill A. Shutemov
  0 siblings, 0 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-09 23:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tglx, mingo, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Tue, Feb 08, 2022 at 01:12:45PM +0100, Borislav Petkov wrote:
> On Mon, Jan 24, 2022 at 06:02:09PM +0300, Kirill A. Shutemov wrote:
> > +	if (ret)
> > +		ret = -EIO;
> > +
> > +	if (ret || !enc)
> 
> Is the second case here after the "||" the conversion-to-shared where it
> only needs to notify with MapGPA and return?

Right. Memory accepting is required on the way to private.

I will rewrite and comment this code to make it more readable.

> > +		return ret;
> > +
> > +	/*
> > +	 * For shared->private conversion, accept the page using
> > +	 * TDX_ACCEPT_PAGE TDX module call.
> > +	 */
> > +	while (start < end) {
> > +		/* Try 2M page accept first if possible */
> > +		if (!(start & ~PMD_MASK) && end - start >= PMD_SIZE &&
> > +		    !tdx_accept_page(start, PG_LEVEL_2M)) {
> 
> What happens here if the module doesn't accept the page? No error
> reporting, no error handling, no warning, nada?

If it fails we fallback to 4k accept below.

We only report error if 4k accept fails.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 20:36                   ` Borislav Petkov
@ 2022-02-10  0:05                     ` Kai Huang
  2022-02-16 16:08                       ` Sean Christopherson
  2022-02-16 15:48                     ` Kirill A. Shutemov
  1 sibling, 1 reply; 154+ messages in thread
From: Kai Huang @ 2022-02-10  0:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Sean Christopherson, Kirill A. Shutemov, tglx, mingo,
	dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy, aarcange,
	ak, dan.j.williams, david, hpa, jgross, jmattson, joro, jpoimboe,
	knsathya, pbonzini, sdeep, tony.luck, vkuznets, wanpengli, x86,
	linux-kernel, Brijesh Singh, Tom Lendacky


> > No objection to omitting "coco".  Though what about using "vmx" and "svm" instead
> > of "tdx" and "sev".
> 
> I'm not dead-set on this but ...
> 
> > We lose the more explicit tie to coco, but it would mirror the
> > sub-directories in arch/x86/kvm/
> 
> ... having them too close in naming to the non-coco stuff, might cause
> confusion when looking at:
> 
> arch/x86/kvm/vmx/vmx.c
> 
> vs
> 
> arch/x86/virt/vmx/vmx.c
> 
> Instead of having
> 
> arch/x86/kvm/vmx/vmx.c
> 
> and
> 
> arch/x86/virt/tdx/vmx.c
> 
> That second version differs just the right amount. :-)

Having vmx.c under tdx/ directory looks a little bit strange.

vmx.c seems more like "generic non-KVM host virtualization staff".

> 
> > and would avoid a mess in the scenario where tdx
> > or sev needs to share code with the non-coco side, e.g. I'm guessing TDX will need
> > to do VMXON.
> > 
> >   arch/x86/virt/vmx/
> >   	tdx.c
> > 	vmx.c
> > 
> >   arch/x86/virt/svm/
> >   	sev.c
> > 	sev-es.c
> > 	sev-snp.c
> >   	svm.c
> 
> That will probably be two files too: sev.c and svm.c
> 
> But let's see what the other folks think first...
> 

So if I catch you guys correctly, so far I am heading towards to:

	arch/x86/virt/vmx/
		tdx.c

("vmx/" can be changed if you guys prefers others later).

And I am targeting to use single tdx.c to hold ~2k LoC since looks like single
file is preferred.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-02-04 11:27     ` Kuppuswamy, Sathyanarayanan
  2022-02-04 13:49       ` Borislav Petkov
@ 2022-02-10  0:25       ` Kai Huang
  1 sibling, 0 replies; 154+ messages in thread
From: Kai Huang @ 2022-02-10  0:25 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy
  Cc: Borislav Petkov, Kirill A. Shutemov, Sean Christopherson, tglx,
	mingo, dave.hansen, luto, peterz, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel


> >> Reported-by: Kai Huang <kai.huang@intel.com>
> > I wonder what that Reported-by tag means here for this is a feature
> > patch, not a bug fix or so...
> 
> I think it was added when Sean created the original patch. I don't have the
> full history.
> 
> Sean, since this is not a bug fix, shall we remove the Reported-by tag?

Sorry just saw.  Please remove :)

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2.1 05/29] x86/tdx: Add HLT support for TDX guests
  2022-02-09 22:23                     ` Kirill A. Shutemov
@ 2022-02-10  1:21                       ` Sean Christopherson
  0 siblings, 0 replies; 154+ messages in thread
From: Sean Christopherson @ 2022-02-10  1:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, bp, aarcange, ak, dan.j.williams, dave.hansen,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya,
	linux-kernel, luto, mingo, pbonzini, peterz,
	sathyanarayanan.kuppuswamy, sdeep, tony.luck, vkuznets,
	wanpengli, x86

On Thu, Feb 10, 2022, Kirill A. Shutemov wrote:
> On Wed, Feb 09, 2022 at 06:05:52PM +0000, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Kirill A. Shutemov wrote:
> > > On Mon, Feb 07, 2022 at 10:52:19PM +0000, Sean Christopherson wrote:
> > > .Lskip_sti:
> > > 	tdcall
> > > 
> > > 	/*
> > > 	 * TDVMCALL leaf does not suppose to fail. If it fails something
> > > 	 * is horribly wrong with TDX module. Stop the world.
> > > 	 */
> > > 	test %rax, %rax
> > > 	je .Lsuccess
> > > 	ud2
> > 
> > If the ud2 or call to an external "do panic" helper is out-of-line, then the happy
> > path avoids a taken branch.  Not a big deal, but it's also trivial to do.
> 
> Something like this?

Yep.

> I assume FRAME_END is irrelevent after UD2.

Not irrelevant, but we don't want to do FRAME_END in this case.  Keeping the current
frame pointer (setup by FRAME_BEGIN, torn down by FRAME_END) will let the unwinder
do its thing when its using frame pointers.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 18/29] x86/boot: Avoid #VE during boot for TDX platforms
  2022-02-02  0:04   ` Thomas Gleixner
@ 2022-02-11 16:13     ` Kirill A. Shutemov
  0 siblings, 0 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-11 16:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel

On Wed, Feb 02, 2022 at 01:04:34AM +0100, Thomas Gleixner wrote:
> > +	orl	$X86_CR4_PAE, %eax
> >  	testl	%edx, %edx
> >  	jz	1f
> >  	orl	$X86_CR4_LA57, %eax
> > @@ -662,8 +675,12 @@ SYM_CODE_START(trampoline_32bit_src)
> >  	pushl	$__KERNEL_CS
> >  	pushl	%eax
> >  
> > -	/* Enable paging again */
> > -	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
> > +	/*
> > +	 * Enable paging again.  Keep CR0.NE set, FERR# is no longer used
> > +	 * to handle x87 FPU errors and clearing NE may fault in some
> > +	 * environments.
> 
> FERR# is no longer used is really not informative here. The point is
> that any x86 CPU which is supported by the kernel requires CR0_NE to be
> set. This code was wrong from the very beginning because 64bit CPUs
> never supported #FERR. The reason why it exists is Copy&Pasta without
> brain applied and the sad fact that the hardware does not enforce it in
> native mode for whatever reason. So this want's to be a seperate patch
> with a coherent comment and changelong.

What about the patch below?

Instead of adding CR0.NE there I used CR0_STATE instead or keep existing
value, only modifing required bit.

I'm not familiar with float-point execption handling. I tried to read up
on that in attempt to make coherent commit message. Please correct me if I
wrote something wrong.

---------------------------------8<----------------------------------------

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 11 Feb 2022 14:25:10 +0300
Subject: [PATCH] x86/boot: Set CR0.NE early and keep it set during the boot
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

TDX guest requires CR0.NE to be set. Clearing the bit triggers #GP(0).

If CR0.NE is 0, the MS-DOS compatibility mode for handling floating-point
exceptions is selected. In this mode, the software exception handler for
floating-point exceptions is invoked externally using the processor’s
FERR#, INTR, and IGNNE# pins.

Using FERR# and IGNNE# to handle floating-point exception is deprecated.
CR0.NE=0 also limits newer processors to operate with one logical
processor active.

Kernel uses CR0_STATE constant to initialize CR0. It has NE bit set.
But during early boot has more ad-hoc approach to setting bit in the
register.

Make CR0 initialization consistent, deriving the initial from CR0_STATE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S   | 7 ++++---
 arch/x86/realmode/rm/trampoline_64.S | 8 ++++----
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index fd9441f40457..d0c3d33f3542 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -289,7 +289,7 @@ SYM_FUNC_START(startup_32)
 	pushl	%eax
 
 	/* Enter paged protected Mode, activating Long Mode */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */
+	movl	$CR0_STATE, %eax
 	movl	%eax, %cr0
 
 	/* Jump from 32bit compatibility mode into 64bit mode. */
@@ -662,8 +662,9 @@ SYM_CODE_START(trampoline_32bit_src)
 	pushl	$__KERNEL_CS
 	pushl	%eax
 
-	/* Enable paging again */
-	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
+	/* Enable paging again. */
+	movl	%cr0, %eax
+	btsl	$X86_CR0_PG_BIT, %eax
 	movl	%eax, %cr0
 
 	lret
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index ae112a91592f..d380f2d1fd23 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -70,7 +70,7 @@ SYM_CODE_START(trampoline_start)
 	movw	$__KERNEL_DS, %dx	# Data segment descriptor
 
 	# Enable protected mode
-	movl	$X86_CR0_PE, %eax	# protected mode (PE) bit
+	movl	$(CR0_STATE & ~X86_CR0_PG), %eax
 	movl	%eax, %cr0		# into protected mode
 
 	# flush prefetch and jump to startup_32
@@ -148,8 +148,8 @@ SYM_CODE_START(startup_32)
 	movl	$MSR_EFER, %ecx
 	wrmsr
 
-	# Enable paging and in turn activate Long Mode
-	movl	$(X86_CR0_PG | X86_CR0_WP | X86_CR0_PE), %eax
+	# Enable paging and in turn activate Long Mode.
+	movl	$CR0_STATE, %eax
 	movl	%eax, %cr0
 
 	/*
@@ -169,7 +169,7 @@ SYM_CODE_START(pa_trampoline_compat)
 	movl	$rm_stack_end, %esp
 	movw	$__KERNEL_DS, %dx
 
-	movl	$X86_CR0_PE, %eax
+	movl	$(CR0_STATE & ~X86_CR0_PG), %eax
 	movl	%eax, %cr0
 	ljmpl   $__KERNEL32_CS, $pa_startup_32
 SYM_CODE_END(pa_trampoline_compat)
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest
  2022-02-01 21:02   ` Thomas Gleixner
  2022-02-01 21:26     ` Sean Christopherson
@ 2022-02-12  1:42     ` Kirill A. Shutemov
  1 sibling, 0 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-12  1:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, bp, dave.hansen, luto, peterz, sathyanarayanan.kuppuswamy,
	aarcange, ak, dan.j.williams, david, hpa, jgross, jmattson, joro,
	jpoimboe, knsathya, pbonzini, sdeep, seanjc, tony.luck, vkuznets,
	wanpengli, x86, linux-kernel, Sean Christopherson

On Tue, Feb 01, 2022 at 10:02:41PM +0100, Thomas Gleixner wrote:
> > +/*
> > + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > + * specific guest actions which may happen in either user space or the
> > + * kernel:
> > + *
> > + *  * Specific instructions (WBINVD, for example)
> > + *  * Specific MSR accesses
> > + *  * Specific CPUID leaf accesses
> > + *  * Access to unmapped pages (EPT violation)
> > + *
> > + * In the settings that Linux will run in, virtualization exceptions are
> > + * never generated on accesses to normal, TD-private memory that has been
> > + * accepted.
> > + *
> > + * Syscall entry code has a critical window where the kernel stack is not
> > + * yet set up. Any exception in this window leads to hard to debug issues
> > + * and can be exploited for privilege escalation. Exceptions in the NMI
> > + * entry code also cause issues. Returning from the exception handler with
> > + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> > + *
> > + * For these reasons, the kernel avoids #VEs during the syscall gap and
> > + * the NMI entry code. Entry code paths do not access TD-shared memory,
> > + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> > + * that might generate #VE.
> 
> How is that enforced or validated? What checks for a violation of that
> assumption?

Hm. I think we would have to rely on code audit for it.

Entry code has no #VE inducing things: no port I/O, CPUID, HLT,
MONITOR/MWAIT, WBINVD/INVD, HLT, VMCALL.

There's single MSR read for MSR_GS_BASE paranoid_entry(), but it doesn't
trigger #VE either.

Other possible source of #VE is shared memory. If somebody tricks kernel
to access shared memory from entry code we have a bigger problem to deal
with than #VE in syscall gap.

Or do you have something more strict than code audit in mind? I don't see
it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-07 17:28       ` Borislav Petkov
@ 2022-02-14 22:09         ` Kirill A. Shutemov
  2022-02-15 10:50           ` Borislav Petkov
  2022-02-15 14:49           ` Tom Lendacky
  0 siblings, 2 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-14 22:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Mon, Feb 07, 2022 at 06:28:04PM +0100, Borislav Petkov wrote:
> On Mon, Feb 07, 2022 at 08:57:39AM -0800, Dave Hansen wrote:
> > We can surely *do* this with cc_something() helpers.  It's just not as
> > easy as making cc_set/cc_clear().
> 
> Sure, that's easy: cc_pgprot_{enc,dec}() or so.

So, I've ended up with this in <asm/pgtable.h>

/*
 * Macros to add or remove encryption attribute
 */
#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
pgprotval_t cc_enc(pgprotval_t protval);
pgprotval_t cc_dec(pgprotval_t protval);
#define pgprot_encrypted(prot)	__pgprot(cc_enc(pgprot_val(prot)))
#define pgprot_decrypted(prot)	__pgprot(cc_dec(pgprot_val(prot)))
#else
#define pgprot_encrypted(prot) (prot)
#define pgprot_decrypted(prot) (prot)
#endif

And cc_platform.c:

pgprotval_t cc_enc(pgprotval_t protval)
{
	if (sme_me_mask)
		return __sme_set(protval);
	else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
		return protval & ~tdx_shared_mask();
	else
		return protval;
}

pgprotval_t cc_dec(pgprotval_t protval)
{
	if (sme_me_mask)
		return __sme_clr(protval);
	else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
		return protval | tdx_shared_mask();
	else
		return protval;
}
EXPORT_SYMBOL_GPL(cc_dec);
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-14 22:09         ` Kirill A. Shutemov
@ 2022-02-15 10:50           ` Borislav Petkov
  2022-02-15 14:49           ` Tom Lendacky
  1 sibling, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-15 10:50 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Tue, Feb 15, 2022 at 01:09:26AM +0300, Kirill A. Shutemov wrote:
> pgprotval_t cc_enc(pgprotval_t protval)
> {
> 	if (sme_me_mask)
> 		return __sme_set(protval);
> 	else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> 		return protval & ~tdx_shared_mask();
				 ^^^^^^^^^^^^^^^^^^^

LGTM.

Btw, what about sticking the mask tdx_shared_mask() returns into a
proper u64 variable and using it everywhere, just like sme_me_mask?

We could unify it later into a common encryption mask, see thread
starting here:

https://lore.kernel.org/r/YgZ427v95xcdOKSC@zn.tnic

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-14 22:09         ` Kirill A. Shutemov
  2022-02-15 10:50           ` Borislav Petkov
@ 2022-02-15 14:49           ` Tom Lendacky
  2022-02-15 15:41             ` Kirill A. Shutemov
  1 sibling, 1 reply; 154+ messages in thread
From: Tom Lendacky @ 2022-02-15 14:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov
  Cc: Dave Hansen, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On 2/14/22 16:09, Kirill A. Shutemov wrote:
> On Mon, Feb 07, 2022 at 06:28:04PM +0100, Borislav Petkov wrote:
>> On Mon, Feb 07, 2022 at 08:57:39AM -0800, Dave Hansen wrote:
>>> We can surely *do* this with cc_something() helpers.  It's just not as
>>> easy as making cc_set/cc_clear().
>>
>> Sure, that's easy: cc_pgprot_{enc,dec}() or so.
> 
> So, I've ended up with this in <asm/pgtable.h>
> 
> /*
>   * Macros to add or remove encryption attribute
>   */
> #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
> pgprotval_t cc_enc(pgprotval_t protval);
> pgprotval_t cc_dec(pgprotval_t protval);
> #define pgprot_encrypted(prot)	__pgprot(cc_enc(pgprot_val(prot)))
> #define pgprot_decrypted(prot)	__pgprot(cc_dec(pgprot_val(prot)))
> #else
> #define pgprot_encrypted(prot) (prot)
> #define pgprot_decrypted(prot) (prot)
> #endif

A couple of things. I think cc_pgprot_enc() and cc_pgprot_dec() would be 
more descriptive/better names to use here.

Also, can they be defined in include/linux/cc_platform.h (with two 
versions based on CONFIG_ARCH_HAS_CC_PLATFORM) and have that included 
here? Or is there some header file include issues when trying to include 
it? That would clean this block up into just two lines.

Thanks,
Tom

> 
> And cc_platform.c:
> 
> pgprotval_t cc_enc(pgprotval_t protval)
> {
> 	if (sme_me_mask)
> 		return __sme_set(protval);
> 	else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> 		return protval & ~tdx_shared_mask();
> 	else
> 		return protval;
> }
> 
> pgprotval_t cc_dec(pgprotval_t protval)
> {
> 	if (sme_me_mask)
> 		return __sme_clr(protval);
> 	else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> 		return protval | tdx_shared_mask();
> 	else
> 		return protval;
> }
> EXPORT_SYMBOL_GPL(cc_dec);

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-15 14:49           ` Tom Lendacky
@ 2022-02-15 15:41             ` Kirill A. Shutemov
  2022-02-15 15:55               ` Tom Lendacky
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-15 15:41 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Borislav Petkov, Dave Hansen, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Tue, Feb 15, 2022 at 08:49:34AM -0600, Tom Lendacky wrote:
> On 2/14/22 16:09, Kirill A. Shutemov wrote:
> > On Mon, Feb 07, 2022 at 06:28:04PM +0100, Borislav Petkov wrote:
> > > On Mon, Feb 07, 2022 at 08:57:39AM -0800, Dave Hansen wrote:
> > > > We can surely *do* this with cc_something() helpers.  It's just not as
> > > > easy as making cc_set/cc_clear().
> > > 
> > > Sure, that's easy: cc_pgprot_{enc,dec}() or so.
> > 
> > So, I've ended up with this in <asm/pgtable.h>
> > 
> > /*
> >   * Macros to add or remove encryption attribute
> >   */
> > #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
> > pgprotval_t cc_enc(pgprotval_t protval);
> > pgprotval_t cc_dec(pgprotval_t protval);
> > #define pgprot_encrypted(prot)	__pgprot(cc_enc(pgprot_val(prot)))
> > #define pgprot_decrypted(prot)	__pgprot(cc_dec(pgprot_val(prot)))
> > #else
> > #define pgprot_encrypted(prot) (prot)
> > #define pgprot_decrypted(prot) (prot)
> > #endif
> 
> A couple of things. I think cc_pgprot_enc() and cc_pgprot_dec() would be
> more descriptive/better names to use here.
> 
> Also, can they be defined in include/linux/cc_platform.h (with two versions
> based on CONFIG_ARCH_HAS_CC_PLATFORM) and have that included here? Or is
> there some header file include issues when trying to include it? That would
> clean this block up into just two lines.

Well, pgprotval_t is x86-specific. It cannot be used in generic headers.
We can use u64 here instead. It is wider than pgprotval_t on 2-level
paging x86, but should work.

But with u64 as type, I'm not sure 'pgprot' in the name is jutified.

Hm?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-15 15:41             ` Kirill A. Shutemov
@ 2022-02-15 15:55               ` Tom Lendacky
  2022-02-15 16:27                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Tom Lendacky @ 2022-02-15 15:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Dave Hansen, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On 2/15/22 09:41, Kirill A. Shutemov wrote:
> On Tue, Feb 15, 2022 at 08:49:34AM -0600, Tom Lendacky wrote:
>> On 2/14/22 16:09, Kirill A. Shutemov wrote:
>>> On Mon, Feb 07, 2022 at 06:28:04PM +0100, Borislav Petkov wrote:
>>>> On Mon, Feb 07, 2022 at 08:57:39AM -0800, Dave Hansen wrote:
>>>>> We can surely *do* this with cc_something() helpers.  It's just not as
>>>>> easy as making cc_set/cc_clear().
>>>>
>>>> Sure, that's easy: cc_pgprot_{enc,dec}() or so.
>>>
>>> So, I've ended up with this in <asm/pgtable.h>
>>>
>>> /*
>>>    * Macros to add or remove encryption attribute
>>>    */
>>> #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
>>> pgprotval_t cc_enc(pgprotval_t protval);
>>> pgprotval_t cc_dec(pgprotval_t protval);
>>> #define pgprot_encrypted(prot)	__pgprot(cc_enc(pgprot_val(prot)))
>>> #define pgprot_decrypted(prot)	__pgprot(cc_dec(pgprot_val(prot)))
>>> #else
>>> #define pgprot_encrypted(prot) (prot)
>>> #define pgprot_decrypted(prot) (prot)
>>> #endif
>>
>> A couple of things. I think cc_pgprot_enc() and cc_pgprot_dec() would be
>> more descriptive/better names to use here.
>>
>> Also, can they be defined in include/linux/cc_platform.h (with two versions
>> based on CONFIG_ARCH_HAS_CC_PLATFORM) and have that included here? Or is
>> there some header file include issues when trying to include it? That would
>> clean this block up into just two lines.
> 
> Well, pgprotval_t is x86-specific. It cannot be used in generic headers.

Ah, right.

> We can use u64 here instead. It is wider than pgprotval_t on 2-level
> paging x86, but should work.

Hmm..., yeah. Maybe unsigned long? CONFIG_ARCH_HAS_CC_PLATFORM is X86_64 
only, so 2-level paging wouldn't be applicable when an unsigned long is 
64-bits?

I'll let the maintainers weigh in on that.

> 
> But with u64 as type, I'm not sure 'pgprot' in the name is jutified.

Maybe cc_mask_{enc,dec}()? It just sounds like cc_{enc,dec}() is actually 
performing encryption or decryption and can be confusing.

Thanks,
Tom

> 
> Hm?
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-15 15:55               ` Tom Lendacky
@ 2022-02-15 16:27                 ` Kirill A. Shutemov
  2022-02-15 16:34                   ` Dave Hansen
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-15 16:27 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Borislav Petkov, Dave Hansen, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Tue, Feb 15, 2022 at 09:55:16AM -0600, Tom Lendacky wrote:
> On 2/15/22 09:41, Kirill A. Shutemov wrote:
> > On Tue, Feb 15, 2022 at 08:49:34AM -0600, Tom Lendacky wrote:
> > > On 2/14/22 16:09, Kirill A. Shutemov wrote:
> > > > On Mon, Feb 07, 2022 at 06:28:04PM +0100, Borislav Petkov wrote:
> > > > > On Mon, Feb 07, 2022 at 08:57:39AM -0800, Dave Hansen wrote:
> > > > > > We can surely *do* this with cc_something() helpers.  It's just not as
> > > > > > easy as making cc_set/cc_clear().
> > > > > 
> > > > > Sure, that's easy: cc_pgprot_{enc,dec}() or so.
> > > > 
> > > > So, I've ended up with this in <asm/pgtable.h>
> > > > 
> > > > /*
> > > >    * Macros to add or remove encryption attribute
> > > >    */
> > > > #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
> > > > pgprotval_t cc_enc(pgprotval_t protval);
> > > > pgprotval_t cc_dec(pgprotval_t protval);
> > > > #define pgprot_encrypted(prot)	__pgprot(cc_enc(pgprot_val(prot)))
> > > > #define pgprot_decrypted(prot)	__pgprot(cc_dec(pgprot_val(prot)))
> > > > #else
> > > > #define pgprot_encrypted(prot) (prot)
> > > > #define pgprot_decrypted(prot) (prot)
> > > > #endif
> > > 
> > > A couple of things. I think cc_pgprot_enc() and cc_pgprot_dec() would be
> > > more descriptive/better names to use here.
> > > 
> > > Also, can they be defined in include/linux/cc_platform.h (with two versions
> > > based on CONFIG_ARCH_HAS_CC_PLATFORM) and have that included here? Or is
> > > there some header file include issues when trying to include it? That would
> > > clean this block up into just two lines.
> > 
> > Well, pgprotval_t is x86-specific. It cannot be used in generic headers.
> 
> Ah, right.
> 
> > We can use u64 here instead. It is wider than pgprotval_t on 2-level
> > paging x86, but should work.
> 
> Hmm..., yeah. Maybe unsigned long? CONFIG_ARCH_HAS_CC_PLATFORM is X86_64
> only, so 2-level paging wouldn't be applicable when an unsigned long is
> 64-bits?

Hm. So for !CONFIG_ARCH_HAS_CC_PLATFORM it has to be define, if we would
try static inline dummy instead it will break x86 PAE as upper bit get
trancated when passed via helper.

I donno.

> I'll let the maintainers weigh in on that.
> 
> > 
> > But with u64 as type, I'm not sure 'pgprot' in the name is jutified.
> 
> Maybe cc_mask_{enc,dec}()? It just sounds like cc_{enc,dec}() is actually
> performing encryption or decryption and can be confusing.

cc_{enc,dec}_mask() sounds better to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-15 16:27                 ` Kirill A. Shutemov
@ 2022-02-15 16:34                   ` Dave Hansen
  2022-02-15 17:33                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Dave Hansen @ 2022-02-15 16:34 UTC (permalink / raw)
  To: Kirill A. Shutemov, Tom Lendacky
  Cc: Borislav Petkov, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On 2/15/22 08:27, Kirill A. Shutemov wrote:
>>> But with u64 as type, I'm not sure 'pgprot' in the name is jutified.
>> Maybe cc_mask_{enc,dec}()? It just sounds like cc_{enc,dec}() is actually
>> performing encryption or decryption and can be confusing.
> cc_{enc,dec}_mask() sounds better to me.

The pte_mk*() functions might be a good naming model here.  Some of them
clear bits and some set them, but they all "make" a PTE.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-15 16:34                   ` Dave Hansen
@ 2022-02-15 17:33                     ` Kirill A. Shutemov
  2022-02-16  9:58                       ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-15 17:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Tom Lendacky, Borislav Petkov, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Tue, Feb 15, 2022 at 08:34:53AM -0800, Dave Hansen wrote:
> On 2/15/22 08:27, Kirill A. Shutemov wrote:
> >>> But with u64 as type, I'm not sure 'pgprot' in the name is jutified.
> >> Maybe cc_mask_{enc,dec}()? It just sounds like cc_{enc,dec}() is actually
> >> performing encryption or decryption and can be confusing.
> > cc_{enc,dec}_mask() sounds better to me.
> 
> The pte_mk*() functions might be a good naming model here.  Some of them
> clear bits and some set them, but they all "make" a PTE.

Like cc_mkencrypted()/cc_mkdecrypted()? I donno. Looks strange.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-02-04 13:49       ` Borislav Petkov
@ 2022-02-15 21:36         ` Kirill A. Shutemov
  2022-02-16 10:07           ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-15 21:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kuppuswamy, Sathyanarayanan, Sean Christopherson, tglx, mingo,
	dave.hansen, luto, peterz, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kai Huang

On Fri, Feb 04, 2022 at 02:49:59PM +0100, Borislav Petkov wrote:
> On Fri, Feb 04, 2022 at 03:27:19AM -0800, Kuppuswamy, Sathyanarayanan wrote:
> > trampoline_start and sev_es_trampoline_start are not mutually exclusive.
> > Both are
> > used in arch/x86/kernel/sev.c.
> 
> I know - I've asked Jörg to have a look here.
> 
> > But trampoline_start64 can be removed and replaced with trampoline_start.
> > But using
> > _*64 suffix makes it clear that is used for 64 bit(CONFIG_X86_64).
> > 
> > Adding it for clarity seems to be fine to me.
> 
> Does it matter if the start IP is the same for all APs? Or do will there
> be a case where you have some APs starting from the 32-bit trampoline
> and some from the 64-bit one, on the same system? (that would be weird
> but what do I know...)

I'm not sure I follow. SMP bring up is new topic for me.

We want a single kernel binary that boots everywhere, so we cannot know at
build time if a secondary CPU will start in 32- or 64-bit mode.

How can signle trampoline_start cover all cases?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-15 17:33                     ` Kirill A. Shutemov
@ 2022-02-16  9:58                       ` Borislav Petkov
  2022-02-16 15:37                         ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-16  9:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Tom Lendacky, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Tue, Feb 15, 2022 at 08:33:21PM +0300, Kirill A. Shutemov wrote:
> Like cc_mkencrypted()/cc_mkdecrypted()? I donno. Looks strange.

cc_mkenc/cc_mkdec probably.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-02-15 21:36         ` Kirill A. Shutemov
@ 2022-02-16 10:07           ` Borislav Petkov
  2022-02-16 14:10             ` Kirill A. Shutemov
  0 siblings, 1 reply; 154+ messages in thread
From: Borislav Petkov @ 2022-02-16 10:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy, Sathyanarayanan, Sean Christopherson, tglx, mingo,
	dave.hansen, luto, peterz, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kai Huang

On Wed, Feb 16, 2022 at 12:36:24AM +0300, Kirill A. Shutemov wrote:
> How can signle trampoline_start cover all cases?

All I'm saying is that the real mode header should have a single

	u32     trampoline_start;

instead of:

	u32     trampoline_start;
	u32     sev_es_trampoline_start;
	u32     trampoline_start64;

which all are the same thing on a single system.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff
  2022-02-16 10:07           ` Borislav Petkov
@ 2022-02-16 14:10             ` Kirill A. Shutemov
  0 siblings, 0 replies; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-16 14:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kuppuswamy, Sathyanarayanan, Sean Christopherson, tglx, mingo,
	dave.hansen, luto, peterz, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Kai Huang

On Wed, Feb 16, 2022 at 11:07:15AM +0100, Borislav Petkov wrote:
> On Wed, Feb 16, 2022 at 12:36:24AM +0300, Kirill A. Shutemov wrote:
> > How can signle trampoline_start cover all cases?
> 
> All I'm saying is that the real mode header should have a single
> 
> 	u32     trampoline_start;
> 
> instead of:
> 
> 	u32     trampoline_start;
> 	u32     sev_es_trampoline_start;
> 	u32     trampoline_start64;
> 
> which all are the same thing on a single system.

But these are generated at build time, no?

As far as I can see it is initialized in arch/x86/realmode/rm/header.S by
linker.

I'm confused.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-16  9:58                       ` Borislav Petkov
@ 2022-02-16 15:37                         ` Kirill A. Shutemov
  2022-02-17 15:24                           ` Borislav Petkov
  0 siblings, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-16 15:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Tom Lendacky, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 16, 2022 at 10:58:34AM +0100, Borislav Petkov wrote:
> On Tue, Feb 15, 2022 at 08:33:21PM +0300, Kirill A. Shutemov wrote:
> > Like cc_mkencrypted()/cc_mkdecrypted()? I donno. Looks strange.
> 
> cc_mkenc/cc_mkdec probably.

Okay, what about this:

	static u64 cc_mask;

	pgprotval_t cc_mkenc(pgprotval_t protval)
	{
		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
			return protval & ~cc_mask;
		else
			return protval | cc_mask;
	}

	pgprotval_t cc_mkdec(pgprotval_t protval)
	{
		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
			return protval | cc_mask;
		else
			return protval & ~cc_mask;
	}
	EXPORT_SYMBOL_GPL(cc_mkdec);

	__init void cc_init(void)
	{
		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
			cc_mask = tdx_shared_mask();
		else
			cc_mask = sme_me_mask;
	}

I did not introduce explicit vendor variable but opted for
X86_FEATURE_TDX_GUEST to check vendor. There's no X86_FEATURE counter part
on AMD side presumably because it get used too early to be functional.

TDX needs cc_platform_has() later and X86_FEATURE infrastructure is
already functional there (and we can benefit from static branch).

cc_init() got called from sme_enable() for AMD and from tdx_early_init()
for TDX.

I also reworked cc_platform_has() to use combination of
X86_FEATURE_TDX_GUEST and cc_mask to route to right helper:

	bool cc_platform_has(enum cc_attr attr)
	{
		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
			return intel_cc_platform_has(attr);
		else if (cc_mask)
			return amd_cc_platform_has(attr);
		else if (hv_is_isolation_supported())
			return hyperv_cc_platform_has(attr);
		else
			return false;
	}

Any opinions?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-09 20:36                   ` Borislav Petkov
  2022-02-10  0:05                     ` Kai Huang
@ 2022-02-16 15:48                     ` Kirill A. Shutemov
  2022-02-17 15:19                       ` Borislav Petkov
  1 sibling, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-16 15:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Sean Christopherson, Kai Huang, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Brijesh Singh, Tom Lendacky

On Wed, Feb 09, 2022 at 09:36:57PM +0100, Borislav Petkov wrote:
> + SEV guys. You can scroll upthread to read up on the context.
> 
> On Wed, Feb 09, 2022 at 08:07:52PM +0000, Sean Christopherson wrote:
> > Don't forget :-)
> > 
> >   arch/x86/kernel/kvm.c - KVM guest stuff
> 
> I knew I'd miss something, ofc.
> 
> > No objection to omitting "coco".  Though what about using "vmx" and "svm" instead
> > of "tdx" and "sev".
> 
> I'm not dead-set on this but ...
> 
> > We lose the more explicit tie to coco, but it would mirror the
> > sub-directories in arch/x86/kvm/
> 
> ... having them too close in naming to the non-coco stuff, might cause
> confusion when looking at:
> 
> arch/x86/kvm/vmx/vmx.c
> 
> vs
> 
> arch/x86/virt/vmx/vmx.c
> 
> Instead of having
> 
> arch/x86/kvm/vmx/vmx.c
> 
> and
> 
> arch/x86/virt/tdx/vmx.c
> 
> That second version differs just the right amount. :-)
> 
> > and would avoid a mess in the scenario where tdx
> > or sev needs to share code with the non-coco side, e.g. I'm guessing TDX will need
> > to do VMXON.
> > 
> >   arch/x86/virt/vmx/
> >   	tdx.c
> > 	vmx.c
> > 
> >   arch/x86/virt/svm/
> >   	sev.c
> > 	sev-es.c
> > 	sev-snp.c
> >   	svm.c
> 
> That will probably be two files too: sev.c and svm.c
> 
> But let's see what the other folks think first...

So, any conclusion?

I want to understand where to land TDX guest code and host-guest shared TDX code.
Host-guest shared code doesn't seem to fit anywhere nicely.

Or should I leave it under arch/x86/kernel until decision is made?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-10  0:05                     ` Kai Huang
@ 2022-02-16 16:08                       ` Sean Christopherson
  0 siblings, 0 replies; 154+ messages in thread
From: Sean Christopherson @ 2022-02-16 16:08 UTC (permalink / raw)
  To: Kai Huang
  Cc: Borislav Petkov, Kirill A. Shutemov, tglx, mingo, dave.hansen,
	luto, peterz, sathyanarayanan.kuppuswamy, aarcange, ak,
	dan.j.williams, david, hpa, jgross, jmattson, joro, jpoimboe,
	knsathya, pbonzini, sdeep, tony.luck, vkuznets, wanpengli, x86,
	linux-kernel, Brijesh Singh, Tom Lendacky

On Thu, Feb 10, 2022, Kai Huang wrote:
> 
> > > No objection to omitting "coco".  Though what about using "vmx" and "svm" instead
> > > of "tdx" and "sev".
> > 
> > I'm not dead-set on this but ...
> > 
> > > We lose the more explicit tie to coco, but it would mirror the
> > > sub-directories in arch/x86/kvm/
> > 
> > ... having them too close in naming to the non-coco stuff, might cause
> > confusion when looking at:
> > 
> > arch/x86/kvm/vmx/vmx.c
> > 
> > vs
> > 
> > arch/x86/virt/vmx/vmx.c
> > 
> > Instead of having
> > 
> > arch/x86/kvm/vmx/vmx.c
> > 
> > and
> > 
> > arch/x86/virt/tdx/vmx.c
> > 
> > That second version differs just the right amount. :-)
> 
> Having vmx.c under tdx/ directory looks a little bit strange.

Yeah, it's inverted.  TDX is a built on top of VMX.  If/when we end up with stuff
that is relevant to VMX but not TDX, then we'll be referencing tdx/ for things
that that don't care at all about TDX.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-16 15:48                     ` Kirill A. Shutemov
@ 2022-02-17 15:19                       ` Borislav Petkov
  2022-02-17 15:26                         ` Kirill A. Shutemov
  2022-02-17 15:29                         ` Sean Christopherson
  0 siblings, 2 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-17 15:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sean Christopherson, Kai Huang, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Brijesh Singh, Tom Lendacky

On Wed, Feb 16, 2022 at 06:48:09PM +0300, Kirill A. Shutemov wrote:
> So, any conclusion?

Lemme type the whole thing here again so that we have it all summed up
in one place - I think we all agree by now:

- confidential computing guest stuff: arch/x86/coco/{sev,tdx}
- generic host virtualization stuff: arch/x86/virt/
- coco host stuff: arch/x86/virt/vmx/{tdx,vmx}.c and arch/x86/virt/svm/sev*.c

New stuff goes to the new paths - i.e., TDX guest, host, etc - old stuff
- AMD SEV/SNP will get moved gradually so that development doesn't get
disrupted. Or we can do a flag day, right before -rc1 or so, and move it
all so in one go. We'll see.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap()
  2022-02-16 15:37                         ` Kirill A. Shutemov
@ 2022-02-17 15:24                           ` Borislav Petkov
  0 siblings, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-17 15:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Tom Lendacky, tglx, mingo, luto, peterz,
	sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On Wed, Feb 16, 2022 at 06:37:03PM +0300, Kirill A. Shutemov wrote:
> 	bool cc_platform_has(enum cc_attr attr)
> 	{
> 		if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> 			return intel_cc_platform_has(attr);
> 		else if (cc_mask)
> 			return amd_cc_platform_has(attr);

It is exactly stuff like that I'd like to avoid because that is
dependent on the order the test happens.

It would be a lot more robust if this did:

	switch (cc_vendor) {
	case INTEL:  return intel_cc_platform_has(attr);
	case AMD:    return amd_cc_platform_has(attr);
	case HYPERV: return hyperv_cc_platform_has(attr);
	default: return false;
	}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-17 15:19                       ` Borislav Petkov
@ 2022-02-17 15:26                         ` Kirill A. Shutemov
  2022-02-17 15:34                           ` Borislav Petkov
  2022-02-17 15:29                         ` Sean Christopherson
  1 sibling, 1 reply; 154+ messages in thread
From: Kirill A. Shutemov @ 2022-02-17 15:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Sean Christopherson, Kai Huang, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Brijesh Singh, Tom Lendacky

On Thu, Feb 17, 2022 at 04:19:35PM +0100, Borislav Petkov wrote:
> On Wed, Feb 16, 2022 at 06:48:09PM +0300, Kirill A. Shutemov wrote:
> > So, any conclusion?
> 
> Lemme type the whole thing here again so that we have it all summed up
> in one place - I think we all agree by now:
> 
> - confidential computing guest stuff: arch/x86/coco/{sev,tdx}
> - generic host virtualization stuff: arch/x86/virt/
> - coco host stuff: arch/x86/virt/vmx/{tdx,vmx}.c and arch/x86/virt/svm/sev*.c
> 
> New stuff goes to the new paths - i.e., TDX guest, host, etc - old stuff
> - AMD SEV/SNP will get moved gradually so that development doesn't get
> disrupted. Or we can do a flag day, right before -rc1 or so, and move it
> all so in one go. We'll see.

Okay, so on TDX guest side I would have

arch/x86/kernel/tdx.c => arch/x86/coco/tdx.c
arch/x86/kernel/tdcall.S => arch/x86/coco/tdcall.S
arch/x86/kernel/tdxcall.S => arch/x86/virt/tdxcall.S

The last one going to be used by TDX host as well to define SEMACALL
helper.

Looks good?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-17 15:19                       ` Borislav Petkov
  2022-02-17 15:26                         ` Kirill A. Shutemov
@ 2022-02-17 15:29                         ` Sean Christopherson
  2022-02-17 15:31                           ` Borislav Petkov
  1 sibling, 1 reply; 154+ messages in thread
From: Sean Christopherson @ 2022-02-17 15:29 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Kai Huang, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Brijesh Singh, Tom Lendacky

On Thu, Feb 17, 2022, Borislav Petkov wrote:
> On Wed, Feb 16, 2022 at 06:48:09PM +0300, Kirill A. Shutemov wrote:
> > So, any conclusion?
> 
> Lemme type the whole thing here again so that we have it all summed up
> in one place - I think we all agree by now:
> 
> - confidential computing guest stuff: arch/x86/coco/{sev,tdx}
> - generic host virtualization stuff: arch/x86/virt/
> - coco host stuff: arch/x86/virt/vmx/{tdx,vmx}.c and arch/x86/virt/svm/sev*.c

LGTM

> New stuff goes to the new paths - i.e., TDX guest, host, etc - old stuff
> - AMD SEV/SNP will get moved gradually so that development doesn't get
> disrupted. Or we can do a flag day, right before -rc1 or so, and move it
> all so in one go. We'll see.

FWIW, I don't think there's much existing SEV host virtualization stuff that can
be moved without first extracting and decoupling it from KVM, which will be
non-trivial.  I do want to do that some day, but it definitely shouldn't hold up
merging SNP.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-17 15:29                         ` Sean Christopherson
@ 2022-02-17 15:31                           ` Borislav Petkov
  0 siblings, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-17 15:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A. Shutemov, Kai Huang, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Brijesh Singh, Tom Lendacky

On Thu, Feb 17, 2022 at 03:29:00PM +0000, Sean Christopherson wrote:
> FWIW, I don't think there's much existing SEV host virtualization stuff that can
> be moved without first extracting and decoupling it from KVM, which will be
> non-trivial.  I do want to do that some day, but it definitely shouldn't hold up
> merging SNP.

Oh sure, code movement is lowest prio so whenever you feel like it.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 00/29] TDX Guest: TDX core support
  2022-02-17 15:26                         ` Kirill A. Shutemov
@ 2022-02-17 15:34                           ` Borislav Petkov
  0 siblings, 0 replies; 154+ messages in thread
From: Borislav Petkov @ 2022-02-17 15:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sean Christopherson, Kai Huang, tglx, mingo, dave.hansen, luto,
	peterz, sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams,
	david, hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini,
	sdeep, tony.luck, vkuznets, wanpengli, x86, linux-kernel,
	Brijesh Singh, Tom Lendacky

On Thu, Feb 17, 2022 at 06:26:13PM +0300, Kirill A. Shutemov wrote:
> Okay, so on TDX guest side I would have
> 
> arch/x86/kernel/tdx.c => arch/x86/coco/tdx.c

Right, and to answer a previous question: if that file starts becoming
too huge and unwieldy then we should split it, by all means. What I
don't like is getting a bunch of small files with no good reason why.

> arch/x86/kernel/tdcall.S => arch/x86/coco/tdcall.S
> arch/x86/kernel/tdxcall.S => arch/x86/virt/tdxcall.S
> 
> The last one going to be used by TDX host as well to define SEMACALL
> helper.
> 
> Looks good?

Right, that looks neat.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [PATCHv2 29/29] Documentation/x86: Document TDX kernel architecture
  2022-01-24 15:02 ` [PATCHv2 29/29] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
@ 2022-02-24  9:08   ` Xiaoyao Li
  0 siblings, 0 replies; 154+ messages in thread
From: Xiaoyao Li @ 2022-02-24  9:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, aarcange, ak, dan.j.williams, david,
	hpa, jgross, jmattson, joro, jpoimboe, knsathya, pbonzini, sdeep,
	seanjc, tony.luck, vkuznets, wanpengli, x86, linux-kernel

On 1/24/2022 11:02 PM, Kirill A. Shutemov wrote:

> +#VE due to CPUID instruction
> +----------------------------
> +
> +In TDX guests, most of CPUID leaf/sub-leaf combinations are virtualized by
> +the TDX module while some trigger #VE. Combinations of CPUID leaf/sub-leaf
> +which triggers #VE are configured by the VMM during the TD initialization
> +time (using TDH.MNG.INIT).
> +

The description is incorrect.

TDH.MNG.INIT does not configure whether the CPUID leaf/sub-leaf triggers 
#VE or not. It configures if some feature bits in specific leaf-subleaf 
are exposed to TD guest or not.

Whether the CPUID(leaf, sub-leaf) causes #VE or not, is defined in TDX 
spec, and not configurable by user or VMM.

^ permalink raw reply	[flat|nested] 154+ messages in thread

end of thread, other threads:[~2022-02-24  9:09 UTC | newest]

Thread overview: 154+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-24 15:01 [PATCHv2 00/29] TDX Guest: TDX core support Kirill A. Shutemov
2022-01-24 15:01 ` [PATCHv2 01/29] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
2022-02-01 19:29   ` Thomas Gleixner
2022-02-01 23:14     ` Kirill A. Shutemov
2022-02-03  0:32       ` Josh Poimboeuf
2022-02-03 14:09         ` Kirill A. Shutemov
2022-02-03 15:13           ` Dave Hansen
2022-01-24 15:01 ` [PATCHv2 02/29] x86/tdx: Extend the cc_platform_has() API to support TDX guests Kirill A. Shutemov
2022-02-01 19:31   ` Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 03/29] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
2022-02-01 19:58   ` Thomas Gleixner
2022-02-02  2:55     ` Kirill A. Shutemov
2022-02-02 10:59       ` Kai Huang
2022-02-03 14:44         ` Kirill A. Shutemov
2022-02-03 23:47           ` Kai Huang
2022-02-04  3:43           ` Kirill A. Shutemov
2022-02-04  9:51             ` Kai Huang
2022-02-04 13:20               ` Kirill A. Shutemov
2022-02-04 10:12             ` Kai Huang
2022-02-04 13:18               ` Kirill A. Shutemov
2022-02-05  0:06                 ` Kai Huang
2022-02-02 17:08       ` Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 04/29] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
2022-02-01 21:02   ` Thomas Gleixner
2022-02-01 21:26     ` Sean Christopherson
2022-02-12  1:42     ` Kirill A. Shutemov
2022-01-24 15:01 ` [PATCHv2 05/29] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
2022-01-29 14:53   ` Borislav Petkov
2022-01-29 22:30     ` [PATCHv2.1 " Kirill A. Shutemov
2022-02-01 21:21       ` Thomas Gleixner
2022-02-02 12:48         ` Kirill A. Shutemov
2022-02-02 17:17           ` Thomas Gleixner
2022-02-04 16:55             ` Kirill A. Shutemov
2022-02-07 22:52               ` Sean Christopherson
2022-02-09 14:34                 ` Kirill A. Shutemov
2022-02-09 18:05                   ` Sean Christopherson
2022-02-09 22:23                     ` Kirill A. Shutemov
2022-02-10  1:21                       ` Sean Christopherson
2022-01-24 15:01 ` [PATCHv2 06/29] x86/tdx: Add MSR " Kirill A. Shutemov
2022-02-01 21:38   ` Thomas Gleixner
2022-02-02 13:06     ` Kirill A. Shutemov
2022-02-02 17:18       ` Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 07/29] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
2022-02-01 21:39   ` Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 08/29] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
2022-01-24 19:30   ` Josh Poimboeuf
2022-01-24 22:08     ` Kirill A. Shutemov
2022-01-24 23:04       ` Josh Poimboeuf
2022-01-24 22:40   ` Dave Hansen
2022-01-24 23:04     ` [PATCHv2.1 " Kirill A. Shutemov
2022-02-01 16:14       ` Borislav Petkov
2022-02-01 22:30   ` [PATCHv2 " Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 09/29] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
2022-02-01 18:30   ` Borislav Petkov
2022-02-01 22:33   ` Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 10/29] x86: Consolidate port I/O helpers Kirill A. Shutemov
2022-02-01 22:36   ` Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 11/29] x86/boot: Allow to hook up alternative " Kirill A. Shutemov
2022-02-01 19:02   ` Borislav Petkov
2022-02-01 22:39   ` Thomas Gleixner
2022-02-01 22:53     ` Thomas Gleixner
2022-02-02 17:20       ` Kirill A. Shutemov
2022-02-02 19:05         ` Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 12/29] x86/boot/compressed: Support TDX guest port I/O at decompression time Kirill A. Shutemov
2022-02-01 22:55   ` Thomas Gleixner
2022-01-24 15:01 ` [PATCHv2 13/29] x86/tdx: Add port I/O emulation Kirill A. Shutemov
2022-02-01 23:01   ` Thomas Gleixner
2022-02-02  6:22   ` Borislav Petkov
2022-01-24 15:02 ` [PATCHv2 14/29] x86/tdx: Early boot handling of port I/O Kirill A. Shutemov
2022-02-01 23:02   ` Thomas Gleixner
2022-02-02 10:09   ` Borislav Petkov
2022-01-24 15:02 ` [PATCHv2 15/29] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
2022-02-01 23:05   ` Thomas Gleixner
2022-01-24 15:02 ` [PATCHv2 16/29] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
2022-02-01 23:06   ` Thomas Gleixner
2022-02-02 11:27   ` Borislav Petkov
2022-02-04 11:27     ` Kuppuswamy, Sathyanarayanan
2022-02-04 13:49       ` Borislav Petkov
2022-02-15 21:36         ` Kirill A. Shutemov
2022-02-16 10:07           ` Borislav Petkov
2022-02-16 14:10             ` Kirill A. Shutemov
2022-02-10  0:25       ` Kai Huang
2022-01-24 15:02 ` [PATCHv2 17/29] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
2022-02-01 23:27   ` Thomas Gleixner
2022-02-05 12:37     ` Kuppuswamy, Sathyanarayanan
2022-01-24 15:02 ` [PATCHv2 18/29] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
2022-02-02  0:04   ` Thomas Gleixner
2022-02-11 16:13     ` Kirill A. Shutemov
2022-01-24 15:02 ` [PATCHv2 19/29] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
2022-02-02  0:09   ` Thomas Gleixner
2022-02-02  0:11     ` Thomas Gleixner
2022-02-03 15:00       ` Borislav Petkov
2022-02-03 21:26         ` Thomas Gleixner
2022-01-24 15:02 ` [PATCHv2 20/29] x86/tdx: Get page shared bit info from the TDX module Kirill A. Shutemov
2022-02-02  0:14   ` Thomas Gleixner
2022-02-07 22:27     ` Sean Christopherson
2022-02-07 10:44   ` Borislav Petkov
2022-01-24 15:02 ` [PATCHv2 21/29] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
2022-02-02  0:18   ` Thomas Gleixner
2022-01-24 15:02 ` [PATCHv2 22/29] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
2022-02-02  0:25   ` Thomas Gleixner
2022-02-02 19:27     ` Kirill A. Shutemov
2022-02-02 19:47       ` Thomas Gleixner
2022-02-07 16:27   ` Borislav Petkov
2022-02-07 16:57     ` Dave Hansen
2022-02-07 17:28       ` Borislav Petkov
2022-02-14 22:09         ` Kirill A. Shutemov
2022-02-15 10:50           ` Borislav Petkov
2022-02-15 14:49           ` Tom Lendacky
2022-02-15 15:41             ` Kirill A. Shutemov
2022-02-15 15:55               ` Tom Lendacky
2022-02-15 16:27                 ` Kirill A. Shutemov
2022-02-15 16:34                   ` Dave Hansen
2022-02-15 17:33                     ` Kirill A. Shutemov
2022-02-16  9:58                       ` Borislav Petkov
2022-02-16 15:37                         ` Kirill A. Shutemov
2022-02-17 15:24                           ` Borislav Petkov
2022-01-24 15:02 ` [PATCHv2 23/29] x86/tdx: Add helper to convert memory between shared and private Kirill A. Shutemov
2022-02-02  0:35   ` Thomas Gleixner
2022-02-08 12:12   ` Borislav Petkov
2022-02-09 23:21     ` Kirill A. Shutemov
2022-01-24 15:02 ` [PATCHv2 24/29] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
2022-02-02  1:27   ` Thomas Gleixner
2022-01-24 15:02 ` [PATCHv2 25/29] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
2022-01-24 15:02 ` [PATCHv2 26/29] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
2022-02-02  1:33   ` Thomas Gleixner
2022-02-04 22:09     ` Yamahata, Isaku
2022-02-04 22:31     ` Kirill A. Shutemov
2022-02-07 14:08       ` Tom Lendacky
2022-01-24 15:02 ` [PATCHv2 27/29] ACPICA: Avoid cache flush on TDX guest Kirill A. Shutemov
2022-01-24 15:02 ` [PATCHv2 28/29] x86/tdx: Warn about unexpected WBINVD Kirill A. Shutemov
2022-02-02  1:46   ` Thomas Gleixner
2022-02-04 21:35     ` Kirill A. Shutemov
2022-01-24 15:02 ` [PATCHv2 29/29] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
2022-02-24  9:08   ` Xiaoyao Li
2022-02-09 10:56 ` [PATCHv2 00/29] TDX Guest: TDX core support Kai Huang
2022-02-09 11:08   ` Borislav Petkov
2022-02-09 11:30     ` Kai Huang
2022-02-09 11:40       ` Borislav Petkov
2022-02-09 11:48         ` Kai Huang
2022-02-09 11:56           ` Borislav Petkov
2022-02-09 11:58             ` Kai Huang
2022-02-09 16:50             ` Sean Christopherson
2022-02-09 19:11               ` Borislav Petkov
2022-02-09 20:07                 ` Sean Christopherson
2022-02-09 20:36                   ` Borislav Petkov
2022-02-10  0:05                     ` Kai Huang
2022-02-16 16:08                       ` Sean Christopherson
2022-02-16 15:48                     ` Kirill A. Shutemov
2022-02-17 15:19                       ` Borislav Petkov
2022-02-17 15:26                         ` Kirill A. Shutemov
2022-02-17 15:34                           ` Borislav Petkov
2022-02-17 15:29                         ` Sean Christopherson
2022-02-17 15:31                           ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).