LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH v3 00/59] KVM: X86: TDX support
@ 2021-11-25  0:19 isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 01/59] x86/mktme: move out MKTME related constatnts/macro to msr-index.h isaku.yamahata
                   ` (60 more replies)
  0 siblings, 61 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Changes from v2:
- update based on patch review
- support TDP MMU
- drop non-essential fetures (ftrace etc.) to reduce patch size

TODO:
- integrate vm type patch
- integrate unmapping user space mapping

--- 
* What's TDX?
TDX stands for Trust Domain Extensions which isolates VMs from the
virtual-machine manager (VMM)/hypervisor and any other software on the
platform. [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.

* Patch organization
The patch 66 is main change.  The preceding patches(1-65) The preceding
patches(01-61) are refactoring the code and introducing additional hooks.

- 01-13: They are preparations. introduce architecture constants, code
         refactoring, export symbols for following patches.
- 14-30: start to introduce the new type of VM and allow the coexistence of
         multiple type of VM. allow/disallow KVM ioctl where
         appropriate. Especially make per-system ioctl to per-VM ioctl.
- 31-38: refactoring KVM VMX/MMU and adding new hooks for Secure EPT.
- 39-54: refactoring KVM
- 55:    main patch to add "basic" support for building/running TDX.
- 56-57: TDP MMU support
- 58:    support TDX hypercall, GetQuote and SetupEventNotifyInterrupt, that
         requires qemu help
- 59:    Documentation

* Missing features
Those major features are intentionally missing from this patch series to keep
this patch series small.  They are addressed as independent patch series.

- qemu gdb stub support
- Large page support
- guest PMU support
- and more

Changes from v1:
- rebase to v5.13
- drop load/initialization of TDX module
- catch up the update of related specifications.
- rework on C-wrapper function to invoke seamcall
- various code clean up

[1] TDX specification
   https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
[3] Intel CPU Architectural Extensions Specification
   https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 EAS
   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
[5] Intel TDX Loader Interface Specification
  https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
[7] Intel TDX Virtual Firmware Design Guide
   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
[8] intel public github
   kvm TDX branch: https://github.com/intel/tdx/tree/kvm
   TDX guest branch: https://github.com/intel/tdx/tree/guest
   qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
    https://github.com/tianocore/edk2-staging/tree/TDVF

Chao Gao (1):
  KVM: x86: Add a helper function to restore 4 host MSRs on exit to user
    space

Isaku Yamahata (9):
  x86/mktme: move out MKTME related constatnts/macro to msr-index.h
  x86/mtrr: mask out keyid bits from variable mtrr mask register
  KVM: TDX: Define TDX architectural definitions
  KVM: TDX: add a helper function for kvm to call seamcall
  KVM: TDX: Add helper functions to print TDX SEAMCALL error
  KVM: Add per-VM flag to mark read-only memory as unsupported
  KVM: x86: add per-VM flags to disable SMI/INIT/SIPI
  KVM: TDX: exit to user space on GET_QUOTE,
    SETUP_EVENT_NOTIFY_INTERRUPT
  Documentation/virtual/kvm: Add Trust Domain Extensions(TDX)

Kai Huang (3):
  KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level
    routes
  KVM: TDX: Protect private mapping related SEAMCALLs with spinlock
  KVM, x86/mmu: Support TDX private mapping for TDP MMU

Rick Edgecombe (1):
  KVM: x86: Add infrastructure for stolen GPA bits

Sean Christopherson (44):
  KVM: TDX: Add TDX "architectural" error codes
  KVM: TDX: Add C wrapper functions for TDX SEAMCALLs
  KVM: Export kvm_io_bus_read for use by TDX for PV MMIO
  KVM: Enable hardware before doing arch VM initialization
  KVM: x86: Split core of hypercall emulation to helper function
  KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO
  KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default
  KVM: Add max_vcpus field in common 'struct kvm'
  KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
  KVM: x86: Introduce "protected guest" concept and block disallowed
    ioctls
  KVM: x86: Add per-VM flag to disable direct IRQ injection
  KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE
  KVM: x86: Add flag to mark TSC as immutable (for TDX)
  KVM: Add per-VM flag to disable dirty logging of memslots for TDs
  KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
  KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs()
  KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP
  KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
  KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
    behavior
  KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
  KVM: x86: Add option to force LAPIC expiration wait
  KVM: x86: Add guest_supported_xss placholder
  KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
  KVM: x86/mmu: Ignore bits 63 and 62 when checking for "present" SPTEs
  KVM: x86/mmu: Allow non-zero init value for shadow PTE
  KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
  KVM: x86/mmu: Frame in support for private/inaccessible shadow pages
  KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
  KVM: x86/mmu: Allow per-VM override of the TDP max page level
  KVM: VMX: Modify NMI and INTR handlers to take intr_info as param
  KVM: VMX: Move NMI/exception handler to common helper
  KVM: VMX: Split out guts of EPT violation to common/exposed function
  KVM: VMX: Define EPT Violation architectural bits
  KVM: VMX: Define VMCS encodings for shared EPT pointer
  KVM: VMX: Add 'main.c' to wrap VMX and TDX
  KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  KVM: VMX: Move register caching logic to common code
  KVM: TDX: Define TDCALL exit reason
  KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
  KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs
  KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h
  KVM: VMX: MOVE GDT and IDT accessors to common code
  KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX
    code
  KVM: TDX: Add "basic" support for building and running Trust Domains

Xiaoyao Li (1):
  KVM: X86: Introduce initial_tsc_khz in struct kvm_arch

 Documentation/virt/kvm/api.rst        |    9 +-
 Documentation/virt/kvm/intel-tdx.rst  |  359 ++++
 arch/arm64/include/asm/kvm_host.h     |    3 -
 arch/arm64/kvm/arm.c                  |    7 +-
 arch/arm64/kvm/vgic/vgic-init.c       |    6 +-
 arch/x86/events/intel/ds.c            |    1 +
 arch/x86/include/asm/kvm-x86-ops.h    |   11 +
 arch/x86/include/asm/kvm_host.h       |   63 +-
 arch/x86/include/asm/msr-index.h      |   16 +
 arch/x86/include/asm/vmx.h            |    6 +
 arch/x86/include/uapi/asm/kvm.h       |   60 +
 arch/x86/include/uapi/asm/vmx.h       |    7 +-
 arch/x86/kernel/cpu/intel.c           |   14 -
 arch/x86/kernel/cpu/mtrr/mtrr.c       |    9 +
 arch/x86/kvm/Makefile                 |    6 +-
 arch/x86/kvm/ioapic.c                 |    4 +
 arch/x86/kvm/irq_comm.c               |   13 +-
 arch/x86/kvm/lapic.c                  |    7 +-
 arch/x86/kvm/lapic.h                  |    2 +-
 arch/x86/kvm/mmu.h                    |   29 +-
 arch/x86/kvm/mmu/mmu.c                |  667 ++++++-
 arch/x86/kvm/mmu/mmu_internal.h       |   12 +
 arch/x86/kvm/mmu/paging_tmpl.h        |   32 +-
 arch/x86/kvm/mmu/spte.c               |   15 +-
 arch/x86/kvm/mmu/spte.h               |   51 +-
 arch/x86/kvm/mmu/tdp_iter.h           |    2 +-
 arch/x86/kvm/mmu/tdp_mmu.c            |  544 +++++-
 arch/x86/kvm/mmu/tdp_mmu.h            |   15 +-
 arch/x86/kvm/svm/svm.c                |   13 +-
 arch/x86/kvm/vmx/common.h             |  178 ++
 arch/x86/kvm/vmx/main.c               | 1152 ++++++++++++
 arch/x86/kvm/vmx/posted_intr.c        |    6 +
 arch/x86/kvm/vmx/seamcall.h           |  116 ++
 arch/x86/kvm/vmx/tdx.c                | 2437 +++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h                |  290 +++
 arch/x86/kvm/vmx/tdx_arch.h           |  239 +++
 arch/x86/kvm/vmx/tdx_errno.h          |  111 ++
 arch/x86/kvm/vmx/tdx_error.c          |   53 +
 arch/x86/kvm/vmx/tdx_ops.h            |  224 +++
 arch/x86/kvm/vmx/tdx_stubs.c          |   50 +
 arch/x86/kvm/vmx/vmenter.S            |  146 ++
 arch/x86/kvm/vmx/vmx.c                |  689 ++-----
 arch/x86/kvm/vmx/x86_ops.h            |  203 ++
 arch/x86/kvm/x86.c                    |  276 ++-
 include/linux/kvm_host.h              |    5 +
 include/uapi/linux/kvm.h              |   59 +
 tools/arch/x86/include/uapi/asm/kvm.h |   55 +
 tools/include/uapi/linux/kvm.h        |    2 +
 virt/kvm/kvm_main.c                   |   34 +-
 49 files changed, 7469 insertions(+), 839 deletions(-)
 create mode 100644 Documentation/virt/kvm/intel-tdx.rst
 create mode 100644 arch/x86/kvm/vmx/common.h
 create mode 100644 arch/x86/kvm/vmx/main.c
 create mode 100644 arch/x86/kvm/vmx/seamcall.h
 create mode 100644 arch/x86/kvm/vmx/tdx.c
 create mode 100644 arch/x86/kvm/vmx/tdx.h
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
 create mode 100644 arch/x86/kvm/vmx/tdx_error.c
 create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
 create mode 100644 arch/x86/kvm/vmx/tdx_stubs.c
 create mode 100644 arch/x86/kvm/vmx/x86_ops.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 01/59] x86/mktme: move out MKTME related constatnts/macro to msr-index.h
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 02/59] x86/mtrr: mask out keyid bits from variable mtrr mask register isaku.yamahata
                   ` (59 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Later the constants/macro will be used by MTRR logic to reduce physical
bits to calculate MTRR mask.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/msr-index.h | 16 ++++++++++++++++
 arch/x86/kernel/cpu/intel.c      | 14 --------------
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 28b207a786b1..97a41d6238f5 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -905,6 +905,22 @@
 /* Geode defined MSRs */
 #define MSR_GEODE_BUSCONT_CONF0		0x00001900
 
+/* MKTME MSRs */
+#define MSR_IA32_TME_ACTIVATE		0x00000982
+
+/* Helpers to access TME_ACTIVATE MSR */
+#define TME_ACTIVATE_LOCKED(x)		(x & 0x1)
+#define TME_ACTIVATE_ENABLED(x)		(x & 0x2)
+
+#define TME_ACTIVATE_POLICY(x)		((x >> 4) & 0xf)	/* Bits 7:4 */
+#define TME_ACTIVATE_KEYID_BITS(x)	((x >> 32) & 0xf)	/* Bits 35:32 */
+
+#define TME_ACTIVATE_POLICY_AES_XTS_128	0
+
+#define TME_ACTIVATE_CRYPTO_ALGS(x)	((x >> 48) & 0xffff)	/* Bits 63:48 */
+#define TME_ACTIVATE_CRYPTO_AES_XTS_128	1
+
+
 /* Intel VT MSRs */
 #define MSR_IA32_VMX_BASIC              0x00000480
 #define MSR_IA32_VMX_PINBASED_CTLS      0x00000481
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..df605787137b 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -492,20 +492,6 @@ static void srat_detect_node(struct cpuinfo_x86 *c)
 #endif
 }
 
-#define MSR_IA32_TME_ACTIVATE		0x982
-
-/* Helpers to access TME_ACTIVATE MSR */
-#define TME_ACTIVATE_LOCKED(x)		(x & 0x1)
-#define TME_ACTIVATE_ENABLED(x)		(x & 0x2)
-
-#define TME_ACTIVATE_POLICY(x)		((x >> 4) & 0xf)	/* Bits 7:4 */
-#define TME_ACTIVATE_POLICY_AES_XTS_128	0
-
-#define TME_ACTIVATE_KEYID_BITS(x)	((x >> 32) & 0xf)	/* Bits 35:32 */
-
-#define TME_ACTIVATE_CRYPTO_ALGS(x)	((x >> 48) & 0xffff)	/* Bits 63:48 */
-#define TME_ACTIVATE_CRYPTO_AES_XTS_128	1
-
 /* Values for mktme_status (SW only construct) */
 #define MKTME_ENABLED			0
 #define MKTME_DISABLED			1
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 02/59] x86/mtrr: mask out keyid bits from variable mtrr mask register
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 01/59] x86/mktme: move out MKTME related constatnts/macro to msr-index.h isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 03/59] KVM: TDX: Define TDX architectural definitions isaku.yamahata
                   ` (58 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Xiaoyao Li

From: Isaku Yamahata <isaku.yamahata@intel.com>

This is a preparation for TDX support. TDX repurposes high bits of physcial
address to private key ID similarly to MKTME.
IA32_TME_ACTIVATE.MK_TME_KEYID_BITS has same meaning for both TDX
disabled/enable for compatibility.

MTRR calculates mask based on available physical address bits.  MKTME
repurpose high bit of physical address to key id for key id.  CPUID MAX_PA
remains same and the bits stolen for key id is controlled IA32_TME_ACTIVATE
MSR bit 35:32.  Because Key ID bits shouldn't affects memory cachability,
MTRR mask should exclude bits repourposed for Key ID.  It's OS
responsibility to maintain cache coherency.  detect_tme @
arch/x86/kernel/cpu/intel.c detects tme and destract it from total usable
physical bits. This patch adds same logic needed for MTRR.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 2746cac9d8a9..79eaf6ed20a6 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -713,6 +713,15 @@ void __init mtrr_bp_init(void)
 			     boot_cpu_data.x86_stepping == 0x4))
 				phys_addr = 36;
 
+			if (boot_cpu_has(X86_FEATURE_TME)) {
+				u64 tme_activate;
+
+				rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
+				if (TME_ACTIVATE_LOCKED(tme_activate) &&
+				    TME_ACTIVATE_ENABLED(tme_activate)) {
+					phys_addr -= TME_ACTIVATE_KEYID_BITS(tme_activate);
+				}
+			}
 			size_or_mask = SIZE_OR_MASK_BITS(phys_addr);
 			size_and_mask = ~size_or_mask & 0xfffff00000ULL;
 		} else if (boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR &&
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 03/59] KVM: TDX: Define TDX architectural definitions
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 01/59] x86/mktme: move out MKTME related constatnts/macro to msr-index.h isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 02/59] x86/mtrr: mask out keyid bits from variable mtrr mask register isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 04/59] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
                   ` (57 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Kai Huang, Xiaoyao Li,
	Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Define architectural definitions for KVM to issue TDX SEAMCALLs.

Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference

Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx_arch.h | 219 ++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h

diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..f57f9bfb7007
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,219 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define TDH_VP_ENTER			0
+#define TDH_MNG_ADDCX			1
+#define TDH_MEM_PAGE_ADD		2
+#define TDH_MEM_SEPT_ADD		3
+#define TDH_VP_ADDCX			4
+#define TDH_MEM_PAGE_AUG		6
+#define TDH_MEM_RANGE_BLOCK		7
+#define TDH_MNG_KEY_CONFIG		8
+#define TDH_MNG_CREATE			9
+#define TDH_VP_CREATE			10
+#define TDH_MNG_RD			11
+#define TDH_MEM_RD			12
+#define TDH_MNG_WR			13
+#define TDH_MEM_WR			14
+#define TDH_MEM_PAGE_DEMOTE		15
+#define TDH_MR_EXTEND			16
+#define TDH_MR_FINALIZE			17
+#define TDH_VP_FLUSH			18
+#define TDH_MNG_VPFLUSHDONE		19
+#define TDH_MNG_KEY_FREEID		20
+#define TDH_MNG_INIT			21
+#define TDH_VP_INIT			22
+#define TDH_MEM_PAGE_PROMOTE		23
+#define TDH_PHYMEM_PAGE_RDMD		24
+#define TDH_MEM_SEPT_RD			25
+#define TDH_VP_RD			26
+#define TDH_MNG_KEY_RECLAIMID		27
+#define TDH_PHYMEM_PAGE_RECLAIM		28
+#define TDH_MEM_PAGE_REMOVE		29
+#define TDH_MEM_SEPT_REMOVE		30
+#define TDH_MEM_TRACK			38
+#define TDH_MEM_RANGE_UNBLOCK		39
+#define TDH_PHYMEM_CACHE_WB		40
+#define TDH_PHYMEM_PAGE_WBINVD		41
+#define TDH_MEM_SEPT_WR			42
+#define TDH_VP_WR			43
+#define TDH_SYS_LP_SHUTDOWN		44
+
+#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO		0x10000
+#define TDG_VP_VMCALL_MAP_GPA				0x10001
+#define TDG_VP_VMCALL_GET_QUOTE				0x10002
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR		0x10003
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT	0x10004
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_CLASS_SHIFT			56
+#define TDX_FIELD_MASK			GENMASK_ULL(31, 0)
+
+#define BUILD_TDX_FIELD(class, field)	\
+	(((u64)(class) << TDX_CLASS_SHIFT) | ((u64)(field) & TDX_FIELD_MASK))
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field)		BUILD_TDX_FIELD(0, (field))
+
+/*
+ * @offset is the offset (in bytes) from the beginning of the architectural
+ * virtual APIC page.
+ */
+#define TDVPS_APIC(offset)		BUILD_TDX_FIELD(1, (offset))
+
+/* @gpr is the index of a general purpose register, e.g. eax=0 */
+#define TDVPS_GPR(gpr)			BUILD_TDX_FIELD(16, (gpr))
+
+#define TDVPS_DR(dr)			BUILD_TDX_FIELD(17, (0 + (dr)))
+
+enum tdx_guest_other_state {
+	TD_VCPU_XCR0 = 32,
+	TD_VCPU_IWK_ENCKEY0 = 64,
+	TD_VCPU_IWK_ENCKEY1,
+	TD_VCPU_IWK_ENCKEY2,
+	TD_VCPU_IWK_ENCKEY3,
+	TD_VCPU_IWK_INTKEY0 = 68,
+	TD_VCPU_IWK_INTKEY1,
+	TD_VCPU_IWK_FLAGS = 70,
+};
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field)		BUILD_TDX_FIELD(17, (field))
+
+/* @msr is the MSR index */
+#define TDVPS_MSR(msr)			BUILD_TDX_FIELD(19, (msr))
+
+/* Management class fields */
+enum tdx_guest_management {
+	TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_guest_management */
+#define TDVPS_MANAGEMENT(field)		BUILD_TDX_FIELD(32, (field))
+
+enum tdx_tdcs_execution_control {
+	TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field)		BUILD_TDX_FIELD(17, (field))
+
+/*
+ * Hard code those values for simplicity and efficiency.  They are constant for
+ * the TDX 1.0.  TDH.SYS.INIT enumerates those values.  On kvm initialization,
+ * do sanity checks so that they are correct values.
+ *
+ * TODO: If they are bumped in future, increase those value.
+ *       (or make them full runtime values)
+ */
+#define TDX_NR_TDCX_PAGES		4
+#define TDX_NR_TDVPX_PAGES		5
+#define TDX_MAX_NR_CPUID_CONFIGS	6
+
+#define TDX_EXTENDMR_CHUNKSIZE		256
+
+struct tdx_cpuid_value {
+	u32 eax;
+	u32 ebx;
+	u32 ecx;
+	u32 edx;
+} __packed;
+
+struct tdx_cpuid_config {
+	u32 leaf;
+	u32 sub_leaf;
+	u32 eax;
+	u32 ebx;
+	u32 ecx;
+	u32 edx;
+} __packed;
+
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32 attributes;
+	u32 vendor_id;
+	u32 build_date;
+	u16 build_num;
+	u16 minor_version;
+	u16 major_version;
+	u8 reserved0[14];
+	/* Memory Info */
+	u16 max_tdmrs;
+	u16 max_reserved_per_tdmr;
+	u16 pamt_entry_size;
+	u8 reserved1[10];
+	/* Control Struct Info */
+	u16 tdcs_base_size;
+	u8 reserved2[2];
+	u16 tdvps_base_size;
+	u8 tdvps_xfam_dependent_size;
+	u8 reserved3[9];
+	/* TD Capabilities */
+	u64 attributes_fixed0;
+	u64 attributes_fixed1;
+	u64 xfam_fixed0;
+	u64 xfam_fixed1;
+	u8 reserved4[32];
+	u32 num_cpuid_config;
+	union {
+		struct tdx_cpuid_config cpuid_configs[0];
+		u8 reserved5[892];
+	};
+} __packed __aligned(1024);
+
+#define TDX_TD_ATTRIBUTE_DEBUG		BIT_ULL(0)
+#define TDX_TD_ATTRIBUTE_PKS		BIT_ULL(30)
+#define TDX_TD_ATTRIBUTE_KL		BIT_ULL(31)
+#define TDX_TD_ATTRIBUTE_PERFMON	BIT_ULL(63)
+
+#define TDX_TD_XFAM_LBR			BIT_ULL(15)
+#define TDX_TD_XFAM_AMX			(BIT_ULL(17) | BIT_ULL(18))
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+struct td_params {
+	u64 attributes;
+	u64 xfam;
+	u32 max_vcpus;
+	u32 reserved0;
+
+	u64 eptp_controls;
+	u64 exec_controls;
+	u16 tsc_frequency;
+	u8  reserved1[38];
+
+	u64 mrconfigid[6];
+	u64 mrowner[6];
+	u64 mrownerconfig[6];
+	u64 reserved2[4];
+
+	union {
+		struct tdx_cpuid_value cpuid_values[0];
+		u8 reserved3[768];
+	};
+} __packed __aligned(1024);
+
+/* Guest uses MAX_PA for GPAW when set. */
+#define TDX_EXEC_CONTROL_MAX_GPAW      BIT_ULL(0)
+
+/*
+ * TDX requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. TDX-SEAM
+ * can only program frequencies that are multiples of 25MHz.  The frequency
+ * must be between 1ghz and 10ghz (inclusive).
+ */
+#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz)	((tsc_in_khz) / (25 * 1000))
+#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz)	((tsc_in_25mhz) * (25 * 1000))
+#define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
+
+#endif /* __KVM_X86_TDX_ARCH_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 04/59] KVM: TDX: Add TDX "architectural" error codes
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (2 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 03/59] KVM: TDX: Define TDX architectural definitions isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 05/59] KVM: TDX: add a helper function for kvm to call seamcall isaku.yamahata
                   ` (56 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add error codes for the TDX SEAMCALLs both for TDX VMM side and TDX guest
side.

TDX SEAMCALL uses bits 31:0 to return more information, so these error
codes will only exactly match RAX[63:32].  Error codes for TDG.VP.VMCALL is
defined by TDX Guest-Host-Communication interface spec.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx_errno.h | 111 +++++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h

diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
new file mode 100644
index 000000000000..395dd3099254
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural status code for SEAMCALL */
+
+#ifndef __KVM_X86_TDX_ERRNO_H
+#define __KVM_X86_TDX_ERRNO_H
+
+#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
+
+/*
+ * TDX SEAMCALL Status Codes (returned in RAX)
+ */
+#define TDX_SUCCESS				0x0000000000000000ULL
+#define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
+#define TDX_NON_RECOVERABLE_TD			0x4000000200000000ULL
+#define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
+#define TDX_INTERRUPTED_RESTARTABLE		0x8000000400000000ULL
+#define TDX_NON_RECOVERABLE_TD_FATAL		0x4000000500000000ULL
+#define TDX_INVALID_RESUMPTION			0xC000000600000000ULL
+#define TDX_NON_RECOVERABLE_TD_NO_APIC		0xC000000700000000ULL
+#define TDX_OPERAND_INVALID			0xC000010000000000ULL
+#define TDX_OPERAND_ADDR_RANGE_ERROR		0xC000010100000000ULL
+#define TDX_OPERAND_BUSY			0x8000020000000000ULL
+#define TDX_PREVIOUS_TLB_EPOCH_BUSY		0x8000020100000000ULL
+#define TDX_SYS_BUSY				0x8000020200000000ULL
+#define TDX_PAGE_METADATA_INCORRECT		0xC000030000000000ULL
+#define TDX_PAGE_ALREADY_FREE			0x0000030100000000ULL
+#define TDX_PAGE_NOT_OWNED_BY_TD		0xC000030200000000ULL
+#define TDX_PAGE_NOT_FREE			0xC000030300000000ULL
+#define TDX_TD_ASSOCIATED_PAGES_EXIST		0xC000040000000000ULL
+#define TDX_SYSINIT_NOT_PENDING			0xC000050000000000ULL
+#define TDX_SYSINIT_NOT_DONE			0xC000050100000000ULL
+#define TDX_SYSINITLP_NOT_DONE			0xC000050200000000ULL
+#define TDX_SYSINITLP_DONE			0xC000050300000000ULL
+#define TDX_SYS_NOT_READY			0xC000050500000000ULL
+#define TDX_SYS_SHUTDOWN			0xC000050600000000ULL
+#define TDX_SYSCONFIG_NOT_DONE			0xC000050700000000ULL
+#define TDX_TD_NOT_INITIALIZED			0xC000060000000000ULL
+#define TDX_TD_INITIALIZED			0xC000060100000000ULL
+#define TDX_TD_NOT_FINALIZED			0xC000060200000000ULL
+#define TDX_TD_FINALIZED			0xC000060300000000ULL
+#define TDX_TD_FATAL				0xC000060400000000ULL
+#define TDX_TD_NON_DEBUG			0xC000060500000000ULL
+#define TDX_LIFECYCLE_STATE_INCORRECT		0xC000060700000000ULL
+#define TDX_TDCX_NUM_INCORRECT			0xC000061000000000ULL
+#define TDX_VCPU_STATE_INCORRECT		0xC000070000000000ULL
+#define TDX_VCPU_ASSOCIATED			0x8000070100000000ULL
+#define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
+#define TDX_TDVPX_NUM_INCORRECT			0xC000070300000000ULL
+#define TDX_NO_VALID_VE_INFO			0xC000070400000000ULL
+#define TDX_MAX_VCPUS_EXCEEDED			0xC000070500000000ULL
+#define TDX_TSC_ROLLBACK			0xC000070600000000ULL
+#define TDX_FIELD_NOT_WRITABLE			0xC000072000000000ULL
+#define TDX_FIELD_NOT_READABLE			0xC000072100000000ULL
+#define TDX_TD_VMCS_FIELD_NOT_INITIALIZED	0xC000073000000000ULL
+#define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
+#define TDX_TD_KEYS_NOT_CONFIGURED		0x8000081000000000ULL
+#define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
+#define TDX_KEY_CONFIGURED			0x0000081500000000ULL
+#define TDX_WBCACHE_NOT_COMPLETE		0x8000081700000000ULL
+#define TDX_HKID_NOT_FREE			0xC000082000000000ULL
+#define TDX_NO_HKID_READY_TO_WBCACHE		0x0000082100000000ULL
+#define TDX_WBCACHE_RESUME_ERROR		0xC000082300000000ULL
+#define TDX_FLUSHVP_NOT_DONE			0x8000082400000000ULL
+#define TDX_NUM_ACTIVATED_HKIDS_NOT_SUPPORTED	0xC000082500000000ULL
+#define TDX_INCORRECT_CPUID_VALUE		0xC000090000000000ULL
+#define TDX_BOOT_NT4_SET			0xC000090100000000ULL
+#define TDX_INCONSISTENT_CPUID_FIELD		0xC000090200000000ULL
+#define TDX_CPUID_LEAF_1F_FORMAT_UNRECOGNIZED	0xC000090400000000ULL
+#define TDX_INVALID_WBINVD_SCOPE		0xC000090500000000ULL
+#define TDX_INVALID_PKG_ID			0xC000090600000000ULL
+#define TDX_CPUID_LEAF_NOT_SUPPORTED		0xC000090800000000ULL
+#define TDX_SMRR_NOT_LOCKED			0xC000091000000000ULL
+#define TDX_INVALID_SMRR_CONFIGURATION		0xC000091100000000ULL
+#define TDX_SMRR_OVERLAPS_CMR			0xC000091200000000ULL
+#define TDX_SMRR_LOCK_NOT_SUPPORTED		0xC000091300000000ULL
+#define TDX_SMRR_NOT_SUPPORTED			0xC000091400000000ULL
+#define TDX_INCONSISTENT_MSR			0xC000092000000000ULL
+#define TDX_INCORRECT_MSR_VALUE			0xC000092100000000ULL
+#define TDX_SEAMREPORT_NOT_AVAILABLE		0xC000093000000000ULL
+#define TDX_PERF_COUNTERS_ARE_PEBS_ENABLED	0x8000094000000000ULL
+#define TDX_INVALID_TDMR			0xC0000A0000000000ULL
+#define TDX_NON_ORDERED_TDMR			0xC0000A0100000000ULL
+#define TDX_TDMR_OUTSIDE_CMRS			0xC0000A0200000000ULL
+#define TDX_TDMR_ALREADY_INITIALIZED		0x00000A0300000000ULL
+#define TDX_INVALID_PAMT			0xC0000A1000000000ULL
+#define TDX_PAMT_OUTSIDE_CMRS			0xC0000A1100000000ULL
+#define TDX_PAMT_OVERLAP			0xC0000A1200000000ULL
+#define TDX_INVALID_RESERVED_IN_TDMR		0xC0000A2000000000ULL
+#define TDX_NON_ORDERED_RESERVED_IN_TDMR	0xC0000A2100000000ULL
+#define TDX_CMR_LIST_INVALID			0xC0000A2200000000ULL
+#define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
+#define TDX_EPT_ENTRY_FREE			0xC0000B0100000000ULL
+#define TDX_EPT_ENTRY_NOT_FREE			0xC0000B0200000000ULL
+#define TDX_EPT_ENTRY_NOT_PRESENT		0xC0000B0300000000ULL
+#define TDX_EPT_ENTRY_NOT_LEAF			0xC0000B0400000000ULL
+#define TDX_EPT_ENTRY_LEAF			0xC0000B0500000000ULL
+#define TDX_GPA_RANGE_NOT_BLOCKED		0xC0000B0600000000ULL
+#define TDX_GPA_RANGE_ALREADY_BLOCKED		0x00000B0700000000ULL
+#define TDX_TLB_TRACKING_NOT_DONE		0xC0000B0800000000ULL
+#define TDX_EPT_INVALID_PROMOTE_CONDITIONS	0xC0000B0900000000ULL
+#define TDX_PAGE_ALREADY_ACCEPTED		0x00000B0A00000000ULL
+#define TDX_PAGE_SIZE_MISMATCH			0xC0000B0B00000000ULL
+
+/*
+ * TDG.VP.VMCALL Status Codes (returned in R10)
+ */
+#define TDG_VP_VMCALL_SUCCESS			0x0000000000000000ULL
+#define TDG_VP_VMCALL_INVALID_OPERAND		0x8000000000000000ULL
+#define TDG_VP_VMCALL_TDREPORT_FAILED		0x8000000000000001ULL
+
+#endif /* __KVM_X86_TDX_ERRNO_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 05/59] KVM: TDX: add a helper function for kvm to call seamcall
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (3 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 04/59] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 06/59] KVM: TDX: Add C wrapper functions for TDX SEAMCALLs isaku.yamahata
                   ` (55 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Xiaoyao Li

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a helper function for kvm to call seamcall and a helper macro to check
its return value.  The later patches will use them.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/seamcall.h | 113 ++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/seamcall.h

diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
new file mode 100644
index 000000000000..f27e9d27137d
--- /dev/null
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_VMX_SEAMCALL_H
+#define __KVM_VMX_SEAMCALL_H
+
+#include <asm/asm.h>
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+#ifdef __ASSEMBLER__
+
+.macro seamcall
+	.byte 0x66, 0x0f, 0x01, 0xcf
+.endm
+
+#else
+
+/*
+ * TDX extended return:
+ * Some of The "TDX module" SEAMCALLs return extended values (which are function
+ * leaf specific) in registers in addition to the completion status code in
+ %rax.
+ */
+struct tdx_ex_ret {
+	union {
+		struct {
+			u64 rcx;
+			u64 rdx;
+			u64 r8;
+			u64 r9;
+			u64 r10;
+			u64 r11;
+		} regs;
+		/* TDH_MNG_INIT returns CPUID info on error. */
+		struct {
+			u32 leaf;
+			u32 subleaf;
+		} mng_init;
+		/* Functions that walk SEPT */
+		struct {
+			u64 septe;
+			struct {
+				u64 level		:3;
+				u64 sept_reserved_0	:5;
+				u64 state		:8;
+				u64 sept_reserved_1	:48;
+			};
+		} sept_walk;
+		/* TDH_MNG_{RD,WR} return the field value. */
+		struct {
+			u64 field_val;
+		} mng_rdwr;
+		/* TDH_MEM_{RD,WR} return the error info and value. */
+		struct {
+			u64 ext_err_info_1;
+			u64 ext_err_info_2;
+			u64 mem_val;
+		} mem_rdwr;
+		/* TDH_PHYMEM_PAGE_RDMD and TDH_PHYMEM_PAGE_RECLAIM return page metadata. */
+		struct {
+			u64 page_type;
+			u64 owner;
+			u64 page_size;
+		} phymem_page_md;
+	};
+};
+
+static inline u64 seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
+			struct tdx_ex_ret *ex)
+{
+	register unsigned long r8_in asm("r8");
+	register unsigned long r9_in asm("r9");
+	register unsigned long r10_in asm("r10");
+	register unsigned long r8_out asm("r8");
+	register unsigned long r9_out asm("r9");
+	register unsigned long r10_out asm("r10");
+	register unsigned long r11_out asm("r11");
+	struct tdx_ex_ret dummy;
+	u64 ret;
+
+	if (!ex)
+		/* The following inline assembly requires non-NULL ex. */
+		ex = &dummy;
+
+	/*
+	 * Because the TDX module is known to be already initialized, seamcall
+	 * instruction should always succeed without exceptions.  Don't check
+	 * the instruction error with CF=1 for the availability of the TDX
+	 * module.
+	 */
+	r8_in = r8;
+	r9_in = r9;
+	r10_in = r10;
+	asm volatile (
+		".byte 0x66, 0x0f, 0x01, 0xcf\n\t"	/* seamcall instruction */
+		: ASM_CALL_CONSTRAINT, "=a"(ret),
+		  "=c"(ex->regs.rcx), "=d"(ex->regs.rdx),
+		  "=r"(r8_out), "=r"(r9_out), "=r"(r10_out), "=r"(r11_out)
+		: "a"(op), "c"(rcx), "d"(rdx),
+		  "r"(r8_in), "r"(r9_in), "r"(r10_in)
+		: "cc", "memory");
+	ex->regs.r8 = r8_out;
+	ex->regs.r9 = r9_out;
+	ex->regs.r10 = r10_out;
+	ex->regs.r11 = r11_out;
+
+	return ret;
+}
+
+#endif /* !__ASSEMBLER__ */
+
+#endif	/* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_VMX_SEAMCALL_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 06/59] KVM: TDX: Add C wrapper functions for TDX SEAMCALLs
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (4 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 05/59] KVM: TDX: add a helper function for kvm to call seamcall isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 07/59] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
                   ` (54 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Kai Huang,
	Xiaoyao Li

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX SEAMCALL interface is defined in the TDX module specification.  Define
C wrapper functions for SEAMCALLs which the later patches will use.

Co-developed-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx_ops.h | 210 +++++++++++++++++++++++++++++++++++++
 1 file changed, 210 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_ops.h

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
new file mode 100644
index 000000000000..87ed67fd2715
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -0,0 +1,210 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_OPS_H
+#define __KVM_X86_TDX_OPS_H
+
+#include <linux/compiler.h>
+
+#include <asm/asm.h>
+#include <asm/kvm_host.h>
+
+#include "seamcall.h"
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
+{
+	return seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
+			    struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, ex);
+}
+
+static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+			    struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, ex);
+}
+
+static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
+{
+	return seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+			    struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
+			  struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mng_key_config(hpa_t tdr)
+{
+	return seamcall(TDH_MNG_KEY_CONFIG, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
+{
+	return seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
+{
+	return seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MNG_RD, tdr, field, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mng_wr(hpa_t tdr, u64 field, u64 val, u64 mask,
+			  struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MNG_WR, tdr, field, val, mask, 0, ex);
+}
+
+static inline u64 tdh_mem_rd(hpa_t addr, struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_RD, addr, 0, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_wr(hpa_t addr, u64 val, struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_WR, addr, val, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_page_demote(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+			       struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_PAGE_DEMOTE, gpa | level, tdr, page, 0, 0, ex);
+}
+
+static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa, struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MR_EXTEND, gpa, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mr_finalize(hpa_t tdr)
+{
+	return seamcall(TDH_MR_FINALIZE, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_flush(hpa_t tdvpr)
+{
+	return seamcall(TDH_VP_FLUSH, tdvpr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_vpflushdone(hpa_t tdr)
+{
+	return seamcall(TDH_MNG_VPFLUSHDONE, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_key_freeid(hpa_t tdr)
+{
+	return seamcall(TDH_MNG_KEY_FREEID, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_init(hpa_t tdr, hpa_t td_params, struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MNG_INIT, tdr, td_params, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_vp_init(hpa_t tdvpr, u64 rcx)
+{
+	return seamcall(TDH_VP_INIT, tdvpr, rcx, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_promote(hpa_t tdr, gpa_t gpa, int level,
+				struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_PAGE_PROMOTE, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_phymem_page_rdmd(hpa_t page, struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_PHYMEM_PAGE_RDMD, page, 0, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_sept_rd(hpa_t tdr, gpa_t gpa, int level,
+			   struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_SEPT_RD, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_vp_rd(hpa_t tdvpr, u64 field, struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_VP_RD, tdvpr, field, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mng_key_reclaimid(hpa_t tdr)
+{
+	return seamcall(TDH_MNG_KEY_RECLAIMID, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_reclaim(hpa_t page, struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_PHYMEM_PAGE_RECLAIM, page, 0, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
+				struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
+			       struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_SEPT_REMOVE, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_sys_lp_shutdown(void)
+{
+	return seamcall(TDH_SYS_LP_SHUTDOWN, 0, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_track(hpa_t tdr)
+{
+	return seamcall(TDH_MEM_TRACK, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
+			    struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, 0, ex);
+}
+
+static inline u64 tdh_phymem_cache_wb(bool resume)
+{
+	return seamcall(TDH_PHYMEM_CACHE_WB, resume ? 1 : 0, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_wbinvd(hpa_t page)
+{
+	return seamcall(TDH_PHYMEM_PAGE_WBINVD, page, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_sept_wr(hpa_t tdr, gpa_t gpa, int level, u64 val,
+			   struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_MEM_SEPT_WR, gpa | level, tdr, val, 0, 0, ex);
+}
+
+static inline u64 tdh_vp_wr(hpa_t tdvpr, u64 field, u64 val, u64 mask,
+			  struct tdx_ex_ret *ex)
+{
+	return seamcall(TDH_VP_WR, tdvpr, field, val, mask, 0, ex);
+}
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_OPS_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 07/59] KVM: TDX: Add helper functions to print TDX SEAMCALL error
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (5 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 06/59] KVM: TDX: Add C wrapper functions for TDX SEAMCALLs isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 08/59] KVM: Export kvm_io_bus_read for use by TDX for PV MMIO isaku.yamahata
                   ` (53 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add helper functions to print out errors from the TDX module in an uniform
way.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Makefile        |  1 +
 arch/x86/kvm/vmx/seamcall.h  |  3 ++
 arch/x86/kvm/vmx/tdx_error.c | 53 ++++++++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_error.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 75dfd27b6e8a..c7eceff49e67 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -29,6 +29,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
 			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx_error.o
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
 
diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
index f27e9d27137d..5fa948054be6 100644
--- a/arch/x86/kvm/vmx/seamcall.h
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -106,6 +106,9 @@ static inline u64 seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
 	return ret;
 }
 
+void pr_tdx_ex_ret_info(u64 op, u64 error_code, const struct tdx_ex_ret *ex_ret);
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_ex_ret *ex_ret);
+
 #endif /* !__ASSEMBLER__ */
 
 #endif	/* CONFIG_INTEL_TDX_HOST */
diff --git a/arch/x86/kvm/vmx/tdx_error.c b/arch/x86/kvm/vmx/tdx_error.c
new file mode 100644
index 000000000000..1a4f91397758
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_error.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/* functions to record TDX SEAMCALL error */
+
+#include <linux/kernel.h>
+#include <linux/bug.h>
+
+#include "tdx_errno.h"
+#include "tdx_arch.h"
+#include "tdx_ops.h"
+
+static const char * const TDX_SEPT_ENTRY_STATES[] = {
+	"SEPT_FREE",
+	"SEPT_BLOCKED",
+	"SEPT_PENDING",
+	"SEPT_PENDING_BLOCKED",
+	"SEPT_PRESENT"
+};
+
+void pr_tdx_ex_ret_info(u64 op, u64 error_code, const struct tdx_ex_ret *ex_ret)
+{
+	if (WARN_ON(!ex_ret))
+		return;
+
+	switch (error_code & TDX_SEAMCALL_STATUS_MASK) {
+	case TDX_EPT_WALK_FAILED: {
+		const char *state;
+
+		if (ex_ret->sept_walk.state >= ARRAY_SIZE(TDX_SEPT_ENTRY_STATES))
+			state = "Invalid";
+		else
+			state = TDX_SEPT_ENTRY_STATES[ex_ret->sept_walk.state];
+
+		pr_err("Secure EPT walk error: SEPTE 0x%llx, level %d, %s\n",
+			ex_ret->sept_walk.septe, ex_ret->sept_walk.level, state);
+		break;
+	}
+	default:
+		/* TODO: print only meaningful registers depending on op */
+		pr_err("RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, "
+		       "R10 0x%llx, R11 0x%llx\n",
+			ex_ret->regs.rcx, ex_ret->regs.rdx, ex_ret->regs.r8,
+			ex_ret->regs.r9, ex_ret->regs.r10, ex_ret->regs.r11);
+		break;
+	}
+}
+
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_ex_ret *ex_ret)
+{
+	pr_err_ratelimited("SEAMCALL[0x%llx] failed: 0x%llx\n",
+			op, error_code);
+	if (ex_ret)
+		pr_tdx_ex_ret_info(op, error_code, ex_ret);
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 08/59] KVM: Export kvm_io_bus_read for use by TDX for PV MMIO
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (6 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 07/59] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25 17:14   ` Thomas Gleixner
  2021-11-25  0:19 ` [RFC PATCH v3 09/59] KVM: Enable hardware before doing arch VM initialization isaku.yamahata
                   ` (52 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Later kvm_intel.ko will use it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 virt/kvm/kvm_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d31724500501..d8e799eeef8d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4994,6 +4994,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 	r = __kvm_io_bus_read(vcpu, bus, &range, val);
 	return r < 0 ? r : 0;
 }
+EXPORT_SYMBOL_GPL(kvm_io_bus_read);
 
 /* Caller must hold slots_lock. */
 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 09/59] KVM: Enable hardware before doing arch VM initialization
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (7 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 08/59] KVM: Export kvm_io_bus_read for use by TDX for PV MMIO isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25 19:02   ` Thomas Gleixner
  2021-11-25  0:19 ` [RFC PATCH v3 10/59] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
                   ` (51 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Swap the order of hardware_enable_all() and kvm_arch_init_vm() to
accommodate Intel's TDX, which needs VMX to be enabled during VM init in
order to make SEAMCALLs.

This also provides consistent ordering between kvm_create_vm() and
kvm_destroy_vm() with respect to calling kvm_arch_destroy_vm() and
hardware_disable_all().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 virt/kvm/kvm_main.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d8e799eeef8d..8544e92db2da 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1065,7 +1065,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 		struct kvm_memslots *slots = kvm_alloc_memslots();
 
 		if (!slots)
-			goto out_err_no_arch_destroy_vm;
+			goto out_err_no_disable;
 		/* Generations must be different for each address space. */
 		slots->generation = i;
 		rcu_assign_pointer(kvm->memslots[i], slots);
@@ -1075,19 +1075,19 @@ static struct kvm *kvm_create_vm(unsigned long type)
 		rcu_assign_pointer(kvm->buses[i],
 			kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL_ACCOUNT));
 		if (!kvm->buses[i])
-			goto out_err_no_arch_destroy_vm;
+			goto out_err_no_disable;
 	}
 
 	kvm->max_halt_poll_ns = halt_poll_ns;
 
-	r = kvm_arch_init_vm(kvm, type);
-	if (r)
-		goto out_err_no_arch_destroy_vm;
-
 	r = hardware_enable_all();
 	if (r)
 		goto out_err_no_disable;
 
+	r = kvm_arch_init_vm(kvm, type);
+	if (r)
+		goto out_err_no_arch_destroy_vm;
+
 #ifdef CONFIG_HAVE_KVM_IRQFD
 	INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
 #endif
@@ -1115,10 +1115,10 @@ static struct kvm *kvm_create_vm(unsigned long type)
 		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
 #endif
 out_err_no_mmu_notifier:
-	hardware_disable_all();
-out_err_no_disable:
 	kvm_arch_destroy_vm(kvm);
 out_err_no_arch_destroy_vm:
+	hardware_disable_all();
+out_err_no_disable:
 	WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
 	for (i = 0; i < KVM_NR_BUSES; i++)
 		kfree(kvm_get_bus(kvm, i));
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 10/59] KVM: x86: Split core of hypercall emulation to helper function
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (8 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 09/59] KVM: Enable hardware before doing arch VM initialization isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 11/59] KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO isaku.yamahata
                   ` (50 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 +++
 arch/x86/kvm/x86.c              | 55 ++++++++++++++++++++-------------
 2 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e5d8700319cc..5b31d8e5fcbf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1778,6 +1778,10 @@ void kvm_request_apicv_update(struct kvm *kvm, bool activate,
 void __kvm_request_apicv_update(struct kvm *kvm, bool activate,
 				unsigned long bit);
 
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+				      unsigned long a0, unsigned long a1,
+				      unsigned long a2, unsigned long a3,
+				      int op_64_bit);
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc7eb5fddfd3..8a4774c5c85d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8829,26 +8829,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
-int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+				      unsigned long a0, unsigned long a1,
+				      unsigned long a2, unsigned long a3,
+				      int op_64_bit)
 {
-	unsigned long nr, a0, a1, a2, a3, ret;
-	int op_64_bit;
-
-	if (kvm_xen_hypercall_enabled(vcpu->kvm))
-		return kvm_xen_hypercall(vcpu);
-
-	if (kvm_hv_hypercall_enabled(vcpu))
-		return kvm_hv_hypercall(vcpu);
-
-	nr = kvm_rax_read(vcpu);
-	a0 = kvm_rbx_read(vcpu);
-	a1 = kvm_rcx_read(vcpu);
-	a2 = kvm_rdx_read(vcpu);
-	a3 = kvm_rsi_read(vcpu);
+	unsigned long ret;
 
 	trace_kvm_hypercall(nr, a0, a1, a2, a3);
 
-	op_64_bit = is_64_bit_mode(vcpu);
 	if (!op_64_bit) {
 		nr &= 0xFFFFFFFF;
 		a0 &= 0xFFFFFFFF;
@@ -8857,11 +8846,6 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		a3 &= 0xFFFFFFFF;
 	}
 
-	if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
-		ret = -KVM_EPERM;
-		goto out;
-	}
-
 	ret = -KVM_ENOSYS;
 
 	switch (nr) {
@@ -8920,6 +8904,35 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		ret = -KVM_ENOSYS;
 		break;
 	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
+
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3, ret;
+	int op_64_bit;
+
+	if (kvm_xen_hypercall_enabled(vcpu->kvm))
+		return kvm_xen_hypercall(vcpu);
+
+	if (kvm_hv_hypercall_enabled(vcpu))
+		return kvm_hv_hypercall(vcpu);
+
+	nr = kvm_rax_read(vcpu);
+	a0 = kvm_rbx_read(vcpu);
+	a1 = kvm_rcx_read(vcpu);
+	a2 = kvm_rdx_read(vcpu);
+	a3 = kvm_rsi_read(vcpu);
+
+	op_64_bit = is_64_bit_mode(vcpu);
+
+	if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
+		ret = -KVM_EPERM;
+		goto out;
+	}
+
+	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit);
 out:
 	if (!op_64_bit)
 		ret = (u32)ret;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 11/59] KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (9 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 10/59] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25  0:19 ` [RFC PATCH v3 12/59] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default isaku.yamahata
                   ` (49 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Later kvm_intel.ko will use it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8a4774c5c85d..6f3c77524426 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12584,6 +12584,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 12/59] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (10 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 11/59] KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25 19:04   ` Thomas Gleixner
  2021-11-25  0:19 ` [RFC PATCH v3 13/59] KVM: Add max_vcpus field in common 'struct kvm' isaku.yamahata
                   ` (48 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Zap only leaf SPTEs when deleting/moving a memslot by default, and add a
module param to allow reverting to the old behavior of zapping all SPTEs
at all levels and memslots when any memslot is updated.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 33794379949e..1e11b14f4f82 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -91,6 +91,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
 static bool __read_mostly force_flush_and_sync_on_reuse;
 module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
 
+static bool __read_mostly memslot_update_zap_all;
+module_param(memslot_update_zap_all, bool, 0444);
+
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
  * where the hardware walks 2 page tables:
@@ -5681,11 +5684,27 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
 	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
 }
 
+static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	/*
+	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+	 * case scenario we'll have unused shadow pages lying around until they
+	 * are recycled due to age or when the VM is destroyed.
+	 */
+	write_lock(&kvm->mmu_lock);
+	slot_handle_level(kvm, slot, kvm_zap_rmapp, PG_LEVEL_4K,
+			  KVM_MAX_HUGEPAGE_LEVEL, true);
+	write_unlock(&kvm->mmu_lock);
+}
+
 static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
-	kvm_mmu_zap_all_fast(kvm);
+	if (memslot_update_zap_all)
+		kvm_mmu_zap_all_fast(kvm);
+	else
+		kvm_mmu_zap_memslot(kvm, slot);
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 13/59] KVM: Add max_vcpus field in common 'struct kvm'
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (11 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 12/59] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25 19:06   ` Thomas Gleixner
  2021-11-25  0:19 ` [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs isaku.yamahata
                   ` (47 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/arm64/include/asm/kvm_host.h | 3 ---
 arch/arm64/kvm/arm.c              | 7 ++-----
 arch/arm64/kvm/vgic/vgic-init.c   | 6 +++---
 include/linux/kvm_host.h          | 1 +
 virt/kvm/kvm_main.c               | 3 ++-
 5 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2a5f7f38006f..ff3f4ab57eba 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -108,9 +108,6 @@ struct kvm_arch {
 	/* VTCR_EL2 value for this VM */
 	u64    vtcr;
 
-	/* The maximum number of vCPUs depends on the used GIC model */
-	int max_vcpus;
-
 	/* Interrupt controller */
 	struct vgic_dist	vgic;
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2f03cbfefe67..14df7ee047e4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -153,7 +153,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm_vgic_early_init(kvm);
 
 	/* The maximum number of VCPUs is limited by the host's GIC model */
-	kvm->arch.max_vcpus = kvm_arm_default_max_vcpus();
+	kvm->max_vcpus = kvm_arm_default_max_vcpus();
 
 	set_default_spectre(kvm);
 
@@ -228,7 +228,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_MAX_VCPUS:
 	case KVM_CAP_MAX_VCPU_ID:
 		if (kvm)
-			r = kvm->arch.max_vcpus;
+			r = kvm->max_vcpus;
 		else
 			r = kvm_arm_default_max_vcpus();
 		break;
@@ -304,9 +304,6 @@ int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
 	if (irqchip_in_kernel(kvm) && vgic_initialized(kvm))
 		return -EBUSY;
 
-	if (id >= kvm->arch.max_vcpus)
-		return -EINVAL;
-
 	return 0;
 }
 
diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index 0a06d0648970..906aee52f2bc 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -97,11 +97,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
 	ret = 0;
 
 	if (type == KVM_DEV_TYPE_ARM_VGIC_V2)
-		kvm->arch.max_vcpus = VGIC_V2_MAX_CPUS;
+		kvm->max_vcpus = VGIC_V2_MAX_CPUS;
 	else
-		kvm->arch.max_vcpus = VGIC_V3_MAX_CPUS;
+		kvm->max_vcpus = VGIC_V3_MAX_CPUS;
 
-	if (atomic_read(&kvm->online_vcpus) > kvm->arch.max_vcpus) {
+	if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) {
 		ret = -E2BIG;
 		goto out_unlock;
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9e0667e3723e..625de21ceb43 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -566,6 +566,7 @@ struct kvm {
 	 * and is accessed atomically.
 	 */
 	atomic_t online_vcpus;
+	int max_vcpus;
 	int created_vcpus;
 	int last_boosted_vcpu;
 	struct list_head vm_list;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8544e92db2da..4a1dbf0b1d3d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1052,6 +1052,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 
 	INIT_LIST_HEAD(&kvm->devices);
+	kvm->max_vcpus = KVM_MAX_VCPUS;
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
 
@@ -3599,7 +3600,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		return -EINVAL;
 
 	mutex_lock(&kvm->lock);
-	if (kvm->created_vcpus == KVM_MAX_VCPUS) {
+	if (kvm->created_vcpus >= kvm->max_vcpus) {
 		mutex_unlock(&kvm->lock);
 		return -EINVAL;
 	}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (12 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 13/59] KVM: Add max_vcpus field in common 'struct kvm' isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25 19:08   ` Thomas Gleixner
  2021-11-25  0:19 ` [RFC PATCH v3 15/59] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls isaku.yamahata
                   ` (46 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Xiaoyao Li

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a capability to effectively allow userspace to query what VM types
are supported by KVM.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h    | 1 +
 arch/x86/include/asm/kvm_host.h       | 2 ++
 arch/x86/include/uapi/asm/kvm.h       | 4 ++++
 arch/x86/kvm/svm/svm.c                | 6 ++++++
 arch/x86/kvm/vmx/vmx.c                | 6 ++++++
 arch/x86/kvm/x86.c                    | 9 ++++++++-
 include/uapi/linux/kvm.h              | 2 ++
 tools/arch/x86/include/uapi/asm/kvm.h | 4 ++++
 tools/include/uapi/linux/kvm.h        | 2 ++
 9 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..4aa0a1c10b84 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -18,6 +18,7 @@ KVM_X86_OP_NULL(hardware_unsetup)
 KVM_X86_OP_NULL(cpu_has_accelerated_tpr)
 KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
 KVM_X86_OP(vm_init)
 KVM_X86_OP_NULL(vm_destroy)
 KVM_X86_OP(vcpu_create)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5b31d8e5fcbf..cdb908ed7d5b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1037,6 +1037,7 @@ struct kvm_x86_msr_filter {
 #define APICV_INHIBIT_REASON_BLOCKIRQ	6
 
 struct kvm_arch {
+	unsigned long vm_type;
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
@@ -1311,6 +1312,7 @@ struct kvm_x86_ops {
 	bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
 	void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
 
+	bool (*is_vm_type_supported)(unsigned long vm_type);
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a776a08f78c..a0805a2a81f8 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -508,4 +508,8 @@ struct kvm_pmu_event_filter {
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
 
+#define KVM_X86_LEGACY_VM	0
+#define KVM_X86_SEV_ES_VM	1
+#define KVM_X86_TDX_VM		2
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5630c241d5f6..2ef77d4566a9 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4562,6 +4562,11 @@ static void svm_vm_destroy(struct kvm *kvm)
 	sev_vm_destroy(kvm);
 }
 
+static bool svm_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_LEGACY_VM;
+}
+
 static int svm_vm_init(struct kvm *kvm)
 {
 	if (!pause_filter_count || !pause_filter_thresh)
@@ -4589,6 +4594,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_free = svm_free_vcpu,
 	.vcpu_reset = svm_vcpu_reset,
 
+	.is_vm_type_supported = svm_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_svm),
 	.vm_init = svm_vm_init,
 	.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fe4f3ee61db3..f2905e00b063 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6851,6 +6851,11 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 	return err;
 }
 
+static bool vmx_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_LEGACY_VM;
+}
+
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
@@ -7505,6 +7510,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.cpu_has_accelerated_tpr = report_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
+	.is_vm_type_supported = vmx_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6f3c77524426..5f4b6f70489b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4225,6 +4225,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		else
 			r = 0;
 		break;
+	case KVM_CAP_VM_TYPES:
+		r = BIT(KVM_X86_LEGACY_VM);
+		if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
+			r |= BIT(KVM_X86_TDX_VM);
+		break;
 	default:
 		break;
 	}
@@ -11320,9 +11325,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	int ret;
 	unsigned long flags;
 
-	if (type)
+	if (!static_call(kvm_x86_is_vm_type_supported)(type))
 		return -EINVAL;
 
+	kvm->arch.vm_type = type;
+
 	ret = kvm_page_track_init(kvm);
 	if (ret)
 		return ret;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..bb49e095867e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1132,6 +1132,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_MTE 205
 #define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
 
+#define KVM_CAP_VM_TYPES 1000
+
 #ifdef KVM_CAP_IRQ_ROUTING
 
 struct kvm_irq_routing_irqchip {
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 2ef1f6513c68..96b0064cff5c 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -504,4 +504,8 @@ struct kvm_pmu_event_filter {
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+#define KVM_X86_LEGACY_VM	0
+#define KVM_X86_SEV_ES_VM	1
+#define KVM_X86_TDX_VM		2
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index a067410ebea5..e5aadad54ced 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1113,6 +1113,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
 #define KVM_CAP_ARM_MTE 205
 
+#define KVM_CAP_VM_TYPES 1000
+
 #ifdef KVM_CAP_IRQ_ROUTING
 
 struct kvm_irq_routing_irqchip {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 15/59] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (13 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25 19:26   ` Thomas Gleixner
  2021-11-25  0:19 ` [RFC PATCH v3 16/59] KVM: x86: Add per-VM flag to disable direct IRQ injection isaku.yamahata
                   ` (45 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Xiaoyao Li

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add 'guest_state_protected' to mark a VM's state as being protected by
hardware/firmware, e.g. SEV-ES or TDX-SEAM.  Use the flag to disallow
ioctls() and/or flows that attempt to access protected state.

Return an error if userspace attempts to get/set register state for a
protected VM, e.g. a non-debug TDX guest.  KVM can't provide sane data,
it's userspace's responsibility to avoid attempting to read guest state
when it's known to be inaccessible.

Retrieving vCPU events is the one exception, as the userspace VMM is
allowed to inject NMIs.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/x86.c | 80 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 66 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5f4b6f70489b..b21fcf3c0cc8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4539,6 +4539,10 @@ static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)
 
 static int kvm_vcpu_ioctl_smi(struct kvm_vcpu *vcpu)
 {
+	/* TODO: use more precise flag */
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	kvm_make_request(KVM_REQ_SMI, vcpu);
 
 	return 0;
@@ -4585,6 +4589,10 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
 	unsigned bank_num = mcg_cap & 0xff;
 	u64 *banks = vcpu->arch.mce_banks;
 
+	/* TODO: use more precise flag */
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	if (mce->bank >= bank_num || !(mce->status & MCI_STATUS_VAL))
 		return -EINVAL;
 	/*
@@ -4680,7 +4688,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
 	events->interrupt.nr = vcpu->arch.interrupt.nr;
 	events->interrupt.soft = 0;
-	events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
+	if (!vcpu->arch.guest_state_protected)
+		events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
 
 	events->nmi.injected = vcpu->arch.nmi_injected;
 	events->nmi.pending = vcpu->arch.nmi_pending != 0;
@@ -4709,11 +4718,17 @@ static void kvm_smm_changed(struct kvm_vcpu *vcpu, bool entering_smm);
 static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 					      struct kvm_vcpu_events *events)
 {
-	if (events->flags & ~(KVM_VCPUEVENT_VALID_NMI_PENDING
-			      | KVM_VCPUEVENT_VALID_SIPI_VECTOR
-			      | KVM_VCPUEVENT_VALID_SHADOW
-			      | KVM_VCPUEVENT_VALID_SMM
-			      | KVM_VCPUEVENT_VALID_PAYLOAD))
+	u32 allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING |
+			    KVM_VCPUEVENT_VALID_SIPI_VECTOR |
+			    KVM_VCPUEVENT_VALID_SHADOW |
+			    KVM_VCPUEVENT_VALID_SMM |
+			    KVM_VCPUEVENT_VALID_PAYLOAD;
+
+	/* TODO: introduce more precise flag */
+	if (vcpu->arch.guest_state_protected)
+		allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING;
+
+	if (events->flags & ~allowed_flags)
 		return -EINVAL;
 
 	if (events->flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
@@ -4789,17 +4804,22 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
-static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
-					     struct kvm_debugregs *dbgregs)
+static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
+					    struct kvm_debugregs *dbgregs)
 {
 	unsigned long val;
 
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	memcpy(dbgregs->db, vcpu->arch.db, sizeof(vcpu->arch.db));
 	kvm_get_dr(vcpu, 6, &val);
 	dbgregs->dr6 = val;
 	dbgregs->dr7 = vcpu->arch.dr7;
 	dbgregs->flags = 0;
 	memset(&dbgregs->reserved, 0, sizeof(dbgregs->reserved));
+
+	return 0;
 }
 
 static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
@@ -4813,6 +4833,9 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
 	if (!kvm_dr7_valid(dbgregs->dr7))
 		return -EINVAL;
 
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	memcpy(vcpu->arch.db, dbgregs->db, sizeof(vcpu->arch.db));
 	kvm_update_dr0123(vcpu);
 	vcpu->arch.dr6 = dbgregs->dr6;
@@ -4845,18 +4868,22 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 					      supported_xcr0, &vcpu->arch.pkru);
 }
 
-static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
-					struct kvm_xcrs *guest_xcrs)
+static int kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
+				       struct kvm_xcrs *guest_xcrs)
 {
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
 		guest_xcrs->nr_xcrs = 0;
-		return;
+		return 0;
 	}
 
 	guest_xcrs->nr_xcrs = 1;
 	guest_xcrs->flags = 0;
 	guest_xcrs->xcrs[0].xcr = XCR_XFEATURE_ENABLED_MASK;
 	guest_xcrs->xcrs[0].value = vcpu->arch.xcr0;
+	return 0;
 }
 
 static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
@@ -4864,6 +4891,9 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
 {
 	int i, r = 0;
 
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	if (!boot_cpu_has(X86_FEATURE_XSAVE))
 		return -EINVAL;
 
@@ -5247,7 +5277,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_GET_DEBUGREGS: {
 		struct kvm_debugregs dbgregs;
 
-		kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
+		r = kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
+		if (r)
+			break;
 
 		r = -EFAULT;
 		if (copy_to_user(argp, &dbgregs,
@@ -5297,7 +5329,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		if (!u.xcrs)
 			break;
 
-		kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
+		r = kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
+		if (r)
+			break;
 
 		r = -EFAULT;
 		if (copy_to_user(argp, u.xcrs,
@@ -10188,6 +10222,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		goto out;
 	}
 
+	if (vcpu->arch.guest_state_protected &&
+	    (kvm_run->kvm_valid_regs || kvm_run->kvm_dirty_regs)) {
+		r = -EINVAL;
+		goto out;
+	}
+
 	if (kvm_run->kvm_dirty_regs) {
 		r = sync_regs(vcpu);
 		if (r != 0)
@@ -10218,7 +10258,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 
 out:
 	kvm_put_guest_fpu(vcpu);
-	if (kvm_run->kvm_valid_regs)
+	if (kvm_run->kvm_valid_regs && !vcpu->arch.guest_state_protected)
 		store_regs(vcpu);
 	post_kvm_run_save(vcpu);
 	kvm_sigset_deactivate(vcpu);
@@ -10265,6 +10305,9 @@ static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 {
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	vcpu_load(vcpu);
 	__get_regs(vcpu, regs);
 	vcpu_put(vcpu);
@@ -10305,6 +10348,9 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 
 int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 {
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	vcpu_load(vcpu);
 	__set_regs(vcpu, regs);
 	vcpu_put(vcpu);
@@ -10387,6 +10433,9 @@ static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
 int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs)
 {
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	vcpu_load(vcpu);
 	__get_sregs(vcpu, sregs);
 	vcpu_put(vcpu);
@@ -10635,6 +10684,9 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 {
 	int ret;
 
+	if (vcpu->arch.guest_state_protected)
+		return -EINVAL;
+
 	vcpu_load(vcpu);
 	ret = __set_sregs(vcpu, sregs);
 	vcpu_put(vcpu);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 16/59] KVM: x86: Add per-VM flag to disable direct IRQ injection
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (14 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 15/59] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls isaku.yamahata
@ 2021-11-25  0:19 ` isaku.yamahata
  2021-11-25 19:31   ` Thomas Gleixner
  2021-11-29  2:49   ` Lai Jiangshan
  2021-11-25  0:20 ` [RFC PATCH v3 17/59] KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE isaku.yamahata
                   ` (44 subsequent siblings)
  60 siblings, 2 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:19 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a flag to disable IRQ injection, which is not supported by TDX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/x86.c              | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cdb908ed7d5b..ebc4de32bf0e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1141,6 +1141,7 @@ struct kvm_arch {
 	bool exception_payload_enabled;
 
 	bool bus_lock_detection_enabled;
+	bool irq_injection_disallowed;
 	/*
 	 * If exit_on_emulation_error is set, and the in-kernel instruction
 	 * emulator fails to emulate an instruction, allow userspace
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b21fcf3c0cc8..b399e64e8863 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4506,7 +4506,8 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
 static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
 				    struct kvm_interrupt *irq)
 {
-	if (irq->irq >= KVM_NR_INTERRUPTS)
+	if (irq->irq >= KVM_NR_INTERRUPTS ||
+	    vcpu->kvm->arch.irq_injection_disallowed)
 		return -EINVAL;
 
 	if (!irqchip_in_kernel(vcpu->kvm)) {
@@ -8997,6 +8998,7 @@ static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)
 {
 	return vcpu->run->request_interrupt_window &&
+	       !vcpu->kvm->arch.irq_injection_disallowed &&
 		likely(!pic_in_kernel(vcpu->kvm));
 }
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 17/59] KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (15 preceding siblings ...)
  2021-11-25  0:19 ` [RFC PATCH v3 16/59] KVM: x86: Add per-VM flag to disable direct IRQ injection isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 19:33   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 18/59] KVM: x86: Add flag to mark TSC as immutable (for TDX) isaku.yamahata
                   ` (43 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Kai Huang

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a flag to disallow MCE injection and reject KVM_X86_SETUP_MCE with
-EINVAL when set.  TDX doesn't support injecting exceptions, including
(virtual) #MCs.

Co-developed-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 14 +++++++-------
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ebc4de32bf0e..e912e1e853ef 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1142,6 +1142,7 @@ struct kvm_arch {
 
 	bool bus_lock_detection_enabled;
 	bool irq_injection_disallowed;
+	bool mce_injection_disallowed;
 	/*
 	 * If exit_on_emulation_error is set, and the in-kernel instruction
 	 * emulator fails to emulate an instruction, allow userspace
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b399e64e8863..fefa4602e879 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4561,15 +4561,16 @@ static int vcpu_ioctl_tpr_access_reporting(struct kvm_vcpu *vcpu,
 static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu,
 					u64 mcg_cap)
 {
-	int r;
 	unsigned bank_num = mcg_cap & 0xff, bank;
 
-	r = -EINVAL;
+	if (vcpu->kvm->arch.mce_injection_disallowed)
+		return -EINVAL;
+
 	if (!bank_num || bank_num > KVM_MAX_MCE_BANKS)
-		goto out;
+		return -EINVAL;
 	if (mcg_cap & ~(kvm_mce_cap_supported | 0xff | 0xff0000))
-		goto out;
-	r = 0;
+		return -EINVAL;
+
 	vcpu->arch.mcg_cap = mcg_cap;
 	/* Init IA32_MCG_CTL to all 1s */
 	if (mcg_cap & MCG_CTL_P)
@@ -4579,8 +4580,7 @@ static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu,
 		vcpu->arch.mce_banks[bank*4] = ~(u64)0;
 
 	static_call(kvm_x86_setup_mce)(vcpu);
-out:
-	return r;
+	return 0;
 }
 
 static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 18/59] KVM: x86: Add flag to mark TSC as immutable (for TDX)
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (16 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 17/59] KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 19:40   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 19/59] KVM: Add per-VM flag to mark read-only memory as unsupported isaku.yamahata
                   ` (42 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Kai Huang

From: Sean Christopherson <sean.j.christopherson@intel.com>

The TSC for TDX1 guests is fixed at TD creation time.  Add tsc_immutable
to reflect that the TSC of the guest cannot be changed in any way, and
use it to short circuit all paths that lead to one of the myriad TSC
adjustment flows.

Suggested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 35 +++++++++++++++++++++++++--------
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e912e1e853ef..f3808672c720 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1122,6 +1122,7 @@ struct kvm_arch {
 	int audit_point;
 	#endif
 
+	bool tsc_immutable;
 	bool backwards_tsc_observed;
 	bool boot_vcpu_runs_old_kvmclock;
 	u32 bsp_vcpu_id;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fefa4602e879..0ebd60846079 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2247,7 +2247,9 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 	u64 ratio;
 
 	/* Guest TSC same frequency as host TSC? */
-	if (!scale) {
+	if (!scale || vcpu->kvm->arch.tsc_immutable) {
+		if (scale)
+			pr_warn_ratelimited("Guest TSC immutable, scaling not supported\n");
 		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_default_tsc_scaling_ratio);
 		return 0;
 	}
@@ -2534,6 +2536,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool matched = false;
 	bool synchronizing = false;
 
+	if (WARN_ON_ONCE(vcpu->kvm->arch.tsc_immutable))
+		return;
+
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
@@ -2960,6 +2965,10 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	u8 pvclock_flags;
 	bool use_master_clock;
 
+	/* Unable to update guest time if the TSC is immutable. */
+	if (ka->tsc_immutable)
+		return 0;
+
 	kernel_ns = 0;
 	host_tsc = 0;
 
@@ -4372,7 +4381,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		if (tsc_delta < 0)
 			mark_tsc_unstable("KVM discovered backwards TSC");
 
-		if (kvm_check_tsc_unstable()) {
+		if (kvm_check_tsc_unstable() &&
+		    !vcpu->kvm->arch.tsc_immutable) {
 			u64 offset = kvm_compute_l1_tsc_offset(vcpu,
 						vcpu->arch.last_guest_tsc);
 			kvm_vcpu_write_tsc_offset(vcpu, offset);
@@ -4386,7 +4396,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		 * On a host with synchronized TSC, there is no need to update
 		 * kvmclock on vcpu->cpu migration
 		 */
-		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
+		if ((!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1) &&
+		    !vcpu->kvm->arch.tsc_immutable)
 			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
 		if (vcpu->cpu != cpu)
 			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
@@ -5352,10 +5363,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		break;
 	}
 	case KVM_SET_TSC_KHZ: {
-		u32 user_tsc_khz;
+		u32 user_tsc_khz = (u32)arg;
 
 		r = -EINVAL;
-		user_tsc_khz = (u32)arg;
+		if (vcpu->kvm->arch.tsc_immutable)
+			goto out;
 
 		if (kvm_has_tsc_control &&
 		    user_tsc_khz >= kvm_max_guest_tsc_khz)
@@ -10994,9 +11006,12 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
-	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
-	vcpu_put(vcpu);
+
+	if (!kvm->arch.tsc_immutable) {
+		vcpu_load(vcpu);
+		kvm_synchronize_tsc(vcpu, 0);
+		vcpu_put(vcpu);
+	}
 
 	/* poll control enabled by default */
 	vcpu->arch.msr_kvm_poll_control = 1;
@@ -11253,6 +11268,10 @@ int kvm_arch_hardware_enable(void)
 	if (backwards_tsc) {
 		u64 delta_cyc = max_tsc - local_tsc;
 		list_for_each_entry(kvm, &vm_list, vm_list) {
+			if (vcpu->kvm->arch.tsc_immutable) {
+				pr_warn_ratelimited("Backwards TSC observed and guest with immutable TSC active\n");
+				continue;
+			}
 			kvm->arch.backwards_tsc_observed = true;
 			kvm_for_each_vcpu(i, vcpu, kvm) {
 				vcpu->arch.tsc_offset_adjustment += delta_cyc;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 19/59] KVM: Add per-VM flag to mark read-only memory as unsupported
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (17 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 18/59] KVM: x86: Add flag to mark TSC as immutable (for TDX) isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 20/59] KVM: Add per-VM flag to disable dirty logging of memslots for TDs isaku.yamahata
                   ` (41 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a flag for TDX to flag RO memory as unsupported and propagate it to
KVM_MEM_READONLY to allow reporting RO memory as unsupported on a per-VM
basis.  TDX1 doesn't expose permission bits to the VMM in the SEPT
tables, i.e. doesn't support read-only private memory.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/x86.c       | 4 +++-
 include/linux/kvm_host.h | 3 +++
 virt/kvm/kvm_main.c      | 8 +++++---
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0ebd60846079..535f65b0915d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4121,7 +4121,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ASYNC_PF_INT:
 	case KVM_CAP_GET_TSC_KHZ:
 	case KVM_CAP_KVMCLOCK_CTRL:
-	case KVM_CAP_READONLY_MEM:
 	case KVM_CAP_HYPERV_TIME:
 	case KVM_CAP_IOAPIC_POLARITY_IGNORED:
 	case KVM_CAP_TSC_DEADLINE_TIMER:
@@ -4239,6 +4238,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
 			r |= BIT(KVM_X86_TDX_VM);
 		break;
+	case KVM_CAP_READONLY_MEM:
+		r = kvm && kvm->readonly_mem_unsupported ? 0 : 1;
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 625de21ceb43..d59debe99212 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -624,6 +624,9 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef __KVM_HAVE_READONLY_MEM
+	bool readonly_mem_unsupported;
+#endif
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4a1dbf0b1d3d..aaa6d69265ea 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1422,12 +1422,14 @@ static void update_memslots(struct kvm_memslots *slots,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_userspace_memory_region *mem)
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
 #ifdef __KVM_HAVE_READONLY_MEM
-	valid_flags |= KVM_MEM_READONLY;
+	if (!kvm->readonly_mem_unsupported)
+		valid_flags |= KVM_MEM_READONLY;
 #endif
 
 	if (mem->flags & ~valid_flags)
@@ -1665,7 +1667,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 20/59] KVM: Add per-VM flag to disable dirty logging of memslots for TDs
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (18 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 19/59] KVM: Add per-VM flag to mark read-only memory as unsupported isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 21/59] KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level routes isaku.yamahata
                   ` (40 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Kai Huang

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a flag for TDX to mark dirty logging as unsupported.

Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d59debe99212..08284807d48e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -624,6 +624,7 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+	bool dirty_log_unsupported;
 #ifdef __KVM_HAVE_READONLY_MEM
 	bool readonly_mem_unsupported;
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index aaa6d69265ea..5bf456734e7d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1425,7 +1425,10 @@ static void update_memslots(struct kvm_memslots *slots,
 static int check_memory_region_flags(struct kvm *kvm,
 				     const struct kvm_userspace_memory_region *mem)
 {
-	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+	u32 valid_flags = 0;
+
+	if (!kvm->dirty_log_unsupported)
+		valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
 
 #ifdef __KVM_HAVE_READONLY_MEM
 	if (!kvm->readonly_mem_unsupported)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 21/59] KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level routes
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (19 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 20/59] KVM: Add per-VM flag to disable dirty logging of memslots for TDs isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 22/59] KVM: x86: add per-VM flags to disable SMI/INIT/SIPI isaku.yamahata
                   ` (39 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Kai Huang, Sean Christopherson

From: Kai Huang <kai.huang@linux.intel.com>

Add a flag to let TDX disallow the in-kernel I/O APIC, level triggered
routes for a userspace I/O APIC, and anything else that relies on being
able to intercept EOIs.  TDX-SEAM does not allow intercepting EOI.

Note, technically KVM could partially emulate the I/O APIC by allowing
only edge triggered interrupts, but that adds a lot of complexity for
basically zero benefit.  Ideally KVM wouldn't even allow I/O APIC route
reservation, but disabling that is a train wreck for Qemu.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/ioapic.c           | 4 ++++
 arch/x86/kvm/irq_comm.c         | 9 +++++++--
 arch/x86/kvm/lapic.c            | 3 ++-
 arch/x86/kvm/x86.c              | 6 ++++++
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f3808672c720..545b556e420c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1132,6 +1132,7 @@ struct kvm_arch {
 
 	enum kvm_irqchip_mode irqchip_mode;
 	u8 nr_reserved_ioapic_pins;
+	bool eoi_intercept_unsupported;
 
 	bool disabled_lapic_found;
 
diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 816a82515dcd..39a9031e11b1 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -311,6 +311,10 @@ void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm)
 {
 	if (!ioapic_in_kernel(kvm))
 		return;
+
+	if (WARN_ON_ONCE(kvm->arch.eoi_intercept_unsupported))
+		return;
+
 	kvm_make_scan_ioapic_request(kvm);
 }
 
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index d5b72a08e566..bcfac99db579 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -123,7 +123,12 @@ EXPORT_SYMBOL_GPL(kvm_set_msi_irq);
 static inline bool kvm_msi_route_invalid(struct kvm *kvm,
 		struct kvm_kernel_irq_routing_entry *e)
 {
-	return kvm->arch.x2apic_format && (e->msi.address_hi & 0xff);
+	struct msi_msg msg = { .address_lo = e->msi.address_lo,
+			       .address_hi = e->msi.address_hi,
+			       .data = e->msi.data };
+	return  (kvm->arch.eoi_intercept_unsupported &&
+		 msg.arch_data.is_level) ||
+		(kvm->arch.x2apic_format && (msg.address_hi & 0xff));
 }
 
 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
@@ -385,7 +390,7 @@ int kvm_setup_empty_irq_routing(struct kvm *kvm)
 
 void kvm_arch_post_irq_routing_update(struct kvm *kvm)
 {
-	if (!irqchip_split(kvm))
+	if (!irqchip_split(kvm) || kvm->arch.eoi_intercept_unsupported)
 		return;
 	kvm_make_scan_ioapic_request(kvm);
 }
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 759952dd1222..1bfcd325d0d2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -281,7 +281,8 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
 	if (old)
 		call_rcu(&old->rcu, kvm_apic_map_free);
 
-	kvm_make_scan_ioapic_request(kvm);
+	if (!kvm->arch.eoi_intercept_unsupported)
+		kvm_make_scan_ioapic_request(kvm);
 }
 
 static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 535f65b0915d..1573dddd1e43 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6110,6 +6110,9 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			goto create_irqchip_unlock;
 
 		r = -EINVAL;
+		if (kvm->arch.eoi_intercept_unsupported)
+			goto create_irqchip_unlock;
+
 		if (kvm->created_vcpus)
 			goto create_irqchip_unlock;
 
@@ -6140,6 +6143,9 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		u.pit_config.flags = KVM_PIT_SPEAKER_DUMMY;
 		goto create_pit;
 	case KVM_CREATE_PIT2:
+		r = -EINVAL;
+		if (kvm->arch.eoi_intercept_unsupported)
+			goto out;
 		r = -EFAULT;
 		if (copy_from_user(&u.pit_config, argp,
 				   sizeof(struct kvm_pit_config)))
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 22/59] KVM: x86: add per-VM flags to disable SMI/INIT/SIPI
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (20 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 21/59] KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level routes isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 23/59] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID isaku.yamahata
                   ` (38 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a flag to let TDX disallow to inject interrupt with delivery
mode of SMI/INIT/SIPI. add a check to reject SMI/INIT interrupt
delivery mode.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/irq_comm.c         | 4 ++++
 arch/x86/kvm/x86.c              | 3 +--
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 545b556e420c..5ed07f31459e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1133,6 +1133,8 @@ struct kvm_arch {
 	enum kvm_irqchip_mode irqchip_mode;
 	u8 nr_reserved_ioapic_pins;
 	bool eoi_intercept_unsupported;
+	bool smm_unsupported;
+	bool init_sipi_unsupported;
 
 	bool disabled_lapic_found;
 
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index bcfac99db579..396ccf086bdd 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -128,6 +128,10 @@ static inline bool kvm_msi_route_invalid(struct kvm *kvm,
 			       .data = e->msi.data };
 	return  (kvm->arch.eoi_intercept_unsupported &&
 		 msg.arch_data.is_level) ||
+		(kvm->arch.smm_unsupported &&
+		 msg.arch_data.delivery_mode == APIC_DELIVERY_MODE_SMI) ||
+		(kvm->arch.init_sipi_unsupported &&
+		 msg.arch_data.delivery_mode == APIC_DELIVERY_MODE_INIT) ||
 		(kvm->arch.x2apic_format && (msg.address_hi & 0xff));
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1573dddd1e43..f2b6a3f89e9e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4553,8 +4553,7 @@ static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)
 
 static int kvm_vcpu_ioctl_smi(struct kvm_vcpu *vcpu)
 {
-	/* TODO: use more precise flag */
-	if (vcpu->arch.guest_state_protected)
+	if (vcpu->kvm->arch.smm_unsupported)
 		return -EINVAL;
 
 	kvm_make_request(KVM_REQ_SMI, vcpu);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 23/59] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (21 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 22/59] KVM: x86: add per-VM flags to disable SMI/INIT/SIPI isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 19:41   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 24/59] KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs() isaku.yamahata
                   ` (37 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Let userspace, or in the case of TDX, KVM itself, enable X2APIC even if
X2APIC is not reported as supported in the guest's CPU model.  KVM
generally does not force specific ordering between ioctls(), e.g. this
forces userspace to configure CPUID before MSRs.  And for TDX, vCPUs
will always run with X2APIC enabled, e.g. KVM will want/need to enable
X2APIC from time zero.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/x86.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f2b6a3f89e9e..0221ef691a15 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -466,8 +466,11 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	enum lapic_mode old_mode = kvm_get_apic_mode(vcpu);
 	enum lapic_mode new_mode = kvm_apic_mode(msr_info->data);
-	u64 reserved_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu) | 0x2ff |
-		(guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) ? 0 : X2APIC_ENABLE);
+	u64 reserved_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu) | 0x2ff;
+
+	if (!msr_info->host_initiated &&
+	    !guest_cpuid_has(vcpu, X86_FEATURE_X2APIC))
+		reserved_bits |= X2APIC_ENABLE;
 
 	if ((msr_info->data & reserved_bits) != 0 || new_mode == LAPIC_MODE_INVALID)
 		return 1;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 24/59] KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs()
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (22 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 23/59] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 25/59] KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP isaku.yamahata
                   ` (36 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Tom Lendacky

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add hooks to cache and flush GPRs and invoke them from KVM_GET_REGS and
KVM_SET_REGS respecitively.  TDX will use the hooks to read/write GPRs
from TDX-SEAM on-demand (for debug TDs).

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 2 ++
 arch/x86/include/asm/kvm_host.h    | 2 ++
 arch/x86/kvm/x86.c                 | 4 ++++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 4aa0a1c10b84..30754f2b8a99 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -45,6 +45,8 @@ KVM_X86_OP(get_gdt)
 KVM_X86_OP(set_gdt)
 KVM_X86_OP(sync_dirty_debug_regs)
 KVM_X86_OP(set_dr7)
+KVM_X86_OP_NULL(cache_gprs)
+KVM_X86_OP_NULL(flush_gprs)
 KVM_X86_OP(cache_reg)
 KVM_X86_OP(get_rflags)
 KVM_X86_OP(set_rflags)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5ed07f31459e..f12eec3f05ab 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1352,6 +1352,8 @@ struct kvm_x86_ops {
 	void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu);
 	void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value);
+	void (*cache_gprs)(struct kvm_vcpu *vcpu);
+	void (*flush_gprs)(struct kvm_vcpu *vcpu);
 	void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
 	unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
 	void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0221ef691a15..2e0bccbd3c44 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10293,6 +10293,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 
 static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 {
+	static_call_cond(kvm_x86_cache_gprs)(vcpu);
+
 	if (vcpu->arch.emulate_regs_need_sync_to_vcpu) {
 		/*
 		 * We are here if userspace calls get_regs() in the middle of
@@ -10367,6 +10369,8 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 
 	vcpu->arch.exception.pending = false;
 
+	static_call_cond(kvm_x86_flush_gprs);
+
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 25/59] KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (23 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 24/59] KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs() isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 19:42   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy() isaku.yamahata
                   ` (35 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/x86.c              | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f12eec3f05ab..74c3c1629563 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1483,7 +1483,9 @@ struct kvm_x86_ops {
 	int (*leave_smm)(struct kvm_vcpu *vcpu, const char *smstate);
 	void (*enable_smi_window)(struct kvm_vcpu *vcpu);
 
+	int (*mem_enc_op_dev)(void __user *argp);
 	int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
+	int (*mem_enc_op_vcpu)(struct kvm_vcpu *vcpu, void __user *argp);
 	int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*mem_enc_unreg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2e0bccbd3c44..f2091c4b928a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4339,6 +4339,12 @@ long kvm_arch_dev_ioctl(struct file *filp,
 	case KVM_GET_SUPPORTED_HV_CPUID:
 		r = kvm_ioctl_get_supported_hv_cpuid(NULL, argp);
 		break;
+	case KVM_MEMORY_ENCRYPT_OP:
+		r = -EINVAL;
+		if (!kvm_x86_ops.mem_enc_op_dev)
+			goto out;
+		r = kvm_x86_ops.mem_enc_op_dev(argp);
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -5516,6 +5522,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_SET_DEVICE_ATTR:
 		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
 		break;
+	case KVM_MEMORY_ENCRYPT_OP:
+		r = -EINVAL;
+		if (!kvm_x86_ops.mem_enc_op_vcpu)
+			goto out;
+		r = kvm_x86_ops.mem_enc_op_vcpu(vcpu, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (24 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 25/59] KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 19:46   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 27/59] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
                   ` (34 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
destruction path, which needs to first put the VM into a teardown state,
then free per-vCPU resource, and finally free per-VM resources.

Note, this knowingly creates a discrepancy in nomenclature for SVM as
svm_vm_teardown() invokes avic_vm_destroy() and sev_vm_destroy().
Moving the now-misnamed functions or renaming them is left to a future
patch so as not to introduce a functional change for SVM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 1 +
 arch/x86/include/asm/kvm_host.h    | 1 +
 arch/x86/kvm/svm/svm.c             | 5 +++--
 arch/x86/kvm/vmx/vmx.c             | 7 +++++++
 arch/x86/kvm/x86.c                 | 3 ++-
 5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 30754f2b8a99..1009541fd6c2 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(is_vm_type_supported)
 KVM_X86_OP(vm_init)
+KVM_X86_OP_NULL(vm_teardown)
 KVM_X86_OP_NULL(vm_destroy)
 KVM_X86_OP(vcpu_create)
 KVM_X86_OP(vcpu_free)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 74c3c1629563..3bc5417f94ec 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1321,6 +1321,7 @@ struct kvm_x86_ops {
 	bool (*is_vm_type_supported)(unsigned long vm_type);
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
+	void (*vm_teardown)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
 
 	/* Create, but do not attach this VCPU */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2ef77d4566a9..1bf410add13b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4556,7 +4556,7 @@ static void svm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
 	sev_vcpu_deliver_sipi_vector(vcpu, vector);
 }
 
-static void svm_vm_destroy(struct kvm *kvm)
+static void svm_vm_teardown(struct kvm *kvm)
 {
 	avic_vm_destroy(kvm);
 	sev_vm_destroy(kvm);
@@ -4597,7 +4597,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.is_vm_type_supported = svm_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_svm),
 	.vm_init = svm_vm_init,
-	.vm_destroy = svm_vm_destroy,
+	.vm_teardown = svm_vm_teardown,
+	.vm_destroy = NULL,
 
 	.prepare_guest_switch = svm_prepare_guest_switch,
 	.vcpu_load = svm_vcpu_load,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f2905e00b063..5e4c6ac9fe69 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6890,6 +6890,11 @@ static int vmx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+static void vmx_vm_destroy(struct kvm *kvm)
+{
+
+}
+
 static int __init vmx_check_processor_compat(void)
 {
 	struct vmcs_config vmcs_conf;
@@ -7513,6 +7518,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.is_vm_type_supported = vmx_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
+	.vm_teardown = NULL,
+	.vm_destroy = vmx_vm_destroy,
 
 	.vcpu_create = vmx_create_vcpu,
 	.vcpu_free = vmx_free_vcpu,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f2091c4b928a..88b55ddb22a7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11613,7 +11613,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
 		mutex_unlock(&kvm->slots_lock);
 	}
-	static_call_cond(kvm_x86_vm_destroy)(kvm);
+	static_call_cond(kvm_x86_vm_teardown)(kvm);
 	kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1));
 	kvm_pic_destroy(kvm);
 	kvm_ioapic_destroy(kvm);
@@ -11624,6 +11624,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_page_track_cleanup(kvm);
 	kvm_xen_destroy_vm(kvm);
 	kvm_hv_destroy_vm(kvm);
+	static_call_cond(kvm_x86_vm_destroy)(kvm);
 }
 
 static void memslot_rmap_free(struct kvm_memory_slot *slot)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 27/59] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (25 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy() isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 28/59] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events() isaku.yamahata
                   ` (33 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Xiaoyao Li,
	Chao Gao

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
irrespective of any other flags.  TDX-SEAM unconditionally saves and
restores guest DRs and reset to architectural INIT state on TD exit.
So, KVM needs to save host DRs before TD enter without restoring guest DRs
and restore host DRs after TD exit.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 10 ++++++++--
 arch/x86/kvm/x86.c              |  3 ++-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3bc5417f94ec..a2bc4d9caaef 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -526,8 +526,14 @@ struct kvm_pmu {
 struct kvm_pmu_ops;
 
 enum {
-	KVM_DEBUGREG_BP_ENABLED = 1,
-	KVM_DEBUGREG_WONT_EXIT = 2,
+	KVM_DEBUGREG_BP_ENABLED		= BIT(0),
+	KVM_DEBUGREG_WONT_EXIT		= BIT(1),
+	KVM_DEBUGREG_RELOAD		= BIT(2),
+	/*
+	 * Guest debug registers are saved/restored by hardware on exit from
+	 * or enter guest. KVM needn't switch them.
+	 */
+	KVM_DEBUGREG_AUTO_SWITCH	= BIT(3),
 };
 
 struct kvm_mtrr_range {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88b55ddb22a7..8161475082a7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9910,7 +9910,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
 		switch_fpu_return();
 
-	if (unlikely(vcpu->arch.switch_db_regs)) {
+	if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
 		set_debugreg(0, 7);
 		set_debugreg(vcpu->arch.eff_db[0], 0);
 		set_debugreg(vcpu->arch.eff_db[1], 1);
@@ -9950,6 +9950,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 */
 	if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
 		WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
+		WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
 		static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
 		kvm_update_dr0123(vcpu);
 		kvm_update_dr7(vcpu);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 28/59] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (26 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 27/59] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:50   ` Paolo Bonzini
  2021-11-25  0:20 ` [RFC PATCH v3 29/59] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
                   ` (32 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
interrupt to support TDX's usage of APICv.  Unlike VMX, TDX doesn't have
access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
i.e. needs to generate a posted interrupt and more importantly can't
manually move requested interrupts into the vIRR (which it also doesn't
have access to).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/x86.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8161475082a7..c6e56f105673 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11914,7 +11914,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 
 	if (kvm_arch_interrupt_allowed(vcpu) &&
 	    (kvm_cpu_has_interrupt(vcpu) ||
-	    kvm_guest_apic_has_interrupt(vcpu)))
+	     kvm_guest_apic_has_interrupt(vcpu) ||
+	     (vcpu->arch.apicv_active &&
+	      kvm_x86_ops.dy_apicv_has_pending_interrupt(vcpu))))
 		return true;
 
 	if (kvm_hv_has_stimer_pending(vcpu))
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 29/59] KVM: x86: Add option to force LAPIC expiration wait
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (27 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 28/59] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events() isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 19:53   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 30/59] KVM: x86: Add guest_supported_xss placholder isaku.yamahata
                   ` (31 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add an option to skip the IRR check in kvm_wait_lapic_expire().  This
will be used by TDX to wait if there is an outstanding notification for
a TD, i.e. a virtual interrupt is being triggered via posted interrupt
processing.  KVM TDX doesn't emulate PI processing, i.e. there will
never be a bit set in IRR/ISR, so the default behavior for APICv of
querying the IRR doesn't work as intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/lapic.c   | 4 ++--
 arch/x86/kvm/lapic.h   | 2 +-
 arch/x86/kvm/svm/svm.c | 2 +-
 arch/x86/kvm/vmx/vmx.c | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1bfcd325d0d2..beef195b9d15 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1625,12 +1625,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
 		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
 }
 
-void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
 {
 	if (lapic_in_kernel(vcpu) &&
 	    vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
 	    vcpu->arch.apic->lapic_timer.timer_advance_ns &&
-	    lapic_timer_int_injected(vcpu))
+	    (force_wait || lapic_timer_int_injected(vcpu)))
 		__kvm_wait_lapic_expire(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2b44e533fc8d..2a0119ef9e96 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -233,7 +233,7 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)
 
 bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 
-void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait);
 
 void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
 			      unsigned long *vcpu_bitmap);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1bf410add13b..e33cbf9613d9 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3885,7 +3885,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
 	clgi();
 	kvm_load_guest_xsave_state(vcpu);
 
-	kvm_wait_lapic_expire(vcpu);
+	kvm_wait_lapic_expire(vcpu, false);
 
 	/*
 	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5e4c6ac9fe69..1e20e85a958f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6647,7 +6647,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (enable_preemption_timer)
 		vmx_update_hv_timer(vcpu);
 
-	kvm_wait_lapic_expire(vcpu);
+	kvm_wait_lapic_expire(vcpu, false);
 
 	/*
 	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 30/59] KVM: x86: Add guest_supported_xss placholder
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (28 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 29/59] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 19:55   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 31/59] KVM: x86: Add infrastructure for stolen GPA bits isaku.yamahata
                   ` (30 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a per-vcpu placeholder for the support XSS of the guest so that the
TDX configuration code doesn't need to hack in manual computation of the
supported XSS.  KVM XSS enabling is currently being upstreamed, i.e.
guest_supported_xss will no longer be a placeholder by the time TDX is
ready for upstreaming (hopefully).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a2bc4d9caaef..2dab0899d82c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -701,6 +701,7 @@ struct kvm_vcpu_arch {
 
 	u64 xcr0;
 	u64 guest_supported_xcr0;
+	u64 guest_supported_xss;
 
 	struct kvm_pio_request pio;
 	void *pio_data;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 31/59] KVM: x86: Add infrastructure for stolen GPA bits
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (29 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 30/59] KVM: x86: Add guest_supported_xss placholder isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:00   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 32/59] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault isaku.yamahata
                   ` (29 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Rick Edgecombe

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
perspective) to a single GPA (from a memslot perspective). GPA alising
will be used to repurpose GPA bits as attribute bits, e.g. to expose an
execute-only permission bit to the guest. To keep the implementation
simple (relatively speaking), GPA aliasing is only supported via TDP.

Today KVM assumes two things that are broken by GPA aliasing.
  1. GPAs coming from hardware can be simply shifted to get the GFNs.
  2. GPA bits 51:MAXPHYADDR are reserved to zero.

With GPA aliasing, translating a GPA to GFN requires masking off the
repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.

To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
that is, bits stolen from the GPA to act as new virtualized attribute
bits. A bit in the mask will cause the MMU code to create aliases of the
GPA. It can also be used to find the GFN out of a GPA coming from a tdp
fault.

To handle case (1) from above, retain any stolen bits when passing a GPA
in KVM's MMU code, but strip them when converting to a GFN so that the
GFN contains only the "real" GFN, i.e. never has repurposed bits set.

GFNs (without stolen bits) continue to be used to:
	-Specify physical memory by userspace via memslots
	-Map GPAs to TDP PTEs via RMAP
	-Specify dirty tracking and write protection
	-Look up MTRR types
	-Inject async page faults

Since there are now multiple aliases for the same aliased GPA, when
userspace memory backing the memslots is paged out, both aliases need to be
modified. Fortunately this happens automatically. Since rmap supports
multiple mappings for the same GFN for PTE shadowing based paging, by
adding/removing each alias PTE with its GFN, kvm_handle_hva() based
operations will be applied to both aliases.

In the case of the rmap being removed in the future, the needed
information could be recovered by iterating over the stolen bits and
walking the TDP page tables.

For TLB flushes that are address based, make sure to flush both aliases
in the stolen bits case.

Only support stolen bits in 64 bit guest paging modes (long, PAE).
Features that use this infrastructure should restrict the stolen bits to
exclude the other paging modes. Don't support stolen bits for shadow EPT.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu.h              | 27 +++++++++++
 arch/x86/kvm/mmu/mmu.c          | 82 +++++++++++++++++++++++----------
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  | 25 ++++++----
 4 files changed, 100 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9ae6168d381e..583483bb6f71 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -351,4 +351,31 @@ static inline void kvm_update_page_stats(struct kvm *kvm, int level, int count)
 {
 	atomic64_add(count, &kvm->stat.pages[level - 1]);
 }
+
+static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
+{
+	/* Currently there are no stolen bits in KVM */
+	return 0;
+}
+
+static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
+{
+	return kvm_gfn_stolen_mask(vcpu->kvm);
+}
+
+static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
+{
+	return kvm_gfn_stolen_mask(kvm) << PAGE_SHIFT;
+}
+
+static inline gpa_t vcpu_gpa_stolen_mask(struct kvm_vcpu *vcpu)
+{
+	return kvm_gpa_stolen_mask(vcpu->kvm);
+}
+
+static inline gfn_t vcpu_gpa_to_gfn_unalias(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	return (gpa >> PAGE_SHIFT) & ~vcpu_gfn_stolen_mask(vcpu);
+}
+
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e11b14f4f82..5305d1d9a976 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -276,27 +276,37 @@ static inline bool kvm_available_flush_tlb_with_range(void)
 	return kvm_x86_ops.tlb_remote_flush_with_range;
 }
 
-static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
-		struct kvm_tlb_range *range)
-{
-	int ret = -ENOTSUPP;
-
-	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
-		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
-
-	if (ret)
-		kvm_flush_remote_tlbs(kvm);
-}
-
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 		u64 start_gfn, u64 pages)
 {
 	struct kvm_tlb_range range;
+	u64 gfn_stolen_mask;
+
+	if (!kvm_available_flush_tlb_with_range())
+		goto generic_flush;
+
+	/*
+	 * Fall back to the big hammer flush if there is more than one
+	 * GPA alias that needs to be flushed.
+	 */
+	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
+	if (hweight64(gfn_stolen_mask) > 1)
+		goto generic_flush;
 
 	range.start_gfn = start_gfn;
 	range.pages = pages;
+	if (static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range))
+		goto generic_flush;
 
-	kvm_flush_remote_tlbs_with_range(kvm, &range);
+	if (!gfn_stolen_mask)
+		return;
+
+	range.start_gfn |= gfn_stolen_mask;
+	static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range);
+	return;
+
+generic_flush:
+	kvm_flush_remote_tlbs(kvm);
 }
 
 static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
@@ -2067,14 +2077,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     int direct,
-					     unsigned int access)
+static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
+					       gfn_t gfn,
+					       gfn_t gfn_stolen_bits,
+					       gva_t gaddr,
+					       unsigned int level,
+					       int direct,
+					       unsigned int access)
 {
 	bool direct_mmu = vcpu->arch.mmu->direct_map;
+	gpa_t gfn_and_stolen = gfn | gfn_stolen_bits;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2094,9 +2106,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		role.quadrant = quadrant;
 	}
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
-		if (sp->gfn != gfn) {
+		if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
 			collisions++;
 			continue;
 		}
@@ -2152,6 +2164,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_alloc_page(vcpu, direct);
 
 	sp->gfn = gfn;
+	sp->gfn_stolen_bits = gfn_stolen_bits;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
@@ -2168,6 +2181,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     gva_t gaddr, unsigned int level,
+					     int direct, unsigned int access)
+{
+	return __kvm_mmu_get_page(vcpu, gfn, 0, gaddr, level, direct, access);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2768,7 +2788,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 
 	gfn = kvm_mmu_page_get_gfn(sp, start - sp->spt);
 	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
-	if (!slot)
+
+	/* Don't map private memslots for stolen bits */
+	if (!slot || (sp->gfn_stolen_bits && slot->id >= KVM_USER_MEM_SLOTS))
 		return -1;
 
 	ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
@@ -2945,6 +2967,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm_shadow_walk_iterator it;
 	struct kvm_mmu_page *sp;
 	int ret;
+	gpa_t gpa = fault->addr;
+	gpa_t gpa_stolen_mask = vcpu_gpa_stolen_mask(vcpu);
+	gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
 	gfn_t base_gfn = fault->gfn;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -2966,8 +2991,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = __kvm_mmu_get_page(vcpu, base_gfn, gfn_stolen_bits,
+					it.addr, it.level - 1, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3923,6 +3948,13 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
 
+	/* Don't expose aliases for no slot GFNs or private memslots */
+	if ((fault->addr & vcpu_gpa_stolen_mask(vcpu)) &&
+	    !kvm_is_visible_memslot(slot)) {
+		fault->pfn = KVM_PFN_NOSLOT;
+		return false;
+	}
+
 	/*
 	 * Retry the page fault if the gfn hit a memslot that is being deleted
 	 * or moved.  This ensures any existing SPTEs for the old memslot will
@@ -3986,7 +4018,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	unsigned long mmu_seq;
 	int r;
 
-	fault->gfn = fault->addr >> PAGE_SHIFT;
+	fault->gfn = vcpu_gpa_to_gfn_unalias(vcpu, fault->addr);
 	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
 
 	if (page_fault_handle_page_track(vcpu, fault))
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 52c6527b1a06..90de2d2ebfff 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -49,6 +49,7 @@ struct kvm_mmu_page {
 	 */
 	union kvm_mmu_page_role role;
 	gfn_t gfn;
+	gfn_t gfn_stolen_bits;
 
 	u64 *spt;
 	/* hold the gfn of each spte inside spt */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f87d36898c44..3a515c71e09c 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -25,7 +25,8 @@
 	#define guest_walker guest_walker64
 	#define FNAME(name) paging##64_##name
 	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~vcpu_gpa_stolen_mask(vcpu) & \
+					     PT64_LVL_ADDR_MASK(lvl))
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
@@ -44,7 +45,7 @@
 	#define guest_walker guest_walker32
 	#define FNAME(name) paging##32_##name
 	#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
 	#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
@@ -58,7 +59,7 @@
 	#define guest_walker guest_walkerEPT
 	#define FNAME(name) ept_##name
 	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
@@ -75,7 +76,7 @@
 #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
 
 #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
-#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
+#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
 
 /*
  * The guest_walker structure emulates the behavior of the hardware page
@@ -96,9 +97,9 @@ struct guest_walker {
 	struct x86_exception fault;
 };
 
-static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
+static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
 {
-	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
+	return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
 }
 
 static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
@@ -395,7 +396,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		--walker->level;
 
 		index = PT_INDEX(addr, walker->level);
-		table_gfn = gpte_to_gfn(pte);
+		table_gfn = gpte_to_gfn(vcpu, pte);
 		offset    = index * sizeof(pt_element_t);
 		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
 
@@ -461,7 +462,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	if (unlikely(errcode))
 		goto error;
 
-	gfn = gpte_to_gfn_lvl(pte, walker->level);
+	gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
 	gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
 
 	if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
@@ -566,12 +567,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 
+	WARN_ON(gpte & vcpu_gpa_stolen_mask(vcpu));
+
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
 
 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
 
-	gfn = gpte_to_gfn(gpte);
+	gfn = gpte_to_gfn(vcpu, gpte);
 	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
 	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
 
@@ -667,6 +670,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	WARN_ON_ONCE(gw->gfn != base_gfn);
 	direct_access = gw->pte_access;
 
+	WARN_ON(fault->addr & vcpu_gpa_stolen_mask(vcpu));
+
 	top_level = vcpu->arch.mmu->root_level;
 	if (top_level == PT32E_ROOT_LEVEL)
 		top_level = PT32_ROOT_LEVEL;
@@ -1111,7 +1116,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 			continue;
 		}
 
-		gfn = gpte_to_gfn(gpte);
+		gfn = gpte_to_gfn(vcpu, gpte);
 		pte_access = sp->role.access;
 		pte_access &= FNAME(gpte_access)(gpte);
 		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 32/59] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (30 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 31/59] KVM: x86: Add infrastructure for stolen GPA bits isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 33/59] KVM: x86/mmu: Ignore bits 63 and 62 when checking for "present" SPTEs isaku.yamahata
                   ` (28 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Explicity check for an MMIO spte in the fast page fault flow.  TDX will
use a not-present entry for MMIO sptes, which can be mistaken for an
access-tracked spte since both have SPTE_SPECIAL_MASK set.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5305d1d9a976..01569913f1f0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3186,7 +3186,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			break;
 
 		sp = sptep_to_sp(sptep);
-		if (!is_last_spte(spte, sp->role.level))
+		if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
 			break;
 
 		/*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 33/59] KVM: x86/mmu: Ignore bits 63 and 62 when checking for "present" SPTEs
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (31 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 32/59] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 34/59] KVM: x86/mmu: Allow non-zero init value for shadow PTE isaku.yamahata
                   ` (27 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Ignore bits 63 and 62 when checking for present SPTEs to allow setting
said bits in not-present SPTEs.  TDX will set bit 63 in "zero" SPTEs to
suppress #VEs (TDX-SEAM unconditionally enables EPT Violation #VE), and
will use bit 62 to track zapped private SPTEs.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
 arch/x86/kvm/mmu/spte.h        | 13 +++++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 3a515c71e09c..80e821996728 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1102,7 +1102,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		gpa_t pte_gpa;
 		gfn_t gfn;
 
-		if (!sp->spt[i])
+		if (!__is_shadow_present_pte(sp->spt[i]))
 			continue;
 
 		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index cc432f9a966b..56b6dd750fb1 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -208,6 +208,19 @@ static inline bool is_mmio_spte(u64 spte)
 	       likely(shadow_mmio_value);
 }
 
+static inline bool __is_shadow_present_pte(u64 pte)
+{
+	/*
+	 * Ignore bits 63 and 62 so that they can be set in SPTEs that are well
+	 * and truly not present.  We can't use the sane/obvious approach of
+	 * querying bits 2:0 (RWX or P) because EPT without A/D bits will clear
+	 * RWX of a "present" SPTE to do access tracking.  Tracking updates can
+	 * be done out of mmu_lock, so even the flushing logic needs to treat
+	 * such SPTEs as present.
+	 */
+	return !!(pte << 2);
+}
+
 static inline bool is_shadow_present_pte(u64 pte)
 {
 	return !!(pte & SPTE_MMU_PRESENT_MASK);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 34/59] KVM: x86/mmu: Allow non-zero init value for shadow PTE
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (32 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 33/59] KVM: x86/mmu: Ignore bits 63 and 62 when checking for "present" SPTEs isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 35/59] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() isaku.yamahata
                   ` (26 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX will run with EPT violation #VEs enabled, which means KVM needs to
set the "suppress #VE" bit in unused PTEs to avoid unintentionally
reflecting not-present EPT violations into the guest.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu.h      |  1 +
 arch/x86/kvm/mmu/mmu.c  | 50 +++++++++++++++++++++++++++++++++++------
 arch/x86/kvm/mmu/spte.c | 10 +++++++++
 arch/x86/kvm/mmu/spte.h |  2 ++
 4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 583483bb6f71..79ccee8bbc38 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -66,6 +66,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+void kvm_mmu_set_spte_init_value(u64 init_value);
 
 void kvm_init_mmu(struct kvm_vcpu *vcpu);
 void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 01569913f1f0..510991c2e94e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -623,9 +623,9 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	int level = sptep_to_sp(sptep)->role.level;
 
 	if (!spte_has_volatile_bits(old_spte))
-		__update_clear_spte_fast(sptep, 0ull);
+		__update_clear_spte_fast(sptep, shadow_init_value);
 	else
-		old_spte = __update_clear_spte_slow(sptep, 0ull);
+		old_spte = __update_clear_spte_slow(sptep, shadow_init_value);
 
 	if (!is_shadow_present_pte(old_spte))
 		return old_spte;
@@ -657,7 +657,7 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
-	__update_clear_spte_fast(sptep, 0ull);
+	__update_clear_spte_fast(sptep, shadow_init_value);
 }
 
 static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -743,6 +743,42 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	}
 }
 
+static inline void kvm_init_shadow_page(void *page)
+{
+#ifdef CONFIG_X86_64
+	int ign;
+
+	asm volatile (
+		"rep stosq\n\t"
+		: "=c"(ign), "=D"(page)
+		: "a"(shadow_init_value), "c"(4096/8), "D"(page)
+		: "memory"
+	);
+#else
+	BUG();
+#endif
+}
+
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
+	int start, end, i, r;
+
+	if (shadow_init_value)
+		start = kvm_mmu_memory_cache_nr_free_objects(mc);
+
+	r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
+	if (r)
+		return r;
+
+	if (shadow_init_value) {
+		end = kvm_mmu_memory_cache_nr_free_objects(mc);
+		for (i = start; i < end; i++)
+			kvm_init_shadow_page(mc->objects[i]);
+	}
+	return 0;
+}
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -752,8 +788,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				       PT64_ROOT_MAX_LEVEL);
+	r = mmu_topup_shadow_page_cache(vcpu);
 	if (r)
 		return r;
 	if (maybe_indirect) {
@@ -3165,7 +3200,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_INVALID;
-	u64 spte = 0ull;
+	u64 spte = shadow_init_value;
 	u64 *sptep = NULL;
 	uint retry_count = 0;
 
@@ -5584,7 +5619,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
-	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	if (!shadow_init_value)
+		vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 0c76c45fdb68..bb45e71eb105 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -34,6 +34,7 @@ u64 __read_mostly shadow_mmio_access_mask;
 u64 __read_mostly shadow_present_mask;
 u64 __read_mostly shadow_me_mask;
 u64 __read_mostly shadow_acc_track_mask;
+u64 __read_mostly shadow_init_value;
 
 u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
@@ -221,6 +222,14 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
 	return new_spte;
 }
 
+void kvm_mmu_set_spte_init_value(u64 init_value)
+{
+	if (WARN_ON(!IS_ENABLED(CONFIG_X86_64) && init_value))
+		init_value = 0;
+	shadow_init_value = init_value;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_spte_init_value);
+
 static u8 kvm_get_shadow_phys_bits(void)
 {
 	/*
@@ -365,6 +374,7 @@ void kvm_mmu_reset_all_pte_masks(void)
 	shadow_present_mask	= PT_PRESENT_MASK;
 	shadow_acc_track_mask	= 0;
 	shadow_me_mask		= sme_me_mask;
+	shadow_init_value	= 0;
 
 	shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITEABLE;
 	shadow_mmu_writable_mask  = DEFAULT_SPTE_MMU_WRITEABLE;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 56b6dd750fb1..b53b301301dc 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -146,6 +146,8 @@ extern u64 __read_mostly shadow_mmio_access_mask;
 extern u64 __read_mostly shadow_present_mask;
 extern u64 __read_mostly shadow_me_mask;
 
+extern u64 __read_mostly shadow_init_value;
+
 /*
  * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
  * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 35/59] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits()
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (33 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 34/59] KVM: x86/mmu: Allow non-zero init value for shadow PTE isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 36/59] KVM: x86/mmu: Frame in support for private/inaccessible shadow pages isaku.yamahata
                   ` (25 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Return the old SPTE when clearing a SPTE and push the "old SPTE present"
check to the caller.  Private shadow page support will use the old SPTE
in rmap_remove() to determine whether or not there is a linked private
shadow page.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 510991c2e94e..f1bd7d952bfe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -616,7 +616,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
  * state bits, it is used to clear the last level sptep.
  * Returns the old PTE.
  */
-static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
+static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 {
 	kvm_pfn_t pfn;
 	u64 old_spte = *sptep;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 36/59] KVM: x86/mmu: Frame in support for private/inaccessible shadow pages
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (34 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 35/59] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 37/59] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
                   ` (24 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Kai Huang

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add kvm_x86_ops hooks to set/clear private SPTEs, i.e. SEPT entries, and
to link/free private shadow pages, i.e. non-leaf SEPT pages.

Because SEAMCALLs are bloody expensive, and because KVM's MMU is already
complex enough, TDX's SEPT will mirror KVM's shadow pages instead of
replacing them outright.  This costs extra memory, but is simpler and
far more performant.

Add a separate list for tracking active private shadow pages.  Zapping
and freeing SEPT entries is subject to very different rules than normal
pages/memory, and need to be preserved (along with their shadow page
counterparts) when KVM gets trigger happy, e.g. zaps everything during a
memslot update.

Zap any aliases of a GPA when mapping in a guest that supports guest
private GPAs.  This is necessary to avoid integrity failures with TDX
due to pointing shared and private GPAs at the same HPA.

Do not prefetch private pages (this should probably be a property of the
VM).

TDX needs one more bits in spte for SPTE_PRIVATE_ZAPPED. Steal 1 bit
from MMIO generation as 62 bit and re-purpose it for non-present
SPTE, SPTE_PRIVATE_ZAPPED.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |   7 +
 arch/x86/include/asm/kvm_host.h    |  32 +-
 arch/x86/kvm/mmu.h                 |  16 +-
 arch/x86/kvm/mmu/mmu.c             | 478 +++++++++++++++++++++++------
 arch/x86/kvm/mmu/mmu_internal.h    |  13 +-
 arch/x86/kvm/mmu/paging_tmpl.h     |  11 +-
 arch/x86/kvm/mmu/spte.h            |  16 +-
 arch/x86/kvm/x86.c                 |  10 +-
 virt/kvm/kvm_main.c                |   1 +
 9 files changed, 463 insertions(+), 121 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 1009541fd6c2..ef94a535f452 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(is_vm_type_supported)
 KVM_X86_OP(vm_init)
+KVM_X86_OP_NULL(mmu_prezap)
 KVM_X86_OP_NULL(vm_teardown)
 KVM_X86_OP_NULL(vm_destroy)
 KVM_X86_OP(vcpu_create)
@@ -89,6 +90,12 @@ KVM_X86_OP(set_tss_addr)
 KVM_X86_OP(set_identity_map_addr)
 KVM_X86_OP(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(set_private_spte)
+KVM_X86_OP(drop_private_spte)
+KVM_X86_OP(zap_private_spte)
+KVM_X86_OP(unzap_private_spte)
+KVM_X86_OP(link_private_sp)
+KVM_X86_OP(free_private_sp)
 KVM_X86_OP_NULL(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2dab0899d82c..7ce0641d54dd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -320,7 +320,8 @@ union kvm_mmu_page_role {
 		unsigned smap_andnot_wp:1;
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
-		unsigned :6;
+		unsigned private:1;
+		unsigned :5;
 
 		/*
 		 * This is left at the top of the word so that
@@ -427,6 +428,7 @@ struct kvm_mmu {
 			 struct kvm_mmu_page *sp);
 	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
 	hpa_t root_hpa;
+	hpa_t private_root_hpa;
 	gpa_t root_pgd;
 	union kvm_mmu_role mmu_role;
 	u8 root_level;
@@ -463,6 +465,8 @@ struct kvm_mmu {
 
 	struct rsvd_bits_validate guest_rsvd_check;
 
+	bool no_prefetch;
+
 	u64 pdptrs[4]; /* pae */
 };
 
@@ -685,6 +689,7 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	struct kvm_mmu_memory_cache mmu_private_sp_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
@@ -1052,6 +1057,7 @@ struct kvm_arch {
 	u8 mmu_valid_gen;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
+	struct list_head private_mmu_pages;
 	struct list_head zapped_obsolete_pages;
 	struct list_head lpage_disallowed_mmu_pages;
 	struct kvm_page_track_notifier_node mmu_sp_tracker;
@@ -1236,6 +1242,8 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	gfn_t gfn_shared_mask;
 };
 
 struct kvm_vm_stat {
@@ -1328,6 +1336,7 @@ struct kvm_x86_ops {
 	bool (*is_vm_type_supported)(unsigned long vm_type);
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
+	void (*mmu_prezap)(struct kvm *kvm);
 	void (*vm_teardown)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
 
@@ -1423,6 +1432,17 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	void (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				 kvm_pfn_t pfn);
+	void (*drop_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				  kvm_pfn_t pfn);
+	void (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
+	void (*unzap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
+	int (*link_private_sp)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+			       void *private_sp);
+	int (*free_private_sp)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+			       void *private_sp);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
@@ -1602,7 +1622,8 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
-void kvm_mmu_zap_all(struct kvm *kvm);
+void kvm_mmu_zap_all_active(struct kvm *kvm);
+void kvm_mmu_zap_all_private(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
@@ -1765,7 +1786,9 @@ static inline int __kvm_irq_line_state(unsigned long *irq_state,
 
 #define KVM_MMU_ROOT_CURRENT		BIT(0)
 #define KVM_MMU_ROOT_PREVIOUS(i)	BIT(1+i)
-#define KVM_MMU_ROOTS_ALL		(~0UL)
+#define KVM_MMU_ROOT_PRIVATE		BIT(1+KVM_MMU_NUM_PREV_ROOTS)
+#define KVM_MMU_ROOTS_ALL		((u32)(~KVM_MMU_ROOT_PRIVATE))
+#define KVM_MMU_ROOTS_ALL_INC_PRIVATE	(KVM_MMU_ROOTS_ALL | KVM_MMU_ROOT_PRIVATE)
 
 int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level);
 void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
@@ -1973,4 +1996,7 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
 #define KVM_CLOCK_VALID_FLAGS						\
 	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
 
+bool kvm_is_private_gfn(struct kvm *kvm, gfn_t gfn);
+bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 79ccee8bbc38..9c22dedbb228 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -355,13 +355,7 @@ static inline void kvm_update_page_stats(struct kvm *kvm, int level, int count)
 
 static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
 {
-	/* Currently there are no stolen bits in KVM */
-	return 0;
-}
-
-static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
-{
-	return kvm_gfn_stolen_mask(vcpu->kvm);
+	return kvm->arch.gfn_shared_mask;
 }
 
 static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
@@ -369,14 +363,14 @@ static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
 	return kvm_gfn_stolen_mask(kvm) << PAGE_SHIFT;
 }
 
-static inline gpa_t vcpu_gpa_stolen_mask(struct kvm_vcpu *vcpu)
+static inline gpa_t kvm_gpa_unalias(struct kvm *kvm, gpa_t gpa)
 {
-	return kvm_gpa_stolen_mask(vcpu->kvm);
+	return gpa & ~kvm_gpa_stolen_mask(kvm);
 }
 
-static inline gfn_t vcpu_gpa_to_gfn_unalias(struct kvm_vcpu *vcpu, gpa_t gpa)
+static inline gfn_t kvm_gfn_unalias(struct kvm *kvm, gpa_t gpa)
 {
-	return (gpa >> PAGE_SHIFT) & ~vcpu_gfn_stolen_mask(vcpu);
+	return kvm_gpa_unalias(kvm, gpa) >> PAGE_SHIFT;
 }
 
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f1bd7d952bfe..c3effcbb726e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -616,16 +616,16 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
  * state bits, it is used to clear the last level sptep.
  * Returns the old PTE.
  */
-static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
+static u64 __mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep, u64 clear_value)
 {
 	kvm_pfn_t pfn;
 	u64 old_spte = *sptep;
 	int level = sptep_to_sp(sptep)->role.level;
 
 	if (!spte_has_volatile_bits(old_spte))
-		__update_clear_spte_fast(sptep, shadow_init_value);
+		__update_clear_spte_fast(sptep, clear_value);
 	else
-		old_spte = __update_clear_spte_slow(sptep, shadow_init_value);
+		old_spte = __update_clear_spte_slow(sptep, clear_value);
 
 	if (!is_shadow_present_pte(old_spte))
 		return old_spte;
@@ -650,6 +650,11 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	return old_spte;
 }
 
+static inline u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
+{
+	return __mmu_spte_clear_track_bits(kvm, sptep, shadow_init_value);
+}
+
 /*
  * Rules for using mmu_spte_clear_no_track:
  * Directly clear spte without caring the state bits of sptep,
@@ -764,6 +769,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
 	int start, end, i, r;
 
+	if (vcpu->kvm->arch.gfn_shared_mask) {
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
+					       PT64_ROOT_MAX_LEVEL);
+		if (r)
+			return r;
+	}
+
 	if (shadow_init_value)
 		start = kvm_mmu_memory_cache_nr_free_objects(mc);
 
@@ -805,6 +817,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
@@ -1058,34 +1071,6 @@ static void pte_list_remove(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	__pte_list_remove(sptep, rmap_head);
 }
 
-/* Return true if rmap existed, false otherwise */
-static bool pte_list_destroy(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
-{
-	struct pte_list_desc *desc, *next;
-	int i;
-
-	if (!rmap_head->val)
-		return false;
-
-	if (!(rmap_head->val & 1)) {
-		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
-		goto out;
-	}
-
-	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
-
-	for (; desc; desc = next) {
-		for (i = 0; i < desc->spte_count; i++)
-			mmu_spte_clear_track_bits(kvm, desc->sptes[i]);
-		next = desc->more;
-		mmu_free_pte_list_desc(desc);
-	}
-out:
-	/* rmap_head is meaningless now, remember to reset it */
-	rmap_head->val = 0;
-	return true;
-}
-
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
@@ -1123,7 +1108,7 @@ static bool rmap_can_add(struct kvm_vcpu *vcpu)
 	return kvm_mmu_memory_cache_nr_free_objects(mc);
 }
 
-static void rmap_remove(struct kvm *kvm, u64 *spte)
+static void rmap_remove(struct kvm *kvm, u64 *spte, u64 old_spte)
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *slot;
@@ -1145,6 +1130,10 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 	rmap_head = gfn_to_rmap(gfn, sp->role.level, slot);
 
 	__pte_list_remove(spte, rmap_head);
+
+	if (is_private_sp(sp))
+		static_call(kvm_x86_drop_private_spte)(
+			kvm, gfn, sp->role.level, spte_to_pfn(old_spte));
 }
 
 /*
@@ -1182,7 +1171,8 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 	iter->pos = 0;
 	sptep = iter->desc->sptes[iter->pos];
 out:
-	BUG_ON(!is_shadow_present_pte(*sptep));
+	WARN_ON(!is_shadow_present_pte(*sptep) &&
+		!is_zapped_private_pte(*sptep));
 	return sptep;
 }
 
@@ -1227,8 +1217,9 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
 	u64 old_spte = mmu_spte_clear_track_bits(kvm, sptep);
 
-	if (is_shadow_present_pte(old_spte))
-		rmap_remove(kvm, sptep);
+	if (is_shadow_present_pte(old_spte) ||
+	    is_zapped_private_pte(old_spte))
+		rmap_remove(kvm, sptep, old_spte);
 }
 
 
@@ -1483,17 +1474,153 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 	return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn, PG_LEVEL_4K);
 }
 
+static bool kvm_mmu_zap_private_spte(struct kvm *kvm, u64 *sptep)
+{
+	struct kvm_mmu_page *sp;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	/* Skip the lookup if the VM doesn't support private memory. */
+	if (likely(!kvm->arch.gfn_shared_mask))
+		return false;
+
+	if (!is_private_spte(sptep))
+		return false;
+
+	sp = sptep_to_sp(sptep);
+	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+	pfn = spte_to_pfn(*sptep);
+
+	static_call(kvm_x86_zap_private_spte)(kvm, gfn, sp->role.level);
+
+	__mmu_spte_clear_track_bits(kvm, sptep,
+				    SPTE_PRIVATE_ZAPPED | pfn << PAGE_SHIFT);
+	return true;
+}
+
+/* Return true if rmap existed, false otherwise */
+static bool __kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
+{
+#if 0
+	/* Non-optimized version. */
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+restart:
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
+		rmap_printk("%s: spte %p %llx.\n", __func__, sptep, *sptep);
+
+		if (is_zapped_private_pte(*sptep))
+			continue;
+
+		flush = true;
+
+		/* Keep the rmap if the private SPTE couldn't be zapped. */
+		if (kvm_mmu_zap_private_spte(kvm, sptep))
+			continue;
+
+		pte_list_remove(kvm, rmap_head, sptep);
+		goto restart;
+	}
+
+	return flush;
+#else
+	/*
+	 * optimized version following the commint.
+	 *
+	 * commit a75b540451d20ef1aebaa09d183ddc5c44c6f86a
+	 * Author: Peter Xu <peterx@redhat.com>
+	 * Date:   Fri Jul 30 18:06:05 2021 -0400
+	 *
+	 * KVM: X86: Optimize zapping rmap
+	 *
+	 * Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
+	 * slow.  The easy way is to travers the rmap list, collecting the a/d bits and
+	 * free the slots along the way.
+	 *
+	 * Provide a pte_list_destroy() and do exactly that.
+	 */
+	struct pte_list_desc *desc, *prev, *next;
+	bool flush = false;
+	u64 *sptep;
+	int i;
+
+	if (!rmap_head->val)
+		return false;
+
+	if (!(rmap_head->val & 1)) {
+retry_head:
+		sptep = (u64 *)rmap_head->val;
+		if (is_zapped_private_pte(*sptep))
+			return flush;
+
+		flush = true;
+		/* Keep the rmap if the private SPTE couldn't be zapped. */
+		if (kvm_mmu_zap_private_spte(kvm, sptep))
+			goto retry_head;
+
+		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
+		rmap_head->val = 0;
+		return true;
+	}
+
+retry:
+	prev = NULL;
+	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+
+	for (; desc; desc = next) {
+		for (i = 0; i < desc->spte_count; i++) {
+			sptep = desc->sptes[i];
+			if (is_zapped_private_pte(*sptep))
+				continue;
+
+			flush = true;
+			/*
+			 * Keep the rmap if the private SPTE couldn't be
+			 * zapped.
+			 */
+			if (kvm_mmu_zap_private_spte(kvm, sptep))
+				goto retry;
+
+			mmu_spte_clear_track_bits(kvm, desc->sptes[i]);
+
+			desc->spte_count--;
+			desc->sptes[i] = desc->sptes[desc->spte_count];
+			desc->sptes[desc->spte_count] = NULL;
+			i--;	/* start from same index. */
+		}
+
+		next = desc->more;
+		if (desc->spte_count) {
+			prev = desc;
+		} else {
+			if (!prev && !desc->more)
+				rmap_head->val = 0;
+			else
+				if (prev)
+					prev->more = next;
+				else
+					rmap_head->val = (unsigned long)desc->more | 1;
+			mmu_free_pte_list_desc(desc);
+		}
+	}
+
+	return flush;
+#endif
+}
+
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			  const struct kvm_memory_slot *slot)
+			const struct kvm_memory_slot *slot)
 {
-	return pte_list_destroy(kvm, rmap_head);
+	return __kvm_zap_rmapp(kvm, rmap_head);
 }
 
 static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			    struct kvm_memory_slot *slot, gfn_t gfn, int level,
 			    pte_t unused)
 {
-	return kvm_zap_rmapp(kvm, rmap_head, slot);
+	return __kvm_zap_rmapp(kvm, rmap_head);
 }
 
 static bool kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
@@ -1516,6 +1643,9 @@ static bool kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 
 		need_flush = 1;
 
+		/* Private page relocation is not yet supported. */
+		KVM_BUG_ON(is_private_spte(sptep), kvm);
+
 		if (pte_write(pte)) {
 			pte_list_remove(kvm, rmap_head, sptep);
 			goto restart;
@@ -1750,7 +1880,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
@@ -1758,6 +1888,11 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct)
 		free_page((unsigned long)sp->gfns);
+	if (sp->private_sp &&
+	    !static_call(kvm_x86_free_private_sp)(kvm, sp->gfn, sp->role.level,
+						  sp->private_sp))
+		free_page((unsigned long)sp->private_sp);
+
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -1788,7 +1923,8 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
+					       int direct, bool private)
 {
 	struct kvm_mmu_page *sp;
 
@@ -1804,7 +1940,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
 	 * comments in kvm_zap_obsolete_pages().
 	 */
 	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
-	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
+	if (private)
+		list_add(&sp->link, &vcpu->kvm->arch.private_mmu_pages);
+	else
+		list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
 	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
 	return sp;
 }
@@ -2112,16 +2251,15 @@ static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					       gfn_t gfn,
-					       gfn_t gfn_stolen_bits,
-					       gva_t gaddr,
-					       unsigned int level,
-					       int direct,
-					       unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
+					     gfn_t gfn,
+					     gva_t gaddr,
+					     unsigned int level,
+					     int direct,
+					     unsigned int access,
+					     unsigned int private)
 {
 	bool direct_mmu = vcpu->arch.mmu->direct_map;
-	gpa_t gfn_and_stolen = gfn | gfn_stolen_bits;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -2140,10 +2278,11 @@ static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
 		role.quadrant = quadrant;
 	}
+	role.private = private;
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
-		if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
+		if (sp->gfn != gfn) {
 			collisions++;
 			continue;
 		}
@@ -2196,10 +2335,9 @@ static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, direct, private);
 
 	sp->gfn = gfn;
-	sp->gfn_stolen_bits = gfn_stolen_bits;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
@@ -2216,13 +2354,6 @@ static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					     gva_t gaddr, unsigned int level,
-					     int direct, unsigned int access)
-{
-	return __kvm_mmu_get_page(vcpu, gfn, 0, gaddr, level, direct, access);
-}
-
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2255,8 +2386,13 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
 static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
 			     struct kvm_vcpu *vcpu, u64 addr)
 {
-	shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root_hpa,
-				    addr);
+	hpa_t root;
+
+	if (tdp_enabled && kvm_vcpu_is_private_gfn(vcpu, addr >> PAGE_SHIFT))
+		root = vcpu->arch.mmu->private_root_hpa;
+	else
+		root = vcpu->arch.mmu->root_hpa;
+	shadow_walk_init_using_root(iterator, vcpu, root, addr);
 }
 
 static bool shadow_walk_okay(struct kvm_shadow_walk_iterator *iterator)
@@ -2333,13 +2469,16 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	struct kvm_mmu_page *child;
 
 	pte = *spte;
-	if (is_shadow_present_pte(pte)) {
+	if (is_shadow_present_pte(pte) || is_zapped_private_pte(pte)) {
 		if (is_last_spte(pte, sp->role.level)) {
 			drop_spte(kvm, spte);
 		} else {
 			child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
 			drop_parent_pte(child, spte);
 
+			if (!is_shadow_present_pte(pte))
+				return 0;
+
 			/*
 			 * Recursively zap nested TDP SPs, parentless SPs are
 			 * unlikely to be used again in the near future.  This
@@ -2490,7 +2629,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON(!sp->role.invalid || sp->root_count);
-		kvm_mmu_free_page(sp);
+		kvm_mmu_free_page(kvm, sp);
 	}
 }
 
@@ -2751,6 +2890,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	bool host_writable = !fault || fault->map_writable;
 	bool prefetch = !fault || fault->prefetch;
 	bool write_fault = fault && fault->write;
+	u64 pte = *sptep;
 
 	pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 		 *sptep, write_fault, gfn);
@@ -2760,25 +2900,27 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 		return RET_PF_EMULATE;
 	}
 
-	if (is_shadow_present_pte(*sptep)) {
+	if (is_shadow_present_pte(pte)) {
 		/*
 		 * If we overwrite a PTE page pointer with a 2MB PMD, unlink
 		 * the parent of the now unreachable PTE.
 		 */
-		if (level > PG_LEVEL_4K && !is_large_pte(*sptep)) {
+		if (level > PG_LEVEL_4K && !is_large_pte(pte)) {
 			struct kvm_mmu_page *child;
-			u64 pte = *sptep;
 
 			child = to_shadow_page(pte & PT64_BASE_ADDR_MASK);
 			drop_parent_pte(child, sptep);
 			flush = true;
-		} else if (pfn != spte_to_pfn(*sptep)) {
+		} else if (pfn != spte_to_pfn(pte)) {
 			pgprintk("hfn old %llx new %llx\n",
-				 spte_to_pfn(*sptep), pfn);
+				 spte_to_pfn(pte), pfn);
 			drop_spte(vcpu->kvm, sptep);
 			flush = true;
 		} else
 			was_rmapped = 1;
+	} else if (is_zapped_private_pte(pte)) {
+		WARN_ON(pfn != spte_to_pfn(pte));
+		was_rmapped = 1;
 	}
 
 	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
@@ -2824,8 +2966,7 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 	gfn = kvm_mmu_page_get_gfn(sp, start - sp->spt);
 	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
 
-	/* Don't map private memslots for stolen bits */
-	if (!slot || (sp->gfn_stolen_bits && slot->id >= KVM_USER_MEM_SLOTS))
+	if (!slot || (is_private_sp(sp) && slot->id > KVM_USER_MEM_SLOTS))
 		return -1;
 
 	ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
@@ -2997,15 +3138,105 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 	}
 }
 
+bool kvm_is_private_gfn(struct kvm *kvm, gfn_t gfn)
+{
+	gfn_t gfn_shared_mask = kvm->arch.gfn_shared_mask;
+
+	return gfn_shared_mask && !(gfn_shared_mask & gfn);
+}
+
+bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	return kvm_is_private_gfn(vcpu->kvm, gfn);
+}
+
+static void kvm_mmu_link_private_sp(struct kvm_vcpu *vcpu,
+				    struct kvm_mmu_page *sp)
+{
+	void *p = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_sp_cache);
+
+	if (!static_call(kvm_x86_link_private_sp)(vcpu->kvm, sp->gfn,
+						  sp->role.level + 1, p))
+		sp->private_sp = p;
+	else
+		free_page((unsigned long)p);
+}
+
+static void kvm_mmu_zap_alias_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	struct kvm_shadow_walk_iterator it;
+	struct kvm_rmap_head *rmap_head;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *slot;
+	struct rmap_iterator iter;
+	struct kvm_mmu_page *sp;
+	gpa_t gpa_alias = fault->addr ^
+		(vcpu->kvm->arch.gfn_shared_mask << PAGE_SHIFT);
+	u64 *sptep;
+	u64 spte;
+
+	for_each_shadow_entry(vcpu, gpa_alias, it) {
+		if (!is_shadow_present_pte(*it.sptep))
+			break;
+	}
+
+	spte = *it.sptep;
+	sp = sptep_to_sp(it.sptep);
+
+	if (!is_last_spte(spte, sp->role.level))
+		return;
+
+	/*
+	 * multiple vcpus can race to zap same alias spte when vcpus caused EPT
+	 * violation on same gpa and come to __direct_map() at the same time.
+	 * In such case, __direct_map() handles it as spurious.
+	 *
+	 * rmap (or __kvm_zap_rmapp()) doesn't distinguish private/shared gpa.
+	 * And rmap is not supposed to co-exit with both shared and private
+	 * spte.  Check if other vcpu already zapped alias and established rmap
+	 * for same gpa to avoid zapping faulting gpa.
+	 *
+	 * shared  gpa_alias: !is_shadow_present_pte(spte)
+	 *                    is_zapped_private_pte(spte) is always false
+	 * private gpa_alias: !is_shadow_present_pte(spte) &&
+	 *                    !is_zapped_private_pte(spte)
+	 */
+	if (!is_shadow_present_pte(spte) && !is_zapped_private_pte(spte))
+		return;
+
+	slots = kvm_memslots_for_spte_role(kvm, sp->role);
+	slot = __gfn_to_memslot(slots, fault->gfn);
+	rmap_head = gfn_to_rmap(fault->gfn, sp->role.level, slot);
+	if (__kvm_zap_rmapp(kvm, rmap_head))
+		kvm_flush_remote_tlbs_with_address(kvm, fault->gfn, 1);
+
+	if (!is_private_sp(sp))
+		return;
+
+	for_each_rmap_spte(rmap_head, &iter, sptep) {
+		if (!is_zapped_private_pte(*sptep))
+			continue;
+
+		drop_spte(kvm, sptep);
+	}
+}
+
 static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_shadow_walk_iterator it;
 	struct kvm_mmu_page *sp;
 	int ret;
-	gpa_t gpa = fault->addr;
-	gpa_t gpa_stolen_mask = vcpu_gpa_stolen_mask(vcpu);
-	gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
 	gfn_t base_gfn = fault->gfn;
+	bool is_private = kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT);
+	bool is_zapped_pte;
+
+	if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn)) {
+		if (is_private)
+			return -EFAULT;
+	} else if (vcpu->kvm->arch.gfn_shared_mask) {
+		kvm_mmu_zap_alias_spte(vcpu, fault);
+	}
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
 
@@ -3026,24 +3257,39 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = __kvm_mmu_get_page(vcpu, base_gfn, gfn_stolen_bits,
-					it.addr, it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr, it.level - 1,
+				true, ACC_ALL, is_private);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
 		    fault->req_level >= it.level)
 			account_huge_nx_page(vcpu->kvm, sp);
+		if (is_private)
+			kvm_mmu_link_private_sp(vcpu, sp);
 	}
 
 	if (WARN_ON_ONCE(it.level != fault->goal_level))
 		return -EFAULT;
 
+	is_zapped_pte = is_zapped_private_pte(*it.sptep);
+
 	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
 			   base_gfn, fault->pfn, fault);
 	if (ret == RET_PF_SPURIOUS)
 		return ret;
 
-	direct_pte_prefetch(vcpu, it.sptep);
+	if (!is_private) {
+		if (!vcpu->arch.mmu->no_prefetch)
+			direct_pte_prefetch(vcpu, it.sptep);
+	} else if (!WARN_ON_ONCE(ret != RET_PF_FIXED)) {
+		if (is_zapped_pte)
+			static_call(kvm_x86_unzap_private_spte)(
+				vcpu->kvm, base_gfn, it.level);
+		else
+			static_call(kvm_x86_set_private_spte)(
+				vcpu->kvm, base_gfn, it.level, fault->pfn);
+	}
+
 	++vcpu->stat.pf_fixed;
 	return ret;
 }
@@ -3204,6 +3450,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	u64 *sptep = NULL;
 	uint retry_count = 0;
 
+	/*
+	 * TDX private mapping doesn't support fast page fault, since there's
+	 * no secure EPT API to support it.
+	 */
+	if (fault->is_tdp &&
+		kvm_is_private_gfn(vcpu->kvm, fault->addr >> PAGE_SHIFT))
+		return ret;
+
 	if (!page_fault_can_be_fast(fault))
 		return ret;
 
@@ -3333,7 +3587,9 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			    VALID_PAGE(mmu->prev_roots[i].hpa))
 				break;
 
-		if (i == KVM_MMU_NUM_PREV_ROOTS)
+		if (i == KVM_MMU_NUM_PREV_ROOTS &&
+		    (!(roots_to_free & KVM_MMU_ROOT_PRIVATE) ||
+		     !VALID_PAGE(mmu->private_root_hpa)))
 			return;
 	}
 
@@ -3362,6 +3618,9 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		mmu->root_pgd = 0;
 	}
 
+	if (roots_to_free & KVM_MMU_ROOT_PRIVATE)
+		mmu_free_root_page(kvm, &mmu->private_root_hpa, &invalid_list);
+
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 	write_unlock(&kvm->mmu_lock);
 }
@@ -3407,11 +3666,12 @@ static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 }
 
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
-			    u8 level, bool direct)
+			    u8 level, bool direct, bool private)
 {
 	struct kvm_mmu_page *sp;
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL,
+			private);
 	++sp->root_count;
 
 	return __pa(sp->spt);
@@ -3421,6 +3681,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u8 shadow_root_level = mmu->shadow_root_level;
+	gfn_t gfn_shared = vcpu->kvm->arch.gfn_shared_mask;
 	hpa_t root;
 	unsigned i;
 	int r;
@@ -3434,9 +3695,17 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
 		mmu->root_hpa = root;
 	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
-		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
-		mmu->root_hpa = root;
+		if (gfn_shared && !VALID_PAGE(vcpu->arch.mmu->private_root_hpa)) {
+			root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
+					true, true);
+			vcpu->arch.mmu->private_root_hpa = root;
+		}
+		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true,
+				false);
+		vcpu->arch.mmu->root_hpa = root;
 	} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
+		WARN_ON_ONCE(gfn_shared);
+
 		if (WARN_ON_ONCE(!mmu->pae_root)) {
 			r = -EIO;
 			goto out_unlock;
@@ -3446,7 +3715,8 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
 
 			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
+					      i << 30, PT32_ROOT_LEVEL, true,
+					      false);
 			mmu->pae_root[i] = root | PT_PRESENT_MASK |
 					   shadow_me_mask;
 		}
@@ -3569,8 +3839,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * write-protect the guests page table root.
 	 */
 	if (mmu->root_level >= PT64_ROOT_4LEVEL) {
-		root = mmu_alloc_root(vcpu, root_gfn, 0,
-				      mmu->shadow_root_level, false);
+		root = mmu_alloc_root(vcpu, root_gfn, 0, 0,
+				      vcpu->arch.mmu->shadow_root_level, false);
 		mmu->root_hpa = root;
 		goto set_root_pgd;
 	}
@@ -3615,7 +3885,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 			root_gfn = pdptrs[i] >> PAGE_SHIFT;
 		}
 
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
+		root = mmu_alloc_root(vcpu, root_gfn, 0, i << 30,
 				      PT32_ROOT_LEVEL, false);
 		mmu->pae_root[i] = root | pm_mask;
 	}
@@ -3984,7 +4254,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	bool async;
 
 	/* Don't expose aliases for no slot GFNs or private memslots */
-	if ((fault->addr & vcpu_gpa_stolen_mask(vcpu)) &&
+	if ((fault->addr & kvm_gpa_stolen_mask(vcpu->kvm)) &&
 	    !kvm_is_visible_memslot(slot)) {
 		fault->pfn = KVM_PFN_NOSLOT;
 		return false;
@@ -4053,7 +4323,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	unsigned long mmu_seq;
 	int r;
 
-	fault->gfn = vcpu_gpa_to_gfn_unalias(vcpu, fault->addr);
+	fault->gfn = kvm_gfn_unalias(vcpu->kvm, fault->addr);
 	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
 
 	if (page_fault_handle_page_track(vcpu, fault))
@@ -4150,7 +4420,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	while (fault->max_level > PG_LEVEL_4K) {
 		int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
-		gfn_t base = (fault->addr >> PAGE_SHIFT) & ~(page_num - 1);
+		gfn_t base = kvm_gfn_unalias(vcpu->kvm, fault->addr) &
+			~(page_num - 1);
 
 		if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
 			break;
@@ -5153,14 +5424,19 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	return r;
 }
 
-void kvm_mmu_unload(struct kvm_vcpu *vcpu)
+static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, u32 roots_to_free)
 {
-	kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
+	kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, roots_to_free);
 	WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root_hpa));
-	kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
+	kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, roots_to_free);
 	WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
 }
 
+void kvm_mmu_unload(struct kvm_vcpu *vcpu)
+{
+	__kvm_mmu_unload(vcpu, KVM_MMU_ROOTS_ALL);
+}
+
 static bool need_remote_flush(u64 old, u64 new)
 {
 	if (!is_shadow_present_pte(old))
@@ -5565,8 +5841,10 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 	int i;
 
 	mmu->root_hpa = INVALID_PAGE;
+	mmu->private_root_hpa = INVALID_PAGE;
 	mmu->root_pgd = 0;
 	mmu->translate_gpa = translate_gpa;
+	mmu->no_prefetch = false;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
 
@@ -5912,6 +6190,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		sp = sptep_to_sp(sptep);
 		pfn = spte_to_pfn(*sptep);
 
+		/* Private page dirty logging is not supported. */
+		KVM_BUG_ON(is_private_spte(sptep), kvm);
+
 		/*
 		 * We cannot do huge page mapping for indirect shadow pages,
 		 * which are found on the last rmap (level = 1) when not using
@@ -6010,7 +6291,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
-void kvm_mmu_zap_all(struct kvm *kvm)
+static void __kvm_mmu_zap_all(struct kvm *kvm, struct list_head *mmu_pages)
 {
 	struct kvm_mmu_page *sp, *node;
 	LIST_HEAD(invalid_list);
@@ -6018,7 +6299,7 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 
 	write_lock(&kvm->mmu_lock);
 restart:
-	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
+	list_for_each_entry_safe(sp, node, mmu_pages, link) {
 		if (WARN_ON(sp->role.invalid))
 			continue;
 		if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
@@ -6026,7 +6307,6 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 		if (cond_resched_rwlock_write(&kvm->mmu_lock))
 			goto restart;
 	}
-
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
 	if (is_tdp_mmu_enabled(kvm))
@@ -6035,6 +6315,12 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	write_unlock(&kvm->mmu_lock);
 }
 
+void kvm_mmu_zap_all_active(struct kvm *kvm)
+{
+	__kvm_mmu_zap_all(kvm, &kvm->arch.active_mmu_pages);
+	__kvm_mmu_zap_all(kvm, &kvm->arch.private_mmu_pages);
+}
+
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 {
 	WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
@@ -6254,7 +6540,7 @@ unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm)
 
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
-	kvm_mmu_unload(vcpu);
+	__kvm_mmu_unload(vcpu, KVM_MMU_ROOTS_ALL_INC_PRIVATE);
 	free_mmu_pages(&vcpu->arch.root_mmu);
 	free_mmu_pages(&vcpu->arch.guest_mmu);
 	mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 90de2d2ebfff..d718a1783112 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -49,11 +49,12 @@ struct kvm_mmu_page {
 	 */
 	union kvm_mmu_page_role role;
 	gfn_t gfn;
-	gfn_t gfn_stolen_bits;
 
 	u64 *spt;
 	/* hold the gfn of each spte inside spt */
 	gfn_t *gfns;
+	/* associated private shadow page, e.g. SEPT page */
+	void *private_sp;
 	/* Currently serving as active root */
 	union {
 		int root_count;
@@ -105,6 +106,16 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 	return kvm_mmu_role_as_id(sp->role);
 }
 
+static inline bool is_private_sp(struct kvm_mmu_page *sp)
+{
+	return sp->role.private;
+}
+
+static inline bool is_private_spte(u64 *sptep)
+{
+	return is_private_sp(sptep_to_sp(sptep));
+}
+
 static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 80e821996728..3e1759c68912 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -25,7 +25,7 @@
 	#define guest_walker guest_walker64
 	#define FNAME(name) paging##64_##name
 	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~vcpu_gpa_stolen_mask(vcpu) & \
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~kvm_gpa_stolen_mask(vcpu->kvm) & \
 					     PT64_LVL_ADDR_MASK(lvl))
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
@@ -567,7 +567,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 
-	WARN_ON(gpte & vcpu_gpa_stolen_mask(vcpu));
+	WARN_ON(gpte & kvm_gpa_stolen_mask(vcpu->kvm));
 
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
@@ -670,7 +670,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	WARN_ON_ONCE(gw->gfn != base_gfn);
 	direct_access = gw->pte_access;
 
-	WARN_ON(fault->addr & vcpu_gpa_stolen_mask(vcpu));
+	WARN_ON(fault->addr & kvm_gpa_stolen_mask(vcpu->kvm));
 
 	top_level = vcpu->arch.mmu->root_level;
 	if (top_level == PT32E_ROOT_LEVEL)
@@ -700,7 +700,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+					      it.level-1, false, access, false);
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -757,7 +757,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 
 		if (!is_shadow_present_pte(*it.sptep)) {
 			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+					      it.level - 1, true, direct_access,
+					      false);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index b53b301301dc..43f0d2a773f7 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -14,6 +14,9 @@
  */
 #define SPTE_MMU_PRESENT_MASK		BIT_ULL(11)
 
+/* Masks that used to track metadata for not-present SPTEs. */
+#define SPTE_PRIVATE_ZAPPED	BIT_ULL(62)
+
 /*
  * TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
  * be restricted to using write-protection (for L2 when CPU dirty logging, i.e.
@@ -95,11 +98,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 #undef SHADOW_ACC_TRACK_SAVED_MASK
 
 /*
- * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
+ * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
  * the memslots generation and is derived as follows:
  *
  * Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
- * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
+ * Bits 8-17 of the MMIO generation are propagated to spte bits 52-61
  *
  * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
  * the MMIO generation number, as doing so would require stealing a bit from
@@ -113,7 +116,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 #define MMIO_SPTE_GEN_LOW_END		10
 
 #define MMIO_SPTE_GEN_HIGH_START	52
-#define MMIO_SPTE_GEN_HIGH_END		62
+#define MMIO_SPTE_GEN_HIGH_END		61
 
 #define MMIO_SPTE_GEN_LOW_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
 						    MMIO_SPTE_GEN_LOW_START)
@@ -126,7 +129,7 @@ static_assert(!(SPTE_MMU_PRESENT_MASK &
 #define MMIO_SPTE_GEN_HIGH_BITS		(MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)
 
 /* remember to adjust the comment above as well if you change these */
-static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 10);
 
 #define MMIO_SPTE_GEN_LOW_SHIFT		(MMIO_SPTE_GEN_LOW_START - 0)
 #define MMIO_SPTE_GEN_HIGH_SHIFT	(MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
@@ -267,6 +270,11 @@ static inline bool is_access_track_spte(u64 spte)
 	return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
 }
 
+static inline bool is_zapped_private_pte(u64 pte)
+{
+	return !!(pte & SPTE_PRIVATE_ZAPPED);
+}
+
 static inline bool is_large_pte(u64 pte)
 {
 	return pte & PT_PAGE_SIZE_MASK;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c6e56f105673..4dd8ec2641a2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11436,6 +11436,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+	INIT_LIST_HEAD(&kvm->arch.private_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
 	INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
@@ -11872,7 +11873,14 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_mmu_zap_all(kvm);
+	/*
+	 * kvm_mmu_zap_all_active() zaps both private and shared page tables.
+	 * Before tearing down private page tables, TDX requires TD has started
+	 * to be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
+	 * kvm_x86_mmu_prezap() for this.
+	 */
+	static_call_cond(kvm_x86_mmu_prezap)(kvm);
+	kvm_mmu_zap_all_active(kvm);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5bf456734e7d..b56d97d43ad9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -190,6 +190,7 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 
 	return true;
 }
+EXPORT_SYMBOL_GPL(kvm_is_reserved_pfn);
 
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 37/59] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (35 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 36/59] KVM: x86/mmu: Frame in support for private/inaccessible shadow pages isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 38/59] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
                   ` (23 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path.  This allows
TDX to get the resulting pfn and also allows the RET_PF_* enums to
stay in mmu.c where they belong.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu.h     |  3 +++
 arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9c22dedbb228..168dcd4e0102 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -202,6 +202,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	return vcpu->arch.mmu->page_fault(vcpu, &fault);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level);
+
 /*
  * Currently, we have two sorts of write-protection, a) the first one
  * write-protects guest page to sync the guest modification, b) another one is
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c3effcbb726e..63d2e3e85c08 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4432,6 +4432,44 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level)
+{
+	int r;
+	struct kvm_page_fault fault = (struct kvm_page_fault) {
+		.addr = gpa,
+		.error_code = error_code,
+		.exec = error_code & PFERR_FETCH_MASK,
+		.write = error_code & PFERR_WRITE_MASK,
+		.present = error_code & PFERR_PRESENT_MASK,
+		.rsvd = error_code & PFERR_RSVD_MASK,
+		.user = error_code & PFERR_USER_MASK,
+		.prefetch = false,
+		.is_tdp = true,
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
+	};
+
+	if (mmu_topup_memory_caches(vcpu, false))
+		return KVM_PFN_ERR_FAULT;
+
+	/*
+	 * Loop on the page fault path to handle the case where an mmu_notifier
+	 * invalidation triggers RET_PF_RETRY.  In the normal page fault path,
+	 * KVM needs to resume the guest in case the invalidation changed any
+	 * of the page fault properties, i.e. the gpa or error code.  For this
+	 * path, the gpa and error code are fixed by the caller, and the caller
+	 * expects failure if and only if the page fault can't be fixed.
+	 */
+	do {
+		fault.max_level = max_level;
+		fault.req_level = PG_LEVEL_4K;
+		fault.goal_level = PG_LEVEL_4K;
+		r = direct_page_fault(vcpu, &fault);
+	} while (r == RET_PF_RETRY && !is_error_noslot_pfn(fault.pfn));
+	return fault.pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 38/59] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (36 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 37/59] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 39/59] KVM: VMX: Modify NMI and INTR handlers to take intr_info as param isaku.yamahata
                   ` (22 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TODO: This is tentative patch.  Support large page and delete this patch.

Allow TDX to effectively disable large pages, as SEPT will initially
support only 4k pages.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/mmu.h              | 2 +-
 arch/x86/kvm/mmu/mmu.c          | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7ce0641d54dd..a2027bc14e41 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1054,6 +1054,7 @@ struct kvm_arch {
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
+	int tdp_max_page_level;
 	u8 mmu_valid_gen;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 168dcd4e0102..2c4b8fde66d9 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -191,7 +191,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
 		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
 
-		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
+		.max_level = vcpu->kvm->arch.tdp_max_page_level,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
 	};
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 63d2e3e85c08..7d3830508a44 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6102,6 +6102,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 39/59] KVM: VMX: Modify NMI and INTR handlers to take intr_info as param
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (37 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 38/59] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:06   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 40/59] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
                   ` (21 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Pass intr_info to the NMI and INTR handlers instead of pulling it from
vcpu_vmx in preparation for sharing the bulk of the handlers with TDX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1e20e85a958f..f243d5ecf543 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6302,25 +6302,24 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
 	kvm_after_interrupt(vcpu);
 }
 
-static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
+static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
 {
 	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
-	u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
 
 	/* if exit due to PF check for async PF */
 	if (is_page_fault(intr_info))
-		vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
 	/* Handle machine checks before interrupts are enabled */
 	else if (is_machine_check(intr_info))
 		kvm_machine_check();
 	/* We need to handle NMIs before interrupts are enabled */
 	else if (is_nmi(intr_info))
-		handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry);
+		handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
 }
 
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
+static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+					     u32 intr_info)
 {
-	u32 intr_info = vmx_get_intr_info(vcpu);
 	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
 	gate_desc *desc = (gate_desc *)host_idt_base + vector;
 
@@ -6339,9 +6338,9 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 		return;
 
 	if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
-		handle_external_interrupt_irqoff(vcpu);
+		handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
 	else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
-		handle_exception_nmi_irqoff(vmx);
+		handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
 }
 
 /*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 40/59] KVM: VMX: Move NMI/exception handler to common helper
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (38 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 39/59] KVM: VMX: Modify NMI and INTR handlers to take intr_info as param isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:06   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 41/59] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
                   ` (20 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h | 52 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 52 ++++++---------------------------------
 2 files changed, 60 insertions(+), 44 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/common.h

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
new file mode 100644
index 000000000000..81c73f30d01d
--- /dev/null
+++ b/arch/x86/kvm/vmx/common.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_X86_VMX_COMMON_H
+#define __KVM_X86_VMX_COMMON_H
+
+#include <linux/kvm_host.h>
+
+#include <asm/traps.h>
+
+#include "vmcs.h"
+#include "x86.h"
+
+extern unsigned long vmx_host_idt_base;
+void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
+
+static inline void vmx_handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
+				     unsigned long entry)
+{
+	kvm_before_interrupt(vcpu);
+	vmx_do_interrupt_nmi_irqoff(entry);
+	kvm_after_interrupt(vcpu);
+}
+
+static inline void vmx_handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu,
+						   u32 intr_info)
+{
+	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
+
+	/* if exit due to PF check for async PF */
+	if (is_page_fault(intr_info))
+		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+	/* Handle machine checks before interrupts are enabled */
+	else if (is_machine_check(intr_info))
+		kvm_machine_check();
+	/* We need to handle NMIs before interrupts are enabled */
+	else if (is_nmi(intr_info))
+		vmx_handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
+}
+
+static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+							u32 intr_info)
+{
+	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+	gate_desc *desc = (gate_desc *)vmx_host_idt_base + vector;
+
+	if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
+	    "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
+		return;
+
+	vmx_handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
+}
+
+#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f243d5ecf543..4730fd15bea6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -49,6 +49,7 @@
 #include <asm/vmx.h>
 
 #include "capabilities.h"
+#include "common.h"
 #include "cpuid.h"
 #include "evmcs.h"
 #include "hyperv.h"
@@ -452,7 +453,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
 	vmx->segment_cache.bitmask = 0;
 }
 
-static unsigned long host_idt_base;
+unsigned long vmx_host_idt_base;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 static bool __read_mostly enlightened_vmcs = true;
@@ -3995,7 +3996,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
 	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
 
-	vmcs_writel(HOST_IDTR_BASE, host_idt_base);   /* 22.2.4 */
+	vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base);   /* 22.2.4 */
 
 	vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */
 
@@ -4726,7 +4727,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 	intr_info = vmx_get_intr_info(vcpu);
 
 	if (is_machine_check(intr_info) || is_nmi(intr_info))
-		return 1; /* handled by handle_exception_nmi_irqoff() */
+		return 1; /* handled by vmx_handle_exception_nmi_irqoff() */
 
 	if (is_invalid_opcode(intr_info))
 		return handle_ud(vcpu);
@@ -6292,44 +6293,6 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
 }
 
-void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
-
-static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
-					unsigned long entry)
-{
-	kvm_before_interrupt(vcpu);
-	vmx_do_interrupt_nmi_irqoff(entry);
-	kvm_after_interrupt(vcpu);
-}
-
-static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
-{
-	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
-
-	/* if exit due to PF check for async PF */
-	if (is_page_fault(intr_info))
-		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
-	/* Handle machine checks before interrupts are enabled */
-	else if (is_machine_check(intr_info))
-		kvm_machine_check();
-	/* We need to handle NMIs before interrupts are enabled */
-	else if (is_nmi(intr_info))
-		handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
-}
-
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
-					     u32 intr_info)
-{
-	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
-	gate_desc *desc = (gate_desc *)host_idt_base + vector;
-
-	if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
-	    "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
-		return;
-
-	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
-}
-
 static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6338,9 +6301,10 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 		return;
 
 	if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
-		handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
+		vmx_handle_external_interrupt_irqoff(vcpu,
+						     vmx_get_intr_info(vcpu));
 	else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
-		handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
+		vmx_handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
 }
 
 /*
@@ -7678,7 +7642,7 @@ static __init int hardware_setup(void)
 	int r, ept_lpage_level;
 
 	store_idt(&dt);
-	host_idt_base = dt.address;
+	vmx_host_idt_base = dt.address;
 
 	vmx_setup_user_return_msrs();
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 41/59] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (39 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 40/59] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:07   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 42/59] KVM: VMX: Define EPT Violation architectural bits isaku.yamahata
                   ` (19 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h | 29 +++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 33 +++++----------------------------
 2 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 81c73f30d01d..9e5865b05d47 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -5,8 +5,11 @@
 #include <linux/kvm_host.h>
 
 #include <asm/traps.h>
+#include <asm/vmx.h>
 
+#include "mmu.h"
 #include "vmcs.h"
+#include "vmx.h"
 #include "x86.h"
 
 extern unsigned long vmx_host_idt_base;
@@ -49,4 +52,30 @@ static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
 	vmx_handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
 }
 
+static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+					     unsigned long exit_qualification)
+{
+	u64 error_code;
+
+	/* Is it a read fault? */
+	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
+		     ? PFERR_USER_MASK : 0;
+	/* Is it a write fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
+		      ? PFERR_WRITE_MASK : 0;
+	/* Is it a fetch fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
+		      ? PFERR_FETCH_MASK : 0;
+	/* ept page table entry is present? */
+	error_code |= (exit_qualification &
+		       (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
+			EPT_VIOLATION_EXECUTABLE))
+		      ? PFERR_PRESENT_MASK : 0;
+
+	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
+	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+}
+
 #endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4730fd15bea6..92b2c4ac627c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5230,11 +5230,10 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
 
 static int handle_ept_violation(struct kvm_vcpu *vcpu)
 {
-	unsigned long exit_qualification;
-	gpa_t gpa;
-	u64 error_code;
+	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
+	gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 
-	exit_qualification = vmx_get_exit_qual(vcpu);
+	trace_kvm_page_fault(gpa, exit_qualification);
 
 	/*
 	 * EPT violation happened while executing iret from NMI,
@@ -5243,31 +5242,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	 * AAK134, BY25.
 	 */
 	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
-			enable_vnmi &&
-			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
+	    enable_vnmi && (exit_qualification & INTR_INFO_UNBLOCK_NMI))
 		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI);
 
-	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
-	trace_kvm_page_fault(gpa, exit_qualification);
-
-	/* Is it a read fault? */
-	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
-		     ? PFERR_USER_MASK : 0;
-	/* Is it a write fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
-		      ? PFERR_WRITE_MASK : 0;
-	/* Is it a fetch fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
-		      ? PFERR_FETCH_MASK : 0;
-	/* ept page table entry is present? */
-	error_code |= (exit_qualification &
-		       (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
-			EPT_VIOLATION_EXECUTABLE))
-		      ? PFERR_PRESENT_MASK : 0;
-
-	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
-	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
-
 	vcpu->arch.exit_qualification = exit_qualification;
 
 	/*
@@ -5281,7 +5258,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 42/59] KVM: VMX: Define EPT Violation architectural bits
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (40 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 41/59] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 43/59] KVM: VMX: Define VMCS encodings for shared EPT pointer isaku.yamahata
                   ` (18 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Define the EPT Violation #VE control bit, #VE info VMCS fields, and the
suppress #VE bit for EPT entries.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/vmx.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 035dfdafa2c1..132981276a2f 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -78,6 +78,7 @@ struct vmcs {
 #define SECONDARY_EXEC_ENCLS_EXITING		VMCS_CONTROL_BIT(ENCLS_EXITING)
 #define SECONDARY_EXEC_RDSEED_EXITING		VMCS_CONTROL_BIT(RDSEED_EXITING)
 #define SECONDARY_EXEC_ENABLE_PML               VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
+#define SECONDARY_EXEC_EPT_VIOLATION_VE		VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
 #define SECONDARY_EXEC_PT_CONCEAL_VMX		VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
 #define SECONDARY_EXEC_XSAVES			VMCS_CONTROL_BIT(XSAVES)
 #define SECONDARY_EXEC_MODE_BASED_EPT_EXEC	VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
@@ -226,6 +227,8 @@ enum vmcs_field {
 	VMREAD_BITMAP_HIGH              = 0x00002027,
 	VMWRITE_BITMAP                  = 0x00002028,
 	VMWRITE_BITMAP_HIGH             = 0x00002029,
+	VE_INFO_ADDRESS                 = 0x0000202A,
+	VE_INFO_ADDRESS_HIGH            = 0x0000202B,
 	XSS_EXIT_BITMAP                 = 0x0000202C,
 	XSS_EXIT_BITMAP_HIGH            = 0x0000202D,
 	ENCLS_EXITING_BITMAP		= 0x0000202E,
@@ -509,6 +512,7 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT    			(1ull << 6)
 #define VMX_EPT_ACCESS_BIT			(1ull << 8)
 #define VMX_EPT_DIRTY_BIT			(1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)
 #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
 						 VMX_EPT_WRITABLE_MASK |       \
 						 VMX_EPT_EXECUTABLE_MASK)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 43/59] KVM: VMX: Define VMCS encodings for shared EPT pointer
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (41 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 42/59] KVM: VMX: Define EPT Violation architectural bits isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 44/59] KVM: VMX: Add 'main.c' to wrap VMX and TDX isaku.yamahata
                   ` (17 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add the VMCS field encoding for the shared EPTP, which will be used by
TDX to have separate EPT walks for private GPAs (existing EPTP) versus
shared GPAs (new shared EPTP).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/vmx.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 132981276a2f..56b3d32941fd 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -235,6 +235,8 @@ enum vmcs_field {
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
 	TSC_MULTIPLIER                  = 0x00002032,
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
+	SHARED_EPT_POINTER		= 0x0000203C,
+	SHARED_EPT_POINTER_HIGH		= 0x0000203D,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 44/59] KVM: VMX: Add 'main.c' to wrap VMX and TDX
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (42 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 43/59] KVM: VMX: Define VMCS encodings for shared EPT pointer isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 45/59] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
                   ` (16 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson,
	kernel test robot, Xiaoyao Li

From: Sean Christopherson <sean.j.christopherson@intel.com>

Wrap the VMX kvm_x86_ops hooks in preparation of adding TDX, which can
coexist with VMX, i.e. KVM can run both VMs and TDs.  Use 'vt' for the
naming scheme as a nod to VT-x and as a concatenation of VmxTdx.

Reported-by: kernel test robot <lkp@intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Makefile      |   2 +-
 arch/x86/kvm/vmx/main.c    | 729 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 463 ++++++++---------------
 arch/x86/kvm/vmx/x86_ops.h | 124 +++++++
 4 files changed, 1000 insertions(+), 318 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/main.c
 create mode 100644 arch/x86/kvm/vmx/x86_ops.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index c7eceff49e67..d28f990bd81d 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -27,7 +27,7 @@ kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
 kvm-$(CONFIG_KVM_XEN)	+= xen.o
 
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
-			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
+			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
 kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx_error.o
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..146dd6a317e5
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,729 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+#include "x86_ops.h"
+#include "vmx.h"
+#include "nested.h"
+#include "pmu.h"
+
+static int __init vt_cpu_has_kvm_support(void)
+{
+	return cpu_has_vmx();
+}
+
+static int __init vt_disabled_by_bios(void)
+{
+	return vmx_disabled_by_bios();
+}
+
+static int __init vt_check_processor_compatibility(void)
+{
+	int ret;
+
+	ret = vmx_check_processor_compat();
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static __init int vt_hardware_setup(void)
+{
+	int ret;
+
+	ret = hardware_setup();
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int vt_hardware_enable(void)
+{
+	return hardware_enable();
+}
+
+static void vt_hardware_disable(void)
+{
+	hardware_disable();
+}
+
+static bool vt_cpu_has_accelerated_tpr(void)
+{
+	return report_flexpriority();
+}
+
+static bool vt_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_LEGACY_VM;
+}
+
+static int vt_vm_init(struct kvm *kvm)
+{
+	return vmx_vm_init(kvm);
+}
+
+static void vt_mmu_prezap(struct kvm *kvm)
+{
+}
+
+static void vt_vm_destroy(struct kvm *kvm)
+{
+}
+
+static int vt_vcpu_create(struct kvm_vcpu *vcpu)
+{
+	return vmx_create_vcpu(vcpu);
+}
+
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
+{
+	return vmx_vcpu_run(vcpu);
+}
+
+static void vt_vcpu_free(struct kvm_vcpu *vcpu)
+{
+	return vmx_free_vcpu(vcpu);
+}
+
+static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+	return vmx_vcpu_reset(vcpu, init_event);
+}
+
+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	return vmx_vcpu_load(vcpu, cpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	return vmx_vcpu_put(vcpu);
+}
+
+static int vt_handle_exit(struct kvm_vcpu *vcpu,
+			     enum exit_fastpath_completion fastpath)
+{
+	return vmx_handle_exit(vcpu, fastpath);
+}
+
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+	vmx_handle_exit_irqoff(vcpu);
+}
+
+static int vt_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+{
+	return vmx_skip_emulated_instruction(vcpu);
+}
+
+static void vt_update_emulated_instruction(struct kvm_vcpu *vcpu)
+{
+	vmx_update_emulated_instruction(vcpu);
+}
+
+static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	return vmx_set_msr(vcpu, msr_info);
+}
+
+static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	return vmx_smi_allowed(vcpu, for_injection);
+}
+
+static int vt_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
+{
+	return vmx_enter_smm(vcpu, smstate);
+}
+
+static int vt_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
+{
+	return vmx_leave_smm(vcpu, smstate);
+}
+
+static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+	/* RSM will cause a vmexit anyway.  */
+	vmx_enable_smi_window(vcpu);
+}
+
+static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, void *insn,
+				       int insn_len)
+{
+	return vmx_can_emulate_instruction(vcpu, insn, insn_len);
+}
+
+static int vt_check_intercept(struct kvm_vcpu *vcpu,
+				 struct x86_instruction_info *info,
+				 enum x86_intercept_stage stage,
+				 struct x86_exception *exception)
+{
+	return vmx_check_intercept(vcpu, info, stage, exception);
+}
+
+static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+	return vmx_apic_init_signal_blocked(vcpu);
+}
+
+static void vt_migrate_timers(struct kvm_vcpu *vcpu)
+{
+	vmx_migrate_timers(vcpu);
+}
+
+static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+	return vmx_set_virtual_apic_mode(vcpu);
+}
+
+static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+{
+	return vmx_apicv_post_state_restore(vcpu);
+}
+
+static bool vt_check_apicv_inhibit_reasons(ulong bit)
+{
+	return vmx_check_apicv_inhibit_reasons(bit);
+}
+
+static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+{
+	return vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
+static void vt_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
+{
+	return vmx_hwapic_isr_update(vcpu, max_isr);
+}
+
+static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+	return vmx_guest_apic_has_interrupt(vcpu);
+}
+
+static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+{
+	return vmx_sync_pir_to_irr(vcpu);
+}
+
+static int vt_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
+{
+	return vmx_deliver_posted_interrupt(vcpu, vector);
+}
+
+static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+	return vmx_vcpu_after_set_cpuid(vcpu);
+}
+
+/*
+ * The kvm parameter can be NULL (module initialization, or invocation before
+ * VM creation). Be sure to check the kvm parameter before using it.
+ */
+static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
+{
+	return vmx_has_emulated_msr(kvm, index);
+}
+
+static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
+{
+	vmx_msr_filter_changed(vcpu);
+}
+
+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+	vmx_update_exception_bitmap(vcpu);
+}
+
+static int vt_get_msr_feature(struct kvm_msr_entry *msr)
+{
+	return vmx_get_msr_feature(msr);
+}
+
+static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	return vmx_get_msr(vcpu, msr_info);
+}
+
+static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+	return vmx_get_segment_base(vcpu, seg);
+}
+
+static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+			      int seg)
+{
+	vmx_get_segment(vcpu, var, seg);
+}
+
+static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+			      int seg)
+{
+	vmx_set_segment(vcpu, var, seg);
+}
+
+static int vt_get_cpl(struct kvm_vcpu *vcpu)
+{
+	return vmx_get_cpl(vcpu);
+}
+
+static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+{
+	vmx_get_cs_db_l_bits(vcpu, db, l);
+}
+
+static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+	vmx_set_cr0(vcpu, cr0);
+}
+
+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+			    int pgd_level)
+{
+	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+}
+
+static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+	vmx_set_cr4(vcpu, cr4);
+}
+
+static bool vt_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+	return vmx_is_valid_cr4(vcpu, cr4);
+}
+
+static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+	return vmx_set_efer(vcpu, efer);
+}
+
+static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	vmx_get_idt(vcpu, dt);
+}
+
+static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	vmx_set_idt(vcpu, dt);
+}
+
+static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	vmx_get_gdt(vcpu, dt);
+}
+
+static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	vmx_set_gdt(vcpu, dt);
+}
+
+static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	vmx_set_dr7(vcpu, val);
+}
+
+static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+{
+	vmx_sync_dirty_debug_regs(vcpu);
+}
+
+static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+	vmx_cache_reg(vcpu, reg);
+}
+
+static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
+{
+	return vmx_get_rflags(vcpu);
+}
+
+static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+	vmx_set_rflags(vcpu, rflags);
+}
+
+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+	vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	vmx_flush_tlb_current(vcpu);
+}
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+	vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	vmx_flush_tlb_guest(vcpu);
+}
+
+static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+{
+	vmx_set_interrupt_shadow(vcpu, mask);
+}
+
+static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
+{
+	return vmx_get_interrupt_shadow(vcpu);
+}
+
+static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
+				  unsigned char *hypercall)
+{
+	vmx_patch_hypercall(vcpu, hypercall);
+}
+
+static void vt_inject_irq(struct kvm_vcpu *vcpu)
+{
+	vmx_inject_irq(vcpu);
+}
+
+static void vt_inject_nmi(struct kvm_vcpu *vcpu)
+{
+	vmx_inject_nmi(vcpu);
+}
+
+static void vt_queue_exception(struct kvm_vcpu *vcpu)
+{
+	vmx_queue_exception(vcpu);
+}
+
+static void vt_cancel_injection(struct kvm_vcpu *vcpu)
+{
+	vmx_cancel_injection(vcpu);
+}
+
+static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	return vmx_interrupt_allowed(vcpu, for_injection);
+}
+
+static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	return vmx_nmi_allowed(vcpu, for_injection);
+}
+
+static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
+{
+	return vmx_get_nmi_mask(vcpu);
+}
+
+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
+{
+	vmx_set_nmi_mask(vcpu, masked);
+}
+
+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
+{
+	vmx_enable_nmi_window(vcpu);
+}
+
+static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	vmx_enable_irq_window(vcpu);
+}
+
+static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+{
+	vmx_update_cr8_intercept(vcpu, tpr, irr);
+}
+
+static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+{
+	vmx_set_apic_access_page_addr(vcpu);
+}
+
+static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+{
+	vmx_refresh_apicv_exec_ctrl(vcpu);
+}
+
+static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+{
+	vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
+}
+
+static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
+{
+	return vmx_set_tss_addr(kvm, addr);
+}
+
+static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+{
+	return vmx_set_identity_map_addr(kvm, ident_addr);
+}
+
+static u64 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
+}
+
+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+}
+
+static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
+{
+	return vmx_get_l2_tsc_offset(vcpu);
+}
+
+static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+	return vmx_get_l2_tsc_multiplier(vcpu);
+}
+
+static void vt_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+{
+	vmx_write_tsc_offset(vcpu, offset);
+}
+
+static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+{
+	vmx_write_tsc_multiplier(vcpu, multiplier);
+}
+
+static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
+{
+	vmx_request_immediate_exit(vcpu);
+}
+
+static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+	vmx_sched_in(vcpu, cpu);
+}
+
+static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
+{
+	vmx_update_cpu_dirty_logging(vcpu);
+}
+
+static int vt_pre_block(struct kvm_vcpu *vcpu)
+{
+	if (pi_pre_block(vcpu))
+		return 1;
+
+	return vmx_pre_block(vcpu);
+}
+
+static void vt_post_block(struct kvm_vcpu *vcpu)
+{
+	vmx_post_block(vcpu);
+
+	pi_post_block(vcpu);
+}
+
+
+#ifdef CONFIG_X86_64
+static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+			      bool *expired)
+{
+	return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
+}
+
+static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+	vmx_cancel_hv_timer(vcpu);
+}
+#endif
+
+static void vt_setup_mce(struct kvm_vcpu *vcpu)
+{
+	vmx_setup_mce(vcpu);
+}
+
+struct kvm_x86_ops vt_x86_ops __initdata = {
+	.name = "kvm_intel",
+
+	.hardware_unsetup = hardware_unsetup,
+
+	.hardware_enable = vt_hardware_enable,
+	.hardware_disable = vt_hardware_disable,
+	.cpu_has_accelerated_tpr = vt_cpu_has_accelerated_tpr,
+	.has_emulated_msr = vt_has_emulated_msr,
+
+	.is_vm_type_supported = vt_is_vm_type_supported,
+	.vm_size = sizeof(struct kvm_vmx),
+	.vm_init = vt_vm_init,
+	.mmu_prezap = vt_mmu_prezap,
+	.vm_teardown = NULL,
+	.vm_destroy = vt_vm_destroy,
+
+	.vcpu_create = vt_vcpu_create,
+	.vcpu_free = vt_vcpu_free,
+	.vcpu_reset = vt_vcpu_reset,
+
+	.prepare_guest_switch = vt_prepare_switch_to_guest,
+	.vcpu_load = vt_vcpu_load,
+	.vcpu_put = vt_vcpu_put,
+
+	.update_exception_bitmap = vt_update_exception_bitmap,
+	.get_msr_feature = vt_get_msr_feature,
+	.get_msr = vt_get_msr,
+	.set_msr = vt_set_msr,
+	.get_segment_base = vt_get_segment_base,
+	.get_segment = vt_get_segment,
+	.set_segment = vt_set_segment,
+	.get_cpl = vt_get_cpl,
+	.get_cs_db_l_bits = vt_get_cs_db_l_bits,
+	.set_cr0 = vt_set_cr0,
+	.is_valid_cr4 = vt_is_valid_cr4,
+	.set_cr4 = vt_set_cr4,
+	.set_efer = vt_set_efer,
+	.get_idt = vt_get_idt,
+	.set_idt = vt_set_idt,
+	.get_gdt = vt_get_gdt,
+	.set_gdt = vt_set_gdt,
+	.set_dr7 = vt_set_dr7,
+	.sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
+	.cache_reg = vt_cache_reg,
+	.get_rflags = vt_get_rflags,
+	.set_rflags = vt_set_rflags,
+
+	.tlb_flush_all = vt_flush_tlb_all,
+	.tlb_flush_current = vt_flush_tlb_current,
+	.tlb_flush_gva = vt_flush_tlb_gva,
+	.tlb_flush_guest = vt_flush_tlb_guest,
+
+	.run = vt_vcpu_run,
+	.handle_exit = vt_handle_exit,
+	.skip_emulated_instruction = vt_skip_emulated_instruction,
+	.update_emulated_instruction = vt_update_emulated_instruction,
+	.set_interrupt_shadow = vt_set_interrupt_shadow,
+	.get_interrupt_shadow = vt_get_interrupt_shadow,
+	.patch_hypercall = vt_patch_hypercall,
+	.set_irq = vt_inject_irq,
+	.set_nmi = vt_inject_nmi,
+	.queue_exception = vt_queue_exception,
+	.cancel_injection = vt_cancel_injection,
+	.interrupt_allowed = vt_interrupt_allowed,
+	.nmi_allowed = vt_nmi_allowed,
+	.get_nmi_mask = vt_get_nmi_mask,
+	.set_nmi_mask = vt_set_nmi_mask,
+	.enable_nmi_window = vt_enable_nmi_window,
+	.enable_irq_window = vt_enable_irq_window,
+	.update_cr8_intercept = vt_update_cr8_intercept,
+	.set_virtual_apic_mode = vt_set_virtual_apic_mode,
+	.set_apic_access_page_addr = vt_set_apic_access_page_addr,
+	.refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
+	.load_eoi_exitmap = vt_load_eoi_exitmap,
+	.apicv_post_state_restore = vt_apicv_post_state_restore,
+	.check_apicv_inhibit_reasons = vt_check_apicv_inhibit_reasons,
+	.hwapic_irr_update = vt_hwapic_irr_update,
+	.hwapic_isr_update = vt_hwapic_isr_update,
+	.guest_apic_has_interrupt = vt_guest_apic_has_interrupt,
+	.sync_pir_to_irr = vt_sync_pir_to_irr,
+	.deliver_posted_interrupt = vt_deliver_posted_interrupt,
+	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+	.set_tss_addr = vt_set_tss_addr,
+	.set_identity_map_addr = vt_set_identity_map_addr,
+	.get_mt_mask = vt_get_mt_mask,
+
+	.get_exit_info = vt_get_exit_info,
+
+	.vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,
+
+	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+	.get_l2_tsc_offset = vt_get_l2_tsc_offset,
+	.get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier,
+	.write_tsc_offset = vt_write_tsc_offset,
+	.write_tsc_multiplier = vt_write_tsc_multiplier,
+
+	.load_mmu_pgd = vt_load_mmu_pgd,
+
+	.check_intercept = vt_check_intercept,
+	.handle_exit_irqoff = vt_handle_exit_irqoff,
+
+	.request_immediate_exit = vt_request_immediate_exit,
+
+	.sched_in = vt_sched_in,
+
+	.cpu_dirty_log_size = PML_ENTITY_NUM,
+	.update_cpu_dirty_logging = vt_update_cpu_dirty_logging,
+
+	.pre_block = vt_pre_block,
+	.post_block = vt_post_block,
+
+	.pmu_ops = &intel_pmu_ops,
+	.nested_ops = &vmx_nested_ops,
+
+	.update_pi_irte = pi_update_irte,
+
+#ifdef CONFIG_X86_64
+	.set_hv_timer = vt_set_hv_timer,
+	.cancel_hv_timer = vt_cancel_hv_timer,
+#endif
+
+	.setup_mce = vt_setup_mce,
+
+	.smi_allowed = vt_smi_allowed,
+	.enter_smm = vt_enter_smm,
+	.leave_smm = vt_leave_smm,
+	.enable_smi_window = vt_enable_smi_window,
+
+	.can_emulate_instruction = vt_can_emulate_instruction,
+	.apic_init_signal_blocked = vt_apic_init_signal_blocked,
+	.migrate_timers = vt_migrate_timers,
+
+	.msr_filter_changed = vt_msr_filter_changed,
+	.complete_emulated_msr = kvm_complete_insn_gp,
+
+	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+};
+
+static struct kvm_x86_init_ops vt_init_ops __initdata = {
+	.cpu_has_kvm_support = vt_cpu_has_kvm_support,
+	.disabled_by_bios = vt_disabled_by_bios,
+	.check_processor_compatibility = vt_check_processor_compatibility,
+	.hardware_setup = vt_hardware_setup,
+
+	.runtime_ops = &vt_x86_ops,
+};
+
+static int __init vt_init(void)
+{
+	unsigned int vcpu_size = 0, vcpu_align = 0;
+	int r;
+
+	vmx_pre_kvm_init(&vcpu_size, &vcpu_align);
+
+	r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
+	if (r)
+		goto err_vmx_post_exit;
+
+	r = vmx_init();
+	if (r)
+		goto err_kvm_exit;
+
+	return 0;
+
+err_kvm_exit:
+	kvm_exit();
+err_vmx_post_exit:
+	vmx_post_kvm_exit();
+	return r;
+}
+module_init(vt_init);
+
+static void vt_exit(void)
+{
+	vmx_exit();
+	kvm_exit();
+	vmx_post_kvm_exit();
+}
+module_exit(vt_exit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 92b2c4ac627c..404afac897bc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -66,6 +66,7 @@
 #include "vmcs12.h"
 #include "vmx.h"
 #include "x86.h"
+#include "x86_ops.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
@@ -539,7 +540,7 @@ static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu)
 	return flexpriority_enabled && lapic_in_kernel(vcpu);
 }
 
-static inline bool report_flexpriority(void)
+bool report_flexpriority(void)
 {
 	return flexpriority_enabled;
 }
@@ -1299,7 +1300,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
  */
-static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -1310,7 +1311,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vmx->host_debugctlmsr = get_debugctlmsr();
 }
 
-static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
+void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	vmx_vcpu_pi_put(vcpu);
 
@@ -1465,7 +1466,7 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
 	return 0;
 }
 
-static bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, void *insn, int insn_len)
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, void *insn, int insn_len)
 {
 	/*
 	 * Emulation of instructions in SGX enclaves is impossible as RIP does
@@ -1549,7 +1550,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
  * Recognizes a pending MTF VM-exit and records the nested state for later
  * delivery.
  */
-static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1572,7 +1573,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 		vmx->nested.mtf_pending = false;
 }
 
-static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
 	vmx_update_emulated_instruction(vcpu);
 	return skip_emulated_instruction(vcpu);
@@ -1591,7 +1592,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
 		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
 
-static void vmx_queue_exception(struct kvm_vcpu *vcpu)
+void vmx_queue_exception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned nr = vcpu->arch.exception.nr;
@@ -1704,12 +1705,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
 	return kvm_default_tsc_scaling_ratio;
 }
 
-static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
 }
 
-static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
 {
 	vmcs_write64(TSC_MULTIPLIER, multiplier);
 }
@@ -1733,7 +1734,7 @@ static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
 	return !(val & ~valid_bits);
 }
 
-static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
+int vmx_get_msr_feature(struct kvm_msr_entry *msr)
 {
 	switch (msr->index) {
 	case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
@@ -1753,7 +1754,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
  */
-static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmx_uret_msr *msr;
@@ -1931,7 +1932,7 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
  */
-static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmx_uret_msr *msr;
@@ -2229,7 +2230,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return ret;
 }
 
-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 {
 	unsigned long guest_owned_bits;
 
@@ -2272,18 +2273,13 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 	}
 }
 
-static __init int cpu_has_kvm_support(void)
-{
-	return cpu_has_vmx();
-}
-
-static __init int vmx_disabled_by_bios(void)
+__init int vmx_disabled_by_bios(void)
 {
 	return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
 	       !boot_cpu_has(X86_FEATURE_VMX);
 }
 
-static int hardware_enable(void)
+int hardware_enable(void)
 {
 	int cpu = raw_smp_processor_id();
 	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
@@ -2324,7 +2320,7 @@ static void vmclear_local_loaded_vmcss(void)
 		__loaded_vmcs_clear(v);
 }
 
-static void hardware_disable(void)
+void hardware_disable(void)
 {
 	vmclear_local_loaded_vmcss();
 
@@ -2876,7 +2872,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
 
 #endif
 
-static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -2899,7 +2895,7 @@ static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
 	}
 }
 
-static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u64 root_hpa = mmu->root_hpa;
@@ -2917,7 +2913,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 		vpid_sync_context(nested_get_vpid02(vcpu));
 }
 
-static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
 {
 	/*
 	 * vpid_sync_vcpu_addr() is a nop if vmx->vpid==0, see the comment in
@@ -2926,7 +2922,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
 	vpid_sync_vcpu_addr(to_vmx(vcpu)->vpid, addr);
 }
 
-static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 {
 	/*
 	 * vpid_sync_context() is a nop if vmx->vpid==0, e.g. if enable_vpid==0
@@ -3074,8 +3070,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 	return eptp;
 }
 
-static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
-			     int root_level)
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 {
 	struct kvm *kvm = vcpu->kvm;
 	bool update_guest_cr3 = true;
@@ -3103,7 +3098,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 		vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	/*
 	 * We operate under the default treatment of SMM, so VMX cannot be
@@ -3219,7 +3214,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	var->g = (ar >> 15) & 1;
 }
 
-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
 {
 	struct kvm_segment s;
 
@@ -3299,14 +3294,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
 }
 
-static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 {
 	__vmx_set_segment(vcpu, var, seg);
 
 	to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
 }
 
-static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
 	u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);
 
@@ -3314,25 +3309,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 	*l = (ar >> 13) & 1;
 }
 
-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
 	dt->address = vmcs_readl(GUEST_IDTR_BASE);
 }
 
-static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
 	vmcs_writel(GUEST_IDTR_BASE, dt->address);
 }
 
-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
 	dt->address = vmcs_readl(GUEST_GDTR_BASE);
 }
 
-static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
 	vmcs_writel(GUEST_GDTR_BASE, dt->address);
@@ -3817,7 +3812,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
 	}
 }
 
-static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	void *vapic_page;
@@ -3837,7 +3832,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	return ((rvi & 0xf0) > (vppr & 0xf0));
 }
 
-static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 i;
@@ -3925,7 +3920,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
  * 2. If target vcpu isn't running(root mode), kick it to pick up the
  * interrupt from PIR in next vmentry.
  */
-static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
+int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int r;
@@ -4068,7 +4063,7 @@ static u32 vmx_vmexit_ctrl(void)
 		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
 }
 
-static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -4391,7 +4386,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 	vmx->pi_desc.sn = 1;
 }
 
-static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -4448,12 +4443,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vpid_sync_context(vmx->vpid);
 }
 
-static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
 }
 
-static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
 {
 	if (!enable_vnmi ||
 	    vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
@@ -4464,7 +4459,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
 	exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
 }
 
-static void vmx_inject_irq(struct kvm_vcpu *vcpu)
+void vmx_inject_irq(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	uint32_t intr;
@@ -4492,7 +4487,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
 	vmx_clear_hlt(vcpu);
 }
 
-static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
+void vmx_inject_nmi(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -4570,7 +4565,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
 		 GUEST_INTR_STATE_NMI));
 }
 
-static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	if (to_vmx(vcpu)->nested.nested_run_pending)
 		return -EBUSY;
@@ -4592,7 +4587,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
 		(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
 }
 
-static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	if (to_vmx(vcpu)->nested.nested_run_pending)
 		return -EBUSY;
@@ -4607,7 +4602,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 	return !vmx_interrupt_blocked(vcpu);
 }
 
-static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 {
 	void __user *ret;
 
@@ -4627,7 +4622,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 	return init_rmode_tss(kvm, ret);
 }
 
-static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
 {
 	to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
 	return 0;
@@ -4870,8 +4865,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
 	return kvm_fast_pio(vcpu, size, port, in);
 }
 
-static void
-vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
 {
 	/*
 	 * Patch in the VMCALL instruction:
@@ -5081,7 +5075,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
 	return kvm_complete_insn_gp(vcpu, err);
 }
 
-static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 {
 	get_debugreg(vcpu->arch.db[0], 0);
 	get_debugreg(vcpu->arch.db[1], 1);
@@ -5100,7 +5094,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 	set_debugreg(DR6_RESERVED, 6);
 }
 
-static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
 {
 	vmcs_writel(GUEST_DR7, val);
 }
@@ -5563,9 +5557,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 static const int kvm_vmx_max_exit_handlers =
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
-static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
-			      u64 *info1, u64 *info2,
-			      u32 *intr_info, u32 *error_code)
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -5982,7 +5975,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	return 0;
 }
 
-static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 {
 	int ret = __vmx_handle_exit(vcpu, exit_fastpath);
 
@@ -6070,7 +6063,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 		: "eax", "ebx", "ecx", "edx");
 }
 
-static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	int tpr_threshold;
@@ -6140,7 +6133,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
 	vmx_update_msr_bitmap_x2apic(vcpu);
 }
 
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 {
 	struct page *page;
 
@@ -6168,7 +6161,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 	put_page(page);
 }
 
-static void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
+void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
 {
 	u16 status;
 	u8 old;
@@ -6202,7 +6195,7 @@ static void vmx_set_rvi(int vector)
 	}
 }
 
-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
 {
 	/*
 	 * When running L2, updating RVI is only relevant when
@@ -6216,7 +6209,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
 		vmx_set_rvi(max_irr);
 }
 
-static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int max_irr;
@@ -6251,7 +6244,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 	return max_irr;
 }
 
-static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 {
 	if (!kvm_vcpu_apicv_active(vcpu))
 		return;
@@ -6262,7 +6255,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 	vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
 }
 
-static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6270,7 +6263,7 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
 }
 
-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6288,7 +6281,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
  * The kvm parameter can be NULL (module initialization, or invocation before
  * VM creation). Be sure to check the kvm parameter before using it.
  */
-static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
 {
 	switch (index) {
 	case MSR_IA32_SMBASE:
@@ -6409,7 +6402,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 				  IDT_VECTORING_ERROR_CODE);
 }
 
-static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
+void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
 	__vmx_complete_interrupts(vcpu,
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -6505,7 +6498,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	kvm_guest_exit_irqoff();
 }
 
-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned long cr3, cr4;
@@ -6693,7 +6686,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	return vmx_exit_handlers_fastpath(vcpu);
 }
 
-static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
+void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6704,7 +6697,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 	free_loaded_vmcs(vmx->loaded_vmcs);
 }
 
-static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
+int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vmx_uret_msr *tsx_ctrl;
 	struct vcpu_vmx *vmx;
@@ -6791,15 +6784,10 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 	return err;
 }
 
-static bool vmx_is_vm_type_supported(unsigned long type)
-{
-	return type == KVM_X86_LEGACY_VM;
-}
-
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
-static int vmx_vm_init(struct kvm *kvm)
+int vmx_vm_init(struct kvm *kvm)
 {
 	if (!ple_gap)
 		kvm->arch.pause_in_guest = true;
@@ -6830,12 +6818,7 @@ static int vmx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
-static void vmx_vm_destroy(struct kvm *kvm)
-{
-
-}
-
-static int __init vmx_check_processor_compat(void)
+int __init vmx_check_processor_compat(void)
 {
 	struct vmcs_config vmcs_conf;
 	struct vmx_capability vmx_cap;
@@ -6858,7 +6841,7 @@ static int __init vmx_check_processor_compat(void)
 	return 0;
 }
 
-static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	u8 cache;
 	u64 ipat = 0;
@@ -7056,7 +7039,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
-static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -7156,7 +7139,7 @@ static __init void vmx_set_cpu_caps(void)
 		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
 }
 
-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
 {
 	to_vmx(vcpu)->req_immediate_exit = true;
 }
@@ -7195,10 +7178,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
 	return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
 }
 
-static int vmx_check_intercept(struct kvm_vcpu *vcpu,
-			       struct x86_instruction_info *info,
-			       enum x86_intercept_stage stage,
-			       struct x86_exception *exception)
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+		       struct x86_instruction_info *info,
+		       enum x86_intercept_stage stage,
+		       struct x86_exception *exception)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 
@@ -7263,8 +7246,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
 	return 0;
 }
 
-static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
-			    bool *expired)
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+		bool *expired)
 {
 	struct vcpu_vmx *vmx;
 	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
@@ -7303,13 +7286,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	return 0;
 }
 
-static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
 {
 	to_vmx(vcpu)->hv_deadline_tsc = -1;
 }
 #endif
 
-static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	if (!kvm_pause_in_guest(vcpu->kvm))
 		shrink_ple_window(vcpu);
@@ -7335,26 +7318,21 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 		secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
 }
 
-static int vmx_pre_block(struct kvm_vcpu *vcpu)
+int vmx_pre_block(struct kvm_vcpu *vcpu)
 {
-	if (pi_pre_block(vcpu))
-		return 1;
-
 	if (kvm_lapic_hv_timer_in_use(vcpu))
 		kvm_lapic_switch_to_sw_timer(vcpu);
 
 	return 0;
 }
 
-static void vmx_post_block(struct kvm_vcpu *vcpu)
+void vmx_post_block(struct kvm_vcpu *vcpu)
 {
 	if (kvm_x86_ops.set_hv_timer)
 		kvm_lapic_switch_to_hv_timer(vcpu);
-
-	pi_post_block(vcpu);
 }
 
-static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+void vmx_setup_mce(struct kvm_vcpu *vcpu)
 {
 	if (vcpu->arch.mcg_cap & MCG_LMCE_P)
 		to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
@@ -7364,7 +7342,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 			~FEAT_CTL_LMCE_ENABLED;
 }
 
-static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	/* we need a nested vmexit to enter SMM, postpone if run is pending */
 	if (to_vmx(vcpu)->nested.nested_run_pending)
@@ -7372,7 +7350,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 	return !is_smm(vcpu);
 }
 
-static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
+int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -7386,7 +7364,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
 	return 0;
 }
 
-static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int ret;
@@ -7406,17 +7384,17 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
 	return 0;
 }
 
-static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
 {
 	/* RSM will cause a vmexit anyway.  */
 }
 
-static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
 {
 	return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
 }
 
-static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
+void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 {
 	if (is_guest_mode(vcpu)) {
 		struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
@@ -7426,7 +7404,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 	}
 }
 
-static void hardware_unsetup(void)
+void hardware_unsetup(void)
 {
 	kvm_set_posted_intr_wakeup_handler(NULL);
 
@@ -7436,7 +7414,7 @@ static void hardware_unsetup(void)
 	free_kvm_area();
 }
 
-static bool vmx_check_apicv_inhibit_reasons(ulong bit)
+bool vmx_check_apicv_inhibit_reasons(ulong bit)
 {
 	ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
 			  BIT(APICV_INHIBIT_REASON_HYPERV) |
@@ -7445,147 +7423,6 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
 	return supported & BIT(bit);
 }
 
-static struct kvm_x86_ops vmx_x86_ops __initdata = {
-	.name = "kvm_intel",
-
-	.hardware_unsetup = hardware_unsetup,
-
-	.hardware_enable = hardware_enable,
-	.hardware_disable = hardware_disable,
-	.cpu_has_accelerated_tpr = report_flexpriority,
-	.has_emulated_msr = vmx_has_emulated_msr,
-
-	.is_vm_type_supported = vmx_is_vm_type_supported,
-	.vm_size = sizeof(struct kvm_vmx),
-	.vm_init = vmx_vm_init,
-	.vm_teardown = NULL,
-	.vm_destroy = vmx_vm_destroy,
-
-	.vcpu_create = vmx_create_vcpu,
-	.vcpu_free = vmx_free_vcpu,
-	.vcpu_reset = vmx_vcpu_reset,
-
-	.prepare_guest_switch = vmx_prepare_switch_to_guest,
-	.vcpu_load = vmx_vcpu_load,
-	.vcpu_put = vmx_vcpu_put,
-
-	.update_exception_bitmap = vmx_update_exception_bitmap,
-	.get_msr_feature = vmx_get_msr_feature,
-	.get_msr = vmx_get_msr,
-	.set_msr = vmx_set_msr,
-	.get_segment_base = vmx_get_segment_base,
-	.get_segment = vmx_get_segment,
-	.set_segment = vmx_set_segment,
-	.get_cpl = vmx_get_cpl,
-	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
-	.set_cr0 = vmx_set_cr0,
-	.is_valid_cr4 = vmx_is_valid_cr4,
-	.set_cr4 = vmx_set_cr4,
-	.set_efer = vmx_set_efer,
-	.get_idt = vmx_get_idt,
-	.set_idt = vmx_set_idt,
-	.get_gdt = vmx_get_gdt,
-	.set_gdt = vmx_set_gdt,
-	.set_dr7 = vmx_set_dr7,
-	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
-	.cache_reg = vmx_cache_reg,
-	.get_rflags = vmx_get_rflags,
-	.set_rflags = vmx_set_rflags,
-
-	.tlb_flush_all = vmx_flush_tlb_all,
-	.tlb_flush_current = vmx_flush_tlb_current,
-	.tlb_flush_gva = vmx_flush_tlb_gva,
-	.tlb_flush_guest = vmx_flush_tlb_guest,
-
-	.run = vmx_vcpu_run,
-	.handle_exit = vmx_handle_exit,
-	.skip_emulated_instruction = vmx_skip_emulated_instruction,
-	.update_emulated_instruction = vmx_update_emulated_instruction,
-	.set_interrupt_shadow = vmx_set_interrupt_shadow,
-	.get_interrupt_shadow = vmx_get_interrupt_shadow,
-	.patch_hypercall = vmx_patch_hypercall,
-	.set_irq = vmx_inject_irq,
-	.set_nmi = vmx_inject_nmi,
-	.queue_exception = vmx_queue_exception,
-	.cancel_injection = vmx_cancel_injection,
-	.interrupt_allowed = vmx_interrupt_allowed,
-	.nmi_allowed = vmx_nmi_allowed,
-	.get_nmi_mask = vmx_get_nmi_mask,
-	.set_nmi_mask = vmx_set_nmi_mask,
-	.enable_nmi_window = vmx_enable_nmi_window,
-	.enable_irq_window = vmx_enable_irq_window,
-	.update_cr8_intercept = vmx_update_cr8_intercept,
-	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
-	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
-	.load_eoi_exitmap = vmx_load_eoi_exitmap,
-	.apicv_post_state_restore = vmx_apicv_post_state_restore,
-	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
-	.hwapic_irr_update = vmx_hwapic_irr_update,
-	.hwapic_isr_update = vmx_hwapic_isr_update,
-	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
-	.sync_pir_to_irr = vmx_sync_pir_to_irr,
-	.deliver_posted_interrupt = vmx_deliver_posted_interrupt,
-	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
-	.set_tss_addr = vmx_set_tss_addr,
-	.set_identity_map_addr = vmx_set_identity_map_addr,
-	.get_mt_mask = vmx_get_mt_mask,
-
-	.get_exit_info = vmx_get_exit_info,
-
-	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
-	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
-	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
-	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
-	.write_tsc_offset = vmx_write_tsc_offset,
-	.write_tsc_multiplier = vmx_write_tsc_multiplier,
-
-	.load_mmu_pgd = vmx_load_mmu_pgd,
-
-	.check_intercept = vmx_check_intercept,
-	.handle_exit_irqoff = vmx_handle_exit_irqoff,
-
-	.request_immediate_exit = vmx_request_immediate_exit,
-
-	.sched_in = vmx_sched_in,
-
-	.cpu_dirty_log_size = PML_ENTITY_NUM,
-	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
-	.pre_block = vmx_pre_block,
-	.post_block = vmx_post_block,
-
-	.pmu_ops = &intel_pmu_ops,
-	.nested_ops = &vmx_nested_ops,
-
-	.update_pi_irte = pi_update_irte,
-	.start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
-	.set_hv_timer = vmx_set_hv_timer,
-	.cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
-	.setup_mce = vmx_setup_mce,
-
-	.smi_allowed = vmx_smi_allowed,
-	.enter_smm = vmx_enter_smm,
-	.leave_smm = vmx_leave_smm,
-	.enable_smi_window = vmx_enable_smi_window,
-
-	.can_emulate_instruction = vmx_can_emulate_instruction,
-	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
-	.migrate_timers = vmx_migrate_timers,
-
-	.msr_filter_changed = vmx_msr_filter_changed,
-	.complete_emulated_msr = kvm_complete_insn_gp,
-
-	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-};
-
 static __init void vmx_setup_user_return_msrs(void)
 {
 
@@ -7612,7 +7449,7 @@ static __init void vmx_setup_user_return_msrs(void)
 		kvm_add_user_return_msr(vmx_uret_msrs_list[i]);
 }
 
-static __init int hardware_setup(void)
+__init int hardware_setup(void)
 {
 	unsigned long host_bndcfgs;
 	struct desc_ptr dt;
@@ -7672,16 +7509,16 @@ static __init int hardware_setup(void)
 	 * using the APIC_ACCESS_ADDR VMCS field.
 	 */
 	if (!flexpriority_enabled)
-		vmx_x86_ops.set_apic_access_page_addr = NULL;
+		vt_x86_ops.set_apic_access_page_addr = NULL;
 
 	if (!cpu_has_vmx_tpr_shadow())
-		vmx_x86_ops.update_cr8_intercept = NULL;
+		vt_x86_ops.update_cr8_intercept = NULL;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
 	    && enable_ept) {
-		vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
-		vmx_x86_ops.tlb_remote_flush_with_range =
+		vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
+		vt_x86_ops.tlb_remote_flush_with_range =
 				hv_remote_flush_tlb_with_range;
 	}
 #endif
@@ -7696,7 +7533,7 @@ static __init int hardware_setup(void)
 
 	if (!cpu_has_vmx_apicv()) {
 		enable_apicv = 0;
-		vmx_x86_ops.sync_pir_to_irr = NULL;
+		vt_x86_ops.sync_pir_to_irr = NULL;
 	}
 
 	if (cpu_has_vmx_tsc_scaling()) {
@@ -7732,7 +7569,7 @@ static __init int hardware_setup(void)
 		enable_pml = 0;
 
 	if (!enable_pml)
-		vmx_x86_ops.cpu_dirty_log_size = 0;
+		vt_x86_ops.cpu_dirty_log_size = 0;
 
 	if (!cpu_has_vmx_preemption_timer())
 		enable_preemption_timer = false;
@@ -7759,9 +7596,9 @@ static __init int hardware_setup(void)
 	}
 
 	if (!enable_preemption_timer) {
-		vmx_x86_ops.set_hv_timer = NULL;
-		vmx_x86_ops.cancel_hv_timer = NULL;
-		vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+		vt_x86_ops.set_hv_timer = NULL;
+		vt_x86_ops.cancel_hv_timer = NULL;
+		vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
 	}
 
 	kvm_mce_cap_supported |= MCG_LMCE_P;
@@ -7793,15 +7630,6 @@ static __init int hardware_setup(void)
 	return r;
 }
 
-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
-	.cpu_has_kvm_support = cpu_has_kvm_support,
-	.disabled_by_bios = vmx_disabled_by_bios,
-	.check_processor_compatibility = vmx_check_processor_compat,
-	.hardware_setup = hardware_setup,
-
-	.runtime_ops = &vmx_x86_ops,
-};
-
 static void vmx_cleanup_l1d_flush(void)
 {
 	if (vmx_l1d_flush_pages) {
@@ -7812,47 +7640,12 @@ static void vmx_cleanup_l1d_flush(void)
 	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
 }
 
-static void vmx_exit(void)
+void __init vmx_pre_kvm_init(unsigned int *vcpu_size, unsigned int *vcpu_align)
 {
-#ifdef CONFIG_KEXEC_CORE
-	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
-	synchronize_rcu();
-#endif
-
-	kvm_exit();
-
-#if IS_ENABLED(CONFIG_HYPERV)
-	if (static_branch_unlikely(&enable_evmcs)) {
-		int cpu;
-		struct hv_vp_assist_page *vp_ap;
-		/*
-		 * Reset everything to support using non-enlightened VMCS
-		 * access later (e.g. when we reload the module with
-		 * enlightened_vmcs=0)
-		 */
-		for_each_online_cpu(cpu) {
-			vp_ap =	hv_get_vp_assist_page(cpu);
-
-			if (!vp_ap)
-				continue;
-
-			vp_ap->nested_control.features.directhypercall = 0;
-			vp_ap->current_nested_vmcs = 0;
-			vp_ap->enlighten_vmentry = 0;
-		}
-
-		static_branch_disable(&enable_evmcs);
-	}
-#endif
-	vmx_cleanup_l1d_flush();
-
-	allow_smaller_maxphyaddr = false;
-}
-module_exit(vmx_exit);
-
-static int __init vmx_init(void)
-{
-	int r, cpu;
+	if (sizeof(struct vcpu_vmx) > *vcpu_size)
+		*vcpu_size = sizeof(struct vcpu_vmx);
+	if (__alignof__(struct vcpu_vmx) > *vcpu_align)
+		*vcpu_align = __alignof__(struct vcpu_vmx);
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	/*
@@ -7880,18 +7673,45 @@ static int __init vmx_init(void)
 		}
 
 		if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
-			vmx_x86_ops.enable_direct_tlbflush
+			vt_x86_ops.enable_direct_tlbflush
 				= hv_enable_direct_tlbflush;
 
 	} else {
 		enlightened_vmcs = false;
 	}
 #endif
+}
 
-	r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
-		     __alignof__(struct vcpu_vmx), THIS_MODULE);
-	if (r)
-		return r;
+void vmx_post_kvm_exit(void)
+{
+#if IS_ENABLED(CONFIG_HYPERV)
+	if (static_branch_unlikely(&enable_evmcs)) {
+		int cpu;
+		struct hv_vp_assist_page *vp_ap;
+		/*
+		 * Reset everything to support using non-enlightened VMCS
+		 * access later (e.g. when we reload the module with
+		 * enlightened_vmcs=0)
+		 */
+		for_each_online_cpu(cpu) {
+			vp_ap =	hv_get_vp_assist_page(cpu);
+
+			if (!vp_ap)
+				continue;
+
+			vp_ap->nested_control.features.directhypercall = 0;
+			vp_ap->current_nested_vmcs = 0;
+			vp_ap->enlighten_vmentry = 0;
+		}
+
+		static_branch_disable(&enable_evmcs);
+	}
+#endif
+}
+
+int __init vmx_init(void)
+{
+	int r, cpu;
 
 	/*
 	 * Must be called after kvm_init() so enable_ept is properly set
@@ -7901,10 +7721,8 @@ static int __init vmx_init(void)
 	 * mitigation mode.
 	 */
 	r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
-	if (r) {
-		vmx_exit();
+	if (r)
 		return r;
-	}
 
 	for_each_possible_cpu(cpu) {
 		INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
@@ -7928,4 +7746,15 @@ static int __init vmx_init(void)
 
 	return 0;
 }
-module_init(vmx_init);
+
+void vmx_exit(void)
+{
+#ifdef CONFIG_KEXEC_CORE
+	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
+	synchronize_rcu();
+#endif
+
+	vmx_cleanup_l1d_flush();
+
+	allow_smaller_maxphyaddr = false;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
new file mode 100644
index 000000000000..f04b30f1b72d
--- /dev/null
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -0,0 +1,124 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_X86_OPS_H
+#define __KVM_X86_VMX_X86_OPS_H
+
+#include <linux/kvm_host.h>
+
+#include <asm/virtext.h>
+
+#include "x86.h"
+
+extern struct kvm_x86_ops vt_x86_ops __initdata;
+
+__init int vmx_disabled_by_bios(void);
+int __init vmx_check_processor_compat(void);
+__init int hardware_setup(void);
+void hardware_unsetup(void);
+int hardware_enable(void);
+void hardware_disable(void);
+bool report_flexpriority(void);
+int vmx_vm_init(struct kvm *kvm);
+int vmx_create_vcpu(struct kvm_vcpu *vcpu);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
+void vmx_free_vcpu(struct kvm_vcpu *vcpu);
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_put(struct kvm_vcpu *vcpu);
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, void *insn, int insn_len);
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+			struct x86_instruction_info *info,
+			enum x86_intercept_stage stage,
+			struct x86_exception *exception);
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
+void vmx_migrate_timers(struct kvm_vcpu *vcpu);
+void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
+bool vmx_check_apicv_inhibit_reasons(ulong bit);
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
+void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr);
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
+int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector);
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
+void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+int vmx_get_msr_feature(struct kvm_msr_entry *msr);
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int vmx_get_cpl(struct kvm_vcpu *vcpu);
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
+void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
+void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
+u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
+void vmx_inject_irq(struct kvm_vcpu *vcpu);
+void vmx_inject_nmi(struct kvm_vcpu *vcpu);
+void vmx_queue_exception(struct kvm_vcpu *vcpu);
+void vmx_cancel_injection(struct kvm_vcpu *vcpu);
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
+void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
+u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
+u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset);
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+int vmx_pre_block(struct kvm_vcpu *vcpu);
+void vmx_post_block(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+		bool *expired);
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
+#endif
+void vmx_setup_mce(struct kvm_vcpu *vcpu);
+
+void __init vmx_pre_kvm_init(unsigned int *vcpu_size, unsigned int *vcpu_align);
+int __init vmx_init(void);
+void vmx_exit(void);
+void vmx_post_kvm_exit(void);
+
+#endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 45/59] KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (43 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 44/59] KVM: VMX: Add 'main.c' to wrap VMX and TDX isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:08   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code isaku.yamahata
                   ` (15 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 5 +++++
 arch/x86/kvm/vmx/vmx.c  | 4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 146dd6a317e5..ef89ed0457d5 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -4,6 +4,7 @@
 #include "x86_ops.h"
 #include "vmx.h"
 #include "nested.h"
+#include "mmu.h"
 #include "pmu.h"
 
 static int __init vt_cpu_has_kvm_support(void)
@@ -35,6 +36,10 @@ static __init int vt_hardware_setup(void)
 	if (ret)
 		return ret;
 
+	if (enable_ept)
+		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
+				      cpu_has_vmx_ept_execute_only());
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 404afac897bc..8b2e57de6627 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7546,10 +7546,6 @@ __init int hardware_setup(void)
 
 	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
 
-	if (enable_ept)
-		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
-				      cpu_has_vmx_ept_execute_only());
-
 	if (!enable_ept)
 		ept_lpage_level = 0;
 	else if (cpu_has_vmx_ept_1g_page())
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (44 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 45/59] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:11   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 47/59] KVM: TDX: Define TDCALL exit reason isaku.yamahata
                   ` (14 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Move the guts of vmx_cache_reg() to vt_cache_reg() in preparation for
reusing the bulk of the code for TDX, which can access guest state for
debug TDs.

Use kvm_x86_ops.cache_reg() in ept_update_paging_mode_cr0() rather than
trying to expose vt_cache_reg() to vmx.c, even though it means taking a
retpoline.  The code runs if and only if EPT is enabled but unrestricted
guest.  Only one generation of CPU, Nehalem, supports EPT but not
unrestricted guest, and disabling unrestricted guest without also
disabling EPT is, to put it bluntly, dumb.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 44 ++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/vmx.c     | 45 +-------------------------------------
 arch/x86/kvm/vmx/x86_ops.h |  1 +
 3 files changed, 43 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ef89ed0457d5..a0a8cc2fd600 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -339,9 +339,47 @@ static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 	vmx_sync_dirty_debug_regs(vcpu);
 }
 
-static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
-{
-	vmx_cache_reg(vcpu, reg);
+void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+	unsigned long guest_owned_bits;
+
+	kvm_register_mark_available(vcpu, reg);
+
+	switch (reg) {
+	case VCPU_REGS_RSP:
+		vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
+		break;
+	case VCPU_REGS_RIP:
+		vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
+		break;
+	case VCPU_EXREG_PDPTR:
+		if (enable_ept)
+			ept_save_pdptrs(vcpu);
+		break;
+	case VCPU_EXREG_CR0:
+		guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
+
+		vcpu->arch.cr0 &= ~guest_owned_bits;
+		vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
+		break;
+	case VCPU_EXREG_CR3:
+		/*
+		 * When intercepting CR3 loads, e.g. for shadowing paging, KVM's
+		 * CR3 is loaded into hardware, not the guest's CR3.
+		 */
+		if (!(exec_controls_get(to_vmx(vcpu)) & CPU_BASED_CR3_LOAD_EXITING))
+			vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+		break;
+	case VCPU_EXREG_CR4:
+		guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
+
+		vcpu->arch.cr4 &= ~guest_owned_bits;
+		vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
+		break;
+	default:
+		KVM_BUG_ON(1, vcpu->kvm);
+		break;
+	}
 }
 
 static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8b2e57de6627..98710b578b28 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2230,49 +2230,6 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return ret;
 }
 
-void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
-{
-	unsigned long guest_owned_bits;
-
-	kvm_register_mark_available(vcpu, reg);
-
-	switch (reg) {
-	case VCPU_REGS_RSP:
-		vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
-		break;
-	case VCPU_REGS_RIP:
-		vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
-		break;
-	case VCPU_EXREG_PDPTR:
-		if (enable_ept)
-			ept_save_pdptrs(vcpu);
-		break;
-	case VCPU_EXREG_CR0:
-		guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
-
-		vcpu->arch.cr0 &= ~guest_owned_bits;
-		vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
-		break;
-	case VCPU_EXREG_CR3:
-		/*
-		 * When intercepting CR3 loads, e.g. for shadowing paging, KVM's
-		 * CR3 is loaded into hardware, not the guest's CR3.
-		 */
-		if (!(exec_controls_get(to_vmx(vcpu)) & CPU_BASED_CR3_LOAD_EXITING))
-			vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
-		break;
-	case VCPU_EXREG_CR4:
-		guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
-
-		vcpu->arch.cr4 &= ~guest_owned_bits;
-		vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
-		break;
-	default:
-		KVM_BUG_ON(1, vcpu->kvm);
-		break;
-	}
-}
-
 __init int vmx_disabled_by_bios(void)
 {
 	return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
@@ -3012,7 +2969,7 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 		 * KVM's CR3 is installed.
 		 */
 		if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
-			vmx_cache_reg(vcpu, VCPU_EXREG_CR3);
+			vt_cache_reg(vcpu, VCPU_EXREG_CR3);
 
 		/*
 		 * When running with EPT but not unrestricted guest, KVM must
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f04b30f1b72d..c49d6f9f36fd 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -9,6 +9,7 @@
 #include "x86.h"
 
 extern struct kvm_x86_ops vt_x86_ops __initdata;
+void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
 
 __init int vmx_disabled_by_bios(void);
 int __init vmx_check_processor_compat(void);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 47/59] KVM: TDX: Define TDCALL exit reason
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (45 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:19   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 48/59] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers isaku.yamahata
                   ` (13 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Xiaoyao Li

From: Sean Christopherson <sean.j.christopherson@intel.com>

Define the TDCALL exit reason, which is carved out from the VMX exit
reason namespace as the TDCALL exit from TDX guest to TDX-SEAM is really
just a VM-Exit.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/vmx.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 946d761adbd3..ba5908dfc7c0 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -91,6 +91,7 @@
 #define EXIT_REASON_UMWAIT              67
 #define EXIT_REASON_TPAUSE              68
 #define EXIT_REASON_BUS_LOCK            74
+#define EXIT_REASON_TDCALL              77
 
 #define VMX_EXIT_REASONS \
 	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
@@ -153,7 +154,8 @@
 	{ EXIT_REASON_XRSTORS,               "XRSTORS" }, \
 	{ EXIT_REASON_UMWAIT,                "UMWAIT" }, \
 	{ EXIT_REASON_TPAUSE,                "TPAUSE" }, \
-	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }
+	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }, \
+	{ EXIT_REASON_TDCALL,                "TDCALL" }
 
 #define VMX_EXIT_REASON_FLAGS \
 	{ VMX_EXIT_REASONS_FAILED_VMENTRY,	"FAILED_VMENTRY" }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 48/59] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (46 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 47/59] KVM: TDX: Define TDCALL exit reason isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 49/59] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs isaku.yamahata
                   ` (12 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Stub in kvm_tdx, vcpu_tdx, their various accessors, and VMCS helpers.
The VMCS helpers, which rely on the stubs, will be used by preparatory
patches to move VMX functions for accessing VMCS state to common code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.h | 169 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 169 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx.h

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
new file mode 100644
index 000000000000..31412ed8049f
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -0,0 +1,169 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_H
+#define __KVM_X86_TDX_H
+
+#include <linux/list.h>
+#include <linux/kvm_host.h>
+
+#include "tdx_errno.h"
+#include "tdx_arch.h"
+#include "tdx_ops.h"
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+struct tdx_td_page {
+	unsigned long va;
+	hpa_t pa;
+	bool added;
+};
+
+struct kvm_tdx {
+	struct kvm kvm;
+
+	struct tdx_td_page tdr;
+	struct tdx_td_page tdcs[TDX_NR_TDCX_PAGES];
+};
+
+struct vcpu_tdx {
+	struct kvm_vcpu	vcpu;
+
+	struct tdx_td_page tdvpr;
+	struct tdx_td_page tdvpx[TDX_NR_TDVPX_PAGES];
+};
+
+static inline bool is_td(struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
+{
+	return is_td(vcpu->kvm);
+}
+
+static inline bool is_debug_td(struct kvm_vcpu *vcpu)
+{
+	return !vcpu->arch.guest_state_protected;
+}
+
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
+{
+	return container_of(kvm, struct kvm_tdx, kvm);
+}
+
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
+{
+	return container_of(vcpu, struct vcpu_tdx, vcpu);
+}
+
+static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
+{
+	BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) & 0x1,
+			 "Read/Write to TD VMCS *_HIGH fields not supported");
+
+	BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
+
+	BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
+			 (((field) & 0x6000) == 0x2000 ||
+			  ((field) & 0x6000) == 0x6000),
+			 "Invalid TD VMCS access for 64-bit field");
+	BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
+			 ((field) & 0x6000) == 0x4000,
+			 "Invalid TD VMCS access for 32-bit field");
+	BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
+			 ((field) & 0x6000) == 0x0000,
+			 "Invalid TD VMCS access for 16-bit field");
+}
+
+static __always_inline void tdvps_gpr_check(u64 field, u8 bits)
+{
+	BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) >= NR_VCPU_REGS,
+			 "Invalid TD guest GPR index");
+}
+
+static __always_inline void tdvps_apic_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_dr_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_state_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_msr_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+
+#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)			       \
+static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,  \
+							u32 field)	       \
+{									       \
+	struct tdx_ex_ret ex_ret;					       \
+	u64 err;							       \
+									       \
+	tdvps_##lclass##_check(field, bits);				       \
+	err = tdh_vp_rd(tdx->tdvpr.pa, TDVPS_##uclass(field), &ex_ret);        \
+	if (unlikely(err)) {						       \
+		pr_err("TDH_VP_RD["#uclass".0x%x] failed: 0x%llx\n",	       \
+		       field, err);					       \
+		return 0;						       \
+	}								       \
+	return (u##bits)ex_ret.regs.r8;					       \
+}									       \
+static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx,    \
+						      u32 field, u##bits val)  \
+{									       \
+	struct tdx_ex_ret ex_ret;					       \
+	u64 err;							       \
+									       \
+	tdvps_##lclass##_check(field, bits);				       \
+	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), val,	       \
+		      GENMASK_ULL(bits - 1, 0), &ex_ret);		       \
+	if (unlikely(err))						       \
+		pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n",  \
+		       field, (u64)val, err);				       \
+}									       \
+static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx,   \
+						       u32 field, u64 bit)     \
+{									       \
+	struct tdx_ex_ret ex_ret;					       \
+	u64 err;							       \
+									       \
+	tdvps_##lclass##_check(field, bits);				       \
+	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), bit, bit,        \
+			&ex_ret);					       \
+	if (unlikely(err))						       \
+		pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n", \
+		       field, bit, err);				       \
+}									       \
+static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx, \
+							 u32 field, u64 bit)   \
+{									       \
+	struct tdx_ex_ret ex_ret;					       \
+	u64 err;							       \
+									       \
+	tdvps_##lclass##_check(field, bits);				       \
+	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), 0, bit,	       \
+			&ex_ret);					       \
+	if (unlikely(err))						       \
+		pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n", \
+		       field, bit, err);				       \
+}
+
+TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
+
+TDX_BUILD_TDVPS_ACCESSORS(64, APIC, apic);
+TDX_BUILD_TDVPS_ACCESSORS(64, GPR, gpr);
+TDX_BUILD_TDVPS_ACCESSORS(64, DR, dr);
+TDX_BUILD_TDVPS_ACCESSORS(64, STATE, state);
+TDX_BUILD_TDVPS_ACCESSORS(64, MSR, msr);
+TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+
+#else
+struct kvm_tdx;
+struct vcpu_tdx;
+
+static inline bool is_td(struct kvm *kvm) { return false; }
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
+static inline bool is_debug_td(struct kvm_vcpu *vcpu) { return false; }
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 49/59] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (47 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 48/59] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:24   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 50/59] KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h isaku.yamahata
                   ` (11 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a macro framework to hide VMX vs. TDX details of VMREAD and VMWRITE
so the VMX and TDX can shared common flows, e.g. accessing DTs.

Note, the TDX paths are dead code at this time.  There is no great way
to deal with the chicken-and-egg scenario of having things in place for
TDX without first having TDX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h | 41 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 9e5865b05d47..d37ef4dd9d90 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -11,6 +11,47 @@
 #include "vmcs.h"
 #include "vmx.h"
 #include "x86.h"
+#include "tdx.h"
+
+#ifdef CONFIG_INTEL_TDX_HOST
+#define VT_BUILD_VMCS_HELPERS(type, bits, tdbits)			   \
+static __always_inline type vmread##bits(struct kvm_vcpu *vcpu,		   \
+					 unsigned long field)		   \
+{									   \
+	if (unlikely(is_td_vcpu(vcpu))) {				   \
+		if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))		   \
+			return 0;					   \
+		return td_vmcs_read##tdbits(to_tdx(vcpu), field);	   \
+	}								   \
+	return vmcs_read##bits(field);					   \
+}									   \
+static __always_inline void vmwrite##bits(struct kvm_vcpu *vcpu,	   \
+					  unsigned long field, type value) \
+{									   \
+	if (unlikely(is_td_vcpu(vcpu))) {				   \
+		if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))		   \
+			return;						   \
+		return td_vmcs_write##tdbits(to_tdx(vcpu), field, value);  \
+	}								   \
+	vmcs_write##bits(field, value);					   \
+}
+#else
+#define VT_BUILD_VMCS_HELPERS(type, bits, tdbits)			   \
+static __always_inline type vmread##bits(struct kvm_vcpu *vcpu,		   \
+					 unsigned long field)		   \
+{									   \
+	return vmcs_read##bits(field);					   \
+}									   \
+static __always_inline void vmwrite##bits(struct kvm_vcpu *vcpu,	   \
+					  unsigned long field, type value) \
+{									   \
+	vmcs_write##bits(field, value);					   \
+}
+#endif /* CONFIG_INTEL_TDX_HOST */
+VT_BUILD_VMCS_HELPERS(u16, 16, 16);
+VT_BUILD_VMCS_HELPERS(u32, 32, 32);
+VT_BUILD_VMCS_HELPERS(u64, 64, 64);
+VT_BUILD_VMCS_HELPERS(unsigned long, l, 64);
 
 extern unsigned long vmx_host_idt_base;
 void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 50/59] KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (48 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 49/59] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 51/59] KVM: VMX: MOVE GDT and IDT accessors to common code isaku.yamahata
                   ` (10 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Move the AR_BYTES helpers to common.h so that future patches can reuse
them to decode/encode AR for TDX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h | 41 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 47 ++++-----------------------------------
 2 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index d37ef4dd9d90..e45d2d222168 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,6 +4,7 @@
 
 #include <linux/kvm_host.h>
 
+#include <asm/kvm.h>
 #include <asm/traps.h>
 #include <asm/vmx.h>
 
@@ -119,4 +120,44 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
+static inline u32 vmx_encode_ar_bytes(struct kvm_segment *var)
+{
+	u32 ar;
+
+	if (var->unusable || !var->present)
+		ar = 1 << 16;
+	else {
+		ar = var->type & 15;
+		ar |= (var->s & 1) << 4;
+		ar |= (var->dpl & 3) << 5;
+		ar |= (var->present & 1) << 7;
+		ar |= (var->avl & 1) << 12;
+		ar |= (var->l & 1) << 13;
+		ar |= (var->db & 1) << 14;
+		ar |= (var->g & 1) << 15;
+	}
+
+	return ar;
+}
+
+static inline void vmx_decode_ar_bytes(u32 ar, struct kvm_segment *var)
+{
+	var->unusable = (ar >> 16) & 1;
+	var->type = ar & 15;
+	var->s = (ar >> 4) & 1;
+	var->dpl = (ar >> 5) & 3;
+	/*
+	 * Some userspaces do not preserve unusable property. Since usable
+	 * segment has to be present according to VMX spec we can use present
+	 * property to amend userspace bug by making unusable segment always
+	 * nonpresent. vmx_encode_ar_bytes() already marks nonpresent
+	 * segment as unusable.
+	 */
+	var->present = !var->unusable;
+	var->avl = (ar >> 12) & 1;
+	var->l = (ar >> 13) & 1;
+	var->db = (ar >> 14) & 1;
+	var->g = (ar >> 15) & 1;
+}
+
 #endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 98710b578b28..6f98d1b2a498 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -365,8 +365,6 @@ static const struct kernel_param_ops vmentry_l1d_flush_ops = {
 };
 module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644);
 
-static u32 vmx_segment_access_rights(struct kvm_segment *var);
-
 void vmx_vmexit(void);
 
 #define vmx_insn_failed(fmt...)		\
@@ -2730,7 +2728,7 @@ static void fix_rmode_seg(int seg, struct kvm_segment *save)
 	vmcs_write16(sf->selector, var.selector);
 	vmcs_writel(sf->base, var.base);
 	vmcs_write32(sf->limit, var.limit);
-	vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(&var));
+	vmcs_write32(sf->ar_bytes, vmx_encode_ar_bytes(&var));
 }
 
 static void enter_rmode(struct kvm_vcpu *vcpu)
@@ -3138,7 +3136,6 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	u32 ar;
 
 	if (vmx->rmode.vm86_active && seg != VCPU_SREG_LDTR) {
 		*var = vmx->rmode.segs[seg];
@@ -3152,23 +3149,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	var->base = vmx_read_guest_seg_base(vmx, seg);
 	var->limit = vmx_read_guest_seg_limit(vmx, seg);
 	var->selector = vmx_read_guest_seg_selector(vmx, seg);
-	ar = vmx_read_guest_seg_ar(vmx, seg);
-	var->unusable = (ar >> 16) & 1;
-	var->type = ar & 15;
-	var->s = (ar >> 4) & 1;
-	var->dpl = (ar >> 5) & 3;
-	/*
-	 * Some userspaces do not preserve unusable property. Since usable
-	 * segment has to be present according to VMX spec we can use present
-	 * property to amend userspace bug by making unusable segment always
-	 * nonpresent. vmx_segment_access_rights() already marks nonpresent
-	 * segment as unusable.
-	 */
-	var->present = !var->unusable;
-	var->avl = (ar >> 12) & 1;
-	var->l = (ar >> 13) & 1;
-	var->db = (ar >> 14) & 1;
-	var->g = (ar >> 15) & 1;
+	vmx_decode_ar_bytes(vmx_read_guest_seg_ar(vmx, seg), var);
 }
 
 u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
@@ -3194,26 +3175,6 @@ int vmx_get_cpl(struct kvm_vcpu *vcpu)
 	}
 }
 
-static u32 vmx_segment_access_rights(struct kvm_segment *var)
-{
-	u32 ar;
-
-	if (var->unusable || !var->present)
-		ar = 1 << 16;
-	else {
-		ar = var->type & 15;
-		ar |= (var->s & 1) << 4;
-		ar |= (var->dpl & 3) << 5;
-		ar |= (var->present & 1) << 7;
-		ar |= (var->avl & 1) << 12;
-		ar |= (var->l & 1) << 13;
-		ar |= (var->db & 1) << 14;
-		ar |= (var->g & 1) << 15;
-	}
-
-	return ar;
-}
-
 void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3248,7 +3209,7 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	if (is_unrestricted_guest(vcpu) && (seg != VCPU_SREG_LDTR))
 		var->type |= 0x1; /* Accessed */
 
-	vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
+	vmcs_write32(sf->ar_bytes, vmx_encode_ar_bytes(var));
 }
 
 void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
@@ -3299,7 +3260,7 @@ static bool rmode_segment_valid(struct kvm_vcpu *vcpu, int seg)
 	var.dpl = 0x3;
 	if (seg == VCPU_SREG_CS)
 		var.type = 0x3;
-	ar = vmx_segment_access_rights(&var);
+	ar = vmx_encode_ar_bytes(&var);
 
 	if (var.base != (var.selector << 4))
 		return false;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 51/59] KVM: VMX: MOVE GDT and IDT accessors to common code
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (49 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 50/59] KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:25   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 52/59] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code isaku.yamahata
                   ` (9 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    |  7 +++++--
 arch/x86/kvm/vmx/vmx.c     | 12 ------------
 arch/x86/kvm/vmx/x86_ops.h |  2 --
 3 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a0a8cc2fd600..4d6bf1f56641 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -3,6 +3,7 @@
 
 #include "x86_ops.h"
 #include "vmx.h"
+#include "common.h"
 #include "nested.h"
 #include "mmu.h"
 #include "pmu.h"
@@ -311,7 +312,8 @@ static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
 
 static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
-	vmx_get_idt(vcpu, dt);
+	dt->size = vmread32(vcpu, GUEST_IDTR_LIMIT);
+	dt->address = vmreadl(vcpu, GUEST_IDTR_BASE);
 }
 
 static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
@@ -321,7 +323,8 @@ static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 
 static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
-	vmx_get_gdt(vcpu, dt);
+	dt->size = vmread32(vcpu, GUEST_GDTR_LIMIT);
+	dt->address = vmreadl(vcpu, GUEST_GDTR_BASE);
 }
 
 static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6f98d1b2a498..a644b9627f9d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3227,24 +3227,12 @@ void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 	*l = (ar >> 13) & 1;
 }
 
-void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
-{
-	dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
-	dt->address = vmcs_readl(GUEST_IDTR_BASE);
-}
-
 void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
 	vmcs_writel(GUEST_IDTR_BASE, dt->address);
 }
 
-void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
-{
-	dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
-	dt->address = vmcs_readl(GUEST_GDTR_BASE);
-}
-
 void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index c49d6f9f36fd..fec22bef05b7 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -66,9 +66,7 @@ void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
-void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
-void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
 void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 52/59] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (50 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 51/59] KVM: VMX: MOVE GDT and IDT accessors to common code isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:26   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space isaku.yamahata
                   ` (8 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 10 +---------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index e45d2d222168..684cd3add46b 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -120,6 +120,20 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
+static inline u32 __vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
+{
+	u32 interruptibility;
+	int ret = 0;
+
+	interruptibility = vmread32(vcpu, GUEST_INTERRUPTIBILITY_INFO);
+	if (interruptibility & GUEST_INTR_STATE_STI)
+		ret |= KVM_X86_SHADOW_INT_STI;
+	if (interruptibility & GUEST_INTR_STATE_MOV_SS)
+		ret |= KVM_X86_SHADOW_INT_MOV_SS;
+
+	return ret;
+}
+
 static inline u32 vmx_encode_ar_bytes(struct kvm_segment *var)
 {
 	u32 ar;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a644b9627f9d..6f38e0d2e1b6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1365,15 +1365,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 
 u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
 {
-	u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
-	int ret = 0;
-
-	if (interruptibility & GUEST_INTR_STATE_STI)
-		ret |= KVM_X86_SHADOW_INT_STI;
-	if (interruptibility & GUEST_INTR_STATE_MOV_SS)
-		ret |= KVM_X86_SHADOW_INT_MOV_SS;
-
-	return ret;
+	return __vmx_get_interrupt_shadow(vcpu);
 }
 
 void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (51 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 52/59] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:34   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch isaku.yamahata
                   ` (7 subsequent siblings)
  60 siblings, 1 reply; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Chao Gao

From: Chao Gao <chao.gao@intel.com>

The TDX module unconditionally reset 4 host MSRs (MSR_SYSCALL_MASK,
MSR_START, MSR_LSTAR, MSR_TSC_AUX) to architectural INIT state on exit from
TDX VM to KVM.  KVM needs to save their values before TD enter and restore
them on exit to userspace.

Reuse current kvm_user_return mechanism and introduce a function to update
cached values and register the user return notifier in this new function.

The later patch will use the helper function to save/restore 4 host MSRs.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 25 ++++++++++++++++++++-----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a2027bc14e41..17d6e4bcf84b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1910,6 +1910,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
 int kvm_add_user_return_msr(u32 msr);
 int kvm_find_user_return_msr(u32 msr);
 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
+void kvm_user_return_update_cache(unsigned int index, u64 val);
 
 static inline bool kvm_is_supported_user_return_msr(u32 msr)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4dd8ec2641a2..09da7cbedb5f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -418,6 +418,15 @@ static void kvm_user_return_msr_cpu_online(void)
 	}
 }
 
+static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
+{
+	if (!msrs->registered) {
+		msrs->urn.on_user_return = kvm_on_user_return;
+		user_return_notifier_register(&msrs->urn);
+		msrs->registered = true;
+	}
+}
+
 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 {
 	unsigned int cpu = smp_processor_id();
@@ -432,15 +441,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 		return 1;
 
 	msrs->values[slot].curr = value;
-	if (!msrs->registered) {
-		msrs->urn.on_user_return = kvm_on_user_return;
-		user_return_notifier_register(&msrs->urn);
-		msrs->registered = true;
-	}
+	kvm_user_return_register_notifier(msrs);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
 
+/* Update the cache, "curr", and register the notifier */
+void kvm_user_return_update_cache(unsigned int slot, u64 value)
+{
+	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+	msrs->values[slot].curr = value;
+	kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
+
 static void drop_user_return_notifiers(void)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (52 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25 20:48   ` Paolo Bonzini
  2021-11-25 21:05   ` Thomas Gleixner
  2021-11-25  0:20 ` [RFC PATCH v3 55/59] KVM: TDX: Add "basic" support for building and running Trust Domains isaku.yamahata
                   ` (6 subsequent siblings)
  60 siblings, 2 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Xiaoyao Li

From: Xiaoyao Li <xiaoyao.li@intel.com>

Introduce a per-vm variable initial_tsc_khz to hold the default tsc_khz
for kvm_arch_vcpu_create().

This field is going to be used by TDX since TSC frequency for TD guest
is configured at TD VM initialization phase.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/vmx/tdx.c          | 2233 +++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |    3 +-
 3 files changed, 2236 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kvm/vmx/tdx.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 17d6e4bcf84b..f10c7c2830e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1111,6 +1111,7 @@ struct kvm_arch {
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
 	u64 last_tsc_offset;
+	u32 initial_tsc_khz;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
new file mode 100644
index 000000000000..64b2841064c4
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -0,0 +1,2233 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpu.h>
+#include <linux/kvm_host.h>
+#include <linux/jump_label.h>
+#include <linux/trace_events.h>
+#include <linux/mmu_context.h>
+#include <linux/pagemap.h>
+#include <linux/perf_event.h>
+#include <linux/freelist.h>
+
+#include <asm/fpu/xcr.h>
+#include <asm/virtext.h>
+
+#include "tdx_errno.h"
+#include "tdx_ops.h"
+#include "x86_ops.h"
+#include "common.h"
+#include "cpuid.h"
+#include "lapic.h"
+#include "tdx.h"
+
+#include <trace/events/kvm.h>
+#include "trace.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "tdx: " fmt
+
+/*
+ * workaround to compile.
+ * TODO: once the TDX module initiation code in x86 host is merged, remove this.
+ * The function returns struct tdsysinfo_struct from TDX module provides.  It
+ * provides the system wide information about the TDX module.
+ */
+#if __has_include(<asm/tdx_host.h>)
+#include <asm/tdx_host.h>
+#else
+static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void)
+{
+	return NULL;
+}
+#endif
+
+/* KeyID range reserved to TDX by BIOS */
+static u32 tdx_keyids_start __read_mostly;
+static u32 tdx_nr_keyids __read_mostly;
+static u32 tdx_seam_keyid __read_mostly;
+
+static void __init tdx_keyids_init(void)
+{
+	u32 nr_mktme_ids;
+
+	rdmsr(MSR_IA32_MKTME_KEYID_PART, nr_mktme_ids, tdx_nr_keyids);
+
+	/* KeyID 0 is reserved, i.e. KeyIDs are 1-based. */
+	tdx_keyids_start = nr_mktme_ids + 1;
+	tdx_seam_keyid = tdx_keyids_start;
+}
+
+/* TDX KeyID pool */
+static DEFINE_IDA(tdx_keyid_pool);
+
+static int tdx_keyid_alloc(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TDX))
+		return -EINVAL;
+
+	if (WARN_ON_ONCE(!tdx_keyids_start || !tdx_nr_keyids))
+		return -EINVAL;
+
+	/* The first keyID is reserved for the global key. */
+	return ida_alloc_range(&tdx_keyid_pool, tdx_keyids_start + 1,
+			       tdx_keyids_start + tdx_nr_keyids - 1,
+			       GFP_KERNEL);
+}
+
+static void tdx_keyid_free(int keyid)
+{
+	if (!keyid || keyid < 0)
+		return;
+
+	ida_free(&tdx_keyid_pool, keyid);
+}
+
+/* Capabilities of KVM + TDX-SEAM. */
+struct tdx_capabilities tdx_caps;
+
+static DEFINE_MUTEX(tdx_lock);
+static struct mutex *tdx_mng_key_config_lock;
+
+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
+ * Protected by interrupt mask.  This list is manipulated in process context
+ * of vcpu and IPI callback.  See tdx_flush_vp_on_cpu().
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
+static u64 hkid_mask __ro_after_init;
+static u8 hkid_start_pos __ro_after_init;
+
+struct tdx_uret_msr {
+	u32 msr;
+	unsigned int slot;
+	u64 defval;
+};
+
+static struct tdx_uret_msr tdx_uret_msrs[] = {
+	{.msr = MSR_SYSCALL_MASK,},
+	{.msr = MSR_STAR,},
+	{.msr = MSR_LSTAR,},
+	{.msr = MSR_TSC_AUX,},
+};
+
+static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
+{
+	pa &= ~hkid_mask;
+	pa |= (u64)hkid << hkid_start_pos;
+
+	return pa;
+}
+
+static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
+{
+	return kvm_rcx_read(vcpu);
+}
+static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
+{
+	return kvm_rdx_read(vcpu);
+}
+static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
+{
+	return kvm_r8_read(vcpu);
+}
+static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
+{
+	return kvm_r9_read(vcpu);
+}
+
+#define BUILD_TDVMCALL_ACCESSORS(param, gpr)				    \
+static __always_inline							    \
+unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu)		    \
+{									    \
+	return kvm_##gpr##_read(vcpu);					    \
+}									    \
+static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
+						     unsigned long val)	    \
+{									    \
+	kvm_##gpr##_write(vcpu, val);					    \
+}
+BUILD_TDVMCALL_ACCESSORS(p1, r12);
+BUILD_TDVMCALL_ACCESSORS(p2, r13);
+BUILD_TDVMCALL_ACCESSORS(p3, r14);
+BUILD_TDVMCALL_ACCESSORS(p4, r15);
+
+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+{
+	return kvm_r10_read(vcpu);
+}
+static __always_inline unsigned long tdvmcall_exit_reason(struct kvm_vcpu *vcpu)
+{
+	return kvm_r11_read(vcpu);
+}
+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
+						     long val)
+{
+	kvm_r10_write(vcpu, val);
+}
+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
+						    unsigned long val)
+{
+	kvm_r11_write(vcpu, val);
+}
+
+static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
+{
+	return tdx->tdvpr.added;
+}
+
+static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->tdr.added;
+}
+
+static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->hkid >= 0;
+}
+
+static inline bool is_td_initialized(struct kvm *kvm)
+{
+	return !!kvm->max_vcpus;
+}
+
+static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->finalized;
+}
+
+static void tdx_clear_page(unsigned long page)
+{
+	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+	unsigned long i;
+
+	/* Zeroing the page is only necessary for systems with MKTME-i. */
+	if (!static_cpu_has(X86_FEATURE_MOVDIR64B))
+		return;
+
+	for (i = 0; i < 4096; i += 64)
+		/* MOVDIR64B [rdx], es:rdi */
+		asm (".byte 0x66, 0x0f, 0x38, 0xf8, 0x3a"
+		     : : "d" (zero_page), "D" (page + i) : "memory");
+}
+
+static int __tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
+{
+	struct tdx_ex_ret ex_ret;
+	u64 err;
+
+	err = tdh_phymem_page_reclaim(pa, &ex_ret);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &ex_ret);
+		return -EIO;
+	}
+
+	if (do_wb) {
+		err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+			return -EIO;
+		}
+	}
+
+	tdx_clear_page(va);
+	return 0;
+}
+
+static int tdx_reclaim_page(unsigned long va, hpa_t pa)
+{
+	return __tdx_reclaim_page(va, pa, false, 0);
+}
+
+static int tdx_alloc_td_page(struct tdx_td_page *page)
+{
+	page->va = __get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!page->va)
+		return -ENOMEM;
+
+	page->pa = __pa(page->va);
+	return 0;
+}
+
+static void tdx_add_td_page(struct tdx_td_page *page)
+{
+	WARN_ON_ONCE(page->added);
+	page->added = true;
+}
+
+static void tdx_reclaim_td_page(struct tdx_td_page *page)
+{
+	if (page->added) {
+		if (tdx_reclaim_page(page->va, page->pa))
+			return;
+
+		page->added = false;
+	}
+	free_page(page->va);
+}
+
+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+	list_del(&to_tdx(vcpu)->cpu_list);
+
+	/*
+	 * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
+	 * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+	 * to its list before its deleted from this CPUs list.
+	 */
+	smp_wmb();
+
+	vcpu->cpu = -1;
+}
+
+static void tdx_flush_vp(void *arg)
+{
+	struct kvm_vcpu *vcpu = arg;
+	u64 err;
+
+	/* Task migration can race with CPU offlining. */
+	if (vcpu->cpu != raw_smp_processor_id())
+		return;
+
+	/*
+	 * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized.  The
+	 * list tracking still needs to be updated so that it's correct if/when
+	 * the vCPU does get initialized.
+	 */
+	if (is_td_vcpu_created(to_tdx(vcpu))) {
+		err = tdh_vp_flush(to_tdx(vcpu)->tdvpr.pa);
+		if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+			if (WARN_ON_ONCE(err))
+				pr_tdx_error(TDH_VP_FLUSH, err, NULL);
+		}
+	}
+
+	tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+	if (unlikely(vcpu->cpu == -1))
+		return;
+
+	smp_call_function_single(vcpu->cpu, tdx_flush_vp, vcpu, 1);
+}
+
+static int tdx_do_tdh_phymem_cache_wb(void *param)
+{
+	u64 err = 0;
+
+	mutex_lock(&tdx_lock);
+	do {
+		err = tdh_phymem_cache_wb(!!err);
+	} while (err == TDX_INTERRUPTED_RESUMABLE);
+	mutex_unlock(&tdx_lock);
+
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+void tdx_vm_teardown(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_vcpu *vcpu;
+	u64 err;
+	int ret;
+	int i;
+
+	if (!is_hkid_assigned(kvm_tdx))
+		return;
+
+	if (!is_td_created(kvm_tdx))
+		goto free_hkid;
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_key_reclaimid(kvm_tdx->tdr.pa);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_KEY_RECLAIMID, err, NULL);
+		return;
+	}
+
+	kvm_for_each_vcpu(i, vcpu, (&kvm_tdx->kvm))
+		tdx_flush_vp_on_cpu(vcpu);
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
+		return;
+	}
+
+	/*
+	 * TODO: optimize to invoke the callback only once per CPU package
+	 * instead of all CPUS because TDH.PHYMEM.CACHE.WB is per CPU package
+	 * operation.
+	 *
+	 * Invoke the callback one-by-one to avoid contention.
+	 * TDH.PHYMEM.CACHE.WB competes for key ownership table lock.
+	 */
+	ret = 0;
+	for_each_online_cpu(i) {
+		ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1);
+		if (ret)
+			break;
+	}
+	if (unlikely(ret))
+		return;
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_key_freeid(kvm_tdx->tdr.pa);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
+		return;
+	}
+
+free_hkid:
+	tdx_keyid_free(kvm_tdx->hkid);
+	kvm_tdx->hkid = -1;
+}
+
+void tdx_vm_destroy(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	int i;
+
+	/* Can't reclaim or free TD pages if teardown failed. */
+	if (is_hkid_assigned(kvm_tdx))
+		return;
+
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++)
+		tdx_reclaim_td_page(&kvm_tdx->tdcs[i]);
+
+	if (kvm_tdx->tdr.added &&
+	    __tdx_reclaim_page(kvm_tdx->tdr.va, kvm_tdx->tdr.pa, true, tdx_seam_keyid))
+		return;
+
+	free_page(kvm_tdx->tdr.va);
+}
+
+static int tdx_do_tdh_mng_key_config(void *param)
+{
+	hpa_t *tdr_p = param;
+	int cpu, cur_pkg;
+	u64 err;
+
+	cpu = raw_smp_processor_id();
+	cur_pkg = topology_physical_package_id(cpu);
+
+	mutex_lock(&tdx_mng_key_config_lock[cur_pkg]);
+	do {
+		err = tdh_mng_key_config(*tdr_p);
+	} while (err == TDX_KEY_GENERATION_FAILED);
+	mutex_unlock(&tdx_mng_key_config_lock[cur_pkg]);
+
+	/*
+	 * TDH.MNG.KEY.CONFIG is per CPU package operation.  Other CPU on the
+	 * same package did it for us.
+	 */
+	if (err == TDX_KEY_CONFIGURED)
+		err = 0;
+
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+int tdx_vm_init(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	int ret, i;
+	u64 err;
+
+	kvm->dirty_log_unsupported = true;
+	kvm->readonly_mem_unsupported = true;
+
+	kvm->arch.tsc_immutable = true;
+	kvm->arch.eoi_intercept_unsupported = true;
+	kvm->arch.smm_unsupported = true;
+	kvm->arch.init_sipi_unsupported = true;
+	kvm->arch.irq_injection_disallowed = true;
+	kvm->arch.mce_injection_disallowed = true;
+
+	/* TODO: Enable 2mb and 1gb large page support. */
+	kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
+
+	/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
+	kvm->max_vcpus = 0;
+
+	kvm_tdx->hkid = tdx_keyid_alloc();
+	if (kvm_tdx->hkid < 0)
+		return -EBUSY;
+	if (WARN_ON_ONCE(kvm_tdx->hkid >> 16)) {
+		ret = -EIO;
+		goto free_hkid;
+	}
+
+	ret = tdx_alloc_td_page(&kvm_tdx->tdr);
+	if (ret)
+		goto free_hkid;
+
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+		ret = tdx_alloc_td_page(&kvm_tdx->tdcs[i]);
+		if (ret)
+			goto free_tdcs;
+	}
+
+	ret = -EIO;
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_create(kvm_tdx->tdr.pa, kvm_tdx->hkid);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_CREATE, err, NULL);
+		goto free_tdcs;
+	}
+	tdx_add_td_page(&kvm_tdx->tdr);
+
+	/*
+	 * TODO: optimize to invoke the callback only once per CPU package
+	 * instead of all CPUS because TDH.MNG.KEY.CONFIG is per CPU package
+	 * operation.
+	 *
+	 * Invoke callback one-by-one to avoid contention because
+	 * TDH.MNG.KEY.CONFIG competes for TDR lock.
+	 */
+	for_each_online_cpu(i) {
+		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
+				&kvm_tdx->tdr.pa, 1);
+		if (ret)
+			break;
+	}
+	if (ret)
+		goto teardown;
+
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+		err = tdh_mng_addcx(kvm_tdx->tdr.pa, kvm_tdx->tdcs[i].pa);
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
+			goto teardown;
+		}
+		tdx_add_td_page(&kvm_tdx->tdcs[i]);
+	}
+
+	/*
+	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
+	 * ioctl() to define the configure CPUID values for the TD.
+	 */
+	return 0;
+
+	/*
+	 * The sequence for freeing resources from a partially initialized TD
+	 * varies based on where in the initialization flow failure occurred.
+	 * Simply use the full teardown and destroy, which naturally play nice
+	 * with partial initialization.
+	 */
+teardown:
+	tdx_vm_teardown(kvm);
+	tdx_vm_destroy(kvm);
+	return ret;
+
+free_tdcs:
+	/* @i points at the TDCS page that failed allocation. */
+	for (--i; i >= 0; i--)
+		free_page(kvm_tdx->tdcs[i].va);
+
+	free_page(kvm_tdx->tdr.va);
+free_hkid:
+	tdx_keyid_free(kvm_tdx->hkid);
+	return ret;
+}
+
+int tdx_vcpu_create(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int ret, i;
+
+	ret = tdx_alloc_td_page(&tdx->tdvpr);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+		ret = tdx_alloc_td_page(&tdx->tdvpx[i]);
+		if (ret)
+			goto free_tdvpx;
+	}
+
+	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
+
+	vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
+	vcpu->arch.cr0_guest_owned_bits = -1ul;
+	vcpu->arch.cr4_guest_owned_bits = -1ul;
+
+	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
+	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
+	vcpu->arch.guest_state_protected =
+		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+	vcpu->arch.root_mmu.no_prefetch = true;
+
+	tdx->pi_desc.nv = POSTED_INTR_VECTOR;
+	tdx->pi_desc.sn = 1;
+	tdx->host_state_need_save = true;
+	tdx->host_state_need_restore = false;
+
+	return 0;
+
+free_tdvpx:
+	/* @i points at the TDVPX page that failed allocation. */
+	for (--i; i >= 0; i--)
+		free_page(tdx->tdvpx[i].va);
+
+	free_page(tdx->tdvpr.va);
+
+	return ret;
+}
+
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (vcpu->cpu != cpu) {
+		tdx_flush_vp_on_cpu(vcpu);
+
+		local_irq_disable();
+		/*
+		 * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+		 * vcpu->cpu is read before tdx->cpu_list.
+		 */
+		smp_rmb();
+
+		list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+		local_irq_enable();
+	}
+
+	vmx_vcpu_pi_load(vcpu, cpu);
+}
+
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (!tdx->host_state_need_save)
+		return;
+
+	if (likely(is_64bit_mm(current->mm)))
+		tdx->msr_host_kernel_gs_base = current->thread.gsbase;
+	else
+		tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+	tdx->host_state_need_save = false;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	tdx->host_state_need_save = true;
+	if (!tdx->host_state_need_restore)
+		return;
+
+	wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
+	tdx->host_state_need_restore = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	vmx_vcpu_pi_put(vcpu);
+
+	tdx_prepare_switch_to_host(vcpu);
+}
+
+void tdx_vcpu_free(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int i;
+
+	/* Can't reclaim or free pages if teardown failed. */
+	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
+		return;
+
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
+		tdx_reclaim_td_page(&tdx->tdvpx[i]);
+
+	tdx_reclaim_td_page(&tdx->tdvpr);
+
+	/*
+	 * kvm_free_vcpus()
+	 *   -> kvm_unload_vcpu_mmu()
+	 *
+	 * does vcpu_load() for every vcpu after they already disassociated
+	 * from the per cpu list when tdx_vm_teardown(). So we need to
+	 * disassociate them again, otherwise the freed vcpu data will be
+	 * accessed when do list_{del,add}() on associated_tdvcpus list
+	 * later.
+	 */
+	tdx_flush_vp_on_cpu(vcpu);
+}
+
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct msr_data apic_base_msr;
+	u64 err;
+	int i;
+
+	if (WARN_ON(init_event) || !vcpu->arch.apic)
+		goto td_bugged;
+
+	if (WARN_ON(is_td_vcpu_created(tdx)))
+		goto td_bugged;
+
+	err = tdh_vp_create(kvm_tdx->tdr.pa, tdx->tdvpr.pa);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_VP_CREATE, err, NULL);
+		goto td_bugged;
+	}
+	tdx_add_td_page(&tdx->tdvpr);
+
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+		err = tdh_vp_addcx(tdx->tdvpr.pa, tdx->tdvpx[i].pa);
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
+			goto td_bugged;
+		}
+		tdx_add_td_page(&tdx->tdvpx[i]);
+	}
+
+	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
+	if (kvm_vcpu_is_reset_bsp(vcpu))
+		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
+	apic_base_msr.host_initiated = true;
+	if (WARN_ON(kvm_set_apic_base(vcpu, &apic_base_msr)))
+		goto td_bugged;
+
+	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+	return;
+
+td_bugged:
+	vcpu->kvm->vm_bugged = true;
+}
+
+void tdx_inject_nmi(struct kvm_vcpu *vcpu)
+{
+	td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
+}
+
+static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
+{
+	/* Avoid costly SEAMCALL if no nmi was injected */
+	if (vcpu->arch.nmi_injected)
+		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
+							      TD_VCPU_PEND_NMI);
+}
+
+static void tdx_user_return_update_cache(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
+		kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
+					     tdx_uret_msrs[i].defval);
+}
+
+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	if (static_cpu_has(X86_FEATURE_XSAVE) &&
+	    host_xcr0 != (kvm_tdx->xfam & supported_xcr0))
+		xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
+	if (static_cpu_has(X86_FEATURE_XSAVES) &&
+	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
+	    host_xss != (kvm_tdx->xfam & (supported_xss | XFEATURE_MASK_PT)))
+		wrmsrl(MSR_IA32_XSS, host_xss);
+	if (static_cpu_has(X86_FEATURE_PKU) &&
+	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
+		write_pkru(vcpu->arch.host_pkru);
+}
+
+u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
+
+static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
+					struct vcpu_tdx *tdx)
+{
+	kvm_guest_enter_irqoff();
+
+	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs,
+					       tdx->tdvmcall.regs_mask);
+
+	kvm_guest_exit_irqoff();
+}
+
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (unlikely(vcpu->kvm->vm_bugged)) {
+		tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
+		return EXIT_FASTPATH_NONE;
+	}
+
+	trace_kvm_entry(vcpu);
+
+	if (pi_test_on(&tdx->pi_desc)) {
+		apic->send_IPI_self(POSTED_INTR_VECTOR);
+
+		kvm_wait_lapic_expire(vcpu, true);
+	}
+
+	tdx_vcpu_enter_exit(vcpu, tdx);
+
+	tdx_user_return_update_cache();
+	perf_restore_debug_store();
+	tdx_restore_host_xsave_state(vcpu);
+	tdx->host_state_need_restore = true;
+
+	vmx_register_cache_reset(vcpu);
+
+	trace_kvm_exit(vcpu, KVM_ISA_VMX);
+
+	tdx_complete_interrupts(vcpu);
+
+	if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
+		return EXIT_FASTPATH_NONE;
+
+	if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
+		tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
+	else
+		tdx->tdvmcall.rcx = 0;
+
+	return EXIT_FASTPATH_NONE;
+}
+
+void tdx_hardware_enable(void)
+{
+	INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
+}
+
+void tdx_hardware_disable(void)
+{
+	int cpu = raw_smp_processor_id();
+	struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+	struct vcpu_tdx *tdx, *tmp;
+
+	/* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+	list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
+		tdx_disassociate_vp(&tdx->vcpu);
+}
+
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+	u16 exit_reason = to_tdx(vcpu)->exit_reason.basic;
+
+	if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
+		vmx_handle_exception_nmi_irqoff(vcpu, tdexit_intr_info(vcpu));
+	else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+		vmx_handle_external_interrupt_irqoff(vcpu,
+						     tdexit_intr_info(vcpu));
+}
+
+static int tdx_handle_exception(struct kvm_vcpu *vcpu)
+{
+	u32 intr_info = tdexit_intr_info(vcpu);
+
+	if (is_nmi(intr_info) || is_machine_check(intr_info))
+		return 1;
+
+	kvm_pr_unimpl("unexpected exception 0x%x\n", intr_info);
+	return -EFAULT;
+}
+
+static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
+{
+	++vcpu->stat.irq_exits;
+	return 1;
+}
+
+static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
+{
+	vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+	vcpu->mmio_needed = 0;
+	return 0;
+}
+
+static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+	u32 eax, ebx, ecx, edx;
+
+	eax = tdvmcall_p1_read(vcpu);
+	ecx = tdvmcall_p2_read(vcpu);
+
+	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, true);
+
+	tdvmcall_p1_write(vcpu, eax);
+	tdvmcall_p2_write(vcpu, ebx);
+	tdvmcall_p3_write(vcpu, ecx);
+	tdvmcall_p4_write(vcpu, edx);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+	return 1;
+}
+
+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+	bool interrupt_disabled = tdvmcall_p1_read(vcpu);
+	union tdx_vcpu_state_details details;
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+	if (!interrupt_disabled) {
+		details.full = td_state_non_arch_read64(
+			to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
+		if (details.vmxip)
+			return 1;
+	}
+
+	return kvm_vcpu_halt(vcpu);
+}
+
+static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
+{
+	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	unsigned long val = 0;
+	int ret;
+
+	WARN_ON(vcpu->arch.pio.count != 1);
+
+	ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
+					 vcpu->arch.pio.port, &val, 1);
+	WARN_ON(!ret);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	tdvmcall_set_return_val(vcpu, val);
+
+	return 1;
+}
+
+static int tdx_emulate_io(struct kvm_vcpu *vcpu)
+{
+	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	unsigned long val = 0;
+	unsigned int port;
+	int size, ret;
+
+	++vcpu->stat.io_exits;
+
+	size = tdvmcall_p1_read(vcpu);
+	port = tdvmcall_p3_read(vcpu);
+
+	if (size != 1 && size != 2 && size != 4) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	if (!tdvmcall_p2_read(vcpu)) {
+		ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
+		if (!ret)
+			vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
+		else
+			tdvmcall_set_return_val(vcpu, val);
+	} else {
+		val = tdvmcall_p4_read(vcpu);
+		ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
+
+		// No need for a complete_userspace_io callback.
+		vcpu->arch.pio.count = 0;
+	}
+	if (ret)
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return ret;
+}
+
+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3, ret;
+
+	nr = tdvmcall_exit_reason(vcpu);
+	a0 = tdvmcall_p1_read(vcpu);
+	a1 = tdvmcall_p2_read(vcpu);
+	a2 = tdvmcall_p3_read(vcpu);
+	a3 = tdvmcall_p4_read(vcpu);
+
+	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, true);
+
+	tdvmcall_set_return_code(vcpu, ret);
+
+	return 1;
+}
+
+static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
+{
+	unsigned long val = 0;
+	gpa_t gpa;
+	int size;
+
+	WARN_ON(vcpu->mmio_needed != 1);
+	vcpu->mmio_needed = 0;
+
+	if (!vcpu->mmio_is_write) {
+		gpa = vcpu->mmio_fragments[0].gpa;
+		size = vcpu->mmio_fragments[0].len;
+
+		memcpy(&val, vcpu->run->mmio.data, size);
+		tdvmcall_set_return_val(vcpu, val);
+		trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+	}
+	return 1;
+}
+
+static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
+				 unsigned long val)
+{
+	if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+	    kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+		return -EOPNOTSUPP;
+
+	trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
+	return 0;
+}
+
+static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
+{
+	unsigned long val;
+
+	if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+	    kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+		return -EOPNOTSUPP;
+
+	tdvmcall_set_return_val(vcpu, val);
+	trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+	return 0;
+}
+
+static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
+{
+	struct kvm_memory_slot *slot;
+	int size, write, r;
+	unsigned long val;
+	gpa_t gpa;
+
+	WARN_ON(vcpu->mmio_needed);
+
+	size = tdvmcall_p1_read(vcpu);
+	write = tdvmcall_p2_read(vcpu);
+	gpa = tdvmcall_p3_read(vcpu);
+	val = write ? tdvmcall_p4_read(vcpu) : 0;
+
+	/* Strip the shared bit, allow MMIO with and without it set. */
+	gpa &= ~(vcpu->kvm->arch.gfn_shared_mask << PAGE_SHIFT);
+
+	if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa >> PAGE_SHIFT);
+	if (slot && !(slot->flags & KVM_MEMSLOT_INVALID)) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+		trace_kvm_fast_mmio(gpa);
+		return 1;
+	}
+
+	if (write)
+		r = tdx_mmio_write(vcpu, gpa, size, val);
+	else
+		r = tdx_mmio_read(vcpu, gpa, size);
+	if (!r) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+		return 1;
+	}
+
+	vcpu->mmio_needed = 1;
+	vcpu->mmio_is_write = write;
+	vcpu->arch.complete_userspace_io = tdx_complete_mmio;
+
+	vcpu->run->mmio.phys_addr = gpa;
+	vcpu->run->mmio.len = size;
+	vcpu->run->mmio.is_write = write;
+	vcpu->run->exit_reason = KVM_EXIT_MMIO;
+
+	if (write) {
+		memcpy(vcpu->run->mmio.data, &val, size);
+	} else {
+		vcpu->mmio_fragments[0].gpa = gpa;
+		vcpu->mmio_fragments[0].len = size;
+		trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
+	}
+	return 0;
+}
+
+static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
+{
+	u32 index = tdvmcall_p1_read(vcpu);
+	u64 data;
+
+	if (kvm_get_msr(vcpu, index, &data)) {
+		trace_kvm_msr_read_ex(index);
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+	trace_kvm_msr_read(index, data);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	tdvmcall_set_return_val(vcpu, data);
+	return 1;
+}
+
+static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
+{
+	u32 index = tdvmcall_p1_read(vcpu);
+	u64 data = tdvmcall_p2_read(vcpu);
+
+	if (kvm_set_msr(vcpu, index, data)) {
+		trace_kvm_msr_write_ex(index, data);
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	trace_kvm_msr_write(index, data);
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return 1;
+}
+
+static int tdx_map_gpa(struct kvm_vcpu *vcpu)
+{
+	gpa_t gpa = tdvmcall_p1_read(vcpu);
+	gpa_t size = tdvmcall_p2_read(vcpu);
+
+	if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
+	    (gpa + size) < gpa ||
+	    (gpa + size) > vcpu->kvm->arch.gfn_shared_mask << (PAGE_SHIFT + 1))
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+	else
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+	return 1;
+}
+
+static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
+{
+	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH;
+	vcpu->run->system_event.flags = tdvmcall_p1_read(vcpu);
+	return 0;
+}
+
+static int handle_tdvmcall(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	unsigned long exit_reason;
+
+	if (unlikely(tdx->tdvmcall.xmm_mask))
+		goto unsupported;
+
+	if (tdvmcall_exit_type(vcpu))
+		return tdx_emulate_vmcall(vcpu);
+
+	exit_reason = tdvmcall_exit_reason(vcpu);
+
+	switch (exit_reason) {
+	case EXIT_REASON_CPUID:
+		return tdx_emulate_cpuid(vcpu);
+	case EXIT_REASON_HLT:
+		return tdx_emulate_hlt(vcpu);
+	case EXIT_REASON_IO_INSTRUCTION:
+		return tdx_emulate_io(vcpu);
+	case EXIT_REASON_MSR_READ:
+		return tdx_emulate_rdmsr(vcpu);
+	case EXIT_REASON_MSR_WRITE:
+		return tdx_emulate_wrmsr(vcpu);
+	case EXIT_REASON_EPT_VIOLATION:
+		return tdx_emulate_mmio(vcpu);
+	case TDG_VP_VMCALL_MAP_GPA:
+		return tdx_map_gpa(vcpu);
+	case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
+		return tdx_report_fatal_error(vcpu);
+	default:
+		break;
+	}
+
+unsupported:
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+	return 1;
+}
+
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
+{
+	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
+}
+
+static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
+{
+	struct tdx_ex_ret ex_ret;
+	u64 err;
+	int i;
+
+	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+		err = tdh_mr_extend(kvm_tdx->tdr.pa, gpa + i, &ex_ret);
+		if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
+			pr_tdx_error(TDH_MR_EXTEND, err, &ex_ret);
+			break;
+		}
+	}
+}
+
+static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	hpa_t hpa = pfn << PAGE_SHIFT;
+	gpa_t gpa = gfn << PAGE_SHIFT;
+	struct tdx_ex_ret ex_ret;
+	hpa_t source_pa;
+	u64 err;
+
+	if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
+		return;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return;
+
+	/* Pin the page, KVM doesn't yet support page migration. */
+	get_page(pfn_to_page(pfn));
+
+	/* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
+	if (is_td_finalized(kvm_tdx)) {
+		err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &ex_ret);
+		if (KVM_BUG_ON(err, kvm))
+			pr_tdx_error(TDH_MEM_PAGE_AUG, err, &ex_ret);
+		return;
+	}
+
+	WARN_ON(kvm_tdx->source_pa == INVALID_PAGE);
+	source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
+
+	err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &ex_ret);
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &ex_ret);
+	else if ((kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION))
+		tdx_measure_page(kvm_tdx, gpa);
+
+	kvm_tdx->source_pa = INVALID_PAGE;
+}
+
+static void tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				       kvm_pfn_t pfn)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn << PAGE_SHIFT;
+	hpa_t hpa = pfn << PAGE_SHIFT;
+	hpa_t hpa_with_hkid;
+	struct tdx_ex_ret ex_ret;
+	u64 err;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return;
+
+	if (is_hkid_assigned(kvm_tdx)) {
+		err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, tdx_level, &ex_ret);
+		if (KVM_BUG_ON(err, kvm)) {
+			pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &ex_ret);
+			return;
+		}
+
+		hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+			return;
+		}
+	} else if (tdx_reclaim_page((unsigned long)__va(hpa), hpa)) {
+		return;
+	}
+
+	put_page(pfn_to_page(pfn));
+}
+
+static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
+				    enum pg_level level, void *sept_page)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn << PAGE_SHIFT;
+	hpa_t hpa = __pa(sept_page);
+	struct tdx_ex_ret ex_ret;
+	u64 err;
+
+	err = tdh_mem_sept_add(kvm_tdx->tdr.pa, gpa, tdx_level, hpa, &ex_ret);
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_SEPT_ADD, err, &ex_ret);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn << PAGE_SHIFT;
+	struct tdx_ex_ret ex_ret;
+	u64 err;
+
+	err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, tdx_level, &ex_ret);
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &ex_ret);
+}
+
+static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn << PAGE_SHIFT;
+	struct tdx_ex_ret ex_ret;
+	u64 err;
+
+	err = tdh_mem_range_unblock(kvm_tdx->tdr.pa, gpa, tdx_level, &ex_ret);
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_RANGE_UNBLOCK, err, &ex_ret);
+}
+
+static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				    void *sept_page)
+{
+	/*
+	 * free_private_sp() is (obviously) called when a shadow page is being
+	 * zapped.  KVM doesn't (yet) zap private SPs while the TD is active.
+	 */
+	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
+		return -EINVAL;
+
+	return tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page));
+}
+
+static int tdx_sept_tlb_remote_flush(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx;
+	u64 err;
+
+	if (!is_td(kvm))
+		return -EOPNOTSUPP;
+
+	kvm_tdx = to_kvm_tdx(kvm);
+	kvm_tdx->tdh_mem_track = true;
+
+	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+
+	if (is_hkid_assigned(kvm_tdx) && is_td_finalized(kvm_tdx)) {
+		err = tdh_mem_track(to_kvm_tdx(kvm)->tdr.pa);
+		if (KVM_BUG_ON(err, kvm))
+			pr_tdx_error(TDH_MEM_TRACK, err, NULL);
+	}
+
+	WRITE_ONCE(kvm_tdx->tdh_mem_track, false);
+
+	return 0;
+}
+
+void tdx_flush_tlb(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	u64 root_hpa = mmu->root_hpa;
+
+	/* Flush the shared EPTP, if it's valid. */
+	if (VALID_PAGE(root_hpa))
+		ept_sync_context(construct_eptp(vcpu, root_hpa,
+						mmu->shadow_root_level));
+
+	while (READ_ONCE(kvm_tdx->tdh_mem_track))
+		cpu_relax();
+}
+
+static inline bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
+{
+	return !((gpa >> PAGE_SHIFT) & kvm->arch.gfn_shared_mask);
+}
+
+#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
+
+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qual;
+
+	if (tdx_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu)))
+		exit_qual = TDX_SEPT_PFERR;
+	else
+		exit_qual = tdexit_exit_qual(vcpu);
+	trace_kvm_page_fault(tdexit_gpa(vcpu), exit_qual);
+	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+}
+
+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
+{
+	WARN_ON(1);
+
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+	vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+
+	return 0;
+}
+
+static int tdx_handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * When EXIT_REASON_BUS_LOCK, bus_lock_detected bit is not necessarily
+	 * set.  Enforce the bit set so that tdx_handle_exit() will handle it
+	 * uniformly.
+	 */
+	to_tdx(vcpu)->exit_reason.bus_lock_detected = true;
+	return 1;
+}
+
+static int __tdx_handle_exit(struct kvm_vcpu *vcpu,
+			     enum exit_fastpath_completion fastpath)
+{
+	union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
+
+	if (unlikely(exit_reason.non_recoverable || exit_reason.error)) {
+		kvm_pr_unimpl("TD exit 0x%llx, %d qual 0x%lx ext 0x%lx gpa 0x%lx intr 0x%lx\n",
+			      exit_reason.full, exit_reason.basic,
+			      tdexit_exit_qual(vcpu),
+			      tdexit_ext_exit_qual(vcpu),
+			      tdexit_gpa(vcpu),
+			      tdexit_intr_info(vcpu));
+		if (exit_reason.basic == EXIT_REASON_TRIPLE_FAULT)
+			return tdx_handle_triple_fault(vcpu);
+
+		goto unhandled_exit;
+	}
+
+	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
+
+	switch (exit_reason.basic) {
+	case EXIT_REASON_EXCEPTION_NMI:
+		return tdx_handle_exception(vcpu);
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return tdx_handle_external_interrupt(vcpu);
+	case EXIT_REASON_TDCALL:
+		return handle_tdvmcall(vcpu);
+	case EXIT_REASON_EPT_VIOLATION:
+		return tdx_handle_ept_violation(vcpu);
+	case EXIT_REASON_EPT_MISCONFIG:
+		return tdx_handle_ept_misconfig(vcpu);
+	case EXIT_REASON_OTHER_SMI:
+		/*
+		 * If reach here, it's not a MSMI.
+		 * #SMI is delivered and handled right after SEAMRET, nothing
+		 * needs to be done in KVM.
+		 */
+		return 1;
+	case EXIT_REASON_BUS_LOCK:
+		tdx_handle_bus_lock_vmexit(vcpu);
+		return 1;
+	default:
+		break;
+	}
+
+unhandled_exit:
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+	vcpu->run->hw.hardware_exit_reason = exit_reason.full;
+	return 0;
+}
+
+int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+{
+	int ret = __tdx_handle_exit(vcpu, exit_fastpath);
+
+	/*
+	 * Exit to user space when bus lock detected to inform that there is
+	 * a bus lock in guest.
+	 */
+	if (to_tdx(vcpu)->exit_reason.bus_lock_detected) {
+		if (ret > 0)
+			vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
+
+		vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
+		return 0;
+	}
+	return ret;
+}
+
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	*reason = tdx->exit_reason.full;
+
+	*info1 = tdexit_exit_qual(vcpu);
+	*info2 = tdexit_ext_exit_qual(vcpu);
+
+	*intr_info = tdexit_intr_info(vcpu);
+	*error_code = 0;
+}
+
+int __init tdx_check_processor_compatibility(void)
+{
+	/* TDX-SEAM itself verifies compatibility on all CPUs. */
+	return 0;
+}
+
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+	/* Only x2APIC mode is supported for TD. */
+	WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
+}
+
+void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	pi_clear_on(&tdx->pi_desc);
+	memset(tdx->pi_desc.pir, 0, sizeof(tdx->pi_desc.pir));
+}
+
+/*
+ * Send interrupt to vcpu via posted interrupt way.
+ * 1. If target vcpu is running(non-root mode), send posted interrupt
+ * notification to vcpu and hardware will sync PIR to vIRR atomically.
+ * 2. If target vcpu isn't running(root mode), kick it to pick up the
+ * interrupt from PIR in next vmentry.
+ */
+int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (pi_test_and_set_pir(vector, &tdx->pi_desc))
+		return 0;
+
+	/* If a previous notification has sent the IPI, nothing to do. */
+	if (pi_test_and_set_on(&tdx->pi_desc))
+		return 0;
+
+	if (vcpu != kvm_get_running_vcpu() &&
+	    !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
+		kvm_vcpu_kick(vcpu);
+
+	return 0;
+}
+
+int tdx_dev_ioctl(void __user *argp)
+{
+	struct kvm_tdx_capabilities __user *user_caps;
+	struct kvm_tdx_capabilities caps;
+	struct kvm_tdx_cmd cmd;
+
+	BUILD_BUG_ON(sizeof(struct kvm_tdx_cpuid_config) !=
+		     sizeof(struct tdx_cpuid_config));
+
+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
+		return -EFAULT;
+
+	if (cmd.metadata || cmd.id != KVM_TDX_CAPABILITIES)
+		return -EINVAL;
+
+	user_caps = (void __user *)cmd.data;
+	if (copy_from_user(&caps, user_caps, sizeof(caps)))
+		return -EFAULT;
+
+	if (caps.nr_cpuid_configs < tdx_caps.nr_cpuid_configs)
+		return -E2BIG;
+	caps.nr_cpuid_configs = tdx_caps.nr_cpuid_configs;
+
+	if (copy_to_user(user_caps->cpuid_configs, &tdx_caps.cpuid_configs,
+			 tdx_caps.nr_cpuid_configs * sizeof(struct tdx_cpuid_config)))
+		return -EFAULT;
+
+	caps.attrs_fixed0 = tdx_caps.attrs_fixed0;
+	caps.attrs_fixed1 = tdx_caps.attrs_fixed1;
+	caps.xfam_fixed0 = tdx_caps.xfam_fixed0;
+	caps.xfam_fixed1 = tdx_caps.xfam_fixed1;
+
+	if (copy_to_user((void __user *)cmd.data, &caps, sizeof(caps)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * TDX-SEAM definitions for fixed{0,1} are inverted relative to VMX.  The TDX
+ * definitions are sane, the VMX definitions are backwards.
+ *
+ * if fixed0[i] == 0: val[i] must be 0
+ * if fixed1[i] == 1: val[i] must be 1
+ */
+static inline bool tdx_fixed_bits_valid(u64 val, u64 fixed0, u64 fixed1)
+{
+	return ((val & fixed0) | fixed1) == val;
+}
+
+static struct kvm_cpuid_entry2 *tdx_find_cpuid_entry(struct kvm_tdx *kvm_tdx,
+						     u32 function, u32 index)
+{
+	struct kvm_cpuid_entry2 *e;
+	int i;
+
+	for (i = 0; i < kvm_tdx->cpuid_nent; i++) {
+		e = &kvm_tdx->cpuid_entries[i];
+
+		if (e->function == function && (e->index == index ||
+		    !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX)))
+			return e;
+	}
+	return NULL;
+}
+
+static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
+			  struct kvm_tdx_init_vm *init_vm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_cpuid_config *config;
+	struct kvm_cpuid_entry2 *entry;
+	struct tdx_cpuid_value *value;
+	u64 guest_supported_xcr0;
+	u64 guest_supported_xss;
+	u32 guest_tsc_khz;
+	int max_pa;
+	int i;
+
+	/* init_vm->reserved must be zero */
+	if (find_first_bit((unsigned long *)init_vm->reserved,
+			   sizeof(init_vm->reserved) * 8) !=
+	    sizeof(init_vm->reserved) * 8)
+		return -EINVAL;
+
+	td_params->attributes = init_vm->attributes;
+	td_params->max_vcpus = init_vm->max_vcpus;
+
+	/* TODO: Enforce consistent CPUID features for all vCPUs. */
+	for (i = 0; i < tdx_caps.nr_cpuid_configs; i++) {
+		config = &tdx_caps.cpuid_configs[i];
+
+		entry = tdx_find_cpuid_entry(kvm_tdx, config->leaf,
+					     config->sub_leaf);
+		if (!entry)
+			continue;
+
+		/*
+		 * Non-configurable bits must be '0', even if they are fixed to
+		 * '1' by TDX-SEAM, i.e. mask off non-configurable bits.
+		 */
+		value = &td_params->cpuid_values[i];
+		value->eax = entry->eax & config->eax;
+		value->ebx = entry->ebx & config->ebx;
+		value->ecx = entry->ecx & config->ecx;
+		value->edx = entry->edx & config->edx;
+	}
+
+	entry = tdx_find_cpuid_entry(kvm_tdx, 0xd, 0);
+	if (entry)
+		guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
+	else
+		guest_supported_xcr0 = 0;
+	guest_supported_xcr0 &= supported_xcr0;
+
+	entry = tdx_find_cpuid_entry(kvm_tdx, 0xd, 1);
+	if (entry)
+		guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
+	else
+		guest_supported_xss = 0;
+	/* PT can be exposed to TD guest regardless of KVM's XSS support */
+	guest_supported_xss &= (supported_xss | XFEATURE_MASK_PT);
+
+	max_pa = 36;
+	entry = tdx_find_cpuid_entry(kvm_tdx, 0x80000008, 0);
+	if (entry)
+		max_pa = entry->eax & 0xff;
+
+	td_params->eptp_controls = VMX_EPTP_MT_WB;
+
+	if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
+		td_params->eptp_controls |= VMX_EPTP_PWL_5;
+		td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
+	} else {
+		td_params->eptp_controls |= VMX_EPTP_PWL_4;
+	}
+
+	if (!tdx_fixed_bits_valid(td_params->attributes,
+				  tdx_caps.attrs_fixed0,
+				  tdx_caps.attrs_fixed1))
+		return -EINVAL;
+
+	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
+		pr_warn("TD doesn't support perfmon. KVM needs to save/restore "
+			"host perf registers properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* Setup td_params.xfam */
+	td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
+	if (!tdx_fixed_bits_valid(td_params->xfam,
+				  tdx_caps.xfam_fixed0,
+				  tdx_caps.xfam_fixed1))
+		return -EINVAL;
+
+	if (td_params->xfam & TDX_TD_XFAM_LBR) {
+		pr_warn("TD doesn't support LBR. KVM needs to save/restore "
+			"IA32_LBR_DEPTH properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (td_params->xfam & TDX_TD_XFAM_AMX) {
+		pr_warn("TD doesn't support AMX. KVM needs to save/restore "
+			"IA32_XFD, IA32_XFD_ERR properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (init_vm->tsc_khz)
+		guest_tsc_khz = init_vm->tsc_khz;
+	else
+		guest_tsc_khz = kvm->arch.initial_tsc_khz;
+
+	if (guest_tsc_khz < TDX_MIN_TSC_FREQUENCY_KHZ ||
+	    guest_tsc_khz > TDX_MAX_TSC_FREQUENCY_KHZ) {
+		pr_warn_ratelimited("Illegal TD TSC %d Khz, it must be between [%d, %d] Khz\n",
+		guest_tsc_khz, TDX_MIN_TSC_FREQUENCY_KHZ, TDX_MAX_TSC_FREQUENCY_KHZ);
+		return -EINVAL;
+	}
+
+	td_params->tsc_frequency = TDX_TSC_KHZ_TO_25MHZ(guest_tsc_khz);
+	if (TDX_TSC_25MHZ_TO_KHZ(td_params->tsc_frequency) != guest_tsc_khz) {
+		pr_warn_ratelimited("TD TSC %d Khz not a multiple of 25Mhz\n", guest_tsc_khz);
+		if (init_vm->tsc_khz)
+			return -EINVAL;
+	}
+
+	BUILD_BUG_ON(sizeof(td_params->mrconfigid) !=
+		     sizeof(init_vm->mrconfigid));
+	memcpy(td_params->mrconfigid, init_vm->mrconfigid,
+	       sizeof(td_params->mrconfigid));
+	BUILD_BUG_ON(sizeof(td_params->mrowner) !=
+		     sizeof(init_vm->mrowner));
+	memcpy(td_params->mrowner, init_vm->mrowner, sizeof(td_params->mrowner));
+	BUILD_BUG_ON(sizeof(td_params->mrownerconfig) !=
+		     sizeof(init_vm->mrownerconfig));
+	memcpy(td_params->mrownerconfig, init_vm->mrownerconfig,
+	       sizeof(td_params->mrownerconfig));
+
+	return 0;
+}
+
+static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_cpuid2 __user *user_cpuid;
+	struct kvm_tdx_init_vm init_vm;
+	struct td_params *td_params;
+	struct tdx_ex_ret ex_ret;
+	struct kvm_cpuid2 cpuid;
+	int ret;
+	u64 err;
+
+	if (is_td_initialized(kvm))
+		return -EINVAL;
+
+	if (cmd->metadata)
+		return -EINVAL;
+
+	if (copy_from_user(&init_vm, (void __user *)cmd->data, sizeof(init_vm)))
+		return -EFAULT;
+
+	if (init_vm.max_vcpus > KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	user_cpuid = (void *)init_vm.cpuid;
+	if (copy_from_user(&cpuid, user_cpuid, sizeof(cpuid)))
+		return -EFAULT;
+
+	if (cpuid.nent > KVM_MAX_CPUID_ENTRIES)
+		return -E2BIG;
+
+	if (copy_from_user(&kvm_tdx->cpuid_entries, user_cpuid->entries,
+			   cpuid.nent * sizeof(struct kvm_cpuid_entry2)))
+		return -EFAULT;
+
+	BUILD_BUG_ON(sizeof(struct td_params) != 1024);
+
+	td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL_ACCOUNT);
+	if (!td_params)
+		return -ENOMEM;
+
+	kvm_tdx->cpuid_nent = cpuid.nent;
+
+	ret = setup_tdparams(kvm, td_params, &init_vm);
+	if (ret)
+		goto free_tdparams;
+
+	err = tdh_mng_init(kvm_tdx->tdr.pa, __pa(td_params), &ex_ret);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_INIT, err, &ex_ret);
+		ret = -EIO;
+		goto free_tdparams;
+	}
+
+	kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
+	kvm_tdx->attributes = td_params->attributes;
+	kvm_tdx->xfam = td_params->xfam;
+	kvm->max_vcpus = td_params->max_vcpus;
+	kvm->arch.initial_tsc_khz = TDX_TSC_25MHZ_TO_KHZ(td_params->tsc_frequency);
+
+	if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
+		kvm->arch.gfn_shared_mask = BIT_ULL(51) >> PAGE_SHIFT;
+	else
+		kvm->arch.gfn_shared_mask = BIT_ULL(47) >> PAGE_SHIFT;
+
+free_tdparams:
+	kfree(td_params);
+	if (ret)
+		kvm_tdx->cpuid_nent = 0;
+	return ret;
+}
+
+static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_tdx_init_mem_region region;
+	struct kvm_vcpu *vcpu;
+	struct page *page;
+	kvm_pfn_t pfn;
+	int idx, ret = 0;
+
+	/* The BSP vCPU must be created before initializing memory regions. */
+	if (!atomic_read(&kvm->online_vcpus))
+		return -EINVAL;
+
+	if (cmd->metadata & ~KVM_TDX_MEASURE_MEMORY_REGION)
+		return -EINVAL;
+
+	if (copy_from_user(&region, (void __user *)cmd->data, sizeof(region)))
+		return -EFAULT;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(region.source_addr, PAGE_SIZE))
+		return -EINVAL;
+	if (!IS_ALIGNED(region.gpa, PAGE_SIZE))
+		return -EINVAL;
+	if (!region.nr_pages)
+		return -EINVAL;
+	if (region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa)
+		return -EINVAL;
+	if (!tdx_is_private_gpa(kvm, region.gpa))
+		return -EINVAL;
+
+	vcpu = kvm_get_vcpu(kvm, 0);
+	if (mutex_lock_killable(&vcpu->mutex))
+		return -EINTR;
+
+	vcpu_load(vcpu);
+	idx = srcu_read_lock(&kvm->srcu);
+
+	kvm_mmu_reload(vcpu);
+
+	while (region.nr_pages) {
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+
+		if (need_resched())
+			cond_resched();
+
+
+		/* Pin the source page. */
+		ret = get_user_pages_fast(region.source_addr, 1, 0, &page);
+		if (ret < 0)
+			break;
+		if (ret != 1) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
+				     (cmd->metadata & KVM_TDX_MEASURE_MEMORY_REGION);
+
+		pfn = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
+					   PG_LEVEL_4K);
+		if (is_error_noslot_pfn(pfn) || kvm->vm_bugged)
+			ret = -EFAULT;
+		else
+			ret = 0;
+
+		put_page(page);
+		if (ret)
+			break;
+
+		region.source_addr += PAGE_SIZE;
+		region.gpa += PAGE_SIZE;
+		region.nr_pages--;
+	}
+
+	srcu_read_unlock(&kvm->srcu, idx);
+	vcpu_put(vcpu);
+
+	mutex_unlock(&vcpu->mutex);
+
+	if (copy_to_user((void __user *)cmd->data, &region, sizeof(region)))
+		ret = -EFAULT;
+
+	return ret;
+}
+
+static int tdx_td_finalizemr(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+
+	if (!is_td_initialized(kvm) || is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	err = tdh_mr_finalize(kvm_tdx->tdr.pa);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
+		return -EIO;
+	}
+
+	kvm_tdx->finalized = true;
+	return 0;
+}
+
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_tdx_cmd tdx_cmd;
+	int r;
+
+	if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
+		return -EFAULT;
+
+	mutex_lock(&kvm->lock);
+
+	switch (tdx_cmd.id) {
+	case KVM_TDX_INIT_VM:
+		r = tdx_td_init(kvm, &tdx_cmd);
+		break;
+	case KVM_TDX_INIT_MEM_REGION:
+		r = tdx_init_mem_region(kvm, &tdx_cmd);
+		break;
+	case KVM_TDX_FINALIZE_VM:
+		r = tdx_td_finalizemr(kvm);
+		break;
+	default:
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
+		r = -EFAULT;
+
+out:
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct kvm_tdx_cmd cmd;
+	u64 err;
+
+	if (tdx->initialized)
+		return -EINVAL;
+
+	if (!is_td_initialized(vcpu->kvm) || is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
+		return -EFAULT;
+
+	if (cmd.metadata || cmd.id != KVM_TDX_INIT_VCPU)
+		return -EINVAL;
+
+	err = tdh_vp_init(tdx->tdvpr.pa, cmd.data);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_VP_INIT, err, NULL);
+		return -EIO;
+	}
+
+	tdx->initialized = true;
+
+	td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR);
+	td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
+	td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
+
+	if (vcpu->kvm->arch.bus_lock_detection_enabled)
+		td_vmcs_setbit32(tdx,
+				 SECONDARY_VM_EXEC_CONTROL,
+				 SECONDARY_EXEC_BUS_LOCK_DETECTION);
+
+	return 0;
+}
+
+void tdx_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+	/* TODO: Figure out exception bitmap for debug TD. */
+}
+
+void tdx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	/* TODO: Add TDH_VP_WR(GUEST_DR7) for debug TDs. */
+	if (is_debug_td(vcpu))
+		return;
+
+	KVM_BUG_ON(val != DR7_FIXED_1, vcpu->kvm);
+}
+
+int tdx_get_cpl(struct kvm_vcpu *vcpu)
+{
+	if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))
+		return 0;
+
+	/*
+	 * For debug TDs, tdx_get_cpl() may be called before the vCPU is
+	 * initialized, i.e. before TDH_VP_RD is legal, if the vCPU is scheduled
+	 * out.  If this happens, simply return CPL0 to avoid TDH_VP_RD failure.
+	 */
+	if (!to_tdx(vcpu)->initialized)
+		return 0;
+
+	return VMX_AR_DPL(td_vmcs_read32(to_tdx(vcpu), GUEST_SS_AR_BYTES));
+}
+
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
+{
+	if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))
+		return 0;
+
+	return td_vmcs_read64(to_tdx(vcpu), GUEST_RFLAGS);
+}
+
+void tdx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+	if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))
+		return;
+
+	/*
+	 * TODO: This is currently disallowed by TDX-SEAM, which breaks single-
+	 * step debug.
+	 */
+	td_vmcs_write64(to_tdx(vcpu), GUEST_RFLAGS, rflags);
+}
+
+bool tdx_is_emulated_msr(u32 index, bool write)
+{
+	switch (index) {
+	case MSR_IA32_UCODE_REV:
+	case MSR_IA32_ARCH_CAPABILITIES:
+	case MSR_IA32_POWER_CTL:
+	case MSR_MTRRcap:
+	case 0x200 ... 0x2ff:
+	case MSR_IA32_TSC_DEADLINE:
+	case MSR_IA32_MISC_ENABLE:
+	case MSR_KVM_STEAL_TIME:
+	case MSR_KVM_POLL_CONTROL:
+	case MSR_PLATFORM_INFO:
+	case MSR_MISC_FEATURES_ENABLES:
+	case MSR_IA32_MCG_CTL:
+	case MSR_IA32_MCG_STATUS:
+	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(32) - 1:
+		return true;
+	case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
+		/*
+		 * x2APIC registers that are virtualized by the CPU can't be
+		 * emulated, KVM doesn't have access to the virtual APIC page.
+		 */
+		switch (index) {
+		case X2APIC_MSR(APIC_TASKPRI):
+		case X2APIC_MSR(APIC_PROCPRI):
+		case X2APIC_MSR(APIC_EOI):
+		case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
+		case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
+		case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
+			return false;
+		default:
+			return true;
+		}
+	case MSR_IA32_APICBASE:
+	case MSR_EFER:
+		return !write;
+	default:
+		return false;
+	}
+}
+
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	if (tdx_is_emulated_msr(msr->index, false))
+		return kvm_get_msr_common(vcpu, msr);
+	return 1;
+}
+
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	if (tdx_is_emulated_msr(msr->index, true))
+		return kvm_set_msr_common(vcpu, msr);
+	return 1;
+}
+
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+	if (!is_debug_td(vcpu))
+		return 0;
+
+	return td_vmcs_read64(to_tdx(vcpu), GUEST_ES_BASE + seg * 2);
+}
+
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (!is_debug_td(vcpu)) {
+		memset(var, 0, sizeof(*var));
+		return;
+	}
+
+	seg *= 2;
+	var->base = td_vmcs_read64(tdx, GUEST_ES_BASE + seg);
+	var->limit = td_vmcs_read32(tdx, GUEST_ES_LIMIT + seg);
+	var->selector = td_vmcs_read16(tdx, GUEST_ES_SELECTOR + seg);
+	vmx_decode_ar_bytes(td_vmcs_read32(tdx, GUEST_ES_AR_BYTES + seg), var);
+}
+
+static void tdx_cache_gprs(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int i;
+
+	if (!is_td_vcpu(vcpu) || !is_debug_td(vcpu))
+		return;
+
+	for (i = 0; i < NR_VCPU_REGS; i++) {
+		if (i == VCPU_REGS_RSP || i == VCPU_REGS_RIP)
+			continue;
+
+		vcpu->arch.regs[i] = td_gpr_read64(tdx, i);
+	}
+}
+
+static void tdx_flush_gprs(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int i;
+
+	if (!is_td_vcpu(vcpu) || KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))
+		return;
+
+	for (i = 0; i < NR_VCPU_REGS; i++)
+		td_gpr_write64(tdx, i, vcpu->arch.regs[i]);
+}
+
+void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
+			unsigned int *vcpu_align, unsigned int *vm_size)
+{
+	*vcpu_size = sizeof(struct vcpu_tdx);
+	*vcpu_align = __alignof__(struct vcpu_tdx);
+
+	if (sizeof(struct kvm_tdx) > *vm_size)
+		*vm_size = sizeof(struct kvm_tdx);
+}
+
+int __init tdx_init(void)
+{
+	return 0;
+}
+
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
+{
+	int i, max_pkgs;
+	u32 max_pa;
+	const struct tdsysinfo_struct *tdsysinfo = tdx_get_sysinfo();
+
+	if (!enable_ept) {
+		pr_warn("Cannot enable TDX with EPT disabled\n");
+		return -EINVAL;
+	}
+	tdx_keyids_init();
+
+	if (tdsysinfo == NULL) {
+		WARN_ON_ONCE(boot_cpu_has(X86_FEATURE_TDX));
+		return -ENODEV;
+	}
+
+	if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
+		return -EIO;
+
+	tdx_caps.tdcs_nr_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE;
+	if (tdx_caps.tdcs_nr_pages != TDX_NR_TDCX_PAGES)
+		return -EIO;
+
+	tdx_caps.tdvpx_nr_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1;
+	if (tdx_caps.tdvpx_nr_pages != TDX_NR_TDVPX_PAGES)
+		return -EIO;
+
+	tdx_caps.attrs_fixed0 = tdsysinfo->attributes_fixed0;
+	tdx_caps.attrs_fixed1 = tdsysinfo->attributes_fixed1;
+	tdx_caps.xfam_fixed0 =	tdsysinfo->xfam_fixed0;
+	tdx_caps.xfam_fixed1 = tdsysinfo->xfam_fixed1;
+
+	tdx_caps.nr_cpuid_configs = tdsysinfo->num_cpuid_config;
+	if (tdx_caps.nr_cpuid_configs > TDX_MAX_NR_CPUID_CONFIGS)
+		return -EIO;
+
+	if (!memcpy(tdx_caps.cpuid_configs, tdsysinfo->cpuid_configs,
+		    tdsysinfo->num_cpuid_config * sizeof(struct tdx_cpuid_config)))
+		return -EIO;
+
+	x86_ops->cache_gprs = tdx_cache_gprs;
+	x86_ops->flush_gprs = tdx_flush_gprs;
+
+	x86_ops->tlb_remote_flush = tdx_sept_tlb_remote_flush;
+	x86_ops->set_private_spte = tdx_sept_set_private_spte;
+	x86_ops->drop_private_spte = tdx_sept_drop_private_spte;
+	x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
+	x86_ops->unzap_private_spte = tdx_sept_unzap_private_spte;
+	x86_ops->link_private_sp = tdx_sept_link_private_sp;
+	x86_ops->free_private_sp = tdx_sept_free_private_sp;
+
+	max_pkgs = topology_max_packages();
+
+	tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
+				   GFP_KERNEL);
+	if (!tdx_mng_key_config_lock) {
+		kfree(tdx_mng_key_config_lock);
+		return -ENOMEM;
+	}
+	for (i = 0; i < max_pkgs; i++)
+		mutex_init(&tdx_mng_key_config_lock[i]);
+
+	max_pa = cpuid_eax(0x80000008) & 0xff;
+	hkid_start_pos = boot_cpu_data.x86_phys_bits;
+	hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
+
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
+		tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
+		if (tdx_uret_msrs[i].slot == -1) {
+			/* If any MSR isn't supported, it is a KVM bug */
+			pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
+				tdx_uret_msrs[i].msr);
+			return -EIO;
+		}
+	}
+
+	return 0;
+}
+
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 09da7cbedb5f..447f0d5b53c7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11019,7 +11019,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
 	kvm_vcpu_mtrr_init(vcpu);
 	vcpu_load(vcpu);
-	kvm_set_tsc_khz(vcpu, max_tsc_khz);
+	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.initial_tsc_khz);
 	kvm_vcpu_reset(vcpu, false);
 	kvm_init_mmu(vcpu);
 	vcpu_put(vcpu);
@@ -11473,6 +11473,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
 	kvm->arch.guest_can_read_msr_platform_info = true;
+	kvm->arch.initial_tsc_khz = max_tsc_khz;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 55/59] KVM: TDX: Add "basic" support for building and running Trust Domains
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (53 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 56/59] KVM: TDX: Protect private mapping related SEAMCALLs with spinlock isaku.yamahata
                   ` (5 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Xiaoyao Li,
	Kai Huang, Chao Gao, Isaku Yamahata, Yuan Yao

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add what is effectively a TDX-specific ioctl for initializing the guest
Trust Domain.  Implement the functionality as a subcommand of
KVM_MEMORY_ENCRYPT_OP, analogous to how the ioctl is used by SVM to
manage SEV guests.

For easy compatibility with future versions of TDX-SEAM, add a
KVM-defined struct, tdx_capabilities, to track requirements/capabilities
for the overall system, and define a global instance to serve as the
canonical reference.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/events/intel/ds.c            |   1 +
 arch/x86/include/uapi/asm/kvm.h       |  56 ++++
 arch/x86/include/uapi/asm/vmx.h       |   3 +-
 arch/x86/kvm/Makefile                 |   5 +-
 arch/x86/kvm/mmu.h                    |   2 +-
 arch/x86/kvm/mmu/mmu.c                |   1 +
 arch/x86/kvm/mmu/spte.c               |   5 +-
 arch/x86/kvm/vmx/common.h             |   1 +
 arch/x86/kvm/vmx/main.c               | 403 +++++++++++++++++++++++++-
 arch/x86/kvm/vmx/posted_intr.c        |   6 +
 arch/x86/kvm/vmx/tdx.h                | 114 ++++++++
 arch/x86/kvm/vmx/tdx_arch.h           |  24 +-
 arch/x86/kvm/vmx/tdx_ops.h            |  14 +
 arch/x86/kvm/vmx/tdx_stubs.c          |  50 ++++
 arch/x86/kvm/vmx/vmenter.S            | 146 ++++++++++
 arch/x86/kvm/vmx/vmx.c                |  39 ---
 arch/x86/kvm/vmx/x86_ops.h            |  80 +++++
 arch/x86/kvm/x86.c                    |   8 +-
 tools/arch/x86/include/uapi/asm/kvm.h |  51 ++++
 19 files changed, 946 insertions(+), 63 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/tdx_stubs.c

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 2e215369df4a..b6c556225fd2 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2248,3 +2248,4 @@ void perf_restore_debug_store(void)
 
 	wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
 }
+EXPORT_SYMBOL_GPL(perf_restore_debug_store);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a0805a2a81f8..6f93a3d345d7 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -512,4 +512,60 @@ struct kvm_pmu_event_filter {
 #define KVM_X86_SEV_ES_VM	1
 #define KVM_X86_TDX_VM		2
 
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+	KVM_TDX_CAPABILITIES = 0,
+	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
+
+	KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+	__u32 id;
+	__u32 metadata;
+	__u64 data;
+};
+
+struct kvm_tdx_cpuid_config {
+	__u32 leaf;
+	__u32 sub_leaf;
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+	__u64 attrs_fixed0;
+	__u64 attrs_fixed1;
+	__u64 xfam_fixed0;
+	__u64 xfam_fixed1;
+
+	__u32 nr_cpuid_configs;
+	__u32 padding;
+	struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
+struct kvm_tdx_init_vm {
+	__u32 max_vcpus;
+	__u32 tsc_khz;
+	__u64 attributes;
+	__u64 cpuid;
+	__u64 mrconfigid[6];	/* sha384 digest */
+	__u64 mrowner[6];	/* sha384 digest */
+	__u64 mrownerconfig[6];	/* sha348 digest */
+	__u64 reserved[43];	/* must be zero for future extensibility */
+};
+
+#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index ba5908dfc7c0..79843a0143d2 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -32,8 +32,9 @@
 #define EXIT_REASON_EXCEPTION_NMI       0
 #define EXIT_REASON_EXTERNAL_INTERRUPT  1
 #define EXIT_REASON_TRIPLE_FAULT        2
-#define EXIT_REASON_INIT_SIGNAL			3
+#define EXIT_REASON_INIT_SIGNAL         3
 #define EXIT_REASON_SIPI_SIGNAL         4
+#define EXIT_REASON_OTHER_SMI           6
 
 #define EXIT_REASON_INTERRUPT_WINDOW    7
 #define EXIT_REASON_NMI_WINDOW          8
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d28f990bd81d..51b2d5fdaeed 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -29,7 +29,10 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
 			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
-kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx_error.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx_error.o vmx/tdx.o
+ifneq ($(CONFIG_INTEL_TDX_HOST),y)
+kvm-intel-y				+= vmx/tdx_stubs.o
+endif
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2c4b8fde66d9..0bcf583216a1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -65,7 +65,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
 }
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value);
 void kvm_mmu_set_spte_init_value(u64 init_value);
 
 void kvm_init_mmu(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7d3830508a44..f8109002da0d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5461,6 +5461,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_load);
 
 static void __kvm_mmu_unload(struct kvm_vcpu *vcpu, u32 roots_to_free)
 {
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index bb45e71eb105..75784ac9d91a 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -312,14 +312,15 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value)
 {
 	shadow_user_mask	= VMX_EPT_READABLE_MASK;
 	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
 	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
 	shadow_nx_mask		= 0ull;
 	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
-	shadow_present_mask	= has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+	shadow_present_mask	=
+		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | init_value;
 	shadow_acc_track_mask	= VMX_EPT_RWX_MASK;
 	shadow_me_mask		= 0ull;
 
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 684cd3add46b..18b939e363e2 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -9,6 +9,7 @@
 #include <asm/vmx.h>
 
 #include "mmu.h"
+#include "tdx.h"
 #include "vmcs.h"
 #include "vmx.h"
 #include "x86.h"
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4d6bf1f56641..958d4805eda4 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -3,11 +3,19 @@
 
 #include "x86_ops.h"
 #include "vmx.h"
+#include "tdx.h"
 #include "common.h"
 #include "nested.h"
 #include "mmu.h"
 #include "pmu.h"
 
+#ifdef CONFIG_INTEL_TDX_HOST
+static bool __read_mostly enable_tdx = 1;
+module_param_named(tdx, enable_tdx, bool, 0444);
+#else
+#define enable_tdx 0
+#endif
+
 static int __init vt_cpu_has_kvm_support(void)
 {
 	return cpu_has_vmx();
@@ -26,6 +34,16 @@ static int __init vt_check_processor_compatibility(void)
 	if (ret)
 		return ret;
 
+	if (enable_tdx) {
+		/*
+		 * Reject the entire module load if the per-cpu check fails, it
+		 * likely indicates a hardware or system configuration issue.
+		 */
+		ret = tdx_check_processor_compatibility();
+		if (ret)
+			return ret;
+	}
+
 	return 0;
 }
 
@@ -37,20 +55,40 @@ static __init int vt_hardware_setup(void)
 	if (ret)
 		return ret;
 
-	if (enable_ept)
+#ifdef CONFIG_INTEL_TDX_HOST
+	if (enable_tdx && tdx_hardware_setup(&vt_x86_ops))
+		enable_tdx = false;
+#endif
+
+	if (enable_ept) {
+		const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
 		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
-				      cpu_has_vmx_ept_execute_only());
+				      cpu_has_vmx_ept_execute_only(), init_value);
+		kvm_mmu_set_spte_init_value(init_value);
+	}
 
 	return 0;
 }
 
 static int vt_hardware_enable(void)
 {
-	return hardware_enable();
+	int ret;
+
+	ret = hardware_enable();
+	if (ret)
+		return ret;
+
+	if (enable_tdx)
+		tdx_hardware_enable();
+	return 0;
 }
 
 static void vt_hardware_disable(void)
 {
+	/* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+	if (enable_tdx)
+		tdx_hardware_disable();
+
 	hardware_disable();
 }
 
@@ -61,60 +99,92 @@ static bool vt_cpu_has_accelerated_tpr(void)
 
 static bool vt_is_vm_type_supported(unsigned long type)
 {
-	return type == KVM_X86_LEGACY_VM;
+	return type == KVM_X86_LEGACY_VM ||
+	       (type == KVM_X86_TDX_VM && enable_tdx);
 }
 
 static int vt_vm_init(struct kvm *kvm)
 {
+	if (kvm->arch.vm_type == KVM_X86_TDX_VM)
+		return tdx_vm_init(kvm);
+
 	return vmx_vm_init(kvm);
 }
 
 static void vt_mmu_prezap(struct kvm *kvm)
 {
+	if (is_td(kvm))
+		return tdx_vm_teardown(kvm);
 }
 
 static void vt_vm_destroy(struct kvm *kvm)
 {
+	if (is_td(kvm))
+		return tdx_vm_destroy(kvm);
 }
 
 static int vt_vcpu_create(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_create(vcpu);
+
 	return vmx_create_vcpu(vcpu);
 }
 
 static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_run(vcpu);
+
 	return vmx_vcpu_run(vcpu);
 }
 
 static void vt_vcpu_free(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_free(vcpu);
+
 	return vmx_free_vcpu(vcpu);
 }
 
 static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_reset(vcpu, init_event);
+
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
 static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_load(vcpu, cpu);
+
 	return vmx_vcpu_load(vcpu, cpu);
 }
 
 static void vt_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_put(vcpu);
+
 	return vmx_vcpu_put(vcpu);
 }
 
 static int vt_handle_exit(struct kvm_vcpu *vcpu,
 			     enum exit_fastpath_completion fastpath)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_handle_exit(vcpu, fastpath);
+
 	return vmx_handle_exit(vcpu, fastpath);
 }
 
 static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_handle_exit_irqoff(vcpu);
+
 	vmx_handle_exit_irqoff(vcpu);
 }
 
@@ -130,21 +200,33 @@ static void vt_update_emulated_instruction(struct kvm_vcpu *vcpu)
 
 static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_set_msr(vcpu, msr_info);
+
 	return vmx_set_msr(vcpu, msr_info);
 }
 
 static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
+	if (is_td_vcpu(vcpu))
+		return false;
+
 	return vmx_smi_allowed(vcpu, for_injection);
 }
 
 static int vt_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return 0;
+
 	return vmx_enter_smm(vcpu, smstate);
 }
 
 static int vt_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
 {
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return 0;
+
 	return vmx_leave_smm(vcpu, smstate);
 }
 
@@ -157,6 +239,9 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
 static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, void *insn,
 				       int insn_len)
 {
+	if (is_td_vcpu(vcpu))
+		return false;
+
 	return vmx_can_emulate_instruction(vcpu, insn, insn_len);
 }
 
@@ -165,11 +250,17 @@ static int vt_check_intercept(struct kvm_vcpu *vcpu,
 				 enum x86_intercept_stage stage,
 				 struct x86_exception *exception)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return X86EMUL_UNHANDLEABLE;
+
 	return vmx_check_intercept(vcpu, info, stage, exception);
 }
 
 static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return true;
+
 	return vmx_apic_init_signal_blocked(vcpu);
 }
 
@@ -178,13 +269,43 @@ static void vt_migrate_timers(struct kvm_vcpu *vcpu)
 	vmx_migrate_timers(vcpu);
 }
 
+static int vt_mem_enc_op_dev(void __user *argp)
+{
+	if (!enable_tdx)
+		return -EINVAL;
+
+	return tdx_dev_ioctl(argp);
+}
+
+static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
+{
+	if (!is_td(kvm))
+		return -ENOTTY;
+
+	return tdx_vm_ioctl(kvm, argp);
+}
+
+static int vt_mem_enc_op_vcpu(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	if (!is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return tdx_vcpu_ioctl(vcpu, argp);
+}
+
 static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_set_virtual_apic_mode(vcpu);
+
 	return vmx_set_virtual_apic_mode(vcpu);
 }
 
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_apicv_post_state_restore(vcpu);
+
 	return vmx_apicv_post_state_restore(vcpu);
 }
 
@@ -195,31 +316,49 @@ static bool vt_check_apicv_inhibit_reasons(ulong bit)
 
 static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	return vmx_hwapic_irr_update(vcpu, max_irr);
 }
 
 static void vt_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	return vmx_hwapic_isr_update(vcpu, max_isr);
 }
 
 static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return false;
+
 	return vmx_guest_apic_has_interrupt(vcpu);
 }
 
 static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return -1;
+
 	return vmx_sync_pir_to_irr(vcpu);
 }
 
 static int vt_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_deliver_posted_interrupt(vcpu, vector);
+
 	return vmx_deliver_posted_interrupt(vcpu, vector);
 }
 
 static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	return vmx_vcpu_after_set_cpuid(vcpu);
 }
 
@@ -229,6 +368,9 @@ static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
  */
 static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
 {
+	if (kvm && is_td(kvm))
+		return tdx_is_emulated_msr(index, true);
+
 	return vmx_has_emulated_msr(kvm, index);
 }
 
@@ -239,11 +381,25 @@ static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
 
 static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * All host state is saved/restored across SEAMCALL/SEAMRET, and the
+	 * guest state of a TD is obviously off limits.  Deferring MSRs and DRs
+	 * is pointless because TDX-SEAM needs to load *something* so as not to
+	 * expose guest state.
+	 */
+	if (is_td_vcpu(vcpu)) {
+		tdx_prepare_switch_to_guest(vcpu);
+		return;
+	}
+
 	vmx_prepare_switch_to_guest(vcpu);
 }
 
 static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_update_exception_bitmap(vcpu);
+
 	vmx_update_exception_bitmap(vcpu);
 }
 
@@ -254,49 +410,84 @@ static int vt_get_msr_feature(struct kvm_msr_entry *msr)
 
 static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_get_msr(vcpu, msr_info);
+
 	return vmx_get_msr(vcpu, msr_info);
 }
 
 static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_get_segment_base(vcpu, seg);
+
 	return vmx_get_segment_base(vcpu, seg);
 }
 
 static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
 			      int seg)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_get_segment(vcpu, var, seg);
+
 	vmx_get_segment(vcpu, var, seg);
 }
 
 static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
 			      int seg)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_set_segment(vcpu, var, seg);
 }
 
 static int vt_get_cpl(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_get_cpl(vcpu);
+
 	return vmx_get_cpl(vcpu);
 }
 
 static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu) && !is_debug_td(vcpu), vcpu->kvm))
+		return;
+
 	vmx_get_cs_db_l_bits(vcpu, db, l);
 }
 
 static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 {
+	if (is_td_vcpu(vcpu) && !is_td_vcpu_initialized(vcpu))
+		/* ignore reset on vcpu creation. */
+		return;
+
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_set_cr0(vcpu, cr0);
 }
 
 static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			    int pgd_level)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+
 	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
 }
 
 static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
+	if (is_td_vcpu(vcpu) && !is_td_vcpu_initialized(vcpu))
+		/* ignore reset on vcpu creation. */
+		return;
+
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_set_cr4(vcpu, cr4);
 }
 
@@ -307,6 +498,13 @@ static bool vt_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 
 static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
 {
+	if (is_td_vcpu(vcpu) && !is_td_vcpu_initialized(vcpu))
+		/* ignore reset on vcpu creation. */
+		return 0;
+
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return -EIO;
+
 	return vmx_set_efer(vcpu, efer);
 }
 
@@ -318,6 +516,9 @@ static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 
 static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_set_idt(vcpu, dt);
 }
 
@@ -329,16 +530,30 @@ static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 
 static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_set_gdt(vcpu, dt);
 }
 
 static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_set_dr7(vcpu, val);
+
 	vmx_set_dr7(vcpu, val);
 }
 
 static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * MOV-DR exiting is always cleared for TD guest, even in debug mode.
+	 * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
+	 * reach here for TD vcpu.
+	 */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_sync_dirty_debug_regs(vcpu);
 }
 
@@ -350,34 +565,47 @@ void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 
 	switch (reg) {
 	case VCPU_REGS_RSP:
-		vcpu->arch.regs[VCPU_REGS_RSP] = vmcs_readl(GUEST_RSP);
+		vcpu->arch.regs[VCPU_REGS_RSP] = vmreadl(vcpu, GUEST_RSP);
 		break;
 	case VCPU_REGS_RIP:
-		vcpu->arch.regs[VCPU_REGS_RIP] = vmcs_readl(GUEST_RIP);
+#ifdef CONFIG_INTEL_TDX_HOST
+		/*
+		 * RIP can be read by tracepoints, stuff a bogus value and
+		 * avoid a WARN/error.
+		 */
+		if (unlikely(is_td_vcpu(vcpu) && !is_debug_td(vcpu))) {
+			vcpu->arch.regs[VCPU_REGS_RIP] = 0xdeadul << 48;
+			break;
+		}
+#endif
+		vcpu->arch.regs[VCPU_REGS_RIP] = vmreadl(vcpu, GUEST_RIP);
 		break;
 	case VCPU_EXREG_PDPTR:
-		if (enable_ept)
+		if (enable_ept && !KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
 			ept_save_pdptrs(vcpu);
 		break;
 	case VCPU_EXREG_CR0:
 		guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
 
 		vcpu->arch.cr0 &= ~guest_owned_bits;
-		vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits;
+		vcpu->arch.cr0 |= vmreadl(vcpu, GUEST_CR0) & guest_owned_bits;
 		break;
 	case VCPU_EXREG_CR3:
 		/*
 		 * When intercepting CR3 loads, e.g. for shadowing paging, KVM's
 		 * CR3 is loaded into hardware, not the guest's CR3.
 		 */
-		if (!(exec_controls_get(to_vmx(vcpu)) & CPU_BASED_CR3_LOAD_EXITING))
-			vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+		if ((!is_td_vcpu(vcpu) /* to use to_vmx() */ &&
+			(!(exec_controls_get(to_vmx(vcpu)) &
+				CPU_BASED_CR3_LOAD_EXITING))) ||
+			is_debug_td(vcpu))
+			vcpu->arch.cr3 = vmreadl(vcpu, GUEST_CR3);
 		break;
 	case VCPU_EXREG_CR4:
 		guest_owned_bits = vcpu->arch.cr4_guest_owned_bits;
 
 		vcpu->arch.cr4 &= ~guest_owned_bits;
-		vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
+		vcpu->arch.cr4 |= vmreadl(vcpu, GUEST_CR4) & guest_owned_bits;
 		break;
 	default:
 		KVM_BUG_ON(1, vcpu->kvm);
@@ -387,173 +615,296 @@ void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 
 static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_get_rflags(vcpu);
+
 	return vmx_get_rflags(vcpu);
 }
 
 static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_set_rflags(vcpu, rflags);
+
 	vmx_set_rflags(vcpu, rflags);
 }
 
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_flush_tlb(vcpu);
+
 	vmx_flush_tlb_all(vcpu);
 }
 
 static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_flush_tlb(vcpu);
+
 	vmx_flush_tlb_current(vcpu);
 }
 
 static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_flush_tlb_gva(vcpu, addr);
 }
 
 static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	vmx_flush_tlb_guest(vcpu);
 }
 
 static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_set_interrupt_shadow(vcpu, mask);
 }
 
 static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
 {
-	return vmx_get_interrupt_shadow(vcpu);
+	return __vmx_get_interrupt_shadow(vcpu);
 }
 
 static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
 				  unsigned char *hypercall)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_patch_hypercall(vcpu, hypercall);
 }
 
 static void vt_inject_irq(struct kvm_vcpu *vcpu)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_inject_irq(vcpu);
 }
 
 static void vt_inject_nmi(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return tdx_inject_nmi(vcpu);
+
 	vmx_inject_nmi(vcpu);
 }
 
 static void vt_queue_exception(struct kvm_vcpu *vcpu)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu) && !is_debug_td(vcpu), vcpu->kvm))
+		return;
+
 	vmx_queue_exception(vcpu);
 }
 
 static void vt_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	vmx_cancel_injection(vcpu);
 }
 
 static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
+	if (is_td_vcpu(vcpu))
+		return true;
+
 	return vmx_interrupt_allowed(vcpu, for_injection);
 }
 
 static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
+	/*
+	 * TDX-SEAM manages NMI windows and NMI reinjection, and hides NMI
+	 * blocking, all KVM can do is throw an NMI over the wall.
+	 */
+	if (is_td_vcpu(vcpu))
+		return true;
+
 	return vmx_nmi_allowed(vcpu, for_injection);
 }
 
 static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * Assume NMIs are always unmasked.  KVM could query PEND_NMI and treat
+	 * NMIs as masked if a previous NMI is still pending, but SEAMCALLs are
+	 * expensive and the end result is unchanged as the only relevant usage
+	 * of get_nmi_mask() is to limit the number of pending NMIs, i.e. it
+	 * only changes whether KVM or TDX-SEAM drops an NMI.
+	 */
+	if (is_td_vcpu(vcpu))
+		return false;
+
 	return vmx_get_nmi_mask(vcpu);
 }
 
 static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	vmx_set_nmi_mask(vcpu, masked);
 }
 
 static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
 {
+	/* TDX-SEAM handles NMI windows, KVM always reports NMIs as unblocked. */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_enable_nmi_window(vcpu);
 }
 
 static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_enable_irq_window(vcpu);
 }
 
 static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_update_cr8_intercept(vcpu, tpr, irr);
 }
 
 static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 {
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return;
+
 	vmx_set_apic_access_page_addr(vcpu);
 }
 
 static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return;
+
 	vmx_refresh_apicv_exec_ctrl(vcpu);
 }
 
 static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 {
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return;
+
 	vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
 static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
 {
+	/* TODO: Reject this and update Qemu, or eat it? */
+	if (is_td(kvm))
+		return 0;
+
 	return vmx_set_tss_addr(kvm, addr);
 }
 
 static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
 {
+	/* TODO: Reject this and update Qemu, or eat it? */
+	if (is_td(kvm))
+		return 0;
+
 	return vmx_set_identity_map_addr(kvm, ident_addr);
 }
 
 static u64 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
+	if (is_td_vcpu(vcpu)) {
+		if (is_mmio)
+			return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+		return  MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+	}
+
 	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
 }
 
 static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
 {
+	if (is_td_vcpu(vcpu)) {
+		tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
+				error_code);
+		return;
+	}
+
 	vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
 }
 
 static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return 0;
+
 	return vmx_get_l2_tsc_offset(vcpu);
 }
 
 static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return 0;
+
 	return vmx_get_l2_tsc_multiplier(vcpu);
 }
 
 static void vt_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_write_tsc_offset(vcpu, offset);
 }
 
 static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
 {
+	if (is_td_vcpu(vcpu)) {
+		if (kvm_scale_tsc(vcpu, tsc_khz, multiplier) !=
+		    vcpu->kvm->arch.initial_tsc_khz)
+			KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm);
+		return;
+	}
+
 	vmx_write_tsc_multiplier(vcpu, multiplier);
 }
 
 static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return __kvm_request_immediate_exit(vcpu);
+
 	vmx_request_immediate_exit(vcpu);
 }
 
 static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	vmx_sched_in(vcpu, cpu);
 }
 
 static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	vmx_update_cpu_dirty_logging(vcpu);
 }
 
@@ -562,12 +913,16 @@ static int vt_pre_block(struct kvm_vcpu *vcpu)
 	if (pi_pre_block(vcpu))
 		return 1;
 
+	if (is_td_vcpu(vcpu))
+		return 0;
+
 	return vmx_pre_block(vcpu);
 }
 
 static void vt_post_block(struct kvm_vcpu *vcpu)
 {
-	vmx_post_block(vcpu);
+	if (!is_td_vcpu(vcpu))
+		vmx_post_block(vcpu);
 
 	pi_post_block(vcpu);
 }
@@ -577,17 +932,26 @@ static void vt_post_block(struct kvm_vcpu *vcpu)
 static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 			      bool *expired)
 {
+	if (is_td_vcpu(vcpu))
+		return -EINVAL;
+
 	return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
 }
 
 static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
 {
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
 	vmx_cancel_hv_timer(vcpu);
 }
 #endif
 
 static void vt_setup_mce(struct kvm_vcpu *vcpu)
 {
+	if (is_td_vcpu(vcpu))
+		return;
+
 	vmx_setup_mce(vcpu);
 }
 
@@ -730,6 +1094,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+	.mem_enc_op_dev = vt_mem_enc_op_dev,
+	.mem_enc_op = vt_mem_enc_op,
+	.mem_enc_op_vcpu = vt_mem_enc_op_vcpu,
 };
 
 static struct kvm_x86_init_ops vt_init_ops __initdata = {
@@ -746,6 +1114,9 @@ static int __init vt_init(void)
 	unsigned int vcpu_size = 0, vcpu_align = 0;
 	int r;
 
+	/* tdx_pre_kvm_init must be called before vmx_pre_kvm_init(). */
+	tdx_pre_kvm_init(&vcpu_size, &vcpu_align, &vt_x86_ops.vm_size);
+
 	vmx_pre_kvm_init(&vcpu_size, &vcpu_align);
 
 	r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
@@ -756,8 +1127,14 @@ static int __init vt_init(void)
 	if (r)
 		goto err_kvm_exit;
 
+	r = tdx_init();
+	if (r)
+		goto err_vmx_exit;
+
 	return 0;
 
+err_vmx_exit:
+	vmx_exit();
 err_kvm_exit:
 	kvm_exit();
 err_vmx_post_exit:
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5f81ef092bd4..d71808358feb 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -6,6 +6,7 @@
 
 #include "lapic.h"
 #include "posted_intr.h"
+#include "tdx.h"
 #include "trace.h"
 #include "vmx.h"
 
@@ -18,6 +19,11 @@ static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
 
 static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
 {
+#ifdef CONFIG_INTEL_TDX_HOST
+	if (is_td_vcpu(vcpu))
+		return &(to_tdx(vcpu)->pi_desc);
+#endif
+
 	return &(to_vmx(vcpu)->pi_desc);
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 31412ed8049f..39853a260e06 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -8,6 +8,7 @@
 #include "tdx_errno.h"
 #include "tdx_arch.h"
 #include "tdx_ops.h"
+#include "posted_intr.h"
 
 #ifdef CONFIG_INTEL_TDX_HOST
 
@@ -22,6 +23,51 @@ struct kvm_tdx {
 
 	struct tdx_td_page tdr;
 	struct tdx_td_page tdcs[TDX_NR_TDCX_PAGES];
+
+	u64 attributes;
+	u64 xfam;
+	int hkid;
+
+	int cpuid_nent;
+	struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
+
+	bool finalized;
+	bool tdh_mem_track;
+
+	hpa_t source_pa;
+
+	u64 tsc_offset;
+};
+
+union tdx_exit_reason {
+	struct {
+		/* 31:0 mirror the VMX Exit Reason format */
+		u64 basic		: 16;
+		u64 reserved16		: 1;
+		u64 reserved17		: 1;
+		u64 reserved18		: 1;
+		u64 reserved19		: 1;
+		u64 reserved20		: 1;
+		u64 reserved21		: 1;
+		u64 reserved22		: 1;
+		u64 reserved23		: 1;
+		u64 reserved24		: 1;
+		u64 reserved25		: 1;
+		u64 bus_lock_detected	: 1;
+		u64 enclave_mode	: 1;
+		u64 smi_pending_mtf	: 1;
+		u64 smi_from_vmx_root	: 1;
+		u64 reserved30		: 1;
+		u64 failed_vmentry	: 1;
+
+		/* 63:32 are TDX specific */
+		u64 details_l1		: 8;
+		u64 class		: 8;
+		u64 reserved61_48	: 14;
+		u64 non_recoverable	: 1;
+		u64 error		: 1;
+	};
+	u64 full;
 };
 
 struct vcpu_tdx {
@@ -29,6 +75,46 @@ struct vcpu_tdx {
 
 	struct tdx_td_page tdvpr;
 	struct tdx_td_page tdvpx[TDX_NR_TDVPX_PAGES];
+
+	struct list_head cpu_list;
+
+	/* Posted interrupt descriptor */
+	struct pi_desc pi_desc;
+
+	union {
+		struct {
+			union {
+				struct {
+					u16 gpr_mask;
+					u16 xmm_mask;
+				};
+				u32 regs_mask;
+			};
+			u32 reserved;
+		};
+		u64 rcx;
+	} tdvmcall;
+
+	union tdx_exit_reason exit_reason;
+
+	bool initialized;
+
+	bool host_state_need_save;
+	bool host_state_need_restore;
+	u64 msr_host_kernel_gs_base;
+};
+
+struct tdx_capabilities {
+	u8 tdcs_nr_pages;
+	u8 tdvpx_nr_pages;
+
+	u64 attrs_fixed0;
+	u64 attrs_fixed1;
+	u64 xfam_fixed0;
+	u64 xfam_fixed1;
+
+	u32 nr_cpuid_configs;
+	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
 };
 
 static inline bool is_td(struct kvm *kvm)
@@ -56,6 +142,11 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_tdx, vcpu);
 }
 
+static inline bool is_td_vcpu_initialized(struct kvm_vcpu *vcpu)
+{
+	return to_tdx(vcpu)->initialized;
+}
+
 static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
 {
 	BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) & 0x1,
@@ -84,6 +175,7 @@ static __always_inline void tdvps_gpr_check(u64 field, u8 bits)
 static __always_inline void tdvps_apic_check(u64 field, u8 bits) {}
 static __always_inline void tdvps_dr_check(u64 field, u8 bits) {}
 static __always_inline void tdvps_state_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {}
 static __always_inline void tdvps_msr_check(u64 field, u8 bits) {}
 static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
 
@@ -151,9 +243,30 @@ TDX_BUILD_TDVPS_ACCESSORS(64, APIC, apic);
 TDX_BUILD_TDVPS_ACCESSORS(64, GPR, gpr);
 TDX_BUILD_TDVPS_ACCESSORS(64, DR, dr);
 TDX_BUILD_TDVPS_ACCESSORS(64, STATE, state);
+TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch);
 TDX_BUILD_TDVPS_ACCESSORS(64, MSR, msr);
 TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
 
+static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
+{
+	struct tdx_ex_ret ex_ret;
+	u64 err;
+
+	err = tdh_mng_rd(kvm_tdx->tdr.pa, TDCS_EXEC(field), &ex_ret);
+	if (unlikely(err)) {
+		pr_err("TDH_MNG_RD[EXEC.0x%x] failed: 0x%llx\n", field, err);
+		WARN_ON(1);
+		return 0;
+	}
+	return ex_ret.regs.r8;
+}
+
+static __always_inline int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+	WARN_ON(level == PG_LEVEL_NONE);
+	return level - 1;
+}
+
 #else
 struct kvm_tdx;
 struct vcpu_tdx;
@@ -163,6 +276,7 @@ static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
 static inline bool is_debug_td(struct kvm_vcpu *vcpu) { return false; }
 static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
 static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+static inline bool is_td_vcpu_initialized(struct kvm_vcpu *vcpu) { return false; }
 
 #endif /* CONFIG_INTEL_TDX_HOST */
 
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index f57f9bfb7007..7d1483a23714 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -54,11 +54,21 @@
 #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT	0x10004
 
 /* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_NON_ARCH			BIT_ULL(63)
 #define TDX_CLASS_SHIFT			56
 #define TDX_FIELD_MASK			GENMASK_ULL(31, 0)
 
-#define BUILD_TDX_FIELD(class, field)	\
-	(((u64)(class) << TDX_CLASS_SHIFT) | ((u64)(field) & TDX_FIELD_MASK))
+#define __BUILD_TDX_FIELD(non_arch, class, field)	\
+	(((non_arch) ? TDX_NON_ARCH : 0) |		\
+	 ((u64)(class) << TDX_CLASS_SHIFT) |		\
+	 ((u64)(field) & TDX_FIELD_MASK))
+
+#define BUILD_TDX_FIELD(class, field)			\
+	__BUILD_TDX_FIELD(false, (class), (field))
+
+#define BUILD_TDX_FIELD_NON_ARCH(class, field)		\
+	__BUILD_TDX_FIELD(true, (class), (field))
+
 
 /* @field is the VMCS field encoding */
 #define TDVPS_VMCS(field)		BUILD_TDX_FIELD(0, (field))
@@ -83,10 +93,20 @@ enum tdx_guest_other_state {
 	TD_VCPU_IWK_INTKEY0 = 68,
 	TD_VCPU_IWK_INTKEY1,
 	TD_VCPU_IWK_FLAGS = 70,
+	TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
+};
+
+union tdx_vcpu_state_details {
+	struct {
+		u64 vmxip	: 1;
+		u64 reserved	: 63;
+	};
+	u64 full;
 };
 
 /* @field is any of enum tdx_guest_other_state */
 #define TDVPS_STATE(field)		BUILD_TDX_FIELD(17, (field))
+#define TDVPS_STATE_NON_ARCH(field)	BUILD_TDX_FIELD_NON_ARCH(17, field)
 
 /* @msr is the MSR index */
 #define TDVPS_MSR(msr)			BUILD_TDX_FIELD(19, (msr))
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 87ed67fd2715..f40c46eaff4c 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -8,36 +8,48 @@
 
 #include <asm/asm.h>
 #include <asm/kvm_host.h>
+#include <asm/cacheflush.h>
 
 #include "seamcall.h"
+#include "tdx_arch.h"
 
 #ifdef CONFIG_INTEL_TDX_HOST
 
+static inline void tdx_clflush_page(hpa_t addr)
+{
+	clflush_cache_range(__va(addr), PAGE_SIZE);
+}
+
 static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
 {
+	tdx_clflush_page(addr);
 	return seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
 }
 
 static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
 			    struct tdx_ex_ret *ex)
 {
+	tdx_clflush_page(hpa);
 	return seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, ex);
 }
 
 static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
 			    struct tdx_ex_ret *ex)
 {
+	tdx_clflush_page(page);
 	return seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, ex);
 }
 
 static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
 {
+	tdx_clflush_page(addr);
 	return seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, 0, NULL);
 }
 
 static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
 			    struct tdx_ex_ret *ex)
 {
+	tdx_clflush_page(hpa);
 	return seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, ex);
 }
 
@@ -54,11 +66,13 @@ static inline u64 tdh_mng_key_config(hpa_t tdr)
 
 static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
 {
+	tdx_clflush_page(tdr);
 	return seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, 0, NULL);
 }
 
 static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
 {
+	tdx_clflush_page(tdvpr);
 	return seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, 0, NULL);
 }
 
diff --git a/arch/x86/kvm/vmx/tdx_stubs.c b/arch/x86/kvm/vmx/tdx_stubs.c
new file mode 100644
index 000000000000..5417d778e6c0
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_stubs.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kvm_host.h>
+
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
+int tdx_vm_init(struct kvm *kvm) { return 0; }
+void tdx_vm_teardown(struct kvm *kvm) {}
+void tdx_vm_destroy(struct kvm *kvm) {}
+int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return 0; }
+void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
+void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+void tdx_hardware_enable(void) {}
+void tdx_hardware_disable(void) {}
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
+int tdx_handle_exit(struct kvm_vcpu *vcpu, enum exit_fastpath_completion fastpath) { return 0; }
+int tdx_dev_ioctl(void __user *argp) { return -EINVAL; }
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EINVAL; }
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EINVAL; }
+void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t pgd, int pgd_level) {}
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) {}
+void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu) {}
+int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) { return -1; }
+
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
+		u64 *info2, u32 *intr_info, u32 *error_code)
+{
+}
+
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+int __init tdx_check_processor_compatibility(void) { return 0; }
+void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
+			unsigned int *vcpu_align, unsigned int *vm_size)
+{
+}
+
+int __init tdx_init(void) { return 0; }
+void tdx_update_exception_bitmap(struct kvm_vcpu *vcpu) {}
+void tdx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) {}
+int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
+void tdx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) {}
+bool tdx_is_emulated_msr(u32 index, bool write) { return false; }
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0; }
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) {}
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 3a6461694fc2..79fa88b30f4d 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -2,6 +2,7 @@
 #include <linux/linkage.h>
 #include <asm/asm.h>
 #include <asm/bitsperlong.h>
+#include <asm/errno.h>
 #include <asm/kvm_vcpu_regs.h>
 #include <asm/nospec-branch.h>
 #include <asm/segment.h>
@@ -28,6 +29,13 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
+#ifdef CONFIG_INTEL_TDX_HOST
+#define TDENTER 		0
+#define EXIT_REASON_TDCALL	77
+#define TDENTER_ERROR_BIT	63
+#include "seamcall.h"
+#endif
+
 .section .noinstr.text, "ax"
 
 /**
@@ -328,3 +336,141 @@ SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
 	pop %_ASM_BP
 	ret
 SYM_FUNC_END(vmx_do_interrupt_nmi_irqoff)
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+.pushsection .noinstr.text, "ax"
+
+/**
+ * __tdx_vcpu_run - Call SEAMCALL(TDENTER) to run a TD vcpu
+ * @tdvpr:	physical address of TDVPR
+ * @regs:	void * (to registers of TDVCPU)
+ * @gpr_mask:	non-zero if guest registers need to be loaded prior to TDENTER
+ *
+ * Returns:
+ *	TD-Exit Reason
+ *
+ * Note: KVM doesn't support using XMM in its hypercalls, it's the HyperV
+ *	 code's responsibility to save/restore XMM registers on TDVMCALL.
+ */
+SYM_FUNC_START(__tdx_vcpu_run)
+	push %rbp
+	mov  %rsp, %rbp
+
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+	push %rbx
+
+	/* Save @regs, which is needed after TDENTER to capture output. */
+	push %rsi
+
+	/* Load @tdvpr to RCX */
+	mov %rdi, %rcx
+
+	/* No need to load guest GPRs if the last exit wasn't a TDVMCALL. */
+	test %dx, %dx
+	je 1f
+
+	/* Load @regs to RAX, which will be clobbered with $TDENTER anyways. */
+	mov %rsi, %rax
+
+	mov VCPU_RBX(%rax), %rbx
+	mov VCPU_RDX(%rax), %rdx
+	mov VCPU_RBP(%rax), %rbp
+	mov VCPU_RSI(%rax), %rsi
+	mov VCPU_RDI(%rax), %rdi
+
+	mov VCPU_R8 (%rax),  %r8
+	mov VCPU_R9 (%rax),  %r9
+	mov VCPU_R10(%rax), %r10
+	mov VCPU_R11(%rax), %r11
+	mov VCPU_R12(%rax), %r12
+	mov VCPU_R13(%rax), %r13
+	mov VCPU_R14(%rax), %r14
+	mov VCPU_R15(%rax), %r15
+
+	/*  Load TDENTER to RAX.  This kills the @regs pointer! */
+1:	mov $TDENTER, %rax
+
+2:	seamcall
+
+	/* Skip to the exit path if TDENTER failed. */
+	bt $TDENTER_ERROR_BIT, %rax
+	jc 4f
+
+	/* Temporarily save the TD-Exit reason. */
+	push %rax
+
+	/* check if TD-exit due to TDVMCALL */
+	cmp $EXIT_REASON_TDCALL, %ax
+
+	/* Reload @regs to RAX. */
+	mov 8(%rsp), %rax
+
+	/* Jump on non-TDVMCALL */
+	jne 3f
+
+	/* Save all output from SEAMCALL(TDENTER) */
+	mov %rbx, VCPU_RBX(%rax)
+	mov %rbp, VCPU_RBP(%rax)
+	mov %rsi, VCPU_RSI(%rax)
+	mov %rdi, VCPU_RDI(%rax)
+	mov %r10, VCPU_R10(%rax)
+	mov %r11, VCPU_R11(%rax)
+	mov %r12, VCPU_R12(%rax)
+	mov %r13, VCPU_R13(%rax)
+	mov %r14, VCPU_R14(%rax)
+	mov %r15, VCPU_R15(%rax)
+
+3:	mov %rcx, VCPU_RCX(%rax)
+	mov %rdx, VCPU_RDX(%rax)
+	mov %r8,  VCPU_R8 (%rax)
+	mov %r9,  VCPU_R9 (%rax)
+
+	/*
+	 * Clear all general purpose registers except RSP and RAX to prevent
+	 * speculative use of the guest's values.
+	 */
+	xor %rbx, %rbx
+	xor %rcx, %rcx
+	xor %rdx, %rdx
+	xor %rsi, %rsi
+	xor %rdi, %rdi
+	xor %rbp, %rbp
+	xor %r8,  %r8
+	xor %r9,  %r9
+	xor %r10, %r10
+	xor %r11, %r11
+	xor %r12, %r12
+	xor %r13, %r13
+	xor %r14, %r14
+	xor %r15, %r15
+
+	/* Restore the TD-Exit reason to RAX for return. */
+	pop %rax
+
+	/* "POP" @regs. */
+4:	add $8, %rsp
+	pop %rbx
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	pop %rbp
+	ret
+
+5:	cmpb $0, kvm_rebooting
+	je 6f
+	mov $-EFAULT, %rax
+	jmp 4b
+6:	ud2
+	_ASM_EXTABLE(2b, 5b)
+
+SYM_FUNC_END(__tdx_vcpu_run)
+
+.popsection
+
+#endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6f38e0d2e1b6..4b7d6fe63d58 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3752,45 +3752,6 @@ void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 	pt_update_intercept_for_msr(vcpu);
 }
 
-static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
-						     bool nested)
-{
-#ifdef CONFIG_SMP
-	int pi_vec = nested ? POSTED_INTR_NESTED_VECTOR : POSTED_INTR_VECTOR;
-
-	if (vcpu->mode == IN_GUEST_MODE) {
-		/*
-		 * The vector of interrupt to be delivered to vcpu had
-		 * been set in PIR before this function.
-		 *
-		 * Following cases will be reached in this block, and
-		 * we always send a notification event in all cases as
-		 * explained below.
-		 *
-		 * Case 1: vcpu keeps in non-root mode. Sending a
-		 * notification event posts the interrupt to vcpu.
-		 *
-		 * Case 2: vcpu exits to root mode and is still
-		 * runnable. PIR will be synced to vIRR before the
-		 * next vcpu entry. Sending a notification event in
-		 * this case has no effect, as vcpu is not in root
-		 * mode.
-		 *
-		 * Case 3: vcpu exits to root mode and is blocked.
-		 * vcpu_block() has already synced PIR to vIRR and
-		 * never blocks vcpu if vIRR is not cleared. Therefore,
-		 * a blocked vcpu here does not wait for any requested
-		 * interrupts in PIR, and sending a notification event
-		 * which has no effect is safe here.
-		 */
-
-		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
-		return true;
-	}
-#endif
-	return false;
-}
-
 static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 						int vector)
 {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index fec22bef05b7..3325dfa5bf52 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -114,10 +114,90 @@ int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 #endif
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
+static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
+						     bool nested)
+{
+#ifdef CONFIG_SMP
+	int pi_vec = nested ? POSTED_INTR_NESTED_VECTOR : POSTED_INTR_VECTOR;
 
+	if (vcpu->mode == IN_GUEST_MODE) {
+		/*
+		 * The vector of interrupt to be delivered to vcpu had
+		 * been set in PIR before this function.
+		 *
+		 * Following cases will be reached in this block, and
+		 * we always send a notification event in all cases as
+		 * explained below.
+		 *
+		 * Case 1: vcpu keeps in non-root mode. Sending a
+		 * notification event posts the interrupt to vcpu.
+		 *
+		 * Case 2: vcpu exits to root mode and is still
+		 * runnable. PIR will be synced to vIRR before the
+		 * next vcpu entry. Sending a notification event in
+		 * this case has no effect, as vcpu is not in root
+		 * mode.
+		 *
+		 * Case 3: vcpu exits to root mode and is blocked.
+		 * vcpu_block() has already synced PIR to vIRR and
+		 * never blocks vcpu if vIRR is not cleared. Therefore,
+		 * a blocked vcpu here does not wait for any requested
+		 * interrupts in PIR, and sending a notification event
+		 * which has no effect is safe here.
+		 */
+
+		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
+		return true;
+	}
+#endif
+	return false;
+}
+
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 void __init vmx_pre_kvm_init(unsigned int *vcpu_size, unsigned int *vcpu_align);
 int __init vmx_init(void);
 void vmx_exit(void);
 void vmx_post_kvm_exit(void);
 
+int tdx_vm_init(struct kvm *kvm);
+void tdx_vm_teardown(struct kvm *kvm);
+void tdx_vm_destroy(struct kvm *kvm);
+int tdx_vcpu_create(struct kvm_vcpu *vcpu);
+void tdx_vcpu_free(struct kvm_vcpu *vcpu);
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void tdx_inject_nmi(struct kvm_vcpu *vcpu);
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+void tdx_hardware_enable(void);
+void tdx_hardware_disable(void);
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int tdx_handle_exit(struct kvm_vcpu *vcpu,
+		enum exit_fastpath_completion fastpath);
+int tdx_dev_ioctl(void __user *argp);
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+void tdx_flush_tlb(struct kvm_vcpu *vcpu);
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t pgd, int pgd_level);
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
+int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector);
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+int __init tdx_check_processor_compatibility(void);
+void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
+			unsigned int *vcpu_align, unsigned int *vm_size);
+int __init tdx_init(void);
+void tdx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+void tdx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+int tdx_get_cpl(struct kvm_vcpu *vcpu);
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu);
+void tdx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool tdx_is_emulated_msr(u32 index, bool write);
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 447f0d5b53c7..da75530d75e9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -292,6 +292,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
 };
 
 u64 __read_mostly host_xcr0;
+EXPORT_SYMBOL_GPL(host_xcr0);
 u64 __read_mostly supported_xcr0;
 EXPORT_SYMBOL_GPL(supported_xcr0);
 
@@ -2265,9 +2266,7 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 	u64 ratio;
 
 	/* Guest TSC same frequency as host TSC? */
-	if (!scale || vcpu->kvm->arch.tsc_immutable) {
-		if (scale)
-			pr_warn_ratelimited("Guest TSC immutable, scaling not supported\n");
+	if (!scale) {
 		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_default_tsc_scaling_ratio);
 		return 0;
 	}
@@ -10740,7 +10739,8 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 {
 	int ret;
 
-	if (vcpu->arch.guest_state_protected)
+	if (vcpu->arch.guest_state_protected ||
+	    vcpu->kvm->arch.vm_type == KVM_X86_TDX_VM)
 		return -EINVAL;
 
 	vcpu_load(vcpu);
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 96b0064cff5c..16dd7bf57ac9 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -508,4 +508,55 @@ struct kvm_pmu_event_filter {
 #define KVM_X86_SEV_ES_VM	1
 #define KVM_X86_TDX_VM		2
 
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+	KVM_TDX_CAPABILITIES = 0,
+	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
+
+	KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+	__u32 id;
+	__u32 metadata;
+	__u64 data;
+};
+
+struct kvm_tdx_cpuid_config {
+	__u32 leaf;
+	__u32 sub_leaf;
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+	__u64 attrs_fixed0;
+	__u64 attrs_fixed1;
+	__u64 xfam_fixed0;
+	__u64 xfam_fixed1;
+
+	__u32 nr_cpuid_configs;
+	struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
+struct kvm_tdx_init_vm {
+	__u32 max_vcpus;
+	__u32 reserved;
+	__u64 attributes;
+	__u64 cpuid;
+};
+
+#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 56/59] KVM: TDX: Protect private mapping related SEAMCALLs with spinlock
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (54 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 55/59] KVM: TDX: Add "basic" support for building and running Trust Domains isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 57/59] KVM, x86/mmu: Support TDX private mapping for TDP MMU isaku.yamahata
                   ` (4 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Kai Huang

From: Kai Huang <kai.huang@intel.com>

Unlike legacy MMU, TDP MMU runs page fault handler concurrently.
However, TDX private mapping related SEAMCALLs cannot run concurrently
even they operate on different pages, because they need to acquire
exclusive access to secure EPT table.  Protect those SEAMCALLs with
spinlock, so they won't run concurrently even under TDP MMU.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 111 +++++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.h |   7 +++
 2 files changed, 108 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 64b2841064c4..53fc01f3bab1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -519,6 +519,8 @@ int tdx_vm_init(struct kvm *kvm)
 		tdx_add_td_page(&kvm_tdx->tdcs[i]);
 	}
 
+	spin_lock_init(&kvm_tdx->seamcall_lock);
+
 	/*
 	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
 	 * ioctl() to define the configure CPUID values for the TD.
@@ -1187,8 +1189,8 @@ static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
 	}
 }
 
-static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
-				      enum pg_level level, kvm_pfn_t pfn)
+static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+					enum pg_level level, kvm_pfn_t pfn)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	hpa_t hpa = pfn << PAGE_SHIFT;
@@ -1215,6 +1217,15 @@ static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		return;
 	}
 
+	/*
+	 * In case of TDP MMU, fault handler can run concurrently.  Note
+	 * 'source_pa' is a TD scope variable, meaning if there are multiple
+	 * threads reaching here with all needing to access 'source_pa', it
+	 * will break.  However fortunately this won't happen, because below
+	 * TDH_MEM_PAGE_ADD code path is only used when VM is being created
+	 * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
+	 * always uses vcpu 0's page table and protected by vcpu->mutex).
+	 */
 	WARN_ON(kvm_tdx->source_pa == INVALID_PAGE);
 	source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
 
@@ -1227,8 +1238,25 @@ static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	kvm_tdx->source_pa = INVALID_PAGE;
 }
 
-static void tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				       kvm_pfn_t pfn)
+static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/*
+	 * Only TDP MMU needs to use spinlock, however for simplicity,
+	 * just always use spinlock for seamcall, regardless whether
+	 * legacy MMU or TDP MMU is being used.  For legacy MMU it
+	 * should not have noticeable performance impact since taking
+	 * spinlock w/o needing to wait should be fast.
+	 */
+	spin_lock(&kvm_tdx->seamcall_lock);
+	__tdx_sept_set_private_spte(kvm, gfn, level, pfn);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+}
+
+static void __tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+					kvm_pfn_t pfn)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1262,8 +1290,19 @@ static void tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level
 	put_page(pfn_to_page(pfn));
 }
 
-static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
-				    enum pg_level level, void *sept_page)
+static void tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				       kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/* See comment in tdx_sept_set_private_spte() */
+	spin_lock(&kvm_tdx->seamcall_lock);
+	__tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+}
+
+static int __tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level, void *sept_page)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1281,7 +1320,22 @@ static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level)
+static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
+				    enum pg_level level, void *sept_page)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	int ret;
+
+	/* See comment in tdx_sept_set_private_spte() */
+	spin_lock(&kvm_tdx->seamcall_lock);
+	ret = __tdx_sept_link_private_sp(kvm, gfn, level, sept_page);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+
+	return ret;
+}
+
+static void __tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+					enum pg_level level)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1294,7 +1348,19 @@ static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level
 		pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &ex_ret);
 }
 
-static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level)
+static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/* See comment in tdx_sept_set_private_spte() */
+	spin_lock(&kvm_tdx->seamcall_lock);
+	__tdx_sept_zap_private_spte(kvm, gfn, level);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+}
+
+static void __tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn,
+					  enum pg_level level)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1307,8 +1373,19 @@ static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_leve
 		pr_tdx_error(TDH_MEM_RANGE_UNBLOCK, err, &ex_ret);
 }
 
-static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				    void *sept_page)
+static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn,
+					enum pg_level level)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/* See comment in tdx_sept_set_private_spte() */
+	spin_lock(&kvm_tdx->seamcall_lock);
+	__tdx_sept_unzap_private_spte(kvm, gfn, level);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+}
+
+static int __tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				      void *sept_page)
 {
 	/*
 	 * free_private_sp() is (obviously) called when a shadow page is being
@@ -1320,6 +1397,20 @@ static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level le
 	return tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page));
 }
 
+static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				    void *sept_page)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	int ret;
+
+	/* See comment in tdx_sept_set_private_spte() */
+	spin_lock(&kvm_tdx->seamcall_lock);
+	ret = __tdx_sept_free_private_sp(kvm, gfn, level, sept_page);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+
+	return ret;
+}
+
 static int tdx_sept_tlb_remote_flush(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 39853a260e06..3b1a1e2878c2 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -37,6 +37,13 @@ struct kvm_tdx {
 	hpa_t source_pa;
 
 	u64 tsc_offset;
+
+	/*
+	 * Lock to prevent seamcalls from running concurrently
+	 * when TDP MMU is enabled, because TDP fault handler
+	 * runs concurrently.
+	 */
+	spinlock_t seamcall_lock;
 };
 
 union tdx_exit_reason {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 57/59] KVM, x86/mmu: Support TDX private mapping for TDP MMU
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (55 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 56/59] KVM: TDX: Protect private mapping related SEAMCALLs with spinlock isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 58/59] KVM: TDX: exit to user space on GET_QUOTE, SETUP_EVENT_NOTIFY_INTERRUPT isaku.yamahata
                   ` (3 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Kai Huang

From: Kai Huang <kai.huang@intel.com>

Frame in TDX private mapping support to support running TD with TDP MMU.

Similar to legacy MMU, use private mapping related kvm_x86_ops hooks in
__handle_changed_spte()/handle_removed_tdp_mmu_page() to support
creating/removing private mapping.  And support temporarily blocking
private mapping upon receiving MMU notifier and later on unblocking it
upon EPT violation, rather than completely removing the private page,
because currently page migration for private page is not supported,
therefore the page cannot be really removed from TD upon MMU notifier.

Similar to legacy MMU, zap aliasing mapping (truly remove the page)
upon EPT violation.  And only leaf page is zapped, but not intermediate
page tables.

Support 4K page only at this stage.  2M page support can be done in
future patches.

A key change to TDP MMU to support TD guest is, the read_lock() is
changed to write_lock() for page fault on private GPA in
direct_page_fault(), while for fault on shared GPA, read_lock() is
still used.  This is because for TD guest, at given time for given GFN,
only one type of mapping (either private, or shared) can be valid, but
not both, otherwise it may cause machine check and data lose.  As a
result, aliasing mapping is zapped in TDP MMU fault handler.  In this
case, running multiple fault threads with both private and shared
address concurrently may end up with having both private and shared
mapping for given GPN.  Consider below case: vcpu 0 is accessing using
private GPA, and vcpu 1 is accessing the shared GPA (i.e. right after
MAP_GPA).

	vcpu 0 				vcpu 1

  (fault with private GPA)	(fault with shared GPA)
	zap shared mapping
				zap private mapping
				setup shared mapping
	setup private mapping

This may end up with having both private and shared mappings.  Perhaps
it is arguable whether above case is valid, but for security, just don't
allow running multiple fault threads concurrently.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c      |  63 ++++-
 arch/x86/kvm/mmu/spte.h     |  20 +-
 arch/x86/kvm/mmu/tdp_iter.h |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c  | 544 ++++++++++++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h  |  15 +-
 5 files changed, 567 insertions(+), 77 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f8109002da0d..1e13949ac62a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3692,7 +3692,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		goto out_unlock;
 
 	if (is_tdp_mmu_enabled(vcpu->kvm)) {
-		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
+		if (gfn_shared && !VALID_PAGE(mmu->private_root_hpa)) {
+			root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
+			mmu->private_root_hpa = root;
+		}
+		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
 		mmu->root_hpa = root;
 	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
 		if (gfn_shared && !VALID_PAGE(vcpu->arch.mmu->private_root_hpa)) {
@@ -4348,7 +4352,33 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 	r = RET_PF_RETRY;
 
-	if (is_tdp_mmu_fault)
+	/*
+	 * Unfortunately, when running TD, TDP MMU cannot always run multiple
+	 * fault threads concurrently.  TDX has two page tables supporting
+	 * private and shared mapping simultaneously, but at given time, only
+	 * one mapping can be valid for given GFN, otherwise it may cause
+	 * machine check and data lose.  Therefore, TDP MMU fault handler zaps
+	 * aliasing mapping.
+	 *
+	 * Running fault threads for both private and shared GPA concurrently
+	 * can potentially end up with having both private and shared mapping
+	 * for one GPA.  For instance, vcpu 0 is accessing using private GPA,
+	 * and vcpu 1 is accessing using shared GPA (i.e. right after MAP_GPA):
+	 *
+	 *		vcpu 0				vcpu 1
+	 *	(fault with private GPA)	(fault with shared GPA)
+	 *
+	 *	zap shared mapping
+	 *					zap priavte mapping
+	 *					setup shared mapping
+	 *	setup private mapping
+	 *
+	 * This can be prevented by only allowing one type of fault (private
+	 * or shared) to run concurrently.  Choose to let private fault to run
+	 * concurrently, because for TD, most pages should be private.
+	 */
+	if (is_tdp_mmu_fault && kvm_is_private_gfn(vcpu->kvm,
+				fault->addr >> PAGE_SHIFT))
 		read_lock(&vcpu->kvm->mmu_lock);
 	else
 		write_lock(&vcpu->kvm->mmu_lock);
@@ -4365,7 +4395,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		r = __direct_map(vcpu, fault);
 
 out_unlock:
-	if (is_tdp_mmu_fault)
+	if (is_tdp_mmu_fault && kvm_is_private_gfn(vcpu->kvm,
+				fault->addr >> PAGE_SHIFT))
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
@@ -6057,6 +6088,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 
 	write_unlock(&kvm->mmu_lock);
 
+	/*
+	 * For now private root is never invalidate during VM is running,
+	 * so this can only happen for shared roots.
+	 */
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
@@ -6162,7 +6197,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	if (is_tdp_mmu_enabled(kvm)) {
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
 			flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
-							  gfn_end, flush);
+							  gfn_end, flush,
+							  false);
 	}
 
 	if (flush)
@@ -6195,6 +6231,11 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 	}
 
+	/*
+	 * For now this can only happen for non-TD VM, because TD private
+	 * mapping doesn't support write protection.  kvm_tdp_mmu_wrprot_slot()
+	 * will give a WARN() if it hits for TD.
+	 */
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
@@ -6277,6 +6318,11 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 	}
 
+	/*
+	 * This should only be reachable in case of log-dirty, wihch TD private
+	 * mapping doesn't support so far.  kvm_tdp_mmu_zap_collapsible_sptes()
+	 * internally gives a WARN() when it hits.
+	 */
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
@@ -6316,6 +6362,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 	}
 
+	/* See comments in kvm_mmu_slot_remove_write_access() */
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
@@ -6350,8 +6397,12 @@ static void __kvm_mmu_zap_all(struct kvm *kvm, struct list_head *mmu_pages)
 	}
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 
-	if (is_tdp_mmu_enabled(kvm))
-		kvm_tdp_mmu_zap_all(kvm);
+	if (is_tdp_mmu_enabled(kvm)) {
+		bool zap_private =
+			(mmu_pages == &kvm->arch.private_mmu_pages) ?
+			true : false;
+		kvm_tdp_mmu_zap_all(kvm, zap_private);
+	}
 
 	write_unlock(&kvm->mmu_lock);
 }
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 43f0d2a773f7..3108d46e0dd3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -173,7 +173,9 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  * If a thread running without exclusive control of the MMU lock must perform a
  * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
  * non-present intermediate value. Other threads which encounter this value
- * should not modify the SPTE.
+ * should not modify the SPTE.  When TDX is enabled, shadow_init_value, which
+ * is "suppress #VE" bit set, is also set to removed SPTE, because TDX module
+ * always enables "EPT violation #VE".
  *
  * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
  * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
@@ -183,12 +185,24 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  */
 #define REMOVED_SPTE	0x5a0ULL
 
-/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
+/*
+ * Removed SPTEs must not be misconstrued as shadow present PTEs, and
+ * temporarily blocked private PTEs.
+ */
 static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(REMOVED_SPTE & SPTE_PRIVATE_ZAPPED));
+
+/*
+ * See above comment around REMOVED_SPTE.  SHADOW_REMOVED_SPTE is the actual
+ * intermediate value set to the removed SPET.  When TDX is enabled, it sets
+ * the "suppress #VE" bit, otherwise it's REMOVED_SPTE.
+ */
+extern u64 __read_mostly shadow_init_value;
+#define SHADOW_REMOVED_SPTE	(shadow_init_value | REMOVED_SPTE)
 
 static inline bool is_removed_spte(u64 spte)
 {
-	return spte == REMOVED_SPTE;
+	return spte == SHADOW_REMOVED_SPTE;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b1748b988d3a..52089728d583 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -28,7 +28,7 @@ struct tdp_iter {
 	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
 	tdp_ptep_t sptep;
-	/* The lowest GFN mapped by the current SPTE */
+	/* The lowest GFN (stolen bits included) mapped by the current SPTE */
 	gfn_t gfn;
 	/* The level of the root page given to the iterator */
 	int root_level;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a54c3491af42..69c71e696510 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,6 +53,11 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	rcu_barrier();
 }
 
+static gfn_t tdp_iter_gfn_unalias(struct kvm *kvm, struct tdp_iter *iter)
+{
+	return iter->gfn & ~kvm_gfn_stolen_mask(kvm);
+}
+
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			  gfn_t start, gfn_t end, bool can_yield, bool flush,
 			  bool shared);
@@ -158,7 +163,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		} else
 
 static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
-						   int level)
+						   int level, bool private)
 {
 	union kvm_mmu_page_role role;
 
@@ -168,12 +173,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
 	role.gpte_is_8_bytes = true;
 	role.access = ACC_ALL;
 	role.ad_disabled = !shadow_accessed_mask;
+	role.private = private;
 
 	return role;
 }
 
 static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					       int level)
+					       int level, bool private)
 {
 	struct kvm_mmu_page *sp;
 
@@ -181,7 +187,18 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	sp->role.word = page_role_for_level(vcpu, level).word;
+	/*
+	 * Unlike kvm_mmu_link_private_sp(), which is used by legacy MMU,
+	 * allocate private_sp here since __handle_changed_spte() takes
+	 * 'kvm' as parameter rather than 'vcpu'.
+	 */
+	if (private) {
+		sp->private_sp =
+			kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_sp_cache);
+		WARN_ON_ONCE(!sp->private_sp);
+	}
+
+	sp->role.word = page_role_for_level(vcpu, level, private).word;
 	sp->gfn = gfn;
 	sp->tdp_mmu_page = true;
 
@@ -190,7 +207,8 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return sp;
 }
 
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
+						      bool private)
 {
 	union kvm_mmu_page_role role;
 	struct kvm *kvm = vcpu->kvm;
@@ -198,7 +216,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);
+	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level,
+			private);
 
 	/* Check for an existing root before allocating a new one. */
 	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
@@ -207,7 +226,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
+	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level,
+			private);
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -215,12 +235,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
 out:
-	return __pa(root->spt);
+	return root;
+}
+
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
+{
+	return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared);
+				tdp_ptep_t sptep, u64 old_spte,
+				u64 new_spte, int level, bool shared);
 
 static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
 {
@@ -239,6 +264,14 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool pfn_changed;
 	struct kvm_memory_slot *slot;
 
+	/*
+	 * TDX doesn't support live migration.  Never mark private page as
+	 * dirty in log-dirty bitmap, since it's not possible for userspace
+	 * hypervisor to live migrate private page anyway.
+	 */
+	if (kvm_is_private_gfn(kvm, gfn))
+		return;
+
 	if (level > PG_LEVEL_4K)
 		return;
 
@@ -246,7 +279,9 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	if ((!is_writable_pte(old_spte) || pfn_changed) &&
 	    is_writable_pte(new_spte)) {
-		slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
+		/* For memory slot operations, use GFN without aliasing */
+		slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id),
+				gfn & ~kvm_gfn_stolen_mask(kvm));
 		mark_page_dirty_in_slot(kvm, slot, gfn);
 	}
 }
@@ -298,6 +333,7 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
  * handle_removed_tdp_mmu_page - handle a pt removed from the TDP structure
  *
  * @kvm: kvm instance
+ * @parent_sptep: pointer to the parent SPTE which points to the @pt.
  * @pt: the page removed from the paging structure
  * @shared: This operation may not be running under the exclusive use
  *	    of the MMU lock and the operation must synchronize with other
@@ -311,9 +347,11 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
  * this thread will be responsible for ensuring the page is freed. Hence the
  * early rcu_dereferences in the function.
  */
-static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
-					bool shared)
+static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t parent_sptep,
+					tdp_ptep_t pt, bool shared)
 {
+	struct kvm_mmu_page *parent_sp =
+		sptep_to_sp(rcu_dereference(parent_sptep));
 	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
 	int level = sp->role.level;
 	gfn_t base_gfn = sp->gfn;
@@ -322,6 +360,8 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 	gfn_t gfn;
 	int i;
 
+	WARN_ON(!is_private_sp(parent_sp) != !is_private_sp(sp));
+
 	trace_kvm_mmu_prepare_zap_page(sp);
 
 	tdp_mmu_unlink_page(kvm, sp, shared);
@@ -340,7 +380,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 			 * value to the removed SPTE value.
 			 */
 			for (;;) {
-				old_child_spte = xchg(sptep, REMOVED_SPTE);
+				old_child_spte = xchg(sptep, SHADOW_REMOVED_SPTE);
 				if (!is_removed_spte(old_child_spte))
 					break;
 				cpu_relax();
@@ -356,7 +396,8 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 			 * unreachable.
 			 */
 			old_child_spte = READ_ONCE(*sptep);
-			if (!is_shadow_present_pte(old_child_spte))
+			if (!is_shadow_present_pte(old_child_spte) &&
+					!is_zapped_private_pte(old_child_spte))
 				continue;
 
 			/*
@@ -367,13 +408,35 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 			 * the two branches consistent and simplifies
 			 * the function.
 			 */
-			WRITE_ONCE(*sptep, REMOVED_SPTE);
+			WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
 		}
-		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_child_spte, REMOVED_SPTE, level,
+		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn, sptep,
+				    old_child_spte, SHADOW_REMOVED_SPTE, level,
 				    shared);
 	}
 
+	if (sp->private_sp) {
+
+		/*
+		 * Currently prviate page table (not the leaf page) can only be
+		 * zapped when VM is being destroyed, because currently
+		 * kvm_x86_ops->free_private_sp() can only be called after TD
+		 * has been torn down (after tdx_vm_teardown()).  To make sure
+		 * this code path can only be reached when the whole page table
+		 * is being torn down when TD is being destroyed, zapping
+		 * aliasing only zaps the leaf pages, but not the intermediate
+		 * page tables.
+		 */
+		WARN_ON(!is_private_sp(sp));
+		/*
+		 * The level used in kvm_x86_ops->free_private_sp() doesn't
+		 * matter since PG_LEVEL_4K is always used internally.
+		 */
+		if (!static_call(kvm_x86_free_private_sp)(kvm, sp->gfn,
+				sp->role.level + 1, sp->private_sp))
+			free_page((unsigned long)sp->private_sp);
+	}
+
 	kvm_flush_remote_tlbs_with_address(kvm, gfn,
 					   KVM_PAGES_PER_HPAGE(level + 1));
 
@@ -384,6 +447,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
  * __handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
  * @as_id: the address space of the paging structure the SPTE was a part of
+ * @sptep: the pointer to the SEPT being modified
  * @gfn: the base GFN that was mapped by the SPTE
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
@@ -396,19 +460,23 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
  * This function must be called for all TDP SPTE modifications.
  */
 static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				  u64 old_spte, u64 new_spte, int level,
-				  bool shared)
+				  tdp_ptep_t sptep, u64 old_spte,
+				  u64 new_spte, int level, bool shared)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	bool was_zapped_private = is_zapped_private_pte(old_spte);
+	bool is_private = is_private_spte(sptep);
 
 	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_4K);
 	WARN_ON(gfn & (KVM_PAGES_PER_HPAGE(level) - 1));
 
+	WARN_ON(was_zapped_private && !is_private);
+
 	/*
 	 * If this warning were to trigger it would indicate that there was a
 	 * missing MMU notifier or a race with some notifier handler.
@@ -437,6 +505,29 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
 
+	/*
+	 * Handle special case of old_spte being temporarily blocked private
+	 * SPTE.  There are two cases: 1) Need to restore the original mapping
+	 * (unblock) when guest accesses the private page; 2) Need to truly
+	 * zap the SPTE because of zapping aliasing in fault handler, or when
+	 * VM is being destroyed.
+	 *
+	 * Do this before handling "!was_present && !is_present" case below,
+	 * because blocked private SPTE is also non-present.
+	 */
+	if (was_zapped_private) {
+		if (is_present)
+			static_call(kvm_x86_unzap_private_spte)(kvm, gfn,
+					level);
+		else
+			static_call(kvm_x86_drop_private_spte)(kvm, gfn,
+					level, spte_to_pfn(old_spte));
+
+		/* Temporarily blocked private SPTE can only be leaf. */
+		WARN_ON(!is_last_spte(old_spte, level));
+		return;
+	}
+
 	/*
 	 * The only times a SPTE should be changed from a non-present to
 	 * non-present state is when an MMIO entry is installed/modified/
@@ -469,20 +560,64 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
 		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
 
+	/*
+	 * Special handling for the private mapping.  We are either
+	 * setting up new mapping at middle level page table, or leaf,
+	 * or tearing down existing mapping.
+	 */
+	if (is_private) {
+		if (is_present) {
+			/*
+			 * Use different call to either set up middle level
+			 * private page table, or leaf.
+			 */
+			if (is_leaf)
+				static_call(kvm_x86_set_private_spte)(kvm, gfn,
+						level, spte_to_pfn(new_spte));
+			else {
+				kvm_pfn_t pfn = spte_to_pfn(new_spte);
+				struct page *page = pfn_to_page(pfn);
+				struct kvm_mmu_page *sp =
+					(struct kvm_mmu_page *)page_private(page);
+
+				WARN_ON(!sp->private_sp);
+				WARN_ON((sp->role.level + 1) != level);
+
+				static_call(kvm_x86_link_private_sp)(kvm, gfn,
+						level, sp->private_sp);
+			}
+		} else if (was_leaf) {
+			/*
+			 * Zap private leaf SPTE.  Zapping private table is done
+			 * below in handle_removed_tdp_mmu_page().
+			 */
+			static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
+			/*
+			 * TDX requires TLB tracking before dropping private
+			 * page.  Do it here, although it is also done later.
+			 */
+			kvm_flush_remote_tlbs_with_address(kvm, gfn,
+					KVM_PAGES_PER_HPAGE(level));
+
+			static_call(kvm_x86_drop_private_spte)(kvm, gfn, level,
+					spte_to_pfn(old_spte));
+		}
+	}
+
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
 	 * the paging structure.
 	 */
 	if (was_present && !was_leaf && (pfn_changed || !is_present))
-		handle_removed_tdp_mmu_page(kvm,
+		handle_removed_tdp_mmu_page(kvm, sptep,
 				spte_to_child_pt(old_spte, level), shared);
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared)
+				tdp_ptep_t sptep, u64 old_spte, u64 new_spte,
+				int level, bool shared)
 {
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
+	__handle_changed_spte(kvm, as_id, gfn, sptep, old_spte, new_spte, level,
 			      shared);
 	handle_changed_spte_acc_track(old_spte, new_spte, level);
 	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
@@ -521,8 +656,8 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 		      new_spte) != iter->old_spte)
 		return false;
 
-	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			      new_spte, iter->level, true);
+	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->sptep,
+			      iter->old_spte, new_spte, iter->level, true);
 	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
 
 	return true;
@@ -537,7 +672,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * immediately installing a present entry in its place
 	 * before the TLBs are flushed.
 	 */
-	if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
+	if (!tdp_mmu_set_spte_atomic(kvm, iter, SHADOW_REMOVED_SPTE))
 		return false;
 
 	kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
@@ -550,8 +685,16 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * special removed SPTE value. No bookkeeping is needed
 	 * here since the SPTE is going from non-present
 	 * to non-present.
+	 *
+	 * Set non-present value to shadow_init_value, rather than 0.
+	 * It is because when TDX is enabled, TDX module always
+	 * enables "EPT-violation #VE", so KVM needs to set
+	 * "suppress #VE" bit in EPT table entries, in order to get
+	 * real EPT violation, rather than TDVMCALL.  KVM sets
+	 * shadow_init_value (which sets "suppress #VE" bit) so it
+	 * can be set when EPT table entries are zapped.
 	 */
-	WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
+	WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
 
 	return true;
 }
@@ -590,8 +733,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 
 	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
 
-	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			      new_spte, iter->level, false);
+	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->sptep,
+			      iter->old_spte, new_spte, iter->level, false);
 	if (record_acc_track)
 		handle_changed_spte_acc_track(iter->old_spte, new_spte,
 					      iter->level);
@@ -621,19 +764,55 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 	__tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
 }
 
+static inline bool tdp_mmu_zap_spte_flush_excl_or_shared(struct kvm *kvm,
+							 struct tdp_iter *iter,
+							 bool shared)
+{
+	if (!shared) {
+		/* See tdp_mmu_zap_spte_atomic() for shadow_init_value */
+		tdp_mmu_set_spte(kvm, iter, shadow_init_value);
+		kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
+				KVM_PAGES_PER_HPAGE(iter->level));
+		return true;
+	}
+
+	/* tdp_mmu_zap_spte_atomic() does TLB flush internally */
+	return tdp_mmu_zap_spte_atomic(kvm, iter);
+}
+
+static inline bool tdp_mmu_set_spte_excl_or_shared(struct kvm *kvm,
+						   struct tdp_iter *iter,
+						   u64 new_spte,
+						   bool shared)
+{
+	if (!shared) {
+		tdp_mmu_set_spte(kvm, iter, new_spte);
+		return true;
+	}
+
+	return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
+}
+
 #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
 	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
 
+/*
+ * Note temporarily blocked private SPTE is consider as valid leaf,
+ * although !is_shadow_present_pte() returns true for it, since the
+ * target page (which the mapping maps to ) is still there.
+ */
 #define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end)	\
 	tdp_root_for_each_pte(_iter, _root, _start, _end)		\
-		if (!is_shadow_present_pte(_iter.old_spte) ||		\
+		if ((!is_shadow_present_pte(_iter.old_spte) &&		\
+			!is_zapped_private_pte(_iter.old_spte)) ||	\
 		    !is_last_spte(_iter.old_spte, _iter.level))		\
 			continue;					\
 		else
 
-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
-			 _mmu->shadow_root_level, _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)	\
+	for_each_tdp_pte(_iter,						\
+		__va((_private) ? _mmu->private_root_hpa : _mmu->root_hpa),	\
+		 _mmu->shadow_root_level, _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -719,6 +898,15 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	 */
 	end = min(end, max_gfn_host);
 
+	/*
+	 * Extend [start, end) to include GFN shared bit when TDX is enabled,
+	 * and for shared mapping range.
+	 */
+	if (kvm->arch.gfn_shared_mask && !is_private_sp(root)) {
+		start |= kvm->arch.gfn_shared_mask;
+		end |= kvm->arch.gfn_shared_mask;
+	}
+
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
 	rcu_read_lock();
@@ -732,7 +920,12 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 		}
 
-		if (!is_shadow_present_pte(iter.old_spte))
+		/*
+		 * Skip non-present SPTE, with exception of temporarily
+		 * blocked private SPTE, which also needs to be zapped.
+		 */
+		if (!is_shadow_present_pte(iter.old_spte) &&
+				!is_zapped_private_pte(iter.old_spte))
 			continue;
 
 		/*
@@ -747,7 +940,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 
 		if (!shared) {
-			tdp_mmu_set_spte(kvm, &iter, 0);
+			/* see comments in tdp_mmu_zap_spte_atomic() */
+			tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
 			flush = true;
 		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
 			/*
@@ -770,24 +964,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
  * MMU lock.
  */
 bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush)
+				 gfn_t end, bool can_yield, bool flush,
+				 bool zap_private)
 {
 	struct kvm_mmu_page *root;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
+	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) {
+		/* Skip private page table if not requested */
+		if (!zap_private && is_private_sp(root))
+			continue;
 		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
 				      false);
+	}
 
 	return flush;
 }
 
-void kvm_tdp_mmu_zap_all(struct kvm *kvm)
+void kvm_tdp_mmu_zap_all(struct kvm *kvm, bool zap_private)
 {
 	bool flush = false;
 	int i;
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush,
+				zap_private);
 
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
@@ -838,6 +1038,13 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 	while (root) {
 		next_root = next_invalidated_root(kvm, root);
 
+		/*
+		 * Private table is only torn down when VM is destroyed.
+		 * It is a bug to zap private table here.
+		 */
+		if (WARN_ON(is_private_sp(root)))
+			goto out;
+
 		rcu_read_unlock();
 
 		flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
@@ -852,7 +1059,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 
 		rcu_read_lock();
 	}
-
+out:
 	rcu_read_unlock();
 
 	if (flush)
@@ -884,9 +1091,16 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 	struct kvm_mmu_page *root;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
-	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
+	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+		/*
+		 * Skip private root since private page table
+		 * is only torn down when VM is destroyed.
+		 */
+		if (is_private_sp(root))
+			continue;
 		if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
 			root->role.invalid = true;
+	}
 }
 
 /*
@@ -895,24 +1109,36 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
  */
 static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 					  struct kvm_page_fault *fault,
-					  struct tdp_iter *iter)
+					  struct tdp_iter *iter,
+					  bool shared)
 {
 	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
 	u64 new_spte;
 	int ret = RET_PF_FIXED;
 	bool wrprot = false;
+	unsigned long pte_access = ACC_ALL;
 
 	WARN_ON(sp->role.level != fault->goal_level);
+
+	/* TDX shared GPAs are no executable, enforce this for the SDV. */
+	if (vcpu->kvm->arch.gfn_shared_mask &&
+			!kvm_is_private_gfn(vcpu->kvm, iter->gfn))
+		pte_access &= ~ACC_EXEC_MASK;
+
 	if (unlikely(!fault->slot))
-		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
+		new_spte = make_mmio_spte(vcpu,
+				tdp_iter_gfn_unalias(vcpu->kvm, iter),
+				pte_access);
 	else
-		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+		wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
+				tdp_iter_gfn_unalias(vcpu->kvm, iter),
+				fault->pfn, iter->old_spte, fault->prefetch,
+				true, fault->map_writable, &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
-	else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
+	else if (!tdp_mmu_set_spte_excl_or_shared(vcpu->kvm, iter,
+				new_spte, shared))
 		return RET_PF_RETRY;
 
 	/*
@@ -945,6 +1171,53 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+static void tdp_mmu_zap_alias_spte(struct kvm_vcpu *vcpu, gfn_t gfn_alias,
+		bool shared)
+{
+	struct tdp_iter iter;
+	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn_alias);
+
+	tdp_mmu_for_each_pte(iter, mmu, is_private, gfn_alias, gfn_alias + 1) {
+retry:
+		/*
+		 * Skip non-present SPTEs, except zapped (temporarily
+		 * blocked) private SPTE, because it needs to be truly
+		 * zapped too (note is_shadow_present_pte() also returns
+		 * false for zapped private SPTE).
+		 */
+		if (!is_shadow_present_pte(iter.old_spte) &&
+				!is_zapped_private_pte(iter.old_spte))
+			continue;
+
+		/*
+		 * Only zap leaf for aliasing mapping, because there's
+		 * no need to zap page table too.  And currently
+		 * tdx_sept_free_private_sp() cannot be called when TD
+		 * is still running, so cannot really zap private page
+		 * table now.
+		 */
+		if (!is_last_spte(iter.old_spte, iter.level))
+			continue;
+
+		/*
+		 * Large page is not supported by TDP MMU for TDX now.
+		 * TODO: large page support.
+		 */
+		WARN_ON(iter.level != PG_LEVEL_4K);
+
+		if (!tdp_mmu_zap_spte_flush_excl_or_shared(vcpu->kvm, &iter,
+					shared)) {
+			/*
+			 * The iter must explicitly re-read the SPTE because
+			 * the atomic cmpxchg failed.
+			 */
+			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+			goto retry;
+		}
+	}
+}
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -956,6 +1229,9 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm_mmu_page *sp;
 	u64 *child_pt;
 	u64 new_spte;
+	gfn_t raw_gfn;
+	bool is_private;
+	bool shared;
 	int ret;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -964,7 +1240,26 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 	rcu_read_lock();
 
-	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+	raw_gfn = fault->addr >> PAGE_SHIFT;
+	is_private = kvm_is_private_gfn(vcpu->kvm, raw_gfn);
+
+	if (is_error_noslot_pfn(fault->pfn) ||
+			kvm_is_reserved_pfn(fault->pfn)) {
+		if (is_private) {
+			rcu_read_unlock();
+			return -EFAULT;
+		}
+	} else if (vcpu->kvm->arch.gfn_shared_mask) {
+		gfn_t gfn_alias = raw_gfn ^ vcpu->kvm->arch.gfn_shared_mask;
+		/*
+		 * At given time, for one GPA, only one mapping can be valid
+		 * (either private, or shared).  Zap aliasing mapping to make
+		 * sure of it.  Also see comment in direct_page_fault().
+		 */
+		tdp_mmu_zap_alias_spte(vcpu, gfn_alias, shared);
+	}
+
+	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
 		if (fault->nx_huge_page_workaround_enabled)
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
@@ -978,8 +1273,15 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
+			if (!tdp_mmu_zap_spte_flush_excl_or_shared(vcpu->kvm,
+						&iter, shared))
 				break;
+			/*
+			 * TODO: large page support.
+			 * Doesn't support large page for TDX now
+			 */
+			WARN_ON(is_private_spte(&iter.old_spte));
+
 
 			/*
 			 * The iter must explicitly re-read the spte here
@@ -990,6 +1292,15 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
+
+			/*
+			 * TODO: large page support.
+			 * Not expecting blocked private SPTE points to a
+			 * large page now.
+			 */
+			WARN_ON(is_zapped_private_pte(iter.old_spte) &&
+					is_large_pte(iter.old_spte));
+
 			/*
 			 * If SPTE has been frozen by another thread, just
 			 * give up and retry, avoiding unnecessary page table
@@ -998,13 +1309,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			if (is_removed_spte(iter.old_spte))
 				break;
 
-			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
+			sp = alloc_tdp_mmu_page(vcpu,
+					tdp_iter_gfn_unalias(vcpu->kvm, &iter),
+					iter.level - 1, is_private);
 			child_pt = sp->spt;
 
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
-			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
+			if (tdp_mmu_set_spte_excl_or_shared(vcpu->kvm, &iter,
+						new_spte, shared)) {
 				tdp_mmu_link_page(vcpu->kvm, sp,
 						  fault->huge_page_disallowed &&
 						  fault->req_level >= iter.level);
@@ -1022,20 +1336,27 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		return RET_PF_RETRY;
 	}
 
-	ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
+	ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter, shared);
 	rcu_read_unlock();
 
 	return ret;
 }
 
+static bool block_private_gfn_range(struct kvm *kvm,
+				    struct kvm_gfn_range *range);
+
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
 {
 	struct kvm_mmu_page *root;
 
-	for_each_tdp_mmu_root(kvm, root, range->slot->as_id)
-		flush |= zap_gfn_range(kvm, root, range->start, range->end,
+	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+		if (is_private_sp(root))
+			flush |= block_private_gfn_range(kvm, range);
+		else
+			flush |= zap_gfn_range(kvm, root, range->start, range->end,
 				       range->may_block, flush, false);
+	}
 
 	return flush;
 }
@@ -1058,6 +1379,15 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	 * into this helper allow blocking; it'd be dead, wasteful code.
 	 */
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+		/*
+		 * For TDX shared mapping, set GFN shared bit to the range,
+		 * so the handler() doesn't need to set it, to avoid duplicated
+		 * code in multiple handler()s.
+		 */
+		if (kvm->arch.gfn_shared_mask && !is_private_sp(root)) {
+			range->start |= kvm->arch.gfn_shared_mask;
+			range->end |= kvm->arch.gfn_shared_mask;
+		}
 		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
 			ret |= handler(kvm, &iter, range);
 	}
@@ -1067,6 +1397,31 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	return ret;
 }
 
+static bool block_private_spte(struct kvm *kvm, struct tdp_iter *iter,
+				    struct kvm_gfn_range *range)
+{
+	u64 new_spte;
+
+	if (WARN_ON(!is_private_spte(iter->sptep)))
+		return false;
+
+	new_spte = SPTE_PRIVATE_ZAPPED |
+		(spte_to_pfn(iter->old_spte) << PAGE_SHIFT) |
+		is_large_pte(iter->old_spte) ? PT_PAGE_SIZE_MASK : 0;
+
+	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+
+	static_call(kvm_x86_zap_private_spte)(kvm, iter->gfn, iter->level);
+
+	return true;
+}
+
+static bool block_private_gfn_range(struct kvm *kvm,
+				    struct kvm_gfn_range *range)
+{
+	return kvm_tdp_mmu_handle_gfn(kvm, range, block_private_spte);
+}
+
 /*
  * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
  * if any of the GFNs in the range have been accessed.
@@ -1080,6 +1435,15 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 	if (!is_accessed_spte(iter->old_spte))
 		return false;
 
+	/*
+	 * First TDX generation doesn't support clearing A bit for private
+	 * mapping, since there's no secure EPT API to support it.  However
+	 * it's a legitimate request for TDX guest, so just return w/o a
+	 * WARN().
+	 */
+	if (is_private_spte(iter->sptep))
+		return false;
+
 	new_spte = iter->old_spte;
 
 	if (spte_ad_enabled(new_spte)) {
@@ -1124,6 +1488,13 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 	/* Huge pages aren't expected to be modified without first being zapped. */
 	WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end);
 
+	/*
+	 * .change_pte() callback should not happen for private page, because
+	 * for now TDX private pages are pinned during VM's life time.
+	 */
+	if (WARN_ON(is_private_spte(iter->sptep)))
+		return false;
+
 	if (iter->level != PG_LEVEL_4K ||
 	    !is_shadow_present_pte(iter->old_spte))
 		return false;
@@ -1175,6 +1546,14 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	/*
+	 * First TDX generation doesn't support write protecting private
+	 * mappings, since there's no secure EPT API to support it.  It
+	 * is a bug to reach here for TDX guest.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return spte_set;
+
 	rcu_read_lock();
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
@@ -1241,6 +1620,14 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	/*
+	 * First TDX generation doesn't support clearing dirty bit,
+	 * since there's no secure EPT API to support it.  It is a
+	 * bug to reach here for TDX guest.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return spte_set;
+
 	rcu_read_lock();
 
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
@@ -1288,6 +1675,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm,
 	struct kvm_mmu_page *root;
 	bool spte_set = false;
 
+	/* See comment in caller */
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
@@ -1310,6 +1698,14 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 	struct tdp_iter iter;
 	u64 new_spte;
 
+	/*
+	 * First TDX generation doesn't support clearing dirty bit,
+	 * since there's no secure EPT API to support it.  It is a
+	 * bug to reach here for TDX guest.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return;
+
 	rcu_read_lock();
 
 	tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
@@ -1318,10 +1714,10 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 			break;
 
 		if (iter.level > PG_LEVEL_4K ||
-		    !(mask & (1UL << (iter.gfn - gfn))))
+		    !(mask & (1UL << (tdp_iter_gfn_unalias(kvm, &iter) - gfn))))
 			continue;
 
-		mask &= ~(1UL << (iter.gfn - gfn));
+		mask &= ~(1UL << (tdp_iter_gfn_unalias(kvm, &iter) - gfn));
 
 		if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
 			if (is_writable_pte(iter.old_spte))
@@ -1374,6 +1770,14 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
 	struct tdp_iter iter;
 	kvm_pfn_t pfn;
 
+	/*
+	 * This should only be reachable in case of log-dirty, which TD
+	 * private mapping doesn't support so far.  Give a WARN() if it
+	 * hits private mapping.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return flush;
+
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
@@ -1389,8 +1793,9 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
 
 		pfn = spte_to_pfn(iter.old_spte);
 		if (kvm_is_reserved_pfn(pfn) ||
-		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
-							    pfn, PG_LEVEL_NUM))
+		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot,
+			    tdp_iter_gfn_unalias(kvm, &iter), pfn,
+			    PG_LEVEL_NUM))
 			continue;
 
 		if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
@@ -1419,6 +1824,7 @@ bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 {
 	struct kvm_mmu_page *root;
 
+	/* See comment in caller */
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
@@ -1441,6 +1847,14 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
+	/*
+	 * First TDX generation doesn't support write protecting private
+	 * mappings, since there's no secure EPT API to support it.  It
+	 * is a bug to reach here for TDX guest.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return spte_set;
+
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
@@ -1496,10 +1910,14 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
+	bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
 
 	*root_level = vcpu->arch.mmu->shadow_root_level;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	if (WARN_ON(is_private))
+		return leaf;
+
+	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
@@ -1525,12 +1943,16 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	tdp_ptep_t sptep = NULL;
+	bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
+
+	if (is_private)
+		goto out;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
-
+out:
 	/*
 	 * Perform the rcu_dereference to get the raw spte pointer value since
 	 * we are passing it up to fast_page_fault, which is shared with the
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 476b133544dd..0a975d30909e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -5,7 +5,7 @@
 
 #include <linux/kvm_host.h>
 
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
 						     struct kvm_mmu_page *root)
@@ -20,11 +20,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared);
 
 bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush);
+				 gfn_t end, bool can_yield, bool flush,
+				 bool zap_private);
 static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
-					     gfn_t start, gfn_t end, bool flush)
+					     gfn_t start, gfn_t end, bool flush,
+					     bool zap_private)
 {
-	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
+	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush,
+			zap_private);
 }
 static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -41,10 +44,10 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	 */
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
-					   sp->gfn, end, false, false);
+					   sp->gfn, end, false, false, false);
 }
 
-void kvm_tdp_mmu_zap_all(struct kvm *kvm);
+void kvm_tdp_mmu_zap_all(struct kvm *kvm, bool zap_private);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 58/59] KVM: TDX: exit to user space on GET_QUOTE, SETUP_EVENT_NOTIFY_INTERRUPT
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (56 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 57/59] KVM, x86/mmu: Support TDX private mapping for TDP MMU isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  0:20 ` [RFC PATCH v3 59/59] Documentation/virtual/kvm: Add Trust Domain Extensions(TDX) isaku.yamahata
                   ` (2 subsequent siblings)
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

GET_QUOTE, SETUP_EVENT_NOTIFY_INTERRUPT TDG.VP.VMCALL requires user space
to handle them on behalf of kvm kernel module.

Introduce new kvm exit, KVM_EXIT_TDX, and when GET_QUOTE and
SETUP_EVENT_NOTIFY_INTERRUPT is called by TD guest, transfer the execution
to user space as KVM exit to the user space.  TDG_VP_VMCALL_INVALID_OPERAND
is set as default return value to avoid random value. User space should
update R10 if necessary.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c   | 113 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h |  57 ++++++++++++++++++++
 2 files changed, 170 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 53fc01f3bab1..a87db46477cc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -152,6 +152,18 @@ BUILD_TDVMCALL_ACCESSORS(p2, r13);
 BUILD_TDVMCALL_ACCESSORS(p3, r14);
 BUILD_TDVMCALL_ACCESSORS(p4, r15);
 
+#define TDX_VMCALL_REG_MASK_RBX	BIT_ULL(2)
+#define TDX_VMCALL_REG_MASK_RDX	BIT_ULL(3)
+#define TDX_VMCALL_REG_MASK_RBP	BIT_ULL(5)
+#define TDX_VMCALL_REG_MASK_RSI	BIT_ULL(6)
+#define TDX_VMCALL_REG_MASK_RDI	BIT_ULL(7)
+#define TDX_VMCALL_REG_MASK_R8	BIT_ULL(8)
+#define TDX_VMCALL_REG_MASK_R9	BIT_ULL(9)
+#define TDX_VMCALL_REG_MASK_R12	BIT_ULL(12)
+#define TDX_VMCALL_REG_MASK_R13	BIT_ULL(13)
+#define TDX_VMCALL_REG_MASK_R14	BIT_ULL(14)
+#define TDX_VMCALL_REG_MASK_R15	BIT_ULL(15)
+
 static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
 {
 	return kvm_r10_read(vcpu);
@@ -1122,6 +1134,91 @@ static int tdx_map_gpa(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_complete_vp_vmcall(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
+	__u64 reg_mask;
+
+	tdvmcall_set_return_code(vcpu, tdx_vmcall->status_code);
+	tdvmcall_set_return_val(vcpu, tdx_vmcall->out_r11);
+
+	reg_mask = kvm_rcx_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R12)
+		kvm_r12_write(vcpu, tdx_vmcall->out_r12);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R13)
+		kvm_r13_write(vcpu, tdx_vmcall->out_r13);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R14)
+		kvm_r14_write(vcpu, tdx_vmcall->out_r14);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R15)
+		kvm_r15_write(vcpu, tdx_vmcall->out_r15);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RBX)
+		kvm_rbx_write(vcpu, tdx_vmcall->out_rbx);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RDI)
+		kvm_rdi_write(vcpu, tdx_vmcall->out_rdi);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RSI)
+		kvm_rsi_write(vcpu, tdx_vmcall->out_rsi);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R8)
+		kvm_r8_write(vcpu, tdx_vmcall->out_r8);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R9)
+		kvm_r9_write(vcpu, tdx_vmcall->out_r9);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RDX)
+		kvm_rdx_write(vcpu, tdx_vmcall->out_rdx);
+
+	return 1;
+}
+
+static int tdx_vp_vmcall_to_user(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
+	__u64 reg_mask;
+
+	vcpu->arch.complete_userspace_io = tdx_complete_vp_vmcall;
+	memset(tdx_vmcall, 0, sizeof(*tdx_vmcall));
+
+	vcpu->run->exit_reason = KVM_EXIT_TDX;
+	vcpu->run->tdx.type = KVM_EXIT_TDX_VMCALL;
+	tdx_vmcall->type = tdvmcall_exit_type(vcpu);
+	tdx_vmcall->subfunction = tdvmcall_exit_reason(vcpu);
+
+	reg_mask = kvm_rcx_read(vcpu);
+	tdx_vmcall->reg_mask = reg_mask;
+	if (reg_mask & TDX_VMCALL_REG_MASK_R12)
+		tdx_vmcall->in_r12 = kvm_r12_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R13)
+		tdx_vmcall->in_r13 = kvm_r13_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R14)
+		tdx_vmcall->in_r14 = kvm_r14_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R15)
+		tdx_vmcall->in_r15 = kvm_r15_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RBX)
+		tdx_vmcall->in_rbx = kvm_rbx_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RDI)
+		tdx_vmcall->in_rdi = kvm_rdi_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RSI)
+		tdx_vmcall->in_rsi = kvm_rsi_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R8)
+		tdx_vmcall->in_r8 = kvm_r8_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R9)
+		tdx_vmcall->in_r9 = kvm_r9_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RDX)
+		tdx_vmcall->in_rdx = kvm_rdx_read(vcpu);
+
+	/* notify userspace to handle the request */
+	return 0;
+}
+
+static int tdx_get_quote(struct kvm_vcpu *vcpu)
+{
+	gpa_t gpa = tdvmcall_p1_read(vcpu);
+
+	if (!IS_ALIGNED(gpa, PAGE_SIZE)) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	return tdx_vp_vmcall_to_user(vcpu);
+}
+
 static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 {
 	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
@@ -1130,6 +1227,18 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int tdx_setup_event_notify_interrupt(struct kvm_vcpu *vcpu)
+{
+	u64 vector = tdvmcall_p1_read(vcpu);
+
+	if (!(vector >= 32 && vector <= 255)) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	return tdx_vp_vmcall_to_user(vcpu);
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1158,8 +1267,12 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_mmio(vcpu);
 	case TDG_VP_VMCALL_MAP_GPA:
 		return tdx_map_gpa(vcpu);
+	case TDG_VP_VMCALL_GET_QUOTE:
+		return tdx_get_quote(vcpu);
 	case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
 		return tdx_report_fatal_error(vcpu);
+	case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
+		return tdx_setup_event_notify_interrupt(vcpu);
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bb49e095867e..6d036b3ccd25 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -231,6 +231,60 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_tdx_exit {
+#define KVM_EXIT_TDX_VMCALL	1
+	__u32 type;
+	__u32 pad;
+
+	union {
+		struct kvm_tdx_vmcall {
+			/*
+			 * Guest-Host-Communication Interface for TDX spec
+			 * defines the ABI for TDG.VP.VMCALL.
+			 */
+
+			/* Input parameters: guest -> VMM */
+			__u64 type;		/* r10 */
+			__u64 subfunction;	/* r11 */
+			__u64 reg_mask;		/* rcx */
+			/*
+			 * Subfunction specific.
+			 * Registers are used in this order to pass input
+			 * arguments.  r12=arg0, r13=arg1, etc.
+			 */
+			__u64 in_r12;
+			__u64 in_r13;
+			__u64 in_r14;
+			__u64 in_r15;
+			__u64 in_rbx;
+			__u64 in_rdi;
+			__u64 in_rsi;
+			__u64 in_r8;
+			__u64 in_r9;
+			__u64 in_rdx;
+
+			/* Output parameters: VMM -> guest */
+			__u64 status_code;	/* r10 */
+			/*
+			 * Subfunction specific.
+			 * Registers are used in this order to output return
+			 * values.  r11=ret0, r12=ret1, etc.
+			 */
+			__u64 out_r11;
+			__u64 out_r12;
+			__u64 out_r13;
+			__u64 out_r14;
+			__u64 out_r15;
+			__u64 out_rbx;
+			__u64 out_rdi;
+			__u64 out_rsi;
+			__u64 out_r8;
+			__u64 out_r9;
+			__u64 out_rdx;
+		} vmcall;
+	} u;
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -270,6 +324,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_TDX              50	/* dump number to avoid conflict. */
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -487,6 +542,8 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_TDX_VMCALL */
+		struct kvm_tdx_exit tdx;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [RFC PATCH v3 59/59] Documentation/virtual/kvm: Add Trust Domain Extensions(TDX)
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (57 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 58/59] KVM: TDX: exit to user space on GET_QUOTE, SETUP_EVENT_NOTIFY_INTERRUPT isaku.yamahata
@ 2021-11-25  0:20 ` isaku.yamahata
  2021-11-25  2:12 ` [RFC PATCH v3 00/59] KVM: X86: TDX support Xiaoyao Li
  2021-11-30 18:51 ` Sean Christopherson
  60 siblings, 0 replies; 123+ messages in thread
From: isaku.yamahata @ 2021-11-25  0:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a documentation to Intel Trusted Docmain Extensions(TDX) support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/api.rst       |   9 +-
 Documentation/virt/kvm/intel-tdx.rst | 359 +++++++++++++++++++++++++++
 2 files changed, 367 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/virt/kvm/intel-tdx.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index aeeb071c7688..72341707d98f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1366,6 +1366,9 @@ It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
 The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
 allocation and is deprecated.
 
+For TDX guest, deleting/moving memory region loses guest memory contents.
+Read only region isn't supported.  Only as-id 0 is supported.
+
 
 4.36 KVM_SET_TSS_ADDR
 ---------------------
@@ -4489,7 +4492,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.
 
 :Capability: basic
 :Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
 :Parameters: an opaque platform specific structure (in/out)
 :Returns: 0 on success; -1 on error
 
@@ -4501,6 +4504,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
 (SEV) commands on AMD Processors. The SEV commands are defined in
 Documentation/virt/kvm/amd-memory-encryption.rst.
 
+Currently, this ioctl is used for issuing Trusted Domain Extensions
+(TDX) commands on Intel Processors. The TDX commands are defined in
+Documentation/virt/kvm/intel-tdx.rst.
+
 4.111 KVM_MEMORY_ENCRYPT_REG_REGION
 -----------------------------------
 
diff --git a/Documentation/virt/kvm/intel-tdx.rst b/Documentation/virt/kvm/intel-tdx.rst
new file mode 100644
index 000000000000..2b4d6cd852d4
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx.rst
@@ -0,0 +1,359 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Dodmain Extensions(TDX)
+===================================
+
+Overview
+========
+TDX stands for Trust Domain Extensions which isolates VMs from
+the virtual-machine manager (VMM)/hypervisor and any other software on
+the platform. [1]
+For details, the specifications, [2], [3], [4], [5], [6], [7], are
+available.
+
+
+API description
+===============
+
+KVM_MEMORY_ENCRYPT_OP
+---------------------
+:Type: system ioctl, vm ioctl, vcpu ioctl
+
+For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
+ioctl with TDX specific sub ioctl command.
+
+::
+
+  /* Trust Domain eXtension sub-ioctl() commands. */
+  enum kvm_tdx_cmd_id {
+          KVM_TDX_CAPABILITIES = 0,
+          KVM_TDX_INIT_VM,
+          KVM_TDX_INIT_VCPU,
+          KVM_TDX_INIT_MEM_REGION,
+          KVM_TDX_FINALIZE_VM,
+
+          KVM_TDX_CMD_NR_MAX,
+  };
+
+  struct kvm_tdx_cmd {
+          __u32 id;             /* tdx_cmd_id */
+          __u32 metadata;       /* sub comamnd specific */
+          __u64 data;           /* sub command specific */
+  };
+
+
+KVM_TDX_CAPABILITIES
+--------------------
+:Type: system ioctl
+
+subset of TDSYSINFO_STRCUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
+returned. which describes about Intel TDX module.
+
+- id: KVM_TDX_CAPABILITIES
+- metadata: must be 0
+- data: pointer to struct kvm_tdx_capabilities
+
+::
+
+  struct kvm_tdx_cpuid_config {
+          __u32 leaf;
+          __u32 sub_leaf;
+          __u32 eax;
+          __u32 ebx;
+          __u32 ecx;
+          __u32 edx;
+  };
+
+  struct kvm_tdx_capabilities {
+          __u64 attrs_fixed0;
+          __u64 attrs_fixed1;
+          __u64 xfam_fixed0;
+          __u64 xfam_fixed1;
+
+          __u32 nr_cpuid_configs;
+          struct kvm_tdx_cpuid_config cpuid_configs[0];
+  };
+
+
+KVM_TDX_INIT_VM
+---------------
+:Type: vm ioctl
+
+Does additional VM initialization specific to TDX which corresponds to
+TDH.MNG.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VM
+- metadata: must be 0
+- data: pointer to struct kvm_tdx_init_vm
+- reserved: must be 0
+
+::
+
+  struct kvm_tdx_init_vm {
+          __u32 max_vcpus;
+          __u32 reserved;
+          __u64 attributes;
+          __u64 cpuid;  /* pointer to struct kvm_cpuid2 */
+          __u64 mrconfigid[6];          /* sha384 digest */
+          __u64 mrowner[6];             /* sha384 digest */
+          __u64 mrownerconfig[6];       /* sha348 digest */
+          __u64 reserved[43];           /* must be zero for future extensibility */
+  };
+
+
+KVM_TDX_INIT_VCPU
+-----------------
+:Type: vcpu ioctl
+
+Does additional VCPU initialization specific to TDX which corresponds to
+TDH.VP.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- metadata: must be 0
+- data: initial value of the guest TD VCPU RCX
+
+
+KVM_TDX_INIT_MEM_REGION
+-----------------------
+:Type: vm ioctl
+
+Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
+TDX SEAM call.
+If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
+which corresponds to TDH.MR.EXTEND TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- metadata: flags
+            currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
+- data: pointer to struct kvm_tdx_init_mem_region
+
+::
+
+  #define KVM_TDX_MEASURE_MEMORY_REGION   (1UL << 0)
+
+  struct kvm_tdx_init_mem_region {
+          __u64 source_addr;
+          __u64 gpa;
+          __u64 nr_pages;
+  };
+
+
+KVM_TDX_FINALIZE_VM
+-------------------
+:Type: vm ioctl
+
+Complete measurement of the initial TD contents and mark it ready to run
+which corresponds to TDH.MR.FINALIZE
+
+- id: KVM_TDX_FINALIZE_VM
+- metadata: ignored
+- data: ignored
+
+
+KVM TDX creation flow
+=====================
+In addition to KVM normal flow, new TDX ioctls need to be called.  The control flow
+looks like as follows.
+
+#. system wide capability check
+  * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
+  * KVM_CAP_xxx: check other KVM extensions same to normal KVM case.
+
+#. creating VM
+  * KVM_CREATE_VM
+  * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
+
+#. creating VCPU
+  * KVM_CREATE_VCPU
+  * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
+
+#. initializing guest memory
+  * allocate guest memory and initialize page same to normal KVM case
+    In TDX case, parse and load TDVF into guest memory in addition.
+  * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
+    If the pages has contents above, those pages need to be added.
+    Otherwise the contents will be lost and guest sees zero pages.
+  * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
+    This must be after KVM_TDX_INIT_MEM_REGION.
+
+#. run vcpu
+
+Design discussion
+=================
+
+Coexistence of normal(VMX) VM and TD VM
+---------------------------------------
+It's required to allow both legacy(normal VMX) VMs and new TD VMs to
+coexist. Otherwise the benefits of VM flexibility would be eliminated.
+The main issue for it is that the logic of kvm_x86_ops callbacks for
+TDX is different from VMX. On the other hand, the variable,
+kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
+
+Several points to be considered.
+  . No or minimal overhead when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+  . Avoid overhead of indirect call via function pointers.
+  . Contain the changes under arch/x86/kvm/vmx directory and share logic
+    with VMX for maintenance.
+    Even though the ways to operation on VM (VMX instruction vs TDX
+    SEAM call) is different, the basic idea remains same. So, many
+    logic can be shared.
+  . Future maintenance
+    The huge change of kvm_x86_ops in (near) future isn't expected.
+    a centralized file is acceptable.
+
+- Wrapping kvm x86_ops: The current choice
+  Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
+  main.c, is just chosen to show main entry points for callbacks.) and
+  wrapper functions around all the callbacks with
+  "if (is-tdx) tdx-callback() else vmx-callback()".
+
+  Pros:
+  - No major change in common x86 KVM code. The change is (mostly)
+    contained under arch/x86/kvm/vmx/.
+  - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
+    optimized out.
+  - Micro optimization by avoiding function pointer.
+  Cons:
+  - Many boiler plates in arch/x86/kvm/vmx/main.c.
+
+Alternative:
+- Introduce another callback layer under arch/x86/kvm/vmx.
+  Pros:
+  - No major change in common x86 KVM code. The change is (mostly)
+    contained under arch/x86/kvm/vmx/.
+  - clear separation on callbacks.
+  Cons:
+  - overhead in VMX even when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Allow per-VM kvm_x86_ops callbacks instead of global kvm_x86_ops
+  Pros:
+  - clear separation on callbacks.
+  Cons:
+  - Big change in common x86 code.
+  - overhead in common code even when TDX is
+    disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Introduce new directory arch/x86/kvm/tdx
+  Pros:
+  - It clarifies that TDX is different from VMX.
+  Cons:
+  - Given the level of code sharing, it complicates code sharing.
+
+KVM MMU Changes
+---------------
+KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
+high-level execution flow is mostly same to normal EPT case.
+EPT violation/misconfiguration -> invoke TDP fault handler ->
+resolve TDP fault -> resume execution. (or emulate MMIO)
+The difference is, that S-EPT is operated(read/write) via TDX SEAM
+call which is expensive instead of direct read/write EPT entry.
+One bit of GPA (51 or 47 bit) is repurposed so that it means shared
+with host(if set to 1) or private to TD(if cleared to 0).
+
+- The current implementation
+  . Reuse the existing MMU code with minimal update.  Because the
+    execution flow is mostly same. But additional operation, TDX call
+    for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
+  . For performance, minimize TDX SEAM call to operate on S-EPT. When
+    getting corresponding S-EPT pages/entry from faulting GPA, don't
+    use TDX SEAM call to read S-EPT entry. Instead create shadow copy
+    in host memory.
+    Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
+    associate S-EPT to it.
+  . Treats share bit as attributes. mask/unmask the bit where
+    necessary to keep the existing traversing code works.
+    Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
+    for special case.
+    = 0 : for non-TDX case
+    = 51 or 47 bit set for TDX case.
+
+  Pros:
+  - Large code reuse with minimal new hooks.
+  - Execution path is same.
+  Cons:
+  - Complicates the existing code.
+  - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
+
+Alternative:
+- Replace direct read/write on EPT entry with TDX-SEAM call by
+  introducing callbacks on EPT entry.
+  Pros:
+  - Straightforward.
+  Cons:
+  - Too many touching point.
+  - Too slow due to TDX-SEAM call.
+  - Overhead even when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Sprinkle "if (is-tdx)" for TDX special case
+  Pros:
+  - Straightforward.
+  Cons:
+  - The result is non-generic and ugly.
+  - Put TDX specific logic into common KVM MMU code.
+
+New KVM API, ioctl (sub)command, to manage TD VMs
+-------------------------------------------------
+Additional KVM API are needed to control TD VMs. The operations on TD
+VMs are specific to TDX.
+
+- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
+  Although not all operation isn't memory encryption, repupose to get
+  TDX specific ioctls.
+  Pros:
+  - No major change in common x86 KVM code.
+  Cons:
+  - The operations aren't actually memory encryption, but operations
+    on TD VMs.
+
+Alternative:
+- Introduce new ioctl for guest protection like
+  KVM_GUEST_PROTECTION_OP and introduce subcommand for TDX.
+  Pros:
+  - Clean name.
+  Cons:
+  - One more new ioctl for guest protection.
+  - Confusion with KVM_MEMORY_ENCRYPT_OP with KVM_GUEST_PROTECTION_OP.
+
+- Rename KVM_MEMORY_ENCRYPT_OP to KVM_GUEST_PROTECTION_OP and keep
+  KVM_MEMORY_ENCRYPT_OP as same value for user API for compatibility.
+  "#define KVM_MEMORY_ENCRYPT_OP KVM_GUEST_PROTECTION_OP" for uapi
+  compatibility.
+  Pros:
+  - No new ioctl with more suitable name.
+  Cons:
+  - May cause confusion to the existing user program.
+
+
+References
+==========
+
+.. [1] TDX specification
+   https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
+.. [2] Intel Trust Domain Extensions (Intel TDX)
+   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
+.. [3] Intel CPU Architectural Extensions Specification
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
+.. [4] Intel TDX Module 1.0 EAS
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
+.. [5] Intel TDX Loader Interface Specification
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
+.. [6] Intel TDX Guest-Hypervisor Communication Interface
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
+.. [7] Intel TDX Virtual Firmware Design Guide
+   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.
+.. [8] intel public github
+   kvm TDX branch: https://github.com/intel/tdx/tree/kvm
+   TDX guest branch: https://github.com/intel/tdx/tree/guest
+.. [9] tdvf
+    https://github.com/tianocore/edk2-staging/tree/TDVF
+.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
+     Enable Hardware Isolated VMs
+     https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
+.. [11] Linux Security Summit EU 2020:
+     Architectural Extensions for Hardware Virtual Machine Isolation
+     to Advance Confidential Computing in Public Clouds - Ravi Sahita
+     & Jun Nakajima, Intel Corporation
+     https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
+.. [12] [RFCv2,00/16] KVM protected memory extension
+     https://lkml.org/lkml/2020/10/20/66
-- 
2.25.1


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 00/59] KVM: X86: TDX support
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (58 preceding siblings ...)
  2021-11-25  0:20 ` [RFC PATCH v3 59/59] Documentation/virtual/kvm: Add Trust Domain Extensions(TDX) isaku.yamahata
@ 2021-11-25  2:12 ` Xiaoyao Li
  2021-11-30 18:51 ` Sean Christopherson
  60 siblings, 0 replies; 123+ messages in thread
From: Xiaoyao Li @ 2021-11-25  2:12 UTC (permalink / raw)
  To: isaku.yamahata, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, erdemaktas, Connor Kuehl,
	Sean Christopherson, linux-kernel, kvm
  Cc: isaku.yamahata

On 11/25/2021 8:19 AM, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Changes from v2:
> - update based on patch review
> - support TDP MMU
> - drop non-essential fetures (ftrace etc.) to reduce patch size
> 
> TODO:
> - integrate vm type patch

Hi, everyone plans to review this series,

Please skip patch 14-22, which aer old and a separate series can be 
found at 
https://lore.kernel.org/all/20211112153733.2767561-1-xiaoyao.li@intel.com/

I will post v2 with all the comments from Sean addressed.

thanks,
-Xiaoyao

> - integrate unmapping user space mapping



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 08/59] KVM: Export kvm_io_bus_read for use by TDX for PV MMIO
  2021-11-25  0:19 ` [RFC PATCH v3 08/59] KVM: Export kvm_io_bus_read for use by TDX for PV MMIO isaku.yamahata
@ 2021-11-25 17:14   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 17:14 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:

What is PV MMIO? This has nothing to do with PV=, aka paravirt because
it's used in the tdx exit handler for emulation. Please stop confusing
concepts.

> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Later kvm_intel.ko will use it.

That sentence is useless, as are the other 'later patches' phrases all
over the place. At submission time it's obvious and once this is merged
it's not helpful at all.

What's even worse is that this happens right at the beginning of the
series and the actual use case is introduced 50 patches later. That's
not how exports are done. Exports are next to the use case and in the
best case they can be just part of the use case patch.

This is not how fine granular patching works. Just splitting out stupid
things into separate patches does not make a proper patch series. It
creates an illusion, that's it.

And if I look at the patch which makes actual use of that export then
it's just the proof. I'll come to that later.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 09/59] KVM: Enable hardware before doing arch VM initialization
  2021-11-25  0:19 ` [RFC PATCH v3 09/59] KVM: Enable hardware before doing arch VM initialization isaku.yamahata
@ 2021-11-25 19:02   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:02 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Swap the order of hardware_enable_all() and kvm_arch_init_vm() to
> accommodate Intel's TDX, which needs VMX to be enabled during VM init in
> order to make SEAMCALLs.
>
> This also provides consistent ordering between kvm_create_vm() and
> kvm_destroy_vm() with respect to calling kvm_arch_destroy_vm() and
> hardware_disable_all().

So far so good. This should also explain that there is no downside with
that order change especially as this order has been that way for a long
time.

Thanks,

      tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 12/59] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default
  2021-11-25  0:19 ` [RFC PATCH v3 12/59] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default isaku.yamahata
@ 2021-11-25 19:04   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:04 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Zap only leaf SPTEs when deleting/moving a memslot by default, and add a
> module param to allow reverting to the old behavior of zapping all SPTEs
> at all levels and memslots when any memslot is updated.

This changelog describes WHAT the change is doing, which can be seen in
the patch itself, but it gives zero justification neither for the change
nor for the existance of the module parameter.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 13/59] KVM: Add max_vcpus field in common 'struct kvm'
  2021-11-25  0:19 ` [RFC PATCH v3 13/59] KVM: Add max_vcpus field in common 'struct kvm' isaku.yamahata
@ 2021-11-25 19:06   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:06 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>

Why?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
  2021-11-25  0:19 ` [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs isaku.yamahata
@ 2021-11-25 19:08   ` Thomas Gleixner
  2021-11-29 17:35     ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:08 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Xiaoyao Li

On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a capability to effectively allow userspace to query what VM types
> are supported by KVM.

I really don't see why this has to be named legacy. There are enough
reasonable use cases which are perfectly fine using the non-encrypted
muck. Just because there is a new hyped feature does not make anything
else legacy.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 15/59] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls
  2021-11-25  0:19 ` [RFC PATCH v3 15/59] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls isaku.yamahata
@ 2021-11-25 19:26   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:26 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Xiaoyao Li

On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
>  
>  static int kvm_vcpu_ioctl_smi(struct kvm_vcpu *vcpu)
>  {
> +	/* TODO: use more precise flag */

Why is this still a todo and not implemented properly from the very beginning?

And then you have this:

> +	if (vcpu->arch.guest_state_protected)
> +		return -EINVAL;

...

> +	/* TODO: use more precise flag */
> +	if (vcpu->arch.guest_state_protected)
> +		return -EINVAL;

and a gazillion of other places. That's beyond lame.

The obvious place to do such a decision is kvm_arch_vcpu_ioctl(), no?

kvm_arch_vcpu_ioctl(.., unsigneg int ioctl, ...)

     if (vcpu->arch.guest_state_protected) {
     	  if (!(test_bit(_IOC_NR(ioctl), vcpu->ioctl_allowed))
          	return -EINVAL;
     }

is way too simple and obvious, right?

Even if you want more fine grained control, then having an array of
flags per ioctl number is way better than sprinkling this protected muck
conditions all over the place.

Thanks,

        tglx

    


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 16/59] KVM: x86: Add per-VM flag to disable direct IRQ injection
  2021-11-25  0:19 ` [RFC PATCH v3 16/59] KVM: x86: Add per-VM flag to disable direct IRQ injection isaku.yamahata
@ 2021-11-25 19:31   ` Thomas Gleixner
  2021-11-29  2:49   ` Lai Jiangshan
  1 sibling, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:31 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a flag to disable IRQ injection, which is not supported by TDX.
...
> @@ -4506,7 +4506,8 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
>  static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
>  				    struct kvm_interrupt *irq)
>  {
> -	if (irq->irq >= KVM_NR_INTERRUPTS)
> +	if (irq->irq >= KVM_NR_INTERRUPTS ||
> +	    vcpu->kvm->arch.irq_injection_disallowed)
>  		return -EINVAL;

That's required here because you forgot to copy & pasta the protect
guest condition muck into that ioctl, right?


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 17/59] KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE
  2021-11-25  0:20 ` [RFC PATCH v3 17/59] KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE isaku.yamahata
@ 2021-11-25 19:33   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:33 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Kai Huang

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
>  static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu,
>  					u64 mcg_cap)
>  {
> -	int r;
>  	unsigned bank_num = mcg_cap & 0xff, bank;
>  
> -	r = -EINVAL;
> +	if (vcpu->kvm->arch.mce_injection_disallowed)
> +		return -EINVAL;

Yet another flag because the copy & pasta orgy did not cover that
ioctl. What for?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 18/59] KVM: x86: Add flag to mark TSC as immutable (for TDX)
  2021-11-25  0:20 ` [RFC PATCH v3 18/59] KVM: x86: Add flag to mark TSC as immutable (for TDX) isaku.yamahata
@ 2021-11-25 19:40   ` Thomas Gleixner
  2021-11-29 18:05     ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:40 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Kai Huang

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> The TSC for TDX1 guests is fixed at TD creation time.  Add tsc_immutable

What's a TDX1 guest?

> to reflect that the TSC of the guest cannot be changed in any way, and
> use it to short circuit all paths that lead to one of the myriad TSC
> adjustment flows.

I can kinda see the reason for this being valuable on it's own, but in
general why does TDX need a gazillion flags to disable tons of different
things if _ALL_ these flags are going to be set by for TDX guests
anyway?

Seperate flags make only sense when they have a value on their own,
i.e. are useful for things outside of TDX. If not they are just useless
ballast.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 23/59] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
  2021-11-25  0:20 ` [RFC PATCH v3 23/59] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID isaku.yamahata
@ 2021-11-25 19:41   ` Thomas Gleixner
  2021-11-26  8:18     ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:41 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> Let userspace, or in the case of TDX, KVM itself, enable X2APIC even if
> X2APIC is not reported as supported in the guest's CPU model.  KVM
> generally does not force specific ordering between ioctls(), e.g. this
> forces userspace to configure CPUID before MSRs.  And for TDX, vCPUs
> will always run with X2APIC enabled, e.g. KVM will want/need to enable
> X2APIC from time zero.

This is complete crap. Fix the broken user space and do not add
horrible hacks to the kernel.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 25/59] KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP
  2021-11-25  0:20 ` [RFC PATCH v3 25/59] KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP isaku.yamahata
@ 2021-11-25 19:42   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:42 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>

Yet another changelog with a significant amount of void content.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
  2021-11-25  0:20 ` [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy() isaku.yamahata
@ 2021-11-25 19:46   ` Thomas Gleixner
  2021-11-25 20:54     ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:46 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
> destruction path, which needs to first put the VM into a teardown state,
> then free per-vCPU resource, and finally free per-VM resources.
>
> Note, this knowingly creates a discrepancy in nomenclature for SVM as
> svm_vm_teardown() invokes avic_vm_destroy() and sev_vm_destroy().
> Moving the now-misnamed functions or renaming them is left to a future
> patch so as not to introduce a functional change for SVM.

That's just the wrong way around. Fixup SVM first and then add the TDX
muck on top. Stop this 'left to a future patch' nonsense. I know for
sure that those future patches never materialize.

The way it works is:

    1) Refactor the code to make room for your new functionality in a
       way that the existing code still works.

    2) Add your new muck on top.

Anything else is not acceptable at all.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 29/59] KVM: x86: Add option to force LAPIC expiration wait
  2021-11-25  0:20 ` [RFC PATCH v3 29/59] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
@ 2021-11-25 19:53   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:53 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> Add an option to skip the IRR check in kvm_wait_lapic_expire().

Yay for consistency. $Subject says 'Add option to force wait'. Changelog
says 'Add option to skip ... check'. Can you make your mind up?

Also the change at hand is not adding an option. It's adding a function
argument. You surely can spot the difference between option and function
argument, right?

Changelogs are meant to be readable by mere mortals.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 30/59] KVM: x86: Add guest_supported_xss placholder
  2021-11-25  0:20 ` [RFC PATCH v3 30/59] KVM: x86: Add guest_supported_xss placholder isaku.yamahata
@ 2021-11-25 19:55   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 19:55 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:

> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a per-vcpu placeholder for the support XSS of the guest so that the
> TDX configuration code doesn't need to hack in manual computation of the
> supported XSS.  KVM XSS enabling is currently being upstreamed, i.e.
> guest_supported_xss will no longer be a placeholder by the time TDX is
> ready for upstreaming (hopefully).

Yes, hope dies last... Definitely technical useful information for a
changelog. There is a reason why notes should go below the --- separator
line.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 31/59] KVM: x86: Add infrastructure for stolen GPA bits
  2021-11-25  0:20 ` [RFC PATCH v3 31/59] KVM: x86: Add infrastructure for stolen GPA bits isaku.yamahata
@ 2021-11-25 20:00   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:00 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Rick Edgecombe

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> perspective) to a single GPA (from a memslot perspective). GPA alising
> will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> execute-only permission bit to the guest. To keep the implementation
> simple (relatively speaking), GPA aliasing is only supported via TDP.
>
> Today KVM assumes two things that are broken by GPA aliasing.
>   1. GPAs coming from hardware can be simply shifted to get the GFNs.
>   2. GPA bits 51:MAXPHYADDR are reserved to zero.
>
> With GPA aliasing, translating a GPA to GFN requires masking off the
> repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
>
> To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> that is, bits stolen from the GPA to act as new virtualized attribute
> bits. A bit in the mask will cause the MMU code to create aliases of the
> GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> fault.
>
> To handle case (1) from above, retain any stolen bits when passing a GPA
> in KVM's MMU code, but strip them when converting to a GFN so that the
> GFN contains only the "real" GFN, i.e. never has repurposed bits set.
>
> GFNs (without stolen bits) continue to be used to:
> 	-Specify physical memory by userspace via memslots
> 	-Map GPAs to TDP PTEs via RMAP
> 	-Specify dirty tracking and write protection
> 	-Look up MTRR types
> 	-Inject async page faults
>
> Since there are now multiple aliases for the same aliased GPA, when
> userspace memory backing the memslots is paged out, both aliases need to be
> modified. Fortunately this happens automatically. Since rmap supports
> multiple mappings for the same GFN for PTE shadowing based paging, by
> adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> operations will be applied to both aliases.
>
> In the case of the rmap being removed in the future, the needed
> information could be recovered by iterating over the stolen bits and
> walking the TDP page tables.
>
> For TLB flushes that are address based, make sure to flush both aliases
> in the stolen bits case.
>
> Only support stolen bits in 64 bit guest paging modes (long, PAE).
> Features that use this infrastructure should restrict the stolen bits to
> exclude the other paging modes. Don't support stolen bits for shadow EPT.

This is a real reasonable and informative changelog. Thanks to Rick for
writing this up!

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 39/59] KVM: VMX: Modify NMI and INTR handlers to take intr_info as param
  2021-11-25  0:20 ` [RFC PATCH v3 39/59] KVM: VMX: Modify NMI and INTR handlers to take intr_info as param isaku.yamahata
@ 2021-11-25 20:06   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:06 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>

$subject:    s/as param/as function arguments/

    param != function argument

It's not rocket science to use the proper terms and to write out words
in changelogs and comments instead of using half baken abbreviations.

Precise language matters and especially so for the benefit on non-native
speakers and people who are not familiar with a particular corporate
slang.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 40/59] KVM: VMX: Move NMI/exception handler to common helper
  2021-11-25  0:20 ` [RFC PATCH v3 40/59] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
@ 2021-11-25 20:06   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:06 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>

Why?


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 41/59] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2021-11-25  0:20 ` [RFC PATCH v3 41/59] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
@ 2021-11-25 20:07   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:07 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>

Moar why?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 45/59] KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  2021-11-25  0:20 ` [RFC PATCH v3 45/59] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
@ 2021-11-25 20:08   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:08 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>

Yet another void ...


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code
  2021-11-25  0:20 ` [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code isaku.yamahata
@ 2021-11-25 20:11   ` Thomas Gleixner
  2021-11-25 20:17     ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:11 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:

> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Move the guts of vmx_cache_reg() to vt_cache_reg() in preparation for
> reusing the bulk of the code for TDX, which can access guest state for
> debug TDs.
>
> Use kvm_x86_ops.cache_reg() in ept_update_paging_mode_cr0() rather than
> trying to expose vt_cache_reg() to vmx.c, even though it means taking a
> retpoline.  The code runs if and only if EPT is enabled but unrestricted
> guest.

This sentence does not parse because it's not a proper sentence.

> Only one generation of CPU, Nehalem, supports EPT but not
> unrestricted guest, and disabling unrestricted guest without also
> disabling EPT is, to put it bluntly, dumb.

This one is only significantly better and lacks an explanation what this
means for the dumb case.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code
  2021-11-25 20:11   ` Thomas Gleixner
@ 2021-11-25 20:17     ` Paolo Bonzini
  2021-11-29 18:23       ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-25 20:17 UTC (permalink / raw)
  To: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Sean Christopherson

On 11/25/21 21:11, Thomas Gleixner wrote:
>>
>> Use kvm_x86_ops.cache_reg() in ept_update_paging_mode_cr0() rather than
>> trying to expose vt_cache_reg() to vmx.c, even though it means taking a
>> retpoline.  The code runs if and only if EPT is enabled but unrestricted
>> guest.
> This sentence does not parse because it's not a proper sentence.
> 
>> Only one generation of CPU, Nehalem, supports EPT but not
>> unrestricted guest, and disabling unrestricted guest without also
>> disabling EPT is, to put it bluntly, dumb.
> This one is only significantly better and lacks an explanation what this
> means for the dumb case.

Well, it means a retpoline (see paragraph before).  However, I'm not 
sure why it one wouldn't create a vt.h header with all vt_* functions.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 47/59] KVM: TDX: Define TDCALL exit reason
  2021-11-25  0:20 ` [RFC PATCH v3 47/59] KVM: TDX: Define TDCALL exit reason isaku.yamahata
@ 2021-11-25 20:19   ` Thomas Gleixner
  2021-11-29 18:36     ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:19 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson, Xiaoyao Li

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Define the TDCALL exit reason, which is carved out from the VMX exit
> reason namespace as the TDCALL exit from TDX guest to TDX-SEAM is really
> just a VM-Exit.

How is this carved out? What's the value of this word salad?

It's simply a new exit reason. Not more, not less. So what?

> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

I'm pretty sure that it does not take two engineers to add a new exit
reason define, but it takes at least two engineers to come up with a
convoluted explanation for it.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 49/59] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs
  2021-11-25  0:20 ` [RFC PATCH v3 49/59] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs isaku.yamahata
@ 2021-11-25 20:24   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:24 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:

> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Add a macro framework to hide VMX vs. TDX details of VMREAD and VMWRITE
> so the VMX and TDX can shared common flows, e.g. accessing DTs.

s/shared/share/

> Note, the TDX paths are dead code at this time.  There is no great way
> to deal with the chicken-and-egg scenario of having things in place for
> TDX without first having TDX.

That's more than obvious and the whole point of building infrastructure
in the first place, isn't it?

> +#ifdef CONFIG_INTEL_TDX_HOST
> +#define VT_BUILD_VMCS_HELPERS(type, bits, tdbits)			   \
> +static __always_inline type vmread##bits(struct kvm_vcpu *vcpu,		   \
> +					 unsigned long field)		   \
> +{									   \
> +	if (unlikely(is_td_vcpu(vcpu))) {				   \
> +		if (KVM_BUG_ON(!is_debug_td(vcpu), vcpu->kvm))		   \
> +			return 0;					   \
> +		return td_vmcs_read##tdbits(to_tdx(vcpu), field);	   \
> +	}								   \
> +	return vmcs_read##bits(field);					   \
> +}									   \

New lines exist for a reason to visually separate things and are even
possible in macro blocks. This includes the defines.

Aside of that is there any reason why the end of the macro block has to
be 3 spaces instead of a tab?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 51/59] KVM: VMX: MOVE GDT and IDT accessors to common code
  2021-11-25  0:20 ` [RFC PATCH v3 51/59] KVM: VMX: MOVE GDT and IDT accessors to common code isaku.yamahata
@ 2021-11-25 20:25   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:25 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>

Again: Why?


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 52/59] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code
  2021-11-25  0:20 ` [RFC PATCH v3 52/59] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code isaku.yamahata
@ 2021-11-25 20:26   ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:26 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Sean Christopherson

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>

Moar missing explanation of the why.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-25  0:20 ` [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space isaku.yamahata
@ 2021-11-25 20:34   ` Thomas Gleixner
  2021-11-26  9:19     ` Chao Gao
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 20:34 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Chao Gao

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Chao Gao <chao.gao@intel.com>

> $Subject: KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space

Which user space are you talking about? This subject line is misleading
at best. The unconditional reset is happening when a TDX VM exits
because the SEAM firmware enforces this to prevent unformation leaks.

It also does not matter whether this are four or ten MSR. Fact is that
the SEAM firmware is buggy because it does not save/restore those MSRs.

So the proper subject line is:

   KVM: x86: Add infrastructure to handle MSR corruption by broken TDX firmware 

> The TDX module unconditionally reset 4 host MSRs (MSR_SYSCALL_MASK,
> MSR_START, MSR_LSTAR, MSR_TSC_AUX) to architectural INIT state on exit from
> TDX VM to KVM.  KVM needs to save their values before TD enter and restore
> them on exit to userspace.
>
> Reuse current kvm_user_return mechanism and introduce a function to update
> cached values and register the user return notifier in this new function.
>
> The later patch will use the helper function to save/restore 4 host
> MSRs.

'The later patch ...' is useless information. Of course there will be a
later patch to make use of this which is implied by 'Add infrastructure
...'. Can we please get rid of these useless phrases which have no value
at patch submission time and are even more confusing once the pile is
merged?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
  2021-11-25  0:20 ` [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch isaku.yamahata
@ 2021-11-25 20:48   ` Paolo Bonzini
  2021-11-25 21:05   ` Thomas Gleixner
  1 sibling, 0 replies; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-25 20:48 UTC (permalink / raw)
  To: isaku.yamahata, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Xiaoyao Li

On 11/25/21 01:20, isaku.yamahata@intel.com wrote:
> From: Xiaoyao Li<xiaoyao.li@intel.com>
> 
> Introduce a per-vm variable initial_tsc_khz to hold the default tsc_khz
> for kvm_arch_vcpu_create().
> 
> This field is going to be used by TDX since TSC frequency for TD guest
> is configured at TD VM initialization phase.
> 
> Signed-off-by: Xiaoyao Li<xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata<isaku.yamahata@intel.com>

Probably some incorrect squashing happened here, since...

>  arch/x86/include/asm/kvm_host.h |    1 +
>  arch/x86/kvm/vmx/tdx.c          | 2233 +++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c              |    3 +-
>  3 files changed, 2236 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/kvm/vmx/tdx.c

This patch does a bit more than that. :)

Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 28/59] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
  2021-11-25  0:20 ` [RFC PATCH v3 28/59] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events() isaku.yamahata
@ 2021-11-25 20:50   ` Paolo Bonzini
  2021-11-29 19:20     ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-25 20:50 UTC (permalink / raw)
  To: isaku.yamahata, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Sean Christopherson

On 11/25/21 01:20, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson<sean.j.christopherson@intel.com>
> 
> Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
> interrupt to support TDX's usage of APICv.  Unlike VMX, TDX doesn't have
> access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
> i.e. needs to generate a posted interrupt and more importantly can't
> manually move requested interrupts into the vIRR (which it also doesn't
> have access to).

Does this mean it is impossible to disable APICv on TDX?  If so, please 
add a WARN.

Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
  2021-11-25 19:46   ` Thomas Gleixner
@ 2021-11-25 20:54     ` Paolo Bonzini
  2021-11-25 21:11       ` Thomas Gleixner
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-25 20:54 UTC (permalink / raw)
  To: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Sean Christopherson

On 11/25/21 20:46, Thomas Gleixner wrote:
> On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
>> Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
>> destruction path, which needs to first put the VM into a teardown state,
>> then free per-vCPU resource, and finally free per-VM resources.
>>
>> Note, this knowingly creates a discrepancy in nomenclature for SVM as
>> svm_vm_teardown() invokes avic_vm_destroy() and sev_vm_destroy().
>> Moving the now-misnamed functions or renaming them is left to a future
>> patch so as not to introduce a functional change for SVM.
> That's just the wrong way around. Fixup SVM first and then add the TDX
> muck on top. Stop this 'left to a future patch' nonsense. I know for
> sure that those future patches never materialize.

Or just keep vm_destroy for the "early" destruction, and give a new name 
to the new hook.  It is used to give back the TDCS memory, so perhaps 
you can call it vm_free?

Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
  2021-11-25  0:20 ` [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch isaku.yamahata
  2021-11-25 20:48   ` Paolo Bonzini
@ 2021-11-25 21:05   ` Thomas Gleixner
  2021-11-25 22:13     ` Paolo Bonzini
  1 sibling, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 21:05 UTC (permalink / raw)
  To: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, isaku.yamahata, Xiaoyao Li

On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
>
> Introduce a per-vm variable initial_tsc_khz to hold the default tsc_khz
> for kvm_arch_vcpu_create().
>
> This field is going to be used by TDX since TSC frequency for TD guest
> is configured at TD VM initialization phase.

So now almost 50 patches after exporting kvm_io_bus_read() I have
finally reached the place which makes use of it. But that's just a minor
detail compared to this:

>  arch/x86/include/asm/kvm_host.h |    1 +
>  arch/x86/kvm/vmx/tdx.c          | 2233 +++++++++++++++++++++++++++++++

Can you pretty please explain how this massive pile of code is related
to the subject line and the change log of this patch?

It takes more than 2000 lines of code to add a new member to struct
kvm_arch?

Seriously? This definitely earns an award for the most disconnected
changelog ever.

>  arch/x86/kvm/x86.c              |    3 +-
>  3 files changed, 2236 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/kvm/vmx/tdx.c
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 17d6e4bcf84b..f10c7c2830e5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1111,6 +1111,7 @@ struct kvm_arch {
>  	u64 last_tsc_write;
>  	u32 last_tsc_khz;
>  	u64 last_tsc_offset;
> +	u32 initial_tsc_khz;
>  	u64 cur_tsc_nsec;
>  	u64 cur_tsc_write;
>  	u64 cur_tsc_offset;
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> new file mode 100644
> index 000000000000..64b2841064c4

< SNIP>

Yes, the above would be the patch which would be expected according to
the changelog and the subject line.

And no. Throwing a 2000+ lines patch over the fence which builds up the
complete infrastructure for TDX in KVM is not how that works. This patch
(aside of the silly changelog) is an unreviewable maze.

So far in this series, there was a continuous build up of infrastructure
in reviewable chunks and then you expect a reviewer to digest 2000+
lines at once for no reason and with the most disconneted changelog
ever?

This can be done structured and gradually, really.

     1) Add the basic infrastructure

     2) Add functionality piece by piece

There is no fundamental dependency up to the point where you enable TDX,
but there is a fundamental difference of reviewing 2000+ lines of code
at once or reviewing a gradual build up of 2000+ lines of code in small
pieces with proper changelogs for each of them.

You can argue that my request is unreasonable until you are blue in
your face, it's not going to lift my NAK on this.

I've mopped up enough half baken crap in x86/kvm over the years and I
have absolutely no interest at all to mop up after you again.

x86/kvm is not a special part of the kernel and neither exempt from
general kernel process nor from x86 specific scrunity rules.

That said, I'm stopping the review right here simply because looking at
any further changes does not make any sense at all.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
  2021-11-25 20:54     ` Paolo Bonzini
@ 2021-11-25 21:11       ` Thomas Gleixner
  2021-11-29 18:16         ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 21:11 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Sean Christopherson

On Thu, Nov 25 2021 at 21:54, Paolo Bonzini wrote:
> On 11/25/21 20:46, Thomas Gleixner wrote:
>> On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
>>> Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
>>> destruction path, which needs to first put the VM into a teardown state,
>>> then free per-vCPU resource, and finally free per-VM resources.
>>>
>>> Note, this knowingly creates a discrepancy in nomenclature for SVM as
>>> svm_vm_teardown() invokes avic_vm_destroy() and sev_vm_destroy().
>>> Moving the now-misnamed functions or renaming them is left to a future
>>> patch so as not to introduce a functional change for SVM.
>> That's just the wrong way around. Fixup SVM first and then add the TDX
>> muck on top. Stop this 'left to a future patch' nonsense. I know for
>> sure that those future patches never materialize.
>
> Or just keep vm_destroy for the "early" destruction, and give a new name 
> to the new hook.  It is used to give back the TDCS memory, so perhaps 
> you can call it vm_free?

Up to you, but the current approach is bogus. I rather go for a fully
symmetric interface and let the various incarnations opt in at the right
place. Similar to what cpu hotplug states are implementing.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
  2021-11-25 21:05   ` Thomas Gleixner
@ 2021-11-25 22:13     ` Paolo Bonzini
  2021-11-25 22:59       ` Thomas Gleixner
                         ` (2 more replies)
  0 siblings, 3 replies; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-25 22:13 UTC (permalink / raw)
  To: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Xiaoyao Li

On 11/25/21 22:05, Thomas Gleixner wrote:
> You can argue that my request is unreasonable until you are blue in
> your face, it's not going to lift my NAK on this.

There's no need for that.  I'd be saying the same, and I don't think 
it's particularly helpful that you made it almost a personal issue.

While in this series there is a separation of changes to existing code 
vs. new code, what's not clear is _why_ you have all those changes. 
These are not code cleanups or refactorings that can stand on their own 
feet; lots of the early patches are actually part of the new 
functionality.  And being in the form of "add an argument here" or 
"export a function there", it's not really easy (or feasible) to review 
them without seeing how the new functionality is used, which requires a 
constant back and forth between early patches and the final 2000 line file.

In some sense, the poor commit messages at the beginning of the series 
are just a symptom of not having any meat until too late, and then 
dropping it all at once.  There's only so much that you can say about an 
EXPORT_SYMBOL_GPL, the real thing to talk about is probably the thing 
that refers to that symbol.

If there are some patches that are actually independent, go ahead and 
submit them early.  But more practically, for the bulk of the changes 
what you need to do is:

1) incorporate into patch 55 a version of tdx.c that essentially does 
KVM_BUG_ON or WARN_ON for each function.  Temporarily keep the same huge 
patch that adds the remaining 2000 lines of tdx.c

2) squash the tdx.c stub with patch 44.

3) gather a strace of QEMU starting up a TDX domain.

4) figure out which parts of the code are needed to run until the first 
ioctl.  Make that a first patch.

5) repeat step 4 until you have covered all the code

5) Move the new "KVM: VMX: Add 'main.c' to wrap VMX and TDX" (which also 
adds the tdx.c stub) as possible in the series.

6) Move each of the new patches as early as possible in the series.

7) Look for candidates for squashing (e.g. commit messages that say it's 
"used later"; now the use should be very close and the two can be 
merged).  Add to the commit message a note about changes outside VMX.

The resulting series may not be perfect, but it would be a much better 
starting point for review.

Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
  2021-11-25 22:13     ` Paolo Bonzini
@ 2021-11-25 22:59       ` Thomas Gleixner
  2021-11-25 23:26       ` Thomas Gleixner
  2021-11-29 23:38       ` Sean Christopherson
  2 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 22:59 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Xiaoyao Li

Paolo,

On Thu, Nov 25 2021 at 23:13, Paolo Bonzini wrote:

> On 11/25/21 22:05, Thomas Gleixner wrote:
>> You can argue that my request is unreasonable until you are blue in
>> your face, it's not going to lift my NAK on this.
>
> There's no need for that.  I'd be saying the same, and I don't think 
> it's particularly helpful that you made it almost a personal issue.

To be entirely clear. It's a personal issue.

Being on the receiving end of a firehose which spills liquid manure for
more than 15 years is not a pleasant experience at all.

Being accused to have unreasonable, unactionable and even bizarre review
requests (mostly behind the scenes) is not helping either.

I'm dead tired of sitting down for hours explaining basics in lengthy
emails over and over just to get yet another experience of being ignored
after a couple of months filled with silence.

While I did not do that in this particular context, I read back on the
previous incarnations of this crud and found out that you are in the
exact same situation.

So what's the difference? Nothing at all.

It's about time that someone speaks up in unambiguous words to make it
entirely clear that this is not acceptable at all in the kernel
community as a whole and not just because a single person being
unreasonable or whatever.

Both of us wasted enough time already to explain how to do it
right. There is no point anymore to proliferate any of these political
correctness games further.

Thanks,

        Thomas

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
  2021-11-25 22:13     ` Paolo Bonzini
  2021-11-25 22:59       ` Thomas Gleixner
@ 2021-11-25 23:26       ` Thomas Gleixner
  2021-11-26  7:56         ` Paolo Bonzini
  2021-11-29 23:38       ` Sean Christopherson
  2 siblings, 1 reply; 123+ messages in thread
From: Thomas Gleixner @ 2021-11-25 23:26 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Xiaoyao Li

Paolo,

On Thu, Nov 25 2021 at 23:13, Paolo Bonzini wrote:
> On 11/25/21 22:05, Thomas Gleixner wrote:
> If there are some patches that are actually independent, go ahead and 
> submit them early.  But more practically, for the bulk of the changes 
> what you need to do is:
>
> 1) incorporate into patch 55 a version of tdx.c that essentially does 
> KVM_BUG_ON or WARN_ON for each function.  Temporarily keep the same huge 
> patch that adds the remaining 2000 lines of tdx.c

There is absolutely no reason to populate anything upfront at all.

Why?

Simply because that whole muck cannot work until all pieces are in place.

That's the classic PoC vs. usable code situation. I.e. you know what
needs to be done from a PoC point of view and you have to get there by
building up the infrastructure piece by piece.

So why would you provide:

handle_A(...) { BUG(); }
..
handle_Z(...) { BUG(); }

with all the invocations in the common emulation dispatcher:

  handle_stuff(reason)
  {
        switch(reason) {
        case A: return handle_A();
        ...
        case Z: return handle_Z();
        default: BUG();
        }
  };

in the first place instead of providing:

 1)
        handle_stuff(reason)
        {
                switch(reason) {
                default: BUG();
                };
        }

 2)
        handle_A(...)
        {
                return do_something_understandable()
        }
        
        handle_stuff(reason)
        {
                switch(reason) {
                case A: return handle_A();
                default: BUG();
        }

 $Z)
        handle_Z(...)
        {
                return do_something_understandable()
        }
        
        handle_stuff(reason)
        {
                switch(reason) {
                case A: return handle_A();
                ...
                case Z: return handle_Z();
                default: BUG();
        }

where each of (1..$Z) is a single patch doing exactly _ONE_ thing at a
time and if there is some extra magic required like an additional export
for handle_Y() then this export patch is either part of the patch
dealing with $Y or just before that.

In both scenarious you cannot boot a TDX guest until you reached $Z, but
in the gradual one you and the reviewers have the pleasure of looking at
one thing at a time.

No?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
  2021-11-25 23:26       ` Thomas Gleixner
@ 2021-11-26  7:56         ` Paolo Bonzini
  0 siblings, 0 replies; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-26  7:56 UTC (permalink / raw)
  To: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Xiaoyao Li

On 11/26/21 00:26, Thomas Gleixner wrote:
> Paolo,
> 
> On Thu, Nov 25 2021 at 23:13, Paolo Bonzini wrote:
>> On 11/25/21 22:05, Thomas Gleixner wrote:
>> If there are some patches that are actually independent, go ahead and
>> submit them early.  But more practically, for the bulk of the changes
>> what you need to do is:
>>
>> 1) incorporate into patch 55 a version of tdx.c that essentially does
>> KVM_BUG_ON or WARN_ON for each function.  Temporarily keep the same huge
>> patch that adds the remaining 2000 lines of tdx.c
> 
> There is absolutely no reason to populate anything upfront at all.
> Why?
> 
> Simply because that whole muck cannot work until all pieces are in place.

It can, sort of.  It cannot run a complete guest, but it could in 
principle run a toy guest with a custom userspace, like the ones that 
make up tools/testing/selftests/kvm.  (Note that KVM_BUG_ON marks the VM 
as bugged but doesn't hang the whole machine).

AMD was working on infrastructure to do this for SEV and SEV-ES.

> So why would you provide:
> 
> handle_A(...) { BUG(); }
> ..
> handle_Z(...) { BUG(); }
> 
> with all the invocations in the common emulation dispatcher:
> 
>    handle_stuff(reason)
>    {
>          switch(reason) {
>          case A: return handle_A();
>          ...
>          case Z: return handle_Z();
>          default: BUG();
>          }
>    };

If it's a switch statement that's good, but the common case is more 
similar to this:

  vmx_handle_A(...) { ... }
+tdx_handle_A(...) { ... }
+
+vt_handle_A(...) {
+    if (is_tdx(vcpu->kvm))
+        tdx_handle_A(...);
+    else
+        vmx_handle_A(...);
+}

...

-     .handle_A = vmx_handle_A,
+     .handle_A = vt_handle_A,

And you could indeed do it in a single patch, without adding the stub 
tdx_handle_A upfront.  But you would have code that is broken and who 
knows what the effects would be of calling vmx_handle_A on a TDX virtual 
machine.  It could be an error, or it could be memory corruption.


> In both scenarious you cannot boot a TDX guest until you reached $Z, but
> in the gradual one you and the reviewers have the pleasure of looking at
> one thing at a time.

I think both of them are gradual.  Not having the stubs might be a 
little more gradual, but it is a very minor issue for the reviewability 
of the whole thing.

Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 23/59] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
  2021-11-25 19:41   ` Thomas Gleixner
@ 2021-11-26  8:18     ` Paolo Bonzini
  2021-11-29 21:21       ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-26  8:18 UTC (permalink / raw)
  To: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm
  Cc: isaku.yamahata, Sean Christopherson

On 11/25/21 20:41, Thomas Gleixner wrote:
> On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
>> Let userspace, or in the case of TDX, KVM itself, enable X2APIC even if
>> X2APIC is not reported as supported in the guest's CPU model.  KVM
>> generally does not force specific ordering between ioctls(), e.g. this
>> forces userspace to configure CPUID before MSRs.  And for TDX, vCPUs
>> will always run with X2APIC enabled, e.g. KVM will want/need to enable
>> X2APIC from time zero.
> 
> This is complete crap. Fix the broken user space and do not add
> horrible hacks to the kernel.

tl;dr: I agree that it's a userspace issue but "configure CPUID before 
MSR" is not the issue (in fact QEMU calls KVM_SET_CPUID2 before any call 
to KVM_SET_MSRS).

We have quite a few other cases in which KVM_GET/SET_MSR is allowed to 
get/set MSRs in ways that the guests are not allowed to do.

In general, there are several reasons for this:

- simplifying userspace so that it can use the same list of MSRs for all 
guests (likely, the list that KVM provides with KVM_GET_MSR_INDEX_LIST). 
  For example MSR_TSC_AUX is only exposed to the guest if RDTSCP or 
RDPID are available, but the host can always access it.  This is usually 
the reason why host accesses to MSRs override CPUID.

- simplifying userspace so that it does not have to go through the 
various steps of a state machine; for example, it's okay if userspace 
goes DISABLED->X2APIC instead of having to do DISABLED->XAPIC->X2APIC.

- allowing userspace to set a reset value, for example overriding the 
lock bit in MSR_IA32_FEAT_CTL.

- read-only MSRs that are really "CPUID-like", i.e. they give the guest 
information about processor features (for example the VMX feature MSRs)

- MSRs had some weird limitations that were lifted later by introducing 
additional MSRs; for example KVM always allows the host to write to the 
full-width MSR_IA32_PMC0 counters, because they are a saner version of 
MSR_IA32_PERFCTR0 and there's no reason for userspace to inflict 
MSR_IA32_PERFCTR0 on userspace.

So the host_initiated check doesn't _necessarily_ count as a horrible 
hack in the kernel.  However, in this case we have a trusted domain 
without X2APIC.  I'm not sure this configuration is clearly bogus.  One 
could imagine special-purpose VMs that don't need interrupts at all. 
For full guests such as the ones that QEMU runs, I agree with Thomas 
that userspace must be fixed to enforce x2apic for TDX guests.

Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-25 20:34   ` Thomas Gleixner
@ 2021-11-26  9:19     ` Chao Gao
  2021-11-26  9:40       ` Paolo Bonzini
  2021-11-29  7:08       ` Lai Jiangshan
  0 siblings, 2 replies; 123+ messages in thread
From: Chao Gao @ 2021-11-26  9:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, Sean Christopherson,
	linux-kernel, kvm, isaku.yamahata

On Thu, Nov 25, 2021 at 09:34:59PM +0100, Thomas Gleixner wrote:
>On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
>> From: Chao Gao <chao.gao@intel.com>
>
>> $Subject: KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
>
>Which user space are you talking about? This subject line is misleading

Host Ring3.

>at best. The unconditional reset is happening when a TDX VM exits
>because the SEAM firmware enforces this to prevent unformation leaks.

Yes.

>
>It also does not matter whether this are four or ten MSR.

Indeed, the number of MSRs doesn't matter.

>Fact is that
>the SEAM firmware is buggy because it does not save/restore those MSRs.

It is done deliberately. It gives host a chance to do "lazy" restoration.
"lazy" means don't save/restore them on each TD entry/exit but defer
restoration to when it is neccesary e.g., when vCPU is scheduled out or
when kernel is about to return to Ring3.

>
>So the proper subject line is:
>
>   KVM: x86: Add infrastructure to handle MSR corruption by broken TDX firmware

I rewrote the commit message:

    KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr

    Several MSRs are constant and only used in userspace. But VMs may have
    different values. KVM uses kvm_set_user_return_msr() to switch to guest's
    values and leverages user return notifier to restore them when kernel is
    to return to userspace. In order to save unnecessary wrmsr, KVM also caches
    the value it wrote to a MSR last time.

    TDX module unconditionally resets some of these MSRs to architectural INIT
    state on TD exit. It makes the cached values in kvm_user_return_msrs are
    inconsistent with values in hardware. This inconsistency needs to be fixed
    otherwise, it may mislead kvm_on_user_return() to skip restoring some MSRs
    to host's values. kvm_set_user_return_msr() can help to correct this case
    but it is not optimal as it always does a wrmsr. So, introduce a variation
    of kvm_set_user_return_msr() to update the cached value but skip the wrmsr.

>
>> The TDX module unconditionally reset 4 host MSRs (MSR_SYSCALL_MASK,
>> MSR_START, MSR_LSTAR, MSR_TSC_AUX) to architectural INIT state on exit from
>> TDX VM to KVM.  KVM needs to save their values before TD enter and restore
>> them on exit to userspace.
>>
>> Reuse current kvm_user_return mechanism and introduce a function to update
>> cached values and register the user return notifier in this new function.
>>
>> The later patch will use the helper function to save/restore 4 host
>> MSRs.
>
>'The later patch ...' is useless information. Of course there will be a
>later patch to make use of this which is implied by 'Add infrastructure
>...'. Can we please get rid of these useless phrases which have no value
>at patch submission time and are even more confusing once the pile is
>merged?

Of course. Will remove all "later patch" phrases.

Thanks
Chao

>
>Thanks,
>
>        tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-26  9:19     ` Chao Gao
@ 2021-11-26  9:40       ` Paolo Bonzini
  2021-11-29  7:08       ` Lai Jiangshan
  1 sibling, 0 replies; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-26  9:40 UTC (permalink / raw)
  To: Chao Gao, Thomas Gleixner
  Cc: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	erdemaktas, Connor Kuehl, Sean Christopherson, linux-kernel, kvm,
	isaku.yamahata

On 11/26/21 10:19, Chao Gao wrote:
> Of course. Will remove all "later patch" phrases.

Chao, see my comment on patch 54, for how to do this.  In this case, 
this patch would be right before the one that enables KVM_RUN on a TDX 
module.

Paolo


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 16/59] KVM: x86: Add per-VM flag to disable direct IRQ injection
  2021-11-25  0:19 ` [RFC PATCH v3 16/59] KVM: x86: Add per-VM flag to disable direct IRQ injection isaku.yamahata
  2021-11-25 19:31   ` Thomas Gleixner
@ 2021-11-29  2:49   ` Lai Jiangshan
  1 sibling, 0 replies; 123+ messages in thread
From: Lai Jiangshan @ 2021-11-29  2:49 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Erdem Aktas, Connor Kuehl, Sean Christopherson,
	LKML, kvm, isaku.yamahata, Sean Christopherson

On Fri, Nov 26, 2021 at 3:44 PM <isaku.yamahata@intel.com> wrote:

>  static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)
>  {
>         return vcpu->run->request_interrupt_window &&
> +              !vcpu->kvm->arch.irq_injection_disallowed &&
>                 likely(!pic_in_kernel(vcpu->kvm));
>  }


Just judged superficially by the function names, it seems that the
logic is better to be put in kvm_cpu_accept_dm_intr() or some deeper
function nested in kvm_cpu_accept_dm_intr().

The function name will tell us that the interrupt is not injected
because the CPU doesn't accept it.  And it will also have an effect
that vcpu->run->ready_for_interrupt_injection will always be false
which I think is better to have for TDX.

>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-26  9:19     ` Chao Gao
  2021-11-26  9:40       ` Paolo Bonzini
@ 2021-11-29  7:08       ` Lai Jiangshan
  2021-11-29  9:26         ` Chao Gao
  1 sibling, 1 reply; 123+ messages in thread
From: Lai Jiangshan @ 2021-11-29  7:08 UTC (permalink / raw)
  To: Chao Gao
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Erdem Aktas, Connor Kuehl,
	Sean Christopherson, LKML, kvm, isaku.yamahata

On Mon, Nov 29, 2021 at 2:00 AM Chao Gao <chao.gao@intel.com> wrote:
>
> On Thu, Nov 25, 2021 at 09:34:59PM +0100, Thomas Gleixner wrote:
> >On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> >> From: Chao Gao <chao.gao@intel.com>
> >
> >> $Subject: KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
> >
> >Which user space are you talking about? This subject line is misleading
>
> Host Ring3.
>
> >at best. The unconditional reset is happening when a TDX VM exits
> >because the SEAM firmware enforces this to prevent unformation leaks.
>
> Yes.
>
> >
> >It also does not matter whether this are four or ten MSR.
>
> Indeed, the number of MSRs doesn't matter.
>
> >Fact is that
> >the SEAM firmware is buggy because it does not save/restore those MSRs.
>
> It is done deliberately. It gives host a chance to do "lazy" restoration.
> "lazy" means don't save/restore them on each TD entry/exit but defer
> restoration to when it is neccesary e.g., when vCPU is scheduled out or
> when kernel is about to return to Ring3.
>
> The TDX module unconditionally reset 4 host MSRs (MSR_SYSCALL_MASK,
> MSR_START, MSR_LSTAR, MSR_TSC_AUX) to architectural INIT state on exit from
> TDX VM to KVM.

I did not find the information in intel-tdx-module-1eas.pdf nor
intel-tdx-cpu-architectural-specification.pdf.

Maybe the version I downloaded is outdated.

I guess that the "lazy" restoration mode is not a valid optimization.
The SEAM module should restore it to the original value when it tries
to reset it to architectural INIT state on exit from TDX VM to KVM
since the SEAM module also does it via wrmsr (correct me if not).

If the SEAM module doesn't know "the original value" of the these
MSRs, it would be mere an optimization to save an rdmsr in SEAM.
But there are a lot of other ways for the host to share the values
to SEAM in zero overhead.

Could you provide more information?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-29  7:08       ` Lai Jiangshan
@ 2021-11-29  9:26         ` Chao Gao
  2021-11-30  4:58           ` Lai Jiangshan
  0 siblings, 1 reply; 123+ messages in thread
From: Chao Gao @ 2021-11-29  9:26 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Erdem Aktas, Connor Kuehl,
	Sean Christopherson, LKML, kvm, isaku.yamahata

On Mon, Nov 29, 2021 at 03:08:39PM +0800, Lai Jiangshan wrote:
>On Mon, Nov 29, 2021 at 2:00 AM Chao Gao <chao.gao@intel.com> wrote:
>>
>> On Thu, Nov 25, 2021 at 09:34:59PM +0100, Thomas Gleixner wrote:
>> >On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
>> >> From: Chao Gao <chao.gao@intel.com>
>> >
>> >> $Subject: KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
>> >
>> >Which user space are you talking about? This subject line is misleading
>>
>> Host Ring3.
>>
>> >at best. The unconditional reset is happening when a TDX VM exits
>> >because the SEAM firmware enforces this to prevent unformation leaks.
>>
>> Yes.
>>
>> >
>> >It also does not matter whether this are four or ten MSR.
>>
>> Indeed, the number of MSRs doesn't matter.
>>
>> >Fact is that
>> >the SEAM firmware is buggy because it does not save/restore those MSRs.
>>
>> It is done deliberately. It gives host a chance to do "lazy" restoration.
>> "lazy" means don't save/restore them on each TD entry/exit but defer
>> restoration to when it is neccesary e.g., when vCPU is scheduled out or
>> when kernel is about to return to Ring3.
>>
>> The TDX module unconditionally reset 4 host MSRs (MSR_SYSCALL_MASK,
>> MSR_START, MSR_LSTAR, MSR_TSC_AUX) to architectural INIT state on exit from
>> TDX VM to KVM.
>
>I did not find the information in intel-tdx-module-1eas.pdf nor
>intel-tdx-cpu-architectural-specification.pdf.
>
>Maybe the version I downloaded is outdated.

Hi Jiangshan,

Please refer to Table 22.162 MSRs that may be Modified by TDH.VP.ENTER,
in section 22.2.40 TDH.VP.ENTER leaf.

>
>I guess that the "lazy" restoration mode is not a valid optimization.
>The SEAM module should restore it to the original value when it tries
>to reset it to architectural INIT state on exit from TDX VM to KVM
>since the SEAM module also does it via wrmsr (correct me if not).

Correct.

>
>If the SEAM module doesn't know "the original value" of the these
>MSRs, it would be mere an optimization to save an rdmsr in SEAM.

Yes. Just a rdmsr is saved in TDX module at the cost of host's
restoring a MSR. If restoration (wrmsr) can be done in a lazy fashion
or even the MSR isn't used by host, some CPU cycles can be saved.

>But there are a lot of other ways for the host to share the values
>to SEAM in zero overhead.

I am not sure. Looks it requests a new interface between host and TDX
module. I guess one problem is how/when to verify host's inputs in case
they are invalid.

Thanks
Chao

>
>Could you provide more information?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
  2021-11-25 19:08   ` Thomas Gleixner
@ 2021-11-29 17:35     ` Sean Christopherson
  2021-12-01 19:37       ` Isaku Yamahata
  0 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2021-11-29 17:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Sean Christopherson, Xiaoyao Li

On Thu, Nov 25, 2021, Thomas Gleixner wrote:
> On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> >
> > Add a capability to effectively allow userspace to query what VM types
> > are supported by KVM.
> 
> I really don't see why this has to be named legacy. There are enough
> reasonable use cases which are perfectly fine using the non-encrypted
> muck. Just because there is a new hyped feature does not make anything
> else legacy.

Yeah, this was brought up in the past.  The current proposal is to use
KVM_X86_DEFAULT_VM[1], though at one point the plan was to use a generic
KVM_VM_TYPE_DEFAULT for all architectures[2], not sure what happened to that idea.

[1] https://lore.kernel.org/all/YY6aqVkHNEfEp990@google.com/
[2] https://lore.kernel.org/all/YQsjQ5aJokV1HZ8N@google.com/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 18/59] KVM: x86: Add flag to mark TSC as immutable (for TDX)
  2021-11-25 19:40   ` Thomas Gleixner
@ 2021-11-29 18:05     ` Sean Christopherson
  0 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2021-11-29 18:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Sean Christopherson, Kai Huang

On Thu, Nov 25, 2021, Thomas Gleixner wrote:
> On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> >
> > The TSC for TDX1 guests is fixed at TD creation time.  Add tsc_immutable
> 
> What's a TDX1 guest?

The "revision 1.0" version of TDX.  Some of these patches use "TDX1" to identify
behaviors that may not hold true in future iterations of TDX, and also to highlight
things that are dictated by the spec, e.g. some of the guest TSC frequency values.
For this patch in particular, there's probably no need to differentiate TDX1 vs. TDX,
the qualification was more for cases where KVM needs to define magic values to adhere
to the spec, e.g. to make it clear that the magic values aren't made up by KVM.

> > to reflect that the TSC of the guest cannot be changed in any way, and
> > use it to short circuit all paths that lead to one of the myriad TSC
> > adjustment flows.
> 
> I can kinda see the reason for this being valuable on it's own, but in
> general why does TDX need a gazillion flags to disable tons of different
> things if _ALL_ these flags are going to be set by for TDX guests
> anyway?
> 
> Seperate flags make only sense when they have a value on their own,
> i.e. are useful for things outside of TDX. If not they are just useless
> ballast.

SEV-SNP and TDX have different, but overlapping, restrictions.  And SEV-ES also
shares most SEV-SNP's restrictions.  TDX guests that can be debugged and/or profiled
also have different restrictions, though I forget if any of these flags would be
affected.

The goal with individual flags is to avoid seemingly arbitrary is_snp_guest() and
is_tdx_guest() checks throughout common x86 code, e.g. to avoid confusion over why
KVM does X for TDX but Y for SNP.  And I personally find it easer to audit KVM
behavior with respect to the SNP/TDX specs if the non-obvious restrictions are
explicitly set when the VM is created.

For some of the flags, there's also hope that future iterations of TDX will remove
some of the restrictions, though that's more of a bonus than a direct justification
for adding individual flags.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy()
  2021-11-25 21:11       ` Thomas Gleixner
@ 2021-11-29 18:16         ` Sean Christopherson
  0 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2021-11-29 18:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Sean Christopherson

On Thu, Nov 25, 2021, Thomas Gleixner wrote:
> On Thu, Nov 25 2021 at 21:54, Paolo Bonzini wrote:
> > On 11/25/21 20:46, Thomas Gleixner wrote:
> >> On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> >>> Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
> >>> destruction path, which needs to first put the VM into a teardown state,
> >>> then free per-vCPU resource, and finally free per-VM resources.
> >>>
> >>> Note, this knowingly creates a discrepancy in nomenclature for SVM as
> >>> svm_vm_teardown() invokes avic_vm_destroy() and sev_vm_destroy().
> >>> Moving the now-misnamed functions or renaming them is left to a future
> >>> patch so as not to introduce a functional change for SVM.
> >> That's just the wrong way around. Fixup SVM first and then add the TDX
> >> muck on top. Stop this 'left to a future patch' nonsense. I know for
> >> sure that those future patches never materialize.
> >
> > Or just keep vm_destroy for the "early" destruction, and give a new name 
> > to the new hook.  It is used to give back the TDCS memory, so perhaps 
> > you can call it vm_free?
> 
> Up to you, but the current approach is bogus. I rather go for a fully
> symmetric interface and let the various incarnations opt in at the right
> place. Similar to what cpu hotplug states are implementing.

Naming aside, that's what is being done, TDX simply needs two hooks instead of one
due to the way KVM handles VM and vCPU destruction.  The alternative would be to
shove and duplicate what is currently common x86 code into VMX/SVM, which IMO is
far worse.

Regarding the naming, I 100% agree SVM should be refactored prior to adding TDX
stuff if we choose to go with vm_teardown() and vm_destroy() instead of Paolo's
suggestion of vm_destroy() and vm_free().  When this patch/code was originally
written, letting SVM become stale was a deliberate choice to reduce conflicts with
upstream as we knew the code would live out of tree for quite some time.  But that
was purely meant to be development "hack", not upstream behavior.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code
  2021-11-25 20:17     ` Paolo Bonzini
@ 2021-11-29 18:23       ` Sean Christopherson
  2021-11-29 18:28         ` Paolo Bonzini
  0 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2021-11-29 18:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Sean Christopherson

On Thu, Nov 25, 2021, Paolo Bonzini wrote:
> On 11/25/21 21:11, Thomas Gleixner wrote:
> > > 
> > > Use kvm_x86_ops.cache_reg() in ept_update_paging_mode_cr0() rather than
> > > trying to expose vt_cache_reg() to vmx.c, even though it means taking a
> > > retpoline.  The code runs if and only if EPT is enabled but unrestricted
> > > guest.
> > This sentence does not parse because it's not a proper sentence.

Heh, supposed to be "... but unrestricted guest is disabled".

> > > Only one generation of CPU, Nehalem, supports EPT but not
> > > unrestricted guest, and disabling unrestricted guest without also
> > > disabling EPT is, to put it bluntly, dumb.
> > This one is only significantly better and lacks an explanation what this
> > means for the dumb case.
> 
> Well, it means a retpoline (see paragraph before).

No, the point being made is that, on a CPU that supports Unrestricted Guest (UG),
disabling UG without disabling EPT is really, really stupid.  UG requires EPT, so
disabling EPT _and_ UG is reasonable as there are scenarios where using shadow
paging is desirable.  But inentionally disabling UG and enabling EPT makes no
sense.  It forces KVM to emulate non-trivial amounts of guest code and has zero
benefits for anything other than testing KVM itself.

> why it one wouldn't create a vt.h header with all vt_* functions.
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code
  2021-11-29 18:23       ` Sean Christopherson
@ 2021-11-29 18:28         ` Paolo Bonzini
  0 siblings, 0 replies; 123+ messages in thread
From: Paolo Bonzini @ 2021-11-29 18:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Sean Christopherson

On 11/29/21 19:23, Sean Christopherson wrote:
>>>> Only one generation of CPU, Nehalem, supports EPT but not
>>>> unrestricted guest, and disabling unrestricted guest without also
>>>> disabling EPT is, to put it bluntly, dumb.
>>> This one is only significantly better and lacks an explanation what this
>>> means for the dumb case.
>> Well, it means a retpoline (see paragraph before).
>
> No, the point being made is that, on a CPU that supports Unrestricted Guest (UG),
> disabling UG without disabling EPT is really, really stupid.

Yes, I understand that.

Thomas was asking what it means to "Move register caching logic to 
common code", i.e. what the consequences are.  The missing words at the 
end of the first paragraph didn't make the connection obvious between 
the extra retpoline and the "dumb case".

Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 47/59] KVM: TDX: Define TDCALL exit reason
  2021-11-25 20:19   ` Thomas Gleixner
@ 2021-11-29 18:36     ` Sean Christopherson
  0 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2021-11-29 18:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: isaku.yamahata, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Sean Christopherson, Xiaoyao Li

On Thu, Nov 25, 2021, Thomas Gleixner wrote:
> On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> >
> > Define the TDCALL exit reason, which is carved out from the VMX exit
> > reason namespace as the TDCALL exit from TDX guest to TDX-SEAM is really
> > just a VM-Exit.
> 
> How is this carved out? What's the value of this word salad?
> 
> It's simply a new exit reason. Not more, not less. So what?

The changelog is alluding to the fact that KVM should never directly see a TDCALL
VM-Exit.  For TDX, KVM deals only with "returns" from the TDX-Module.  The "carved
out" bit is calling out that the transition from SEAM Non-Root (the TDX guest) to
SEAM Root (the TDX Module) is actually a VT-x/VMX VM-Exit, e.g. if TDX were somehow
implemented without relying on VT-x/VMX, then the TDCALL exit reason wouldn't exist.

> > Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> I'm pretty sure that it does not take two engineers to add a new exit
> reason define, but it takes at least two engineers to come up with a
> convoluted explanation for it.

Nah, just one ;-)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 28/59] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
  2021-11-25 20:50   ` Paolo Bonzini
@ 2021-11-29 19:20     ` Sean Christopherson
  0 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2021-11-29 19:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Sean Christopherson

On Thu, Nov 25, 2021, Paolo Bonzini wrote:
> On 11/25/21 01:20, isaku.yamahata@intel.com wrote:
> > From: Sean Christopherson<sean.j.christopherson@intel.com>
> > 
> > Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
> > interrupt to support TDX's usage of APICv.  Unlike VMX, TDX doesn't have
> > access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
> > i.e. needs to generate a posted interrupt and more importantly can't
> > manually move requested interrupts into the vIRR (which it also doesn't
> > have access to).
> 
> Does this mean it is impossible to disable APICv on TDX?  If so, please add
> a WARN.

Yes, APICv is forced.

Rereading this patch, checking only for a pending posted interrupt isn't correct,
a pending interrupt that's below the PPR shouldn't be considered a wake event.

A much better approach would be to have vt_sync_pir_to_irr() redirect to a TDX
implementation to read the PIR but not update the vIRR, that way common x86 doesn't
need to be touched.  Hopefully that can be done in a race-free way.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 23/59] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID
  2021-11-26  8:18     ` Paolo Bonzini
@ 2021-11-29 21:21       ` Sean Christopherson
  0 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2021-11-29 21:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Sean Christopherson

On Fri, Nov 26, 2021, Paolo Bonzini wrote:
> On 11/25/21 20:41, Thomas Gleixner wrote:
> > On Wed, Nov 24 2021 at 16:20, isaku yamahata wrote:
> > > Let userspace, or in the case of TDX, KVM itself, enable X2APIC even if
                                            ^^^^^^^^^^

> > > X2APIC is not reported as supported in the guest's CPU model.  KVM
> > > generally does not force specific ordering between ioctls(), e.g. this
> > > forces userspace to configure CPUID before MSRs.  And for TDX, vCPUs
> > > will always run with X2APIC enabled, e.g. KVM will want/need to enable
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > X2APIC from time zero.
      ^^^^^^^^^^^^^^^^^^^^^
> > 
> > This is complete crap. Fix the broken user space and do not add
> > horrible hacks to the kernel.
> 
> tl;dr: I agree that it's a userspace issue but "configure CPUID before MSR"
> is not the issue (in fact QEMU calls KVM_SET_CPUID2 before any call to
> KVM_SET_MSRS).

Specifically for TDX, it's not a userspace issue.  To simplify other checks and
to report sane values for KVM_GET_MSRS, KVM forces X2APIC for TDX guests when the
vCPU is created, before its exposed to usersepace.  The bit about not forcing
specific ordering is justification for making the change independent of TDX,
i.e. to call out that APIC_BASE is different from every other MSR, and is even
inconsistent in its own behavior since illegal transitions are allowed when
userspace is stuffing the MSR.

IMO, this patch is valid irrespective of TDX.  It's included in the TDX series
because TDX support forces the issue.

That said, an alternative for TDX would be do handle this in kvm_lapic_reset()
now that the lAPIC RESET flows are consolidated.  Back when this patch was first
written, that wasn't really an option.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch
  2021-11-25 22:13     ` Paolo Bonzini
  2021-11-25 22:59       ` Thomas Gleixner
  2021-11-25 23:26       ` Thomas Gleixner
@ 2021-11-29 23:38       ` Sean Christopherson
  2 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2021-11-29 23:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata, Xiaoyao Li

On Thu, Nov 25, 2021, Paolo Bonzini wrote:
> On 11/25/21 22:05, Thomas Gleixner wrote:
> > You can argue that my request is unreasonable until you are blue in
> > your face, it's not going to lift my NAK on this.
> 
> There's no need for that.  I'd be saying the same, and I don't think it's
> particularly helpful that you made it almost a personal issue.
> 
> While in this series there is a separation of changes to existing code vs.
> new code, what's not clear is _why_ you have all those changes. These are
> not code cleanups or refactorings that can stand on their own feet; lots of
> the early patches are actually part of the new functionality.  And being in
> the form of "add an argument here" or "export a function there", it's not
> really easy (or feasible) to review them without seeing how the new
> functionality is used, which requires a constant back and forth between
> early patches and the final 2000 line file.
> 
> In some sense, the poor commit messages at the beginning of the series are
> just a symptom of not having any meat until too late, and then dropping it
> all at once.  There's only so much that you can say about an
> EXPORT_SYMBOL_GPL, the real thing to talk about is probably the thing that
> refers to that symbol.
> 
> If there are some patches that are actually independent, go ahead and submit
> them early.  But more practically, for the bulk of the changes what you need
> to do is:
> 
> 1) incorporate into patch 55 a version of tdx.c that essentially does
> KVM_BUG_ON or WARN_ON for each function.  Temporarily keep the same huge
> patch that adds the remaining 2000 lines of tdx.c
> 
> 2) squash the tdx.c stub with patch 44.
> 
> 3) gather a strace of QEMU starting up a TDX domain.
> 
> 4) figure out which parts of the code are needed to run until the first
> ioctl.  Make that a first patch.

Hmm, I don't think this approach will work as well as it did for SEV when applied
at a per-ioctl granuarity, I suspect several patches will end up quite large.   I
completely agree with the overall idea, but I'd encourage the TDX folks to have a
finer granularity where it makes sense, e.g. things like the x2APIC behavior,
immutable TSC, memory management, etc... can probably be sliced up into separate
patches.

> 5) repeat step 4 until you have covered all the code
> 
> 5) Move the new "KVM: VMX: Add 'main.c' to wrap VMX and TDX" (which also
> adds the tdx.c stub) as possible in the series.
> 
> 6) Move each of the new patches as early as possible in the series.
> 
> 7) Look for candidates for squashing (e.g. commit messages that say it's
> "used later"; now the use should be very close and the two can be merged).
> Add to the commit message a note about changes outside VMX.

Generally speaking, I agree.  For the flag exposion, I 100% agree that setting
the flag in TDX, adding it in x86 is best done in a signal patch, and handling
all side effects is best done in a single patch.  

But for things like letting debug TDs access registers, I would prefer not to
actually squash the two (or more) patches.  I agree that two related patches need
to be contiguous in the series, but I'd prefer that things with non-trivial changes,
especially in common code, are kept separate.

> The resulting series may not be perfect, but it would be a much better
> starting point for review.
> 
> Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-29  9:26         ` Chao Gao
@ 2021-11-30  4:58           ` Lai Jiangshan
  2021-11-30  8:19             ` Chao Gao
  0 siblings, 1 reply; 123+ messages in thread
From: Lai Jiangshan @ 2021-11-30  4:58 UTC (permalink / raw)
  To: Chao Gao
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Erdem Aktas, Connor Kuehl,
	Sean Christopherson, LKML, kvm, isaku.yamahata

On Mon, Nov 29, 2021 at 5:16 PM Chao Gao <chao.gao@intel.com> wrote:

> >I did not find the information in intel-tdx-module-1eas.pdf nor
> >intel-tdx-cpu-architectural-specification.pdf.
> >
> >Maybe the version I downloaded is outdated.
>
> Hi Jiangshan,
>
> Please refer to Table 22.162 MSRs that may be Modified by TDH.VP.ENTER,
> in section 22.2.40 TDH.VP.ENTER leaf.

No file in this link:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
has chapter 22.

>
> >
> >I guess that the "lazy" restoration mode is not a valid optimization.
> >The SEAM module should restore it to the original value when it tries
> >to reset it to architectural INIT state on exit from TDX VM to KVM
> >since the SEAM module also does it via wrmsr (correct me if not).
>
> Correct.
>
> >
> >If the SEAM module doesn't know "the original value" of the these
> >MSRs, it would be mere an optimization to save an rdmsr in SEAM.
>
> Yes. Just a rdmsr is saved in TDX module at the cost of host's
> restoring a MSR. If restoration (wrmsr) can be done in a lazy fashion
> or even the MSR isn't used by host, some CPU cycles can be saved.

But it adds overall overhead because the wrmsr in TDX module
can't be skipped while the unneeded potential overhead of
wrmsr is added in user return path.

If TDX module restores the original MSR value, the host hypervisor
doesn't need to step in.

I think I'm reviewing the code without the code.  It is definitely
wrong design to (ab)use the host's user-return-msr mechanism.

>
> >But there are a lot of other ways for the host to share the values
> >to SEAM in zero overhead.
>
> I am not sure. Looks it requests a new interface between host and TDX
> module. I guess one problem is how/when to verify host's inputs in case
> they are invalid.
>

If the requirement of "lazy restoration" is being added (not seen in the
published document yet), you are changing the ABI between the host and
the TDX module.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-30  4:58           ` Lai Jiangshan
@ 2021-11-30  8:19             ` Chao Gao
  2021-11-30 11:18               ` Lai Jiangshan
  0 siblings, 1 reply; 123+ messages in thread
From: Chao Gao @ 2021-11-30  8:19 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Erdem Aktas, Connor Kuehl,
	Sean Christopherson, LKML, kvm, isaku.yamahata

On Tue, Nov 30, 2021 at 12:58:47PM +0800, Lai Jiangshan wrote:
>On Mon, Nov 29, 2021 at 5:16 PM Chao Gao <chao.gao@intel.com> wrote:
>
>> >I did not find the information in intel-tdx-module-1eas.pdf nor
>> >intel-tdx-cpu-architectural-specification.pdf.
>> >
>> >Maybe the version I downloaded is outdated.
>>
>> Hi Jiangshan,
>>
>> Please refer to Table 22.162 MSRs that may be Modified by TDH.VP.ENTER,
>> in section 22.2.40 TDH.VP.ENTER leaf.
>
>No file in this link:
>https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
>has chapter 22.

https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf

>
>>
>> >
>> >I guess that the "lazy" restoration mode is not a valid optimization.
>> >The SEAM module should restore it to the original value when it tries
>> >to reset it to architectural INIT state on exit from TDX VM to KVM
>> >since the SEAM module also does it via wrmsr (correct me if not).
>>
>> Correct.
>>
>> >
>> >If the SEAM module doesn't know "the original value" of the these
>> >MSRs, it would be mere an optimization to save an rdmsr in SEAM.
>>
>> Yes. Just a rdmsr is saved in TDX module at the cost of host's
>> restoring a MSR. If restoration (wrmsr) can be done in a lazy fashion
>> or even the MSR isn't used by host, some CPU cycles can be saved.
>
>But it adds overall overhead because the wrmsr in TDX module
>can't be skipped while the unneeded potential overhead of
>wrmsr is added in user return path.

It is a trade-off. In this way, TDX module needn't save some host MSRs.
Of course, host has to restore those MSRs. the overall impact depends
on the frequency of restoring: on every exit from TD VM, on vCPU being
scheduled out or on exit to userspace. In the last two cases, it is
supposed to have less overhead than saving host MSRs and restoring them
on every exit from TD VM.

>
>If TDX module restores the original MSR value, the host hypervisor
>doesn't need to step in.
>
>I think I'm reviewing the code without the code.  It is definitely
>wrong design to (ab)use the host's user-return-msr mechanism.

I cannot follow.

>
>>
>> >But there are a lot of other ways for the host to share the values
>> >to SEAM in zero overhead.
>>
>> I am not sure. Looks it requests a new interface between host and TDX
>> module. I guess one problem is how/when to verify host's inputs in case
>> they are invalid.
>>
>
>If the requirement of "lazy restoration" is being added (not seen in the
>published document yet), you are changing the ABI between the host and
>the TDX module.

No, it is already documented in public spec.

TDX module spec just says some MSRs are reset to INIT state by TDX module
(un)conditionally during TD exit. When to restore these MSRs to host's
values is decided by host.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space
  2021-11-30  8:19             ` Chao Gao
@ 2021-11-30 11:18               ` Lai Jiangshan
  0 siblings, 0 replies; 123+ messages in thread
From: Lai Jiangshan @ 2021-11-30 11:18 UTC (permalink / raw)
  To: Chao Gao
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Erdem Aktas, Connor Kuehl,
	Sean Christopherson, LKML, kvm, isaku.yamahata

On Tue, Nov 30, 2021 at 4:10 PM Chao Gao <chao.gao@intel.com> wrote:

>
> No, it is already documented in public spec.

In intel-tdx-module-1.5-base-spec-348549001.pdf, page 79
"recoverability hist",  I think it is a "hint".

>
> TDX module spec just says some MSRs are reset to INIT state by TDX module
> (un)conditionally during TD exit. When to restore these MSRs to host's
> values is decided by host.
>


Sigh, it is quite a common solution to reset a register to a default
value here, for example in VMX, the host GDT.limit and host TSS.limit
is reset after VMEXIT, so load_fixmap_gdt() and invalidate_tss_limit()
have to be called in host in vmx_vcpu_put().

Off-topic: Is it a good idea to also put load_fixmap_gdt() and
invalidate_tss_limit() in user-return-notfier?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 00/59] KVM: X86: TDX support
  2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
                   ` (59 preceding siblings ...)
  2021-11-25  2:12 ` [RFC PATCH v3 00/59] KVM: X86: TDX support Xiaoyao Li
@ 2021-11-30 18:51 ` Sean Christopherson
  2021-12-01 13:22   ` Kai Huang
  2021-12-01 15:05   ` Paolo Bonzini
  60 siblings, 2 replies; 123+ messages in thread
From: Sean Christopherson @ 2021-11-30 18:51 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, erdemaktas, Connor Kuehl, linux-kernel, kvm,
	isaku.yamahata

On Wed, Nov 24, 2021, isaku.yamahata@intel.com wrote:
> - drop load/initialization of TDX module

So what's the plan for loading and initializing TDX modules?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 00/59] KVM: X86: TDX support
  2021-11-30 18:51 ` Sean Christopherson
@ 2021-12-01 13:22   ` Kai Huang
  2021-12-01 19:08     ` Isaku Yamahata
  2021-12-01 15:05   ` Paolo Bonzini
  1 sibling, 1 reply; 123+ messages in thread
From: Kai Huang @ 2021-12-01 13:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, erdemaktas, Connor Kuehl,
	linux-kernel, kvm, isaku.yamahata

On Tue, 30 Nov 2021 18:51:52 +0000 Sean Christopherson wrote:
> On Wed, Nov 24, 2021, isaku.yamahata@intel.com wrote:
> > - drop load/initialization of TDX module
> 
> So what's the plan for loading and initializing TDX modules?

Hi Sean,

Although I don't quite understand what does Isaku mean here (I thought
loading/initializing TDX module was never part of TDX KVM series), for this part
we are working internally to improve the quality and finalize the code, but
currently I don't have ETA of being able to send patches out, but we are trying
to send out asap.  Sorry this is what I can say for now : (



^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 00/59] KVM: X86: TDX support
  2021-11-30 18:51 ` Sean Christopherson
  2021-12-01 13:22   ` Kai Huang
@ 2021-12-01 15:05   ` Paolo Bonzini
  2021-12-01 20:16     ` Kai Huang
  1 sibling, 1 reply; 123+ messages in thread
From: Paolo Bonzini @ 2021-12-01 15:05 UTC (permalink / raw)
  To: Sean Christopherson, isaku.yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	erdemaktas, Connor Kuehl, linux-kernel, kvm, isaku.yamahata

On 11/30/21 19:51, Sean Christopherson wrote:
> On Wed, Nov 24, 2021,isaku.yamahata@intel.com  wrote:
>> - drop load/initialization of TDX module
> So what's the plan for loading and initializing TDX modules?
> 

The latest news I got are that Intel has an EFI application that loads 
it, so loading it from Linux and updating it at runtime can be punted to 
later.

Paolo

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 00/59] KVM: X86: TDX support
  2021-12-01 13:22   ` Kai Huang
@ 2021-12-01 19:08     ` Isaku Yamahata
  2021-12-01 19:32       ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Isaku Yamahata @ 2021-12-01 19:08 UTC (permalink / raw)
  To: Kai Huang
  Cc: Sean Christopherson, isaku.yamahata, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	erdemaktas, Connor Kuehl, linux-kernel, kvm, isaku.yamahata

On Thu, Dec 02, 2021 at 02:22:27AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> On Tue, 30 Nov 2021 18:51:52 +0000 Sean Christopherson wrote:
> > On Wed, Nov 24, 2021, isaku.yamahata@intel.com wrote:
> > > - drop load/initialization of TDX module
> > 
> > So what's the plan for loading and initializing TDX modules?
> 
> Although I don't quite understand what does Isaku mean here (I thought
> loading/initializing TDX module was never part of TDX KVM series), for this part
> we are working internally to improve the quality and finalize the code, but
> currently I don't have ETA of being able to send patches out, but we are trying
> to send out asap.  Sorry this is what I can say for now : (

v1/v2 has it. Anyway The plan is what Kai said.  The code will reside in the x86
common directory instead of kvm.

The only dependency between the part of loading/initializing TDX module and
TDX KVM is only single function to get the info about the TDX module.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 00/59] KVM: X86: TDX support
  2021-12-01 19:08     ` Isaku Yamahata
@ 2021-12-01 19:32       ` Sean Christopherson
  2021-12-01 20:28         ` Kai Huang
  0 siblings, 1 reply; 123+ messages in thread
From: Sean Christopherson @ 2021-12-01 19:32 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Kai Huang, isaku.yamahata, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	erdemaktas, Connor Kuehl, linux-kernel, kvm

On Wed, Dec 01, 2021, Isaku Yamahata wrote:
> On Thu, Dec 02, 2021 at 02:22:27AM +1300,
> Kai Huang <kai.huang@intel.com> wrote:
> 
> > On Tue, 30 Nov 2021 18:51:52 +0000 Sean Christopherson wrote:
> > > On Wed, Nov 24, 2021, isaku.yamahata@intel.com wrote:
> > > > - drop load/initialization of TDX module
> > > 
> > > So what's the plan for loading and initializing TDX modules?
> > 
> > Although I don't quite understand what does Isaku mean here (I thought
> > loading/initializing TDX module was never part of TDX KVM series), for this part
> > we are working internally to improve the quality and finalize the code, but
> > currently I don't have ETA of being able to send patches out, but we are trying
> > to send out asap.  Sorry this is what I can say for now : (
> 
> v1/v2 has it.

No, v1 had support for the old architecture where SEAMLDR was a single ACM, it
did not support the new split persistent/non-persistent architcture.  v2 didn't
have support for either architecture.

> Anyway The plan is what Kai said.  The code will reside in the x86 common
> directory instead of kvm.

But what's the plan at a higher level?  Will the kernel load the ACM or is that
done by firmware?  If it's done by firmware, which entity is responsibile for
loading the TDX module?  If firmware loads the module, what's the plan for
upgrading the module without a reboot?  When will the kernel initialize the
module, regardless of who loads it?

All of those unanswered questions make it nigh impossible to review the KVM
support because the code organization and APIs provided will differ based on how
the kernel handles loading and initializing the TDX module.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
  2021-11-29 17:35     ` Sean Christopherson
@ 2021-12-01 19:37       ` Isaku Yamahata
  2021-12-03 16:14         ` Sean Christopherson
  0 siblings, 1 reply; 123+ messages in thread
From: Isaku Yamahata @ 2021-12-01 19:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, erdemaktas, Connor Kuehl,
	linux-kernel, kvm, isaku.yamahata, Sean Christopherson,
	Xiaoyao Li

On Mon, Nov 29, 2021 at 05:35:34PM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Nov 25, 2021, Thomas Gleixner wrote:
> > On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > >
> > > Add a capability to effectively allow userspace to query what VM types
> > > are supported by KVM.
> > 
> > I really don't see why this has to be named legacy. There are enough
> > reasonable use cases which are perfectly fine using the non-encrypted
> > muck. Just because there is a new hyped feature does not make anything
> > else legacy.
> 
> Yeah, this was brought up in the past.  The current proposal is to use
> KVM_X86_DEFAULT_VM[1], though at one point the plan was to use a generic
> KVM_VM_TYPE_DEFAULT for all architectures[2], not sure what happened to that idea.
> 
> [1] https://lore.kernel.org/all/YY6aqVkHNEfEp990@google.com/
> [2] https://lore.kernel.org/all/YQsjQ5aJokV1HZ8N@google.com/

Currently <feature>_{unsupported, disallowed} are added and the check is
 sprinkled and warn in the corresponding low level tdx code.  It helped to
 detect dubious behavior of guest or qemu.

The other approach is to silently ignore them (SMI, INIT, IRQ etc) without
such check.  The pros is, the code would be simpler and it's what SEV does today.
the cons is, it would bes hard to track down such cases and the user would
be confused.  For example, when user requests reset/SMI, it's silently ignored.
The some check would still be needed.
Any thoughts?

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 00/59] KVM: X86: TDX support
  2021-12-01 15:05   ` Paolo Bonzini
@ 2021-12-01 20:16     ` Kai Huang
  0 siblings, 0 replies; 123+ messages in thread
From: Kai Huang @ 2021-12-01 20:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, isaku.yamahata, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, erdemaktas, Connor Kuehl,
	linux-kernel, kvm, isaku.yamahata

On Wed, 1 Dec 2021 16:05:32 +0100 Paolo Bonzini wrote:
> On 11/30/21 19:51, Sean Christopherson wrote:
> > On Wed, Nov 24, 2021,isaku.yamahata@intel.com  wrote:
> >> - drop load/initialization of TDX module
> > So what's the plan for loading and initializing TDX modules?
> > 
> 
> The latest news I got are that Intel has an EFI application that loads 
> it, so loading it from Linux and updating it at runtime can be punted to 
> later.
> 

Yes we are heading this approach.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 00/59] KVM: X86: TDX support
  2021-12-01 19:32       ` Sean Christopherson
@ 2021-12-01 20:28         ` Kai Huang
  0 siblings, 0 replies; 123+ messages in thread
From: Kai Huang @ 2021-12-01 20:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, isaku.yamahata, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	erdemaktas, Connor Kuehl, linux-kernel, kvm


> 
> > Anyway The plan is what Kai said.  The code will reside in the x86 common
> > directory instead of kvm.
> 
> But what's the plan at a higher level?  Will the kernel load the ACM or is that
> done by firmware?  If it's done by firmware, which entity is responsibile for
> loading the TDX module?  If firmware loads the module, what's the plan for
> upgrading the module without a reboot?  When will the kernel initialize the
> module, regardless of who loads it?

The UEFI loads both ACM and TDX module before booting into kernel by using UEFI
tool.  The runtime update is pushed out for future support.  One goal of
this is to reduce the code size so that it can be reviewed more easily and
quickly.

And yes kernel will initialize the TDX module.  The direction we are heading is
to allow to defer TDX module initialization when TDX is truly needed, i.e.
When KVM is loaded with TDX support, or first TD is created.  The code will
basically still reside in host kernel, provided as functions, etc.  And at first
stage, KVM will call those functions to initialize TDX when needed.

The advantage of this approach is it provides more flexibility: the TDX module
initialization code can be reused by future TDX runtime update, etc.  And with
only initializing TDX in KVM, the host kernel doesn't need to handle entering
VMX operation, etc.  It can be introduced later when needed.

> 
> All of those unanswered questions make it nigh impossible to review the KVM
> support because the code organization and APIs provided will differ based on how
> the kernel handles loading and initializing the TDX module.

I think theoretically loading/initializing module should be quite independent
from KVM series, but yes in practice the APIs matter, but I also don't expect
this will reduce the ability to review KVM series a lot as RFC.

Anyway sending out host kernel patches is our top priority now and we are
trying to do asap.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs
  2021-12-01 19:37       ` Isaku Yamahata
@ 2021-12-03 16:14         ` Sean Christopherson
  0 siblings, 0 replies; 123+ messages in thread
From: Sean Christopherson @ 2021-12-03 16:14 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Thomas Gleixner, isaku.yamahata, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, erdemaktas, Connor Kuehl,
	linux-kernel, kvm, Sean Christopherson, Xiaoyao Li

On Wed, Dec 01, 2021, Isaku Yamahata wrote:
> On Mon, Nov 29, 2021 at 05:35:34PM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Thu, Nov 25, 2021, Thomas Gleixner wrote:
> > > On Wed, Nov 24 2021 at 16:19, isaku yamahata wrote:
> > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > >
> > > > Add a capability to effectively allow userspace to query what VM types
> > > > are supported by KVM.
> > > 
> > > I really don't see why this has to be named legacy. There are enough
> > > reasonable use cases which are perfectly fine using the non-encrypted
> > > muck. Just because there is a new hyped feature does not make anything
> > > else legacy.
> > 
> > Yeah, this was brought up in the past.  The current proposal is to use
> > KVM_X86_DEFAULT_VM[1], though at one point the plan was to use a generic
> > KVM_VM_TYPE_DEFAULT for all architectures[2], not sure what happened to that idea.
> > 
> > [1] https://lore.kernel.org/all/YY6aqVkHNEfEp990@google.com/
> > [2] https://lore.kernel.org/all/YQsjQ5aJokV1HZ8N@google.com/
> 
> Currently <feature>_{unsupported, disallowed} are added and the check is
>  sprinkled and warn in the corresponding low level tdx code.  It helped to
>  detect dubious behavior of guest or qemu.

KVM shouldn't log a message or WARN unless the issue is detected at a late sanity
check, i.e. where failure indicates a KVM bug.  Other than that, I agree that KVM
should reject ioctls() that directly violate the rules of a confidential VM with
an appropriate error code.  I don't think KVM should reject everything though,
e.g. if the guest attempts to send an SMI, dropping the request on the floor is
the least awful option because we can't communicate an error to the guest without
making up our own architecture, and exiting to userspace with -EINVAL from deep
in KVM would be both painful to implement and an overreaction since doing so would
likely kill the guest.

> The other approach is to silently ignore them (SMI, INIT, IRQ etc) without
> such check.  The pros is, the code would be simpler and it's what SEV does today.
> the cons is, it would bes hard to track down such cases and the user would
> be confused.  For example, when user requests reset/SMI, it's silently ignored.
> The some check would still be needed.
> Any thoughts?
> 
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2021-12-03 16:14 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-25  0:19 [RFC PATCH v3 00/59] KVM: X86: TDX support isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 01/59] x86/mktme: move out MKTME related constatnts/macro to msr-index.h isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 02/59] x86/mtrr: mask out keyid bits from variable mtrr mask register isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 03/59] KVM: TDX: Define TDX architectural definitions isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 04/59] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 05/59] KVM: TDX: add a helper function for kvm to call seamcall isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 06/59] KVM: TDX: Add C wrapper functions for TDX SEAMCALLs isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 07/59] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 08/59] KVM: Export kvm_io_bus_read for use by TDX for PV MMIO isaku.yamahata
2021-11-25 17:14   ` Thomas Gleixner
2021-11-25  0:19 ` [RFC PATCH v3 09/59] KVM: Enable hardware before doing arch VM initialization isaku.yamahata
2021-11-25 19:02   ` Thomas Gleixner
2021-11-25  0:19 ` [RFC PATCH v3 10/59] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 11/59] KVM: x86: Export kvm_mmio tracepoint for use by TDX for PV MMIO isaku.yamahata
2021-11-25  0:19 ` [RFC PATCH v3 12/59] KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot by default isaku.yamahata
2021-11-25 19:04   ` Thomas Gleixner
2021-11-25  0:19 ` [RFC PATCH v3 13/59] KVM: Add max_vcpus field in common 'struct kvm' isaku.yamahata
2021-11-25 19:06   ` Thomas Gleixner
2021-11-25  0:19 ` [RFC PATCH v3 14/59] KVM: x86: Add vm_type to differentiate legacy VMs from protected VMs isaku.yamahata
2021-11-25 19:08   ` Thomas Gleixner
2021-11-29 17:35     ` Sean Christopherson
2021-12-01 19:37       ` Isaku Yamahata
2021-12-03 16:14         ` Sean Christopherson
2021-11-25  0:19 ` [RFC PATCH v3 15/59] KVM: x86: Introduce "protected guest" concept and block disallowed ioctls isaku.yamahata
2021-11-25 19:26   ` Thomas Gleixner
2021-11-25  0:19 ` [RFC PATCH v3 16/59] KVM: x86: Add per-VM flag to disable direct IRQ injection isaku.yamahata
2021-11-25 19:31   ` Thomas Gleixner
2021-11-29  2:49   ` Lai Jiangshan
2021-11-25  0:20 ` [RFC PATCH v3 17/59] KVM: x86: Add flag to disallow #MC injection / KVM_X86_SETUP_MCE isaku.yamahata
2021-11-25 19:33   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 18/59] KVM: x86: Add flag to mark TSC as immutable (for TDX) isaku.yamahata
2021-11-25 19:40   ` Thomas Gleixner
2021-11-29 18:05     ` Sean Christopherson
2021-11-25  0:20 ` [RFC PATCH v3 19/59] KVM: Add per-VM flag to mark read-only memory as unsupported isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 20/59] KVM: Add per-VM flag to disable dirty logging of memslots for TDs isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 21/59] KVM: x86: Add per-VM flag to disable in-kernel I/O APIC and level routes isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 22/59] KVM: x86: add per-VM flags to disable SMI/INIT/SIPI isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 23/59] KVM: x86: Allow host-initiated WRMSR to set X2APIC regardless of CPUID isaku.yamahata
2021-11-25 19:41   ` Thomas Gleixner
2021-11-26  8:18     ` Paolo Bonzini
2021-11-29 21:21       ` Sean Christopherson
2021-11-25  0:20 ` [RFC PATCH v3 24/59] KVM: x86: Add kvm_x86_ops .cache_gprs() and .flush_gprs() isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 25/59] KVM: x86: Add support for vCPU and device-scoped KVM_MEMORY_ENCRYPT_OP isaku.yamahata
2021-11-25 19:42   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 26/59] KVM: x86: Introduce vm_teardown() hook in kvm_arch_vm_destroy() isaku.yamahata
2021-11-25 19:46   ` Thomas Gleixner
2021-11-25 20:54     ` Paolo Bonzini
2021-11-25 21:11       ` Thomas Gleixner
2021-11-29 18:16         ` Sean Christopherson
2021-11-25  0:20 ` [RFC PATCH v3 27/59] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 28/59] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events() isaku.yamahata
2021-11-25 20:50   ` Paolo Bonzini
2021-11-29 19:20     ` Sean Christopherson
2021-11-25  0:20 ` [RFC PATCH v3 29/59] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
2021-11-25 19:53   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 30/59] KVM: x86: Add guest_supported_xss placholder isaku.yamahata
2021-11-25 19:55   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 31/59] KVM: x86: Add infrastructure for stolen GPA bits isaku.yamahata
2021-11-25 20:00   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 32/59] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 33/59] KVM: x86/mmu: Ignore bits 63 and 62 when checking for "present" SPTEs isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 34/59] KVM: x86/mmu: Allow non-zero init value for shadow PTE isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 35/59] KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 36/59] KVM: x86/mmu: Frame in support for private/inaccessible shadow pages isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 37/59] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 38/59] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 39/59] KVM: VMX: Modify NMI and INTR handlers to take intr_info as param isaku.yamahata
2021-11-25 20:06   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 40/59] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
2021-11-25 20:06   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 41/59] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
2021-11-25 20:07   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 42/59] KVM: VMX: Define EPT Violation architectural bits isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 43/59] KVM: VMX: Define VMCS encodings for shared EPT pointer isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 44/59] KVM: VMX: Add 'main.c' to wrap VMX and TDX isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 45/59] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
2021-11-25 20:08   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 46/59] KVM: VMX: Move register caching logic to common code isaku.yamahata
2021-11-25 20:11   ` Thomas Gleixner
2021-11-25 20:17     ` Paolo Bonzini
2021-11-29 18:23       ` Sean Christopherson
2021-11-29 18:28         ` Paolo Bonzini
2021-11-25  0:20 ` [RFC PATCH v3 47/59] KVM: TDX: Define TDCALL exit reason isaku.yamahata
2021-11-25 20:19   ` Thomas Gleixner
2021-11-29 18:36     ` Sean Christopherson
2021-11-25  0:20 ` [RFC PATCH v3 48/59] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 49/59] KVM: VMX: Add macro framework to read/write VMCS for VMs and TDs isaku.yamahata
2021-11-25 20:24   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 50/59] KVM: VMX: Move AR_BYTES encoder/decoder helpers to common.h isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 51/59] KVM: VMX: MOVE GDT and IDT accessors to common code isaku.yamahata
2021-11-25 20:25   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 52/59] KVM: VMX: Move .get_interrupt_shadow() implementation to common VMX code isaku.yamahata
2021-11-25 20:26   ` Thomas Gleixner
2021-11-25  0:20 ` [RFC PATCH v3 53/59] KVM: x86: Add a helper function to restore 4 host MSRs on exit to user space isaku.yamahata
2021-11-25 20:34   ` Thomas Gleixner
2021-11-26  9:19     ` Chao Gao
2021-11-26  9:40       ` Paolo Bonzini
2021-11-29  7:08       ` Lai Jiangshan
2021-11-29  9:26         ` Chao Gao
2021-11-30  4:58           ` Lai Jiangshan
2021-11-30  8:19             ` Chao Gao
2021-11-30 11:18               ` Lai Jiangshan
2021-11-25  0:20 ` [RFC PATCH v3 54/59] KVM: X86: Introduce initial_tsc_khz in struct kvm_arch isaku.yamahata
2021-11-25 20:48   ` Paolo Bonzini
2021-11-25 21:05   ` Thomas Gleixner
2021-11-25 22:13     ` Paolo Bonzini
2021-11-25 22:59       ` Thomas Gleixner
2021-11-25 23:26       ` Thomas Gleixner
2021-11-26  7:56         ` Paolo Bonzini
2021-11-29 23:38       ` Sean Christopherson
2021-11-25  0:20 ` [RFC PATCH v3 55/59] KVM: TDX: Add "basic" support for building and running Trust Domains isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 56/59] KVM: TDX: Protect private mapping related SEAMCALLs with spinlock isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 57/59] KVM, x86/mmu: Support TDX private mapping for TDP MMU isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 58/59] KVM: TDX: exit to user space on GET_QUOTE, SETUP_EVENT_NOTIFY_INTERRUPT isaku.yamahata
2021-11-25  0:20 ` [RFC PATCH v3 59/59] Documentation/virtual/kvm: Add Trust Domain Extensions(TDX) isaku.yamahata
2021-11-25  2:12 ` [RFC PATCH v3 00/59] KVM: X86: TDX support Xiaoyao Li
2021-11-30 18:51 ` Sean Christopherson
2021-12-01 13:22   ` Kai Huang
2021-12-01 19:08     ` Isaku Yamahata
2021-12-01 19:32       ` Sean Christopherson
2021-12-01 20:28         ` Kai Huang
2021-12-01 15:05   ` Paolo Bonzini
2021-12-01 20:16     ` Kai Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).