LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v5 0/7] KVM: PKS Virtualization support
@ 2021-08-11 10:11 Chenyi Qiang
  2021-08-11 10:11 ` [PATCH v5 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-11 10:11 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: Chenyi Qiang, kvm, linux-kernel

This patch series is based on top of kernel patchset:
https://lore.kernel.org/lkml/20210804043231.2655537-1-ira.weiny@intel.com/

To help patches review, one missing info in SDM is that PKSR will be
cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
"IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"

---

Protection Keys for Supervisor Pages(PKS) is a feature that extends the
Protection Keys architecture to support thread-specific permission
restrictions on supervisor pages.

PKS works similar to an existing feature named PKU(protecting user pages).
They both perform an additional check after normal paging permission
checks are done. Access or Writes can be disabled via a MSR update
without TLB flushes when permissions changes. If violating this
addional check, #PF occurs and PFEC.PK bit will be set.

PKS introduces MSR IA32_PKRS to manage supervisor protection key
rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
advertises on a group of pages with the same key which is set in the
leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
supported by XSAVES architecture.

This patchset aims to add the virtualization of PKS in KVM. It
implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
exposure, nested supported etc. Currently, PKS is not yet supported for
shadow paging. 

Detailed information about PKS can be found in the latest Intel 64 and
IA-32 Architectures Software Developer's Manual.


---

Changelogs:

v4->v5
- Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
  And set them if PKS is exposed to guest. (Suggested by Sean)
- Add pkrs to standard register caching mechanism to help update
  vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
- Do the real pkrs update in VMCS field in vmx_vcpu_reset and
  vmx_sync_vmcs_host_state(). (Sean)
- Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
  (Sean & Paolo)
- v4: https://lore.kernel.org/lkml/20210205083706.14146-1-chenyi.qiang@intel.com/

v3->v4
- Make the MSR intercept and load-controls setting depend on CR4.PKS value
- shadow the guest pkrs and make it usable in PKS emultion
- add the cr4_pke and cr4_pks check in pkr_mask update
- squash PATCH 2 and PATCH 5 to make the dependencies read more clear
- v3: https://lore.kernel.org/lkml/20201105081805.5674-1-chenyi.qiang@intel.com/

v2->v3:
- No function changes since last submit
- rebase on the latest PKS kernel support:
  https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
- add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
- RFC v2: https://lore.kernel.org/lkml/20201014021157.18022-1-chenyi.qiang@intel.com/

v1->v2:
- rebase on the latest PKS kernel support:
  https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
- add a kvm-unit-tests for PKS
- add the check in kvm_init_msr_list for PKRS
- place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
- add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
  VMX MSR
- RFC v1: https://lore.kernel.org/lkml/20200807084841.7112-1-chenyi.qiang@intel.com/

---

Chenyi Qiang (7):
  KVM: VMX: Introduce PKS VMCS fields
  KVM: VMX: Add proper cache tracking for PKRS
  KVM: X86: Expose IA32_PKRS MSR
  KVM: MMU: Rename the pkru to pkr
  KVM: MMU: Add support for PKS emulation
  KVM: VMX: Expose PKS to guest
  KVM: VMX: Enable PKS for nested VM

 arch/x86/include/asm/kvm_host.h | 17 ++++---
 arch/x86/include/asm/vmx.h      |  6 +++
 arch/x86/kvm/cpuid.c            |  2 +-
 arch/x86/kvm/kvm_cache_regs.h   |  7 +++
 arch/x86/kvm/mmu.h              | 25 +++++----
 arch/x86/kvm/mmu/mmu.c          | 68 ++++++++++++++-----------
 arch/x86/kvm/vmx/capabilities.h |  6 +++
 arch/x86/kvm/vmx/nested.c       | 41 ++++++++++++++-
 arch/x86/kvm/vmx/vmcs.h         |  1 +
 arch/x86/kvm/vmx/vmcs12.c       |  2 +
 arch/x86/kvm/vmx/vmcs12.h       |  4 ++
 arch/x86/kvm/vmx/vmx.c          | 89 ++++++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/vmx.h          |  7 ++-
 arch/x86/kvm/x86.c              |  6 ++-
 arch/x86/kvm/x86.h              |  8 +++
 arch/x86/mm/pkeys.c             |  6 +++
 include/linux/pkeys.h           |  5 ++
 17 files changed, 243 insertions(+), 57 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/7] KVM: VMX: Introduce PKS VMCS fields
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
@ 2021-08-11 10:11 ` Chenyi Qiang
  2021-08-11 10:11 ` [PATCH v5 2/7] KVM: VMX: Add proper cache tracking for PKRS Chenyi Qiang
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-11 10:11 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: Chenyi Qiang, kvm, linux-kernel

PKS(Protection Keys for Supervisor Pages) is a feature that extends the
Protection Key architecture to support thread-specific permission
restrictions on supervisor pages.

A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
set of permissions associated with each protection domain.

Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
{host,guest}-state area to store the respective values of PKRS.

Every VM exit saves PKRS into guest-state area.
If VM_EXIT_LOAD_IA32_PKRS = 1, VM exit loads PKRS from the host-state
area.
If VM_ENTRY_LOAD_IA32_PKRS = 1, VM entry loads PKRS from the guest-state
area.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/include/asm/vmx.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..7962d506ba91 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -95,6 +95,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_LOAD_IA32_PKRS			0x20000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -108,6 +109,7 @@
 #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
 #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
 #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
+#define VM_ENTRY_LOAD_IA32_PKRS			0x00400000
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
@@ -245,12 +247,16 @@ enum vmcs_field {
 	GUEST_BNDCFGS_HIGH              = 0x00002813,
 	GUEST_IA32_RTIT_CTL		= 0x00002814,
 	GUEST_IA32_RTIT_CTL_HIGH	= 0x00002815,
+	GUEST_IA32_PKRS			= 0x00002818,
+	GUEST_IA32_PKRS_HIGH		= 0x00002819,
 	HOST_IA32_PAT			= 0x00002c00,
 	HOST_IA32_PAT_HIGH		= 0x00002c01,
 	HOST_IA32_EFER			= 0x00002c02,
 	HOST_IA32_EFER_HIGH		= 0x00002c03,
 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
+	HOST_IA32_PKRS			= 0x00002c06,
+	HOST_IA32_PKRS_HIGH		= 0x00002c07,
 	PIN_BASED_VM_EXEC_CONTROL       = 0x00004000,
 	CPU_BASED_VM_EXEC_CONTROL       = 0x00004002,
 	EXCEPTION_BITMAP                = 0x00004004,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 2/7] KVM: VMX: Add proper cache tracking for PKRS
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
  2021-08-11 10:11 ` [PATCH v5 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
@ 2021-08-11 10:11 ` Chenyi Qiang
  2021-11-08 17:13   ` Sean Christopherson
  2021-08-11 10:11 ` [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR Chenyi Qiang
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-11 10:11 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: Chenyi Qiang, kvm, linux-kernel

Add PKRS caching into the standard register caching mechanism in order
to take advantage of the availability checks provided by regs_avail.

This is because vcpu->arch.pkrs will be rarely acceesed by KVM, only in
the case of host userspace MSR reads and GVA->GPA translation in
following patches. It is unnecessary to keep it up-to-date at all times.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/kvm_cache_regs.h   | 7 +++++++
 arch/x86/kvm/vmx/vmx.c          | 4 ++++
 arch/x86/kvm/vmx/vmx.h          | 3 ++-
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 974cbfb1eefe..c2bcb88781b3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -173,6 +173,7 @@ enum kvm_reg {
 	VCPU_EXREG_SEGMENTS,
 	VCPU_EXREG_EXIT_INFO_1,
 	VCPU_EXREG_EXIT_INFO_2,
+	VCPU_EXREG_PKRS,
 };
 
 enum {
@@ -620,6 +621,7 @@ struct kvm_vcpu_arch {
 	unsigned long cr8;
 	u32 host_pkru;
 	u32 pkru;
+	u32 pkrs;
 	u32 hflags;
 	u64 efer;
 	u64 apic_base;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 90e1ffdc05b7..da014b1be874 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -171,6 +171,13 @@ static inline u64 kvm_read_edx_eax(struct kvm_vcpu *vcpu)
 		| ((u64)(kvm_rdx_read(vcpu) & -1u) << 32);
 }
 
+static inline ulong kvm_read_pkrs(struct kvm_vcpu *vcpu)
+{
+	if (!kvm_register_is_available(vcpu, VCPU_EXREG_PKRS))
+		static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_PKRS);
+	return vcpu->arch.pkrs;
+}
+
 static inline void enter_guest_mode(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.hflags |= HF_GUEST_MASK;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 927a552393b9..bf911029aa35 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2273,6 +2273,10 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 		vcpu->arch.cr4 &= ~guest_owned_bits;
 		vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
 		break;
+	case VCPU_EXREG_PKRS:
+		if (kvm_cpu_cap_has(X86_FEATURE_PKS))
+			vcpu->arch.pkrs = vmcs_read64(GUEST_IA32_PKRS);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		break;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index db88ed4f2121..18039eb6efb0 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -447,7 +447,8 @@ static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
 				  | (1 << VCPU_EXREG_CR3)
 				  | (1 << VCPU_EXREG_CR4)
 				  | (1 << VCPU_EXREG_EXIT_INFO_1)
-				  | (1 << VCPU_EXREG_EXIT_INFO_2));
+				  | (1 << VCPU_EXREG_EXIT_INFO_2)
+				  | (1 << VCPU_EXREG_PKRS));
 	vcpu->arch.regs_dirty = 0;
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
  2021-08-11 10:11 ` [PATCH v5 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
  2021-08-11 10:11 ` [PATCH v5 2/7] KVM: VMX: Add proper cache tracking for PKRS Chenyi Qiang
@ 2021-08-11 10:11 ` Chenyi Qiang
  2021-11-08 17:44   ` Sean Christopherson
  2021-11-08 20:18   ` Sean Christopherson
  2021-08-11 10:11 ` [PATCH v5 4/7] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-11 10:11 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: Chenyi Qiang, kvm, linux-kernel

Protection Key for Superviosr Pages (PKS) uses IA32_PKRS MSR (PKRS) at
index 0x6E1 to allow software to manage superviosr key rights, i.e. it
can enforce additional permissions checks besides normal paging
protections via a MSR update without TLB flushes when permissions
change.

For performance consideration, PKRS intercept in KVM will be disabled
when PKS is supported in guest so that PKRS can be accessed without VM
exit.

PKS introduces dedicated control fields in VMCS to switch PKRS, which
only does the retore part. In addition, every VM exit saves PKRS into
the guest-state area in VMCS, while VM enter won't save the host value
due to the expectation that the host won't change the MSR often. Update
the host's value in VMCS manually if the MSR has been changed by the
kernel since the last time the VMCS was run.

Introduce a function get_current_pkrs() in arch/x86/mm/pkeys.c exports
the per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/kvm/vmx/vmcs.h |  1 +
 arch/x86/kvm/vmx/vmx.c  | 69 +++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h  |  2 +-
 arch/x86/kvm/x86.c      |  6 +++-
 arch/x86/kvm/x86.h      |  6 ++++
 arch/x86/mm/pkeys.c     |  6 ++++
 include/linux/pkeys.h   |  5 +++
 7 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 4b9957e2bf5b..05312af594cd 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -40,6 +40,7 @@ struct vmcs_host_state {
 #ifdef CONFIG_X86_64
 	u16           ds_sel, es_sel;
 #endif
+	u32           pkrs;
 };
 
 struct vmcs_controls_shadow {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bf911029aa35..0f3ca6a07a21 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -28,6 +28,7 @@
 #include <linux/tboot.h>
 #include <linux/trace_events.h>
 #include <linux/entry-kvm.h>
+#include <linux/pkeys.h>
 
 #include <asm/apic.h>
 #include <asm/asm.h>
@@ -170,6 +171,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_CORE_C3_RESIDENCY,
 	MSR_CORE_C6_RESIDENCY,
 	MSR_CORE_C7_RESIDENCY,
+	MSR_IA32_PKRS,
 };
 
 /*
@@ -1115,6 +1117,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 #endif
 	unsigned long fs_base, gs_base;
 	u16 fs_sel, gs_sel;
+	u32 host_pkrs;
 	int i;
 
 	vmx->req_immediate_exit = false;
@@ -1150,6 +1153,20 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	 */
 	host_state->ldt_sel = kvm_read_ldt();
 
+	/*
+	 * Update the host pkrs vmcs field before vcpu runs.
+	 * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
+	 * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	 * guest_cpuid_has(vcpu, X86_FEATURE_PKS)
+	 */
+	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
+		host_pkrs = get_current_pkrs();
+		if (unlikely(host_pkrs != host_state->pkrs)) {
+			vmcs_write64(HOST_IA32_PKRS, host_pkrs);
+			host_state->pkrs = host_pkrs;
+		}
+	}
+
 #ifdef CONFIG_X86_64
 	savesegment(ds, host_state->ds_sel);
 	savesegment(es, host_state->es_sel);
@@ -1371,6 +1388,15 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 		vmx->emulation_required = emulation_required(vcpu);
 }
 
+static void vmx_set_pkrs(struct kvm_vcpu *vcpu, u64 pkrs)
+{
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS)) {
+		vcpu->arch.pkrs = pkrs;
+		kvm_register_mark_available(vcpu, VCPU_EXREG_PKRS);
+		vmcs_write64(GUEST_IA32_PKRS, pkrs);
+	}
+}
+
 u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu)
 {
 	u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
@@ -1899,6 +1925,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_DEBUGCTLMSR:
 		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
 		break;
+	case MSR_IA32_PKRS:
+		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+		    (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+			return 1;
+		msr_info->data = kvm_read_pkrs(vcpu);
+		break;
 	default:
 	find_uret_msr:
 		msr = vmx_find_uret_msr(vmx, msr_info->index);
@@ -2226,7 +2259,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		}
 		ret = kvm_set_msr_common(vcpu, msr_info);
 		break;
-
+	case MSR_IA32_PKRS:
+		if (!kvm_pkrs_valid(data))
+			return 1;
+		if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
+		    (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))
+			return 1;
+		vmx_set_pkrs(vcpu, data);
+		break;
 	default:
 	find_uret_msr:
 		msr = vmx_find_uret_msr(vmx, msr_index);
@@ -2507,7 +2548,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	      VM_EXIT_LOAD_IA32_EFER |
 	      VM_EXIT_CLEAR_BNDCFGS |
 	      VM_EXIT_PT_CONCEAL_PIP |
-	      VM_EXIT_CLEAR_IA32_RTIT_CTL;
+	      VM_EXIT_CLEAR_IA32_RTIT_CTL |
+	      VM_EXIT_LOAD_IA32_PKRS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
 				&_vmexit_control) < 0)
 		return -EIO;
@@ -2531,7 +2573,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	      VM_ENTRY_LOAD_IA32_EFER |
 	      VM_ENTRY_LOAD_BNDCFGS |
 	      VM_ENTRY_PT_CONCEAL_PIP |
-	      VM_ENTRY_LOAD_IA32_RTIT_CTL;
+	      VM_ENTRY_LOAD_IA32_RTIT_CTL |
+	      VM_ENTRY_LOAD_IA32_PKRS;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
 				&_vmentry_control) < 0)
 		return -EIO;
@@ -4462,6 +4505,9 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	if (kvm_mpx_supported())
 		vmcs_write64(GUEST_BNDCFGS, 0);
 
+	/* PKRS is cleared on INIT/RESET */
+	vmx_set_pkrs(vcpu, 0);
+
 	setup_msrs(vmx);
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);  /* 22.2.1 */
@@ -5783,6 +5829,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 		       vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL));
 	if (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS)
 		pr_err("BndCfgS = 0x%016llx\n", vmcs_read64(GUEST_BNDCFGS));
+	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_PKRS)
+		pr_err("PKRS = 0x%016llx\n", vmcs_read64(GUEST_IA32_PKRS));
 	pr_err("Interruptibility = %08x  ActivityState = %08x\n",
 	       vmcs_read32(GUEST_INTERRUPTIBILITY_INFO),
 	       vmcs_read32(GUEST_ACTIVITY_STATE));
@@ -5824,6 +5872,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 		       vmcs_read64(HOST_IA32_PERF_GLOBAL_CTRL));
 	if (vmcs_read32(VM_EXIT_MSR_LOAD_COUNT) > 0)
 		vmx_dump_msrs("host autoload", &vmx->msr_autoload.host);
+	if (vmexit_ctl & VM_EXIT_LOAD_IA32_PKRS)
+		pr_err("PKRS = 0x%016llx\n", vmcs_read64(HOST_IA32_PKRS));
 
 	pr_err("*** Control State ***\n");
 	pr_err("PinBased=%08x CPUBased=%08x SecondaryExec=%08x\n",
@@ -7207,6 +7257,19 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
+
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    guest_cpuid_has(vcpu, X86_FEATURE_PKS)) {
+		vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
+	} else {
+		/*
+		 * Remove VM control in case guest VM doesn't support PKS to mitigate
+		 * overhead during VM-{exit,entry}. They are present by default
+		 * if supported.
+		 */
+		vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
+		vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
+	}
 }
 
 static __init void vmx_set_cpu_caps(void)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 18039eb6efb0..858d9a959c72 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -336,7 +336,7 @@ struct vcpu_vmx {
 	struct lbr_desc lbr_desc;
 
 	/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS	13
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS	14
 	struct {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a4fd10604f72..16cd5f98d3f9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1303,7 +1303,7 @@ static const u32 msrs_to_save_all[] = {
 	MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
 	MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
 	MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
-	MSR_IA32_UMWAIT_CONTROL,
+	MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,
 
 	MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
 	MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,
@@ -6219,6 +6219,10 @@ static void kvm_init_msr_list(void)
 				intel_pt_validate_hw_cap(PT_CAP_num_address_ranges) * 2)
 				continue;
 			break;
+		case MSR_IA32_PKRS:
+			if (!kvm_cpu_cap_has(X86_FEATURE_PKS))
+				continue;
+			break;
 		case MSR_ARCH_PERFMON_PERFCTR0 ... MSR_ARCH_PERFMON_PERFCTR0 + 17:
 			if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_PERFCTR0 >=
 			    min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 44ae10312740..f8aaf89e6dc5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -436,6 +436,12 @@ static inline void kvm_machine_check(void)
 #endif
 }
 
+static inline bool kvm_pkrs_valid(u64 data)
+{
+	/* bit[63,32] must be zero */
+	return !(data >> 32);
+}
+
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
 int kvm_spec_ctrl_test_value(u64 value);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 201004586c2b..d1a58df2663e 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -452,4 +452,10 @@ void pks_abandon_protections(int pkey)
 }
 EXPORT_SYMBOL_GPL(pks_abandon_protections);
 
+u32 get_current_pkrs(void)
+{
+	return this_cpu_read(pkrs_cache);
+}
+EXPORT_SYMBOL_GPL(get_current_pkrs);
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index c06b47264c5d..5389318e3aa3 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -62,6 +62,7 @@ void pks_mk_noaccess(int pkey);
 void pks_mk_readonly(int pkey);
 void pks_mk_readwrite(int pkey);
 void pks_abandon_protections(int pkey);
+u32 get_current_pkrs(void);
 
 typedef bool (*pks_key_callback)(unsigned long address, bool write);
 
@@ -79,6 +80,10 @@ static inline void pks_mk_noaccess(int pkey) {}
 static inline void pks_mk_readonly(int pkey) {}
 static inline void pks_mk_readwrite(int pkey) {}
 static inline void pks_abandon_protections(int pkey) {}
+static inline u32 get_current_pkrs(void)
+{
+	return 0;
+}
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 4/7] KVM: MMU: Rename the pkru to pkr
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (2 preceding siblings ...)
  2021-08-11 10:11 ` [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR Chenyi Qiang
@ 2021-08-11 10:11 ` Chenyi Qiang
  2021-08-11 10:11 ` [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-11 10:11 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: Chenyi Qiang, kvm, linux-kernel

PKRU represents the PKU register utilized in the protection key rights
check for user pages. Protection Keys for Superviosr Pages (PKS) extends
the protection key architecture to cover supervisor pages.

Rename the *pkru* related variables and functions to *pkr* which stands
for both of the PKRU and PKRS. It makes sense because PKS and PKU each
have:
 - a single control register (PKRU and PKRS)
 - the same number of keys (16 in total)
 - the same format in control registers (Access and Write disable bits)

PKS and PKU can also share the same bitmap pkr_mask cache conditions
where protection key checks are neede, because they can share almost the
same requirements for PK restrictions to cause a fault, except they
focus on different pages (supervisor and user pages).

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu.h              | 12 ++++++------
 arch/x86/kvm/mmu/mmu.c          | 10 +++++-----
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c2bcb88781b3..3d55aca9167b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -444,7 +444,7 @@ struct kvm_mmu {
 	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
 	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
 	*/
-	u32 pkru_mask;
+	u32 pkr_mask;
 
 	u64 *pae_root;
 	u64 *pml4_root;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 83e6c6965f1e..5e94f6a90e80 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -200,8 +200,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	u32 errcode = PFERR_PRESENT_MASK;
 
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-	if (unlikely(mmu->pkru_mask)) {
-		u32 pkru_bits, offset;
+	if (unlikely(mmu->pkr_mask)) {
+		u32 pkr_bits, offset;
 
 		/*
 		* PKRU defines 32 bits, there are 16 domains and 2
@@ -209,15 +209,15 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		* index of the protection domain, so pte_pkey * 2 is
 		* is the index of the first bit for the domain.
 		*/
-		pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
 
 		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
 		offset = (pfec & ~1) +
 			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
 
-		pkru_bits &= mmu->pkru_mask >> offset;
-		errcode |= -pkru_bits & PFERR_PK_MASK;
-		fault |= (pkru_bits != 0);
+		pkr_bits &= mmu->pkr_mask >> offset;
+		errcode |= -pkr_bits & PFERR_PK_MASK;
+		fault |= (pkr_bits != 0);
 	}
 
 	return -(u32)fault & errcode;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 66f7f5bc3482..49fd2dc98cc6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4443,13 +4443,13 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
 * away both AD and WD.  For all reads or if the last condition holds, WD
 * only will be masked away.
 */
-static void update_pkru_bitmask(struct kvm_mmu *mmu)
+static void update_pkr_bitmask(struct kvm_mmu *mmu)
 {
 	unsigned bit;
 	bool wp;
 
 	if (!is_cr4_pke(mmu)) {
-		mmu->pkru_mask = 0;
+		mmu->pkr_mask = 0;
 		return;
 	}
 
@@ -4483,7 +4483,7 @@ static void update_pkru_bitmask(struct kvm_mmu *mmu)
 		/* PKRU.WD stops write access. */
 		pkey_bits |= (!!check_write) << 1;
 
-		mmu->pkru_mask |= (pkey_bits & 3) << pfec;
+		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
 	}
 }
 
@@ -4495,7 +4495,7 @@ static void reset_guest_paging_metadata(struct kvm_vcpu *vcpu,
 
 	reset_rsvds_bits_mask(vcpu, mmu);
 	update_permission_bitmask(mmu, false);
-	update_pkru_bitmask(mmu);
+	update_pkr_bitmask(mmu);
 }
 
 static void paging64_init_context(struct kvm_mmu *context)
@@ -4763,7 +4763,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 	context->direct_map = false;
 
 	update_permission_bitmask(context, true);
-	update_pkru_bitmask(context);
+	update_pkr_bitmask(context);
 	reset_rsvds_bits_mask_ept(vcpu, context, execonly);
 	reset_ept_shadow_zero_bits_mask(vcpu, context, execonly);
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (3 preceding siblings ...)
  2021-08-11 10:11 ` [PATCH v5 4/7] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
@ 2021-08-11 10:11 ` Chenyi Qiang
  2021-11-08 19:46   ` Sean Christopherson
  2021-11-08 19:52   ` Sean Christopherson
  2021-08-11 10:11 ` [PATCH v5 6/7] KVM: VMX: Expose PKS to guest Chenyi Qiang
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-11 10:11 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: Chenyi Qiang, kvm, linux-kernel

Up until now, pkr_mask had 0 bits for supervisor pages (the U/S bit in
page tables replaces the PFEC.RSVD in page fault error code).
For PKS support, fill in the bits using the same algorithm used for user
mode pages, but with CR4.PKE replaced by CR4.PKS. Because of this
change, CR4.PKS must also be included in the MMU role.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 10 +++---
 arch/x86/kvm/mmu.h              | 15 +++++----
 arch/x86/kvm/mmu/mmu.c          | 58 ++++++++++++++++++++-------------
 3 files changed, 50 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d55aca9167b..f31d19e851de 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -359,6 +359,7 @@ union kvm_mmu_extended_role {
 		unsigned int cr4_smap:1;
 		unsigned int cr4_smep:1;
 		unsigned int cr4_la57:1;
+		unsigned int cr4_pks:1;
 	};
 };
 
@@ -439,10 +440,11 @@ struct kvm_mmu {
 	u8 permissions[16];
 
 	/*
-	* The pkru_mask indicates if protection key checks are needed.  It
-	* consists of 16 domains indexed by page fault error code bits [4:1],
-	* with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
-	* Each domain has 2 bits which are ANDed with AD and WD from PKRU.
+	* The pkr_mask indicates if protection key checks are needed.
+	* It consists of 16 domains indexed by page fault error code
+	* bits[4:1] with PFEC.RSVD replaced by ACC_USER_MASK from the
+	* page tables. Each domain has 2 bits which are ANDed with AD
+	* and WD from PKRU/PKRS.
 	*/
 	u32 pkr_mask;
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5e94f6a90e80..5586c0341d28 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -46,7 +46,7 @@
 
 #define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE | \
 			       X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE | \
-			       X86_CR4_LA57)
+			       X86_CR4_LA57 | X86_CR4_PKS)
 
 #define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)
 
@@ -202,14 +202,17 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
 	if (unlikely(mmu->pkr_mask)) {
 		u32 pkr_bits, offset;
+		u64 pkr;
 
 		/*
-		* PKRU defines 32 bits, there are 16 domains and 2
-		* attribute bits per domain in pkru.  pte_pkey is the
-		* index of the protection domain, so pte_pkey * 2 is
-		* is the index of the first bit for the domain.
+		* PKRU and PKRS both define 32 bits. There are 16 domains
+		* and 2 attribute bits per domain in them. pte_key is the
+		* index of the protection domain, so pte_pkey * 2 is the
+		* index of the first bit for the domain. The choice of
+		* PKRU and PKRS is determined by the accessed pages.
 		*/
-		pkr_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
+		pkr = pte_access & PT_USER_MASK ? vcpu->arch.pkru : kvm_read_pkrs(vcpu);
+		pkr_bits = (pkr >> pte_pkey * 2) & 3;
 
 		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
 		offset = (pfec & ~1) +
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 49fd2dc98cc6..ca83ad5f5716 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -205,6 +205,7 @@ BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smep, X86_CR4_SMEP);
 BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, smap, X86_CR4_SMAP);
 BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pke, X86_CR4_PKE);
 BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, la57, X86_CR4_LA57);
+BUILD_MMU_ROLE_REGS_ACCESSOR(cr4, pks, X86_CR4_PKS);
 BUILD_MMU_ROLE_REGS_ACCESSOR(efer, nx, EFER_NX);
 BUILD_MMU_ROLE_REGS_ACCESSOR(efer, lma, EFER_LMA);
 
@@ -227,6 +228,7 @@ BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smep);
 BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, smap);
 BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pke);
 BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, la57);
+BUILD_MMU_ROLE_ACCESSOR(ext,  cr4, pks);
 BUILD_MMU_ROLE_ACCESSOR(base, efer, nx);
 
 static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
@@ -4420,35 +4422,41 @@ static void update_permission_bitmask(struct kvm_mmu *mmu, bool ept)
 }
 
 /*
-* PKU is an additional mechanism by which the paging controls access to
-* user-mode addresses based on the value in the PKRU register.  Protection
-* key violations are reported through a bit in the page fault error code.
+* Protection Keys (PKEY) is an additional mechanism by which
+* the paging controls access to user-mode/supervisor-mode address
+* based on the values in PKEY registers (PKRU/PKRS). Protection key
+* violations are reported through a bit in the page fault error code.
 * Unlike other bits of the error code, the PK bit is not known at the
 * call site of e.g. gva_to_gpa; it must be computed directly in
-* permission_fault based on two bits of PKRU, on some machine state (CR4,
-* CR0, EFER, CPL), and on other bits of the error code and the page tables.
+* permission_fault based on two bits of PKRU/PKRS, on some machine
+* state (CR4, CR0, EFER, CPL), and on other bits of the error code
+* and the page tables.
 *
 * In particular the following conditions come from the error code, the
 * page tables and the machine state:
-* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
+* - PK is always zero unless CR4.PKE=1/CR4.PKS=1 and EFER.LMA=1
 * - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
-* - PK is always zero if U=0 in the page tables
-* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
+* - PK is always zero if
+*       - U=0 in the page tables and CR4.PKS=0
+*       - U=1 in the page tables and CR4.PKU=0
+* - (PKRU/PKRS).WD is ignored if CR0.WP=0 and the access is a supervisor access.
 *
-* The PKRU bitmask caches the result of these four conditions.  The error
-* code (minus the P bit) and the page table's U bit form an index into the
-* PKRU bitmask.  Two bits of the PKRU bitmask are then extracted and ANDed
-* with the two bits of the PKRU register corresponding to the protection key.
-* For the first three conditions above the bits will be 00, thus masking
-* away both AD and WD.  For all reads or if the last condition holds, WD
-* only will be masked away.
+* The pkr_mask caches the result of these three conditions. The error
+* code (minus the P bit) and the page table's U bit form an index into
+* the pkr_mask. Two bits of the pkr_mask are then extracted and ANDed with
+* the two bits of the PKEY register corresponding to the protection key.
+* For the first three conditions above the bits will be 00, thus masking away
+* both AD and WD. For all reads or if the last condition holds, WD only will be
+* masked away.
 */
 static void update_pkr_bitmask(struct kvm_mmu *mmu)
 {
 	unsigned bit;
 	bool wp;
+	bool cr4_pke = is_cr4_pke(mmu);
+	bool cr4_pks = is_cr4_pks(mmu);
 
-	if (!is_cr4_pke(mmu)) {
+	if (!cr4_pke && !cr4_pks) {
 		mmu->pkr_mask = 0;
 		return;
 	}
@@ -4468,19 +4476,22 @@ static void update_pkr_bitmask(struct kvm_mmu *mmu)
 		pte_user = pfec & PFERR_RSVD_MASK;
 
 		/*
-		 * Only need to check the access which is not an
-		 * instruction fetch and is to a user page.
+		 * need to check the access which is not an
+		 * instruction fetch and
+		 * - if cr4_pke 1-setting when accessing a user page.
+		 * - if cr4_pks 1-setting when accessing a supervisor page.
 		 */
-		check_pkey = (!ff && pte_user);
+		check_pkey = !ff && (pte_user ? cr4_pke : cr4_pks);
+
 		/*
-		 * write access is controlled by PKRU if it is a
-		 * user access or CR0.WP = 1.
+		 * write access is controlled by PKRU/PKRS if
+		 * it is a user access or CR0.WP = 1.
 		 */
 		check_write = check_pkey && wf && (uf || wp);
 
-		/* PKRU.AD stops both read and write access. */
+		/* PKRU/PKRS.AD stops both read and write access. */
 		pkey_bits = !!check_pkey;
-		/* PKRU.WD stops write access. */
+		/* PKRU/PKRS.WD stops write access. */
 		pkey_bits |= (!!check_write) << 1;
 
 		mmu->pkr_mask |= (pkey_bits & 3) << pfec;
@@ -4531,6 +4542,7 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu,
 		/* PKEY and LA57 are active iff long mode is active. */
 		ext.cr4_pke = ____is_efer_lma(regs) && ____is_cr4_pke(regs);
 		ext.cr4_la57 = ____is_efer_lma(regs) && ____is_cr4_la57(regs);
+		ext.cr4_pks = ____is_efer_lma(regs) && ____is_cr4_pks(regs);
 	}
 
 	ext.valid = 1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 6/7] KVM: VMX: Expose PKS to guest
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (4 preceding siblings ...)
  2021-08-11 10:11 ` [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
@ 2021-08-11 10:11 ` Chenyi Qiang
  2021-11-08 21:31   ` Sean Christopherson
  2021-08-11 10:11 ` [PATCH v5 7/7] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-11 10:11 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: Chenyi Qiang, kvm, linux-kernel

Existence of PKS is enumerated via CPUID.(EAX=7,ECX=0):ECX[31]. It is
enabled by setting CR4.PKS when long mode is active. PKS is only
implemented when EPT is enabled and requires the support of
VM_{ENTRY,EXIT}_LOAD_IA32_PKRS VMCS controls currently.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/cpuid.c            |  2 +-
 arch/x86/kvm/vmx/capabilities.h |  6 ++++++
 arch/x86/kvm/vmx/vmx.c          | 15 ++++++++++++---
 arch/x86/kvm/x86.h              |  2 ++
 5 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f31d19e851de..9abd9a4c2174 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -103,7 +103,8 @@
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
-			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+			  | X86_CR4_PKS))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 739be5da3bca..dbee0d639db3 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -458,7 +458,7 @@ void kvm_set_cpu_caps(void)
 		F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
 		F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
 		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
-		F(SGX_LC) | F(BUS_LOCK_DETECT)
+		F(SGX_LC) | F(BUS_LOCK_DETECT) | 0 /*PKS*/
 	);
 	/* Set LA57 based on hardware capability. */
 	if (cpuid_ecx(7) & F(LA57))
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 4705ad55abb5..3f6122fd8f65 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -104,6 +104,12 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
 	       (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL);
 }
 
+static inline bool cpu_has_load_ia32_pkrs(void)
+{
+	return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PKRS) &&
+	       (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PKRS);
+}
+
 static inline bool cpu_has_vmx_mpx(void)
 {
 	return (vmcs_config.vmexit_ctrl & VM_EXIT_CLEAR_BNDCFGS) &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0f3ca6a07a21..71f2aefd6454 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3218,7 +3218,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		}
 
 		/*
-		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
+		 * SMEP/SMAP/PKU/PKS is disabled if CPU is in non-paging mode in
 		 * hardware.  To emulate this behavior, SMEP/SMAP/PKU needs
 		 * to be manually disabled when guest switches to non-paging
 		 * mode.
@@ -3226,10 +3226,11 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		 * If !enable_unrestricted_guest, the CPU is always running
 		 * with CR0.PG=1 and CR4 needs to be modified.
 		 * If enable_unrestricted_guest, the CPU automatically
-		 * disables SMEP/SMAP/PKU when the guest sets CR0.PG=0.
+		 * disables SMEP/SMAP/PKU/PKS when the guest sets CR0.PG=0.
 		 */
 		if (!is_paging(vcpu))
-			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+			hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
+				    X86_CR4_PKS);
 	}
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
@@ -7311,6 +7312,14 @@ static __init void vmx_set_cpu_caps(void)
 
 	if (cpu_has_vmx_waitpkg())
 		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
+
+	/*
+	 * PKS is not yet implemented for shadow paging.
+	 * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
+	 * don't expose the PKS as well.
+	 */
+	if (enable_ept && cpu_has_load_ia32_pkrs())
+		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);
 }
 
 static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f8aaf89e6dc5..a7040c6ef524 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -481,6 +481,8 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
 		__reserved_bits |= X86_CR4_VMXE;        \
 	if (!__cpu_has(__c, X86_FEATURE_PCID))          \
 		__reserved_bits |= X86_CR4_PCIDE;       \
+	if (!__cpu_has(__c, X86_FEATURE_PKS))		\
+		__reserved_bits |= X86_CR4_PKS;		\
 	__reserved_bits;                                \
 })
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 7/7] KVM: VMX: Enable PKS for nested VM
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (5 preceding siblings ...)
  2021-08-11 10:11 ` [PATCH v5 6/7] KVM: VMX: Expose PKS to guest Chenyi Qiang
@ 2021-08-11 10:11 ` Chenyi Qiang
  2021-08-26  2:04 ` [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
  2021-10-25 15:12 ` Paolo Bonzini
  8 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-11 10:11 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: Chenyi Qiang, kvm, linux-kernel

PKS MSR passes through guest directly. Configure the MSR to match the
L0/L1 settings so that nested VM runs PKS properly.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 41 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmcs12.c |  2 ++
 arch/x86/kvm/vmx/vmcs12.h |  4 ++++
 arch/x86/kvm/vmx/vmx.c    |  1 +
 arch/x86/kvm/vmx/vmx.h    |  2 ++
 5 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1a52134b0c42..b551e6ec7db5 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -251,6 +251,10 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
 	dest->ds_sel = src->ds_sel;
 	dest->es_sel = src->es_sel;
 #endif
+	if (unlikely(src->pkrs != dest->pkrs)) {
+		vmcs_write64(HOST_IA32_PKRS, src->pkrs);
+		dest->pkrs = src->pkrs;
+	}
 }
 
 static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
@@ -660,6 +664,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 					MSR_IA32_PRED_CMD,
 					MSR_TYPE_W);
 
+	if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
+		nested_vmx_disable_intercept_for_msr(
+					msr_bitmap_l1, msr_bitmap_l0,
+					MSR_IA32_PKRS,
+					MSR_TYPE_R | MSR_TYPE_W);
+
 	kvm_vcpu_unmap(vcpu, &to_vmx(vcpu)->nested.msr_bitmap_map, false);
 
 	return true;
@@ -2394,6 +2404,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		if (kvm_mpx_supported() && vmx->nested.nested_run_pending &&
 		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
 			vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs);
+
+		if (vmx->nested.nested_run_pending &&
+		    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+			vmcs_write64(GUEST_IA32_PKRS, vmcs12->guest_ia32_pkrs);
 	}
 
 	if (nested_cpu_has_xsaves(vmcs12))
@@ -2482,6 +2496,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
 		vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    (!vmx->nested.nested_run_pending ||
+	     !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
+		vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);
+
 	vmx_set_rflags(vcpu, vmcs12->guest_rflags);
 
 	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
@@ -2840,6 +2859,10 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 					   vmcs12->host_ia32_perf_global_ctrl)))
 		return -EINVAL;
 
+	if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS) &&
+	    CC(!kvm_pkrs_valid(vmcs12->host_ia32_pkrs)))
+		return -EINVAL;
+
 #ifdef CONFIG_X86_64
 	ia32e = !!(vcpu->arch.efer & EFER_LMA);
 #else
@@ -2990,6 +3013,10 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
 	if (nested_check_guest_non_reg_state(vmcs12))
 		return -EINVAL;
 
+	if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS) &&
+	    CC(!kvm_pkrs_valid(vmcs12->guest_ia32_pkrs)))
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -3325,6 +3352,9 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	if (kvm_mpx_supported() &&
 		!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
 		vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
+	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS))
+		vmx->nested.vmcs01_guest_pkrs = vmcs_read64(GUEST_IA32_PKRS);
 
 	/*
 	 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
@@ -3964,6 +3994,7 @@ static bool is_vmcs12_ext_field(unsigned long field)
 	case GUEST_IDTR_BASE:
 	case GUEST_PENDING_DBG_EXCEPTIONS:
 	case GUEST_BNDCFGS:
+	case GUEST_IA32_PKRS:
 		return true;
 	default:
 		break;
@@ -4015,6 +4046,8 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 	if (kvm_mpx_supported())
 		vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
+	if (guest_cpuid_has(vcpu, X86_FEATURE_PKS))
+		vmcs12->guest_ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
 
 	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
 }
@@ -4252,6 +4285,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 		WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
 					 vmcs12->host_ia32_perf_global_ctrl));
 
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PKRS)
+		vmcs_write64(GUEST_IA32_PKRS, vmcs12->host_ia32_pkrs);
+
 	/* Set L1 segment info according to Intel SDM
 	    27.5.2 Loading Host Segment and Descriptor-Table Registers */
 	seg = (struct kvm_segment) {
@@ -6466,7 +6502,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
-		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
+		VM_EXIT_LOAD_IA32_PKRS;
 	msrs->exit_ctls_high |=
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6486,7 +6523,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		VM_ENTRY_IA32E_MODE |
 #endif
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
-		VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+		VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_PKRS;
 	msrs->entry_ctls_high |=
 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER);
 
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index d9f5d7c56ae3..cf088dfa69c6 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -63,9 +63,11 @@ const unsigned short vmcs_field_to_offset_table[] = {
 	FIELD64(GUEST_PDPTR2, guest_pdptr2),
 	FIELD64(GUEST_PDPTR3, guest_pdptr3),
 	FIELD64(GUEST_BNDCFGS, guest_bndcfgs),
+	FIELD64(GUEST_IA32_PKRS, guest_ia32_pkrs),
 	FIELD64(HOST_IA32_PAT, host_ia32_pat),
 	FIELD64(HOST_IA32_EFER, host_ia32_efer),
 	FIELD64(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl),
+	FIELD64(HOST_IA32_PKRS, host_ia32_pkrs),
 	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
 	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
 	FIELD(EXCEPTION_BITMAP, exception_bitmap),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 5e0e1b39f495..2cf201f19155 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -185,6 +185,8 @@ struct __packed vmcs12 {
 	u16 host_gs_selector;
 	u16 host_tr_selector;
 	u16 guest_pml_index;
+	u64 host_ia32_pkrs;
+	u64 guest_ia32_pkrs;
 };
 
 /*
@@ -359,6 +361,8 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_gs_selector, 992);
 	CHECK_OFFSET(host_tr_selector, 994);
 	CHECK_OFFSET(guest_pml_index, 996);
+	CHECK_OFFSET(host_ia32_pkrs, 998);
+	CHECK_OFFSET(guest_ia32_pkrs, 1006);
 }
 
 extern const unsigned short vmcs_field_to_offset_table[];
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 71f2aefd6454..a910d9b423eb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7112,6 +7112,7 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
 	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
 	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
+	cr4_fixed1_update(X86_CR4_PKS,        ecx, feature_bit(PKS));
 
 #undef cr4_fixed1_update
 }
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 858d9a959c72..fff314dc8356 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -203,6 +203,8 @@ struct nested_vmx {
 	u64 vmcs01_debugctl;
 	u64 vmcs01_guest_bndcfgs;
 
+	u64 vmcs01_guest_pkrs;
+
 	/* to migrate it to L1 if L2 writes to L1's CR8 directly */
 	int l1_tpr_threshold;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/7] KVM: PKS Virtualization support
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (6 preceding siblings ...)
  2021-08-11 10:11 ` [PATCH v5 7/7] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
@ 2021-08-26  2:04 ` Chenyi Qiang
  2021-10-25 15:12 ` Paolo Bonzini
  8 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-08-26  2:04 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

Ping for comments.

One thing to pay attention:
The previous statement was incorrect – PKRS should be preserved on INIT, 
not cleared.


On 8/11/2021 6:11 PM, Chenyi Qiang wrote:
> This patch series is based on top of kernel patchset:
> https://lore.kernel.org/lkml/20210804043231.2655537-1-ira.weiny@intel.com/
> 
> To help patches review, one missing info in SDM is that PKSR will be
> cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
> "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
> 
> ---
> 
> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
> Protection Keys architecture to support thread-specific permission
> restrictions on supervisor pages.
> 
> PKS works similar to an existing feature named PKU(protecting user pages).
> They both perform an additional check after normal paging permission
> checks are done. Access or Writes can be disabled via a MSR update
> without TLB flushes when permissions changes. If violating this
> addional check, #PF occurs and PFEC.PK bit will be set.
> 
> PKS introduces MSR IA32_PKRS to manage supervisor protection key
> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
> advertises on a group of pages with the same key which is set in the
> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
> supported by XSAVES architecture.
> 
> This patchset aims to add the virtualization of PKS in KVM. It
> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
> exposure, nested supported etc. Currently, PKS is not yet supported for
> shadow paging.
> 
> Detailed information about PKS can be found in the latest Intel 64 and
> IA-32 Architectures Software Developer's Manual.
> 
> 
> ---
> 
> Changelogs:
> 
> v4->v5
> - Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
>    And set them if PKS is exposed to guest. (Suggested by Sean)
> - Add pkrs to standard register caching mechanism to help update
>    vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
> - Do the real pkrs update in VMCS field in vmx_vcpu_reset and
>    vmx_sync_vmcs_host_state(). (Sean)
> - Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
>    (Sean & Paolo)
> - v4: https://lore.kernel.org/lkml/20210205083706.14146-1-chenyi.qiang@intel.com/
> 
> v3->v4
> - Make the MSR intercept and load-controls setting depend on CR4.PKS value
> - shadow the guest pkrs and make it usable in PKS emultion
> - add the cr4_pke and cr4_pks check in pkr_mask update
> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
> - v3: https://lore.kernel.org/lkml/20201105081805.5674-1-chenyi.qiang@intel.com/
> 
> v2->v3:
> - No function changes since last submit
> - rebase on the latest PKS kernel support:
>    https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
> - RFC v2: https://lore.kernel.org/lkml/20201014021157.18022-1-chenyi.qiang@intel.com/
> 
> v1->v2:
> - rebase on the latest PKS kernel support:
>    https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> - add a kvm-unit-tests for PKS
> - add the check in kvm_init_msr_list for PKRS
> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
>    VMX MSR
> - RFC v1: https://lore.kernel.org/lkml/20200807084841.7112-1-chenyi.qiang@intel.com/
> 
> ---
> 
> Chenyi Qiang (7):
>    KVM: VMX: Introduce PKS VMCS fields
>    KVM: VMX: Add proper cache tracking for PKRS
>    KVM: X86: Expose IA32_PKRS MSR
>    KVM: MMU: Rename the pkru to pkr
>    KVM: MMU: Add support for PKS emulation
>    KVM: VMX: Expose PKS to guest
>    KVM: VMX: Enable PKS for nested VM
> 
>   arch/x86/include/asm/kvm_host.h | 17 ++++---
>   arch/x86/include/asm/vmx.h      |  6 +++
>   arch/x86/kvm/cpuid.c            |  2 +-
>   arch/x86/kvm/kvm_cache_regs.h   |  7 +++
>   arch/x86/kvm/mmu.h              | 25 +++++----
>   arch/x86/kvm/mmu/mmu.c          | 68 ++++++++++++++-----------
>   arch/x86/kvm/vmx/capabilities.h |  6 +++
>   arch/x86/kvm/vmx/nested.c       | 41 ++++++++++++++-
>   arch/x86/kvm/vmx/vmcs.h         |  1 +
>   arch/x86/kvm/vmx/vmcs12.c       |  2 +
>   arch/x86/kvm/vmx/vmcs12.h       |  4 ++
>   arch/x86/kvm/vmx/vmx.c          | 89 ++++++++++++++++++++++++++++++---
>   arch/x86/kvm/vmx/vmx.h          |  7 ++-
>   arch/x86/kvm/x86.c              |  6 ++-
>   arch/x86/kvm/x86.h              |  8 +++
>   arch/x86/mm/pkeys.c             |  6 +++
>   include/linux/pkeys.h           |  5 ++
>   17 files changed, 243 insertions(+), 57 deletions(-)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/7] KVM: PKS Virtualization support
  2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
                   ` (7 preceding siblings ...)
  2021-08-26  2:04 ` [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
@ 2021-10-25 15:12 ` Paolo Bonzini
  2021-10-26  3:14   ` Chenyi Qiang
  8 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2021-10-25 15:12 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 11/08/21 12:11, Chenyi Qiang wrote:
> This patch series is based on top of kernel patchset:
> https://lore.kernel.org/lkml/20210804043231.2655537-1-ira.weiny@intel.com/
> 
> To help patches review, one missing info in SDM is that PKSR will be
> cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
> "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
> 
> ---
> 
> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
> Protection Keys architecture to support thread-specific permission
> restrictions on supervisor pages.
> 
> PKS works similar to an existing feature named PKU(protecting user pages).
> They both perform an additional check after normal paging permission
> checks are done. Access or Writes can be disabled via a MSR update
> without TLB flushes when permissions changes. If violating this
> addional check, #PF occurs and PFEC.PK bit will be set.
> 
> PKS introduces MSR IA32_PKRS to manage supervisor protection key
> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
> advertises on a group of pages with the same key which is set in the
> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
> supported by XSAVES architecture.
> 
> This patchset aims to add the virtualization of PKS in KVM. It
> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
> exposure, nested supported etc. Currently, PKS is not yet supported for
> shadow paging.
> 
> Detailed information about PKS can be found in the latest Intel 64 and
> IA-32 Architectures Software Developer's Manual.

Hi Chenyi,

pkrs_cache does not yet exist in Linux 5.15.  What is the state of the 
bare-metal support for PKS?

Thanks,

Paolo

> 
> ---
> 
> Changelogs:
> 
> v4->v5
> - Make setting of MSR intercept/vmcs control bits not dependent on guest.CR4.PKS.
>    And set them if PKS is exposed to guest. (Suggested by Sean)
> - Add pkrs to standard register caching mechanism to help update
>    vcpu->arch.pkrs on demand. Add related helper functions. (Suggested by Sean)
> - Do the real pkrs update in VMCS field in vmx_vcpu_reset and
>    vmx_sync_vmcs_host_state(). (Sean)
> - Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
>    (Sean & Paolo)
> - v4: https://lore.kernel.org/lkml/20210205083706.14146-1-chenyi.qiang@intel.com/
> 
> v3->v4
> - Make the MSR intercept and load-controls setting depend on CR4.PKS value
> - shadow the guest pkrs and make it usable in PKS emultion
> - add the cr4_pke and cr4_pks check in pkr_mask update
> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
> - v3: https://lore.kernel.org/lkml/20201105081805.5674-1-chenyi.qiang@intel.com/
> 
> v2->v3:
> - No function changes since last submit
> - rebase on the latest PKS kernel support:
>    https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/
> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
> - RFC v2: https://lore.kernel.org/lkml/20201014021157.18022-1-chenyi.qiang@intel.com/
> 
> v1->v2:
> - rebase on the latest PKS kernel support:
>    https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> - add a kvm-unit-tests for PKS
> - add the check in kvm_init_msr_list for PKRS
> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
>    VMX MSR
> - RFC v1: https://lore.kernel.org/lkml/20200807084841.7112-1-chenyi.qiang@intel.com/
> 
> ---
> 
> Chenyi Qiang (7):
>    KVM: VMX: Introduce PKS VMCS fields
>    KVM: VMX: Add proper cache tracking for PKRS
>    KVM: X86: Expose IA32_PKRS MSR
>    KVM: MMU: Rename the pkru to pkr
>    KVM: MMU: Add support for PKS emulation
>    KVM: VMX: Expose PKS to guest
>    KVM: VMX: Enable PKS for nested VM
> 
>   arch/x86/include/asm/kvm_host.h | 17 ++++---
>   arch/x86/include/asm/vmx.h      |  6 +++
>   arch/x86/kvm/cpuid.c            |  2 +-
>   arch/x86/kvm/kvm_cache_regs.h   |  7 +++
>   arch/x86/kvm/mmu.h              | 25 +++++----
>   arch/x86/kvm/mmu/mmu.c          | 68 ++++++++++++++-----------
>   arch/x86/kvm/vmx/capabilities.h |  6 +++
>   arch/x86/kvm/vmx/nested.c       | 41 ++++++++++++++-
>   arch/x86/kvm/vmx/vmcs.h         |  1 +
>   arch/x86/kvm/vmx/vmcs12.c       |  2 +
>   arch/x86/kvm/vmx/vmcs12.h       |  4 ++
>   arch/x86/kvm/vmx/vmx.c          | 89 ++++++++++++++++++++++++++++++---
>   arch/x86/kvm/vmx/vmx.h          |  7 ++-
>   arch/x86/kvm/x86.c              |  6 ++-
>   arch/x86/kvm/x86.h              |  8 +++
>   arch/x86/mm/pkeys.c             |  6 +++
>   include/linux/pkeys.h           |  5 ++
>   17 files changed, 243 insertions(+), 57 deletions(-)
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/7] KVM: PKS Virtualization support
  2021-10-25 15:12 ` Paolo Bonzini
@ 2021-10-26  3:14   ` Chenyi Qiang
  0 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-10-26  3:14 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel



On 10/25/2021 11:12 PM, Paolo Bonzini wrote:
> On 11/08/21 12:11, Chenyi Qiang wrote:
>> This patch series is based on top of kernel patchset:
>> https://lore.kernel.org/lkml/20210804043231.2655537-1-ira.weiny@intel.com/ 
>>
>>
>> To help patches review, one missing info in SDM is that PKSR will be
>> cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
>> "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
>>
>> ---
>>
>> Protection Keys for Supervisor Pages(PKS) is a feature that extends the
>> Protection Keys architecture to support thread-specific permission
>> restrictions on supervisor pages.
>>
>> PKS works similar to an existing feature named PKU(protecting user 
>> pages).
>> They both perform an additional check after normal paging permission
>> checks are done. Access or Writes can be disabled via a MSR update
>> without TLB flushes when permissions changes. If violating this
>> addional check, #PF occurs and PFEC.PK bit will be set.
>>
>> PKS introduces MSR IA32_PKRS to manage supervisor protection key
>> rights. The MSR contains 16 pairs of ADi and WDi bits. Each pair
>> advertises on a group of pages with the same key which is set in the
>> leaf paging-structure entries(bits[62:59]). Currently, IA32_PKRS is not
>> supported by XSAVES architecture.
>>
>> This patchset aims to add the virtualization of PKS in KVM. It
>> implemented PKS CPUID enumeration, vmentry/vmexit configuration, MSR
>> exposure, nested supported etc. Currently, PKS is not yet supported for
>> shadow paging.
>>
>> Detailed information about PKS can be found in the latest Intel 64 and
>> IA-32 Architectures Software Developer's Manual.
> 
> Hi Chenyi,
> 
> pkrs_cache does not yet exist in Linux 5.15.  What is the state of the 
> bare-metal support for PKS?
> 
> Thanks,
> 
> Paolo
> 

Hi Paolo,

The latest version is still at
https://lore.kernel.org/lkml/20210804043231.2655537-1-ira.weiny@intel.com/

Ira is working on the next version but doesn't have concrete schedule.

Thanks
Chenyi

>>
>> ---
>>
>> Changelogs:
>>
>> v4->v5
>> - Make setting of MSR intercept/vmcs control bits not dependent on 
>> guest.CR4.PKS.
>>    And set them if PKS is exposed to guest. (Suggested by Sean)
>> - Add pkrs to standard register caching mechanism to help update
>>    vcpu->arch.pkrs on demand. Add related helper functions. (Suggested 
>> by Sean)
>> - Do the real pkrs update in VMCS field in vmx_vcpu_reset and
>>    vmx_sync_vmcs_host_state(). (Sean)
>> - Add a new mmu_role cr4_pks instead of smushing PKU and PKS together.
>>    (Sean & Paolo)
>> - v4: 
>> https://lore.kernel.org/lkml/20210205083706.14146-1-chenyi.qiang@intel.com/ 
>>
>>
>> v3->v4
>> - Make the MSR intercept and load-controls setting depend on CR4.PKS 
>> value
>> - shadow the guest pkrs and make it usable in PKS emultion
>> - add the cr4_pke and cr4_pks check in pkr_mask update
>> - squash PATCH 2 and PATCH 5 to make the dependencies read more clear
>> - v3: 
>> https://lore.kernel.org/lkml/20201105081805.5674-1-chenyi.qiang@intel.com/ 
>>
>>
>> v2->v3:
>> - No function changes since last submit
>> - rebase on the latest PKS kernel support:
>>    
>> https://lore.kernel.org/lkml/20201102205320.1458656-1-ira.weiny@intel.com/ 
>>
>> - add MSR_IA32_PKRS to the vmx_possible_passthrough_msrs[]
>> - RFC v2: 
>> https://lore.kernel.org/lkml/20201014021157.18022-1-chenyi.qiang@intel.com/ 
>>
>>
>> v1->v2:
>> - rebase on the latest PKS kernel support:
>>    https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
>> - add a kvm-unit-tests for PKS
>> - add the check in kvm_init_msr_list for PKRS
>> - place the X86_CR4_PKS in mmu_role_bits in kvm_set_cr4
>> - add the support to expose VM_{ENTRY, EXIT}_LOAD_IA32_PKRS in nested
>>    VMX MSR
>> - RFC v1: 
>> https://lore.kernel.org/lkml/20200807084841.7112-1-chenyi.qiang@intel.com/ 
>>
>>
>> ---
>>
>> Chenyi Qiang (7):
>>    KVM: VMX: Introduce PKS VMCS fields
>>    KVM: VMX: Add proper cache tracking for PKRS
>>    KVM: X86: Expose IA32_PKRS MSR
>>    KVM: MMU: Rename the pkru to pkr
>>    KVM: MMU: Add support for PKS emulation
>>    KVM: VMX: Expose PKS to guest
>>    KVM: VMX: Enable PKS for nested VM
>>
>>   arch/x86/include/asm/kvm_host.h | 17 ++++---
>>   arch/x86/include/asm/vmx.h      |  6 +++
>>   arch/x86/kvm/cpuid.c            |  2 +-
>>   arch/x86/kvm/kvm_cache_regs.h   |  7 +++
>>   arch/x86/kvm/mmu.h              | 25 +++++----
>>   arch/x86/kvm/mmu/mmu.c          | 68 ++++++++++++++-----------
>>   arch/x86/kvm/vmx/capabilities.h |  6 +++
>>   arch/x86/kvm/vmx/nested.c       | 41 ++++++++++++++-
>>   arch/x86/kvm/vmx/vmcs.h         |  1 +
>>   arch/x86/kvm/vmx/vmcs12.c       |  2 +
>>   arch/x86/kvm/vmx/vmcs12.h       |  4 ++
>>   arch/x86/kvm/vmx/vmx.c          | 89 ++++++++++++++++++++++++++++++---
>>   arch/x86/kvm/vmx/vmx.h          |  7 ++-
>>   arch/x86/kvm/x86.c              |  6 ++-
>>   arch/x86/kvm/x86.h              |  8 +++
>>   arch/x86/mm/pkeys.c             |  6 +++
>>   include/linux/pkeys.h           |  5 ++
>>   17 files changed, 243 insertions(+), 57 deletions(-)
>>
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] KVM: VMX: Add proper cache tracking for PKRS
  2021-08-11 10:11 ` [PATCH v5 2/7] KVM: VMX: Add proper cache tracking for PKRS Chenyi Qiang
@ 2021-11-08 17:13   ` Sean Christopherson
  2021-11-08 18:07     ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-11-08 17:13 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> Add PKRS caching into the standard register caching mechanism in order
> to take advantage of the availability checks provided by regs_avail.
> 
> This is because vcpu->arch.pkrs will be rarely acceesed by KVM, only in
> the case of host userspace MSR reads and GVA->GPA translation in
> following patches. It is unnecessary to keep it up-to-date at all times.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 2 ++
>  arch/x86/kvm/kvm_cache_regs.h   | 7 +++++++
>  arch/x86/kvm/vmx/vmx.c          | 4 ++++
>  arch/x86/kvm/vmx/vmx.h          | 3 ++-
>  4 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 974cbfb1eefe..c2bcb88781b3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -173,6 +173,7 @@ enum kvm_reg {
>  	VCPU_EXREG_SEGMENTS,
>  	VCPU_EXREG_EXIT_INFO_1,
>  	VCPU_EXREG_EXIT_INFO_2,
> +	VCPU_EXREG_PKRS,
>  };
>  
>  enum {
> @@ -620,6 +621,7 @@ struct kvm_vcpu_arch {
>  	unsigned long cr8;
>  	u32 host_pkru;
>  	u32 pkru;
> +	u32 pkrs;
>  	u32 hflags;
>  	u64 efer;
>  	u64 apic_base;
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 90e1ffdc05b7..da014b1be874 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -171,6 +171,13 @@ static inline u64 kvm_read_edx_eax(struct kvm_vcpu *vcpu)
>  		| ((u64)(kvm_rdx_read(vcpu) & -1u) << 32);
>  }
>  
> +static inline ulong kvm_read_pkrs(struct kvm_vcpu *vcpu)

Return value should be u32 (or u64 if we decide to track PKRS as a 64-bit value).

> +{
> +	if (!kvm_register_is_available(vcpu, VCPU_EXREG_PKRS))
> +		static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_PKRS);
> +	return vcpu->arch.pkrs;
> +}
> +
>  static inline void enter_guest_mode(struct kvm_vcpu *vcpu)
>  {
>  	vcpu->arch.hflags |= HF_GUEST_MASK;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 927a552393b9..bf911029aa35 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2273,6 +2273,10 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
>  		vcpu->arch.cr4 &= ~guest_owned_bits;
>  		vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits;
>  		break;
> +	case VCPU_EXREG_PKRS:
> +		if (kvm_cpu_cap_has(X86_FEATURE_PKS))

Peeking ahead, the next patch rejects RDMSR(MSR_IA32_PKRS) if X86_FEATURE_PKS isn't
supported in KVM, i.e. this is WARN-worthy as KVM should PKRS if and only if PKS is
supported.  Since KVM will WARN if VMREAD fails, just omit this check and let VMREAD
handle any errors.  That won't detect the scenario where PKRS is supported in hardware
but disabled by the kernel/KVM, but that's an acceptable risk since any buggy path is
all but guaranteed to be reachable if PKRS isn't supported at all, i.e. the WARN will
fire and detect any bug in the more common case.

> +			vcpu->arch.pkrs = vmcs_read64(GUEST_IA32_PKRS);

Hrm.  I agree that it's extremely unlikely that IA32_PKRS will ever allow software
to set bits 63:32, but at the same time there's no real advantage to KVM it as a u32,
e.g. the extra 4 bytes per vCPU is a non-issue, and could be avoided by shuffling
kvm_vcpu_arch to get a more efficient layout.  On the flip side, using a 32 means
code like this _looks_ buggy because it's silently dropping bits 63:32, and if the
architecture ever does get updated, we'll have to modify a bunch of KVM code.

TL;DR: I vote to track PRKS as a u64 even though the kernel tracks it as a u32.

> +		break;
>  	default:
>  		WARN_ON_ONCE(1);
>  		break;
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index db88ed4f2121..18039eb6efb0 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -447,7 +447,8 @@ static inline void vmx_register_cache_reset(struct kvm_vcpu *vcpu)
>  				  | (1 << VCPU_EXREG_CR3)
>  				  | (1 << VCPU_EXREG_CR4)
>  				  | (1 << VCPU_EXREG_EXIT_INFO_1)
> -				  | (1 << VCPU_EXREG_EXIT_INFO_2));
> +				  | (1 << VCPU_EXREG_EXIT_INFO_2)
> +				  | (1 << VCPU_EXREG_PKRS));
>  	vcpu->arch.regs_dirty = 0;
>  }
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR
  2021-08-11 10:11 ` [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR Chenyi Qiang
@ 2021-11-08 17:44   ` Sean Christopherson
  2021-11-09  5:54     ` Chenyi Qiang
  2021-11-08 20:18   ` Sean Christopherson
  1 sibling, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-11-08 17:44 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> +	u32           pkrs;

...

> @@ -1115,6 +1117,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>  #endif
>  	unsigned long fs_base, gs_base;
>  	u16 fs_sel, gs_sel;
> +	u32 host_pkrs;

As mentioned in the previosu patch, I think it makes sense to track this as a u64
so that the only place in KVM that deas with the u64<=>u32 conversion is the below

	host_pkrs = get_current_pkrs();

>  	int i;
>  
>  	vmx->req_immediate_exit = false;
> @@ -1150,6 +1153,20 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>  	 */
>  	host_state->ldt_sel = kvm_read_ldt();
>  
> +	/*
> +	 * Update the host pkrs vmcs field before vcpu runs.
> +	 * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
> +	 * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
> +	 * guest_cpuid_has(vcpu, X86_FEATURE_PKS)
> +	 */
> +	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
> +		host_pkrs = get_current_pkrs();
> +		if (unlikely(host_pkrs != host_state->pkrs)) {
> +			vmcs_write64(HOST_IA32_PKRS, host_pkrs);
> +			host_state->pkrs = host_pkrs;
> +		}
> +	}
> +
>  #ifdef CONFIG_X86_64
>  	savesegment(ds, host_state->ds_sel);
>  	savesegment(es, host_state->es_sel);
> @@ -1371,6 +1388,15 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>  		vmx->emulation_required = emulation_required(vcpu);
>  }
>  
> +static void vmx_set_pkrs(struct kvm_vcpu *vcpu, u64 pkrs)
> +{

Hrm.  Ideally this would be open coded in vmx_set_msr().  Long term, the RESET/INIT
paths should really treat MSR updates as "normal" host_initiated writes instead of
having to manually handle every MSR.

That would be a bit gross to handle in vmx_vcpu_reset() since it would have to
create a struct msr_data (because __kvm_set_msr() isn't exposed to vendor code),
but since vcpu->arch.pkrs is relevant to the MMU I think it makes sense to
initiate the write from common x86.

E.g. this way there's not out-of-band special code, vmx_vcpu_reset() is kept clean,
and if/when SVM gains support for PKRS this particular path Just Works.  And it would
be an easy conversion for my pipe dream plan of handling MSRs at RESET/INIT via a
list of MSRs+values.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac83d873d65b..55881d13620f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11147,6 +11147,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
        kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
        kvm_rip_write(vcpu, 0xfff0);

+       if (kvm_cpu_cap_has(X86_FEATURE_PKS))
+               __kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);
+
        vcpu->arch.cr3 = 0;
        kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);

> +	if (kvm_cpu_cap_has(X86_FEATURE_PKS)) {
> +		vcpu->arch.pkrs = pkrs;
> +		kvm_register_mark_available(vcpu, VCPU_EXREG_PKRS);
> +		vmcs_write64(GUEST_IA32_PKRS, pkrs);
> +	}
> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] KVM: VMX: Add proper cache tracking for PKRS
  2021-11-08 17:13   ` Sean Christopherson
@ 2021-11-08 18:07     ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2021-11-08 18:07 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Mon, Nov 08, 2021, Sean Christopherson wrote:
> On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> > +			vcpu->arch.pkrs = vmcs_read64(GUEST_IA32_PKRS);
> 
> Hrm.  I agree that it's extremely unlikely that IA32_PKRS will ever allow software
> to set bits 63:32, but at the same time there's no real advantage to KVM it as a u32,
> e.g. the extra 4 bytes per vCPU is a non-issue, and could be avoided by shuffling
> kvm_vcpu_arch to get a more efficient layout.  On the flip side, using a 32 means
> code like this _looks_ buggy because it's silently dropping bits 63:32, and if the
> architecture ever does get updated, we'll have to modify a bunch of KVM code.
> 
> TL;DR: I vote to track PRKS as a u64 even though the kernel tracks it as a u32.

Rats, I forgot that the MMU code for PKRU is going to be reused for PKRS.  I withdraw
my vote :-)

Maybe have this code WARN on bits 63:32 being set in the VMCS field?  E.g. to
detect if hardware ever changes and KVM fails to update this path, and to document
that KVM intentionally drops those bits.

	case VCPU_EXREG_PKRS: {
		u64 ia32_pkrs;

		ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
		WARN_ON_ONCE(ia32_pkrs >> 32);
		vcpu->arch.pkrs = (u32)ia32_pkrs;
	}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation
  2021-08-11 10:11 ` [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
@ 2021-11-08 19:46   ` Sean Christopherson
  2021-11-09  6:42     ` Chenyi Qiang
  2021-11-08 19:52   ` Sean Christopherson
  1 sibling, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-11-08 19:46 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
>  * In particular the following conditions come from the error code, the
>  * page tables and the machine state:
> -* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
> +* - PK is always zero unless CR4.PKE=1/CR4.PKS=1 and EFER.LMA=1
>  * - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
> -* - PK is always zero if U=0 in the page tables
> -* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
> +* - PK is always zero if
> +*       - U=0 in the page tables and CR4.PKS=0
> +*       - U=1 in the page tables and CR4.PKU=0

I think it makes sense to completely rewrite this "table" or drop it altogether.
The "always zero" wording is nonsensical when there are multiple conditions for
"always".  And IMO the whole "PK is ... zero" thing is a bit awkward because it
leaves the uninitiated wondering what PK=0 even means ('1' == disabled is not the
most intuitive thing since most PTE bits are '1' = allowed).  Ugh, and re-reading
with context, that's not even what "PK" means here, this is actually referring to
PFEC.PK, which is all kinds of confusing because PFEC.PK is merely a "symptom" of
a #PF to due a protection key violation, not the other way 'round.

IMO this entire comment could use a good overhaul.  It never explicitly documents
the "access-disable" and "write-disable" behavior.  More below.

> +* - (PKRU/PKRS).WD is ignored if CR0.WP=0 and the access is a supervisor access.

Hrm.  The SDM contradicts itself.

Section 4.6.1 "Determination of Access Rights" says this for supervisor-mode accesses:

  If CR0.WP = 0, data may be written to any supervisor-mode address with a protection
  key for which write access is permitted.

but section 4.6.2 "Protection Keys" says:

  If WDi = 1, write accesses are not permitted if CR0.WP = 1. (If CR0.WP = 0,
  IA32_PKRS.WDi does not affect write accesses to supervisor-mode addresses with
  protection key i.)

I believe 4.6.1 is subtly wrong and should be "data access", not "write access".

  If CR0.WP = 0, data may be written to any supervisor-mode address with a protection
  key for which data access is permitted.
                ^^^^

Can you follow-up with someone to get the SDM fixed?  This stuff is subtle and
confusing enough as it is :-)

And on a very related topic, it would be helpful to clarify user-mode vs. supervisor-mode
and access vs. address.

How about this for a comment?

/*
 * Protection Key Rights (PKR) is an additional mechanism by which data accesses
 * with 4-level or 5-level paging (EFER.LMA=1) may be disabled based on the
 * Protection Key Rights Userspace (PRKU) or Protection Key Rights Supervisor
 * (PKRS) registers.  The Protection Key (PK) used for an access is a 4-bit
 * value specified in bits 62:59 of the leaf PTE used to translate the address.
 *
 * PKRU and PKRS are 32-bit registers, with 16 2-bit entries consisting of an
 * access-disable (AD) and write-disable (WD) bit.  The PK from the leaf PTE is
 * used to index the approriate PKR (see below), e.g. PK=1 would consume bits
 * 3:2 (bit 3 == write-disable, bit 2 == access-disable).
 *
 * The PK register (PKRU vs. PKRS) indexed by the PK depends on the type of
 * _address_ (not access type!).  For a user-mode address, PKRU is used; for a
 * supervisor-mode address, PKRS is used.  An address is supervisor-mode if the
 * U/S flag (bit 2) is 0 in at least one of the paging-structure entries, i.e.
 * an address is user-mode if the U/S flag is 0 in _all_ entries.  Again, this
 * is the address type, not the the access type, e.g. a supervisor-mode _access_
 * will consume PKRU if the _address_ is a user-mode address.
 *
 * As alluded to above, PKR checks are only performed for data accesses; code
 * fetches are not subject to PKR checks.  Terminal page faults (!PRESENT or
 * PFEC.RSVD=1) are also not subject to PKR checks.
 *
 * PKR write-disable checks for superivsor-mode _accesses_ are performed if and
 * only if CR0.WP=1 (though access-disable checks still apply).
 *
 * In summary, PKR checks are based on (a) EFER.LMA, (b) CR4.PKE or CR4.PKS,
 * (c) CR4.WP, (d) the PK in the leaf PTE, (e) two bits from the corresponding
 * PKR{S,U} entry, (f) the access type (derived from the other PFEC bits), and
 * (g) the address type (retrieved from the paging-structure entries).
 *
 * To avoid conditional branches in permission_fault(), the PKR bitmask caches
 * the above inputs, except for (e) the PKR{S,U} entry.  The FETCH, USER, and
 * WRITE bits of the PFEC and the effective value of the paging-structures' U/S
 * bit (slotted into the PFEC.RSVD position, bit 3) are used to index into the
 * PKR bitmask (similar to the 4-bit Protection Key itself).  The two bits of
 * the PKR bitmask "entry" are then extracted and ANDed with the two bits of
 * the PKR{S,U{} register corresponding to the address type and protection key.
 *
 * E.g. for all values where PFEC.FETCH=1, the corresponding pkr_bitmask bits
 * will be 00b, thus masking away the AD and WD bits from the PKR{S,U} register
 * to suppress PKR checks on code fetches.
*/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation
  2021-08-11 10:11 ` [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
  2021-11-08 19:46   ` Sean Christopherson
@ 2021-11-08 19:52   ` Sean Christopherson
  1 sibling, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2021-11-08 19:52 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> @@ -202,14 +202,17 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  	WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
>  	if (unlikely(mmu->pkr_mask)) {
>  		u32 pkr_bits, offset;
> +		u64 pkr;

Heh, MSR_IA32_PKRS strikes again.  This should be a u32.

>  
>  		/*
> -		* PKRU defines 32 bits, there are 16 domains and 2
> -		* attribute bits per domain in pkru.  pte_pkey is the
> -		* index of the protection domain, so pte_pkey * 2 is
> -		* is the index of the first bit for the domain.
> +		* PKRU and PKRS both define 32 bits. There are 16 domains
> +		* and 2 attribute bits per domain in them. pte_key is the
> +		* index of the protection domain, so pte_pkey * 2 is the
> +		* index of the first bit for the domain. The choice of
> +		* PKRU and PKRS is determined by the accessed pages.

Please replace "accessed pages" with something along the lines of

  The use of PKRU versus PKRS is selected by the address type, as determined by
  the U/S bit in the paging-structure entries.

I.e. try to avoid "access" in favor of "address" to follow the SDM's wording.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR
  2021-08-11 10:11 ` [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR Chenyi Qiang
  2021-11-08 17:44   ` Sean Christopherson
@ 2021-11-08 20:18   ` Sean Christopherson
  1 sibling, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2021-11-08 20:18 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> @@ -7207,6 +7257,19 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  
>  	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>  	vmx_update_exception_bitmap(vcpu);
> +
> +	if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&
> +	    guest_cpuid_has(vcpu, X86_FEATURE_PKS)) {

Ah, this confused me for a second.  It's not wrong to clear the entry/exit controls
in the "else" path, but it's surprisingly hard to follow because it reads as if the
entry/exit controls are paired with the MSR behavior.

Oh, and more importantly, it's "hiding" a bug: the MSR bitmap needs to be _set_
if userspace disables X86_FEATURE_PKS in guest CPUID, e.g. if for some reason
userspace exposed PKS and then yanked it away.

Oof, two bugs actually.  This will fail to re-enable the entry/exit bits if
userspace hides PKS and then re-enables PKS.

Heh, make that three bugs.  If userspace never sets CPUID, KVM will run with
the entry/exit bits set.  That's arguably not a bug since functionally it's fine,
but it's a bug in the sense that KVM loads an MSR when it doesn't inted to do so.

So this should be:

	if (kvm_vcpu_cap_has(X86_FEATURE_PKS) {
		if (guest_cpuid_has(vcpu, X86_FEATURE_PKS)) {
			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);

			vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
			vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS)

		} else {
			vmx_enable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);

			vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
			vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS)
		}
	}

and then the bits need to be masked in vmx_vmexit_ctrl() and vmx_vmentry_ctrl(),
a la EFER and PERF_GLOBAL_CTRL.

> +		vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PKRS, MSR_TYPE_RW);
> +	} else {
> +		/*
> +		 * Remove VM control in case guest VM doesn't support PKS to mitigate
> +		 * overhead during VM-{exit,entry}. They are present by default
> +		 * if supported.
> +		 */
> +		vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +		vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +	}
>  }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 6/7] KVM: VMX: Expose PKS to guest
  2021-08-11 10:11 ` [PATCH v5 6/7] KVM: VMX: Expose PKS to guest Chenyi Qiang
@ 2021-11-08 21:31   ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2021-11-08 21:31 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Wed, Aug 11, 2021, Chenyi Qiang wrote:
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 739be5da3bca..dbee0d639db3 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -458,7 +458,7 @@ void kvm_set_cpu_caps(void)
>  		F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
>  		F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
>  		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
> -		F(SGX_LC) | F(BUS_LOCK_DETECT)
> +		F(SGX_LC) | F(BUS_LOCK_DETECT) | 0 /*PKS*/

...

>  	);
>  	/* Set LA57 based on hardware capability. */
>  	if (cpuid_ecx(7) & F(LA57))

...

> @@ -7311,6 +7312,14 @@ static __init void vmx_set_cpu_caps(void)
>  
>  	if (cpu_has_vmx_waitpkg())
>  		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> +
> +	/*
> +	 * PKS is not yet implemented for shadow paging.
> +	 * If not support VM_{ENTRY, EXIT}_LOAD_IA32_PKRS,
> +	 * don't expose the PKS as well.
> +	 */
> +	if (enable_ept && cpu_has_load_ia32_pkrs())
> +		kvm_cpu_cap_check_and_set(X86_FEATURE_PKS);

I would rather handle the !TDP case in cpuid.c alongside the PKU.  The decision
to not support Protection Keys with legacy shadow paging is an x86 decision, not
a VMX decision.

And VMX's extra restriction on the VMCS support should not bleed into common x86.

Can you also opportunistically update the comment (see below) to explain _why_
OSPKE needs to be enabled in order to advertise PKU?

Thanks!

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 2d70edb0f323..c4ed6881857c 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -439,18 +439,23 @@ void kvm_set_cpu_caps(void)
                F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
                F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
                F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
-               F(SGX_LC) | F(BUS_LOCK_DETECT)
+               F(SGX_LC) | F(BUS_LOCK_DETECT) | F(PKS)
        );
        /* Set LA57 based on hardware capability. */
        if (cpuid_ecx(7) & F(LA57))
                kvm_cpu_cap_set(X86_FEATURE_LA57);

        /*
-        * PKU not yet implemented for shadow paging and requires OSPKE
-        * to be set on the host. Clear it if that is not the case
+        * Protection Keys are not supported for shadow paging.  PKU further
+        * requires OSPKE to be set on the host in order to use {RD,WR}PKRU to
+        * save/restore the guests PKRU.
         */
-       if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
+       if (!tdp_enabled) {
                kvm_cpu_cap_clear(X86_FEATURE_PKU);
+               kvm_cpu_cap_clear(X86_FEATURE_PKS);
+       } else if (!boot_cpu_has(X86_FEATURE_OSPKE)) {
+               kvm_cpu_cap_clear(X86_FEATURE_PKU);
+       }

        kvm_cpu_cap_mask(CPUID_7_EDX,
                F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |


and then vmx.c only needs to handle clearing PKS when the VMCS controls aren't
available.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR
  2021-11-08 17:44   ` Sean Christopherson
@ 2021-11-09  5:54     ` Chenyi Qiang
  2021-11-09 15:30       ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Chenyi Qiang @ 2021-11-09  5:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel



On 11/9/2021 1:44 AM, Sean Christopherson wrote:
> On Wed, Aug 11, 2021, Chenyi Qiang wrote:
>> +	u32           pkrs;
> 
> ...
> 
>> @@ -1115,6 +1117,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>>   #endif
>>   	unsigned long fs_base, gs_base;
>>   	u16 fs_sel, gs_sel;
>> +	u32 host_pkrs;
> 
> As mentioned in the previosu patch, I think it makes sense to track this as a u64
> so that the only place in KVM that deas with the u64<=>u32 conversion is the below
> 
> 	host_pkrs = get_current_pkrs();
> 
>>   	int i;
>>   
>>   	vmx->req_immediate_exit = false;
>> @@ -1150,6 +1153,20 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>>   	 */
>>   	host_state->ldt_sel = kvm_read_ldt();
>>   
>> +	/*
>> +	 * Update the host pkrs vmcs field before vcpu runs.
>> +	 * The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
>> +	 * kvm_cpu_cap_has(X86_FEATURE_PKS) &&
>> +	 * guest_cpuid_has(vcpu, X86_FEATURE_PKS)
>> +	 */
>> +	if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
>> +		host_pkrs = get_current_pkrs();
>> +		if (unlikely(host_pkrs != host_state->pkrs)) {
>> +			vmcs_write64(HOST_IA32_PKRS, host_pkrs);
>> +			host_state->pkrs = host_pkrs;
>> +		}
>> +	}
>> +
>>   #ifdef CONFIG_X86_64
>>   	savesegment(ds, host_state->ds_sel);
>>   	savesegment(es, host_state->es_sel);
>> @@ -1371,6 +1388,15 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>>   		vmx->emulation_required = emulation_required(vcpu);
>>   }
>>   
>> +static void vmx_set_pkrs(struct kvm_vcpu *vcpu, u64 pkrs)
>> +{
> 
> Hrm.  Ideally this would be open coded in vmx_set_msr().  Long term, the RESET/INIT
> paths should really treat MSR updates as "normal" host_initiated writes instead of
> having to manually handle every MSR.
> 
> That would be a bit gross to handle in vmx_vcpu_reset() since it would have to
> create a struct msr_data (because __kvm_set_msr() isn't exposed to vendor code),
> but since vcpu->arch.pkrs is relevant to the MMU I think it makes sense to
> initiate the write from common x86.
> 
> E.g. this way there's not out-of-band special code, vmx_vcpu_reset() is kept clean,
> and if/when SVM gains support for PKRS this particular path Just Works.  And it would
> be an easy conversion for my pipe dream plan of handling MSRs at RESET/INIT via a
> list of MSRs+values.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ac83d873d65b..55881d13620f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11147,6 +11147,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>          kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
>          kvm_rip_write(vcpu, 0xfff0);
> 
> +       if (kvm_cpu_cap_has(X86_FEATURE_PKS))
> +               __kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);
> +

Got it. In addition, is it necessary to add on-INIT check? like:

if (kvm_cpu_cap_has(X86_FEATURE_PKS) && !init_event)
	__kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);

PKRS should be preserved on INIT, not cleared. The SDM doesn't make this 
clear either.

>          vcpu->arch.cr3 = 0;
>          kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> 
>> +	if (kvm_cpu_cap_has(X86_FEATURE_PKS)) {
>> +		vcpu->arch.pkrs = pkrs;
>> +		kvm_register_mark_available(vcpu, VCPU_EXREG_PKRS);
>> +		vmcs_write64(GUEST_IA32_PKRS, pkrs);
>> +	}
>> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation
  2021-11-08 19:46   ` Sean Christopherson
@ 2021-11-09  6:42     ` Chenyi Qiang
  0 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-11-09  6:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel



On 11/9/2021 3:46 AM, Sean Christopherson wrote:
> On Wed, Aug 11, 2021, Chenyi Qiang wrote:
>>   * In particular the following conditions come from the error code, the
>>   * page tables and the machine state:
>> -* - PK is always zero unless CR4.PKE=1 and EFER.LMA=1
>> +* - PK is always zero unless CR4.PKE=1/CR4.PKS=1 and EFER.LMA=1
>>   * - PK is always zero if RSVD=1 (reserved bit set) or F=1 (instruction fetch)
>> -* - PK is always zero if U=0 in the page tables
>> -* - PKRU.WD is ignored if CR0.WP=0 and the access is a supervisor access.
>> +* - PK is always zero if
>> +*       - U=0 in the page tables and CR4.PKS=0
>> +*       - U=1 in the page tables and CR4.PKU=0
> 
> I think it makes sense to completely rewrite this "table" or drop it altogether.
> The "always zero" wording is nonsensical when there are multiple conditions for
> "always".  And IMO the whole "PK is ... zero" thing is a bit awkward because it
> leaves the uninitiated wondering what PK=0 even means ('1' == disabled is not the
> most intuitive thing since most PTE bits are '1' = allowed).  Ugh, and re-reading
> with context, that's not even what "PK" means here, this is actually referring to
> PFEC.PK, which is all kinds of confusing because PFEC.PK is merely a "symptom" of
> a #PF to due a protection key violation, not the other way 'round.
> 
> IMO this entire comment could use a good overhaul.  It never explicitly documents
> the "access-disable" and "write-disable" behavior.  More below.
> 
>> +* - (PKRU/PKRS).WD is ignored if CR0.WP=0 and the access is a supervisor access.
> 
> Hrm.  The SDM contradicts itself.
> 
> Section 4.6.1 "Determination of Access Rights" says this for supervisor-mode accesses:
> 
>    If CR0.WP = 0, data may be written to any supervisor-mode address with a protection
>    key for which write access is permitted.
> 
> but section 4.6.2 "Protection Keys" says:
> 
>    If WDi = 1, write accesses are not permitted if CR0.WP = 1. (If CR0.WP = 0,
>    IA32_PKRS.WDi does not affect write accesses to supervisor-mode addresses with
>    protection key i.)
> 
> I believe 4.6.1 is subtly wrong and should be "data access", not "write access".
> 
>    If CR0.WP = 0, data may be written to any supervisor-mode address with a protection
>    key for which data access is permitted.
>                  ^^^^
> 
> Can you follow-up with someone to get the SDM fixed?  This stuff is subtle and
> confusing enough as it is :-)
> 

Nice catch. I'll mention it internally to fix it.

> And on a very related topic, it would be helpful to clarify user-mode vs. supervisor-mode
> and access vs. address.
> 
> How about this for a comment?
> 
> /*
>   * Protection Key Rights (PKR) is an additional mechanism by which data accesses
>   * with 4-level or 5-level paging (EFER.LMA=1) may be disabled based on the
>   * Protection Key Rights Userspace (PRKU) or Protection Key Rights Supervisor
>   * (PKRS) registers.  The Protection Key (PK) used for an access is a 4-bit
>   * value specified in bits 62:59 of the leaf PTE used to translate the address.
>   *
>   * PKRU and PKRS are 32-bit registers, with 16 2-bit entries consisting of an
>   * access-disable (AD) and write-disable (WD) bit.  The PK from the leaf PTE is
>   * used to index the approriate PKR (see below), e.g. PK=1 would consume bits
>   * 3:2 (bit 3 == write-disable, bit 2 == access-disable).
>   *
>   * The PK register (PKRU vs. PKRS) indexed by the PK depends on the type of
>   * _address_ (not access type!).  For a user-mode address, PKRU is used; for a
>   * supervisor-mode address, PKRS is used.  An address is supervisor-mode if the
>   * U/S flag (bit 2) is 0 in at least one of the paging-structure entries, i.e.
>   * an address is user-mode if the U/S flag is 0 in _all_ entries.  Again, this
>   * is the address type, not the the access type, e.g. a supervisor-mode _access_
>   * will consume PKRU if the _address_ is a user-mode address.
>   *
>   * As alluded to above, PKR checks are only performed for data accesses; code
>   * fetches are not subject to PKR checks.  Terminal page faults (!PRESENT or
>   * PFEC.RSVD=1) are also not subject to PKR checks.
>   *
>   * PKR write-disable checks for superivsor-mode _accesses_ are performed if and
>   * only if CR0.WP=1 (though access-disable checks still apply).
>   *
>   * In summary, PKR checks are based on (a) EFER.LMA, (b) CR4.PKE or CR4.PKS,
>   * (c) CR4.WP, (d) the PK in the leaf PTE, (e) two bits from the corresponding
>   * PKR{S,U} entry, (f) the access type (derived from the other PFEC bits), and
>   * (g) the address type (retrieved from the paging-structure entries).
>   *
>   * To avoid conditional branches in permission_fault(), the PKR bitmask caches
>   * the above inputs, except for (e) the PKR{S,U} entry.  The FETCH, USER, and
>   * WRITE bits of the PFEC and the effective value of the paging-structures' U/S
>   * bit (slotted into the PFEC.RSVD position, bit 3) are used to index into the
>   * PKR bitmask (similar to the 4-bit Protection Key itself).  The two bits of
>   * the PKR bitmask "entry" are then extracted and ANDed with the two bits of
>   * the PKR{S,U{} register corresponding to the address type and protection key.
>   *
>   * E.g. for all values where PFEC.FETCH=1, the corresponding pkr_bitmask bits
>   * will be 00b, thus masking away the AD and WD bits from the PKR{S,U} register
>   * to suppress PKR checks on code fetches.
> */

Very clear comment. I'll clean it up and change in next version.

> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR
  2021-11-09  5:54     ` Chenyi Qiang
@ 2021-11-09 15:30       ` Sean Christopherson
  2021-11-10  0:56         ` Chenyi Qiang
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2021-11-09 15:30 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Tue, Nov 09, 2021, Chenyi Qiang wrote:
> 
> On 11/9/2021 1:44 AM, Sean Christopherson wrote:
> > Hrm.  Ideally this would be open coded in vmx_set_msr().  Long term, the RESET/INIT
> > paths should really treat MSR updates as "normal" host_initiated writes instead of
> > having to manually handle every MSR.
> > 
> > That would be a bit gross to handle in vmx_vcpu_reset() since it would have to
> > create a struct msr_data (because __kvm_set_msr() isn't exposed to vendor code),
> > but since vcpu->arch.pkrs is relevant to the MMU I think it makes sense to
> > initiate the write from common x86.
> > 
> > E.g. this way there's not out-of-band special code, vmx_vcpu_reset() is kept clean,
> > and if/when SVM gains support for PKRS this particular path Just Works.  And it would
> > be an easy conversion for my pipe dream plan of handling MSRs at RESET/INIT via a
> > list of MSRs+values.
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ac83d873d65b..55881d13620f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11147,6 +11147,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >          kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
> >          kvm_rip_write(vcpu, 0xfff0);
> > 
> > +       if (kvm_cpu_cap_has(X86_FEATURE_PKS))
> > +               __kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);
> > +
> 
> Got it. In addition, is it necessary to add on-INIT check? like:
> 
> if (kvm_cpu_cap_has(X86_FEATURE_PKS) && !init_event)
> 	__kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);
> 
> PKRS should be preserved on INIT, not cleared. The SDM doesn't make this
> clear either.

Hmm, but your cover letter says:

  To help patches review, one missing info in SDM is that PKSR will be
  cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
  "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"

Which honestly makes me a little happy because I thought I was making stuff up
for a minute :-)

So which is it?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR
  2021-11-09 15:30       ` Sean Christopherson
@ 2021-11-10  0:56         ` Chenyi Qiang
  0 siblings, 0 replies; 22+ messages in thread
From: Chenyi Qiang @ 2021-11-10  0:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel



On 11/9/2021 11:30 PM, Sean Christopherson wrote:
> On Tue, Nov 09, 2021, Chenyi Qiang wrote:
>>
>> On 11/9/2021 1:44 AM, Sean Christopherson wrote:
>>> Hrm.  Ideally this would be open coded in vmx_set_msr().  Long term, the RESET/INIT
>>> paths should really treat MSR updates as "normal" host_initiated writes instead of
>>> having to manually handle every MSR.
>>>
>>> That would be a bit gross to handle in vmx_vcpu_reset() since it would have to
>>> create a struct msr_data (because __kvm_set_msr() isn't exposed to vendor code),
>>> but since vcpu->arch.pkrs is relevant to the MMU I think it makes sense to
>>> initiate the write from common x86.
>>>
>>> E.g. this way there's not out-of-band special code, vmx_vcpu_reset() is kept clean,
>>> and if/when SVM gains support for PKRS this particular path Just Works.  And it would
>>> be an easy conversion for my pipe dream plan of handling MSRs at RESET/INIT via a
>>> list of MSRs+values.
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index ac83d873d65b..55881d13620f 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -11147,6 +11147,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>>           kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
>>>           kvm_rip_write(vcpu, 0xfff0);
>>>
>>> +       if (kvm_cpu_cap_has(X86_FEATURE_PKS))
>>> +               __kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);
>>> +
>>
>> Got it. In addition, is it necessary to add on-INIT check? like:
>>
>> if (kvm_cpu_cap_has(X86_FEATURE_PKS) && !init_event)
>> 	__kvm_set_msr(vcpu, MSR_IA32_PKRS, 0, true);
>>
>> PKRS should be preserved on INIT, not cleared. The SDM doesn't make this
>> clear either.
> 
> Hmm, but your cover letter says:
> 
>    To help patches review, one missing info in SDM is that PKSR will be
>    cleared on Powerup/INIT/RESET, which should be listed in Table 9.1
>    "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
> 
> Which honestly makes me a little happy because I thought I was making stuff up
> for a minute :-)
> 
> So which is it?

Sorry about the confusion. PKRS is preserved on INIT. I tried to correct 
my statement in previous ping mail but seems not so obvious.

> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-11-10  0:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11 10:11 [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
2021-08-11 10:11 ` [PATCH v5 1/7] KVM: VMX: Introduce PKS VMCS fields Chenyi Qiang
2021-08-11 10:11 ` [PATCH v5 2/7] KVM: VMX: Add proper cache tracking for PKRS Chenyi Qiang
2021-11-08 17:13   ` Sean Christopherson
2021-11-08 18:07     ` Sean Christopherson
2021-08-11 10:11 ` [PATCH v5 3/7] KVM: X86: Expose IA32_PKRS MSR Chenyi Qiang
2021-11-08 17:44   ` Sean Christopherson
2021-11-09  5:54     ` Chenyi Qiang
2021-11-09 15:30       ` Sean Christopherson
2021-11-10  0:56         ` Chenyi Qiang
2021-11-08 20:18   ` Sean Christopherson
2021-08-11 10:11 ` [PATCH v5 4/7] KVM: MMU: Rename the pkru to pkr Chenyi Qiang
2021-08-11 10:11 ` [PATCH v5 5/7] KVM: MMU: Add support for PKS emulation Chenyi Qiang
2021-11-08 19:46   ` Sean Christopherson
2021-11-09  6:42     ` Chenyi Qiang
2021-11-08 19:52   ` Sean Christopherson
2021-08-11 10:11 ` [PATCH v5 6/7] KVM: VMX: Expose PKS to guest Chenyi Qiang
2021-11-08 21:31   ` Sean Christopherson
2021-08-11 10:11 ` [PATCH v5 7/7] KVM: VMX: Enable PKS for nested VM Chenyi Qiang
2021-08-26  2:04 ` [PATCH v5 0/7] KVM: PKS Virtualization support Chenyi Qiang
2021-10-25 15:12 ` Paolo Bonzini
2021-10-26  3:14   ` Chenyi Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).