LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/8] Re-enable ENQCMD and PASID MSR
@ 2021-09-20 19:23 Fenghua Yu
  2021-09-20 19:23 ` [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions Fenghua Yu
                   ` (7 more replies)
  0 siblings, 8 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

Since updating PASID (Process Address Space ID) MSR through IPI has a few
issues that are beyond repair, Thomas disables ENQCMD [1].

Please check Documentation/x86/sva.rst for various concepts and terms
related to PASID, ENQCMD, SVM (Shared Virtual Memory), etc.

This series re-enables ENQCMD and IA32_PASID MSR by using a #GP fix up
method previously published in [2]. A PASID is allocated to a mm once
a SVM is bound to the mm via intel_svm_bind() API. The #GP fix up method
updates the PASID MSR from the mm's PASID in #GP handler when one thread
in a process executes ENQCMD for the first time and one reference is taken
to the PASID. Once the MSR is uploaded, the thread keeps and can use it
for the rest life time of the thread. In exit(2) or unbind, the PASID's
reference is dropped and the PASID is freed if there is no reference.

References:
1. ENQCMD was disabled in upstream due to serious issues:
https://lore.kernel.org/linux-iommu/87mtsd6gr9.ffs@nanos.tec.linutronix.de/

2. #GP fix up PASID MSR:
https://lore.kernel.org/linux-iommu/1594684087-61184-1-git-send-email-fenghua.yu@intel.com/

Fenghua Yu (7):
  iommu/vt-d: Clean up unused PASID updating functions
  x86/process: Clear PASID state for a newly forked/cloned thread
  x86/traps: Demand-populate PASID MSR via #GP
  x86/mmu: Add mm-based PASID refcounting
  x86/cpufeatures: Re-enable ENQCMD
  tools/objtool: Check for use of the ENQCMD instruction in the kernel
  docs: x86: Change documentation for SVA (Shared Virtual Addressing)

Peter Zijlstra (1):
  sched: Define and initialize a flag to identify valid PASID in the
    task

 Documentation/x86/sva.rst                | 81 ++++++++++++++++++--
 arch/x86/include/asm/disabled-features.h |  7 +-
 arch/x86/include/asm/fpu/api.h           |  6 +-
 arch/x86/include/asm/iommu.h             |  8 ++
 arch/x86/include/asm/mmu_context.h       |  2 +
 arch/x86/kernel/fpu/xstate.c             | 59 +++++++++++++++
 arch/x86/kernel/process.c                |  8 ++
 arch/x86/kernel/traps.c                  | 12 +++
 drivers/iommu/intel/svm.c                | 95 ++++++++++++++++++------
 include/linux/sched.h                    |  4 +
 kernel/fork.c                            |  4 +
 tools/objtool/arch/x86/decode.c          | 10 ++-
 tools/objtool/check.c                    | 20 +++++
 tools/objtool/include/objtool/arch.h     |  1 +
 14 files changed, 283 insertions(+), 34 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions
  2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
@ 2021-09-20 19:23 ` Fenghua Yu
  2021-09-29  7:34   ` Lu Baolu
  2021-09-20 19:23 ` [PATCH 2/8] x86/process: Clear PASID state for a newly forked/cloned thread Fenghua Yu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

update_pasid() and its call chain are currently unused in the tree because
Thomas disabled the ENQCMD feature. The feature will be re-enabled shortly
using a different approach and update_pasid() and its call chain will not
be used in the new approach.

Remove the useless functions.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/fpu/api.h |  2 --
 drivers/iommu/intel/svm.c      | 24 +-----------------------
 2 files changed, 1 insertion(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 23bef08a8388..ca4d0dee1ecd 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -106,6 +106,4 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, const char **feature_name);
  */
 #define PASID_DISABLED	0
 
-static inline void update_pasid(void) { }
-
 #endif /* _ASM_X86_FPU_API_H */
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 0c228787704f..5b5d69b04fcc 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -505,21 +505,6 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
 	return ret;
 }
 
-static void _load_pasid(void *unused)
-{
-	update_pasid();
-}
-
-static void load_pasid(struct mm_struct *mm, u32 pasid)
-{
-	mutex_lock(&mm->context.lock);
-
-	/* Update PASID MSR on all CPUs running the mm's tasks. */
-	on_each_cpu_mask(mm_cpumask(mm), _load_pasid, NULL, true);
-
-	mutex_unlock(&mm->context.lock);
-}
-
 static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm,
 				 unsigned int flags)
 {
@@ -614,10 +599,6 @@ static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu,
 	if (ret)
 		goto free_sdev;
 
-	/* The newly allocated pasid is loaded to the mm. */
-	if (!(flags & SVM_FLAG_SUPERVISOR_MODE) && list_empty(&svm->devs))
-		load_pasid(mm, svm->pasid);
-
 	list_add_rcu(&sdev->list, &svm->devs);
 success:
 	return &sdev->sva;
@@ -670,11 +651,8 @@ static int intel_svm_unbind_mm(struct device *dev, u32 pasid)
 			kfree_rcu(sdev, rcu);
 
 			if (list_empty(&svm->devs)) {
-				if (svm->notifier.ops) {
+				if (svm->notifier.ops)
 					mmu_notifier_unregister(&svm->notifier, mm);
-					/* Clear mm's pasid. */
-					load_pasid(mm, PASID_DISABLED);
-				}
 				pasid_private_remove(svm->pasid);
 				/* We mandate that no page faults may be outstanding
 				 * for the PASID when intel_svm_unbind_mm() is called.
-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 2/8] x86/process: Clear PASID state for a newly forked/cloned thread
  2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
  2021-09-20 19:23 ` [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions Fenghua Yu
@ 2021-09-20 19:23 ` Fenghua Yu
  2021-09-20 19:23 ` [PATCH 3/8] sched: Define and initialize a flag to identify valid PASID in the task Fenghua Yu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

The PASID state has to be cleared on forks, since the child has a
different address space. The PASID is also cleared for thread clone. While
it would be correct to inherit the PASID in this case, it is unknown
whether the new task will use ENQCMD. Giving it the PASID "just in case"
would have the downside of increased context switch overhead to setting
the PASID MSR and losing init optimization.

Since #GP faults have to be handled on any threads that were created before
the PASID was assigned to the mm of the process, newly created threads
might as well be treated in a consistent way.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/process.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1d9463e3096b..c713986ef7d7 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -154,6 +154,14 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	frame->flags = X86_EFLAGS_FIXED;
 #endif
 
+	if (static_cpu_has(X86_FEATURE_ENQCMD)) {
+		/*
+		 * Clear the PASID bit in xfeatures so that the PASID MSR
+		 * will be initialized as init state (0).
+		 */
+		p->thread.fpu.state.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
+	}
+
 	/* Kernel thread ? */
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		p->thread.pkru = pkru_get_init_value();
-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 3/8] sched: Define and initialize a flag to identify valid PASID in the task
  2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
  2021-09-20 19:23 ` [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions Fenghua Yu
  2021-09-20 19:23 ` [PATCH 2/8] x86/process: Clear PASID state for a newly forked/cloned thread Fenghua Yu
@ 2021-09-20 19:23 ` Fenghua Yu
  2021-09-20 19:23 ` [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP Fenghua Yu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

From: Peter Zijlstra <peterz@infradead.org>

Add a new field to the task structure to track whether this task
has initialized the IA32_PASID MSR (and thus holds a reference
count on the PASID for this process).

Initialize the field to zero when creating a new task with fork/clone.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 include/linux/sched.h | 4 ++++
 kernel/fork.c         | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 39039ce8ac4c..21a8cff9155c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -936,6 +936,10 @@ struct task_struct {
 	unsigned			in_eventfd_signal:1;
 #endif
 
+#ifdef CONFIG_IOMMU_SUPPORT
+	unsigned			has_valid_pasid:1;
+#endif
+
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
 	struct restart_block		restart_block;
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..e379f88260eb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -967,6 +967,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->use_memdelay = 0;
 #endif
 
+#ifdef CONFIG_IOMMU_SUPPORT
+	tsk->has_valid_pasid = 0;
+#endif
+
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
 #endif
-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
                   ` (2 preceding siblings ...)
  2021-09-20 19:23 ` [PATCH 3/8] sched: Define and initialize a flag to identify valid PASID in the task Fenghua Yu
@ 2021-09-20 19:23 ` Fenghua Yu
  2021-09-22 21:07   ` Peter Zijlstra
                     ` (2 more replies)
  2021-09-20 19:23 ` [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting Fenghua Yu
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

ENQCMD requires the IA32_PASID MSR has a valid PASID value which was
allocated to the process during bind. The MSR could be populated eagerly
by an IPI after the PASID is allocated in bind. But the method was
disabled in commit 9bfecd058339 ("x86/cpufeatures: Force disable
X86_FEATURE_ENQCMD and remove update_pasid()")' due to locking and other
issues.

Since the MSR was cleared in fork()/clone(), the first ENQCMD will
generate a #GP fault. The #GP fault handler will initialize the MSR
if a PASID has been allocated for this process.

The lazy enabling of the PASID MSR in the #GP handler is not an elegant
solution. But it has the least complexity that fits with h/w behavior.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/fpu/api.h |  6 ++++
 arch/x86/include/asm/iommu.h   |  2 ++
 arch/x86/kernel/fpu/xstate.c   | 59 ++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c        | 12 +++++++
 drivers/iommu/intel/svm.c      | 32 ++++++++++++++++++
 5 files changed, 111 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index ca4d0dee1ecd..f146849e5c8c 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -106,4 +106,10 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, const char **feature_name);
  */
 #define PASID_DISABLED	0
 
+#ifdef CONFIG_INTEL_IOMMU_SVM
+void fpu__pasid_write(u32 pasid);
+#else
+static inline void fpu__pasid_write(u32 pasid) { }
+#endif
+
 #endif /* _ASM_X86_FPU_API_H */
diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index bf1ed2ddc74b..9c4bf9b0702f 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -26,4 +26,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
 	return -EINVAL;
 }
 
+bool __fixup_pasid_exception(void);
+
 #endif /* _ASM_X86_IOMMU_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8def1b7f8fb..8a89b2cecd77 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1289,3 +1289,62 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 #endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
+#ifdef CONFIG_INTEL_IOMMU_SVM
+/**
+ * fpu__pasid_write - Write the current task's PASID state/MSR.
+ * @pasid:	the PASID
+ *
+ * The PASID is written to the IA32_PASID MSR directly if the MSR is active.
+ * Otherwise it's written to the PASID. The IA32_PASID MSR should contain
+ * the PASID after returning to the user.
+ *
+ * This is called only when ENQCMD is enabled.
+ */
+void fpu__pasid_write(u32 pasid)
+{
+	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
+	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * ENQCMD always uses the compacted XSAVE format. Ensure the buffer
+	 * has space for the PASID.
+	 */
+	BUG_ON(!(xsave->header.xcomp_bv & XFEATURE_MASK_PASID));
+
+	fpregs_lock();
+
+	/*
+	 * If the task's FPU doesn't need to be loaded or is valid, directly
+	 * write the IA32_PASID MSR. Otherwise, write the PASID state and
+	 * the MSR will be loaded from the PASID state before returning to
+	 * the user.
+	 */
+	if (!test_thread_flag(TIF_NEED_FPU_LOAD) ||
+	    fpregs_state_valid(fpu, smp_processor_id())) {
+		wrmsrl(MSR_IA32_PASID, msr_val);
+	} else {
+		struct ia32_pasid_state *ppasid_state;
+		/*
+		 * Mark XFEATURE_PASID as non-init in the XSAVE buffer.
+		 * This ensures that a subsequent XRSTOR will see the new
+		 * value instead of writing the init value to the MSR.
+		 */
+		xsave->header.xfeatures |= XFEATURE_MASK_PASID;
+		ppasid_state = get_xsave_addr(xsave, XFEATURE_PASID);
+		/*
+		 * ppasid_state shouldn't be NULL because XFEATURE_PASID
+		 * was set just now.
+		 *
+		 * Please note that the following operation is a "write only"
+		 * operation on the PASID state and it writes the *ENTIRE*
+		 * state component. Please don't blindly copy this code to
+		 * modify other XSAVE states.
+		 */
+		ppasid_state->pasid = msr_val;
+	}
+
+	fpregs_unlock();
+}
+#endif /* CONFIG_INTEL_IOMMU_SVM */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..a25d738ae839 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/vdso.h>
+#include <asm/iommu.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
@@ -526,6 +527,14 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
 	return GP_CANONICAL;
 }
 
+static bool fixup_pasid_exception(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
+		return false;
+
+	return __fixup_pasid_exception();
+}
+
 #define GPFSTR "general protection fault"
 
 DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
@@ -538,6 +547,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 
 	cond_local_irq_enable(regs);
 
+	if (user_mode(regs) && fixup_pasid_exception())
+		goto exit;
+
 	if (static_cpu_has(X86_FEATURE_UMIP)) {
 		if (user_mode(regs) && fixup_umip_exception(regs))
 			goto exit;
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 5b5d69b04fcc..ab65020019b6 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1179,3 +1179,35 @@ int intel_svm_page_response(struct device *dev,
 	mutex_unlock(&pasid_mutex);
 	return ret;
 }
+
+/*
+ * Try to figure out if there is a PASID MSR value to propagate to the
+ * thread taking the #GP.
+ */
+bool __fixup_pasid_exception(void)
+{
+	u32 pasid;
+
+	/*
+	 * This function is called only when this #GP was triggered from user
+	 * space. So the mm cannot be NULL.
+	 */
+	pasid = current->mm->pasid;
+
+	/* If no PASID is allocated, there is nothing to propagate. */
+	if (pasid == PASID_DISABLED)
+		return false;
+
+	/*
+	 * If the current task already has a valid PASID MSR, then the #GP
+	 * fault must be for some non-ENQCMD related reason.
+	 */
+	if (current->has_valid_pasid)
+		return false;
+
+	/* Fix up the MSR by the PASID in the mm. */
+	fpu__pasid_write(pasid);
+	current->has_valid_pasid = 1;
+
+	return true;
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
                   ` (3 preceding siblings ...)
  2021-09-20 19:23 ` [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP Fenghua Yu
@ 2021-09-20 19:23 ` Fenghua Yu
  2021-09-23  5:43   ` Lu Baolu
                     ` (2 more replies)
  2021-09-20 19:23 ` [PATCH 6/8] x86/cpufeatures: Re-enable ENQCMD Fenghua Yu
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

PASIDs are fundamentally hardware resources in a shared address space.
There is a limited number of them to use ENQCMD on shared workqueue.
They must be shared and managed. They can not, for instance, be
statically allocated to processes.

Free PASID eagerly by sending IPIs in unbind was disabled due to locking
and other issues in commit 9bfecd058339 ("x86/cpufeatures: Force disable
X86_FEATURE_ENQCMD and remove update_pasid()").

Lazy PASID free is implemented in order to re-enable the ENQCMD feature.
PASIDs are currently reference counted and are centered around device
usage. To support lazy PASID free, reference counts are tracked in the
following scenarios:

1. The PASID's reference count is initialized as 1 when the PASID is first
   allocated in bind. This is already implemented.
2. A reference is taken when a device is bound to the mm and dropped
   when the device is unbound from the mm. This reference tracks device
   usage of the PASID. This is already implemented.
3. A reference is taken when a task's IA32_PASID MSR is initialized in
   #GP fix up and dropped when the task exits. This reference tracks
   the task usage of the PASID. It is implemented here.

Once a PASID is allocated to an mm in bind, it's associated to the mm until
it's freed lazily when its reference count is dropped to zero in unbind or
exit(2).

ENQCMD requires a valid IA32_PASID MSR with the PASID value and a valid
PASID table entry for the PASID. Lazy PASID free may cause the process
still has the valid PASID but the PASID table entry is removed in unbind.
In this case, workqueue submitted by ENQCMD cannot find the PASID table
entry and will generate a DMAR fault.

Here is a more detailed explanation of the life cycle of a PASID:

All processes start out without a PASID allocated (because fork(2)
clears the PASID in the child).

A PASID is allocated on the first open of an accelerator device by
a call to:
iommu_sva_bind_device()
-> intel_svm_bind()
   -> intel_svm_alloc_pasid()
      -> iommu_sva_alloc_pasid()
        -> ioasid_alloc()

At this point mm->pasid for the process is initialized, the reference
count on that PASID is 1, but as yet no tasks within the process have
set up their MSR_IA32_PASID to be able to execute the ENQCMD instruction.

When a task in the process does execute ENQCMD there is a #GP fault.
The Linux handler notes that the process has a PASID allocated, and
attempts to fix the #GP fault by initializing MSR_IA32_PASID for this
task. It also increments the reference count for the PASID.

Additional threads in the task may also execute ENQCMD, and each
will add to the reference count of the PASID.

Tasks within the process may open additional accelerator devices.
In this case the call to iommu_sva_bind_device() merely increments
the reference count for the PASID. Since all devices use the same
PASID (all are accessing the same address space).

So the reference count on a PASID is the sum of the number of open
accelerator devices plus the number of threads that have tried to
execute ENQCMD.

The reverse happens as a process gives up resources. Each call to
iommu_sva_unbind_device() will reduce the reference count on the
PASID. Each task in the process that had set up MSR_IA32_PASID will
reduce the reference count as it exits.

When the reference count is dropped to 0 in either task exit or
unbind, the PASID will be freed.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/iommu.h       |  6 +++++
 arch/x86/include/asm/mmu_context.h |  2 ++
 drivers/iommu/intel/svm.c          | 39 ++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index 9c4bf9b0702f..d00f0a3f32fb 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -28,4 +28,10 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
 
 bool __fixup_pasid_exception(void);
 
+#ifdef CONFIG_INTEL_IOMMU_SVM
+void pasid_put(struct task_struct *tsk, struct mm_struct *mm);
+#else
+static inline void pasid_put(struct task_struct *tsk, struct mm_struct *mm) { }
+#endif
+
 #endif /* _ASM_X86_IOMMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..3a2de87e98a9 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -12,6 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/paravirt.h>
 #include <asm/debugreg.h>
+#include <asm/iommu.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -146,6 +147,7 @@ do {						\
 #else
 #define deactivate_mm(tsk, mm)			\
 do {						\
+	pasid_put(tsk, mm);			\
 	load_gs_index(0);			\
 	loadsegment(fs, 0);			\
 } while (0)
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index ab65020019b6..8b6b8007ba2c 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1187,6 +1187,7 @@ int intel_svm_page_response(struct device *dev,
 bool __fixup_pasid_exception(void)
 {
 	u32 pasid;
+	int ret;
 
 	/*
 	 * This function is called only when this #GP was triggered from user
@@ -1205,9 +1206,47 @@ bool __fixup_pasid_exception(void)
 	if (current->has_valid_pasid)
 		return false;
 
+	mutex_lock(&pasid_mutex);
+	/* The mm's pasid has been allocated. Take a reference to it. */
+	ret = iommu_sva_alloc_pasid(current->mm, PASID_MIN,
+				    intel_pasid_max_id - 1);
+	mutex_unlock(&pasid_mutex);
+	if (ret)
+		return false;
+
 	/* Fix up the MSR by the PASID in the mm. */
 	fpu__pasid_write(pasid);
 	current->has_valid_pasid = 1;
 
 	return true;
 }
+
+/*
+ * pasid_put - On task exit release a reference to the mm's PASID
+ *	       and free the PASID if no more reference
+ * @mm: the mm
+ *
+ * When the task exits, release a reference to the mm's PASID if it was
+ * allocated and the IA32_PASID MSR was fixed up.
+ *
+ * If there is no reference, the PASID is freed and can be allocated to
+ * any process later.
+ */
+void pasid_put(struct task_struct *tsk, struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
+		return;
+
+	/*
+	 * Nothing to do if this task doesn't have a reference to the PASID.
+	 */
+	if (tsk->has_valid_pasid) {
+		mutex_lock(&pasid_mutex);
+		/*
+		 * The PASID's reference was taken during fix up. Release it
+		 * now. If the reference count is 0, the PASID is freed.
+		 */
+		iommu_sva_free_pasid(mm);
+		mutex_unlock(&pasid_mutex);
+	}
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 6/8] x86/cpufeatures: Re-enable ENQCMD
  2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
                   ` (4 preceding siblings ...)
  2021-09-20 19:23 ` [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting Fenghua Yu
@ 2021-09-20 19:23 ` Fenghua Yu
  2021-09-20 19:23 ` [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel Fenghua Yu
  2021-09-20 19:23 ` [PATCH 8/8] docs: x86: Change documentation for SVA (Shared Virtual Addressing) Fenghua Yu
  7 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

Since ENQCMD is handled by #GP fix up, it can be re-enabled.

The ENQCMD feature cannot be used if CONFIG_INTEL_IOMMU_SVM
is not set. Add X86_FEATURE_ENQCMD to the disabled features mask as
appropriate and use cpu_feature_enabled() to check the feature.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/disabled-features.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..1231d63f836d 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -56,8 +56,11 @@
 # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
 #endif
 
-/* Force disable because it's broken beyond repair */
-#define DISABLE_ENQCMD		(1 << (X86_FEATURE_ENQCMD & 31))
+#ifdef CONFIG_INTEL_IOMMU_SVM
+# define DISABLE_ENQCMD		0
+#else
+# define DISABLE_ENQCMD		(1 << (X86_FEATURE_ENQCMD & 31))
+#endif
 
 #ifdef CONFIG_X86_SGX
 # define DISABLE_SGX	0
-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel
  2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
                   ` (5 preceding siblings ...)
  2021-09-20 19:23 ` [PATCH 6/8] x86/cpufeatures: Re-enable ENQCMD Fenghua Yu
@ 2021-09-20 19:23 ` Fenghua Yu
  2021-09-22 21:03   ` Peter Zijlstra
  2021-09-20 19:23 ` [PATCH 8/8] docs: x86: Change documentation for SVA (Shared Virtual Addressing) Fenghua Yu
  7 siblings, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

The ENQCMD implicitly accesses the PASID_MSR to fill in the pasid field
of the descriptor being submitted to an accelerator. But there is no
precise (and stable across kernel changes) point at which the PASID_MSR
is updated from the value for one task to the next.

Kernel code that uses accelerators must always use the ENQCMDS instruction
which does not access the PASID_MSR.

Check for use of the ENQCMD instruction in the kernel and warn on its
usage.

Checking the invalid instruction is a relatively new use of objtool and
I'm open to feedback about the approach.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 tools/objtool/arch/x86/decode.c      | 10 +++++++++-
 tools/objtool/check.c                | 20 ++++++++++++++++++++
 tools/objtool/include/objtool/arch.h |  1 +
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index bc821056aba9..e63b44ba3bd4 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -110,7 +110,7 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
 {
 	struct insn insn;
 	int x86_64, ret;
-	unsigned char op1, op2,
+	unsigned char op1, op2, op3,
 		      rex = 0, rex_b = 0, rex_r = 0, rex_w = 0, rex_x = 0,
 		      modrm = 0, modrm_mod = 0, modrm_rm = 0, modrm_reg = 0,
 		      sib = 0, /* sib_scale = 0, */ sib_index = 0, sib_base = 0;
@@ -137,6 +137,7 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
 
 	op1 = insn.opcode.bytes[0];
 	op2 = insn.opcode.bytes[1];
+	op3 = insn.opcode.bytes[2];
 
 	if (insn.rex_prefix.nbytes) {
 		rex = insn.rex_prefix.bytes[0];
@@ -489,6 +490,13 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
 			/* nopl/nopw */
 			*type = INSN_NOP;
 
+		} else if (op2 == 0x38 && op3 == 0xf8) {
+			if (insn.prefixes.nbytes == 1 &&
+			    insn.prefixes.bytes[0] == 0xf2) {
+				/* enqcmd */
+				*type = INSN_ENQCMD;
+			}
+
 		} else if (op2 == 0xa0 || op2 == 0xa8) {
 
 			/* push fs/gs */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index e5947fbb9e7a..91d13521d9d6 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3133,6 +3133,21 @@ static int validate_reachable_instructions(struct objtool_file *file)
 	return 0;
 }
 
+static int validate_enqcmd(struct objtool_file *file)
+{
+	struct instruction *insn;
+
+	for_each_insn(file, insn) {
+		if (insn->type == INSN_ENQCMD) {
+			WARN_FUNC("enqcmd instruction", insn->sec,
+				  insn->offset);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
 int check(struct objtool_file *file)
 {
 	int ret, warnings = 0;
@@ -3147,6 +3162,11 @@ int check(struct objtool_file *file)
 	if (list_empty(&file->insn_list))
 		goto out;
 
+	ret = validate_enqcmd(file);
+	if (ret < 0)
+		goto out;
+	warnings += ret;
+
 	if (vmlinux && !validate_dup) {
 		ret = validate_vmlinux_functions(file);
 		if (ret < 0)
diff --git a/tools/objtool/include/objtool/arch.h b/tools/objtool/include/objtool/arch.h
index 062bb6e9b865..5147c0762b49 100644
--- a/tools/objtool/include/objtool/arch.h
+++ b/tools/objtool/include/objtool/arch.h
@@ -26,6 +26,7 @@ enum insn_type {
 	INSN_CLAC,
 	INSN_STD,
 	INSN_CLD,
+	INSN_ENQCMD,
 	INSN_OTHER,
 };
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 8/8] docs: x86: Change documentation for SVA (Shared Virtual Addressing)
  2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
                   ` (6 preceding siblings ...)
  2021-09-20 19:23 ` [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel Fenghua Yu
@ 2021-09-20 19:23 ` Fenghua Yu
  7 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-20 19:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

Since allocating, freeing and fixing up PASID are changed, the
documentation is updated to reflect the changes.

Originally-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/x86/sva.rst | 81 +++++++++++++++++++++++++++++++++++----
 1 file changed, 74 insertions(+), 7 deletions(-)

diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst
index 076efd51ef1f..868ed4b25002 100644
--- a/Documentation/x86/sva.rst
+++ b/Documentation/x86/sva.rst
@@ -106,16 +106,83 @@ process share the same page tables, thus the same MSR value.
 
 PASID is cleared when a process is created. The PASID allocation and MSR
 programming may occur long after a process and its threads have been created.
-One thread must call iommu_sva_bind_device() to allocate the PASID for the
-process. If a thread uses ENQCMD without the MSR first being populated, a #GP
-will be raised. The kernel will update the PASID MSR with the PASID for all
-threads in the process. A single process PASID can be used simultaneously
+One thread must call iommu_sva_bind(_device) to allocate the PASID for the process.
+If a thread uses ENQCMD without the MSR first being populated, it will cause #GP.
+The kernel will fix up the #GP by writing the process-wide PASID into the
+thread that took the #GP. A single process PASID can be used simultaneously
 with multiple devices since they all share the same address space.
 
-One thread can call iommu_sva_unbind_device() to free the allocated PASID.
-The kernel will clear the PASID MSR for all threads belonging to the process.
+The PASID MSR value is cleared at thread creation and is never inherited from a
+parent. This ensures consistent child behavior no matter whether the thread is
+created first or the PASID is allocated (and the MSR written).
 
-New threads inherit the MSR value from the parent.
+
+PASID Lifecycle Management
+==========================
+
+Only processes that access SVA-capable devices need to have a PASID
+allocated. This allocation happens when a process opens/binds an SVA-capable
+device but finds no PASID for this process. Subsequent binds of the same, or
+other devices will share the same PASID.
+
+Although the PASID is allocated to the process by opening a device,
+it is not active in any of the threads of that process. It's loaded to the
+IA32_PASID MSR lazily when a thread tries to submit a work descriptor
+to a device using the ENQCMD.
+
+That first access will trigger a #GP fault because the IA32_PASID MSR
+has not been initialized with the PASID value assigned to the process
+when the device was opened. The Linux #GP handler notes that a PASID has
+been allocated for the process, and so initializes the IA32_PASID MSR, takes
+a reference to the PASID and returns so that the ENQCMD instruction is
+re-executed.
+
+On fork(2) or exec(2) the PASID is removed from the process as it no
+longer has the same address space that it had when the device was opened.
+
+On clone(2) the new task shares the same address space, so will be
+able to use the PASID allocated to the process. The IA32_PASID is not
+preemptively initialized as the PASID value might not be allocated yet or
+the kernel does not know whether this thread is going to access the device
+and the cleared IA32_PASID MSR reduces context switch overhead by xstate
+init optimization. Since #GP faults have to be handled on any threads that
+were created before the PASID was assigned to the mm of the process, newly
+created threads might as well be treated in a consistent way.
+
+Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in
+all threads in unbind, free the PASID lazily when there is no PASID user.
+Track the PASID's reference count in the following way:
+
+- Track device usage of the PASID: The PASID's reference count is initialized
+  as 1 when the PASID is allocated in the first bind. Bind takes a reference
+  and unbind drops the reference.
+- Track task usage of the PASID: Fixing up the IA32_PASID MSR in #GP takes
+  reference and exit(2) drops the reference. Once the MSR is fixed up, the
+  PASID value stays in the MSR stays for the rest life of the task.
+
+The PASID is freed lazily in exit(2) or unbind when there is no reference
+to the PASID. After freed, the PASID can be allocated to any process.
+
+ENQCMD needs at least two requirements: a valid IA32_PASID MSR with the
+PASID value of the process and a valid PASID table entry for the PASID.
+To execute ENQCMD, the user must ensure the device is bound to the
+process so that the kernel can guarantee to meet the above two requirements.
+
+Lazy PASID free may cause the task still has the PASID value in IA32_PASID
+while there is no PASID table entry for the PASID. The workqueue submitted
+by ENQCMD in this scenario cannot find the PASID table entry and generates
+a DMAR fault. Currently DMAR fault handler just prints a fault reason.
+Future DMAR fault handler might notify the user the workqueue failure.
+Here are two detailed cases:
+
+- Unbind removes the PASID table entry but the process still owns the PASID
+  and the task's IA32_PASID MSR still keeps the PASID value. The workqueue
+  submitted by ENQCMD in this task will generate a DMAR fault.
+- Unbind removes the PASID table entry but the process still owns the PASID
+  because some task took one reference during fix up. ENQCMD executed in a
+  task that doesn't fix up the IA32_PASID MSR will generate #GP first to get
+  its IA32_PASID MSR fixed up and then the submitted workqueue will generate
+  a DMAR fault.
 
 Relationships
 =============
-- 
2.33.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel
  2021-09-20 19:23 ` [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel Fenghua Yu
@ 2021-09-22 21:03   ` Peter Zijlstra
  2021-09-22 23:44     ` Fenghua Yu
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-22 21:03 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

On Mon, Sep 20, 2021 at 07:23:48PM +0000, Fenghua Yu wrote:


> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index e5947fbb9e7a..91d13521d9d6 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -3133,6 +3133,21 @@ static int validate_reachable_instructions(struct objtool_file *file)
>  	return 0;
>  }
>  
> +static int validate_enqcmd(struct objtool_file *file)
> +{
> +	struct instruction *insn;
> +
> +	for_each_insn(file, insn) {
> +		if (insn->type == INSN_ENQCMD) {
> +			WARN_FUNC("enqcmd instruction", insn->sec,
> +				  insn->offset);
> +			return 1;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  int check(struct objtool_file *file)
>  {
>  	int ret, warnings = 0;
> @@ -3147,6 +3162,11 @@ int check(struct objtool_file *file)
>  	if (list_empty(&file->insn_list))
>  		goto out;
>  
> +	ret = validate_enqcmd(file);
> +	if (ret < 0)
> +		goto out;
> +	warnings += ret;
> +
>  	if (vmlinux && !validate_dup) {
>  		ret = validate_vmlinux_functions(file);
>  		if (ret < 0)

Since you're making it a fatal error, before doing much of anything
else, you might at well fail decode and keep it all in the x86/decode.c
file, no need to spread this 'knowledge' any further.

There's no actual state associated with it, you just want to avoid the
instruction being present.

Much simpler patch too.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-20 19:23 ` [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP Fenghua Yu
@ 2021-09-22 21:07   ` Peter Zijlstra
  2021-09-22 21:11     ` Peter Zijlstra
                       ` (2 more replies)
  2021-09-23 11:31   ` Thomas Gleixner
  2021-09-23 23:17   ` Andy Lutomirski
  2 siblings, 3 replies; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-22 21:07 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

On Mon, Sep 20, 2021 at 07:23:45PM +0000, Fenghua Yu wrote:
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a25d738ae839 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -61,6 +61,7 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/vdso.h>
> +#include <asm/iommu.h>
>  
>  #ifdef CONFIG_X86_64
>  #include <asm/x86_init.h>
> @@ -526,6 +527,14 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
>  	return GP_CANONICAL;
>  }
>  
> +static bool fixup_pasid_exception(void)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> +		return false;
> +
> +	return __fixup_pasid_exception();
> +}
> +
>  #define GPFSTR "general protection fault"
>  
>  DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> @@ -538,6 +547,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  
>  	cond_local_irq_enable(regs);
>  
> +	if (user_mode(regs) && fixup_pasid_exception())
> +		goto exit;
> +
>  	if (static_cpu_has(X86_FEATURE_UMIP)) {
>  		if (user_mode(regs) && fixup_umip_exception(regs))
>  			goto exit;

So you're eating any random #GP that might or might not be PASID
related. And all that witout a comment... Enlighten?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-22 21:07   ` Peter Zijlstra
@ 2021-09-22 21:11     ` Peter Zijlstra
  2021-09-22 21:26       ` Luck, Tony
                         ` (2 more replies)
  2021-09-22 23:39     ` Fenghua Yu
  2021-09-23 17:14     ` Luck, Tony
  2 siblings, 3 replies; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-22 21:11 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

On Wed, Sep 22, 2021 at 11:07:22PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 20, 2021 at 07:23:45PM +0000, Fenghua Yu wrote:
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index a58800973aed..a25d738ae839 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -61,6 +61,7 @@
> >  #include <asm/insn.h>
> >  #include <asm/insn-eval.h>
> >  #include <asm/vdso.h>
> > +#include <asm/iommu.h>
> >  
> >  #ifdef CONFIG_X86_64
> >  #include <asm/x86_init.h>
> > @@ -526,6 +527,14 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
> >  	return GP_CANONICAL;
> >  }
> >  
> > +static bool fixup_pasid_exception(void)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> > +		return false;
> > +
> > +	return __fixup_pasid_exception();
> > +}

That is, shouldn't the above at the very least decode the instruction
causing the #GP and check it's this ENQCMD thing?

> >  #define GPFSTR "general protection fault"
> >  
> >  DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> > @@ -538,6 +547,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> >  
> >  	cond_local_irq_enable(regs);
> >  
> > +	if (user_mode(regs) && fixup_pasid_exception())
> > +		goto exit;
> > +
> >  	if (static_cpu_has(X86_FEATURE_UMIP)) {
> >  		if (user_mode(regs) && fixup_umip_exception(regs))
> >  			goto exit;
> 
> So you're eating any random #GP that might or might not be PASID
> related. And all that witout a comment... Enlighten?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-22 21:11     ` Peter Zijlstra
@ 2021-09-22 21:26       ` Luck, Tony
  2021-09-23  7:03         ` Peter Zijlstra
  2021-09-22 21:33       ` Dave Hansen
  2021-09-22 21:36       ` Fenghua Yu
  2 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-22 21:26 UTC (permalink / raw)
  To: Peter Zijlstra, Yu, Fenghua
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Hansen, Dave, Lu Baolu, Joerg Roedel, Josh Poimboeuf, Jiang,
	Dave, Pan, Jacob jun, Raj, Ashok, Shankar, Ravi V, iommu, x86,
	linux-kernel

>> > +static bool fixup_pasid_exception(void)
>> > +{
>> > +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
>> > +		return false;
>> > +
>> > +	return __fixup_pasid_exception();
>> > +}
>
> That is, shouldn't the above at the very least decode the instruction
> causing the #GP and check it's this ENQCMD thing?

It can't reliably do that because some other thread in the process may
have re-written the memory that regs->ip points at (bizarre case, but
I think Dave Hansen brought it up).

So it would just add extra code, and still only be a hint.

Without the check this sequence is possible:

1) Process binds an accelerator (so mm->pasid is set)
2) Task in the process executes something other than ENQCMD that gets a #GP
3) Kernel says "Oh, mm->pasid is set, I'll initialize the IA32_PASID MSR to see if that fixes it"
4) Nope. Re-executing the instruction at step #2 just gives another #GP
5) Kernel says "I already set IA32_PASID, so this must be something else ... do regular #GP actions"

Now if the task catches the signal that results from step #5 and avoids termination, it will have
IA32_PASID set ... but to the right value should it go on to actually execute ENQCMD at some
future point.

So the corner case from not knowing whether this #GP was from ENQCMD or not is harmless.

-Tony



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-22 21:11     ` Peter Zijlstra
  2021-09-22 21:26       ` Luck, Tony
@ 2021-09-22 21:33       ` Dave Hansen
  2021-09-23  7:05         ` Peter Zijlstra
  2021-09-22 21:36       ` Fenghua Yu
  2 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2021-09-22 21:33 UTC (permalink / raw)
  To: Peter Zijlstra, Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

On 9/22/21 2:11 PM, Peter Zijlstra wrote:
>>> +static bool fixup_pasid_exception(void)
>>> +{
>>> +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
>>> +		return false;
>>> +
>>> +	return __fixup_pasid_exception();
>>> +}
> That is, shouldn't the above at the very least decode the instruction
> causing the #GP and check it's this ENQCMD thing?

To reiterate: on systems with no X86_FEATURE_ENQCMD, there is basically
no additional overhead.  It isn't worth doing decoding there.

On systems with X86_FEATURE_ENQCMD, but where it is unused, the #GP
handler gets some new overhead on every #GP.  Basically:

> +	pasid = current->mm->pasid;
> +	if (pasid == PASID_DISABLED)
> +		return false;

That's still pretty cheap.  Probably not worth doing decoding there either.

So, that leaves us with if you are:
1. On system with X86_FEATURE_ENQCMD
2. In a process/mm that has an allocated pasid
3. Your *task* does not have the MSR set
4. You get a #GP for some other reason

Then, you'll do this double-#GP dance.

So, instruction decoding could absolutely be added between steps 3 and
4.  It would absolutely save doing the double-#GP in cases where 1/2/3
are met.  But, I wouldn't move it up above and of the 1/2/3 checks
because they're way cheaper than instruction decoding.

In the end, it didn't seem worth it to me to be optimizing a relatively
rare path which 99% of the time ends up in a crash.

If you want instruction decoding in here, though, just say the word. :)

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-22 21:11     ` Peter Zijlstra
  2021-09-22 21:26       ` Luck, Tony
  2021-09-22 21:33       ` Dave Hansen
@ 2021-09-22 21:36       ` Fenghua Yu
  2 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-22 21:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

Hi, Peter,

On Wed, Sep 22, 2021 at 11:11:45PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 22, 2021 at 11:07:22PM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 20, 2021 at 07:23:45PM +0000, Fenghua Yu wrote:
> > > +static bool fixup_pasid_exception(void)
> > > +{
> > > +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> > > +		return false;
> > > +
> > > +	return __fixup_pasid_exception();
> > > +}
> 
> That is, shouldn't the above at the very least decode the instruction
> causing the #GP and check it's this ENQCMD thing?

There were comments on a previous version when we used #GP fixup method:
https://lore.kernel.org/linux-iommu/f6d34d59-e6eb-ee9f-d247-8fb2f0e37549@intel.com/

There are three reasons for not decoding the instruction:

1. Parsing the instruction sets bad architectural precedent and is ugly.
2. The instruction could be modified (e.g. JVM) while decoding the
   instruction. It's.
3. Decoding is more complex than this patch and doesn't worth it.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-22 21:07   ` Peter Zijlstra
  2021-09-22 21:11     ` Peter Zijlstra
@ 2021-09-22 23:39     ` Fenghua Yu
  2021-09-23 17:14     ` Luck, Tony
  2 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-22 23:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

Hi, Peter,

On Wed, Sep 22, 2021 at 11:07:22PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 20, 2021 at 07:23:45PM +0000, Fenghua Yu wrote:
> >  
> > +	if (user_mode(regs) && fixup_pasid_exception())
> > +		goto exit;
> > +
> >  	if (static_cpu_has(X86_FEATURE_UMIP)) {
> >  		if (user_mode(regs) && fixup_umip_exception(regs))
> >  			goto exit;
> 
> So you're eating any random #GP that might or might not be PASID
> related. And all that witout a comment... Enlighten?

I will add a comment here.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel
  2021-09-22 21:03   ` Peter Zijlstra
@ 2021-09-22 23:44     ` Fenghua Yu
  2021-09-23  7:17       ` Peter Zijlstra
  0 siblings, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-22 23:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

Hi, Peter,

On Wed, Sep 22, 2021 at 11:03:43PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 20, 2021 at 07:23:48PM +0000, Fenghua Yu wrote:
> > +	ret = validate_enqcmd(file);
> > +	if (ret < 0)
> > +		goto out;
> > +	warnings += ret;
> > +
> >  	if (vmlinux && !validate_dup) {
> >  		ret = validate_vmlinux_functions(file);
> >  		if (ret < 0)
> 
> Since you're making it a fatal error, before doing much of anything
> else, you might at well fail decode and keep it all in the x86/decode.c
> file, no need to spread this 'knowledge' any further.
> 
> There's no actual state associated with it, you just want to avoid the
> instruction being present.
> 
> Much simpler patch too.

Is the following updated patch a right one?

Thanks.

-Fenghua

diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index bc821056aba9..3e0f928e28a5 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -110,7 +110,7 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
 {
 	struct insn insn;
 	int x86_64, ret;
-	unsigned char op1, op2,
+	unsigned char op1, op2, op3,
 		      rex = 0, rex_b = 0, rex_r = 0, rex_w = 0, rex_x = 0,
 		      modrm = 0, modrm_mod = 0, modrm_rm = 0, modrm_reg = 0,
 		      sib = 0, /* sib_scale = 0, */ sib_index = 0, sib_base = 0;
@@ -137,6 +137,7 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
 
 	op1 = insn.opcode.bytes[0];
 	op2 = insn.opcode.bytes[1];
+	op3 = insn.opcode.bytes[2];
 
 	if (insn.rex_prefix.nbytes) {
 		rex = insn.rex_prefix.bytes[0];
@@ -489,6 +490,16 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
 			/* nopl/nopw */
 			*type = INSN_NOP;
 
+		} else if (op2 == 0x38 && op3 == 0xf8) {
+			if (insn.prefixes.nbytes == 1 &&
+			    insn.prefixes.bytes[0] == 0xf2) {
+				/* ENQCMD cannot be used in the kernel. */
+				WARN("ENQCMD instruction at %s:%lx", sec->name,
+				     offset);
+
+				return -1;
+			}
+
 		} else if (op2 == 0xa0 || op2 == 0xa8) {
 
 			/* push fs/gs */

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-20 19:23 ` [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting Fenghua Yu
@ 2021-09-23  5:43   ` Lu Baolu
  2021-09-30  0:44     ` Fenghua Yu
  2021-09-23 14:36   ` Thomas Gleixner
  2021-09-23 23:09   ` Andy Lutomirski
  2 siblings, 1 reply; 77+ messages in thread
From: Lu Baolu @ 2021-09-23  5:43 UTC (permalink / raw)
  To: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Joerg Roedel, Josh Poimboeuf, Dave Jiang, Jacob Jun Pan,
	Ashok Raj, Ravi V Shankar
  Cc: baolu.lu, iommu, x86, linux-kernel

Hi Fenghua,

On 9/21/21 3:23 AM, Fenghua Yu wrote:
> PASIDs are fundamentally hardware resources in a shared address space.
> There is a limited number of them to use ENQCMD on shared workqueue.
> They must be shared and managed. They can not, for instance, be
> statically allocated to processes.
> 
> Free PASID eagerly by sending IPIs in unbind was disabled due to locking
> and other issues in commit 9bfecd058339 ("x86/cpufeatures: Force disable
> X86_FEATURE_ENQCMD and remove update_pasid()").
> 
> Lazy PASID free is implemented in order to re-enable the ENQCMD feature.
> PASIDs are currently reference counted and are centered around device
> usage. To support lazy PASID free, reference counts are tracked in the
> following scenarios:
> 
> 1. The PASID's reference count is initialized as 1 when the PASID is first
>     allocated in bind. This is already implemented.
> 2. A reference is taken when a device is bound to the mm and dropped
>     when the device is unbound from the mm. This reference tracks device
>     usage of the PASID. This is already implemented.
> 3. A reference is taken when a task's IA32_PASID MSR is initialized in
>     #GP fix up and dropped when the task exits. This reference tracks
>     the task usage of the PASID. It is implemented here.
> 
> Once a PASID is allocated to an mm in bind, it's associated to the mm until
> it's freed lazily when its reference count is dropped to zero in unbind or
> exit(2).
> 
> ENQCMD requires a valid IA32_PASID MSR with the PASID value and a valid
> PASID table entry for the PASID. Lazy PASID free may cause the process
> still has the valid PASID but the PASID table entry is removed in unbind.
> In this case, workqueue submitted by ENQCMD cannot find the PASID table
> entry and will generate a DMAR fault.
> 
> Here is a more detailed explanation of the life cycle of a PASID:
> 
> All processes start out without a PASID allocated (because fork(2)
> clears the PASID in the child).
> 
> A PASID is allocated on the first open of an accelerator device by
> a call to:
> iommu_sva_bind_device()
> -> intel_svm_bind()
>     -> intel_svm_alloc_pasid()
>        -> iommu_sva_alloc_pasid()
>          -> ioasid_alloc()
> 
> At this point mm->pasid for the process is initialized, the reference
> count on that PASID is 1, but as yet no tasks within the process have
> set up their MSR_IA32_PASID to be able to execute the ENQCMD instruction.
> 
> When a task in the process does execute ENQCMD there is a #GP fault.
> The Linux handler notes that the process has a PASID allocated, and
> attempts to fix the #GP fault by initializing MSR_IA32_PASID for this
> task. It also increments the reference count for the PASID.
> 
> Additional threads in the task may also execute ENQCMD, and each
> will add to the reference count of the PASID.
> 
> Tasks within the process may open additional accelerator devices.
> In this case the call to iommu_sva_bind_device() merely increments
> the reference count for the PASID. Since all devices use the same
> PASID (all are accessing the same address space).
> 
> So the reference count on a PASID is the sum of the number of open
> accelerator devices plus the number of threads that have tried to
> execute ENQCMD.
> 
> The reverse happens as a process gives up resources. Each call to
> iommu_sva_unbind_device() will reduce the reference count on the
> PASID. Each task in the process that had set up MSR_IA32_PASID will
> reduce the reference count as it exits.
> 
> When the reference count is dropped to 0 in either task exit or
> unbind, the PASID will be freed.
> 
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>   arch/x86/include/asm/iommu.h       |  6 +++++
>   arch/x86/include/asm/mmu_context.h |  2 ++
>   drivers/iommu/intel/svm.c          | 39 ++++++++++++++++++++++++++++++
>   3 files changed, 47 insertions(+)
> 
> diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
> index 9c4bf9b0702f..d00f0a3f32fb 100644
> --- a/arch/x86/include/asm/iommu.h
> +++ b/arch/x86/include/asm/iommu.h
> @@ -28,4 +28,10 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
>   
>   bool __fixup_pasid_exception(void);
>   
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +void pasid_put(struct task_struct *tsk, struct mm_struct *mm);
> +#else
> +static inline void pasid_put(struct task_struct *tsk, struct mm_struct *mm) { }
> +#endif
> +
>   #endif /* _ASM_X86_IOMMU_H */
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 27516046117a..3a2de87e98a9 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -12,6 +12,7 @@
>   #include <asm/tlbflush.h>
>   #include <asm/paravirt.h>
>   #include <asm/debugreg.h>
> +#include <asm/iommu.h>
>   
>   extern atomic64_t last_mm_ctx_id;
>   
> @@ -146,6 +147,7 @@ do {						\
>   #else
>   #define deactivate_mm(tsk, mm)			\
>   do {						\
> +	pasid_put(tsk, mm);			\
>   	load_gs_index(0);			\
>   	loadsegment(fs, 0);			\
>   } while (0)
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index ab65020019b6..8b6b8007ba2c 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -1187,6 +1187,7 @@ int intel_svm_page_response(struct device *dev,
>   bool __fixup_pasid_exception(void)
>   {
>   	u32 pasid;
> +	int ret;
>   
>   	/*
>   	 * This function is called only when this #GP was triggered from user
> @@ -1205,9 +1206,47 @@ bool __fixup_pasid_exception(void)
>   	if (current->has_valid_pasid)
>   		return false;
>   
> +	mutex_lock(&pasid_mutex);
> +	/* The mm's pasid has been allocated. Take a reference to it. */
> +	ret = iommu_sva_alloc_pasid(current->mm, PASID_MIN,
> +				    intel_pasid_max_id - 1);
> +	mutex_unlock(&pasid_mutex);
> +	if (ret)
> +		return false;
> +
>   	/* Fix up the MSR by the PASID in the mm. */
>   	fpu__pasid_write(pasid);
>   	current->has_valid_pasid = 1;
>   
>   	return true;
>   }
> +
> +/*
> + * pasid_put - On task exit release a reference to the mm's PASID
> + *	       and free the PASID if no more reference
> + * @mm: the mm
> + *
> + * When the task exits, release a reference to the mm's PASID if it was
> + * allocated and the IA32_PASID MSR was fixed up.
> + *
> + * If there is no reference, the PASID is freed and can be allocated to
> + * any process later.
> + */
> +void pasid_put(struct task_struct *tsk, struct mm_struct *mm)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> +		return;
> +
> +	/*
> +	 * Nothing to do if this task doesn't have a reference to the PASID.
> +	 */
> +	if (tsk->has_valid_pasid) {
> +		mutex_lock(&pasid_mutex);
> +		/*
> +		 * The PASID's reference was taken during fix up. Release it
> +		 * now. If the reference count is 0, the PASID is freed.
> +		 */
> +		iommu_sva_free_pasid(mm);
> +		mutex_unlock(&pasid_mutex);
> +	}
> +}
> 

It looks odd that both __fixup_pasid_exception() and pasid_put() are
defined in the vendor IOMMU driver, but get called in the arch/x86
code.

Is it feasible to move these two helpers to the files where they are
called? The IA32_PASID MSR fixup and release are not part of the IOMMU
implementation.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-22 21:26       ` Luck, Tony
@ 2021-09-23  7:03         ` Peter Zijlstra
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-23  7:03 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Yu, Fenghua, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Hansen, Dave, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Jiang, Dave, Pan, Jacob jun, Raj, Ashok, Shankar,
	Ravi V, iommu, x86, linux-kernel

On Wed, Sep 22, 2021 at 09:26:10PM +0000, Luck, Tony wrote:
> >> > +static bool fixup_pasid_exception(void)
> >> > +{
> >> > +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> >> > +		return false;
> >> > +
> >> > +	return __fixup_pasid_exception();
> >> > +}
> >
> > That is, shouldn't the above at the very least decode the instruction
> > causing the #GP and check it's this ENQCMD thing?
> 
> It can't reliably do that because some other thread in the process may
> have re-written the memory that regs->ip points at (bizarre case, but
> I think Dave Hansen brought it up).

I don't buy that argument, any cross modifying code gets to keep the
pieces in that case.

> So it would just add extra code, and still only be a hint.
> 
> Without the check this sequence is possible:
> 
> 1) Process binds an accelerator (so mm->pasid is set)
> 2) Task in the process executes something other than ENQCMD that gets a #GP
> 3) Kernel says "Oh, mm->pasid is set, I'll initialize the IA32_PASID MSR to see if that fixes it"
> 4) Nope. Re-executing the instruction at step #2 just gives another #GP
> 5) Kernel says "I already set IA32_PASID, so this must be something else ... do regular #GP actions"
> 
> Now if the task catches the signal that results from step #5 and avoids termination, it will have
> IA32_PASID set ... but to the right value should it go on to actually execute ENQCMD at some
> future point.
> 
> So the corner case from not knowing whether this #GP was from ENQCMD or not is harmless.

And all that *really* should be a in a comment near there, because I'm
100% sure I'll get confused and wonder about that very same thing the
next time I see that code.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-22 21:33       ` Dave Hansen
@ 2021-09-23  7:05         ` Peter Zijlstra
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-23  7:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Wed, Sep 22, 2021 at 02:33:09PM -0700, Dave Hansen wrote:
> On 9/22/21 2:11 PM, Peter Zijlstra wrote:
> >>> +static bool fixup_pasid_exception(void)
> >>> +{
> >>> +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> >>> +		return false;
> >>> +
> >>> +	return __fixup_pasid_exception();
> >>> +}
> > That is, shouldn't the above at the very least decode the instruction
> > causing the #GP and check it's this ENQCMD thing?
> 
> To reiterate: on systems with no X86_FEATURE_ENQCMD, there is basically
> no additional overhead.  It isn't worth doing decoding there.

Well, they won't get past the X86_FEATURE check anyway, so who cares.

> On systems with X86_FEATURE_ENQCMD, but where it is unused, the #GP
> handler gets some new overhead on every #GP.  Basically:
> 
> > +	pasid = current->mm->pasid;
> > +	if (pasid == PASID_DISABLED)
> > +		return false;
> 
> That's still pretty cheap.  Probably not worth doing decoding there either.
> 
> So, that leaves us with if you are:
> 1. On system with X86_FEATURE_ENQCMD
> 2. In a process/mm that has an allocated pasid
> 3. Your *task* does not have the MSR set
> 4. You get a #GP for some other reason
> 
> Then, you'll do this double-#GP dance.
> 
> So, instruction decoding could absolutely be added between steps 3 and
> 4.  It would absolutely save doing the double-#GP in cases where 1/2/3
> are met.  But, I wouldn't move it up above and of the 1/2/3 checks
> because they're way cheaper than instruction decoding.
> 
> In the end, it didn't seem worth it to me to be optimizing a relatively
> rare path which 99% of the time ends up in a crash.
> 
> If you want instruction decoding in here, though, just say the word. :)

Instruction deoding makes it obvious you only consume your own #GP, the
alternative is a comment that explains this reasoning. Having neither
gets you confusion as per this thread.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel
  2021-09-22 23:44     ` Fenghua Yu
@ 2021-09-23  7:17       ` Peter Zijlstra
  2021-09-23 15:26         ` Fenghua Yu
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-23  7:17 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

On Wed, Sep 22, 2021 at 11:44:41PM +0000, Fenghua Yu wrote:

> > Since you're making it a fatal error, before doing much of anything
> > else, you might at well fail decode and keep it all in the x86/decode.c
> > file, no need to spread this 'knowledge' any further.

> Is the following updated patch a right one?

Yes, that's what I was thinking of.

> diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
> index bc821056aba9..3e0f928e28a5 100644
> --- a/tools/objtool/arch/x86/decode.c
> +++ b/tools/objtool/arch/x86/decode.c
> @@ -110,7 +110,7 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
>  {
>  	struct insn insn;
>  	int x86_64, ret;
> -	unsigned char op1, op2,
> +	unsigned char op1, op2, op3,
>  		      rex = 0, rex_b = 0, rex_r = 0, rex_w = 0, rex_x = 0,
>  		      modrm = 0, modrm_mod = 0, modrm_rm = 0, modrm_reg = 0,
>  		      sib = 0, /* sib_scale = 0, */ sib_index = 0, sib_base = 0;
> @@ -137,6 +137,7 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
>  
>  	op1 = insn.opcode.bytes[0];
>  	op2 = insn.opcode.bytes[1];
> +	op3 = insn.opcode.bytes[2];
>  
>  	if (insn.rex_prefix.nbytes) {
>  		rex = insn.rex_prefix.bytes[0];
> @@ -489,6 +490,16 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
>  			/* nopl/nopw */
>  			*type = INSN_NOP;
>  
> +		} else if (op2 == 0x38 && op3 == 0xf8) {
> +			if (insn.prefixes.nbytes == 1 &&
> +			    insn.prefixes.bytes[0] == 0xf2) {
> +				/* ENQCMD cannot be used in the kernel. */
> +				WARN("ENQCMD instruction at %s:%lx", sec->name,
> +				     offset);
> +
> +				return -1;
> +			}

The only concern here is if we want it to be fatal or not. But otherwise
this seems to be all that's required.

> +
>  		} else if (op2 == 0xa0 || op2 == 0xa8) {
>  
>  			/* push fs/gs */

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-20 19:23 ` [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP Fenghua Yu
  2021-09-22 21:07   ` Peter Zijlstra
@ 2021-09-23 11:31   ` Thomas Gleixner
  2021-09-23 23:17   ` Andy Lutomirski
  2 siblings, 0 replies; 77+ messages in thread
From: Thomas Gleixner @ 2021-09-23 11:31 UTC (permalink / raw)
  To: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

On Mon, Sep 20 2021 at 19:23, Fenghua Yu wrote:
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index c8def1b7f8fb..8a89b2cecd77 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1289,3 +1289,62 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
>  	return 0;
>  }
>  #endif /* CONFIG_PROC_PID_ARCH_STATUS */
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +/**
> + * fpu__pasid_write - Write the current task's PASID state/MSR.
> + * @pasid:	the PASID
> + *
> + * The PASID is written to the IA32_PASID MSR directly if the MSR is active.
> + * Otherwise it's written to the PASID. The IA32_PASID MSR should contain

written to the PASID? What's 'the PASID' ?

> + * the PASID after returning to the user.
> + *
> + * This is called only when ENQCMD is enabled.

Well, yes, but it does not explain why it is called and from which context.

> + */
> +void fpu__pasid_write(u32 pasid)
> +{
> +	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
> +	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> +	struct fpu *fpu = &current->thread.fpu;
> +
> +	/*
> +	 * ENQCMD always uses the compacted XSAVE format. Ensure the buffer
> +	 * has space for the PASID.
> +	 */
> +	BUG_ON(!(xsave->header.xcomp_bv & XFEATURE_MASK_PASID));
> +
> +	fpregs_lock();
> +
> +	/*
> +	 * If the task's FPU doesn't need to be loaded or is valid, directly
> +	 * write the IA32_PASID MSR. Otherwise, write the PASID state and
> +	 * the MSR will be loaded from the PASID state before returning to
> +	 * the user.
> +	 */
> +	if (!test_thread_flag(TIF_NEED_FPU_LOAD) ||
> +	    fpregs_state_valid(fpu, smp_processor_id())) {
> +		wrmsrl(MSR_IA32_PASID, msr_val);
> +	} else {
> +		struct ia32_pasid_state *ppasid_state;
> +		/*
> +		 * Mark XFEATURE_PASID as non-init in the XSAVE buffer.
> +		 * This ensures that a subsequent XRSTOR will see the new
> +		 * value instead of writing the init value to the MSR.
> +		 */

This and the above wrmsrl() assume that @pasid is valid which might be
correct, but I don't see any information about pasid lifetime. I assume
that there is no way to drop a PASID, right?

> +		xsave->header.xfeatures |= XFEATURE_MASK_PASID;
> +		ppasid_state = get_xsave_addr(xsave, XFEATURE_PASID);
> +		/*
> +		 * ppasid_state shouldn't be NULL because XFEATURE_PASID
> +		 * was set just now.
> +		 *
> +		 * Please note that the following operation is a "write only"
> +		 * operation on the PASID state and it writes the *ENTIRE*
> +		 * state component. Please don't blindly copy this code to
> +		 * modify other XSAVE states.
> +		 */
> +		ppasid_state->pasid = msr_val;
> +	}
> +
> +	fpregs_unlock();
> +}
> +#endif /* CONFIG_INTEL_IOMMU_SVM */

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a25d738ae839 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
>  
> +static bool fixup_pasid_exception(void)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> +		return false;
> +
> +	return __fixup_pasid_exception();
> +}

Ok, so here is the hook into #GP which then calls out into:

> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -1179,3 +1179,35 @@ int intel_svm_page_response(struct device *dev,
>  	mutex_unlock(&pasid_mutex);
>  	return ret;
>  }
> +
> +/*
> + * Try to figure out if there is a PASID MSR value to propagate to the
> + * thread taking the #GP.
> + */
> +bool __fixup_pasid_exception(void)
> +{
> +	u32 pasid;
> +
> +	/*
> +	 * This function is called only when this #GP was triggered from user
> +	 * space. So the mm cannot be NULL.
> +	 */
> +	pasid = current->mm->pasid;
> +
> +	/* If no PASID is allocated, there is nothing to propagate. */
> +	if (pasid == PASID_DISABLED)
> +		return false;
> +
> +	/*
> +	 * If the current task already has a valid PASID MSR, then the #GP
> +	 * fault must be for some non-ENQCMD related reason.
> +	 */
> +	if (current->has_valid_pasid)
> +		return false;
> +
> +	/* Fix up the MSR by the PASID in the mm. */
> +	fpu__pasid_write(pasid);
> +	current->has_valid_pasid = 1;
> +
> +	return true;
> +}

What is INTEL SVM specific on this? Nothing at all.

If there is a valid PASID in current->mm and the MSR has not been
updated yet, write it. Otherwise bail.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-20 19:23 ` [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting Fenghua Yu
  2021-09-23  5:43   ` Lu Baolu
@ 2021-09-23 14:36   ` Thomas Gleixner
  2021-09-23 16:40     ` Luck, Tony
  2021-09-23 23:09   ` Andy Lutomirski
  2 siblings, 1 reply; 77+ messages in thread
From: Thomas Gleixner @ 2021-09-23 14:36 UTC (permalink / raw)
  To: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar
  Cc: iommu, x86, linux-kernel, Fenghua Yu

On Mon, Sep 20 2021 at 19:23, Fenghua Yu wrote:
>  
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +void pasid_put(struct task_struct *tsk, struct mm_struct *mm);
> +#else
> +static inline void pasid_put(struct task_struct *tsk, struct mm_struct *mm) { }
> +#endif

This code is again defining that PASID is entirely restricted to
INTEL. It's true, that no other vendor supports this, but PASID is
a non-vendor specific concept.

Sticking this into INTEL code means that any other PASID implementation
has to rip it out again from INTEL code and make it a run time property.

The refcounting issue should be the same for all PASID mechanisms which
attach PASID to a mm. What's INTEL specific about that?

So can we pretty please do that correct right away?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel
  2021-09-23  7:17       ` Peter Zijlstra
@ 2021-09-23 15:26         ` Fenghua Yu
  2021-09-24  0:55           ` Josh Poimboeuf
  0 siblings, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-23 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

Hi, Peter,

On Thu, Sep 23, 2021 at 09:17:01AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 22, 2021 at 11:44:41PM +0000, Fenghua Yu wrote:
> 
> > > Since you're making it a fatal error, before doing much of anything
> > > else, you might at well fail decode and keep it all in the x86/decode.c
> > > file, no need to spread this 'knowledge' any further.
> 
> > Is the following updated patch a right one?
> 
> Yes, that's what I was thinking of.
> 
> > diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
> > index bc821056aba9..3e0f928e28a5 100644
> > --- a/tools/objtool/arch/x86/decode.c
> > +++ b/tools/objtool/arch/x86/decode.c
> > @@ -110,7 +110,7 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
> >  {
> >  	struct insn insn;
> >  	int x86_64, ret;
> > -	unsigned char op1, op2,
> > +	unsigned char op1, op2, op3,
> >  		      rex = 0, rex_b = 0, rex_r = 0, rex_w = 0, rex_x = 0,
> >  		      modrm = 0, modrm_mod = 0, modrm_rm = 0, modrm_reg = 0,
> >  		      sib = 0, /* sib_scale = 0, */ sib_index = 0, sib_base = 0;
> > @@ -137,6 +137,7 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
> >  
> >  	op1 = insn.opcode.bytes[0];
> >  	op2 = insn.opcode.bytes[1];
> > +	op3 = insn.opcode.bytes[2];
> >  
> >  	if (insn.rex_prefix.nbytes) {
> >  		rex = insn.rex_prefix.bytes[0];
> > @@ -489,6 +490,16 @@ int arch_decode_instruction(const struct elf *elf, const struct section *sec,
> >  			/* nopl/nopw */
> >  			*type = INSN_NOP;
> >  
> > +		} else if (op2 == 0x38 && op3 == 0xf8) {
> > +			if (insn.prefixes.nbytes == 1 &&
> > +			    insn.prefixes.bytes[0] == 0xf2) {
> > +				/* ENQCMD cannot be used in the kernel. */
> > +				WARN("ENQCMD instruction at %s:%lx", sec->name,
> > +				     offset);
> > +
> > +				return -1;
> > +			}
> 
> The only concern here is if we want it to be fatal or not. But otherwise
> this seems to be all that's required.

objtool doesn't fail kernel build on this fatal warning.

Returning -1 here stops checking the rest of the file and won't report any
further warnings unless this ENQCMD warning is fixed. Not returning -1
continues checking the rest of the file and may report more warnings.
Seems that's the only difference b/w them.

Should I keep this "return -1" or not? Please advice.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-23 14:36   ` Thomas Gleixner
@ 2021-09-23 16:40     ` Luck, Tony
  2021-09-23 17:48       ` Thomas Gleixner
  0 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-23 16:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Thu, Sep 23, 2021 at 04:36:50PM +0200, Thomas Gleixner wrote:
> On Mon, Sep 20 2021 at 19:23, Fenghua Yu wrote:
> >  
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +void pasid_put(struct task_struct *tsk, struct mm_struct *mm);
> > +#else
> > +static inline void pasid_put(struct task_struct *tsk, struct mm_struct *mm) { }
> > +#endif
> 
> This code is again defining that PASID is entirely restricted to
> INTEL. It's true, that no other vendor supports this, but PASID is
> a non-vendor specific concept.
> 
> Sticking this into INTEL code means that any other PASID implementation
> has to rip it out again from INTEL code and make it a run time property.
> 
> The refcounting issue should be the same for all PASID mechanisms which
> attach PASID to a mm. What's INTEL specific about that?
> 
> So can we pretty please do that correct right away?

It's a bit messy (surprise).

There are two reasons to hold a refcount on a PASID

1) The process has done a bind on a device that uses PASIDs

	This one isn't dependent on Intel.

2) A task within a process is using ENQCMD (and thus holds
   a reference on the PASID because IA32_PASID MSR for this
   task has the PASID value loaded with the enable bit set).

	This is (currently) Intel specific (until others
	implement an ENQCMD-like feature to allow apps to
	access PASID enabled devices without going through
	the OS).

Perhaps some better function naming might help?  E.g. have
a task_pasid_put() function that handles the process exit
case separatley from the device unbind case.

void task_pasid_put(void)
{
	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
		return;

	if (current->has_valid_pasid) {
		mutex_lock(&pasid_mutex);
		iommu_sva_free_pasid(mm);
		mutex_unlock(&pasid_mutex);
	}
}

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-22 21:07   ` Peter Zijlstra
  2021-09-22 21:11     ` Peter Zijlstra
  2021-09-22 23:39     ` Fenghua Yu
@ 2021-09-23 17:14     ` Luck, Tony
  2021-09-24 13:37       ` Peter Zijlstra
  2 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-23 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Wed, Sep 22, 2021 at 11:07:22PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 20, 2021 at 07:23:45PM +0000, Fenghua Yu wrote:
> > @@ -538,6 +547,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> >  
> >  	cond_local_irq_enable(regs);
> >  
> > +	if (user_mode(regs) && fixup_pasid_exception())
> > +		goto exit;
> > +

> So you're eating any random #GP that might or might not be PASID
> related. And all that witout a comment... Enlighten?

This is moderately well commented inside the fixup_pasid_exception()
function. Another copy of the comments here at the call-site seems
overkill.

Would it help to change the name to try_fixup_pasid_exception()
to make it clearer that this is just a heuristic that may or may
not fix this particular #GP?

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-23 16:40     ` Luck, Tony
@ 2021-09-23 17:48       ` Thomas Gleixner
  2021-09-24 13:18         ` Thomas Gleixner
  0 siblings, 1 reply; 77+ messages in thread
From: Thomas Gleixner @ 2021-09-23 17:48 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
> On Thu, Sep 23, 2021 at 04:36:50PM +0200, Thomas Gleixner wrote:
>> This code is again defining that PASID is entirely restricted to
>> INTEL. It's true, that no other vendor supports this, but PASID is
>> a non-vendor specific concept.
>> 
>> Sticking this into INTEL code means that any other PASID implementation
>> has to rip it out again from INTEL code and make it a run time property.
>> 
>> The refcounting issue should be the same for all PASID mechanisms which
>> attach PASID to a mm. What's INTEL specific about that?
>> 
>> So can we pretty please do that correct right away?
>
> It's a bit messy (surprise).
>
> There are two reasons to hold a refcount on a PASID
>
> 1) The process has done a bind on a device that uses PASIDs
>
> 	This one isn't dependent on Intel.

Yep.

> 2) A task within a process is using ENQCMD (and thus holds
>    a reference on the PASID because IA32_PASID MSR for this
>    task has the PASID value loaded with the enable bit set).
>
> 	This is (currently) Intel specific (until others
> 	implement an ENQCMD-like feature to allow apps to
> 	access PASID enabled devices without going through
> 	the OS).

Right, but all it does is to grab another reference on the PASID and if
the task exits it needs to be dropped, right?

So you already added 'has_valid_pasid' to task_struct, which is a
misnomer btw. The meaning of this flag is that the task is actually
using the PASID (in the ENQCMD case written to the MSR) beyond the
purposes of the PASID which is attached to current->mm.

But the information which is important from a pasid resource management
POV is whether the task actually holds a seperate refcount on the PASID
or not. That's completely generic. It does not matter whether the task
uses it to populate the IA32_PASID_MSR or to just keeps it in memory
just because it can. The point is that is holds a refcount.

So we can have a generic interface:

int iommu_pasid_get_task_ref()
{
        if (current->holds_pasid_ref)
        	return -EYOUGOTONEALREADY;

	if (!has_pasid(current->mm)
		return -EWHATAREYOUASKINGFOR;

	if (!iommu->pasid_get_ref)
		return -EHOWDIDYOUMAKEUPAPASID;

	if (iommu->pasid_get_ref())
        	return -ETHEIOMMUDOESNOTLIKEYOU;
        
        current->holds_pasid_ref = true;
        return 0;
}

Actually letting this return a bool should be good enough, but you get
the idea.

void iommu_pasid_put_task_ref()
{
        if (!current->holds_pasid_ref)
        	return;
        current->holds_pasid_ref = false;
	iommu->pasid_put_ref();
}

Which makes the exception handling in traps.c the real x86/intel
specific muck:

bool fixup_pasid_exception(...)
{
        /* ENCQMD and future muck */
	if (!per_task_pasid_usage_enabled())
        	return false;
        if (iommu_pasid_get_ref())
        	return false;
        fpu_write_task_pasid();
        return true;
}

fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
and be done with it.

The task exit code can just call iommu_pasid_put_task_ref() from the
generic code and not from within x86.

That means you want in Kconfig:

     PASID_TASK_REFS
     bool

and select that when a IOMMU supporting it is enabled and provide either
the proper protypes or stub functions depending on that.

Hmm?

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-20 19:23 ` [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting Fenghua Yu
  2021-09-23  5:43   ` Lu Baolu
  2021-09-23 14:36   ` Thomas Gleixner
@ 2021-09-23 23:09   ` Andy Lutomirski
  2021-09-23 23:22     ` Luck, Tony
  2 siblings, 1 reply; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-23 23:09 UTC (permalink / raw)
  To: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Raj Ashok, Shankar, Ravi V
  Cc: iommu, the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
> PASIDs are fundamentally hardware resources in a shared address space.
> There is a limited number of them to use ENQCMD on shared workqueue.
> They must be shared and managed. They can not, for instance, be
> statically allocated to processes.
>
> Free PASID eagerly by sending IPIs in unbind was disabled due to locking
> and other issues in commit 9bfecd058339 ("x86/cpufeatures: Force disable
> X86_FEATURE_ENQCMD and remove update_pasid()").
>
> Lazy PASID free is implemented in order to re-enable the ENQCMD feature.
> PASIDs are currently reference counted and are centered around device
> usage. To support lazy PASID free, reference counts are tracked in the
> following scenarios:
>
> 1. The PASID's reference count is initialized as 1 when the PASID is first
>    allocated in bind. This is already implemented.
> 2. A reference is taken when a device is bound to the mm and dropped
>    when the device is unbound from the mm. This reference tracks device
>    usage of the PASID. This is already implemented.
> 3. A reference is taken when a task's IA32_PASID MSR is initialized in
>    #GP fix up and dropped when the task exits. This reference tracks
>    the task usage of the PASID. It is implemented here.

I think this is unnecessarily complicated because it's buying in to the existing ISA misconception that PASID has anything to do with a task.  A PASID belongs to an mm, full stop.  Now the ISA is nasty and we have tasks that have *noticed* that their mm has a PASID and tasks that have not noticed this fact, but that should be irrelevant to essentially everything except the fault handler.

So just refcount the thing the obvious way: take a reference when you stick the PASID in the mm_struct and drop the reference in __mmdrop().  Problem solved.  You could probably drop it more aggressively in __mmput(), and the comment explaining why is left as an exercise to the reader -- if a kernel thread starts doing ENQCMD, we have worse things to worry about :)

--Andy

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-20 19:23 ` [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP Fenghua Yu
  2021-09-22 21:07   ` Peter Zijlstra
  2021-09-23 11:31   ` Thomas Gleixner
@ 2021-09-23 23:17   ` Andy Lutomirski
  2021-09-24  2:56     ` Fenghua Yu
  2021-09-27 21:02     ` Luck, Tony
  2 siblings, 2 replies; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-23 23:17 UTC (permalink / raw)
  To: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Raj Ashok, Shankar, Ravi V
  Cc: iommu, the arch/x86 maintainers, Linux Kernel Mailing List



On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
> ENQCMD requires the IA32_PASID MSR has a valid PASID value which was
> allocated to the process during bind. The MSR could be populated eagerly
> by an IPI after the PASID is allocated in bind. But the method was
> disabled in commit 9bfecd058339 ("x86/cpufeatures: Force disable
> X86_FEATURE_ENQCMD and remove update_pasid()")' due to locking and other
> issues.
>
> Since the MSR was cleared in fork()/clone(), the first ENQCMD will
> generate a #GP fault. The #GP fault handler will initialize the MSR
> if a PASID has been allocated for this process.
>
> The lazy enabling of the PASID MSR in the #GP handler is not an elegant
> solution. But it has the least complexity that fits with h/w behavior.
>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/include/asm/fpu/api.h |  6 ++++
>  arch/x86/include/asm/iommu.h   |  2 ++
>  arch/x86/kernel/fpu/xstate.c   | 59 ++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/traps.c        | 12 +++++++
>  drivers/iommu/intel/svm.c      | 32 ++++++++++++++++++
>  5 files changed, 111 insertions(+)
>
> diff --git a/arch/x86/include/asm/fpu/api.h 
> b/arch/x86/include/asm/fpu/api.h
> index ca4d0dee1ecd..f146849e5c8c 100644
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -106,4 +106,10 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, 
> const char **feature_name);
>   */
>  #define PASID_DISABLED	0
> 
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +void fpu__pasid_write(u32 pasid);
> +#else
> +static inline void fpu__pasid_write(u32 pasid) { }
> +#endif
> +
>  #endif /* _ASM_X86_FPU_API_H */
> diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
> index bf1ed2ddc74b..9c4bf9b0702f 100644
> --- a/arch/x86/include/asm/iommu.h
> +++ b/arch/x86/include/asm/iommu.h
> @@ -26,4 +26,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
>  	return -EINVAL;
>  }
> 
> +bool __fixup_pasid_exception(void);
> +
>  #endif /* _ASM_X86_IOMMU_H */
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index c8def1b7f8fb..8a89b2cecd77 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1289,3 +1289,62 @@ int proc_pid_arch_status(struct seq_file *m, 
> struct pid_namespace *ns,
>  	return 0;
>  }
>  #endif /* CONFIG_PROC_PID_ARCH_STATUS */
> +
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +/**
> + * fpu__pasid_write - Write the current task's PASID state/MSR.
> + * @pasid:	the PASID
> + *
> + * The PASID is written to the IA32_PASID MSR directly if the MSR is 
> active.
> + * Otherwise it's written to the PASID. The IA32_PASID MSR should 
> contain
> + * the PASID after returning to the user.
> + *
> + * This is called only when ENQCMD is enabled.
> + */
> +void fpu__pasid_write(u32 pasid)
> +{
> +	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
> +	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> +	struct fpu *fpu = &current->thread.fpu;
> +
> +	/*
> +	 * ENQCMD always uses the compacted XSAVE format. Ensure the buffer
> +	 * has space for the PASID.
> +	 */
> +	BUG_ON(!(xsave->header.xcomp_bv & XFEATURE_MASK_PASID));
> +
> +	fpregs_lock();
> +
> +	/*
> +	 * If the task's FPU doesn't need to be loaded or is valid, directly
> +	 * write the IA32_PASID MSR. Otherwise, write the PASID state and
> +	 * the MSR will be loaded from the PASID state before returning to
> +	 * the user.
> +	 */
> +	if (!test_thread_flag(TIF_NEED_FPU_LOAD) ||
> +	    fpregs_state_valid(fpu, smp_processor_id())) {
> +		wrmsrl(MSR_IA32_PASID, msr_val);

Let me try to decode this.

If the current task's FPU state is live or if the state is in memory but the CPU regs just happen to match the copy in memory, then write the MSR.  Else write the value to memory.

This is wrong.  If !TIF_NEED_FPU_LOAD && fpregs_state_valid, you MUST NOT MODIFY FPU STATE.  This is not negotiable -- you will break coherence between CPU regs and the memory image.  The way you modify the current task's state is either you modify it in CPU regs (if the kernel knows that the CPU regs are the one and only source of truth) OR you modify it in memory and invalidate any preserved copies (by zapping last_cpu). 

In any event, that particular bit of logic really doesn't belong in here -- it belongs in some helper that gets it right, once.

> +
> +/*
> + * Try to figure out if there is a PASID MSR value to propagate to the
> + * thread taking the #GP.
> + */
> +bool __fixup_pasid_exception(void)
> +{
> +	u32 pasid;
> +
> +	/*
> +	 * This function is called only when this #GP was triggered from user
> +	 * space. So the mm cannot be NULL.
> +	 */
> +	pasid = current->mm->pasid;
> +
> +	/* If no PASID is allocated, there is nothing to propagate. */
> +	if (pasid == PASID_DISABLED)
> +		return false;
> +
> +	/*
> +	 * If the current task already has a valid PASID MSR, then the #GP
> +	 * fault must be for some non-ENQCMD related reason.
> +	 */
> +	if (current->has_valid_pasid)
> +		return false;

IMO "has_valid_pasid" is a poor name.  An mm can have a PASID.  A task has noticed or it hasn't.

If you really need an in-memory cache, call it "pasid_msr_is_loaded".  Or just read the field out of FPU state, but that might be painfully slow.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-23 23:09   ` Andy Lutomirski
@ 2021-09-23 23:22     ` Luck, Tony
  2021-09-24  5:17       ` Andy Lutomirski
  0 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-23 23:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Dave Hansen, Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Thu, Sep 23, 2021 at 04:09:18PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:

> I think this is unnecessarily complicated because it's buying in to the
> existing ISA misconception that PASID has anything to do with a task.
> A PASID belongs to an mm, full stop.  Now the ISA is nasty and we have
> tasks that have *noticed* that their mm has a PASID and tasks that have
> not noticed this fact, but that should be irrelevant to essentially
> everything except the fault handler.
> 
> So just refcount the thing the obvious way: take a reference when you
> stick the PASID in the mm_struct and drop the reference in __mmdrop().
> Problem solved.  You could probably drop it more aggressively in
> __mmput(), and the comment explaining why is left as an exercise to the
> reader -- if a kernel thread starts doing ENQCMD, we have worse things
> to worry about :)

That doesn't match well with the non-x86 usage of PASIDs. The code there
bumps the reference count on each device bind, and decrements on each
device unbind.

If we don't keep a reference count for each task that has IA32_PASID
set up we could have this sequence

1) Process binds to a PASID capable device
2) Task uses ENQCMD, so PASID MSR is set up.
3) Process unbinds the device, reference count on PASID
   goes to zero. PASID is freed. PASID is reallocated to
   another task.
4) Task from step #2 uses ENQCMD to submit a descriptor
   and device now processes virtual addresses based on mappings
   in the new task.

Now you might say that at step 3 we need to hunt down all the
tasks that have PASID enabled and disabled ... but that's the
same painful code that we avoided when we said that we would
not make Linux hand out a PASID to all existing tasks in a
process on the first bind operation.

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel
  2021-09-23 15:26         ` Fenghua Yu
@ 2021-09-24  0:55           ` Josh Poimboeuf
  2021-09-24  0:57             ` Fenghua Yu
  0 siblings, 1 reply; 77+ messages in thread
From: Josh Poimboeuf @ 2021-09-24  0:55 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

On Thu, Sep 23, 2021 at 03:26:14PM +0000, Fenghua Yu wrote:
> > > +		} else if (op2 == 0x38 && op3 == 0xf8) {
> > > +			if (insn.prefixes.nbytes == 1 &&
> > > +			    insn.prefixes.bytes[0] == 0xf2) {
> > > +				/* ENQCMD cannot be used in the kernel. */
> > > +				WARN("ENQCMD instruction at %s:%lx", sec->name,
> > > +				     offset);
> > > +
> > > +				return -1;
> > > +			}
> > 
> > The only concern here is if we want it to be fatal or not. But otherwise
> > this seems to be all that's required.
> 
> objtool doesn't fail kernel build on this fatal warning.
> 
> Returning -1 here stops checking the rest of the file and won't report any
> further warnings unless this ENQCMD warning is fixed. Not returning -1
> continues checking the rest of the file and may report more warnings.
> Seems that's the only difference b/w them.
> 
> Should I keep this "return -1" or not? Please advice.

I'd say remove the "return -1" since it's not a fatal-type analysis
error and there's nothing to prevent objtool from analyzing the rest of
the file.

-- 
Josh


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel
  2021-09-24  0:55           ` Josh Poimboeuf
@ 2021-09-24  0:57             ` Fenghua Yu
  0 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-24  0:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel,
	Dave Jiang, Jacob Jun Pan, Ashok Raj, Ravi V Shankar, iommu, x86,
	linux-kernel

Hi, Josh,

On Thu, Sep 23, 2021 at 05:55:40PM -0700, Josh Poimboeuf wrote:
> On Thu, Sep 23, 2021 at 03:26:14PM +0000, Fenghua Yu wrote:
> > > > +		} else if (op2 == 0x38 && op3 == 0xf8) {
> > > > +			if (insn.prefixes.nbytes == 1 &&
> > > > +			    insn.prefixes.bytes[0] == 0xf2) {
> > > > +				/* ENQCMD cannot be used in the kernel. */
> > > > +				WARN("ENQCMD instruction at %s:%lx", sec->name,
> > > > +				     offset);
> > > > +
> > > > +				return -1;
> > > > +			}
> > > 
> > > The only concern here is if we want it to be fatal or not. But otherwise
> > > this seems to be all that's required.
> > 
> > objtool doesn't fail kernel build on this fatal warning.
> > 
> > Returning -1 here stops checking the rest of the file and won't report any
> > further warnings unless this ENQCMD warning is fixed. Not returning -1
> > continues checking the rest of the file and may report more warnings.
> > Seems that's the only difference b/w them.
> > 
> > Should I keep this "return -1" or not? Please advice.
> 
> I'd say remove the "return -1" since it's not a fatal-type analysis
> error and there's nothing to prevent objtool from analyzing the rest of
> the file.

Sure. It does make sense to remove "return -1". I will remove it.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-23 23:17   ` Andy Lutomirski
@ 2021-09-24  2:56     ` Fenghua Yu
  2021-09-24  5:12       ` Andy Lutomirski
  2021-09-27 21:02     ` Luck, Tony
  1 sibling, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-24  2:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

Hi, Andy,

On Thu, Sep 23, 2021 at 04:17:05PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
> > ENQCMD requires the IA32_PASID MSR has a valid PASID value which was
> > allocated to the process during bind. The MSR could be populated eagerly
> > by an IPI after the PASID is allocated in bind. But the method was
> > disabled in commit 9bfecd058339 ("x86/cpufeatures: Force disable
> > X86_FEATURE_ENQCMD and remove update_pasid()")' due to locking and other
> > issues.
> >
> > Since the MSR was cleared in fork()/clone(), the first ENQCMD will
> > generate a #GP fault. The #GP fault handler will initialize the MSR
> > if a PASID has been allocated for this process.
> >
> > The lazy enabling of the PASID MSR in the #GP handler is not an elegant
> > solution. But it has the least complexity that fits with h/w behavior.
> >
> > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > Reviewed-by: Tony Luck <tony.luck@intel.com>
> > ---
> >  arch/x86/include/asm/fpu/api.h |  6 ++++
> >  arch/x86/include/asm/iommu.h   |  2 ++
> >  arch/x86/kernel/fpu/xstate.c   | 59 ++++++++++++++++++++++++++++++++++
> >  arch/x86/kernel/traps.c        | 12 +++++++
> >  drivers/iommu/intel/svm.c      | 32 ++++++++++++++++++
> >  5 files changed, 111 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/fpu/api.h 
> > b/arch/x86/include/asm/fpu/api.h
> > index ca4d0dee1ecd..f146849e5c8c 100644
> > --- a/arch/x86/include/asm/fpu/api.h
> > +++ b/arch/x86/include/asm/fpu/api.h
> > @@ -106,4 +106,10 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, 
> > const char **feature_name);
> >   */
> >  #define PASID_DISABLED	0
> > 
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +void fpu__pasid_write(u32 pasid);
> > +#else
> > +static inline void fpu__pasid_write(u32 pasid) { }
> > +#endif
> > +
> >  #endif /* _ASM_X86_FPU_API_H */
> > diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
> > index bf1ed2ddc74b..9c4bf9b0702f 100644
> > --- a/arch/x86/include/asm/iommu.h
> > +++ b/arch/x86/include/asm/iommu.h
> > @@ -26,4 +26,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
> >  	return -EINVAL;
> >  }
> > 
> > +bool __fixup_pasid_exception(void);
> > +
> >  #endif /* _ASM_X86_IOMMU_H */
> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > index c8def1b7f8fb..8a89b2cecd77 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -1289,3 +1289,62 @@ int proc_pid_arch_status(struct seq_file *m, 
> > struct pid_namespace *ns,
> >  	return 0;
> >  }
> >  #endif /* CONFIG_PROC_PID_ARCH_STATUS */
> > +
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +/**
> > + * fpu__pasid_write - Write the current task's PASID state/MSR.
> > + * @pasid:	the PASID
> > + *
> > + * The PASID is written to the IA32_PASID MSR directly if the MSR is 
> > active.
> > + * Otherwise it's written to the PASID. The IA32_PASID MSR should 
> > contain
> > + * the PASID after returning to the user.
> > + *
> > + * This is called only when ENQCMD is enabled.
> > + */
> > +void fpu__pasid_write(u32 pasid)
> > +{
> > +	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
> > +	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> > +	struct fpu *fpu = &current->thread.fpu;
> > +
> > +	/*
> > +	 * ENQCMD always uses the compacted XSAVE format. Ensure the buffer
> > +	 * has space for the PASID.
> > +	 */
> > +	BUG_ON(!(xsave->header.xcomp_bv & XFEATURE_MASK_PASID));
> > +
> > +	fpregs_lock();
> > +
> > +	/*
> > +	 * If the task's FPU doesn't need to be loaded or is valid, directly
> > +	 * write the IA32_PASID MSR. Otherwise, write the PASID state and
> > +	 * the MSR will be loaded from the PASID state before returning to
> > +	 * the user.
> > +	 */
> > +	if (!test_thread_flag(TIF_NEED_FPU_LOAD) ||
> > +	    fpregs_state_valid(fpu, smp_processor_id())) {
> > +		wrmsrl(MSR_IA32_PASID, msr_val);
> 
> Let me try to decode this.
> 
> If the current task's FPU state is live or if the state is in memory but the CPU regs just happen to match the copy in memory, then write the MSR.  Else write the value to memory.
> 
> This is wrong.  If !TIF_NEED_FPU_LOAD && fpregs_state_valid, you MUST NOT MODIFY FPU STATE.

But the FPU state is not modified if !TIF_NEED_FPU_LOAD && fpregs_state_valid.

The FPU state is modified only if TIF_NEED_FPU_LOAD && !fpregs_state_valid.
In this case, the FPU state will be restored to the IA32_PASID MSR when
exiting to the user. In all other cases, the FPU state will be not be
restored on exiting to the user and thus the IA32_PASID MSR is directly
written here.

Is it right?

>  This is not negotiable -- you will break coherence between CPU regs and the memory image.

fpregs_assert_state_consistent() warns on !TIF_NEED_FPU_LOAD &&
!fpregs_state_valid. Is that breaking coherence you are talking?

>  The way you modify the current task's state is either you modify it in CPU regs (if the kernel knows that the CPU regs are the one and only source of truth) OR you modify it in memory and invalidate any preserved copies (by zapping last_cpu). 
> 
> In any event, that particular bit of logic really doesn't belong in here -- it belongs in some helper that gets it right, once.
> 
> > +
> > +/*
> > + * Try to figure out if there is a PASID MSR value to propagate to the
> > + * thread taking the #GP.
> > + */
> > +bool __fixup_pasid_exception(void)
> > +{
> > +	u32 pasid;
> > +
> > +	/*
> > +	 * This function is called only when this #GP was triggered from user
> > +	 * space. So the mm cannot be NULL.
> > +	 */
> > +	pasid = current->mm->pasid;
> > +
> > +	/* If no PASID is allocated, there is nothing to propagate. */
> > +	if (pasid == PASID_DISABLED)
> > +		return false;
> > +
> > +	/*
> > +	 * If the current task already has a valid PASID MSR, then the #GP
> > +	 * fault must be for some non-ENQCMD related reason.
> > +	 */
> > +	if (current->has_valid_pasid)
> > +		return false;
> 
> IMO "has_valid_pasid" is a poor name.  An mm can have a PASID.  A task has noticed or it hasn't.
> 
> If you really need an in-memory cache, call it "pasid_msr_is_loaded".  Or just read the field out of FPU state, but that might be painfully slow.

Agree. Thomas wants to change "has_valid_pasid" to "holds_pasid_ref" to
represents if the task takes a reference to the PASID.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-24  2:56     ` Fenghua Yu
@ 2021-09-24  5:12       ` Andy Lutomirski
  0 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-24  5:12 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Dave Hansen, Tony Luck, Lu Baolu, Joerg Roedel, Josh Poimboeuf,
	Dave Jiang, Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List



On Thu, Sep 23, 2021, at 7:56 PM, Fenghua Yu wrote:
> Hi, Andy,
>
> On Thu, Sep 23, 2021 at 04:17:05PM -0700, Andy Lutomirski wrote:
>> On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
>> > ENQCMD requires the IA32_PASID MSR has a valid PASID value which was
>> > allocated to the process during bind. The MSR could be populated eagerly
>> > by an IPI after the PASID is allocated in bind. But the method was
>> > disabled in commit 9bfecd058339 ("x86/cpufeatures: Force disable
>> > X86_FEATURE_ENQCMD and remove update_pasid()")' due to locking and other
>> > issues.
>> >
>> > Since the MSR was cleared in fork()/clone(), the first ENQCMD will
>> > generate a #GP fault. The #GP fault handler will initialize the MSR
>> > if a PASID has been allocated for this process.
>> >
>> > The lazy enabling of the PASID MSR in the #GP handler is not an elegant
>> > solution. But it has the least complexity that fits with h/w behavior.
>> >
>> > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
>> > Reviewed-by: Tony Luck <tony.luck@intel.com>
>> > ---
>> >  arch/x86/include/asm/fpu/api.h |  6 ++++
>> >  arch/x86/include/asm/iommu.h   |  2 ++
>> >  arch/x86/kernel/fpu/xstate.c   | 59 ++++++++++++++++++++++++++++++++++
>> >  arch/x86/kernel/traps.c        | 12 +++++++
>> >  drivers/iommu/intel/svm.c      | 32 ++++++++++++++++++
>> >  5 files changed, 111 insertions(+)
>> >
>> > diff --git a/arch/x86/include/asm/fpu/api.h 
>> > b/arch/x86/include/asm/fpu/api.h
>> > index ca4d0dee1ecd..f146849e5c8c 100644
>> > --- a/arch/x86/include/asm/fpu/api.h
>> > +++ b/arch/x86/include/asm/fpu/api.h
>> > @@ -106,4 +106,10 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, 
>> > const char **feature_name);
>> >   */
>> >  #define PASID_DISABLED	0
>> > 
>> > +#ifdef CONFIG_INTEL_IOMMU_SVM
>> > +void fpu__pasid_write(u32 pasid);
>> > +#else
>> > +static inline void fpu__pasid_write(u32 pasid) { }
>> > +#endif
>> > +
>> >  #endif /* _ASM_X86_FPU_API_H */
>> > diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
>> > index bf1ed2ddc74b..9c4bf9b0702f 100644
>> > --- a/arch/x86/include/asm/iommu.h
>> > +++ b/arch/x86/include/asm/iommu.h
>> > @@ -26,4 +26,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
>> >  	return -EINVAL;
>> >  }
>> > 
>> > +bool __fixup_pasid_exception(void);
>> > +
>> >  #endif /* _ASM_X86_IOMMU_H */
>> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> > index c8def1b7f8fb..8a89b2cecd77 100644
>> > --- a/arch/x86/kernel/fpu/xstate.c
>> > +++ b/arch/x86/kernel/fpu/xstate.c
>> > @@ -1289,3 +1289,62 @@ int proc_pid_arch_status(struct seq_file *m, 
>> > struct pid_namespace *ns,
>> >  	return 0;
>> >  }
>> >  #endif /* CONFIG_PROC_PID_ARCH_STATUS */
>> > +
>> > +#ifdef CONFIG_INTEL_IOMMU_SVM
>> > +/**
>> > + * fpu__pasid_write - Write the current task's PASID state/MSR.
>> > + * @pasid:	the PASID
>> > + *
>> > + * The PASID is written to the IA32_PASID MSR directly if the MSR is 
>> > active.
>> > + * Otherwise it's written to the PASID. The IA32_PASID MSR should 
>> > contain
>> > + * the PASID after returning to the user.
>> > + *
>> > + * This is called only when ENQCMD is enabled.
>> > + */
>> > +void fpu__pasid_write(u32 pasid)
>> > +{
>> > +	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
>> > +	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
>> > +	struct fpu *fpu = &current->thread.fpu;
>> > +
>> > +	/*
>> > +	 * ENQCMD always uses the compacted XSAVE format. Ensure the buffer
>> > +	 * has space for the PASID.
>> > +	 */
>> > +	BUG_ON(!(xsave->header.xcomp_bv & XFEATURE_MASK_PASID));
>> > +
>> > +	fpregs_lock();
>> > +
>> > +	/*
>> > +	 * If the task's FPU doesn't need to be loaded or is valid, directly
>> > +	 * write the IA32_PASID MSR. Otherwise, write the PASID state and
>> > +	 * the MSR will be loaded from the PASID state before returning to
>> > +	 * the user.
>> > +	 */
>> > +	if (!test_thread_flag(TIF_NEED_FPU_LOAD) ||
>> > +	    fpregs_state_valid(fpu, smp_processor_id())) {
>> > +		wrmsrl(MSR_IA32_PASID, msr_val);
>> 
>> Let me try to decode this.
>> 
>> If the current task's FPU state is live or if the state is in memory but the CPU regs just happen to match the copy in memory, then write the MSR.  Else write the value to memory.
>> 
>> This is wrong.  If !TIF_NEED_FPU_LOAD && fpregs_state_valid, you MUST NOT MODIFY FPU STATE.

Sorry, I typoed that.  I meant TIF_NEED_FPU_LOAD && fpregs_state_valid, which is in the case that does wrmsr.

>
> But the FPU state is not modified if !TIF_NEED_FPU_LOAD && fpregs_state_valid.
>
> The FPU state is modified only if TIF_NEED_FPU_LOAD && !fpregs_state_valid.

The MSR is FPU state.  If TIF_NEED_FPU_LOAD && fpregs_state_valid, then the authoritative copy of the FPU state is in memory, but the CPU regs are known to match memory.  You MUST NOT modify the in-memory state or the regs.

> In this case, the FPU state will be restored to the IA32_PASID MSR when
> exiting to the user. In all other cases, the FPU state will be not be
> restored on exiting to the user and thus the IA32_PASID MSR is directly
> written here.


>
> Is it right?
>
>>  This is not negotiable -- you will break coherence between CPU regs and the memory image.
>
> fpregs_assert_state_consistent() warns on !TIF_NEED_FPU_LOAD &&
> !fpregs_state_valid. Is that breaking coherence you are talking?

No.  I mean that you broke coherence between memory and registers.  If the task resumes on the current CPU without scheduling, the MSR write will take effect.  If the task resumes on a different CPU or after something else takes over the current CPU's regs, the MSR write will be lost.

>
>>  The way you modify the current task's state is either you modify it in CPU regs (if the kernel knows that the CPU regs are the one and only source of truth) OR you modify it in memory and invalidate any preserved copies (by zapping last_cpu). 
>> 
>> In any event, that particular bit of logic really doesn't belong in here -- it belongs in some helper that gets it right, once.
>> 
>> > +
>> > +/*
>> > + * Try to figure out if there is a PASID MSR value to propagate to the
>> > + * thread taking the #GP.
>> > + */
>> > +bool __fixup_pasid_exception(void)
>> > +{
>> > +	u32 pasid;
>> > +
>> > +	/*
>> > +	 * This function is called only when this #GP was triggered from user
>> > +	 * space. So the mm cannot be NULL.
>> > +	 */
>> > +	pasid = current->mm->pasid;
>> > +
>> > +	/* If no PASID is allocated, there is nothing to propagate. */
>> > +	if (pasid == PASID_DISABLED)
>> > +		return false;
>> > +
>> > +	/*
>> > +	 * If the current task already has a valid PASID MSR, then the #GP
>> > +	 * fault must be for some non-ENQCMD related reason.
>> > +	 */
>> > +	if (current->has_valid_pasid)
>> > +		return false;
>> 
>> IMO "has_valid_pasid" is a poor name.  An mm can have a PASID.  A task has noticed or it hasn't.
>> 
>> If you really need an in-memory cache, call it "pasid_msr_is_loaded".  Or just read the field out of FPU state, but that might be painfully slow.
>
> Agree. Thomas wants to change "has_valid_pasid" to "holds_pasid_ref" to
> represents if the task takes a reference to the PASID.

That name will stop making sense when tasks stop holding references.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-23 23:22     ` Luck, Tony
@ 2021-09-24  5:17       ` Andy Lutomirski
  0 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-24  5:17 UTC (permalink / raw)
  To: Tony Luck
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Dave Hansen, Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List



On Thu, Sep 23, 2021, at 4:22 PM, Luck, Tony wrote:
> On Thu, Sep 23, 2021 at 04:09:18PM -0700, Andy Lutomirski wrote:
>> On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
>
>> I think this is unnecessarily complicated because it's buying in to the
>> existing ISA misconception that PASID has anything to do with a task.
>> A PASID belongs to an mm, full stop.  Now the ISA is nasty and we have
>> tasks that have *noticed* that their mm has a PASID and tasks that have
>> not noticed this fact, but that should be irrelevant to essentially
>> everything except the fault handler.
>> 
>> So just refcount the thing the obvious way: take a reference when you
>> stick the PASID in the mm_struct and drop the reference in __mmdrop().
>> Problem solved.  You could probably drop it more aggressively in
>> __mmput(), and the comment explaining why is left as an exercise to the
>> reader -- if a kernel thread starts doing ENQCMD, we have worse things
>> to worry about :)
>
> That doesn't match well with the non-x86 usage of PASIDs. The code there
> bumps the reference count on each device bind, and decrements on each
> device unbind.

Can you elaborate on how that works?  Is there an architecture where there is a bona fide per task PASID?

>
> If we don't keep a reference count for each task that has IA32_PASID
> set up we could have this sequence
>
> 1) Process binds to a PASID capable device

Okay, so the mm has that PASID set up and a reference is taken.

> 2) Task uses ENQCMD, so PASID MSR is set up.

Yep.

> 3) Process unbinds the device, reference count on PASID
>    goes to zero. PASID is freed. PASID is reallocated to
>    another task.

It had better not.  We had an entire phone call in which we agreed that the entire lazy-MSR-setup approach only makes any sense if everyone pinky swears that an mm will *never* change its PASID once it has a PASID.

> 4) Task from step #2 uses ENQCMD to submit a descriptor
>    and device now processes virtual addresses based on mappings
>    in the new task.
>
> Now you might say that at step 3 we need to hunt down all the
> tasks that have PASID enabled and disabled ... but that's the
> same painful code that we avoided when we said that we would
> not make Linux hand out a PASID to all existing tasks in a
> process on the first bind operation.
>

Exactly.  Which means that the mm ought to pin that PASID for as long as it exists.  What am I missing?

Sure, one can invent a situation in which you start two threads, and one of those threads binds a device, does ENQCMD, unbinds the device, and exits.  Then the other thread *in the same mm* binds another device and gets a new PASID.  And it all works.  But I really don't think this special case is worth optimizing for.

> -Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-23 17:48       ` Thomas Gleixner
@ 2021-09-24 13:18         ` Thomas Gleixner
  2021-09-24 16:12           ` Luck, Tony
  2021-09-24 16:12           ` Fenghua Yu
  0 siblings, 2 replies; 77+ messages in thread
From: Thomas Gleixner @ 2021-09-24 13:18 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
> On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
>
> fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
> and be done with it.
>
> The task exit code can just call iommu_pasid_put_task_ref() from the
> generic code and not from within x86.

But OTOH why do you need a per task reference count on the PASID at all?

The PASID is fundamentaly tied to the mm and the mm can't go away before
the threads have gone away unless this magically changed after I checked
that ~20 years ago.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-23 17:14     ` Luck, Tony
@ 2021-09-24 13:37       ` Peter Zijlstra
  2021-09-24 15:39         ` Luck, Tony
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-24 13:37 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Thu, Sep 23, 2021 at 10:14:42AM -0700, Luck, Tony wrote:
> On Wed, Sep 22, 2021 at 11:07:22PM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 20, 2021 at 07:23:45PM +0000, Fenghua Yu wrote:
> > > @@ -538,6 +547,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> > >  
> > >  	cond_local_irq_enable(regs);
> > >  
> > > +	if (user_mode(regs) && fixup_pasid_exception())
> > > +		goto exit;
> > > +
> 
> > So you're eating any random #GP that might or might not be PASID
> > related. And all that witout a comment... Enlighten?
> 
> This is moderately well commented inside the fixup_pasid_exception()
> function. Another copy of the comments here at the call-site seems
> overkill.

+static bool fixup_pasid_exception(void)
+{
+       if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
+               return false;
+
+       return __fixup_pasid_exception();
+}

/me goes looking for comments in that function, lemme get out the
electron microscope, because I can't seem to spot them with the naked
eye.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-24 13:37       ` Peter Zijlstra
@ 2021-09-24 15:39         ` Luck, Tony
  2021-09-29  9:00           ` Peter Zijlstra
  0 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-24 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Fri, Sep 24, 2021 at 03:37:22PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 23, 2021 at 10:14:42AM -0700, Luck, Tony wrote:
> > On Wed, Sep 22, 2021 at 11:07:22PM +0200, Peter Zijlstra wrote:
> > > On Mon, Sep 20, 2021 at 07:23:45PM +0000, Fenghua Yu wrote:
> > > > @@ -538,6 +547,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> > > >  
> > > >  	cond_local_irq_enable(regs);
> > > >  
> > > > +	if (user_mode(regs) && fixup_pasid_exception())
> > > > +		goto exit;
> > > > +
> > 
> > > So you're eating any random #GP that might or might not be PASID
> > > related. And all that witout a comment... Enlighten?
> > 
> > This is moderately well commented inside the fixup_pasid_exception()
> > function. Another copy of the comments here at the call-site seems
> > overkill.
> 
> +static bool fixup_pasid_exception(void)
> +{
> +       if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> +               return false;
> +
> +       return __fixup_pasid_exception();
> +}
> 
> /me goes looking for comments in that function, lemme get out the
> electron microscope, because I can't seem to spot them with the naked
> eye.

If you have ctags installed then a ctrl-] on that
__fixup_pasid_exception() will take you to the function with
the comments. No electron microscope needed.

+
+/*
+ * Try to figure out if there is a PASID MSR value to propagate to the
+ * thread taking the #GP.
+ */
+bool __fixup_pasid_exception(void)
+{
+       u32 pasid;
+
+       /*
+        * This function is called only when this #GP was triggered from user
+        * space. So the mm cannot be NULL.
+        */
+       pasid = current->mm->pasid;
+
+       /* If no PASID is allocated, there is nothing to propagate. */
+       if (pasid == PASID_DISABLED)
+               return false;
+
+       /*
+        * If the current task already has a valid PASID MSR, then the #GP
+        * fault must be for some non-ENQCMD related reason.
+        */
+       if (current->has_valid_pasid)
+               return false;
+
+       /* Fix up the MSR by the PASID in the mm. */
+       fpu__pasid_write(pasid);
+       current->has_valid_pasid = 1;
+
+       return true;
+}

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-24 13:18         ` Thomas Gleixner
@ 2021-09-24 16:12           ` Luck, Tony
  2021-09-24 23:03             ` Andy Lutomirski
  2021-09-24 16:12           ` Fenghua Yu
  1 sibling, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-24 16:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
> On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
> > On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
> >
> > fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
> > and be done with it.
> >
> > The task exit code can just call iommu_pasid_put_task_ref() from the
> > generic code and not from within x86.
> 
> But OTOH why do you need a per task reference count on the PASID at all?
> 
> The PASID is fundamentaly tied to the mm and the mm can't go away before
> the threads have gone away unless this magically changed after I checked
> that ~20 years ago.

It would be possible to avoid a per-task reference to the PASID by
taking an extra reference when mm->pasid is first allocated using
the CONFIG_PASID_TASK_REFS you proposed yesterday to define a function
to take the extra reference, and another to drop it when the mm is
finally freed ... with stubs to do nothing on architectures that
don't create per-task PASID context.

This solution works, but is functionally different from Fenghua's
proposal for this case:

	Process clones a task
	task binds a device
	task accesses device using PASID
	task unbinds device
	task exits (but process lives on)

Fenghua will free the PASID because the reference count goes
back to zero. The "take an extra reference and keep until the
mm is freed" option would needlessly hold onto the PASID.

This seems like an odd usage case ... even if it does exist, a process
that does this may spawn another task that does the same thing.

If you think it is sufficiently simpler to use the "extra reference"
option (invoking the "perfect is the enemy of good enough" rule) then we
can change.

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-24 13:18         ` Thomas Gleixner
  2021-09-24 16:12           ` Luck, Tony
@ 2021-09-24 16:12           ` Fenghua Yu
  2021-09-25 23:13             ` Thomas Gleixner
  1 sibling, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-24 16:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Luck, Tony, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

Hi, Thomas,

On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
> On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
> > On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
> >
> > fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
> > and be done with it.
> >
> > The task exit code can just call iommu_pasid_put_task_ref() from the
> > generic code and not from within x86.
> 
> But OTOH why do you need a per task reference count on the PASID at all?
> 
> The PASID is fundamentaly tied to the mm and the mm can't go away before
> the threads have gone away unless this magically changed after I checked
> that ~20 years ago.

There are up to 1M PASIDs because PASID is 20-bit. I think there are a few ways
to allocate and free PASID:

1. Statically allocate a PASID once a mm is created and free it in mm
   exit. No PASID allocation/free during the mm's lifetime. Then
   up to 1M processes can be created due to 1M PASIDs limitation.
   We don't want this method because the 1M processes limitation.

2. A PASID is allocated to the mm in open(dev)->bind(dev, mm). There
   are three ways to free it:
   (a) Actively free it in close(fd)->unbind(dev, mm) by sending
       IPIs to tell all tasks using the PASID to clear the IA32_PASID
       MSR. This has locking issues similar to the actively loading
       IA32_PASID MSR which was force disabled in upstream. So won't work.
   (b) Passively free the PASID in destroy_context(mm) in mm exit. Once
       the PASID is allocated, it stays with the process for the lifetime. It's
       better than #1 because the PASID is allocated only on demand.
   (c) Passively free the PASID in deactive_mm(mm) or unbind() whenever there
       is no usage as implemented in this series. Tracking the PASID usage
       per task provides a chance to free the PASID on task exit. The
       PASID has a better chance to be freed earlier than mm exit in #(b).

This series uses #2 and #(c) to allocate and free the PASID for a better
chance to ease the 1M PASIDs limitation pressure. For example, a thread
doing open(dev)->ENQCMD->close(fd)->exit(2) will not occupy a PASID while
its sibling threads are still running.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-24 16:12           ` Luck, Tony
@ 2021-09-24 23:03             ` Andy Lutomirski
  2021-09-24 23:11               ` Luck, Tony
  2021-09-29  9:54               ` Peter Zijlstra
  0 siblings, 2 replies; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-24 23:03 UTC (permalink / raw)
  To: Tony Luck, Thomas Gleixner
  Cc: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra (Intel),
	Dave Hansen, Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List



On Fri, Sep 24, 2021, at 9:12 AM, Luck, Tony wrote:
> On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
>> On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
>> > On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
>> >
>> > fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
>> > and be done with it.
>> >
>> > The task exit code can just call iommu_pasid_put_task_ref() from the
>> > generic code and not from within x86.
>> 
>> But OTOH why do you need a per task reference count on the PASID at all?
>> 
>> The PASID is fundamentaly tied to the mm and the mm can't go away before
>> the threads have gone away unless this magically changed after I checked
>> that ~20 years ago.
>
> It would be possible to avoid a per-task reference to the PASID by
> taking an extra reference when mm->pasid is first allocated using
> the CONFIG_PASID_TASK_REFS you proposed yesterday to define a function
> to take the extra reference, and another to drop it when the mm is
> finally freed ... with stubs to do nothing on architectures that
> don't create per-task PASID context.
>
> This solution works, but is functionally different from Fenghua's
> proposal for this case:
>
> 	Process clones a task
> 	task binds a device
> 	task accesses device using PASID
> 	task unbinds device
> 	task exits (but process lives on)
>
> Fenghua will free the PASID because the reference count goes
> back to zero. The "take an extra reference and keep until the
> mm is freed" option would needlessly hold onto the PASID.
>
> This seems like an odd usage case ... even if it does exist, a process
> that does this may spawn another task that does the same thing.
>
> If you think it is sufficiently simpler to use the "extra reference"
> option (invoking the "perfect is the enemy of good enough" rule) then we
> can change.

I think the perfect and the good are a bit confused here. If we go for "good", then we have an mm owning a PASID for its entire lifetime.  If we want "perfect", then we should actually do it right: teach the kernel to update an entire mm's PASID setting all at once.  This isn't *that* hard -- it involves two things:

1. The context switch code needs to resync PASID.  Unfortunately, this adds some overhead to every context switch, although a static_branch could minimize it for non-PASID users.
2. A change to an mm's PASID needs to sent an IPI, but that IPI can't touch FPU state.  So instead the IPI should use task_work_add() to make sure PASID gets resynced.

And there is still no per-task refcounting.

After all, the not so perfect attempt at perfect in this patch set won't actually work if a thread pool is involved.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-24 23:03             ` Andy Lutomirski
@ 2021-09-24 23:11               ` Luck, Tony
  2021-09-29  9:54               ` Peter Zijlstra
  1 sibling, 0 replies; 77+ messages in thread
From: Luck, Tony @ 2021-09-24 23:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Fenghua Yu, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Dave Hansen, Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
> 1. The context switch code needs to resync PASID.  Unfortunately, this adds some overhead to every context switch, although a static_branch could minimize it for non-PASID users.

Any solution that adds to context switch time isn't going to meet
the definition of "perfect" either.

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-24 16:12           ` Fenghua Yu
@ 2021-09-25 23:13             ` Thomas Gleixner
  2021-09-28 16:36               ` Fenghua Yu
  0 siblings, 1 reply; 77+ messages in thread
From: Thomas Gleixner @ 2021-09-25 23:13 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Luck, Tony, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

Fenghua,

On Fri, Sep 24 2021 at 16:12, Fenghua Yu wrote:
> On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
>> But OTOH why do you need a per task reference count on the PASID at all?
>> 
>> The PASID is fundamentaly tied to the mm and the mm can't go away before
>> the threads have gone away unless this magically changed after I checked
>> that ~20 years ago.
>
> There are up to 1M PASIDs because PASID is 20-bit. I think there are a few ways
> to allocate and free PASID:
>
> 1. Statically allocate a PASID once a mm is created and free it in mm
>    exit. No PASID allocation/free during the mm's lifetime. Then
>    up to 1M processes can be created due to 1M PASIDs limitation.
>    We don't want this method because the 1M processes limitation.

I'm not so worried about the 1M limitation, but it obviously makes sense
to avoid that because allocating stuff which is not used is pointless in
general.

> 2. A PASID is allocated to the mm in open(dev)->bind(dev, mm). There
>    are three ways to free it:
>    (a) Actively free it in close(fd)->unbind(dev, mm) by sending
>        IPIs to tell all tasks using the PASID to clear the IA32_PASID
>        MSR. This has locking issues similar to the actively loading
>        IA32_PASID MSR which was force disabled in upstream. So won't work.

Exactly.

>    (b) Passively free the PASID in destroy_context(mm) in mm exit. Once
>        the PASID is allocated, it stays with the process for the lifetime. It's
>        better than #1 because the PASID is allocated only on demand.

Which is simple and makes a lot of sense. See below.

>    (c) Passively free the PASID in deactive_mm(mm) or unbind() whenever there
>        is no usage as implemented in this series. Tracking the PASID usage
>        per task provides a chance to free the PASID on task exit. The
>        PASID has a better chance to be freed earlier than mm exit in #(b).
>
> This series uses #2 and #(c) to allocate and free the PASID for a better
> chance to ease the 1M PASIDs limitation pressure. For example, a thread
> doing open(dev)->ENQCMD->close(fd)->exit(2) will not occupy a PASID while
> its sibling threads are still running.

I'm not seeing that as a realistic problem. Applications which use this
kind of devices are unlikely to behave exactly that way.

2^20 PASIDs are really plenty and just adding code for the theoretical
case of PASID pressure is a pointless exercise IMO. It just adds
complexity for no reason.

IMO reality will be that either you have long lived processes with tons
of threads which use such devices over and over or short lived forked
processes which open the device, do the job, close and exit. Both
scenarios are fine with allocate on first use and drop on process exit.

I think with your approach you create overhead for applications which
use thread pools where the threads get work thrown at them and do open()
-> do_stuff() -> close() and then go back to wait for the next job which
will do exactly the same thing. So you add the overhead of refcounts in
general and in the worst case if the refcount drops to zero then the
next worker has to allocate a new PASID instead of just moving on.

So unless you have a really compelling real world usecase argument, I'm
arguing that the PASID pressure problem is a purely academic exercise.

I think you are conflating two things here:

  1) PASID lifetime
  2) PASID MSR overhead

Which is not correct: You still can and have to optimize the per thread
behaviour vs. the PASID MSR: Track per thread whether it ever needed the
PASID and act upon that.

If the thread just does EMQCMD once in it's lifetime, then so be
it. That's not a realistic use case, really.

And if someone does this then this does not mean we have to optimize for
that. Optimizing for possible stupid implementations is the wrong
approach. There is no technial measure against stupidity. If that would
exist the world would be a much better place.

You really have to think about the problem space you are working
on. There are problems which need a 'get it right at the first shot'
solution because they create user space ABI or otheer hard to fix
dependencies.

That's absolutely not the case here.

Get the basic simple support correct and work from there. Trying to
solve all possible theoretical problems upfront is simply not possible
and a guarantee for not making progress.

"Keep it simple" and "correctness first" are still the best working
engineering principles.

They do not prevent us from revisiting this _if_ there is a real world
problem which makes enough sense to implement a finer grained solution.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-23 23:17   ` Andy Lutomirski
  2021-09-24  2:56     ` Fenghua Yu
@ 2021-09-27 21:02     ` Luck, Tony
  2021-09-27 23:51       ` Dave Hansen
  1 sibling, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-27 21:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Dave Hansen, Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Thu, Sep 23, 2021 at 04:17:05PM -0700, Andy Lutomirski wrote:
> > +	fpregs_lock();
> > +
> > +	/*
> > +	 * If the task's FPU doesn't need to be loaded or is valid, directly
> > +	 * write the IA32_PASID MSR. Otherwise, write the PASID state and
> > +	 * the MSR will be loaded from the PASID state before returning to
> > +	 * the user.
> > +	 */
> > +	if (!test_thread_flag(TIF_NEED_FPU_LOAD) ||
> > +	    fpregs_state_valid(fpu, smp_processor_id())) {
> > +		wrmsrl(MSR_IA32_PASID, msr_val);
> 
> Let me try to decode this.
> 
> If the current task's FPU state is live or if the state is in memory but the CPU regs just happen to match the copy in memory, then write the MSR.  Else write the value to memory.
> 
> This is wrong.  If !TIF_NEED_FPU_LOAD && fpregs_state_valid, you MUST NOT MODIFY FPU STATE.  This is not negotiable -- you will break coherence between CPU regs and the memory image.  The way you modify the current task's state is either you modify it in CPU regs (if the kernel knows that the CPU regs are the one and only source of truth) OR you modify it in memory and invalidate any preserved copies (by zapping last_cpu). 
> 
> In any event, that particular bit of logic really doesn't belong in here -- it belongs in some helper that gets it right, once.

Andy,

A helper sounds like a good idea. Can you flesh out what
you would like that to look like?

Is it just the "where is the live register state?" so the
above could be written:

	if (xsave_state_in_memory(args ...))
		update pasid bit of xsave state in memory
	else
		wrmsrl(MSR_IA32_PASID, msr_val);

Or are you thinking of a helper that does both the check
and the update ... so the code here could be:

	update_one_xsave_feature(XFEATURE_PASID, &msr_val, sizeof(msr_val));

With the helper being something like:

void update_one_xsave_feature(enum xfeature xfeature, void *data, size_t size)
{
	if (xsave_state_in_memory(args ...)) {
		addr = get_xsave_addr(xsave, xfeature);
		memcpy(addr, data, size);
		xsave->header.xfeatures |= (1 << xfeature);
		return;
	}

	switch (xfeature) {
	case XFEATURE_PASID:
		wrmsrl(MSR_IA32_PASID, *(u64 *)data);
		break;

	case each_of_the_other_XFEATURE_enums:
		code to update registers for that XFEATURE
	}
}

either way needs the definitive correct coding for xsave_state_in_memory()

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-27 21:02     ` Luck, Tony
@ 2021-09-27 23:51       ` Dave Hansen
  2021-09-28 18:50         ` Luck, Tony
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2021-09-27 23:51 UTC (permalink / raw)
  To: Luck, Tony, Andy Lutomirski
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 9/27/21 2:02 PM, Luck, Tony wrote:
> Or are you thinking of a helper that does both the check
> and the update ... so the code here could be:
> 
> 	update_one_xsave_feature(XFEATURE_PASID, &msr_val, sizeof(msr_val));
> 
> With the helper being something like:
> 
> void update_one_xsave_feature(enum xfeature xfeature, void *data, size_t size)
> {
> 	if (xsave_state_in_memory(args ...)) {
> 		addr = get_xsave_addr(xsave, xfeature);
> 		memcpy(addr, data, size);
> 		xsave->header.xfeatures |= (1 << xfeature);
> 		return;
> 	}
> 
> 	switch (xfeature) {
> 	case XFEATURE_PASID:
> 		wrmsrl(MSR_IA32_PASID, *(u64 *)data);
> 		break;
> 
> 	case each_of_the_other_XFEATURE_enums:
> 		code to update registers for that XFEATURE
> 	}
> }
> 
> either way needs the definitive correct coding for xsave_state_in_memory()

That's close to what we want.

The size should probably be implicit.  If it isn't implicit, it needs to
at least be double-checked against the state sizes.

Not to get too fancy, but I think we also want to have a "replace"
operation which is separate from the "update".  Think of a case where
you are trying to set a bit:

	struct pkru_state *pk = start_update_xstate(tsk, XSTATE_PKRU);
	pk->pkru |= 0x100;
	finish_update_xstate(tsk, XSTATE_PKRU, pk);

versus setting a whole new value:

	struct pkru_state *pk = start_replace_xstate(tsk, XSTATE_PKRU);
	memset(pkru, sizeof(*pk), 0);
	pk->pkru = 0x1234;
	finish_replace_xstate(tsk, XSTATE_PKRU, pk);

They look similar, but they actually might have very different costs for
big states like AMX.  We can also do some very different debugging for
them.  In normal operation, these _can_ just return pointers directly to
the fpu...->xstate in some case.  But, if we're debugging, we could
kmalloc() a buffer and do sanity checking before it goes into the task
buffer.

We don't have to do any of that fancy stuff now.  We don't even need to
do an "update" if all we need for now for XFEATURE_PASID is a "replace".

1. Hide whether we need to write to real registers
2. Hide whether we need to update the in-memory image
3. Hide other FPU infrastructure like the TIF flag.
4. Make the users deal with a *whole* state in the replace API

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-25 23:13             ` Thomas Gleixner
@ 2021-09-28 16:36               ` Fenghua Yu
  0 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-28 16:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Luck, Tony, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

Hi, Thomas,

On Sun, Sep 26, 2021 at 01:13:50AM +0200, Thomas Gleixner wrote:
> Fenghua,
> 
> On Fri, Sep 24 2021 at 16:12, Fenghua Yu wrote:
> > On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
> >> But OTOH why do you need a per task reference count on the PASID at all?
> >> 
> >> The PASID is fundamentaly tied to the mm and the mm can't go away before
> >> the threads have gone away unless this magically changed after I checked
> >> that ~20 years ago.
> >
> > There are up to 1M PASIDs because PASID is 20-bit. I think there are a few ways
> > to allocate and free PASID:
> >
> > 1. Statically allocate a PASID once a mm is created and free it in mm
> >    exit. No PASID allocation/free during the mm's lifetime. Then
> >    up to 1M processes can be created due to 1M PASIDs limitation.
> >    We don't want this method because the 1M processes limitation.
> 
> I'm not so worried about the 1M limitation, but it obviously makes sense
> to avoid that because allocating stuff which is not used is pointless in
> general.
> 
> > 2. A PASID is allocated to the mm in open(dev)->bind(dev, mm). There
> >    are three ways to free it:
> >    (a) Actively free it in close(fd)->unbind(dev, mm) by sending
> >        IPIs to tell all tasks using the PASID to clear the IA32_PASID
> >        MSR. This has locking issues similar to the actively loading
> >        IA32_PASID MSR which was force disabled in upstream. So won't work.
> 
> Exactly.
> 
> >    (b) Passively free the PASID in destroy_context(mm) in mm exit. Once
> >        the PASID is allocated, it stays with the process for the lifetime. It's
> >        better than #1 because the PASID is allocated only on demand.
> 
> Which is simple and makes a lot of sense. See below.
> 
> >    (c) Passively free the PASID in deactive_mm(mm) or unbind() whenever there
> >        is no usage as implemented in this series. Tracking the PASID usage
> >        per task provides a chance to free the PASID on task exit. The
> >        PASID has a better chance to be freed earlier than mm exit in #(b).
> >
> > This series uses #2 and #(c) to allocate and free the PASID for a better
> > chance to ease the 1M PASIDs limitation pressure. For example, a thread
> > doing open(dev)->ENQCMD->close(fd)->exit(2) will not occupy a PASID while
> > its sibling threads are still running.
> 
> I'm not seeing that as a realistic problem. Applications which use this
> kind of devices are unlikely to behave exactly that way.
> 
> 2^20 PASIDs are really plenty and just adding code for the theoretical
> case of PASID pressure is a pointless exercise IMO. It just adds
> complexity for no reason.
> 
> IMO reality will be that either you have long lived processes with tons
> of threads which use such devices over and over or short lived forked
> processes which open the device, do the job, close and exit. Both
> scenarios are fine with allocate on first use and drop on process exit.
> 
> I think with your approach you create overhead for applications which
> use thread pools where the threads get work thrown at them and do open()
> -> do_stuff() -> close() and then go back to wait for the next job which
> will do exactly the same thing. So you add the overhead of refcounts in
> general and in the worst case if the refcount drops to zero then the
> next worker has to allocate a new PASID instead of just moving on.
> 
> So unless you have a really compelling real world usecase argument, I'm
> arguing that the PASID pressure problem is a purely academic exercise.
> 
> I think you are conflating two things here:
> 
>   1) PASID lifetime
>   2) PASID MSR overhead
> 
> Which is not correct: You still can and have to optimize the per thread
> behaviour vs. the PASID MSR: Track per thread whether it ever needed the
> PASID and act upon that.
> 
> If the thread just does EMQCMD once in it's lifetime, then so be
> it. That's not a realistic use case, really.
> 
> And if someone does this then this does not mean we have to optimize for
> that. Optimizing for possible stupid implementations is the wrong
> approach. There is no technial measure against stupidity. If that would
> exist the world would be a much better place.
> 
> You really have to think about the problem space you are working
> on. There are problems which need a 'get it right at the first shot'
> solution because they create user space ABI or otheer hard to fix
> dependencies.
> 
> That's absolutely not the case here.
> 
> Get the basic simple support correct and work from there. Trying to
> solve all possible theoretical problems upfront is simply not possible
> and a guarantee for not making progress.
> 
> "Keep it simple" and "correctness first" are still the best working
> engineering principles.
> 
> They do not prevent us from revisiting this _if_ there is a real world
> problem which makes enough sense to implement a finer grained solution.

Sure. Will free the PASID in destroy_context() on mm exit and won't track
the PASID usage per task. The code will be simpler and clearer.

Thank you very much for your insight!

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-27 23:51       ` Dave Hansen
@ 2021-09-28 18:50         ` Luck, Tony
  2021-09-28 19:19           ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-28 18:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Sep 27, 2021 at 04:51:25PM -0700, Dave Hansen wrote:
> That's close to what we want.
> 
> The size should probably be implicit.  If it isn't implicit, it needs to
> at least be double-checked against the state sizes.
> 
> Not to get too fancy, but I think we also want to have a "replace"
> operation which is separate from the "update".  Think of a case where
> you are trying to set a bit:
> 
> 	struct pkru_state *pk = start_update_xstate(tsk, XSTATE_PKRU);
> 	pk->pkru |= 0x100;
> 	finish_update_xstate(tsk, XSTATE_PKRU, pk);
> 
> versus setting a whole new value:
> 
> 	struct pkru_state *pk = start_replace_xstate(tsk, XSTATE_PKRU);
> 	memset(pkru, sizeof(*pk), 0);
> 	pk->pkru = 0x1234;
> 	finish_replace_xstate(tsk, XSTATE_PKRU, pk);
> 
> They look similar, but they actually might have very different costs for
> big states like AMX.  We can also do some very different debugging for
> them.  In normal operation, these _can_ just return pointers directly to
> the fpu...->xstate in some case.  But, if we're debugging, we could
> kmalloc() a buffer and do sanity checking before it goes into the task
> buffer.
> 
> We don't have to do any of that fancy stuff now.  We don't even need to
> do an "update" if all we need for now for XFEATURE_PASID is a "replace".
> 
> 1. Hide whether we need to write to real registers
> 2. Hide whether we need to update the in-memory image
> 3. Hide other FPU infrastructure like the TIF flag.
> 4. Make the users deal with a *whole* state in the replace API

Is that difference just whether you need to save the
state from registers to memory (for the "update" case)
or not (for the "replace" case ... where you can ignore
the current register, overwrite the whole per-feature
xsave area and mark it to be restored to registers).

If so, just a "bool full" argument might do the trick?

Also - you have a "tsk" argument in your pseudo code. Is
this needed? Are there places where we need to perform
these operations on something other than "current"?

pseudo-code:

void *begin_update_one_xsave_feature(enum xfeature xfeature, bool full)
{
	void *addr;

	BUG_ON(!(xsave->header.xcomp_bv & xfeature));

	addr = __raw_xsave_addr(xsave, xfeature);

	fpregs_lock();

	if (full)
		return addr;

	if (xfeature registers are "live")
		xsaves(xstate, 1 << xfeature);

	return addr;
}

void finish_update_one_xsave_feature(enum xfeature xfeature)
{
	mark feature modified
	set TIF bit
	fpregs_unlock();
}

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-28 18:50         ` Luck, Tony
@ 2021-09-28 19:19           ` Dave Hansen
  2021-09-28 20:28             ` Luck, Tony
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2021-09-28 19:19 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 9/28/21 11:50 AM, Luck, Tony wrote:
> On Mon, Sep 27, 2021 at 04:51:25PM -0700, Dave Hansen wrote:
...
>> 1. Hide whether we need to write to real registers
>> 2. Hide whether we need to update the in-memory image
>> 3. Hide other FPU infrastructure like the TIF flag.
>> 4. Make the users deal with a *whole* state in the replace API
> 
> Is that difference just whether you need to save the
> state from registers to memory (for the "update" case)
> or not (for the "replace" case ... where you can ignore
> the current register, overwrite the whole per-feature
> xsave area and mark it to be restored to registers).
> 
> If so, just a "bool full" argument might do the trick?

I want to be able to hide the complexity of where the old state comes
from.  It might be in registers or it might be in memory or it might be
*neither*.  It's possible we're running with stale register state and a
current->...->xsave buffer that has XFEATURES&XFEATURE_FOO 0.

In that case, the "old" copy might be memcpy'd out of the init_task.
Or, for pkeys, we might build it ourselves with init_pkru_val.

> Also - you have a "tsk" argument in your pseudo code. Is
> this needed? Are there places where we need to perform
> these operations on something other than "current"?

Two cases come to mind:
1. Fork/clone where we are doing things to our child's XSAVE buffer
2. ptrace() where we are poking into another task's state

ptrace() goes for the *whole* buffer now.  I'm not sure it would need
this per-feature API.  I just call it out as something that we might
need in the future.

> pseudo-code:
> 
> void *begin_update_one_xsave_feature(enum xfeature xfeature, bool full)
> {
> 	void *addr;
> 
> 	BUG_ON(!(xsave->header.xcomp_bv & xfeature));
> 
> 	addr = __raw_xsave_addr(xsave, xfeature);
> 
> 	fpregs_lock();
> 
> 	if (full)
> 		return addr;

If the feature is marked as in the init state in the buffer
(XSTATE_BV[feature]==0), this addr *could* contain total garbage.  So,
we'd want to make sure that the memory contents have the init state
written before handing them back to the caller.  That's not strictly
required if the user is writing the whole thing, but it's the nice thing
to do.

> 	if (xfeature registers are "live")
> 		xsaves(xstate, 1 << xfeature);

One little note: I don't think we would necessarily need to do an XSAVES
here.  For PKRU, for instance, we could just do a rdpkru.

> 	return addr;
> }
> 
> void finish_update_one_xsave_feature(enum xfeature xfeature)
> {
> 	mark feature modified

I think we'd want to do this at the "begin" time.  Also, do you mean we
should set XSTATE_BV[feature]?

> 	set TIF bit

Since the XSAVE buffer was updated, it now contains the canonical FPU
state.  It may have diverged from the register state, thus we need to
set TIF_NEED_FPU_LOAD.

It's also worth noting that we *could*:

	xrstors(xstate, 1<<xfeature);

as well.  That would bring the registers back up to day and we could
keep TIF_NEED_FPU_LOAD==0.

> 	fpregs_unlock();
> }
> 
> -Tony
> 


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-28 19:19           ` Dave Hansen
@ 2021-09-28 20:28             ` Luck, Tony
  2021-09-28 20:55               ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-28 20:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Tue, Sep 28, 2021 at 12:19:22PM -0700, Dave Hansen wrote:
> On 9/28/21 11:50 AM, Luck, Tony wrote:
> > On Mon, Sep 27, 2021 at 04:51:25PM -0700, Dave Hansen wrote:
> ...
> >> 1. Hide whether we need to write to real registers
> >> 2. Hide whether we need to update the in-memory image
> >> 3. Hide other FPU infrastructure like the TIF flag.
> >> 4. Make the users deal with a *whole* state in the replace API
> > 
> > Is that difference just whether you need to save the
> > state from registers to memory (for the "update" case)
> > or not (for the "replace" case ... where you can ignore
> > the current register, overwrite the whole per-feature
> > xsave area and mark it to be restored to registers).
> > 
> > If so, just a "bool full" argument might do the trick?
> 
> I want to be able to hide the complexity of where the old state comes
> from.  It might be in registers or it might be in memory or it might be
> *neither*.  It's possible we're running with stale register state and a
> current->...->xsave buffer that has XFEATURES&XFEATURE_FOO 0.
> 
> In that case, the "old" copy might be memcpy'd out of the init_task.
> Or, for pkeys, we might build it ourselves with init_pkru_val.

So should there be an error case if there isn't an "old" state, and
the user calls:

	p = begin_update_one_xsave_feature(XFEATURE_something, false);

Maybe instead of an error, just fill it in with the init state for the feature?

> > Also - you have a "tsk" argument in your pseudo code. Is
> > this needed? Are there places where we need to perform
> > these operations on something other than "current"?
> 
> Two cases come to mind:
> 1. Fork/clone where we are doing things to our child's XSAVE buffer
> 2. ptrace() where we are poking into another task's state
> 
> ptrace() goes for the *whole* buffer now.  I'm not sure it would need
> this per-feature API.  I just call it out as something that we might
> need in the future.

Ok - those seem ok ... it is up to the caller to make sure that the
target task is in some "not running, and can't suddenly start running"
state before calling these functions.

> 
> > pseudo-code:
> > 
> > void *begin_update_one_xsave_feature(enum xfeature xfeature, bool full)
> > {
> > 	void *addr;
> > 
> > 	BUG_ON(!(xsave->header.xcomp_bv & xfeature));
> > 
> > 	addr = __raw_xsave_addr(xsave, xfeature);
> > 
> > 	fpregs_lock();
> > 
> > 	if (full)
> > 		return addr;
> 
> If the feature is marked as in the init state in the buffer
> (XSTATE_BV[feature]==0), this addr *could* contain total garbage.  So,
> we'd want to make sure that the memory contents have the init state
> written before handing them back to the caller.  That's not strictly
> required if the user is writing the whole thing, but it's the nice thing
> to do.

Nice guys waste CPU cycles writing to memory that is just going to get
written again.

> 
> > 	if (xfeature registers are "live")
> > 		xsaves(xstate, 1 << xfeature);
> 
> One little note: I don't think we would necessarily need to do an XSAVES
> here.  For PKRU, for instance, we could just do a rdpkru.

Like this?

	if (tsk == current) {
		switch (xfeature) {
		case XFEATURE_PKRU:
			*(u32 *)addr = rdpkru();
			break;
		case XFEATURE_PASID:
			rdmsrl(MSR_IA32_PASID, msr);
			*(u64 *)addr = msr;
			break;
		... any other "easy" states ...
		default:
			xsaves(xstate, 1 << xfeature);
			break;
		}
	}

> 
> > 	return addr;
> > }
> > 
> > void finish_update_one_xsave_feature(enum xfeature xfeature)
> > {
> > 	mark feature modified
> 
> I think we'd want to do this at the "begin" time.  Also, do you mean we
> should set XSTATE_BV[feature]?

Begin? End? It's all inside fpregs_lock(). But whatever seems best.

Yes, I think that this means set XSTATE_BV[feature] ... but I'm
relying on you as the xsave expert to help get the subtle bits right so
the Andy Lutomirski can smile at this code.

> > 	set TIF bit
> 
> Since the XSAVE buffer was updated, it now contains the canonical FPU
> state.  It may have diverged from the register state, thus we need to
> set TIF_NEED_FPU_LOAD.

Yes, that's the TIF bit my pseudo-code intended.

> It's also worth noting that we *could*:
> 
> 	xrstors(xstate, 1<<xfeature);
> 
> as well.  That would bring the registers back up to day and we could
> keep TIF_NEED_FPU_LOAD==0.

Only makes sense if "tsk == current". But does this help. The work seems
to be the same whether we do it now, or later. We don't know for sure
that we will directly return to the task. We might context switch to
another task, so loading the state into registers now would just be
wasted time.

> 
> > 	fpregs_unlock();
> > }

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-28 20:28             ` Luck, Tony
@ 2021-09-28 20:55               ` Dave Hansen
  2021-09-28 23:10                 ` Luck, Tony
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2021-09-28 20:55 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 9/28/21 1:28 PM, Luck, Tony wrote:
> On Tue, Sep 28, 2021 at 12:19:22PM -0700, Dave Hansen wrote:
>> On 9/28/21 11:50 AM, Luck, Tony wrote:
>>> On Mon, Sep 27, 2021 at 04:51:25PM -0700, Dave Hansen wrote:
>> ...
>>>> 1. Hide whether we need to write to real registers
>>>> 2. Hide whether we need to update the in-memory image
>>>> 3. Hide other FPU infrastructure like the TIF flag.
>>>> 4. Make the users deal with a *whole* state in the replace API
>>>
>>> Is that difference just whether you need to save the
>>> state from registers to memory (for the "update" case)
>>> or not (for the "replace" case ... where you can ignore
>>> the current register, overwrite the whole per-feature
>>> xsave area and mark it to be restored to registers).
>>>
>>> If so, just a "bool full" argument might do the trick?
>>
>> I want to be able to hide the complexity of where the old state comes
>> from.  It might be in registers or it might be in memory or it might be
>> *neither*.  It's possible we're running with stale register state and a
>> current->...->xsave buffer that has XFEATURES&XFEATURE_FOO 0.
>>
>> In that case, the "old" copy might be memcpy'd out of the init_task.
>> Or, for pkeys, we might build it ourselves with init_pkru_val.
> 
> So should there be an error case if there isn't an "old" state, and
> the user calls:
> 
> 	p = begin_update_one_xsave_feature(XFEATURE_something, false);
> 
> Maybe instead of an error, just fill it in with the init state for the feature?

Yes, please.  Let's not generate an error.  Let's populate the init
state and let them move on with life.

>>> pseudo-code:
>>>
>>> void *begin_update_one_xsave_feature(enum xfeature xfeature, bool full)
>>> {
>>> 	void *addr;
>>>
>>> 	BUG_ON(!(xsave->header.xcomp_bv & xfeature));
>>>
>>> 	addr = __raw_xsave_addr(xsave, xfeature);
>>>
>>> 	fpregs_lock();
>>>
>>> 	if (full)
>>> 		return addr;
>>
>> If the feature is marked as in the init state in the buffer
>> (XSTATE_BV[feature]==0), this addr *could* contain total garbage.  So,
>> we'd want to make sure that the memory contents have the init state
>> written before handing them back to the caller.  That's not strictly
>> required if the user is writing the whole thing, but it's the nice thing
>> to do.
> 
> Nice guys waste CPU cycles writing to memory that is just going to get
> written again.

Let's skip the "replace" operation for now and focus on "update".  A
full replace *can* be faster because it doesn't require the state to be
written out.  But, we don't need that for now.

Let's just do the "update" thing, and let's ensure that we reflect the
init state into the buffer that gets returned if the feature was in its
init state.

Sound good?

>>> 	if (xfeature registers are "live")
>>> 		xsaves(xstate, 1 << xfeature);
>>
>> One little note: I don't think we would necessarily need to do an XSAVES
>> here.  For PKRU, for instance, we could just do a rdpkru.
> 
> Like this?
> 
> 	if (tsk == current) {
> 		switch (xfeature) {
> 		case XFEATURE_PKRU:
> 			*(u32 *)addr = rdpkru();
> 			break;
> 		case XFEATURE_PASID:
> 			rdmsrl(MSR_IA32_PASID, msr);
> 			*(u64 *)addr = msr;
> 			break;
> 		... any other "easy" states ...
> 		default:
> 			xsaves(xstate, 1 << xfeature);
> 			break;
> 		}
> 	}

Yep.

>>> 	return addr;
>>> }
>>>
>>> void finish_update_one_xsave_feature(enum xfeature xfeature)
>>> {
>>> 	mark feature modified
>>
>> I think we'd want to do this at the "begin" time.  Also, do you mean we
>> should set XSTATE_BV[feature]?
> 
> Begin? End? It's all inside fpregs_lock(). But whatever seems best.

I'm actually waffling about it.

Does XSTATE_BV[feature] mean that state *is* there or that state *may*
be there?  Either way works.

>> It's also worth noting that we *could*:
>>
>> 	xrstors(xstate, 1<<xfeature);
>>
>> as well.  That would bring the registers back up to day and we could
>> keep TIF_NEED_FPU_LOAD==0.
> 
> Only makes sense if "tsk == current". But does this help. The work seems
> to be the same whether we do it now, or later. We don't know for sure
> that we will directly return to the task. We might context switch to
> another task, so loading the state into registers now would just be
> wasted time.

True, but the other side of the coin is that setting TIF_NEED_FPU_LOAD
subjects us to an XRSTOR on the way out to userspace.  That XRSTOR might
just be updating one state component in practice.

Either way, sorry for the distraction.  We (me) should really be
focusing on getting something that is functional but slow.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-28 20:55               ` Dave Hansen
@ 2021-09-28 23:10                 ` Luck, Tony
  2021-09-28 23:50                   ` Fenghua Yu
  2021-09-29 16:58                   ` Andy Lutomirski
  0 siblings, 2 replies; 77+ messages in thread
From: Luck, Tony @ 2021-09-28 23:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

Moving beyond pseudo-code and into compiles-but-probably-broken-code.


The intent of the functions below is that Fenghua should be able to
do:

void fpu__pasid_write(u32 pasid)
{
	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
	struct ia32_pasid_state *addr;

	addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
	addr->pasid = msr_val;
	finish_update_one_xsave_feature(current);
}

So here's the two new functions that would be added to
arch/x86/kernel/fpu/xstate.c

----

void *begin_update_one_xsave_feature(struct task_struct *tsk,
                                     enum xfeature xfeature, bool full)
{
        struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
        struct xregs_state *xinit = &init_fpstate.xsave;
        u64 fmask = 1ull << xfeature;
        void *addr;

        BUG_ON(!(xsave->header.xcomp_bv & fmask));

        fpregs_lock();

        addr = __raw_xsave_addr(xsave, xfeature);

        if (full || tsk != current) {
                memcpy(addr, __raw_xsave_addr(xinit, xfeature), xstate_sizes[xfeature]);
                goto out;
        }

	/* could optimize some cases where xsaves() isn't fastest option */
        if (!(xsave->header.xfeatures & fmask))
                xsaves(xsave, fmask);

out:
        xsave->header.xfeatures |= fmask;
        return addr;
}

void finish_update_one_xsave_feature(struct task_struct *tsk)
{
        set_ti_thread_flag(task_thread_info(tsk), TIF_NEED_FPU_LOAD);
        fpregs_unlock();
}

----

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-28 23:10                 ` Luck, Tony
@ 2021-09-28 23:50                   ` Fenghua Yu
  2021-09-29  0:08                     ` Luck, Tony
  2021-09-29 16:58                   ` Andy Lutomirski
  1 sibling, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-28 23:50 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

Hi, Tony,

On Tue, Sep 28, 2021 at 04:10:39PM -0700, Luck, Tony wrote:
> Moving beyond pseudo-code and into compiles-but-probably-broken-code.
> 
> 
> The intent of the functions below is that Fenghua should be able to
> do:
> 
> void fpu__pasid_write(u32 pasid)
> {
> 	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> 	struct ia32_pasid_state *addr;
> 
> 	addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
> 	addr->pasid = msr_val;
> 	finish_update_one_xsave_feature(current);
> }
> 
> So here's the two new functions that would be added to
> arch/x86/kernel/fpu/xstate.c
> 
> ----
> 
> void *begin_update_one_xsave_feature(struct task_struct *tsk,
>                                      enum xfeature xfeature, bool full)
> {
>         struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
>         struct xregs_state *xinit = &init_fpstate.xsave;
>         u64 fmask = 1ull << xfeature;
>         void *addr;
> 
>         BUG_ON(!(xsave->header.xcomp_bv & fmask));
> 
>         fpregs_lock();
> 
>         addr = __raw_xsave_addr(xsave, xfeature);
> 
>         if (full || tsk != current) {
>                 memcpy(addr, __raw_xsave_addr(xinit, xfeature), xstate_sizes[xfeature]);
>                 goto out;
>         }
> 
> 	/* could optimize some cases where xsaves() isn't fastest option */
>         if (!(xsave->header.xfeatures & fmask))
>                 xsaves(xsave, fmask);

If xfeatures's feature bit is 0, xsaves will not write its init value to the
memory due to init optimization. So the xsaves will do nothing and the
state is not initialized and may have random data.

> 
> out:
>         xsave->header.xfeatures |= fmask;
>         return addr;
> }
> 
> void finish_update_one_xsave_feature(struct task_struct *tsk)
> {
>         set_ti_thread_flag(task_thread_info(tsk), TIF_NEED_FPU_LOAD);

Setting TIF_NEED_FPU_LOAD cannot guaranteed to execute XRSTORS on exiting
to user. In fpregs_restore_userregs():
	if (!fpregs_state_valid(fpu, cpu)) {
		...
                __restore_fpregs_from_fpstate(&fpu->state, mask);
		...
	}

fpregs state should be invalid to get the XRSTROS executed.

So setting TIF_NEED_FPU_LOAD may get the FPU register unchanged on exiting
to user.


>         fpregs_unlock();
> }

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-28 23:50                   ` Fenghua Yu
@ 2021-09-29  0:08                     ` Luck, Tony
  2021-09-29  0:26                       ` Yu, Fenghua
  2021-09-29  1:56                       ` Yu, Fenghua
  0 siblings, 2 replies; 77+ messages in thread
From: Luck, Tony @ 2021-09-29  0:08 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Tue, Sep 28, 2021 at 11:50:37PM +0000, Fenghua Yu wrote:
> If xfeatures's feature bit is 0, xsaves will not write its init value to the
> memory due to init optimization. So the xsaves will do nothing and the
> state is not initialized and may have random data.

> Setting TIF_NEED_FPU_LOAD cannot guaranteed to execute XRSTORS on exiting
> to user. In fpregs_restore_userregs():
> 	if (!fpregs_state_valid(fpu, cpu)) {
> 		...
>                 __restore_fpregs_from_fpstate(&fpu->state, mask);
> 		...
> 	}
> 
> fpregs state should be invalid to get the XRSTROS executed.
> 
> So setting TIF_NEED_FPU_LOAD may get the FPU register unchanged on exiting
> to user.

Does this help?
Changed lines marked with //<<<<<

-Tony

void *begin_update_one_xsave_feature(struct task_struct *tsk,
				     enum xfeature xfeature, bool full)
{
	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
	struct xregs_state *xinit = &init_fpstate.xsave;
	u64 fmask = 1ull << xfeature;
	void *addr;

	BUG_ON(!(xsave->header.xcomp_bv & fmask));

	fpregs_lock();

	addr = __raw_xsave_addr(xsave, xfeature);

	if (full || tsk != current) {
		memcpy(addr, __raw_xsave_addr(xinit, xfeature), xstate_sizes[xfeature]);
		goto out;
	}

	if (!(xsave->header.xfeatures & fmask)) {
		xsave->header.xfeatures |= fmask;	//<<<<<
		xsaves(xsave, fmask);
	}

out:
	xsave->header.xfeatures |= fmask;
	return addr;
}

void finish_update_one_xsave_feature(struct task_struct *tsk)
{
	set_ti_thread_flag(task_thread_info(tsk), TIF_NEED_FPU_LOAD);
	if (tsk == current)				//<<<<<
		__cpu_invalidate_fpregs_state();	//<<<<<
	fpregs_unlock();
}

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-29  0:08                     ` Luck, Tony
@ 2021-09-29  0:26                       ` Yu, Fenghua
  2021-09-29  1:06                         ` Luck, Tony
  2021-09-29  1:56                       ` Yu, Fenghua
  1 sibling, 1 reply; 77+ messages in thread
From: Yu, Fenghua @ 2021-09-29  0:26 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Hansen, Dave, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Jiang, Dave, Pan,
	Jacob jun, Raj, Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

Hi, Tony,

> void *begin_update_one_xsave_feature(struct task_struct *tsk,
> 				     enum xfeature xfeature, bool full) {
> 	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
> 	struct xregs_state *xinit = &init_fpstate.xsave;
> 	u64 fmask = 1ull << xfeature;
> 	void *addr;
> 
> 	BUG_ON(!(xsave->header.xcomp_bv & fmask));
> 
> 	fpregs_lock();

I'm afraid we may hit the same locking issue when we send IPI to notify another task to modify its
PASID state. Here the API is called to modify another running task's PASID state as well without a right lock.
fpregs_lock() is not enough to deal with this, I'm afraid.

Quote from Thomas in https://lore.kernel.org/linux-iommu/87mtsd6gr9.ffs@nanos.tec.linutronix.de/
"FPU state of a running task is protected by fregs_lock() which is
   nothing else than a local_bh_disable(). As BH disabled regions run
   usually with interrupts enabled the IPI can hit a code section which
   modifies FPU state and there is absolutely no guarantee that any of the
   assumptions which are made for the IPI case is true."

Maybe restrict the API's scope: don't modify another task's FPU state, just the current task's state?

> 	addr = __raw_xsave_addr(xsave, xfeature);
> 
> 	if (full || tsk != current) {
> 		memcpy(addr, __raw_xsave_addr(xinit, xfeature),
> xstate_sizes[xfeature]);
> 		goto out;
> 	}
> 
> 	if (!(xsave->header.xfeatures & fmask)) {
> 		xsave->header.xfeatures |= fmask;	//<<<<<
> 		xsaves(xsave, fmask);
> 	}
> 
> out:
> 	xsave->header.xfeatures |= fmask;
> 	return addr;
> }
> 
> void finish_update_one_xsave_feature(struct task_struct *tsk) {
> 	set_ti_thread_flag(task_thread_info(tsk), TIF_NEED_FPU_LOAD);
> 	if (tsk == current)				//<<<<<
> 		__cpu_invalidate_fpregs_state();	//<<<<<
> 	fpregs_unlock();
> }

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-29  0:26                       ` Yu, Fenghua
@ 2021-09-29  1:06                         ` Luck, Tony
  2021-09-29  1:16                           ` Fenghua Yu
  0 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-29  1:06 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Hansen, Dave, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Jiang, Dave, Pan,
	Jacob jun, Raj, Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

>> 	fpregs_lock();
>
> I'm afraid we may hit the same locking issue when we send IPI to notify another task to modify its
> PASID state. Here the API is called to modify another running task's PASID state as well without a right lock.
> fpregs_lock() is not enough to deal with this, I'm afraid.

We don't send IPI any more to change PASID state. The only place that the
current patch series touches the PASID MSR is in the #GP fault handler.

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-29  1:06                         ` Luck, Tony
@ 2021-09-29  1:16                           ` Fenghua Yu
  2021-09-29  2:11                             ` Luck, Tony
  0 siblings, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-29  1:16 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Hansen, Dave, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Jiang, Dave, Pan,
	Jacob jun, Raj, Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

Hi, Tony,

On Tue, Sep 28, 2021 at 06:06:52PM -0700, Luck, Tony wrote:
> >> 	fpregs_lock();
> >
> > I'm afraid we may hit the same locking issue when we send IPI to notify another task to modify its
> > PASID state. Here the API is called to modify another running task's PASID state as well without a right lock.
> > fpregs_lock() is not enough to deal with this, I'm afraid.
> 
> We don't send IPI any more to change PASID state. The only place that the
> current patch series touches the PASID MSR is in the #GP fault handler.

It's safe for the helpers to handle the PASID case (modifying the current task's
PASID state in #GP).

But the helpers seem to be generic. They take "task" as a parameter and
handle the task as non-current case. So the helpers are not for PASID only.
They may be used by others to modify a running task's FPU state. But
It's not safe to do so.

At least need some comments/restriction for the helpers to be used on
a running task?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-29  0:08                     ` Luck, Tony
  2021-09-29  0:26                       ` Yu, Fenghua
@ 2021-09-29  1:56                       ` Yu, Fenghua
  2021-09-29  2:15                         ` Luck, Tony
  1 sibling, 1 reply; 77+ messages in thread
From: Yu, Fenghua @ 2021-09-29  1:56 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Hansen, Dave, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Jiang, Dave, Pan,
	Jacob jun, Raj, Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

Hi, Tony,

> void *begin_update_one_xsave_feature(struct task_struct *tsk,
> 				     enum xfeature xfeature, bool full) {
> 	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
> 	struct xregs_state *xinit = &init_fpstate.xsave;
> 	u64 fmask = 1ull << xfeature;
> 	void *addr;
> 
> 	BUG_ON(!(xsave->header.xcomp_bv & fmask));
> 
> 	fpregs_lock();
> 
> 	addr = __raw_xsave_addr(xsave, xfeature);
> 
> 	if (full || tsk != current) {
> 		memcpy(addr, __raw_xsave_addr(xinit, xfeature),
> xstate_sizes[xfeature]);
> 		goto out;
> 	}
> 
> 	if (!(xsave->header.xfeatures & fmask)) {
> 		xsave->header.xfeatures |= fmask;	//<<<<<
> 		xsaves(xsave, fmask);
> 	}

I'm not sure why the FPU state is initialized here.

For updating the PASID state, it's unnecessary to init the PASID state.

Maybe it is necessary in other cases?

> 
> out:
> 	xsave->header.xfeatures |= fmask;

Setting the xfeatures bit plus updating the PASID state is enough
to restore the PASID state to the IA32_PASID MSR.

> 	return addr;
> }
> 
> void finish_update_one_xsave_feature(struct task_struct *tsk) {
> 	set_ti_thread_flag(task_thread_info(tsk), TIF_NEED_FPU_LOAD);
> 	if (tsk == current)				//<<<<<
> 		__cpu_invalidate_fpregs_state();	//<<<<<
> 	fpregs_unlock();
> }

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-29  1:16                           ` Fenghua Yu
@ 2021-09-29  2:11                             ` Luck, Tony
  0 siblings, 0 replies; 77+ messages in thread
From: Luck, Tony @ 2021-09-29  2:11 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Hansen, Dave, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Jiang, Dave, Pan,
	Jacob jun, Raj, Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

> But the helpers seem to be generic. They take "task" as a parameter and
> handle the task as non-current case. So the helpers are not for PASID only.
> They may be used by others to modify a running task's FPU state. But
> It's not safe to do so.
>
> At least need some comments/restriction for the helpers to be used on
> a running task?

Fenghua,

Correct. When I add some comments I'll make it very clear that it is the
responsibility of the caller to make sure that the task that is targeted
cannot run.

Earlier in this thread Dave suggested there are two cases where these
helpers might be useful:

1) Fork/clone - to set up some xsave state in the child ... but this would be
done before the child is allowed to run.

2) ptrace - this is a "maybe" because ptrace already has code to handle all
the xsave state as a single entity. Perhaps someone might want to change it
to only modify a single feature ... but this seems unlikely. In any case the
ptrace code already "stops" the target process while it is reading/writing state.

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-29  1:56                       ` Yu, Fenghua
@ 2021-09-29  2:15                         ` Luck, Tony
  0 siblings, 0 replies; 77+ messages in thread
From: Luck, Tony @ 2021-09-29  2:15 UTC (permalink / raw)
  To: Yu, Fenghua
  Cc: Hansen, Dave, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Jiang, Dave, Pan,
	Jacob jun, Raj, Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

>> 	if (!(xsave->header.xfeatures & fmask)) {
>> 		xsave->header.xfeatures |= fmask;	//<<<<<
>> 		xsaves(xsave, fmask);
>> 	}
>
> I'm not sure why the FPU state is initialized here.
>
> For updating the PASID state, it's unnecessary to init the PASID state.
>
> Maybe it is necessary in other cases?

Dave had suggested initializing feature state when it is unknown (could
be garbage).  This is my attempt to follow that guidance. I'm not confident
that my tests for "is the state in registers, in memory, or is garbage"
really capture all the cases.

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions
  2021-09-20 19:23 ` [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions Fenghua Yu
@ 2021-09-29  7:34   ` Lu Baolu
  2021-09-30  0:40     ` Fenghua Yu
  0 siblings, 1 reply; 77+ messages in thread
From: Lu Baolu @ 2021-09-29  7:34 UTC (permalink / raw)
  To: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Andy Lutomirski, Dave Hansen, Tony Luck,
	Joerg Roedel, Josh Poimboeuf, Dave Jiang, Jacob Jun Pan,
	Ashok Raj, Ravi V Shankar
  Cc: baolu.lu, iommu, x86, linux-kernel

Hi Fenghua,

On 2021/9/21 3:23, Fenghua Yu wrote:
> update_pasid() and its call chain are currently unused in the tree because
> Thomas disabled the ENQCMD feature. The feature will be re-enabled shortly
> using a different approach and update_pasid() and its call chain will not
> be used in the new approach.
> 
> Remove the useless functions.
> 
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>

Thanks for this cleanup. I have queued it for v5.16.

Best regards,
baolu

> ---
>   arch/x86/include/asm/fpu/api.h |  2 --
>   drivers/iommu/intel/svm.c      | 24 +-----------------------
>   2 files changed, 1 insertion(+), 25 deletions(-)
> 
> diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
> index 23bef08a8388..ca4d0dee1ecd 100644
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -106,6 +106,4 @@ extern int cpu_has_xfeatures(u64 xfeatures_mask, const char **feature_name);
>    */
>   #define PASID_DISABLED	0
>   
> -static inline void update_pasid(void) { }
> -
>   #endif /* _ASM_X86_FPU_API_H */
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 0c228787704f..5b5d69b04fcc 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -505,21 +505,6 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
>   	return ret;
>   }
>   
> -static void _load_pasid(void *unused)
> -{
> -	update_pasid();
> -}
> -
> -static void load_pasid(struct mm_struct *mm, u32 pasid)
> -{
> -	mutex_lock(&mm->context.lock);
> -
> -	/* Update PASID MSR on all CPUs running the mm's tasks. */
> -	on_each_cpu_mask(mm_cpumask(mm), _load_pasid, NULL, true);
> -
> -	mutex_unlock(&mm->context.lock);
> -}
> -
>   static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm,
>   				 unsigned int flags)
>   {
> @@ -614,10 +599,6 @@ static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu,
>   	if (ret)
>   		goto free_sdev;
>   
> -	/* The newly allocated pasid is loaded to the mm. */
> -	if (!(flags & SVM_FLAG_SUPERVISOR_MODE) && list_empty(&svm->devs))
> -		load_pasid(mm, svm->pasid);
> -
>   	list_add_rcu(&sdev->list, &svm->devs);
>   success:
>   	return &sdev->sva;
> @@ -670,11 +651,8 @@ static int intel_svm_unbind_mm(struct device *dev, u32 pasid)
>   			kfree_rcu(sdev, rcu);
>   
>   			if (list_empty(&svm->devs)) {
> -				if (svm->notifier.ops) {
> +				if (svm->notifier.ops)
>   					mmu_notifier_unregister(&svm->notifier, mm);
> -					/* Clear mm's pasid. */
> -					load_pasid(mm, PASID_DISABLED);
> -				}
>   				pasid_private_remove(svm->pasid);
>   				/* We mandate that no page faults may be outstanding
>   				 * for the PASID when intel_svm_unbind_mm() is called.
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-24 15:39         ` Luck, Tony
@ 2021-09-29  9:00           ` Peter Zijlstra
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-29  9:00 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

On Fri, Sep 24, 2021 at 08:39:24AM -0700, Luck, Tony wrote:

> If you have ctags installed then a ctrl-] on that
> __fixup_pasid_exception() will take you to the function with
> the comments. No electron microscope needed.

I to use ctags, but when reading the #GP handler, this is a whole
different file. Also, I don't find any of those comments explaining the
not-our-#GP-but-harmless-cycle issue.

The current->has_valid_pasid one comes close, but just misses it. But
really the place to put this is in the #GP handler itself so we don't
have to dig through every call there to figure out how it's supposed to
work.

> +
> +/*
> + * Try to figure out if there is a PASID MSR value to propagate to the
> + * thread taking the #GP.
> + */
> +bool __fixup_pasid_exception(void)
> +{
> +       u32 pasid;
> +
> +       /*
> +        * This function is called only when this #GP was triggered from user
> +        * space. So the mm cannot be NULL.
> +        */
> +       pasid = current->mm->pasid;
> +
> +       /* If no PASID is allocated, there is nothing to propagate. */
> +       if (pasid == PASID_DISABLED)
> +               return false;
> +
> +       /*
> +        * If the current task already has a valid PASID MSR, then the #GP
> +        * fault must be for some non-ENQCMD related reason.
> +        */
> +       if (current->has_valid_pasid)
> +               return false;
> +
> +       /* Fix up the MSR by the PASID in the mm. */
> +       fpu__pasid_write(pasid);
> +       current->has_valid_pasid = 1;
> +
> +       return true;
> +}
> 
> -Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-24 23:03             ` Andy Lutomirski
  2021-09-24 23:11               ` Luck, Tony
@ 2021-09-29  9:54               ` Peter Zijlstra
  2021-09-29 12:28                 ` Thomas Gleixner
  1 sibling, 1 reply; 77+ messages in thread
From: Peter Zijlstra @ 2021-09-29  9:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tony Luck, Thomas Gleixner, Fenghua Yu, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Raj Ashok, Shankar,
	Ravi V, iommu, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
> I think the perfect and the good are a bit confused here. If we go for
> "good", then we have an mm owning a PASID for its entire lifetime.  If
> we want "perfect", then we should actually do it right: teach the
> kernel to update an entire mm's PASID setting all at once.  This isn't
> *that* hard -- it involves two things:
> 
> 1. The context switch code needs to resync PASID.  Unfortunately, this
> adds some overhead to every context switch, although a static_branch
> could minimize it for non-PASID users.

> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
> touch FPU state.  So instead the IPI should use task_work_add() to
> make sure PASID gets resynced.

What do we need 1 for? Any PASID change can be achieved using 2 no?

Basically, call task_work_add() on all relevant tasks [1], then IPI
spray the current running of those and presto.

[1] it is nigh on impossible to find all tasks sharing an mm in any sane
way due to CLONE_MM && !CLONE_THREAD.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29  9:54               ` Peter Zijlstra
@ 2021-09-29 12:28                 ` Thomas Gleixner
  2021-09-29 16:51                   ` Luck, Tony
  2021-09-29 16:59                   ` Andy Lutomirski
  0 siblings, 2 replies; 77+ messages in thread
From: Thomas Gleixner @ 2021-09-29 12:28 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Tony Luck, Fenghua Yu, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote:
> On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
>> I think the perfect and the good are a bit confused here. If we go for
>> "good", then we have an mm owning a PASID for its entire lifetime.  If
>> we want "perfect", then we should actually do it right: teach the
>> kernel to update an entire mm's PASID setting all at once.  This isn't
>> *that* hard -- it involves two things:
>> 
>> 1. The context switch code needs to resync PASID.  Unfortunately, this
>> adds some overhead to every context switch, although a static_branch
>> could minimize it for non-PASID users.
>
>> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
>> touch FPU state.  So instead the IPI should use task_work_add() to
>> make sure PASID gets resynced.
>
> What do we need 1 for? Any PASID change can be achieved using 2 no?
>
> Basically, call task_work_add() on all relevant tasks [1], then IPI
> spray the current running of those and presto.
>
> [1] it is nigh on impossible to find all tasks sharing an mm in any sane
> way due to CLONE_MM && !CLONE_THREAD.

Why would we want any of that at all?

  Process starts, no PASID assigned.

  bind to device -> PASID is allocated and assigned to the mm

  some task of the process issues ENQCMD -> #GP -> write PASID MSR

  After that the PASID is saved and restored as part of the XSTATE and
  there is no extra overhead in context switch or return to user space.

  All tasks of the process which did never use ENQCMD don't care and their
  PASID xstate is in init state.

There is absolutely no point in enforcing that all tasks of the process
have the PASID activated immediately when it is assigned. If they need
it they get it via the #GP fixup and everything just works.

Looking at that patch again, none of this muck in fpu__pasid_write() is
required at all. The whole exception fixup is:

    if (!user_mode(regs))
             return false;

    if (!current->mm->pasid)
             return false;

    if (current->pasid_activated)
    	     return false;

    wrmsrl(MSR_IA32_PASID, current->mm->pasid);
    current->pasid_activated = true;
    return true;

There is zero requirement to look at TIF_NEED_FPU_LOAD or
fpregs_state_valid() simply because the #GP comes straight from user
space which means the FPU registers contain the current tasks user space
state.

If TIF_NEED_FPU_LOAD would be set or fpregs_state_valid() would be false
after the user_mode() check then this would simply be a bug somewhere
else and has nothing to do with this PASID fixup.

So no need for magic update_one_xstate_feature() wrappers, no
concurrency concerns, nothing.

It's that simple, really. Anything more complex is just a purely
academic exercise which creates more problems than it solves.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 12:28                 ` Thomas Gleixner
@ 2021-09-29 16:51                   ` Luck, Tony
  2021-09-29 17:07                     ` Fenghua Yu
  2021-09-29 16:59                   ` Andy Lutomirski
  1 sibling, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-29 16:51 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Andy Lutomirski
  Cc: Yu, Fenghua, Ingo Molnar, Borislav Petkov, Hansen, Dave,
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Jiang, Dave, Pan,
	Jacob jun, Raj, Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

> There is zero requirement to look at TIF_NEED_FPU_LOAD or
> fpregs_state_valid() simply because the #GP comes straight from user
> space which means the FPU registers contain the current tasks user space
> state.

Just to double confirm ... there is no point in the #GP handler up to this point
where pre-emption can occur?

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-28 23:10                 ` Luck, Tony
  2021-09-28 23:50                   ` Fenghua Yu
@ 2021-09-29 16:58                   ` Andy Lutomirski
  2021-09-29 17:07                     ` Luck, Tony
  1 sibling, 1 reply; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-29 16:58 UTC (permalink / raw)
  To: Luck, Tony, Dave Hansen
  Cc: Fenghua Yu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 9/28/21 16:10, Luck, Tony wrote:
> Moving beyond pseudo-code and into compiles-but-probably-broken-code.
> 
> 
> The intent of the functions below is that Fenghua should be able to
> do:
> 
> void fpu__pasid_write(u32 pasid)
> {
> 	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> 	struct ia32_pasid_state *addr;
> 
> 	addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
> 	addr->pasid = msr_val;
> 	finish_update_one_xsave_feature(current);
> }
> 

This gets gnarly because we would presumably like to optimize the case 
where we can do the update directly in registers.  I wonder if we can do 
it with a bit of macro magic in a somewhat generic way:

typedef fpu__pasid_type u32;

static inline void fpu__set_pasid_in_register(const u32 *value)
{
	wrmsr(...);
}

#define DEFINE_FPU_HELPER(name) \
static inline void fpu__set_##name(const fpu__##name##_type *val) \
{ \
	fpregs_lock(); \
	if (should write in memory) { \
		->xfeatures |= XFEATURE_##name; \
		ptr = get_xsave_addr(...); \
		memcpy(ptr, val, sizeof(*val)); \
		__fpu_invalidate_fpregs_state(...); \
	} else { \
		fpu__set_##name##_in_register(val); \
	} \
}

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 12:28                 ` Thomas Gleixner
  2021-09-29 16:51                   ` Luck, Tony
@ 2021-09-29 16:59                   ` Andy Lutomirski
  2021-09-29 17:15                     ` Thomas Gleixner
  1 sibling, 1 reply; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-29 16:59 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Tony Luck, Fenghua Yu, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 9/29/21 05:28, Thomas Gleixner wrote:
> On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote:
>> On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
>>> I think the perfect and the good are a bit confused here. If we go for
>>> "good", then we have an mm owning a PASID for its entire lifetime.  If
>>> we want "perfect", then we should actually do it right: teach the
>>> kernel to update an entire mm's PASID setting all at once.  This isn't
>>> *that* hard -- it involves two things:
>>>
>>> 1. The context switch code needs to resync PASID.  Unfortunately, this
>>> adds some overhead to every context switch, although a static_branch
>>> could minimize it for non-PASID users.
>>
>>> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
>>> touch FPU state.  So instead the IPI should use task_work_add() to
>>> make sure PASID gets resynced.
>>
>> What do we need 1 for? Any PASID change can be achieved using 2 no?
>>
>> Basically, call task_work_add() on all relevant tasks [1], then IPI
>> spray the current running of those and presto.
>>
>> [1] it is nigh on impossible to find all tasks sharing an mm in any sane
>> way due to CLONE_MM && !CLONE_THREAD.
> 
> Why would we want any of that at all?
> 
>    Process starts, no PASID assigned.
> 
>    bind to device -> PASID is allocated and assigned to the mm
> 
>    some task of the process issues ENQCMD -> #GP -> write PASID MSR
> 
>    After that the PASID is saved and restored as part of the XSTATE and
>    there is no extra overhead in context switch or return to user space.
> 
>    All tasks of the process which did never use ENQCMD don't care and their
>    PASID xstate is in init state.
> 
> There is absolutely no point in enforcing that all tasks of the process
> have the PASID activated immediately when it is assigned. If they need
> it they get it via the #GP fixup and everything just works.
> 
> Looking at that patch again, none of this muck in fpu__pasid_write() is
> required at all. The whole exception fixup is:
> 
>      if (!user_mode(regs))
>               return false;
> 
>      if (!current->mm->pasid)
>               return false;
> 
>      if (current->pasid_activated)
>      	     return false;

<-- preemption or BH here: kaboom.

> 
>      wrmsrl(MSR_IA32_PASID, current->mm->pasid);

This needs the actual sane fpstate writing helper -- see other email.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 16:51                   ` Luck, Tony
@ 2021-09-29 17:07                     ` Fenghua Yu
  0 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-29 17:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Peter Zijlstra, Andy Lutomirski, Ingo Molnar,
	Borislav Petkov, Hansen, Dave, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Jiang, Dave, Pan, Jacob jun, Raj, Ashok, Shankar,
	Ravi V, iommu, the arch/x86 maintainers,
	Linux Kernel Mailing List

Hi, Thomas,

On Wed, Sep 29, 2021 at 09:51:15AM -0700, Luck, Tony wrote:
> > There is zero requirement to look at TIF_NEED_FPU_LOAD or
> > fpregs_state_valid() simply because the #GP comes straight from user
> > space which means the FPU registers contain the current tasks user space
> > state.
> 
> Just to double confirm ... there is no point in the #GP handler up to this point
> where pre-emption can occur?

Same question here. The fixup function is called after cond_local_irq_enable().
If an interrupt comes before fixup_pasid_exception(), the interrupt may
use FPU and call kernel_fpu_begin_mask()->set(TIF_NEED_FPU_LOAD)->
__cpu_invalidate_fpregs_state(). Then writing to the IA32_PASID MSR. When
exiting to user, the FPU states will be restored to the FPU regs including
the IA32_PASID MSR. So the MSR could be different from the value written in
fixup_pasid_execption(). Is it possible?

Or should fixup_pasid_exception() be called before cond_local_irq_enable()?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-29 16:58                   ` Andy Lutomirski
@ 2021-09-29 17:07                     ` Luck, Tony
  2021-09-29 17:48                       ` Andy Lutomirski
  0 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-29 17:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Wed, Sep 29, 2021 at 09:58:22AM -0700, Andy Lutomirski wrote:
> On 9/28/21 16:10, Luck, Tony wrote:
> > Moving beyond pseudo-code and into compiles-but-probably-broken-code.
> > 
> > 
> > The intent of the functions below is that Fenghua should be able to
> > do:
> > 
> > void fpu__pasid_write(u32 pasid)
> > {
> > 	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> > 	struct ia32_pasid_state *addr;
> > 
> > 	addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
> > 	addr->pasid = msr_val;
> > 	finish_update_one_xsave_feature(current);
> > }
> > 
> 
> This gets gnarly because we would presumably like to optimize the case where
> we can do the update directly in registers.  I wonder if we can do it with a
> bit of macro magic in a somewhat generic way:

Can we defere the optimizations until there is a use case that
cares? The existing use case (fixing up the #GP fault by setting
the PASID MSR) isn't performance critical.

Let's just have something that is clear (or as clear as any xsave
code can be) and correct.

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 16:59                   ` Andy Lutomirski
@ 2021-09-29 17:15                     ` Thomas Gleixner
  2021-09-29 17:41                       ` Luck, Tony
  0 siblings, 1 reply; 77+ messages in thread
From: Thomas Gleixner @ 2021-09-29 17:15 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Tony Luck, Fenghua Yu, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> On 9/29/21 05:28, Thomas Gleixner wrote:
>> Looking at that patch again, none of this muck in fpu__pasid_write() is
>> required at all. The whole exception fixup is:
>> 
>>      if (!user_mode(regs))
>>               return false;
>> 
>>      if (!current->mm->pasid)
>>               return false;
>> 
>>      if (current->pasid_activated)
>>      	     return false;
>
> <-- preemption or BH here: kaboom.

Sigh, this had obviously to run in the early portion of #GP, i.e. before
enabling interrupts.

For me that's more than obvious and I apologize that I failed to mention
it.

>> 
>>      wrmsrl(MSR_IA32_PASID, current->mm->pasid);
>
> This needs the actual sane fpstate writing helper -- see other email.

And with that there is no fpstate write helper required at all.

Stop this overengineering madness already.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 17:15                     ` Thomas Gleixner
@ 2021-09-29 17:41                       ` Luck, Tony
  2021-09-29 17:46                         ` Andy Lutomirski
  2021-09-29 18:07                         ` Fenghua Yu
  0 siblings, 2 replies; 77+ messages in thread
From: Luck, Tony @ 2021-09-29 17:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Peter Zijlstra, Fenghua Yu, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Raj Ashok, Shankar,
	Ravi V, iommu, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
> On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> > On 9/29/21 05:28, Thomas Gleixner wrote:
> >> Looking at that patch again, none of this muck in fpu__pasid_write() is
> >> required at all. The whole exception fixup is:
> >> 
> >>      if (!user_mode(regs))
> >>               return false;
> >> 
> >>      if (!current->mm->pasid)
> >>               return false;
> >> 
> >>      if (current->pasid_activated)
> >>      	     return false;
> >
> > <-- preemption or BH here: kaboom.
> 
> Sigh, this had obviously to run in the early portion of #GP, i.e. before
> enabling interrupts.

Like this? Obviously with some comment about why this is being done.

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..a848a59291e7 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	unsigned long gp_addr;
 	int ret;
 
+	if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) {
+		current->pasid_activated = 1;
+		wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID);
+		return;
+	}
+
 	cond_local_irq_enable(regs);
 
 	if (static_cpu_has(X86_FEATURE_UMIP)) {

-Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 17:41                       ` Luck, Tony
@ 2021-09-29 17:46                         ` Andy Lutomirski
  2021-09-29 18:07                         ` Fenghua Yu
  1 sibling, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-29 17:46 UTC (permalink / raw)
  To: Tony Luck, Thomas Gleixner
  Cc: Peter Zijlstra (Intel),
	Fenghua Yu, Ingo Molnar, Borislav Petkov, Dave Hansen, Lu Baolu,
	Joerg Roedel, Josh Poimboeuf, Dave Jiang, Jacob Jun Pan,
	Raj Ashok, Shankar, Ravi V, iommu, the arch/x86 maintainers,
	Linux Kernel Mailing List



On Wed, Sep 29, 2021, at 10:41 AM, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
>> On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
>> > On 9/29/21 05:28, Thomas Gleixner wrote:
>> >> Looking at that patch again, none of this muck in fpu__pasid_write() is
>> >> required at all. The whole exception fixup is:
>> >> 
>> >>      if (!user_mode(regs))
>> >>               return false;
>> >> 
>> >>      if (!current->mm->pasid)
>> >>               return false;
>> >> 
>> >>      if (current->pasid_activated)
>> >>      	     return false;
>> >
>> > <-- preemption or BH here: kaboom.
>> 
>> Sigh, this had obviously to run in the early portion of #GP, i.e. before
>> enabling interrupts.
>
> Like this? Obviously with some comment about why this is being done.
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a848a59291e7 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  	unsigned long gp_addr;
>  	int ret;
> 
> +	if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) {
> +		current->pasid_activated = 1;
> +		wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID);
> +		return;
> +	}
> +

This could do with a WARN_ON_ONCE(TIF_NEED_LOAD_FPU) imo.

Is instrumentation allowed to touch FPU state?

>  	cond_local_irq_enable(regs);
> 
>  	if (static_cpu_has(X86_FEATURE_UMIP)) {
>
> -Tony

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP
  2021-09-29 17:07                     ` Luck, Tony
@ 2021-09-29 17:48                       ` Andy Lutomirski
  0 siblings, 0 replies; 77+ messages in thread
From: Andy Lutomirski @ 2021-09-29 17:48 UTC (permalink / raw)
  To: Tony Luck
  Cc: Dave Hansen, Fenghua Yu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra (Intel),
	Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List



On Wed, Sep 29, 2021, at 10:07 AM, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 09:58:22AM -0700, Andy Lutomirski wrote:
>> On 9/28/21 16:10, Luck, Tony wrote:
>> > Moving beyond pseudo-code and into compiles-but-probably-broken-code.
>> > 
>> > 
>> > The intent of the functions below is that Fenghua should be able to
>> > do:
>> > 
>> > void fpu__pasid_write(u32 pasid)
>> > {
>> > 	u64 msr_val = pasid | MSR_IA32_PASID_VALID;
>> > 	struct ia32_pasid_state *addr;
>> > 
>> > 	addr = begin_update_one_xsave_feature(current, XFEATURE_PASID, true);
>> > 	addr->pasid = msr_val;
>> > 	finish_update_one_xsave_feature(current);
>> > }
>> > 
>> 
>> This gets gnarly because we would presumably like to optimize the case where
>> we can do the update directly in registers.  I wonder if we can do it with a
>> bit of macro magic in a somewhat generic way:
>
> Can we defere the optimizations until there is a use case that
> cares? The existing use case (fixing up the #GP fault by setting
> the PASID MSR) isn't performance critical.
>
> Let's just have something that is clear (or as clear as any xsave
> code can be) and correct.
>
> 

The goal would be to use the same code for CET and PKRU, I think.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 17:41                       ` Luck, Tony
  2021-09-29 17:46                         ` Andy Lutomirski
@ 2021-09-29 18:07                         ` Fenghua Yu
  2021-09-29 18:31                           ` Luck, Tony
  1 sibling, 1 reply; 77+ messages in thread
From: Fenghua Yu @ 2021-09-29 18:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Thomas Gleixner, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Raj Ashok, Shankar,
	Ravi V, iommu, the arch/x86 maintainers,
	Linux Kernel Mailing List

Hi, Tony,

On Wed, Sep 29, 2021 at 10:41:42AM -0700, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
> > On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> > > On 9/29/21 05:28, Thomas Gleixner wrote:
> > >> Looking at that patch again, none of this muck in fpu__pasid_write() is
> > >> required at all. The whole exception fixup is:
> > >> 
> > >>      if (!user_mode(regs))
> > >>               return false;
> > >> 
> > >>      if (!current->mm->pasid)
> > >>               return false;
> > >> 
> > >>      if (current->pasid_activated)
> > >>      	     return false;
> > >
> > > <-- preemption or BH here: kaboom.
> > 
> > Sigh, this had obviously to run in the early portion of #GP, i.e. before
> > enabling interrupts.
> 
> Like this? Obviously with some comment about why this is being done.
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a848a59291e7 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  	unsigned long gp_addr;
>  	int ret;
>  

Add
+#ifdef CONFIG_IOMMU_SUPPORT
because mm->pasid and current-pasid_activated are defined only if
CONFIG_IOMMU_SUPPORT is defined.

> +	if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) {

Maybe need to add "&& cpu_feature_enabled(X86_FEATURE_ENQCMD)" because
the IA32_PASID MSR is only used when ENQCMD is enabled?

> +		current->pasid_activated = 1;
> +		wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID);
> +		return;
> +	}
> +

+endif

>  	cond_local_irq_enable(regs);
>  
>  	if (static_cpu_has(X86_FEATURE_UMIP)) {

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 18:07                         ` Fenghua Yu
@ 2021-09-29 18:31                           ` Luck, Tony
  2021-09-29 20:07                             ` Thomas Gleixner
  0 siblings, 1 reply; 77+ messages in thread
From: Luck, Tony @ 2021-09-29 18:31 UTC (permalink / raw)
  To: Fenghua Yu
  Cc: Thomas Gleixner, Andy Lutomirski, Peter Zijlstra, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Lu Baolu, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Raj Ashok, Shankar,
	Ravi V, iommu, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Wed, Sep 29, 2021 at 06:07:22PM +0000, Fenghua Yu wrote:
> Add
> +#ifdef CONFIG_IOMMU_SUPPORT
> because mm->pasid and current-pasid_activated are defined only if
> CONFIG_IOMMU_SUPPORT is defined.
> 
> > +	if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) {
> 
> Maybe need to add "&& cpu_feature_enabled(X86_FEATURE_ENQCMD)" because
> the IA32_PASID MSR is only used when ENQCMD is enabled?
> 
> > +		current->pasid_activated = 1;
> > +		wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID);
> > +		return;
> > +	}
> > +
> 
> +endif

New version that addresses those issues.  Has ugly #ifdef in C
code :-(  If that's unacceptable, then could create some stub
functions, or add a call to __try_fixup_pasid() that's in a
file in the iommu code that is only built when CONFIG_IOMMU_SUPPORT
is set.  But either of those move the details far away from the
#GP handler so make extra work for anyone trying to follow along
with what is happening here.

-Tony

---

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..5a3c87fd65de 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -528,6 +528,32 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
 
 #define GPFSTR "general protection fault"
 
+/*
+ * When a user executes the ENQCMD instruction it will #GP
+ * fault if the IA32_PASID MSR has not been set up with a
+ * valid PASID.
+ * So if the process has been allocated a PASID (mm->pasid)
+ * AND the IA32_PASID MSR has not been initialized, try to
+ * fix this #GP by initializing the IA32_PASID MSR.
+ * If the #GP was for some other reason, it will trigger
+ * again, but this routine will return false and the #GP
+ * will be processed.
+ */
+static void try_fixup_pasid(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
+		return false;
+
+#ifdef CONFIG_IOMMU_SUPPORT
+	if (current->mm->pasid && !current->pasid_activated) {
+		current->pasid_activated = 1;
+		wrmsrl(MSR_IA32_PASID, current->mm->pasid);
+		return true;
+	}
+#endif
+	return false;
+}
+
 DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 {
 	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
@@ -536,6 +562,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	unsigned long gp_addr;
 	int ret;
 
+	if (user_mode(regs) && try_fixup_pasid())
+		return;
+
 	cond_local_irq_enable(regs);
 
 	if (static_cpu_has(X86_FEATURE_UMIP)) {

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-29 18:31                           ` Luck, Tony
@ 2021-09-29 20:07                             ` Thomas Gleixner
  0 siblings, 0 replies; 77+ messages in thread
From: Thomas Gleixner @ 2021-09-29 20:07 UTC (permalink / raw)
  To: Luck, Tony, Fenghua Yu
  Cc: Andy Lutomirski, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Lu Baolu, Joerg Roedel, Josh Poimboeuf, Dave Jiang,
	Jacob Jun Pan, Raj Ashok, Shankar, Ravi V, iommu,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Wed, Sep 29 2021 at 11:31, Tony Luck wrote:
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..5a3c87fd65de 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -528,6 +528,32 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
>  
>  #define GPFSTR "general protection fault"
>  
> +/*
> + * When a user executes the ENQCMD instruction it will #GP
> + * fault if the IA32_PASID MSR has not been set up with a
> + * valid PASID.
> + * So if the process has been allocated a PASID (mm->pasid)
> + * AND the IA32_PASID MSR has not been initialized, try to
> + * fix this #GP by initializing the IA32_PASID MSR.
> + * If the #GP was for some other reason, it will trigger
> + * again, but this routine will return false and the #GP
> + * will be processed.
> + */
> +static void try_fixup_pasid(void)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> +		return false;
> +
> +#ifdef CONFIG_IOMMU_SUPPORT
> +	if (current->mm->pasid && !current->pasid_activated) {
> +		current->pasid_activated = 1;
> +		wrmsrl(MSR_IA32_PASID, current->mm->pasid);
> +		return true;
> +	}
> +#endif
> +	return false;
> +}
> +
>  DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  {
>  	char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
> @@ -536,6 +562,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
>  	unsigned long gp_addr;
>  	int ret;
>  
> +	if (user_mode(regs) && try_fixup_pasid())
> +		return;
> +
>  	cond_local_irq_enable(regs);
>  
>  	if (static_cpu_has(X86_FEATURE_UMIP)) {

Amen.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions
  2021-09-29  7:34   ` Lu Baolu
@ 2021-09-30  0:40     ` Fenghua Yu
  0 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-30  0:40 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

Hi, Baolu,

On Wed, Sep 29, 2021 at 03:34:51PM +0800, Lu Baolu wrote:
> On 2021/9/21 3:23, Fenghua Yu wrote:
> > update_pasid() and its call chain are currently unused in the tree because
> > Thomas disabled the ENQCMD feature. The feature will be re-enabled shortly
> > using a different approach and update_pasid() and its call chain will not
> > be used in the new approach.
> > 
> > Remove the useless functions.
> > 
> > Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> > Reviewed-by: Tony Luck <tony.luck@intel.com>
> 
> Thanks for this cleanup. I have queued it for v5.16.

Thank you for pushing this patch to 5.16!

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting
  2021-09-23  5:43   ` Lu Baolu
@ 2021-09-30  0:44     ` Fenghua Yu
  0 siblings, 0 replies; 77+ messages in thread
From: Fenghua Yu @ 2021-09-30  0:44 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Andy Lutomirski, Dave Hansen, Tony Luck, Joerg Roedel,
	Josh Poimboeuf, Dave Jiang, Jacob Jun Pan, Ashok Raj,
	Ravi V Shankar, iommu, x86, linux-kernel

Hi, Baolu,

On Thu, Sep 23, 2021 at 01:43:32PM +0800, Lu Baolu wrote:
> On 9/21/21 3:23 AM, Fenghua Yu wrote:
> > +void pasid_put(struct task_struct *tsk, struct mm_struct *mm)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> > +		return;
> > +
> > +	/*
> > +	 * Nothing to do if this task doesn't have a reference to the PASID.
> > +	 */
> > +	if (tsk->has_valid_pasid) {
> > +		mutex_lock(&pasid_mutex);
> > +		/*
> > +		 * The PASID's reference was taken during fix up. Release it
> > +		 * now. If the reference count is 0, the PASID is freed.
> > +		 */
> > +		iommu_sva_free_pasid(mm);
> > +		mutex_unlock(&pasid_mutex);
> > +	}
> > +}
> > 
> 
> It looks odd that both __fixup_pasid_exception() and pasid_put() are
> defined in the vendor IOMMU driver, but get called in the arch/x86
> code.
> 
> Is it feasible to move these two helpers to the files where they are
> called? The IA32_PASID MSR fixup and release are not part of the IOMMU
> implementation.

Sure. The functions will be removed in the next version. And new functions
will not be called in IOMMU driver.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2021-09-30  0:44 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-20 19:23 [PATCH 0/8] Re-enable ENQCMD and PASID MSR Fenghua Yu
2021-09-20 19:23 ` [PATCH 1/8] iommu/vt-d: Clean up unused PASID updating functions Fenghua Yu
2021-09-29  7:34   ` Lu Baolu
2021-09-30  0:40     ` Fenghua Yu
2021-09-20 19:23 ` [PATCH 2/8] x86/process: Clear PASID state for a newly forked/cloned thread Fenghua Yu
2021-09-20 19:23 ` [PATCH 3/8] sched: Define and initialize a flag to identify valid PASID in the task Fenghua Yu
2021-09-20 19:23 ` [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP Fenghua Yu
2021-09-22 21:07   ` Peter Zijlstra
2021-09-22 21:11     ` Peter Zijlstra
2021-09-22 21:26       ` Luck, Tony
2021-09-23  7:03         ` Peter Zijlstra
2021-09-22 21:33       ` Dave Hansen
2021-09-23  7:05         ` Peter Zijlstra
2021-09-22 21:36       ` Fenghua Yu
2021-09-22 23:39     ` Fenghua Yu
2021-09-23 17:14     ` Luck, Tony
2021-09-24 13:37       ` Peter Zijlstra
2021-09-24 15:39         ` Luck, Tony
2021-09-29  9:00           ` Peter Zijlstra
2021-09-23 11:31   ` Thomas Gleixner
2021-09-23 23:17   ` Andy Lutomirski
2021-09-24  2:56     ` Fenghua Yu
2021-09-24  5:12       ` Andy Lutomirski
2021-09-27 21:02     ` Luck, Tony
2021-09-27 23:51       ` Dave Hansen
2021-09-28 18:50         ` Luck, Tony
2021-09-28 19:19           ` Dave Hansen
2021-09-28 20:28             ` Luck, Tony
2021-09-28 20:55               ` Dave Hansen
2021-09-28 23:10                 ` Luck, Tony
2021-09-28 23:50                   ` Fenghua Yu
2021-09-29  0:08                     ` Luck, Tony
2021-09-29  0:26                       ` Yu, Fenghua
2021-09-29  1:06                         ` Luck, Tony
2021-09-29  1:16                           ` Fenghua Yu
2021-09-29  2:11                             ` Luck, Tony
2021-09-29  1:56                       ` Yu, Fenghua
2021-09-29  2:15                         ` Luck, Tony
2021-09-29 16:58                   ` Andy Lutomirski
2021-09-29 17:07                     ` Luck, Tony
2021-09-29 17:48                       ` Andy Lutomirski
2021-09-20 19:23 ` [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting Fenghua Yu
2021-09-23  5:43   ` Lu Baolu
2021-09-30  0:44     ` Fenghua Yu
2021-09-23 14:36   ` Thomas Gleixner
2021-09-23 16:40     ` Luck, Tony
2021-09-23 17:48       ` Thomas Gleixner
2021-09-24 13:18         ` Thomas Gleixner
2021-09-24 16:12           ` Luck, Tony
2021-09-24 23:03             ` Andy Lutomirski
2021-09-24 23:11               ` Luck, Tony
2021-09-29  9:54               ` Peter Zijlstra
2021-09-29 12:28                 ` Thomas Gleixner
2021-09-29 16:51                   ` Luck, Tony
2021-09-29 17:07                     ` Fenghua Yu
2021-09-29 16:59                   ` Andy Lutomirski
2021-09-29 17:15                     ` Thomas Gleixner
2021-09-29 17:41                       ` Luck, Tony
2021-09-29 17:46                         ` Andy Lutomirski
2021-09-29 18:07                         ` Fenghua Yu
2021-09-29 18:31                           ` Luck, Tony
2021-09-29 20:07                             ` Thomas Gleixner
2021-09-24 16:12           ` Fenghua Yu
2021-09-25 23:13             ` Thomas Gleixner
2021-09-28 16:36               ` Fenghua Yu
2021-09-23 23:09   ` Andy Lutomirski
2021-09-23 23:22     ` Luck, Tony
2021-09-24  5:17       ` Andy Lutomirski
2021-09-20 19:23 ` [PATCH 6/8] x86/cpufeatures: Re-enable ENQCMD Fenghua Yu
2021-09-20 19:23 ` [PATCH 7/8] tools/objtool: Check for use of the ENQCMD instruction in the kernel Fenghua Yu
2021-09-22 21:03   ` Peter Zijlstra
2021-09-22 23:44     ` Fenghua Yu
2021-09-23  7:17       ` Peter Zijlstra
2021-09-23 15:26         ` Fenghua Yu
2021-09-24  0:55           ` Josh Poimboeuf
2021-09-24  0:57             ` Fenghua Yu
2021-09-20 19:23 ` [PATCH 8/8] docs: x86: Change documentation for SVA (Shared Virtual Addressing) Fenghua Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).