LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/2] KVM: x86/mmu: Fix a TDP MMU leak and optimize zap all
@ 2021-08-12  5:07 Sean Christopherson
  2021-08-12  5:07 ` [PATCH 1/2] KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs Sean Christopherson
  2021-08-12  5:07 ` [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator " Sean Christopherson
  0 siblings, 2 replies; 15+ messages in thread
From: Sean Christopherson @ 2021-08-12  5:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

Patch 1 fixes a leak of root-1 shadow pages, patch 2 is a minor
optimization to the zap all flow that avoids re-reading and re-checking
the root-1 SPTEs after they've been zapped by "zap all" flows.

Sean Christopherson (2):
  KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs
  KVM: x86/mmu: Don't step down in the TDP iterator when zapping all
    SPTEs

 arch/x86/kvm/mmu/tdp_mmu.c | 49 +++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs
  2021-08-12  5:07 [PATCH 0/2] KVM: x86/mmu: Fix a TDP MMU leak and optimize zap all Sean Christopherson
@ 2021-08-12  5:07 ` Sean Christopherson
  2021-08-12 16:18   ` Paolo Bonzini
  2021-08-12  5:07 ` [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator " Sean Christopherson
  1 sibling, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2021-08-12  5:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

Use a magic number for the end gfn to signal "zap all" for the TDP MMU
and really zap all SPTEs in this case.  As is, zap_gfn_range() skips
non-leaf SPTEs whose range exceeds the range to be zapped.  If
shadow_phys_bits is not aligned to the range size of top-level SPTEs,
e.g. 512gb with 4-level paging, the "zap all" flows will skip top-level
SPTEs whose range extends beyond shadow_phys_bits and leak their SPs
when the VM is destroyed.

Use a magic number and the current upper bound (based on host.MAXPHYADDR)
instead of the max theoretical gfn, 1 << (52 - 12), even though the latter
would be functionally ok, too.  Bounding based on host.MAXPHYADDR allows
the TDP iterator to terminate its walk when the gfn of the iterator is
out of bounds.

Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
help future debuggers should KVM decide to leak SPTEs again.

The bug is most easily reproduced by running (and unloading!) KVM in a
VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.

  =============================================================================
  BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------
  Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
  CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   dump_stack_lvl+0x45/0x59
   slab_err+0x95/0xc9
   __kmem_cache_shutdown.cold+0x3c/0x158
   kmem_cache_destroy+0x3d/0xf0
   kvm_mmu_module_exit+0xa/0x30 [kvm]
   kvm_arch_exit+0x5d/0x90 [kvm]
   kvm_exit+0x78/0x90 [kvm]
   vmx_exit+0x1a/0x50 [kvm_intel]
   __x64_sys_delete_module+0x13f/0x220
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 47ec9f968406..6566f70a31c1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -13,6 +13,17 @@
 static bool __read_mostly tdp_mmu_enabled = true;
 module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
 
+/*
+ * Magic numbers for 'start' and 'end' used when zapping all SPTEs.  The values
+ * are not completely arbitrary; start must be 0, and end must be greater than
+ * the max theoretical GFN that KVM can fault into the TDP MMU.
+ */
+#define ZAP_ALL_START	0
+#define ZAP_ALL_END	-1ull
+
+/* start+end pair for zapping SPTEs for all possible GFNs. */
+#define ZAP_ALL ZAP_ALL_START, ZAP_ALL_END
+
 /* Initializes the TDP MMU for the VM, if enabled. */
 bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 {
@@ -43,6 +54,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	if (!kvm->arch.tdp_mmu_enabled)
 		return;
 
+	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
 
 	/*
@@ -81,8 +93,6 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared)
 {
-	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
-
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
 	if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
@@ -94,7 +104,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	list_del_rcu(&root->link);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
-	zap_gfn_range(kvm, root, 0, max_gfn, false, false, shared);
+	zap_gfn_range(kvm, root, ZAP_ALL, false, false, shared);
 
 	call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
@@ -739,8 +749,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			  gfn_t start, gfn_t end, bool can_yield, bool flush,
 			  bool shared)
 {
+	bool zap_all = (end == ZAP_ALL_END);
 	struct tdp_iter iter;
 
+	/*
+	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
+	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
+	 * and so KVM will never install a SPTE for such addresses.
+	 */
+	end = min(end, 1ULL << (shadow_phys_bits - PAGE_SHIFT));
+
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
 	rcu_read_lock();
@@ -759,9 +777,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		/*
 		 * If this is a non-last-level SPTE that covers a larger range
 		 * than should be zapped, continue, and zap the mappings at a
-		 * lower level.
+		 * lower level, except when zapping all SPTEs.
 		 */
-		if ((iter.gfn < start ||
+		if (!zap_all &&
+		    (iter.gfn < start ||
 		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
@@ -803,12 +822,11 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
 
 void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 {
-	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 	bool flush = false;
 	int i;
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, max_gfn, flush);
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, ZAP_ALL, flush);
 
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
@@ -846,7 +864,6 @@ static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
  */
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 {
-	gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
 	struct kvm_mmu_page *next_root;
 	struct kvm_mmu_page *root;
 	bool flush = false;
@@ -862,8 +879,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 
 		rcu_read_unlock();
 
-		flush = zap_gfn_range(kvm, root, 0, max_gfn, true, flush,
-				      true);
+		flush = zap_gfn_range(kvm, root, ZAP_ALL, true, flush, true);
 
 		/*
 		 * Put the reference acquired in
-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-12  5:07 [PATCH 0/2] KVM: x86/mmu: Fix a TDP MMU leak and optimize zap all Sean Christopherson
  2021-08-12  5:07 ` [PATCH 1/2] KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs Sean Christopherson
@ 2021-08-12  5:07 ` Sean Christopherson
  2021-08-12 16:47   ` Ben Gardon
  1 sibling, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2021-08-12  5:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon

Set the min_level for the TDP iterator at the root level when zapping all
SPTEs so that the _iterator_ only processes top-level SPTEs.  Zapping a
non-leaf SPTE will recursively zap all its children, thus there is no
need for the iterator to attempt to step down.  This avoids rereading all
the top-level SPTEs after they are zapped by causing try_step_down() to
short-circuit.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6566f70a31c1..aec069c18d82 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -751,6 +751,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	bool zap_all = (end == ZAP_ALL_END);
 	struct tdp_iter iter;
+	int min_level;
+
+	/*
+	 * No need to step down in the iterator when zapping all SPTEs, zapping
+	 * the top-level non-leaf SPTEs will recurse on all their children.
+	 */
+	if (zap_all)
+		min_level = root->role.level;
+	else
+		min_level = PG_LEVEL_4K;
 
 	/*
 	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
@@ -763,7 +773,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	tdp_root_for_each_pte(iter, root, start, end) {
+	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
+				   min_level, start, end) {
 retry:
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
-- 
2.33.0.rc1.237.g0d66db33f3-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs
  2021-08-12  5:07 ` [PATCH 1/2] KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs Sean Christopherson
@ 2021-08-12 16:18   ` Paolo Bonzini
  2021-08-12 16:43     ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2021-08-12 16:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ben Gardon

On 12/08/21 07:07, Sean Christopherson wrote:
> @@ -739,8 +749,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   			  gfn_t start, gfn_t end, bool can_yield, bool flush,
>   			  bool shared)
>   {
> +	bool zap_all = (end == ZAP_ALL_END);
>   	struct tdp_iter iter;
>   
> +	/*
> +	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> +	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> +	 * and so KVM will never install a SPTE for such addresses.
> +	 */
> +	end = min(end, 1ULL << (shadow_phys_bits - PAGE_SHIFT));

Then zap_all need not have any magic value.  You can use 0/-1ull, it's 
readable enough.  ZAP_ALL_END is also unnecessary here if you do:

	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
	bool zap_all = (start == 0 && end >= max_gfn_host);

	end = min(end, max_gfn_host);

And as a small commit message nit, I would say "don't leak" instead of 
"don't skip", since that's really the effect.

Paolo


>   	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>   
>   	rcu_read_lock();
> @@ -759,9 +777,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   		/*
>   		 * If this is a non-last-level SPTE that covers a larger range
>   		 * than should be zapped, continue, and zap the mappings at a
> -		 * lower level.
> +		 * lower level, except when zapping all SPTEs.
>   		 */
> -		if ((iter.gfn < start ||
> +		if (!zap_all &&
> +		    (iter.gfn < start ||
>   		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs
  2021-08-12 16:18   ` Paolo Bonzini
@ 2021-08-12 16:43     ` Sean Christopherson
  2021-08-12 18:09       ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2021-08-12 16:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ben Gardon

On Thu, Aug 12, 2021, Paolo Bonzini wrote:
> On 12/08/21 07:07, Sean Christopherson wrote:
> > @@ -739,8 +749,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >   			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> >   			  bool shared)
> >   {
> > +	bool zap_all = (end == ZAP_ALL_END);
> >   	struct tdp_iter iter;
> > +	/*
> > +	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > +	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > +	 * and so KVM will never install a SPTE for such addresses.
> > +	 */
> > +	end = min(end, 1ULL << (shadow_phys_bits - PAGE_SHIFT));
> 
> Then zap_all need not have any magic value.  You can use 0/-1ull, it's
> readable enough.  ZAP_ALL_END is also unnecessary here if you do:
> 
> 	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> 	bool zap_all = (start == 0 && end >= max_gfn_host);

Aha!  Nice.  I was both too clever and yet not clever enough.

> 	end = min(end, max_gfn_host);
> 
> And as a small commit message nit, I would say "don't leak" instead of
> "don't skip", since that's really the effect.

Hrm, yeah, I can see how "skip" doesn't raise alarm bells like it should.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-12  5:07 ` [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator " Sean Christopherson
@ 2021-08-12 16:47   ` Ben Gardon
  2021-08-12 17:07     ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Ben Gardon @ 2021-08-12 16:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Wed, Aug 11, 2021 at 10:07 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Set the min_level for the TDP iterator at the root level when zapping all
> SPTEs so that the _iterator_ only processes top-level SPTEs.  Zapping a
> non-leaf SPTE will recursively zap all its children, thus there is no
> need for the iterator to attempt to step down.  This avoids rereading all
> the top-level SPTEs after they are zapped by causing try_step_down() to
> short-circuit.
>
> Cc: Ben Gardon <bgardon@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

This change looks functionally correct, but I'm not sure it's worth
adding more code special cased on zap-all for what seems like a small
performance improvement in a context which shouldn't be particularly
performance sensitive.
Change is a correct optimization though and it's not much extra code,
so I'm happy to give a:
Reviewed-by: Ben Gardon <bgardon@google.com>

> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6566f70a31c1..aec069c18d82 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -751,6 +751,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  {
>         bool zap_all = (end == ZAP_ALL_END);
>         struct tdp_iter iter;
> +       int min_level;
> +
> +       /*
> +        * No need to step down in the iterator when zapping all SPTEs, zapping
> +        * the top-level non-leaf SPTEs will recurse on all their children.
> +        */
> +       if (zap_all)
> +               min_level = root->role.level;
> +       else
> +               min_level = PG_LEVEL_4K;
>
>         /*
>          * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> @@ -763,7 +773,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>
>         rcu_read_lock();
>
> -       tdp_root_for_each_pte(iter, root, start, end) {
> +       for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> +                                  min_level, start, end) {
>  retry:
>                 if (can_yield &&
>                     tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
> --
> 2.33.0.rc1.237.g0d66db33f3-goog
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-12 16:47   ` Ben Gardon
@ 2021-08-12 17:07     ` Sean Christopherson
  2021-08-12 17:15       ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2021-08-12 17:07 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Thu, Aug 12, 2021, Ben Gardon wrote:
> On Wed, Aug 11, 2021 at 10:07 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Set the min_level for the TDP iterator at the root level when zapping all
> > SPTEs so that the _iterator_ only processes top-level SPTEs.  Zapping a
> > non-leaf SPTE will recursively zap all its children, thus there is no
> > need for the iterator to attempt to step down.  This avoids rereading all
> > the top-level SPTEs after they are zapped by causing try_step_down() to
> > short-circuit.
> >
> > Cc: Ben Gardon <bgardon@google.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> This change looks functionally correct, but I'm not sure it's worth
> adding more code special cased on zap-all for what seems like a small
> performance improvement in a context which shouldn't be particularly
> performance sensitive.

Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
saying as much.  I'll do that for v2 and let Paolo decide.

Thanks!

> Change is a correct optimization though and it's not much extra code,
> so I'm happy to give a:
> Reviewed-by: Ben Gardon <bgardon@google.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-12 17:07     ` Sean Christopherson
@ 2021-08-12 17:15       ` Paolo Bonzini
  2021-08-12 17:33         ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2021-08-12 17:15 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm, LKML

On 12/08/21 19:07, Sean Christopherson wrote:
> Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
> saying as much.  I'll do that for v2 and let Paolo decide.

I think it makes sense to have it.  You can even use the ternary operator

	/*
	 * When zapping everything, all entries at the top level
	 * ultimately go away, and the levels below go down with them.
	 * So do not bother iterating all the way down to the leaves.
	 */
	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-12 17:15       ` Paolo Bonzini
@ 2021-08-12 17:33         ` Sean Christopherson
  2021-08-12 17:38           ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2021-08-12 17:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Thu, Aug 12, 2021, Paolo Bonzini wrote:
> On 12/08/21 19:07, Sean Christopherson wrote:
> > Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
> > saying as much.  I'll do that for v2 and let Paolo decide.
> 
> I think it makes sense to have it.  You can even use the ternary operator

Hah, yeah, I almost used a ternary op.  Honestly don't know why I didn't, guess
my brain flipped a coin.

> 
> 	/*
> 	 * When zapping everything, all entries at the top level
> 	 * ultimately go away, and the levels below go down with them.
> 	 * So do not bother iterating all the way down to the leaves.

The subtle part is that try_step_down() won't actually iterate down because it
explicitly rereads and rechecks the SPTE.  

	if (iter->level == iter->min_level)
		return false;

	/*
	 * Reread the SPTE before stepping down to avoid traversing into page
	 * tables that are no longer linked from this entry.
	 */
	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));  \
                                                                     ---> this is the code that is avoided
	child_pt = spte_to_child_pt(iter->old_spte, iter->level);   /
	if (!child_pt)
		return false;


My comment wasn't all that accurate either.  Maybe this?

	/*
	 * No need to try to step down in the iterator when zapping all SPTEs,
	 * zapping the top-level non-leaf SPTEs will recurse on their children.
	 */
	 int min_level = zap_all ? root->role.level : PG_LEVEL_4K;

> 	 */
> 	int min_level = zap_all ? root->role.level : PG_LEVEL_4K;
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-12 17:33         ` Sean Christopherson
@ 2021-08-12 17:38           ` Paolo Bonzini
  2021-08-12 17:46             ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2021-08-12 17:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On 12/08/21 19:33, Sean Christopherson wrote:
> On Thu, Aug 12, 2021, Paolo Bonzini wrote:
>> On 12/08/21 19:07, Sean Christopherson wrote:
>>> Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
>>> saying as much.  I'll do that for v2 and let Paolo decide.
>>
>> I think it makes sense to have it.  You can even use the ternary operator
> 
> Hah, yeah, I almost used a ternary op.  Honestly don't know why I didn't, guess
> my brain flipped a coin.
> 
>>
>> 	/*
>> 	 * When zapping everything, all entries at the top level
>> 	 * ultimately go away, and the levels below go down with them.
>> 	 * So do not bother iterating all the way down to the leaves.
> 
> The subtle part is that try_step_down() won't actually iterate down because it
> explicitly rereads and rechecks the SPTE.
> 
> 	if (iter->level == iter->min_level)
> 		return false;
> 
> 	/*
> 	 * Reread the SPTE before stepping down to avoid traversing into page
> 	 * tables that are no longer linked from this entry.
> 	 */
> 	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));  \
>                                                                       ---> this is the code that is avoided
> 	child_pt = spte_to_child_pt(iter->old_spte, iter->level);   /
> 	if (!child_pt)
> 		return false;

Ah, right - so I agree with Ben that it's not too important.

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-12 17:38           ` Paolo Bonzini
@ 2021-08-12 17:46             ` Sean Christopherson
  2021-08-13  7:27               ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2021-08-12 17:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Thu, Aug 12, 2021, Paolo Bonzini wrote:
> On 12/08/21 19:33, Sean Christopherson wrote:
> > On Thu, Aug 12, 2021, Paolo Bonzini wrote:
> > > On 12/08/21 19:07, Sean Christopherson wrote:
> > > > Yeah, I was/am on the fence too, I almost included a blurb in the cover letter
> > > > saying as much.  I'll do that for v2 and let Paolo decide.
> > > 
> > > I think it makes sense to have it.  You can even use the ternary operator
> > 
> > Hah, yeah, I almost used a ternary op.  Honestly don't know why I didn't, guess
> > my brain flipped a coin.
> > 
> > > 
> > > 	/*
> > > 	 * When zapping everything, all entries at the top level
> > > 	 * ultimately go away, and the levels below go down with them.
> > > 	 * So do not bother iterating all the way down to the leaves.
> > 
> > The subtle part is that try_step_down() won't actually iterate down because it
> > explicitly rereads and rechecks the SPTE.
> > 
> > 	if (iter->level == iter->min_level)
> > 		return false;
> > 
> > 	/*
> > 	 * Reread the SPTE before stepping down to avoid traversing into page
> > 	 * tables that are no longer linked from this entry.
> > 	 */
> > 	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));  \
> >                                                                       ---> this is the code that is avoided
> > 	child_pt = spte_to_child_pt(iter->old_spte, iter->level);   /
> > 	if (!child_pt)
> > 		return false;
> 
> Ah, right - so I agree with Ben that it's not too important.

Ya.  There is a measurable performance improvement, but it's really only
meaningful when there aren't many SPTEs to zap, otherwise the cost of zapping
completely dominates the time.

The one thing that makes me want to include the optimization is that it will kick
in if userspace is constantly modifying memslots, e.g. for option ROMs, in which
case many of the memslot-induced zaps will run with relatively few SPTEs.  The
thread doing the zapping isn't a vCPU thread, but it still holds mmu_lock for
read and thus can be a noisy neighbor of sorts.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs
  2021-08-12 16:43     ` Sean Christopherson
@ 2021-08-12 18:09       ` Sean Christopherson
  0 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2021-08-12 18:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ben Gardon

On Thu, Aug 12, 2021, Sean Christopherson wrote:
> On Thu, Aug 12, 2021, Paolo Bonzini wrote:
> > On 12/08/21 07:07, Sean Christopherson wrote:
> > > @@ -739,8 +749,16 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > >   			  gfn_t start, gfn_t end, bool can_yield, bool flush,
> > >   			  bool shared)
> > >   {
> > > +	bool zap_all = (end == ZAP_ALL_END);
> > >   	struct tdp_iter iter;
> > > +	/*
> > > +	 * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > > +	 * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > > +	 * and so KVM will never install a SPTE for such addresses.
> > > +	 */
> > > +	end = min(end, 1ULL << (shadow_phys_bits - PAGE_SHIFT));
> > 
> > Then zap_all need not have any magic value.  You can use 0/-1ull, it's
> > readable enough.  ZAP_ALL_END is also unnecessary here if you do:
> > 
> > 	gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > 	bool zap_all = (start == 0 && end >= max_gfn_host);
> 
> Aha!  Nice.  I was both too clever and yet not clever enough.

And as a bonus, this also works for kvm_post_set_cr0(), which calls the common
kvm_zap_gfn_range() with 0 - ALL_ONES.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-12 17:46             ` Sean Christopherson
@ 2021-08-13  7:27               ` Paolo Bonzini
  2021-08-13 16:13                 ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2021-08-13  7:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On 12/08/21 19:46, Sean Christopherson wrote:
>>> 	if (iter->level == iter->min_level)
>>> 		return false;
>>>
>>> 	/*
>>> 	 * Reread the SPTE before stepping down to avoid traversing into page
>>> 	 * tables that are no longer linked from this entry.
>>> 	 */
>>> 	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));  \
>>>                                                                        ---> this is the code that is avoided
>>> 	child_pt = spte_to_child_pt(iter->old_spte, iter->level);   /
>>> 	if (!child_pt)
>>> 		return false;
>> Ah, right - so I agree with Ben that it's not too important.
> Ya.  There is a measurable performance improvement, but it's really only
> meaningful when there aren't many SPTEs to zap, otherwise the cost of zapping
> completely dominates the time.

I don't understand.  When try_step_down is called by tdp_iter_next, all 
it does is really just the READ_ONCE, because spte_to_child_pt will see 
a non-present PTE and return immediately.  Why do two, presumably cache 
hot, reads cause a measurable performance improvement?

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-13  7:27               ` Paolo Bonzini
@ 2021-08-13 16:13                 ` Sean Christopherson
  2021-08-13 16:38                   ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2021-08-13 16:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On Fri, Aug 13, 2021, Paolo Bonzini wrote:
> On 12/08/21 19:46, Sean Christopherson wrote:
> > > > 	if (iter->level == iter->min_level)
> > > > 		return false;
> > > > 
> > > > 	/*
> > > > 	 * Reread the SPTE before stepping down to avoid traversing into page
> > > > 	 * tables that are no longer linked from this entry.
> > > > 	 */
> > > > 	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));  \
> > > >                                                                        ---> this is the code that is avoided
> > > > 	child_pt = spte_to_child_pt(iter->old_spte, iter->level);   /
> > > > 	if (!child_pt)
> > > > 		return false;
> > > Ah, right - so I agree with Ben that it's not too important.
> > Ya.  There is a measurable performance improvement, but it's really only
> > meaningful when there aren't many SPTEs to zap, otherwise the cost of zapping
> > completely dominates the time.
> 
> I don't understand.  When try_step_down is called by tdp_iter_next, all it
> does is really just the READ_ONCE, because spte_to_child_pt will see a
> non-present PTE and return immediately.  Why do two, presumably cache hot,
> reads cause a measurable performance improvement?

It's entirely possible my measurements were bad and/or noisy.  Ah, and my kernel
was running with CONFIG_PROVE_RCU=y, which makes the rcu_dereference() quite a bit
more expensive.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  2021-08-13 16:13                 ` Sean Christopherson
@ 2021-08-13 16:38                   ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2021-08-13 16:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML

On 13/08/21 18:13, Sean Christopherson wrote:
> On Fri, Aug 13, 2021, Paolo Bonzini wrote:
>> On 12/08/21 19:46, Sean Christopherson wrote:
>>>>> 	if (iter->level == iter->min_level)
>>>>> 		return false;
>>>>>
>>>>> 	/*
>>>>> 	 * Reread the SPTE before stepping down to avoid traversing into page
>>>>> 	 * tables that are no longer linked from this entry.
>>>>> 	 */
>>>>> 	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));  \
>>>>>                                                                         ---> this is the code that is avoided
>>>>> 	child_pt = spte_to_child_pt(iter->old_spte, iter->level);   /
>>>>> 	if (!child_pt)
>>>>> 		return false;
>>>> Ah, right - so I agree with Ben that it's not too important.
>>> Ya.  There is a measurable performance improvement, but it's really only
>>> meaningful when there aren't many SPTEs to zap, otherwise the cost of zapping
>>> completely dominates the time.
>>
>> I don't understand.  When try_step_down is called by tdp_iter_next, all it
>> does is really just the READ_ONCE, because spte_to_child_pt will see a
>> non-present PTE and return immediately.  Why do two, presumably cache hot,
>> reads cause a measurable performance improvement?
> 
> It's entirely possible my measurements were bad and/or noisy.  Ah, and my kernel
> was running with CONFIG_PROVE_RCU=y, which makes the rcu_dereference() quite a bit
> more expensive.

It's one line of code and it makes sense, so I can certainly include the 
patch.  I was just a bit confused.

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-08-13 16:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-12  5:07 [PATCH 0/2] KVM: x86/mmu: Fix a TDP MMU leak and optimize zap all Sean Christopherson
2021-08-12  5:07 ` [PATCH 1/2] KVM: x86/mmu: Don't skip non-leaf SPTEs when zapping all SPTEs Sean Christopherson
2021-08-12 16:18   ` Paolo Bonzini
2021-08-12 16:43     ` Sean Christopherson
2021-08-12 18:09       ` Sean Christopherson
2021-08-12  5:07 ` [PATCH 2/2] KVM: x86/mmu: Don't step down in the TDP iterator " Sean Christopherson
2021-08-12 16:47   ` Ben Gardon
2021-08-12 17:07     ` Sean Christopherson
2021-08-12 17:15       ` Paolo Bonzini
2021-08-12 17:33         ` Sean Christopherson
2021-08-12 17:38           ` Paolo Bonzini
2021-08-12 17:46             ` Sean Christopherson
2021-08-13  7:27               ` Paolo Bonzini
2021-08-13 16:13                 ` Sean Christopherson
2021-08-13 16:38                   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).