LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/3] KVM: LAPIC: Optimize timer latency further
@ 2019-05-09 11:29 Wanpeng Li
  2019-05-09 11:29 ` [PATCH 1/3] KVM: LAPIC: Extract adaptive tune timer advancement logic Wanpeng Li
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Wanpeng Li @ 2019-05-09 11:29 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Paolo Bonzini, Radim Krčmář

Advance lapic timer tries to hidden the hypervisor overhead between host 
timer fires and the guest awares the timer is fired. However, it just hidden 
the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire, 
instead of the real position of vmentry which is mentioned in the orignial 
commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer 
expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire 
and before world switch on my haswell desktop, it will be 2400+ cycles if 
vmentry_l1d_flush is tuned to always. 

This patchset tries to narrow the last gap, it measures the time between 
the end of wait_lapic_expire and before world switch, we take this 
time into consideration when busy waiting, otherwise, the guest still 
awares the latency between wait_lapic_expire and world switch, we also 
consider this when adaptively tuning the timer advancement. The patch 
can reduce 50% latency (~1600+ cycles to ~800+ cycles on a haswell 
desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits.

Wanpeng Li (3):
  KVM: LAPIC: Extract adaptive tune timer advancement logic
  KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow
  KVM: LAPIC: Optimize timer latency further

 arch/x86/kvm/lapic.c   | 78 ++++++++++++++++++++++++++++++++++----------------
 arch/x86/kvm/lapic.h   |  8 ++++++
 arch/x86/kvm/vmx/vmx.c |  2 ++
 arch/x86/kvm/x86.c     |  2 +-
 4 files changed, 64 insertions(+), 26 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] KVM: LAPIC: Extract adaptive tune timer advancement logic
  2019-05-09 11:29 [PATCH 0/3] KVM: LAPIC: Optimize timer latency further Wanpeng Li
@ 2019-05-09 11:29 ` Wanpeng Li
  2019-05-13 19:39   ` Sean Christopherson
  2019-05-09 11:29 ` [PATCH 2/3] KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow Wanpeng Li
  2019-05-09 11:29 ` [PATCH 3/3] KVM: LAPIC: Optimize timer latency further Wanpeng Li
  2 siblings, 1 reply; 11+ messages in thread
From: Wanpeng Li @ 2019-05-09 11:29 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář,
	Sean Christopherson, Liran Alon

From: Wanpeng Li <wanpengli@tencent.com>

Extract adaptive tune timer advancement logic to a single function.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/lapic.c | 57 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index bd13fdd..e7a0660 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1501,11 +1501,41 @@ static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
 	}
 }
 
-void wait_lapic_expire(struct kvm_vcpu *vcpu)
+static inline void adaptive_tune_timer_advancement(struct kvm_vcpu *vcpu,
+				u64 guest_tsc, u64 tsc_deadline)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
 	u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
-	u64 guest_tsc, tsc_deadline, ns;
+	u64 ns;
+
+	if (!apic->lapic_timer.timer_advance_adjust_done) {
+			/* too early */
+			if (guest_tsc < tsc_deadline) {
+				ns = (tsc_deadline - guest_tsc) * 1000000ULL;
+				do_div(ns, vcpu->arch.virtual_tsc_khz);
+				timer_advance_ns -= min((u32)ns,
+					timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+			} else {
+			/* too late */
+				ns = (guest_tsc - tsc_deadline) * 1000000ULL;
+				do_div(ns, vcpu->arch.virtual_tsc_khz);
+				timer_advance_ns += min((u32)ns,
+					timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
+			}
+			if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
+				apic->lapic_timer.timer_advance_adjust_done = true;
+			if (unlikely(timer_advance_ns > 5000)) {
+				timer_advance_ns = 0;
+				apic->lapic_timer.timer_advance_adjust_done = true;
+			}
+			apic->lapic_timer.timer_advance_ns = timer_advance_ns;
+		}
+}
+
+void wait_lapic_expire(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	u64 guest_tsc, tsc_deadline;
 
 	if (apic->lapic_timer.expired_tscdeadline == 0)
 		return;
@@ -1521,28 +1551,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 	if (guest_tsc < tsc_deadline)
 		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
 
-	if (!apic->lapic_timer.timer_advance_adjust_done) {
-		/* too early */
-		if (guest_tsc < tsc_deadline) {
-			ns = (tsc_deadline - guest_tsc) * 1000000ULL;
-			do_div(ns, vcpu->arch.virtual_tsc_khz);
-			timer_advance_ns -= min((u32)ns,
-				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
-		} else {
-		/* too late */
-			ns = (guest_tsc - tsc_deadline) * 1000000ULL;
-			do_div(ns, vcpu->arch.virtual_tsc_khz);
-			timer_advance_ns += min((u32)ns,
-				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
-		}
-		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
-			apic->lapic_timer.timer_advance_adjust_done = true;
-		if (unlikely(timer_advance_ns > 5000)) {
-			timer_advance_ns = 0;
-			apic->lapic_timer.timer_advance_adjust_done = true;
-		}
-		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
-	}
+	adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
 }
 
 static void start_sw_tscdeadline(struct kvm_lapic *apic)
-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/3] KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow
  2019-05-09 11:29 [PATCH 0/3] KVM: LAPIC: Optimize timer latency further Wanpeng Li
  2019-05-09 11:29 ` [PATCH 1/3] KVM: LAPIC: Extract adaptive tune timer advancement logic Wanpeng Li
@ 2019-05-09 11:29 ` Wanpeng Li
  2019-05-13 19:14   ` Sean Christopherson
  2019-05-09 11:29 ` [PATCH 3/3] KVM: LAPIC: Optimize timer latency further Wanpeng Li
  2 siblings, 1 reply; 11+ messages in thread
From: Wanpeng Li @ 2019-05-09 11:29 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář,
	Sean Christopherson, Liran Alon

From: Wanpeng Li <wanpengli@tencent.com>

After commit c3941d9e0 (KVM: lapic: Allow user to disable adaptive tuning of 
timer advancement), '-1' enables adaptive tuning starting from default 
advancment of 1000ns. However, we should expose an int instead of an overflow 
uint module parameter.

Before patch:

/sys/module/kvm/parameters/lapic_timer_advance_ns:4294967295

After patch:

/sys/module/kvm/parameters/lapic_timer_advance_ns:-1

Fixes: c3941d9e0 (KVM: lapic: Allow user to disable adaptive tuning of timer advancement)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d75bb97..1d89cb9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -143,7 +143,7 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
  * tuning, i.e. allows priveleged userspace to set an exact advancement time.
  */
 static int __read_mostly lapic_timer_advance_ns = -1;
-module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
+module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR);
 
 static bool __read_mostly vector_hashing = true;
 module_param(vector_hashing, bool, S_IRUGO);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/3] KVM: LAPIC: Optimize timer latency further
  2019-05-09 11:29 [PATCH 0/3] KVM: LAPIC: Optimize timer latency further Wanpeng Li
  2019-05-09 11:29 ` [PATCH 1/3] KVM: LAPIC: Extract adaptive tune timer advancement logic Wanpeng Li
  2019-05-09 11:29 ` [PATCH 2/3] KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow Wanpeng Li
@ 2019-05-09 11:29 ` Wanpeng Li
  2019-05-13 19:54   ` Sean Christopherson
  2 siblings, 1 reply; 11+ messages in thread
From: Wanpeng Li @ 2019-05-09 11:29 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Radim Krčmář,
	Sean Christopherson, Liran Alon

From: Wanpeng Li <wanpengli@tencent.com>

Advance lapic timer tries to hidden the hypervisor overhead between host 
timer fires and the guest awares the timer is fired. However, it just hidden 
the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire, 
instead of the real position of vmentry which is mentioned in the orignial 
commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer 
expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire 
and before world switch on my haswell desktop, it will be 2400+ cycles if 
vmentry_l1d_flush is tuned to always. 

This patch tries to narrow the last gap, it measures the time between 
the end of wait_lapic_expire and before world switch, we take this 
time into consideration when busy waiting, otherwise, the guest still 
awares the latency between wait_lapic_expire and world switch, we also 
consider this when adaptively tuning the timer advancement. The patch 
can reduce 50% latency (~1600+ cycles to ~800+ cycles on a haswell 
desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/lapic.c   | 23 +++++++++++++++++++++--
 arch/x86/kvm/lapic.h   |  8 ++++++++
 arch/x86/kvm/vmx/vmx.c |  2 ++
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e7a0660..01d3a87 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1545,13 +1545,19 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
 
 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
 	apic->lapic_timer.expired_tscdeadline = 0;
-	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+	guest_tsc = kvm_read_l1_tsc(vcpu, (apic->lapic_timer.measure_delay_done == 2) ?
+		rdtsc() + apic->lapic_timer.vmentry_delay : rdtsc());
 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
 
 	if (guest_tsc < tsc_deadline)
 		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
 
 	adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
+
+	if (!apic->lapic_timer.measure_delay_done) {
+		apic->lapic_timer.measure_delay_done = 1;
+		apic->lapic_timer.vmentry_delay = rdtsc();
+	}
 }
 
 static void start_sw_tscdeadline(struct kvm_lapic *apic)
@@ -1837,6 +1843,18 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
 	}
 }
 
+void kvm_lapic_measure_vmentry_delay(struct kvm_vcpu *vcpu)
+{
+	struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;
+
+	if (ktimer->measure_delay_done == 1) {
+		ktimer->vmentry_delay = rdtsc() -
+			ktimer->vmentry_delay;
+		ktimer->measure_delay_done = 2;
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_measure_vmentry_delay);
+
 int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 {
 	int ret = 0;
@@ -2318,7 +2336,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
 		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
 		apic->lapic_timer.timer_advance_adjust_done = true;
 	}
-
+	apic->lapic_timer.vmentry_delay = 0;
+	apic->lapic_timer.measure_delay_done = 0;
 
 	/*
 	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index d6d049b..f1d037b 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -35,6 +35,13 @@ struct kvm_timer {
 	atomic_t pending;			/* accumulated triggered timers */
 	bool hv_timer_in_use;
 	bool timer_advance_adjust_done;
+	/**
+	 * 0 unstart measure
+	 * 1 start record
+	 * 2 get delta
+	 */
+	u32 measure_delay_done;
+	u64 vmentry_delay;
 };
 
 struct kvm_lapic {
@@ -230,6 +237,7 @@ void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu);
 void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
 bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu);
 void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu);
+void kvm_lapic_measure_vmentry_delay(struct kvm_vcpu *vcpu);
 
 static inline enum lapic_mode kvm_apic_mode(u64 apic_base)
 {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9663d41..a939bf5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6437,6 +6437,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.cr2 != read_cr2())
 		write_cr2(vcpu->arch.cr2);
 
+	kvm_lapic_measure_vmentry_delay(vcpu);
+
 	vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
 				   vmx->loaded_vmcs->launched);
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow
  2019-05-09 11:29 ` [PATCH 2/3] KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow Wanpeng Li
@ 2019-05-13 19:14   ` Sean Christopherson
  0 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2019-05-13 19:14 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Krčmář,
	Liran Alon

On Thu, May 09, 2019 at 07:29:20PM +0800, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> After commit c3941d9e0 (KVM: lapic: Allow user to disable adaptive tuning of 
> timer advancement), '-1' enables adaptive tuning starting from default 
> advancment of 1000ns. However, we should expose an int instead of an overflow 
> uint module parameter.
> 
> Before patch:
> 
> /sys/module/kvm/parameters/lapic_timer_advance_ns:4294967295
> 
> After patch:
> 
> /sys/module/kvm/parameters/lapic_timer_advance_ns:-1
> 
> Fixes: c3941d9e0 (KVM: lapic: Allow user to disable adaptive tuning of timer advancement)
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Cc: Liran Alon <liran.alon@oracle.com>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] KVM: LAPIC: Extract adaptive tune timer advancement logic
  2019-05-09 11:29 ` [PATCH 1/3] KVM: LAPIC: Extract adaptive tune timer advancement logic Wanpeng Li
@ 2019-05-13 19:39   ` Sean Christopherson
  2019-05-14  0:59     ` Wanpeng Li
  0 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2019-05-13 19:39 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Krčmář,
	Liran Alon

On Thu, May 09, 2019 at 07:29:19PM +0800, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> Extract adaptive tune timer advancement logic to a single function.

Why?

> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Cc: Liran Alon <liran.alon@oracle.com>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
>  arch/x86/kvm/lapic.c | 57 ++++++++++++++++++++++++++++++----------------------
>  1 file changed, 33 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index bd13fdd..e7a0660 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1501,11 +1501,41 @@ static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
>  	}
>  }
>  
> -void wait_lapic_expire(struct kvm_vcpu *vcpu)
> +static inline void adaptive_tune_timer_advancement(struct kvm_vcpu *vcpu,
> +				u64 guest_tsc, u64 tsc_deadline)
>  {
>  	struct kvm_lapic *apic = vcpu->arch.apic;
>  	u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
> -	u64 guest_tsc, tsc_deadline, ns;
> +	u64 ns;
> +
> +	if (!apic->lapic_timer.timer_advance_adjust_done) {
> +			/* too early */
> +			if (guest_tsc < tsc_deadline) {
> +				ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> +				do_div(ns, vcpu->arch.virtual_tsc_khz);
> +				timer_advance_ns -= min((u32)ns,
> +					timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> +			} else {
> +			/* too late */
> +				ns = (guest_tsc - tsc_deadline) * 1000000ULL;
> +				do_div(ns, vcpu->arch.virtual_tsc_khz);
> +				timer_advance_ns += min((u32)ns,
> +					timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> +			}
> +			if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> +				apic->lapic_timer.timer_advance_adjust_done = true;
> +			if (unlikely(timer_advance_ns > 5000)) {
> +				timer_advance_ns = 0;
> +				apic->lapic_timer.timer_advance_adjust_done = true;
> +			}
> +			apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> +		}

This whole block is indented too far.

> +}
> +
> +void wait_lapic_expire(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	u64 guest_tsc, tsc_deadline;
>  
>  	if (apic->lapic_timer.expired_tscdeadline == 0)
>  		return;
> @@ -1521,28 +1551,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>  	if (guest_tsc < tsc_deadline)
>  		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
>  
> -	if (!apic->lapic_timer.timer_advance_adjust_done) {
> -		/* too early */
> -		if (guest_tsc < tsc_deadline) {
> -			ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> -			do_div(ns, vcpu->arch.virtual_tsc_khz);
> -			timer_advance_ns -= min((u32)ns,
> -				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> -		} else {
> -		/* too late */
> -			ns = (guest_tsc - tsc_deadline) * 1000000ULL;
> -			do_div(ns, vcpu->arch.virtual_tsc_khz);
> -			timer_advance_ns += min((u32)ns,
> -				timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> -		}
> -		if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> -			apic->lapic_timer.timer_advance_adjust_done = true;
> -		if (unlikely(timer_advance_ns > 5000)) {
> -			timer_advance_ns = 0;
> -			apic->lapic_timer.timer_advance_adjust_done = true;
> -		}
> -		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> -	}
> +	adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
>  }
>  
>  static void start_sw_tscdeadline(struct kvm_lapic *apic)
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: LAPIC: Optimize timer latency further
  2019-05-09 11:29 ` [PATCH 3/3] KVM: LAPIC: Optimize timer latency further Wanpeng Li
@ 2019-05-13 19:54   ` Sean Christopherson
  2019-05-14  1:45     ` Wanpeng Li
  0 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2019-05-13 19:54 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Krčmář,
	Liran Alon

On Thu, May 09, 2019 at 07:29:21PM +0800, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> Advance lapic timer tries to hidden the hypervisor overhead between host 
> timer fires and the guest awares the timer is fired. However, it just hidden 
> the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire, 
> instead of the real position of vmentry which is mentioned in the orignial 
> commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer 
> expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire 
> and before world switch on my haswell desktop, it will be 2400+ cycles if 
> vmentry_l1d_flush is tuned to always. 
> 
> This patch tries to narrow the last gap, it measures the time between 
> the end of wait_lapic_expire and before world switch, we take this 
> time into consideration when busy waiting, otherwise, the guest still 
> awares the latency between wait_lapic_expire and world switch, we also 
> consider this when adaptively tuning the timer advancement. The patch 
> can reduce 50% latency (~1600+ cycles to ~800+ cycles on a haswell 
> desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Cc: Liran Alon <liran.alon@oracle.com>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
>  arch/x86/kvm/lapic.c   | 23 +++++++++++++++++++++--
>  arch/x86/kvm/lapic.h   |  8 ++++++++
>  arch/x86/kvm/vmx/vmx.c |  2 ++
>  3 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e7a0660..01d3a87 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1545,13 +1545,19 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>  
>  	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
>  	apic->lapic_timer.expired_tscdeadline = 0;
> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	guest_tsc = kvm_read_l1_tsc(vcpu, (apic->lapic_timer.measure_delay_done == 2) ?
> +		rdtsc() + apic->lapic_timer.vmentry_delay : rdtsc());
>  	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>  
>  	if (guest_tsc < tsc_deadline)
>  		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
>  
>  	adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
> +
> +	if (!apic->lapic_timer.measure_delay_done) {
> +		apic->lapic_timer.measure_delay_done = 1;
> +		apic->lapic_timer.vmentry_delay = rdtsc();
> +	}
>  }
>  
>  static void start_sw_tscdeadline(struct kvm_lapic *apic)
> @@ -1837,6 +1843,18 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
>  	}
>  }
>  
> +void kvm_lapic_measure_vmentry_delay(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;

This will #GP if the APIC is not in-kernel, i.e. @apic is NULL.

> +
> +	if (ktimer->measure_delay_done == 1) {
> +		ktimer->vmentry_delay = rdtsc() -
> +			ktimer->vmentry_delay;
> +		ktimer->measure_delay_done = 2;

Measuring the delay a single time is bound to result in random outliers,
e.g. if an NMI happens to occur after wait_lapic_expire().

Rather than reinvent the wheel, can we simply move the call to
wait_lapic_expire() into vmx.c and svm.c?  For VMX we'd probably want to
support the advancement if enable_unrestricted_guest=true so that we avoid
the emulation_required case, but other than that I don't see anything that
requires wait_lapic_expire() to be called where it is.

> +	}
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_measure_vmentry_delay);
> +
>  int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  {
>  	int ret = 0;
> @@ -2318,7 +2336,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
>  		apic->lapic_timer.timer_advance_ns = timer_advance_ns;
>  		apic->lapic_timer.timer_advance_adjust_done = true;
>  	}
> -
> +	apic->lapic_timer.vmentry_delay = 0;
> +	apic->lapic_timer.measure_delay_done = 0;
>  
>  	/*
>  	 * APIC is created enabled. This will prevent kvm_lapic_set_base from
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index d6d049b..f1d037b 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -35,6 +35,13 @@ struct kvm_timer {
>  	atomic_t pending;			/* accumulated triggered timers */
>  	bool hv_timer_in_use;
>  	bool timer_advance_adjust_done;
> +	/**
> +	 * 0 unstart measure
> +	 * 1 start record
> +	 * 2 get delta
> +	 */
> +	u32 measure_delay_done;
> +	u64 vmentry_delay;
>  };
>  
>  struct kvm_lapic {
> @@ -230,6 +237,7 @@ void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu);
>  void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
>  bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu);
>  void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu);
> +void kvm_lapic_measure_vmentry_delay(struct kvm_vcpu *vcpu);
>  
>  static inline enum lapic_mode kvm_apic_mode(u64 apic_base)
>  {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9663d41..a939bf5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6437,6 +6437,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	if (vcpu->arch.cr2 != read_cr2())
>  		write_cr2(vcpu->arch.cr2);
>  
> +	kvm_lapic_measure_vmentry_delay(vcpu);

This should be wrapped in an unlikely of some form given that it happens
literally once out of thousands/millions runs.

> +
>  	vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
>  				   vmx->loaded_vmcs->launched);
>  
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] KVM: LAPIC: Extract adaptive tune timer advancement logic
  2019-05-13 19:39   ` Sean Christopherson
@ 2019-05-14  0:59     ` Wanpeng Li
  0 siblings, 0 replies; 11+ messages in thread
From: Wanpeng Li @ 2019-05-14  0:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Radim Krčmář, Liran Alon

On Tue, 14 May 2019 at 03:39, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, May 09, 2019 at 07:29:19PM +0800, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > Extract adaptive tune timer advancement logic to a single function.
>
> Why?

Just because the function wait_lapic_expire() is too complex now.

Regards,
Wanpeng Li

>
> >
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> > Cc: Liran Alon <liran.alon@oracle.com>
> > Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> > ---
> >  arch/x86/kvm/lapic.c | 57 ++++++++++++++++++++++++++++++----------------------
> >  1 file changed, 33 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index bd13fdd..e7a0660 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1501,11 +1501,41 @@ static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
> >       }
> >  }
> >
> > -void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > +static inline void adaptive_tune_timer_advancement(struct kvm_vcpu *vcpu,
> > +                             u64 guest_tsc, u64 tsc_deadline)
> >  {
> >       struct kvm_lapic *apic = vcpu->arch.apic;
> >       u32 timer_advance_ns = apic->lapic_timer.timer_advance_ns;
> > -     u64 guest_tsc, tsc_deadline, ns;
> > +     u64 ns;
> > +
> > +     if (!apic->lapic_timer.timer_advance_adjust_done) {
> > +                     /* too early */
> > +                     if (guest_tsc < tsc_deadline) {
> > +                             ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> > +                             do_div(ns, vcpu->arch.virtual_tsc_khz);
> > +                             timer_advance_ns -= min((u32)ns,
> > +                                     timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> > +                     } else {
> > +                     /* too late */
> > +                             ns = (guest_tsc - tsc_deadline) * 1000000ULL;
> > +                             do_div(ns, vcpu->arch.virtual_tsc_khz);
> > +                             timer_advance_ns += min((u32)ns,
> > +                                     timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> > +                     }
> > +                     if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> > +                             apic->lapic_timer.timer_advance_adjust_done = true;
> > +                     if (unlikely(timer_advance_ns > 5000)) {
> > +                             timer_advance_ns = 0;
> > +                             apic->lapic_timer.timer_advance_adjust_done = true;
> > +                     }
> > +                     apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> > +             }
>
> This whole block is indented too far.
>
> > +}
> > +
> > +void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > +{
> > +     struct kvm_lapic *apic = vcpu->arch.apic;
> > +     u64 guest_tsc, tsc_deadline;
> >
> >       if (apic->lapic_timer.expired_tscdeadline == 0)
> >               return;
> > @@ -1521,28 +1551,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> >       if (guest_tsc < tsc_deadline)
> >               __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> >
> > -     if (!apic->lapic_timer.timer_advance_adjust_done) {
> > -             /* too early */
> > -             if (guest_tsc < tsc_deadline) {
> > -                     ns = (tsc_deadline - guest_tsc) * 1000000ULL;
> > -                     do_div(ns, vcpu->arch.virtual_tsc_khz);
> > -                     timer_advance_ns -= min((u32)ns,
> > -                             timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> > -             } else {
> > -             /* too late */
> > -                     ns = (guest_tsc - tsc_deadline) * 1000000ULL;
> > -                     do_div(ns, vcpu->arch.virtual_tsc_khz);
> > -                     timer_advance_ns += min((u32)ns,
> > -                             timer_advance_ns / LAPIC_TIMER_ADVANCE_ADJUST_STEP);
> > -             }
> > -             if (abs(guest_tsc - tsc_deadline) < LAPIC_TIMER_ADVANCE_ADJUST_DONE)
> > -                     apic->lapic_timer.timer_advance_adjust_done = true;
> > -             if (unlikely(timer_advance_ns > 5000)) {
> > -                     timer_advance_ns = 0;
> > -                     apic->lapic_timer.timer_advance_adjust_done = true;
> > -             }
> > -             apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> > -     }
> > +     adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
> >  }
> >
> >  static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: LAPIC: Optimize timer latency further
  2019-05-13 19:54   ` Sean Christopherson
@ 2019-05-14  1:45     ` Wanpeng Li
  2019-05-14 10:56       ` Wanpeng Li
  0 siblings, 1 reply; 11+ messages in thread
From: Wanpeng Li @ 2019-05-14  1:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Radim Krčmář, Liran Alon

On Tue, 14 May 2019 at 03:54, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, May 09, 2019 at 07:29:21PM +0800, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > Advance lapic timer tries to hidden the hypervisor overhead between host
> > timer fires and the guest awares the timer is fired. However, it just hidden
> > the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire,
> > instead of the real position of vmentry which is mentioned in the orignial
> > commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer
> > expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire
> > and before world switch on my haswell desktop, it will be 2400+ cycles if
> > vmentry_l1d_flush is tuned to always.
> >
> > This patch tries to narrow the last gap, it measures the time between
> > the end of wait_lapic_expire and before world switch, we take this
> > time into consideration when busy waiting, otherwise, the guest still
> > awares the latency between wait_lapic_expire and world switch, we also
> > consider this when adaptively tuning the timer advancement. The patch
> > can reduce 50% latency (~1600+ cycles to ~800+ cycles on a haswell
> > desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits.
> >
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> > Cc: Liran Alon <liran.alon@oracle.com>
> > Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> > ---
> >  arch/x86/kvm/lapic.c   | 23 +++++++++++++++++++++--
> >  arch/x86/kvm/lapic.h   |  8 ++++++++
> >  arch/x86/kvm/vmx/vmx.c |  2 ++
> >  3 files changed, 31 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index e7a0660..01d3a87 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1545,13 +1545,19 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> >
> >       tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> >       apic->lapic_timer.expired_tscdeadline = 0;
> > -     guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +     guest_tsc = kvm_read_l1_tsc(vcpu, (apic->lapic_timer.measure_delay_done == 2) ?
> > +             rdtsc() + apic->lapic_timer.vmentry_delay : rdtsc());
> >       trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> >
> >       if (guest_tsc < tsc_deadline)
> >               __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> >
> >       adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
> > +
> > +     if (!apic->lapic_timer.measure_delay_done) {
> > +             apic->lapic_timer.measure_delay_done = 1;
> > +             apic->lapic_timer.vmentry_delay = rdtsc();
> > +     }
> >  }
> >
> >  static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > @@ -1837,6 +1843,18 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
> >       }
> >  }
> >
> > +void kvm_lapic_measure_vmentry_delay(struct kvm_vcpu *vcpu)
> > +{
> > +     struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;
>
> This will #GP if the APIC is not in-kernel, i.e. @apic is NULL.
>
> > +
> > +     if (ktimer->measure_delay_done == 1) {
> > +             ktimer->vmentry_delay = rdtsc() -
> > +                     ktimer->vmentry_delay;
> > +             ktimer->measure_delay_done = 2;
>
> Measuring the delay a single time is bound to result in random outliers,
> e.g. if an NMI happens to occur after wait_lapic_expire().
>
> Rather than reinvent the wheel, can we simply move the call to
> wait_lapic_expire() into vmx.c and svm.c?  For VMX we'd probably want to
> support the advancement if enable_unrestricted_guest=true so that we avoid
> the emulation_required case, but other than that I don't see anything that
> requires wait_lapic_expire() to be called where it is.

I also considered to move wait_lapic_expire() into vmx.c and svm.c
before, what do you think, Paolo, Radim?

Regards,
Wanpeng Li

>
> > +     }
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lapic_measure_vmentry_delay);
> > +
> >  int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  {
> >       int ret = 0;
> > @@ -2318,7 +2336,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
> >               apic->lapic_timer.timer_advance_ns = timer_advance_ns;
> >               apic->lapic_timer.timer_advance_adjust_done = true;
> >       }
> > -
> > +     apic->lapic_timer.vmentry_delay = 0;
> > +     apic->lapic_timer.measure_delay_done = 0;
> >
> >       /*
> >        * APIC is created enabled. This will prevent kvm_lapic_set_base from
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index d6d049b..f1d037b 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -35,6 +35,13 @@ struct kvm_timer {
> >       atomic_t pending;                       /* accumulated triggered timers */
> >       bool hv_timer_in_use;
> >       bool timer_advance_adjust_done;
> > +     /**
> > +      * 0 unstart measure
> > +      * 1 start record
> > +      * 2 get delta
> > +      */
> > +     u32 measure_delay_done;
> > +     u64 vmentry_delay;
> >  };
> >
> >  struct kvm_lapic {
> > @@ -230,6 +237,7 @@ void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu);
> >  void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
> >  bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu);
> >  void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu);
> > +void kvm_lapic_measure_vmentry_delay(struct kvm_vcpu *vcpu);
> >
> >  static inline enum lapic_mode kvm_apic_mode(u64 apic_base)
> >  {
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 9663d41..a939bf5 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6437,6 +6437,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >       if (vcpu->arch.cr2 != read_cr2())
> >               write_cr2(vcpu->arch.cr2);
> >
> > +     kvm_lapic_measure_vmentry_delay(vcpu);
>
> This should be wrapped in an unlikely of some form given that it happens
> literally once out of thousands/millions runs.
>
> > +
> >       vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs,
> >                                  vmx->loaded_vmcs->launched);
> >
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: LAPIC: Optimize timer latency further
  2019-05-14  1:45     ` Wanpeng Li
@ 2019-05-14 10:56       ` Wanpeng Li
  2019-05-14 16:45         ` Sean Christopherson
  0 siblings, 1 reply; 11+ messages in thread
From: Wanpeng Li @ 2019-05-14 10:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Radim Krčmář, Liran Alon

On Tue, 14 May 2019 at 09:45, Wanpeng Li <kernellwp@gmail.com> wrote:
>
> On Tue, 14 May 2019 at 03:54, Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Thu, May 09, 2019 at 07:29:21PM +0800, Wanpeng Li wrote:
> > > From: Wanpeng Li <wanpengli@tencent.com>
> > >
> > > Advance lapic timer tries to hidden the hypervisor overhead between host
> > > timer fires and the guest awares the timer is fired. However, it just hidden
> > > the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire,
> > > instead of the real position of vmentry which is mentioned in the orignial
> > > commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer
> > > expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire
> > > and before world switch on my haswell desktop, it will be 2400+ cycles if
> > > vmentry_l1d_flush is tuned to always.
> > >
> > > This patch tries to narrow the last gap, it measures the time between
> > > the end of wait_lapic_expire and before world switch, we take this
> > > time into consideration when busy waiting, otherwise, the guest still
> > > awares the latency between wait_lapic_expire and world switch, we also
> > > consider this when adaptively tuning the timer advancement. The patch
> > > can reduce 50% latency (~1600+ cycles to ~800+ cycles on a haswell
> > > desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits.
> > >
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > > Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> > > Cc: Liran Alon <liran.alon@oracle.com>
> > > Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> > > ---
> > >  arch/x86/kvm/lapic.c   | 23 +++++++++++++++++++++--
> > >  arch/x86/kvm/lapic.h   |  8 ++++++++
> > >  arch/x86/kvm/vmx/vmx.c |  2 ++
> > >  3 files changed, 31 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > index e7a0660..01d3a87 100644
> > > --- a/arch/x86/kvm/lapic.c
> > > +++ b/arch/x86/kvm/lapic.c
> > > @@ -1545,13 +1545,19 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> > >
> > >       tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> > >       apic->lapic_timer.expired_tscdeadline = 0;
> > > -     guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > > +     guest_tsc = kvm_read_l1_tsc(vcpu, (apic->lapic_timer.measure_delay_done == 2) ?
> > > +             rdtsc() + apic->lapic_timer.vmentry_delay : rdtsc());
> > >       trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> > >
> > >       if (guest_tsc < tsc_deadline)
> > >               __wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
> > >
> > >       adaptive_tune_timer_advancement(vcpu, guest_tsc, tsc_deadline);
> > > +
> > > +     if (!apic->lapic_timer.measure_delay_done) {
> > > +             apic->lapic_timer.measure_delay_done = 1;
> > > +             apic->lapic_timer.vmentry_delay = rdtsc();
> > > +     }
> > >  }
> > >
> > >  static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > > @@ -1837,6 +1843,18 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
> > >       }
> > >  }
> > >
> > > +void kvm_lapic_measure_vmentry_delay(struct kvm_vcpu *vcpu)
> > > +{
> > > +     struct kvm_timer *ktimer = &vcpu->arch.apic->lapic_timer;
> >
> > This will #GP if the APIC is not in-kernel, i.e. @apic is NULL.
> >
> > > +
> > > +     if (ktimer->measure_delay_done == 1) {
> > > +             ktimer->vmentry_delay = rdtsc() -
> > > +                     ktimer->vmentry_delay;
> > > +             ktimer->measure_delay_done = 2;
> >
> > Measuring the delay a single time is bound to result in random outliers,
> > e.g. if an NMI happens to occur after wait_lapic_expire().
> >
> > Rather than reinvent the wheel, can we simply move the call to
> > wait_lapic_expire() into vmx.c and svm.c?  For VMX we'd probably want to
> > support the advancement if enable_unrestricted_guest=true so that we avoid
> > the emulation_required case, but other than that I don't see anything that
> > requires wait_lapic_expire() to be called where it is.
>
> I also considered to move wait_lapic_expire() into vmx.c and svm.c
> before, what do you think, Paolo, Radim?

However, guest_enter_irqoff() also prevents this. Otherwise, we will
account busy wait time as guest time. How about sampling several times
and get the average value or conservative min value to handle Sean's
concern?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: LAPIC: Optimize timer latency further
  2019-05-14 10:56       ` Wanpeng Li
@ 2019-05-14 16:45         ` Sean Christopherson
  0 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2019-05-14 16:45 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: LKML, kvm, Paolo Bonzini, Radim Krčmář, Liran Alon

On Tue, May 14, 2019 at 06:56:04PM +0800, Wanpeng Li wrote:
> On Tue, 14 May 2019 at 09:45, Wanpeng Li <kernellwp@gmail.com> wrote:
> >
> > On Tue, 14 May 2019 at 03:54, Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > Rather than reinvent the wheel, can we simply move the call to
> > > wait_lapic_expire() into vmx.c and svm.c?  For VMX we'd probably want to
> > > support the advancement if enable_unrestricted_guest=true so that we avoid
> > > the emulation_required case, but other than that I don't see anything that
> > > requires wait_lapic_expire() to be called where it is.
> >
> > I also considered to move wait_lapic_expire() into vmx.c and svm.c
> > before, what do you think, Paolo, Radim?
> 
> However, guest_enter_irqoff() also prevents this. Otherwise, we will
> account busy wait time as guest time. How about sampling several times
> and get the average value or conservative min value to handle Sean's
> concern?

Hmm, looking at the history, wait_lapic_expire() was originally called
immediately before kvm_x86_ops->run()[1].  The call was moved above
guest_enter_irqoff() because of its tracepoint, which violated the RCU
extended quiescent state invoked by guest_enter_irqoff()[2][3].  In
other words, I don't think there is a fundamental issue with accounting
the busy wait time to the guest rather than the host.

Assuming the tracepoint was added to help tune the advancement time, I
think we can simply remove the tracepoint, which would allow moving
wait_lapic_expire().  Now that the advancement time is tracked per-vCPU,
realizing a change in the advancement time requires creating a new VM.
For all intents and purposes this makes it impractical to hand tune the
advancement in real time using the tracepoint as the feedback mechanism.

If we want to expose the per-vCPU advancement time to the user, a debugfs
entry is likely sufficient given that the advancement time is
automatically adjusted.

[1] Commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer expiration")
[2] Commit 8b89fe1f6c43 ("kvm: x86: move tracepoints outside extended quiescent state")
[3] https://patchwork.kernel.org/patch/7821111/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-05-14 16:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-09 11:29 [PATCH 0/3] KVM: LAPIC: Optimize timer latency further Wanpeng Li
2019-05-09 11:29 ` [PATCH 1/3] KVM: LAPIC: Extract adaptive tune timer advancement logic Wanpeng Li
2019-05-13 19:39   ` Sean Christopherson
2019-05-14  0:59     ` Wanpeng Li
2019-05-09 11:29 ` [PATCH 2/3] KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow Wanpeng Li
2019-05-13 19:14   ` Sean Christopherson
2019-05-09 11:29 ` [PATCH 3/3] KVM: LAPIC: Optimize timer latency further Wanpeng Li
2019-05-13 19:54   ` Sean Christopherson
2019-05-14  1:45     ` Wanpeng Li
2019-05-14 10:56       ` Wanpeng Li
2019-05-14 16:45         ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).