LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC] Question about enable doorbell irq and halt_poll process
@ 2019-03-19 13:25 Tangnianyao (ICT)
  2019-03-20 14:31 ` Heyi Guo
  2019-03-20 17:02 ` Marc Zyngier
  0 siblings, 2 replies; 9+ messages in thread
From: Tangnianyao (ICT) @ 2019-03-19 13:25 UTC (permalink / raw)
  To: kvm, kvmarm, linux-kernel, Linuxarm; +Cc: guoheyi, wanghaibin.wang

Hi, all

Using gicv4, when guest is waiting for irq, it sends wfi and traps to kvm.
When vlpi is forwarded to PE after its_vpe_deschedule, before halt_poll in
kvm_vcpu_block, halt_poll may increase latency for this vlpi getting to guest.
In halt_poll process, it checks if there's pending irq for vcpu using pending_last.
However, doorbell is not enable at this moment and vlpi or doorbell can not set
pending_last true, to stop halt_poll. It will run until halt_poll time ends, if
there's no other physical irq coming in the meantime. And then vcpu is scheduled out.
This pending vlpi has to wait for vcpu getting schedule in next time.

Should we enable doorbell before halt_poll process ?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Question about enable doorbell irq and halt_poll process
  2019-03-19 13:25 [RFC] Question about enable doorbell irq and halt_poll process Tangnianyao (ICT)
@ 2019-03-20 14:31 ` Heyi Guo
  2019-03-20 17:02 ` Marc Zyngier
  1 sibling, 0 replies; 9+ messages in thread
From: Heyi Guo @ 2019-03-20 14:31 UTC (permalink / raw)
  To: Tangnianyao (ICT), kvm, kvmarm, linux-kernel, Linuxarm
  Cc: wanghaibin.wang, Marc Zyngier, Christoffer Dall

+cc Marc and Christoffer...


On 2019/3/19 21:25, Tangnianyao (ICT) wrote:
> Hi, all
>
> Using gicv4, when guest is waiting for irq, it sends wfi and traps to kvm.
> When vlpi is forwarded to PE after its_vpe_deschedule, before halt_poll in
> kvm_vcpu_block, halt_poll may increase latency for this vlpi getting to guest.
> In halt_poll process, it checks if there's pending irq for vcpu using pending_last.
> However, doorbell is not enable at this moment and vlpi or doorbell can not set
> pending_last true, to stop halt_poll. It will run until halt_poll time ends, if
> there's no other physical irq coming in the meantime. And then vcpu is scheduled out.
> This pending vlpi has to wait for vcpu getting schedule in next time.
>
> Should we enable doorbell before halt_poll process ?
>
>
> .
>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Question about enable doorbell irq and halt_poll process
  2019-03-19 13:25 [RFC] Question about enable doorbell irq and halt_poll process Tangnianyao (ICT)
  2019-03-20 14:31 ` Heyi Guo
@ 2019-03-20 17:02 ` Marc Zyngier
       [not found]   ` <5df934fd-06d5-55f2-68a5-6f4985e4ac1b@huawei.com>
  1 sibling, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2019-03-20 17:02 UTC (permalink / raw)
  To: Tangnianyao (ICT)
  Cc: kvm, kvmarm, linux-kernel, Linuxarm, guoheyi, Christoffer Dall

On Tue, 19 Mar 2019 21:25:47 +0800
"Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:

> Hi, all
> 
> Using gicv4, when guest is waiting for irq, it sends wfi and traps to kvm.
> When vlpi is forwarded to PE after its_vpe_deschedule, before halt_poll in
> kvm_vcpu_block, halt_poll may increase latency for this vlpi getting to guest.
> In halt_poll process, it checks if there's pending irq for vcpu using pending_last.
> However, doorbell is not enable at this moment and vlpi or doorbell can not set
> pending_last true, to stop halt_poll. It will run until halt_poll time ends, if
> there's no other physical irq coming in the meantime. And then vcpu is scheduled out.
> This pending vlpi has to wait for vcpu getting schedule in next time.
> 
> Should we enable doorbell before halt_poll process ?

Enabling doorbells can be quite expensive. Depending on the HW, this is
either:

- a write to memory (+DSB, potential cache maintenance), a write to the
  INVLPI register, and a poll of the SYNC register
- a write to memory (+DSB, potential cache maintenance), potentially
  a string of DISCARD+SYNC+MAPI+SYNC commands, and an INV+SYNC command

Frankly, you want to be careful with that. I'd rather enable them late
and have a chance of not blocking because of another (virtual)
interrupt, which saves us the doorbell business.

I wonder if you wouldn't be in a better position by drastically
reducing halt_poll_ns for vcpu that can have directly injected
interrupts.

In any case, this is something that we should measure, not guess.

Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Question about enable doorbell irq and halt_poll process
       [not found]     ` <86zhpc66jl.wl-marc.zyngier@arm.com>
@ 2019-04-04 10:07       ` Tangnianyao (ICT)
  2019-04-04 10:59         ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Tangnianyao (ICT) @ 2019-04-04 10:07 UTC (permalink / raw)
  To: Marc Zyngier, kvmarm, linux-kernel, linuxarm, guoheyi, kvm, xuwei5



On 2019/3/30 17:43, Marc Zyngier wrote:

Hi, Marc

> On Sat, 30 Mar 2019 08:42:58 +0000,
> "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
>>
>> Hi, Marc
>>
>> On 2019/3/21 1:02, Marc Zyngier Wrote:
>>> On Tue, 19 Mar 2019 21:25:47 +0800
>>> "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
>>>
>>>> Hi, all
>>>>
>>>> Using gicv4, when guest is waiting for irq, it sends wfi and traps to kvm.
>>>> When vlpi is forwarded to PE after its_vpe_deschedule, before halt_poll in
>>>> kvm_vcpu_block, halt_poll may increase latency for this vlpi getting to guest.
>>>> In halt_poll process, it checks if there's pending irq for vcpu using pending_last.
>>>> However, doorbell is not enable at this moment and vlpi or doorbell can not set
>>>> pending_last true, to stop halt_poll. It will run until halt_poll time ends, if
>>>> there's no other physical irq coming in the meantime. And then vcpu is scheduled out.
>>>> This pending vlpi has to wait for vcpu getting schedule in next time.
>>>>
>>>> Should we enable doorbell before halt_poll process ?
>>>
>>> Enabling doorbells can be quite expensive. Depending on the HW, this is
>>> either:
>>>
>>> - a write to memory (+DSB, potential cache maintenance), a write to the
>>>   INVLPI register, and a poll of the SYNC register
>>> - a write to memory (+DSB, potential cache maintenance), potentially
>>>   a string of DISCARD+SYNC+MAPI+SYNC commands, and an INV+SYNC command
>>>
>> I have tested average cost of kvm_vgic_v4_enable_doorbell in our machine.
>> When gic_rdists->has_direct_lpi is 1, it costs 0.35 us.
>> When gic_rdists->has_direct_lpi is 0, it costs 1.4 us.
> 
> This looks pretty low. Which HW is that on? How about on something
> like D05?

I tested it on D06.
D05 doesn't not support gicv4 and I haven't tested on D05.

> 
>> Compared to default halt_poll_ns, 500000ns, enabling doorbells is not
>> large at all.
> 
> Sure. But I'm not sure this is a universal figure.
> 
>>
>>> Frankly, you want to be careful with that. I'd rather enable them late
>>> and have a chance of not blocking because of another (virtual)
>>> interrupt, which saves us the doorbell business.
>>>
>>> I wonder if you wouldn't be in a better position by drastically
>>> reducing halt_poll_ns for vcpu that can have directly injected
>>> interrupts.
>>>
>>
>> If we set halt_poll_ns to a small value, the chance of
>> not blocking and vcpu scheduled out becomes larger. The cost
>> of vcpu scheduled out is quite expensive.
>> In many cases, one pcpu is assigned to only
>> one vcpu, and halt_poll_ns is set quite large, to increase
>> chance of halt_poll process got terminated. This is the
>> reason we want to set halt_poll_ns large. And We hope vlpi
>> stop halt_poll process in time.
> 
> Fair enough. It is certainly realistic to strictly partition the
> system when GICv4 is in use, so I can see some potential benefit.
> 
>> Though it will check whether vcpu is runnable again by
>> kvm_vcpu_check_block. Vlpi can prevent scheduling vcpu out.
>> However it's somewhat later if halt_poll_ns is larger.
>>
>>> In any case, this is something that we should measure, not guess.
> 
> Please post results of realistic benchmarks that we can reproduce,
> with and without this change. I'm willing to entertain the idea, but I
> need more than just a vague number.
> 
> Thanks,
> 
> 	M.
> 

I tested it with and without change (patch attached in the last).
halt_poll_ns is keep default, 500000ns.
I have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled"
to my test kernel, to make sure ICH_HCR_EL2.EN=1 in guest.

netperf result:
D06 as server, intel 8180 server as client
with change:
package 512 bytes - 5400 Mbits/s
package 64 bytes - 740 Mbits/s
without change:
package 512 bytes - 5000 Mbits/s
package 64 bytes - 710 Mbits/s

Also I have tested D06 as client, intel machine as server,
with the change, the result remains the same.


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 55fe8e2..0f56904 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2256,6 +2256,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 	if (vcpu->halt_poll_ns) {
 		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);

+#ifdef CONFIG_ARM64
+		/*
+		 * When using gicv4, enable doorbell before halt pool wait
+		 * so that direct vlpi can stop halt poll.
+		 */
+		if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) {
+			kvm_vgic_v4_enable_doorbell(vcpu);
+		}
+#endif
+
 		++vcpu->stat.halt_attempted_poll;
 		do {
 			/*


Thanks,
Nianyao Tang		





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Question about enable doorbell irq and halt_poll process
  2019-04-04 10:07       ` Tangnianyao (ICT)
@ 2019-04-04 10:59         ` Marc Zyngier
  2019-04-23  7:44           ` Tangnianyao (ICT)
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2019-04-04 10:59 UTC (permalink / raw)
  To: Tangnianyao (ICT); +Cc: kvmarm, linux-kernel, linuxarm, guoheyi, kvm, xuwei5

On Thu, 04 Apr 2019 11:07:59 +0100,
"Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
> 
> 
> 
> On 2019/3/30 17:43, Marc Zyngier wrote:
> 
> Hi, Marc
> 
> > On Sat, 30 Mar 2019 08:42:58 +0000,
> > "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
> >>
> >> Hi, Marc
> >>
> >> On 2019/3/21 1:02, Marc Zyngier Wrote:
> >>> On Tue, 19 Mar 2019 21:25:47 +0800
> >>> "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
> >>>
> >>>> Hi, all
> >>>>
> >>>> Using gicv4, when guest is waiting for irq, it sends wfi and traps to kvm.
> >>>> When vlpi is forwarded to PE after its_vpe_deschedule, before halt_poll in
> >>>> kvm_vcpu_block, halt_poll may increase latency for this vlpi getting to guest.
> >>>> In halt_poll process, it checks if there's pending irq for vcpu using pending_last.
> >>>> However, doorbell is not enable at this moment and vlpi or doorbell can not set
> >>>> pending_last true, to stop halt_poll. It will run until halt_poll time ends, if
> >>>> there's no other physical irq coming in the meantime. And then vcpu is scheduled out.
> >>>> This pending vlpi has to wait for vcpu getting schedule in next time.
> >>>>
> >>>> Should we enable doorbell before halt_poll process ?
> >>>
> >>> Enabling doorbells can be quite expensive. Depending on the HW, this is
> >>> either:
> >>>
> >>> - a write to memory (+DSB, potential cache maintenance), a write to the
> >>>   INVLPI register, and a poll of the SYNC register
> >>> - a write to memory (+DSB, potential cache maintenance), potentially
> >>>   a string of DISCARD+SYNC+MAPI+SYNC commands, and an INV+SYNC command
> >>>
> >> I have tested average cost of kvm_vgic_v4_enable_doorbell in our machine.
> >> When gic_rdists->has_direct_lpi is 1, it costs 0.35 us.
> >> When gic_rdists->has_direct_lpi is 0, it costs 1.4 us.
> > 
> > This looks pretty low. Which HW is that on? How about on something
> > like D05?
> 
> I tested it on D06.
> D05 doesn't not support gicv4 and I haven't tested on D05.

I'm afraid you're mistaken. D05 does have GICv4. I have a pretty good
idea it does because that's the system I used to write the SW
support. So please dig one out of the cupboard and test on it. I'm
pretty sure you're at the right place to do so.

> 
> > 
> >> Compared to default halt_poll_ns, 500000ns, enabling doorbells is not
> >> large at all.
> > 
> > Sure. But I'm not sure this is a universal figure.
> > 
> >>
> >>> Frankly, you want to be careful with that. I'd rather enable them late
> >>> and have a chance of not blocking because of another (virtual)
> >>> interrupt, which saves us the doorbell business.
> >>>
> >>> I wonder if you wouldn't be in a better position by drastically
> >>> reducing halt_poll_ns for vcpu that can have directly injected
> >>> interrupts.
> >>>
> >>
> >> If we set halt_poll_ns to a small value, the chance of
> >> not blocking and vcpu scheduled out becomes larger. The cost
> >> of vcpu scheduled out is quite expensive.
> >> In many cases, one pcpu is assigned to only
> >> one vcpu, and halt_poll_ns is set quite large, to increase
> >> chance of halt_poll process got terminated. This is the
> >> reason we want to set halt_poll_ns large. And We hope vlpi
> >> stop halt_poll process in time.
> > 
> > Fair enough. It is certainly realistic to strictly partition the
> > system when GICv4 is in use, so I can see some potential benefit.
> > 
> >> Though it will check whether vcpu is runnable again by
> >> kvm_vcpu_check_block. Vlpi can prevent scheduling vcpu out.
> >> However it's somewhat later if halt_poll_ns is larger.
> >>
> >>> In any case, this is something that we should measure, not guess.
> > 
> > Please post results of realistic benchmarks that we can reproduce,
> > with and without this change. I'm willing to entertain the idea, but I
> > need more than just a vague number.
> > 
> > Thanks,
> > 
> > 	M.
> > 
> 
> I tested it with and without change (patch attached in the last).
> halt_poll_ns is keep default, 500000ns.
> I have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled"
> to my test kernel, to make sure ICH_HCR_EL2.EN=1 in guest.
> 
> netperf result:
> D06 as server, intel 8180 server as client
> with change:
> package 512 bytes - 5400 Mbits/s
> package 64 bytes - 740 Mbits/s
> without change:
> package 512 bytes - 5000 Mbits/s
> package 64 bytes - 710 Mbits/s
> 
> Also I have tested D06 as client, intel machine as server,
> with the change, the result remains the same.

So anywhere between 4% and 8% improvement. Please post results for D05
once you've found one.

> 
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 55fe8e2..0f56904 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2256,6 +2256,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  	if (vcpu->halt_poll_ns) {
>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> 
> +#ifdef CONFIG_ARM64
> +		/*
> +		 * When using gicv4, enable doorbell before halt pool wait
> +		 * so that direct vlpi can stop halt poll.
> +		 */
> +		if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) {
> +			kvm_vgic_v4_enable_doorbell(vcpu);
> +		}
> +#endif
> +

Irk. No. You're now leaving doorbells enabled when going back to the
guest, and that's pretty bad as the whole logic relies on doorbells
being disabled if the guest can run.

So please try this instead on top of mainline. And it has to be on top
of mainline as we've changed the way timer interrupts work in 5.1 --
see accb99bcd0ca ("KVM: arm/arm64: Simplify bg_timer programming").

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f25aa98a94df..0ae4ad5dcb12 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2252,6 +2252,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 	bool waited = false;
 	u64 block_ns;
 
+	kvm_arch_vcpu_blocking(vcpu);
+
 	start = cur = ktime_get();
 	if (vcpu->halt_poll_ns) {
 		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
@@ -2272,8 +2274,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 		} while (single_task_running() && ktime_before(cur, stop));
 	}
 
-	kvm_arch_vcpu_blocking(vcpu);
-
 	for (;;) {
 		prepare_to_swait_exclusive(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
 
@@ -2287,8 +2287,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 	finish_swait(&vcpu->wq, &wait);
 	cur = ktime_get();
 
-	kvm_arch_vcpu_unblocking(vcpu);
 out:
+	kvm_arch_vcpu_unblocking(vcpu);
 	block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
 
 	if (!vcpu_valid_wakeup(vcpu))

Thanks,

	M.

-- 
Jazz is not dead, it just smell funny.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Question about enable doorbell irq and halt_poll process
  2019-04-04 10:59         ` Marc Zyngier
@ 2019-04-23  7:44           ` Tangnianyao (ICT)
  2019-04-23 10:00             ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Tangnianyao (ICT) @ 2019-04-23  7:44 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-kernel, linuxarm, guoheyi, kvm, xuwei5,
	tangnianyao, wanghaibin.wang

Hi, Marc

On 2019/4/4 18:59, Marc Zyngier wrote:
> On Thu, 04 Apr 2019 11:07:59 +0100,
> "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
>>
>>
>>
>> On 2019/3/30 17:43, Marc Zyngier wrote:
>>
>> Hi, Marc
>>
>>> On Sat, 30 Mar 2019 08:42:58 +0000,
>>> "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
>>>>
>>>> Hi, Marc
>>>>
>>>> On 2019/3/21 1:02, Marc Zyngier Wrote:
>>>>> On Tue, 19 Mar 2019 21:25:47 +0800
>>>>> "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
>>>>>
>>>>>> Hi, all
>>>>>>
>>>>>> Using gicv4, when guest is waiting for irq, it sends wfi and traps to kvm.
>>>>>> When vlpi is forwarded to PE after its_vpe_deschedule, before halt_poll in
>>>>>> kvm_vcpu_block, halt_poll may increase latency for this vlpi getting to guest.
>>>>>> In halt_poll process, it checks if there's pending irq for vcpu using pending_last.
>>>>>> However, doorbell is not enable at this moment and vlpi or doorbell can not set
>>>>>> pending_last true, to stop halt_poll. It will run until halt_poll time ends, if
>>>>>> there's no other physical irq coming in the meantime. And then vcpu is scheduled out.
>>>>>> This pending vlpi has to wait for vcpu getting schedule in next time.
>>>>>>
>>>>>> Should we enable doorbell before halt_poll process ?
>>>>>
>>>>> Enabling doorbells can be quite expensive. Depending on the HW, this is
>>>>> either:
>>>>>
>>>>> - a write to memory (+DSB, potential cache maintenance), a write to the
>>>>>   INVLPI register, and a poll of the SYNC register
>>>>> - a write to memory (+DSB, potential cache maintenance), potentially
>>>>>   a string of DISCARD+SYNC+MAPI+SYNC commands, and an INV+SYNC command
>>>>>
>>>> I have tested average cost of kvm_vgic_v4_enable_doorbell in our machine.
>>>> When gic_rdists->has_direct_lpi is 1, it costs 0.35 us.
>>>> When gic_rdists->has_direct_lpi is 0, it costs 1.4 us.
>>>
>>> This looks pretty low. Which HW is that on? How about on something
>>> like D05?
>>
>> I tested it on D06.
>> D05 doesn't not support gicv4 and I haven't tested on D05.
> 
> I'm afraid you're mistaken. D05 does have GICv4. I have a pretty good
> idea it does because that's the system I used to write the SW
> support. So please dig one out of the cupboard and test on it. I'm
> pretty sure you're at the right place to do so.

I've learned that there's some implementation problem for the PCIe controller
of Hi1616, the processor of D05. The PCIe ACS was not implemented properly
and D05 doesn't support VM using pcie vf.

> 
>>
>>>
>>>> Compared to default halt_poll_ns, 500000ns, enabling doorbells is not
>>>> large at all.
>>>
>>> Sure. But I'm not sure this is a universal figure.
>>>
>>>>
>>>>> Frankly, you want to be careful with that. I'd rather enable them late
>>>>> and have a chance of not blocking because of another (virtual)
>>>>> interrupt, which saves us the doorbell business.
>>>>>
>>>>> I wonder if you wouldn't be in a better position by drastically
>>>>> reducing halt_poll_ns for vcpu that can have directly injected
>>>>> interrupts.
>>>>>
>>>>
>>>> If we set halt_poll_ns to a small value, the chance of
>>>> not blocking and vcpu scheduled out becomes larger. The cost
>>>> of vcpu scheduled out is quite expensive.
>>>> In many cases, one pcpu is assigned to only
>>>> one vcpu, and halt_poll_ns is set quite large, to increase
>>>> chance of halt_poll process got terminated. This is the
>>>> reason we want to set halt_poll_ns large. And We hope vlpi
>>>> stop halt_poll process in time.
>>>
>>> Fair enough. It is certainly realistic to strictly partition the
>>> system when GICv4 is in use, so I can see some potential benefit.
>>>
>>>> Though it will check whether vcpu is runnable again by
>>>> kvm_vcpu_check_block. Vlpi can prevent scheduling vcpu out.
>>>> However it's somewhat later if halt_poll_ns is larger.
>>>>
>>>>> In any case, this is something that we should measure, not guess.
>>>
>>> Please post results of realistic benchmarks that we can reproduce,
>>> with and without this change. I'm willing to entertain the idea, but I
>>> need more than just a vague number.
>>>
>>> Thanks,
>>>
>>> 	M.
>>>
>>
>> I tested it with and without change (patch attached in the last).
>> halt_poll_ns is keep default, 500000ns.
>> I have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled"
>> to my test kernel, to make sure ICH_HCR_EL2.EN=1 in guest.
>>
>> netperf result:
>> D06 as server, intel 8180 server as client
>> with change:
>> package 512 bytes - 5400 Mbits/s
>> package 64 bytes - 740 Mbits/s
>> without change:
>> package 512 bytes - 5000 Mbits/s
>> package 64 bytes - 710 Mbits/s
>>
>> Also I have tested D06 as client, intel machine as server,
>> with the change, the result remains the same.
> 
> So anywhere between 4% and 8% improvement. Please post results for D05
> once you've found one.
> 
>>
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 55fe8e2..0f56904 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -2256,6 +2256,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>  	if (vcpu->halt_poll_ns) {
>>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>>
>> +#ifdef CONFIG_ARM64
>> +		/*
>> +		 * When using gicv4, enable doorbell before halt pool wait
>> +		 * so that direct vlpi can stop halt poll.
>> +		 */
>> +		if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) {
>> +			kvm_vgic_v4_enable_doorbell(vcpu);
>> +		}
>> +#endif
>> +
> 
> Irk. No. You're now leaving doorbells enabled when going back to the
> guest, and that's pretty bad as the whole logic relies on doorbells
> being disabled if the guest can run.
> 
> So please try this instead on top of mainline. And it has to be on top
> of mainline as we've changed the way timer interrupts work in 5.1 --
> see accb99bcd0ca ("KVM: arm/arm64: Simplify bg_timer programming").
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f25aa98a94df..0ae4ad5dcb12 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2252,6 +2252,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  	bool waited = false;
>  	u64 block_ns;
>  
> +	kvm_arch_vcpu_blocking(vcpu);
> +
>  	start = cur = ktime_get();
>  	if (vcpu->halt_poll_ns) {
>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> @@ -2272,8 +2274,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  		} while (single_task_running() && ktime_before(cur, stop));
>  	}
>  
> -	kvm_arch_vcpu_blocking(vcpu);
> -
>  	for (;;) {
>  		prepare_to_swait_exclusive(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>  
> @@ -2287,8 +2287,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  	finish_swait(&vcpu->wq, &wait);
>  	cur = ktime_get();
>  
> -	kvm_arch_vcpu_unblocking(vcpu);
>  out:
> +	kvm_arch_vcpu_unblocking(vcpu);
>  	block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>  
>  	if (!vcpu_valid_wakeup(vcpu))
> 
> Thanks,
> 
> 	M.
> 

I've tested your patch and the results showed as follow:

netperf result:
D06 as server, intel 8180 server as client
with change:
package 512 bytes - 5500 Mbits/s
package 64 bytes - 760 Mbits/s
without change:
package 512 bytes - 5000 Mbits/s
package 64 bytes - 710 Mbits/s

Thanks,
Tang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Question about enable doorbell irq and halt_poll process
  2019-04-23  7:44           ` Tangnianyao (ICT)
@ 2019-04-23 10:00             ` Marc Zyngier
  2019-04-29  2:29               ` Tangnianyao (ICT)
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2019-04-23 10:00 UTC (permalink / raw)
  To: Tangnianyao (ICT)
  Cc: kvmarm, linux-kernel, linuxarm, guoheyi, kvm, xuwei5, wanghaibin.wang

On Tue, 23 Apr 2019 08:44:17 +0100,
"Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
> 
> Hi, Marc

[...]

> I've learned that there's some implementation problem for the PCIe
> controller of Hi1616, the processor of D05. The PCIe ACS was not
> implemented properly and D05 doesn't support VM using pcie vf.

My D05 completely disagrees with you:

root@unassigned-hostname:~# lspci -v
00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
	Subsystem: Red Hat, Inc QEMU PCIe Host bridge
	Flags: fast devsel
lspci: Unable to load libkmod resources: error -12

00:01.0 Ethernet controller: Red Hat, Inc Virtio network device (rev 01)
	Subsystem: Red Hat, Inc Virtio network device
	Flags: bus master, fast devsel, latency 0, IRQ 40
	Memory at 10040000 (32-bit, non-prefetchable) [size=4K]
	Memory at 8000000000 (64-bit, prefetchable) [size=16K]
	Expansion ROM at 10000000 [disabled] [size=256K]
	Capabilities: [98] MSI-X: Enable+ Count=3 Masked-
	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
	Kernel driver in use: virtio-pci

00:02.0 SCSI storage controller: Red Hat, Inc Virtio block device (rev 01)
	Subsystem: Red Hat, Inc Virtio block device
	Flags: bus master, fast devsel, latency 0, IRQ 41
	Memory at 10041000 (32-bit, non-prefetchable) [size=4K]
	Memory at 8000004000 (64-bit, prefetchable) [size=16K]
	Capabilities: [98] MSI-X: Enable+ Count=2 Masked-
	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
	Kernel driver in use: virtio-pci

00:03.0 SCSI storage controller: Red Hat, Inc Virtio SCSI (rev 01)
	Subsystem: Red Hat, Inc Virtio SCSI
	Flags: bus master, fast devsel, latency 0, IRQ 42
	Memory at 10042000 (32-bit, non-prefetchable) [size=4K]
	Memory at 8000008000 (64-bit, prefetchable) [size=16K]
	Capabilities: [98] MSI-X: Enable+ Count=4 Masked-
	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
	Kernel driver in use: virtio-pci

00:04.0 Ethernet controller: Intel Corporation I350 Ethernet Controller Virtual Function (rev 01)
	Subsystem: Intel Corporation I350 Ethernet Controller Virtual Function
	Flags: bus master, fast devsel, latency 0
	Memory at 800000c000 (64-bit, prefetchable) [size=16K]
	Memory at 8000010000 (64-bit, prefetchable) [size=16K]
	Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
	Capabilities: [a0] Express Root Complex Integrated Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [1a0] Transaction Processing Hints
	Capabilities: [1d0] Access Control Services
	Kernel driver in use: igbvf

root@unassigned-hostname:~# uptime
 05:40:45 up 27 days, 17:16,  1 user,  load average: 4.10, 4.05, 4.01

For something that isn't supposed to work, it is pretty good. So
please test it and let me know how it fares. At this stage, not
regressing deployed HW is more important than improving the situation
on HW that nobody has.

> 
> > 
> >>
> >>>
> >>>> Compared to default halt_poll_ns, 500000ns, enabling doorbells is not
> >>>> large at all.
> >>>
> >>> Sure. But I'm not sure this is a universal figure.
> >>>
> >>>>
> >>>>> Frankly, you want to be careful with that. I'd rather enable them late
> >>>>> and have a chance of not blocking because of another (virtual)
> >>>>> interrupt, which saves us the doorbell business.
> >>>>>
> >>>>> I wonder if you wouldn't be in a better position by drastically
> >>>>> reducing halt_poll_ns for vcpu that can have directly injected
> >>>>> interrupts.
> >>>>>
> >>>>
> >>>> If we set halt_poll_ns to a small value, the chance of
> >>>> not blocking and vcpu scheduled out becomes larger. The cost
> >>>> of vcpu scheduled out is quite expensive.
> >>>> In many cases, one pcpu is assigned to only
> >>>> one vcpu, and halt_poll_ns is set quite large, to increase
> >>>> chance of halt_poll process got terminated. This is the
> >>>> reason we want to set halt_poll_ns large. And We hope vlpi
> >>>> stop halt_poll process in time.
> >>>
> >>> Fair enough. It is certainly realistic to strictly partition the
> >>> system when GICv4 is in use, so I can see some potential benefit.
> >>>
> >>>> Though it will check whether vcpu is runnable again by
> >>>> kvm_vcpu_check_block. Vlpi can prevent scheduling vcpu out.
> >>>> However it's somewhat later if halt_poll_ns is larger.
> >>>>
> >>>>> In any case, this is something that we should measure, not guess.
> >>>
> >>> Please post results of realistic benchmarks that we can reproduce,
> >>> with and without this change. I'm willing to entertain the idea, but I
> >>> need more than just a vague number.
> >>>
> >>> Thanks,
> >>>
> >>> 	M.
> >>>
> >>
> >> I tested it with and without change (patch attached in the last).
> >> halt_poll_ns is keep default, 500000ns.
> >> I have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled"
> >> to my test kernel, to make sure ICH_HCR_EL2.EN=1 in guest.
> >>
> >> netperf result:
> >> D06 as server, intel 8180 server as client
> >> with change:
> >> package 512 bytes - 5400 Mbits/s
> >> package 64 bytes - 740 Mbits/s
> >> without change:
> >> package 512 bytes - 5000 Mbits/s
> >> package 64 bytes - 710 Mbits/s
> >>
> >> Also I have tested D06 as client, intel machine as server,
> >> with the change, the result remains the same.
> > 
> > So anywhere between 4% and 8% improvement. Please post results for D05
> > once you've found one.
> > 
> >>
> >>
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 55fe8e2..0f56904 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -2256,6 +2256,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >>  	if (vcpu->halt_poll_ns) {
> >>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> >>
> >> +#ifdef CONFIG_ARM64
> >> +		/*
> >> +		 * When using gicv4, enable doorbell before halt pool wait
> >> +		 * so that direct vlpi can stop halt poll.
> >> +		 */
> >> +		if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) {
> >> +			kvm_vgic_v4_enable_doorbell(vcpu);
> >> +		}
> >> +#endif
> >> +
> > 
> > Irk. No. You're now leaving doorbells enabled when going back to the
> > guest, and that's pretty bad as the whole logic relies on doorbells
> > being disabled if the guest can run.
> > 
> > So please try this instead on top of mainline. And it has to be on top
> > of mainline as we've changed the way timer interrupts work in 5.1 --
> > see accb99bcd0ca ("KVM: arm/arm64: Simplify bg_timer programming").
> > 

[...]

> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index f25aa98a94df..0ae4ad5dcb12 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2252,6 +2252,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  	bool waited = false;
> >  	u64 block_ns;
> >  
> > +	kvm_arch_vcpu_blocking(vcpu);
> > +
> >  	start = cur = ktime_get();
> >  	if (vcpu->halt_poll_ns) {
> >  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
> > @@ -2272,8 +2274,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  		} while (single_task_running() && ktime_before(cur, stop));
> >  	}
> >  
> > -	kvm_arch_vcpu_blocking(vcpu);
> > -
> >  	for (;;) {
> >  		prepare_to_swait_exclusive(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
> >  
> > @@ -2287,8 +2287,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >  	finish_swait(&vcpu->wq, &wait);
> >  	cur = ktime_get();
> >  
> > -	kvm_arch_vcpu_unblocking(vcpu);
> >  out:
> > +	kvm_arch_vcpu_unblocking(vcpu);
> >  	block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
> >  
> >  	if (!vcpu_valid_wakeup(vcpu))
> > 
> > Thanks,
> > 
> > 	M.
> > 
> 
> I've tested your patch and the results showed as follow:
> 
> netperf result:
> D06 as server, intel 8180 server as client
> with change:
> package 512 bytes - 5500 Mbits/s
> package 64 bytes - 760 Mbits/s
> without change:
> package 512 bytes - 5000 Mbits/s
> package 64 bytes - 710 Mbits/s

OK, that's pretty good. Let me know once you've tested it on D05.

Thanks,

	M.

-- 
Jazz is not dead, it just smell funny.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Question about enable doorbell irq and halt_poll process
  2019-04-23 10:00             ` Marc Zyngier
@ 2019-04-29  2:29               ` Tangnianyao (ICT)
  2019-05-14  9:02                 ` Tangnianyao (ICT)
  0 siblings, 1 reply; 9+ messages in thread
From: Tangnianyao (ICT) @ 2019-04-29  2:29 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-kernel, linuxarm, guoheyi, kvm, xuwei5, wanghaibin.wang



On 2019/4/23 18:00, Marc Zyngier wrote:

Hi, Marc

> On Tue, 23 Apr 2019 08:44:17 +0100,
> "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
>>
>> Hi, Marc
> 
> [...]
> 
>> I've learned that there's some implementation problem for the PCIe
>> controller of Hi1616, the processor of D05. The PCIe ACS was not
>> implemented properly and D05 doesn't support VM using pcie vf.
> 
> My D05 completely disagrees with you:
> 
> root@unassigned-hostname:~# lspci -v
> 00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
> 	Subsystem: Red Hat, Inc QEMU PCIe Host bridge
> 	Flags: fast devsel
> lspci: Unable to load libkmod resources: error -12
> 
> 00:01.0 Ethernet controller: Red Hat, Inc Virtio network device (rev 01)
> 	Subsystem: Red Hat, Inc Virtio network device
> 	Flags: bus master, fast devsel, latency 0, IRQ 40
> 	Memory at 10040000 (32-bit, non-prefetchable) [size=4K]
> 	Memory at 8000000000 (64-bit, prefetchable) [size=16K]
> 	Expansion ROM at 10000000 [disabled] [size=256K]
> 	Capabilities: [98] MSI-X: Enable+ Count=3 Masked-
> 	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
> 	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
> 	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
> 	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
> 	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
> 	Kernel driver in use: virtio-pci
> 
> 00:02.0 SCSI storage controller: Red Hat, Inc Virtio block device (rev 01)
> 	Subsystem: Red Hat, Inc Virtio block device
> 	Flags: bus master, fast devsel, latency 0, IRQ 41
> 	Memory at 10041000 (32-bit, non-prefetchable) [size=4K]
> 	Memory at 8000004000 (64-bit, prefetchable) [size=16K]
> 	Capabilities: [98] MSI-X: Enable+ Count=2 Masked-
> 	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
> 	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
> 	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
> 	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
> 	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
> 	Kernel driver in use: virtio-pci
> 
> 00:03.0 SCSI storage controller: Red Hat, Inc Virtio SCSI (rev 01)
> 	Subsystem: Red Hat, Inc Virtio SCSI
> 	Flags: bus master, fast devsel, latency 0, IRQ 42
> 	Memory at 10042000 (32-bit, non-prefetchable) [size=4K]
> 	Memory at 8000008000 (64-bit, prefetchable) [size=16K]
> 	Capabilities: [98] MSI-X: Enable+ Count=4 Masked-
> 	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
> 	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
> 	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
> 	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
> 	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
> 	Kernel driver in use: virtio-pci
> 
> 00:04.0 Ethernet controller: Intel Corporation I350 Ethernet Controller Virtual Function (rev 01)
> 	Subsystem: Intel Corporation I350 Ethernet Controller Virtual Function
> 	Flags: bus master, fast devsel, latency 0
> 	Memory at 800000c000 (64-bit, prefetchable) [size=16K]
> 	Memory at 8000010000 (64-bit, prefetchable) [size=16K]
> 	Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
> 	Capabilities: [a0] Express Root Complex Integrated Endpoint, MSI 00
> 	Capabilities: [100] Advanced Error Reporting
> 	Capabilities: [1a0] Transaction Processing Hints
> 	Capabilities: [1d0] Access Control Services
> 	Kernel driver in use: igbvf
> 
> root@unassigned-hostname:~# uptime
>  05:40:45 up 27 days, 17:16,  1 user,  load average: 4.10, 4.05, 4.01
> 

I have make a new quirk to fix ACS problem on Hi1616, to enable VM using
pcie vf.

> For something that isn't supposed to work, it is pretty good. So
> please test it and let me know how it fares. At this stage, not
> regressing deployed HW is more important than improving the situation
> on HW that nobody has.
> 
>>
>>>
>>>>
>>>>>
>>>>>> Compared to default halt_poll_ns, 500000ns, enabling doorbells is not
>>>>>> large at all.
>>>>>
>>>>> Sure. But I'm not sure this is a universal figure.
>>>>>
>>>>>>
>>>>>>> Frankly, you want to be careful with that. I'd rather enable them late
>>>>>>> and have a chance of not blocking because of another (virtual)
>>>>>>> interrupt, which saves us the doorbell business.
>>>>>>>
>>>>>>> I wonder if you wouldn't be in a better position by drastically
>>>>>>> reducing halt_poll_ns for vcpu that can have directly injected
>>>>>>> interrupts.
>>>>>>>
>>>>>>
>>>>>> If we set halt_poll_ns to a small value, the chance of
>>>>>> not blocking and vcpu scheduled out becomes larger. The cost
>>>>>> of vcpu scheduled out is quite expensive.
>>>>>> In many cases, one pcpu is assigned to only
>>>>>> one vcpu, and halt_poll_ns is set quite large, to increase
>>>>>> chance of halt_poll process got terminated. This is the
>>>>>> reason we want to set halt_poll_ns large. And We hope vlpi
>>>>>> stop halt_poll process in time.
>>>>>
>>>>> Fair enough. It is certainly realistic to strictly partition the
>>>>> system when GICv4 is in use, so I can see some potential benefit.
>>>>>
>>>>>> Though it will check whether vcpu is runnable again by
>>>>>> kvm_vcpu_check_block. Vlpi can prevent scheduling vcpu out.
>>>>>> However it's somewhat later if halt_poll_ns is larger.
>>>>>>
>>>>>>> In any case, this is something that we should measure, not guess.
>>>>>
>>>>> Please post results of realistic benchmarks that we can reproduce,
>>>>> with and without this change. I'm willing to entertain the idea, but I
>>>>> need more than just a vague number.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> 	M.
>>>>>
>>>>
>>>> I tested it with and without change (patch attached in the last).
>>>> halt_poll_ns is keep default, 500000ns.
>>>> I have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled"
>>>> to my test kernel, to make sure ICH_HCR_EL2.EN=1 in guest.
>>>>
>>>> netperf result:
>>>> D06 as server, intel 8180 server as client
>>>> with change:
>>>> package 512 bytes - 5400 Mbits/s
>>>> package 64 bytes - 740 Mbits/s
>>>> without change:
>>>> package 512 bytes - 5000 Mbits/s
>>>> package 64 bytes - 710 Mbits/s
>>>>
>>>> Also I have tested D06 as client, intel machine as server,
>>>> with the change, the result remains the same.
>>>
>>> So anywhere between 4% and 8% improvement. Please post results for D05
>>> once you've found one.
>>>
>>>>
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 55fe8e2..0f56904 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -2256,6 +2256,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>>  	if (vcpu->halt_poll_ns) {
>>>>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>>>>
>>>> +#ifdef CONFIG_ARM64
>>>> +		/*
>>>> +		 * When using gicv4, enable doorbell before halt pool wait
>>>> +		 * so that direct vlpi can stop halt poll.
>>>> +		 */
>>>> +		if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) {
>>>> +			kvm_vgic_v4_enable_doorbell(vcpu);
>>>> +		}
>>>> +#endif
>>>> +
>>>
>>> Irk. No. You're now leaving doorbells enabled when going back to the
>>> guest, and that's pretty bad as the whole logic relies on doorbells
>>> being disabled if the guest can run.
>>>
>>> So please try this instead on top of mainline. And it has to be on top
>>> of mainline as we've changed the way timer interrupts work in 5.1 --
>>> see accb99bcd0ca ("KVM: arm/arm64: Simplify bg_timer programming").
>>>
> 
> [...]
> 
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index f25aa98a94df..0ae4ad5dcb12 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -2252,6 +2252,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>  	bool waited = false;
>>>  	u64 block_ns;
>>>  
>>> +	kvm_arch_vcpu_blocking(vcpu);
>>> +
>>>  	start = cur = ktime_get();
>>>  	if (vcpu->halt_poll_ns) {
>>>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>>> @@ -2272,8 +2274,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>  		} while (single_task_running() && ktime_before(cur, stop));
>>>  	}
>>>  
>>> -	kvm_arch_vcpu_blocking(vcpu);
>>> -
>>>  	for (;;) {
>>>  		prepare_to_swait_exclusive(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>>>  
>>> @@ -2287,8 +2287,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>  	finish_swait(&vcpu->wq, &wait);
>>>  	cur = ktime_get();
>>>  
>>> -	kvm_arch_vcpu_unblocking(vcpu);
>>>  out:
>>> +	kvm_arch_vcpu_unblocking(vcpu);
>>>  	block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>>>  
>>>  	if (!vcpu_valid_wakeup(vcpu))
>>>
>>> Thanks,
>>>
>>> 	M.
>>>
>>
>> I've tested your patch and the results showed as follow:
>>
>> netperf result:
>> D06 as server, intel 8180 server as client
>> with change:
>> package 512 bytes - 5500 Mbits/s
>> package 64 bytes - 760 Mbits/s
>> without change:
>> package 512 bytes - 5000 Mbits/s
>> package 64 bytes - 710 Mbits/s
> 
> OK, that's pretty good. Let me know once you've tested it on D05.
> 
> Thanks,
>
> 	M.
>

The average cost of kvm_vgic_v4_enable_doorbell on D05 is 0.74us,
while it is 0.35us on D06.

netperf result:
server: D05 vm using 82599 vf,
client: intel 8180
5.0.0-rc3, have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN
if GICv4 is enabled"

with change:
package 512 bytes - 7150 Mbits/s
package 64 bytes - 1080 Mbits/s
without change:
package 512 bytes - 7050 Mbits/s
package 64 bytes - 1080 Mbits/s

It seems not work on D05, as the doorbell enable process costs more than
that on D06.

Thanks,
Tang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Question about enable doorbell irq and halt_poll process
  2019-04-29  2:29               ` Tangnianyao (ICT)
@ 2019-05-14  9:02                 ` Tangnianyao (ICT)
  0 siblings, 0 replies; 9+ messages in thread
From: Tangnianyao (ICT) @ 2019-05-14  9:02 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-kernel, linuxarm, guoheyi, kvm, xuwei5, wanghaibin.wang



On 2019/4/29 10:29, Tangnianyao (ICT) wrote:

Hi, Marc

> On 2019/4/23 18:00, Marc Zyngier wrote:
> 
> Hi, Marc
> 
>> On Tue, 23 Apr 2019 08:44:17 +0100,
>> "Tangnianyao (ICT)" <tangnianyao@huawei.com> wrote:
>>>
>>> Hi, Marc
>>
>> [...]
>>
>>> I've learned that there's some implementation problem for the PCIe
>>> controller of Hi1616, the processor of D05. The PCIe ACS was not
>>> implemented properly and D05 doesn't support VM using pcie vf.
>>
>> My D05 completely disagrees with you:
>>
>> root@unassigned-hostname:~# lspci -v
>> 00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge
>> 	Subsystem: Red Hat, Inc QEMU PCIe Host bridge
>> 	Flags: fast devsel
>> lspci: Unable to load libkmod resources: error -12
>>
>> 00:01.0 Ethernet controller: Red Hat, Inc Virtio network device (rev 01)
>> 	Subsystem: Red Hat, Inc Virtio network device
>> 	Flags: bus master, fast devsel, latency 0, IRQ 40
>> 	Memory at 10040000 (32-bit, non-prefetchable) [size=4K]
>> 	Memory at 8000000000 (64-bit, prefetchable) [size=16K]
>> 	Expansion ROM at 10000000 [disabled] [size=256K]
>> 	Capabilities: [98] MSI-X: Enable+ Count=3 Masked-
>> 	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
>> 	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
>> 	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
>> 	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
>> 	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
>> 	Kernel driver in use: virtio-pci
>>
>> 00:02.0 SCSI storage controller: Red Hat, Inc Virtio block device (rev 01)
>> 	Subsystem: Red Hat, Inc Virtio block device
>> 	Flags: bus master, fast devsel, latency 0, IRQ 41
>> 	Memory at 10041000 (32-bit, non-prefetchable) [size=4K]
>> 	Memory at 8000004000 (64-bit, prefetchable) [size=16K]
>> 	Capabilities: [98] MSI-X: Enable+ Count=2 Masked-
>> 	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
>> 	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
>> 	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
>> 	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
>> 	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
>> 	Kernel driver in use: virtio-pci
>>
>> 00:03.0 SCSI storage controller: Red Hat, Inc Virtio SCSI (rev 01)
>> 	Subsystem: Red Hat, Inc Virtio SCSI
>> 	Flags: bus master, fast devsel, latency 0, IRQ 42
>> 	Memory at 10042000 (32-bit, non-prefetchable) [size=4K]
>> 	Memory at 8000008000 (64-bit, prefetchable) [size=16K]
>> 	Capabilities: [98] MSI-X: Enable+ Count=4 Masked-
>> 	Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
>> 	Capabilities: [70] Vendor Specific Information: VirtIO: Notify
>> 	Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
>> 	Capabilities: [50] Vendor Specific Information: VirtIO: ISR
>> 	Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
>> 	Kernel driver in use: virtio-pci
>>
>> 00:04.0 Ethernet controller: Intel Corporation I350 Ethernet Controller Virtual Function (rev 01)
>> 	Subsystem: Intel Corporation I350 Ethernet Controller Virtual Function
>> 	Flags: bus master, fast devsel, latency 0
>> 	Memory at 800000c000 (64-bit, prefetchable) [size=16K]
>> 	Memory at 8000010000 (64-bit, prefetchable) [size=16K]
>> 	Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
>> 	Capabilities: [a0] Express Root Complex Integrated Endpoint, MSI 00
>> 	Capabilities: [100] Advanced Error Reporting
>> 	Capabilities: [1a0] Transaction Processing Hints
>> 	Capabilities: [1d0] Access Control Services
>> 	Kernel driver in use: igbvf
>>
>> root@unassigned-hostname:~# uptime
>>  05:40:45 up 27 days, 17:16,  1 user,  load average: 4.10, 4.05, 4.01
>>
> 
> I have make a new quirk to fix ACS problem on Hi1616, to enable VM using
> pcie vf.
> 
>> For something that isn't supposed to work, it is pretty good. So
>> please test it and let me know how it fares. At this stage, not
>> regressing deployed HW is more important than improving the situation
>> on HW that nobody has.
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> Compared to default halt_poll_ns, 500000ns, enabling doorbells is not
>>>>>>> large at all.
>>>>>>
>>>>>> Sure. But I'm not sure this is a universal figure.
>>>>>>
>>>>>>>
>>>>>>>> Frankly, you want to be careful with that. I'd rather enable them late
>>>>>>>> and have a chance of not blocking because of another (virtual)
>>>>>>>> interrupt, which saves us the doorbell business.
>>>>>>>>
>>>>>>>> I wonder if you wouldn't be in a better position by drastically
>>>>>>>> reducing halt_poll_ns for vcpu that can have directly injected
>>>>>>>> interrupts.
>>>>>>>>
>>>>>>>
>>>>>>> If we set halt_poll_ns to a small value, the chance of
>>>>>>> not blocking and vcpu scheduled out becomes larger. The cost
>>>>>>> of vcpu scheduled out is quite expensive.
>>>>>>> In many cases, one pcpu is assigned to only
>>>>>>> one vcpu, and halt_poll_ns is set quite large, to increase
>>>>>>> chance of halt_poll process got terminated. This is the
>>>>>>> reason we want to set halt_poll_ns large. And We hope vlpi
>>>>>>> stop halt_poll process in time.
>>>>>>
>>>>>> Fair enough. It is certainly realistic to strictly partition the
>>>>>> system when GICv4 is in use, so I can see some potential benefit.
>>>>>>
>>>>>>> Though it will check whether vcpu is runnable again by
>>>>>>> kvm_vcpu_check_block. Vlpi can prevent scheduling vcpu out.
>>>>>>> However it's somewhat later if halt_poll_ns is larger.
>>>>>>>
>>>>>>>> In any case, this is something that we should measure, not guess.
>>>>>>
>>>>>> Please post results of realistic benchmarks that we can reproduce,
>>>>>> with and without this change. I'm willing to entertain the idea, but I
>>>>>> need more than just a vague number.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> 	M.
>>>>>>
>>>>>
>>>>> I tested it with and without change (patch attached in the last).
>>>>> halt_poll_ns is keep default, 500000ns.
>>>>> I have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled"
>>>>> to my test kernel, to make sure ICH_HCR_EL2.EN=1 in guest.
>>>>>
>>>>> netperf result:
>>>>> D06 as server, intel 8180 server as client
>>>>> with change:
>>>>> package 512 bytes - 5400 Mbits/s
>>>>> package 64 bytes - 740 Mbits/s
>>>>> without change:
>>>>> package 512 bytes - 5000 Mbits/s
>>>>> package 64 bytes - 710 Mbits/s
>>>>>
>>>>> Also I have tested D06 as client, intel machine as server,
>>>>> with the change, the result remains the same.
>>>>
>>>> So anywhere between 4% and 8% improvement. Please post results for D05
>>>> once you've found one.
>>>>
>>>>>
>>>>>
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index 55fe8e2..0f56904 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -2256,6 +2256,16 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>>>  	if (vcpu->halt_poll_ns) {
>>>>>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>>>>>
>>>>> +#ifdef CONFIG_ARM64
>>>>> +		/*
>>>>> +		 * When using gicv4, enable doorbell before halt pool wait
>>>>> +		 * so that direct vlpi can stop halt poll.
>>>>> +		 */
>>>>> +		if (vcpu->arch.vgic_cpu.vgic_v3.its_vpe.its_vm) {
>>>>> +			kvm_vgic_v4_enable_doorbell(vcpu);
>>>>> +		}
>>>>> +#endif
>>>>> +
>>>>
>>>> Irk. No. You're now leaving doorbells enabled when going back to the
>>>> guest, and that's pretty bad as the whole logic relies on doorbells
>>>> being disabled if the guest can run.
>>>>
>>>> So please try this instead on top of mainline. And it has to be on top
>>>> of mainline as we've changed the way timer interrupts work in 5.1 --
>>>> see accb99bcd0ca ("KVM: arm/arm64: Simplify bg_timer programming").
>>>>
>>
>> [...]
>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index f25aa98a94df..0ae4ad5dcb12 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -2252,6 +2252,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>>  	bool waited = false;
>>>>  	u64 block_ns;
>>>>  
>>>> +	kvm_arch_vcpu_blocking(vcpu);
>>>> +
>>>>  	start = cur = ktime_get();
>>>>  	if (vcpu->halt_poll_ns) {
>>>>  		ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns);
>>>> @@ -2272,8 +2274,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>>  		} while (single_task_running() && ktime_before(cur, stop));
>>>>  	}
>>>>  
>>>> -	kvm_arch_vcpu_blocking(vcpu);
>>>> -
>>>>  	for (;;) {
>>>>  		prepare_to_swait_exclusive(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>>>>  
>>>> @@ -2287,8 +2287,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>>  	finish_swait(&vcpu->wq, &wait);
>>>>  	cur = ktime_get();
>>>>  
>>>> -	kvm_arch_vcpu_unblocking(vcpu);
>>>>  out:
>>>> +	kvm_arch_vcpu_unblocking(vcpu);
>>>>  	block_ns = ktime_to_ns(cur) - ktime_to_ns(start);
>>>>  
>>>>  	if (!vcpu_valid_wakeup(vcpu))
>>>>
>>>> Thanks,
>>>>
>>>> 	M.
>>>>
>>>
>>> I've tested your patch and the results showed as follow:
>>>
>>> netperf result:
>>> D06 as server, intel 8180 server as client
>>> with change:
>>> package 512 bytes - 5500 Mbits/s
>>> package 64 bytes - 760 Mbits/s
>>> without change:
>>> package 512 bytes - 5000 Mbits/s
>>> package 64 bytes - 710 Mbits/s
>>
>> OK, that's pretty good. Let me know once you've tested it on D05.
>>
>> Thanks,
>>
>> 	M.
>>
> 
> The average cost of kvm_vgic_v4_enable_doorbell on D05 is 0.74us,
> while it is 0.35us on D06.
> 
> netperf result:
> server: D05 vm using 82599 vf,
> client: intel 8180
> 5.0.0-rc3, have merged the patch "arm64: KVM: Always set ICH_HCR_EL2.EN
> if GICv4 is enabled"
> 
> with change:
> package 512 bytes - 7150 Mbits/s
> package 64 bytes - 1080 Mbits/s
> without change:
> package 512 bytes - 7050 Mbits/s
> package 64 bytes - 1080 Mbits/s
> 
> It seems not work on D05, as the doorbell enable process costs more than
> that on D06.
> 
> Thanks,
> Tang
> 
> 
> .
> 

On D05, Hi1616, ACS extended capability was not implemented standardly, it
has to fix it by making a new quirk. And this change results to about 1.5%
performance drop on D05, while it improves 5% to 8% on D06. D06 will be
more commercially used than D05. We think it may be reasonable to change
this process.
If you need to verify this process,  you may use our D06 machine. I will
shows how you can use it with another email. As I know, James.Morse@arm.com
has ever used it too.

Thanks,
Tang



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-05-14  9:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-19 13:25 [RFC] Question about enable doorbell irq and halt_poll process Tangnianyao (ICT)
2019-03-20 14:31 ` Heyi Guo
2019-03-20 17:02 ` Marc Zyngier
     [not found]   ` <5df934fd-06d5-55f2-68a5-6f4985e4ac1b@huawei.com>
     [not found]     ` <86zhpc66jl.wl-marc.zyngier@arm.com>
2019-04-04 10:07       ` Tangnianyao (ICT)
2019-04-04 10:59         ` Marc Zyngier
2019-04-23  7:44           ` Tangnianyao (ICT)
2019-04-23 10:00             ` Marc Zyngier
2019-04-29  2:29               ` Tangnianyao (ICT)
2019-05-14  9:02                 ` Tangnianyao (ICT)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).