LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
@ 2015-02-18  9:32 Bogdan Purcareata
  2015-02-18  9:32 ` [PATCH 1/2] powerpc/kvm: Convert openpic lock to raw_spinlock Bogdan Purcareata
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Bogdan Purcareata @ 2015-02-18  9:32 UTC (permalink / raw)
  To: linuxppc-dev, linux-rt-users, bigeasy, agraf, pbonzini
  Cc: linux-kernel, scottwood, mihai.caraman, Bogdan Purcareata

This patchset enables running KVM SMP guests with external interrupts on an
underlying RT-enabled Linux. Previous to this patch, a guest with in-kernel MPIC
emulation could easily panic the kernel due to preemption when delivering IPIs
and external interrupts, because of the openpic spinlock becoming a sleeping
mutex on PREEMPT_RT_FULL Linux.

0001: converts the openpic spinlock to a raw spinlock, in order to circumvent
this behavior. While this change is targeted for a RT enabled Linux, it has no
effect on upstream kvm-ppc, so send it upstream for better future maintenance.

0002: introduces a limit on the maximum VCPUs a guest can have, in order to
prevent potential DoS attack due to large system latencies. This patch is
targeted to RT (due to CONFIG_PREEMPT_RT_FULL), but it can also be applied on
upstream Linux, with no effect. Not sure if it's best to send it upstream and
have a hanging CONFIG_PREEMPT_RT_FULL check there, with no effect, or send it
against linux-stable-rt. Please apply as you consider appropriate.

- applied & compiled against upstream 3.19
- applied & compiled against stable-rt 3.14-rt (0002 with minor fuzz)

Bogdan Purcareata (2):
  powerpc/kvm: Convert openpic lock to raw_spinlock
  powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux

 arch/powerpc/include/asm/kvm_host.h |  6 +++++
 arch/powerpc/kvm/mpic.c             | 44 ++++++++++++++++++-------------------
 2 files changed, 28 insertions(+), 22 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] powerpc/kvm: Convert openpic lock to raw_spinlock
  2015-02-18  9:32 [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux Bogdan Purcareata
@ 2015-02-18  9:32 ` Bogdan Purcareata
  2015-02-23 22:43   ` Scott Wood
  2015-02-18  9:32 ` [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux Bogdan Purcareata
  2015-02-20 13:45 ` [PATCH 0/2] powerpc/kvm: Enable running guests " Alexander Graf
  2 siblings, 1 reply; 37+ messages in thread
From: Bogdan Purcareata @ 2015-02-18  9:32 UTC (permalink / raw)
  To: linuxppc-dev, linux-rt-users, bigeasy, agraf, pbonzini
  Cc: linux-kernel, scottwood, mihai.caraman, Bogdan Purcareata

This patch enables running intensive I/O workloads, e.g. netperf, in a guest
deployed on a RT host. It also enable guests to be SMP.

The openpic spinlock becomes a sleeping mutex on a RT system. This no longer
guarantees that EPR is atomic with exception delivery. The guest VCPU thread
fails due to a BUG_ON(preemptible()) when running netperf.

In order to make the kvmppc_mpic_set_epr() call safe on RT from non-atomic
context, convert the openpic lock to a raw_spinlock. A similar approach can
be seen for x86 platforms in the following commit [1].

Here are some comparative cyclitest measurements run inside a high priority RT
guest run on a RT host. The guest has 1 VCPU and the test has been run for 15
minutes. The guest runs ~750 hackbench processes as background stress.

                  spinlock  raw_spinlock
Min latency (us)  4         4
Avg latency (us)  15        19
Max latency (us)  70        62

[1] https://lkml.org/lkml/2010/1/11/289

Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 39b3a8f..9fad0aa 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -196,7 +196,7 @@ struct openpic {
 	int num_mmio_regions;
 
 	gpa_t reg_base;
-	spinlock_t lock;
+	raw_spinlock_t lock;
 
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
@@ -1108,9 +1108,9 @@ static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
 			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
-		spin_unlock(&opp->lock);
+		raw_spin_unlock(&opp->lock);
 		kvm_notify_acked_irq(opp->kvm, 0, notify_eoi);
-		spin_lock(&opp->lock);
+		raw_spin_lock(&opp->lock);
 
 		break;
 	}
@@ -1185,12 +1185,12 @@ void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 	int cpu = vcpu->arch.irq_cpu_id;
 	unsigned long flags;
 
-	spin_lock_irqsave(&opp->lock, flags);
+	raw_spin_lock_irqsave(&opp->lock, flags);
 
 	if ((opp->gcr & opp->mpic_mode_mask) == GCR_MODE_PROXY)
 		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
 
-	spin_unlock_irqrestore(&opp->lock, flags);
+	raw_spin_unlock_irqrestore(&opp->lock, flags);
 }
 
 static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
@@ -1390,9 +1390,9 @@ static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
 		return -EINVAL;
 	}
 
-	spin_lock_irq(&opp->lock);
+	raw_spin_lock_irq(&opp->lock);
 	ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
-	spin_unlock_irq(&opp->lock);
+	raw_spin_unlock_irq(&opp->lock);
 
 	/*
 	 * Technically only 32-bit accesses are allowed, but be nice to
@@ -1430,10 +1430,10 @@ static int kvm_mpic_write(struct kvm_io_device *this, gpa_t addr,
 		return -EOPNOTSUPP;
 	}
 
-	spin_lock_irq(&opp->lock);
+	raw_spin_lock_irq(&opp->lock);
 	ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
 				      *(const u32 *)ptr);
-	spin_unlock_irq(&opp->lock);
+	raw_spin_unlock_irq(&opp->lock);
 
 	pr_debug("%s: addr %llx ret %d val %x\n",
 		 __func__, addr, ret, *(const u32 *)ptr);
@@ -1504,14 +1504,14 @@ static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
 	if (addr & 3)
 		return -ENXIO;
 
-	spin_lock_irq(&opp->lock);
+	raw_spin_lock_irq(&opp->lock);
 
 	if (type == ATTR_SET)
 		ret = kvm_mpic_write_internal(opp, addr, *val);
 	else
 		ret = kvm_mpic_read_internal(opp, addr, val);
 
-	spin_unlock_irq(&opp->lock);
+	raw_spin_unlock_irq(&opp->lock);
 
 	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
 
@@ -1548,9 +1548,9 @@ static int mpic_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 		if (attr32 != 0 && attr32 != 1)
 			return -EINVAL;
 
-		spin_lock_irq(&opp->lock);
+		raw_spin_lock_irq(&opp->lock);
 		openpic_set_irq(opp, attr->attr, attr32);
-		spin_unlock_irq(&opp->lock);
+		raw_spin_unlock_irq(&opp->lock);
 		return 0;
 	}
 
@@ -1595,9 +1595,9 @@ static int mpic_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 		if (attr->attr > MAX_SRC)
 			return -EINVAL;
 
-		spin_lock_irq(&opp->lock);
+		raw_spin_lock_irq(&opp->lock);
 		attr32 = opp->src[attr->attr].pending;
-		spin_unlock_irq(&opp->lock);
+		raw_spin_unlock_irq(&opp->lock);
 
 		if (put_user(attr32, (u32 __user *)(long)attr->addr))
 			return -EFAULT;
@@ -1673,7 +1673,7 @@ static int mpic_create(struct kvm_device *dev, u32 type)
 	opp->kvm = dev->kvm;
 	opp->dev = dev;
 	opp->model = type;
-	spin_lock_init(&opp->lock);
+	raw_spin_lock_init(&opp->lock);
 
 	add_mmio_region(opp, &openpic_gbl_mmio);
 	add_mmio_region(opp, &openpic_tmr_mmio);
@@ -1746,7 +1746,7 @@ int kvmppc_mpic_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu,
 	if (cpu < 0 || cpu >= MAX_CPU)
 		return -EPERM;
 
-	spin_lock_irq(&opp->lock);
+	raw_spin_lock_irq(&opp->lock);
 
 	if (opp->dst[cpu].vcpu) {
 		ret = -EEXIST;
@@ -1769,7 +1769,7 @@ int kvmppc_mpic_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu,
 		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
 
 out:
-	spin_unlock_irq(&opp->lock);
+	raw_spin_unlock_irq(&opp->lock);
 	return ret;
 }
 
@@ -1799,9 +1799,9 @@ static int mpic_set_irq(struct kvm_kernel_irq_routing_entry *e,
 	struct openpic *opp = kvm->arch.mpic;
 	unsigned long flags;
 
-	spin_lock_irqsave(&opp->lock, flags);
+	raw_spin_lock_irqsave(&opp->lock, flags);
 	openpic_set_irq(opp, irq, level);
-	spin_unlock_irqrestore(&opp->lock, flags);
+	raw_spin_unlock_irqrestore(&opp->lock, flags);
 
 	/* All code paths we care about don't check for the return value */
 	return 0;
@@ -1813,14 +1813,14 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
 	struct openpic *opp = kvm->arch.mpic;
 	unsigned long flags;
 
-	spin_lock_irqsave(&opp->lock, flags);
+	raw_spin_lock_irqsave(&opp->lock, flags);
 
 	/*
 	 * XXX We ignore the target address for now, as we only support
 	 *     a single MSI bank.
 	 */
 	openpic_msi_write(kvm->arch.mpic, MSIIR_OFFSET, e->msi.data);
-	spin_unlock_irqrestore(&opp->lock, flags);
+	raw_spin_unlock_irqrestore(&opp->lock, flags);
 
 	/* All code paths we care about don't check for the return value */
 	return 0;
-- 
2.1.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux
  2015-02-18  9:32 [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux Bogdan Purcareata
  2015-02-18  9:32 ` [PATCH 1/2] powerpc/kvm: Convert openpic lock to raw_spinlock Bogdan Purcareata
@ 2015-02-18  9:32 ` Bogdan Purcareata
  2015-02-18  9:36   ` Sebastian Andrzej Siewior
  2015-02-20 13:45   ` Alexander Graf
  2015-02-20 13:45 ` [PATCH 0/2] powerpc/kvm: Enable running guests " Alexander Graf
  2 siblings, 2 replies; 37+ messages in thread
From: Bogdan Purcareata @ 2015-02-18  9:32 UTC (permalink / raw)
  To: linuxppc-dev, linux-rt-users, bigeasy, agraf, pbonzini
  Cc: linux-kernel, scottwood, mihai.caraman, Bogdan Purcareata

Due to the introduction of the raw_spinlock for the KVM openpic, guests with a
high number of VCPUs may induce great latencies on the underlying RT Linux
system (e.g. cyclictest reports latencies of ~15ms for guests with 24 VCPUs).
This can be further aggravated by sending a lot of external interrupts to the
guest.

A malicious app can abuse this scenario, causing a DoS of the host Linux.
Until the KVM openpic code is refactored to use finer lock granularity, impose
a limitation on the number of VCPUs a guest can have when running on a
PREEMPT_RT_FULL system with KVM_MPIC emulation.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/include/asm/kvm_host.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 8ef0512..6f6b928 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -36,8 +36,14 @@
 #include <asm/cacheflush.h>
 #include <asm/hvcall.h>
 
+#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_KVM_MPIC)
+/* Limit the number of vcpus due to in-kernel mpic concurrency */
+#define KVM_MAX_VCPUS		4
+#define KVM_MAX_VCORES		4
+#else
 #define KVM_MAX_VCPUS		NR_CPUS
 #define KVM_MAX_VCORES		NR_CPUS
+#endif
 #define KVM_USER_MEM_SLOTS 32
 #define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux
  2015-02-18  9:32 ` [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux Bogdan Purcareata
@ 2015-02-18  9:36   ` Sebastian Andrzej Siewior
  2015-02-20 13:45   ` Alexander Graf
  1 sibling, 0 replies; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-02-18  9:36 UTC (permalink / raw)
  To: Bogdan Purcareata, linuxppc-dev, linux-rt-users, agraf, pbonzini
  Cc: linux-kernel, scottwood, mihai.caraman

On 02/18/2015 10:32 AM, Bogdan Purcareata wrote:
> Due to the introduction of the raw_spinlock for the KVM openpic, guests with a
> high number of VCPUs may induce great latencies on the underlying RT Linux
> system (e.g. cyclictest reports latencies of ~15ms for guests with 24 VCPUs).
> This can be further aggravated by sending a lot of external interrupts to the
> guest.
> 
> A malicious app can abuse this scenario, causing a DoS of the host Linux.
> Until the KVM openpic code is refactored to use finer lock granularity, impose
> a limitation on the number of VCPUs a guest can have when running on a
> PREEMPT_RT_FULL system with KVM_MPIC emulation.

How is this possible? You take the raw lock, write a register, release
the raw lock. How can the guest lockup the host? Is this write blocking
in guest?

Sebastian

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux
  2015-02-18  9:32 ` [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux Bogdan Purcareata
  2015-02-18  9:36   ` Sebastian Andrzej Siewior
@ 2015-02-20 13:45   ` Alexander Graf
  2015-02-23 22:48     ` Scott Wood
  1 sibling, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2015-02-20 13:45 UTC (permalink / raw)
  To: Bogdan Purcareata, linuxppc-dev, linux-rt-users, bigeasy, pbonzini
  Cc: linux-kernel, scottwood, mihai.caraman



On 18.02.15 10:32, Bogdan Purcareata wrote:
> Due to the introduction of the raw_spinlock for the KVM openpic, guests with a
> high number of VCPUs may induce great latencies on the underlying RT Linux
> system (e.g. cyclictest reports latencies of ~15ms for guests with 24 VCPUs).
> This can be further aggravated by sending a lot of external interrupts to the
> guest.
> 
> A malicious app can abuse this scenario, causing a DoS of the host Linux.
> Until the KVM openpic code is refactored to use finer lock granularity, impose
> a limitation on the number of VCPUs a guest can have when running on a
> PREEMPT_RT_FULL system with KVM_MPIC emulation.
> 
> Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
> Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
> Reviewed-by: Scott Wood <scottwood@freescale.com>

I don't think this patch is reasonable to take upstream. If we have a
latency issue, whoever spawned KVM VMs made a decision to spawn such big
VMs.

I'd say let's leave it at MAX_VCPUS == NR_CPUS for now and rather get
someone to resolve any locking problems for real ;).


Alex

> ---
>  arch/powerpc/include/asm/kvm_host.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 8ef0512..6f6b928 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -36,8 +36,14 @@
>  #include <asm/cacheflush.h>
>  #include <asm/hvcall.h>
>  
> +#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_KVM_MPIC)
> +/* Limit the number of vcpus due to in-kernel mpic concurrency */
> +#define KVM_MAX_VCPUS		4
> +#define KVM_MAX_VCORES		4
> +#else
>  #define KVM_MAX_VCPUS		NR_CPUS
>  #define KVM_MAX_VCORES		NR_CPUS
> +#endif
>  #define KVM_USER_MEM_SLOTS 32
>  #define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS
>  
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-18  9:32 [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux Bogdan Purcareata
  2015-02-18  9:32 ` [PATCH 1/2] powerpc/kvm: Convert openpic lock to raw_spinlock Bogdan Purcareata
  2015-02-18  9:32 ` [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux Bogdan Purcareata
@ 2015-02-20 13:45 ` Alexander Graf
  2015-02-20 14:12   ` Paolo Bonzini
  2 siblings, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2015-02-20 13:45 UTC (permalink / raw)
  To: Bogdan Purcareata, linuxppc-dev, linux-rt-users, bigeasy, pbonzini
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner



On 18.02.15 10:32, Bogdan Purcareata wrote:
> This patchset enables running KVM SMP guests with external interrupts on an
> underlying RT-enabled Linux. Previous to this patch, a guest with in-kernel MPIC
> emulation could easily panic the kernel due to preemption when delivering IPIs
> and external interrupts, because of the openpic spinlock becoming a sleeping
> mutex on PREEMPT_RT_FULL Linux.
> 
> 0001: converts the openpic spinlock to a raw spinlock, in order to circumvent
> this behavior. While this change is targeted for a RT enabled Linux, it has no
> effect on upstream kvm-ppc, so send it upstream for better future maintenance.
> 
> 0002: introduces a limit on the maximum VCPUs a guest can have, in order to
> prevent potential DoS attack due to large system latencies. This patch is
> targeted to RT (due to CONFIG_PREEMPT_RT_FULL), but it can also be applied on
> upstream Linux, with no effect. Not sure if it's best to send it upstream and
> have a hanging CONFIG_PREEMPT_RT_FULL check there, with no effect, or send it
> against linux-stable-rt. Please apply as you consider appropriate.

Thomas, what is the usual approach for patches like this? Do you take
them into your rt tree or should they get integrated to upstream?


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 13:45 ` [PATCH 0/2] powerpc/kvm: Enable running guests " Alexander Graf
@ 2015-02-20 14:12   ` Paolo Bonzini
  2015-02-20 14:16     ` Alexander Graf
  2015-02-20 14:54     ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 37+ messages in thread
From: Paolo Bonzini @ 2015-02-20 14:12 UTC (permalink / raw)
  To: Alexander Graf, Bogdan Purcareata, linuxppc-dev, linux-rt-users, bigeasy
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner



On 20/02/2015 14:45, Alexander Graf wrote:
> 
> 
> On 18.02.15 10:32, Bogdan Purcareata wrote:
>> This patchset enables running KVM SMP guests with external interrupts on an
>> underlying RT-enabled Linux. Previous to this patch, a guest with in-kernel MPIC
>> emulation could easily panic the kernel due to preemption when delivering IPIs
>> and external interrupts, because of the openpic spinlock becoming a sleeping
>> mutex on PREEMPT_RT_FULL Linux.
>>
>> 0001: converts the openpic spinlock to a raw spinlock, in order to circumvent
>> this behavior. While this change is targeted for a RT enabled Linux, it has no
>> effect on upstream kvm-ppc, so send it upstream for better future maintenance.
>>
>> 0002: introduces a limit on the maximum VCPUs a guest can have, in order to
>> prevent potential DoS attack due to large system latencies. This patch is
>> targeted to RT (due to CONFIG_PREEMPT_RT_FULL), but it can also be applied on
>> upstream Linux, with no effect. Not sure if it's best to send it upstream and
>> have a hanging CONFIG_PREEMPT_RT_FULL check there, with no effect, or send it
>> against linux-stable-rt. Please apply as you consider appropriate.
> 
> Thomas, what is the usual approach for patches like this? Do you take
> them into your rt tree or should they get integrated to upstream?

Patch 1 is definitely suitable for upstream, that's the reason why we
have raw_spin_lock vs. raw_spin_unlock.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 14:12   ` Paolo Bonzini
@ 2015-02-20 14:16     ` Alexander Graf
  2015-02-20 14:54     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 37+ messages in thread
From: Alexander Graf @ 2015-02-20 14:16 UTC (permalink / raw)
  To: Paolo Bonzini, Bogdan Purcareata, linuxppc-dev, linux-rt-users, bigeasy
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner



On 20.02.15 15:12, Paolo Bonzini wrote:
> 
> 
> On 20/02/2015 14:45, Alexander Graf wrote:
>>
>>
>> On 18.02.15 10:32, Bogdan Purcareata wrote:
>>> This patchset enables running KVM SMP guests with external interrupts on an
>>> underlying RT-enabled Linux. Previous to this patch, a guest with in-kernel MPIC
>>> emulation could easily panic the kernel due to preemption when delivering IPIs
>>> and external interrupts, because of the openpic spinlock becoming a sleeping
>>> mutex on PREEMPT_RT_FULL Linux.
>>>
>>> 0001: converts the openpic spinlock to a raw spinlock, in order to circumvent
>>> this behavior. While this change is targeted for a RT enabled Linux, it has no
>>> effect on upstream kvm-ppc, so send it upstream for better future maintenance.
>>>
>>> 0002: introduces a limit on the maximum VCPUs a guest can have, in order to
>>> prevent potential DoS attack due to large system latencies. This patch is
>>> targeted to RT (due to CONFIG_PREEMPT_RT_FULL), but it can also be applied on
>>> upstream Linux, with no effect. Not sure if it's best to send it upstream and
>>> have a hanging CONFIG_PREEMPT_RT_FULL check there, with no effect, or send it
>>> against linux-stable-rt. Please apply as you consider appropriate.
>>
>> Thomas, what is the usual approach for patches like this? Do you take
>> them into your rt tree or should they get integrated to upstream?
> 
> Patch 1 is definitely suitable for upstream, that's the reason why we
> have raw_spin_lock vs. raw_spin_unlock.

I see, perfect :).

Bogdan, please resend patch 1 with CC to kvm-ppc@vger so that I can pick
it up from patchworks.


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 14:12   ` Paolo Bonzini
  2015-02-20 14:16     ` Alexander Graf
@ 2015-02-20 14:54     ` Sebastian Andrzej Siewior
  2015-02-20 14:57       ` Paolo Bonzini
                         ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-02-20 14:54 UTC (permalink / raw)
  To: Paolo Bonzini, Alexander Graf, Bogdan Purcareata, linuxppc-dev,
	linux-rt-users
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner

On 02/20/2015 03:12 PM, Paolo Bonzini wrote:
>> Thomas, what is the usual approach for patches like this? Do you take
>> them into your rt tree or should they get integrated to upstream?
> 
> Patch 1 is definitely suitable for upstream, that's the reason why we
> have raw_spin_lock vs. raw_spin_unlock.

raw_spin_lock were introduced in c2f21ce2e31286a0a32 ("locking:
Implement new raw_spinlock). They are used in context which runs with
IRQs off - especially on -RT. This includes usually interrupt
controllers and related core-code pieces.

Usually you see "scheduling while atomic" on -RT and convert them to
raw locks if it is appropriate.

Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
not cause a DoS and large latencies in the host. I haven't seen an
answer to my why question. Because if the conversation leads to
large latencies in the host then it does not look right.

Each host PIC has a rawlock and does mostly just mask/unmask and the
raw lock makes sure the value written is not mixed up due to
preemption.
This hardly increase latencies because the "locked" path is very short.
If this conversation leads to higher latencies then the locked path is
too long and hardly suitable to become a rawlock.

> Paolo
> 

Sebastian

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 14:54     ` Sebastian Andrzej Siewior
@ 2015-02-20 14:57       ` Paolo Bonzini
  2015-02-20 15:06         ` Sebastian Andrzej Siewior
  2015-02-23  7:29       ` Purcareata Bogdan
  2015-02-23 23:27       ` Scott Wood
  2 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2015-02-20 14:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Alexander Graf, Bogdan Purcareata,
	linuxppc-dev, linux-rt-users
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner



On 20/02/2015 15:54, Sebastian Andrzej Siewior wrote:
> Usually you see "scheduling while atomic" on -RT and convert them to
> raw locks if it is appropriate.
> 
> Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
> not cause a DoS and large latencies in the host. I haven't seen an
> answer to my why question. Because if the conversation leads to
> large latencies in the host then it does not look right.
> 
> Each host PIC has a rawlock and does mostly just mask/unmask and the
> raw lock makes sure the value written is not mixed up due to
> preemption.
> This hardly increase latencies because the "locked" path is very short.
> If this conversation leads to higher latencies then the locked path is
> too long and hardly suitable to become a rawlock.

Yes, but large latencies just mean the code has to be rewritten (x86
doesn't anymore do event injection in an atomic regions for example).
Until it is, using raw_spin_lock is correct.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 14:57       ` Paolo Bonzini
@ 2015-02-20 15:06         ` Sebastian Andrzej Siewior
  2015-02-20 15:10           ` Paolo Bonzini
  2015-02-23  7:50           ` Purcareata Bogdan
  0 siblings, 2 replies; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-02-20 15:06 UTC (permalink / raw)
  To: Paolo Bonzini, Alexander Graf, Bogdan Purcareata, linuxppc-dev,
	linux-rt-users
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner

On 02/20/2015 03:57 PM, Paolo Bonzini wrote:
> 
> 
> On 20/02/2015 15:54, Sebastian Andrzej Siewior wrote:
>> Usually you see "scheduling while atomic" on -RT and convert them to
>> raw locks if it is appropriate.
>>
>> Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
>> not cause a DoS and large latencies in the host. I haven't seen an
>> answer to my why question. Because if the conversation leads to
>> large latencies in the host then it does not look right.
>>
>> Each host PIC has a rawlock and does mostly just mask/unmask and the
>> raw lock makes sure the value written is not mixed up due to
>> preemption.
>> This hardly increase latencies because the "locked" path is very short.
>> If this conversation leads to higher latencies then the locked path is
>> too long and hardly suitable to become a rawlock.
> 
> Yes, but large latencies just mean the code has to be rewritten (x86
> doesn't anymore do event injection in an atomic regions for example).
> Until it is, using raw_spin_lock is correct.

It does not sound like it. It sounds more like disabling interrupts to
get things run faster and then limit it on a different corner to not
blow up everything.
Max latencies was decreased "Max latency (us)  70        62" and that
is why this is done? For 8 us and possible DoS in case there are too
many cpus?

> Paolo
> 

Sebastian

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 15:06         ` Sebastian Andrzej Siewior
@ 2015-02-20 15:10           ` Paolo Bonzini
  2015-02-20 15:17             ` Sebastian Andrzej Siewior
  2015-02-23  7:50           ` Purcareata Bogdan
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2015-02-20 15:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Alexander Graf, Bogdan Purcareata,
	linuxppc-dev, linux-rt-users
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner



On 20/02/2015 16:06, Sebastian Andrzej Siewior wrote:
> On 02/20/2015 03:57 PM, Paolo Bonzini wrote:

>> Yes, but large latencies just mean the code has to be rewritten (x86
>> doesn't anymore do event injection in an atomic regions for example).
>> Until it is, using raw_spin_lock is correct.
> 
> It does not sound like it. It sounds more like disabling interrupts to
> get things run faster and then limit it on a different corner to not
> blow up everything.

"This patchset enables running KVM SMP guests with external interrupts
on an underlying RT-enabled Linux. Previous to this patch, a guest with
in-kernel MPIC emulation could easily panic the kernel due to preemption
when delivering IPIs and external interrupts, because of the openpic
spinlock becoming a sleeping mutex on PREEMPT_RT_FULL Linux".

> Max latencies was decreased "Max latency (us)  70        62" and that
> is why this is done? For 8 us and possible DoS in case there are too
> many cpus?

My understanding is that:

1) netperf can get you a BUG KVM, and raw_spinlock fixes that

2) cyclictest did not trigger the BUG, and you can also get reduced
latency from using raw_spinlock.

I think we agree that (2) is not a factor in accepting the patch.

Paolo

>> Paolo
>>
> 
> Sebastian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 15:10           ` Paolo Bonzini
@ 2015-02-20 15:17             ` Sebastian Andrzej Siewior
  2015-02-23  8:12               ` Purcareata Bogdan
  0 siblings, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-02-20 15:17 UTC (permalink / raw)
  To: Paolo Bonzini, Alexander Graf, Bogdan Purcareata, linuxppc-dev,
	linux-rt-users
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner

On 02/20/2015 04:10 PM, Paolo Bonzini wrote:
> On 20/02/2015 16:06, Sebastian Andrzej Siewior wrote:
>> On 02/20/2015 03:57 PM, Paolo Bonzini wrote:
> 
>>> Yes, but large latencies just mean the code has to be rewritten (x86
>>> doesn't anymore do event injection in an atomic regions for example).
>>> Until it is, using raw_spin_lock is correct.
>>
>> It does not sound like it. It sounds more like disabling interrupts to
>> get things run faster and then limit it on a different corner to not
>> blow up everything.
> 
> "This patchset enables running KVM SMP guests with external interrupts
> on an underlying RT-enabled Linux. Previous to this patch, a guest with
> in-kernel MPIC emulation could easily panic the kernel due to preemption
> when delivering IPIs and external interrupts, because of the openpic
> spinlock becoming a sleeping mutex on PREEMPT_RT_FULL Linux".
> 
>> Max latencies was decreased "Max latency (us)  70        62" and that
>> is why this is done? For 8 us and possible DoS in case there are too
>> many cpus?
> 
> My understanding is that:
> 
> 1) netperf can get you a BUG KVM, and raw_spinlock fixes that

May I please see a backtrace with context tracking which states where
the interrupts / preemption gets disabled and where the lock was taken?

I'm not totally against this patch I just want to make sure this is not
a blind raw conversation to shup up the warning the kernel throws.

> 2) cyclictest did not trigger the BUG, and you can also get reduced
> latency from using raw_spinlock.
> 
> I think we agree that (2) is not a factor in accepting the patch.
good :)

> 
> Paolo
> 
Sebastian

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 14:54     ` Sebastian Andrzej Siewior
  2015-02-20 14:57       ` Paolo Bonzini
@ 2015-02-23  7:29       ` Purcareata Bogdan
  2015-02-23 23:27       ` Scott Wood
  2 siblings, 0 replies; 37+ messages in thread
From: Purcareata Bogdan @ 2015-02-23  7:29 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner

On 20.02.2015 16:54, Sebastian Andrzej Siewior wrote:
> On 02/20/2015 03:12 PM, Paolo Bonzini wrote:
>>> Thomas, what is the usual approach for patches like this? Do you take
>>> them into your rt tree or should they get integrated to upstream?
>>
>> Patch 1 is definitely suitable for upstream, that's the reason why we
>> have raw_spin_lock vs. raw_spin_unlock.
>
> raw_spin_lock were introduced in c2f21ce2e31286a0a32 ("locking:
> Implement new raw_spinlock). They are used in context which runs with
> IRQs off - especially on -RT. This includes usually interrupt
> controllers and related core-code pieces.
>
> Usually you see "scheduling while atomic" on -RT and convert them to
> raw locks if it is appropriate.
>
> Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
> not cause a DoS and large latencies in the host. I haven't seen an
> answer to my why question. Because if the conversation leads to
> large latencies in the host then it does not look right.

What I did notice were bad cyclictest results, when run in a guest with 
24 VCPUs. There were 24 netperf flows running in the guest. The max 
cyclictest latencies got up to 15ms in the guest, however I haven't 
captured any host side information related to preemptirqs off statistics.

What I was planning to do in the past days was to rerun the test and 
come up with the host preemptirqs off disabled statistics (mainly the 
max latency), so I could have a more reliable argument. I haven't had 
the time nor the setup to do that yet, and will come back with this as 
soon as I have them available.

> Each host PIC has a rawlock and does mostly just mask/unmask and the
> raw lock makes sure the value written is not mixed up due to
> preemption.
> This hardly increase latencies because the "locked" path is very short.
> If this conversation leads to higher latencies then the locked path is
> too long and hardly suitable to become a rawlock.

 From my understanding, the kvm openpic emulation code does more than 
just that - it requires to be atomic with interrupt delivery. This might 
mean the bad cyclictest max latencies visible from the guest side 
(15ms), may also have a correspondent to how much time that raw spinlock 
is taken, leading to an unresponsive host.

Best regards,
Bogdan P.

>> Paolo
>>
>
> Sebastian
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 15:06         ` Sebastian Andrzej Siewior
  2015-02-20 15:10           ` Paolo Bonzini
@ 2015-02-23  7:50           ` Purcareata Bogdan
  1 sibling, 0 replies; 37+ messages in thread
From: Purcareata Bogdan @ 2015-02-23  7:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner

On 20.02.2015 17:06, Sebastian Andrzej Siewior wrote:
> On 02/20/2015 03:57 PM, Paolo Bonzini wrote:
>>
>>
>> On 20/02/2015 15:54, Sebastian Andrzej Siewior wrote:
>>> Usually you see "scheduling while atomic" on -RT and convert them to
>>> raw locks if it is appropriate.
>>>
>>> Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
>>> not cause a DoS and large latencies in the host. I haven't seen an
>>> answer to my why question. Because if the conversation leads to
>>> large latencies in the host then it does not look right.
>>>
>>> Each host PIC has a rawlock and does mostly just mask/unmask and the
>>> raw lock makes sure the value written is not mixed up due to
>>> preemption.
>>> This hardly increase latencies because the "locked" path is very short.
>>> If this conversation leads to higher latencies then the locked path is
>>> too long and hardly suitable to become a rawlock.
>>
>> Yes, but large latencies just mean the code has to be rewritten (x86
>> doesn't anymore do event injection in an atomic regions for example).
>> Until it is, using raw_spin_lock is correct.
>
> It does not sound like it. It sounds more like disabling interrupts to
> get things run faster and then limit it on a different corner to not
> blow up everything.
> Max latencies was decreased "Max latency (us)  70        62" and that
> is why this is done? For 8 us and possible DoS in case there are too
> many cpus?

The main reason for this patch was to enable KVM guests to run on RT 
hosts in certain scenarios, such as delivering external interrupts to 
the guest and the guest being SMP. The cyclictest measurements were just 
a "sanity check" to make sure the latencies don't get messed up too bad, 
albeit in a light scenario (guest with 1 VCPU), for a use case when the 
guest is not SMP and doesn't have any external interrupts delivered. 
This latter scenario works even without the kvm openpic being a 
raw_spinlock.

Previous to this patch, KVM was indeed blowing up on guest_enter [1], 
and making the openpic lock a raw_spinlock fixes that, without causing 
too much cyclictest damage when the guest doesn't have many VCPUs. I had 
a discussion with Scott Wood a while ago regarding delivering external 
interrupts to the guest, and he mentioned that the correct solution was 
to rework the entire interrupt delivery mechanism into multiple lock 
domains, minimize the code on the EPR path and the locking involved. 
Until that can be achieved, converting the openpic lock to a 
raw_spinlock would be acceptable, as long as we keep the number of guest 
VCPUs small, so as to not cause big host latencies.

[1] http://lxr.free-electrons.com/source/include/linux/kvm_host.h#L762

Best regards,
Bogdan P.

>> Paolo
>>
>
> Sebastian
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 15:17             ` Sebastian Andrzej Siewior
@ 2015-02-23  8:12               ` Purcareata Bogdan
  0 siblings, 0 replies; 37+ messages in thread
From: Purcareata Bogdan @ 2015-02-23  8:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users
  Cc: linux-kernel, scottwood, mihai.caraman, Thomas Gleixner

On 20.02.2015 17:17, Sebastian Andrzej Siewior wrote:
> On 02/20/2015 04:10 PM, Paolo Bonzini wrote:
>> On 20/02/2015 16:06, Sebastian Andrzej Siewior wrote:
>>> On 02/20/2015 03:57 PM, Paolo Bonzini wrote:
>>
>>>> Yes, but large latencies just mean the code has to be rewritten (x86
>>>> doesn't anymore do event injection in an atomic regions for example).
>>>> Until it is, using raw_spin_lock is correct.
>>>
>>> It does not sound like it. It sounds more like disabling interrupts to
>>> get things run faster and then limit it on a different corner to not
>>> blow up everything.
>>
>> "This patchset enables running KVM SMP guests with external interrupts
>> on an underlying RT-enabled Linux. Previous to this patch, a guest with
>> in-kernel MPIC emulation could easily panic the kernel due to preemption
>> when delivering IPIs and external interrupts, because of the openpic
>> spinlock becoming a sleeping mutex on PREEMPT_RT_FULL Linux".
>>
>>> Max latencies was decreased "Max latency (us)  70        62" and that
>>> is why this is done? For 8 us and possible DoS in case there are too
>>> many cpus?
>>
>> My understanding is that:
>>
>> 1) netperf can get you a BUG KVM, and raw_spinlock fixes that

Actually, it's not just netperf. The bug triggers in the following 
scenarios:
- running CPU intensive task (while true; do date; done) in SMP guest 
(even with 2 VCPUs)
- running netperf in guest
- running cyclictest in SMP guest

> May I please see a backtrace with context tracking which states where
> the interrupts / preemption gets disabled and where the lock was taken?

Will do, I will get back to you as soon as I have it available. I will 
try and capture it using function trace.

> I'm not totally against this patch I just want to make sure this is not
> a blind raw conversation to shup up the warning the kernel throws.
>
>> 2) cyclictest did not trigger the BUG, and you can also get reduced
>> latency from using raw_spinlock.
>>
>> I think we agree that (2) is not a factor in accepting the patch.
> good :)
>
>>
>> Paolo
>>
> Sebastian
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] powerpc/kvm: Convert openpic lock to raw_spinlock
  2015-02-18  9:32 ` [PATCH 1/2] powerpc/kvm: Convert openpic lock to raw_spinlock Bogdan Purcareata
@ 2015-02-23 22:43   ` Scott Wood
  0 siblings, 0 replies; 37+ messages in thread
From: Scott Wood @ 2015-02-23 22:43 UTC (permalink / raw)
  To: Bogdan Purcareata
  Cc: linuxppc-dev, linux-rt-users, bigeasy, agraf, pbonzini,
	linux-kernel, mihai.caraman

On Wed, 2015-02-18 at 09:32 +0000, Bogdan Purcareata wrote:
> This patch enables running intensive I/O workloads, e.g. netperf, in a guest
> deployed on a RT host. It also enable guests to be SMP.
> 
> The openpic spinlock becomes a sleeping mutex on a RT system. This no longer
> guarantees that EPR is atomic with exception delivery. The guest VCPU thread
> fails due to a BUG_ON(preemptible()) when running netperf.
> 
> In order to make the kvmppc_mpic_set_epr() call safe on RT from non-atomic
> context, convert the openpic lock to a raw_spinlock. A similar approach can
> be seen for x86 platforms in the following commit [1].
> 
> Here are some comparative cyclitest measurements run inside a high priority RT
> guest run on a RT host. The guest has 1 VCPU and the test has been run for 15
> minutes. The guest runs ~750 hackbench processes as background stress.
> 
>                   spinlock  raw_spinlock
> Min latency (us)  4         4
> Avg latency (us)  15        19
> Max latency (us)  70        62
> 
> [1] https://lkml.org/lkml/2010/1/11/289
> 
> Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
> Reviewed-by: Scott Wood <scottwood@freescale.com>

Where did that Reviewed-by: come from?

A +1 in Gerrit on an internal tree does not translate into an upstream
Reviewed-by.

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux
  2015-02-20 13:45   ` Alexander Graf
@ 2015-02-23 22:48     ` Scott Wood
  0 siblings, 0 replies; 37+ messages in thread
From: Scott Wood @ 2015-02-23 22:48 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Bogdan Purcareata, linuxppc-dev, linux-rt-users, bigeasy,
	pbonzini, linux-kernel, mihai.caraman

On Fri, 2015-02-20 at 14:45 +0100, Alexander Graf wrote:
> 
> On 18.02.15 10:32, Bogdan Purcareata wrote:
> > Due to the introduction of the raw_spinlock for the KVM openpic, guests with a
> > high number of VCPUs may induce great latencies on the underlying RT Linux
> > system (e.g. cyclictest reports latencies of ~15ms for guests with 24 VCPUs).
> > This can be further aggravated by sending a lot of external interrupts to the
> > guest.
> > 
> > A malicious app can abuse this scenario, causing a DoS of the host Linux.
> > Until the KVM openpic code is refactored to use finer lock granularity, impose
> > a limitation on the number of VCPUs a guest can have when running on a
> > PREEMPT_RT_FULL system with KVM_MPIC emulation.
> > 
> > Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
> > Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
> > Reviewed-by: Scott Wood <scottwood@freescale.com>
> 
> I don't think this patch is reasonable to take upstream.

I agree (or at least, I don't think the raw lock conversion should be
separated from the vcpu limitation that makes it clear that it's a
temporary hack), because it ought to be fixed properly.

>  If we have a
> latency issue, whoever spawned KVM VMs made a decision to spawn such big
> VMs.

I disagree.  The point of PREEMPT_RT is to prevent the majority of
kernel code from excessively impacting latency.  When you start using
raw locks you're stepping outside those bounds and need to ensure that
you don't hand things within those bounds (which includes userspace) the
ability to excessively impact latency.

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-20 14:54     ` Sebastian Andrzej Siewior
  2015-02-20 14:57       ` Paolo Bonzini
  2015-02-23  7:29       ` Purcareata Bogdan
@ 2015-02-23 23:27       ` Scott Wood
  2015-02-25 16:36         ` Sebastian Andrzej Siewior
  2015-02-26 13:02         ` Paolo Bonzini
  2 siblings, 2 replies; 37+ messages in thread
From: Scott Wood @ 2015-02-23 23:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Paolo Bonzini, Alexander Graf, Bogdan Purcareata, linuxppc-dev,
	linux-rt-users, linux-kernel, mihai.caraman, Thomas Gleixner

On Fri, 2015-02-20 at 15:54 +0100, Sebastian Andrzej Siewior wrote:
> On 02/20/2015 03:12 PM, Paolo Bonzini wrote:
> >> Thomas, what is the usual approach for patches like this? Do you take
> >> them into your rt tree or should they get integrated to upstream?
> > 
> > Patch 1 is definitely suitable for upstream, that's the reason why we
> > have raw_spin_lock vs. raw_spin_unlock.
> 
> raw_spin_lock were introduced in c2f21ce2e31286a0a32 ("locking:
> Implement new raw_spinlock). They are used in context which runs with
> IRQs off - especially on -RT. This includes usually interrupt
> controllers and related core-code pieces.
> 
> Usually you see "scheduling while atomic" on -RT and convert them to
> raw locks if it is appropriate.
> 
> Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
> not cause a DoS and large latencies in the host. I haven't seen an
> answer to my why question. Because if the conversation leads to
> large latencies in the host then it does not look right.
> 
> Each host PIC has a rawlock and does mostly just mask/unmask and the
> raw lock makes sure the value written is not mixed up due to
> preemption.
> This hardly increase latencies because the "locked" path is very short.
> If this conversation leads to higher latencies then the locked path is
> too long and hardly suitable to become a rawlock.

This isn't a host PIC driver.  It's guest PIC emulation, some of which
is indeed not suitable for a rawlock (in particular, openpic_update_irq
which loops on the number of vcpus, with a loop body that calls
IRQ_check() which loops over all pending IRQs).  The vcpu limits are a
temporary bandaid to avoid the worst latencies, but I'm still skeptical
about this being upstream material.

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-23 23:27       ` Scott Wood
@ 2015-02-25 16:36         ` Sebastian Andrzej Siewior
  2015-02-26 13:02         ` Paolo Bonzini
  1 sibling, 0 replies; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-02-25 16:36 UTC (permalink / raw)
  To: Scott Wood
  Cc: Paolo Bonzini, Alexander Graf, Bogdan Purcareata, linuxppc-dev,
	linux-rt-users, linux-kernel, mihai.caraman, Thomas Gleixner

* Scott Wood | 2015-02-23 17:27:31 [-0600]:

>This isn't a host PIC driver.  It's guest PIC emulation, some of which
>is indeed not suitable for a rawlock (in particular, openpic_update_irq
>which loops on the number of vcpus, with a loop body that calls
>IRQ_check() which loops over all pending IRQs).  The vcpu limits are a
>temporary bandaid to avoid the worst latencies, but I'm still skeptical
>about this being upstream material.

Okay.

>-Scott

Sebastian

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-23 23:27       ` Scott Wood
  2015-02-25 16:36         ` Sebastian Andrzej Siewior
@ 2015-02-26 13:02         ` Paolo Bonzini
  2015-02-26 13:31           ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2015-02-26 13:02 UTC (permalink / raw)
  To: Scott Wood, Sebastian Andrzej Siewior
  Cc: Alexander Graf, Bogdan Purcareata, linuxppc-dev, linux-rt-users,
	linux-kernel, mihai.caraman, Thomas Gleixner



On 24/02/2015 00:27, Scott Wood wrote:
> This isn't a host PIC driver.  It's guest PIC emulation, some of which
> is indeed not suitable for a rawlock (in particular, openpic_update_irq
> which loops on the number of vcpus, with a loop body that calls
> IRQ_check() which loops over all pending IRQs).

The question is what behavior is wanted of code that isn't quite
RT-ready.  What is preferred, bugs or bad latency?

If the answer is bad latency (which can be avoided simply by not running
KVM on a RT kernel in production), patch 1 can be applied.  If the
answer is bugs, patch 1 is not upstream material.

I myself prefer to have bad latency; if something takes a spinlock in
atomic context, that spinlock should be raw.  If it hurts (latency),
don't do it (use the affected code).

Paolo

> The vcpu limits are a
> temporary bandaid to avoid the worst latencies, but I'm still skeptical
> about this being upstream material.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-26 13:02         ` Paolo Bonzini
@ 2015-02-26 13:31           ` Sebastian Andrzej Siewior
  2015-02-27  1:05             ` Scott Wood
  0 siblings, 1 reply; 37+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-02-26 13:31 UTC (permalink / raw)
  To: Paolo Bonzini, Scott Wood
  Cc: Alexander Graf, Bogdan Purcareata, linuxppc-dev, linux-rt-users,
	linux-kernel, mihai.caraman, Thomas Gleixner

On 02/26/2015 02:02 PM, Paolo Bonzini wrote:
> 
> 
> On 24/02/2015 00:27, Scott Wood wrote:
>> This isn't a host PIC driver.  It's guest PIC emulation, some of which
>> is indeed not suitable for a rawlock (in particular, openpic_update_irq
>> which loops on the number of vcpus, with a loop body that calls
>> IRQ_check() which loops over all pending IRQs).
> 
> The question is what behavior is wanted of code that isn't quite
> RT-ready.  What is preferred, bugs or bad latency?
> 
> If the answer is bad latency (which can be avoided simply by not running
> KVM on a RT kernel in production), patch 1 can be applied.  If the
can be applied *but* makes no difference if applied or not.

> answer is bugs, patch 1 is not upstream material.
> 
> I myself prefer to have bad latency; if something takes a spinlock in
> atomic context, that spinlock should be raw.  If it hurts (latency),
> don't do it (use the affected code).

The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
only in -RT. There is no change upstream. In general we fix such things
in -RT first and forward the patches upstream if possible. This convert
thingy would be possible.
Bug fixing comes before latency no matter if RT or not. Converting
every lock into a rawlock is not always the answer.
Last thing I read from Scott is that he is not entirely sure if this is
the right approach or not and patch #1 was not acked-by him either.

So for now I wait for Scott's feedback and maybe a backtrace :)

> 
> Paolo

Sebastian

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-26 13:31           ` Sebastian Andrzej Siewior
@ 2015-02-27  1:05             ` Scott Wood
  2015-02-27 13:06               ` Paolo Bonzini
  2015-03-27 17:07               ` Purcareata Bogdan
  0 siblings, 2 replies; 37+ messages in thread
From: Scott Wood @ 2015-02-27  1:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Paolo Bonzini, Alexander Graf, Bogdan Purcareata, linuxppc-dev,
	linux-rt-users, linux-kernel, mihai.caraman, Thomas Gleixner

On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:
> On 02/26/2015 02:02 PM, Paolo Bonzini wrote:
> > 
> > 
> > On 24/02/2015 00:27, Scott Wood wrote:
> >> This isn't a host PIC driver.  It's guest PIC emulation, some of which
> >> is indeed not suitable for a rawlock (in particular, openpic_update_irq
> >> which loops on the number of vcpus, with a loop body that calls
> >> IRQ_check() which loops over all pending IRQs).
> > 
> > The question is what behavior is wanted of code that isn't quite
> > RT-ready.  What is preferred, bugs or bad latency?
> > 
> > If the answer is bad latency (which can be avoided simply by not running
> > KVM on a RT kernel in production), patch 1 can be applied.  If the
> can be applied *but* makes no difference if applied or not.
> 
> > answer is bugs, patch 1 is not upstream material.
> > 
> > I myself prefer to have bad latency; if something takes a spinlock in
> > atomic context, that spinlock should be raw.  If it hurts (latency),
> > don't do it (use the affected code).
> 
> The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
> only in -RT. There is no change upstream. In general we fix such things
> in -RT first and forward the patches upstream if possible. This convert
> thingy would be possible.
> Bug fixing comes before latency no matter if RT or not. Converting
> every lock into a rawlock is not always the answer.
> Last thing I read from Scott is that he is not entirely sure if this is
> the right approach or not and patch #1 was not acked-by him either.
> 
> So for now I wait for Scott's feedback and maybe a backtrace :)

Obviously leaving it in a buggy state is not what we want -- but I lean
towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
in-kernel MPIC emulation (which is itself just an optimization -- you
can still use KVM without it).  This way people don't enable it with RT
without being aware of the issue, and there's more of an incentive to
fix it properly.

I'll let Bogdan supply the backtrace.

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-27  1:05             ` Scott Wood
@ 2015-02-27 13:06               ` Paolo Bonzini
  2015-03-27 17:07               ` Purcareata Bogdan
  1 sibling, 0 replies; 37+ messages in thread
From: Paolo Bonzini @ 2015-02-27 13:06 UTC (permalink / raw)
  To: Scott Wood, Sebastian Andrzej Siewior
  Cc: Alexander Graf, Bogdan Purcareata, linuxppc-dev, linux-rt-users,
	linux-kernel, mihai.caraman, Thomas Gleixner



On 27/02/2015 02:05, Scott Wood wrote:
> Obviously leaving it in a buggy state is not what we want -- but I lean
> towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
> in-kernel MPIC emulation (which is itself just an optimization -- you
> can still use KVM without it).  This way people don't enable it with RT
> without being aware of the issue, and there's more of an incentive to
> fix it properly.

That would indeed work for me.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-02-27  1:05             ` Scott Wood
  2015-02-27 13:06               ` Paolo Bonzini
@ 2015-03-27 17:07               ` Purcareata Bogdan
  2015-04-02 23:11                 ` Scott Wood
  1 sibling, 1 reply; 37+ messages in thread
From: Purcareata Bogdan @ 2015-03-27 17:07 UTC (permalink / raw)
  To: Scott Wood, Sebastian Andrzej Siewior
  Cc: Paolo Bonzini, Alexander Graf, Bogdan Purcareata, linuxppc-dev,
	linux-rt-users, linux-kernel, mihai.caraman, Thomas Gleixner

On 27.02.2015 03:05, Scott Wood wrote:
> On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:
>> On 02/26/2015 02:02 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 24/02/2015 00:27, Scott Wood wrote:
>>>> This isn't a host PIC driver.  It's guest PIC emulation, some of which
>>>> is indeed not suitable for a rawlock (in particular, openpic_update_irq
>>>> which loops on the number of vcpus, with a loop body that calls
>>>> IRQ_check() which loops over all pending IRQs).
>>>
>>> The question is what behavior is wanted of code that isn't quite
>>> RT-ready.  What is preferred, bugs or bad latency?
>>>
>>> If the answer is bad latency (which can be avoided simply by not running
>>> KVM on a RT kernel in production), patch 1 can be applied.  If the
>> can be applied *but* makes no difference if applied or not.
>>
>>> answer is bugs, patch 1 is not upstream material.
>>>
>>> I myself prefer to have bad latency; if something takes a spinlock in
>>> atomic context, that spinlock should be raw.  If it hurts (latency),
>>> don't do it (use the affected code).
>>
>> The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
>> only in -RT. There is no change upstream. In general we fix such things
>> in -RT first and forward the patches upstream if possible. This convert
>> thingy would be possible.
>> Bug fixing comes before latency no matter if RT or not. Converting
>> every lock into a rawlock is not always the answer.
>> Last thing I read from Scott is that he is not entirely sure if this is
>> the right approach or not and patch #1 was not acked-by him either.
>>
>> So for now I wait for Scott's feedback and maybe a backtrace :)
>
> Obviously leaving it in a buggy state is not what we want -- but I lean
> towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
> in-kernel MPIC emulation (which is itself just an optimization -- you
> can still use KVM without it).  This way people don't enable it with RT
> without being aware of the issue, and there's more of an incentive to
> fix it properly.
>
> I'll let Bogdan supply the backtrace.

So about the backtrace. Wasn't really sure how to "catch" this, so what 
I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest 
run 24 netperf flows with an external back to back board of the same 
kind. I assumed this would provide the sufficient VCPUs and external 
interrupt to expose an alleged culprit.

With regards to measuring the latency, I thought of using ftrace, 
specifically the preemptirqsoff latency histogram. Unfortunately, I 
wasn't able to capture any major differences between running a guest 
with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion 
applied) vs. no in-kernel MPIC emulation. Function profiling 
(trace_stat) shows that in the second case there's a far greater time 
spent in kvm_handle_exit (100x), but overall, the maximum latencies for 
preemptirqsoff don't look that much different.

Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host 
RT Linux, sorted in descending order, expressed in microseconds:

In-kernel MPIC		QEMU MPIC
3975			5105
2079			3972
1303			3557
1106			1725
447			907
423			853
362			723
343			182
260			121
133			116
131			116
118			115
116			114
114			114
114			114
114			99
113			99
103			98
98			98
95			97
87			96
83			83
83			82
80			81

I'm not sure if this captures openpic behavior or just scheduler behavior.

Anyways, I'm pro adding the openpic raw_spinlock conversion along with 
disabling the in-kernel MPIC emulation for upstream. But just wanted to 
catch up with this last request from a while ago.

Do you think it would be better to just submit the new patch or should I 
do some further testing? Do you have any suggestions regarding what else 
I should look at / how to test?

Thank you,
Bogdan P.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-03-27 17:07               ` Purcareata Bogdan
@ 2015-04-02 23:11                 ` Scott Wood
  2015-04-03  8:07                   ` Purcareata Bogdan
  0 siblings, 1 reply; 37+ messages in thread
From: Scott Wood @ 2015-04-02 23:11 UTC (permalink / raw)
  To: Purcareata Bogdan
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner

On Fri, 2015-03-27 at 19:07 +0200, Purcareata Bogdan wrote:
> On 27.02.2015 03:05, Scott Wood wrote:
> > On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:
> >> On 02/26/2015 02:02 PM, Paolo Bonzini wrote:
> >>>
> >>>
> >>> On 24/02/2015 00:27, Scott Wood wrote:
> >>>> This isn't a host PIC driver.  It's guest PIC emulation, some of which
> >>>> is indeed not suitable for a rawlock (in particular, openpic_update_irq
> >>>> which loops on the number of vcpus, with a loop body that calls
> >>>> IRQ_check() which loops over all pending IRQs).
> >>>
> >>> The question is what behavior is wanted of code that isn't quite
> >>> RT-ready.  What is preferred, bugs or bad latency?
> >>>
> >>> If the answer is bad latency (which can be avoided simply by not running
> >>> KVM on a RT kernel in production), patch 1 can be applied.  If the
> >> can be applied *but* makes no difference if applied or not.
> >>
> >>> answer is bugs, patch 1 is not upstream material.
> >>>
> >>> I myself prefer to have bad latency; if something takes a spinlock in
> >>> atomic context, that spinlock should be raw.  If it hurts (latency),
> >>> don't do it (use the affected code).
> >>
> >> The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
> >> only in -RT. There is no change upstream. In general we fix such things
> >> in -RT first and forward the patches upstream if possible. This convert
> >> thingy would be possible.
> >> Bug fixing comes before latency no matter if RT or not. Converting
> >> every lock into a rawlock is not always the answer.
> >> Last thing I read from Scott is that he is not entirely sure if this is
> >> the right approach or not and patch #1 was not acked-by him either.
> >>
> >> So for now I wait for Scott's feedback and maybe a backtrace :)
> >
> > Obviously leaving it in a buggy state is not what we want -- but I lean
> > towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
> > in-kernel MPIC emulation (which is itself just an optimization -- you
> > can still use KVM without it).  This way people don't enable it with RT
> > without being aware of the issue, and there's more of an incentive to
> > fix it properly.
> >
> > I'll let Bogdan supply the backtrace.
> 
> So about the backtrace. Wasn't really sure how to "catch" this, so what 
> I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest 
> run 24 netperf flows with an external back to back board of the same 
> kind. I assumed this would provide the sufficient VCPUs and external 
> interrupt to expose an alleged culprit.
> 
> With regards to measuring the latency, I thought of using ftrace, 
> specifically the preemptirqsoff latency histogram. Unfortunately, I 
> wasn't able to capture any major differences between running a guest 
> with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion 
> applied) vs. no in-kernel MPIC emulation. Function profiling 
> (trace_stat) shows that in the second case there's a far greater time 
> spent in kvm_handle_exit (100x), but overall, the maximum latencies for 
> preemptirqsoff don't look that much different.
> 
> Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host 
> RT Linux, sorted in descending order, expressed in microseconds:
> 
> In-kernel MPIC		QEMU MPIC
> 3975			5105

What are you measuring?  Latency in the host, or in the guest?

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-02 23:11                 ` Scott Wood
@ 2015-04-03  8:07                   ` Purcareata Bogdan
  2015-04-03 21:26                     ` Scott Wood
  0 siblings, 1 reply; 37+ messages in thread
From: Purcareata Bogdan @ 2015-04-03  8:07 UTC (permalink / raw)
  To: Scott Wood
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner

On 03.04.2015 02:11, Scott Wood wrote:
> On Fri, 2015-03-27 at 19:07 +0200, Purcareata Bogdan wrote:
>> On 27.02.2015 03:05, Scott Wood wrote:
>>> On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:
>>>> On 02/26/2015 02:02 PM, Paolo Bonzini wrote:
>>>>>
>>>>>
>>>>> On 24/02/2015 00:27, Scott Wood wrote:
>>>>>> This isn't a host PIC driver.  It's guest PIC emulation, some of which
>>>>>> is indeed not suitable for a rawlock (in particular, openpic_update_irq
>>>>>> which loops on the number of vcpus, with a loop body that calls
>>>>>> IRQ_check() which loops over all pending IRQs).
>>>>>
>>>>> The question is what behavior is wanted of code that isn't quite
>>>>> RT-ready.  What is preferred, bugs or bad latency?
>>>>>
>>>>> If the answer is bad latency (which can be avoided simply by not running
>>>>> KVM on a RT kernel in production), patch 1 can be applied.  If the
>>>> can be applied *but* makes no difference if applied or not.
>>>>
>>>>> answer is bugs, patch 1 is not upstream material.
>>>>>
>>>>> I myself prefer to have bad latency; if something takes a spinlock in
>>>>> atomic context, that spinlock should be raw.  If it hurts (latency),
>>>>> don't do it (use the affected code).
>>>>
>>>> The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
>>>> only in -RT. There is no change upstream. In general we fix such things
>>>> in -RT first and forward the patches upstream if possible. This convert
>>>> thingy would be possible.
>>>> Bug fixing comes before latency no matter if RT or not. Converting
>>>> every lock into a rawlock is not always the answer.
>>>> Last thing I read from Scott is that he is not entirely sure if this is
>>>> the right approach or not and patch #1 was not acked-by him either.
>>>>
>>>> So for now I wait for Scott's feedback and maybe a backtrace :)
>>>
>>> Obviously leaving it in a buggy state is not what we want -- but I lean
>>> towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
>>> in-kernel MPIC emulation (which is itself just an optimization -- you
>>> can still use KVM without it).  This way people don't enable it with RT
>>> without being aware of the issue, and there's more of an incentive to
>>> fix it properly.
>>>
>>> I'll let Bogdan supply the backtrace.
>>
>> So about the backtrace. Wasn't really sure how to "catch" this, so what
>> I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest
>> run 24 netperf flows with an external back to back board of the same
>> kind. I assumed this would provide the sufficient VCPUs and external
>> interrupt to expose an alleged culprit.
>>
>> With regards to measuring the latency, I thought of using ftrace,
>> specifically the preemptirqsoff latency histogram. Unfortunately, I
>> wasn't able to capture any major differences between running a guest
>> with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion
>> applied) vs. no in-kernel MPIC emulation. Function profiling
>> (trace_stat) shows that in the second case there's a far greater time
>> spent in kvm_handle_exit (100x), but overall, the maximum latencies for
>> preemptirqsoff don't look that much different.
>>
>> Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host
>> RT Linux, sorted in descending order, expressed in microseconds:
>>
>> In-kernel MPIC		QEMU MPIC
>> 3975			5105
>
> What are you measuring?  Latency in the host, or in the guest?

This is in the host kernel. It's the maximum continuous period of time 
when both interrupts and preemption were disabled on the host kernel 
(basically making it unresponsive). This has been tracked while the 
guest was running with high prio, with 24 VCPUs, and in the guest there 
were 24 netperf flows - so a lot of VCPUs and a lot of external 
interrupts - for about 15 minutes.

Bogdan P.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-03  8:07                   ` Purcareata Bogdan
@ 2015-04-03 21:26                     ` Scott Wood
  2015-04-09  7:44                       ` Purcareata Bogdan
  0 siblings, 1 reply; 37+ messages in thread
From: Scott Wood @ 2015-04-03 21:26 UTC (permalink / raw)
  To: Purcareata Bogdan
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner

On Fri, 2015-04-03 at 11:07 +0300, Purcareata Bogdan wrote:
> On 03.04.2015 02:11, Scott Wood wrote:
> > On Fri, 2015-03-27 at 19:07 +0200, Purcareata Bogdan wrote:
> >> On 27.02.2015 03:05, Scott Wood wrote:
> >>> On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:
> >>>> On 02/26/2015 02:02 PM, Paolo Bonzini wrote:
> >>>>>
> >>>>>
> >>>>> On 24/02/2015 00:27, Scott Wood wrote:
> >>>>>> This isn't a host PIC driver.  It's guest PIC emulation, some of which
> >>>>>> is indeed not suitable for a rawlock (in particular, openpic_update_irq
> >>>>>> which loops on the number of vcpus, with a loop body that calls
> >>>>>> IRQ_check() which loops over all pending IRQs).
> >>>>>
> >>>>> The question is what behavior is wanted of code that isn't quite
> >>>>> RT-ready.  What is preferred, bugs or bad latency?
> >>>>>
> >>>>> If the answer is bad latency (which can be avoided simply by not running
> >>>>> KVM on a RT kernel in production), patch 1 can be applied.  If the
> >>>> can be applied *but* makes no difference if applied or not.
> >>>>
> >>>>> answer is bugs, patch 1 is not upstream material.
> >>>>>
> >>>>> I myself prefer to have bad latency; if something takes a spinlock in
> >>>>> atomic context, that spinlock should be raw.  If it hurts (latency),
> >>>>> don't do it (use the affected code).
> >>>>
> >>>> The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
> >>>> only in -RT. There is no change upstream. In general we fix such things
> >>>> in -RT first and forward the patches upstream if possible. This convert
> >>>> thingy would be possible.
> >>>> Bug fixing comes before latency no matter if RT or not. Converting
> >>>> every lock into a rawlock is not always the answer.
> >>>> Last thing I read from Scott is that he is not entirely sure if this is
> >>>> the right approach or not and patch #1 was not acked-by him either.
> >>>>
> >>>> So for now I wait for Scott's feedback and maybe a backtrace :)
> >>>
> >>> Obviously leaving it in a buggy state is not what we want -- but I lean
> >>> towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
> >>> in-kernel MPIC emulation (which is itself just an optimization -- you
> >>> can still use KVM without it).  This way people don't enable it with RT
> >>> without being aware of the issue, and there's more of an incentive to
> >>> fix it properly.
> >>>
> >>> I'll let Bogdan supply the backtrace.
> >>
> >> So about the backtrace. Wasn't really sure how to "catch" this, so what
> >> I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest
> >> run 24 netperf flows with an external back to back board of the same
> >> kind. I assumed this would provide the sufficient VCPUs and external
> >> interrupt to expose an alleged culprit.
> >>
> >> With regards to measuring the latency, I thought of using ftrace,
> >> specifically the preemptirqsoff latency histogram. Unfortunately, I
> >> wasn't able to capture any major differences between running a guest
> >> with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion
> >> applied) vs. no in-kernel MPIC emulation. Function profiling
> >> (trace_stat) shows that in the second case there's a far greater time
> >> spent in kvm_handle_exit (100x), but overall, the maximum latencies for
> >> preemptirqsoff don't look that much different.
> >>
> >> Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host
> >> RT Linux, sorted in descending order, expressed in microseconds:
> >>
> >> In-kernel MPIC		QEMU MPIC
> >> 3975			5105
> >
> > What are you measuring?  Latency in the host, or in the guest?
> 
> This is in the host kernel.

Those are terrible numbers in both cases.  Can you use those tracing
tools to find out what the code path is for QEMU MPIC?

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-03 21:26                     ` Scott Wood
@ 2015-04-09  7:44                       ` Purcareata Bogdan
  2015-04-09 23:53                         ` Scott Wood
  0 siblings, 1 reply; 37+ messages in thread
From: Purcareata Bogdan @ 2015-04-09  7:44 UTC (permalink / raw)
  To: Scott Wood
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner

On 04.04.2015 00:26, Scott Wood wrote:
> On Fri, 2015-04-03 at 11:07 +0300, Purcareata Bogdan wrote:
>> On 03.04.2015 02:11, Scott Wood wrote:
>>> On Fri, 2015-03-27 at 19:07 +0200, Purcareata Bogdan wrote:
>>>> On 27.02.2015 03:05, Scott Wood wrote:
>>>>> On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:
>>>>>> On 02/26/2015 02:02 PM, Paolo Bonzini wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 24/02/2015 00:27, Scott Wood wrote:
>>>>>>>> This isn't a host PIC driver.  It's guest PIC emulation, some of which
>>>>>>>> is indeed not suitable for a rawlock (in particular, openpic_update_irq
>>>>>>>> which loops on the number of vcpus, with a loop body that calls
>>>>>>>> IRQ_check() which loops over all pending IRQs).
>>>>>>>
>>>>>>> The question is what behavior is wanted of code that isn't quite
>>>>>>> RT-ready.  What is preferred, bugs or bad latency?
>>>>>>>
>>>>>>> If the answer is bad latency (which can be avoided simply by not running
>>>>>>> KVM on a RT kernel in production), patch 1 can be applied.  If the
>>>>>> can be applied *but* makes no difference if applied or not.
>>>>>>
>>>>>>> answer is bugs, patch 1 is not upstream material.
>>>>>>>
>>>>>>> I myself prefer to have bad latency; if something takes a spinlock in
>>>>>>> atomic context, that spinlock should be raw.  If it hurts (latency),
>>>>>>> don't do it (use the affected code).
>>>>>>
>>>>>> The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
>>>>>> only in -RT. There is no change upstream. In general we fix such things
>>>>>> in -RT first and forward the patches upstream if possible. This convert
>>>>>> thingy would be possible.
>>>>>> Bug fixing comes before latency no matter if RT or not. Converting
>>>>>> every lock into a rawlock is not always the answer.
>>>>>> Last thing I read from Scott is that he is not entirely sure if this is
>>>>>> the right approach or not and patch #1 was not acked-by him either.
>>>>>>
>>>>>> So for now I wait for Scott's feedback and maybe a backtrace :)
>>>>>
>>>>> Obviously leaving it in a buggy state is not what we want -- but I lean
>>>>> towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
>>>>> in-kernel MPIC emulation (which is itself just an optimization -- you
>>>>> can still use KVM without it).  This way people don't enable it with RT
>>>>> without being aware of the issue, and there's more of an incentive to
>>>>> fix it properly.
>>>>>
>>>>> I'll let Bogdan supply the backtrace.
>>>>
>>>> So about the backtrace. Wasn't really sure how to "catch" this, so what
>>>> I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest
>>>> run 24 netperf flows with an external back to back board of the same
>>>> kind. I assumed this would provide the sufficient VCPUs and external
>>>> interrupt to expose an alleged culprit.
>>>>
>>>> With regards to measuring the latency, I thought of using ftrace,
>>>> specifically the preemptirqsoff latency histogram. Unfortunately, I
>>>> wasn't able to capture any major differences between running a guest
>>>> with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion
>>>> applied) vs. no in-kernel MPIC emulation. Function profiling
>>>> (trace_stat) shows that in the second case there's a far greater time
>>>> spent in kvm_handle_exit (100x), but overall, the maximum latencies for
>>>> preemptirqsoff don't look that much different.
>>>>
>>>> Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host
>>>> RT Linux, sorted in descending order, expressed in microseconds:
>>>>
>>>> In-kernel MPIC		QEMU MPIC
>>>> 3975			5105
>>>
>>> What are you measuring?  Latency in the host, or in the guest?
>>
>> This is in the host kernel.
>
> Those are terrible numbers in both cases.  Can you use those tracing
> tools to find out what the code path is for QEMU MPIC?

After more careful inspection, I noticed that those big-big numbers
(couple of milliseconds) are isolated cases, and in fact 99.99% of those
latencies top to somewhere around 800us. I also have a feeling that the
isolated huge latencies might have something to do with
enabling/disabling tracing, since those numbers don't come up at all in
the actual trace output, only in the latency histogram. From what I
know, there are two separate mechanisms - the function tracer and the
latency histogram.

Now, about that max 800us - there are 2 things that are enabled by
default, and can cause bad latency:
1. scheduler load balancing - latency can top to up to 800us (as seen in
the trace output).
2. RT throttling - which calls sched_rt_period_timer, which cycles
through the runqueues of all CPUs - latency can top to 600us.

I'm mentioning these since the trace output for the max preemptirqsoff
period was always "stolen" by these activities, basically hiding
anything related to the kvm in-kernel openpic.

I disabled both of them, and now the max preemptirqsoff trace shows a
transition between a vhost thread and the qemu process, involving a
timer and external interrupt (do_irq), which you can see at the end of
this e-mail. Not much particularly related to the kvm openpic (but
perhaps I'm not able to understand it entirely). The trace for QEMU
MPIC looks pretty much the same.

So at this point I was getting kinda frustrated so I decided to measure
the time spend in kvm_mpic_write and kvm_mpic_read. I assumed these were
the main entry points in the in-kernel MPIC and were basically executed
while holding the spinlock. The scenario was the same - 24 VCPUs guest,
with 24 virtio+vhost interfaces, only this time I ran 24 ping flood
threads to another board instead of netperf. I assumed this would impose
a heavier stress.

The latencies look pretty ok, around 1-2 us on average, with the max
shown below:

.kvm_mpic_read	14.560
.kvm_mpic_write	12.608

Those are also microseconds. This was run for about 15 mins.

 From what it looks like, the in-kernel MPIC isn't all that bad, even for
a guest with a large number of VCPUs and a lot of external interrupts.
Bigger latencies can be caused by scheduling mechanisms, but I guess
this isn't something we can avoid entirely, only optimize.

Looking forward to your feedback and comments.

Thanks,
Bogdan P.

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 3.12.19-rt30-156196-gbc48ad3f-dirty
# --------------------------------------------------------------------
# latency: 364 us, #322/322, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:24)
#    -----------------
#    | task: qemu-system-ppc-4055 (uid:0 nice:0 policy:2 rt_prio:95)
#    -----------------
#  => started at: .__schedule
#  => ended at:   .__schedule
#
#
#                   _--------=> CPU#
#                  / _-------=> irqs-off
#                 | / _------=> need-resched
#                 || / _-----=> need-resched_lazy
#                 ||| / _----=> hardirq/softirq
#                 |||| / _---=> preempt-depth
#                 ||||| / _--=> preempt-lazy-depth
#                 |||||| / _-=> migrate-disable
#                 ||||||| /     delay
#  cmd     pid    |||||||| time  |   caller
#     \   /      ||||||||  \   |   /
    <...>-4170    0....1..    0us+: .__schedule
    <...>-4170    0d...3..    3us : .deactivate_task <-.__schedule
    <...>-4170    0d...3..    4us : .dequeue_task <-.__schedule
    <...>-4170    0d...3..    5us : .update_rq_clock.part.64 <-.dequeue_task
    <...>-4170    0d...3..    6us : .dequeue_task_fair <-.dequeue_task
    <...>-4170    0d...3..    6us : .update_curr <-.dequeue_task_fair
    <...>-4170    0d...3..    7us : .update_min_vruntime <-.update_curr
    <...>-4170    0d...3..    8us : .cpuacct_charge <-.update_curr
    <...>-4170    0d...3..    9us : .__rcu_read_lock <-.cpuacct_charge
    <...>-4170    0d...3..   10us : .__rcu_read_unlock <-.cpuacct_charge
    <...>-4170    0d...3..   11us : .__compute_runnable_contrib <-.dequeue_task_fair
    <...>-4170    0d...3..   12us : .__update_entity_load_avg_contrib <-.dequeue_task_fair
    <...>-4170    0d...3..   13us : .update_cfs_rq_blocked_load <-.dequeue_task_fair
    <...>-4170    0d...3..   14us : .clear_buddies <-.dequeue_task_fair
    <...>-4170    0d...3..   16us : .account_entity_dequeue <-.dequeue_task_fair
    <...>-4170    0d...3..   17us : .update_min_vruntime <-.dequeue_task_fair
    <...>-4170    0d...3..   18us : .update_cfs_shares <-.dequeue_task_fair
    <...>-4170    0d...3..   19us : .__compute_runnable_contrib <-.dequeue_task_fair
    <...>-4170    0d...3..   20us : .hrtick_update <-.dequeue_task
    <...>-4170    0d...3..   21us : .put_prev_task_fair <-.__schedule
    <...>-4170    0d...3..   22us : .pick_next_task_fair <-.__schedule
    <...>-4170    0d...3..   23us : .clear_buddies <-.pick_next_task_fair
    <...>-4170    0d...3..   24us+: .__dequeue_entity <-.pick_next_task_fair
    <...>-4170    0d...3..   26us : .switch_mmu_context <-.__schedule
    <...>-4170    0d...3..   27us : ._raw_spin_lock <-.switch_mmu_context
    <...>-4170    0d...3..   28us : .__raw_spin_lock <-.switch_mmu_context
    <...>-4170    0d...3..   29us : .add_preempt_count <-.__raw_spin_lock
    <...>-4170    0d...4..   30us : .sub_preempt_count <-.switch_mmu_context
    <...>-4170    0d...3..   31us : .__switch_to <-.__schedule
    <...>-4170    0d...3..   32us+: .switch_booke_debug_regs <-.__switch_to
    <...>-4171    0d...3..   33us : .finish_task_switch <-.__schedule
    <...>-4171    0d...3..   34us : .vtime_common_task_switch <-.finish_task_switch
    <...>-4171    0d...3..   36us : .account_system_time <-.vtime_account_system
    <...>-4171    0d...3..   37us : .in_serving_softirq <-.account_system_time
    <...>-4171    0d...3..   38us : .cpuacct_account_field <-.account_system_time
    <...>-4171    0d...3..   39us : .__rcu_read_lock <-.cpuacct_account_field
    <...>-4171    0d...3..   40us : .__rcu_read_unlock <-.cpuacct_account_field
    <...>-4171    0d...3..   41us : .account_user_time <-.vtime_account_user
    <...>-4171    0d...3..   42us : .cpuacct_account_field <-.account_user_time
    <...>-4171    0d...3..   43us : .__rcu_read_lock <-.cpuacct_account_field
    <...>-4171    0d...3..   44us : .__rcu_read_unlock <-.cpuacct_account_field
    <...>-4171    0d...3..   45us+: ._raw_spin_unlock_irq <-.finish_task_switch
    <...>-4171    0d...3..   46us+: .do_IRQ <-exc_0x500_common
    <...>-4171    0d...3..   48us : .__do_irq <-.call_do_irq
    <...>-4171    0d...3..   49us : .irq_enter <-.__do_irq
    <...>-4171    0d...3..   50us : .rcu_irq_enter <-.irq_enter
    <...>-4171    0d...3..   52us : .vtime_common_account_irq_enter <-.irq_enter
    <...>-4171    0d...3..   53us : .account_system_time <-.vtime_account_system
    <...>-4171    0d...3..   54us : .in_serving_softirq <-.account_system_time
    <...>-4171    0d...3..   55us : .cpuacct_account_field <-.account_system_time
    <...>-4171    0d...3..   56us : .__rcu_read_lock <-.cpuacct_account_field
    <...>-4171    0d...3..   57us : .__rcu_read_unlock <-.cpuacct_account_field
    <...>-4171    0d...3..   58us : .add_preempt_count <-.irq_enter
    <...>-4171    0d..h3..   59us : .mpic_get_coreint_irq <-.__do_irq
    <...>-4171    0d..h3..   60us : .irq_to_desc <-.__do_irq
    <...>-4171    0d..h3..   61us : .handle_fasteoi_irq <-.__do_irq
    <...>-4171    0d..h3..   62us : ._raw_spin_lock <-.handle_fasteoi_irq
    <...>-4171    0d..h3..   63us : .__raw_spin_lock <-.handle_fasteoi_irq
    <...>-4171    0d..h3..   64us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0d..h4..   65us+: .mpic_mask_irq <-.handle_fasteoi_irq
    <...>-4171    0d..h4..   66us : .handle_irq_event <-.handle_fasteoi_irq
    <...>-4171    0d..h4..   67us : .sub_preempt_count <-.handle_irq_event
    <...>-4171    0d..h3..   68us : .handle_irq_event_percpu <-.handle_irq_event
    <...>-4171    0d..h3..   70us : .irq_default_primary_handler <-.handle_irq_event_percpu
    <...>-4171    0d..h3..   71us : .wake_up_process <-.handle_irq_event_percpu
    <...>-4171    0d..h3..   72us : .try_to_wake_up <-.handle_irq_event_percpu
    <...>-4171    0d..h3..   73us : ._raw_spin_lock_irqsave <-.try_to_wake_up
    <...>-4171    0d..h3..   74us : .__raw_spin_lock_irqsave <-.try_to_wake_up
    <...>-4171    0d..h3..   75us : .add_preempt_count <-.__raw_spin_lock_irqsave
    <...>-4171    0d..h4..   76us : .select_task_rq_rt <-.try_to_wake_up
    <...>-4171    0d..h4..   77us : ._raw_spin_lock <-.try_to_wake_up
    <...>-4171    0d..h4..   78us : .__raw_spin_lock <-.try_to_wake_up
    <...>-4171    0d..h4..   79us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0d..h5..   80us : .ttwu_do_activate.constprop.71 <-.try_to_wake_up
    <...>-4171    0d..h5..   81us : .activate_task <-.ttwu_do_activate.constprop.71
    <...>-4171    0d..h5..   82us : .enqueue_task <-.ttwu_do_activate.constprop.71
    <...>-4171    0d..h5..   83us : .update_rq_clock.part.64 <-.enqueue_task
    <...>-4171    0d..h5..   84us : .enqueue_task_rt <-.enqueue_task
    <...>-4171    0d..h5..   85us : .dequeue_rt_stack <-.enqueue_task_rt
    <...>-4171    0d..h5..   86us+: .cpupri_set <-.enqueue_task_rt
    <...>-4171    0d..h5..   88us : .update_rt_migration <-.enqueue_task_rt
    <...>-4171    0d..h5..   89us : .ttwu_do_wakeup <-.try_to_wake_up
    <...>-4171    0d..h5..   90us : .check_preempt_curr <-.ttwu_do_wakeup
    <...>-4171    0d..h5..   91us : .resched_task <-.check_preempt_curr
    <...>-4171    0d..h5..   92us : .task_woken_rt <-.ttwu_do_wakeup
    <...>-4171    0d..h5..   93us : .sub_preempt_count <-.try_to_wake_up
    <...>-4171    0d..h4..   94us : ._raw_spin_unlock_irqrestore <-.try_to_wake_up
    <...>-4171    0d..h4..   95us : .sub_preempt_count <-._raw_spin_unlock_irqrestore
    <...>-4171    0d..h3..   97us : .note_interrupt <-.handle_irq_event_percpu
    <...>-4171    0d..h3..   98us : ._raw_spin_lock <-.handle_irq_event
    <...>-4171    0d..h3..   99us : .__raw_spin_lock <-.handle_irq_event
    <...>-4171    0d..h3..  100us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0d..h4..  101us : .cond_unmask_irq <-.handle_fasteoi_irq
    <...>-4171    0d..h4..  102us+: .mpic_end_irq <-.handle_fasteoi_irq
    <...>-4171    0d..h4..  103us : .sub_preempt_count <-.handle_fasteoi_irq
    <...>-4171    0d..h3..  104us : .irq_exit <-.__do_irq
    <...>-4171    0d..h3..  105us : .account_system_time <-.vtime_account_system
    <...>-4171    0d..h3..  106us : .cpuacct_account_field <-.account_system_time
    <...>-4171    0d..h3..  107us+: .__rcu_read_lock <-.cpuacct_account_field
    <...>-4171    0d..h3..  109us : .__rcu_read_unlock <-.cpuacct_account_field
    <...>-4171    0d..h3..  110us : .sub_preempt_count <-.irq_exit
    <...>-4171    0d...3..  111us+: .rcu_irq_exit <-.irq_exit
    <...>-4171    0dN..3..  113us : .irq_enter <-.timer_interrupt
    <...>-4171    0dN..3..  114us : .rcu_irq_enter <-.irq_enter
    <...>-4171    0dN..3..  115us : .vtime_common_account_irq_enter <-.irq_enter
    <...>-4171    0dN..3..  116us : .account_system_time <-.vtime_account_system
    <...>-4171    0dN..3..  118us : .in_serving_softirq <-.account_system_time
    <...>-4171    0dN..3..  119us : .cpuacct_account_field <-.account_system_time
    <...>-4171    0dN..3..  120us : .__rcu_read_lock <-.cpuacct_account_field
    <...>-4171    0dN..3..  121us : .__rcu_read_unlock <-.cpuacct_account_field
    <...>-4171    0dN..3..  122us : .add_preempt_count <-.irq_enter
    <...>-4171    0dN.h3..  123us : .hrtimer_interrupt <-.timer_interrupt
    <...>-4171    0dN.h3..  124us : ._raw_spin_lock <-.hrtimer_interrupt
    <...>-4171    0dN.h3..  126us : .__raw_spin_lock <-.hrtimer_interrupt
    <...>-4171    0dN.h3..  127us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN.h4..  128us : .ktime_get_update_offsets <-.hrtimer_interrupt
    <...>-4171    0dN.h4..  129us : .__run_hrtimer <-.hrtimer_interrupt
    <...>-4171    0dN.h4..  130us : .__remove_hrtimer <-.__run_hrtimer
    <...>-4171    0dN.h4..  131us : .sub_preempt_count <-.__run_hrtimer
    <...>-4171    0dN.h3..  133us : .tick_sched_timer <-.__run_hrtimer
    <...>-4171    0dN.h3..  134us : .ktime_get <-.tick_sched_timer
    <...>-4171    0dN.h3..  135us : ._raw_spin_lock <-.tick_sched_timer
    <...>-4171    0dN.h3..  136us : .__raw_spin_lock <-.tick_sched_timer
    <...>-4171    0dN.h3..  137us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN.h4..  138us : .add_preempt_count <-.tick_sched_timer
    <...>-4171    0dN.h5..  139us : .do_timer <-.tick_sched_timer
    <...>-4171    0dN.h5..  140us : ._raw_spin_lock_irqsave <-.do_timer
    <...>-4171    0dN.h5..  141us : .__raw_spin_lock_irqsave <-.do_timer
    <...>-4171    0dN.h5..  142us : .add_preempt_count <-.__raw_spin_lock_irqsave
    <...>-4171    0dN.h6..  144us : .ntp_tick_length <-.do_timer
    <...>-4171    0dN.h6..  145us : .ntp_tick_length <-.do_timer
    <...>-4171    0dN.h6..  146us : .ntp_tick_length <-.do_timer
    <...>-4171    0dN.h6..  147us : .add_preempt_count <-.do_timer
    <...>-4171    0dN.h7..  148us+: .timekeeping_update.constprop.7 <-.do_timer
    <...>-4171    0dN.h7..  150us : .raw_notifier_call_chain <-.timekeeping_update.constprop.7
    <...>-4171    0dN.h7..  151us : .notifier_call_chain <-.timekeeping_update.constprop.7
    <...>-4171    0dN.h7..  152us : .sub_preempt_count <-.do_timer
    <...>-4171    0dN.h6..  153us : ._raw_spin_unlock_irqrestore <-.do_timer
    <...>-4171    0dN.h6..  154us : .sub_preempt_count <-._raw_spin_unlock_irqrestore
    <...>-4171    0dN.h5..  155us : .calc_global_load <-.do_timer
    <...>-4171    0dN.h5..  157us : .sub_preempt_count <-.tick_sched_timer
    <...>-4171    0dN.h4..  158us : .sub_preempt_count <-.tick_sched_timer
    <...>-4171    0dN.h3..  159us : .update_process_times <-.tick_sched_timer
    <...>-4171    0dN.h3..  160us : .account_user_time <-.vtime_account_user
    <...>-4171    0dN.h3..  161us : .cpuacct_account_field <-.account_user_time
    <...>-4171    0dN.h3..  162us : .__rcu_read_lock <-.cpuacct_account_field
    <...>-4171    0dN.h3..  163us : .__rcu_read_unlock <-.cpuacct_account_field
    <...>-4171    0dN.h3..  164us : .scheduler_tick <-.update_process_times
    <...>-4171    0dN.h3..  166us : ._raw_spin_lock <-.scheduler_tick
    <...>-4171    0dN.h3..  167us : .__raw_spin_lock <-.scheduler_tick
    <...>-4171    0dN.h3..  168us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN.h4..  169us : .task_tick_fair <-.scheduler_tick
    <...>-4171    0dN.h4..  170us : .update_curr <-.task_tick_fair
    <...>-4171    0dN.h4..  171us : .update_min_vruntime <-.update_curr
    <...>-4171    0dN.h4..  172us : .cpuacct_charge <-.update_curr
    <...>-4171    0dN.h4..  173us : .__rcu_read_lock <-.cpuacct_charge
    <...>-4171    0dN.h4..  174us : .__rcu_read_unlock <-.cpuacct_charge
    <...>-4171    0dN.h4..  176us : .update_cfs_rq_blocked_load <-.task_tick_fair
    <...>-4171    0dN.h4..  177us : .update_cfs_shares <-.task_tick_fair
    <...>-4171    0dN.h4..  178us : .update_cpu_load_active <-.scheduler_tick
    <...>-4171    0dN.h4..  179us : .sched_avg_update <-.update_cpu_load_active
    <...>-4171    0dN.h4..  180us : .sub_preempt_count <-.scheduler_tick
    <...>-4171    0dN.h3..  182us : .trigger_load_balance <-.scheduler_tick
    <...>-4171    0dN.h3..  183us : .run_local_timers <-.update_process_times
    <...>-4171    0dN.h3..  184us : .hrtimer_run_queues <-.run_local_timers
    <...>-4171    0dN.h3..  185us : .rt_spin_trylock <-.run_local_timers
    <...>-4171    0dN.h3..  186us : .rt_mutex_trylock <-.run_local_timers
    <...>-4171    0dN.h3..  187us : .raise_softirq <-.run_local_timers
    <...>-4171    0dN.h3..  188us : .raise_softirq_irqoff <-.raise_softirq
    <...>-4171    0dN.h3..  189us : .do_raise_softirq_irqoff <-.raise_softirq_irqoff
    <...>-4171    0dN.h3..  191us : .rt_spin_unlock_after_trylock_in_irq <-.run_local_timers
    <...>-4171    0dN.h3..  192us : .rcu_check_callbacks <-.update_process_times
    <...>-4171    0dN.h3..  193us : .rcu_bh_qs <-.rcu_check_callbacks
    <...>-4171    0dN.h3..  194us : .cpu_needs_another_gp <-.rcu_check_callbacks
    <...>-4171    0dN.h3..  195us : .invoke_rcu_core <-.rcu_check_callbacks
    <...>-4171    0dN.h3..  197us : .wake_up_process <-.invoke_rcu_core
    <...>-4171    0dN.h3..  198us : .try_to_wake_up <-.invoke_rcu_core
    <...>-4171    0dN.h3..  199us : ._raw_spin_lock_irqsave <-.try_to_wake_up
    <...>-4171    0dN.h3..  200us : .__raw_spin_lock_irqsave <-.try_to_wake_up
    <...>-4171    0dN.h3..  201us : .add_preempt_count <-.__raw_spin_lock_irqsave
    <...>-4171    0dN.h4..  202us : .task_waking_fair <-.try_to_wake_up
    <...>-4171    0dN.h4..  204us : .select_task_rq_fair <-.try_to_wake_up
    <...>-4171    0dN.h4..  205us : ._raw_spin_lock <-.try_to_wake_up
    <...>-4171    0dN.h4..  206us : .__raw_spin_lock <-.try_to_wake_up
    <...>-4171    0dN.h4..  207us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN.h5..  208us : .ttwu_do_activate.constprop.71 <-.try_to_wake_up
    <...>-4171    0dN.h5..  209us : .activate_task <-.ttwu_do_activate.constprop.71
    <...>-4171    0dN.h5..  210us : .enqueue_task <-.ttwu_do_activate.constprop.71
    <...>-4171    0dN.h5..  211us : .enqueue_task_fair <-.enqueue_task
    <...>-4171    0dN.h5..  213us : .update_curr <-.enqueue_task_fair
    <...>-4171    0dN.h5..  214us : .__compute_runnable_contrib <-.enqueue_task_fair
    <...>-4171    0dN.h5..  215us : .__update_entity_load_avg_contrib <-.enqueue_task_fair
    <...>-4171    0dN.h5..  216us : .update_cfs_rq_blocked_load <-.enqueue_task_fair
    <...>-4171    0dN.h5..  217us : .account_entity_enqueue <-.enqueue_task_fair
    <...>-4171    0dN.h5..  218us : .update_cfs_shares <-.enqueue_task_fair
    <...>-4171    0dN.h5..  220us : .__enqueue_entity <-.enqueue_task_fair
    <...>-4171    0dN.h5..  221us : .hrtick_update <-.enqueue_task
    <...>-4171    0dN.h5..  222us : .ttwu_do_wakeup <-.try_to_wake_up
    <...>-4171    0dN.h5..  223us : .check_preempt_curr <-.ttwu_do_wakeup
    <...>-4171    0dN.h5..  224us : .check_preempt_wakeup <-.check_preempt_curr
    <...>-4171    0dN.h5..  225us : .sub_preempt_count <-.try_to_wake_up
    <...>-4171    0dN.h4..  226us : ._raw_spin_unlock_irqrestore <-.try_to_wake_up
    <...>-4171    0dN.h4..  227us : .sub_preempt_count <-._raw_spin_unlock_irqrestore
    <...>-4171    0dN.h3..  229us : .run_posix_cpu_timers <-.update_process_times
    <...>-4171    0dN.h3..  230us : .hrtimer_forward <-.tick_sched_timer
    <...>-4171    0dN.h3..  231us : .ktime_add_safe <-.hrtimer_forward
    <...>-4171    0dN.h3..  232us : .ktime_add_safe <-.hrtimer_forward
    <...>-4171    0dN.h3..  233us : ._raw_spin_lock <-.__run_hrtimer
    <...>-4171    0dN.h3..  234us : .__raw_spin_lock <-.__run_hrtimer
    <...>-4171    0dN.h3..  235us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN.h4..  237us : .enqueue_hrtimer <-.__run_hrtimer
    <...>-4171    0dN.h4..  238us+: .sub_preempt_count <-.hrtimer_interrupt
    <...>-4171    0dN.h3..  240us : .tick_program_event <-.hrtimer_interrupt
    <...>-4171    0dN.h3..  241us : .clockevents_program_event <-.tick_program_event
    <...>-4171    0dN.h3..  242us : .ktime_get <-.clockevents_program_event
    <...>-4171    0dN.h3..  243us : ._raw_spin_lock <-.hrtimer_interrupt
    <...>-4171    0dN.h3..  244us : .__raw_spin_lock <-.hrtimer_interrupt
    <...>-4171    0dN.h3..  245us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN.h4..  246us : .ktime_get_update_offsets <-.hrtimer_interrupt
    <...>-4171    0dN.h4..  248us : .__run_hrtimer <-.hrtimer_interrupt
    <...>-4171    0dN.h4..  249us : .__remove_hrtimer <-.__run_hrtimer
    <...>-4171    0dN.h4..  250us : .sub_preempt_count <-.__run_hrtimer
    <...>-4171    0dN.h3..  251us : .kvmppc_decrementer_wakeup <-.__run_hrtimer
    <...>-4171    0dN.h3..  252us : .kvmppc_decrementer_func <-.kvmppc_decrementer_wakeup
    <...>-4171    0dN.h3..  253us : .kvmppc_set_tsr_bits <-.kvmppc_decrementer_wakeup
    <...>-4171    0dN.h3..  254us : .kvm_vcpu_kick <-.kvmppc_set_tsr_bits
    <...>-4171    0dN.h3..  256us : .__swait_wake <-.kvm_vcpu_kick
    <...>-4171    0dN.h3..  257us : ._raw_spin_lock_irqsave <-.__swait_wake
    <...>-4171    0dN.h3..  258us : .__raw_spin_lock_irqsave <-.__swait_wake
    <...>-4171    0dN.h3..  259us : .add_preempt_count <-.__raw_spin_lock_irqsave
    <...>-4171    0dN.h4..  260us : .__swait_wake_locked <-.__swait_wake
    <...>-4171    0dN.h4..  261us : .wake_up_state <-.__swait_wake_locked
    <...>-4171    0dN.h4..  262us : .try_to_wake_up <-.__swait_wake_locked
    <...>-4171    0dN.h4..  263us : ._raw_spin_lock_irqsave <-.try_to_wake_up
    <...>-4171    0dN.h4..  264us : .__raw_spin_lock_irqsave <-.try_to_wake_up
    <...>-4171    0dN.h4..  265us : .add_preempt_count <-.__raw_spin_lock_irqsave
    <...>-4171    0dN.h5..  267us : .select_task_rq_rt <-.try_to_wake_up
    <...>-4171    0dN.h5..  268us : ._raw_spin_lock <-.try_to_wake_up
    <...>-4171    0dN.h5..  269us : .__raw_spin_lock <-.try_to_wake_up
    <...>-4171    0dN.h5..  270us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN.h6..  271us : .ttwu_do_activate.constprop.71 <-.try_to_wake_up
    <...>-4171    0dN.h6..  272us : .activate_task <-.ttwu_do_activate.constprop.71
    <...>-4171    0dN.h6..  273us : .enqueue_task <-.ttwu_do_activate.constprop.71
    <...>-4171    0dN.h6..  274us : .enqueue_task_rt <-.enqueue_task
    <...>-4171    0dN.h6..  276us : .dequeue_rt_stack <-.enqueue_task_rt
    <...>-4171    0dN.h6..  277us+: .cpupri_set <-.enqueue_task_rt
    <...>-4171    0dN.h6..  278us : .update_rt_migration <-.enqueue_task_rt
    <...>-4171    0dN.h6..  280us : .ttwu_do_wakeup <-.try_to_wake_up
    <...>-4171    0dN.h6..  281us : .check_preempt_curr <-.ttwu_do_wakeup
    <...>-4171    0dN.h6..  282us : .resched_task <-.check_preempt_curr
    <...>-4171    0dN.h6..  283us : .task_woken_rt <-.ttwu_do_wakeup
    <...>-4171    0dN.h6..  284us : .sub_preempt_count <-.try_to_wake_up
    <...>-4171    0dN.h5..  285us : ._raw_spin_unlock_irqrestore <-.try_to_wake_up
    <...>-4171    0dN.h5..  287us : .sub_preempt_count <-._raw_spin_unlock_irqrestore
    <...>-4171    0dN.h4..  288us : ._raw_spin_unlock_irqrestore <-.__swait_wake
    <...>-4171    0dN.h4..  289us : .sub_preempt_count <-._raw_spin_unlock_irqrestore
    <...>-4171    0dN.h3..  290us : .add_preempt_count <-.kvm_vcpu_kick
    <...>-4171    0dN.h4..  291us : .sub_preempt_count <-.kvm_vcpu_kick
    <...>-4171    0dN.h3..  292us : ._raw_spin_lock <-.__run_hrtimer
    <...>-4171    0dN.h3..  293us : .__raw_spin_lock <-.__run_hrtimer
    <...>-4171    0dN.h3..  294us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN.h4..  296us : .sub_preempt_count <-.hrtimer_interrupt
    <...>-4171    0dN.h3..  297us : .tick_program_event <-.hrtimer_interrupt
    <...>-4171    0dN.h3..  298us : .clockevents_program_event <-.tick_program_event
    <...>-4171    0dN.h3..  299us : .ktime_get <-.clockevents_program_event
    <...>-4171    0dN.h3..  300us : .irq_exit <-.timer_interrupt
    <...>-4171    0dN.h3..  301us : .account_system_time <-.vtime_account_system
    <...>-4171    0dN.h3..  303us : .cpuacct_account_field <-.account_system_time
    <...>-4171    0dN.h3..  304us : .__rcu_read_lock <-.cpuacct_account_field
    <...>-4171    0dN.h3..  305us : .__rcu_read_unlock <-.cpuacct_account_field
    <...>-4171    0dN.h3..  306us : .sub_preempt_count <-.irq_exit
    <...>-4171    0dN..3..  307us : .wakeup_softirqd <-.irq_exit
    <...>-4171    0dN..3..  308us : .wake_up_process <-.wakeup_softirqd
    <...>-4171    0dN..3..  309us : .try_to_wake_up <-.wakeup_softirqd
    <...>-4171    0dN..3..  310us : ._raw_spin_lock_irqsave <-.try_to_wake_up
    <...>-4171    0dN..3..  312us : .__raw_spin_lock_irqsave <-.try_to_wake_up
    <...>-4171    0dN..3..  313us : .add_preempt_count <-.__raw_spin_lock_irqsave
    <...>-4171    0dN..4..  314us : .select_task_rq_rt <-.try_to_wake_up
    <...>-4171    0dN..4..  315us : ._raw_spin_lock <-.try_to_wake_up
    <...>-4171    0dN..4..  316us : .__raw_spin_lock <-.try_to_wake_up
    <...>-4171    0dN..4..  317us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0dN..5..  318us : .ttwu_do_activate.constprop.71 <-.try_to_wake_up
    <...>-4171    0dN..5..  319us : .activate_task <-.ttwu_do_activate.constprop.71
    <...>-4171    0dN..5..  320us : .enqueue_task <-.ttwu_do_activate.constprop.71
    <...>-4171    0dN..5..  321us : .enqueue_task_rt <-.enqueue_task
    <...>-4171    0dN..5..  322us : .dequeue_rt_stack <-.enqueue_task_rt
    <...>-4171    0dN..5..  323us : .update_rt_migration <-.enqueue_task_rt
    <...>-4171    0dN..5..  325us : .ttwu_do_wakeup <-.try_to_wake_up
    <...>-4171    0dN..5..  326us : .check_preempt_curr <-.ttwu_do_wakeup
    <...>-4171    0dN..5..  327us : .resched_task <-.check_preempt_curr
    <...>-4171    0dN..5..  328us : .task_woken_rt <-.ttwu_do_wakeup
    <...>-4171    0dN..5..  329us : .sub_preempt_count <-.try_to_wake_up
    <...>-4171    0dN..4..  330us : ._raw_spin_unlock_irqrestore <-.try_to_wake_up
    <...>-4171    0dN..4..  331us : .sub_preempt_count <-._raw_spin_unlock_irqrestore
    <...>-4171    0dN..3..  332us+: .rcu_irq_exit <-.irq_exit
    <...>-4171    0dN..3..  336us : .put_prev_task_fair <-.__schedule
    <...>-4171    0dN..3..  337us : .update_curr <-.put_prev_task_fair
    <...>-4171    0dN..3..  338us : .__enqueue_entity <-.put_prev_task_fair
    <...>-4171    0dN..3..  339us : .pick_next_task_stop <-.__schedule
    <...>-4171    0dN..3..  341us : .pick_next_task_rt <-.__schedule
    <...>-4171    0d...3..  342us : .switch_mmu_context <-.__schedule
    <...>-4171    0d...3..  343us : ._raw_spin_lock <-.switch_mmu_context
    <...>-4171    0d...3..  344us : .__raw_spin_lock <-.switch_mmu_context
    <...>-4171    0d...3..  345us : .add_preempt_count <-.__raw_spin_lock
    <...>-4171    0d...4..  346us : .sub_preempt_count <-.switch_mmu_context
    <...>-4171    0d...3..  347us : .__switch_to <-.__schedule
    <...>-4171    0d...3..  348us+: .switch_booke_debug_regs <-.__switch_to
qemu-sys-4055    0d...3..  350us : .finish_task_switch <-.__schedule
qemu-sys-4055    0d...3..  352us : .vtime_common_task_switch <-.finish_task_switch
qemu-sys-4055    0d...3..  353us : .account_system_time <-.vtime_account_system
qemu-sys-4055    0d...3..  354us : .in_serving_softirq <-.account_system_time
qemu-sys-4055    0d...3..  355us : .cpuacct_account_field <-.account_system_time
qemu-sys-4055    0d...3..  355us : .__rcu_read_lock <-.cpuacct_account_field
qemu-sys-4055    0d...3..  356us : .__rcu_read_unlock <-.cpuacct_account_field
qemu-sys-4055    0d...3..  357us : .account_user_time <-.vtime_account_user
qemu-sys-4055    0d...3..  358us : .cpuacct_account_field <-.account_user_time
qemu-sys-4055    0d...3..  359us : .__rcu_read_lock <-.cpuacct_account_field
qemu-sys-4055    0d...3..  360us : .__rcu_read_unlock <-.cpuacct_account_field
qemu-sys-4055    0d...3..  361us+: ._raw_spin_unlock_irq <-.finish_task_switch
qemu-sys-4055    0....1..  364us+: .__schedule
qemu-sys-4055    0....1..  366us+: .trace_preempt_on <-.__schedule
qemu-sys-4055    0....1..  369us : <stack trace>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-09  7:44                       ` Purcareata Bogdan
@ 2015-04-09 23:53                         ` Scott Wood
  2015-04-20 10:53                           ` Purcareata Bogdan
  0 siblings, 1 reply; 37+ messages in thread
From: Scott Wood @ 2015-04-09 23:53 UTC (permalink / raw)
  To: Purcareata Bogdan
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner

On Thu, 2015-04-09 at 10:44 +0300, Purcareata Bogdan wrote:
> So at this point I was getting kinda frustrated so I decided to measure
> the time spend in kvm_mpic_write and kvm_mpic_read. I assumed these were
> the main entry points in the in-kernel MPIC and were basically executed
> while holding the spinlock. The scenario was the same - 24 VCPUs guest,
> with 24 virtio+vhost interfaces, only this time I ran 24 ping flood
> threads to another board instead of netperf. I assumed this would impose
> a heavier stress.
> 
> The latencies look pretty ok, around 1-2 us on average, with the max
> shown below:
> 
> .kvm_mpic_read	14.560
> .kvm_mpic_write	12.608
> 
> Those are also microseconds. This was run for about 15 mins.

What about other entry points such as kvm_set_msi() and
kvmppc_mpic_set_epr()?

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-09 23:53                         ` Scott Wood
@ 2015-04-20 10:53                           ` Purcareata Bogdan
  2015-04-21  0:52                             ` Scott Wood
  0 siblings, 1 reply; 37+ messages in thread
From: Purcareata Bogdan @ 2015-04-20 10:53 UTC (permalink / raw)
  To: Scott Wood
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner

On 10.04.2015 02:53, Scott Wood wrote:
> On Thu, 2015-04-09 at 10:44 +0300, Purcareata Bogdan wrote:
>> So at this point I was getting kinda frustrated so I decided to measure
>> the time spend in kvm_mpic_write and kvm_mpic_read. I assumed these were
>> the main entry points in the in-kernel MPIC and were basically executed
>> while holding the spinlock. The scenario was the same - 24 VCPUs guest,
>> with 24 virtio+vhost interfaces, only this time I ran 24 ping flood
>> threads to another board instead of netperf. I assumed this would impose
>> a heavier stress.
>>
>> The latencies look pretty ok, around 1-2 us on average, with the max
>> shown below:
>>
>> .kvm_mpic_read	14.560
>> .kvm_mpic_write	12.608
>>
>> Those are also microseconds. This was run for about 15 mins.
>
> What about other entry points such as kvm_set_msi() and
> kvmppc_mpic_set_epr()?

Thanks for the pointers! I redid the measurements, this time for the functions 
run with the openpic lock down:

.kvm_mpic_read_internal (.kvm_mpic_read)	1.664
.kvmppc_mpic_set_epr				6.880
.kvm_mpic_write_internal (.kvm_mpic_write)	7.840
.openpic_msi_write (.kvm_set_msi)		10.560

Same scenario, 15 mins, numbers are microseconds.

There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner 
function is kvmppc_set_epr, which is a static inline. Removing the static inline 
yields a compiler crash (Segmentation fault (core dumped) - 
scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed), 
but that's a different story, so I just let it be for now. Point is the time may 
include other work after the lock has been released, but before the function 
actually returned. I noticed this was the case for .kvm_set_msi, which could 
work up to 90 ms, not actually under the lock. This made me change what I'm 
looking at.

So far it looks pretty decent. Are there any other MPIC entry points worthy of 
investigation? Or perhaps a different stress scenario involving a lot of VCPUs 
and external interrupts?

Thanks,
Bogdan P.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-20 10:53                           ` Purcareata Bogdan
@ 2015-04-21  0:52                             ` Scott Wood
  2015-04-22 12:06                               ` Purcareata Bogdan
  0 siblings, 1 reply; 37+ messages in thread
From: Scott Wood @ 2015-04-21  0:52 UTC (permalink / raw)
  To: Purcareata Bogdan
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner

On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:
> On 10.04.2015 02:53, Scott Wood wrote:
> > On Thu, 2015-04-09 at 10:44 +0300, Purcareata Bogdan wrote:
> >> So at this point I was getting kinda frustrated so I decided to measure
> >> the time spend in kvm_mpic_write and kvm_mpic_read. I assumed these were
> >> the main entry points in the in-kernel MPIC and were basically executed
> >> while holding the spinlock. The scenario was the same - 24 VCPUs guest,
> >> with 24 virtio+vhost interfaces, only this time I ran 24 ping flood
> >> threads to another board instead of netperf. I assumed this would impose
> >> a heavier stress.
> >>
> >> The latencies look pretty ok, around 1-2 us on average, with the max
> >> shown below:
> >>
> >> .kvm_mpic_read	14.560
> >> .kvm_mpic_write	12.608
> >>
> >> Those are also microseconds. This was run for about 15 mins.
> >
> > What about other entry points such as kvm_set_msi() and
> > kvmppc_mpic_set_epr()?
> 
> Thanks for the pointers! I redid the measurements, this time for the functions 
> run with the openpic lock down:
> 
> .kvm_mpic_read_internal (.kvm_mpic_read)	1.664
> .kvmppc_mpic_set_epr				6.880
> .kvm_mpic_write_internal (.kvm_mpic_write)	7.840
> .openpic_msi_write (.kvm_set_msi)		10.560
> 
> Same scenario, 15 mins, numbers are microseconds.
> 
> There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner 
> function is kvmppc_set_epr, which is a static inline. Removing the static inline 
> yields a compiler crash (Segmentation fault (core dumped) - 
> scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed), 
> but that's a different story, so I just let it be for now. Point is the time may 
> include other work after the lock has been released, but before the function 
> actually returned. I noticed this was the case for .kvm_set_msi, which could 
> work up to 90 ms, not actually under the lock. This made me change what I'm 
> looking at.

kvm_set_msi does pretty much nothing outside the lock -- I suspect
you're measuring an interrupt that happened as soon as the lock was
released.

> So far it looks pretty decent. Are there any other MPIC entry points worthy of 
> investigation?

I don't think so.

>  Or perhaps a different stress scenario involving a lot of VCPUs 
> and external interrupts?

You could instrument the MPIC code to find out how many loop iterations
you maxed out on, and compare that to the theoretical maximum.

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-21  0:52                             ` Scott Wood
@ 2015-04-22 12:06                               ` Purcareata Bogdan
  2015-04-23  0:30                                 ` Scott Wood
  0 siblings, 1 reply; 37+ messages in thread
From: Purcareata Bogdan @ 2015-04-22 12:06 UTC (permalink / raw)
  To: Scott Wood
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner, Laurentiu Tudor

On 21.04.2015 03:52, Scott Wood wrote:
> On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:
>> There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
>> function is kvmppc_set_epr, which is a static inline. Removing the static inline
>> yields a compiler crash (Segmentation fault (core dumped) -
>> scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
>> but that's a different story, so I just let it be for now. Point is the time may
>> include other work after the lock has been released, but before the function
>> actually returned. I noticed this was the case for .kvm_set_msi, which could
>> work up to 90 ms, not actually under the lock. This made me change what I'm
>> looking at.
>
> kvm_set_msi does pretty much nothing outside the lock -- I suspect
> you're measuring an interrupt that happened as soon as the lock was
> released.

That's exactly right. I've seen things like a timer interrupt occuring right 
after the spinlock_irqrestore, but before kvm_set_msi actually returned.

[...]

>>   Or perhaps a different stress scenario involving a lot of VCPUs
>> and external interrupts?
>
> You could instrument the MPIC code to find out how many loop iterations
> you maxed out on, and compare that to the theoretical maximum.

Numbers are pretty low, and I'll try to explain based on my observations.

The problematic section in openpic_update_irq is this [1], since it loops 
through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops 
through all pending interrupts for a VCPU [2].

The guest interfaces are virtio-vhostnet, which are based on MSI 
(/proc/interrupts in guest shows they are MSI). For external interrupts to the 
guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized), 
so [1] will go on and deliver the interrupt directly and unicast (no VCPUs loop).

I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts 
are actually pending for the destination VCPU. At most, there were 3 interrupts 
- n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that 
guest virtio interrupts are cascaded over 1 or a couple of shared MSI interrupts.

So worst case, in this scenario, was checking the priorities for 3 pending 
interrupts for 1 VCPU. Something like this (some of my prints included):

[61010.582033] openpic_update_irq: destmask 1 last_cpu 0
[61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
[61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
[61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
[61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
[61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1

It would be really helpful to get your comments regarding whether these are 
realistical number for everyday use, or they are relevant only to this 
particular scenario.

- Can these interrupts be used in directed delivery, so that the destination 
mask can include multiple VCPUs? The MPIC manual states that timer and IPI 
interrupts are supported for directed delivery, altough I'm not sure how much of 
this is used in the emulation. I know that kvmppc uses the decrementer outside 
of the MPIC.

- How are virtio interrupts cascaded over the shared MSI interrupts? 
/proc/device-tree/soc@e0000000/msi@41600/interrupts in the guest shows 8 values 
- 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is 
that correct?

Looking forward to your feedback.

[1] http://lxr.free-electrons.com/source/arch/powerpc/kvm/mpic.c#L454
[2] http://lxr.free-electrons.com/source/arch/powerpc/kvm/mpic.c#L303
[3] 
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/F27971551C9EED8E8525774A0048770A/$file/mpic_db_05_16_2011.pdf

Best regards,
Bogdan P.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-22 12:06                               ` Purcareata Bogdan
@ 2015-04-23  0:30                                 ` Scott Wood
  2015-04-23 12:31                                   ` Purcareata Bogdan
  0 siblings, 1 reply; 37+ messages in thread
From: Scott Wood @ 2015-04-23  0:30 UTC (permalink / raw)
  To: Purcareata Bogdan
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner, Laurentiu Tudor

On Wed, 2015-04-22 at 15:06 +0300, Purcareata Bogdan wrote:
> On 21.04.2015 03:52, Scott Wood wrote:
> > On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:
> >> There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
> >> function is kvmppc_set_epr, which is a static inline. Removing the static inline
> >> yields a compiler crash (Segmentation fault (core dumped) -
> >> scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
> >> but that's a different story, so I just let it be for now. Point is the time may
> >> include other work after the lock has been released, but before the function
> >> actually returned. I noticed this was the case for .kvm_set_msi, which could
> >> work up to 90 ms, not actually under the lock. This made me change what I'm
> >> looking at.
> >
> > kvm_set_msi does pretty much nothing outside the lock -- I suspect
> > you're measuring an interrupt that happened as soon as the lock was
> > released.
> 
> That's exactly right. I've seen things like a timer interrupt occuring right 
> after the spinlock_irqrestore, but before kvm_set_msi actually returned.
> 
> [...]
> 
> >>   Or perhaps a different stress scenario involving a lot of VCPUs
> >> and external interrupts?
> >
> > You could instrument the MPIC code to find out how many loop iterations
> > you maxed out on, and compare that to the theoretical maximum.
> 
> Numbers are pretty low, and I'll try to explain based on my observations.
> 
> The problematic section in openpic_update_irq is this [1], since it loops 
> through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops 
> through all pending interrupts for a VCPU [2].
> 
> The guest interfaces are virtio-vhostnet, which are based on MSI 
> (/proc/interrupts in guest shows they are MSI). For external interrupts to the 
> guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized), 
> so [1] will go on and deliver the interrupt directly and unicast (no VCPUs loop).
> 
> I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts 
> are actually pending for the destination VCPU. At most, there were 3 interrupts 
> - n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that 
> guest virtio interrupts are cascaded over 1 or a couple of shared MSI interrupts.
> 
> So worst case, in this scenario, was checking the priorities for 3 pending 
> interrupts for 1 VCPU. Something like this (some of my prints included):
> 
> [61010.582033] openpic_update_irq: destmask 1 last_cpu 0
> [61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
> [61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
> [61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
> [61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
> [61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1
> 
> It would be really helpful to get your comments regarding whether these are 
> realistical number for everyday use, or they are relevant only to this 
> particular scenario.

RT isn't about "realistic numbers for everyday use".  It's about worst
cases.

> - Can these interrupts be used in directed delivery, so that the destination 
> mask can include multiple VCPUs?

The Freescale MPIC does not support multiple destinations for most
interrupts, but the (non-FSL-specific) emulation code appears to allow
it.

>  The MPIC manual states that timer and IPI 
> interrupts are supported for directed delivery, altough I'm not sure how much of 
> this is used in the emulation. I know that kvmppc uses the decrementer outside 
> of the MPIC.
> 
> - How are virtio interrupts cascaded over the shared MSI interrupts? 
> /proc/device-tree/soc@e0000000/msi@41600/interrupts in the guest shows 8 values 
> - 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is 
> that correct?

It looks like that's currently the case, but actual hardware supports
more than that, so it's possible (albeit unlikely any time soon) that
the emulation eventually does as well.

But it's possible to have interrupts other than MSIs...

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-23  0:30                                 ` Scott Wood
@ 2015-04-23 12:31                                   ` Purcareata Bogdan
  2015-04-23 21:26                                     ` Scott Wood
  0 siblings, 1 reply; 37+ messages in thread
From: Purcareata Bogdan @ 2015-04-23 12:31 UTC (permalink / raw)
  To: Scott Wood
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner, Laurentiu Tudor

On 23.04.2015 03:30, Scott Wood wrote:
> On Wed, 2015-04-22 at 15:06 +0300, Purcareata Bogdan wrote:
>> On 21.04.2015 03:52, Scott Wood wrote:
>>> On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:
>>>> There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
>>>> function is kvmppc_set_epr, which is a static inline. Removing the static inline
>>>> yields a compiler crash (Segmentation fault (core dumped) -
>>>> scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
>>>> but that's a different story, so I just let it be for now. Point is the time may
>>>> include other work after the lock has been released, but before the function
>>>> actually returned. I noticed this was the case for .kvm_set_msi, which could
>>>> work up to 90 ms, not actually under the lock. This made me change what I'm
>>>> looking at.
>>>
>>> kvm_set_msi does pretty much nothing outside the lock -- I suspect
>>> you're measuring an interrupt that happened as soon as the lock was
>>> released.
>>
>> That's exactly right. I've seen things like a timer interrupt occuring right
>> after the spinlock_irqrestore, but before kvm_set_msi actually returned.
>>
>> [...]
>>
>>>>    Or perhaps a different stress scenario involving a lot of VCPUs
>>>> and external interrupts?
>>>
>>> You could instrument the MPIC code to find out how many loop iterations
>>> you maxed out on, and compare that to the theoretical maximum.
>>
>> Numbers are pretty low, and I'll try to explain based on my observations.
>>
>> The problematic section in openpic_update_irq is this [1], since it loops
>> through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops
>> through all pending interrupts for a VCPU [2].
>>
>> The guest interfaces are virtio-vhostnet, which are based on MSI
>> (/proc/interrupts in guest shows they are MSI). For external interrupts to the
>> guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized),
>> so [1] will go on and deliver the interrupt directly and unicast (no VCPUs loop).
>>
>> I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts
>> are actually pending for the destination VCPU. At most, there were 3 interrupts
>> - n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that
>> guest virtio interrupts are cascaded over 1 or a couple of shared MSI interrupts.
>>
>> So worst case, in this scenario, was checking the priorities for 3 pending
>> interrupts for 1 VCPU. Something like this (some of my prints included):
>>
>> [61010.582033] openpic_update_irq: destmask 1 last_cpu 0
>> [61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
>> [61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
>> [61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
>> [61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
>> [61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1
>>
>> It would be really helpful to get your comments regarding whether these are
>> realistical number for everyday use, or they are relevant only to this
>> particular scenario.
>
> RT isn't about "realistic numbers for everyday use".  It's about worst
> cases.
>
>> - Can these interrupts be used in directed delivery, so that the destination
>> mask can include multiple VCPUs?
>
> The Freescale MPIC does not support multiple destinations for most
> interrupts, but the (non-FSL-specific) emulation code appears to allow
> it.
>
>>   The MPIC manual states that timer and IPI
>> interrupts are supported for directed delivery, altough I'm not sure how much of
>> this is used in the emulation. I know that kvmppc uses the decrementer outside
>> of the MPIC.
>>
>> - How are virtio interrupts cascaded over the shared MSI interrupts?
>> /proc/device-tree/soc@e0000000/msi@41600/interrupts in the guest shows 8 values
>> - 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is
>> that correct?
>
> It looks like that's currently the case, but actual hardware supports
> more than that, so it's possible (albeit unlikely any time soon) that
> the emulation eventually does as well.
>
> But it's possible to have interrupts other than MSIs...

Right.

So given that the raw spinlock conversion is not suitable for all the scenarios 
supported by the OpenPIC emulation, is it ok that my next step would be to send 
a patch containing both the raw spinlock conversion and a mandatory disable of 
the in-kernel MPIC? This is actually the last conclusion we came up with some 
time ago, but I guess it was good to get some more insight on how things 
actually work (at least for me).

Thanks,
Bogdan P.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-23 12:31                                   ` Purcareata Bogdan
@ 2015-04-23 21:26                                     ` Scott Wood
  2015-04-27  6:45                                       ` Purcareata Bogdan
  0 siblings, 1 reply; 37+ messages in thread
From: Scott Wood @ 2015-04-23 21:26 UTC (permalink / raw)
  To: Purcareata Bogdan
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner, Laurentiu Tudor

On Thu, 2015-04-23 at 15:31 +0300, Purcareata Bogdan wrote:
> On 23.04.2015 03:30, Scott Wood wrote:
> > On Wed, 2015-04-22 at 15:06 +0300, Purcareata Bogdan wrote:
> >> On 21.04.2015 03:52, Scott Wood wrote:
> >>> On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:
> >>>> There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
> >>>> function is kvmppc_set_epr, which is a static inline. Removing the static inline
> >>>> yields a compiler crash (Segmentation fault (core dumped) -
> >>>> scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
> >>>> but that's a different story, so I just let it be for now. Point is the time may
> >>>> include other work after the lock has been released, but before the function
> >>>> actually returned. I noticed this was the case for .kvm_set_msi, which could
> >>>> work up to 90 ms, not actually under the lock. This made me change what I'm
> >>>> looking at.
> >>>
> >>> kvm_set_msi does pretty much nothing outside the lock -- I suspect
> >>> you're measuring an interrupt that happened as soon as the lock was
> >>> released.
> >>
> >> That's exactly right. I've seen things like a timer interrupt occuring right
> >> after the spinlock_irqrestore, but before kvm_set_msi actually returned.
> >>
> >> [...]
> >>
> >>>>    Or perhaps a different stress scenario involving a lot of VCPUs
> >>>> and external interrupts?
> >>>
> >>> You could instrument the MPIC code to find out how many loop iterations
> >>> you maxed out on, and compare that to the theoretical maximum.
> >>
> >> Numbers are pretty low, and I'll try to explain based on my observations.
> >>
> >> The problematic section in openpic_update_irq is this [1], since it loops
> >> through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops
> >> through all pending interrupts for a VCPU [2].
> >>
> >> The guest interfaces are virtio-vhostnet, which are based on MSI
> >> (/proc/interrupts in guest shows they are MSI). For external interrupts to the
> >> guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized),
> >> so [1] will go on and deliver the interrupt directly and unicast (no VCPUs loop).
> >>
> >> I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts
> >> are actually pending for the destination VCPU. At most, there were 3 interrupts
> >> - n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that
> >> guest virtio interrupts are cascaded over 1 or a couple of shared MSI interrupts.
> >>
> >> So worst case, in this scenario, was checking the priorities for 3 pending
> >> interrupts for 1 VCPU. Something like this (some of my prints included):
> >>
> >> [61010.582033] openpic_update_irq: destmask 1 last_cpu 0
> >> [61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
> >> [61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
> >> [61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
> >> [61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
> >> [61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1
> >>
> >> It would be really helpful to get your comments regarding whether these are
> >> realistical number for everyday use, or they are relevant only to this
> >> particular scenario.
> >
> > RT isn't about "realistic numbers for everyday use".  It's about worst
> > cases.
> >
> >> - Can these interrupts be used in directed delivery, so that the destination
> >> mask can include multiple VCPUs?
> >
> > The Freescale MPIC does not support multiple destinations for most
> > interrupts, but the (non-FSL-specific) emulation code appears to allow
> > it.
> >
> >>   The MPIC manual states that timer and IPI
> >> interrupts are supported for directed delivery, altough I'm not sure how much of
> >> this is used in the emulation. I know that kvmppc uses the decrementer outside
> >> of the MPIC.
> >>
> >> - How are virtio interrupts cascaded over the shared MSI interrupts?
> >> /proc/device-tree/soc@e0000000/msi@41600/interrupts in the guest shows 8 values
> >> - 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is
> >> that correct?
> >
> > It looks like that's currently the case, but actual hardware supports
> > more than that, so it's possible (albeit unlikely any time soon) that
> > the emulation eventually does as well.
> >
> > But it's possible to have interrupts other than MSIs...
> 
> Right.
> 
> So given that the raw spinlock conversion is not suitable for all the scenarios 
> supported by the OpenPIC emulation, is it ok that my next step would be to send 
> a patch containing both the raw spinlock conversion and a mandatory disable of 
> the in-kernel MPIC? This is actually the last conclusion we came up with some 
> time ago, but I guess it was good to get some more insight on how things 
> actually work (at least for me).

Fine with me.  Have you given any thought to ways to restructure the
code to eliminate the problem?

-Scott



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux
  2015-04-23 21:26                                     ` Scott Wood
@ 2015-04-27  6:45                                       ` Purcareata Bogdan
  0 siblings, 0 replies; 37+ messages in thread
From: Purcareata Bogdan @ 2015-04-27  6:45 UTC (permalink / raw)
  To: Scott Wood
  Cc: Sebastian Andrzej Siewior, Paolo Bonzini, Alexander Graf,
	Bogdan Purcareata, linuxppc-dev, linux-rt-users, linux-kernel,
	mihai.caraman, Thomas Gleixner, Laurentiu Tudor

On 24.04.2015 00:26, Scott Wood wrote:
> On Thu, 2015-04-23 at 15:31 +0300, Purcareata Bogdan wrote:
>> On 23.04.2015 03:30, Scott Wood wrote:
>>> On Wed, 2015-04-22 at 15:06 +0300, Purcareata Bogdan wrote:
>>>> On 21.04.2015 03:52, Scott Wood wrote:
>>>>> On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:
>>>>>> There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
>>>>>> function is kvmppc_set_epr, which is a static inline. Removing the static inline
>>>>>> yields a compiler crash (Segmentation fault (core dumped) -
>>>>>> scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
>>>>>> but that's a different story, so I just let it be for now. Point is the time may
>>>>>> include other work after the lock has been released, but before the function
>>>>>> actually returned. I noticed this was the case for .kvm_set_msi, which could
>>>>>> work up to 90 ms, not actually under the lock. This made me change what I'm
>>>>>> looking at.
>>>>>
>>>>> kvm_set_msi does pretty much nothing outside the lock -- I suspect
>>>>> you're measuring an interrupt that happened as soon as the lock was
>>>>> released.
>>>>
>>>> That's exactly right. I've seen things like a timer interrupt occuring right
>>>> after the spinlock_irqrestore, but before kvm_set_msi actually returned.
>>>>
>>>> [...]
>>>>
>>>>>>     Or perhaps a different stress scenario involving a lot of VCPUs
>>>>>> and external interrupts?
>>>>>
>>>>> You could instrument the MPIC code to find out how many loop iterations
>>>>> you maxed out on, and compare that to the theoretical maximum.
>>>>
>>>> Numbers are pretty low, and I'll try to explain based on my observations.
>>>>
>>>> The problematic section in openpic_update_irq is this [1], since it loops
>>>> through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops
>>>> through all pending interrupts for a VCPU [2].
>>>>
>>>> The guest interfaces are virtio-vhostnet, which are based on MSI
>>>> (/proc/interrupts in guest shows they are MSI). For external interrupts to the
>>>> guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized),
>>>> so [1] will go on and deliver the interrupt directly and unicast (no VCPUs loop).
>>>>
>>>> I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts
>>>> are actually pending for the destination VCPU. At most, there were 3 interrupts
>>>> - n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that
>>>> guest virtio interrupts are cascaded over 1 or a couple of shared MSI interrupts.
>>>>
>>>> So worst case, in this scenario, was checking the priorities for 3 pending
>>>> interrupts for 1 VCPU. Something like this (some of my prints included):
>>>>
>>>> [61010.582033] openpic_update_irq: destmask 1 last_cpu 0
>>>> [61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
>>>> [61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
>>>> [61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
>>>> [61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
>>>> [61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1
>>>>
>>>> It would be really helpful to get your comments regarding whether these are
>>>> realistical number for everyday use, or they are relevant only to this
>>>> particular scenario.
>>>
>>> RT isn't about "realistic numbers for everyday use".  It's about worst
>>> cases.
>>>
>>>> - Can these interrupts be used in directed delivery, so that the destination
>>>> mask can include multiple VCPUs?
>>>
>>> The Freescale MPIC does not support multiple destinations for most
>>> interrupts, but the (non-FSL-specific) emulation code appears to allow
>>> it.
>>>
>>>>    The MPIC manual states that timer and IPI
>>>> interrupts are supported for directed delivery, altough I'm not sure how much of
>>>> this is used in the emulation. I know that kvmppc uses the decrementer outside
>>>> of the MPIC.
>>>>
>>>> - How are virtio interrupts cascaded over the shared MSI interrupts?
>>>> /proc/device-tree/soc@e0000000/msi@41600/interrupts in the guest shows 8 values
>>>> - 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is
>>>> that correct?
>>>
>>> It looks like that's currently the case, but actual hardware supports
>>> more than that, so it's possible (albeit unlikely any time soon) that
>>> the emulation eventually does as well.
>>>
>>> But it's possible to have interrupts other than MSIs...
>>
>> Right.
>>
>> So given that the raw spinlock conversion is not suitable for all the scenarios
>> supported by the OpenPIC emulation, is it ok that my next step would be to send
>> a patch containing both the raw spinlock conversion and a mandatory disable of
>> the in-kernel MPIC? This is actually the last conclusion we came up with some
>> time ago, but I guess it was good to get some more insight on how things
>> actually work (at least for me).
>
> Fine with me.  Have you given any thought to ways to restructure the
> code to eliminate the problem?

My first thought would be to create a separate lock for each VCPU pending 
interrupts queue, so that we make the whole openpic_irq_update more granular. 
However, this is just a very preliminary thought. Before I can come up with 
anything worthy of consideration, I must read the OpenPIC specification and the 
current KVM emulated OpenPIC implementation thoroughly. I currently have other 
things on my hands, and will come back to this once I have some time.

Meanwhile, I've sent a v2 on the PPC and RT mailing lists for this raw_spinlock 
conversion, alongside disabling the in-kernel MPIC emulation for PREEMPT_RT. I 
would be grateful to hear your feedback on that, so that it can get applied.

Thank you,
Bogdan P.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2015-04-27  6:45 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-18  9:32 [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux Bogdan Purcareata
2015-02-18  9:32 ` [PATCH 1/2] powerpc/kvm: Convert openpic lock to raw_spinlock Bogdan Purcareata
2015-02-23 22:43   ` Scott Wood
2015-02-18  9:32 ` [PATCH 2/2] powerpc/kvm: Limit MAX_VCPUS for guests running on RT Linux Bogdan Purcareata
2015-02-18  9:36   ` Sebastian Andrzej Siewior
2015-02-20 13:45   ` Alexander Graf
2015-02-23 22:48     ` Scott Wood
2015-02-20 13:45 ` [PATCH 0/2] powerpc/kvm: Enable running guests " Alexander Graf
2015-02-20 14:12   ` Paolo Bonzini
2015-02-20 14:16     ` Alexander Graf
2015-02-20 14:54     ` Sebastian Andrzej Siewior
2015-02-20 14:57       ` Paolo Bonzini
2015-02-20 15:06         ` Sebastian Andrzej Siewior
2015-02-20 15:10           ` Paolo Bonzini
2015-02-20 15:17             ` Sebastian Andrzej Siewior
2015-02-23  8:12               ` Purcareata Bogdan
2015-02-23  7:50           ` Purcareata Bogdan
2015-02-23  7:29       ` Purcareata Bogdan
2015-02-23 23:27       ` Scott Wood
2015-02-25 16:36         ` Sebastian Andrzej Siewior
2015-02-26 13:02         ` Paolo Bonzini
2015-02-26 13:31           ` Sebastian Andrzej Siewior
2015-02-27  1:05             ` Scott Wood
2015-02-27 13:06               ` Paolo Bonzini
2015-03-27 17:07               ` Purcareata Bogdan
2015-04-02 23:11                 ` Scott Wood
2015-04-03  8:07                   ` Purcareata Bogdan
2015-04-03 21:26                     ` Scott Wood
2015-04-09  7:44                       ` Purcareata Bogdan
2015-04-09 23:53                         ` Scott Wood
2015-04-20 10:53                           ` Purcareata Bogdan
2015-04-21  0:52                             ` Scott Wood
2015-04-22 12:06                               ` Purcareata Bogdan
2015-04-23  0:30                                 ` Scott Wood
2015-04-23 12:31                                   ` Purcareata Bogdan
2015-04-23 21:26                                     ` Scott Wood
2015-04-27  6:45                                       ` Purcareata Bogdan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).