LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/6] KVM Steal time, new submission
@ 2011-01-28 19:52 Glauber Costa
  2011-01-28 19:52 ` [PATCH v2 1/6] KVM-HDR: KVM Steal time implementation Glauber Costa
                   ` (5 more replies)
  0 siblings, 6 replies; 44+ messages in thread
From: Glauber Costa @ 2011-01-28 19:52 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori

Gentlemen,

If I may please steal a little bit of your time, for the revision of
this patchset, this is appreciated. This version only includes the steal
time part, using a new MSR+cpuid to register it.

It also includes some documentation, describing the ABI.
Mostly inspired by the kvmclock one.

Comments welcome

Glauber Costa (6):
  KVM-HDR: KVM Steal time implementation
  KVM-HV: KVM Steal time implementation
  KVM-GST: KVM Steal time accounting
  KVM-GST: KVM Steal time registration
  KVM-GST: adjust scheduler cpu power
  Describe KVM_MSR_STEAL_TIME

 Documentation/kvm/msr.txt       |   32 ++++++++++
 arch/x86/Kconfig                |   12 ++++
 arch/x86/include/asm/kvm_host.h |    6 ++
 arch/x86/include/asm/kvm_para.h |   10 +++
 arch/x86/kernel/kvm.c           |   61 +++++++++++++++++++
 arch/x86/kernel/kvmclock.c      |    2 +
 arch/x86/kvm/x86.c              |   48 +++++++++++++--
 include/linux/kvm.h             |    1 +
 include/linux/sched.h           |    1 +
 kernel/sched.c                  |  126 +++++++++++++++++++++++++++++++++++----
 kernel/sched_features.h         |    4 +-
 11 files changed, 283 insertions(+), 20 deletions(-)

-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 1/6] KVM-HDR: KVM Steal time implementation
  2011-01-28 19:52 [PATCH v2 0/6] KVM Steal time, new submission Glauber Costa
@ 2011-01-28 19:52 ` Glauber Costa
  2011-01-29  1:28   ` Rik van Riel
  2011-01-28 19:52 ` [PATCH v2 2/6] KVM-HV: " Glauber Costa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-01-28 19:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..8ba33ed 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE2        3
 #define KVM_FEATURE_ASYNC_PF		4
+#define KVM_FEATURE_STEAL_TIME		5
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -34,6 +35,14 @@
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+#define MSR_KVM_STEAL_TIME  0x4b564d03
+
+struct kvm_steal_time {
+	__u64 steal;
+	__u32 version;
+	__u32 flags;
+	__u32 pad[6];
+};
 
 #define KVM_MAX_MMU_OP_BATCH           32
 
-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
  2011-01-28 19:52 [PATCH v2 0/6] KVM Steal time, new submission Glauber Costa
  2011-01-28 19:52 ` [PATCH v2 1/6] KVM-HDR: KVM Steal time implementation Glauber Costa
@ 2011-01-28 19:52 ` Glauber Costa
  2011-01-29  1:46   ` Rik van Riel
                     ` (2 more replies)
  2011-01-28 19:52 ` [PATCH v2 3/6] KVM-GST: KVM Steal time accounting Glauber Costa
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 44+ messages in thread
From: Glauber Costa @ 2011-01-28 19:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    6 +++++
 arch/x86/kvm/x86.c              |   48 ++++++++++++++++++++++++++++++++++-----
 include/linux/kvm.h             |    1 +
 3 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ffd7f8d..675f7ae 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -377,6 +377,12 @@ struct kvm_vcpu_arch {
 	unsigned int hw_tsc_khz;
 	unsigned int time_offset;
 	struct page *time_page;
+
+	gpa_t stime;
+	u32 sversion;
+	u64 time_out;
+	u64 this_time_out;
+
 	u64 last_host_tsc;
 	u64 last_guest_tsc;
 	u64 last_kernel_ns;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 38b55b3..0f8a529 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -787,12 +787,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN	8
+#define KVM_SAVE_MSRS_BEGIN	9
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
 	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1528,16 +1528,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		vcpu->arch.time_page =
 				gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);
 
-		if (is_error_page(vcpu->arch.time_page)) {
-			kvm_release_page_clean(vcpu->arch.time_page);
-			vcpu->arch.time_page = NULL;
-		}
 		break;
 	}
 	case MSR_KVM_ASYNC_PF_EN:
 		if (kvm_pv_enable_async_pf(vcpu, data))
 			return 1;
 		break;
+	case MSR_KVM_STEAL_TIME:
+
+		if (!(data & 1)) {
+			vcpu->arch.stime = 0;
+			break;
+		}
+
+		vcpu->arch.stime = data & ~1;
+		vcpu->arch.sversion = 0;
+		break;
+
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1817,6 +1824,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_KVM_ASYNC_PF_EN:
 		data = vcpu->arch.apf.msr_val;
 		break;
+	case MSR_KVM_STEAL_TIME:
+		data = vcpu->arch.stime;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -1966,6 +1976,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
 	case KVM_CAP_ASYNC_PF:
+	case KVM_CAP_STEAL_TIME:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -2081,6 +2092,9 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+	struct kvm_steal_time *st;
+	st = (struct kvm_steal_time *)vcpu->arch.stime;
+
 	/* Address WBINVD may be executed by guest */
 	if (need_emulate_wbinvd(vcpu)) {
 		if (kvm_x86_ops->has_wbinvd_exit())
@@ -2106,6 +2120,25 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 			kvm_migrate_timers(vcpu);
 		vcpu->cpu = cpu;
 	}
+
+	if (vcpu->arch.this_time_out) {
+		u64 to = (get_kernel_ns() - vcpu->arch.this_time_out);
+		/*
+		 * using nanoseconds introduces noise, which accumulates easily
+		 * leading to big steal time values. We want, however, to keep the
+		 * interface nanosecond-based for future-proofness.
+		 */
+		to /= NSEC_PER_USEC;
+		to *= NSEC_PER_USEC;
+		vcpu->arch.time_out += to;
+		kvm_write_guest(vcpu->kvm, (gpa_t)&st->steal,
+				&vcpu->arch.time_out, sizeof(st->steal));
+		vcpu->arch.sversion += 2;
+		kvm_write_guest(vcpu->kvm, (gpa_t)&st->version,
+				&vcpu->arch.sversion, sizeof(st->version));
+		/* is it possible to have 2 loads in sequence? */
+		vcpu->arch.this_time_out = 0;
+	}
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -2113,6 +2146,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
 	vcpu->arch.last_host_tsc = native_read_tsc();
+	vcpu->arch.this_time_out = get_kernel_ns();
 }
 
 static int is_efer_nx(void)
@@ -5878,6 +5912,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	vcpu->arch.apf.msr_val = 0;
 
+	vcpu->arch.stime = 0;
+
 	if (vcpu->arch.time_page) {
 		kvm_release_page_dirty(vcpu->arch.time_page);
 		vcpu->arch.time_page = NULL;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..1009060 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -541,6 +541,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
 #define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_STEAL_TIME 60 
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-01-28 19:52 [PATCH v2 0/6] KVM Steal time, new submission Glauber Costa
  2011-01-28 19:52 ` [PATCH v2 1/6] KVM-HDR: KVM Steal time implementation Glauber Costa
  2011-01-28 19:52 ` [PATCH v2 2/6] KVM-HV: " Glauber Costa
@ 2011-01-28 19:52 ` Glauber Costa
  2011-01-29  1:16   ` Jeremy Fitzhardinge
                     ` (2 more replies)
  2011-01-28 19:52 ` [PATCH v2 4/6] KVM-GST: KVM Steal time registration Glauber Costa
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 44+ messages in thread
From: Glauber Costa @ 2011-01-28 19:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.

Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..77e9297 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -302,6 +302,7 @@ long io_schedule_timeout(long timeout);
 extern void cpu_init (void);
 extern void trap_init(void);
 extern void update_process_times(int user);
+extern u64 (*hypervisor_steal_time)(void);
 extern void scheduler_tick(void);
 
 extern void sched_show_task(struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index ea3e5ef..7765e9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3508,6 +3508,39 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
 	return ns;
 }
 
+u64 (*hypervisor_steal_time)(void) = NULL;
+
+/*
+ * We have to at flush steal time information every time something else
+ * is accounted. Since the accounting functions are all visible to the rest
+ * of the kernel, it gets tricky to do them in one place. This helper function
+ * helps us.
+ *
+ * When the system is idle, the concept of steal time does not apply. We just
+ * tell the underlying hypervisor that we grabbed the data, but skip steal time
+ * accounting
+ */
+static int touch_steal_time(int is_idle)
+{
+	u64 steal, st;
+
+	if (!hypervisor_steal_time)
+		return 0;
+
+	steal = hypervisor_steal_time();
+
+	st = usecs_to_cputime(steal / 1000);
+
+	if (is_idle)
+		return 0;
+
+	if (st) {
+		account_steal_time(st);
+		return 1;
+	}
+	return 0;
+}
+
 /*
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -3520,6 +3553,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	cputime64_t tmp;
 
+	if (touch_steal_time(0))
+		return;
+
 	/* Add user time to process. */
 	p->utime = cputime_add(p->utime, cputime);
 	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
@@ -3580,6 +3616,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	cputime64_t tmp;
 
+	if (touch_steal_time(0))
+		return;
+
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
 		return;
@@ -3627,6 +3666,8 @@ void account_idle_time(cputime_t cputime)
 	cputime64_t cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
+	touch_steal_time(1);
+
 	if (atomic_read(&rq->nr_iowait) > 0)
 		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
 	else
-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-01-28 19:52 [PATCH v2 0/6] KVM Steal time, new submission Glauber Costa
                   ` (2 preceding siblings ...)
  2011-01-28 19:52 ` [PATCH v2 3/6] KVM-GST: KVM Steal time accounting Glauber Costa
@ 2011-01-28 19:52 ` Glauber Costa
  2011-01-29  2:16   ` Rik van Riel
                     ` (2 more replies)
  2011-01-28 19:52 ` [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power Glauber Costa
  2011-01-28 19:52 ` [PATCH v2 6/6] Describe KVM_MSR_STEAL_TIME Glauber Costa
  5 siblings, 3 replies; 44+ messages in thread
From: Glauber Costa @ 2011-01-28 19:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |    1 +
 arch/x86/kernel/kvm.c           |   61 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/kvmclock.c      |    2 +
 3 files changed, 64 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 8ba33ed..8210122 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -89,6 +89,7 @@ struct kvm_vcpu_pv_apf_data {
 
 extern void kvmclock_init(void);
 extern int kvm_register_clock(char *txt);
+extern int kvm_register_steal_time(void);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33c07b0..30c0fa7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -58,6 +58,8 @@ struct kvm_para_state {
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct kvm_steal_time, steal_time);
+static DEFINE_PER_CPU(u64, steal_info);
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -489,18 +491,21 @@ static void __init kvm_smp_prepare_boot_cpu(void)
 #ifdef CONFIG_KVM_CLOCK
 	WARN_ON(kvm_register_clock("primary cpu clock"));
 #endif
+	WARN_ON(kvm_register_steal_time());
 	kvm_guest_cpu_init();
 	native_smp_prepare_boot_cpu();
 }
 
 static void __cpuinit kvm_guest_cpu_online(void *dummy)
 {
+	WARN_ON(kvm_register_steal_time());
 	kvm_guest_cpu_init();
 }
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
 	kvm_pv_disable_apf(NULL);
+	native_write_msr(MSR_KVM_STEAL_TIME, 0, 0);
 	apf_task_wake_all();
 }
 
@@ -534,6 +539,59 @@ static void __init kvm_apf_trap_init(void)
 	set_intr_gate(14, &async_page_fault);
 }
 
+static u64 kvm_account_steal_time(void)
+{
+	u64 delta = 0;
+	u64 *last_steal_info, this_steal_info;
+	struct kvm_steal_time *src;
+	int version;
+
+	src = &get_cpu_var(steal_time);
+	do {
+		version = src->version;
+		rmb();
+		this_steal_info = src->steal;
+		rmb();
+	} while ((src->version & 1) || (version != src->version));
+	put_cpu_var(steal_time);
+
+	last_steal_info = &get_cpu_var(steal_info);
+
+	if (likely(*last_steal_info))
+		delta = this_steal_info - *last_steal_info;
+	*last_steal_info = this_steal_info;
+
+	put_cpu_var(steal_info);
+
+	/*
+	 * using nanoseconds introduces noise, which accumulates easily
+	 * leading to big steal time values. We want, however, to keep the
+	 * interface nanosecond-based for future-proofness. The hypervisor may
+	 * adopt a similar strategy, but we can't rely on that.
+	 */
+	delta /= NSEC_PER_MSEC;
+	delta *= NSEC_PER_MSEC;
+
+	return delta;
+}
+
+
+int kvm_register_steal_time(void)
+{
+	int cpu = smp_processor_id();
+	int low, high, ret;
+
+	if (!hypervisor_steal_time)
+		return 0;
+
+	low = (int)__pa(&per_cpu(steal_time, cpu)) | 1;
+	high = ((u64)__pa(&per_cpu(steal_time, cpu)) >> 32);
+	ret = native_write_msr_safe(MSR_KVM_STEAL_TIME, low, high);
+	printk(KERN_INFO "kvm-stealtime: cpu %d, msr %x:%x\n",
+		cpu, high, low);
+	return ret;
+}
+
 void __init kvm_guest_init(void)
 {
 	int i;
@@ -548,6 +606,9 @@ void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
 		x86_init.irqs.trap_init = kvm_apf_trap_init;
 
+	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME))
+		hypervisor_steal_time = kvm_account_steal_time;
+
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 	register_cpu_notifier(&kvm_cpu_notifier);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f98d3ea..08661c6 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -164,6 +164,7 @@ static void __cpuinit kvm_setup_secondary_clock(void)
 static void kvm_crash_shutdown(struct pt_regs *regs)
 {
 	native_write_msr(msr_kvm_system_time, 0, 0);
+	native_write_msr(MSR_KVM_STEAL_TIME, 0, 0);
 	native_machine_crash_shutdown(regs);
 }
 #endif
@@ -171,6 +172,7 @@ static void kvm_crash_shutdown(struct pt_regs *regs)
 static void kvm_shutdown(void)
 {
 	native_write_msr(msr_kvm_system_time, 0, 0);
+	native_write_msr(MSR_KVM_STEAL_TIME, 0, 0);
 	native_machine_shutdown();
 }
 
-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-01-28 19:52 [PATCH v2 0/6] KVM Steal time, new submission Glauber Costa
                   ` (3 preceding siblings ...)
  2011-01-28 19:52 ` [PATCH v2 4/6] KVM-GST: KVM Steal time registration Glauber Costa
@ 2011-01-28 19:52 ` Glauber Costa
  2011-01-31 11:25   ` Peter Zijlstra
  2011-01-28 19:52 ` [PATCH v2 6/6] Describe KVM_MSR_STEAL_TIME Glauber Costa
  5 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-01-28 19:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra, Avi Kivity

This is a first proposal for using steal time information
to influence the scheduler. There are a lot of optimizations
and fine grained adjustments to be done, but it is working reasonably
so far for me (mostly)

With this patch (and some host pinnings to demonstrate the situation),
two vcpus with very different steal time (Say 80 % vs 1 %) will not get
an even distribution of processes. This is a situation that can naturally
arise, specially in overcommited scenarios. Previosly, the guest scheduler
would wrongly think that all cpus have the same ability to run processes,
lowering the overall throughput.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
---
 arch/x86/Kconfig        |   12 +++++++
 kernel/sched.c          |   85 ++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched_features.h |    4 +-
 3 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3ed5ad9..8f7a666 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -515,6 +515,18 @@ menuconfig PARAVIRT_GUEST
 
 if PARAVIRT_GUEST
 
+config PARAVIRT_TIME_ACCOUNTING
+	bool "Paravirtual steal time accounting"
+	select PARAVIRT
+	default n
+	---help---
+	  Select this option to enable fine granularity task steal time 
+	  accounting. Time spent executing other tasks in parallel with
+          the current vCPU is discounted from the vCPU power. To account for
+	  that, there can be a small performance impact.
+
+	  If in doubt, say N here.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index 7765e9d..40df0d8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -524,6 +524,9 @@ struct rq {
 	u64 prev_irq_time;
 #endif
 
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+	u64 prev_steal_time;
+#endif
 	/* calc_load related fields */
 	unsigned long calc_load_update;
 	long calc_load_active;
@@ -1780,6 +1783,54 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 	dec_nr_running(rq);
 }
 
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+static DEFINE_PER_CPU(u64, cpu_steal_time);
+
+#ifndef CONFIG_64BIT
+static DEFINE_PER_CPU(seqcount_t, steal_time_seq);
+
+static inline void steal_time_write_begin(void)
+{
+	__this_cpu_inc(steal_time_seq.sequence);
+	smp_wmb();
+}
+
+static inline void steal_time_write_end(void)
+{
+	smp_wmb();
+	__this_cpu_inc(steal_time_seq.sequence);
+}
+
+static inline u64 steal_time_read(int cpu)
+{
+	u64 steal_time;
+	unsigned seq;
+
+	do {
+		seq = read_seqcount_begin(&per_cpu(steal_time_seq, cpu));
+		steal_time = per_cpu(cpu_steal_time, cpu);
+	} while (read_seqcount_retry(&per_cpu(steal_time_seq, cpu), seq));
+
+	return steal_time;
+}
+#else /* CONFIG_64BIT */
+static inline void steal_time_write_begin(void)
+{
+}
+
+static inline void steal_time_write_end(void)
+{
+}
+
+static inline u64 steal_time_read(int cpu)
+{
+	return per_cpu(cpu_steal_time, cpu);
+}
+
+#endif /* CONFIG_64BIT */
+
+#endif
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 
 /*
@@ -1888,10 +1939,13 @@ void account_system_vtime(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
-	s64 irq_delta;
+	s64 irq_delta = 0, steal = 0;
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
 
 	/*
@@ -1914,20 +1968,22 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 	rq->prev_irq_time += irq_delta;
 	delta -= irq_delta;
-	rq->clock_task += delta;
-
-	if (irq_delta && sched_feat(NONIRQ_POWER))
-		sched_rt_avg_update(rq, irq_delta);
-}
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+	steal = steal_time_read(cpu_of(rq)) - rq->prev_steal_time;
+  
+	if (steal > delta)
+		steal = delta;
+	rq->prev_steal_time += steal;
 
-#else /* CONFIG_IRQ_TIME_ACCOUNTING */
+	delta -= steal;
+#endif
 
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
 	rq->clock_task += delta;
-}
 
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+	if ((irq_delta + steal) && sched_feat(NONTASK_POWER))
+		sched_rt_avg_update(rq, irq_delta + steal);
+}
 
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -3536,6 +3592,11 @@ static int touch_steal_time(int is_idle)
 
 	if (st) {
 		account_steal_time(st);
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+		steal_time_write_begin();
+		__this_cpu_add(cpu_steal_time, steal);
+		steal_time_write_end();
+#endif
 		return 1;
 	}
 	return 0;
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 68e69ac..194fc6d 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,6 +61,6 @@ SCHED_FEAT(LB_BIAS, 1)
 SCHED_FEAT(OWNER_SPIN, 1)
 
 /*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
  */
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)
-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 6/6] Describe KVM_MSR_STEAL_TIME
  2011-01-28 19:52 [PATCH v2 0/6] KVM Steal time, new submission Glauber Costa
                   ` (4 preceding siblings ...)
  2011-01-28 19:52 ` [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power Glauber Costa
@ 2011-01-28 19:52 ` Glauber Costa
  2011-01-30 13:19   ` Avi Kivity
  5 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-01-28 19:52 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, aliguori

This patch adds documentation about usage of the newly
introduced KVM_MSR_STEAL_TIME.

Signed-off-by: Glauber Costa <glommer@redhat.com>
---
 Documentation/kvm/msr.txt |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
index d079aed..7a2d223 100644
--- a/Documentation/kvm/msr.txt
+++ b/Documentation/kvm/msr.txt
@@ -185,3 +185,35 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 
 	Currently type 2 APF will be always delivered on the same vcpu as
 	type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+	data: 4-byte alignment physical address of a memory area which must be
+	in guest RAM, plus an enable bit in bit 0. This memory is expected to
+	hold a copy of the following structure:
+
+	struct kvm_steal_time {
+	  	__u64 steal;
+ 		__u32 version;
+ 		__u32 flags;
+	 	__u32 pad[6];
+	}
+
+	whose data will be filled in by the hypervisor periodically. Only one
+	write, or registration, is needed for each VCPU. The interval between
+	updates of this structure is arbitrary and implementation-dependent.
+	The hypervisor may update this structure at any time it sees fit until
+	anything with bit0 == 0 is written to it.
+
+	Fields have the following meanings:
+
+		version: guest has to check version before and after grabbing
+		time information and check that they are both equal and even.
+		An odd version indicates an in-progress update.
+
+		flags: At this point, always zero. May be used to indicate
+		changes in this structure in the future.
+
+		steal: the amount of time in which this vCPU did not run, in
+		nanoseconds.
+
-- 
1.7.2.3


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-01-28 19:52 ` [PATCH v2 3/6] KVM-GST: KVM Steal time accounting Glauber Costa
@ 2011-01-29  1:16   ` Jeremy Fitzhardinge
  2011-01-29  1:27     ` Glauber Costa
  2011-01-29  1:53   ` Rik van Riel
  2011-01-30 14:04   ` Avi Kivity
  2 siblings, 1 reply; 44+ messages in thread
From: Jeremy Fitzhardinge @ 2011-01-29  1:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Peter Zijlstra, Avi Kivity

On 01/28/2011 11:52 AM, Glauber Costa wrote:
> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.

Is this used on the host or guest side?

    J

> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>
> Signed-off-by: Glauber Costa <glommer@redhat.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Avi Kivity <avi@redhat.com>
> ---
>  include/linux/sched.h |    1 +
>  kernel/sched.c        |   41 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d747f94..77e9297 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -302,6 +302,7 @@ long io_schedule_timeout(long timeout);
>  extern void cpu_init (void);
>  extern void trap_init(void);
>  extern void update_process_times(int user);
> +extern u64 (*hypervisor_steal_time)(void);
>  extern void scheduler_tick(void);
>  
>  extern void sched_show_task(struct task_struct *p);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index ea3e5ef..7765e9d 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3508,6 +3508,39 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
>  	return ns;
>  }
>  
> +u64 (*hypervisor_steal_time)(void) = NULL;
> +
> +/*
> + * We have to at flush steal time information every time something else
> + * is accounted. Since the accounting functions are all visible to the rest
> + * of the kernel, it gets tricky to do them in one place. This helper function
> + * helps us.
> + *
> + * When the system is idle, the concept of steal time does not apply. We just
> + * tell the underlying hypervisor that we grabbed the data, but skip steal time
> + * accounting
> + */
> +static int touch_steal_time(int is_idle)
> +{
> +	u64 steal, st;
> +
> +	if (!hypervisor_steal_time)
> +		return 0;
> +
> +	steal = hypervisor_steal_time();
> +
> +	st = usecs_to_cputime(steal / 1000);
> +
> +	if (is_idle)
> +		return 0;
> +
> +	if (st) {
> +		account_steal_time(st);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * Account user cpu time to a process.
>   * @p: the process that the cpu time gets accounted to
> @@ -3520,6 +3553,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
>  	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
>  	cputime64_t tmp;
>  
> +	if (touch_steal_time(0))
> +		return;
> +
>  	/* Add user time to process. */
>  	p->utime = cputime_add(p->utime, cputime);
>  	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
> @@ -3580,6 +3616,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>  	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
>  	cputime64_t tmp;
>  
> +	if (touch_steal_time(0))
> +		return;
> +
>  	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
>  		account_guest_time(p, cputime, cputime_scaled);
>  		return;
> @@ -3627,6 +3666,8 @@ void account_idle_time(cputime_t cputime)
>  	cputime64_t cputime64 = cputime_to_cputime64(cputime);
>  	struct rq *rq = this_rq();
>  
> +	touch_steal_time(1);
> +
>  	if (atomic_read(&rq->nr_iowait) > 0)
>  		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
>  	else


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-01-29  1:16   ` Jeremy Fitzhardinge
@ 2011-01-29  1:27     ` Glauber Costa
  0 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-01-29  1:27 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Peter Zijlstra, Avi Kivity

On Fri, 2011-01-28 at 17:16 -0800, Jeremy Fitzhardinge wrote:
> On 01/28/2011 11:52 AM, Glauber Costa wrote:
> > This patch accounts steal time time in kernel/sched.
> > I kept it from last proposal, because I still see advantages
> > in it: Doing it here will give us easier access from scheduler
> > variables such as the cpu rq. The next patch shows an example of
> > usage for it.
> 
> Is this used on the host or guest side?

Guest.

>     J
> 
> > Since functions like account_idle_time() can be called from
> > multiple places, not only account_process_tick(), steal time
> > grabbing is repeated in each account function separatedely.
> >
> > Signed-off-by: Glauber Costa <glommer@redhat.com>
> > CC: Rik van Riel <riel@redhat.com>
> > CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> > CC: Peter Zijlstra <peterz@infradead.org>
> > CC: Avi Kivity <avi@redhat.com>
> > ---
> >  include/linux/sched.h |    1 +
> >  kernel/sched.c        |   41 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 42 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d747f94..77e9297 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -302,6 +302,7 @@ long io_schedule_timeout(long timeout);
> >  extern void cpu_init (void);
> >  extern void trap_init(void);
> >  extern void update_process_times(int user);
> > +extern u64 (*hypervisor_steal_time)(void);
> >  extern void scheduler_tick(void);
> >  
> >  extern void sched_show_task(struct task_struct *p);
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index ea3e5ef..7765e9d 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -3508,6 +3508,39 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
> >  	return ns;
> >  }
> >  
> > +u64 (*hypervisor_steal_time)(void) = NULL;
> > +
> > +/*
> > + * We have to at flush steal time information every time something else
> > + * is accounted. Since the accounting functions are all visible to the rest
> > + * of the kernel, it gets tricky to do them in one place. This helper function
> > + * helps us.
> > + *
> > + * When the system is idle, the concept of steal time does not apply. We just
> > + * tell the underlying hypervisor that we grabbed the data, but skip steal time
> > + * accounting
> > + */
> > +static int touch_steal_time(int is_idle)
> > +{
> > +	u64 steal, st;
> > +
> > +	if (!hypervisor_steal_time)
> > +		return 0;
> > +
> > +	steal = hypervisor_steal_time();
> > +
> > +	st = usecs_to_cputime(steal / 1000);
> > +
> > +	if (is_idle)
> > +		return 0;
> > +
> > +	if (st) {
> > +		account_steal_time(st);
> > +		return 1;
> > +	}
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Account user cpu time to a process.
> >   * @p: the process that the cpu time gets accounted to
> > @@ -3520,6 +3553,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
> >  	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
> >  	cputime64_t tmp;
> >  
> > +	if (touch_steal_time(0))
> > +		return;
> > +
> >  	/* Add user time to process. */
> >  	p->utime = cputime_add(p->utime, cputime);
> >  	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
> > @@ -3580,6 +3616,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
> >  	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
> >  	cputime64_t tmp;
> >  
> > +	if (touch_steal_time(0))
> > +		return;
> > +
> >  	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
> >  		account_guest_time(p, cputime, cputime_scaled);
> >  		return;
> > @@ -3627,6 +3666,8 @@ void account_idle_time(cputime_t cputime)
> >  	cputime64_t cputime64 = cputime_to_cputime64(cputime);
> >  	struct rq *rq = this_rq();
> >  
> > +	touch_steal_time(1);
> > +
> >  	if (atomic_read(&rq->nr_iowait) > 0)
> >  		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
> >  	else
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/6] KVM-HDR: KVM Steal time implementation
  2011-01-28 19:52 ` [PATCH v2 1/6] KVM-HDR: KVM Steal time implementation Glauber Costa
@ 2011-01-29  1:28   ` Rik van Riel
  0 siblings, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2011-01-29  1:28 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/28/2011 02:52 PM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the headers for it. I am keeping it separate to facilitate
> backports to people who wants to backport the kernel part but not the
> hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
  2011-01-28 19:52 ` [PATCH v2 2/6] KVM-HV: " Glauber Costa
@ 2011-01-29  1:46   ` Rik van Riel
  2011-01-30 13:13   ` Avi Kivity
  2011-01-31 11:07   ` Peter Zijlstra
  2 siblings, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2011-01-29  1:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/28/2011 02:52 PM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-01-28 19:52 ` [PATCH v2 3/6] KVM-GST: KVM Steal time accounting Glauber Costa
  2011-01-29  1:16   ` Jeremy Fitzhardinge
@ 2011-01-29  1:53   ` Rik van Riel
  2011-01-30 14:04   ` Avi Kivity
  2 siblings, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2011-01-29  1:53 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/28/2011 02:52 PM, Glauber Costa wrote:
> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.
>
> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Not the traditional way of doing steal time, but a lot cleaner
than the legacy code that's left over from when each clocksource
had its own interrupt function.

I like it.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-01-28 19:52 ` [PATCH v2 4/6] KVM-GST: KVM Steal time registration Glauber Costa
@ 2011-01-29  2:16   ` Rik van Riel
  2011-01-30 13:16   ` Avi Kivity
  2011-01-31 11:11   ` Peter Zijlstra
  2 siblings, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2011-01-29  2:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Jeremy Fitzhardinge, Peter Zijlstra,
	Avi Kivity

On 01/28/2011 02:52 PM, Glauber Costa wrote:
> Register steal time within KVM. Everytime we sample the steal time
> information, we update a local variable that tells what was the
> last time read. We then account the difference.
>
> Signed-off-by: Glauber Costa<glommer@redhat.com>
> CC: Rik van Riel<riel@redhat.com>
> CC: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra<peterz@infradead.org>
> CC: Avi Kivity<avi@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
  2011-01-28 19:52 ` [PATCH v2 2/6] KVM-HV: " Glauber Costa
  2011-01-29  1:46   ` Rik van Riel
@ 2011-01-30 13:13   ` Avi Kivity
  2011-02-01 15:48     ` Glauber Costa
  2011-01-31 11:07   ` Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2011-01-30 13:13 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On 01/28/2011 09:52 PM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
>
> @@ -1528,16 +1528,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>   		vcpu->arch.time_page =
>   				gfn_to_page(vcpu->kvm, data>>  PAGE_SHIFT);
>
> -		if (is_error_page(vcpu->arch.time_page)) {
> -			kvm_release_page_clean(vcpu->arch.time_page);
> -			vcpu->arch.time_page = NULL;
> -		}
>   		break;
>   	}

Unrelated?

> @@ -2106,6 +2120,25 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   			kvm_migrate_timers(vcpu);
>   		vcpu->cpu = cpu;
>   	}
> +
> +	if (vcpu->arch.this_time_out) {
> +		u64 to = (get_kernel_ns() - vcpu->arch.this_time_out);
> +		/*
> +		 * using nanoseconds introduces noise, which accumulates easily
> +		 * leading to big steal time values. We want, however, to keep the
> +		 * interface nanosecond-based for future-proofness.
> +		 */
> +		to /= NSEC_PER_USEC;
> +		to *= NSEC_PER_USEC;

Seems there is a real problem and that this is just papering it over.  
I'd like to understand the root cause.

> +		vcpu->arch.time_out += to;
> +		kvm_write_guest(vcpu->kvm, (gpa_t)&st->steal,
> +				&vcpu->arch.time_out, sizeof(st->steal));

Error check.

> +		vcpu->arch.sversion += 2;

Doesn't survive live migration.  You need to use the version from the 
guest area.

> +		kvm_write_guest(vcpu->kvm, (gpa_t)&st->version,
> +				&vcpu->arch.sversion, sizeof(st->version));
> +		/* is it possible to have 2 loads in sequence? */
> +		vcpu->arch.this_time_out = 0;
> +	}
>   }
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-01-28 19:52 ` [PATCH v2 4/6] KVM-GST: KVM Steal time registration Glauber Costa
  2011-01-29  2:16   ` Rik van Riel
@ 2011-01-30 13:16   ` Avi Kivity
  2011-02-01 15:53     ` Glauber Costa
  2011-01-31 11:11   ` Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2011-01-30 13:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On 01/28/2011 09:52 PM, Glauber Costa wrote:
> Register steal time within KVM. Everytime we sample the steal time
> information, we update a local variable that tells what was the
> last time read. We then account the difference.
>
>
>
>   static void kvm_guest_cpu_offline(void *dummy)
>   {
>   	kvm_pv_disable_apf(NULL);
> +	native_write_msr(MSR_KVM_STEAL_TIME, 0, 0);
>   	apf_task_wake_all();
>   }

Don't use the native_ versions, they override the pvops implementation.  
It doesn't matter for kvm, but we're not supposed to know this.

> +	/*
> +	 * using nanoseconds introduces noise, which accumulates easily
> +	 * leading to big steal time values. We want, however, to keep the
> +	 * interface nanosecond-based for future-proofness. The hypervisor may
> +	 * adopt a similar strategy, but we can't rely on that.
> +	 */
> +	delta /= NSEC_PER_MSEC;
> +	delta *= NSEC_PER_MSEC;

You're working around this problem both in the guest and host.  So even 
if we fix it in one, it will still be broken in the other.

> +
> +	return delta;
> +}
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 6/6] Describe KVM_MSR_STEAL_TIME
  2011-01-28 19:52 ` [PATCH v2 6/6] Describe KVM_MSR_STEAL_TIME Glauber Costa
@ 2011-01-30 13:19   ` Avi Kivity
  2011-02-01 15:54     ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2011-01-30 13:19 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 01/28/2011 09:52 PM, Glauber Costa wrote:
> This patch adds documentation about usage of the newly
> introduced KVM_MSR_STEAL_TIME.
>
>
> +
> +MSR_KVM_STEAL_TIME: 0x4b564d03
> +
> +	data: 4-byte alignment physical address of a memory area which must be
> +	in guest RAM, plus an enable bit in bit 0.

64-byte aligned:
- avoids wrapping around a page boundary, which may let us optimize 
things later on (see kvm_write_guest_cached()).
- gives us 5 more unused bits to enable more options

>   This memory is expected to
> +	hold a copy of the following structure:
> +
> +	struct kvm_steal_time {
> +	  	__u64 steal;
> + 		__u32 version;
> + 		__u32 flags;
> +	 	__u32 pad[6];
> +	}
> +
> +	whose data will be filled in by the hypervisor periodically. Only one
> +	write, or registration, is needed for each VCPU. The interval between
> +	updates of this structure is arbitrary and implementation-dependent.
> +	The hypervisor may update this structure at any time it sees fit until
> +	anything with bit0 == 0 is written to it.
> +
> +	Fields have the following meanings:
> +
> +		version: guest has to check version before and after grabbing
> +		time information and check that they are both equal and even.
> +		An odd version indicates an in-progress update.
> +
> +		flags: At this point, always zero. May be used to indicate
> +		changes in this structure in the future.
> +
> +		steal: the amount of time in which this vCPU did not run, in
> +		nanoseconds.
> +

The guest must initialize the entire 64-byte structure to zero before 
enabling the feature.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-01-28 19:52 ` [PATCH v2 3/6] KVM-GST: KVM Steal time accounting Glauber Costa
  2011-01-29  1:16   ` Jeremy Fitzhardinge
  2011-01-29  1:53   ` Rik van Riel
@ 2011-01-30 14:04   ` Avi Kivity
  2011-01-30 16:45     ` lidong chen
  2011-02-01 15:57     ` Glauber Costa
  2 siblings, 2 replies; 44+ messages in thread
From: Avi Kivity @ 2011-01-30 14:04 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On 01/28/2011 09:52 PM, Glauber Costa wrote:
> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.
>
> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>

I accept that steal time is worthwhile, but do you have some way to 
demonstrate that the implementation actually works and is beneficial?

Perhaps run two cpu-bound compute processes on one vcpu, overcommit that 
vcpu, and see what happens to the processing rate with and without steal 
time accounting.  I'd expect a fairer response with steal time accounting.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-01-30 14:04   ` Avi Kivity
@ 2011-01-30 16:45     ` lidong chen
  2011-02-01 15:58       ` Glauber Costa
  2011-02-01 15:57     ` Glauber Costa
  1 sibling, 1 reply; 44+ messages in thread
From: lidong chen @ 2011-01-30 16:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Glauber Costa, kvm, linux-kernel, aliguori, Rik van Riel,
	Jeremy Fitzhardinge, Peter Zijlstra

I think we can use performance counter.
use unhalted core cycles event, in the nmi callback funcation, count
which process is running .
if the vm exit is caused by nmi,discard it.
the system time of qemu process is the time steal by kvm.



2011/1/30 Avi Kivity <avi@redhat.com>:
> On 01/28/2011 09:52 PM, Glauber Costa wrote:
>>
>> This patch accounts steal time time in kernel/sched.
>> I kept it from last proposal, because I still see advantages
>> in it: Doing it here will give us easier access from scheduler
>> variables such as the cpu rq. The next patch shows an example of
>> usage for it.
>>
>> Since functions like account_idle_time() can be called from
>> multiple places, not only account_process_tick(), steal time
>> grabbing is repeated in each account function separatedely.
>>
>
> I accept that steal time is worthwhile, but do you have some way to
> demonstrate that the implementation actually works and is beneficial?
>
> Perhaps run two cpu-bound compute processes on one vcpu, overcommit that
> vcpu, and see what happens to the processing rate with and without steal
> time accounting.  I'd expect a fairer response with steal time accounting.
>
> --
> error compiling committee.c: too many arguments to function
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
  2011-01-28 19:52 ` [PATCH v2 2/6] KVM-HV: " Glauber Costa
  2011-01-29  1:46   ` Rik van Riel
  2011-01-30 13:13   ` Avi Kivity
@ 2011-01-31 11:07   ` Peter Zijlstra
  2 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-01-31 11:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity, tglx

On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:
> +               u64 to = (get_kernel_ns() - vcpu->arch.this_time_out);
> +               /*
> +                * using nanoseconds introduces noise, which accumulates easily
> +                * leading to big steal time values. We want, however, to keep the
> +                * interface nanosecond-based for future-proofness.
> +                */
> +               to /= NSEC_PER_USEC;
> +               to *= NSEC_PER_USEC; 

This just doesn't make any sense at all, you use the most expensive and
accurate kernel interface to get ns resolution timestamps (ktime_get),
and then you truncate the stuff to usec for some totally weird and
unexplained reason.

If ktime_get is wrong all our time-keeping is out the window.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-01-28 19:52 ` [PATCH v2 4/6] KVM-GST: KVM Steal time registration Glauber Costa
  2011-01-29  2:16   ` Rik van Riel
  2011-01-30 13:16   ` Avi Kivity
@ 2011-01-31 11:11   ` Peter Zijlstra
  2 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-01-31 11:11 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:
> +       /*
> +        * using nanoseconds introduces noise, which accumulates easily
> +        * leading to big steal time values. We want, however, to keep the
> +        * interface nanosecond-based for future-proofness. The hypervisor may
> +        * adopt a similar strategy, but we can't rely on that.
> +        */
> +       delta /= NSEC_PER_MSEC;
> +       delta *= NSEC_PER_MSEC; 

I really doubt that claim..


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-01-28 19:52 ` [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power Glauber Costa
@ 2011-01-31 11:25   ` Peter Zijlstra
  2011-01-31 11:27     ` Peter Zijlstra
  2011-02-01 15:59     ` Glauber Costa
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-01-31 11:25 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:

> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> +static DEFINE_PER_CPU(u64, cpu_steal_time);
> +
> +#ifndef CONFIG_64BIT
> +static DEFINE_PER_CPU(seqcount_t, steal_time_seq);
> +
> +static inline void steal_time_write_begin(void)
> +{
> +	__this_cpu_inc(steal_time_seq.sequence);
> +	smp_wmb();
> +}
> +
> +static inline void steal_time_write_end(void)
> +{
> +	smp_wmb();
> +	__this_cpu_inc(steal_time_seq.sequence);
> +}
> +
> +static inline u64 steal_time_read(int cpu)
> +{
> +	u64 steal_time;
> +	unsigned seq;
> +
> +	do {
> +		seq = read_seqcount_begin(&per_cpu(steal_time_seq, cpu));
> +		steal_time = per_cpu(cpu_steal_time, cpu);
> +	} while (read_seqcount_retry(&per_cpu(steal_time_seq, cpu), seq));
> +
> +	return steal_time;
> +}
> +#else /* CONFIG_64BIT */
> +static inline void steal_time_write_begin(void)
> +{
> +}
> +
> +static inline void steal_time_write_end(void)
> +{
> +}
> +
> +static inline u64 steal_time_read(int cpu)
> +{
> +	return per_cpu(cpu_steal_time, cpu);
> +}


> @@ -3536,6 +3592,11 @@ static int touch_steal_time(int is_idle)
>  
>  	if (st) {
>  		account_steal_time(st);
> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> +		steal_time_write_begin();
> +		__this_cpu_add(cpu_steal_time, steal);
> +		steal_time_write_end();
> +#endif
>  		return 1;
>  	}
>  	return 0;


Why replicate all logic you've already got in patch 4? That too is
reading steal time in a loop in kvm_account_steal_time(), why not extend
that interface to take a cpu argument and be done with it?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-01-31 11:25   ` Peter Zijlstra
@ 2011-01-31 11:27     ` Peter Zijlstra
  2011-02-01 15:59     ` Glauber Costa
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-01-31 11:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Mon, 2011-01-31 at 12:25 +0100, Peter Zijlstra wrote:
> On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:
> 
> > +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> > +static DEFINE_PER_CPU(u64, cpu_steal_time);
> > +
> > +#ifndef CONFIG_64BIT
> > +static DEFINE_PER_CPU(seqcount_t, steal_time_seq);
> > +
> > +static inline void steal_time_write_begin(void)
> > +{
> > +	__this_cpu_inc(steal_time_seq.sequence);
> > +	smp_wmb();
> > +}
> > +
> > +static inline void steal_time_write_end(void)
> > +{
> > +	smp_wmb();
> > +	__this_cpu_inc(steal_time_seq.sequence);
> > +}
> > +
> > +static inline u64 steal_time_read(int cpu)
> > +{
> > +	u64 steal_time;
> > +	unsigned seq;
> > +
> > +	do {
> > +		seq = read_seqcount_begin(&per_cpu(steal_time_seq, cpu));
> > +		steal_time = per_cpu(cpu_steal_time, cpu);
> > +	} while (read_seqcount_retry(&per_cpu(steal_time_seq, cpu), seq));
> > +
> > +	return steal_time;
> > +}
> > +#else /* CONFIG_64BIT */
> > +static inline void steal_time_write_begin(void)
> > +{
> > +}
> > +
> > +static inline void steal_time_write_end(void)
> > +{
> > +}
> > +
> > +static inline u64 steal_time_read(int cpu)
> > +{
> > +	return per_cpu(cpu_steal_time, cpu);
> > +}
> 
> 
> > @@ -3536,6 +3592,11 @@ static int touch_steal_time(int is_idle)
> >  
> >  	if (st) {
> >  		account_steal_time(st);
> > +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> > +		steal_time_write_begin();
> > +		__this_cpu_add(cpu_steal_time, steal);
> > +		steal_time_write_end();
> > +#endif
> >  		return 1;
> >  	}
> >  	return 0;
> 
> 
> Why replicate all logic you've already got in patch 4? That too is
> reading steal time in a loop in kvm_account_steal_time(), why not extend
> that interface to take a cpu argument and be done with it?

Also, this relies on touch_steal_time() to have ran, not at all
something that is tied to calling update_rq_clock().

We can call update_rq_clock() many many times between ticks, as can the
stealtime increase because the vcpu can schedule many many times between
ticks.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
  2011-01-30 13:13   ` Avi Kivity
@ 2011-02-01 15:48     ` Glauber Costa
  2011-02-01 17:09       ` Avi Kivity
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 15:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On Sun, 2011-01-30 at 15:13 +0200, Avi Kivity wrote:
> On 01/28/2011 09:52 PM, Glauber Costa wrote:
> > To implement steal time, we need the hypervisor to pass the guest information
> > about how much time was spent running other processes outside the VM.
> > This is per-vcpu, and using the kvmclock structure for that is an abuse
> > we decided not to make.
> >
> > In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> > holds the memory area address containing information about steal time
> >
> > This patch contains the hypervisor part for it. I am keeping it separate from
> > the headers to facilitate backports to people who wants to backport the kernel
> > part but not the hypervisor, or the other way around.
> >
> >
> > @@ -1528,16 +1528,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >   		vcpu->arch.time_page =
> >   				gfn_to_page(vcpu->kvm, data>>  PAGE_SHIFT);
> >
> > -		if (is_error_page(vcpu->arch.time_page)) {
> > -			kvm_release_page_clean(vcpu->arch.time_page);
> > -			vcpu->arch.time_page = NULL;
> > -		}
> >   		break;
> >   	}
> 
> Unrelated?

Indeed, leftover from another patch. Thanks

> > @@ -2106,6 +2120,25 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >   			kvm_migrate_timers(vcpu);
> >   		vcpu->cpu = cpu;
> >   	}
> > +
> > +	if (vcpu->arch.this_time_out) {
> > +		u64 to = (get_kernel_ns() - vcpu->arch.this_time_out);
> > +		/*
> > +		 * using nanoseconds introduces noise, which accumulates easily
> > +		 * leading to big steal time values. We want, however, to keep the
> > +		 * interface nanosecond-based for future-proofness.
> > +		 */
> > +		to /= NSEC_PER_USEC;
> > +		to *= NSEC_PER_USEC;
> 
> Seems there is a real problem and that this is just papering it over.  
> I'd like to understand the root cause.
Okay, my self-explanation seemed reasonable to me, but since both you
and Peter dislike it, I think it is important enough to get a more
thorough investigation before a second round. But in this case,
I keep that using nanoseconds may then not be the best approach here. We
also have to keep in mind that the host and guest clocks may be running
at different resolutions.

> 
> > +		vcpu->arch.time_out += to;
> > +		kvm_write_guest(vcpu->kvm, (gpa_t)&st->steal,
> > +				&vcpu->arch.time_out, sizeof(st->steal));
> 
> Error check.

Fair.

> > +		vcpu->arch.sversion += 2;
> 
> Doesn't survive live migration.  You need to use the version from the 
> guest area.
Why not? Who said versions need to always increase? If current version
is 102324, and we live migrate and it becomes 0, what is the problem?
next update will make it two, that will at most force another read loop
in the guest.
The only properties we have to maintain are: 
1) Value is even (checked)
2) Value does not change between a read (checked).

Reading the value from the guest just slows it down, imho.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-01-30 13:16   ` Avi Kivity
@ 2011-02-01 15:53     ` Glauber Costa
  2011-02-01 16:17       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 15:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On Sun, 2011-01-30 at 15:16 +0200, Avi Kivity wrote:
> On 01/28/2011 09:52 PM, Glauber Costa wrote:
> > Register steal time within KVM. Everytime we sample the steal time
> > information, we update a local variable that tells what was the
> > last time read. We then account the difference.
> >
> >
> >
> >   static void kvm_guest_cpu_offline(void *dummy)
> >   {
> >   	kvm_pv_disable_apf(NULL);
> > +	native_write_msr(MSR_KVM_STEAL_TIME, 0, 0);
> >   	apf_task_wake_all();
> >   }
> 
> Don't use the native_ versions, they override the pvops implementation.  
> It doesn't matter for kvm, but we're not supposed to know this.

fair.

> > +	/*
> > +	 * using nanoseconds introduces noise, which accumulates easily
> > +	 * leading to big steal time values. We want, however, to keep the
> > +	 * interface nanosecond-based for future-proofness. The hypervisor may
> > +	 * adopt a similar strategy, but we can't rely on that.
> > +	 */
> > +	delta /= NSEC_PER_MSEC;
> > +	delta *= NSEC_PER_MSEC;
> 
> You're working around this problem both in the guest and host.  So even 
> if we fix it in one, it will still be broken in the other.

And if you notice, in two different ways:
I am (was) forcing to usecs in the host, and msecs in the guest.
One of the problems here, is that if we account steal time, we refrain
from accounting user / system time. Reason being, that if we account it,
we'll end up with more than HZ ticks per HZ, since we'll account ticks
as both steal and real.

And since the granularity of the cpu accounting is too coarse, we end up
with much more steal time than we should, because things that are less
than 1 unity of cputime, are often rounded up to 1 unity of cputime.

Now, I've already said that I will investigate further, and I'm ready to
back of from all of this. But assuming my analysis is right so far, what
if we keep things in nsecs or msecs, and only convert to cputime in the
time of read? This would allow us to just subtract steal time from
user/system time, in a more fine grained way.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 6/6] Describe KVM_MSR_STEAL_TIME
  2011-01-30 13:19   ` Avi Kivity
@ 2011-02-01 15:54     ` Glauber Costa
  2011-02-02 10:14       ` Avi Kivity
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 15:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, aliguori

On Sun, 2011-01-30 at 15:19 +0200, Avi Kivity wrote:
> On 01/28/2011 09:52 PM, Glauber Costa wrote:
> > This patch adds documentation about usage of the newly
> > introduced KVM_MSR_STEAL_TIME.
> >
> >
> > +
> > +MSR_KVM_STEAL_TIME: 0x4b564d03
> > +
> > +	data: 4-byte alignment physical address of a memory area which must be
> > +	in guest RAM, plus an enable bit in bit 0.
> 
> 64-byte aligned:
> - avoids wrapping around a page boundary, which may let us optimize 
> things later on (see kvm_write_guest_cached()).
> - gives us 5 more unused bits to enable more options
> 
> >   This memory is expected to
> > +	hold a copy of the following structure:
> > +
> > +	struct kvm_steal_time {
> > +	  	__u64 steal;
> > + 		__u32 version;
> > + 		__u32 flags;
> > +	 	__u32 pad[6];
> > +	}
> > +
> > +	whose data will be filled in by the hypervisor periodically. Only one
> > +	write, or registration, is needed for each VCPU. The interval between
> > +	updates of this structure is arbitrary and implementation-dependent.
> > +	The hypervisor may update this structure at any time it sees fit until
> > +	anything with bit0 == 0 is written to it.
> > +
> > +	Fields have the following meanings:
> > +
> > +		version: guest has to check version before and after grabbing
> > +		time information and check that they are both equal and even.
> > +		An odd version indicates an in-progress update.
> > +
> > +		flags: At this point, always zero. May be used to indicate
> > +		changes in this structure in the future.
> > +
> > +		steal: the amount of time in which this vCPU did not run, in
> > +		nanoseconds.
> > +
> 
> The guest must initialize the entire 64-byte structure to zero before 
> enabling the feature.

I honestly don't see why. But I also don't see why not... 
Will update it.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-01-30 14:04   ` Avi Kivity
  2011-01-30 16:45     ` lidong chen
@ 2011-02-01 15:57     ` Glauber Costa
  2011-02-02 10:11       ` Avi Kivity
  1 sibling, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 15:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On Sun, 2011-01-30 at 16:04 +0200, Avi Kivity wrote:
> On 01/28/2011 09:52 PM, Glauber Costa wrote:
> > This patch accounts steal time time in kernel/sched.
> > I kept it from last proposal, because I still see advantages
> > in it: Doing it here will give us easier access from scheduler
> > variables such as the cpu rq. The next patch shows an example of
> > usage for it.
> >
> > Since functions like account_idle_time() can be called from
> > multiple places, not only account_process_tick(), steal time
> > grabbing is repeated in each account function separatedely.
> >
> 
> I accept that steal time is worthwhile, but do you have some way to 
> demonstrate that the implementation actually works and is beneficial?
> 
> Perhaps run two cpu-bound compute processes on one vcpu, overcommit that 
> vcpu, and see what happens to the processing rate with and without steal 
> time accounting.  I'd expect a fairer response with steal time accounting.

Avi,

There are two things here:
One of them, which is solely the accounting of steal time, (patches 1 to
4) has absolutely nothing to do with what you said. Its sole purpose is
to provide the user with information about "why is my process slow if I
am using 100 % of my cpu?")

The last patch is the only one that actually tries to rebalance cpus
according to steal time information. For that, I do have some
experiments I did here to see if it is working, will try to provide more
precise data in the next submission.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-01-30 16:45     ` lidong chen
@ 2011-02-01 15:58       ` Glauber Costa
  0 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 15:58 UTC (permalink / raw)
  To: lidong chen
  Cc: Avi Kivity, kvm, linux-kernel, aliguori, Rik van Riel,
	Jeremy Fitzhardinge, Peter Zijlstra

On Mon, 2011-01-31 at 00:45 +0800, lidong chen wrote:
> I think we can use performance counter.
> use unhalted core cycles event, in the nmi callback funcation, count
> which process is running .
> if the vm exit is caused by nmi,discard it.
> the system time of qemu process is the time steal by kvm.

Performance counters are a scarce resource, so I'd rather not use them,
since it will mean forcing a context switch from whoever is using it at
the moment. Which is also an expensive operation anyway.

So even though it can be possible, in theory, I don't see why use it
in this particular case.

> 
> 
> 2011/1/30 Avi Kivity <avi@redhat.com>:
> > On 01/28/2011 09:52 PM, Glauber Costa wrote:
> >>
> >> This patch accounts steal time time in kernel/sched.
> >> I kept it from last proposal, because I still see advantages
> >> in it: Doing it here will give us easier access from scheduler
> >> variables such as the cpu rq. The next patch shows an example of
> >> usage for it.
> >>
> >> Since functions like account_idle_time() can be called from
> >> multiple places, not only account_process_tick(), steal time
> >> grabbing is repeated in each account function separatedely.
> >>
> >
> > I accept that steal time is worthwhile, but do you have some way to
> > demonstrate that the implementation actually works and is beneficial?
> >
> > Perhaps run two cpu-bound compute processes on one vcpu, overcommit that
> > vcpu, and see what happens to the processing rate with and without steal
> > time accounting.  I'd expect a fairer response with steal time accounting.
> >
> > --
> > error compiling committee.c: too many arguments to function
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-01-31 11:25   ` Peter Zijlstra
  2011-01-31 11:27     ` Peter Zijlstra
@ 2011-02-01 15:59     ` Glauber Costa
  2011-02-01 16:19       ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Mon, 2011-01-31 at 12:25 +0100, Peter Zijlstra wrote:
> On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote:
> 
> > +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> > +static DEFINE_PER_CPU(u64, cpu_steal_time);
> > +
> > +#ifndef CONFIG_64BIT
> > +static DEFINE_PER_CPU(seqcount_t, steal_time_seq);
> > +
> > +static inline void steal_time_write_begin(void)
> > +{
> > +	__this_cpu_inc(steal_time_seq.sequence);
> > +	smp_wmb();
> > +}
> > +
> > +static inline void steal_time_write_end(void)
> > +{
> > +	smp_wmb();
> > +	__this_cpu_inc(steal_time_seq.sequence);
> > +}
> > +
> > +static inline u64 steal_time_read(int cpu)
> > +{
> > +	u64 steal_time;
> > +	unsigned seq;
> > +
> > +	do {
> > +		seq = read_seqcount_begin(&per_cpu(steal_time_seq, cpu));
> > +		steal_time = per_cpu(cpu_steal_time, cpu);
> > +	} while (read_seqcount_retry(&per_cpu(steal_time_seq, cpu), seq));
> > +
> > +	return steal_time;
> > +}
> > +#else /* CONFIG_64BIT */
> > +static inline void steal_time_write_begin(void)
> > +{
> > +}
> > +
> > +static inline void steal_time_write_end(void)
> > +{
> > +}
> > +
> > +static inline u64 steal_time_read(int cpu)
> > +{
> > +	return per_cpu(cpu_steal_time, cpu);
> > +}
> 
> 
> > @@ -3536,6 +3592,11 @@ static int touch_steal_time(int is_idle)
> >  
> >  	if (st) {
> >  		account_steal_time(st);
> > +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> > +		steal_time_write_begin();
> > +		__this_cpu_add(cpu_steal_time, steal);
> > +		steal_time_write_end();
> > +#endif
> >  		return 1;
> >  	}
> >  	return 0;
> 
> 
> Why replicate all logic you've already got in patch 4? That too is
> reading steal time in a loop in kvm_account_steal_time(), why not extend
> that interface to take a cpu argument and be done with it?

Because that part is kvm-specific, and this is scheduler general.
It seemed cleaner to me to do it this way. But I can do it differently,
certainly.

> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-02-01 15:53     ` Glauber Costa
@ 2011-02-01 16:17       ` Peter Zijlstra
  2011-02-01 17:00         ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-02-01 16:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Avi Kivity, kvm, linux-kernel, aliguori, Rik van Riel,
	Jeremy Fitzhardinge, Venkatesh Pallipadi

On Tue, 2011-02-01 at 13:53 -0200, Glauber Costa wrote:
> 
> And since the granularity of the cpu accounting is too coarse, we end up
> with much more steal time than we should, because things that are less
> than 1 unity of cputime, are often rounded up to 1 unity of cputime.

See, that! is the problem, don't round up like that.

What you can do is: steal_ticks = steal_time_clock() / TICK_NSEC, or
simply keep a steal time delta and every time it overflows
cputime_one_jiffy insert a steal-time tick.

Venki might have created some infrastructure for doing this with the
IRQ_TIME accounting mess, but irqtime_account_process_tick() still gives
me a head-ache.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-02-01 15:59     ` Glauber Costa
@ 2011-02-01 16:19       ` Peter Zijlstra
  2011-02-01 16:22         ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-02-01 16:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-02-01 at 13:59 -0200, Glauber Costa wrote:
> 
> Because that part is kvm-specific, and this is scheduler general.
> It seemed cleaner to me to do it this way. But I can do it differently,
> certainly. 

Well, any steal time clock will be hypervisor specific, but if we agree
that anything that enables CONFIG_PARAVIRT_TIME_ACCOUNTING provides a
u64 steal_time_clock(int cpu) function then all should be well, right?

The bit you have in kvm is almost that, except it assumes cpu ==
this_cpu.

You simply cannot rely on the silly tick accounting to drive any clock,
its archaic.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-02-01 16:19       ` Peter Zijlstra
@ 2011-02-01 16:22         ` Glauber Costa
  2011-02-01 18:59           ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-02-01 at 17:19 +0100, Peter Zijlstra wrote:
> On Tue, 2011-02-01 at 13:59 -0200, Glauber Costa wrote:
> > 
> > Because that part is kvm-specific, and this is scheduler general.
> > It seemed cleaner to me to do it this way. But I can do it differently,
> > certainly. 
> 
> Well, any steal time clock will be hypervisor specific, but if we agree
> that anything that enables CONFIG_PARAVIRT_TIME_ACCOUNTING provides a
> u64 steal_time_clock(int cpu) function then all should be well, right?

Once the hypervisor provided the data, it can all be generic, and have
large parts of it that are generic, living in sched.c.


> The bit you have in kvm is almost that, except it assumes cpu ==
> this_cpu.
> 
> You simply cannot rely on the silly tick accounting to drive any clock,
> its archaic.

Which tick accounting? In your other e-mail , you pointed that this only
runs in touch_steal_time, which is fine, will change. But all the rest
here, that is behind the hypervisor specific vs generic code has nothing
to do with ticks at all.

> 
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-02-01 16:17       ` Peter Zijlstra
@ 2011-02-01 17:00         ` Glauber Costa
  2011-02-01 17:44           ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, kvm, linux-kernel, aliguori, Rik van Riel,
	Jeremy Fitzhardinge, Venkatesh Pallipadi


> 
> See, that! is the problem, don't round up like that.

Yeah, I was using usecs_to_cputime(), believing this was the standard interface. By the way, one of the things that also led to better results
were just forcing it to 0 every time we had steal == 1 in the end. But
*that* was a real hack =)

> What you can do is: steal_ticks = steal_time_clock() / TICK_NSEC, or
> simply keep a steal time delta and every time it overflows
> cputime_one_jiffy insert a steal-time tick.

What do you think about keeping accounting in msec/usec resolution and
(thus allowing us to compute half a tick to user/system, other half to
steal time) only change it to cputime in the last minute?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
  2011-02-01 15:48     ` Glauber Costa
@ 2011-02-01 17:09       ` Avi Kivity
  2011-02-01 19:58         ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2011-02-01 17:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On 02/01/2011 05:48 PM, Glauber Costa wrote:
> >  >  @@ -2106,6 +2120,25 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >  >    			kvm_migrate_timers(vcpu);
> >  >    		vcpu->cpu = cpu;
> >  >    	}
> >  >  +
> >  >  +	if (vcpu->arch.this_time_out) {
> >  >  +		u64 to = (get_kernel_ns() - vcpu->arch.this_time_out);
> >  >  +		/*
> >  >  +		 * using nanoseconds introduces noise, which accumulates easily
> >  >  +		 * leading to big steal time values. We want, however, to keep the
> >  >  +		 * interface nanosecond-based for future-proofness.
> >  >  +		 */
> >  >  +		to /= NSEC_PER_USEC;
> >  >  +		to *= NSEC_PER_USEC;
> >
> >  Seems there is a real problem and that this is just papering it over.
> >  I'd like to understand the root cause.
> Okay, my self-explanation seemed reasonable to me, but since both you
> and Peter dislike it, I think it is important enough to get a more
> thorough investigation before a second round.

Yes please.

> But in this case,
> I keep that using nanoseconds may then not be the best approach here. We
> also have to keep in mind that the host and guest clocks may be running
> at different resolutions.

We need to choose a resolution for the clock (or negotiate one), an 
nanoseconds seems as good as any from a range and precision 
considerations, and is convenient for the host and Linux guests.  So why 
not pick it?

> >  >  +		vcpu->arch.sversion += 2;
> >
> >  Doesn't survive live migration.  You need to use the version from the
> >  guest area.
> Why not? Who said versions need to always increase? If current version
> is 102324, and we live migrate and it becomes 0, what is the problem?

Guest reads version (result: 2)
Guest starts reading data
Live migration; vcpu->arch.sversion is zeroed
Steal time update; vcpu->arch.sversion += 2; write to guest
Guest continues reading data
Guest reads version (result: 2)

So the guest is unaware that an update has occurred while it was reading 
the data.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-02-01 17:00         ` Glauber Costa
@ 2011-02-01 17:44           ` Peter Zijlstra
  2011-02-01 20:20             ` Venkatesh Pallipadi
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-02-01 17:44 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Avi Kivity, kvm, linux-kernel, aliguori, Rik van Riel,
	Jeremy Fitzhardinge, Venkatesh Pallipadi

On Tue, 2011-02-01 at 15:00 -0200, Glauber Costa wrote:
> 
> > What you can do is: steal_ticks = steal_time_clock() / TICK_NSEC, or
> > simply keep a steal time delta and every time it overflows
> > cputime_one_jiffy insert a steal-time tick.
> 
> What do you think about keeping accounting in msec/usec resolution and
> (thus allowing us to compute half a tick to user/system, other half to
> steal time) only change it to cputime in the last minute?

its only accounting full tick..


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-02-01 16:22         ` Glauber Costa
@ 2011-02-01 18:59           ` Peter Zijlstra
  2011-02-01 19:55             ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-02-01 18:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-02-01 at 14:22 -0200, Glauber Costa wrote:
> 
> 
> Which tick accounting? In your other e-mail , you pointed that this only
> runs in touch_steal_time, which is fine, will change.

That tick ;-), all the account_foo muck is per tick.

>  But all the rest
> here, that is behind the hypervisor specific vs generic code has nothing
> to do with ticks at all.

But I don't get it, there is no generic code needed, all that's needed
is u64 steal_time_clock(int cpu), and the first part of your
kvm_account_steal_time() function is exactly that if you add the cpu
argument.

+static u64 steal_time_clock(int cpu)
+{
+       u64 steal_time;
+       struct kvm_steal_time *src;
+       int version;
+
+	preempt_disable();
+       src = &per_cpu_ptr(steal_time, cpu);
+       do {
+               version = src->version;
+               rmb();
+               steal_time = src->steal;
+               rmb();
+       } while ((src->version & 1) || (version != src->version));
+       preempt_enable();
+
+       return steal_time
+}

And you're done.. no need to for any of that steal_time_{read,write} business.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-02-01 18:59           ` Peter Zijlstra
@ 2011-02-01 19:55             ` Glauber Costa
  2011-02-01 20:04               ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 19:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-02-01 at 19:59 +0100, Peter Zijlstra wrote:
> On Tue, 2011-02-01 at 14:22 -0200, Glauber Costa wrote:
> > 
> > 
> > Which tick accounting? In your other e-mail , you pointed that this only
> > runs in touch_steal_time, which is fine, will change.
> 
> That tick ;-), all the account_foo muck is per tick.
> 
> >  But all the rest
> > here, that is behind the hypervisor specific vs generic code has nothing
> > to do with ticks at all.
> 
> But I don't get it, there is no generic code needed, all that's needed
> is u64 steal_time_clock(int cpu), and the first part of your
> kvm_account_steal_time() function is exactly that if you add the cpu
> argument.
> 
> +static u64 steal_time_clock(int cpu)
> +{
> +       u64 steal_time;
> +       struct kvm_steal_time *src;
> +       int version;
> +
> +	preempt_disable();
> +       src = &per_cpu_ptr(steal_time, cpu);
> +       do {
> +               version = src->version;
> +               rmb();
> +               steal_time = src->steal;
> +               rmb();
> +       } while ((src->version & 1) || (version != src->version));
> +       preempt_enable();
> +
> +       return steal_time
> +}
> 
> And you're done.. no need to for any of that steal_time_{read,write} business.

update_rq_clock_task still have to keep track of what was the last steal
time value we saw, in the same way it does for irq. One option is to
call update_rq_clock_task from inside kvm-code, but I don't really like
it very much.

But okay, there are many ways to work around it, I'll cook something.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
  2011-02-01 17:09       ` Avi Kivity
@ 2011-02-01 19:58         ` Glauber Costa
  2011-02-02 10:09           ` Avi Kivity
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-02-01 19:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On Tue, 2011-02-01 at 19:09 +0200, Avi Kivity wrote:
> On 02/01/2011 05:48 PM, Glauber Costa wrote:
> > >  >  @@ -2106,6 +2120,25 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > >  >    			kvm_migrate_timers(vcpu);
> > >  >    		vcpu->cpu = cpu;
> > >  >    	}
> > >  >  +
> > >  >  +	if (vcpu->arch.this_time_out) {
> > >  >  +		u64 to = (get_kernel_ns() - vcpu->arch.this_time_out);
> > >  >  +		/*
> > >  >  +		 * using nanoseconds introduces noise, which accumulates easily
> > >  >  +		 * leading to big steal time values. We want, however, to keep the
> > >  >  +		 * interface nanosecond-based for future-proofness.
> > >  >  +		 */
> > >  >  +		to /= NSEC_PER_USEC;
> > >  >  +		to *= NSEC_PER_USEC;
> > >
> > >  Seems there is a real problem and that this is just papering it over.
> > >  I'd like to understand the root cause.
> > Okay, my self-explanation seemed reasonable to me, but since both you
> > and Peter dislike it, I think it is important enough to get a more
> > thorough investigation before a second round.
> 
> Yes please.
> 
> > But in this case,
> > I keep that using nanoseconds may then not be the best approach here. We
> > also have to keep in mind that the host and guest clocks may be running
> > at different resolutions.
> 
> We need to choose a resolution for the clock (or negotiate one), an 
> nanoseconds seems as good as any from a range and precision 
> considerations, and is convenient for the host and Linux guests.  So why 
> not pick it?
> 
> > >  >  +		vcpu->arch.sversion += 2;
> > >
> > >  Doesn't survive live migration.  You need to use the version from the
> > >  guest area.
> > Why not? Who said versions need to always increase? If current version
> > is 102324, and we live migrate and it becomes 0, what is the problem?
> 
> Guest reads version (result: 2)
> Guest starts reading data
> Live migration; vcpu->arch.sversion is zeroed
> Steal time update; vcpu->arch.sversion += 2; write to guest
> Guest continues reading data
> Guest reads version (result: 2)
> 
> So the guest is unaware that an update has occurred while it was reading 
> the data.
Ok, fair.

By the way, kvmclock have the same problem, then
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
  2011-02-01 19:55             ` Glauber Costa
@ 2011-02-01 20:04               ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-02-01 20:04 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Avi Kivity

On Tue, 2011-02-01 at 17:55 -0200, Glauber Costa wrote:
> 
> update_rq_clock_task still have to keep track of what was the last steal
> time value we saw, in the same way it does for irq.

Right, the CONFIG_SCHED_PARAVIRT patch I sent earlier adds a
prev_steal_time member to struct rq for this purpose.

>  One option is to
> call update_rq_clock_task from inside kvm-code, but I don't really like
> it very much.

Why would you need to call anything from the kvm code?

Simply make u64 steal_time_clock(int cpu) a paravirt function with u64
native_steal_time_clock(int cpu) { return 0ULL; }. Possibly avoid the
whole CONFIG_SCHED_PARAVIRT block in update_rq_clock_task() for !
paravirt guests.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
  2011-02-01 17:44           ` Peter Zijlstra
@ 2011-02-01 20:20             ` Venkatesh Pallipadi
  0 siblings, 0 replies; 44+ messages in thread
From: Venkatesh Pallipadi @ 2011-02-01 20:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, Avi Kivity, kvm, linux-kernel, aliguori,
	Rik van Riel, Jeremy Fitzhardinge

On Tue, Feb 1, 2011 at 9:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2011-02-01 at 15:00 -0200, Glauber Costa wrote:
>>
>> > What you can do is: steal_ticks = steal_time_clock() / TICK_NSEC, or
>> > simply keep a steal time delta and every time it overflows
>> > cputime_one_jiffy insert a steal-time tick.
>>
>> What do you think about keeping accounting in msec/usec resolution and
>> (thus allowing us to compute half a tick to user/system, other half to
>> steal time) only change it to cputime in the last minute?
>
> its only accounting full tick..
>

Yes. The way we ended up dealing with this in irq time case is track
it in fine granularity and accumulate over time (internally) but
account it (make it visible externally) only in terms of ticks, only
when the value being accumulated crosses the tick boundary. This does
has a hole when we use 99% of time on tick on irq and use 1% just
before the tick on some system, then whole tick will be system and on
next tick if there is 1% irq and 99% system then that will be
accounted as irq as our accumulated value crosses the tick boundary
then. But, such holes on avg should not be worse than not having fine
granularity.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
  2011-02-01 19:58         ` Glauber Costa
@ 2011-02-02 10:09           ` Avi Kivity
  0 siblings, 0 replies; 44+ messages in thread
From: Avi Kivity @ 2011-02-02 10:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On 02/01/2011 09:58 PM, Glauber Costa wrote:
> >
> >  Guest reads version (result: 2)
> >  Guest starts reading data
> >  Live migration; vcpu->arch.sversion is zeroed
> >  Steal time update; vcpu->arch.sversion += 2; write to guest
> >  Guest continues reading data
> >  Guest reads version (result: 2)
> >
> >  So the guest is unaware that an update has occurred while it was reading
> >  the data.
> Ok, fair.
>
> By the way, kvmclock have the same problem, then
>

It needs the same fix then.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-02-01 15:57     ` Glauber Costa
@ 2011-02-02 10:11       ` Avi Kivity
  2011-02-02 10:51         ` Avi Kivity
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2011-02-02 10:11 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On 02/01/2011 05:57 PM, Glauber Costa wrote:
> On Sun, 2011-01-30 at 16:04 +0200, Avi Kivity wrote:
> >  On 01/28/2011 09:52 PM, Glauber Costa wrote:
> >  >  This patch accounts steal time time in kernel/sched.
> >  >  I kept it from last proposal, because I still see advantages
> >  >  in it: Doing it here will give us easier access from scheduler
> >  >  variables such as the cpu rq. The next patch shows an example of
> >  >  usage for it.
> >  >
> >  >  Since functions like account_idle_time() can be called from
> >  >  multiple places, not only account_process_tick(), steal time
> >  >  grabbing is repeated in each account function separatedely.
> >  >
> >
> >  I accept that steal time is worthwhile, but do you have some way to
> >  demonstrate that the implementation actually works and is beneficial?
> >
> >  Perhaps run two cpu-bound compute processes on one vcpu, overcommit that
> >  vcpu, and see what happens to the processing rate with and without steal
> >  time accounting.  I'd expect a fairer response with steal time accounting.
>
> Avi,
>
> There are two things here:
> One of them, which is solely the accounting of steal time, (patches 1 to
> 4) has absolutely nothing to do with what you said. Its sole purpose is
> to provide the user with information about "why is my process slow if I
> am using 100 % of my cpu?")

Right.  Like irq and softirq time, we need to report this to the user, 
as it's potentially much higher.

> The last patch is the only one that actually tries to rebalance cpus
> according to steal time information. For that, I do have some
> experiments I did here to see if it is working, will try to provide more
> precise data in the next submission.
>

Thanks.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 6/6] Describe KVM_MSR_STEAL_TIME
  2011-02-01 15:54     ` Glauber Costa
@ 2011-02-02 10:14       ` Avi Kivity
  0 siblings, 0 replies; 44+ messages in thread
From: Avi Kivity @ 2011-02-02 10:14 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm, linux-kernel, aliguori

On 02/01/2011 05:54 PM, Glauber Costa wrote:
> >
> >  The guest must initialize the entire 64-byte structure to zero before
> >  enabling the feature.
>
> I honestly don't see why. But I also don't see why not...
> Will update it.

Requiring the guest to initialize it to zero allows us to later assign 
meanings to various fields, as long as a zero value is backwards compatible.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-02-02 10:11       ` Avi Kivity
@ 2011-02-02 10:51         ` Avi Kivity
  2011-02-02 11:57           ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2011-02-02 10:51 UTC (permalink / raw)
  To: Glauber Costa
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On 02/02/2011 12:11 PM, Avi Kivity wrote:
> On 02/01/2011 05:57 PM, Glauber Costa wrote:
>> On Sun, 2011-01-30 at 16:04 +0200, Avi Kivity wrote:
>> >  On 01/28/2011 09:52 PM, Glauber Costa wrote:
>> > >  This patch accounts steal time time in kernel/sched.
>> > >  I kept it from last proposal, because I still see advantages
>> > >  in it: Doing it here will give us easier access from scheduler
>> > >  variables such as the cpu rq. The next patch shows an example of
>> > >  usage for it.
>> > >
>> > >  Since functions like account_idle_time() can be called from
>> > >  multiple places, not only account_process_tick(), steal time
>> > >  grabbing is repeated in each account function separatedely.
>> > >
>> >
>> >  I accept that steal time is worthwhile, but do you have some way to
>> >  demonstrate that the implementation actually works and is beneficial?
>> >
>> >  Perhaps run two cpu-bound compute processes on one vcpu, 
>> overcommit that
>> >  vcpu, and see what happens to the processing rate with and without 
>> steal
>> >  time accounting.  I'd expect a fairer response with steal time 
>> accounting.
>>
>> Avi,
>>
>> There are two things here:
>> One of them, which is solely the accounting of steal time, (patches 1 to
>> 4) has absolutely nothing to do with what you said. Its sole purpose is
>> to provide the user with information about "why is my process slow if I
>> am using 100 % of my cpu?")
>
> Right.  Like irq and softirq time, we need to report this to the user, 
> as it's potentially much higher.

Of course, it's not enough to just account for this time, you also have 
to expose it somewhere, and update tools like top(1) to display it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 3/6] KVM-GST: KVM Steal time accounting
  2011-02-02 10:51         ` Avi Kivity
@ 2011-02-02 11:57           ` Glauber Costa
  0 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-02-02 11:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-kernel, aliguori, Rik van Riel, Jeremy Fitzhardinge,
	Peter Zijlstra

On Wed, 2011-02-02 at 12:51 +0200, Avi Kivity wrote:
> On 02/02/2011 12:11 PM, Avi Kivity wrote:
> > On 02/01/2011 05:57 PM, Glauber Costa wrote:
> >> On Sun, 2011-01-30 at 16:04 +0200, Avi Kivity wrote:
> >> >  On 01/28/2011 09:52 PM, Glauber Costa wrote:
> >> > >  This patch accounts steal time time in kernel/sched.
> >> > >  I kept it from last proposal, because I still see advantages
> >> > >  in it: Doing it here will give us easier access from scheduler
> >> > >  variables such as the cpu rq. The next patch shows an example of
> >> > >  usage for it.
> >> > >
> >> > >  Since functions like account_idle_time() can be called from
> >> > >  multiple places, not only account_process_tick(), steal time
> >> > >  grabbing is repeated in each account function separatedely.
> >> > >
> >> >
> >> >  I accept that steal time is worthwhile, but do you have some way to
> >> >  demonstrate that the implementation actually works and is beneficial?
> >> >
> >> >  Perhaps run two cpu-bound compute processes on one vcpu, 
> >> overcommit that
> >> >  vcpu, and see what happens to the processing rate with and without 
> >> steal
> >> >  time accounting.  I'd expect a fairer response with steal time 
> >> accounting.
> >>
> >> Avi,
> >>
> >> There are two things here:
> >> One of them, which is solely the accounting of steal time, (patches 1 to
> >> 4) has absolutely nothing to do with what you said. Its sole purpose is
> >> to provide the user with information about "why is my process slow if I
> >> am using 100 % of my cpu?")
> >
> > Right.  Like irq and softirq time, we need to report this to the user, 
> > as it's potentially much higher.
> 
> Of course, it's not enough to just account for this time, you also have 
> to expose it somewhere, and update tools like top(1) to display it.
Yes, what I meant is that just the accounting will just expose it to the
tools, won't affect the scheduler in any sense.



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2011-02-02 11:58 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-28 19:52 [PATCH v2 0/6] KVM Steal time, new submission Glauber Costa
2011-01-28 19:52 ` [PATCH v2 1/6] KVM-HDR: KVM Steal time implementation Glauber Costa
2011-01-29  1:28   ` Rik van Riel
2011-01-28 19:52 ` [PATCH v2 2/6] KVM-HV: " Glauber Costa
2011-01-29  1:46   ` Rik van Riel
2011-01-30 13:13   ` Avi Kivity
2011-02-01 15:48     ` Glauber Costa
2011-02-01 17:09       ` Avi Kivity
2011-02-01 19:58         ` Glauber Costa
2011-02-02 10:09           ` Avi Kivity
2011-01-31 11:07   ` Peter Zijlstra
2011-01-28 19:52 ` [PATCH v2 3/6] KVM-GST: KVM Steal time accounting Glauber Costa
2011-01-29  1:16   ` Jeremy Fitzhardinge
2011-01-29  1:27     ` Glauber Costa
2011-01-29  1:53   ` Rik van Riel
2011-01-30 14:04   ` Avi Kivity
2011-01-30 16:45     ` lidong chen
2011-02-01 15:58       ` Glauber Costa
2011-02-01 15:57     ` Glauber Costa
2011-02-02 10:11       ` Avi Kivity
2011-02-02 10:51         ` Avi Kivity
2011-02-02 11:57           ` Glauber Costa
2011-01-28 19:52 ` [PATCH v2 4/6] KVM-GST: KVM Steal time registration Glauber Costa
2011-01-29  2:16   ` Rik van Riel
2011-01-30 13:16   ` Avi Kivity
2011-02-01 15:53     ` Glauber Costa
2011-02-01 16:17       ` Peter Zijlstra
2011-02-01 17:00         ` Glauber Costa
2011-02-01 17:44           ` Peter Zijlstra
2011-02-01 20:20             ` Venkatesh Pallipadi
2011-01-31 11:11   ` Peter Zijlstra
2011-01-28 19:52 ` [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power Glauber Costa
2011-01-31 11:25   ` Peter Zijlstra
2011-01-31 11:27     ` Peter Zijlstra
2011-02-01 15:59     ` Glauber Costa
2011-02-01 16:19       ` Peter Zijlstra
2011-02-01 16:22         ` Glauber Costa
2011-02-01 18:59           ` Peter Zijlstra
2011-02-01 19:55             ` Glauber Costa
2011-02-01 20:04               ` Peter Zijlstra
2011-01-28 19:52 ` [PATCH v2 6/6] Describe KVM_MSR_STEAL_TIME Glauber Costa
2011-01-30 13:19   ` Avi Kivity
2011-02-01 15:54     ` Glauber Costa
2011-02-02 10:14       ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).