LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v10 00/11] sched: consolidation of CPU capacity and usage
@ 2015-02-27 15:54 Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 01/11] sched: add utilization_avg_contrib Vincent Guittot
                   ` (12 more replies)
  0 siblings, 13 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

This patchset consolidates several changes in the capacity and the usage
tracking of the CPU. It provides a frequency invariant metric of the usage of
CPUs and generally improves the accuracy of load/usage tracking in the
scheduler. The frequency invariant metric is the foundation required for the
consolidation of cpufreq and implementation of a fully invariant load tracking.
These are currently WIP and require several changes to the load balancer
(including how it will use and interprets load and capacity metrics) and
extensive validation. The frequency invariance is done with
arch_scale_freq_capacity and this patchset doesn't provide the backends of
the function which are architecture dependent.

As discussed at LPC14, Morten and I have consolidated our changes into a single
patchset to make it easier to review and merge.

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. With this patch set, we don't try anymore to
evaluate the number of available cores based on the group_capacity but instead
we evaluate the usage of a group and compare it with its capacity.

This patchset mainly replaces the old capacity_factor method by a new one and
keeps the general policy almost unchanged. These new metrics will be also used
in later patches.

The CPU usage is based on a running time tracking version of the current
implementation of the load average tracking. I also have a version that is
based on the new implementation proposal [1] but I haven't provide the patches
and results as [1] is still under review. I can provide change above [1] to
change how CPU usage is computed and to adapt to new mecanism.

Change since V9
 - add a dedicated patch for removing unused capacity_orig
 - update some comments and fix typo
 - change the condition for actively migrating task on CPU with higher capacity 

Change since V8
 - reorder patches

Change since V7
 - add freq invariance for usage tracking
 - add freq invariance for scale_rt
 - update comments and commits' message
 - fix init of utilization_avg_contrib
 - fix prefer_sibling

Change since V6
 - add group usage tracking
 - fix some commits' messages
 - minor fix like comments and argument order

Change since V5
 - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
 - update commit log and add more details on the purpose of the patches
 - fix/remove useless code with the rebase on patchset [2]
 - remove capacity_orig in sched_group_capacity as it is not used
 - move code in the right patch
 - add some helper function to factorize code

Change since V4
 - rebase to manage conflicts with changes in selection of busiest group

Change since V3:
 - add usage_avg_contrib statistic which sums the running time of tasks on a rq
 - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
 - fix replacement power by capacity
 - update some comments

Change since V2:
 - rebase on top of capacity renaming
 - fix wake_affine statistic update
 - rework nohz_kick_needed
 - optimize the active migration of a task from CPU with reduced capacity
 - rename group_activity by group_utilization and remove unused total_utilization
 - repair SD_PREFER_SIBLING and use it for SMT level
 - reorder patchset to gather patches with same topics

Change since V1:
 - add 3 fixes
 - correct some commit messages
 - replace capacity computation by activity
 - take into account current cpu capacity

[1] https://lkml.org/lkml/2014/10/10/131
[2] https://lkml.org/lkml/2014/7/25/589

Morten Rasmussen (2):
  sched: Track group sched_entity usage contributions
  sched: Make sched entity usage tracking scale-invariant

Vincent Guittot (9):
  sched: add utilization_avg_contrib
  sched: remove frequency scaling from cpu_capacity
  sched: make scale_rt invariant with frequency
  sched: add per rq cpu_capacity_orig
  sched: get CPU's usage statistic
  sched: replace capacity_factor by usage
  sched; remove unused capacity_orig from
  sched: add SD_PREFER_SIBLING for SMT level
  sched: move cfs task on a CPU with higher capacity

 include/linux/sched.h |  21 ++-
 kernel/sched/core.c   |  15 +--
 kernel/sched/debug.c  |  12 +-
 kernel/sched/fair.c   | 366 +++++++++++++++++++++++++++++++-------------------
 kernel/sched/sched.h  |  15 ++-
 5 files changed, 271 insertions(+), 158 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 01/11] sched: add utilization_avg_contrib
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-27 11:40   ` [tip:sched/core] sched: Add sched_avg::utilization_avg_contrib tip-bot for Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 02/11] sched: Track group sched_entity usage contributions Vincent Guittot
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot, Paul Turner, Ben Segall

Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.

The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum

Add some descriptions of the variables to explain their differences

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 21 ++++++++++++---
 kernel/sched/debug.c  | 10 ++++---
 kernel/sched/fair.c   | 74 ++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h  |  8 +++++-
 4 files changed, 89 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cb5cdc7..adc6278 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1115,15 +1115,28 @@ struct load_weight {
 };
 
 struct sched_avg {
+	u64 last_runnable_update;
+	s64 decay_count;
+	/*
+	 * utilization_avg_contrib describes the amount of time that a
+	 * sched_entity is running on a CPU. It is based on running_avg_sum
+	 * and is scaled in the range [0..SCHED_LOAD_SCALE].
+	 * load_avg_contrib described the amount of time that a sched_entity
+	 * is runnable on a rq. It is based on both runnable_avg_sum and the
+	 * weight of the task.
+	 */
+	unsigned long load_avg_contrib, utilization_avg_contrib;
 	/*
 	 * These sums represent an infinite geometric series and so are bound
 	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 	 * choices of y < 1-2^(-32)*1024.
+	 * running_avg_sum reflects the time that the sched_entity is
+	 * effectively running on the CPU.
+	 * runnable_avg_sum represents the amount of time a sched_entity is on
+	 * a runqueue which includes the running time that is monitored by
+	 * running_avg_sum.
 	 */
-	u32 runnable_avg_sum, runnable_avg_period;
-	u64 last_runnable_update;
-	s64 decay_count;
-	unsigned long load_avg_contrib;
+	u32 runnable_avg_sum, avg_period, running_avg_sum;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8baaf85..578ff83 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	if (!se) {
 		struct sched_avg *avg = &cpu_rq(cpu)->avg;
 		P(avg->runnable_avg_sum);
-		P(avg->runnable_avg_period);
+		P(avg->avg_period);
 		return;
 	}
 
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
-	P(se->avg.runnable_avg_period);
+	P(se->avg.avg_period);
 	P(se->avg.load_avg_contrib);
 	P(se->avg.decay_count);
 #endif
@@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "utilization_load_avg",
+			cfs_rq->utilization_load_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_contrib",
 			cfs_rq->tg_load_contrib);
@@ -636,8 +638,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.runnable_avg_sum);
-	P(se.avg.runnable_avg_period);
+	P(se.avg.running_avg_sum);
+	P(se.avg.avg_period);
 	P(se.avg.load_avg_contrib);
+	P(se.avg.utilization_avg_contrib);
 	P(se.avg.decay_count);
 #endif
 	P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef..414408dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -670,6 +670,7 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 static inline void __update_task_entity_contrib(struct sched_entity *se);
+static inline void __update_task_entity_utilization(struct sched_entity *se);
 
 /* Give new task start runnable values to heavy its load in infant time */
 void init_task_runnable_average(struct task_struct *p)
@@ -677,9 +678,10 @@ void init_task_runnable_average(struct task_struct *p)
 	u32 slice;
 
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
-	p->se.avg.runnable_avg_sum = slice;
-	p->se.avg.runnable_avg_period = slice;
+	p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
+	p->se.avg.avg_period = slice;
 	__update_task_entity_contrib(&p->se);
+	__update_task_entity_utilization(&p->se);
 }
 #else
 void init_task_runnable_average(struct task_struct *p)
@@ -1684,7 +1686,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		*period = now - p->last_task_numa_placement;
 	} else {
 		delta = p->se.avg.runnable_avg_sum;
-		*period = p->se.avg.runnable_avg_period;
+		*period = p->se.avg.avg_period;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2512,7 +2514,8 @@ static u32 __compute_runnable_contrib(u64 n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -2538,7 +2541,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	sa->last_runnable_update = now;
 
 	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->runnable_avg_period % 1024;
+	delta_w = sa->avg_period % 1024;
 	if (delta + delta_w >= 1024) {
 		/* period roll-over */
 		decayed = 1;
@@ -2551,7 +2554,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
-		sa->runnable_avg_period += delta_w;
+		if (running)
+			sa->running_avg_sum += delta_w;
+		sa->avg_period += delta_w;
 
 		delta -= delta_w;
 
@@ -2561,20 +2566,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 
 		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
 						  periods + 1);
-		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+		sa->running_avg_sum = decay_load(sa->running_avg_sum,
+						  periods + 1);
+		sa->avg_period = decay_load(sa->avg_period,
 						     periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
-		sa->runnable_avg_period += runnable_contrib;
+		if (running)
+			sa->running_avg_sum += runnable_contrib;
+		sa->avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
 		sa->runnable_avg_sum += delta;
-	sa->runnable_avg_period += delta;
+	if (running)
+		sa->running_avg_sum += delta;
+	sa->avg_period += delta;
 
 	return decayed;
 }
@@ -2591,6 +2602,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 		return 0;
 
 	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+	se->avg.utilization_avg_contrib =
+		decay_load(se->avg.utilization_avg_contrib, decays);
 
 	return decays;
 }
@@ -2626,7 +2639,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 
 	/* The fraction of a cpu used by this cfs_rq */
 	contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
-			  sa->runnable_avg_period + 1);
+			  sa->avg_period + 1);
 	contrib -= cfs_rq->tg_runnable_contrib;
 
 	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
@@ -2679,7 +2692,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+			runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2697,7 +2711,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
 
 	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
 	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
-	contrib /= (se->avg.runnable_avg_period + 1);
+	contrib /= (se->avg.avg_period + 1);
 	se->avg.load_avg_contrib = scale_load(contrib);
 }
 
@@ -2716,6 +2730,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+
+static inline void __update_task_entity_utilization(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+	contrib /= (se->avg.avg_period + 1);
+	se->avg.utilization_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.utilization_avg_contrib;
+
+	if (entity_is_task(se))
+		__update_task_entity_utilization(se);
+
+	return se->avg.utilization_avg_contrib - old_contrib;
+}
+
 static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 						 long load_contrib)
 {
@@ -2732,7 +2767,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
+	long contrib_delta, utilization_delta;
 	u64 now;
 
 	/*
@@ -2744,18 +2779,22 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+					cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+	utilization_delta = __update_entity_utilization_avg_contrib(se);
 
 	if (!update_cfs_rq)
 		return;
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
-	else
+		cfs_rq->utilization_load_avg += utilization_delta;
+	} else {
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+	}
 }
 
 /*
@@ -2830,6 +2869,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -2848,6 +2888,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -3185,6 +3226,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc0f435..65fa7b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -362,8 +362,14 @@ struct cfs_rq {
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
+	 * runnable_load_avg is the sum of the load_avg_contrib of the
+	 * sched_entities on the rq.
+	 * blocked_load_avg is similar to runnable_load_avg except that its
+	 * the blocked sched_entities on the rq.
+	 * utilization_load_avg is the sum of the average running time of the
+	 * sched_entities on the rq.
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
 	atomic_long_t removed_load;
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 02/11] sched: Track group sched_entity usage contributions
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 01/11] sched: add utilization_avg_contrib Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-27 11:40   ` [tip:sched/core] " tip-bot for Morten Rasmussen
  2015-02-27 15:54 ` [PATCH v10 03/11] sched: remove frequency scaling from cpu_capacity Vincent Guittot
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Morten Rasmussen, Paul Turner, Ben Segall, Vincent Guittot

From: Morten Rasmussen <morten.rasmussen@arm.com>

Adds usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue. It is _not_ influenced in any way by the task group
h_load. Hence it is representing the actual cpu usage of the group, not
its intended load contribution which may differ significantly from the
utilization on lightly utilized systems.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/debug.c | 2 ++
 kernel/sched/fair.c  | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 578ff83..a245c1f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
+	P(se->avg.running_avg_sum);
 	P(se->avg.avg_period);
 	P(se->avg.load_avg_contrib);
+	P(se->avg.utilization_avg_contrib);
 	P(se->avg.decay_count);
 #endif
 #undef PN
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 414408dd..d94a865 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2747,6 +2747,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
 
 	if (entity_is_task(se))
 		__update_task_entity_utilization(se);
+	else
+		se->avg.utilization_avg_contrib =
+					group_cfs_rq(se)->utilization_load_avg;
 
 	return se->avg.utilization_avg_contrib - old_contrib;
 }
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 03/11] sched: remove frequency scaling from cpu_capacity
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 01/11] sched: add utilization_avg_contrib Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 02/11] sched: Track group sched_entity usage contributions Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-27 11:40   ` [tip:sched/core] sched: Remove " tip-bot for Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant Vincent Guittot
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

Now that arch_scale_cpu_capacity has been introduced to scale the original
capacity, the arch_scale_freq_capacity is no longer used (it was
previously used by ARM arch). Remove arch_scale_freq_capacity from the
computation of cpu_capacity. The frequency invariance will be handled in the
load tracking and not in the CPU capacity. arch_scale_freq_capacity will be
revisited for scaling load with the current frequency of the CPUs in a later
patch.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d94a865..e54231f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6042,13 +6042,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	sdg->sgc->capacity_orig = capacity;
 
-	if (sched_feat(ARCH_CAPACITY))
-		capacity *= arch_scale_freq_capacity(sd, cpu);
-	else
-		capacity *= default_scale_capacity(sd, cpu);
-
-	capacity >>= SCHED_CAPACITY_SHIFT;
-
 	capacity *= scale_rt_capacity(cpu);
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (2 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 03/11] sched: remove frequency scaling from cpu_capacity Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-03 12:51   ` Dietmar Eggemann
                     ` (2 more replies)
  2015-02-27 15:54 ` [PATCH v10 05/11] sched: make scale_rt invariant with frequency Vincent Guittot
                   ` (8 subsequent siblings)
  12 siblings, 3 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Morten Rasmussen, Paul Turner, Ben Segall, Vincent Guittot

From: Morten Rasmussen <morten.rasmussen@arm.com>

Apply frequency scale-invariance correction factor to usage tracking.
Each segment of the running_load_avg geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling. As a result, utilization_load_avg which is
the sum of utilization_avg_contrib, becomes invariant too. So the usage level
that is returned by get_cpu_usage, stays relative to the max frequency as the
cpu_capacity which is is compared against.
Then, we want the keep the load tracking values in a 32bits type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e54231f..7f031e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2484,6 +2484,8 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2512,7 +2514,7 @@ static u32 __compute_runnable_contrib(u64 n)
  *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
-static __always_inline int __update_entity_runnable_avg(u64 now,
+static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							struct sched_avg *sa,
 							int runnable,
 							int running)
@@ -2520,6 +2522,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	u64 delta, periods;
 	u32 runnable_contrib;
 	int delta_w, decayed = 0;
+	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
 
 	delta = now - sa->last_runnable_update;
 	/*
@@ -2555,7 +2558,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
 		if (running)
-			sa->running_avg_sum += delta_w;
+			sa->running_avg_sum += delta_w * scale_freq
+				>> SCHED_CAPACITY_SHIFT;
 		sa->avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2576,7 +2580,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
 		if (running)
-			sa->running_avg_sum += runnable_contrib;
+			sa->running_avg_sum += runnable_contrib * scale_freq
+				>> SCHED_CAPACITY_SHIFT;
 		sa->avg_period += runnable_contrib;
 	}
 
@@ -2584,7 +2589,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	if (runnable)
 		sa->runnable_avg_sum += delta;
 	if (running)
-		sa->running_avg_sum += delta;
+		sa->running_avg_sum += delta * scale_freq
+			>> SCHED_CAPACITY_SHIFT;
 	sa->avg_period += delta;
 
 	return decayed;
@@ -2692,8 +2698,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
-			runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
+			runnable, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2771,6 +2777,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta, utilization_delta;
+	int cpu = cpu_of(rq_of(cfs_rq));
 	u64 now;
 
 	/*
@@ -2782,7 +2789,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+	if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
 					cfs_rq->curr == se))
 		return;
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 05/11] sched: make scale_rt invariant with frequency
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (3 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-27 11:41   ` [tip:sched/core] sched: Make " tip-bot for Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 06/11] sched: add per rq cpu_capacity_orig Vincent Guittot
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.

A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.

In sched_rt_avg_update, we now scale the RT execution time like below:
rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT

Then, scale_rt_capacity can be summarized by:
scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total
with available = total - rq->rt_avg

This has been been optimized in current code by
scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)

But we can also developed the equation like below
scale_rt_capacity = SCHED_CAPACITY_SCALE -
		((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)

and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
the computation of rq->rt_avg and scale_rt_capacity

so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
and
scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)

arch_scale_frequency_capacity will be called in the hot path of the scheduler
which implies to have a short and efficient function.
As an example, arch_scale_frequency_capacity should return a cached value that
is updated periodically outside of the hot path.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 17 +++++------------
 kernel/sched/sched.h |  4 +++-
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f031e4..dc7c693 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6004,7 +6004,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	u64 total, available, age_stamp, avg;
+	u64 total, used, age_stamp, avg;
 	s64 delta;
 
 	/*
@@ -6020,19 +6020,12 @@ static unsigned long scale_rt_capacity(int cpu)
 
 	total = sched_avg_period() + delta;
 
-	if (unlikely(total < avg)) {
-		/* Ensures that capacity won't end up being negative */
-		available = 0;
-	} else {
-		available = total - avg;
-	}
+	used = div_u64(avg, total);
 
-	if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
-		total = SCHED_CAPACITY_SCALE;
+	if (likely(used < SCHED_CAPACITY_SCALE))
+		return SCHED_CAPACITY_SCALE - used;
 
-	total >>= SCHED_CAPACITY_SHIFT;
-
-	return div_u64(available, total);
+	return 1;
 }
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 65fa7b5..23c6dd7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1374,9 +1374,11 @@ static inline int hrtick_enabled(struct rq *rq)
 
 #ifdef CONFIG_SMP
 extern void sched_avg_update(struct rq *rq);
+extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
-	rq->rt_avg += rt_delta;
+	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
 	sched_avg_update(rq);
 }
 #else
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 06/11] sched: add per rq cpu_capacity_orig
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (4 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 05/11] sched: make scale_rt invariant with frequency Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-27 11:41   ` [tip:sched/core] sched: Add struct rq::cpu_capacity_orig tip-bot for Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 07/11] sched: get CPU's usage statistic Vincent Guittot
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

This new field cpu_capacity_orig reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ

The cpu_capacity_orig will be used:
- to detect when the capacity of a CPU has been noticeably reduced so we can
  trig load balance to look for a CPU with better capacity. As an example, we
  can detect when a CPU handles a significant amount of irq
  (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
  scheduler whereas CPUs, which are really idle, are available.
- evaluate the available capacity for CFS tasks

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 8 +++++++-
 kernel/sched/sched.h | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97fe79c..28e3ec2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7216,7 +7216,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dc7c693..10f84c3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4363,6 +4363,11 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+static unsigned long capacity_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -6040,6 +6045,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
+	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 	sdg->sgc->capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
@@ -6094,7 +6100,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_of(cpu);
+				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 23c6dd7..9f06d24 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -603,6 +603,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_capacity;
+	unsigned long cpu_capacity_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 07/11] sched: get CPU's usage statistic
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (5 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 06/11] sched: add per rq cpu_capacity_orig Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-03 12:47   ` Dietmar Eggemann
                     ` (2 more replies)
  2015-02-27 15:54 ` [PATCH v10 08/11] sched: replace capacity_factor by usage Vincent Guittot
                   ` (5 subsequent siblings)
  12 siblings, 3 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.

The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
until the metrics are stabilized.

The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.

The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the running_load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10f84c3..faf61a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4781,6 +4781,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
 done:
 	return target;
 }
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must capacity so we can compare the
+ * usage with the capacity of the CPU that is available for CFS task (ie
+ * cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
+ */
+static int get_cpu_usage(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+	unsigned long capacity = capacity_orig_of(cpu);
+
+	if (usage >= SCHED_LOAD_SCALE)
+		return capacity;
+
+	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
 
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
@@ -5907,6 +5934,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
@@ -6255,6 +6283,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_usage += get_cpu_usage(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 08/11] sched: replace capacity_factor by usage
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (6 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 07/11] sched: get CPU's usage statistic Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-27 11:42   ` [tip:sched/core] sched: Replace " tip-bot for Vincent Guittot
  2015-03-27 14:52   ` [PATCH v10 08/11] sched: replace " Xunlei Pang
  2015-02-27 15:54 ` [PATCH v10 09/11] sched; remove unused capacity_orig from Vincent Guittot
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not. But the group_capacity_factor is hardly working for SMT
 system, it sometimes works for big cores but fails to do the right thing for
 little cores.

Below are two examples to illustrate the problem that this patch solves:

1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
(div_round_closest(3x640/1024) = 2) which means that it will be seen as
overloaded even if we have only one task per CPU.

2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
(1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
(at max and thanks to the fix [0] for SMT system that prevent the apparition
of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
reduced to nearly nothing), the capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).

So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is already used by
 load_balance.
-The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
has been re-introduced to compute the usage of a CPU by CFS tasks.

group_capacity_factor and group_has_free_capacity has been removed and replaced
by group_no_capacity. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.

For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
because local is not overloaded.

[1] 9a5d9ba6a363 ("sched/fair: Allow calculate_imbalance() to move idle cpus")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 139 +++++++++++++++++++++++++++-------------------------
 1 file changed, 72 insertions(+), 67 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index faf61a2..9d7431f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5936,11 +5936,10 @@ struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_no_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -6156,28 +6155,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
+ * Check whether the capacity of the rq has been noticeably reduced by side
+ * activity. The imbalance_pct is used for the threshold.
+ * Return true is the capacity is reduced
  */
 static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
+check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
 {
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
+	return ((rq->cpu_capacity * sd->imbalance_pct) <
+				(rq->cpu_capacity_orig * 100));
 }
 
 /*
@@ -6215,37 +6201,56 @@ static inline int sg_imbalanced(struct sched_group *group)
 }
 
 /*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
+ * group_has_capacity returns true if the group has spare capacity that could
+ * be used by some tasks.
+ * We consider that a group has spare capacity if the  * number of task is
+ * smaller than the number of CPUs or if the usage is lower than the available
+ * capacity for CFS tasks.
+ * For the latter, we use a threshold to stabilize the state, to take into
+ * account the variance of the tasks' load and to return true if the available
+ * capacity in meaningful for the load balancer.
+ * As an example, an available capacity of 1% can appear but it doesn't make
+ * any benefit for the load balance.
  */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline bool
+group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return true;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	if ((sgs->group_capacity * 100) >
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+	return false;
+}
+
+/*
+ *  group_is_overloaded returns true if the group has more tasks than it can
+ *  handle.
+ *  group_is_overloaded is not equals to !group_has_capacity because a group
+ *  with the exact right number of tasks, has no more spare capacity but is not
+ *  overloaded so both group_has_capacity and group_is_overloaded return
+ *  false.
+ */
+static inline bool
+group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return false;
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+	if ((sgs->group_capacity * 100) <
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	return capacity_factor;
+	return false;
 }
 
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+static enum group_type group_classify(struct lb_env *env,
+		struct sched_group *group,
+		struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (sgs->group_no_capacity)
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6306,11 +6311,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_no_capacity = group_is_overloaded(env, sgs);
+	sgs->group_type = group_classify(env, group, sgs);
 }
 
 /**
@@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
-		 * extra check prevents the case where you always pull from the
-		 * heaviest group when it is already under-utilized (possible
-		 * with a large weight task outweighs the tasks on the system).
+		 * these excess tasks. The extra check prevents the case where
+		 * you always pull from the heaviest group when it is already
+		 * under-utilized (possible with a large weight task outweighs
+		 * the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity) {
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
-			sgs->group_type = group_classify(sg, sgs);
+		    group_has_capacity(env, &sds->local_stat) &&
+		    (sgs->sum_nr_running > 1)) {
+			sgs->group_no_capacity = 1;
+			sgs->group_type = group_overloaded;
 		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
@@ -6623,11 +6627,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6690,6 +6695,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6710,8 +6716,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
+	    busiest->group_no_capacity)
 		goto force_balance;
 
 	/*
@@ -6770,7 +6776,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6799,9 +6805,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6809,7 +6812,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
 		/*
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 09/11] sched; remove unused capacity_orig from
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (7 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 08/11] sched: replace capacity_factor by usage Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-03 10:18   ` Morten Rasmussen
  2015-03-03 10:35   ` [PATCH v10 09/11] sched; remove unused capacity_orig Vincent Guittot
  2015-02-27 15:54 ` [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 12 ------------
 kernel/sched/fair.c  | 13 +++----------
 kernel/sched/sched.h |  2 +-
 3 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 28e3ec2..29f7037 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5446,17 +5446,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5941,7 +5930,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9d7431f..7420d21 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6073,7 +6073,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
 	capacity >>= SCHED_CAPACITY_SHIFT;
@@ -6089,7 +6088,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -6101,7 +6100,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -6121,19 +6120,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
+			 * This avoids capacity from being 0 and
 			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -6144,13 +6139,11 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f06d24..24c4aaf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -814,7 +814,7 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (8 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 09/11] sched; remove unused capacity_orig from Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-02 11:52   ` Srikar Dronamraju
                     ` (2 more replies)
  2015-02-27 15:54 ` [PATCH v10 11/11] sched: move cfs task on a CPU with higher capacity Vincent Guittot
                   ` (2 subsequent siblings)
  12 siblings, 3 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 29f7037..753f0a2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6240,6 +6240,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	 */
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
+		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 11/11] sched: move cfs task on a CPU with higher capacity
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (9 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
@ 2015-02-27 15:54 ` Vincent Guittot
  2015-03-26 14:19   ` Dietmar Eggemann
  2015-03-27 11:42   ` [tip:sched/core] sched: Move CFS tasks to CPUs " tip-bot for Vincent Guittot
  2015-03-11 10:10 ` [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
  2015-04-02  1:47 ` Wanpeng Li
  12 siblings, 2 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-02-27 15:54 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.
It's worth trying to move the task before the CPU is fully utilized to
minimize the preemption by irq or RT tasks.

Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases :
- there is only 1 task on the busiest CPU.
- we haven't been able to move a task of the busiest rq.

A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got more available capacity for
CFS tasks. This test has been added in need_active_balance.

As a sidenote, this will not generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 69 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7420d21..e70c315 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6855,6 +6855,19 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	/*
+	 * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
+	 * It's worth migrating the task if the src_cpu's capacity is reduced
+	 * because of other sched_class or IRQs if more capacity stays
+	 * available on dst_cpu.
+	 */
+	if ((env->idle != CPU_NOT_IDLE) &&
+	    (env->src_rq->cfs.h_nr_running == 1)) {
+		if ((check_cpu_capacity(env->src_rq, sd)) &&
+		    (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
+			return 1;
+	}
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
@@ -6954,6 +6967,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
+	env.src_cpu = busiest->cpu;
+	env.src_rq = busiest;
+
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
@@ -6963,8 +6979,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
-		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
 more_balance:
@@ -7664,22 +7678,25 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 
 /*
  * Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
  *   - This rq has more than one task.
- *   - At any scheduler domain level, this cpu's scheduler group has multiple
- *     busy cpu's exceeding the group's capacity.
+ *   - This rq has at least one CFS task and the capacity of the CPU is
+ *     significantly reduced because of RT tasks or IRQs.
+ *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ *     multiple busy cpu.
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
-static inline int nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq)
 {
 	unsigned long now = jiffies;
 	struct sched_domain *sd;
 	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7693,38 +7710,46 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
+
 	}
 
-	sd = rcu_dereference(per_cpu(sd_asym, cpu));
+	sd = rcu_dereference(rq->sd);
+	if (sd) {
+		if ((rq->cfs.h_nr_running >= 1) &&
+				check_cpu_capacity(rq, sd)) {
+			kick = true;
+			goto unlock;
+		}
+	}
 
+	sd = rcu_dereference(per_cpu(sd_asym, cpu));
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
-				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
-
-	rcu_read_unlock();
-	return 0;
+				  sched_domain_span(sd)) < cpu)) {
+		kick = true;
+		goto unlock;
+	}
 
-need_kick_unlock:
+unlock:
 	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level
  2015-02-27 15:54 ` [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
@ 2015-03-02 11:52   ` Srikar Dronamraju
  2015-03-03  8:38     ` Vincent Guittot
  2015-03-26 10:55   ` Peter Zijlstra
  2015-03-27 11:42   ` [tip:sched/core] sched: Add " tip-bot for Vincent Guittot
  2 siblings, 1 reply; 68+ messages in thread
From: Srikar Dronamraju @ 2015-03-02 11:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh,
	riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel

* Vincent Guittot <vincent.guittot@linaro.org> [2015-02-27 16:54:13]:

> Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
> the scheduler will put at least 1 task per core.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
> ---
>  kernel/sched/core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 29f7037..753f0a2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6240,6 +6240,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>  	 */
>  
>  	if (sd->flags & SD_SHARE_CPUCAPACITY) {
> +		sd->flags |= SD_PREFER_SIBLING;
>  		sd->imbalance_pct = 110;
>  		sd->smt_gain = 1178; /* ~15% */
>  

Prefer siblings logic dates back to https://lkml.org/lkml/2009/8/27/210
and only used in update_sd_lb_stats() where we have
 
if (child && child->flags & SD_PREFER_SIBLING)
	 prefer_sibling = 1;

However what confuses me is why should we even look at a child domain's
flag to balance tasks across the current sched domain? Why cant we just
set and use a sd flag at current level than to look at child domain
flag?

> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level
  2015-03-02 11:52   ` Srikar Dronamraju
@ 2015-03-03  8:38     ` Vincent Guittot
  2015-03-23  9:11       ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-03  8:38 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel, Mike Galbraith,
	Nicolas Pitre, Dietmar Eggemann, Linaro Kernel Mailman List

On 2 March 2015 at 12:52, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> * Vincent Guittot <vincent.guittot@linaro.org> [2015-02-27 16:54:13]:
>
>> Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
>> the scheduler will put at least 1 task per core.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
>> ---
>>  kernel/sched/core.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 29f7037..753f0a2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6240,6 +6240,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>>        */
>>
>>       if (sd->flags & SD_SHARE_CPUCAPACITY) {
>> +             sd->flags |= SD_PREFER_SIBLING;
>>               sd->imbalance_pct = 110;
>>               sd->smt_gain = 1178; /* ~15% */
>>
>
> Prefer siblings logic dates back to https://lkml.org/lkml/2009/8/27/210
> and only used in update_sd_lb_stats() where we have
>
> if (child && child->flags & SD_PREFER_SIBLING)
>          prefer_sibling = 1;
>
> However what confuses me is why should we even look at a child domain's
> flag to balance tasks across the current sched domain? Why cant we just
> set and use a sd flag at current level than to look at child domain
> flag?

Peter,
have you got some insight about the reason ?

egarding SMT, I see one advantage : the prefer sibling flag only need
to be set if SMT is enable and a smt sched_domain is present but it
implies some action at the parent level. If the flag was directly set
at parent level, the behavior linked to SD_PREFER_SIBLING flag will be
applied whatever the presence or not of the SMT domain.

Regards,
Vincent


>
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> --
> Thanks and Regards
> Srikar Dronamraju
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 09/11] sched; remove unused capacity_orig from
  2015-02-27 15:54 ` [PATCH v10 09/11] sched; remove unused capacity_orig from Vincent Guittot
@ 2015-03-03 10:18   ` Morten Rasmussen
  2015-03-03 10:33     ` Vincent Guittot
  2015-03-03 10:35   ` [PATCH v10 09/11] sched; remove unused capacity_orig Vincent Guittot
  1 sibling, 1 reply; 68+ messages in thread
From: Morten Rasmussen @ 2015-03-03 10:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, kamalesh, riel, efault,
	nicolas.pitre, Dietmar Eggemann, linaro-kernel

On Fri, Feb 27, 2015 at 03:54:12PM +0000, Vincent Guittot wrote:
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

The commit message is empty? And the subject appears truncated?
"...from"?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 09/11] sched; remove unused capacity_orig from
  2015-03-03 10:18   ` Morten Rasmussen
@ 2015-03-03 10:33     ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-03-03 10:33 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, preeti, kamalesh, riel, efault,
	nicolas.pitre, Dietmar Eggemann, linaro-kernel

On 3 March 2015 at 11:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, Feb 27, 2015 at 03:54:12PM +0000, Vincent Guittot wrote:
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> The commit message is empty? And the subject appears truncated?
> "...from"?

i don't what happened with this patch. I'm going to resend it with
complete commit message

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 09/11] sched; remove unused capacity_orig
  2015-02-27 15:54 ` [PATCH v10 09/11] sched; remove unused capacity_orig from Vincent Guittot
  2015-03-03 10:18   ` Morten Rasmussen
@ 2015-03-03 10:35   ` Vincent Guittot
  2015-03-27 11:42     ` [tip:sched/core] sched: Remove unused struct sched_group_capacity ::capacity_orig tip-bot for Vincent Guittot
  1 sibling, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-03 10:35 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

capacity_orig field is no more used in the scheduler so we can remove it from
struct sched_group_capacity

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 12 ------------
 kernel/sched/fair.c  | 13 +++----------
 kernel/sched/sched.h |  2 +-
 3 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 28e3ec2..29f7037 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5446,17 +5446,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5941,7 +5930,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9d7431f..7420d21 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6073,7 +6073,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
 	capacity >>= SCHED_CAPACITY_SHIFT;
@@ -6089,7 +6088,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -6101,7 +6100,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -6121,19 +6120,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
+			 * This avoids capacity from being 0 and
 			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -6144,13 +6139,11 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f06d24..24c4aaf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -814,7 +814,7 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 07/11] sched: get CPU's usage statistic
  2015-02-27 15:54 ` [PATCH v10 07/11] sched: get CPU's usage statistic Vincent Guittot
@ 2015-03-03 12:47   ` Dietmar Eggemann
  2015-03-04  7:53     ` Vincent Guittot
  2015-03-04  7:48   ` Vincent Guittot
  2015-03-27 15:12   ` [PATCH v10 07/11] sched: get CPU's usage statistic Xunlei Pang
  2 siblings, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2015-03-03 12:47 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, preeti,
	Morten Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, linaro-kernel

On 27/02/15 15:54, Vincent Guittot wrote:
> Monitor the usage level of each group of each sched_domain level. The usage is
> the portion of cpu_capacity_orig that is currently used on a CPU or group of
> CPUs. We use the utilization_load_avg to evaluate the usage level of each
> group.
> 
> The utilization_load_avg only takes into account the running time of the CFS
> tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
> utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly

s/temporaly/temporally

> greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
> until the metrics are stabilized.
> 
> The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
> running load on the CPU whereas the available capacity for the CFS task is in
> the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
> by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
> of the CPU to get the usage of the latter. The usage can then be compared with
> the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
> 
> The frequency scaling invariance of the usage is not taken into account in this
> patch, it will be solved in another patch which will deal with frequency
> scaling invariance on the running_load_avg.

The use of underscores in running_load_avg implies to me that this is a
data member of struct sched_avg or something similar. But there is no
running_load_avg in the current code. However, I can see that
sched_avg::*running_avg_sum* (and therefore
cfs_rq::*utilization_load_avg*) are frequency scale invariant.

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 10f84c3..faf61a2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4781,6 +4781,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  done:
>  	return target;
>  }
> +/*
> + * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
> + * tasks. The unit of the return value must capacity so we can compare the

s/must capacity/must be the one of capacity

> + * usage with the capacity of the CPU that is available for CFS task (ie
> + * cpu_capacity).
> + * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
> + * CPU. It represents the amount of utilization of a CPU in the range
> + * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
> + * capacity of the CPU because it's about the running time on this CPU.
> + * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
> + * because of unfortunate rounding in avg_period and running_load_avg or just
> + * after migrating tasks until the average stabilizes with the new running
> + * time. So we need to check that the usage stays into the range
> + * [0..cpu_capacity_orig] and cap if necessary.
> + * Without capping the usage, a group could be seen as overloaded (CPU0 usage
> + * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/

s/capacity\//capacity.

[...]

-- Dietmar


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-02-27 15:54 ` [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant Vincent Guittot
@ 2015-03-03 12:51   ` Dietmar Eggemann
  2015-03-04  7:54     ` Vincent Guittot
  2015-03-04  7:46   ` Vincent Guittot
  2015-03-23 13:19   ` [PATCH v10 04/11] " Peter Zijlstra
  2 siblings, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2015-03-03 12:51 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, preeti,
	Morten Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, linaro-kernel, Paul Turner, Ben Segall

On 27/02/15 15:54, Vincent Guittot wrote:
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> 
> Apply frequency scale-invariance correction factor to usage tracking.
> Each segment of the running_load_avg geometric series is now scaled by the

The same comment I sent out on [PATCH v10 07/11]:

The use of underscores in running_load_avg implies to me that this is a
data member of struct sched_avg or something similar. But there is no
running_load_avg in the current code. However, I can see that
sched_avg::*running_avg_sum* (and therefore
cfs_rq::*utilization_load_avg*) are frequency scale invariant.

-- Dietmar

> current frequency so the utilization_avg_contrib of each entity will be
> invariant with frequency scaling. As a result, utilization_load_avg which is
> the sum of utilization_avg_contrib, becomes invariant too. So the usage level
> that is returned by get_cpu_usage, stays relative to the max frequency as the
> cpu_capacity which is is compared against.
> Then, we want the keep the load tracking values in a 32bits type, which implies
> that the max value of {runnable|running}_avg_sum must be lower than
> 2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
> arch_scale_freq_capacity must return a value less than
> (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
> So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-02-27 15:54 ` [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant Vincent Guittot
  2015-03-03 12:51   ` Dietmar Eggemann
@ 2015-03-04  7:46   ` Vincent Guittot
  2015-03-27 11:40     ` [tip:sched/core] " tip-bot for Morten Rasmussen
  2015-03-23 13:19   ` [PATCH v10 04/11] " Peter Zijlstra
  2 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-04  7:46 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Morten Rasmussen, Paul Turner, Ben Segall, Vincent Guittot

From: Morten Rasmussen <morten.rasmussen@arm.com>

Apply frequency scale-invariance correction factor to usage tracking.
Each segment of the running_avg_sum geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling. As a result, utilization_load_avg which is
the sum of utilization_avg_contrib, becomes invariant too. So the usage level
that is returned by get_cpu_usage, stays relative to the max frequency as the
cpu_capacity which is is compared against.
Then, we want the keep the load tracking values in a 32bits type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e54231f..7f031e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2484,6 +2484,8 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2512,7 +2514,7 @@ static u32 __compute_runnable_contrib(u64 n)
  *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
-static __always_inline int __update_entity_runnable_avg(u64 now,
+static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							struct sched_avg *sa,
 							int runnable,
 							int running)
@@ -2520,6 +2522,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	u64 delta, periods;
 	u32 runnable_contrib;
 	int delta_w, decayed = 0;
+	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
 
 	delta = now - sa->last_runnable_update;
 	/*
@@ -2555,7 +2558,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
 		if (running)
-			sa->running_avg_sum += delta_w;
+			sa->running_avg_sum += delta_w * scale_freq
+				>> SCHED_CAPACITY_SHIFT;
 		sa->avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2576,7 +2580,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
 		if (running)
-			sa->running_avg_sum += runnable_contrib;
+			sa->running_avg_sum += runnable_contrib * scale_freq
+				>> SCHED_CAPACITY_SHIFT;
 		sa->avg_period += runnable_contrib;
 	}
 
@@ -2584,7 +2589,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	if (runnable)
 		sa->runnable_avg_sum += delta;
 	if (running)
-		sa->running_avg_sum += delta;
+		sa->running_avg_sum += delta * scale_freq
+			>> SCHED_CAPACITY_SHIFT;
 	sa->avg_period += delta;
 
 	return decayed;
@@ -2692,8 +2698,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
-			runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
+			runnable, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2771,6 +2777,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta, utilization_delta;
+	int cpu = cpu_of(rq_of(cfs_rq));
 	u64 now;
 
 	/*
@@ -2782,7 +2789,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+	if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
 					cfs_rq->curr == se))
 		return;
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v10 07/11] sched: get CPU's usage statistic
  2015-02-27 15:54 ` [PATCH v10 07/11] sched: get CPU's usage statistic Vincent Guittot
  2015-03-03 12:47   ` Dietmar Eggemann
@ 2015-03-04  7:48   ` Vincent Guittot
  2015-03-27 11:41     ` [tip:sched/core] sched: Calculate CPU' s usage statistic and put it into struct sg_lb_stats::group_usage tip-bot for Vincent Guittot
  2015-03-27 15:12   ` [PATCH v10 07/11] sched: get CPU's usage statistic Xunlei Pang
  2 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-04  7:48 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Vincent Guittot

Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.

The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be
temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
CPU and until the metrics are stabilized.

The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.

The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the utilization_load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10f84c3..faf61a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4781,6 +4781,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
 done:
 	return target;
 }
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must be the one of capacity so we can
+ * compare the usage with the capacity of the CPU that is available for CFS
+ * task (ie cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
+ */
+static int get_cpu_usage(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+	unsigned long capacity = capacity_orig_of(cpu);
+
+	if (usage >= SCHED_LOAD_SCALE)
+		return capacity;
+
+	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
 
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
@@ -5907,6 +5934,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
@@ -6255,6 +6283,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_usage += get_cpu_usage(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 07/11] sched: get CPU's usage statistic
  2015-03-03 12:47   ` Dietmar Eggemann
@ 2015-03-04  7:53     ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-03-04  7:53 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, preeti, Morten Rasmussen, kamalesh,
	riel, efault, nicolas.pitre, linaro-kernel

On 3 March 2015 at 13:47, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 27/02/15 15:54, Vincent Guittot wrote:
>> Monitor the usage level of each group of each sched_domain level. The usage is
>> the portion of cpu_capacity_orig that is currently used on a CPU or group of
>> CPUs. We use the utilization_load_avg to evaluate the usage level of each
>> group.
>>
>> The utilization_load_avg only takes into account the running time of the CFS
>> tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
>> utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
>
> s/temporaly/temporally
>
>> greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
>> until the metrics are stabilized.
>>
>> The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
>> running load on the CPU whereas the available capacity for the CFS task is in
>> the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
>> by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
>> of the CPU to get the usage of the latter. The usage can then be compared with
>> the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
>>
>> The frequency scaling invariance of the usage is not taken into account in this
>> patch, it will be solved in another patch which will deal with frequency
>> scaling invariance on the running_load_avg.
>
> The use of underscores in running_load_avg implies to me that this is a
> data member of struct sched_avg or something similar. But there is no
> running_load_avg in the current code. However, I can see that
> sched_avg::*running_avg_sum* (and therefore
> cfs_rq::*utilization_load_avg*) are frequency scale invariant.
>
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> ---
>>  kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
>>  1 file changed, 29 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 10f84c3..faf61a2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4781,6 +4781,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>  done:
>>       return target;
>>  }
>> +/*
>> + * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
>> + * tasks. The unit of the return value must capacity so we can compare the
>
> s/must capacity/must be the one of capacity
>
>> + * usage with the capacity of the CPU that is available for CFS task (ie
>> + * cpu_capacity).
>> + * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
>> + * CPU. It represents the amount of utilization of a CPU in the range
>> + * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
>> + * capacity of the CPU because it's about the running time on this CPU.
>> + * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
>> + * because of unfortunate rounding in avg_period and running_load_avg or just
>> + * after migrating tasks until the average stabilizes with the new running
>> + * time. So we need to check that the usage stays into the range
>> + * [0..cpu_capacity_orig] and cap if necessary.
>> + * Without capping the usage, a group could be seen as overloaded (CPU0 usage
>> + * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
>
> s/capacity\//capacity.

I have resent the patch with typo correction

>
> [...]
>
> -- Dietmar
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-03 12:51   ` Dietmar Eggemann
@ 2015-03-04  7:54     ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-03-04  7:54 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, preeti, Morten Rasmussen, kamalesh,
	riel, efault, nicolas.pitre, linaro-kernel, Paul Turner,
	Ben Segall

On 3 March 2015 at 13:51, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 27/02/15 15:54, Vincent Guittot wrote:
>> From: Morten Rasmussen <morten.rasmussen@arm.com>
>>
>> Apply frequency scale-invariance correction factor to usage tracking.
>> Each segment of the running_load_avg geometric series is now scaled by the
>
> The same comment I sent out on [PATCH v10 07/11]:
>
> The use of underscores in running_load_avg implies to me that this is a
> data member of struct sched_avg or something similar. But there is no
> running_load_avg in the current code. However, I can see that
> sched_avg::*running_avg_sum* (and therefore
> cfs_rq::*utilization_load_avg*) are frequency scale invariant.

I have resent the patch with typo correction

>
> -- Dietmar
>
>> current frequency so the utilization_avg_contrib of each entity will be
>> invariant with frequency scaling. As a result, utilization_load_avg which is
>> the sum of utilization_avg_contrib, becomes invariant too. So the usage level
>> that is returned by get_cpu_usage, stays relative to the max frequency as the
>> cpu_capacity which is is compared against.
>> Then, we want the keep the load tracking values in a 32bits type, which implies
>> that the max value of {runnable|running}_avg_sum must be lower than
>> 2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
>> arch_scale_freq_capacity must return a value less than
>> (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
>> So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 00/11] sched: consolidation of CPU capacity and usage
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (10 preceding siblings ...)
  2015-02-27 15:54 ` [PATCH v10 11/11] sched: move cfs task on a CPU with higher capacity Vincent Guittot
@ 2015-03-11 10:10 ` Vincent Guittot
  2015-04-02  1:47 ` Wanpeng Li
  12 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-03-11 10:10 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal
  Cc: Rik van Riel, Mike Galbraith, Nicolas Pitre, Dietmar Eggemann,
	Linaro Kernel Mailman List, Vincent Guittot

On 27 February 2015 at 16:54, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> This patchset consolidates several changes in the capacity and the usage
> tracking of the CPU. It provides a frequency invariant metric of the usage of
> CPUs and generally improves the accuracy of load/usage tracking in the
> scheduler. The frequency invariant metric is the foundation required for the
> consolidation of cpufreq and implementation of a fully invariant load tracking.
> These are currently WIP and require several changes to the load balancer
> (including how it will use and interprets load and capacity metrics) and
> extensive validation. The frequency invariance is done with
> arch_scale_freq_capacity and this patchset doesn't provide the backends of
> the function which are architecture dependent.
>
> As discussed at LPC14, Morten and I have consolidated our changes into a single
> patchset to make it easier to review and merge.
>
> During load balance, the scheduler evaluates the number of tasks that a group
> of CPUs can handle. The current method assumes that tasks have a fix load of
> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
> This assumption generates wrong decision by creating ghost cores or by
> removing real ones when the original capacity of CPUs is different from the
> default SCHED_CAPACITY_SCALE. With this patch set, we don't try anymore to
> evaluate the number of available cores based on the group_capacity but instead
> we evaluate the usage of a group and compare it with its capacity.
>
> This patchset mainly replaces the old capacity_factor method by a new one and
> keeps the general policy almost unchanged. These new metrics will be also used
> in later patches.
>
> The CPU usage is based on a running time tracking version of the current
> implementation of the load average tracking. I also have a version that is
> based on the new implementation proposal [1] but I haven't provide the patches
> and results as [1] is still under review. I can provide change above [1] to
> change how CPU usage is computed and to adapt to new mecanism.
>
> Change since V9
>  - add a dedicated patch for removing unused capacity_orig
>  - update some comments and fix typo
>  - change the condition for actively migrating task on CPU with higher capacity
>
> Change since V8
>  - reorder patches
>
> Change since V7
>  - add freq invariance for usage tracking
>  - add freq invariance for scale_rt
>  - update comments and commits' message
>  - fix init of utilization_avg_contrib
>  - fix prefer_sibling
>
> Change since V6
>  - add group usage tracking
>  - fix some commits' messages
>  - minor fix like comments and argument order
>
> Change since V5
>  - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
>  - update commit log and add more details on the purpose of the patches
>  - fix/remove useless code with the rebase on patchset [2]
>  - remove capacity_orig in sched_group_capacity as it is not used
>  - move code in the right patch
>  - add some helper function to factorize code
>
> Change since V4
>  - rebase to manage conflicts with changes in selection of busiest group
>
> Change since V3:
>  - add usage_avg_contrib statistic which sums the running time of tasks on a rq
>  - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
>  - fix replacement power by capacity
>  - update some comments
>
> Change since V2:
>  - rebase on top of capacity renaming
>  - fix wake_affine statistic update
>  - rework nohz_kick_needed
>  - optimize the active migration of a task from CPU with reduced capacity
>  - rename group_activity by group_utilization and remove unused total_utilization
>  - repair SD_PREFER_SIBLING and use it for SMT level
>  - reorder patchset to gather patches with same topics
>
> Change since V1:
>  - add 3 fixes
>  - correct some commit messages
>  - replace capacity computation by activity
>  - take into account current cpu capacity
>
> [1] https://lkml.org/lkml/2014/10/10/131
> [2] https://lkml.org/lkml/2014/7/25/589
>
> Morten Rasmussen (2):
>   sched: Track group sched_entity usage contributions
>   sched: Make sched entity usage tracking scale-invariant
>
> Vincent Guittot (9):
>   sched: add utilization_avg_contrib
>   sched: remove frequency scaling from cpu_capacity
>   sched: make scale_rt invariant with frequency
>   sched: add per rq cpu_capacity_orig
>   sched: get CPU's usage statistic
>   sched: replace capacity_factor by usage
>   sched; remove unused capacity_orig from
>   sched: add SD_PREFER_SIBLING for SMT level
>   sched: move cfs task on a CPU with higher capacity
>
>  include/linux/sched.h |  21 ++-
>  kernel/sched/core.c   |  15 +--
>  kernel/sched/debug.c  |  12 +-
>  kernel/sched/fair.c   | 366 +++++++++++++++++++++++++++++++-------------------
>  kernel/sched/sched.h  |  15 ++-
>  5 files changed, 271 insertions(+), 158 deletions(-)
>
> --
> 1.9.1
>

Hi Peter,

Gentle reminder ping

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level
  2015-03-03  8:38     ` Vincent Guittot
@ 2015-03-23  9:11       ` Peter Zijlstra
  2015-03-23  9:59         ` Preeti U Murthy
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2015-03-23  9:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Srikar Dronamraju, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel, Mike Galbraith,
	Nicolas Pitre, Dietmar Eggemann, Linaro Kernel Mailman List

On Tue, Mar 03, 2015 at 09:38:11AM +0100, Vincent Guittot wrote:

> > Prefer siblings logic dates back to https://lkml.org/lkml/2009/8/27/210
> > and only used in update_sd_lb_stats() where we have
> >
> > if (child && child->flags & SD_PREFER_SIBLING)
> >          prefer_sibling = 1;
> >
> > However what confuses me is why should we even look at a child domain's
> > flag to balance tasks across the current sched domain? Why cant we just
> > set and use a sd flag at current level than to look at child domain
> > flag?
> 
> Peter,
> have you got some insight about the reason ?

Yeah, because it makes sense that way? ;-)

The we want to move things to the child's sibling, not the parent's
sibling. We further need to have a child for this to make sense.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level
  2015-03-23  9:11       ` Peter Zijlstra
@ 2015-03-23  9:59         ` Preeti U Murthy
  0 siblings, 0 replies; 68+ messages in thread
From: Preeti U Murthy @ 2015-03-23  9:59 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Srikar Dronamraju, Ingo Molnar, linux-kernel, Morten Rasmussen,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, Nicolas Pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List



On 03/23/2015 02:41 PM, Peter Zijlstra wrote:
> On Tue, Mar 03, 2015 at 09:38:11AM +0100, Vincent Guittot wrote:
> 
>>> Prefer siblings logic dates back to https://lkml.org/lkml/2009/8/27/210
>>> and only used in update_sd_lb_stats() where we have
>>>
>>> if (child && child->flags & SD_PREFER_SIBLING)
>>>          prefer_sibling = 1;
>>>
>>> However what confuses me is why should we even look at a child domain's
>>> flag to balance tasks across the current sched domain? Why cant we just
>>> set and use a sd flag at current level than to look at child domain
>>> flag?
>>
>> Peter,
>> have you got some insight about the reason ?
> 
> Yeah, because it makes sense that way? ;-)
> 
> The we want to move things to the child's sibling, not the parent's
> sibling. We further need to have a child for this to make sense.
> 
> 

+1. The above is precisely why we need this patch.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-02-27 15:54 ` [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant Vincent Guittot
  2015-03-03 12:51   ` Dietmar Eggemann
  2015-03-04  7:46   ` Vincent Guittot
@ 2015-03-23 13:19   ` Peter Zijlstra
  2015-03-24 10:00     ` Vincent Guittot
  2015-03-27 11:43     ` [tip:sched/core] sched: Optimize freq invariant accounting tip-bot for Peter Zijlstra
  2 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2015-03-23 13:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel,
	efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Paul Turner, Ben Segall

On Fri, Feb 27, 2015 at 04:54:07PM +0100, Vincent Guittot wrote:

> +	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);

> +			sa->running_avg_sum += delta_w * scale_freq
> +				>> SCHED_CAPACITY_SHIFT;

so the only thing that could be improved is somehow making this
multiplication go away when the arch doesn't implement the function.

But I'm not sure how to do that without #ifdef.

Maybe a little something like so then... that should make the compiler
get rid of those multiplications unless the arch needs them.


--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2484,8 +2484,6 @@ static u32 __compute_runnable_contrib(u6
 	return contrib + runnable_avg_yN_sum[n];
 }
 
-unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
-
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -6015,11 +6013,6 @@ static unsigned long default_scale_capac
 	return SCHED_CAPACITY_SCALE;
 }
 
-unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
-{
-	return default_scale_capacity(sd, cpu);
-}
-
 static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1387,7 +1387,14 @@ static inline int hrtick_enabled(struct
 
 #ifdef CONFIG_SMP
 extern void sched_avg_update(struct rq *rq);
-extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
+#ifndef arch_scale_freq_capacity
+static __always_inline
+unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+	return SCHED_CAPACITY_SCALE;
+}
+#endif
 
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-23 13:19   ` [PATCH v10 04/11] " Peter Zijlstra
@ 2015-03-24 10:00     ` Vincent Guittot
  2015-03-25 17:33       ` Peter Zijlstra
  2015-03-27 11:43     ` [tip:sched/core] sched: Optimize freq invariant accounting tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-24 10:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy, Morten Rasmussen,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, Nicolas Pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On 23 March 2015 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Feb 27, 2015 at 04:54:07PM +0100, Vincent Guittot wrote:
>
>> +     unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
>
>> +                     sa->running_avg_sum += delta_w * scale_freq
>> +                             >> SCHED_CAPACITY_SHIFT;
>
> so the only thing that could be improved is somehow making this
> multiplication go away when the arch doesn't implement the function.
>
> But I'm not sure how to do that without #ifdef.
>
> Maybe a little something like so then... that should make the compiler
> get rid of those multiplications unless the arch needs them.

yes, it removes useless multiplication when not used by an arch.
It also adds a constraint on the arch side which have to define
arch_scale_freq_capacity like below:

#define arch_scale_freq_capacity xxx_arch_scale_freq_capacity
with xxx_arch_scale_freq_capacity an architecture specific function

If it sounds acceptable i can update the patch with your proposal ?

Vincent
>
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2484,8 +2484,6 @@ static u32 __compute_runnable_contrib(u6
>         return contrib + runnable_avg_yN_sum[n];
>  }
>
> -unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> -
>  /*
>   * We can represent the historical contribution to runnable average as the
>   * coefficients of a geometric series.  To do this we sub-divide our runnable
> @@ -6015,11 +6013,6 @@ static unsigned long default_scale_capac
>         return SCHED_CAPACITY_SCALE;
>  }
>
> -unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> -{
> -       return default_scale_capacity(sd, cpu);
> -}
> -
>  static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>         if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1387,7 +1387,14 @@ static inline int hrtick_enabled(struct
>
>  #ifdef CONFIG_SMP
>  extern void sched_avg_update(struct rq *rq);
> -extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
> +
> +#ifndef arch_scale_freq_capacity
> +static __always_inline
> +unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +{
> +       return SCHED_CAPACITY_SCALE;
> +}
> +#endif
>
>  static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
>  {

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-24 10:00     ` Vincent Guittot
@ 2015-03-25 17:33       ` Peter Zijlstra
  2015-03-25 18:08         ` Vincent Guittot
  2015-04-02 16:53         ` Morten Rasmussen
  0 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2015-03-25 17:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy, Morten Rasmussen,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, Nicolas Pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On Tue, Mar 24, 2015 at 11:00:57AM +0100, Vincent Guittot wrote:
> On 23 March 2015 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Feb 27, 2015 at 04:54:07PM +0100, Vincent Guittot wrote:
> >
> >> +     unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
> >
> >> +                     sa->running_avg_sum += delta_w * scale_freq
> >> +                             >> SCHED_CAPACITY_SHIFT;
> >
> > so the only thing that could be improved is somehow making this
> > multiplication go away when the arch doesn't implement the function.
> >
> > But I'm not sure how to do that without #ifdef.
> >
> > Maybe a little something like so then... that should make the compiler
> > get rid of those multiplications unless the arch needs them.
> 
> yes, it removes useless multiplication when not used by an arch.
> It also adds a constraint on the arch side which have to define
> arch_scale_freq_capacity like below:
> 
> #define arch_scale_freq_capacity xxx_arch_scale_freq_capacity
> with xxx_arch_scale_freq_capacity an architecture specific function

Yeah, but it not being weak should make that a compile time warn/fail,
which should be pretty easy to deal with.

> If it sounds acceptable i can update the patch with your proposal ?

I'll stick it to the end, I just wanted to float to patch to see if
people had better solutions.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-25 17:33       ` Peter Zijlstra
@ 2015-03-25 18:08         ` Vincent Guittot
  2015-03-26 17:38           ` Morten Rasmussen
  2015-04-02 16:53         ` Morten Rasmussen
  1 sibling, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-25 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy, Morten Rasmussen,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, Nicolas Pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On 25 March 2015 at 18:33, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 24, 2015 at 11:00:57AM +0100, Vincent Guittot wrote:
>> On 23 March 2015 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, Feb 27, 2015 at 04:54:07PM +0100, Vincent Guittot wrote:
>> >
>> >> +     unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
>> >
>> >> +                     sa->running_avg_sum += delta_w * scale_freq
>> >> +                             >> SCHED_CAPACITY_SHIFT;
>> >
>> > so the only thing that could be improved is somehow making this
>> > multiplication go away when the arch doesn't implement the function.
>> >
>> > But I'm not sure how to do that without #ifdef.
>> >
>> > Maybe a little something like so then... that should make the compiler
>> > get rid of those multiplications unless the arch needs them.
>>
>> yes, it removes useless multiplication when not used by an arch.
>> It also adds a constraint on the arch side which have to define
>> arch_scale_freq_capacity like below:
>>
>> #define arch_scale_freq_capacity xxx_arch_scale_freq_capacity
>> with xxx_arch_scale_freq_capacity an architecture specific function
>
> Yeah, but it not being weak should make that a compile time warn/fail,
> which should be pretty easy to deal with.
>
>> If it sounds acceptable i can update the patch with your proposal ?
>
> I'll stick it to the end, I just wanted to float to patch to see if
> people had better solutions.

ok. all other methods that i have tried, was removing the optimization
when default arch_scale_freq_capacity was used

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level
  2015-02-27 15:54 ` [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
  2015-03-02 11:52   ` Srikar Dronamraju
@ 2015-03-26 10:55   ` Peter Zijlstra
  2015-03-26 12:03     ` Preeti U Murthy
  2015-03-27 11:42   ` [tip:sched/core] sched: Add " tip-bot for Vincent Guittot
  2 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2015-03-26 10:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh, riel,
	efault, nicolas.pitre, dietmar.eggemann, linaro-kernel

On Fri, Feb 27, 2015 at 04:54:13PM +0100, Vincent Guittot wrote:
> Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
> the scheduler will put at least 1 task per core.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>

Preeti, what benchmarks did you use on power8 smt to verify performance?

I'm seeing a slight but statistically significant regression on my
ivb-ep kernel build when I match the build concurrency to my core count.

/me goes run (and install, its a fairly new box) more benches.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level
  2015-03-26 10:55   ` Peter Zijlstra
@ 2015-03-26 12:03     ` Preeti U Murthy
  0 siblings, 0 replies; 68+ messages in thread
From: Preeti U Murthy @ 2015-03-26 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: mingo, linux-kernel, Morten.Rasmussen, kamalesh, riel, efault,
	nicolas.pitre, dietmar.eggemann, linaro-kernel

On 03/26/2015 04:25 PM, Peter Zijlstra wrote:
> On Fri, Feb 27, 2015 at 04:54:13PM +0100, Vincent Guittot wrote:
>> Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
>> the scheduler will put at least 1 task per core.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
> 
> Preeti, what benchmarks did you use on power8 smt to verify performance?
> 
> I'm seeing a slight but statistically significant regression on my
> ivb-ep kernel build when I match the build concurrency to my core count.
> 
> /me goes run (and install, its a fairly new box) more benches.

I use ebizzy benchmark to test performance regressions. But for this
particular patch, I recollect that I did a code walk through to verify
the correctness of the patch and not a performance test. Let me run
ebizzy against this patch and verify.

Regards
Preeti U Murthy
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 11/11] sched: move cfs task on a CPU with higher capacity
  2015-02-27 15:54 ` [PATCH v10 11/11] sched: move cfs task on a CPU with higher capacity Vincent Guittot
@ 2015-03-26 14:19   ` Dietmar Eggemann
  2015-03-26 15:43     ` Vincent Guittot
  2015-03-27 11:42   ` [tip:sched/core] sched: Move CFS tasks to CPUs " tip-bot for Vincent Guittot
  1 sibling, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2015-03-26 14:19 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, preeti,
	Morten Rasmussen, kamalesh
  Cc: riel, efault, nicolas.pitre, linaro-kernel

On 27/02/15 15:54, Vincent Guittot wrote:
> When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
> capacity for CFS tasks can be significantly reduced. Once we detect such
> situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
> load balance to check if it's worth moving its tasks on an idle CPU.
> It's worth trying to move the task before the CPU is fully utilized to
> minimize the preemption by irq or RT tasks.
>
> Once the idle load_balance has selected the busiest CPU, it will look for an
> active load balance for only two cases :
> - there is only 1 task on the busiest CPU.
> - we haven't been able to move a task of the busiest rq.
>
> A CPU with a reduced capacity is included in the 1st case, and it's worth to
> actively migrate its task if the idle CPU has got more available capacity for
> CFS tasks. This test has been added in need_active_balance.
>
> As a sidenote, this will not generate more spurious ilb because we already
> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
> has a task, we will trig the ilb once for migrating the task.
>
> The nohz_kick_needed function has been cleaned up a bit while adding the new
> test
>
> env.src_cpu and env.src_rq must be set unconditionnally because they are used
> in need_active_balance which is called even if busiest->nr_running equals 1
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c | 69 ++++++++++++++++++++++++++++++++++++-----------------
>   1 file changed, 47 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7420d21..e70c315 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6855,6 +6855,19 @@ static int need_active_balance(struct lb_env *env)
>   			return 1;
>   	}
>
> +	/*
> +	 * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
> +	 * It's worth migrating the task if the src_cpu's capacity is reduced
> +	 * because of other sched_class or IRQs if more capacity stays
> +	 * available on dst_cpu.
> +	 */
> +	if ((env->idle != CPU_NOT_IDLE) &&
> +	    (env->src_rq->cfs.h_nr_running == 1)) {
> +		if ((check_cpu_capacity(env->src_rq, sd)) &&
> +		    (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
> +			return 1;
> +	}
> +
>   	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
>   }
>
> @@ -6954,6 +6967,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>
>   	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>
> +	env.src_cpu = busiest->cpu;

Isn't this 'env.src_cpu = busiest->cpu;' or 'env.src_cpu = 
cpu_of(busiest);' already needed due to the existing ASYM_PACKING check 
in need_active_balance() 'if ( ... && env->src_cpu > env->dst_cpu)' for 
CPU_NEWLY_IDLE? Otherwise like you said, in these 'busiest->nr_running 
equals 1' instances, env->src_cpu is un-initialized.

> +	env.src_rq = busiest;
> +
>   	ld_moved = 0;
>   	if (busiest->nr_running > 1) {
>   		/*
> @@ -6963,8 +6979,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>   		 * correctly treated as an imbalance.
>   		 */
>   		env.flags |= LBF_ALL_PINNED;
> -		env.src_cpu   = busiest->cpu;
> -		env.src_rq    = busiest;
>   		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>
>   more_balance:

[...]


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 11/11] sched: move cfs task on a CPU with higher capacity
  2015-03-26 14:19   ` Dietmar Eggemann
@ 2015-03-26 15:43     ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-03-26 15:43 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, preeti, Morten Rasmussen, kamalesh,
	riel, efault, nicolas.pitre, linaro-kernel

On 26 March 2015 at 15:19, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 27/02/15 15:54, Vincent Guittot wrote:
>>
>> When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
>> capacity for CFS tasks can be significantly reduced. Once we detect such
>> situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
>> load balance to check if it's worth moving its tasks on an idle CPU.
>> It's worth trying to move the task before the CPU is fully utilized to
>> minimize the preemption by irq or RT tasks.
>>
>> Once the idle load_balance has selected the busiest CPU, it will look for
>> an
>> active load balance for only two cases :
>> - there is only 1 task on the busiest CPU.
>> - we haven't been able to move a task of the busiest rq.
>>
>> A CPU with a reduced capacity is included in the 1st case, and it's worth
>> to
>> actively migrate its task if the idle CPU has got more available capacity
>> for
>> CFS tasks. This test has been added in need_active_balance.
>>
>> As a sidenote, this will not generate more spurious ilb because we already
>> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one
>> that
>> has a task, we will trig the ilb once for migrating the task.
>>
>> The nohz_kick_needed function has been cleaned up a bit while adding the
>> new
>> test
>>
>> env.src_cpu and env.src_rq must be set unconditionnally because they are
>> used
>> in need_active_balance which is called even if busiest->nr_running equals
>> 1
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>   kernel/sched/fair.c | 69
>> ++++++++++++++++++++++++++++++++++++-----------------
>>   1 file changed, 47 insertions(+), 22 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7420d21..e70c315 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6855,6 +6855,19 @@ static int need_active_balance(struct lb_env *env)
>>                         return 1;
>>         }
>>
>> +       /*
>> +        * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
>> +        * It's worth migrating the task if the src_cpu's capacity is
>> reduced
>> +        * because of other sched_class or IRQs if more capacity stays
>> +        * available on dst_cpu.
>> +        */
>> +       if ((env->idle != CPU_NOT_IDLE) &&
>> +           (env->src_rq->cfs.h_nr_running == 1)) {
>> +               if ((check_cpu_capacity(env->src_rq, sd)) &&
>> +                   (capacity_of(env->src_cpu)*sd->imbalance_pct <
>> capacity_of(env->dst_cpu)*100))
>> +                       return 1;
>> +       }
>> +
>>         return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
>>   }
>>
>> @@ -6954,6 +6967,9 @@ static int load_balance(int this_cpu, struct rq
>> *this_rq,
>>
>>         schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>
>> +       env.src_cpu = busiest->cpu;
>
>
> Isn't this 'env.src_cpu = busiest->cpu;' or 'env.src_cpu = cpu_of(busiest);'
> already needed due to the existing ASYM_PACKING check in
> need_active_balance() 'if ( ... && env->src_cpu > env->dst_cpu)' for
> CPU_NEWLY_IDLE? Otherwise like you said, in these 'busiest->nr_running
> equals 1' instances, env->src_cpu is un-initialized.

yes, i sent a fix for that purpose some times ago :
https://lkml.org/lkml/2013/2/12/158
but it has not gone further than the mailing list.

AFAICT, SD_ASYM_PACKING can't trig an active load balance on the cpu
with the lowest id without this fix as src_cpu is initialized to 0
which implies that 'env->src_cpu > env->dst_cpu' is always false

Vincent

>
>> +       env.src_rq = busiest;
>> +
>>         ld_moved = 0;
>>         if (busiest->nr_running > 1) {
>>                 /*
>> @@ -6963,8 +6979,6 @@ static int load_balance(int this_cpu, struct rq
>> *this_rq,
>>                  * correctly treated as an imbalance.
>>                  */
>>                 env.flags |= LBF_ALL_PINNED;
>> -               env.src_cpu   = busiest->cpu;
>> -               env.src_rq    = busiest;
>>                 env.loop_max  = min(sysctl_sched_nr_migrate,
>> busiest->nr_running);
>>
>>   more_balance:
>
>
> [...]
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-25 18:08         ` Vincent Guittot
@ 2015-03-26 17:38           ` Morten Rasmussen
  2015-03-26 17:40             ` Morten Rasmussen
                               ` (3 more replies)
  0 siblings, 4 replies; 68+ messages in thread
From: Morten Rasmussen @ 2015-03-26 17:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On Wed, Mar 25, 2015 at 06:08:42PM +0000, Vincent Guittot wrote:
> On 25 March 2015 at 18:33, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Mar 24, 2015 at 11:00:57AM +0100, Vincent Guittot wrote:
> >> On 23 March 2015 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Fri, Feb 27, 2015 at 04:54:07PM +0100, Vincent Guittot wrote:
> >> >
> >> >> +     unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
> >> >
> >> >> +                     sa->running_avg_sum += delta_w * scale_freq
> >> >> +                             >> SCHED_CAPACITY_SHIFT;
> >> >
> >> > so the only thing that could be improved is somehow making this
> >> > multiplication go away when the arch doesn't implement the function.
> >> >
> >> > But I'm not sure how to do that without #ifdef.
> >> >
> >> > Maybe a little something like so then... that should make the compiler
> >> > get rid of those multiplications unless the arch needs them.
> >>
> >> yes, it removes useless multiplication when not used by an arch.
> >> It also adds a constraint on the arch side which have to define
> >> arch_scale_freq_capacity like below:
> >>
> >> #define arch_scale_freq_capacity xxx_arch_scale_freq_capacity
> >> with xxx_arch_scale_freq_capacity an architecture specific function
> >
> > Yeah, but it not being weak should make that a compile time warn/fail,
> > which should be pretty easy to deal with.
> >
> >> If it sounds acceptable i can update the patch with your proposal ?
> >
> > I'll stick it to the end, I just wanted to float to patch to see if
> > people had better solutions.
> 
> ok. all other methods that i have tried, was removing the optimization
> when default arch_scale_freq_capacity was used

Another potential solution is to stay with weak functions but move the
multiplication and shift into the arch_scale_*() functions by passing
the value we want to scale into the arch_scale_*() function. That way we
can completely avoid multiplication and shift in the default case (no
arch_scale*() implementations, which is better than what we have today.

The only downside is that for frequency invariance we need three
arch_scale_freq_capacity() calls instead of two.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-26 17:38           ` Morten Rasmussen
@ 2015-03-26 17:40             ` Morten Rasmussen
  2015-03-26 17:46             ` [PATCH 1/2] sched: Change arch_scale_*() functions to scale input factor Morten Rasmussen
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 68+ messages in thread
From: Morten Rasmussen @ 2015-03-26 17:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On Thu, Mar 26, 2015 at 05:38:45PM +0000, Morten Rasmussen wrote:
> The only downside is that for frequency invariance we need three
> arch_scale_freq_capacity() calls instead of two.

It should have been instead of one...

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/2] sched: Change arch_scale_*() functions to scale input factor
  2015-03-26 17:38           ` Morten Rasmussen
  2015-03-26 17:40             ` Morten Rasmussen
@ 2015-03-26 17:46             ` Morten Rasmussen
  2015-03-26 17:46               ` [PATCH 2/2] sched: Make sched entity usage tracking scale-invariant Morten Rasmussen
  2015-03-26 17:47             ` [PATCH v10 04/11] " Peter Zijlstra
  2015-03-27  8:17             ` Vincent Guittot
  3 siblings, 1 reply; 68+ messages in thread
From: Morten Rasmussen @ 2015-03-26 17:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

The arch_scale_{freq, cpu}_capacity() functions currently return a
scaling factor that need to be multiplied and shifted by the caller. The
default weak functions don't result in any scaling by the the
multiplication and shift is still done. By moving the multiplication and
shift into the arch_scale*() functions instead, the weak implementation
can just return the input value and avoid the unnecessary multiplication
and shift.

While we are at it, we can remove the sched_domain parameter by moving
the SD_SHARE_CPUCAPACITY outside the weak arch_scale_cpu_capacity()
function.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c |  6 +++---
 kernel/sched/fair.c        | 34 ++++++++++++++--------------------
 2 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..5328f79 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,9 +42,9 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale);
 
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_cpu_capacity(int cpu, unsigned long factor)
 {
-	return per_cpu(cpu_scale, cpu);
+	return (factor * per_cpu(cpu_scale, cpu)) >> SCHED_CAPACITY_SHIFT;
 }
 
 static void set_capacity_scale(unsigned int cpu, unsigned long capacity)
@@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
 	set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
 
 	pr_info("CPU%u: update cpu_capacity %lu\n",
-		cpu, arch_scale_cpu_capacity(NULL, cpu));
+		cpu, arch_scale_cpu_capacity(cpu, SCHED_CAPACITY_SCALE));
 }
 
 #else
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5080c0d..60c3172 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5958,27 +5958,19 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
-static unsigned long default_scale_capacity(struct sched_domain *sd, int cpu)
+static unsigned long default_scale_capacity(int cpu, unsigned long factor)
 {
-	return SCHED_CAPACITY_SCALE;
-}
-
-unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
-{
-	return default_scale_capacity(sd, cpu);
+	return factor;
 }
 
-static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long __weak arch_scale_freq_capacity(int cpu, unsigned long factor)
 {
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
-		return sd->smt_gain / sd->span_weight;
-
-	return SCHED_CAPACITY_SCALE;
+	return default_scale_capacity(cpu, factor);
 }
 
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long __weak arch_scale_cpu_capacity(int cpu, unsigned long factor)
 {
-	return default_scale_cpu_capacity(sd, cpu);
+	return default_scale_capacity(cpu, factor);
 }
 
 static unsigned long scale_rt_capacity(int cpu)
@@ -6020,12 +6012,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	unsigned long capacity = SCHED_CAPACITY_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	if (sched_feat(ARCH_CAPACITY))
-		capacity *= arch_scale_cpu_capacity(sd, cpu);
-	else
-		capacity *= default_scale_cpu_capacity(sd, cpu);
-
-	capacity >>= SCHED_CAPACITY_SHIFT;
+	if (sched_feat(ARCH_CAPACITY)) {
+		capacity = arch_scale_cpu_capacity(cpu, capacity);
+	} else {
+		if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+			capacity = sd->smt_gain / sd->span_weight;
+		else
+			capacity = default_scale_capacity(cpu, capacity);
+	}
 
 	sdg->sgc->capacity_orig = capacity;
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 2/2] sched: Make sched entity usage tracking scale-invariant
  2015-03-26 17:46             ` [PATCH 1/2] sched: Change arch_scale_*() functions to scale input factor Morten Rasmussen
@ 2015-03-26 17:46               ` Morten Rasmussen
  0 siblings, 0 replies; 68+ messages in thread
From: Morten Rasmussen @ 2015-03-26 17:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

Apply frequency scale-invariance correction factor to usage tracking.
Each segment of the running_load_avg geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling. As a result, utilization_load_avg which is
the sum of utilization_avg_contrib, becomes invariant too. So the usage level
that is returned by get_cpu_usage, stays relative to the max frequency as the
cpu_capacity which is is compared against.
Then, we want the keep the load tracking values in a 32bits type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60c3172..c09df87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2471,6 +2471,8 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+unsigned long __weak arch_scale_freq_capacity(int cpu, unsigned long factor);
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2499,7 +2501,7 @@ static u32 __compute_runnable_contrib(u64 n)
  *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
-static __always_inline int __update_entity_runnable_avg(u64 now,
+static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							struct sched_avg *sa,
 							int runnable,
 							int running)
@@ -2542,7 +2544,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
 		if (running)
-			sa->running_avg_sum += delta_w;
+			sa->running_avg_sum +=
+					arch_scale_freq_capacity(cpu, delta_w);
 		sa->avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2563,7 +2566,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
 		if (running)
-			sa->running_avg_sum += runnable_contrib;
+			sa->running_avg_sum +=
+				arch_scale_freq_capacity(cpu, runnable_contrib);
 		sa->avg_period += runnable_contrib;
 	}
 
@@ -2571,7 +2575,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	if (runnable)
 		sa->runnable_avg_sum += delta;
 	if (running)
-		sa->running_avg_sum += delta;
+		sa->running_avg_sum += arch_scale_freq_capacity(cpu, delta);
 	sa->avg_period += delta;
 
 	return decayed;
@@ -2679,8 +2683,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
-			runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
+			runnable, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2758,6 +2762,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta, utilization_delta;
+	int cpu = cpu_of(rq_of(cfs_rq));
 	u64 now;
 
 	/*
@@ -2769,7 +2774,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+	if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
 					cfs_rq->curr == se))
 		return;
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-26 17:38           ` Morten Rasmussen
  2015-03-26 17:40             ` Morten Rasmussen
  2015-03-26 17:46             ` [PATCH 1/2] sched: Change arch_scale_*() functions to scale input factor Morten Rasmussen
@ 2015-03-26 17:47             ` Peter Zijlstra
  2015-03-26 17:51               ` Morten Rasmussen
  2015-03-27  8:17             ` Vincent Guittot
  3 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2015-03-26 17:47 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On Thu, Mar 26, 2015 at 05:38:45PM +0000, Morten Rasmussen wrote:
> Another potential solution is to stay with weak functions but move the
> multiplication and shift into the arch_scale_*() functions by passing
> the value we want to scale into the arch_scale_*() function. That way we
> can completely avoid multiplication and shift in the default case (no
> arch_scale*() implementations, which is better than what we have today.
> 
> The only downside is that for frequency invariance we need three
> arch_scale_freq_capacity() calls instead of two.

That would still result in unconditional function calls, which on some
archs are _more_ expensive than 64bit mults.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-26 17:47             ` [PATCH v10 04/11] " Peter Zijlstra
@ 2015-03-26 17:51               ` Morten Rasmussen
  0 siblings, 0 replies; 68+ messages in thread
From: Morten Rasmussen @ 2015-03-26 17:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On Thu, Mar 26, 2015 at 05:47:00PM +0000, Peter Zijlstra wrote:
> On Thu, Mar 26, 2015 at 05:38:45PM +0000, Morten Rasmussen wrote:
> > Another potential solution is to stay with weak functions but move the
> > multiplication and shift into the arch_scale_*() functions by passing
> > the value we want to scale into the arch_scale_*() function. That way we
> > can completely avoid multiplication and shift in the default case (no
> > arch_scale*() implementations, which is better than what we have today.
> > 
> > The only downside is that for frequency invariance we need three
> > arch_scale_freq_capacity() calls instead of two.
> 
> That would still result in unconditional function calls, which on some
> archs are _more_ expensive than 64bit mults.

Right. Then it can only be preprocessor magic I think.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-26 17:38           ` Morten Rasmussen
                               ` (2 preceding siblings ...)
  2015-03-26 17:47             ` [PATCH v10 04/11] " Peter Zijlstra
@ 2015-03-27  8:17             ` Vincent Guittot
  2015-03-27  9:05               ` Vincent Guittot
  3 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-27  8:17 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On 26 March 2015 at 18:38, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Mar 25, 2015 at 06:08:42PM +0000, Vincent Guittot wrote:
>> On 25 March 2015 at 18:33, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, Mar 24, 2015 at 11:00:57AM +0100, Vincent Guittot wrote:
>> >> On 23 March 2015 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > On Fri, Feb 27, 2015 at 04:54:07PM +0100, Vincent Guittot wrote:
>> >> >
>> >> >> +     unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
>> >> >
>> >> >> +                     sa->running_avg_sum += delta_w * scale_freq
>> >> >> +                             >> SCHED_CAPACITY_SHIFT;
>> >> >
>> >> > so the only thing that could be improved is somehow making this
>> >> > multiplication go away when the arch doesn't implement the function.
>> >> >
>> >> > But I'm not sure how to do that without #ifdef.
>> >> >
>> >> > Maybe a little something like so then... that should make the compiler
>> >> > get rid of those multiplications unless the arch needs them.
>> >>
>> >> yes, it removes useless multiplication when not used by an arch.
>> >> It also adds a constraint on the arch side which have to define
>> >> arch_scale_freq_capacity like below:
>> >>
>> >> #define arch_scale_freq_capacity xxx_arch_scale_freq_capacity
>> >> with xxx_arch_scale_freq_capacity an architecture specific function
>> >
>> > Yeah, but it not being weak should make that a compile time warn/fail,
>> > which should be pretty easy to deal with.
>> >
>> >> If it sounds acceptable i can update the patch with your proposal ?
>> >
>> > I'll stick it to the end, I just wanted to float to patch to see if
>> > people had better solutions.
>>
>> ok. all other methods that i have tried, was removing the optimization
>> when default arch_scale_freq_capacity was used
>
> Another potential solution is to stay with weak functions but move the
> multiplication and shift into the arch_scale_*() functions by passing
> the value we want to scale into the arch_scale_*() function. That way we
> can completely avoid multiplication and shift in the default case (no
> arch_scale*() implementations, which is better than what we have today.

the sched_rt_avg_update only uses the mul with
arch_scale_freq_capacity because the shift by SCHED_CAPACITY_SHIFT has
been factorized in scale_rt_capacity

>
> The only downside is that for frequency invariance we need three
> arch_scale_freq_capacity() calls instead of two.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-27  8:17             ` Vincent Guittot
@ 2015-03-27  9:05               ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-03-27  9:05 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On 27 March 2015 at 09:17, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 26 March 2015 at 18:38, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> On Wed, Mar 25, 2015 at 06:08:42PM +0000, Vincent Guittot wrote:
>>> On 25 March 2015 at 18:33, Peter Zijlstra <peterz@infradead.org> wrote:
>>> > On Tue, Mar 24, 2015 at 11:00:57AM +0100, Vincent Guittot wrote:
>>> >> On 23 March 2015 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
>>> >> > On Fri, Feb 27, 2015 at 04:54:07PM +0100, Vincent Guittot wrote:
>>> >> >
>>> >> >> +     unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
>>> >> >
>>> >> >> +                     sa->running_avg_sum += delta_w * scale_freq
>>> >> >> +                             >> SCHED_CAPACITY_SHIFT;
>>> >> >
>>> >> > so the only thing that could be improved is somehow making this
>>> >> > multiplication go away when the arch doesn't implement the function.
>>> >> >
>>> >> > But I'm not sure how to do that without #ifdef.
>>> >> >
>>> >> > Maybe a little something like so then... that should make the compiler
>>> >> > get rid of those multiplications unless the arch needs them.
>>> >>
>>> >> yes, it removes useless multiplication when not used by an arch.
>>> >> It also adds a constraint on the arch side which have to define
>>> >> arch_scale_freq_capacity like below:
>>> >>
>>> >> #define arch_scale_freq_capacity xxx_arch_scale_freq_capacity
>>> >> with xxx_arch_scale_freq_capacity an architecture specific function
>>> >
>>> > Yeah, but it not being weak should make that a compile time warn/fail,
>>> > which should be pretty easy to deal with.
>>> >
>>> >> If it sounds acceptable i can update the patch with your proposal ?
>>> >
>>> > I'll stick it to the end, I just wanted to float to patch to see if
>>> > people had better solutions.
>>>
>>> ok. all other methods that i have tried, was removing the optimization
>>> when default arch_scale_freq_capacity was used
>>
>> Another potential solution is to stay with weak functions but move the
>> multiplication and shift into the arch_scale_*() functions by passing
>> the value we want to scale into the arch_scale_*() function. That way we
>> can completely avoid multiplication and shift in the default case (no
>> arch_scale*() implementations, which is better than what we have today.
>
> the sched_rt_avg_update only uses the mul with
> arch_scale_freq_capacity because the shift by SCHED_CAPACITY_SHIFT has
> been factorized in scale_rt_capacity

when arch_scale_freq_capacity is not defined by an arch, the mul is
optimized in a lsl 10

>
>>
>> The only downside is that for frequency invariance we need three
>> arch_scale_freq_capacity() calls instead of two.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Add sched_avg::utilization_avg_contrib
  2015-02-27 15:54 ` [PATCH v10 01/11] sched: add utilization_avg_contrib Vincent Guittot
@ 2015-03-27 11:40   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: pjt, bsegall, vincent.guittot, mingo, peterz, hpa, tglx,
	linux-kernel, morten.rasmussen

Commit-ID:  36ee28e45df50c2c8624b978335516e42d84ae1f
Gitweb:     http://git.kernel.org/tip/36ee28e45df50c2c8624b978335516e42d84ae1f
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 27 Feb 2015 16:54:04 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:35:57 +0100

sched: Add sched_avg::utilization_avg_contrib

Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.

The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum.

Add some descriptions of the variables to explain their differences.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 21 ++++++++++++---
 kernel/sched/debug.c  | 10 ++++---
 kernel/sched/fair.c   | 74 ++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h  |  8 +++++-
 4 files changed, 89 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..fdca05c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1115,15 +1115,28 @@ struct load_weight {
 };
 
 struct sched_avg {
+	u64 last_runnable_update;
+	s64 decay_count;
+	/*
+	 * utilization_avg_contrib describes the amount of time that a
+	 * sched_entity is running on a CPU. It is based on running_avg_sum
+	 * and is scaled in the range [0..SCHED_LOAD_SCALE].
+	 * load_avg_contrib described the amount of time that a sched_entity
+	 * is runnable on a rq. It is based on both runnable_avg_sum and the
+	 * weight of the task.
+	 */
+	unsigned long load_avg_contrib, utilization_avg_contrib;
 	/*
 	 * These sums represent an infinite geometric series and so are bound
 	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 	 * choices of y < 1-2^(-32)*1024.
+	 * running_avg_sum reflects the time that the sched_entity is
+	 * effectively running on the CPU.
+	 * runnable_avg_sum represents the amount of time a sched_entity is on
+	 * a runqueue which includes the running time that is monitored by
+	 * running_avg_sum.
 	 */
-	u32 runnable_avg_sum, runnable_avg_period;
-	u64 last_runnable_update;
-	s64 decay_count;
-	unsigned long load_avg_contrib;
+	u32 runnable_avg_sum, avg_period, running_avg_sum;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8baaf85..578ff83 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	if (!se) {
 		struct sched_avg *avg = &cpu_rq(cpu)->avg;
 		P(avg->runnable_avg_sum);
-		P(avg->runnable_avg_period);
+		P(avg->avg_period);
 		return;
 	}
 
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
-	P(se->avg.runnable_avg_period);
+	P(se->avg.avg_period);
 	P(se->avg.load_avg_contrib);
 	P(se->avg.decay_count);
 #endif
@@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "utilization_load_avg",
+			cfs_rq->utilization_load_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_contrib",
 			cfs_rq->tg_load_contrib);
@@ -636,8 +638,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.runnable_avg_sum);
-	P(se.avg.runnable_avg_period);
+	P(se.avg.running_avg_sum);
+	P(se.avg.avg_period);
 	P(se.avg.load_avg_contrib);
+	P(se.avg.utilization_avg_contrib);
 	P(se.avg.decay_count);
 #endif
 	P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef..414408dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -670,6 +670,7 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 static inline void __update_task_entity_contrib(struct sched_entity *se);
+static inline void __update_task_entity_utilization(struct sched_entity *se);
 
 /* Give new task start runnable values to heavy its load in infant time */
 void init_task_runnable_average(struct task_struct *p)
@@ -677,9 +678,10 @@ void init_task_runnable_average(struct task_struct *p)
 	u32 slice;
 
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
-	p->se.avg.runnable_avg_sum = slice;
-	p->se.avg.runnable_avg_period = slice;
+	p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
+	p->se.avg.avg_period = slice;
 	__update_task_entity_contrib(&p->se);
+	__update_task_entity_utilization(&p->se);
 }
 #else
 void init_task_runnable_average(struct task_struct *p)
@@ -1684,7 +1686,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		*period = now - p->last_task_numa_placement;
 	} else {
 		delta = p->se.avg.runnable_avg_sum;
-		*period = p->se.avg.runnable_avg_period;
+		*period = p->se.avg.avg_period;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2512,7 +2514,8 @@ static u32 __compute_runnable_contrib(u64 n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -2538,7 +2541,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	sa->last_runnable_update = now;
 
 	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->runnable_avg_period % 1024;
+	delta_w = sa->avg_period % 1024;
 	if (delta + delta_w >= 1024) {
 		/* period roll-over */
 		decayed = 1;
@@ -2551,7 +2554,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
-		sa->runnable_avg_period += delta_w;
+		if (running)
+			sa->running_avg_sum += delta_w;
+		sa->avg_period += delta_w;
 
 		delta -= delta_w;
 
@@ -2561,20 +2566,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 
 		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
 						  periods + 1);
-		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+		sa->running_avg_sum = decay_load(sa->running_avg_sum,
+						  periods + 1);
+		sa->avg_period = decay_load(sa->avg_period,
 						     periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
-		sa->runnable_avg_period += runnable_contrib;
+		if (running)
+			sa->running_avg_sum += runnable_contrib;
+		sa->avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
 		sa->runnable_avg_sum += delta;
-	sa->runnable_avg_period += delta;
+	if (running)
+		sa->running_avg_sum += delta;
+	sa->avg_period += delta;
 
 	return decayed;
 }
@@ -2591,6 +2602,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 		return 0;
 
 	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+	se->avg.utilization_avg_contrib =
+		decay_load(se->avg.utilization_avg_contrib, decays);
 
 	return decays;
 }
@@ -2626,7 +2639,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 
 	/* The fraction of a cpu used by this cfs_rq */
 	contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
-			  sa->runnable_avg_period + 1);
+			  sa->avg_period + 1);
 	contrib -= cfs_rq->tg_runnable_contrib;
 
 	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
@@ -2679,7 +2692,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+			runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2697,7 +2711,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
 
 	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
 	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
-	contrib /= (se->avg.runnable_avg_period + 1);
+	contrib /= (se->avg.avg_period + 1);
 	se->avg.load_avg_contrib = scale_load(contrib);
 }
 
@@ -2716,6 +2730,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+
+static inline void __update_task_entity_utilization(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+	contrib /= (se->avg.avg_period + 1);
+	se->avg.utilization_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.utilization_avg_contrib;
+
+	if (entity_is_task(se))
+		__update_task_entity_utilization(se);
+
+	return se->avg.utilization_avg_contrib - old_contrib;
+}
+
 static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 						 long load_contrib)
 {
@@ -2732,7 +2767,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
+	long contrib_delta, utilization_delta;
 	u64 now;
 
 	/*
@@ -2744,18 +2779,22 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+					cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+	utilization_delta = __update_entity_utilization_avg_contrib(se);
 
 	if (!update_cfs_rq)
 		return;
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
-	else
+		cfs_rq->utilization_load_avg += utilization_delta;
+	} else {
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+	}
 }
 
 /*
@@ -2830,6 +2869,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -2848,6 +2888,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -3185,6 +3226,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c2c0d7b..4c95cc2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -363,8 +363,14 @@ struct cfs_rq {
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
+	 * runnable_load_avg is the sum of the load_avg_contrib of the
+	 * sched_entities on the rq.
+	 * blocked_load_avg is similar to runnable_load_avg except that its
+	 * the blocked sched_entities on the rq.
+	 * utilization_load_avg is the sum of the average running time of the
+	 * sched_entities on the rq.
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
 	atomic_long_t removed_load;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Track group sched_entity usage contributions
  2015-02-27 15:54 ` [PATCH v10 02/11] sched: Track group sched_entity usage contributions Vincent Guittot
@ 2015-03-27 11:40   ` tip-bot for Morten Rasmussen
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Morten Rasmussen @ 2015-03-27 11:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.guittot, linux-kernel, pjt, morten.rasmussen, mingo, hpa,
	tglx, peterz, bsegall

Commit-ID:  21f4486630b0bd1b6dbcc04f61836987fa54278f
Gitweb:     http://git.kernel.org/tip/21f4486630b0bd1b6dbcc04f61836987fa54278f
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Fri, 27 Feb 2015 16:54:05 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:35:58 +0100

sched: Track group sched_entity usage contributions

Add usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue.

It is _not_ influenced in any way by the task group h_load. Hence it is
representing the actual cpu usage of the group, not its intended load
contribution which may differ significantly from the utilization on
lightly utilized systems.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/debug.c | 2 ++
 kernel/sched/fair.c  | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 578ff83..a245c1f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
+	P(se->avg.running_avg_sum);
 	P(se->avg.avg_period);
 	P(se->avg.load_avg_contrib);
+	P(se->avg.utilization_avg_contrib);
 	P(se->avg.decay_count);
 #endif
 #undef PN
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 414408dd..d94a865 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2747,6 +2747,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
 
 	if (entity_is_task(se))
 		__update_task_entity_utilization(se);
+	else
+		se->avg.utilization_avg_contrib =
+					group_cfs_rq(se)->utilization_load_avg;
 
 	return se->avg.utilization_avg_contrib - old_contrib;
 }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Remove frequency scaling from cpu_capacity
  2015-02-27 15:54 ` [PATCH v10 03/11] sched: remove frequency scaling from cpu_capacity Vincent Guittot
@ 2015-03-27 11:40   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, peterz, vincent.guittot, morten.rasmussen,
	hpa, tglx

Commit-ID:  a8faa8f55d48496f64d96df48298e54fd380f6af
Gitweb:     http://git.kernel.org/tip/a8faa8f55d48496f64d96df48298e54fd380f6af
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 27 Feb 2015 16:54:06 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:35:59 +0100

sched: Remove frequency scaling from cpu_capacity

Now that arch_scale_cpu_capacity has been introduced to scale the original
capacity, the arch_scale_freq_capacity is no longer used (it was
previously used by ARM arch).

Remove arch_scale_freq_capacity from the computation of cpu_capacity.
The frequency invariance will be handled in the load tracking and not in
the CPU capacity. arch_scale_freq_capacity will be revisited for scaling
load with the current frequency of the CPUs in a later patch.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d94a865..e54231f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6042,13 +6042,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	sdg->sgc->capacity_orig = capacity;
 
-	if (sched_feat(ARCH_CAPACITY))
-		capacity *= arch_scale_freq_capacity(sd, cpu);
-	else
-		capacity *= default_scale_capacity(sd, cpu);
-
-	capacity >>= SCHED_CAPACITY_SHIFT;
-
 	capacity *= scale_rt_capacity(cpu);
 	capacity >>= SCHED_CAPACITY_SHIFT;
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Make sched entity usage tracking scale-invariant
  2015-03-04  7:46   ` Vincent Guittot
@ 2015-03-27 11:40     ` tip-bot for Morten Rasmussen
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Morten Rasmussen @ 2015-03-27 11:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bsegall, pjt, mingo, peterz, linux-kernel, morten.rasmussen,
	vincent.guittot, hpa, tglx

Commit-ID:  0c1dc6b27dac883ee78392189c8e20e764d79bfa
Gitweb:     http://git.kernel.org/tip/0c1dc6b27dac883ee78392189c8e20e764d79bfa
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Wed, 4 Mar 2015 08:46:26 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:00 +0100

sched: Make sched entity usage tracking scale-invariant

Apply frequency scale-invariance correction factor to usage tracking.

Each segment of the running_avg_sum geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling.

As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
becomes invariant too. So the usage level that is returned by get_cpu_usage(),
stays relative to the max frequency as the cpu_capacity which is is compared against.

Then, we want the keep the load tracking values in a 32-bit type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity() must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e54231f..7f031e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2484,6 +2484,8 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2512,7 +2514,7 @@ static u32 __compute_runnable_contrib(u64 n)
  *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
-static __always_inline int __update_entity_runnable_avg(u64 now,
+static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							struct sched_avg *sa,
 							int runnable,
 							int running)
@@ -2520,6 +2522,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	u64 delta, periods;
 	u32 runnable_contrib;
 	int delta_w, decayed = 0;
+	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
 
 	delta = now - sa->last_runnable_update;
 	/*
@@ -2555,7 +2558,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
 		if (running)
-			sa->running_avg_sum += delta_w;
+			sa->running_avg_sum += delta_w * scale_freq
+				>> SCHED_CAPACITY_SHIFT;
 		sa->avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2576,7 +2580,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
 		if (running)
-			sa->running_avg_sum += runnable_contrib;
+			sa->running_avg_sum += runnable_contrib * scale_freq
+				>> SCHED_CAPACITY_SHIFT;
 		sa->avg_period += runnable_contrib;
 	}
 
@@ -2584,7 +2589,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	if (runnable)
 		sa->runnable_avg_sum += delta;
 	if (running)
-		sa->running_avg_sum += delta;
+		sa->running_avg_sum += delta * scale_freq
+			>> SCHED_CAPACITY_SHIFT;
 	sa->avg_period += delta;
 
 	return decayed;
@@ -2692,8 +2698,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
-			runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
+			runnable, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2771,6 +2777,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta, utilization_delta;
+	int cpu = cpu_of(rq_of(cfs_rq));
 	u64 now;
 
 	/*
@@ -2782,7 +2789,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+	if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
 					cfs_rq->curr == se))
 		return;
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Make scale_rt invariant with frequency
  2015-02-27 15:54 ` [PATCH v10 05/11] sched: make scale_rt invariant with frequency Vincent Guittot
@ 2015-03-27 11:41   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: morten.rasmussen, tglx, peterz, vincent.guittot, linux-kernel,
	hpa, mingo

Commit-ID:  b5b4860d1d61ddc5308c7d492cbeaa3a6e508d7f
Gitweb:     http://git.kernel.org/tip/b5b4860d1d61ddc5308c7d492cbeaa3a6e508d7f
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 27 Feb 2015 16:54:08 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:01 +0100

sched: Make scale_rt invariant with frequency

The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.

A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.

In sched_rt_avg_update(), we now scale the RT execution time like below:

  rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT

Then, scale_rt_capacity can be summarized by:

  scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total

with available = total - rq->rt_avg

This has been been optimized in current code by:

  scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)

But we can also developed the equation like below:

  scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)

and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
the computation of rq->rt_avg and scale_rt_capacity().

so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
and
scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)

arch_scale_frequency_capacity() will be called in the hot path of the scheduler
which implies to have a short and efficient function.

As an example, arch_scale_frequency_capacity() should return a cached value that
is updated periodically outside of the hot path.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  | 17 +++++------------
 kernel/sched/sched.h |  4 +++-
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f031e4..dc7c693 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6004,7 +6004,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	u64 total, available, age_stamp, avg;
+	u64 total, used, age_stamp, avg;
 	s64 delta;
 
 	/*
@@ -6020,19 +6020,12 @@ static unsigned long scale_rt_capacity(int cpu)
 
 	total = sched_avg_period() + delta;
 
-	if (unlikely(total < avg)) {
-		/* Ensures that capacity won't end up being negative */
-		available = 0;
-	} else {
-		available = total - avg;
-	}
+	used = div_u64(avg, total);
 
-	if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
-		total = SCHED_CAPACITY_SCALE;
+	if (likely(used < SCHED_CAPACITY_SCALE))
+		return SCHED_CAPACITY_SCALE - used;
 
-	total >>= SCHED_CAPACITY_SHIFT;
-
-	return div_u64(available, total);
+	return 1;
 }
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c95cc2..3600002 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1386,9 +1386,11 @@ static inline int hrtick_enabled(struct rq *rq)
 
 #ifdef CONFIG_SMP
 extern void sched_avg_update(struct rq *rq);
+extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
-	rq->rt_avg += rt_delta;
+	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
 	sched_avg_update(rq);
 }
 #else

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Add struct rq::cpu_capacity_orig
  2015-02-27 15:54 ` [PATCH v10 06/11] sched: add per rq cpu_capacity_orig Vincent Guittot
@ 2015-03-27 11:41   ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, morten.rasmussen, tglx, peterz, kamalesh, mingo,
	linux-kernel, vincent.guittot

Commit-ID:  ca6d75e6908efbc350d536e0b496ebdac36b20d2
Gitweb:     http://git.kernel.org/tip/ca6d75e6908efbc350d536e0b496ebdac36b20d2
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 27 Feb 2015 16:54:09 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:02 +0100

sched: Add struct rq::cpu_capacity_orig

This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ

The cpu_capacity_orig will be used:

  - to detect when the capacity of a CPU has been noticeably reduced so we can
    trig load balance to look for a CPU with better capacity. As an example, we
    can detect when a CPU handles a significant amount of irq
    (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
    scheduler whereas CPUs, which are really idle, are available.

  - evaluate the available capacity for CFS tasks

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 8 +++++++-
 kernel/sched/sched.h | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index feda520..7022e90 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7216,7 +7216,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dc7c693..10f84c3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4363,6 +4363,11 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+static unsigned long capacity_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -6040,6 +6045,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
+	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 	sdg->sgc->capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
@@ -6094,7 +6100,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_of(cpu);
+				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3600002..be56dfd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -615,6 +615,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_capacity;
+	unsigned long cpu_capacity_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Calculate CPU' s usage statistic and put it into struct sg_lb_stats::group_usage
  2015-03-04  7:48   ` Vincent Guittot
@ 2015-03-27 11:41     ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: morten.rasmussen, hpa, vincent.guittot, tglx, peterz, mingo,
	linux-kernel

Commit-ID:  8bb5b00c2f90100a272b09a9d17ec7875d088aa7
Gitweb:     http://git.kernel.org/tip/8bb5b00c2f90100a272b09a9d17ec7875d088aa7
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Wed, 4 Mar 2015 08:48:47 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:03 +0100

sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage

Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.

The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be
temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
CPU and until the metrics are stabilized.

The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.

The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the utilization_load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10f84c3..471193b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4781,6 +4781,33 @@ next:
 done:
 	return target;
 }
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must be the one of capacity so we can
+ * compare the usage with the capacity of the CPU that is available for CFS
+ * task (ie cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
+ */
+static int get_cpu_usage(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+	unsigned long capacity = capacity_orig_of(cpu);
+
+	if (usage >= SCHED_LOAD_SCALE)
+		return capacity;
+
+	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
 
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
@@ -5907,6 +5934,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
@@ -6255,6 +6283,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_usage += get_cpu_usage(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Replace capacity_factor by usage
  2015-02-27 15:54 ` [PATCH v10 08/11] sched: replace capacity_factor by usage Vincent Guittot
@ 2015-03-27 11:42   ` tip-bot for Vincent Guittot
  2015-03-27 14:52   ` [PATCH v10 08/11] sched: replace " Xunlei Pang
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.guittot, mingo, torvalds, peterz, hpa, tglx, linux-kernel

Commit-ID:  ea67821b9a3edadf602b7772a0b2a69657ced746
Gitweb:     http://git.kernel.org/tip/ea67821b9a3edadf602b7772a0b2a69657ced746
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 27 Feb 2015 16:54:11 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:04 +0100

sched: Replace capacity_factor by usage

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE.

'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not.

But the 'group_capacity_factor' concept is hardly working for SMT systems, it
sometimes works for big cores but fails to do the right thing for little cores.

Below are two examples to illustrate the problem that this patch solves:

1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
   (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
   (div_round_closest(3x640/1024) = 2) which means that it will be seen as
   overloaded even if we have only one task per CPU.

2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
   (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
   (at max and thanks to the fix [0] for SMT system that prevent the apparition
   of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
   reduced to nearly nothing), the capacity factor of the group will still be 4
   (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).

So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics:

  - The available CPU's capacity for CFS tasks which is already used by
    load_balance().

  - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
    has been re-introduced to compute the usage of a CPU by CFS tasks.

'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
by 'group_no_capacity'. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.

For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
because local is not overloaded.

[1] 9a5d9ba6a363 ("sched/fair: Allow calculate_imbalance() to move idle cpus")

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
[ Tidied up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 139 +++++++++++++++++++++++++++-------------------------
 1 file changed, 72 insertions(+), 67 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 471193b..7e13dd0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5936,11 +5936,10 @@ struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_no_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -6156,28 +6155,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
+ * Check whether the capacity of the rq has been noticeably reduced by side
+ * activity. The imbalance_pct is used for the threshold.
+ * Return true is the capacity is reduced
  */
 static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
+check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
 {
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
+	return ((rq->cpu_capacity * sd->imbalance_pct) <
+				(rq->cpu_capacity_orig * 100));
 }
 
 /*
@@ -6215,37 +6201,56 @@ static inline int sg_imbalanced(struct sched_group *group)
 }
 
 /*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
+ * group_has_capacity returns true if the group has spare capacity that could
+ * be used by some tasks.
+ * We consider that a group has spare capacity if the  * number of task is
+ * smaller than the number of CPUs or if the usage is lower than the available
+ * capacity for CFS tasks.
+ * For the latter, we use a threshold to stabilize the state, to take into
+ * account the variance of the tasks' load and to return true if the available
+ * capacity in meaningful for the load balancer.
+ * As an example, an available capacity of 1% can appear but it doesn't make
+ * any benefit for the load balance.
  */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline bool
+group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return true;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	if ((sgs->group_capacity * 100) >
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+	return false;
+}
+
+/*
+ *  group_is_overloaded returns true if the group has more tasks than it can
+ *  handle.
+ *  group_is_overloaded is not equals to !group_has_capacity because a group
+ *  with the exact right number of tasks, has no more spare capacity but is not
+ *  overloaded so both group_has_capacity and group_is_overloaded return
+ *  false.
+ */
+static inline bool
+group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return false;
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+	if ((sgs->group_capacity * 100) <
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	return capacity_factor;
+	return false;
 }
 
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+static enum group_type group_classify(struct lb_env *env,
+		struct sched_group *group,
+		struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (sgs->group_no_capacity)
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6306,11 +6311,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_no_capacity = group_is_overloaded(env, sgs);
+	sgs->group_type = group_classify(env, group, sgs);
 }
 
 /**
@@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
-		 * extra check prevents the case where you always pull from the
-		 * heaviest group when it is already under-utilized (possible
-		 * with a large weight task outweighs the tasks on the system).
+		 * these excess tasks. The extra check prevents the case where
+		 * you always pull from the heaviest group when it is already
+		 * under-utilized (possible with a large weight task outweighs
+		 * the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity) {
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
-			sgs->group_type = group_classify(sg, sgs);
+		    group_has_capacity(env, &sds->local_stat) &&
+		    (sgs->sum_nr_running > 1)) {
+			sgs->group_no_capacity = 1;
+			sgs->group_type = group_overloaded;
 		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
@@ -6623,11 +6627,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6690,6 +6695,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6710,8 +6716,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
+	    busiest->group_no_capacity)
 		goto force_balance;
 
 	/*
@@ -6770,7 +6776,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6799,9 +6805,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6809,7 +6812,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
 		/*

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Remove unused struct sched_group_capacity ::capacity_orig
  2015-03-03 10:35   ` [PATCH v10 09/11] sched; remove unused capacity_orig Vincent Guittot
@ 2015-03-27 11:42     ` tip-bot for Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, vincent.guittot, mingo, linux-kernel, tglx, peterz

Commit-ID:  dc7ff76eadb4b89fd39bb466b8f3773e5467c11d
Gitweb:     http://git.kernel.org/tip/dc7ff76eadb4b89fd39bb466b8f3773e5467c11d
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 3 Mar 2015 11:35:03 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:05 +0100

sched: Remove unused struct sched_group_capacity::capacity_orig

The 'struct sched_group_capacity::capacity_orig' field is no longer used
in the scheduler so we can remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 12 ------------
 kernel/sched/fair.c  | 13 +++----------
 kernel/sched/sched.h |  2 +-
 3 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7022e90..838fc9d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5447,17 +5447,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5941,7 +5930,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e13dd0..d36f8d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6073,7 +6073,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
 	capacity >>= SCHED_CAPACITY_SHIFT;
@@ -6089,7 +6088,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -6101,7 +6100,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -6121,19 +6120,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
+			 * This avoids capacity from being 0 and
 			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -6144,13 +6139,11 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be56dfd..dd532c5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -826,7 +826,7 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Add SD_PREFER_SIBLING for SMT level
  2015-02-27 15:54 ` [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
  2015-03-02 11:52   ` Srikar Dronamraju
  2015-03-26 10:55   ` Peter Zijlstra
@ 2015-03-27 11:42   ` tip-bot for Vincent Guittot
  2 siblings, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, preeti, mingo, peterz, linux-kernel, hpa, vincent.guittot

Commit-ID:  caff37ef96eac7fe96a582d032f6958e834e9447
Gitweb:     http://git.kernel.org/tip/caff37ef96eac7fe96a582d032f6958e834e9447
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 27 Feb 2015 16:54:13 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:05 +0100

sched: Add SD_PREFER_SIBLING for SMT level

Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will place at least one task per core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 838fc9d..043e2a1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6240,6 +6240,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	 */
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
+		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Move CFS tasks to CPUs with higher capacity
  2015-02-27 15:54 ` [PATCH v10 11/11] sched: move cfs task on a CPU with higher capacity Vincent Guittot
  2015-03-26 14:19   ` Dietmar Eggemann
@ 2015-03-27 11:42   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot for Vincent Guittot @ 2015-03-27 11:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, hpa, vincent.guittot, mingo, peterz, linux-kernel

Commit-ID:  1aaf90a4b88aae26a4535ba01dacab520a310d17
Gitweb:     http://git.kernel.org/tip/1aaf90a4b88aae26a4535ba01dacab520a310d17
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 27 Feb 2015 16:54:14 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:06 +0100

sched: Move CFS tasks to CPUs with higher capacity

When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.

It's worth trying to move the task before the CPU is fully utilized to
minimize the preemption by irq or RT tasks.

Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases:

  - There is only 1 task on the busiest CPU.

  - We haven't been able to move a task of the busiest rq.

A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got more available capacity for
CFS tasks. This test has been added in need_active_balance.

As a sidenote, this will not generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 69 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d36f8d2..0576ce0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6855,6 +6855,19 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	/*
+	 * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
+	 * It's worth migrating the task if the src_cpu's capacity is reduced
+	 * because of other sched_class or IRQs if more capacity stays
+	 * available on dst_cpu.
+	 */
+	if ((env->idle != CPU_NOT_IDLE) &&
+	    (env->src_rq->cfs.h_nr_running == 1)) {
+		if ((check_cpu_capacity(env->src_rq, sd)) &&
+		    (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
+			return 1;
+	}
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
@@ -6954,6 +6967,9 @@ redo:
 
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
+	env.src_cpu = busiest->cpu;
+	env.src_rq = busiest;
+
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
@@ -6963,8 +6979,6 @@ redo:
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
-		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
 more_balance:
@@ -7664,22 +7678,25 @@ end:
 
 /*
  * Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
  *   - This rq has more than one task.
- *   - At any scheduler domain level, this cpu's scheduler group has multiple
- *     busy cpu's exceeding the group's capacity.
+ *   - This rq has at least one CFS task and the capacity of the CPU is
+ *     significantly reduced because of RT tasks or IRQs.
+ *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ *     multiple busy cpu.
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
-static inline int nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq)
 {
 	unsigned long now = jiffies;
 	struct sched_domain *sd;
 	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7693,38 +7710,46 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
+
 	}
 
-	sd = rcu_dereference(per_cpu(sd_asym, cpu));
+	sd = rcu_dereference(rq->sd);
+	if (sd) {
+		if ((rq->cfs.h_nr_running >= 1) &&
+				check_cpu_capacity(rq, sd)) {
+			kick = true;
+			goto unlock;
+		}
+	}
 
+	sd = rcu_dereference(per_cpu(sd_asym, cpu));
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
-				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
-
-	rcu_read_unlock();
-	return 0;
+				  sched_domain_span(sd)) < cpu)) {
+		kick = true;
+		goto unlock;
+	}
 
-need_kick_unlock:
+unlock:
 	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip:sched/core] sched: Optimize freq invariant accounting
  2015-03-23 13:19   ` [PATCH v10 04/11] " Peter Zijlstra
  2015-03-24 10:00     ` Vincent Guittot
@ 2015-03-27 11:43     ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-03-27 11:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: vincent.guittot, peterz, mingo, linux-kernel, pjt, hpa, bsegall, tglx

Commit-ID:  dfbca41f347997e57048a53755611c8e2d792924
Gitweb:     http://git.kernel.org/tip/dfbca41f347997e57048a53755611c8e2d792924
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 23 Mar 2015 14:19:05 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:08 +0100

sched: Optimize freq invariant accounting

Currently the freq invariant accounting (in
__update_entity_runnable_avg() and sched_rt_avg_update()) get the
scale factor from a weak function call, this means that even for archs
that default on their implementation the compiler cannot see into this
function and optimize the extra scaling math away.

This is sad, esp. since its a 64-bit multiplication which can be quite
costly on some platforms.

So replace the weak function with #ifdef and __always_inline goo. This
is not quite as nice from an arch support PoV but should at least
result in compile time errors if done wrong.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  | 12 ------------
 kernel/sched/sched.h |  9 ++++++++-
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0576ce0..3a798ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2484,8 +2484,6 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
-unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
-
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -6010,16 +6008,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
-static unsigned long default_scale_capacity(struct sched_domain *sd, int cpu)
-{
-	return SCHED_CAPACITY_SCALE;
-}
-
-unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
-{
-	return default_scale_capacity(sd, cpu);
-}
-
 static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dd532c5..91c6736 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1387,7 +1387,14 @@ static inline int hrtick_enabled(struct rq *rq)
 
 #ifdef CONFIG_SMP
 extern void sched_avg_update(struct rq *rq);
-extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
+#ifndef arch_scale_freq_capacity
+static __always_inline
+unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+	return SCHED_CAPACITY_SCALE;
+}
+#endif
 
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 08/11] sched: replace capacity_factor by usage
  2015-02-27 15:54 ` [PATCH v10 08/11] sched: replace capacity_factor by usage Vincent Guittot
  2015-03-27 11:42   ` [tip:sched/core] sched: Replace " tip-bot for Vincent Guittot
@ 2015-03-27 14:52   ` Xunlei Pang
  2015-03-27 15:59     ` Vincent Guittot
  1 sibling, 1 reply; 68+ messages in thread
From: Xunlei Pang @ 2015-03-27 14:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, lkml, preeti, Morten Rasmussen,
	kamalesh, riel, Linaro Kernel Mailman List, efault,
	dietmar.eggemann

Hi Vincent,

On 27 February 2015 at 23:54, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>  /**
> @@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
>                 /*
>                  * In case the child domain prefers tasks go to siblings
> -                * first, lower the sg capacity factor to one so that we'll try
> +                * first, lower the sg capacity so that we'll try
>                  * and move all the excess tasks away. We lower the capacity
>                  * of a group only if the local group has the capacity to fit
> -                * these excess tasks, i.e. nr_running < group_capacity_factor. The
> -                * extra check prevents the case where you always pull from the
> -                * heaviest group when it is already under-utilized (possible
> -                * with a large weight task outweighs the tasks on the system).
> +                * these excess tasks. The extra check prevents the case where
> +                * you always pull from the heaviest group when it is already
> +                * under-utilized (possible with a large weight task outweighs
> +                * the tasks on the system).
>                  */
>                 if (prefer_sibling && sds->local &&
> -                   sds->local_stat.group_has_free_capacity) {
> -                       sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
> -                       sgs->group_type = group_classify(sg, sgs);
> +                   group_has_capacity(env, &sds->local_stat) &&
> +                   (sgs->sum_nr_running > 1)) {
> +                       sgs->group_no_capacity = 1;
> +                       sgs->group_type = group_overloaded;
>                 }
>

For SD_PREFER_SIBLING, if local has 1 task and group_has_capacity()
returns true(but not overloaded)  for it, and assume sgs group has 2
tasks, should we still mark this group overloaded?

-Xunlei

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 07/11] sched: get CPU's usage statistic
  2015-02-27 15:54 ` [PATCH v10 07/11] sched: get CPU's usage statistic Vincent Guittot
  2015-03-03 12:47   ` Dietmar Eggemann
  2015-03-04  7:48   ` Vincent Guittot
@ 2015-03-27 15:12   ` Xunlei Pang
  2015-03-27 15:37     ` Vincent Guittot
  2 siblings, 1 reply; 68+ messages in thread
From: Xunlei Pang @ 2015-03-27 15:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, lkml, preeti, Morten Rasmussen,
	kamalesh, riel, Linaro Kernel Mailman List, efault,
	Dietmar Eggemann

Hi Vincent,

On 27 February 2015 at 23:54, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> Monitor the usage level of each group of each sched_domain level. The usage is
> the portion of cpu_capacity_orig that is currently used on a CPU or group of
> CPUs. We use the utilization_load_avg to evaluate the usage level of each
> group.
>
> The utilization_load_avg only takes into account the running time of the CFS
> tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
> utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
> greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
> until the metrics are stabilized.
>
> + * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
> + */
> +static int get_cpu_usage(int cpu)
> +{
> +       unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> +       unsigned long capacity = capacity_orig_of(cpu);
> +
> +       if (usage >= SCHED_LOAD_SCALE)
> +               return capacity;

Can "capacity" be greater than SCHED_LOAD_SCALE?
Why use SCHED_LOAD_SCALE instead of "capacity" in this judgement?

-Xunlei

> +
> +       return (usage * capacity) >> SCHED_LOAD_SHIFT;
> +}

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 07/11] sched: get CPU's usage statistic
  2015-03-27 15:12   ` [PATCH v10 07/11] sched: get CPU's usage statistic Xunlei Pang
@ 2015-03-27 15:37     ` Vincent Guittot
  2015-04-01  3:22       ` Xunlei Pang
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-27 15:37 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: Peter Zijlstra, Ingo Molnar, lkml, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel,
	Linaro Kernel Mailman List, Mike Galbraith, Dietmar Eggemann

On 27 March 2015 at 16:12, Xunlei Pang <pang.xunlei@linaro.org> wrote:
> Hi Vincent,
>
> On 27 February 2015 at 23:54, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>> Monitor the usage level of each group of each sched_domain level. The usage is
>> the portion of cpu_capacity_orig that is currently used on a CPU or group of
>> CPUs. We use the utilization_load_avg to evaluate the usage level of each
>> group.
>>
>> The utilization_load_avg only takes into account the running time of the CFS
>> tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
>> utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
>> greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
>> until the metrics are stabilized.
>>
>> + * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
>> + */
>> +static int get_cpu_usage(int cpu)
>> +{
>> +       unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>> +       unsigned long capacity = capacity_orig_of(cpu);
>> +
>> +       if (usage >= SCHED_LOAD_SCALE)
>> +               return capacity;
>
> Can "capacity" be greater than SCHED_LOAD_SCALE?
> Why use SCHED_LOAD_SCALE instead of "capacity" in this judgement?

Yes, SCHED_LOAD_SCALE is the default value but the capacity can be in
the range [1536:512] for arm as an example

>
> -Xunlei
>
>> +
>> +       return (usage * capacity) >> SCHED_LOAD_SHIFT;
>> +}

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 08/11] sched: replace capacity_factor by usage
  2015-03-27 14:52   ` [PATCH v10 08/11] sched: replace " Xunlei Pang
@ 2015-03-27 15:59     ` Vincent Guittot
  2015-04-01  3:37       ` Xunlei Pang
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-03-27 15:59 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: Peter Zijlstra, Ingo Molnar, lkml, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel,
	Linaro Kernel Mailman List, Mike Galbraith, Dietmar Eggemann

On 27 March 2015 at 15:52, Xunlei Pang <pang.xunlei@linaro.org> wrote:
> Hi Vincent,
>
> On 27 February 2015 at 23:54, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>>  /**
>> @@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>
>>                 /*
>>                  * In case the child domain prefers tasks go to siblings
>> -                * first, lower the sg capacity factor to one so that we'll try
>> +                * first, lower the sg capacity so that we'll try
>>                  * and move all the excess tasks away. We lower the capacity
>>                  * of a group only if the local group has the capacity to fit
>> -                * these excess tasks, i.e. nr_running < group_capacity_factor. The
>> -                * extra check prevents the case where you always pull from the
>> -                * heaviest group when it is already under-utilized (possible
>> -                * with a large weight task outweighs the tasks on the system).
>> +                * these excess tasks. The extra check prevents the case where
>> +                * you always pull from the heaviest group when it is already
>> +                * under-utilized (possible with a large weight task outweighs
>> +                * the tasks on the system).
>>                  */
>>                 if (prefer_sibling && sds->local &&
>> -                   sds->local_stat.group_has_free_capacity) {
>> -                       sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
>> -                       sgs->group_type = group_classify(sg, sgs);
>> +                   group_has_capacity(env, &sds->local_stat) &&
>> +                   (sgs->sum_nr_running > 1)) {
>> +                       sgs->group_no_capacity = 1;
>> +                       sgs->group_type = group_overloaded;
>>                 }
>>
>
> For SD_PREFER_SIBLING, if local has 1 task and group_has_capacity()
> returns true(but not overloaded)  for it, and assume sgs group has 2
> tasks, should we still mark this group overloaded?

yes, the load balance will then choose if it's worth pulling it or not
depending of the load of each groups

>
> -Xunlei

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 07/11] sched: get CPU's usage statistic
  2015-03-27 15:37     ` Vincent Guittot
@ 2015-04-01  3:22       ` Xunlei Pang
  0 siblings, 0 replies; 68+ messages in thread
From: Xunlei Pang @ 2015-04-01  3:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, lkml, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel,
	Linaro Kernel Mailman List, Mike Galbraith, Dietmar Eggemann

Vincent,
On 27 March 2015 at 23:37, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 27 March 2015 at 16:12, Xunlei Pang <pang.xunlei@linaro.org> wrote:
>>> +static int get_cpu_usage(int cpu)
>>> +{
>>> +       unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>>> +       unsigned long capacity = capacity_orig_of(cpu);
>>> +
>>> +       if (usage >= SCHED_LOAD_SCALE)
>>> +               return capacity;
>>
>> Can "capacity" be greater than SCHED_LOAD_SCALE?
>> Why use SCHED_LOAD_SCALE instead of "capacity" in this judgement?
>
> Yes, SCHED_LOAD_SCALE is the default value but the capacity can be in
> the range [1536:512] for arm as an example

Right, I was confused between cpu capacity and
arch_scale_freq_capacity() in "Patch 04"  then. Thanks.

-Xunlei

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 08/11] sched: replace capacity_factor by usage
  2015-03-27 15:59     ` Vincent Guittot
@ 2015-04-01  3:37       ` Xunlei Pang
  2015-04-01  9:06         ` Vincent Guittot
  0 siblings, 1 reply; 68+ messages in thread
From: Xunlei Pang @ 2015-04-01  3:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, lkml, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel,
	Linaro Kernel Mailman List, Mike Galbraith, Dietmar Eggemann

Hi Vincent,

On 27 March 2015 at 23:59, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 27 March 2015 at 15:52, Xunlei Pang <pang.xunlei@linaro.org> wrote:
>> Hi Vincent,
>>
>> On 27 February 2015 at 23:54, Vincent Guittot
>> <vincent.guittot@linaro.org> wrote:
>>>  /**
>>> @@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>>
>>>                 /*
>>>                  * In case the child domain prefers tasks go to siblings
>>> -                * first, lower the sg capacity factor to one so that we'll try
>>> +                * first, lower the sg capacity so that we'll try
>>>                  * and move all the excess tasks away. We lower the capacity
>>>                  * of a group only if the local group has the capacity to fit
>>> -                * these excess tasks, i.e. nr_running < group_capacity_factor. The
>>> -                * extra check prevents the case where you always pull from the
>>> -                * heaviest group when it is already under-utilized (possible
>>> -                * with a large weight task outweighs the tasks on the system).
>>> +                * these excess tasks. The extra check prevents the case where
>>> +                * you always pull from the heaviest group when it is already
>>> +                * under-utilized (possible with a large weight task outweighs
>>> +                * the tasks on the system).
>>>                  */
>>>                 if (prefer_sibling && sds->local &&
>>> -                   sds->local_stat.group_has_free_capacity) {
>>> -                       sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
>>> -                       sgs->group_type = group_classify(sg, sgs);
>>> +                   group_has_capacity(env, &sds->local_stat) &&
>>> +                   (sgs->sum_nr_running > 1)) {
>>> +                       sgs->group_no_capacity = 1;
>>> +                       sgs->group_type = group_overloaded;
>>>                 }
>>>
>>
>> For SD_PREFER_SIBLING, if local has 1 task and group_has_capacity()
>> returns true(but not overloaded)  for it, and assume sgs group has 2
>> tasks, should we still mark this group overloaded?
>
> yes, the load balance will then choose if it's worth pulling it or not
> depending of the load of each groups

Maybe I didn't make it clearly.
For example, CPU0~1 are SMT siblings,  CPU2~CPU3 are another pair.
CPU0 is idle, others each has 1 task. Then according to this patch,
CPU2~CPU3(as one group) will be viewed as overloaded(CPU0~CPU1 as
local group, and group_has_capacity() returns true here), so the
balancer may initiate an active task moving. This is different from
the current code as SD_PREFER_SIBLING logic does. Is this problematic?

>
>>
>> -Xunlei

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 08/11] sched: replace capacity_factor by usage
  2015-04-01  3:37       ` Xunlei Pang
@ 2015-04-01  9:06         ` Vincent Guittot
  2015-04-01 14:54           ` Xunlei Pang
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2015-04-01  9:06 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: Peter Zijlstra, Ingo Molnar, lkml, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel,
	Linaro Kernel Mailman List, Mike Galbraith, Dietmar Eggemann

On 1 April 2015 at 05:37, Xunlei Pang <pang.xunlei@linaro.org> wrote:
> Hi Vincent,
>
> On 27 March 2015 at 23:59, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> On 27 March 2015 at 15:52, Xunlei Pang <pang.xunlei@linaro.org> wrote:
>>> Hi Vincent,
>>>
>>> On 27 February 2015 at 23:54, Vincent Guittot
>>> <vincent.guittot@linaro.org> wrote:
>>>>  /**
>>>> @@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>>>
>>>>                 /*
>>>>                  * In case the child domain prefers tasks go to siblings
>>>> -                * first, lower the sg capacity factor to one so that we'll try
>>>> +                * first, lower the sg capacity so that we'll try
>>>>                  * and move all the excess tasks away. We lower the capacity
>>>>                  * of a group only if the local group has the capacity to fit
>>>> -                * these excess tasks, i.e. nr_running < group_capacity_factor. The
>>>> -                * extra check prevents the case where you always pull from the
>>>> -                * heaviest group when it is already under-utilized (possible
>>>> -                * with a large weight task outweighs the tasks on the system).
>>>> +                * these excess tasks. The extra check prevents the case where
>>>> +                * you always pull from the heaviest group when it is already
>>>> +                * under-utilized (possible with a large weight task outweighs
>>>> +                * the tasks on the system).
>>>>                  */
>>>>                 if (prefer_sibling && sds->local &&
>>>> -                   sds->local_stat.group_has_free_capacity) {
>>>> -                       sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
>>>> -                       sgs->group_type = group_classify(sg, sgs);
>>>> +                   group_has_capacity(env, &sds->local_stat) &&
>>>> +                   (sgs->sum_nr_running > 1)) {
>>>> +                       sgs->group_no_capacity = 1;
>>>> +                       sgs->group_type = group_overloaded;
>>>>                 }
>>>>
>>>
>>> For SD_PREFER_SIBLING, if local has 1 task and group_has_capacity()
>>> returns true(but not overloaded)  for it, and assume sgs group has 2
>>> tasks, should we still mark this group overloaded?
>>
>> yes, the load balance will then choose if it's worth pulling it or not
>> depending of the load of each groups
>
> Maybe I didn't make it clearly.
> For example, CPU0~1 are SMT siblings,  CPU2~CPU3 are another pair.
> CPU0 is idle, others each has 1 task. Then according to this patch,
> CPU2~CPU3(as one group) will be viewed as overloaded(CPU0~CPU1 as
> local group, and group_has_capacity() returns true here), so the
> balancer may initiate an active task moving. This is different from
> the current code as SD_PREFER_SIBLING logic does. Is this problematic?

IMHO, it's not problematic, It's worth triggering a load balance if
there is an imbalance between the 2  groups (as an example CPU0~1 has
one low nice prio task but CPU1~2 have 2 high nice prio tasks) so the
decision will be done when calculating the imbalance

Vincent

>
>>
>>>
>>> -Xunlei

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 08/11] sched: replace capacity_factor by usage
  2015-04-01  9:06         ` Vincent Guittot
@ 2015-04-01 14:54           ` Xunlei Pang
  2015-04-01 15:57             ` Vincent Guittot
  0 siblings, 1 reply; 68+ messages in thread
From: Xunlei Pang @ 2015-04-01 14:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, lkml, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel,
	Linaro Kernel Mailman List, Mike Galbraith, Dietmar Eggemann

Hi Vincent,

On 1 April 2015 at 17:06, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 1 April 2015 at 05:37, Xunlei Pang <pang.xunlei@linaro.org> wrote:
>> Hi Vincent,
>>
>> On 27 March 2015 at 23:59, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>> On 27 March 2015 at 15:52, Xunlei Pang <pang.xunlei@linaro.org> wrote:
>>>> Hi Vincent,
>>>>
>>>> On 27 February 2015 at 23:54, Vincent Guittot
>>>> <vincent.guittot@linaro.org> wrote:
>>>>>  /**
>>>>> @@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>>>>
>>>>>                 /*
>>>>>                  * In case the child domain prefers tasks go to siblings
>>>>> -                * first, lower the sg capacity factor to one so that we'll try
>>>>> +                * first, lower the sg capacity so that we'll try
>>>>>                  * and move all the excess tasks away. We lower the capacity
>>>>>                  * of a group only if the local group has the capacity to fit
>>>>> -                * these excess tasks, i.e. nr_running < group_capacity_factor. The
>>>>> -                * extra check prevents the case where you always pull from the
>>>>> -                * heaviest group when it is already under-utilized (possible
>>>>> -                * with a large weight task outweighs the tasks on the system).
>>>>> +                * these excess tasks. The extra check prevents the case where
>>>>> +                * you always pull from the heaviest group when it is already
>>>>> +                * under-utilized (possible with a large weight task outweighs
>>>>> +                * the tasks on the system).
>>>>>                  */
>>>>>                 if (prefer_sibling && sds->local &&
>>>>> -                   sds->local_stat.group_has_free_capacity) {
>>>>> -                       sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
>>>>> -                       sgs->group_type = group_classify(sg, sgs);
>>>>> +                   group_has_capacity(env, &sds->local_stat) &&
>>>>> +                   (sgs->sum_nr_running > 1)) {
>>>>> +                       sgs->group_no_capacity = 1;
>>>>> +                       sgs->group_type = group_overloaded;
>>>>>                 }
>>>>>
>>>>
>>>> For SD_PREFER_SIBLING, if local has 1 task and group_has_capacity()
>>>> returns true(but not overloaded)  for it, and assume sgs group has 2
>>>> tasks, should we still mark this group overloaded?
>>>
>>> yes, the load balance will then choose if it's worth pulling it or not
>>> depending of the load of each groups
>>
>> Maybe I didn't make it clearly.
>> For example, CPU0~1 are SMT siblings,  CPU2~CPU3 are another pair.
>> CPU0 is idle, others each has 1 task. Then according to this patch,
>> CPU2~CPU3(as one group) will be viewed as overloaded(CPU0~CPU1 as
>> local group, and group_has_capacity() returns true here), so the
>> balancer may initiate an active task moving. This is different from
>> the current code as SD_PREFER_SIBLING logic does. Is this problematic?
>
> IMHO, it's not problematic, It's worth triggering a load balance if
> there is an imbalance between the 2  groups (as an example CPU0~1 has
> one low nice prio task but CPU1~2 have 2 high nice prio tasks) so the
> decision will be done when calculating the imbalance

Yes, but assuming the balancer calculated some imbalance, after moving
like CPU0~CPU1 have 1 low prio task and 1 high prio task, CPU2~CPU3
have 1 high piro task, seems it does no good because there's only 1
task per CPU after all.

So, is code below better( we may have more than 2 SMT siblings, like
Broadcom XLP processor having 4 SMT per core)?
 if (prefer_sibling && sds->local &&
    group_has_capacity(env, &sds->local_stat) &&
    (sgs->sum_nr_running > sds->local_stat.sum_nr_running + 1)) {
        sgs->group_no_capacity = 1;
        sgs->group_type = group_overloaded;
 }

Thanks,
-Xunlei

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 08/11] sched: replace capacity_factor by usage
  2015-04-01 14:54           ` Xunlei Pang
@ 2015-04-01 15:57             ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-04-01 15:57 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: Peter Zijlstra, Ingo Molnar, lkml, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel,
	Linaro Kernel Mailman List, Mike Galbraith, Dietmar Eggemann

On 1 April 2015 at 16:54, Xunlei Pang <pang.xunlei@linaro.org> wrote:
> Hi Vincent,
>
> On 1 April 2015 at 17:06, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> On 1 April 2015 at 05:37, Xunlei Pang <pang.xunlei@linaro.org> wrote:
>>> Hi Vincent,
>>>
>>> On 27 March 2015 at 23:59, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>>> On 27 March 2015 at 15:52, Xunlei Pang <pang.xunlei@linaro.org> wrote:
>>>>> Hi Vincent,
>>>>>
>>>>> On 27 February 2015 at 23:54, Vincent Guittot
>>>>> <vincent.guittot@linaro.org> wrote:
>>>>>>  /**
>>>>>> @@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>>>>>
>>>>>>                 /*
>>>>>>                  * In case the child domain prefers tasks go to siblings
>>>>>> -                * first, lower the sg capacity factor to one so that we'll try
>>>>>> +                * first, lower the sg capacity so that we'll try
>>>>>>                  * and move all the excess tasks away. We lower the capacity
>>>>>>                  * of a group only if the local group has the capacity to fit
>>>>>> -                * these excess tasks, i.e. nr_running < group_capacity_factor. The
>>>>>> -                * extra check prevents the case where you always pull from the
>>>>>> -                * heaviest group when it is already under-utilized (possible
>>>>>> -                * with a large weight task outweighs the tasks on the system).
>>>>>> +                * these excess tasks. The extra check prevents the case where
>>>>>> +                * you always pull from the heaviest group when it is already
>>>>>> +                * under-utilized (possible with a large weight task outweighs
>>>>>> +                * the tasks on the system).
>>>>>>                  */
>>>>>>                 if (prefer_sibling && sds->local &&
>>>>>> -                   sds->local_stat.group_has_free_capacity) {
>>>>>> -                       sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
>>>>>> -                       sgs->group_type = group_classify(sg, sgs);
>>>>>> +                   group_has_capacity(env, &sds->local_stat) &&
>>>>>> +                   (sgs->sum_nr_running > 1)) {
>>>>>> +                       sgs->group_no_capacity = 1;
>>>>>> +                       sgs->group_type = group_overloaded;
>>>>>>                 }
>>>>>>
>>>>>
>>>>> For SD_PREFER_SIBLING, if local has 1 task and group_has_capacity()
>>>>> returns true(but not overloaded)  for it, and assume sgs group has 2
>>>>> tasks, should we still mark this group overloaded?
>>>>
>>>> yes, the load balance will then choose if it's worth pulling it or not
>>>> depending of the load of each groups
>>>
>>> Maybe I didn't make it clearly.
>>> For example, CPU0~1 are SMT siblings,  CPU2~CPU3 are another pair.
>>> CPU0 is idle, others each has 1 task. Then according to this patch,
>>> CPU2~CPU3(as one group) will be viewed as overloaded(CPU0~CPU1 as
>>> local group, and group_has_capacity() returns true here), so the
>>> balancer may initiate an active task moving. This is different from
>>> the current code as SD_PREFER_SIBLING logic does. Is this problematic?
>>
>> IMHO, it's not problematic, It's worth triggering a load balance if
>> there is an imbalance between the 2  groups (as an example CPU0~1 has
>> one low nice prio task but CPU1~2 have 2 high nice prio tasks) so the
>> decision will be done when calculating the imbalance
>
> Yes, but assuming the balancer calculated some imbalance, after moving
> like CPU0~CPU1 have 1 low prio task and 1 high prio task, CPU2~CPU3
> have 1 high piro task, seems it does no good because there's only 1
> task per CPU after all.

In this condition i agree that it doesn't worth to move a task and the
scheduler should not because the imbalance will be too small to found
a busiest queue.
So the decision is more linked to the weighted load of the tasks than
to the number of tasks

>
> So, is code below better( we may have more than 2 SMT siblings, like
> Broadcom XLP processor having 4 SMT per core)?
>  if (prefer_sibling && sds->local &&
>     group_has_capacity(env, &sds->local_stat) &&
>     (sgs->sum_nr_running > sds->local_stat.sum_nr_running + 1)) {
>         sgs->group_no_capacity = 1;
>         sgs->group_type = group_overloaded;
>  }

I would say no because it mainly depends of the weighted load of the
tasks so calculate_imbalance is the right place

Vincent
>
> Thanks,
> -Xunlei

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 00/11] sched: consolidation of CPU capacity and usage
  2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
                   ` (11 preceding siblings ...)
  2015-03-11 10:10 ` [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
@ 2015-04-02  1:47 ` Wanpeng Li
  2015-04-02  7:30   ` Vincent Guittot
  12 siblings, 1 reply; 68+ messages in thread
From: Wanpeng Li @ 2015-04-02  1:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, Morten.Rasmussen, kamalesh,
	riel, efault, nicolas.pitre, dietmar.eggemann, linaro-kernel,
	Wanpeng Li

Hi Vincent,
On Fri, Feb 27, 2015 at 04:54:03PM +0100, Vincent Guittot wrote:
>This patchset consolidates several changes in the capacity and the usage
>tracking of the CPU. It provides a frequency invariant metric of the usage of
>CPUs and generally improves the accuracy of load/usage tracking in the
>scheduler. The frequency invariant metric is the foundation required for the
>consolidation of cpufreq and implementation of a fully invariant load tracking.
>These are currently WIP and require several changes to the load balancer
>(including how it will use and interprets load and capacity metrics) and
>extensive validation. The frequency invariance is done with
>arch_scale_freq_capacity and this patchset doesn't provide the backends of
>the function which are architecture dependent.
>
>As discussed at LPC14, Morten and I have consolidated our changes into a single
>patchset to make it easier to review and merge.
>
>During load balance, the scheduler evaluates the number of tasks that a group
>of CPUs can handle. The current method assumes that tasks have a fix load of
>SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
>This assumption generates wrong decision by creating ghost cores or by
>removing real ones when the original capacity of CPUs is different from the
>default SCHED_CAPACITY_SCALE. With this patch set, we don't try anymore to
>evaluate the number of available cores based on the group_capacity but instead
>we evaluate the usage of a group and compare it with its capacity.
>
>This patchset mainly replaces the old capacity_factor method by a new one and
>keeps the general policy almost unchanged. These new metrics will be also used
>in later patches.
>
>The CPU usage is based on a running time tracking version of the current
>implementation of the load average tracking. I also have a version that is
>based on the new implementation proposal [1] but I haven't provide the patches
>and results as [1] is still under review. I can provide change above [1] to
>change how CPU usage is computed and to adapt to new mecanism.

Is there performance data for this cpu capacity and usage improvement?

Regards,
Wanpeng Li 

>
>Change since V9
> - add a dedicated patch for removing unused capacity_orig
> - update some comments and fix typo
> - change the condition for actively migrating task on CPU with higher capacity 
>
>Change since V8
> - reorder patches
>
>Change since V7
> - add freq invariance for usage tracking
> - add freq invariance for scale_rt
> - update comments and commits' message
> - fix init of utilization_avg_contrib
> - fix prefer_sibling
>
>Change since V6
> - add group usage tracking
> - fix some commits' messages
> - minor fix like comments and argument order
>
>Change since V5
> - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
> - update commit log and add more details on the purpose of the patches
> - fix/remove useless code with the rebase on patchset [2]
> - remove capacity_orig in sched_group_capacity as it is not used
> - move code in the right patch
> - add some helper function to factorize code
>
>Change since V4
> - rebase to manage conflicts with changes in selection of busiest group
>
>Change since V3:
> - add usage_avg_contrib statistic which sums the running time of tasks on a rq
> - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
> - fix replacement power by capacity
> - update some comments
>
>Change since V2:
> - rebase on top of capacity renaming
> - fix wake_affine statistic update
> - rework nohz_kick_needed
> - optimize the active migration of a task from CPU with reduced capacity
> - rename group_activity by group_utilization and remove unused total_utilization
> - repair SD_PREFER_SIBLING and use it for SMT level
> - reorder patchset to gather patches with same topics
>
>Change since V1:
> - add 3 fixes
> - correct some commit messages
> - replace capacity computation by activity
> - take into account current cpu capacity
>
>[1] https://lkml.org/lkml/2014/10/10/131
>[2] https://lkml.org/lkml/2014/7/25/589
>
>Morten Rasmussen (2):
>  sched: Track group sched_entity usage contributions
>  sched: Make sched entity usage tracking scale-invariant
>
>Vincent Guittot (9):
>  sched: add utilization_avg_contrib
>  sched: remove frequency scaling from cpu_capacity
>  sched: make scale_rt invariant with frequency
>  sched: add per rq cpu_capacity_orig
>  sched: get CPU's usage statistic
>  sched: replace capacity_factor by usage
>  sched; remove unused capacity_orig from
>  sched: add SD_PREFER_SIBLING for SMT level
>  sched: move cfs task on a CPU with higher capacity
>
> include/linux/sched.h |  21 ++-
> kernel/sched/core.c   |  15 +--
> kernel/sched/debug.c  |  12 +-
> kernel/sched/fair.c   | 366 +++++++++++++++++++++++++++++++-------------------
> kernel/sched/sched.h  |  15 ++-
> 5 files changed, 271 insertions(+), 158 deletions(-)
>
>-- 
>1.9.1
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 00/11] sched: consolidation of CPU capacity and usage
  2015-04-02  1:47 ` Wanpeng Li
@ 2015-04-02  7:30   ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2015-04-02  7:30 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Morten Rasmussen, Kamalesh Babulal, Rik van Riel, Mike Galbraith,
	Nicolas Pitre, Dietmar Eggemann, Linaro Kernel Mailman List

On 2 April 2015 at 03:47, Wanpeng Li <wanpeng.li@linux.intel.com> wrote:
> Hi Vincent,
> On Fri, Feb 27, 2015 at 04:54:03PM +0100, Vincent Guittot wrote:
>>This patchset consolidates several changes in the capacity and the usage
>>tracking of the CPU. It provides a frequency invariant metric of the usage of
>>CPUs and generally improves the accuracy of load/usage tracking in the
>>scheduler. The frequency invariant metric is the foundation required for the
>>consolidation of cpufreq and implementation of a fully invariant load tracking.
>>These are currently WIP and require several changes to the load balancer
>>(including how it will use and interprets load and capacity metrics) and
>>extensive validation. The frequency invariance is done with
>>arch_scale_freq_capacity and this patchset doesn't provide the backends of
>>the function which are architecture dependent.
>>
>>As discussed at LPC14, Morten and I have consolidated our changes into a single
>>patchset to make it easier to review and merge.
>>
>>During load balance, the scheduler evaluates the number of tasks that a group
>>of CPUs can handle. The current method assumes that tasks have a fix load of
>>SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
>>This assumption generates wrong decision by creating ghost cores or by
>>removing real ones when the original capacity of CPUs is different from the
>>default SCHED_CAPACITY_SCALE. With this patch set, we don't try anymore to
>>evaluate the number of available cores based on the group_capacity but instead
>>we evaluate the usage of a group and compare it with its capacity.
>>
>>This patchset mainly replaces the old capacity_factor method by a new one and
>>keeps the general policy almost unchanged. These new metrics will be also used
>>in later patches.
>>
>>The CPU usage is based on a running time tracking version of the current
>>implementation of the load average tracking. I also have a version that is
>>based on the new implementation proposal [1] but I haven't provide the patches
>>and results as [1] is still under review. I can provide change above [1] to
>>change how CPU usage is computed and to adapt to new mecanism.
>
> Is there performance data for this cpu capacity and usage improvement?

I don't have data for this version but i have published figures for
previous one.
https://lkml.org/lkml/2014/8/26/288

This patchset consolidates the tracking of CPU usage and capacity for
all kind of arch and use case by improving the detection of overloaded
CPU.
Regarding the perf bench on SMP system which goals is to use all
available CPU and computing capacity , we should not see perf
improvement but we will not see perf regression too.
The difference is noticeable in mid load use case or when rt task or
irq are involved

Regards,
Vincent
>
> Regards,
> Wanpeng Li
>
>>
>>Change since V9
>> - add a dedicated patch for removing unused capacity_orig
>> - update some comments and fix typo
>> - change the condition for actively migrating task on CPU with higher capacity
>>
>>Change since V8
>> - reorder patches
>>
>>Change since V7
>> - add freq invariance for usage tracking
>> - add freq invariance for scale_rt
>> - update comments and commits' message
>> - fix init of utilization_avg_contrib
>> - fix prefer_sibling
>>
>>Change since V6
>> - add group usage tracking
>> - fix some commits' messages
>> - minor fix like comments and argument order
>>
>>Change since V5
>> - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
>> - update commit log and add more details on the purpose of the patches
>> - fix/remove useless code with the rebase on patchset [2]
>> - remove capacity_orig in sched_group_capacity as it is not used
>> - move code in the right patch
>> - add some helper function to factorize code
>>
>>Change since V4
>> - rebase to manage conflicts with changes in selection of busiest group
>>
>>Change since V3:
>> - add usage_avg_contrib statistic which sums the running time of tasks on a rq
>> - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
>> - fix replacement power by capacity
>> - update some comments
>>
>>Change since V2:
>> - rebase on top of capacity renaming
>> - fix wake_affine statistic update
>> - rework nohz_kick_needed
>> - optimize the active migration of a task from CPU with reduced capacity
>> - rename group_activity by group_utilization and remove unused total_utilization
>> - repair SD_PREFER_SIBLING and use it for SMT level
>> - reorder patchset to gather patches with same topics
>>
>>Change since V1:
>> - add 3 fixes
>> - correct some commit messages
>> - replace capacity computation by activity
>> - take into account current cpu capacity
>>
>>[1] https://lkml.org/lkml/2014/10/10/131
>>[2] https://lkml.org/lkml/2014/7/25/589
>>
>>Morten Rasmussen (2):
>>  sched: Track group sched_entity usage contributions
>>  sched: Make sched entity usage tracking scale-invariant
>>
>>Vincent Guittot (9):
>>  sched: add utilization_avg_contrib
>>  sched: remove frequency scaling from cpu_capacity
>>  sched: make scale_rt invariant with frequency
>>  sched: add per rq cpu_capacity_orig
>>  sched: get CPU's usage statistic
>>  sched: replace capacity_factor by usage
>>  sched; remove unused capacity_orig from
>>  sched: add SD_PREFER_SIBLING for SMT level
>>  sched: move cfs task on a CPU with higher capacity
>>
>> include/linux/sched.h |  21 ++-
>> kernel/sched/core.c   |  15 +--
>> kernel/sched/debug.c  |  12 +-
>> kernel/sched/fair.c   | 366 +++++++++++++++++++++++++++++++-------------------
>> kernel/sched/sched.h  |  15 ++-
>> 5 files changed, 271 insertions(+), 158 deletions(-)
>>
>>--
>>1.9.1
>>
>>--
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-03-25 17:33       ` Peter Zijlstra
  2015-03-25 18:08         ` Vincent Guittot
@ 2015-04-02 16:53         ` Morten Rasmussen
  2015-04-02 17:32           ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Morten Rasmussen @ 2015-04-02 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On Wed, Mar 25, 2015 at 05:33:09PM +0000, Peter Zijlstra wrote:
> On Tue, Mar 24, 2015 at 11:00:57AM +0100, Vincent Guittot wrote:
> > On 23 March 2015 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Fri, Feb 27, 2015 at 04:54:07PM +0100, Vincent Guittot wrote:
> > >
> > >> +     unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > >
> > >> +                     sa->running_avg_sum += delta_w * scale_freq
> > >> +                             >> SCHED_CAPACITY_SHIFT;
> > >
> > > so the only thing that could be improved is somehow making this
> > > multiplication go away when the arch doesn't implement the function.
> > >
> > > But I'm not sure how to do that without #ifdef.
> > >
> > > Maybe a little something like so then... that should make the compiler
> > > get rid of those multiplications unless the arch needs them.
> > 
> > yes, it removes useless multiplication when not used by an arch.
> > It also adds a constraint on the arch side which have to define
> > arch_scale_freq_capacity like below:
> > 
> > #define arch_scale_freq_capacity xxx_arch_scale_freq_capacity
> > with xxx_arch_scale_freq_capacity an architecture specific function
> 
> Yeah, but it not being weak should make that a compile time warn/fail,
> which should be pretty easy to deal with.

Could you enlighten me a bit about how to define the arch specific
implementation without getting into trouble? I'm failing miserably :(

I thought the arm arch-specific topology.h file was a good place to put
the define as it get included in sched.h, so I did a:

#define arch_scale_freq_capacity arm_arch_scale_freq_capacity

However, I have to put a function prototype in the same (or some other
included) header file to avoid doing an implicit function definition.
arch_scale_freq_capacity() takes a struct sched_domain pointer, so I
have to include linux/sched.h which leads to circular dependency between
linux/sched.h and topology.h.

The only way out I can think of to create (or find) a new arch-specific
header file that can include linux/sched.h and be included in
kernel/sched/sched.h and have the define and prototype in there.

I must be missing something?

We can drop the sched_domain pointer as we don't use it, but I'm going
to do the same trick for arch_scale_cpu_capacity() as well which does
require the sd pointer.

Finally, is introducing an ARCH_HAS_SCALE_FREQ_CAPACITY or similar a
complete no go?

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91c6736..5707fb7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1388,12 +1388,14 @@ static inline int hrtick_enabled(struct rq *rq)
 #ifdef CONFIG_SMP
 extern void sched_avg_update(struct rq *rq);
 
-#ifndef arch_scale_freq_capacity
+#ifndef ARCH_HAS_SCALE_FREQ_CAPACITY
 static __always_inline
 unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 {
 	return SCHED_CAPACITY_SCALE;
 }
+#else
+extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
 #endif
 
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-04-02 16:53         ` Morten Rasmussen
@ 2015-04-02 17:32           ` Peter Zijlstra
  2015-04-07 13:31             ` Morten Rasmussen
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2015-04-02 17:32 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On Thu, Apr 02, 2015 at 05:53:09PM +0100, Morten Rasmussen wrote:
> Could you enlighten me a bit about how to define the arch specific
> implementation without getting into trouble? I'm failing miserably :(

Hmm, this was not supposed to be difficult.. :/

> I thought the arm arch-specific topology.h file was a good place to put
> the define as it get included in sched.h, so I did a:
> 
> #define arch_scale_freq_capacity arm_arch_scale_freq_capacity
> 
> However, I have to put a function prototype in the same (or some other
> included) header file to avoid doing an implicit function definition.
> arch_scale_freq_capacity() takes a struct sched_domain pointer, so I
> have to include linux/sched.h which leads to circular dependency between
> linux/sched.h and topology.h.

Why would you have to include linux/sched.h ?

#define arch_scale_freq_capacity arch_scale_freq_capacity
struct sched_domain;
extern unsigned long arch_scale_freq_capacity(struct sched_domain *, int cpu);

Would work from you asm/topology.h, right?

> We can drop the sched_domain pointer as we don't use it, but I'm going
> to do the same trick for arch_scale_cpu_capacity() as well which does
> require the sd pointer.

Sure, dropping that pointer is fine.

> Finally, is introducing an ARCH_HAS_SCALE_FREQ_CAPACITY or similar a
> complete no go?

It seems out of style, I'd have to go look for the email thread, but
this should more or less be the same no?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant
  2015-04-02 17:32           ` Peter Zijlstra
@ 2015-04-07 13:31             ` Morten Rasmussen
  0 siblings, 0 replies; 68+ messages in thread
From: Morten Rasmussen @ 2015-04-07 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Kamalesh Babulal, Rik van Riel, Mike Galbraith, nicolas.pitre,
	Dietmar Eggemann, Linaro Kernel Mailman List, Paul Turner,
	Ben Segall

On Thu, Apr 02, 2015 at 06:32:38PM +0100, Peter Zijlstra wrote:
> On Thu, Apr 02, 2015 at 05:53:09PM +0100, Morten Rasmussen wrote:
> > Could you enlighten me a bit about how to define the arch specific
> > implementation without getting into trouble? I'm failing miserably :(
> 
> Hmm, this was not supposed to be difficult.. :/

I wouldn't have thought so, and it turned out not to be...

> 
> > I thought the arm arch-specific topology.h file was a good place to put
> > the define as it get included in sched.h, so I did a:
> > 
> > #define arch_scale_freq_capacity arm_arch_scale_freq_capacity
> > 
> > However, I have to put a function prototype in the same (or some other
> > included) header file to avoid doing an implicit function definition.
> > arch_scale_freq_capacity() takes a struct sched_domain pointer, so I
> > have to include linux/sched.h which leads to circular dependency between
> > linux/sched.h and topology.h.
> 
> Why would you have to include linux/sched.h ?
> 
> #define arch_scale_freq_capacity arch_scale_freq_capacity
> struct sched_domain;
> extern unsigned long arch_scale_freq_capacity(struct sched_domain *, int cpu);
> 
> Would work from you asm/topology.h, right?

Yes, of course, it works fine. It was just me missing the most obvious
solution. No need to include linux/sched.h, 'struct sched_domain;' was
the bit I was missing. Sorry for the noise.

> > We can drop the sched_domain pointer as we don't use it, but I'm going
> > to do the same trick for arch_scale_cpu_capacity() as well which does
> > require the sd pointer.
> 
> Sure, dropping that pointer is fine.
> 
> > Finally, is introducing an ARCH_HAS_SCALE_FREQ_CAPACITY or similar a
> > complete no go?
> 
> It seems out of style, I'd have to go look for the email thread, but
> this should more or less be the same no?

The above works just fine, so no need for that anyway :)

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2015-04-07 13:30 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-27 15:54 [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
2015-02-27 15:54 ` [PATCH v10 01/11] sched: add utilization_avg_contrib Vincent Guittot
2015-03-27 11:40   ` [tip:sched/core] sched: Add sched_avg::utilization_avg_contrib tip-bot for Vincent Guittot
2015-02-27 15:54 ` [PATCH v10 02/11] sched: Track group sched_entity usage contributions Vincent Guittot
2015-03-27 11:40   ` [tip:sched/core] " tip-bot for Morten Rasmussen
2015-02-27 15:54 ` [PATCH v10 03/11] sched: remove frequency scaling from cpu_capacity Vincent Guittot
2015-03-27 11:40   ` [tip:sched/core] sched: Remove " tip-bot for Vincent Guittot
2015-02-27 15:54 ` [PATCH v10 04/11] sched: Make sched entity usage tracking scale-invariant Vincent Guittot
2015-03-03 12:51   ` Dietmar Eggemann
2015-03-04  7:54     ` Vincent Guittot
2015-03-04  7:46   ` Vincent Guittot
2015-03-27 11:40     ` [tip:sched/core] " tip-bot for Morten Rasmussen
2015-03-23 13:19   ` [PATCH v10 04/11] " Peter Zijlstra
2015-03-24 10:00     ` Vincent Guittot
2015-03-25 17:33       ` Peter Zijlstra
2015-03-25 18:08         ` Vincent Guittot
2015-03-26 17:38           ` Morten Rasmussen
2015-03-26 17:40             ` Morten Rasmussen
2015-03-26 17:46             ` [PATCH 1/2] sched: Change arch_scale_*() functions to scale input factor Morten Rasmussen
2015-03-26 17:46               ` [PATCH 2/2] sched: Make sched entity usage tracking scale-invariant Morten Rasmussen
2015-03-26 17:47             ` [PATCH v10 04/11] " Peter Zijlstra
2015-03-26 17:51               ` Morten Rasmussen
2015-03-27  8:17             ` Vincent Guittot
2015-03-27  9:05               ` Vincent Guittot
2015-04-02 16:53         ` Morten Rasmussen
2015-04-02 17:32           ` Peter Zijlstra
2015-04-07 13:31             ` Morten Rasmussen
2015-03-27 11:43     ` [tip:sched/core] sched: Optimize freq invariant accounting tip-bot for Peter Zijlstra
2015-02-27 15:54 ` [PATCH v10 05/11] sched: make scale_rt invariant with frequency Vincent Guittot
2015-03-27 11:41   ` [tip:sched/core] sched: Make " tip-bot for Vincent Guittot
2015-02-27 15:54 ` [PATCH v10 06/11] sched: add per rq cpu_capacity_orig Vincent Guittot
2015-03-27 11:41   ` [tip:sched/core] sched: Add struct rq::cpu_capacity_orig tip-bot for Vincent Guittot
2015-02-27 15:54 ` [PATCH v10 07/11] sched: get CPU's usage statistic Vincent Guittot
2015-03-03 12:47   ` Dietmar Eggemann
2015-03-04  7:53     ` Vincent Guittot
2015-03-04  7:48   ` Vincent Guittot
2015-03-27 11:41     ` [tip:sched/core] sched: Calculate CPU' s usage statistic and put it into struct sg_lb_stats::group_usage tip-bot for Vincent Guittot
2015-03-27 15:12   ` [PATCH v10 07/11] sched: get CPU's usage statistic Xunlei Pang
2015-03-27 15:37     ` Vincent Guittot
2015-04-01  3:22       ` Xunlei Pang
2015-02-27 15:54 ` [PATCH v10 08/11] sched: replace capacity_factor by usage Vincent Guittot
2015-03-27 11:42   ` [tip:sched/core] sched: Replace " tip-bot for Vincent Guittot
2015-03-27 14:52   ` [PATCH v10 08/11] sched: replace " Xunlei Pang
2015-03-27 15:59     ` Vincent Guittot
2015-04-01  3:37       ` Xunlei Pang
2015-04-01  9:06         ` Vincent Guittot
2015-04-01 14:54           ` Xunlei Pang
2015-04-01 15:57             ` Vincent Guittot
2015-02-27 15:54 ` [PATCH v10 09/11] sched; remove unused capacity_orig from Vincent Guittot
2015-03-03 10:18   ` Morten Rasmussen
2015-03-03 10:33     ` Vincent Guittot
2015-03-03 10:35   ` [PATCH v10 09/11] sched; remove unused capacity_orig Vincent Guittot
2015-03-27 11:42     ` [tip:sched/core] sched: Remove unused struct sched_group_capacity ::capacity_orig tip-bot for Vincent Guittot
2015-02-27 15:54 ` [PATCH v10 10/11] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
2015-03-02 11:52   ` Srikar Dronamraju
2015-03-03  8:38     ` Vincent Guittot
2015-03-23  9:11       ` Peter Zijlstra
2015-03-23  9:59         ` Preeti U Murthy
2015-03-26 10:55   ` Peter Zijlstra
2015-03-26 12:03     ` Preeti U Murthy
2015-03-27 11:42   ` [tip:sched/core] sched: Add " tip-bot for Vincent Guittot
2015-02-27 15:54 ` [PATCH v10 11/11] sched: move cfs task on a CPU with higher capacity Vincent Guittot
2015-03-26 14:19   ` Dietmar Eggemann
2015-03-26 15:43     ` Vincent Guittot
2015-03-27 11:42   ` [tip:sched/core] sched: Move CFS tasks to CPUs " tip-bot for Vincent Guittot
2015-03-11 10:10 ` [PATCH v10 00/11] sched: consolidation of CPU capacity and usage Vincent Guittot
2015-04-02  1:47 ` Wanpeng Li
2015-04-02  7:30   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).