LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
@ 2018-03-15 14:46 Morten Rasmussen
  2018-03-15 14:46 ` [PATCHv2 1/7] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-15 14:46 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.

To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.

The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.

The patches have been tested on:
   1. Arm Juno (r0): 2+4 Cortex A57/A53
   2. Hikey960: 4+4 Cortex A73/A53

Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.

    REQUESTS=1000
    BIGS="1 2"
    LITTLES="0 3 4 5"
 
    # Don't care about the score for those, just keep the bigs busy
    for i in $BIGS; do
        taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
        --test=cpu  run &>/dev/null &
    done
 
    for i in $LITTLES; do
        sysbench --max-requests=$REQUESTS --test=cpu run \
	| grep "total time:" &
    done
 
    wait

Results:
Single runs with completion time of each task
Juno (tip)
    total time:                          1.2608s
    total time:                          1.2995s
    total time:                          1.5954s
    total time:                          1.7463s

Juno (misfit)
    total time:                          1.2575s
    total time:                          1.3004s
    total time:                          1.5860s
    total time:                          1.5871s

Hikey960 (tip)
    total time:                          1.7431s
    total time:                          2.2914s
    total time:                          2.5976s
    total time:                          1.7280s

Hikey960 (misfit)
    total time:                          1.7866s
    total time:                          1.7513s
    total time:                          1.6918s
    total time:                          1.6965s

10 run summary (tracking longest running task for each run)
	Juno		Hikey960
	avg	max	avg	max
tip     1.7465  1.7469  2.5997  2.6131 
misfit  1.6016  1.6192  1.8506  1.9666

Changelog:
v2
- Removed redudant condition in static_key enablement.
- Fixed logic flaw in patch #2 reported by Yi Yao <yi.yao@intel.com>
- Dropped patch #4 as although the patch seems to make sense no benefit
  has been proven.
- Dropped root_domain->overload renaming
- Changed type of root_domain->overload to int
- Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
- v1 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>

Morten Rasmussen (3):
  sched: Add static_key for asymmetric cpu capacity optimizations
  sched/fair: Add group_misfit_task load-balance type
  sched/fair: Consider misfit tasks when load-balancing

Valentin Schneider (4):
  sched/fair: Kick nohz balance if rq->misfit_task
  sched: Change root_domain->overload type to int
  sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
  sched/fair: Set sd->overload when misfit

 kernel/sched/fair.c     | 115 +++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h    |  15 +++++--
 kernel/sched/topology.c |   9 ++++
 3 files changed, 115 insertions(+), 24 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv2 1/7] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
@ 2018-03-15 14:46 ` Morten Rasmussen
  2018-04-27 14:04   ` [PATCHv2,1/7] " Jiada Wang
  2018-03-15 14:46 ` [PATCHv2 2/7] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-15 14:46 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c     | 3 +++
 kernel/sched/sched.h    | 1 +
 kernel/sched/topology.c | 9 +++++++++
 3 files changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3582117e1580..e4ce572113a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6385,6 +6385,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 {
 	long min_cap, max_cap;
 
+	if (!static_branch_unlikely(&sched_asym_cpucapacity))
+		return 0;
+
 	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
 	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22909ffc04fb..017f62069e0b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1140,6 +1140,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
 	atomic_t		ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64cc564f5255..b023a5b3665a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -425,6 +426,12 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+static void update_asym_cpucapacity(int cpu)
+{
+	if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+		static_branch_enable(&sched_asym_cpucapacity);
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -1705,6 +1712,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 
 		cpu_attach_domain(sd, d.rd, i);
 	}
+
+	update_asym_cpucapacity(cpumask_first(cpu_map));
 	rcu_read_unlock();
 
 	if (rq && sched_debug_enabled) {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv2 2/7] sched/fair: Add group_misfit_task load-balance type
  2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
  2018-03-15 14:46 ` [PATCHv2 1/7] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
@ 2018-03-15 14:46 ` Morten Rasmussen
  2018-04-11 10:45   ` Peter Zijlstra
  2018-03-15 14:47 ` [PATCHv2 3/7] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-15 14:46 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 57 +++++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h |  2 ++
 2 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4ce572113a9..1e06c722bc2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -1407,7 +1408,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 static unsigned long weighted_cpuload(struct rq *rq);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
@@ -3873,6 +3873,30 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 
 static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
 
+static inline unsigned long task_util(struct task_struct *p);
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+	return capacity * 1024 > task_util(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+	if (!static_branch_unlikely(&sched_asym_cpucapacity))
+		return;
+
+	if (!p) {
+		rq->misfit_task_load = 0;
+		return;
+	}
+
+	if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+		rq->misfit_task_load = 0;
+		return;
+	}
+
+	rq->misfit_task_load = task_h_load(p);
+}
+
 #else /* CONFIG_SMP */
 
 static inline int
@@ -3902,6 +3926,8 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
 	return 0;
 }
 
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
+
 #endif /* CONFIG_SMP */
 
 static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -5835,7 +5861,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return target;
 }
 
-static inline unsigned long task_util(struct task_struct *p);
 static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6398,7 +6423,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	/* Bring task utilization in sync with prev_cpu */
 	sync_entity_load_avg(&p->se);
 
-	return min_cap * 1024 < task_util(p) * capacity_margin;
+	return !task_fits_capacity(p, min_cap);
 }
 
 /*
@@ -6829,9 +6854,12 @@ done: __maybe_unused;
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
 
+	update_misfit_status(p, rq);
+
 	return p;
 
 idle:
+	update_misfit_status(NULL, rq);
 	new_tasks = idle_balance(rq, rf);
 
 	/*
@@ -7037,6 +7065,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+enum group_type {
+	group_other = 0,
+	group_misfit_task,
+	group_imbalanced,
+	group_overloaded,
+};
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
@@ -7578,12 +7613,6 @@ static unsigned long task_h_load(struct task_struct *p)
 
 /********** Helpers for find_busiest_group ************************/
 
-enum group_type {
-	group_other = 0,
-	group_imbalanced,
-	group_overloaded,
-};
-
 /*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
@@ -7599,6 +7628,7 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	int group_no_capacity;
+	int group_misfit_task_load; /* A cpu has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -7898,6 +7928,9 @@ group_type group_classify(struct sched_group *group,
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
+	if (sgs->group_misfit_task_load)
+		return group_misfit_task;
+
 	return group_other;
 }
 
@@ -7972,6 +8005,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		 */
 		if (!nr_running && idle_cpu(i))
 			sgs->idle_cpus++;
+
+		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		    !sgs->group_misfit_task_load && rq->misfit_task_load)
+			sgs->group_misfit_task_load = rq->misfit_task_load;
 	}
 
 	/* Adjust by relative CPU capacity of the group */
@@ -9753,6 +9790,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_misfit_status(curr, rq);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 017f62069e0b..22177dfc1f04 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -815,6 +815,8 @@ struct rq {
 
 	unsigned char		idle_balance;
 
+	unsigned int		misfit_task_load;
+
 	/* For active balancing */
 	int			active_balance;
 	int			push_cpu;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv2 3/7] sched/fair: Consider misfit tasks when load-balancing
  2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
  2018-03-15 14:46 ` [PATCHv2 1/7] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
  2018-03-15 14:46 ` [PATCHv2 2/7] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
@ 2018-03-15 14:47 ` Morten Rasmussen
  2018-04-11 10:46   ` Peter Zijlstra
  2018-04-11 10:53   ` Peter Zijlstra
  2018-03-15 14:47 ` [PATCHv2 4/7] sched/fair: Kick nohz balance if rq->misfit_task Morten Rasmussen
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-15 14:47 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the first cpu with the rq->misfit flag raised as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
   balancing.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e06c722bc2e..496062860733 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7102,6 +7102,7 @@ struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
+	enum group_type		src_grp_type;
 	struct list_head	tasks;
 };
 
@@ -8044,6 +8045,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
+	/*
+	 * Don't try to pull misfit tasks we can't help.
+	 */
+	if (sgs->group_type == group_misfit_task &&
+	    (!group_smaller_cpu_capacity(sg, sds->local) ||
+	     !group_has_capacity(env, &sds->local_stat)))
+		return false;
+
 	if (sgs->group_type > busiest->group_type)
 		return true;
 
@@ -8363,8 +8372,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * factors in sg capacity and sgs with smaller group_type are
 	 * skipped when updating the busiest sg:
 	 */
-	if (busiest->avg_load <= sds->avg_load ||
-	    local->avg_load >= sds->avg_load) {
+	if (busiest->group_type != group_misfit_task &&
+	    (busiest->avg_load <= sds->avg_load ||
+	     local->avg_load >= sds->avg_load)) {
 		env->imbalance = 0;
 		return fix_small_imbalance(env, sds);
 	}
@@ -8398,6 +8408,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		(sds->avg_load - local->avg_load) * local->group_capacity
 	) / SCHED_CAPACITY_SCALE;
 
+	/* Boost imbalance to allow misfit task to be balanced. */
+	if (busiest->group_type == group_misfit_task) {
+		env->imbalance = max_t(long, env->imbalance,
+				       busiest->group_misfit_task_load);
+	}
+
 	/*
 	 * if *imbalance is less than the average load per runnable task
 	 * there is no guarantee that any tasks will be moved so we'll have
@@ -8464,6 +8480,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	    busiest->group_no_capacity)
 		goto force_balance;
 
+	/* Misfit tasks should be dealt with regardless of the avg load */
+	if (busiest->group_type == group_misfit_task)
+		goto force_balance;
+
 	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
@@ -8501,6 +8521,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
+	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
 	return sds.busiest;
 
@@ -8548,6 +8569,14 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
+		/*
+		 * For ASYM_CPUCAPACITY domains with misfit tasks we ignore
+		 * load.
+		 */
+		if (env->src_grp_type == group_misfit_task &&
+		    rq->misfit_task_load)
+			return rq;
+
 		capacity = capacity_of(i);
 
 		wl = weighted_cpuload(rq);
@@ -8617,6 +8646,9 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	if (env->src_grp_type == group_misfit_task)
+		return 1;
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv2 4/7] sched/fair: Kick nohz balance if rq->misfit_task
  2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (2 preceding siblings ...)
  2018-03-15 14:47 ` [PATCHv2 3/7] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
@ 2018-03-15 14:47 ` Morten Rasmussen
  2018-03-15 14:47 ` [PATCHv2 5/7] sched: Change root_domain->overload type to int Morten Rasmussen
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-15 14:47 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:

* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure

As such, this commit adds an rq->misfit_task condition to trigger a
nohz kick.

The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
the Android Common Kernel - see [1].

[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 496062860733..c791dd7ac9a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9291,7 +9291,7 @@ static void nohz_balancer_kick(struct rq *rq)
 	if (time_before(now, nohz.next_balance))
 		goto out;
 
-	if (rq->nr_running >= 2) {
+	if (rq->nr_running >= 2 || rq->misfit_task_load) {
 		flags = NOHZ_KICK_MASK;
 		goto out;
 	}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv2 5/7] sched: Change root_domain->overload type to int
  2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (3 preceding siblings ...)
  2018-03-15 14:47 ` [PATCHv2 4/7] sched/fair: Kick nohz balance if rq->misfit_task Morten Rasmussen
@ 2018-03-15 14:47 ` Morten Rasmussen
  2018-03-15 14:47 ` [PATCHv2 6/7] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-15 14:47 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 7 +++----
 kernel/sched/sched.h | 4 ++--
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c791dd7ac9a8..fe4824a37302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7969,7 +7969,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
 			int local_group, struct sg_lb_stats *sgs,
-			bool *overload)
+			int *overload)
 {
 	unsigned long load;
 	int i, nr_running;
@@ -7994,7 +7994,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		nr_running = rq->nr_running;
 		if (nr_running > 1)
-			*overload = true;
+			*overload = 1;
 
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
@@ -8143,8 +8143,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	int load_idx, prefer_sibling = 0;
-	bool overload = false;
+	int load_idx, prefer_sibling, overload = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22177dfc1f04..5379f647016d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -689,7 +689,7 @@ struct root_domain {
 	cpumask_var_t		online;
 
 	/* Indicate more than one runnable task for any CPU */
-	bool			overload;
+	int			overload;
 
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
@@ -1652,7 +1652,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
 		if (!rq->rd->overload)
-			rq->rd->overload = true;
+			rq->rd->overload = 1;
 #endif
 	}
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv2 6/7] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
  2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (4 preceding siblings ...)
  2018-03-15 14:47 ` [PATCHv2 5/7] sched: Change root_domain->overload type to int Morten Rasmussen
@ 2018-03-15 14:47 ` Morten Rasmussen
  2018-03-15 14:47 ` [PATCHv2 7/7] sched/fair: Set sd->overload when misfit Morten Rasmussen
  2018-03-20  5:30 ` [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Gaku Inami
  7 siblings, 0 replies; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-15 14:47 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe4824a37302..4e79ec8760be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8220,8 +8220,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		if (env->dst_rq->rd->overload != overload)
-			env->dst_rq->rd->overload = overload;
+		if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+			WRITE_ONCE(env->dst_rq->rd->overload, overload);
 	}
 }
 
@@ -9659,7 +9659,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rq_unpin_lock(this_rq, rf);
 
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !this_rq->rd->overload) {
+	    !READ_ONCE(this_rq->rd->overload)) {
 
 		rcu_read_lock();
 		sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5379f647016d..31847b7f7d47 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1651,8 +1651,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
-		if (!rq->rd->overload)
-			rq->rd->overload = 1;
+		if (!READ_ONCE(rq->rd->overload))
+			WRITE_ONCE(rq->rd->overload, 1);
 #endif
 	}
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv2 7/7] sched/fair: Set sd->overload when misfit
  2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (5 preceding siblings ...)
  2018-03-15 14:47 ` [PATCHv2 6/7] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
@ 2018-03-15 14:47 ` Morten Rasmussen
  2018-03-20  5:30 ` [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Gaku Inami
  7 siblings, 0 replies; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-15 14:47 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
  big_0  | c c c c a a
  big_1  | d d d d b b
                  ^
                  ^
    Tasks end on the big CPUs, idle balance happens
    and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 6 ++++--
 kernel/sched/sched.h | 6 +++++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e79ec8760be..7952e9491713 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7964,7 +7964,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
@@ -8008,8 +8008,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->idle_cpus++;
 
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    !sgs->group_misfit_task_load && rq->misfit_task_load)
+		    !sgs->group_misfit_task_load && rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
+			*overload = 1;
+		}
 	}
 
 	/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 31847b7f7d47..2981a162d9c0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -688,7 +688,11 @@ struct root_domain {
 	cpumask_var_t		span;
 	cpumask_var_t		online;
 
-	/* Indicate more than one runnable task for any CPU */
+	/*
+	 * Indicate pullable load on at least one CPU, e.g:
+	 * - More than one runnable task
+	 * - Running task is misfit
+	 */
 	int			overload;
 
 	/*
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (6 preceding siblings ...)
  2018-03-15 14:47 ` [PATCHv2 7/7] sched/fair: Set sd->overload when misfit Morten Rasmussen
@ 2018-03-20  5:30 ` Gaku Inami
  2018-03-20  9:17   ` Morten Rasmussen
  7 siblings, 1 reply; 15+ messages in thread
From: Gaku Inami @ 2018-03-20  5:30 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot, linux-kernel

Hi

> -----Original Message-----
> From: Morten Rasmussen [mailto:morten.rasmussen@arm.com]
> Sent: Thursday, March 15, 2018 11:47 PM
> To: peterz@infradead.org; mingo@redhat.com
> Cc: valentin.schneider@arm.com; dietmar.eggemann@arm.com; vincent.guittot@linaro.org; Gaku Inami
> <gaku.inami.xh@renesas.com>; linux-kernel@vger.kernel.org; Morten Rasmussen <morten.rasmussen@arm.com>
> Subject: [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
[snip]
> 
> The patches have been tested on:
>    1. Arm Juno (r0): 2+4 Cortex A57/A53
>    2. Hikey960: 4+4 Cortex A73/A53
> 

I have tested this on our R-Car again and it works well. In addition,
I confirmed that this patch-set also brings performance improvement to
other benchmarks(e.g. memory load latency in LMbench). You can add:

Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>

The patches have been tested on:
   3. Renesas R-Car H3 : 4+4 Cortex A57/A53

Results:
Single runs with completion time of each task
R-Car H3 (tip)
    total time:                          0.9435s
    total time:                          0.9952s
    total time:                          1.3511s
    total time:                          1.6747s

R-Car H3 (misfit)
    total time:                          0.9387s
    total time:                          0.9280s
    total time:                          0.9616s
    total time:                          0.9934s

10 run summary (tracking longest running task for each run)
	R-Car H3
	avg	max
tip     1.6737  1.6758
misfit  0.9980  1.0409

Regards,
Inami

[snip]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-03-20  5:30 ` [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Gaku Inami
@ 2018-03-20  9:17   ` Morten Rasmussen
  0 siblings, 0 replies; 15+ messages in thread
From: Morten Rasmussen @ 2018-03-20  9:17 UTC (permalink / raw)
  To: Gaku Inami
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, linux-kernel

Hi,

On Tue, Mar 20, 2018 at 05:30:00AM +0000, Gaku Inami wrote:
> Hi
> 
> > -----Original Message-----
> > From: Morten Rasmussen [mailto:morten.rasmussen@arm.com]
> > Sent: Thursday, March 15, 2018 11:47 PM
> > To: peterz@infradead.org; mingo@redhat.com
> > Cc: valentin.schneider@arm.com; dietmar.eggemann@arm.com; vincent.guittot@linaro.org; Gaku Inami
> > <gaku.inami.xh@renesas.com>; linux-kernel@vger.kernel.org; Morten Rasmussen <morten.rasmussen@arm.com>
> > Subject: [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
> [snip]
> > 
> > The patches have been tested on:
> >    1. Arm Juno (r0): 2+4 Cortex A57/A53
> >    2. Hikey960: 4+4 Cortex A73/A53
> > 
> 
> I have tested this on our R-Car again and it works well. In addition,
> I confirmed that this patch-set also brings performance improvement to
> other benchmarks(e.g. memory load latency in LMbench). You can add:
> 
> Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>
> 
> The patches have been tested on:
>    3. Renesas R-Car H3 : 4+4 Cortex A57/A53
> 
> Results:
> Single runs with completion time of each task
> R-Car H3 (tip)
>     total time:                          0.9435s
>     total time:                          0.9952s
>     total time:                          1.3511s
>     total time:                          1.6747s
> 
> R-Car H3 (misfit)
>     total time:                          0.9387s
>     total time:                          0.9280s
>     total time:                          0.9616s
>     total time:                          0.9934s
> 
> 10 run summary (tracking longest running task for each run)
> 	R-Car H3
> 	avg	max
> tip     1.6737  1.6758
> misfit  0.9980  1.0409

Thanks a lot for testing again.

Morten

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv2 2/7] sched/fair: Add group_misfit_task load-balance type
  2018-03-15 14:46 ` [PATCHv2 2/7] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
@ 2018-04-11 10:45   ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2018-04-11 10:45 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel

On Thu, Mar 15, 2018 at 02:46:59PM +0000, Morten Rasmussen wrote:

> +static inline unsigned long task_util(struct task_struct *p);
> +static inline int task_fits_capacity(struct task_struct *p, long capacity)
> +{
> +	return capacity * 1024 > task_util(p) * capacity_margin;
> +}
> +
> +static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> +{
> +	if (!static_branch_unlikely(&sched_asym_cpucapacity))
> +		return;
> +
> +	if (!p) {
> +		rq->misfit_task_load = 0;
> +		return;
> +	}
> +
> +	if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
> +		rq->misfit_task_load = 0;
> +		return;
> +	}
> +
> +	rq->misfit_task_load = task_h_load(p);
> +}

So RT/IRQ pressure can also cause misfit..


> @@ -7972,6 +8005,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  		 */
>  		if (!nr_running && idle_cpu(i))
>  			sgs->idle_cpus++;
> +
> +		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +		    !sgs->group_misfit_task_load && rq->misfit_task_load)
> +			sgs->group_misfit_task_load = rq->misfit_task_load;
>  	}

Should we not look for the biggest misfit instead of the first?

>  
>  	/* Adjust by relative CPU capacity of the group */

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv2 3/7] sched/fair: Consider misfit tasks when load-balancing
  2018-03-15 14:47 ` [PATCHv2 3/7] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
@ 2018-04-11 10:46   ` Peter Zijlstra
  2018-04-11 10:53   ` Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2018-04-11 10:46 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel

On Thu, Mar 15, 2018 at 02:47:00PM +0000, Morten Rasmussen wrote:
> @@ -8548,6 +8569,14 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  		if (rt > env->fbq_type)
>  			continue;
>  
> +		/*
> +		 * For ASYM_CPUCAPACITY domains with misfit tasks we ignore
> +		 * load.
> +		 */
> +		if (env->src_grp_type == group_misfit_task &&
> +		    rq->misfit_task_load)
> +			return rq;
> +
>  		capacity = capacity_of(i);
>  
>  		wl = weighted_cpuload(rq);

Similarly, should we not return worst misfit instead of the first?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv2 3/7] sched/fair: Consider misfit tasks when load-balancing
  2018-03-15 14:47 ` [PATCHv2 3/7] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
  2018-04-11 10:46   ` Peter Zijlstra
@ 2018-04-11 10:53   ` Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2018-04-11 10:53 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel

On Thu, Mar 15, 2018 at 02:47:00PM +0000, Morten Rasmussen wrote:
> +	/*
> +	 * Don't try to pull misfit tasks we can't help.
> +	 */
> +	if (sgs->group_type == group_misfit_task &&
> +	    (!group_smaller_cpu_capacity(sg, sds->local) ||
> +	     !group_has_capacity(env, &sds->local_stat)))
> +		return false;
> +

I think that has the same 'problem' as last time, right? Since
group_smaller_cpu_capacity() is affected by RT/IRQ pressure.

We should have a comment stating this and preferably explaining why that
is ok.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv2,1/7] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-03-15 14:46 ` [PATCHv2 1/7] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
@ 2018-04-27 14:04   ` Jiada Wang
  2018-04-29 19:09     ` Valentin Schneider
  0 siblings, 1 reply; 15+ messages in thread
From: Jiada Wang @ 2018-04-27 14:04 UTC (permalink / raw)
  To: morten.rasmussen
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, jiada_wang

Hi

with this patch, if enable CONFIG_DEBUG_ATOMIC_SLEEP=y,
then I am getting following BUG report during early startup

Backtrace caused by [1] during early kernel startup:
[ 5.325288] CPU: All CPU(s) started at EL2
[ 5.325700] alternatives: patching kernel code
[ 5.329255] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[ 5.329525] in_atomic(): 0, irqs_disabled(): 0, pid: 1, name: swapper/0
[ 5.329657] 2 locks held by swapper/0/1:
[ 5.329744] #0: (sched_domains_mutex){+.+.}, at: [<ffff20000957f244>] sched_init_smp+0x88/0x158
[ 5.329993] #1: (rcu_read_lock){....}, at: [<ffff200008159794>] build_sched_domains+0x9cc/0x2f08
[ 5.330233] Preemption disabled at:
[ 5.330256] [<ffff200008157b5c>] rq_attach_root+0x28/0x1d8
[ 5.330511] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.17+ #123
[ 5.330635] Hardware name: Renesas Salvator-X board based on r8a7795 ES2.0+ (DT)
[ 5.330779] Call trace:
[ 5.330853] [<ffff20000808fe88>] dump_backtrace+0x0/0x364
[ 5.330968] [<ffff200008090200>] show_stack+0x14/0x1c
[ 5.331080] [<ffff200008f365d8>] dump_stack+0x108/0x174
[ 5.331194] [<ffff200008113170>] ___might_sleep+0x43c/0x44c
[ 5.331310] [<ffff2000081132e4>] __might_sleep+0x164/0x178
[ 5.331429] [<ffff2000080b2338>] cpus_read_lock+0x38/0x12c
[ 5.331547] [<ffff2000082d5860>] static_key_enable+0x14/0x2c
[ 5.331665] [<ffff20000815bcac>] build_sched_domains+0x2ee4/0x2f08
[ 5.331789] [<ffff20000815cc9c>] sched_init_domains+0xcc/0xe8
[ 5.331908] [<ffff20000957f250>] sched_init_smp+0x94/0x158
[ 5.332026] [<ffff200009571560>] kernel_init_freeable+0x1ec/0x4c4
[ 5.332153] [<ffff200008f593a8>] kernel_init+0x10/0x128
[ 5.332264] [<ffff2000080865d4>] ret_from_fork+0x10/0x18
[ 5.343400] devtmpfs: initialized

Thanks,
Jiada

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv2,1/7] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-04-27 14:04   ` [PATCHv2,1/7] " Jiada Wang
@ 2018-04-29 19:09     ` Valentin Schneider
  0 siblings, 0 replies; 15+ messages in thread
From: Valentin Schneider @ 2018-04-29 19:09 UTC (permalink / raw)
  To: Jiada Wang, morten.rasmussen
  Cc: dietmar.eggemann, vincent.guittot, gaku.inami.xh, linux-kernel

Hi,

On 27/04/18 15:04, Jiada Wang wrote:
> Hi
> 
> with this patch, if enable CONFIG_DEBUG_ATOMIC_SLEEP=y,
> then I am getting following BUG report during early startup
> 

Thanks for bringing that up.

> Backtrace caused by [1] during early kernel startup:
> [ 5.325288] CPU: All CPU(s) started at EL2
> [ 5.325700] alternatives: patching kernel code
> [ 5.329255] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
> [ 5.329525] in_atomic(): 0, irqs_disabled(): 0, pid: 1, name: swapper/0
> [ 5.329657] 2 locks held by swapper/0/1:
> [ 5.329744] #0: (sched_domains_mutex){+.+.}, at: [<ffff20000957f244>] sched_init_smp+0x88/0x158
> [ 5.329993] #1: (rcu_read_lock){....}, at: [<ffff200008159794>] build_sched_domains+0x9cc/0x2f08
> [ 5.330233] Preemption disabled at:
> [ 5.330256] [<ffff200008157b5c>] rq_attach_root+0x28/0x1d8
> [ 5.330511] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.17+ #123
> [ 5.330635] Hardware name: Renesas Salvator-X board based on r8a7795 ES2.0+ (DT)
> [ 5.330779] Call trace:
> [ 5.330853] [<ffff20000808fe88>] dump_backtrace+0x0/0x364
> [ 5.330968] [<ffff200008090200>] show_stack+0x14/0x1c
> [ 5.331080] [<ffff200008f365d8>] dump_stack+0x108/0x174
> [ 5.331194] [<ffff200008113170>] ___might_sleep+0x43c/0x44c
> [ 5.331310] [<ffff2000081132e4>] __might_sleep+0x164/0x178
> [ 5.331429] [<ffff2000080b2338>] cpus_read_lock+0x38/0x12c
> [ 5.331547] [<ffff2000082d5860>] static_key_enable+0x14/0x2c
> [ 5.331665] [<ffff20000815bcac>] build_sched_domains+0x2ee4/0x2f08
> [ 5.331789] [<ffff20000815cc9c>] sched_init_domains+0xcc/0xe8
> [ 5.331908] [<ffff20000957f250>] sched_init_smp+0x94/0x158
> [ 5.332026] [<ffff200009571560>] kernel_init_freeable+0x1ec/0x4c4
> [ 5.332153] [<ffff200008f593a8>] kernel_init+0x10/0x128
> [ 5.332264] [<ffff2000080865d4>] ret_from_fork+0x10/0x18
> [ 5.343400] devtmpfs: initialized
> 

I tried reproducing this on my HiKey960, and I do get a BUG pointing at a 
static_key_enable but at a completely different place...

[    0.158072] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:238
[    0.158074] in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/4
[    0.158080] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S               4.16.0-linaro-hikey960 #4
[    0.158081] Hardware name: HiKey960 (DT)
[    0.158083] Call trace:
[    0.158098]  dump_backtrace+0x0/0x188
[    0.158102]  show_stack+0x14/0x20
[    0.158108]  dump_stack+0x98/0xbc
[    0.158113]  ___might_sleep+0xf0/0x118
[    0.158115]  __might_sleep+0x50/0x88
[    0.158118]  mutex_lock+0x24/0x60
[    0.158124]  static_key_enable_cpuslocked+0x50/0xc0
[    0.158130]  arch_timer_check_ool_workaround+0x1ac/0x228
[    0.158133]  arch_timer_starting_cpu+0xfc/0x2e8
[    0.158137]  cpuhp_invoke_callback+0xa0/0x228
[    0.158140]  notify_cpu_starting+0x70/0x90
[    0.158143]  secondary_start_kernel+0x128/0x1c8



I went and had a look at the documentation for the static keys, and it mentions
fun stuff can happen with hotplug. I gave it a try and got this:

root@linaro-developer:~# echo 0 > /sys/devices/system/cpu/cpu4/online 
[ 1893.765366] CPU4: shutdown
[ 1893.768077] psci: CPU4 killed.
[ 1893.771890] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[ 1893.773361] crct10dif_ce: Unknown symbol _mcount (err 0)
[ 1893.777754] in_atomic(): 0, irqs_disabled(): 0, pid: 3392, name: kworker/4:0
[ 1893.799136] CPU: 0 PID: 3392 Comm: kworker/4:0 Tainted: G S      W        4.16.0-linaro-hikey960 #4
[ 1893.808180] Hardware name: HiKey960 (DT)
[ 1893.812110] Workqueue: events cpuset_hotplug_workfn
[ 1893.816984] Call trace:
[ 1893.819430]  dump_backtrace+0x0/0x188
[ 1893.823088]  show_stack+0x14/0x20
[ 1893.826401]  dump_stack+0x98/0xbc
[ 1893.829712]  ___might_sleep+0xf0/0x118
[ 1893.833455]  __might_sleep+0x50/0x88
[ 1893.837028]  cpus_read_lock+0x1c/0x90
[ 1893.840689]  static_key_enable+0x14/0x30
[ 1893.844608]  build_sched_domains+0xe4c/0xf00
[ 1893.848874]  partition_sched_domains+0x2c8/0x410
[ 1893.853486]  rebuild_sched_domains_locked+0xe4/0x430
[ 1893.858446]  rebuild_sched_domains+0x20/0x38
[ 1893.862712]  cpuset_hotplug_workfn+0x28c/0x6b8
[ 1893.867153]  process_one_work+0x114/0x330
[ 1893.871158]  worker_thread+0x130/0x470
[ 1893.874903]  kthread+0x104/0x130
[ 1893.878126]  ret_from_fork+0x10/0x18


This seems to be complaining about holding 'sched_domains_mutex' while
taking the hotplug lock before flipping the static key.

Thing is, both callers of build_sched_domains():

- sched_init_domains()
- partition_sched_domains()

mention that they must be called with the hotplug lock held, so I figured
we could use that information to change the static key call (see snippet
below).

It does suppress the warning, and I *think* it's not completely insane - 
assuming the comments about the hotplug lock are still up to date, and with
the exception of sched_init_smp() which doesn't care about hotplugs it seems
to be the case.

SMT also uses a static key but avoids this issue by being enabled outside of
sched_init_domains(), and from what I see it's just set once and for all.
I'm not sure we can use the same approach since we might not always be able
to detect asymmetry this early.

> Thanks,
> Jiada
> 

Cheers,
Valentin

--->8---

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b023a5b..89e502e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -428,8 +428,10 @@ static void update_top_cache_domain(int cpu)
 
 static void update_asym_cpucapacity(int cpu)
 {
-       if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
-               static_branch_enable(&sched_asym_cpucapacity);
+       if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY)) {
+               /* This expects the hotplug lock to be held */
+               static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+       }
 }
 
 /*

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-04-29 19:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-15 14:46 [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
2018-03-15 14:46 ` [PATCHv2 1/7] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
2018-04-27 14:04   ` [PATCHv2,1/7] " Jiada Wang
2018-04-29 19:09     ` Valentin Schneider
2018-03-15 14:46 ` [PATCHv2 2/7] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
2018-04-11 10:45   ` Peter Zijlstra
2018-03-15 14:47 ` [PATCHv2 3/7] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
2018-04-11 10:46   ` Peter Zijlstra
2018-04-11 10:53   ` Peter Zijlstra
2018-03-15 14:47 ` [PATCHv2 4/7] sched/fair: Kick nohz balance if rq->misfit_task Morten Rasmussen
2018-03-15 14:47 ` [PATCHv2 5/7] sched: Change root_domain->overload type to int Morten Rasmussen
2018-03-15 14:47 ` [PATCHv2 6/7] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
2018-03-15 14:47 ` [PATCHv2 7/7] sched/fair: Set sd->overload when misfit Morten Rasmussen
2018-03-20  5:30 ` [PATCHv2 0/7] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Gaku Inami
2018-03-20  9:17   ` Morten Rasmussen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).