LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC 00/11] select_idle_sibling rework
@ 2018-05-30 14:22 Peter Zijlstra
  2018-05-30 14:22 ` [RFC 01/11] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
                   ` (11 more replies)
  0 siblings, 12 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

Hi all,

This is all still very preliminary and could all still go up in flames (it has
only seen hackbench so far). This is mostly the same code I posted yesterday,
but hopefully in a more readable form.

This fixes the SIS_PROP as per the outline here:

  https://lkml.kernel.org/r/20180425153600.GA4043@hirez.programming.kicks-ass.net

and Rohit's suggestion of folding the iteration loops.

For testing I would suggest to ignore the last 3 patches, those are purely
cleanups once the first lot is found to actually work as advertised.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 01/11] sched/fair: Fix select_idle_cpu()s cost accounting
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 02/11] sched/fair: Age the average idle time Peter Zijlstra
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-1.patch --]
[-- Type: text/plain, Size: 1423 bytes --]

We compute the average cost of the total scan, but then use it as a
per-cpu scan cost when computing the scan proportion. Fix this by
properly computing a per-cpu scan cost.

This also fixes a bug where we would terminate early (!--nr, case) and
not account that cost at all.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6368,11 +6368,11 @@ static inline int select_idle_smt(struct
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	int cpu, loops = 0, nr = INT_MAX;
 	struct sched_domain *this_sd;
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6399,8 +6399,10 @@ static int select_idle_cpu(struct task_s
 	time = local_clock();
 
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
-		if (!--nr)
-			return -1;
+		if (loops++ >= nr) {
+			cpu = -1;
+			break;
+		}
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
 		if (available_idle_cpu(cpu))
@@ -6408,6 +6410,7 @@ static int select_idle_cpu(struct task_s
 	}
 
 	time = local_clock() - time;
+	time = div_u64(time, loops);
 	cost = this_sd->avg_scan_cost;
 	delta = (s64)(time - cost) / 8;
 	this_sd->avg_scan_cost += delta;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 02/11] sched/fair: Age the average idle time
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
  2018-05-30 14:22 ` [RFC 01/11] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 03/11] sched/fair: Only use time once Peter Zijlstra
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-2.patch --]
[-- Type: text/plain, Size: 2908 bytes --]

Currently select_idle_cpu()'s proportional scheme uses the average
idle time *for when we are idle*, that is temporally challenged.

When we're not at all idle, we'll happily continue using whatever
value we did see when we did go idle. To fix this, introduce a seprate
average idle and age it (the existing value still makes sense for
things like new-idle balancing, which happens when we do go idle).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |    5 +++++
 kernel/sched/fair.c     |   29 ++++++++++++++++++++++++-----
 kernel/sched/features.h |    2 ++
 kernel/sched/sched.h    |    3 +++
 4 files changed, 34 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1674,6 +1674,9 @@ static void ttwu_do_wakeup(struct rq *rq
 		if (rq->avg_idle > max)
 			rq->avg_idle = max;
 
+		rq->wake_stamp = jiffies;
+		rq->wake_avg = rq->avg_idle / 2;
+
 		rq->idle_stamp = 0;
 	}
 #endif
@@ -6051,6 +6054,8 @@ void __init sched_init(void)
 		rq->online = 0;
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
+		rq->wake_stamp = jiffies;
+		rq->wake_avg = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6378,11 +6378,30 @@ static int select_idle_cpu(struct task_s
 	if (!this_sd)
 		return -1;
 
-	/*
-	 * Due to large variance we need a large fuzz factor; hackbench in
-	 * particularly is sensitive here.
-	 */
-	avg_idle = this_rq()->avg_idle / 512;
+	if (sched_feat(SIS_AGE)) {
+		unsigned long now = jiffies;
+		struct rq *this_rq = this_rq();
+
+		/*
+		 * If we're busy, the assumption that the last idle period
+		 * predicts the future is flawed; age away the remaining
+		 * predicted idle time.
+		 */
+		if (unlikely(this_rq->wake_stamp < now)) {
+			while (this_rq->wake_stamp < now && this_rq->wake_avg) {
+				this_rq->wake_stamp++;
+				this_rq->wake_avg >>= 1;
+			}
+		}
+
+		avg_idle = this_rq->wake_avg;
+	} else {
+		/*
+		 * Due to large variance we need a large fuzz factor; hackbench
+		 * in particularly is sensitive here.
+		 */
+		avg_idle = this_rq()->avg_idle / 512;
+	}
 	avg_cost = this_sd->avg_scan_cost + 1;
 
 	if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -58,6 +58,8 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
+SCHED_FEAT(SIS_AGE, true)
+
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -831,6 +831,9 @@ struct rq {
 	u64			idle_stamp;
 	u64			avg_idle;
 
+	unsigned long		wake_stamp;
+	u64			wake_avg;
+
 	/* This is used to determine avg_idle's max value */
 	u64			max_idle_balance_cost;
 #endif

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 03/11] sched/fair: Only use time once
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
  2018-05-30 14:22 ` [RFC 01/11] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
  2018-05-30 14:22 ` [RFC 02/11] sched/fair: Age the average idle time Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 04/11] sched/topology: Introduce sched_domain_cores() Peter Zijlstra
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-3.patch --]
[-- Type: text/plain, Size: 2026 bytes --]

The goal is to not spend more time scanning for idle CPUs than we're
idle for. Otherwise we're inhibiting work.

This means that we need to consider the cost over all the wakeups
between consequtive idle periods.

Combined AGE+ONCE work better than the old code:

ORIG

1:        0.559639567 seconds time elapsed    ( +-  1.44% )
2:        0.630091207 seconds time elapsed    ( +-  2.93% )
5:        2.329768398 seconds time elapsed    ( +-  1.21% )
10:       3.920248646 seconds time elapsed    ( +-  2.39% )
20:       6.501776759 seconds time elapsed    ( +-  1.02% )
40:      10.482109619 seconds time elapsed    ( +-  2.16% )

AGE+ONCE

1:        0.546238431 seconds time elapsed    ( +-  0.84% )
2:        0.620581405 seconds time elapsed    ( +-  1.26% )
5:        2.161288964 seconds time elapsed    ( +-  1.90% )
10:       3.514636966 seconds time elapsed    ( +-  1.82% )
20:       6.228234657 seconds time elapsed    ( +-  0.67% )
40:       9.755615438 seconds time elapsed    ( +-  2.20% )

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   15 +++++++++++++++
 kernel/sched/features.h |    1 +
 2 files changed, 16 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6429,6 +6429,21 @@ static int select_idle_cpu(struct task_s
 	}
 
 	time = local_clock() - time;
+
+	if (sched_feat(SIS_ONCE)) {
+		struct rq *this_rq = this_rq();
+
+		/*
+		 * We need to consider the cost of all wakeups between
+		 * consequtive idle periods. We can only use the predicted
+		 * idle time once.
+		 */
+		if (this_rq->wake_avg > time)
+			this_rq->wake_avg -= time;
+		else
+			this_rq->wake_avg = 0;
+	}
+
 	time = div_u64(time, loops);
 	cost = this_sd->avg_scan_cost;
 	delta = (s64)(time - cost) / 8;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -59,6 +59,7 @@ SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 SCHED_FEAT(SIS_AGE, true)
+SCHED_FEAT(SIS_ONCE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 04/11] sched/topology: Introduce sched_domain_cores()
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (2 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 03/11] sched/fair: Only use time once Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 05/11] sched/fair: Re-arrange select_idle_cpu() Peter Zijlstra
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-4.patch --]
[-- Type: text/plain, Size: 4004 bytes --]

In order to more efficiently iterate cores/smt, we need a cpumask
containing only the first thread of each core (for the LLC domain).

And since we're iterating SMT specific things, move sched_init_smt()
over there. Also track how many threads per core we have.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/topology.h |    9 +++++++++
 kernel/sched/core.c            |   18 ------------------
 kernel/sched/fair.c            |    3 +++
 kernel/sched/sched.h           |    2 ++
 kernel/sched/topology.c        |   35 +++++++++++++++++++++++++++++++++--
 5 files changed, 47 insertions(+), 20 deletions(-)

--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,8 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+
+	unsigned long core_mask[0];
 };
 
 struct sched_domain {
@@ -162,6 +164,13 @@ static inline struct cpumask *sched_doma
 	return to_cpumask(sd->span);
 }
 
+#ifdef CONFIG_SCHED_SMT
+static inline struct cpumask *sched_domain_cores(struct sched_domain *sd)
+{
+	return to_cpumask(sd->shared->core_mask);
+}
+#endif
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5854,22 +5854,6 @@ int sched_cpu_dying(unsigned int cpu)
 }
 #endif
 
-#ifdef CONFIG_SCHED_SMT
-DEFINE_STATIC_KEY_FALSE(sched_smt_present);
-
-static void sched_init_smt(void)
-{
-	/*
-	 * We've enumerated all CPUs and will assume that if any CPU
-	 * has SMT siblings, CPU0 will too.
-	 */
-	if (cpumask_weight(cpu_smt_mask(0)) > 1)
-		static_branch_enable(&sched_smt_present);
-}
-#else
-static inline void sched_init_smt(void) { }
-#endif
-
 void __init sched_init_smp(void)
 {
 	sched_init_numa();
@@ -5891,8 +5875,6 @@ void __init sched_init_smp(void)
 	init_sched_rt_class();
 	init_sched_dl_class();
 
-	sched_init_smt();
-
 	sched_smp_initialized = true;
 }
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6238,6 +6238,9 @@ static inline int find_idlest_cpu(struct
 }
 
 #ifdef CONFIG_SCHED_SMT
+DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+
+__read_mostly int sched_smt_weight = 1;
 
 static inline void set_idle_cores(int cpu, int val)
 {
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -910,6 +910,8 @@ static inline void update_idle_core(stru
 		__update_idle_core(rq);
 }
 
+extern __read_mostly int sched_smt_weight;
+
 #else
 static inline void update_idle_core(struct rq *rq) { }
 #endif
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1190,8 +1190,39 @@ sd_init(struct sched_domain_topology_lev
 	 */
 	if (sd->flags & SD_SHARE_PKG_RESOURCES) {
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
-		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		if (atomic_read(&sd->shared->ref)) {
+			atomic_inc(&sd->shared->ref);
+		} else {
+#ifdef CONFIG_SCHED_SMT
+			int core, smt, smt_weight;
+
+			/*
+			 * Set the first SMT sibling of each core present in
+			 * the domain span.
+			 */
+			for_each_cpu(core, sched_domain_span(sd)) {
+				for_each_cpu(smt, cpu_smt_mask(core)) {
+					if (cpumask_test_cpu(smt, sched_domain_span(sd))) {
+						__cpumask_set_cpu(smt, sched_domain_cores(sd));
+						break;
+					}
+				}
+
+				/*
+				 * And track the presence and number of threads per core.
+				 */
+
+				smt_weight = cpumask_weight(cpu_smt_mask(core));
+				if (smt_weight > sched_smt_weight) {
+					sched_smt_weight = smt_weight;
+					static_branch_enable(&sched_smt_present);
+				}
+			}
+#endif
+
+			atomic_set(&sd->shared->ref, 1);
+		}
 	}
 
 	sd->private = sdd;
@@ -1537,7 +1568,7 @@ static int __sdt_alloc(const struct cpum
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 05/11] sched/fair: Re-arrange select_idle_cpu()
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (3 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 04/11] sched/topology: Introduce sched_domain_cores() Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 06/11] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-5.patch --]
[-- Type: text/plain, Size: 1448 bytes --]

In preparation of the next patch, move the actual scanning of the LLC
out of the whole proportional/cost metric stuff, so we can change it
out in a next patch.

Should not actually change anything.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6364,6 +6364,23 @@ static inline int select_idle_smt(struct
 
 #endif /* CONFIG_SCHED_SMT */
 
+static int __select_idle_cpu(struct task_struct *p, struct sched_domain *sd,
+			     int target, int nr, int *ploops)
+{
+	int cpu;
+
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+		if ((*ploops)++ >= nr)
+			return -1;
+		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+			continue;
+		if (available_idle_cpu(cpu))
+			break;
+	}
+
+	return cpu;
+}
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6420,16 +6437,7 @@ static int select_idle_cpu(struct task_s
 
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
-		if (loops++ >= nr) {
-			cpu = -1;
-			break;
-		}
-		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-			continue;
-		if (available_idle_cpu(cpu))
-			break;
-	}
+	cpu = __select_idle_cpu(p, sd, target, nr, &loops);
 
 	time = local_clock() - time;
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 06/11] sched/fair: Make select_idle_cpu() proportional to cores
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (4 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 05/11] sched/fair: Re-arrange select_idle_cpu() Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 07/11] sched/fair: Fold the select_idle_sibling() scans Peter Zijlstra
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-7.patch --]
[-- Type: text/plain, Size: 1336 bytes --]

Instead of calculating how many (logical) CPUs to scan, compute how
many cores to scan.

This changes behaviour for anything !SMT2.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6352,6 +6352,8 @@ static int select_idle_smt(struct task_s
 
 #else /* CONFIG_SCHED_SMT */
 
+#define sched_smt_weight	1
+
 static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	return -1;
@@ -6381,6 +6383,8 @@ static int __select_idle_cpu(struct task
 	return cpu;
 }
 
+#define sis_min_cores		2
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6429,15 +6433,15 @@ static int select_idle_cpu(struct task_s
 
 	if (sched_feat(SIS_PROP)) {
 		u64 span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
+		if (span_avg > sis_min_cores * avg_cost)
 			nr = div_u64(span_avg, avg_cost);
 		else
-			nr = 4;
+			nr = sis_min_cores;
 	}
 
 	time = local_clock();
 
-	cpu = __select_idle_cpu(p, sd, target, nr, &loops);
+	cpu = __select_idle_cpu(p, sd, target, nr * sched_smt_weight, &loops);
 
 	time = local_clock() - time;
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 07/11] sched/fair: Fold the select_idle_sibling() scans
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (5 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 06/11] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 08/11] sched/fair: Optimize SIS_FOLD Peter Zijlstra
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-6.patch --]
[-- Type: text/plain, Size: 3857 bytes --]

Currently select_idle_sibling() does 3 separate scans:

 - searches for an idle core
 - searches for an idle cpu
 - searches for an idle thread

The core scan is gates by there actually having been an idle core, it
has a worst case issue where we'll always scan the entire LLC to
establish there is no idle core left anymore before gating.

The cpu scan is done proportional to the remaining average idle time.

And since the cpu scan might not actually see our own sibling threads
(if they're enumerated far away in the CPU space), check if there's an
idle sibling thread.

Rohit suggested we could maybe do all 3 in a single proportional
search.

This uses the new SMT topology bits previously introduced to do the
core/smt iteration. And relies on the select_idle_cpu()'s change to nr
cores.

Basically we iterate @nr cores and select the first idle thread of the
core with the least amount of busy threads that first in the task
affinity mask.

ORIG

1:        0.559639567 seconds time elapsed    ( +-  1.44% )
2:        0.630091207 seconds time elapsed    ( +-  2.93% )
5:        2.329768398 seconds time elapsed    ( +-  1.21% )
10:       3.920248646 seconds time elapsed    ( +-  2.39% )
20:       6.501776759 seconds time elapsed    ( +-  1.02% )
40:      10.482109619 seconds time elapsed    ( +-  2.16% )

FOLD

1:        0.568188455 seconds time elapsed    ( +-  0.40% )
2:        0.643264625 seconds time elapsed    ( +-  1.27% )
5:        2.385378263 seconds time elapsed    ( +-  1.12% )
10:       3.808555491 seconds time elapsed    ( +-  1.46% )
20:       6.431994272 seconds time elapsed    ( +-  1.21% )
40:       9.423539507 seconds time elapsed    ( +-  2.07% )

Suggested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |    1 +
 2 files changed, 49 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6350,6 +6350,41 @@ static int select_idle_smt(struct task_s
 	return -1;
 }
 
+static int __select_idle_core(struct task_struct *p, struct sched_domain *sd,
+			      int target, int nr, int *ploops)
+{
+	int best_busy = INT_MAX, best_cpu = -1;
+	int core, cpu;
+
+	for_each_cpu_wrap(core, sched_domain_cores(sd), target) {
+		int first_idle = -1;
+		int busy = 0;
+
+		if ((*ploops)++ >= nr)
+			break;
+
+		for (cpu = core; cpu < nr_cpumask_bits; cpu = cpumask_next(cpu, cpu_smt_mask(core))) {
+			if (!available_idle_cpu(cpu))
+				busy++;
+			else if (first_idle < 0 && cpumask_test_cpu(cpu, &p->cpus_allowed))
+				first_idle = cpu;
+		}
+
+		if (first_idle < 0)
+			continue;
+
+		if (!busy)
+			return first_idle;
+
+		if (busy < best_busy) {
+			best_busy = busy;
+			best_cpu = first_idle;
+		}
+	}
+
+	return best_cpu;
+}
+
 #else /* CONFIG_SCHED_SMT */
 
 #define sched_smt_weight	1
@@ -6441,6 +6476,11 @@ static int select_idle_cpu(struct task_s
 
 	time = local_clock();
 
+#ifdef CONFIG_SCHED_SMT
+	if (sched_feat(SIS_FOLD) && static_branch_likely(&sched_smt_present))
+		cpu = __select_idle_core(p, sd, target, nr, &loops);
+	else
+#endif
 	cpu = __select_idle_cpu(p, sd, target, nr * sched_smt_weight, &loops);
 
 	time = local_clock() - time;
@@ -6503,6 +6543,14 @@ static int select_idle_sibling(struct ta
 	if (!sd)
 		return target;
 
+	if (sched_feat(SIS_FOLD)) {
+		i = select_idle_cpu(p, sd, target);
+		if ((unsigned)i < nr_cpumask_bits)
+			target = i;
+
+		return target;
+	}
+
 	i = select_idle_core(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -60,6 +60,7 @@ SCHED_FEAT(SIS_PROP, true)
 
 SCHED_FEAT(SIS_AGE, true)
 SCHED_FEAT(SIS_ONCE, true)
+SCHED_FEAT(SIS_FOLD, false)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 08/11] sched/fair: Optimize SIS_FOLD
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (6 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 07/11] sched/fair: Fold the select_idle_sibling() scans Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 09/11] sched/fair: Remove SIS_AVG_PROP Peter Zijlstra
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-11.patch --]
[-- Type: text/plain, Size: 2492 bytes --]

Tracing showed that the per-cpu scanning cost of __select_idle_core()
(~120ns) was significantly higher than that of __select_idle_cpu()
(~40ns).

This means that, even when reduced to the minimal scan, we're still 3x
more expensive than the simple search.

perf annotate suggested this was mostly due to cache-misses on the
additional cpumasks used.

However, we can mitigate this by only doing the more expensive search
when there is a good chance it is beneficial. After all, when there
are no idle cores to be had, there's no point in looking for any
(SMT>2 might want to try without this).

Clearing has_idle_cores early (without an exhaustive search) should be
fine because we're eager to set it when a core goes idle again.

FOLD

1:        0.568188455 seconds time elapsed    ( +-  0.40% )
2:        0.643264625 seconds time elapsed    ( +-  1.27% )
5:        2.385378263 seconds time elapsed    ( +-  1.12% )
10:       3.808555491 seconds time elapsed    ( +-  1.46% )
20:       6.431994272 seconds time elapsed    ( +-  1.21% )
40:       9.423539507 seconds time elapsed    ( +-  2.07% )

FOLD+

1:        0.554694881 seconds time elapsed    ( +-  0.42% )
2:        0.632730119 seconds time elapsed    ( +-  1.84% )
5:        2.230432464 seconds time elapsed    ( +-  1.17% )
10:       3.549957778 seconds time elapsed    ( +-  1.55% )
20:       6.118364255 seconds time elapsed    ( +-  0.72% )
40:       9.515406550 seconds time elapsed    ( +-  1.74% )

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |    5 ++++-
 kernel/sched/features.h |    2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6382,6 +6382,8 @@ static int __select_idle_core(struct tas
 		}
 	}
 
+	set_idle_cores(target, 0);
+
 	return best_cpu;
 }
 
@@ -6477,7 +6479,8 @@ static int select_idle_cpu(struct task_s
 	time = local_clock();
 
 #ifdef CONFIG_SCHED_SMT
-	if (sched_feat(SIS_FOLD) && static_branch_likely(&sched_smt_present))
+	if (sched_feat(SIS_FOLD) && static_branch_likely(&sched_smt_present) &&
+	    test_idle_cores(target, false))
 		cpu = __select_idle_core(p, sd, target, nr, &loops);
 	else
 #endif
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -60,7 +60,7 @@ SCHED_FEAT(SIS_PROP, true)
 
 SCHED_FEAT(SIS_AGE, true)
 SCHED_FEAT(SIS_ONCE, true)
-SCHED_FEAT(SIS_FOLD, false)
+SCHED_FEAT(SIS_FOLD, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 09/11] sched/fair: Remove SIS_AVG_PROP
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (7 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 08/11] sched/fair: Optimize SIS_FOLD Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 10/11] sched/fair: Remove SIS_AGE/SIS_ONCE Peter Zijlstra
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-9.patch --]
[-- Type: text/plain, Size: 871 bytes --]

Hasn't been used in forever and doesn't give nice numbers anyway.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |    3 ---
 kernel/sched/features.h |    1 -
 2 files changed, 4 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6465,9 +6465,6 @@ static int select_idle_cpu(struct task_s
 	}
 	avg_cost = this_sd->avg_scan_cost + 1;
 
-	if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
-		return -1;
-
 	if (sched_feat(SIS_PROP)) {
 		u64 span_avg = sd->span_weight * avg_idle;
 		if (span_avg > sis_min_cores * avg_cost)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -55,7 +55,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 SCHED_FEAT(SIS_AGE, true)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 10/11] sched/fair: Remove SIS_AGE/SIS_ONCE
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (8 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 09/11] sched/fair: Remove SIS_AVG_PROP Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-05-30 14:22 ` [RFC 11/11] sched/fair: Remove SIS_FOLD Peter Zijlstra
  2018-06-19 22:06 ` [RFC 00/11] select_idle_sibling rework Matt Fleming
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-10.patch --]
[-- Type: text/plain, Size: 2984 bytes --]

The new scheme is clearly better (XXX need !hackbench numbers), clean
up the mess.

This leaves everything under SIS_PROP, which I think Facebook still
uses (to disable), Rik?

Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   43 ++++++++++++++++++-------------------------
 kernel/sched/features.h |    3 ---
 2 files changed, 18 insertions(+), 28 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6432,16 +6432,18 @@ static int select_idle_cpu(struct task_s
 	int cpu, loops = 0, nr = INT_MAX;
 	struct sched_domain *this_sd;
 	u64 avg_cost, avg_idle;
-	u64 time, cost;
-	s64 delta;
+	struct rq *this_rq;
+	u64 time;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
 		return -1;
 
-	if (sched_feat(SIS_AGE)) {
+	if (sched_feat(SIS_PROP)) {
 		unsigned long now = jiffies;
-		struct rq *this_rq = this_rq();
+		u64 span_avg;
+
+		this_rq = this_rq();
 
 		/*
 		 * If we're busy, the assumption that the last idle period
@@ -6456,24 +6458,16 @@ static int select_idle_cpu(struct task_s
 		}
 
 		avg_idle = this_rq->wake_avg;
-	} else {
-		/*
-		 * Due to large variance we need a large fuzz factor; hackbench
-		 * in particularly is sensitive here.
-		 */
-		avg_idle = this_rq()->avg_idle / 512;
-	}
-	avg_cost = this_sd->avg_scan_cost + 1;
+		avg_cost = this_sd->avg_scan_cost + 1;
 
-	if (sched_feat(SIS_PROP)) {
-		u64 span_avg = sd->span_weight * avg_idle;
+		span_avg = sd->span_weight * avg_idle;
 		if (span_avg > sis_min_cores * avg_cost)
 			nr = div_u64(span_avg, avg_cost);
 		else
 			nr = sis_min_cores;
-	}
 
-	time = local_clock();
+		time = local_clock();
+	}
 
 #ifdef CONFIG_SCHED_SMT
 	if (sched_feat(SIS_FOLD) && static_branch_likely(&sched_smt_present) &&
@@ -6483,26 +6477,25 @@ static int select_idle_cpu(struct task_s
 #endif
 	cpu = __select_idle_cpu(p, sd, target, nr * sched_smt_weight, &loops);
 
-	time = local_clock() - time;
+	if (sched_feat(SIS_PROP)) {
+		s64 delta;
 
-	if (sched_feat(SIS_ONCE)) {
-		struct rq *this_rq = this_rq();
+		time = local_clock() - time;
 
 		/*
 		 * We need to consider the cost of all wakeups between
 		 * consequtive idle periods. We can only use the predicted
 		 * idle time once.
 		 */
-		if (this_rq->wake_avg > time)
+		if (avg_idle > time)
 			this_rq->wake_avg -= time;
 		else
 			this_rq->wake_avg = 0;
-	}
 
-	time = div_u64(time, loops);
-	cost = this_sd->avg_scan_cost;
-	delta = (s64)(time - cost) / 8;
-	this_sd->avg_scan_cost += delta;
+		time = div_u64(time, loops);
+		delta = (s64)(time - avg_cost) / 8;
+		this_sd->avg_scan_cost += delta;
+	}
 
 	return cpu;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -56,9 +56,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_PROP, true)
-
-SCHED_FEAT(SIS_AGE, true)
-SCHED_FEAT(SIS_ONCE, true)
 SCHED_FEAT(SIS_FOLD, true)
 
 /*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 11/11] sched/fair: Remove SIS_FOLD
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (9 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 10/11] sched/fair: Remove SIS_AGE/SIS_ONCE Peter Zijlstra
@ 2018-05-30 14:22 ` Peter Zijlstra
  2018-06-19 22:06 ` [RFC 00/11] select_idle_sibling rework Matt Fleming
  11 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2018-05-30 14:22 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: subhra.mazumdar, steven.sistare, dhaval.giani, rohit.k.jain,
	umgwanakikbuti, matt, riel, peterz

[-- Attachment #1: peterz-sis-again-8.patch --]
[-- Type: text/plain, Size: 3708 bytes --]

Since the new code works, remove the old.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   90 +-----------------------------------------------
 kernel/sched/features.h |    4 +-
 2 files changed, 4 insertions(+), 90 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6291,65 +6291,6 @@ void __update_idle_core(struct rq *rq)
 	rcu_read_unlock();
 }
 
-/*
- * Scan the entire LLC domain for idle cores; this dynamically switches off if
- * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
- */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -1;
-
-	if (!test_idle_cores(target, false))
-		return -1;
-
-	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
-
-	for_each_cpu_wrap(core, cpus, target) {
-		bool idle = true;
-
-		for_each_cpu(cpu, cpu_smt_mask(core)) {
-			cpumask_clear_cpu(cpu, cpus);
-			if (!available_idle_cpu(cpu))
-				idle = false;
-		}
-
-		if (idle)
-			return core;
-	}
-
-	/*
-	 * Failed to find an idle core; stop looking for one.
-	 */
-	set_idle_cores(target, 0);
-
-	return -1;
-}
-
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	int cpu;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -1;
-
-	for_each_cpu(cpu, cpu_smt_mask(target)) {
-		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-			continue;
-		if (available_idle_cpu(cpu))
-			return cpu;
-	}
-
-	return -1;
-}
-
 static int __select_idle_core(struct task_struct *p, struct sched_domain *sd,
 			      int target, int nr, int *ploops)
 {
@@ -6391,16 +6332,6 @@ static int __select_idle_core(struct tas
 
 #define sched_smt_weight	1
 
-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	return -1;
-}
-
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	return -1;
-}
-
 #endif /* CONFIG_SCHED_SMT */
 
 static int __select_idle_cpu(struct task_struct *p, struct sched_domain *sd,
@@ -6470,8 +6401,7 @@ static int select_idle_cpu(struct task_s
 	}
 
 #ifdef CONFIG_SCHED_SMT
-	if (sched_feat(SIS_FOLD) && static_branch_likely(&sched_smt_present) &&
-	    test_idle_cores(target, false))
+	if (static_branch_likely(&sched_smt_present) && test_idle_cores(target, false))
 		cpu = __select_idle_core(p, sd, target, nr, &loops);
 	else
 #endif
@@ -6536,25 +6466,9 @@ static int select_idle_sibling(struct ta
 	if (!sd)
 		return target;
 
-	if (sched_feat(SIS_FOLD)) {
-		i = select_idle_cpu(p, sd, target);
-		if ((unsigned)i < nr_cpumask_bits)
-			target = i;
-
-		return target;
-	}
-
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
-
 	i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
-		return i;
-
-	i = select_idle_smt(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
+		target = i;
 
 	return target;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -53,10 +53,10 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
 SCHED_FEAT(TTWU_QUEUE, true)
 
 /*
- * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
+ * When doing wakeups, attempt to limit scanning cost of the LLC proportional
+ * to the average idle time.
  */
 SCHED_FEAT(SIS_PROP, true)
-SCHED_FEAT(SIS_FOLD, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 00/11] select_idle_sibling rework
  2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
                   ` (10 preceding siblings ...)
  2018-05-30 14:22 ` [RFC 11/11] sched/fair: Remove SIS_FOLD Peter Zijlstra
@ 2018-06-19 22:06 ` Matt Fleming
  2018-06-20 22:20   ` Steven Sistare
  11 siblings, 1 reply; 14+ messages in thread
From: Matt Fleming @ 2018-06-19 22:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, subhra.mazumdar, steven.sistare,
	dhaval.giani, rohit.k.jain, umgwanakikbuti, riel

On Wed, 30 May, at 04:22:36PM, Peter Zijlstra wrote:
> Hi all,
> 
> This is all still very preliminary and could all still go up in flames (it has
> only seen hackbench so far). This is mostly the same code I posted yesterday,
> but hopefully in a more readable form.
> 
> This fixes the SIS_PROP as per the outline here:
> 
>   https://lkml.kernel.org/r/20180425153600.GA4043@hirez.programming.kicks-ass.net
> 
> and Rohit's suggestion of folding the iteration loops.
> 
> For testing I would suggest to ignore the last 3 patches, those are purely
> cleanups once the first lot is found to actually work as advertised.

This series looks pretty good from my testing. I see double-digit
improvements to hackbench results and only one case of a clear
regression (easily offset by all the wins).

Are you aware of any regressions for particular benchmarks I should
take a look at?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 00/11] select_idle_sibling rework
  2018-06-19 22:06 ` [RFC 00/11] select_idle_sibling rework Matt Fleming
@ 2018-06-20 22:20   ` Steven Sistare
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Sistare @ 2018-06-20 22:20 UTC (permalink / raw)
  To: Matt Fleming, Peter Zijlstra, subhra.mazumdar, rohit.k.jain
  Cc: mingo, linux-kernel, dhaval.giani, umgwanakikbuti, riel

On 6/19/2018 6:06 PM, Matt Fleming wrote:
> On Wed, 30 May, at 04:22:36PM, Peter Zijlstra wrote:
>> Hi all,
>>
>> This is all still very preliminary and could all still go up in flames (it has
>> only seen hackbench so far). This is mostly the same code I posted yesterday,
>> but hopefully in a more readable form.
>>
>> This fixes the SIS_PROP as per the outline here:
>>
>>   https://lkml.kernel.org/r/20180425153600.GA4043@hirez.programming.kicks-ass.net
>>
>> and Rohit's suggestion of folding the iteration loops.
>>
>> For testing I would suggest to ignore the last 3 patches, those are purely
>> cleanups once the first lot is found to actually work as advertised.
> 
> This series looks pretty good from my testing. I see double-digit
> improvements to hackbench results and only one case of a clear
> regression (easily offset by all the wins).
> 
> Are you aware of any regressions for particular benchmarks I should
> take a look at?

Hi folks,
  Just a heads up that I am working on a different patch series for
improving the utilization of idle CPUs.  After reviewing Subhra's series
and Peter's series for inspiration (thanks guys), I started from scratch.
My series improves efficiency on both the push side (find idle) and
the pull side (find a task), and is showing substantial speedups: a 10%
to 88% improvement in hackbench depending on load level.  I will send
an RFC soon, but here is a quick summary.

On the push side, I start by extending Rohit's proposal to use an idle
cpu mask.  I define a per-LLC bitmask of idle CPUs and a bitmask of idle
cores, maintained using atomic operations during idle_task transitions
with the help of a per-core busy-cpu counter.  select_idle_core,
select_idle_cpu, and select_idle_smt search the masks and claim bits
atomically.  However, to reduce contention amongst readers and writers
of the masks, I define a new sparsemask type which only uses 8 bits in
the first word of every 64 bytes; the remaining 63*8 bits are not used.
For example, a set that can hold 128 items is 128/8*64 = 1024 bytes
in size.  This reduces contention for the atomic operations that update
the mask, but still allows efficient traversal of the mask when searching
for set bits.  The number of bits per 64 bytes is a creation time parameter.

On the pull side, I define a per-LLC sparsemask of overloaded CPUs, which
are those with 2 or more runable CFS tasks, updated whenever h_nr_running
changes from 1 to greater, or from 2 to less.  Before pick_next_task_fair
calls idle_balance(), it tries to steal a task from an overloaded CPU,
using the mask to efficiently find candidates.  It first looks on
overloaded CPUs on the same core, then looks on all overloaded CPUs.
It steals the first migrateable task it finds, searching each overloaded
CPU starting at the leftmost task on cfs_rq->tasks_timeline for fairness.
The cost to steal is very low compared to the cost of idle_balance.  If
stealing fails to find a task, idle_balance is called, with no changes to 
its algorithm.

To measure overhead, I added a timer around all the code that
updates masks, searches for idle CPUs, searches for overloaded CPUs,
and steals a task.  The sum of this time is exposed as a schedstat,
labelled as search_time below.  This is temporary, for evaluation purposes.
I added timers around select_idle_sibling in the baseline kernel for
comparison.

More testing is needed, but in my limited testing so far using hackbench,
all the metrics look good: throughput is up, CPU utilization is up, and
search time is down.

Test is "hackbench <groups> process 100000"
Configuration is a single node Xeon, 28 cores, 56 CPUs.
In the tables below:
  Time = elapsed time in seconds.  
         Number in parens is improvement vs baseline.  
  %util is the overall CPU utilization.
  %search_time is the fraction of CPU time spent in the scheduler code
    mentioned above.

baseline:

groups  tasks   Time              %util   %search_time
1       40      13.023            46.95   0.0
2       80      16.939            76.97   0.2
3       120     18.866            79.29   0.3
4       160     23.454            85.31   0.4
8       320     44.095            98.54   1.3

new:

groups  tasks   Time              %util   %search_time
1       40      13.171 (- 1.1%)   39.97   0.2
2       80       8.993 (+88.4%)   98.74   0.1
3       120     13.446 (+40.3%)   99.02   0.3
4       160     20.249 (+15.8%)   99.70   0.3
8       320     40.121 (+ 9.9%)   99.91   0.3

For this workload, most of the improvement comes from the pull side changes,
but I still like the push side changes because they guarantee we find an
idle core or cpu if one is available, at low cost.  This is important if
an idle CPU has transitioned to the idle_task and is no longer trying to
steal work.

I only handle CFS tasks, but the same algorithms could be applied for RT.
The pushing and pulling only occurs within the LLC, but it could be
applied across nodes if the LLC search fails.  Those are potential future
patches if all goes well.  I will send a patch series for the CFS LLC work
soon.

- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-06-20 22:20 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-30 14:22 [RFC 00/11] select_idle_sibling rework Peter Zijlstra
2018-05-30 14:22 ` [RFC 01/11] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
2018-05-30 14:22 ` [RFC 02/11] sched/fair: Age the average idle time Peter Zijlstra
2018-05-30 14:22 ` [RFC 03/11] sched/fair: Only use time once Peter Zijlstra
2018-05-30 14:22 ` [RFC 04/11] sched/topology: Introduce sched_domain_cores() Peter Zijlstra
2018-05-30 14:22 ` [RFC 05/11] sched/fair: Re-arrange select_idle_cpu() Peter Zijlstra
2018-05-30 14:22 ` [RFC 06/11] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
2018-05-30 14:22 ` [RFC 07/11] sched/fair: Fold the select_idle_sibling() scans Peter Zijlstra
2018-05-30 14:22 ` [RFC 08/11] sched/fair: Optimize SIS_FOLD Peter Zijlstra
2018-05-30 14:22 ` [RFC 09/11] sched/fair: Remove SIS_AVG_PROP Peter Zijlstra
2018-05-30 14:22 ` [RFC 10/11] sched/fair: Remove SIS_AGE/SIS_ONCE Peter Zijlstra
2018-05-30 14:22 ` [RFC 11/11] sched/fair: Remove SIS_FOLD Peter Zijlstra
2018-06-19 22:06 ` [RFC 00/11] select_idle_sibling rework Matt Fleming
2018-06-20 22:20   ` Steven Sistare

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).