LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH v3 00/16] Core scheduling v3
@ 2019-05-29 20:36 Vineeth Remanan Pillai
  2019-05-29 20:36 ` [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
                   ` (17 more replies)
  0 siblings, 18 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Vineeth Remanan Pillai, linux-kernel, subhra.mazumdar, fweisbec,
	keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

Third iteration of the Core-Scheduling feature.

This version fixes mostly correctness related issues in v2 and
addresses performance issues. Also, addressed some crashes related
to cgroups and cpu hotplugging.

We have tested and verified that incompatible processes are not
selected during schedule. In terms of performance, the impact
depends on the workload: 
- on CPU intensive applications that use all the logical CPUs with
  SMT enabled, enabling core scheduling performs better than nosmt.
- on mixed workloads with considerable io compared to cpu usage,
  nosmt seems to perform better than core scheduling.

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

Issues
------
- Comparing process priority across cpus is not accurate

TODO
----
- Decide on the API for exposing the feature to userland

---

Peter Zijlstra (16):
  stop_machine: Fix stop_cpus_in_progress ordering
  sched: Fix kerneldoc comment for ia64_set_curr_task
  sched: Wrap rq::lock access
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched/fair: Export newidle_balance()
  sched: Allow put_prev_task() to drop rq->lock
  sched: Rework pick_next_task() slow-path
  sched: Introduce sched_class::pick_task()
  sched: Core-wide rq->lock
  sched: Basic tracking of matching tasks
  sched: A quick and dirty cgroup tagging interface
  sched: Add core wide task selection and scheduling.
  sched/fair: Add a few assertions
  sched: Trivial forced-newidle balancer
  sched: Debug bits...

 include/linux/sched.h    |   9 +-
 kernel/Kconfig.preempt   |   7 +-
 kernel/sched/core.c      | 858 +++++++++++++++++++++++++++++++++++++--
 kernel/sched/cpuacct.c   |  12 +-
 kernel/sched/deadline.c  |  99 +++--
 kernel/sched/debug.c     |   4 +-
 kernel/sched/fair.c      | 180 ++++----
 kernel/sched/idle.c      |  42 +-
 kernel/sched/pelt.h      |   2 +-
 kernel/sched/rt.c        |  96 ++---
 kernel/sched/sched.h     | 237 ++++++++---
 kernel/sched/stop_task.c |  35 +-
 kernel/sched/topology.c  |   4 +-
 kernel/stop_machine.c    |   2 +
 14 files changed, 1250 insertions(+), 337 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-08 10:54   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2019-08-26 16:19   ` [RFC PATCH v3 01/16] " mark gross
  2019-05-29 20:36 ` [RFC PATCH v3 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Make sure the entire for loop has stop_cpus_in_progress set.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/stop_machine.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 067cb83f37ea..583119e0c51c 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -375,6 +375,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 	 */
 	preempt_disable();
 	stop_cpus_in_progress = true;
+	barrier();
 	for_each_cpu(cpu, cpumask) {
 		work = &per_cpu(cpu_stopper.stop_work, cpu);
 		work->fn = fn;
@@ -383,6 +384,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 		if (cpu_stop_queue_work(cpu, work))
 			queued = true;
 	}
+	barrier();
 	stop_cpus_in_progress = false;
 	preempt_enable();
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
  2019-05-29 20:36 ` [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-08 10:55   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2019-08-26 16:20   ` [RFC PATCH v3 02/16] " mark gross
  2019-05-29 20:36 ` [RFC PATCH v3 03/16] sched: Wrap rq::lock access Vineeth Remanan Pillai
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778c48a7fda..416ea613eda8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6287,7 +6287,7 @@ struct task_struct *curr_task(int cpu)
 
 #ifdef CONFIG_IA64
 /**
- * set_curr_task - set the current task for a given CPU.
+ * ia64_set_curr_task - set the current task for a given CPU.
  * @cpu: the processor in question.
  * @p: the task pointer to set.
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 03/16] sched: Wrap rq::lock access
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
  2019-05-29 20:36 ` [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
  2019-05-29 20:36 ` [RFC PATCH v3 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-05-29 20:36 ` [RFC PATCH v3 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Vineeth Remanan Pillai
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---

Changes in v2
-------------
- Fixes a deadlock due in double_rq_lock and double_lock_lock
  - Vineeth Pillai
  - Julien Desfossez
- Fixes 32bit build.
  - Aubrey Li

---
 kernel/sched/core.c     |  46 ++++++++---------
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  18 +++----
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     |  40 +++++++--------
 kernel/sched/idle.c     |   4 +-
 kernel/sched/pelt.h     |   2 +-
 kernel/sched/rt.c       |   8 +--
 kernel/sched/sched.h    | 106 ++++++++++++++++++++--------------------
 kernel/sched/topology.c |   4 +-
 10 files changed, 123 insertions(+), 121 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 416ea613eda8..6f4861ae85dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,12 +72,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 
 	for (;;) {
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 
 		while (unlikely(task_on_rq_migrating(p)))
 			cpu_relax();
@@ -96,7 +96,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -118,7 +118,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -188,7 +188,7 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (rq->clock_update_flags & RQCF_ACT_SKIP)
 		return;
@@ -497,7 +497,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (test_tsk_need_resched(curr))
 		return;
@@ -521,10 +521,10 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
 		resched_curr(rq);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -956,7 +956,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	dequeue_task(rq, p, DEQUEUE_NOCLOCK);
@@ -1070,7 +1070,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		 * Because __kthread_bind() calls this on blocked tasks without
 		 * holding rq->lock.
 		 */
-		lockdep_assert_held(&rq->lock);
+		lockdep_assert_held(rq_lockp(rq));
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	}
 	if (running)
@@ -1203,7 +1203,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(&task_rq(p)->lock)));
+				      lockdep_is_held(rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -1732,7 +1732,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 #ifdef CONFIG_SMP
 	if (p->sched_contributes_to_load)
@@ -2123,7 +2123,7 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
 	    WARN_ON_ONCE(p == current))
 		return;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (!raw_spin_trylock(&p->pi_lock)) {
 		/*
@@ -2609,10 +2609,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
+	spin_release(&rq_lockp(rq)->dep_map, 1, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
-	rq->lock.owner = next;
+	rq_lockp(rq)->owner = next;
 #endif
 }
 
@@ -2623,8 +2623,8 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
-	raw_spin_unlock_irq(&rq->lock);
+	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 /*
@@ -2698,7 +2698,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	 *	schedule()
 	 *	  preempt_disable();			// 1
 	 *	  __schedule()
-	 *	    raw_spin_lock_irq(&rq->lock)	// 2
+	 *	    raw_spin_lock_irq(rq_lockp(rq))	// 2
 	 *
 	 * Also, see FORK_PREEMPT_COUNT.
 	 */
@@ -2774,7 +2774,7 @@ static void __balance_callback(struct rq *rq)
 	void (*func)(struct rq *rq);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	head = rq->balance_callback;
 	rq->balance_callback = NULL;
 	while (head) {
@@ -2785,7 +2785,7 @@ static void __balance_callback(struct rq *rq)
 
 		func(rq);
 	}
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 static inline void balance_callback(struct rq *rq)
@@ -5414,7 +5414,7 @@ void init_idle(struct task_struct *idle, int cpu)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 
 	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
@@ -5451,7 +5451,7 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
 #endif
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
@@ -6019,7 +6019,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
-		raw_spin_lock_init(&rq->lock);
+		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..78de28ebc45d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	/*
 	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	if (index == CPUACCT_STAT_NSTATS) {
@@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	}
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	return data;
@@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	/*
 	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 }
 
@@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 			 * Take rq->lock to make 64-bit read safe on 32-bit
 			 * platforms.
 			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 			seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 		}
 		seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43901fa3f269..e683d4c19ab8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -80,7 +80,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -93,7 +93,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
 	dl_rq->running_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
@@ -107,7 +107,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
 	dl_rq->this_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
 }
@@ -117,7 +117,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
 	dl_rq->this_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
@@ -892,7 +892,7 @@ static int start_dl_timer(struct task_struct *p)
 	ktime_t now, act;
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/*
 	 * We want the timer to fire at the deadline, but considering
@@ -1002,9 +1002,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(&rq->lock, rf.cookie);
+		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(&rq->lock);
+		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
@@ -1619,7 +1619,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
 	 * rq->lock is not... So, lock it
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
 		p->dl.dl_non_contending = 0;
@@ -1634,7 +1634,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 			put_task_struct(p);
 	}
 	sub_rq_bw(&p->dl, &rq->dl);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8039d62ae36e..bfeed9658a83 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -515,7 +515,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_rq->exec_clock));
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (rb_first_cached(&cfs_rq->tasks_timeline))
 		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
 	last = __pick_last_entity(cfs_rq);
@@ -523,7 +523,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 35f3ea375084..c17faf672dcf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4996,7 +4996,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5015,7 +5015,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6773,7 +6773,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_held(&task_rq(p)->lock);
+		lockdep_assert_held(rq_lockp(task_rq(p)));
 		detach_entity_cfs_rq(&p->se);
 
 	} else {
@@ -7347,7 +7347,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (p->sched_class != &fair_sched_class)
 		return 0;
@@ -7441,7 +7441,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	/*
 	 * We do not migrate tasks that are:
@@ -7519,7 +7519,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	p->on_rq = TASK_ON_RQ_MIGRATING;
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
@@ -7536,7 +7536,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 {
 	struct task_struct *p;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	list_for_each_entry_reverse(p,
 			&env->src_rq->cfs_tasks, se.group_node) {
@@ -7572,7 +7572,7 @@ static int detach_tasks(struct lb_env *env)
 	unsigned long load;
 	int detached = 0;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (env->imbalance <= 0)
 		return 0;
@@ -7653,7 +7653,7 @@ static int detach_tasks(struct lb_env *env)
  */
 static void attach_task(struct rq *rq, struct task_struct *p)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	BUG_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9206,7 +9206,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		if (need_active_balance(&env)) {
 			unsigned long flags;
 
-			raw_spin_lock_irqsave(&busiest->lock, flags);
+			raw_spin_lock_irqsave(rq_lockp(busiest), flags);
 
 			/*
 			 * Don't kick the active_load_balance_cpu_stop,
@@ -9214,7 +9214,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			 * moved to this_cpu:
 			 */
 			if (!cpumask_test_cpu(this_cpu, &busiest->curr->cpus_allowed)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
+				raw_spin_unlock_irqrestore(rq_lockp(busiest),
 							    flags);
 				env.flags |= LBF_ALL_PINNED;
 				goto out_one_pinned;
@@ -9230,7 +9230,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
 
 			if (active_balance) {
 				stop_one_cpu_nowait(cpu_of(busiest),
@@ -9969,7 +9969,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
 		return;
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	/*
 	 * This CPU is going to be idle and blocked load of idle CPUs
 	 * need to be updated. Run the ilb locally as it is a good
@@ -9978,7 +9978,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	 */
 	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
 		kick_ilb(NOHZ_STATS_KICK);
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 }
 
 #else /* !CONFIG_NO_HZ_COMMON */
@@ -10038,7 +10038,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -10079,7 +10079,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -10515,11 +10515,11 @@ void online_fair_sched_group(struct task_group *tg)
 		rq = cpu_rq(i);
 		se = tg->se[i];
 
-		raw_spin_lock_irq(&rq->lock);
+		raw_spin_lock_irq(rq_lockp(rq));
 		update_rq_clock(rq);
 		attach_entity_cfs_rq(se);
 		sync_throttle(tg, i);
-		raw_spin_unlock_irq(&rq->lock);
+		raw_spin_unlock_irq(rq_lockp(rq));
 	}
 }
 
@@ -10542,9 +10542,9 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 		rq = cpu_rq(cpu);
 
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_lock_irqsave(rq_lockp(rq), flags);
 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	}
 }
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f5516bae0c1b..39788d3a40ec 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -390,10 +390,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 static void
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 }
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 7489d5f56960..dd604947e9f8 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -116,7 +116,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 90fa23d36565..3d9db8c75d53 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -845,7 +845,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 		if (skip)
 			continue;
 
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		if (rt_rq->rt_time) {
@@ -883,7 +883,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -2034,9 +2034,9 @@ void rto_push_irq_work_func(struct irq_work *work)
 	 * When it gets updated, a check is made if a push is possible.
 	 */
 	if (has_pushable_tasks(rq)) {
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		push_rt_tasks(rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	raw_spin_lock(&rd->rto_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efa686eeff26..c4cd252dba29 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -806,7 +806,7 @@ extern void rto_push_irq_work_func(struct irq_work *work);
  */
 struct rq {
 	/* runqueue lock: */
-	raw_spinlock_t		lock;
+	raw_spinlock_t		__lock;
 
 	/*
 	 * nr_running and cpu_load should be in the same cacheline because
@@ -979,6 +979,10 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
 
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
@@ -1046,7 +1050,7 @@ static inline void assert_clock_updated(struct rq *rq)
 
 static inline u64 rq_clock(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock;
@@ -1054,7 +1058,7 @@ static inline u64 rq_clock(struct rq *rq)
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_task;
@@ -1062,7 +1066,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 
 static inline void rq_clock_skip_update(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags |= RQCF_REQ_SKIP;
 }
 
@@ -1072,7 +1076,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
  */
 static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
@@ -1091,7 +1095,7 @@ struct rq_flags {
 
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(&rq->lock);
+	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1106,12 +1110,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(&rq->lock, rf->cookie);
+	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(&rq->lock, rf->cookie);
+	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -1132,7 +1136,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline void
@@ -1141,7 +1145,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	__releases(p->pi_lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1149,7 +1153,7 @@ static inline void
 rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1157,7 +1161,7 @@ static inline void
 rq_lock_irq(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1165,7 +1169,7 @@ static inline void
 rq_lock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1173,7 +1177,7 @@ static inline void
 rq_relock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_repin_lock(rq, rf);
 }
 
@@ -1182,7 +1186,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
 }
 
 static inline void
@@ -1190,7 +1194,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 static inline void
@@ -1198,7 +1202,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline struct rq *
@@ -1261,7 +1265,7 @@ queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
 		       void (*func)(struct rq *rq))
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (unlikely(head->next))
 		return;
@@ -1917,7 +1921,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	double_rq_lock(this_rq, busiest);
 
 	return 1;
@@ -1936,20 +1940,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
-
-	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			raw_spin_unlock(&this_rq->lock);
-			raw_spin_lock(&busiest->lock);
-			raw_spin_lock_nested(&this_rq->lock,
-					      SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			raw_spin_lock_nested(&busiest->lock,
-					      SINGLE_DEPTH_NESTING);
+	if (rq_lockp(this_rq) == rq_lockp(busiest))
+		return 0;
+
+	if (likely(raw_spin_trylock(rq_lockp(busiest))))
+		return 0;
+
+	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+		raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+		return 0;
 	}
-	return ret;
+
+	raw_spin_unlock(rq_lockp(this_rq));
+	raw_spin_lock(rq_lockp(busiest));
+	raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+	return 1;
 }
 
 #endif /* CONFIG_PREEMPT */
@@ -1959,20 +1965,16 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
  */
 static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 {
-	if (unlikely(!irqs_disabled())) {
-		/* printk() doesn't work well under rq->lock */
-		raw_spin_unlock(&this_rq->lock);
-		BUG_ON(1);
-	}
-
+	lockdep_assert_irqs_disabled();
 	return _double_lock_balance(this_rq, busiest);
 }
 
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	raw_spin_unlock(&busiest->lock);
-	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+	if (rq_lockp(this_rq) != rq_lockp(busiest))
+		raw_spin_unlock(rq_lockp(busiest));
+	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2013,16 +2015,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq2->lock)
 {
 	BUG_ON(!irqs_disabled());
-	if (rq1 == rq2) {
-		raw_spin_lock(&rq1->lock);
+	if (rq_lockp(rq1) == rq_lockp(rq2)) {
+		raw_spin_lock(rq_lockp(rq1));
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1 < rq2) {
-			raw_spin_lock(&rq1->lock);
-			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+		if (rq_lockp(rq1) < rq_lockp(rq2)) {
+			raw_spin_lock(rq_lockp(rq1));
+			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
 		} else {
-			raw_spin_lock(&rq2->lock);
-			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock(rq_lockp(rq2));
+			raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
 		}
 	}
 }
@@ -2037,9 +2039,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_unlock(&rq1->lock);
-	if (rq1 != rq2)
-		raw_spin_unlock(&rq2->lock);
+	raw_spin_unlock(rq_lockp(rq1));
+	if (rq_lockp(rq1) != rq_lockp(rq2))
+		raw_spin_unlock(rq_lockp(rq2));
 	else
 		__release(rq2->lock);
 }
@@ -2062,7 +2064,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
-	raw_spin_lock(&rq1->lock);
+	raw_spin_lock(rq_lockp(rq1));
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
 
@@ -2077,7 +2079,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq2->lock)
 {
 	BUG_ON(rq1 != rq2);
-	raw_spin_unlock(&rq1->lock);
+	raw_spin_unlock(rq_lockp(rq1));
 	__release(rq2->lock);
 }
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ab7f371a3a17..14b8be81dab2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -442,7 +442,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 
 	if (rq->rd) {
 		old_rd = rq->rd;
@@ -468,7 +468,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 
 	if (old_rd)
 		call_rcu(&old_rd->rcu, free_rootdomain);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (2 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 03/16] sched: Wrap rq::lock access Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-08 10:55   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2019-05-29 20:36 ` [RFC PATCH v3 05/16] sched: Add task_struct pointer to sched_class::set_curr_task Vineeth Remanan Pillai
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Because pick_next_task() implies set_curr_task() and some of the
details haven't matter too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c | 23 ++++++++++++-----------
 kernel/sched/rt.c       | 27 ++++++++++++++-------------
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e683d4c19ab8..0783dfa65150 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 }
 #endif
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
+
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+
+	if (rq->curr->sched_class != &dl_sched_class)
+		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	if (rq->curr != p)
+		deadline_queue_push_tasks(rq);
 }
 
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@@ -1758,15 +1767,7 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = dl_task_of(dl_se);
 
-	set_next_task(rq, p);
-
-	if (hrtick_enabled(rq))
-		start_hrtick_dl(rq, p);
-
-	deadline_queue_push_tasks(rq);
-
-	if (rq->curr->sched_class != &dl_sched_class)
-		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_dl(rq, p);
 
 	return p;
 }
@@ -1813,7 +1814,7 @@ static void task_fork_dl(struct task_struct *p)
 
 static void set_curr_task_dl(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_dl(rq, rq->curr);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3d9db8c75d53..353ad960691b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1498,12 +1498,23 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 #endif
 }
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
+
+	/*
+	 * If prev task was rt, put_prev_task() has already updated the
+	 * utilization. We only care of the case where we start to schedule a
+	 * rt task
+	 */
+	if (rq->curr->sched_class != &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	if (rq->curr != p)
+		rt_queue_push_tasks(rq);
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@@ -1577,17 +1588,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = _pick_next_task_rt(rq);
 
-	set_next_task(rq, p);
-
-	rt_queue_push_tasks(rq);
-
-	/*
-	 * If prev task was rt, put_prev_task() has already updated the
-	 * utilization. We only care of the case where we start to schedule a
-	 * rt task
-	 */
-	if (rq->curr->sched_class != &rt_sched_class)
-		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_rt(rq, p);
 
 	return p;
 }
@@ -2356,7 +2357,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 
 static void set_curr_task_rt(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_rt(rq, rq->curr);
 }
 
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 05/16] sched: Add task_struct pointer to sched_class::set_curr_task
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (3 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-08 10:57   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2019-05-29 20:36 ` [RFC PATCH v3 06/16] sched/fair: Export newidle_balance() Vineeth Remanan Pillai
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      | 12 ++++++------
 kernel/sched/deadline.c  |  7 +------
 kernel/sched/fair.c      | 17 ++++++++++++++---
 kernel/sched/idle.c      | 27 +++++++++++++++------------
 kernel/sched/rt.c        |  7 +------
 kernel/sched/sched.h     |  8 +++++---
 kernel/sched/stop_task.c | 17 +++++++----------
 7 files changed, 49 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f4861ae85dc..32ea79fb8d29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1081,7 +1081,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 }
 
 /*
@@ -3890,7 +3890,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -3957,7 +3957,7 @@ void set_user_nice(struct task_struct *p, long nice)
 			resched_curr(rq);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 out_unlock:
 	task_rq_unlock(rq, p, &rf);
 }
@@ -4382,7 +4382,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		enqueue_task(rq, p, queue_flags);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 
@@ -5555,7 +5555,7 @@ void sched_setnuma(struct task_struct *p, int nid)
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -6438,7 +6438,7 @@ void sched_move_task(struct task_struct *tsk)
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
 	if (running)
-		set_curr_task(rq, tsk);
+		set_next_task(rq, tsk);
 
 	task_rq_unlock(rq, tsk, &rf);
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0783dfa65150..c02b3229e2c3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1812,11 +1812,6 @@ static void task_fork_dl(struct task_struct *p)
 	 */
 }
 
-static void set_curr_task_dl(struct rq *rq)
-{
-	set_next_task_dl(rq, rq->curr);
-}
-
 #ifdef CONFIG_SMP
 
 /* Only try algorithms three times */
@@ -2404,6 +2399,7 @@ const struct sched_class dl_sched_class = {
 
 	.pick_next_task		= pick_next_task_dl,
 	.put_prev_task		= put_prev_task_dl,
+	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
@@ -2414,7 +2410,6 @@ const struct sched_class dl_sched_class = {
 	.task_woken		= task_woken_dl,
 #endif
 
-	.set_curr_task		= set_curr_task_dl,
 	.task_tick		= task_tick_dl,
 	.task_fork              = task_fork_dl,
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c17faf672dcf..56fc2a1aa261 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10388,9 +10388,19 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_curr_task_fair(struct rq *rq)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p)
 {
-	struct sched_entity *se = &rq->curr->se;
+	struct sched_entity *se = &p->se;
+
+#ifdef CONFIG_SMP
+	if (task_on_rq_queued(p)) {
+		/*
+		 * Move the next running task to the front of the list, so our
+		 * cfs_tasks list becomes MRU one.
+		 */
+		list_move(&se->group_node, &rq->cfs_tasks);
+	}
+#endif
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -10661,7 +10671,9 @@ const struct sched_class fair_sched_class = {
 	.check_preempt_curr	= check_preempt_wakeup,
 
 	.pick_next_task		= pick_next_task_fair,
+
 	.put_prev_task		= put_prev_task_fair,
+	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
@@ -10674,7 +10686,6 @@ const struct sched_class fair_sched_class = {
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
 	.task_fork		= task_fork_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 39788d3a40ec..dd64be34881d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,14 +373,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+{
+}
+
+static void set_next_task_idle(struct rq *rq, struct task_struct *next)
+{
+	update_idle_core(rq);
+	schedstat_inc(rq->sched_goidle);
+}
+
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
+	struct task_struct *next = rq->idle;
+
 	put_prev_task(rq, prev);
-	update_idle_core(rq);
-	schedstat_inc(rq->sched_goidle);
+	set_next_task_idle(rq, next);
 
-	return rq->idle;
+	return next;
 }
 
 /*
@@ -396,10 +407,6 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 	raw_spin_lock_irq(rq_lockp(rq));
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
-{
-}
-
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -412,10 +419,6 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
 {
 }
 
-static void set_curr_task_idle(struct rq *rq)
-{
-}
-
 static void switched_to_idle(struct rq *rq, struct task_struct *p)
 {
 	BUG();
@@ -450,13 +453,13 @@ const struct sched_class idle_sched_class = {
 
 	.pick_next_task		= pick_next_task_idle,
 	.put_prev_task		= put_prev_task_idle,
+	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_idle,
 	.task_tick		= task_tick_idle,
 
 	.get_rr_interval	= get_rr_interval_idle,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 353ad960691b..adec98a94f2b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2355,11 +2355,6 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	}
 }
 
-static void set_curr_task_rt(struct rq *rq)
-{
-	set_next_task_rt(rq, rq->curr);
-}
-
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
 {
 	/*
@@ -2381,6 +2376,7 @@ const struct sched_class rt_sched_class = {
 
 	.pick_next_task		= pick_next_task_rt,
 	.put_prev_task		= put_prev_task_rt,
+	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_rt,
@@ -2392,7 +2388,6 @@ const struct sched_class rt_sched_class = {
 	.switched_from		= switched_from_rt,
 #endif
 
-	.set_curr_task          = set_curr_task_rt,
 	.task_tick		= task_tick_rt,
 
 	.get_rr_interval	= get_rr_interval_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c4cd252dba29..fb01c77c16ff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1672,6 +1672,7 @@ struct sched_class {
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
@@ -1686,7 +1687,6 @@ struct sched_class {
 	void (*rq_offline)(struct rq *rq);
 #endif
 
-	void (*set_curr_task)(struct rq *rq);
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork)(struct task_struct *p);
 	void (*task_dead)(struct task_struct *p);
@@ -1716,12 +1716,14 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
+	WARN_ON_ONCE(rq->curr != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
-static inline void set_curr_task(struct rq *rq, struct task_struct *curr)
+static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	curr->sched_class->set_curr_task(rq);
+	WARN_ON_ONCE(rq->curr != next);
+	next->sched_class->set_next_task(rq, next);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index c183b790ca54..47a3d2a18a9a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,6 +23,11 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
 	/* we're never preempted */
 }
 
+static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
+{
+	stop->se.exec_start = rq_clock_task(rq);
+}
+
 static struct task_struct *
 pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -32,8 +37,7 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		return NULL;
 
 	put_prev_task(rq, prev);
-
-	stop->se.exec_start = rq_clock_task(rq);
+	set_next_task_stop(rq, stop);
 
 	return stop;
 }
@@ -86,13 +90,6 @@ static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
 {
 }
 
-static void set_curr_task_stop(struct rq *rq)
-{
-	struct task_struct *stop = rq->stop;
-
-	stop->se.exec_start = rq_clock_task(rq);
-}
-
 static void switched_to_stop(struct rq *rq, struct task_struct *p)
 {
 	BUG(); /* its impossible to change to this class */
@@ -128,13 +125,13 @@ const struct sched_class stop_sched_class = {
 
 	.pick_next_task		= pick_next_task_stop,
 	.put_prev_task		= put_prev_task_stop,
+	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_stop,
 	.task_tick		= task_tick_stop,
 
 	.get_rr_interval	= get_rr_interval_stop,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 06/16] sched/fair: Export newidle_balance()
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (4 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 05/16] sched: Add task_struct pointer to sched_class::set_curr_task Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-08 10:58   ` [tip:sched/core] sched/fair: Expose newidle_balance() tip-bot for Peter Zijlstra
  2019-05-29 20:36 ` [RFC PATCH v3 07/16] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  | 18 ++++++++----------
 kernel/sched/sched.h |  4 ++++
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56fc2a1aa261..49707b4797de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3615,8 +3615,6 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 	return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
-
 static inline unsigned long task_util(struct task_struct *p)
 {
 	return READ_ONCE(p->se.avg.util_avg);
@@ -7087,11 +7085,10 @@ done: __maybe_unused;
 	return p;
 
 idle:
-	update_misfit_status(NULL, rq);
-	new_tasks = idle_balance(rq, rf);
+	new_tasks = newidle_balance(rq, rf);
 
 	/*
-	 * Because idle_balance() releases (and re-acquires) rq->lock, it is
+	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
 	 * possible for any higher priority task to appear. In that case we
 	 * must re-start the pick_next_entity() loop.
 	 */
@@ -9286,10 +9283,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	ld_moved = 0;
 
 	/*
-	 * idle_balance() disregards balance intervals, so we could repeatedly
-	 * reach this code, which would lead to balance_interval skyrocketting
-	 * in a short amount of time. Skip the balance_interval increase logic
-	 * to avoid that.
+	 * newidle_balance() disregards balance intervals, so we could
+	 * repeatedly reach this code, which would lead to balance_interval
+	 * skyrocketting in a short amount of time. Skip the balance_interval
+	 * increase logic to avoid that.
 	 */
 	if (env.idle == CPU_NEWLY_IDLE)
 		goto out;
@@ -9996,7 +9993,7 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
+int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
 	int this_cpu = this_rq->cpu;
@@ -10004,6 +10001,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	int pulled_task = 0;
 	u64 curr_cost = 0;
 
+	update_misfit_status(NULL, this_rq);
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb01c77c16ff..bfcbcbb25646 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1414,10 +1414,14 @@ static inline void unregister_sched_domain_sysctl(void)
 }
 #endif
 
+extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
 
+static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 07/16] sched: Allow put_prev_task() to drop rq->lock
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (5 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 06/16] sched/fair: Export newidle_balance() Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-08 10:58   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2019-08-26 16:51   ` [RFC PATCH v3 07/16] " mark gross
  2019-05-29 20:36 ` [RFC PATCH v3 08/16] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |  2 +-
 kernel/sched/deadline.c  | 14 +++++++++++++-
 kernel/sched/fair.c      |  2 +-
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        | 14 +++++++++++++-
 kernel/sched/sched.h     |  4 ++--
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 32ea79fb8d29..9dfa0c53deb3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5595,7 +5595,7 @@ static void calc_load_migrate(struct rq *rq)
 		atomic_long_add(delta, &calc_load_tasks);
 }
 
-static void put_prev_task_fake(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fake(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c02b3229e2c3..45425f971eec 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1772,13 +1772,25 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_dl(rq);
 
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
+
+	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49707b4797de..8e3eb243fd9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7110,7 +7110,7 @@ done: __maybe_unused;
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index dd64be34881d..1b65a4c3683e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,7 +373,7 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index adec98a94f2b..51ee87c5a28a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1593,7 +1593,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_rt(rq);
 
@@ -1605,6 +1605,18 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_rt_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bfcbcbb25646..4cbe2bef92e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1675,7 +1675,7 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
@@ -1721,7 +1721,7 @@ struct sched_class {
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
-	prev->sched_class->put_prev_task(rq, prev);
+	prev->sched_class->put_prev_task(rq, prev, NULL);
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 47a3d2a18a9a..8f414018d5e0 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -59,7 +59,7 @@ static void yield_task_stop(struct rq *rq)
 	BUG(); /* the stop task should never yield, its pointless. */
 }
 
-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 08/16] sched: Rework pick_next_task() slow-path
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (6 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 07/16] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-08 10:59   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2019-08-26 17:01   ` [RFC PATCH v3 08/16] " mark gross
  2019-05-29 20:36 ` [RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task() Vineeth Remanan Pillai
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      | 19 ++++++++++++-------
 kernel/sched/deadline.c  | 30 ++----------------------------
 kernel/sched/fair.c      |  9 ++++++---
 kernel/sched/idle.c      |  4 +++-
 kernel/sched/rt.c        | 29 +----------------------------
 kernel/sched/sched.h     | 13 ++++++++-----
 kernel/sched/stop_task.c |  3 ++-
 7 files changed, 34 insertions(+), 73 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9dfa0c53deb3..b883c70674ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3363,7 +3363,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
-			goto again;
+			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
 		if (unlikely(!p))
@@ -3372,14 +3372,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		return p;
 	}
 
-again:
+restart:
+	/*
+	 * Ensure that we put DL/RT tasks before the pick loop, such that they
+	 * can PULL higher prio tasks when we lower the RQ 'priority'.
+	 */
+	prev->sched_class->put_prev_task(rq, prev, rf);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
 	for_each_class(class) {
-		p = class->pick_next_task(rq, prev, rf);
-		if (p) {
-			if (unlikely(p == RETRY_TASK))
-				goto again;
+		p = class->pick_next_task(rq, NULL, NULL);
+		if (p)
 			return p;
-		}
 	}
 
 	/* The idle class should always have a runnable task: */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 45425f971eec..d3904168857a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1729,39 +1729,13 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *p;
 	struct dl_rq *dl_rq;
 
-	dl_rq = &rq->dl;
-
-	if (need_pull_dl_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_dl_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_dl_task() can drop (and re-acquire) rq->lock; this
-		 * means a stop task can slip in, in which case we need to
-		 * re-start task selection.
-		 */
-		if (rq->stop && task_on_rq_queued(rq->stop))
-			return RETRY_TASK;
-	}
+	WARN_ON_ONCE(prev || rf);
 
-	/*
-	 * When prev is DL, we may throttle it in put_prev_task().
-	 * So, we update time before we check for dl_nr_running.
-	 */
-	if (prev->sched_class == &dl_sched_class)
-		update_curr_dl(rq);
+	dl_rq = &rq->dl;
 
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e3eb243fd9f..e65f2dfda77a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6979,7 +6979,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
+	if (!prev || prev->sched_class != &fair_sched_class)
 		goto simple;
 
 	/*
@@ -7056,8 +7056,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	goto done;
 simple:
 #endif
-
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
 
 	do {
 		se = pick_next_entity(cfs_rq, NULL);
@@ -7085,6 +7085,9 @@ done: __maybe_unused;
 	return p;
 
 idle:
+	if (!rf)
+		return NULL;
+
 	new_tasks = newidle_balance(rq, rf);
 
 	/*
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1b65a4c3683e..7ece8e820b5d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -388,7 +388,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	struct task_struct *next = rq->idle;
 
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
+
 	set_next_task_idle(rq, next);
 
 	return next;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 51ee87c5a28a..79f2e60516ef 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1554,38 +1554,11 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
-	if (need_pull_rt_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_rt_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_rt_task() can drop (and re-acquire) rq->lock; this
-		 * means a dl or stop task can slip in, in which case we need
-		 * to re-start task selection.
-		 */
-		if (unlikely((rq->stop && task_on_rq_queued(rq->stop)) ||
-			     rq->dl.dl_nr_running))
-			return RETRY_TASK;
-	}
-
-	/*
-	 * We may dequeue prev's rt_rq in put_prev_task().
-	 * So, we update time before rt_queued check.
-	 */
-	if (prev->sched_class == &rt_sched_class)
-		update_curr_rt(rq);
+	WARN_ON_ONCE(prev || rf);
 
 	if (!rt_rq->rt_queued)
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	p = _pick_next_task_rt(rq);
 
 	set_next_task_rt(rq, p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4cbe2bef92e4..460dd04e76af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1665,12 +1665,15 @@ struct sched_class {
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
 	/*
-	 * It is the responsibility of the pick_next_task() method that will
-	 * return the next task to call put_prev_task() on the @prev task or
-	 * something equivalent.
+	 * Both @prev and @rf are optional and may be NULL, in which case the
+	 * caller must already have invoked put_prev_task(rq, prev, rf).
 	 *
-	 * May return RETRY_TASK when it finds a higher prio class has runnable
-	 * tasks.
+	 * Otherwise it is the responsibility of the pick_next_task() to call
+	 * put_prev_task() on the @prev task or something equivalent, IFF it
+	 * returns a next task.
+	 *
+	 * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
+	 * higher prio class has runnable tasks.
 	 */
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 8f414018d5e0..7e1cee4e65b2 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -33,10 +33,11 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	struct task_struct *stop = rq->stop;
 
+	WARN_ON_ONCE(prev || rf);
+
 	if (!stop || !task_on_rq_queued(stop))
 		return NULL;
 
-	put_prev_task(rq, prev);
 	set_next_task_stop(rq, stop);
 
 	return stop;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task()
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (7 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 08/16] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-26 17:14   ` mark gross
  2019-05-29 20:36 ` [RFC PATCH v3 10/16] sched: Core-wide rq->lock Vineeth Remanan Pillai
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---

Chnages in v3
-------------
- Minor refactor to remove redundant NULL checks

Changes in v2
-------------
- Fixes a NULL pointer dereference crash
  - Subhra Mazumdar
  - Tim Chen

---
 kernel/sched/deadline.c  | 21 ++++++++++++++++-----
 kernel/sched/fair.c      | 36 +++++++++++++++++++++++++++++++++---
 kernel/sched/idle.c      | 10 +++++++++-
 kernel/sched/rt.c        | 21 ++++++++++++++++-----
 kernel/sched/sched.h     |  2 ++
 kernel/sched/stop_task.c | 21 ++++++++++++++++-----
 6 files changed, 92 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d3904168857a..64fc444f44f9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1722,15 +1722,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *
-pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct task_struct *p;
 	struct dl_rq *dl_rq;
 
-	WARN_ON_ONCE(prev || rf);
-
 	dl_rq = &rq->dl;
 
 	if (unlikely(!dl_rq->dl_nr_running))
@@ -1741,7 +1738,19 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = dl_task_of(dl_se);
 
-	set_next_task_dl(rq, p);
+	return p;
+}
+
+static struct task_struct *
+pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_dl(rq);
+	if (p)
+		set_next_task_dl(rq, p);
 
 	return p;
 }
@@ -2388,6 +2397,8 @@ const struct sched_class dl_sched_class = {
 	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_dl,
+
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e65f2dfda77a..02e5dfb85e7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4136,7 +4136,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
-	if (cfs_rq->skip == se) {
+	if (cfs_rq->skip && cfs_rq->skip == se) {
 		struct sched_entity *second;
 
 		if (se == curr) {
@@ -4154,13 +4154,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	/*
 	 * Prefer last buddy, try to return the CPU to a preempted task.
 	 */
-	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+	if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
 		se = cfs_rq->last;
 
 	/*
 	 * Someone really wants this to run. If it's not unfair, run it.
 	 */
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+	if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
 		se = cfs_rq->next;
 
 	clear_buddies(cfs_rq, se);
@@ -6966,6 +6966,34 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		set_last_buddy(se);
 }
 
+static struct task_struct *
+pick_task_fair(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	struct sched_entity *se;
+
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		se = pick_next_entity(cfs_rq, NULL);
+
+		if (curr) {
+			if (se && curr->on_rq)
+				update_curr(cfs_rq);
+
+			if (!se || entity_before(curr, se))
+				se = curr;
+		}
+
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	return task_of(se);
+}
+
 static struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -10677,6 +10705,8 @@ const struct sched_class fair_sched_class = {
 	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_fair,
+
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7ece8e820b5d..e7f38da60373 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,6 +373,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
+static struct task_struct *
+pick_task_idle(struct rq *rq)
+{
+	return rq->idle;
+}
+
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
@@ -386,11 +392,12 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *next = rq->idle;
+	struct task_struct *next;
 
 	if (prev)
 		put_prev_task(rq, prev);
 
+	next = pick_task_idle(rq);
 	set_next_task_idle(rq, next);
 
 	return next;
@@ -458,6 +465,7 @@ const struct sched_class idle_sched_class = {
 	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 79f2e60516ef..81557224548c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1548,20 +1548,29 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *
-pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!rt_rq->rt_queued)
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
 
-	set_next_task_rt(rq, p);
+	return p;
+}
+
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_rt(rq);
+	if (p)
+		set_next_task_rt(rq, p);
 
 	return p;
 }
@@ -2364,6 +2373,8 @@ const struct sched_class rt_sched_class = {
 	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_rt,
+
 	.select_task_rq		= select_task_rq_rt,
 
 	.set_cpus_allowed       = set_cpus_allowed_common,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 460dd04e76af..a024dd80eeb3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1682,6 +1682,8 @@ struct sched_class {
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
+	struct task_struct * (*pick_task)(struct rq *rq);
+
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 7e1cee4e65b2..fb6c436cba6c 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -29,20 +29,30 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
 }
 
 static struct task_struct *
-pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_task_stop(struct rq *rq)
 {
 	struct task_struct *stop = rq->stop;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!stop || !task_on_rq_queued(stop))
 		return NULL;
 
-	set_next_task_stop(rq, stop);
-
 	return stop;
 }
 
+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_stop(rq);
+	if (p)
+		set_next_task_stop(rq, p);
+
+	return p;
+}
+
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -129,6 +139,7 @@ const struct sched_class stop_sched_class = {
 	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 10/16] sched: Core-wide rq->lock
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (8 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task() Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-05-31 11:08   ` Peter Zijlstra
  2019-05-29 20:36 ` [RFC PATCH v3 11/16] sched: Basic tracking of matching tasks Vineeth Remanan Pillai
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---

Changes in v3
-------------
- Fixes a crash during cpu offline/offline with coresched enabled
  - Vineeth Pillai

 kernel/Kconfig.preempt |   7 ++-
 kernel/sched/core.c    | 113 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h   |  31 +++++++++++
 3 files changed, 148 insertions(+), 3 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 0fee5fe6c899..02fe0bf26676 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -57,4 +57,9 @@ config PREEMPT
 endchoice
 
 config PREEMPT_COUNT
-       bool
+	bool
+
+config SCHED_CORE
+	bool
+	default y
+	depends on SCHED_SMT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b883c70674ba..b1ce33f9b106 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,70 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ *	spin_lock(rq_lockp(rq));
+ *	...
+ *	spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+	bool enabled = !!(unsigned long)data;
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	stop_machine(__sched_core_stopper, (void *)false, NULL);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
@@ -5790,8 +5854,15 @@ int sched_cpu_activate(unsigned int cpu)
 	/*
 	 * When going up, increment the number of cores with SMT present.
 	 */
-	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
 		static_branch_inc_cpuslocked(&sched_smt_present);
+#ifdef CONFIG_SCHED_CORE
+		if (static_branch_unlikely(&__sched_core_enabled)) {
+			rq->core_enabled = true;
+		}
+#endif
+	}
+
 #endif
 	set_cpu_active(cpu, true);
 
@@ -5839,8 +5910,16 @@ int sched_cpu_deactivate(unsigned int cpu)
 	/*
 	 * When going down, decrement the number of cores with SMT present.
 	 */
-	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
+#ifdef CONFIG_SCHED_CORE
+		struct rq *rq = cpu_rq(cpu);
+		if (static_branch_unlikely(&__sched_core_enabled)) {
+			rq->core_enabled = false;
+		}
+#endif
 		static_branch_dec_cpuslocked(&sched_smt_present);
+
+	}
 #endif
 
 	if (!sched_smp_initialized)
@@ -5865,6 +5944,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+#ifdef CONFIG_SCHED_CORE
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	for_each_cpu(i, smt_mask) {
+		rq = cpu_rq(i);
+		if (rq->core && rq->core == rq)
+			core_rq = rq;
+	}
+
+	if (!core_rq)
+		core_rq = cpu_rq(cpu);
+
+	for_each_cpu(i, smt_mask) {
+		rq = cpu_rq(i);
+
+		WARN_ON_ONCE(rq->core && rq->core != core_rq);
+		rq->core = core_rq;
+	}
+#endif /* CONFIG_SCHED_CORE */
+
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -5893,6 +5994,9 @@ int sched_cpu_dying(unsigned int cpu)
 	update_max_interval();
 	nohz_balance_exit_idle(rq);
 	hrtick_clear(rq);
+#ifdef CONFIG_SCHED_CORE
+	rq->core = NULL;
+#endif
 	return 0;
 }
 #endif
@@ -6091,6 +6195,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a024dd80eeb3..eb38063221d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -952,6 +952,12 @@ struct rq {
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
 #endif
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -979,11 +985,36 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 11/16] sched: Basic tracking of matching tasks
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (9 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 10/16] sched: Core-wide rq->lock Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-08-26 20:59   ` mark gross
  2019-05-29 20:36 ` [RFC PATCH v3 12/16] sched: A quick and dirty cgroup tagging interface Vineeth Remanan Pillai
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---

Changes in v3
-------------
- Refactored priority comparison code
- Fixed a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
-------------
- Improves the priority comparison logic between processes in
  different cpus.
  - Peter Zijlstra
  - Aaron Lu

---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   |  46 -------------
 kernel/sched/sched.h  |  55 ++++++++++++++++
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1538..a4b39a28236f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -636,10 +636,16 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+#ifdef CONFIG_SCHED_CORE
+	struct rb_node			core_node;
+	unsigned long			core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* List of struct preempt_notifier: */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b1ce33f9b106..112d70f2b1e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -64,6 +64,141 @@ int sysctl_sched_rt_runtime = 950000;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+	if (p->sched_class == &stop_sched_class) /* trumps deadline */
+		return -2;
+
+	if (rt_prio(p->prio)) /* includes deadline */
+		return p->prio; /* [-1, 99] */
+
+	if (p->sched_class == &idle_sched_class)
+		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+	int pa = __task_prio(a), pb = __task_prio(b);
+
+	if (-pa < -pb)
+		return true;
+
+	if (-pb < -pa)
+		return false;
+
+	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+		u64 vruntime = b->se.vruntime;
+
+		/*
+		 * Normalize the vruntime if tasks are in different cpus.
+		 */
+		if (task_cpu(a) != task_cpu(b)) {
+			vruntime -= task_cfs_rq(b)->min_vruntime;
+			vruntime += task_cfs_rq(a)->min_vruntime;
+		}
+
+		return !((s64)(a->se.vruntime - vruntime) <= 0);
+	}
+
+	return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+	if (a->core_cookie < b->core_cookie)
+		return true;
+
+	if (a->core_cookie > b->core_cookie)
+		return false;
+
+	/* flip prio, so high prio is leftmost */
+	if (prio_less(b, a))
+		return true;
+
+	return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct task_struct *node_task;
+
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	node = &rq->core_tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		node_task = container_of(*node, struct task_struct, core_node);
+		parent = *node;
+
+		if (__sched_core_less(p, node_task))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&p->core_node, parent, node);
+	rb_insert_color(&p->core_node, &rq->core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+	struct rb_node *node = rq->core_tree.rb_node;
+	struct task_struct *node_task, *match;
+
+	/*
+	 * The idle task always matches any cookie!
+	 */
+	match = idle_sched_class.pick_task(rq);
+
+	while (node) {
+		node_task = container_of(node, struct task_struct, core_node);
+
+		if (cookie < node_task->core_cookie) {
+			node = node->rb_left;
+		} else if (cookie > node_task->core_cookie) {
+			node = node->rb_right;
+		} else {
+			match = node_task;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -122,6 +257,11 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -826,6 +966,9 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -839,6 +982,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_dequeue(rq, p);
+
 	if (!(flags & DEQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02e5dfb85e7d..d8a107aea69b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -248,33 +248,11 @@ const struct sched_class fair_sched_class;
  */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	SCHED_WARN_ON(!entity_is_task(se));
-	return container_of(se, struct task_struct, se);
-}
 
 /* Walk up scheduling entities hierarchy */
 #define for_each_sched_entity(se) \
 		for (; se; se = se->parent)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return grp->my_q;
-}
-
 static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -422,33 +400,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	return container_of(se, struct task_struct, se);
-}
-
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	struct task_struct *p = task_of(se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return NULL;
-}
-
 static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	return true;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eb38063221d0..0cbcfb6c8ee4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -957,6 +957,10 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	unsigned int		core_enabled;
+	struct rb_root		core_tree;
+
+	/* shared state */
+	unsigned int		core_task_seq;
 #endif
 };
 
@@ -1036,6 +1040,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	struct task_struct *p = task_of(se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return NULL;
+}
+#endif
+
 extern void update_rq_clock(struct rq *rq);
 
 static inline u64 __rq_clock_broken(struct rq *rq)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 12/16] sched: A quick and dirty cgroup tagging interface
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (10 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 11/16] sched: Basic tracking of matching tasks Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-05-29 20:36 ` [RFC PATCH v3 13/16] sched: Add core wide task selection and scheduling Vineeth Remanan Pillai
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Marks all tasks in a cgroup as matching for core-scheduling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---

Changes in v3
-------------
- Fixes the refcount management when deleting a tagged cgroup.
  - Julien Desfossez

---
 kernel/sched/core.c  | 78 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  4 +++
 2 files changed, 82 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 112d70f2b1e5..3164c6b33553 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6658,6 +6658,15 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+		tsk->core_cookie = 0UL;
+
+	if (tg->tagged /* && !tsk->core_cookie ? */)
+		tsk->core_cookie = (unsigned long)tg;
+#endif
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -6737,6 +6746,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+	struct task_group *tg = css_tg(css);
+
+	if (tg->tagged) {
+		sched_core_put();
+		tg->tagged = 0;
+	}
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
@@ -7117,6 +7138,46 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return !!tg->tagged;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	if (tg->tagged == !!val)
+		return 0;
+
+	tg->tagged = !!val;
+
+	if (!!val)
+		sched_core_get();
+
+	css_task_iter_start(css, 0, &it);
+	while ((p = css_task_iter_next(&it)))
+		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+	css_task_iter_end(&it);
+
+	if (!val)
+		sched_core_put();
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7152,6 +7213,14 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7319,6 +7388,14 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+#endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
 #endif
 	{ }	/* terminate */
 };
@@ -7326,6 +7403,7 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
+	.css_offline	= cpu_cgroup_css_offline,
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0cbcfb6c8ee4..bd9b473ebde2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -363,6 +363,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+	int			tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 13/16] sched: Add core wide task selection and scheduling.
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (11 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 12/16] sched: A quick and dirty cgroup tagging interface Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-06-07 23:36   ` Pawan Gupta
  2019-05-29 20:36 ` [RFC PATCH v3 14/16] sched/fair: Add a few assertions Vineeth Remanan Pillai
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
      simplify it further.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---

Changes in v3
-------------
- Fixes the issue of sibling picking an incompatible task.
  - Aaron Lu
  - Peter Zijlstra
  - Vineeth Pillai
  - Julien Desfossez

---
 kernel/sched/core.c  | 271 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 274 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3164c6b33553..e25811b81562 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3556,7 +3556,7 @@ static inline void schedule_debug(struct task_struct *prev)
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -3601,6 +3601,268 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+	return is_idle_task(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+	if (is_idle_task(a) || is_idle_task(b))
+		return true;
+
+	return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+	struct task_struct *class_pick, *cookie_pick;
+	unsigned long cookie = rq->core->core_cookie;
+
+	class_pick = class->pick_task(rq);
+	if (!class_pick)
+		return NULL;
+
+	if (!cookie) {
+		/*
+		 * If class_pick is tagged, return it only if it has
+		 * higher priority than max.
+		 */
+		if (max && class_pick->core_cookie &&
+		    prio_less(class_pick, max))
+			return idle_sched_class.pick_task(rq);
+
+		return class_pick;
+	}
+
+	/*
+	 * If class_pick is idle or matches cookie, return early.
+	 */
+	if (cookie_equals(class_pick, cookie))
+		return class_pick;
+
+	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * If class > max && class > cookie, it is the highest priority task on
+	 * the core (so far) and it must be selected, otherwise we must go with
+	 * the cookie pick in order to satisfy the constraint.
+	 */
+	if (prio_less(cookie_pick, class_pick) &&
+	    (!max || prio_less(max, class_pick)))
+		return class_pick;
+
+	return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *next, *max = NULL;
+	const struct sched_class *class;
+	const struct cpumask *smt_mask;
+	int i, j, cpu;
+	bool need_sync = false;
+
+	if (!sched_core_enabled(rq))
+		return __pick_next_task(rq, prev, rf);
+
+	/*
+	 * If there were no {en,de}queues since we picked (IOW, the task
+	 * pointers are all still valid), and we haven't scheduled the last
+	 * pick yet, do so now.
+	 */
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		next = rq->core_pick;
+		if (next != prev) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+		}
+		return next;
+	}
+
+	prev->sched_class->put_prev_task(rq, prev, rf);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
+	cpu = cpu_of(rq);
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		rq_i->core_pick = NULL;
+
+		if (rq_i->core_forceidle) {
+			need_sync = true;
+			rq_i->core_forceidle = false;
+		}
+
+		if (i != cpu)
+			update_rq_clock(rq_i);
+	}
+
+	/*
+	 * Try and select tasks for each sibling in decending sched_class
+	 * order.
+	 */
+	for_each_class(class) {
+again:
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			struct rq *rq_i = cpu_rq(i);
+			struct task_struct *p;
+
+			if (cpu_is_offline(i))
+				continue;
+
+			if (rq_i->core_pick)
+				continue;
+
+			/*
+			 * If this sibling doesn't yet have a suitable task to
+			 * run; ask for the most elegible task, given the
+			 * highest priority task already selected for this
+			 * core.
+			 */
+			p = pick_task(rq_i, class, max);
+			if (!p) {
+				/*
+				 * If there weren't no cookies; we don't need
+				 * to bother with the other siblings.
+				 */
+				if (i == cpu && !need_sync)
+					goto next_class;
+
+				continue;
+			}
+
+			/*
+			 * Optimize the 'normal' case where there aren't any
+			 * cookies and we don't need to sync up.
+			 */
+			if (i == cpu && !need_sync && !p->core_cookie) {
+				next = p;
+				goto done;
+			}
+
+			rq_i->core_pick = p;
+
+			/*
+			 * If this new candidate is of higher priority than the
+			 * previous; and they're incompatible; we need to wipe
+			 * the slate and start over. pick_task makes sure that
+			 * p's priority is more than max if it doesn't match
+			 * max's cookie.
+			 *
+			 * NOTE: this is a linear max-filter and is thus bounded
+			 * in execution time.
+			 */
+			if (!max || !cookie_match(max, p)) {
+				struct task_struct *old_max = max;
+
+				rq->core->core_cookie = p->core_cookie;
+				max = p;
+
+				if (old_max) {
+					for_each_cpu(j, smt_mask) {
+						if (j == i)
+							continue;
+
+						cpu_rq(j)->core_pick = NULL;
+					}
+					goto again;
+				} else {
+					/*
+					 * Once we select a task for a cpu, we
+					 * should not be doing an unconstrained
+					 * pick because it might starve a task
+					 * on a forced idle cpu.
+					 */
+					need_sync = true;
+				}
+
+			}
+		}
+next_class:;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+	next = rq->core_pick;
+	rq->core_sched_seq = rq->core->core_pick_seq;
+
+	/*
+	 * Reschedule siblings
+	 *
+	 * NOTE: L1TF -- at this point we're no longer running the old task and
+	 * sending an IPI (below) ensures the sibling will no longer be running
+	 * their task. This ensures there is no inter-sibling overlap between
+	 * non-matching user state.
+	 */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		if (cpu_is_offline(i))
+			continue;
+
+		WARN_ON_ONCE(!rq_i->core_pick);
+
+		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+			rq->core_forceidle = true;
+
+		if (i == cpu)
+			continue;
+
+		if (rq_i->curr != rq_i->core_pick)
+			resched_curr(rq_i);
+
+		/* Did we break L1TF mitigation requirements? */
+		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+	}
+
+done:
+	set_next_task(rq, next);
+	return next;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return __pick_next_task(rq, prev, rf);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -5870,7 +6132,7 @@ static void migrate_tasks(struct rq *dead_rq, struct rq_flags *rf)
 		/*
 		 * pick_next_task() assumes pinned rq->lock:
 		 */
-		next = pick_next_task(rq, &fake_task, rf);
+		next = __pick_next_task(rq, &fake_task, rf);
 		BUG_ON(!next);
 		put_prev_task(rq, next);
 
@@ -6344,7 +6606,12 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
+		rq->core_pick = NULL;
 		rq->core_enabled = 0;
+		rq->core_tree = RB_ROOT;
+		rq->core_forceidle = false;
+
+		rq->core_cookie = 0UL;
 #endif
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bd9b473ebde2..cd8ced09826f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -960,11 +960,16 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
+	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
+	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	bool			core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
+	unsigned int		core_pick_seq;
+	unsigned long		core_cookie;
 #endif
 };
 
@@ -1821,7 +1826,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != next);
 	next->sched_class->set_next_task(rq, next);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 14/16] sched/fair: Add a few assertions
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (12 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 13/16] sched: Add core wide task selection and scheduling Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-05-29 20:36 ` [RFC PATCH v3 15/16] sched: Trivial forced-newidle balancer Vineeth Remanan Pillai
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8a107aea69b..26d29126d6a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6192,6 +6192,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
+	/*
+	 * per-cpu select_idle_mask usage
+	 */
+	lockdep_assert_irqs_disabled();
+
 	if (available_idle_cpu(target))
 		return target;
 
@@ -6619,8 +6624,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
@@ -6631,6 +6634,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	/*
+	 * required for stable ->cpus_allowed
+	 */
+	lockdep_assert_held(&p->pi_lock);
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 15/16] sched: Trivial forced-newidle balancer
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (13 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 14/16] sched/fair: Add a few assertions Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-05-29 20:36 ` [RFC PATCH v3 16/16] sched: Debug bits Vineeth Remanan Pillai
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 131 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4b39a28236f..1a309e8546cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -641,6 +641,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned int			core_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e25811b81562..5b8223c9a723 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -199,6 +199,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 	return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+	struct rb_node *node = &p->core_node;
+
+	node = rb_next(node);
+	if (!node)
+		return NULL;
+
+	p = container_of(node, struct task_struct, core_node);
+	if (p->core_cookie != cookie)
+		return NULL;
+
+	return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -3672,7 +3687,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
-	int i, j, cpu;
+	int i, j, cpu, occ = 0;
 	bool need_sync = false;
 
 	if (!sched_core_enabled(rq))
@@ -3774,6 +3789,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				goto done;
 			}
 
+			if (!is_idle_task(p))
+				occ++;
+
 			rq_i->core_pick = p;
 
 			/*
@@ -3799,6 +3817,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 						cpu_rq(j)->core_pick = NULL;
 					}
+					occ = 1;
 					goto again;
 				} else {
 					/*
@@ -3838,6 +3857,8 @@ next_class:;
 		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
 			rq->core_forceidle = true;
 
+		rq_i->core_pick->core_occupation = occ;
+
 		if (i == cpu)
 			continue;
 
@@ -3853,6 +3874,114 @@ next_class:;
 	return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+	struct task_struct *p;
+	unsigned long cookie;
+	bool success = false;
+
+	local_irq_disable();
+	double_rq_lock(dst, src);
+
+	cookie = dst->core->core_cookie;
+	if (!cookie)
+		goto unlock;
+
+	if (dst->curr != dst->idle)
+		goto unlock;
+
+	p = sched_core_find(src, cookie);
+	if (p == src->idle)
+		goto unlock;
+
+	do {
+		if (p == src->core_pick || p == src->curr)
+			goto next;
+
+		if (!cpumask_test_cpu(this, &p->cpus_allowed))
+			goto next;
+
+		if (p->core_occupation > dst->idle->core_occupation)
+			goto next;
+
+		p->on_rq = TASK_ON_RQ_MIGRATING;
+		deactivate_task(src, p, 0);
+		set_task_cpu(p, this);
+		activate_task(dst, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+
+		resched_curr(dst);
+
+		success = true;
+		break;
+
+next:
+		p = sched_core_next(p, cookie);
+	} while (p);
+
+unlock:
+	double_rq_unlock(dst, src);
+	local_irq_enable();
+
+	return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+	int i;
+
+	for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+		if (i == cpu)
+			continue;
+
+		if (need_resched())
+			break;
+
+		if (try_steal_cookie(cpu, i))
+			return true;
+	}
+
+	return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+	struct sched_domain *sd;
+	int cpu = cpu_of(rq);
+
+	rcu_read_lock();
+	raw_spin_unlock_irq(rq_lockp(rq));
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & SD_LOAD_BALANCE))
+			break;
+
+		if (need_resched())
+			break;
+
+		if (steal_cookie_task(cpu, sd))
+			break;
+	}
+	raw_spin_lock_irq(rq_lockp(rq));
+	rcu_read_unlock();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	if (!rq->core->core_cookie)
+		return;
+
+	if (!rq->nr_running) /* not forced idle */
+		return;
+
+	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static struct task_struct *
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index e7f38da60373..44decdcccba1 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -387,6 +387,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	queue_core_balance(rq);
 }
 
 static struct task_struct *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cd8ced09826f..e91c188a452c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1014,6 +1014,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+extern void queue_core_balance(struct rq *rq);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1026,6 +1028,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH v3 16/16] sched: Debug bits...
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (14 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 15/16] sched: Trivial forced-newidle balancer Vineeth Remanan Pillai
@ 2019-05-29 20:36 ` Vineeth Remanan Pillai
  2019-05-29 21:02   ` Peter Oskolkov
  2019-05-30 14:04 ` [RFC PATCH v3 00/16] Core scheduling v3 Aubrey Li
  2019-08-27 21:14 ` Matthew Garrett
  17 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-05-29 20:36 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b8223c9a723..90655c9ad937 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -92,6 +92,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+		     a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+		     b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -246,6 +250,8 @@ static void __sched_core_enable(void)
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
@@ -254,6 +260,8 @@ static void __sched_core_disable(void)
 
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
+
+	printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -3707,6 +3715,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			put_prev_task(rq, prev);
 			set_next_task(rq, next);
 		}
+
+		trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+			     rq->core->core_task_seq,
+			     rq->core->core_pick_seq,
+			     rq->core_sched_seq,
+			     next->comm, next->pid,
+			     next->core_cookie);
+
 		return next;
 	}
 
@@ -3786,6 +3802,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 */
 			if (i == cpu && !need_sync && !p->core_cookie) {
 				next = p;
+				trace_printk("unconstrained pick: %s/%d %lx\n",
+					     next->comm, next->pid, next->core_cookie);
+
 				goto done;
 			}
 
@@ -3794,6 +3813,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 			rq_i->core_pick = p;
 
+			trace_printk("cpu(%d): selected: %s/%d %lx\n",
+				     i, p->comm, p->pid, p->core_cookie);
+
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
@@ -3810,6 +3832,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				rq->core->core_cookie = p->core_cookie;
 				max = p;
 
+				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
+
 				if (old_max) {
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
@@ -3837,6 +3861,7 @@ next_class:;
 	rq->core->core_pick_seq = rq->core->core_task_seq;
 	next = rq->core_pick;
 	rq->core_sched_seq = rq->core->core_pick_seq;
+	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 	/*
 	 * Reschedule siblings
@@ -3862,11 +3887,20 @@ next_class:;
 		if (i == cpu)
 			continue;
 
-		if (rq_i->curr != rq_i->core_pick)
+		if (rq_i->curr != rq_i->core_pick) {
+			trace_printk("IPI(%d)\n", i);
 			resched_curr(rq_i);
+		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+		if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+			trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+				     rq_i->cpu, rq_i->core_pick->comm,
+				     rq_i->core_pick->pid,
+				     rq_i->core_pick->core_cookie,
+				     rq_i->core->core_cookie);
+			WARN_ON_ONCE(1);
+		}
 	}
 
 done:
@@ -3905,6 +3939,10 @@ static bool try_steal_cookie(int this, int that)
 		if (p->core_occupation > dst->idle->core_occupation)
 			goto next;
 
+		trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+			     p->comm, p->pid, that, this,
+			     p->core_occupation, dst->idle->core_occupation, cookie);
+
 		p->on_rq = TASK_ON_RQ_MIGRATING;
 		deactivate_task(src, p, 0);
 		set_task_cpu(p, this);
@@ -6501,6 +6539,8 @@ int sched_cpu_starting(unsigned int cpu)
 		WARN_ON_ONCE(rq->core && rq->core != core_rq);
 		rq->core = core_rq;
 	}
+
+	printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
 #endif /* CONFIG_SCHED_CORE */
 
 	sched_rq_cpu_starting(cpu);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 16/16] sched: Debug bits...
  2019-05-29 20:36 ` [RFC PATCH v3 16/16] sched: Debug bits Vineeth Remanan Pillai
@ 2019-05-29 21:02   ` Peter Oskolkov
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Oskolkov @ 2019-05-29 21:02 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, Linux Kernel Mailing List,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Wed, May 29, 2019 at 1:37 PM Vineeth Remanan Pillai
<vpillai@digitalocean.com> wrote:
>
> From: Peter Zijlstra <peterz@infradead.org>
>
> Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

No commit message, not-signed-off-by...

> ---
>  kernel/sched/core.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b8223c9a723..90655c9ad937 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -92,6 +92,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
>
>         int pa = __task_prio(a), pb = __task_prio(b);
>
> +       trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
> +                    a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
> +                    b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
> +
>         if (-pa < -pb)
>                 return true;
>
> @@ -246,6 +250,8 @@ static void __sched_core_enable(void)
>
>         static_branch_enable(&__sched_core_enabled);
>         stop_machine(__sched_core_stopper, (void *)true, NULL);
> +
> +       printk("core sched enabled\n");
>  }
>
>  static void __sched_core_disable(void)
> @@ -254,6 +260,8 @@ static void __sched_core_disable(void)
>
>         stop_machine(__sched_core_stopper, (void *)false, NULL);
>         static_branch_disable(&__sched_core_enabled);
> +
> +       printk("core sched disabled\n");
>  }
>
>  void sched_core_get(void)
> @@ -3707,6 +3715,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                         put_prev_task(rq, prev);
>                         set_next_task(rq, next);
>                 }
> +
> +               trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
> +                            rq->core->core_task_seq,
> +                            rq->core->core_pick_seq,
> +                            rq->core_sched_seq,
> +                            next->comm, next->pid,
> +                            next->core_cookie);
> +
>                 return next;
>         }
>
> @@ -3786,6 +3802,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                          */
>                         if (i == cpu && !need_sync && !p->core_cookie) {
>                                 next = p;
> +                               trace_printk("unconstrained pick: %s/%d %lx\n",
> +                                            next->comm, next->pid, next->core_cookie);
> +
>                                 goto done;
>                         }
>
> @@ -3794,6 +3813,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
>                         rq_i->core_pick = p;
>
> +                       trace_printk("cpu(%d): selected: %s/%d %lx\n",
> +                                    i, p->comm, p->pid, p->core_cookie);
> +
>                         /*
>                          * If this new candidate is of higher priority than the
>                          * previous; and they're incompatible; we need to wipe
> @@ -3810,6 +3832,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                                 rq->core->core_cookie = p->core_cookie;
>                                 max = p;
>
> +                               trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
> +
>                                 if (old_max) {
>                                         for_each_cpu(j, smt_mask) {
>                                                 if (j == i)
> @@ -3837,6 +3861,7 @@ next_class:;
>         rq->core->core_pick_seq = rq->core->core_task_seq;
>         next = rq->core_pick;
>         rq->core_sched_seq = rq->core->core_pick_seq;
> +       trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
>
>         /*
>          * Reschedule siblings
> @@ -3862,11 +3887,20 @@ next_class:;
>                 if (i == cpu)
>                         continue;
>
> -               if (rq_i->curr != rq_i->core_pick)
> +               if (rq_i->curr != rq_i->core_pick) {
> +                       trace_printk("IPI(%d)\n", i);
>                         resched_curr(rq_i);
> +               }
>
>                 /* Did we break L1TF mitigation requirements? */
> -               WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
> +               if (unlikely(!cookie_match(next, rq_i->core_pick))) {
> +                       trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
> +                                    rq_i->cpu, rq_i->core_pick->comm,
> +                                    rq_i->core_pick->pid,
> +                                    rq_i->core_pick->core_cookie,
> +                                    rq_i->core->core_cookie);
> +                       WARN_ON_ONCE(1);
> +               }
>         }
>
>  done:
> @@ -3905,6 +3939,10 @@ static bool try_steal_cookie(int this, int that)
>                 if (p->core_occupation > dst->idle->core_occupation)
>                         goto next;
>
> +               trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
> +                            p->comm, p->pid, that, this,
> +                            p->core_occupation, dst->idle->core_occupation, cookie);
> +
>                 p->on_rq = TASK_ON_RQ_MIGRATING;
>                 deactivate_task(src, p, 0);
>                 set_task_cpu(p, this);
> @@ -6501,6 +6539,8 @@ int sched_cpu_starting(unsigned int cpu)
>                 WARN_ON_ONCE(rq->core && rq->core != core_rq);
>                 rq->core = core_rq;
>         }
> +
> +       printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
>  #endif /* CONFIG_SCHED_CORE */
>
>         sched_rq_cpu_starting(cpu);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (15 preceding siblings ...)
  2019-05-29 20:36 ` [RFC PATCH v3 16/16] sched: Debug bits Vineeth Remanan Pillai
@ 2019-05-30 14:04 ` Aubrey Li
  2019-05-30 14:17   ` Julien Desfossez
  2019-05-31  3:01   ` Aaron Lu
  2019-08-27 21:14 ` Matthew Garrett
  17 siblings, 2 replies; 161+ messages in thread
From: Aubrey Li @ 2019-05-30 14:04 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Subhra Mazumdar,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Thu, May 30, 2019 at 4:36 AM Vineeth Remanan Pillai
<vpillai@digitalocean.com> wrote:
>
> Third iteration of the Core-Scheduling feature.
>
> This version fixes mostly correctness related issues in v2 and
> addresses performance issues. Also, addressed some crashes related
> to cgroups and cpu hotplugging.
>
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
>   SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
>   nosmt seems to perform better than core scheduling.

My testing scripts can not be completed on this version. I figured out the
number of cpu utilization report entry didn't reach my minimal requirement.
Then I wrote a simple script to verify.
====================
$ cat test.sh
#!/bin/sh

for i in `seq 1 10`
do
    echo `date`, $i
    sleep 1
done
====================

Normally it works as below:

Thu May 30 14:13:40 CST 2019, 1
Thu May 30 14:13:41 CST 2019, 2
Thu May 30 14:13:42 CST 2019, 3
Thu May 30 14:13:43 CST 2019, 4
Thu May 30 14:13:44 CST 2019, 5
Thu May 30 14:13:45 CST 2019, 6
Thu May 30 14:13:46 CST 2019, 7
Thu May 30 14:13:47 CST 2019, 8
Thu May 30 14:13:48 CST 2019, 9
Thu May 30 14:13:49 CST 2019, 10

When the system was running 32 sysbench threads and
32 gemmbench threads, it worked as below(the system
has ~38% idle time)
Thu May 30 14:14:20 CST 2019, 1
Thu May 30 14:14:21 CST 2019, 2
Thu May 30 14:14:22 CST 2019, 3
Thu May 30 14:14:24 CST 2019, 4 <=======x=
Thu May 30 14:14:25 CST 2019, 5
Thu May 30 14:14:26 CST 2019, 6
Thu May 30 14:14:28 CST 2019, 7 <=======x=
Thu May 30 14:14:29 CST 2019, 8
Thu May 30 14:14:31 CST 2019, 9 <=======x=
Thu May 30 14:14:34 CST 2019, 10 <=======x=

And it got worse when the system was running 64/64 case,
the system still had ~3% idle time
Thu May 30 14:26:40 CST 2019, 1
Thu May 30 14:26:46 CST 2019, 2
Thu May 30 14:26:53 CST 2019, 3
Thu May 30 14:27:01 CST 2019, 4
Thu May 30 14:27:03 CST 2019, 5
Thu May 30 14:27:11 CST 2019, 6
Thu May 30 14:27:31 CST 2019, 7
Thu May 30 14:27:32 CST 2019, 8
Thu May 30 14:27:41 CST 2019, 9
Thu May 30 14:27:56 CST 2019, 10

Any thoughts?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-30 14:04 ` [RFC PATCH v3 00/16] Core scheduling v3 Aubrey Li
@ 2019-05-30 14:17   ` Julien Desfossez
  2019-05-31  4:55     ` Aubrey Li
  2019-05-31  3:01   ` Aaron Lu
  1 sibling, 1 reply; 161+ messages in thread
From: Julien Desfossez @ 2019-05-30 14:17 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Subhra Mazumdar,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 30-May-2019 10:04:39 PM, Aubrey Li wrote:
> On Thu, May 30, 2019 at 4:36 AM Vineeth Remanan Pillai
> <vpillai@digitalocean.com> wrote:
> >
> > Third iteration of the Core-Scheduling feature.
> >
> > This version fixes mostly correctness related issues in v2 and
> > addresses performance issues. Also, addressed some crashes related
> > to cgroups and cpu hotplugging.
> >
> > We have tested and verified that incompatible processes are not
> > selected during schedule. In terms of performance, the impact
> > depends on the workload:
> > - on CPU intensive applications that use all the logical CPUs with
> >   SMT enabled, enabling core scheduling performs better than nosmt.
> > - on mixed workloads with considerable io compared to cpu usage,
> >   nosmt seems to perform better than core scheduling.
> 
> My testing scripts can not be completed on this version. I figured out the
> number of cpu utilization report entry didn't reach my minimal requirement.
> Then I wrote a simple script to verify.
> ====================
> $ cat test.sh
> #!/bin/sh
> 
> for i in `seq 1 10`
> do
>     echo `date`, $i
>     sleep 1
> done
> ====================
> 
> Normally it works as below:
> 
> Thu May 30 14:13:40 CST 2019, 1
> Thu May 30 14:13:41 CST 2019, 2
> Thu May 30 14:13:42 CST 2019, 3
> Thu May 30 14:13:43 CST 2019, 4
> Thu May 30 14:13:44 CST 2019, 5
> Thu May 30 14:13:45 CST 2019, 6
> Thu May 30 14:13:46 CST 2019, 7
> Thu May 30 14:13:47 CST 2019, 8
> Thu May 30 14:13:48 CST 2019, 9
> Thu May 30 14:13:49 CST 2019, 10
> 
> When the system was running 32 sysbench threads and
> 32 gemmbench threads, it worked as below(the system
> has ~38% idle time)
> Thu May 30 14:14:20 CST 2019, 1
> Thu May 30 14:14:21 CST 2019, 2
> Thu May 30 14:14:22 CST 2019, 3
> Thu May 30 14:14:24 CST 2019, 4 <=======x=
> Thu May 30 14:14:25 CST 2019, 5
> Thu May 30 14:14:26 CST 2019, 6
> Thu May 30 14:14:28 CST 2019, 7 <=======x=
> Thu May 30 14:14:29 CST 2019, 8
> Thu May 30 14:14:31 CST 2019, 9 <=======x=
> Thu May 30 14:14:34 CST 2019, 10 <=======x=
> 
> And it got worse when the system was running 64/64 case,
> the system still had ~3% idle time
> Thu May 30 14:26:40 CST 2019, 1
> Thu May 30 14:26:46 CST 2019, 2
> Thu May 30 14:26:53 CST 2019, 3
> Thu May 30 14:27:01 CST 2019, 4
> Thu May 30 14:27:03 CST 2019, 5
> Thu May 30 14:27:11 CST 2019, 6
> Thu May 30 14:27:31 CST 2019, 7
> Thu May 30 14:27:32 CST 2019, 8
> Thu May 30 14:27:41 CST 2019, 9
> Thu May 30 14:27:56 CST 2019, 10
> 
> Any thoughts?

Interesting, could you detail a bit more your test setup (commands used,
type of machine, any cgroup/pinning configuration, etc) ? I would like
to reproduce it and investigate.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-30 14:04 ` [RFC PATCH v3 00/16] Core scheduling v3 Aubrey Li
  2019-05-30 14:17   ` Julien Desfossez
@ 2019-05-31  3:01   ` Aaron Lu
  2019-05-31  5:12     ` Aubrey Li
  2019-05-31 21:08     ` Julien Desfossez
  1 sibling, 2 replies; 161+ messages in thread
From: Aaron Lu @ 2019-05-31  3:01 UTC (permalink / raw)
  To: Aubrey Li, Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Subhra Mazumdar,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 2019/5/30 22:04, Aubrey Li wrote:
> On Thu, May 30, 2019 at 4:36 AM Vineeth Remanan Pillai
> <vpillai@digitalocean.com> wrote:
>>
>> Third iteration of the Core-Scheduling feature.
>>
>> This version fixes mostly correctness related issues in v2 and
>> addresses performance issues. Also, addressed some crashes related
>> to cgroups and cpu hotplugging.
>>
>> We have tested and verified that incompatible processes are not
>> selected during schedule. In terms of performance, the impact
>> depends on the workload:
>> - on CPU intensive applications that use all the logical CPUs with
>>   SMT enabled, enabling core scheduling performs better than nosmt.
>> - on mixed workloads with considerable io compared to cpu usage,
>>   nosmt seems to perform better than core scheduling.
> 
> My testing scripts can not be completed on this version. I figured out the
> number of cpu utilization report entry didn't reach my minimal requirement.
> Then I wrote a simple script to verify.
> ====================
> $ cat test.sh
> #!/bin/sh
> 
> for i in `seq 1 10`
> do
>     echo `date`, $i
>     sleep 1
> done
> ====================

Is the shell put to some cgroup and assigned some tag or simply untagged?

> 
> Normally it works as below:
> 
> Thu May 30 14:13:40 CST 2019, 1
> Thu May 30 14:13:41 CST 2019, 2
> Thu May 30 14:13:42 CST 2019, 3
> Thu May 30 14:13:43 CST 2019, 4
> Thu May 30 14:13:44 CST 2019, 5
> Thu May 30 14:13:45 CST 2019, 6
> Thu May 30 14:13:46 CST 2019, 7
> Thu May 30 14:13:47 CST 2019, 8
> Thu May 30 14:13:48 CST 2019, 9
> Thu May 30 14:13:49 CST 2019, 10
> 
> When the system was running 32 sysbench threads and
> 32 gemmbench threads, it worked as below(the system
> has ~38% idle time)

Are the two workloads assigned different tags?
And how many cores/threads do you have?

> Thu May 30 14:14:20 CST 2019, 1
> Thu May 30 14:14:21 CST 2019, 2
> Thu May 30 14:14:22 CST 2019, 3
> Thu May 30 14:14:24 CST 2019, 4 <=======x=
> Thu May 30 14:14:25 CST 2019, 5
> Thu May 30 14:14:26 CST 2019, 6
> Thu May 30 14:14:28 CST 2019, 7 <=======x=
> Thu May 30 14:14:29 CST 2019, 8
> Thu May 30 14:14:31 CST 2019, 9 <=======x=
> Thu May 30 14:14:34 CST 2019, 10 <=======x=

This feels like "date" failed to schedule on some CPU
on time.

> And it got worse when the system was running 64/64 case,
> the system still had ~3% idle time
> Thu May 30 14:26:40 CST 2019, 1
> Thu May 30 14:26:46 CST 2019, 2
> Thu May 30 14:26:53 CST 2019, 3
> Thu May 30 14:27:01 CST 2019, 4
> Thu May 30 14:27:03 CST 2019, 5
> Thu May 30 14:27:11 CST 2019, 6
> Thu May 30 14:27:31 CST 2019, 7
> Thu May 30 14:27:32 CST 2019, 8
> Thu May 30 14:27:41 CST 2019, 9
> Thu May 30 14:27:56 CST 2019, 10
> 
> Any thoughts?

My first reaction is: when shell wakes up from sleep, it will
fork date. If the script is untagged and those workloads are
tagged and all available cores are already running workload
threads, the forked date can lose to the running workload
threads due to __prio_less() can't properly do vruntime comparison
for tasks on different CPUs. So those idle siblings can't run
date and are idled instead. See my previous post on this:

https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
(Now that I re-read my post, I see that I didn't make it clear
that se_bash and se_hog are assigned different tags(e.g. hog is
tagged and bash is untagged).

Siblings being forced idle is expected due to the nature of core
scheduling, but when two tasks belonging to two siblings are
fighting for schedule, we should let the higher priority one win.

It used to work on v2 is probably due to we mistakenly
allow different tagged tasks to schedule on the same core at
the same time, but that is fixed in v3.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-30 14:17   ` Julien Desfossez
@ 2019-05-31  4:55     ` Aubrey Li
  0 siblings, 0 replies; 161+ messages in thread
From: Aubrey Li @ 2019-05-31  4:55 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Subhra Mazumdar,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Thu, May 30, 2019 at 10:17 PM Julien Desfossez
<jdesfossez@digitalocean.com> wrote:
>
> Interesting, could you detail a bit more your test setup (commands used,
> type of machine, any cgroup/pinning configuration, etc) ? I would like
> to reproduce it and investigate.

Let me see if I can simply my test to reproduce it.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-31  3:01   ` Aaron Lu
@ 2019-05-31  5:12     ` Aubrey Li
  2019-05-31  6:09       ` Aaron Lu
  2019-05-31 21:08     ` Julien Desfossez
  1 sibling, 1 reply; 161+ messages in thread
From: Aubrey Li @ 2019-05-31  5:12 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, May 31, 2019 at 11:01 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> This feels like "date" failed to schedule on some CPU
> on time.
>
> My first reaction is: when shell wakes up from sleep, it will
> fork date. If the script is untagged and those workloads are
> tagged and all available cores are already running workload
> threads, the forked date can lose to the running workload
> threads due to __prio_less() can't properly do vruntime comparison
> for tasks on different CPUs. So those idle siblings can't run
> date and are idled instead. See my previous post on this:
> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> (Now that I re-read my post, I see that I didn't make it clear
> that se_bash and se_hog are assigned different tags(e.g. hog is
> tagged and bash is untagged).

Yes, script is untagged. This looks like exactly the problem in you
previous post. I didn't follow that, does that discussion lead to a solution?

>
> Siblings being forced idle is expected due to the nature of core
> scheduling, but when two tasks belonging to two siblings are
> fighting for schedule, we should let the higher priority one win.
>
> It used to work on v2 is probably due to we mistakenly
> allow different tagged tasks to schedule on the same core at
> the same time, but that is fixed in v3.

I have 64 threads running on a 104-CPU server, that is, when the
system has ~40% idle time, and "date" is still failed to be picked
up onto CPU on time. This may be the nature of core scheduling,
but it seems to be far from fairness.

Shouldn't we share the core between (sysbench+gemmbench)
and (date)? I mean core level sharing instead of  "date" starvation?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-31  5:12     ` Aubrey Li
@ 2019-05-31  6:09       ` Aaron Lu
  2019-05-31  6:53         ` Aubrey Li
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-05-31  6:09 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/5/31 13:12, Aubrey Li wrote:
> On Fri, May 31, 2019 at 11:01 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>>
>> This feels like "date" failed to schedule on some CPU
>> on time.
>>
>> My first reaction is: when shell wakes up from sleep, it will
>> fork date. If the script is untagged and those workloads are
>> tagged and all available cores are already running workload
>> threads, the forked date can lose to the running workload
>> threads due to __prio_less() can't properly do vruntime comparison
>> for tasks on different CPUs. So those idle siblings can't run
>> date and are idled instead. See my previous post on this:
>> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
>> (Now that I re-read my post, I see that I didn't make it clear
>> that se_bash and se_hog are assigned different tags(e.g. hog is
>> tagged and bash is untagged).
> 
> Yes, script is untagged. This looks like exactly the problem in you
> previous post. I didn't follow that, does that discussion lead to a solution?

No immediate solution yet.

>>
>> Siblings being forced idle is expected due to the nature of core
>> scheduling, but when two tasks belonging to two siblings are
>> fighting for schedule, we should let the higher priority one win.
>>
>> It used to work on v2 is probably due to we mistakenly
>> allow different tagged tasks to schedule on the same core at
>> the same time, but that is fixed in v3.
> 
> I have 64 threads running on a 104-CPU server, that is, when the

104-CPU means 52 cores I guess.
64 threads may(should?) spread on all the 52 cores and that is enough
to make 'date' suffer.

> system has ~40% idle time, and "date" is still failed to be picked
> up onto CPU on time. This may be the nature of core scheduling,
> but it seems to be far from fairness.

Exactly.

> Shouldn't we share the core between (sysbench+gemmbench)
> and (date)? I mean core level sharing instead of  "date" starvation?

We need to make core scheduling fair, but due to no
immediate solution to vruntime comparison cross CPUs, it's not
done yet.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-31  6:09       ` Aaron Lu
@ 2019-05-31  6:53         ` Aubrey Li
  2019-05-31  7:44           ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Aubrey Li @ 2019-05-31  6:53 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, May 31, 2019 at 2:09 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On 2019/5/31 13:12, Aubrey Li wrote:
> > On Fri, May 31, 2019 at 11:01 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> >>
> >> This feels like "date" failed to schedule on some CPU
> >> on time.
> >>
> >> My first reaction is: when shell wakes up from sleep, it will
> >> fork date. If the script is untagged and those workloads are
> >> tagged and all available cores are already running workload
> >> threads, the forked date can lose to the running workload
> >> threads due to __prio_less() can't properly do vruntime comparison
> >> for tasks on different CPUs. So those idle siblings can't run
> >> date and are idled instead. See my previous post on this:
> >> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> >> (Now that I re-read my post, I see that I didn't make it clear
> >> that se_bash and se_hog are assigned different tags(e.g. hog is
> >> tagged and bash is untagged).
> >
> > Yes, script is untagged. This looks like exactly the problem in you
> > previous post. I didn't follow that, does that discussion lead to a solution?
>
> No immediate solution yet.
>
> >>
> >> Siblings being forced idle is expected due to the nature of core
> >> scheduling, but when two tasks belonging to two siblings are
> >> fighting for schedule, we should let the higher priority one win.
> >>
> >> It used to work on v2 is probably due to we mistakenly
> >> allow different tagged tasks to schedule on the same core at
> >> the same time, but that is fixed in v3.
> >
> > I have 64 threads running on a 104-CPU server, that is, when the
>
> 104-CPU means 52 cores I guess.
> 64 threads may(should?) spread on all the 52 cores and that is enough
> to make 'date' suffer.

64 threads should spread onto all the 52 cores, but why they can get
scheduled while untagged "date" can not? Is it because in the current
implementation the task with cookie always has higher priority than the
task without a cookie?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-31  6:53         ` Aubrey Li
@ 2019-05-31  7:44           ` Aaron Lu
  2019-05-31  8:26             ` Aubrey Li
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-05-31  7:44 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, May 31, 2019 at 02:53:21PM +0800, Aubrey Li wrote:
> On Fri, May 31, 2019 at 2:09 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> >
> > On 2019/5/31 13:12, Aubrey Li wrote:
> > > On Fri, May 31, 2019 at 11:01 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> > >>
> > >> This feels like "date" failed to schedule on some CPU
> > >> on time.
> > >>
> > >> My first reaction is: when shell wakes up from sleep, it will
> > >> fork date. If the script is untagged and those workloads are
> > >> tagged and all available cores are already running workload
> > >> threads, the forked date can lose to the running workload
> > >> threads due to __prio_less() can't properly do vruntime comparison
> > >> for tasks on different CPUs. So those idle siblings can't run
> > >> date and are idled instead. See my previous post on this:
> > >> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> > >> (Now that I re-read my post, I see that I didn't make it clear
> > >> that se_bash and se_hog are assigned different tags(e.g. hog is
> > >> tagged and bash is untagged).
> > >
> > > Yes, script is untagged. This looks like exactly the problem in you
> > > previous post. I didn't follow that, does that discussion lead to a solution?
> >
> > No immediate solution yet.
> >
> > >>
> > >> Siblings being forced idle is expected due to the nature of core
> > >> scheduling, but when two tasks belonging to two siblings are
> > >> fighting for schedule, we should let the higher priority one win.
> > >>
> > >> It used to work on v2 is probably due to we mistakenly
> > >> allow different tagged tasks to schedule on the same core at
> > >> the same time, but that is fixed in v3.
> > >
> > > I have 64 threads running on a 104-CPU server, that is, when the
> >
> > 104-CPU means 52 cores I guess.
> > 64 threads may(should?) spread on all the 52 cores and that is enough
> > to make 'date' suffer.
> 
> 64 threads should spread onto all the 52 cores, but why they can get
> scheduled while untagged "date" can not? Is it because in the current

If 'date' didn't get scheduled, there will be no output at all unless
all those workload threads finished :-)

I guess the workload you used is not entirely CPU intensive, or 'date'
can be totally blocked due to START_DEBIT. But note that START_DEBIT
isn't the problem here, cross CPU vruntime comparison is.

> implementation the task with cookie always has higher priority than the
> task without a cookie?

No.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-31  7:44           ` Aaron Lu
@ 2019-05-31  8:26             ` Aubrey Li
  0 siblings, 0 replies; 161+ messages in thread
From: Aubrey Li @ 2019-05-31  8:26 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, May 31, 2019 at 3:45 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On Fri, May 31, 2019 at 02:53:21PM +0800, Aubrey Li wrote:
> > On Fri, May 31, 2019 at 2:09 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> > >
> > > On 2019/5/31 13:12, Aubrey Li wrote:
> > > > On Fri, May 31, 2019 at 11:01 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> > > >>
> > > >> This feels like "date" failed to schedule on some CPU
> > > >> on time.
> > > >>
> > > >> My first reaction is: when shell wakes up from sleep, it will
> > > >> fork date. If the script is untagged and those workloads are
> > > >> tagged and all available cores are already running workload
> > > >> threads, the forked date can lose to the running workload
> > > >> threads due to __prio_less() can't properly do vruntime comparison
> > > >> for tasks on different CPUs. So those idle siblings can't run
> > > >> date and are idled instead. See my previous post on this:
> > > >> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> > > >> (Now that I re-read my post, I see that I didn't make it clear
> > > >> that se_bash and se_hog are assigned different tags(e.g. hog is
> > > >> tagged and bash is untagged).
> > > >
> > > > Yes, script is untagged. This looks like exactly the problem in you
> > > > previous post. I didn't follow that, does that discussion lead to a solution?
> > >
> > > No immediate solution yet.
> > >
> > > >>
> > > >> Siblings being forced idle is expected due to the nature of core
> > > >> scheduling, but when two tasks belonging to two siblings are
> > > >> fighting for schedule, we should let the higher priority one win.
> > > >>
> > > >> It used to work on v2 is probably due to we mistakenly
> > > >> allow different tagged tasks to schedule on the same core at
> > > >> the same time, but that is fixed in v3.
> > > >
> > > > I have 64 threads running on a 104-CPU server, that is, when the
> > >
> > > 104-CPU means 52 cores I guess.
> > > 64 threads may(should?) spread on all the 52 cores and that is enough
> > > to make 'date' suffer.
> >
> > 64 threads should spread onto all the 52 cores, but why they can get
> > scheduled while untagged "date" can not? Is it because in the current
>
> If 'date' didn't get scheduled, there will be no output at all unless
> all those workload threads finished :-)

Certainly I meant untagged "date" can not be scheduled on time, :)

>
> I guess the workload you used is not entirely CPU intensive, or 'date'
> can be totally blocked due to START_DEBIT. But note that START_DEBIT
> isn't the problem here, cross CPU vruntime comparison is.
>
> > implementation the task with cookie always has higher priority than the
> > task without a cookie?
>
> No.

I checked the benchmark log manually, it looks like the data of two benchmarks
with cookies are acceptable, but ones without cookies are really bad.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 10/16] sched: Core-wide rq->lock
  2019-05-29 20:36 ` [RFC PATCH v3 10/16] sched: Core-wide rq->lock Vineeth Remanan Pillai
@ 2019-05-31 11:08   ` Peter Zijlstra
  2019-05-31 15:23     ` Vineeth Pillai
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2019-05-31 11:08 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, mingo, tglx,
	pjt, torvalds, linux-kernel, subhra.mazumdar, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, May 29, 2019 at 08:36:46PM +0000, Vineeth Remanan Pillai wrote:

> + * The static-key + stop-machine variable are needed such that:
> + *
> + *	spin_lock(rq_lockp(rq));
> + *	...
> + *	spin_unlock(rq_lockp(rq));
> + *
> + * ends up locking and unlocking the _same_ lock, and all CPUs
> + * always agree on what rq has what lock.

> @@ -5790,8 +5854,15 @@ int sched_cpu_activate(unsigned int cpu)
>  	/*
>  	 * When going up, increment the number of cores with SMT present.
>  	 */
> -	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> +	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
>  		static_branch_inc_cpuslocked(&sched_smt_present);
> +#ifdef CONFIG_SCHED_CORE
> +		if (static_branch_unlikely(&__sched_core_enabled)) {
> +			rq->core_enabled = true;
> +		}
> +#endif
> +	}
> +
>  #endif
>  	set_cpu_active(cpu, true);
>  
> @@ -5839,8 +5910,16 @@ int sched_cpu_deactivate(unsigned int cpu)
>  	/*
>  	 * When going down, decrement the number of cores with SMT present.
>  	 */
> -	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> +	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
> +#ifdef CONFIG_SCHED_CORE
> +		struct rq *rq = cpu_rq(cpu);
> +		if (static_branch_unlikely(&__sched_core_enabled)) {
> +			rq->core_enabled = false;
> +		}
> +#endif
>  		static_branch_dec_cpuslocked(&sched_smt_present);
> +
> +	}
>  #endif

I'm confused, how doesn't this break the invariant above?

That is, all CPUs must at all times agree on the value of rq_lockp(),
and I'm not seeing how that is true with the above changes.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 10/16] sched: Core-wide rq->lock
  2019-05-31 11:08   ` Peter Zijlstra
@ 2019-05-31 15:23     ` Vineeth Pillai
  0 siblings, 0 replies; 161+ messages in thread
From: Vineeth Pillai @ 2019-05-31 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nishanth Aravamudan, Julien Desfossez, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Subhra Mazumdar,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

>
> I'm confused, how doesn't this break the invariant above?
>
> That is, all CPUs must at all times agree on the value of rq_lockp(),
> and I'm not seeing how that is true with the above changes.
>
While fixing the crash in cpu online/offline, I was focusing on
maintaining the invariance
of all online cpus to agree on the value of rq_lockp(). Would it be
safe to assume that
rq and rq_lock would be used only after a cpu is onlined(sched:active)?.

To maintain the strict invariance, the sibling should also disable
core scheduling, but
we need to empty the rbtree before disabling it. I am trying to see
how to empty the
rbtree safely in the offline context.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-31  3:01   ` Aaron Lu
  2019-05-31  5:12     ` Aubrey Li
@ 2019-05-31 21:08     ` Julien Desfossez
  2019-06-06 15:26       ` Julien Desfossez
  1 sibling, 1 reply; 161+ messages in thread
From: Julien Desfossez @ 2019-05-31 21:08 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Aubrey Li, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> My first reaction is: when shell wakes up from sleep, it will
> fork date. If the script is untagged and those workloads are
> tagged and all available cores are already running workload
> threads, the forked date can lose to the running workload
> threads due to __prio_less() can't properly do vruntime comparison
> for tasks on different CPUs. So those idle siblings can't run
> date and are idled instead. See my previous post on this:
> 
> https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> (Now that I re-read my post, I see that I didn't make it clear
> that se_bash and se_hog are assigned different tags(e.g. hog is
> tagged and bash is untagged).
> 
> Siblings being forced idle is expected due to the nature of core
> scheduling, but when two tasks belonging to two siblings are
> fighting for schedule, we should let the higher priority one win.
> 
> It used to work on v2 is probably due to we mistakenly
> allow different tagged tasks to schedule on the same core at
> the same time, but that is fixed in v3.

I confirm this is indeed what is happening, we reproduced it with a
simple script that only uses one core (cpu 2 and 38 are sibling on this
machine):

setup:
cgcreate -g cpu,cpuset:test
cgcreate -g cpu,cpuset:test/set1
cgcreate -g cpu,cpuset:test/set2
echo 2,38 > /sys/fs/cgroup/cpuset/test/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/test/cpuset.mems
echo 2,38 > /sys/fs/cgroup/cpuset/test/set1/cpuset.cpus
echo 2,38 > /sys/fs/cgroup/cpuset/test/set2/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/test/set1/cpuset.mems
echo 0 > /sys/fs/cgroup/cpuset/test/set2/cpuset.mems
echo 1 > /sys/fs/cgroup/cpu,cpuacct/test/set1/cpu.tag

In one terminal:
sudo cgexec -g cpu,cpuset:test/set1 sysbench --threads=1 --time=30
--test=cpu run

In another one:
sudo cgexec -g cpu,cpuset:test/set2 date

It's very clear that 'date' hangs until sysbench is done.

We started experimenting with marking a task on the forced idle sibling
if normalized vruntimes are equal. That way, at the next compare, if the
normalized vruntimes are still equal, it prefers the task on the forced
idle sibling. It still needs more work, but in our early tests it helps.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-31 21:08     ` Julien Desfossez
@ 2019-06-06 15:26       ` Julien Desfossez
  2019-06-12  1:52         ` Li, Aubrey
  2019-06-12 16:33         ` Julien Desfossez
  0 siblings, 2 replies; 161+ messages in thread
From: Julien Desfossez @ 2019-06-06 15:26 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Aubrey Li, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 31-May-2019 05:08:16 PM, Julien Desfossez wrote:
> > My first reaction is: when shell wakes up from sleep, it will
> > fork date. If the script is untagged and those workloads are
> > tagged and all available cores are already running workload
> > threads, the forked date can lose to the running workload
> > threads due to __prio_less() can't properly do vruntime comparison
> > for tasks on different CPUs. So those idle siblings can't run
> > date and are idled instead. See my previous post on this:
> > 
> > https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/
> > (Now that I re-read my post, I see that I didn't make it clear
> > that se_bash and se_hog are assigned different tags(e.g. hog is
> > tagged and bash is untagged).
> > 
> > Siblings being forced idle is expected due to the nature of core
> > scheduling, but when two tasks belonging to two siblings are
> > fighting for schedule, we should let the higher priority one win.
> > 
> > It used to work on v2 is probably due to we mistakenly
> > allow different tagged tasks to schedule on the same core at
> > the same time, but that is fixed in v3.
> 
> I confirm this is indeed what is happening, we reproduced it with a
> simple script that only uses one core (cpu 2 and 38 are sibling on this
> machine):
> 
> setup:
> cgcreate -g cpu,cpuset:test
> cgcreate -g cpu,cpuset:test/set1
> cgcreate -g cpu,cpuset:test/set2
> echo 2,38 > /sys/fs/cgroup/cpuset/test/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/test/cpuset.mems
> echo 2,38 > /sys/fs/cgroup/cpuset/test/set1/cpuset.cpus
> echo 2,38 > /sys/fs/cgroup/cpuset/test/set2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/test/set1/cpuset.mems
> echo 0 > /sys/fs/cgroup/cpuset/test/set2/cpuset.mems
> echo 1 > /sys/fs/cgroup/cpu,cpuacct/test/set1/cpu.tag
> 
> In one terminal:
> sudo cgexec -g cpu,cpuset:test/set1 sysbench --threads=1 --time=30
> --test=cpu run
> 
> In another one:
> sudo cgexec -g cpu,cpuset:test/set2 date
> 
> It's very clear that 'date' hangs until sysbench is done.
> 
> We started experimenting with marking a task on the forced idle sibling
> if normalized vruntimes are equal. That way, at the next compare, if the
> normalized vruntimes are still equal, it prefers the task on the forced
> idle sibling. It still needs more work, but in our early tests it helps.

As mentioned above, we have come up with a fix for the long starvation
of untagged interactive threads competing for the same core with tagged
threads at the same priority. The idea is to detect the stall and boost
the stalling threads priority so that it gets a chance next time.
Boosting is done by a new counter(min_vruntime_boost) for every task
which we subtract from vruntime before comparison. The new logic looks
like this:

If we see that normalized runtimes are equal, we check the min_vruntimes
of their runqueues and give a chance for the task in the runqueue with
less min_vruntime. That will help it to progress its vruntime. While
doing this, we boost the priority of the task in the sibling so that, we
don’t starve the task in the sibling until the min_vruntime of this
runqueue catches up.

If min_vruntimes are also equal, we do as before and consider the task
‘a’ of higher priority. Here we boost the task ‘b’ so that it gets to
run next time.

The min_vruntime_boost is reset to zero once the task in on cpu. So only
waiting tasks will have a non-zero value if it is starved while matching
a task on the other sibling.

The attached patch has a sched_feature to enable the above feature so
that you can compare the results with and without this feature.

What we observe with this patch is that it helps for untagged
interactive tasks and fairness in general, but this increases the
overhead of core scheduling when there is contention for the CPU with
tasks of varying cpu usage. The general trend we see is that if there is
a cpu intensive thread and multiple relatively idle threads in different
tags, the cpu intensive tasks continuously yields to be fair to the
relatively idle threads when it becomes runnable. And if the relatively
idle threads make up for most of the tasks in a system and are tagged,
the cpu intensive tasks sees a considerable drop in performance.

If you have any feedback or creative ideas to help improve, let us
know !

Thanks


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1a309e8..56cad0e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -642,6 +642,7 @@ struct task_struct {
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
 	unsigned int			core_occupation;
+	unsigned int			core_vruntime_boost;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 73329da..c302853 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -92,6 +92,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+		     a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+		     b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -102,21 +106,36 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
 	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
-		     a->comm, a->pid, pa, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
-		     b->comm, b->pid, pb, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+		u64 a_vruntime = a->se.vruntime - a->core_vruntime_boost;
+		u64 b_vruntime = b->se.vruntime - b->core_vruntime_boost;
 
 		/*
 		 * Normalize the vruntime if tasks are in different cpus.
 		 */
 		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
+			s64 min_vruntime_diff = task_cfs_rq(a)->min_vruntime -
+						 task_cfs_rq(b)->min_vruntime;
+			b_vruntime += min_vruntime_diff;
+
+			trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+				     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+				     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+			if (sched_feat(CORESCHED_STALL_FIX) &&
+			    a_vruntime == b_vruntime) {
+				bool less_prio = min_vruntime_diff > 0;
+
+				if (less_prio)
+					a->core_vruntime_boost++;
+				else
+					b->core_vruntime_boost++;
+
+				return less_prio;
+
+			}
 		}
 
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
+		return !((s64)(a_vruntime - b_vruntime) <= 0);
 	}
 
 	return false;
@@ -2456,6 +2475,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_COMPACTION
 	p->capture_control = NULL;
 #endif
+#ifdef CONFIG_SCHED_CORE
+	p->core_vruntime_boost = 0UL;
+#endif
 	init_numa_balancing(clone_flags, p);
 }
 
@@ -3723,6 +3745,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			     next->comm, next->pid,
 			     next->core_cookie);
 
+		next->core_vruntime_boost = 0UL;
 		return next;
 	}
 
@@ -3835,6 +3858,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
 
 				if (old_max) {
+					if (old_max->core_vruntime_boost)
+						old_max->core_vruntime_boost--;
+
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
 							continue;
@@ -3905,6 +3931,7 @@ next_class:;
 
 done:
 	set_next_task(rq, next);
+	next->core_vruntime_boost = 0UL;
 	return next;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..332a092 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true)
  * UtilEstimation. Use estimated CPU utilization.
  */
 SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Prevent task stall due to vruntime comparison limitation across
+ * cpus.
+ */
+SCHED_FEAT(CORESCHED_STALL_FIX, false)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 13/16] sched: Add core wide task selection and scheduling.
  2019-05-29 20:36 ` [RFC PATCH v3 13/16] sched: Add core wide task selection and scheduling Vineeth Remanan Pillai
@ 2019-06-07 23:36   ` Pawan Gupta
  0 siblings, 0 replies; 161+ messages in thread
From: Pawan Gupta @ 2019-06-07 23:36 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Paolo Bonzini

On Wed, May 29, 2019 at 08:36:49PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
> 
> NOTE: there is still potential for siblings rivalry.
> NOTE: this is far too complicated; but thus far I've failed to
>       simplify it further.

Looks like there are still some race conditions while bringing cpu
online/offline. I am seeing an easy to reproduce panic when turning SMT on/off
in a loop with core scheduling ON. I dont see the panic with core scheduling
OFF.

Steps to reproduce:

mkdir /sys/fs/cgroup/cpu/group1
mkdir /sys/fs/cgroup/cpu/group2
echo 1 > /sys/fs/cgroup/cpu/group1/cpu.tag
echo 1 > /sys/fs/cgroup/cpu/group2/cpu.tag

echo $$ > /sys/fs/cgroup/cpu/group1/tasks

while [ 1 ];  do
	echo on	 > /sys/devices/system/cpu/smt/control
	echo off > /sys/devices/system/cpu/smt/control
done

Panic logs:
[  274.629437] BUG: unable to handle kernel NULL pointer dereference at
0000000000000024
[  274.630366] #PF error: [normal kernel read fault]
[  274.630933] PGD 800000003e52c067 P4D 800000003e52c067 PUD 0
[  274.631613] Oops: 0000 [#1] SMP PTI
[  274.632016] CPU: 0 PID: 1470 Comm: bash Tainted: G        W
5.1.4+ #33
[  274.632854] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29
04/01/2014
[  274.634248] RIP: 0010:__schedule+0x9d4/0x1350
[  274.634699] Code: da 0f 83 21 04 00 00 48 8b 35 70 f3 ab 00 48 c7 c7
51 1c a8 81 e8 4c 4e 6b ff 49 8b 85 b8 0b 00 00 48 85 c0 0f 84 2f 09 00
01
[  274.636648] RSP: 0018:ffffc900008f3ca8 EFLAGS: 00010046
[  274.637197] RAX: 0000000000000000 RBX: 0000000000000001 RCX:
0000000000000040
[  274.637941] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffffffff82544890
[  274.638691] RBP: ffffc900008f3d40 R08: 00000000000004c7 R09:
0000000000000030
[  274.639449] R10: 0000000000000001 R11: ffffc900008f3b28 R12:
ffff88803d2d0e80
[  274.640172] R13: ffff88803eaa0a40 R14: ffff88803ea20a40 R15:
ffff88803d2d0e80
[  274.640915] FS:  0000000000000000(0000) GS:ffff88803ea00000(0063)
knlGS:00000000f7f8b780
[  274.641755] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[  274.642355] CR2: 0000000000000024 CR3: 000000003c01a005 CR4:
0000000000360ef0
[  274.643135] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  274.643995] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  274.645023] Call Trace:
[  274.645336]  schedule+0x28/0x70
[  274.645621]  native_cpu_up+0x271/0x6d0
[  274.645959]  ? cpus_read_trylock+0x40/0x40
[  274.646324]  bringup_cpu+0x2d/0xe0
[  274.646631]  cpuhp_invoke_callback+0x94/0x550
[  274.647032]  ? ring_buffer_record_is_set_on+0x10/0x10
[  274.647478]  _cpu_up+0xa9/0x140
[  274.647763]  store_smt_control+0x1cb/0x260
[  274.648132]  kernfs_fop_write+0x108/0x190
[  274.648498]  vfs_write+0xa5/0x1a0
[  274.648794]  ksys_write+0x57/0xd0
[  274.649100]  do_fast_syscall_32+0x92/0x220
[  274.649468]  entry_SYSENTER_compat+0x7c/0x8e


NULL pointer exception is triggered when sibling is offline during core task
pick in pick_next_task() leaving rq_i->core_pick = NULL and if sibling comes
online before the "Reschedule siblings" block in the same function it causes
panic in is_idle_task(rq_i->core_pick).

Traces for the scenario:
[ 274.599567] bash-1470 0d... 273921815us : __schedule: cpu(0) is online during core_pick 
[ 274.600339] bash-1470 0d... 273921816us : __schedule: cpu(1) is offline during core_pick 
[ 274.601106] bash-1470 0d... 273921816us : __schedule: picked: bash/1470 ffff88803cb9c000 
[ 274.602106] bash-1470 0d... 273921816us : __schedule: cpu(0) is online.. during Reschedule siblings 
[ 274.603219] bash-1470 0d... 273921816us : __schedule: cpu(1) is online.. during Reschedule siblings 
[ 274.604333] <idle>-0 1d... 273921816us : start_secondary: cpu(1) is online now 
[ 274.605239] bash-1470 0d... 273922148us : __schedule: rq_i->core_pick on cpu(1) is NULL

I am not able to reproduce the panic after the below change. Not sure if this
is the right fix. Maybe we don't have to allow cpus to go online/offline while
pick_next_task() is executing.

-------------- 8< ---------------
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90655c9ad937..b230b095772a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3874,7 +3874,7 @@ next_class:;
        for_each_cpu(i, smt_mask) {
                struct rq *rq_i = cpu_rq(i);
 
-               if (cpu_is_offline(i))
+               if (cpu_is_offline(i) || !rq_i->core_pick)
                        continue;
 
                WARN_ON_ONCE(!rq_i->core_pick);

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-06-06 15:26       ` Julien Desfossez
@ 2019-06-12  1:52         ` Li, Aubrey
  2019-06-12 16:06           ` Julien Desfossez
  2019-06-12 16:33         ` Julien Desfossez
  1 sibling, 1 reply; 161+ messages in thread
From: Li, Aubrey @ 2019-06-12  1:52 UTC (permalink / raw)
  To: Julien Desfossez, Aaron Lu
  Cc: Aubrey Li, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/6/6 23:26, Julien Desfossez wrote:
> As mentioned above, we have come up with a fix for the long starvation
> of untagged interactive threads competing for the same core with tagged
> threads at the same priority. The idea is to detect the stall and boost
> the stalling threads priority so that it gets a chance next time.
> Boosting is done by a new counter(min_vruntime_boost) for every task
> which we subtract from vruntime before comparison. The new logic looks
> like this:
> 
> If we see that normalized runtimes are equal, we check the min_vruntimes
> of their runqueues and give a chance for the task in the runqueue with
> less min_vruntime. That will help it to progress its vruntime. While
> doing this, we boost the priority of the task in the sibling so that, we
> don’t starve the task in the sibling until the min_vruntime of this
> runqueue catches up.
> 
> If min_vruntimes are also equal, we do as before and consider the task
> ‘a’ of higher priority. Here we boost the task ‘b’ so that it gets to
> run next time.
> 
> The min_vruntime_boost is reset to zero once the task in on cpu. So only
> waiting tasks will have a non-zero value if it is starved while matching
> a task on the other sibling.
> 
> The attached patch has a sched_feature to enable the above feature so
> that you can compare the results with and without this feature.
> 
> What we observe with this patch is that it helps for untagged
> interactive tasks and fairness in general, but this increases the
> overhead of core scheduling when there is contention for the CPU with
> tasks of varying cpu usage. The general trend we see is that if there is
> a cpu intensive thread and multiple relatively idle threads in different
> tags, the cpu intensive tasks continuously yields to be fair to the
> relatively idle threads when it becomes runnable. And if the relatively
> idle threads make up for most of the tasks in a system and are tagged,
> the cpu intensive tasks sees a considerable drop in performance.
> 
> If you have any feedback or creative ideas to help improve, let us
> know !

The data on my side looks good with CORESCHED_STALL_FIX = true.

Environment setup
--------------------------
Skylake 8170 server, 2 numa nodes, 52 cores, 104 CPUs (HT on)
cgroup1 workload, sysbench (CPU mode, non AVX workload)
cgroup2 workload, gemmbench (AVX512 workload)

sysbench throughput result:
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |  no-SMT [std% / sem%]	 +/-	  cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
|  1/1	      490.8	[ 0.1%/ 0.0%]	  1.9% |        492.6	[ 0.1%/ 0.0%]	  0.4%	  1.9% |   489.5 [ 0.1%/ 0.0%]  -0.3%	  3.9% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|  2/2	      975.0	[ 0.6%/ 0.1%]	  3.9% |        970.4	[ 0.4%/ 0.0%]	 -0.5%	  3.9% |   975.6 [ 0.2%/ 0.0%]   0.1%	  7.7% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|  4/4	     1856.9	[ 0.2%/ 0.0%]	  7.8% |       1854.5	[ 0.3%/ 0.0%]	 -0.1%	  7.8% |  1849.4 [ 0.8%/ 0.1%]  -0.4%	 14.8% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|  8/8	     3622.8	[ 0.2%/ 0.0%]	 14.6% |       3618.3	[ 0.1%/ 0.0%]	 -0.1%	 14.7% |  3626.6 [ 0.4%/ 0.0%]   0.1%	 30.1% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 16/16	     6976.7	[ 0.2%/ 0.0%]	 30.1% |       6959.3	[ 0.3%/ 0.0%]	 -0.2%	 30.1% |  6964.4 [ 0.9%/ 0.1%]  -0.2%	 60.1% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 32/32	    10347.7	[ 3.8%/ 0.4%]	 60.1% |      11525.4	[ 2.8%/ 0.3%]	 11.4%	 59.5% |  9810.5 [ 9.4%/ 0.8%]  -5.2%	 97.7% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 64/64	    15284.9	[ 9.0%/ 0.9%]	 98.1% |      17022.1	[ 4.5%/ 0.5%]	 11.4%	 98.2% |  9989.7 [19.3%/ 1.1%] -34.6%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|128/128    16211.3	[18.9%/ 1.9%]	100.0% |      16507.9	[ 6.1%/ 0.6%]	  1.8%	 99.8% | 10379.0 [12.6%/ 0.8%] -36.0%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|256/256    16667.1	[ 3.1%/ 0.3%]	100.0% |      16499.1	[ 3.2%/ 0.3%]	 -1.0%	100.0% | 10540.9 [16.2%/ 1.0%] -36.8%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

sysbench latency result:
(The story we care about latency is that some customers reported their
latency critical job is affected when co-locating a deep learning job
(AVX512 task) onto the same core, because when a core executes AVX512
instructions, the core automatically reduces its frequency. This can
lead to a significant overall performance loss for a non-AVX512 job
on the same core.

And now we have core cookie match mechanism, so if we put AVX512 task
and non AVX512 task into the different cgroups, they are supposed not
to be co-located. That's why we saw the improvements of 32/32 and 64/64
cases.) 

.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |  no-SMT [std% / sem%]	 +/-	  cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
|  1/1	        2.1	[ 0.6%/ 0.1%]	  1.9% |          2.0	[ 0.2%/ 0.0%]	  3.8%	  1.9% |     2.1 [ 0.7%/ 0.1%]  -0.8%	  3.9% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|  2/2	        2.1	[ 0.7%/ 0.1%]	  3.9% |          2.1	[ 0.3%/ 0.0%]	  0.2%	  3.9% |     2.1 [ 0.6%/ 0.1%]   0.5%	  7.7% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|  4/4	        2.2	[ 0.6%/ 0.1%]	  7.8% |          2.2	[ 0.4%/ 0.0%]	 -0.2%	  7.8% |     2.2 [ 0.2%/ 0.0%]  -0.3%	 14.8% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|  8/8	        2.2	[ 0.4%/ 0.0%]	 14.6% |          2.2	[ 0.0%/ 0.0%]	  0.1%	 14.7% |     2.2 [ 0.0%/ 0.0%]   0.1%	 30.1% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 16/16	        2.4	[ 1.6%/ 0.2%]	 30.1% |          2.4	[ 1.6%/ 0.2%]	 -0.9%	 30.1% |     2.4 [ 1.9%/ 0.2%]  -0.3%	 60.1% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 32/32	        4.9	[ 6.2%/ 0.6%]	 60.1% |          3.1	[ 5.0%/ 0.5%]	 36.6%	 59.5% |     6.7 [17.3%/ 3.7%] -34.5%	 97.7% |
'--------------------------------------------------------------------------------------------------------------------------------------'
| 64/64	        9.4	[28.3%/ 2.8%]	 98.1% |          3.5	[25.6%/ 2.6%]	 62.4%	 98.2% |    18.5 [ 9.5%/ 5.0%] -97.9%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|128/128       21.3	[10.1%/ 1.0%]	100.0% |         24.8	[ 8.1%/ 0.8%]	-16.1%	 99.8% |    34.5 [ 4.9%/ 0.7%] -62.0%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'
|256/256       35.5	[ 7.8%/ 0.8%]	100.0% |         37.3	[ 5.4%/ 0.5%]	 -5.1%	100.0% |    40.8 [ 5.9%/ 0.6%] -15.0%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

Note:
----
64/64:		64 sysbench threads(in one cgroup) and 64 gemmbench threads(in other cgroup) run simultaneously.
Vanilla-SMT:	baseline with HT on
coresched-SMT:	core scheduling enabled
no-SMT:		HT off thru /sys/devices/system/cpu/smt/control
std%:		standard deviation
sem%:		standard error of the mean
±:		improvement/regression against baseline
cpu%:		derived by vmstat.idle and vmstat.iowait

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-06-12  1:52         ` Li, Aubrey
@ 2019-06-12 16:06           ` Julien Desfossez
  0 siblings, 0 replies; 161+ messages in thread
From: Julien Desfossez @ 2019-06-12 16:06 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aaron Lu, Aubrey Li, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> The data on my side looks good with CORESCHED_STALL_FIX = true.

Thank you for testing this fix, I'm glad it works for this use-case as
well.

We will be posting another (simpler) version today, stay tuned :-)

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-06-06 15:26       ` Julien Desfossez
  2019-06-12  1:52         ` Li, Aubrey
@ 2019-06-12 16:33         ` Julien Desfossez
  2019-06-13  0:03           ` Subhra Mazumdar
  1 sibling, 1 reply; 161+ messages in thread
From: Julien Desfossez @ 2019-06-12 16:33 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Aubrey Li, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

After reading more traces and trying to understand why only untagged
tasks are starving when there are cpu-intensive tasks running on the
same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
the case where ‘core_cookie’ is 0, we are supposed to only prefer the
tagged task if it’s priority is higher, but when the priorities are
equal we prefer it as well which causes the starving. ‘pick_task’ is
biased toward selecting its first parameter in case of equality which in
this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
the parameter solves this issue and matches the expected behavior.

So we can get rid of this vruntime_boost concept.

We have tested the fix below and it seems to work well with
tagged/untagged tasks.

Here are our initial test results. When core scheduling is enabled,
each VM (and associated vhost threads) are in their own cgroup/tag.

1 12-vcpu VM MySQL TPC-C benchmark (IO + CPU) with 96 mostly-idle 1-vcpu
VMs on each NUMA node (72 logical CPUs total with SMT on):
+-------------+----------+--------------+------------+--------+
|             | baseline | coresched    | coresched  | nosmt  |
|             | no tag   | VMs tagged   | VMs tagged | no tag |
|             | v5.1.5   | no stall fix | stall fix  |        |
+-------------+----------+--------------+------------+--------+
|average TPS  | 1474     | 1289         | 1264       | 1339   |
|stdev        | 48       | 12           | 17         | 24     |
|overhead     | N/A      | -12%         | -14%       | -9%    |
+-------------+----------+--------------+------------+--------+

3 12-vcpu VMs running linpack (cpu-intensive), all pinned on the same
NUMA node (36 logical CPUs with SMT enabled on that NUMA node):
+---------------+----------+--------------+-----------+--------+
|               | baseline | coresched    | coresched | nosmt  |
|               | no tag   | VMs tagged   | VMs tagged| no tag |
|               | v5.1.5   | no stall fix | stall fix |        |
+---------------+----------+--------------+-----------+--------+
|average gflops | 177.9    | 171.3        | 172.7     | 81.9   |
|stdev          | 2.6      | 10.6         | 6.4       | 8.1    |
|overhead       | N/A      | -3.7%        | -2.9%     | -53.9% |
+---------------+----------+--------------+-----------+--------+

This fix can be toggled dynamically with the ‘CORESCHED_STALL_FIX’
sched_feature so it’s easy to test before/after (it is disabled by
default).

The up-to-date git tree can also be found here in case it’s easier to
follow:
https://github.com/digitalocean/linux-coresched/commits/vpillai/coresched-v3-v5.1.5-test

Feedback welcome !

Thanks,

Julien

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e79421..26fea68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3668,8 +3668,10 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
                 * If class_pick is tagged, return it only if it has
                 * higher priority than max.
                 */
-               if (max && class_pick->core_cookie &&
-                   prio_less(class_pick, max))
+               bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
+                                    max && !prio_less(max, class_pick) :
+                                    max && prio_less(class_pick, max);
+               if (class_pick->core_cookie && max_is_higher)
                        return idle_sched_class.pick_task(rq);
 
                return class_pick;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..332a092 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true)
  * UtilEstimation. Use estimated CPU utilization.
  */
 SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Prevent task stall due to vruntime comparison limitation across
+ * cpus.
+ */
+SCHED_FEAT(CORESCHED_STALL_FIX, false)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-06-12 16:33         ` Julien Desfossez
@ 2019-06-13  0:03           ` Subhra Mazumdar
  2019-06-13  3:22             ` Julien Desfossez
  0 siblings, 1 reply; 161+ messages in thread
From: Subhra Mazumdar @ 2019-06-13  0:03 UTC (permalink / raw)
  To: Julien Desfossez, Aaron Lu
  Cc: Aubrey Li, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini


On 6/12/19 9:33 AM, Julien Desfossez wrote:
> After reading more traces and trying to understand why only untagged
> tasks are starving when there are cpu-intensive tasks running on the
> same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
> the case where ‘core_cookie’ is 0, we are supposed to only prefer the
> tagged task if it’s priority is higher, but when the priorities are
> equal we prefer it as well which causes the starving. ‘pick_task’ is
> biased toward selecting its first parameter in case of equality which in
> this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
> the parameter solves this issue and matches the expected behavior.
>
> So we can get rid of this vruntime_boost concept.
>
> We have tested the fix below and it seems to work well with
> tagged/untagged tasks.
>
My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX
than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and
idleness. May be enable it by default?

NO_CORESCHED_STALL_FIX:

users     %stdev   %gain %idle
16        25       -42.4 73
24        32       -26.3 67
32        0.2      -48.9 62


CORESCHED_STALL_FIX:

users     %stdev   %gain %idle
16        6.5      -23 70
24        0.6      -17 60
32        1.5      -30.2   52

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-06-13  0:03           ` Subhra Mazumdar
@ 2019-06-13  3:22             ` Julien Desfossez
  2019-06-17  2:51               ` Aubrey Li
  0 siblings, 1 reply; 161+ messages in thread
From: Julien Desfossez @ 2019-06-13  3:22 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Aaron Lu, Aubrey Li, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 12-Jun-2019 05:03:08 PM, Subhra Mazumdar wrote:
> 
> On 6/12/19 9:33 AM, Julien Desfossez wrote:
> >After reading more traces and trying to understand why only untagged
> >tasks are starving when there are cpu-intensive tasks running on the
> >same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
> >the case where ‘core_cookie’ is 0, we are supposed to only prefer the
> >tagged task if it’s priority is higher, but when the priorities are
> >equal we prefer it as well which causes the starving. ‘pick_task’ is
> >biased toward selecting its first parameter in case of equality which in
> >this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
> >the parameter solves this issue and matches the expected behavior.
> >
> >So we can get rid of this vruntime_boost concept.
> >
> >We have tested the fix below and it seems to work well with
> >tagged/untagged tasks.
> >
> My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX
> than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and
> idleness. May be enable it by default?

Yes if the fix is approved, we will just remove the option and it will
always be enabled.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-06-13  3:22             ` Julien Desfossez
@ 2019-06-17  2:51               ` Aubrey Li
  2019-06-19 18:33                 ` Julien Desfossez
  0 siblings, 1 reply; 161+ messages in thread
From: Aubrey Li @ 2019-06-17  2:51 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Subhra Mazumdar, Aaron Lu, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Jun 13, 2019 at 11:22 AM Julien Desfossez
<jdesfossez@digitalocean.com> wrote:
>
> On 12-Jun-2019 05:03:08 PM, Subhra Mazumdar wrote:
> >
> > On 6/12/19 9:33 AM, Julien Desfossez wrote:
> > >After reading more traces and trying to understand why only untagged
> > >tasks are starving when there are cpu-intensive tasks running on the
> > >same set of CPUs, we noticed a difference in behavior in ‘pick_task’. In
> > >the case where ‘core_cookie’ is 0, we are supposed to only prefer the
> > >tagged task if it’s priority is higher, but when the priorities are
> > >equal we prefer it as well which causes the starving. ‘pick_task’ is
> > >biased toward selecting its first parameter in case of equality which in
> > >this case was the ‘class_pick’ instead of ‘max’. Reversing the order of
> > >the parameter solves this issue and matches the expected behavior.
> > >
> > >So we can get rid of this vruntime_boost concept.
> > >
> > >We have tested the fix below and it seems to work well with
> > >tagged/untagged tasks.
> > >
> > My 2 DB instance runs with this patch are better with CORESCHED_STALL_FIX
> > than NO_CORESCHED_STALL_FIX in terms of performance, std deviation and
> > idleness. May be enable it by default?
>
> Yes if the fix is approved, we will just remove the option and it will
> always be enabled.
>

sysbench --report-interval option unveiled something.

benchmark setup
-------------------------
two cgroups, cpuset.cpus = 1, 53(one core, two siblings)
sysbench cpu mode, one thread in cgroup1
sysbench memory mode, one thread in cgroup2

no core scheduling
--------------------------
cpu throughput eps: 405.8, std: 0.14%
mem bandwidth MB/s: 5785.7, std: 0.11%

cgroup1 enable core scheduling(cpu mode)
cgroup2 disable core scheduling(memory mode)
-----------------------------------------------------------------
cpu throughput eps: 8.7, std: 519.2%
mem bandwidth MB/s: 6263.2, std: 9.3%

cgroup1 disable core scheduling(cpu mode)
cgroup2 enable core scheduling(memory mode)
-----------------------------------------------------------------
cpu throughput eps: 468.0 , std: 8.7%
mem bandwidth MB/S: 311.6 , std: 169.1%

cgroup1 enable core scheduling(cpu mode)
cgroup2 enable core scheduling(memory mode)
----------------------------------------------------------------
cpu throughput eps: 76.4 , std: 168.0%
mem bandwidth MB/S: 5388.3 , std: 30.9%

The result looks still unfair, and particularly, the variance is too high,
----sysbench cpu log ----
----snip----
[ 10s ] thds: 1 eps: 296.00 lat (ms,95%): 2.03
[ 11s ] thds: 1 eps: 0.00 lat (ms,95%): 1170.65
[ 12s ] thds: 1 eps: 1.00 lat (ms,95%): 0.00
[ 13s ] thds: 1 eps: 0.00 lat (ms,95%): 0.00
[ 14s ] thds: 1 eps: 295.91 lat (ms,95%): 2.03
[ 15s ] thds: 1 eps: 1.00 lat (ms,95%): 170.48
[ 16s ] thds: 1 eps: 0.00 lat (ms,95%): 2009.23
[ 17s ] thds: 1 eps: 1.00 lat (ms,95%): 995.51
[ 18s ] thds: 1 eps: 296.00 lat (ms,95%): 2.03
[ 19s ] thds: 1 eps: 1.00 lat (ms,95%): 170.48
[ 20s ] thds: 1 eps: 0.00 lat (ms,95%): 2009.23
----snip----

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-06-17  2:51               ` Aubrey Li
@ 2019-06-19 18:33                 ` Julien Desfossez
  2019-07-18 10:07                   ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Julien Desfossez @ 2019-06-19 18:33 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Subhra Mazumdar, Aaron Lu, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote:
> The result looks still unfair, and particularly, the variance is too high,

I just want to confirm that I am also seeing the same issue with a
similar setup. I also tried with the priority boost fix we previously
posted, the results are slightly better, but we are still seeing a very
high variance.

On average, the results I get for 10 30-seconds runs are still much
better than nosmt (both sysbench pinned on the same sibling) for the
memory benchmark, and pretty similar for the CPU benchmark, but the high
variance between runs is indeed concerning.

Still digging :-)

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-06-19 18:33                 ` Julien Desfossez
@ 2019-07-18 10:07                   ` Aaron Lu
  2019-07-18 23:27                     ` Tim Chen
  2019-07-22 10:26                     ` Aubrey Li
  0 siblings, 2 replies; 161+ messages in thread
From: Aaron Lu @ 2019-07-18 10:07 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote:
> > The result looks still unfair, and particularly, the variance is too high,
> 
> I just want to confirm that I am also seeing the same issue with a
> similar setup. I also tried with the priority boost fix we previously
> posted, the results are slightly better, but we are still seeing a very
> high variance.
> 
> On average, the results I get for 10 30-seconds runs are still much
> better than nosmt (both sysbench pinned on the same sibling) for the
> memory benchmark, and pretty similar for the CPU benchmark, but the high
> variance between runs is indeed concerning.

I was thinking to use util_avg signal to decide which task win in
__prio_less() in the cross cpu case. The reason util_avg is chosen
is because it represents how cpu intensive the task is, so the end
result is, less cpu intensive task will preempt more cpu intensive
task.

Here is the test I have done to see how util_avg works
(on a single node, 16 cores, 32 cpus vm):
1 Start tmux and then start 3 windows with each running bash;
2 Place two shells into two different cgroups and both have cpu.tag set;
3 Switch to the 1st tmux window, start
  will-it-scale/page_fault1_processes -t 16 -s 30
  in the first tagged shell;
4 Switch to the 2nd tmux window;
5 Start
  will-it-scale/page_fault1_processes -t 16 -s 30
  in the 2nd tagged shell;
6 Switch to the 3rd tmux window;
7 Do some simple things in the 3rd untagged shell like ls to see if
  untagged task is able to proceed;
8 Wait for the two page_fault workloads to finish.

With v3 here, I can not do step 4 and later steps, i.e. the 16
page_fault1 processes started in step 3 will occupy all 16 cores and
other tasks do not have a chance to run, including tmux, which made
switching tmux window impossible.

With the below patch on top of v3 that makes use of util_avg to decide
which task win, I can do all 8 steps and the final scores of the 2
workloads are: 1796191 and 2199586. The score number are not close,
suggesting some unfairness, but I can finish the test now...

Here is the diff(consider it as a POC):

---
 kernel/sched/core.c  | 35 ++---------------------------------
 kernel/sched/fair.c  | 36 ++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 3 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26fea68f7f54..7557a7bbb481 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 a_vruntime = a->se.vruntime;
-		u64 b_vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			b_vruntime -= task_cfs_rq(b)->min_vruntime;
-			b_vruntime += task_cfs_rq(a)->min_vruntime;
-
-			trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
-				     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
-				     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
-		}
-
-		return !((s64)(a_vruntime - b_vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -3663,20 +3646,6 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	if (!class_pick)
 		return NULL;
 
-	if (!cookie) {
-		/*
-		 * If class_pick is tagged, return it only if it has
-		 * higher priority than max.
-		 */
-		bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
-				     max && !prio_less(max, class_pick) :
-				     max && prio_less(class_pick, max);
-		if (class_pick->core_cookie && max_is_higher)
-			return idle_sched_class.pick_task(rq);
-
-		return class_pick;
-	}
-
 	/*
 	 * If class_pick is idle or matches cookie, return early.
 	 */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26d29126d6a5..06fb00689db1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10740,3 +10740,39 @@ __init void init_sched_fair_class(void)
 #endif /* SMP */
 
 }
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecore = task_cpu(a) == task_cpu(b);
+	struct task_struct *p;
+	s64 delta;
+
+	if (samecore) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+	}
+
+	/* across cpu: use util_avg to decide which task to run */
+	delta = (s64)(sea->avg.util_avg - seb->avg.util_avg);
+
+	p = delta > 0 ? b : a;
+	trace_printk("picked %s/%d %s: %Lu %Lu %Ld\n", p->comm, p->pid,
+			samecore ? "vruntime" : "util_avg",
+			samecore ? sea->vruntime : sea->avg.util_avg,
+			samecore ? seb->vruntime : seb->avg.util_avg,
+			delta);
+
+	return delta > 0;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..02a6d71704f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2454,3 +2454,5 @@ static inline bool sched_energy_enabled(void)
 static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-18 10:07                   ` Aaron Lu
@ 2019-07-18 23:27                     ` Tim Chen
  2019-07-19  5:52                       ` Aaron Lu
  2019-07-22 10:26                     ` Aubrey Li
  1 sibling, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-07-18 23:27 UTC (permalink / raw)
  To: Aaron Lu, Julien Desfossez
  Cc: Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini



On 7/18/19 3:07 AM, Aaron Lu wrote:
> On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:

> 
> With the below patch on top of v3 that makes use of util_avg to decide
> which task win, I can do all 8 steps and the final scores of the 2
> workloads are: 1796191 and 2199586. The score number are not close,
> suggesting some unfairness, but I can finish the test now...

Aaron,

Do you still see high variance in terms of workload throughput that
was a problem with the previous version?

>
>  
>  }
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	struct sched_entity *sea = &a->se;
> +	struct sched_entity *seb = &b->se;
> +	bool samecore = task_cpu(a) == task_cpu(b);


Probably "samecpu" instead of "samecore" will be more accurate.
I think task_cpu(a) and task_cpu(b)
can be different, but still belong to the same cpu core.

> +	struct task_struct *p;
> +	s64 delta;
> +
> +	if (samecore) {
> +		/* vruntime is per cfs_rq */
> +		while (!is_same_group(sea, seb)) {
> +			int sea_depth = sea->depth;
> +			int seb_depth = seb->depth;
> +
> +			if (sea_depth >= seb_depth)

Should this be strictly ">" instead of ">=" ?

> +				sea = parent_entity(sea);
> +			if (sea_depth <= seb_depth)

Should use "<" ?

> +				seb = parent_entity(seb);
> +		}
> +
> +		delta = (s64)(sea->vruntime - seb->vruntime);
> +	}
> +

Thanks.

Tim


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-18 23:27                     ` Tim Chen
@ 2019-07-19  5:52                       ` Aaron Lu
  2019-07-19 11:48                         ` Aubrey Li
  2019-07-19 18:33                         ` Tim Chen
  0 siblings, 2 replies; 161+ messages in thread
From: Aaron Lu @ 2019-07-19  5:52 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Jul 18, 2019 at 04:27:19PM -0700, Tim Chen wrote:
> 
> 
> On 7/18/19 3:07 AM, Aaron Lu wrote:
> > On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> 
> > 
> > With the below patch on top of v3 that makes use of util_avg to decide
> > which task win, I can do all 8 steps and the final scores of the 2
> > workloads are: 1796191 and 2199586. The score number are not close,
> > suggesting some unfairness, but I can finish the test now...
> 
> Aaron,
> 
> Do you still see high variance in terms of workload throughput that
> was a problem with the previous version?

Any suggestion how to measure this?
It's not clear how Aubrey did his test, will need to take a look at
sysbench.

> >
> >  
> >  }
> > +
> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +	struct sched_entity *sea = &a->se;
> > +	struct sched_entity *seb = &b->se;
> > +	bool samecore = task_cpu(a) == task_cpu(b);
> 
> 
> Probably "samecpu" instead of "samecore" will be more accurate.
> I think task_cpu(a) and task_cpu(b)
> can be different, but still belong to the same cpu core.

Right, definitely, guess I'm brain damaged.

> 
> > +	struct task_struct *p;
> > +	s64 delta;
> > +
> > +	if (samecore) {
> > +		/* vruntime is per cfs_rq */
> > +		while (!is_same_group(sea, seb)) {
> > +			int sea_depth = sea->depth;
> > +			int seb_depth = seb->depth;
> > +
> > +			if (sea_depth >= seb_depth)
> 
> Should this be strictly ">" instead of ">=" ?

Same depth doesn't necessarily mean same group while the purpose here is
to make sure they are in the same cfs_rq. When they are of the same
depth but in different cfs_rqs, we will continue to go up till we reach
rq->cfs.

> 
> > +				sea = parent_entity(sea);
> > +			if (sea_depth <= seb_depth)
> 
> Should use "<" ?

Ditto here.
When they are of the same depth but no in the same cfs_rq, both se will
move up.

> > +				seb = parent_entity(seb);
> > +		}
> > +
> > +		delta = (s64)(sea->vruntime - seb->vruntime);
> > +	}
> > +

Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-19  5:52                       ` Aaron Lu
@ 2019-07-19 11:48                         ` Aubrey Li
  2019-07-19 18:33                         ` Tim Chen
  1 sibling, 0 replies; 161+ messages in thread
From: Aubrey Li @ 2019-07-19 11:48 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, Jul 19, 2019 at 1:53 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On Thu, Jul 18, 2019 at 04:27:19PM -0700, Tim Chen wrote:
> >
> >
> > On 7/18/19 3:07 AM, Aaron Lu wrote:
> > > On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> >
> > >
> > > With the below patch on top of v3 that makes use of util_avg to decide
> > > which task win, I can do all 8 steps and the final scores of the 2
> > > workloads are: 1796191 and 2199586. The score number are not close,
> > > suggesting some unfairness, but I can finish the test now...
> >
> > Aaron,
> >
> > Do you still see high variance in terms of workload throughput that
> > was a problem with the previous version?
>
> Any suggestion how to measure this?
> It's not clear how Aubrey did his test, will need to take a look at
> sysbench.
>

Well, thanks to post this at the end of my vacation, ;)
I'll go back to the office next week and give a shot.
I actually have a new setup of co-locating AVX512 tasks with
sysbench MYSQL. Both throughput and latency was unacceptable
on the top of V3, Looking forward to seeing the difference of
patch.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-19  5:52                       ` Aaron Lu
  2019-07-19 11:48                         ` Aubrey Li
@ 2019-07-19 18:33                         ` Tim Chen
  1 sibling, 0 replies; 161+ messages in thread
From: Tim Chen @ 2019-07-19 18:33 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Julien Desfossez, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 7/18/19 10:52 PM, Aaron Lu wrote:
> On Thu, Jul 18, 2019 at 04:27:19PM -0700, Tim Chen wrote:
>>
>>
>> On 7/18/19 3:07 AM, Aaron Lu wrote:
>>> On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
>>
>>>
>>> With the below patch on top of v3 that makes use of util_avg to decide
>>> which task win, I can do all 8 steps and the final scores of the 2
>>> workloads are: 1796191 and 2199586. The score number are not close,
>>> suggesting some unfairness, but I can finish the test now...
>>
>> Aaron,
>>
>> Do you still see high variance in terms of workload throughput that
>> was a problem with the previous version?
> 
> Any suggestion how to measure this?
> It's not clear how Aubrey did his test, will need to take a look at
> sysbench.
> 
>>>
>>>  
>>>  }
>>> +
>>> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
>>> +{
>>> +	struct sched_entity *sea = &a->se;
>>> +	struct sched_entity *seb = &b->se;
>>> +	bool samecore = task_cpu(a) == task_cpu(b);
>>
>>
>> Probably "samecpu" instead of "samecore" will be more accurate.
>> I think task_cpu(a) and task_cpu(b)
>> can be different, but still belong to the same cpu core.
> 
> Right, definitely, guess I'm brain damaged.
> 
>>
>>> +	struct task_struct *p;
>>> +	s64 delta;
>>> +
>>> +	if (samecore) {
>>> +		/* vruntime is per cfs_rq */
>>> +		while (!is_same_group(sea, seb)) {
>>> +			int sea_depth = sea->depth;
>>> +			int seb_depth = seb->depth;
>>> +
>>> +			if (sea_depth >= seb_depth)
>>
>> Should this be strictly ">" instead of ">=" ?
> 
> Same depth doesn't necessarily mean same group while the purpose here is
> to make sure they are in the same cfs_rq. When they are of the same
> depth but in different cfs_rqs, we will continue to go up till we reach
> rq->cfs.

Ah, I see what you are doing now.  Thanks for the clarification.

Tim

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-18 10:07                   ` Aaron Lu
  2019-07-18 23:27                     ` Tim Chen
@ 2019-07-22 10:26                     ` Aubrey Li
  2019-07-22 10:43                       ` Aaron Lu
  2019-07-25 14:30                       ` Aaron Lu
  1 sibling, 2 replies; 161+ messages in thread
From: Aubrey Li @ 2019-07-22 10:26 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Julien Desfossez, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Jul 18, 2019 at 6:07 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On Wed, Jun 19, 2019 at 02:33:02PM -0400, Julien Desfossez wrote:
> > On 17-Jun-2019 10:51:27 AM, Aubrey Li wrote:
> > > The result looks still unfair, and particularly, the variance is too high,
> >
> > I just want to confirm that I am also seeing the same issue with a
> > similar setup. I also tried with the priority boost fix we previously
> > posted, the results are slightly better, but we are still seeing a very
> > high variance.
> >
> > On average, the results I get for 10 30-seconds runs are still much
> > better than nosmt (both sysbench pinned on the same sibling) for the
> > memory benchmark, and pretty similar for the CPU benchmark, but the high
> > variance between runs is indeed concerning.
>
> I was thinking to use util_avg signal to decide which task win in
> __prio_less() in the cross cpu case. The reason util_avg is chosen
> is because it represents how cpu intensive the task is, so the end
> result is, less cpu intensive task will preempt more cpu intensive
> task.
>
> Here is the test I have done to see how util_avg works
> (on a single node, 16 cores, 32 cpus vm):
> 1 Start tmux and then start 3 windows with each running bash;
> 2 Place two shells into two different cgroups and both have cpu.tag set;
> 3 Switch to the 1st tmux window, start
>   will-it-scale/page_fault1_processes -t 16 -s 30
>   in the first tagged shell;
> 4 Switch to the 2nd tmux window;
> 5 Start
>   will-it-scale/page_fault1_processes -t 16 -s 30
>   in the 2nd tagged shell;
> 6 Switch to the 3rd tmux window;
> 7 Do some simple things in the 3rd untagged shell like ls to see if
>   untagged task is able to proceed;
> 8 Wait for the two page_fault workloads to finish.
>
> With v3 here, I can not do step 4 and later steps, i.e. the 16
> page_fault1 processes started in step 3 will occupy all 16 cores and
> other tasks do not have a chance to run, including tmux, which made
> switching tmux window impossible.
>
> With the below patch on top of v3 that makes use of util_avg to decide
> which task win, I can do all 8 steps and the final scores of the 2
> workloads are: 1796191 and 2199586. The score number are not close,
> suggesting some unfairness, but I can finish the test now...
>
> Here is the diff(consider it as a POC):
>
> ---
>  kernel/sched/core.c  | 35 ++---------------------------------
>  kernel/sched/fair.c  | 36 ++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |  2 ++
>  3 files changed, 40 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26fea68f7f54..7557a7bbb481 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
>         if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>                 return !dl_time_before(a->dl.deadline, b->dl.deadline);
>
> -       if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> -               u64 a_vruntime = a->se.vruntime;
> -               u64 b_vruntime = b->se.vruntime;
> -
> -               /*
> -                * Normalize the vruntime if tasks are in different cpus.
> -                */
> -               if (task_cpu(a) != task_cpu(b)) {
> -                       b_vruntime -= task_cfs_rq(b)->min_vruntime;
> -                       b_vruntime += task_cfs_rq(a)->min_vruntime;
> -
> -                       trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
> -                                    a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
> -                                    b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
> -
> -               }
> -
> -               return !((s64)(a_vruntime - b_vruntime) <= 0);
> -       }
> +       if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
> +               return cfs_prio_less(a, b);
>
>         return false;
>  }
> @@ -3663,20 +3646,6 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
>         if (!class_pick)
>                 return NULL;
>
> -       if (!cookie) {
> -               /*
> -                * If class_pick is tagged, return it only if it has
> -                * higher priority than max.
> -                */
> -               bool max_is_higher = sched_feat(CORESCHED_STALL_FIX) ?
> -                                    max && !prio_less(max, class_pick) :
> -                                    max && prio_less(class_pick, max);
> -               if (class_pick->core_cookie && max_is_higher)
> -                       return idle_sched_class.pick_task(rq);
> -
> -               return class_pick;
> -       }
> -
>         /*
>          * If class_pick is idle or matches cookie, return early.
>          */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26d29126d6a5..06fb00689db1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10740,3 +10740,39 @@ __init void init_sched_fair_class(void)
>  #endif /* SMP */
>
>  }
> +
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +       struct sched_entity *sea = &a->se;
> +       struct sched_entity *seb = &b->se;
> +       bool samecore = task_cpu(a) == task_cpu(b);
> +       struct task_struct *p;
> +       s64 delta;
> +
> +       if (samecore) {
> +               /* vruntime is per cfs_rq */
> +               while (!is_same_group(sea, seb)) {
> +                       int sea_depth = sea->depth;
> +                       int seb_depth = seb->depth;
> +
> +                       if (sea_depth >= seb_depth)
> +                               sea = parent_entity(sea);
> +                       if (sea_depth <= seb_depth)
> +                               seb = parent_entity(seb);
> +               }
> +
> +               delta = (s64)(sea->vruntime - seb->vruntime);
> +       }
> +
> +       /* across cpu: use util_avg to decide which task to run */
> +       delta = (s64)(sea->avg.util_avg - seb->avg.util_avg);

The granularity period of util_avg seems too large to decide task priority
during pick_task(), at least it is in my case, cfs_prio_less() always picked
core max task, so pick_task() eventually picked idle, which causes this change
not very helpful for my case.

 <idle>-0     [057] dN..    83.716973: __schedule: max: sysbench/2578
ffff889050f68600
 <idle>-0     [057] dN..    83.716974: __schedule:
(swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
 <idle>-0     [057] dN..    83.716975: __schedule:
(sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
 <idle>-0     [057] dN..    83.716975: cfs_prio_less: picked
sysbench/2578 util_avg: 20 527 -507 <======= here===
 <idle>-0     [057] dN..    83.716976: __schedule: pick_task cookie
pick swapper/5/0 ffff889050f68600

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-22 10:26                     ` Aubrey Li
@ 2019-07-22 10:43                       ` Aaron Lu
  2019-07-23  2:52                         ` Aubrey Li
  2019-07-25 14:30                       ` Aaron Lu
  1 sibling, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-07-22 10:43 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Julien Desfossez, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/7/22 18:26, Aubrey Li wrote:
> The granularity period of util_avg seems too large to decide task priority
> during pick_task(), at least it is in my case, cfs_prio_less() always picked
> core max task, so pick_task() eventually picked idle, which causes this change
> not very helpful for my case.
> 
>  <idle>-0     [057] dN..    83.716973: __schedule: max: sysbench/2578
> ffff889050f68600
>  <idle>-0     [057] dN..    83.716974: __schedule:
> (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
>  <idle>-0     [057] dN..    83.716975: __schedule:
> (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
>  <idle>-0     [057] dN..    83.716975: cfs_prio_less: picked
> sysbench/2578 util_avg: 20 527 -507 <======= here===
>  <idle>-0     [057] dN..    83.716976: __schedule: pick_task cookie
> pick swapper/5/0 ffff889050f68600

Can you share your setup of the test? I would like to try it locally.
Thanks.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-22 10:43                       ` Aaron Lu
@ 2019-07-23  2:52                         ` Aubrey Li
  0 siblings, 0 replies; 161+ messages in thread
From: Aubrey Li @ 2019-07-23  2:52 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Julien Desfossez, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Mon, Jul 22, 2019 at 6:43 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On 2019/7/22 18:26, Aubrey Li wrote:
> > The granularity period of util_avg seems too large to decide task priority
> > during pick_task(), at least it is in my case, cfs_prio_less() always picked
> > core max task, so pick_task() eventually picked idle, which causes this change
> > not very helpful for my case.
> >
> >  <idle>-0     [057] dN..    83.716973: __schedule: max: sysbench/2578
> > ffff889050f68600
> >  <idle>-0     [057] dN..    83.716974: __schedule:
> > (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
> >  <idle>-0     [057] dN..    83.716975: __schedule:
> > (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
> >  <idle>-0     [057] dN..    83.716975: cfs_prio_less: picked
> > sysbench/2578 util_avg: 20 527 -507 <======= here===
> >  <idle>-0     [057] dN..    83.716976: __schedule: pick_task cookie
> > pick swapper/5/0 ffff889050f68600
>
> Can you share your setup of the test? I would like to try it locally.

My setup is a co-location of AVX512 tasks(gemmbench) and non-AVX512 tasks
(sysbench MYSQL). Let me simply it and send offline.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-22 10:26                     ` Aubrey Li
  2019-07-22 10:43                       ` Aaron Lu
@ 2019-07-25 14:30                       ` Aaron Lu
  2019-07-25 14:31                         ` [RFC PATCH 1/3] wrapper for cfs_rq->min_vruntime Aaron Lu
                                           ` (4 more replies)
  1 sibling, 5 replies; 161+ messages in thread
From: Aaron Lu @ 2019-07-25 14:30 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Julien Desfossez, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Mon, Jul 22, 2019 at 06:26:46PM +0800, Aubrey Li wrote:
> The granularity period of util_avg seems too large to decide task priority
> during pick_task(), at least it is in my case, cfs_prio_less() always picked
> core max task, so pick_task() eventually picked idle, which causes this change
> not very helpful for my case.
> 
>  <idle>-0     [057] dN..    83.716973: __schedule: max: sysbench/2578
> ffff889050f68600
>  <idle>-0     [057] dN..    83.716974: __schedule:
> (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
>  <idle>-0     [057] dN..    83.716975: __schedule:
> (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
>  <idle>-0     [057] dN..    83.716975: cfs_prio_less: picked
> sysbench/2578 util_avg: 20 527 -507 <======= here===
>  <idle>-0     [057] dN..    83.716976: __schedule: pick_task cookie
> pick swapper/5/0 ffff889050f68600

I tried a different approach based on vruntime with 3 patches following.

When the two tasks are on the same CPU, no change is made, I still route
the two sched entities up till they are in the same group(cfs_rq) and
then do the vruntime comparison.

When the two tasks are on differen threads of the same core, the root
level sched_entities to which the two tasks belong will be used to do
the comparison.

An ugly illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62

Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
task B's se is se61. To compare priority of task A and B, we compare
priority of se2 and se6. The smaller vruntime wins.

To make this work, the root level ses on both CPU should have a common
cfs_rq min vuntime, which I call it the core cfs_rq min vruntime.

This is mostly done in patch2/3.

Test:
1 wrote an cpu intensive program that does nothing but while(1) in
  main(), let's call it cpuhog;
2 start 2 cgroups, with one cgroup's cpuset binding to CPU2 and the
  other binding to cpu3. cpu2 and cpu3 are smt siblings on the test VM;
3 enable cpu.tag for the two cgroups;
4 start one cpuhog task in each cgroup;
5 kill both cpuhog tasks after 10 seconds;
6 check each cgroup's cpu usage.

If the task is scheduled fairly, then each cgroup's cpu usage should be
around 5s.

With v3, the cpu usage of both cgroups are sometimes 3s, 7s; sometimes
1s, 9s.

With the 3 patches applied, the numbers are mostly around 5s, 5s.

Another test is starting two cgroups simultaneously with cpu.tag set,
with one cgroup running: will-it-scale/page_fault1_processes -t 16 -s 30,
the other running: will-it-scale/page_fault2_processes -t 16 -s 30.
With v3, like I said last time, the later started page_fault processes
can't start running. With the 3 patches applied, both running at the
same time with each CPU having a relatively fair score:

output line of 16 page_fault1 processes in 1 second interval:
min:105225 max:131716 total:1872322

output line of 16 page_fault2 processes in 1 second interval:
min:86797 max:110554 total:1581177

Note the value in min and max, the smaller the gap is, the better the
faireness is.

Aubrey,

I haven't been able to run your workload yet...

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [RFC PATCH 1/3] wrapper for cfs_rq->min_vruntime
  2019-07-25 14:30                       ` Aaron Lu
@ 2019-07-25 14:31                         ` Aaron Lu
  2019-07-25 14:32                         ` [PATCH 2/3] core vruntime comparison Aaron Lu
                                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 161+ messages in thread
From: Aaron Lu @ 2019-07-25 14:31 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Julien Desfossez, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

Add a wrapper function cfs_rq_min_vruntime(cfs_rq) to
return cfs_rq->min_vruntime.

It will be used in the following patch, no functionality
change.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/fair.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26d29126d6a5..a7b26c96f46b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -431,6 +431,11 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->min_vruntime;
+}
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -467,7 +472,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	struct sched_entity *curr = cfs_rq->curr;
 	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = cfs_rq_min_vruntime(cfs_rq);
 
 	if (curr) {
 		if (curr->on_rq)
@@ -487,7 +492,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -3742,7 +3747,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
 static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHED_DEBUG
-	s64 d = se->vruntime - cfs_rq->min_vruntime;
+	s64 d = se->vruntime - cfs_rq_min_vruntime(cfs_rq);
 
 	if (d < 0)
 		d = -d;
@@ -3755,7 +3760,7 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = cfs_rq_min_vruntime(cfs_rq);
 
 	/*
 	 * The 'current' period is already promised to the current tasks,
@@ -3848,7 +3853,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * update_curr().
 	 */
 	if (renorm && curr)
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 
 	update_curr(cfs_rq);
 
@@ -3859,7 +3864,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * fairness detriment of existing tasks.
 	 */
 	if (renorm && !curr)
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 
 	/*
 	 * When enqueuing a sched_entity, we must:
@@ -3972,7 +3977,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * can move min_vruntime forward still more.
 	 */
 	if (!(flags & DEQUEUE_SLEEP))
-		se->vruntime -= cfs_rq->min_vruntime;
+		se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
@@ -6722,7 +6727,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 			min_vruntime = cfs_rq->min_vruntime;
 		} while (min_vruntime != min_vruntime_copy);
 #else
-		min_vruntime = cfs_rq->min_vruntime;
+		min_vruntime = cfs_rq_min_vruntime(cfs_rq);
 #endif
 
 		se->vruntime -= min_vruntime;
@@ -10215,7 +10220,7 @@ static void task_fork_fair(struct task_struct *p)
 		resched_curr(rq);
 	}
 
-	se->vruntime -= cfs_rq->min_vruntime;
+	se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 	rq_unlock(rq, &rf);
 }
 
@@ -10335,7 +10340,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 		 * cause 'unlimited' sleep bonus.
 		 */
 		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
+		se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 	}
 
 	detach_entity_cfs_rq(se);
@@ -10349,7 +10354,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	attach_entity_cfs_rq(se);
 
 	if (!vruntime_normalized(p))
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH 2/3] core vruntime comparison
  2019-07-25 14:30                       ` Aaron Lu
  2019-07-25 14:31                         ` [RFC PATCH 1/3] wrapper for cfs_rq->min_vruntime Aaron Lu
@ 2019-07-25 14:32                         ` Aaron Lu
  2019-08-06 14:17                           ` Peter Zijlstra
  2019-07-25 14:33                         ` [PATCH 3/3] temp hack to make tick based schedule happen Aaron Lu
                                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-07-25 14:32 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Julien Desfossez, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, the root
level sched_entities to which the two tasks belong will be used to do
the comparison.

An ugly illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62

Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
task B's se is se61. To compare priority of task A and B, we compare
priority of se2 and se6. Whose vruntime is smaller, who wins.

To make this work, the root level se should have a common cfs_rq min
vuntime, which I call it the core cfs_rq min vruntime.

Potential issues: when core scheduling is enabled, if there are tasks
already in some CPU's rq, then new tasks will be queued with the per-core
cfs_rq min vruntime while the old tasks are using the original root
level cfs_rq's min_vruntime. The two values can differ greatly and can
cause tasks with a large vruntime starve. So enable core scheduling
early when the system is still kind of idle for the time being to avoid
this problem.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c  | 15 ++-------
 kernel/sched/fair.c  | 79 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  2 ++
 3 files changed, 82 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90655c9ad937..bc746ea4cc82 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,19 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7b26c96f46b..43babc2a12a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -431,9 +431,85 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->core->cfs;
+}
+
 static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->min_vruntime;
+	if (!sched_core_enabled(rq_of(cfs_rq)))
+		return cfs_rq->min_vruntime;
+
+	if (is_root_cfs_rq(cfs_rq))
+		return core_cfs_rq(cfs_rq)->min_vruntime;
+	else
+		return cfs_rq->min_vruntime;
+}
+
+static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_rq *cfs_rq_core;
+
+	if (!sched_core_enabled(rq_of(cfs_rq)))
+		return;
+
+	if (!is_root_cfs_rq(cfs_rq))
+		return;
+
+	cfs_rq_core = core_cfs_rq(cfs_rq);
+	cfs_rq_core->min_vruntime = max(cfs_rq_core->min_vruntime,
+					cfs_rq->min_vruntime);
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	struct task_struct *p;
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+	delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+	p = delta > 0 ? b : a;
+	trace_printk("picked %s/%d %s: %Ld %Ld %Ld\n", p->comm, p->pid,
+			samecpu ? "samecpu" : "crosscpu",
+			sea->vruntime, seb->vruntime, delta);
+
+	return delta > 0;
 }
 
 static __always_inline
@@ -493,6 +569,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 
 	/* ensure we never gain time by being placed backwards. */
 	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+	update_core_cfs_rq_min_vruntime(cfs_rq);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..02a6d71704f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2454,3 +2454,5 @@ static inline bool sched_energy_enabled(void)
 static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH 3/3] temp hack to make tick based schedule happen
  2019-07-25 14:30                       ` Aaron Lu
  2019-07-25 14:31                         ` [RFC PATCH 1/3] wrapper for cfs_rq->min_vruntime Aaron Lu
  2019-07-25 14:32                         ` [PATCH 2/3] core vruntime comparison Aaron Lu
@ 2019-07-25 14:33                         ` Aaron Lu
  2019-07-25 21:42                         ` [RFC PATCH v3 00/16] Core scheduling v3 Li, Aubrey
  2019-07-26 15:21                         ` Julien Desfossez
  4 siblings, 0 replies; 161+ messages in thread
From: Aaron Lu @ 2019-07-25 14:33 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Julien Desfossez, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

When a hyperthread is forced idle and the other hyperthread has a single
CPU intensive task running, the running task can occupy the hyperthread
for a long time with no scheduling point and starve the other
hyperthread.

Fix this temporarily by always checking if the task has exceed its
timeslice and if so, do a schedule.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43babc2a12a5..730c9359e9c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4093,6 +4093,9 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 	}
 
+	if (cfs_rq->nr_running <= 1)
+		return;
+
 	/*
 	 * Ensure that a task that missed wakeup preemption by a
 	 * narrow margin doesn't have to wait for a full slice.
@@ -4261,8 +4264,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 		return;
 #endif
 
-	if (cfs_rq->nr_running > 1)
-		check_preempt_tick(cfs_rq, curr);
+	check_preempt_tick(cfs_rq, curr);
 }
 
 
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-25 14:30                       ` Aaron Lu
                                           ` (2 preceding siblings ...)
  2019-07-25 14:33                         ` [PATCH 3/3] temp hack to make tick based schedule happen Aaron Lu
@ 2019-07-25 21:42                         ` Li, Aubrey
  2019-07-26 15:21                         ` Julien Desfossez
  4 siblings, 0 replies; 161+ messages in thread
From: Li, Aubrey @ 2019-07-25 21:42 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Julien Desfossez, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/7/25 22:30, Aaron Lu wrote:
> On Mon, Jul 22, 2019 at 06:26:46PM +0800, Aubrey Li wrote:
>> The granularity period of util_avg seems too large to decide task priority
>> during pick_task(), at least it is in my case, cfs_prio_less() always picked
>> core max task, so pick_task() eventually picked idle, which causes this change
>> not very helpful for my case.
>>
>>  <idle>-0     [057] dN..    83.716973: __schedule: max: sysbench/2578
>> ffff889050f68600
>>  <idle>-0     [057] dN..    83.716974: __schedule:
>> (swapper/5/0;140,0,0) ?< (mysqld/2511;119,1042118143,0)
>>  <idle>-0     [057] dN..    83.716975: __schedule:
>> (sysbench/2578;119,96449836,0) ?< (mysqld/2511;119,1042118143,0)
>>  <idle>-0     [057] dN..    83.716975: cfs_prio_less: picked
>> sysbench/2578 util_avg: 20 527 -507 <======= here===
>>  <idle>-0     [057] dN..    83.716976: __schedule: pick_task cookie
>> pick swapper/5/0 ffff889050f68600
> 
> I tried a different approach based on vruntime with 3 patches following.
> 
> When the two tasks are on the same CPU, no change is made, I still route
> the two sched entities up till they are in the same group(cfs_rq) and
> then do the vruntime comparison.
> 
> When the two tasks are on differen threads of the same core, the root
> level sched_entities to which the two tasks belong will be used to do
> the comparison.
> 
> An ugly illustration for the cross CPU case:
> 
>    cpu0         cpu1
>  /   |  \     /   |  \
> se1 se2 se3  se4 se5 se6
>     /  \            /   \
>   se21 se22       se61  se62
> 
> Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
> task B's se is se61. To compare priority of task A and B, we compare
> priority of se2 and se6. The smaller vruntime wins.
> 
> To make this work, the root level ses on both CPU should have a common
> cfs_rq min vuntime, which I call it the core cfs_rq min vruntime.
> 
> This is mostly done in patch2/3.
> 
> Test:
> 1 wrote an cpu intensive program that does nothing but while(1) in
>   main(), let's call it cpuhog;
> 2 start 2 cgroups, with one cgroup's cpuset binding to CPU2 and the
>   other binding to cpu3. cpu2 and cpu3 are smt siblings on the test VM;
> 3 enable cpu.tag for the two cgroups;
> 4 start one cpuhog task in each cgroup;
> 5 kill both cpuhog tasks after 10 seconds;
> 6 check each cgroup's cpu usage.
> 
> If the task is scheduled fairly, then each cgroup's cpu usage should be
> around 5s.
> 
> With v3, the cpu usage of both cgroups are sometimes 3s, 7s; sometimes
> 1s, 9s.
> 
> With the 3 patches applied, the numbers are mostly around 5s, 5s.
> 
> Another test is starting two cgroups simultaneously with cpu.tag set,
> with one cgroup running: will-it-scale/page_fault1_processes -t 16 -s 30,
> the other running: will-it-scale/page_fault2_processes -t 16 -s 30.
> With v3, like I said last time, the later started page_fault processes
> can't start running. With the 3 patches applied, both running at the
> same time with each CPU having a relatively fair score:
> 
> output line of 16 page_fault1 processes in 1 second interval:
> min:105225 max:131716 total:1872322
> 
> output line of 16 page_fault2 processes in 1 second interval:
> min:86797 max:110554 total:1581177
> 
> Note the value in min and max, the smaller the gap is, the better the
> faireness is.
> 
> Aubrey,
> 
> I haven't been able to run your workload yet...
> 

No worry, let me try to see how it works.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-25 14:30                       ` Aaron Lu
                                           ` (3 preceding siblings ...)
  2019-07-25 21:42                         ` [RFC PATCH v3 00/16] Core scheduling v3 Li, Aubrey
@ 2019-07-26 15:21                         ` Julien Desfossez
  2019-07-26 21:29                           ` Tim Chen
  2019-07-31  2:42                           ` Li, Aubrey
  4 siblings, 2 replies; 161+ messages in thread
From: Julien Desfossez @ 2019-07-26 15:21 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 25-Jul-2019 10:30:03 PM, Aaron Lu wrote:
> 
> I tried a different approach based on vruntime with 3 patches following.
[...]

We have experimented with this new patchset and indeed the fairness is
now much better. Interactive tasks with v3 were complete starving when
there were cpu-intensive tasks running, now they can run consistently.
With my initial test of TPC-C running in large VMs with a lot of
background noise VMs, the results are pretty similar to v3, I will run
more thorough tests and report the results back here.

Instead of the 3/3 hack patch, we were already working on a different
approach to solve the same problem. What we have done so far is create a
very low priority per-cpu coresched_idle kernel thread that we use
instead of idle when we can't co-schedule tasks. This gives us more
control and accounting. It still needs some work, but the initial
results are encouraging, I will post more when we have something that
works well.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-26 15:21                         ` Julien Desfossez
@ 2019-07-26 21:29                           ` Tim Chen
  2019-07-31  2:42                           ` Li, Aubrey
  1 sibling, 0 replies; 161+ messages in thread
From: Tim Chen @ 2019-07-26 21:29 UTC (permalink / raw)
  To: Julien Desfossez, Aaron Lu
  Cc: Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 7/26/19 8:21 AM, Julien Desfossez wrote:
> On 25-Jul-2019 10:30:03 PM, Aaron Lu wrote:
>>
>> I tried a different approach based on vruntime with 3 patches following.
> [...]
> 
> We have experimented with this new patchset and indeed the fairness is
> now much better. Interactive tasks with v3 were complete starving when
> there were cpu-intensive tasks running, now they can run consistently.
> With my initial test of TPC-C running in large VMs with a lot of
> background noise VMs, the results are pretty similar to v3, I will run
> more thorough tests and report the results back here.

Aaron's patch inspired me to experiment with another approach to tackle
fairness.  The root problem with v3 was we didn't account for the forced
idle time when we compare the priority of tasks between two threads.

So what I did here is to account the forced idle time in the top cfs_rq's
min_vruntime when we update the runqueue clock.  When we are comparing
between two cfs runqueues, the task in cpu getting forced idle will now
be credited with the forced idle time. The effect should be similar to
Aaron's patches. Logic is a bit simpler and we don't need to use
one of the sibling's cfs_rq min_vruntime as a time base.

In really limited testing, it seems to have balanced fairness between two
tagged cgroups.

Tim

-------patch 1----------
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Wed, 24 Jul 2019 13:58:18 -0700
Subject: [PATCH 1/2] sched: move sched fair prio comparison to fair.c

Move the priority comparison of two tasks in fair class to fair.c.
There is no functional change.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c  | 21 ++-------------------
 kernel/sched/fair.c  | 21 +++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8ea87be56a1e..f78b8fdfd47c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 a_vruntime = a->se.vruntime;
-		u64 b_vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			b_vruntime -= task_cfs_rq(b)->min_vruntime;
-			b_vruntime += task_cfs_rq(a)->min_vruntime;
-
-			trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
-				     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
-				     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
-		}
-
-		return !((s64)(a_vruntime - b_vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return prio_less_fair(a, b);
 
 	return false;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02bff10237d4..e289b6e1545b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -602,6 +602,27 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 	return delta;
 }
 
+bool prio_less_fair(struct task_struct *a, struct task_struct *b)
+{
+	u64 a_vruntime = a->se.vruntime;
+	u64 b_vruntime = b->se.vruntime;
+
+	/*
+	 * Normalize the vruntime if tasks are in different cpus.
+	 */
+	if (task_cpu(a) != task_cpu(b)) {
+		b_vruntime -= task_cfs_rq(b)->min_vruntime;
+		b_vruntime += task_cfs_rq(a)->min_vruntime;
+
+		trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+			     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+			     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+	}
+
+	return !((s64)(a_vruntime - b_vruntime) <= 0);
+}
+
 /*
  * The idea is to set a period in which each task runs once.
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..bdabe7ce1152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1015,6 +1015,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 }
 
 extern void queue_core_balance(struct rq *rq);
+extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);
 
 #else /* !CONFIG_SCHED_CORE */
 
-- 
2.20.1


--------------patch 2------------------
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Thu, 25 Jul 2019 13:09:21 -0700
Subject: [PATCH 2/2] sched: Account the forced idle time

We did not account for the forced idle time when comparing two tasks
from different SMT thread in the same core.

Account it in root cfs_rq min_vruntime when we update the rq clock.  This will
allow for fair comparison of which task has higher priority from
two different SMT.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c |  6 ++++++
 kernel/sched/fair.c | 22 ++++++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f78b8fdfd47c..d8fa74810126 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -393,6 +393,12 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 	rq->clock_task += delta;
 
+#ifdef CONFIG_SCHED_CORE
+	/* Account the forced idle time by sibling */
+	if (rq->core_forceidle)
+		rq->cfs.min_vruntime += delta;
+#endif
+
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
 		update_irq_load_avg(rq, irq_delta + steal);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e289b6e1545b..1b2fd1271c51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -604,20 +604,34 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 
 bool prio_less_fair(struct task_struct *a, struct task_struct *b)
 {
-	u64 a_vruntime = a->se.vruntime;
-	u64 b_vruntime = b->se.vruntime;
+	u64 a_vruntime;
+	u64 b_vruntime;
 
 	/*
 	 * Normalize the vruntime if tasks are in different cpus.
 	 */
 	if (task_cpu(a) != task_cpu(b)) {
-		b_vruntime -= task_cfs_rq(b)->min_vruntime;
-		b_vruntime += task_cfs_rq(a)->min_vruntime;
+		struct sched_entity *sea = &a->se;
+		struct sched_entity *seb = &b->se;
+
+		while (sea->parent)
+			sea = sea->parent;
+		while (seb->parent)
+			seb = seb->parent;
+
+		a_vruntime = sea->vruntime;
+		b_vruntime = seb->vruntime;
+
+		b_vruntime -= task_rq(b)->cfs.min_vruntime;
+		b_vruntime += task_rq(a)->cfs.min_vruntime;
 
 		trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
 			     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
 			     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
 
+	} else {
+		a_vruntime = a->se.vruntime;
+		b_vruntime = b->se.vruntime;
 	}
 
 	return !((s64)(a_vruntime - b_vruntime) <= 0);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-26 15:21                         ` Julien Desfossez
  2019-07-26 21:29                           ` Tim Chen
@ 2019-07-31  2:42                           ` Li, Aubrey
  2019-08-02 15:37                             ` Julien Desfossez
  1 sibling, 1 reply; 161+ messages in thread
From: Li, Aubrey @ 2019-07-31  2:42 UTC (permalink / raw)
  To: Julien Desfossez, Aaron Lu
  Cc: Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/7/26 23:21, Julien Desfossez wrote:
> On 25-Jul-2019 10:30:03 PM, Aaron Lu wrote:
>>
>> I tried a different approach based on vruntime with 3 patches following.
> [...]
> 
> We have experimented with this new patchset and indeed the fairness is
> now much better. Interactive tasks with v3 were complete starving when
> there were cpu-intensive tasks running, now they can run consistently.

Yeah, the fairness is much better now.

For two cgroups created, I limited the cpuset to be one core(two siblings)
of these two cgroups, I still run gemmbench and sysbench-mysql, and here
is the mysql result:

Latency:
.----------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |
|----------------------------------------------------------------------------------------------|
|  1/1	        6.7	[13.8%/ 1.4%]	  2.1% |          6.4	[14.6%/ 1.5%]	  4.0%	  2.0% |
|  2/2	        9.1	[ 5.0%/ 0.5%]	  4.0% |         11.4	[ 6.8%/ 0.7%]	-24.9%	  3.9% |
'----------------------------------------------------------------------------------------------'

Throughput:
.----------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |
|----------------------------------------------------------------------------------------------|
|  1/1	      310.2	[ 4.1%/ 0.4%]	  2.1% |        296.2	[ 5.0%/ 0.5%]	 -4.5%	  2.0% | 
|  2/2	      547.7	[ 3.6%/ 0.4%]	  4.0% |        368.3	[ 4.8%/ 0.5%]	-32.8%	  3.9% | 
'----------------------------------------------------------------------------------------------'

Note: 2/2 case means 4 threads run on one core, which is overloaded.(cpu% is overall system report)

Though the latency/throughput has regressions, but standard deviation is much better now.

> With my initial test of TPC-C running in large VMs with a lot of
> background noise VMs, the results are pretty similar to v3, I will run
> more thorough tests and report the results back here.

I see something similar. I guess task placement could be another problem.
We don't check cookie matching in load balance and task wakeup, so
- if tasks with different cookie happen to be dispatched onto different cores,
The result should be good
- if tasks with different cookie are unfortunately dispatched onto the same
core, the result should be bad.

This problem is bypassed in my testing setup above, but may be one cause
of my other scenarios, need a while to sort out.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-07-31  2:42                           ` Li, Aubrey
@ 2019-08-02 15:37                             ` Julien Desfossez
  2019-08-05 15:55                               ` Tim Chen
                                                 ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Julien Desfossez @ 2019-08-02 15:37 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aaron Lu, Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

We tested both Aaron's and Tim's patches and here are our results.

Test setup:
- 2 1-thread sysbench, one running the cpu benchmark, the other one the
  mem benchmark
- both started at the same time
- both are pinned on the same core (2 hardware threads)
- 10 30-seconds runs
- test script: https://paste.debian.net/plainh/834cf45c
- only showing the CPU events/sec (higher is better)
- tested 4 tag configurations:
  - no tag
  - sysbench mem untagged, sysbench cpu tagged
  - sysbench mem tagged, sysbench cpu untagged
  - both tagged with a different tag
- "Alone" is the sysbench CPU running alone on the core, no tag
- "nosmt" is both sysbench pinned on the same hardware thread, no tag
- "Tim's full patchset + sched" is an experiment with Tim's patchset
  combined with Aaron's "hack patch" to get rid of the remaining deep
  idle cases
- In all test cases, both tasks can run simultaneously (which was not
  the case without those patches), but the standard deviation is a
  pretty good indicator of the fairness/consistency.

No tag
------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          828.15      32.45
Aaron's first 2 patches:        832.12      36.53
Aaron's 3rd patch alone:        864.21      3.68
Tim's full patchset:            852.50      4.11
Tim's full patchset + sched:    852.59      8.25

Sysbench mem untagged, sysbench cpu tagged
------------------------------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          586.06      1.77
Aaron's first 2 patches:        630.08      47.30
Aaron's 3rd patch alone:        1086.65     246.54
Tim's full patchset:            852.50      4.11
Tim's full patchset + sched:    390.49      15.76

Sysbench mem tagged, sysbench cpu untagged
------------------------------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          583.77      3.52
Aaron's first 2 patches:        513.63      63.09
Aaron's 3rd patch alone:        1171.23     3.35
Tim's full patchset:            564.04      58.05
Tim's full patchset + sched:    1026.16     49.43

Both sysbench tagged
--------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          582.15      3.75
Aaron's first 2 patches:        561.07      91.61
Aaron's 3rd patch alone:        638.49      231.06
Tim's full patchset:            679.43      70.07
Tim's full patchset + sched:    664.34      210.14

So in terms of fairness, Aaron's full patchset is the most consistent, but only
Tim's patchset performs better than nosmt in some conditions.

Of course, this is one of the worst case scenario, as soon as we have
multithreaded applications on overcommitted systems, core scheduling performs
better than nosmt.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-02 15:37                             ` Julien Desfossez
@ 2019-08-05 15:55                               ` Tim Chen
  2019-08-06  3:24                                 ` Aaron Lu
  2019-08-08 12:55                                 ` Aaron Lu
  2019-08-05 20:09                               ` Phil Auld
  2019-08-07  8:58                               ` Dario Faggioli
  2 siblings, 2 replies; 161+ messages in thread
From: Tim Chen @ 2019-08-05 15:55 UTC (permalink / raw)
  To: Julien Desfossez, Li, Aubrey
  Cc: Aaron Lu, Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 8/2/19 8:37 AM, Julien Desfossez wrote:
> We tested both Aaron's and Tim's patches and here are our results.
> 
> Test setup:
> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
>   mem benchmark
> - both started at the same time
> - both are pinned on the same core (2 hardware threads)
> - 10 30-seconds runs
> - test script: https://paste.debian.net/plainh/834cf45c
> - only showing the CPU events/sec (higher is better)
> - tested 4 tag configurations:
>   - no tag
>   - sysbench mem untagged, sysbench cpu tagged
>   - sysbench mem tagged, sysbench cpu untagged
>   - both tagged with a different tag
> - "Alone" is the sysbench CPU running alone on the core, no tag
> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> - "Tim's full patchset + sched" is an experiment with Tim's patchset
>   combined with Aaron's "hack patch" to get rid of the remaining deep
>   idle cases
> - In all test cases, both tasks can run simultaneously (which was not
>   the case without those patches), but the standard deviation is a
>   pretty good indicator of the fairness/consistency.

Thanks for testing the patches and giving such detailed data.

I came to realize that for my scheme, the accumulated deficit of forced idle could be wiped
out in one execution of a task on the forced idle cpu, with the update of the min_vruntime,
even if the execution time could be far less than the accumulated deficit.
That's probably one reason my scheme didn't achieve fairness.

Tim


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-02 15:37                             ` Julien Desfossez
  2019-08-05 15:55                               ` Tim Chen
@ 2019-08-05 20:09                               ` Phil Auld
  2019-08-06 13:54                                 ` Aaron Lu
  2019-08-07  8:58                               ` Dario Faggioli
  2 siblings, 1 reply; 161+ messages in thread
From: Phil Auld @ 2019-08-05 20:09 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Li, Aubrey, Aaron Lu, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

Hi,

On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> We tested both Aaron's and Tim's patches and here are our results.
> 
> Test setup:
> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
>   mem benchmark
> - both started at the same time
> - both are pinned on the same core (2 hardware threads)
> - 10 30-seconds runs
> - test script: https://paste.debian.net/plainh/834cf45c
> - only showing the CPU events/sec (higher is better)
> - tested 4 tag configurations:
>   - no tag
>   - sysbench mem untagged, sysbench cpu tagged
>   - sysbench mem tagged, sysbench cpu untagged
>   - both tagged with a different tag
> - "Alone" is the sysbench CPU running alone on the core, no tag
> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> - "Tim's full patchset + sched" is an experiment with Tim's patchset
>   combined with Aaron's "hack patch" to get rid of the remaining deep
>   idle cases
> - In all test cases, both tasks can run simultaneously (which was not
>   the case without those patches), but the standard deviation is a
>   pretty good indicator of the fairness/consistency.
> 
> No tag
> ------
> Test                            Average     Stdev
> Alone                           1306.90     0.94
> nosmt                           649.95      1.44
> Aaron's full patchset:          828.15      32.45
> Aaron's first 2 patches:        832.12      36.53
> Aaron's 3rd patch alone:        864.21      3.68
> Tim's full patchset:            852.50      4.11
> Tim's full patchset + sched:    852.59      8.25
> 
> Sysbench mem untagged, sysbench cpu tagged
> ------------------------------------------
> Test                            Average     Stdev
> Alone                           1306.90     0.94
> nosmt                           649.95      1.44
> Aaron's full patchset:          586.06      1.77
> Aaron's first 2 patches:        630.08      47.30
> Aaron's 3rd patch alone:        1086.65     246.54
> Tim's full patchset:            852.50      4.11
> Tim's full patchset + sched:    390.49      15.76
> 
> Sysbench mem tagged, sysbench cpu untagged
> ------------------------------------------
> Test                            Average     Stdev
> Alone                           1306.90     0.94
> nosmt                           649.95      1.44
> Aaron's full patchset:          583.77      3.52
> Aaron's first 2 patches:        513.63      63.09
> Aaron's 3rd patch alone:        1171.23     3.35
> Tim's full patchset:            564.04      58.05
> Tim's full patchset + sched:    1026.16     49.43
> 
> Both sysbench tagged
> --------------------
> Test                            Average     Stdev
> Alone                           1306.90     0.94
> nosmt                           649.95      1.44
> Aaron's full patchset:          582.15      3.75
> Aaron's first 2 patches:        561.07      91.61
> Aaron's 3rd patch alone:        638.49      231.06
> Tim's full patchset:            679.43      70.07
> Tim's full patchset + sched:    664.34      210.14
> 

Sorry if I'm missing something obvious here but with only 2 processes 
of interest shouldn't one tagged and one untagged be about the same
as both tagged?  

In both cases the 2 sysbenches should not be running on the core at 
the same time. 

There will be times when oher non-related threads could share the core
with the untagged one. Is that enough to account for this difference?


Thanks,
Phil


> So in terms of fairness, Aaron's full patchset is the most consistent, but only
> Tim's patchset performs better than nosmt in some conditions.
> 
> Of course, this is one of the worst case scenario, as soon as we have
> multithreaded applications on overcommitted systems, core scheduling performs
> better than nosmt.
> 
> Thanks,
> 
> Julien

-- 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-05 15:55                               ` Tim Chen
@ 2019-08-06  3:24                                 ` Aaron Lu
  2019-08-06  6:56                                   ` Aubrey Li
  2019-08-06 17:03                                   ` Tim Chen
  2019-08-08 12:55                                 ` Aaron Lu
  1 sibling, 2 replies; 161+ messages in thread
From: Aaron Lu @ 2019-08-06  3:24 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> On 8/2/19 8:37 AM, Julien Desfossez wrote:
> > We tested both Aaron's and Tim's patches and here are our results.
> > 
> > Test setup:
> > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> >   mem benchmark
> > - both started at the same time
> > - both are pinned on the same core (2 hardware threads)
> > - 10 30-seconds runs
> > - test script: https://paste.debian.net/plainh/834cf45c
> > - only showing the CPU events/sec (higher is better)
> > - tested 4 tag configurations:
> >   - no tag
> >   - sysbench mem untagged, sysbench cpu tagged
> >   - sysbench mem tagged, sysbench cpu untagged
> >   - both tagged with a different tag
> > - "Alone" is the sysbench CPU running alone on the core, no tag
> > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> >   combined with Aaron's "hack patch" to get rid of the remaining deep
> >   idle cases
> > - In all test cases, both tasks can run simultaneously (which was not
> >   the case without those patches), but the standard deviation is a
> >   pretty good indicator of the fairness/consistency.
> 
> Thanks for testing the patches and giving such detailed data.

Thanks Julien.

> I came to realize that for my scheme, the accumulated deficit of forced idle could be wiped
> out in one execution of a task on the forced idle cpu, with the update of the min_vruntime,
> even if the execution time could be far less than the accumulated deficit.
> That's probably one reason my scheme didn't achieve fairness.

I've been thinking if we should consider core wide tenent fairness?

Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
(e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
B. Assume A1 and B1 are queued on the same thread and A2 on the other
thread, when we decide priority for A1 and B1, shall we also consider
A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
belong to the same tenent? I tend to think we should make fairness per
core per tenent, instead of per thread(cpu) per task(sched entity). What
do you guys think?

Implemention of the idea is a mess to me, as I feel I'm duplicating the
existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
for the per core per tenent stuff.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06  3:24                                 ` Aaron Lu
@ 2019-08-06  6:56                                   ` Aubrey Li
  2019-08-06  7:04                                     ` Aaron Lu
  2019-08-06 17:03                                   ` Tim Chen
  1 sibling, 1 reply; 161+ messages in thread
From: Aubrey Li @ 2019-08-06  6:56 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Tue, Aug 6, 2019 at 11:24 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> > On 8/2/19 8:37 AM, Julien Desfossez wrote:
> > > We tested both Aaron's and Tim's patches and here are our results.
> > >
> > > Test setup:
> > > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > >   mem benchmark
> > > - both started at the same time
> > > - both are pinned on the same core (2 hardware threads)
> > > - 10 30-seconds runs
> > > - test script: https://paste.debian.net/plainh/834cf45c
> > > - only showing the CPU events/sec (higher is better)
> > > - tested 4 tag configurations:
> > >   - no tag
> > >   - sysbench mem untagged, sysbench cpu tagged
> > >   - sysbench mem tagged, sysbench cpu untagged
> > >   - both tagged with a different tag
> > > - "Alone" is the sysbench CPU running alone on the core, no tag
> > > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > >   combined with Aaron's "hack patch" to get rid of the remaining deep
> > >   idle cases
> > > - In all test cases, both tasks can run simultaneously (which was not
> > >   the case without those patches), but the standard deviation is a
> > >   pretty good indicator of the fairness/consistency.
> >
> > Thanks for testing the patches and giving such detailed data.
>
> Thanks Julien.
>
> > I came to realize that for my scheme, the accumulated deficit of forced idle could be wiped
> > out in one execution of a task on the forced idle cpu, with the update of the min_vruntime,
> > even if the execution time could be far less than the accumulated deficit.
> > That's probably one reason my scheme didn't achieve fairness.
>
> I've been thinking if we should consider core wide tenent fairness?
>
> Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
> (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
> B. Assume A1 and B1 are queued on the same thread and A2 on the other
> thread, when we decide priority for A1 and B1, shall we also consider
> A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
> belong to the same tenent? I tend to think we should make fairness per
> core per tenent, instead of per thread(cpu) per task(sched entity). What
> do you guys think?
>

I also think a way to make fairness per cookie per core, is this what you
want to propose?

Thanks,
-Aubrey

> Implemention of the idea is a mess to me, as I feel I'm duplicating the
> existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
> for the per core per tenent stuff.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06  6:56                                   ` Aubrey Li
@ 2019-08-06  7:04                                     ` Aaron Lu
  2019-08-06 12:24                                       ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-08-06  7:04 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/8/6 14:56, Aubrey Li wrote:
> On Tue, Aug 6, 2019 at 11:24 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>> I've been thinking if we should consider core wide tenent fairness?
>>
>> Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
>> (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
>> B. Assume A1 and B1 are queued on the same thread and A2 on the other
>> thread, when we decide priority for A1 and B1, shall we also consider
>> A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
>> belong to the same tenent? I tend to think we should make fairness per
>> core per tenent, instead of per thread(cpu) per task(sched entity). What
>> do you guys think?
>>
> 
> I also think a way to make fairness per cookie per core, is this what you
> want to propose?

Yes, that's what I meant.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06  7:04                                     ` Aaron Lu
@ 2019-08-06 12:24                                       ` Vineeth Remanan Pillai
  2019-08-06 13:49                                         ` Aaron Lu
  2019-08-06 14:16                                         ` Peter Zijlstra
  0 siblings, 2 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-08-06 12:24 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Aubrey Li, Tim Chen, Julien Desfossez, Li, Aubrey,
	Subhra Mazumdar, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> >
> > I also think a way to make fairness per cookie per core, is this what you
> > want to propose?
>
> Yes, that's what I meant.

I think that would hurt some kind of workloads badly, especially if
one tenant is
having way more tasks than the other. Tenant with more task on the same core
might have immediate requirements from some threads than the other and we
would fail to take that into account. With some hierarchical management, we can
alleviate this, but as Aaron said, it would be a bit messy.

Peter's rebalance logic actually takes care of most of the runq
imbalance caused
due to cookie tagging. What we have found from our testing is, fairness issue is
caused mostly due to a Hyperthread going idle and not waking up. Aaron's 3rd
patch works around that. As Julien mentioned, we are working on a per thread
coresched idle thread concept. The problem that we found was, idle thread causes
accounting issues and wakeup issues as it was not designed to be used in this
context. So if we can have a low priority thread which looks like any other task
to the scheduler, things becomes easy for the scheduler and we achieve security
as well. Please share your thoughts on this idea.

The results are encouraging, but we do not yet have the coresched idle
to not spin
100%. We will soon post the patch once it is a bit more stable for
running the tests
that we all have done so far.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 12:24                                       ` Vineeth Remanan Pillai
@ 2019-08-06 13:49                                         ` Aaron Lu
  2019-08-06 16:14                                           ` Vineeth Remanan Pillai
  2019-08-06 14:16                                         ` Peter Zijlstra
  1 sibling, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-08-06 13:49 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aubrey Li, Tim Chen, Julien Desfossez, Li, Aubrey,
	Subhra Mazumdar, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Tue, Aug 06, 2019 at 08:24:17AM -0400, Vineeth Remanan Pillai wrote:
> > >
> > > I also think a way to make fairness per cookie per core, is this what you
> > > want to propose?
> >
> > Yes, that's what I meant.
> 
> I think that would hurt some kind of workloads badly, especially if
> one tenant is
> having way more tasks than the other. Tenant with more task on the same core
> might have immediate requirements from some threads than the other and we
> would fail to take that into account. With some hierarchical management, we can
> alleviate this, but as Aaron said, it would be a bit messy.

I think tenant will have per core weight, similar to sched entity's per
cpu weight. The tenant's per core weight could derive from its
corresponding taskgroup's per cpu sched entities' weight(sum them up
perhaps). Tenant with higher weight will have its core wide vruntime
advance slower than tenant with lower weight. Does this address the
issue here?

> Peter's rebalance logic actually takes care of most of the runq
> imbalance caused
> due to cookie tagging. What we have found from our testing is, fairness issue is
> caused mostly due to a Hyperthread going idle and not waking up. Aaron's 3rd
> patch works around that. As Julien mentioned, we are working on a per thread
> coresched idle thread concept. The problem that we found was, idle thread causes
> accounting issues and wakeup issues as it was not designed to be used in this
> context. So if we can have a low priority thread which looks like any other task
> to the scheduler, things becomes easy for the scheduler and we achieve security
> as well. Please share your thoughts on this idea.

Care to elaborate the idea of coresched idle thread concept?
How it solved the hyperthread going idle problem and what the accounting
issues and wakeup issues are, etc.

Thanks,
Aaron

> The results are encouraging, but we do not yet have the coresched idle
> to not spin
> 100%. We will soon post the patch once it is a bit more stable for
> running the tests
> that we all have done so far.
> 
> Thanks,
> Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-05 20:09                               ` Phil Auld
@ 2019-08-06 13:54                                 ` Aaron Lu
  2019-08-06 14:17                                   ` Phil Auld
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-08-06 13:54 UTC (permalink / raw)
  To: Phil Auld
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
> Hi,
> 
> On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> > We tested both Aaron's and Tim's patches and here are our results.
> > 
> > Test setup:
> > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> >   mem benchmark
> > - both started at the same time
> > - both are pinned on the same core (2 hardware threads)
> > - 10 30-seconds runs
> > - test script: https://paste.debian.net/plainh/834cf45c
> > - only showing the CPU events/sec (higher is better)
> > - tested 4 tag configurations:
> >   - no tag
> >   - sysbench mem untagged, sysbench cpu tagged
> >   - sysbench mem tagged, sysbench cpu untagged
> >   - both tagged with a different tag
> > - "Alone" is the sysbench CPU running alone on the core, no tag
> > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> >   combined with Aaron's "hack patch" to get rid of the remaining deep
> >   idle cases
> > - In all test cases, both tasks can run simultaneously (which was not
> >   the case without those patches), but the standard deviation is a
> >   pretty good indicator of the fairness/consistency.
> > 
> > No tag
> > ------
> > Test                            Average     Stdev
> > Alone                           1306.90     0.94
> > nosmt                           649.95      1.44
> > Aaron's full patchset:          828.15      32.45
> > Aaron's first 2 patches:        832.12      36.53
> > Aaron's 3rd patch alone:        864.21      3.68
> > Tim's full patchset:            852.50      4.11
> > Tim's full patchset + sched:    852.59      8.25
> > 
> > Sysbench mem untagged, sysbench cpu tagged
> > ------------------------------------------
> > Test                            Average     Stdev
> > Alone                           1306.90     0.94
> > nosmt                           649.95      1.44
> > Aaron's full patchset:          586.06      1.77
> > Aaron's first 2 patches:        630.08      47.30
> > Aaron's 3rd patch alone:        1086.65     246.54
> > Tim's full patchset:            852.50      4.11
> > Tim's full patchset + sched:    390.49      15.76
> > 
> > Sysbench mem tagged, sysbench cpu untagged
> > ------------------------------------------
> > Test                            Average     Stdev
> > Alone                           1306.90     0.94
> > nosmt                           649.95      1.44
> > Aaron's full patchset:          583.77      3.52
> > Aaron's first 2 patches:        513.63      63.09
> > Aaron's 3rd patch alone:        1171.23     3.35
> > Tim's full patchset:            564.04      58.05
> > Tim's full patchset + sched:    1026.16     49.43
> > 
> > Both sysbench tagged
> > --------------------
> > Test                            Average     Stdev
> > Alone                           1306.90     0.94
> > nosmt                           649.95      1.44
> > Aaron's full patchset:          582.15      3.75
> > Aaron's first 2 patches:        561.07      91.61
> > Aaron's 3rd patch alone:        638.49      231.06
> > Tim's full patchset:            679.43      70.07
> > Tim's full patchset + sched:    664.34      210.14
> > 
> 
> Sorry if I'm missing something obvious here but with only 2 processes 
> of interest shouldn't one tagged and one untagged be about the same
> as both tagged?  

It should.

> In both cases the 2 sysbenches should not be running on the core at 
> the same time. 

Agree.

> There will be times when oher non-related threads could share the core
> with the untagged one. Is that enough to account for this difference?

What difference do you mean?

Thanks,
Aaron

> > So in terms of fairness, Aaron's full patchset is the most consistent, but only
> > Tim's patchset performs better than nosmt in some conditions.
> > 
> > Of course, this is one of the worst case scenario, as soon as we have
> > multithreaded applications on overcommitted systems, core scheduling performs
> > better than nosmt.
> > 
> > Thanks,
> > 
> > Julien
> 
> -- 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 12:24                                       ` Vineeth Remanan Pillai
  2019-08-06 13:49                                         ` Aaron Lu
@ 2019-08-06 14:16                                         ` Peter Zijlstra
  2019-08-06 15:53                                           ` Vineeth Remanan Pillai
  1 sibling, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2019-08-06 14:16 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aaron Lu, Aubrey Li, Tim Chen, Julien Desfossez, Li, Aubrey,
	Subhra Mazumdar, Nishanth Aravamudan, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Tue, Aug 06, 2019 at 08:24:17AM -0400, Vineeth Remanan Pillai wrote:
> Peter's rebalance logic actually takes care of most of the runq
> imbalance caused
> due to cookie tagging. What we have found from our testing is, fairness issue is
> caused mostly due to a Hyperthread going idle and not waking up. Aaron's 3rd
> patch works around that. As Julien mentioned, we are working on a per thread
> coresched idle thread concept. The problem that we found was, idle thread causes
> accounting issues and wakeup issues as it was not designed to be used in this
> context. So if we can have a low priority thread which looks like any other task
> to the scheduler, things becomes easy for the scheduler and we achieve security
> as well. Please share your thoughts on this idea.

What accounting in particular is upset? Is it things like
select_idle_sibling() that thinks the thread is idle and tries to place
tasks there?

It should be possible to change idle_cpu() to not report a forced-idle
CPU as idle.

(also; it should be possible to optimize select_idle_sibling() for the
core-sched case specifically)

> The results are encouraging, but we do not yet have the coresched idle
> to not spin 100%. We will soon post the patch once it is a bit more
> stable for running the tests that we all have done so far.

There's play_idle(), which is the entry point for idle injection.

In general, I don't particularly like 'fake' idle threads, please be
very specific in describing what issues it works around such that we can
look at alternatives.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH 2/3] core vruntime comparison
  2019-07-25 14:32                         ` [PATCH 2/3] core vruntime comparison Aaron Lu
@ 2019-08-06 14:17                           ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2019-08-06 14:17 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Aubrey Li, Julien Desfossez, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Jul 25, 2019 at 10:32:49PM +0800, Aaron Lu wrote:
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	struct sched_entity *sea = &a->se;
> +	struct sched_entity *seb = &b->se;
> +	bool samecpu = task_cpu(a) == task_cpu(b);
> +	struct task_struct *p;
> +	s64 delta;
> +
> +	if (samecpu) {
> +		/* vruntime is per cfs_rq */
> +		while (!is_same_group(sea, seb)) {
> +			int sea_depth = sea->depth;
> +			int seb_depth = seb->depth;
> +
> +			if (sea_depth >= seb_depth)
> +				sea = parent_entity(sea);
> +			if (sea_depth <= seb_depth)
> +				seb = parent_entity(seb);
> +		}
> +
> +		delta = (s64)(sea->vruntime - seb->vruntime);
> +		goto out;
> +	}
> +
> +	/* crosscpu: compare root level se's vruntime to decide priority */
> +	while (sea->parent)
> +		sea = sea->parent;
> +	while (seb->parent)
> +		seb = seb->parent;
> +	delta = (s64)(sea->vruntime - seb->vruntime);
> +
> +out:
> +	p = delta > 0 ? b : a;
> +	trace_printk("picked %s/%d %s: %Ld %Ld %Ld\n", p->comm, p->pid,
> +			samecpu ? "samecpu" : "crosscpu",
> +			sea->vruntime, seb->vruntime, delta);
> +
> +	return delta > 0;
>  }

Heh.. I suppose the good news is that Rik is trying very hard to kill
the nested runqueues, which would make this _much_ easier again.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 13:54                                 ` Aaron Lu
@ 2019-08-06 14:17                                   ` Phil Auld
  2019-08-06 14:41                                     ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Phil Auld @ 2019-08-06 14:17 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Aug 06, 2019 at 09:54:01PM +0800 Aaron Lu wrote:
> On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
> > Hi,
> > 
> > On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> > > We tested both Aaron's and Tim's patches and here are our results.
> > > 
> > > Test setup:
> > > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > >   mem benchmark
> > > - both started at the same time
> > > - both are pinned on the same core (2 hardware threads)
> > > - 10 30-seconds runs
> > > - test script: https://paste.debian.net/plainh/834cf45c
> > > - only showing the CPU events/sec (higher is better)
> > > - tested 4 tag configurations:
> > >   - no tag
> > >   - sysbench mem untagged, sysbench cpu tagged
> > >   - sysbench mem tagged, sysbench cpu untagged
> > >   - both tagged with a different tag
> > > - "Alone" is the sysbench CPU running alone on the core, no tag
> > > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > >   combined with Aaron's "hack patch" to get rid of the remaining deep
> > >   idle cases
> > > - In all test cases, both tasks can run simultaneously (which was not
> > >   the case without those patches), but the standard deviation is a
> > >   pretty good indicator of the fairness/consistency.
> > > 
> > > No tag
> > > ------
> > > Test                            Average     Stdev
> > > Alone                           1306.90     0.94
> > > nosmt                           649.95      1.44
> > > Aaron's full patchset:          828.15      32.45
> > > Aaron's first 2 patches:        832.12      36.53
> > > Aaron's 3rd patch alone:        864.21      3.68
> > > Tim's full patchset:            852.50      4.11
> > > Tim's full patchset + sched:    852.59      8.25
> > > 
> > > Sysbench mem untagged, sysbench cpu tagged
> > > ------------------------------------------
> > > Test                            Average     Stdev
> > > Alone                           1306.90     0.94
> > > nosmt                           649.95      1.44
> > > Aaron's full patchset:          586.06      1.77
> > > Aaron's first 2 patches:        630.08      47.30
> > > Aaron's 3rd patch alone:        1086.65     246.54
> > > Tim's full patchset:            852.50      4.11
> > > Tim's full patchset + sched:    390.49      15.76
> > > 
> > > Sysbench mem tagged, sysbench cpu untagged
> > > ------------------------------------------
> > > Test                            Average     Stdev
> > > Alone                           1306.90     0.94
> > > nosmt                           649.95      1.44
> > > Aaron's full patchset:          583.77      3.52
> > > Aaron's first 2 patches:        513.63      63.09
> > > Aaron's 3rd patch alone:        1171.23     3.35
> > > Tim's full patchset:            564.04      58.05
> > > Tim's full patchset + sched:    1026.16     49.43
> > > 
> > > Both sysbench tagged
> > > --------------------
> > > Test                            Average     Stdev
> > > Alone                           1306.90     0.94
> > > nosmt                           649.95      1.44
> > > Aaron's full patchset:          582.15      3.75
> > > Aaron's first 2 patches:        561.07      91.61
> > > Aaron's 3rd patch alone:        638.49      231.06
> > > Tim's full patchset:            679.43      70.07
> > > Tim's full patchset + sched:    664.34      210.14
> > > 
> > 
> > Sorry if I'm missing something obvious here but with only 2 processes 
> > of interest shouldn't one tagged and one untagged be about the same
> > as both tagged?  
> 
> It should.
> 
> > In both cases the 2 sysbenches should not be running on the core at 
> > the same time. 
> 
> Agree.
> 
> > There will be times when oher non-related threads could share the core
> > with the untagged one. Is that enough to account for this difference?
> 
> What difference do you mean?


I was looking at the above posted numbers. For example:

> > > Sysbench mem untagged, sysbench cpu tagged
> > > Aaron's 3rd patch alone:        1086.65     246.54

> > > Sysbench mem tagged, sysbench cpu untagged
> > > Aaron's 3rd patch alone:        1171.23     3.35

> > > Both sysbench tagged
> > > Aaron's 3rd patch alone:        638.49      231.06


Admittedly, there's some high variance on some of those numbers. 


Cheers,
Phil

> 
> Thanks,
> Aaron
> 
> > > So in terms of fairness, Aaron's full patchset is the most consistent, but only
> > > Tim's patchset performs better than nosmt in some conditions.
> > > 
> > > Of course, this is one of the worst case scenario, as soon as we have
> > > multithreaded applications on overcommitted systems, core scheduling performs
> > > better than nosmt.
> > > 
> > > Thanks,
> > > 
> > > Julien
> > 
> > -- 

-- 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 14:17                                   ` Phil Auld
@ 2019-08-06 14:41                                     ` Aaron Lu
  2019-08-06 14:55                                       ` Phil Auld
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-08-06 14:41 UTC (permalink / raw)
  To: Phil Auld
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 2019/8/6 22:17, Phil Auld wrote:
> On Tue, Aug 06, 2019 at 09:54:01PM +0800 Aaron Lu wrote:
>> On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
>>> Hi,
>>>
>>> On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
>>>> We tested both Aaron's and Tim's patches and here are our results.
>>>>
>>>> Test setup:
>>>> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
>>>>   mem benchmark
>>>> - both started at the same time
>>>> - both are pinned on the same core (2 hardware threads)
>>>> - 10 30-seconds runs
>>>> - test script: https://paste.debian.net/plainh/834cf45c
>>>> - only showing the CPU events/sec (higher is better)
>>>> - tested 4 tag configurations:
>>>>   - no tag
>>>>   - sysbench mem untagged, sysbench cpu tagged
>>>>   - sysbench mem tagged, sysbench cpu untagged
>>>>   - both tagged with a different tag
>>>> - "Alone" is the sysbench CPU running alone on the core, no tag
>>>> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
>>>> - "Tim's full patchset + sched" is an experiment with Tim's patchset
>>>>   combined with Aaron's "hack patch" to get rid of the remaining deep
>>>>   idle cases
>>>> - In all test cases, both tasks can run simultaneously (which was not
>>>>   the case without those patches), but the standard deviation is a
>>>>   pretty good indicator of the fairness/consistency.
>>>>
>>>> No tag
>>>> ------
>>>> Test                            Average     Stdev
>>>> Alone                           1306.90     0.94
>>>> nosmt                           649.95      1.44
>>>> Aaron's full patchset:          828.15      32.45
>>>> Aaron's first 2 patches:        832.12      36.53
>>>> Aaron's 3rd patch alone:        864.21      3.68
>>>> Tim's full patchset:            852.50      4.11
>>>> Tim's full patchset + sched:    852.59      8.25
>>>>
>>>> Sysbench mem untagged, sysbench cpu tagged
>>>> ------------------------------------------
>>>> Test                            Average     Stdev
>>>> Alone                           1306.90     0.94
>>>> nosmt                           649.95      1.44
>>>> Aaron's full patchset:          586.06      1.77
>>>> Aaron's first 2 patches:        630.08      47.30
>>>> Aaron's 3rd patch alone:        1086.65     246.54
>>>> Tim's full patchset:            852.50      4.11
>>>> Tim's full patchset + sched:    390.49      15.76
>>>>
>>>> Sysbench mem tagged, sysbench cpu untagged
>>>> ------------------------------------------
>>>> Test                            Average     Stdev
>>>> Alone                           1306.90     0.94
>>>> nosmt                           649.95      1.44
>>>> Aaron's full patchset:          583.77      3.52
>>>> Aaron's first 2 patches:        513.63      63.09
>>>> Aaron's 3rd patch alone:        1171.23     3.35
>>>> Tim's full patchset:            564.04      58.05
>>>> Tim's full patchset + sched:    1026.16     49.43
>>>>
>>>> Both sysbench tagged
>>>> --------------------
>>>> Test                            Average     Stdev
>>>> Alone                           1306.90     0.94
>>>> nosmt                           649.95      1.44
>>>> Aaron's full patchset:          582.15      3.75
>>>> Aaron's first 2 patches:        561.07      91.61
>>>> Aaron's 3rd patch alone:        638.49      231.06
>>>> Tim's full patchset:            679.43      70.07
>>>> Tim's full patchset + sched:    664.34      210.14
>>>>
>>>
>>> Sorry if I'm missing something obvious here but with only 2 processes 
>>> of interest shouldn't one tagged and one untagged be about the same
>>> as both tagged?  
>>
>> It should.
>>
>>> In both cases the 2 sysbenches should not be running on the core at 
>>> the same time. 
>>
>> Agree.
>>
>>> There will be times when oher non-related threads could share the core
>>> with the untagged one. Is that enough to account for this difference?
>>
>> What difference do you mean?
> 
> 
> I was looking at the above posted numbers. For example:
> 
>>>> Sysbench mem untagged, sysbench cpu tagged
>>>> Aaron's 3rd patch alone:        1086.65     246.54
> 
>>>> Sysbench mem tagged, sysbench cpu untagged
>>>> Aaron's 3rd patch alone:        1171.23     3.35
> 
>>>> Both sysbench tagged
>>>> Aaron's 3rd patch alone:        638.49      231.06
> 
> 
> Admittedly, there's some high variance on some of those numbers. 

The high variance suggests the code having some fairness issues :-)

For the test here, I didn't expect the 3rd patch being used alone
since the fairness is solved by patch2 and patch3 together.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 14:41                                     ` Aaron Lu
@ 2019-08-06 14:55                                       ` Phil Auld
  0 siblings, 0 replies; 161+ messages in thread
From: Phil Auld @ 2019-08-06 14:55 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Aug 06, 2019 at 10:41:25PM +0800 Aaron Lu wrote:
> On 2019/8/6 22:17, Phil Auld wrote:
> > On Tue, Aug 06, 2019 at 09:54:01PM +0800 Aaron Lu wrote:
> >> On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
> >>> Hi,
> >>>
> >>> On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> >>>> We tested both Aaron's and Tim's patches and here are our results.
> >>>>
> >>>> Test setup:
> >>>> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> >>>>   mem benchmark
> >>>> - both started at the same time
> >>>> - both are pinned on the same core (2 hardware threads)
> >>>> - 10 30-seconds runs
> >>>> - test script: https://paste.debian.net/plainh/834cf45c
> >>>> - only showing the CPU events/sec (higher is better)
> >>>> - tested 4 tag configurations:
> >>>>   - no tag
> >>>>   - sysbench mem untagged, sysbench cpu tagged
> >>>>   - sysbench mem tagged, sysbench cpu untagged
> >>>>   - both tagged with a different tag
> >>>> - "Alone" is the sysbench CPU running alone on the core, no tag
> >>>> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> >>>> - "Tim's full patchset + sched" is an experiment with Tim's patchset
> >>>>   combined with Aaron's "hack patch" to get rid of the remaining deep
> >>>>   idle cases
> >>>> - In all test cases, both tasks can run simultaneously (which was not
> >>>>   the case without those patches), but the standard deviation is a
> >>>>   pretty good indicator of the fairness/consistency.
> >>>>
> >>>> No tag
> >>>> ------
> >>>> Test                            Average     Stdev
> >>>> Alone                           1306.90     0.94
> >>>> nosmt                           649.95      1.44
> >>>> Aaron's full patchset:          828.15      32.45
> >>>> Aaron's first 2 patches:        832.12      36.53
> >>>> Aaron's 3rd patch alone:        864.21      3.68
> >>>> Tim's full patchset:            852.50      4.11
> >>>> Tim's full patchset + sched:    852.59      8.25
> >>>>
> >>>> Sysbench mem untagged, sysbench cpu tagged
> >>>> ------------------------------------------
> >>>> Test                            Average     Stdev
> >>>> Alone                           1306.90     0.94
> >>>> nosmt                           649.95      1.44
> >>>> Aaron's full patchset:          586.06      1.77
> >>>> Aaron's first 2 patches:        630.08      47.30
> >>>> Aaron's 3rd patch alone:        1086.65     246.54
> >>>> Tim's full patchset:            852.50      4.11
> >>>> Tim's full patchset + sched:    390.49      15.76
> >>>>
> >>>> Sysbench mem tagged, sysbench cpu untagged
> >>>> ------------------------------------------
> >>>> Test                            Average     Stdev
> >>>> Alone                           1306.90     0.94
> >>>> nosmt                           649.95      1.44
> >>>> Aaron's full patchset:          583.77      3.52
> >>>> Aaron's first 2 patches:        513.63      63.09
> >>>> Aaron's 3rd patch alone:        1171.23     3.35
> >>>> Tim's full patchset:            564.04      58.05
> >>>> Tim's full patchset + sched:    1026.16     49.43
> >>>>
> >>>> Both sysbench tagged
> >>>> --------------------
> >>>> Test                            Average     Stdev
> >>>> Alone                           1306.90     0.94
> >>>> nosmt                           649.95      1.44
> >>>> Aaron's full patchset:          582.15      3.75
> >>>> Aaron's first 2 patches:        561.07      91.61
> >>>> Aaron's 3rd patch alone:        638.49      231.06
> >>>> Tim's full patchset:            679.43      70.07
> >>>> Tim's full patchset + sched:    664.34      210.14
> >>>>
> >>>
> >>> Sorry if I'm missing something obvious here but with only 2 processes 
> >>> of interest shouldn't one tagged and one untagged be about the same
> >>> as both tagged?  
> >>
> >> It should.
> >>
> >>> In both cases the 2 sysbenches should not be running on the core at 
> >>> the same time. 
> >>
> >> Agree.
> >>
> >>> There will be times when oher non-related threads could share the core
> >>> with the untagged one. Is that enough to account for this difference?
> >>
> >> What difference do you mean?
> > 
> > 
> > I was looking at the above posted numbers. For example:
> > 
> >>>> Sysbench mem untagged, sysbench cpu tagged
> >>>> Aaron's 3rd patch alone:        1086.65     246.54
> > 
> >>>> Sysbench mem tagged, sysbench cpu untagged
> >>>> Aaron's 3rd patch alone:        1171.23     3.35
> > 
> >>>> Both sysbench tagged
> >>>> Aaron's 3rd patch alone:        638.49      231.06
> > 
> > 
> > Admittedly, there's some high variance on some of those numbers. 
> 
> The high variance suggests the code having some fairness issues :-)
> 
> For the test here, I didn't expect the 3rd patch being used alone
> since the fairness is solved by patch2 and patch3 together.

Makes sense, thanks.


-- 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 14:16                                         ` Peter Zijlstra
@ 2019-08-06 15:53                                           ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-08-06 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Aaron Lu, Aubrey Li, Tim Chen, Julien Desfossez, Li, Aubrey,
	Subhra Mazumdar, Nishanth Aravamudan, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

>
> What accounting in particular is upset? Is it things like
> select_idle_sibling() that thinks the thread is idle and tries to place
> tasks there?
>
The major issue that we saw was, certain work load causes the idle cpu to never
wakeup and schedule again even when there are runnable threads in there. If
I remember correctly, this happened when the sibling had only one cpu intensive
task and did not enter the pick_next_task for a long time. There were other
situations as well which caused this prolonged idle state on the cpu.
One was when
pick_next_task was called on the sibling but it always won there
because vruntime
was not progressing on the idle cpu.

Having a coresched idle makes sure that the idle thread is not overloaded. Also
vruntime moves forward and tsk vruntime comparison across cpus works when
we normalize.

> It should be possible to change idle_cpu() to not report a forced-idle
> CPU as idle.
I agree. If we can identify all the places the idle thread is
considered special and
also account for the vruntime progress for force idle, this should be a better
approach compared to coresched idle thread per cpu.

>
> (also; it should be possible to optimize select_idle_sibling() for the
> core-sched case specifically)
>
We haven't seen this because, most of our micro test cases did not have more
threads than the cpus. Thanks for pointing this out, we shall cook some tests
to observe this behavior.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 13:49                                         ` Aaron Lu
@ 2019-08-06 16:14                                           ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-08-06 16:14 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Aubrey Li, Tim Chen, Julien Desfossez, Li, Aubrey,
	Subhra Mazumdar, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> I think tenant will have per core weight, similar to sched entity's per
> cpu weight. The tenant's per core weight could derive from its
> corresponding taskgroup's per cpu sched entities' weight(sum them up
> perhaps). Tenant with higher weight will have its core wide vruntime
> advance slower than tenant with lower weight. Does this address the
> issue here?
>
I think that makes sense. Should work. We should also consider how to
classify untagged processes so that they are not starved .

>
> Care to elaborate the idea of coresched idle thread concept?
> How it solved the hyperthread going idle problem and what the accounting
> issues and wakeup issues are, etc.
>
So we have one coresched_idle thread per cpu and when a sibling
cannot find a match, instead of forcing idle, we schedule this new
thread. Ideally this thread would be similar to idle, but scheduler doesn't
now confuse idle cpu with a forced idle state. This also invokes schedule()
as vruntime progresses(alternative to your 3rd patch) and vruntime
accounting gets more consistent. There are special cases that need to be
handled so that coresched_idle never gets scheduled in the normal
scheduling path(without coresched) etc. Hope this clarifies.

But as Peter suggested, if we can differentiate idle from forced idle in
the idle thread and account for the vruntime progress, that would be a
better approach.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06  3:24                                 ` Aaron Lu
  2019-08-06  6:56                                   ` Aubrey Li
@ 2019-08-06 17:03                                   ` Tim Chen
  2019-08-06 17:12                                     ` Peter Zijlstra
  1 sibling, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-08-06 17:03 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 8/5/19 8:24 PM, Aaron Lu wrote:

> I've been thinking if we should consider core wide tenent fairness?
> 
> Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
> (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
> B. Assume A1 and B1 are queued on the same thread and A2 on the other
> thread, when we decide priority for A1 and B1, shall we also consider
> A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
> belong to the same tenent? I tend to think we should make fairness per
> core per tenent, instead of per thread(cpu) per task(sched entity). What
> do you guys think?
> 
> Implemention of the idea is a mess to me, as I feel I'm duplicating the
> existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
> for the per core per tenent stuff.

I'm wondering if something simpler will work.  It is easier to maintain fairness
between the CPU threads.  A simple scheme may be if the force idle deficit
on a CPU thread exceeds a threshold compared to its sibling, we will
bias in choosing the task on the suppressed CPU thread.  
The fairness among the tenents per run queue is balanced out by cfq fairness,
so things should be fair if we maintain fairness in CPU utilization between
the two CPU threads.

Tim

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 17:03                                   ` Tim Chen
@ 2019-08-06 17:12                                     ` Peter Zijlstra
  2019-08-06 21:19                                       ` Tim Chen
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2019-08-06 17:12 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aaron Lu, Julien Desfossez, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Tue, Aug 06, 2019 at 10:03:29AM -0700, Tim Chen wrote:
> On 8/5/19 8:24 PM, Aaron Lu wrote:
> 
> > I've been thinking if we should consider core wide tenent fairness?
> > 
> > Let's say there are 3 tasks on 2 threads' rq of the same core, 2 tasks
> > (e.g. A1, A2) belong to tenent A and the 3rd B1 belong to another tenent
> > B. Assume A1 and B1 are queued on the same thread and A2 on the other
> > thread, when we decide priority for A1 and B1, shall we also consider
> > A2's vruntime? i.e. shall we consider A1 and A2 as a whole since they
> > belong to the same tenent? I tend to think we should make fairness per
> > core per tenent, instead of per thread(cpu) per task(sched entity). What
> > do you guys think?
> > 
> > Implemention of the idea is a mess to me, as I feel I'm duplicating the
> > existing per cpu per sched_entity enqueue/update vruntime/dequeue logic
> > for the per core per tenent stuff.
> 
> I'm wondering if something simpler will work.  It is easier to maintain fairness
> between the CPU threads.  A simple scheme may be if the force idle deficit
> on a CPU thread exceeds a threshold compared to its sibling, we will
> bias in choosing the task on the suppressed CPU thread.  
> The fairness among the tenents per run queue is balanced out by cfq fairness,
> so things should be fair if we maintain fairness in CPU utilization between
> the two CPU threads.

IIRC pjt once did a simle 5ms flip flop between siblings.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 17:12                                     ` Peter Zijlstra
@ 2019-08-06 21:19                                       ` Tim Chen
  2019-08-08  6:47                                         ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-08-06 21:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Aaron Lu, Julien Desfossez, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 8/6/19 10:12 AM, Peter Zijlstra wrote:

>> I'm wondering if something simpler will work.  It is easier to maintain fairness
>> between the CPU threads.  A simple scheme may be if the force idle deficit
>> on a CPU thread exceeds a threshold compared to its sibling, we will
>> bias in choosing the task on the suppressed CPU thread.  
>> The fairness among the tenents per run queue is balanced out by cfq fairness,
>> so things should be fair if we maintain fairness in CPU utilization between
>> the two CPU threads.
> 
> IIRC pjt once did a simle 5ms flip flop between siblings.
> 

Trying out Peter's suggestions in the following two patches on v3 to
provide fairness between the CPU threads.
The changes is in patch 2 and patch 1 is simply a code reorg.

It is only minimally tested and seems to provide fairness between
two will-it-scale cgroups. Haven't tried it yet on something that
is less CPU intensive with lots of sleep in between.

Also need to put the account idle time for rt and dl sched classes.
Will do that later if this works for the fair class.

Tim

----------------patch 1-----------------------

From ede10309986a6b1bcc82d317f86a5b06459d76bd Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Wed, 24 Jul 2019 13:58:18 -0700
Subject: [PATCH 1/2] sched: Move sched fair prio comparison to fair.c

Consolidate the task priority comparison of the fair class
to fair.c.  A simple code reorganization and there are no functional changes.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c  | 21 ++-------------------
 kernel/sched/fair.c  | 21 +++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e3cd9cb17809..567eba50dc38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 a_vruntime = a->se.vruntime;
-		u64 b_vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			b_vruntime -= task_cfs_rq(b)->min_vruntime;
-			b_vruntime += task_cfs_rq(a)->min_vruntime;
-
-			trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
-				     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
-				     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
-		}
-
-		return !((s64)(a_vruntime - b_vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return prio_less_fair(a, b);
 
 	return false;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02bff10237d4..e289b6e1545b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -602,6 +602,27 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 	return delta;
 }
 
+bool prio_less_fair(struct task_struct *a, struct task_struct *b)
+{
+	u64 a_vruntime = a->se.vruntime;
+	u64 b_vruntime = b->se.vruntime;
+
+	/*
+	 * Normalize the vruntime if tasks are in different cpus.
+	 */
+	if (task_cpu(a) != task_cpu(b)) {
+		b_vruntime -= task_cfs_rq(b)->min_vruntime;
+		b_vruntime += task_cfs_rq(a)->min_vruntime;
+
+		trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+			     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+			     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+	}
+
+	return !((s64)(a_vruntime - b_vruntime) <= 0);
+}
+
 /*
  * The idea is to set a period in which each task runs once.
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..bdabe7ce1152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1015,6 +1015,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 }
 
 extern void queue_core_balance(struct rq *rq);
+extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);
 
 #else /* !CONFIG_SCHED_CORE */
 
-- 
2.20.1


--------------------patch 2------------------------

From e27305b0042631382bd4f72260d579bf4c971d2f Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Tue, 6 Aug 2019 12:50:45 -0700
Subject: [PATCH 2/2] sched: Enforce fairness between cpu threads

CPU thread could be suppressed by its sibling for extended time.
Implement a budget for force idling, making all CPU threads have
equal chance to run.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c  | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 12 ++++++++++++
 kernel/sched/sched.h |  4 ++++
 3 files changed, 57 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 567eba50dc38..e22042883723 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -207,6 +207,44 @@ static struct task_struct *sched_core_next(struct task_struct *p, unsigned long
 	return p;
 }
 
+void account_core_idletime(struct task_struct *p, u64 exec)
+{
+	const struct cpumask *smt_mask;
+	struct rq *rq;
+	bool force_idle, refill;
+	int i, cpu;
+
+	rq = task_rq(p);
+	if (!sched_core_enabled(rq) || !p->core_cookie)
+		return;
+
+	cpu = task_cpu(p);
+	force_idle = false;
+	refill = true;
+	smt_mask = cpu_smt_mask(cpu);
+
+	for_each_cpu(i, smt_mask) {
+		if (cpu == i)
+			continue;
+
+		if (cpu_rq(i)->core_forceidle)
+			force_idle = true;
+
+		/* Only refill if everyone has run out of allowance */
+		if (cpu_rq(i)->core_idle_allowance > 0)
+			refill = false;
+	}
+
+	if (force_idle)
+		rq->core_idle_allowance -= (s64) exec;
+
+	if (rq->core_idle_allowance < 0 && refill) {
+		for_each_cpu(i, smt_mask) {
+			cpu_rq(i)->core_idle_allowance += (s64) SCHED_IDLE_ALLOWANCE;
+		}
+	}
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -273,6 +311,8 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void account_core_idletime(struct task_struct *p, u64 exec) { }
+{
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -6773,6 +6813,7 @@ void __init sched_init(void)
 		rq->core_enabled = 0;
 		rq->core_tree = RB_ROOT;
 		rq->core_forceidle = false;
+		rq->core_idle_allowance = (s64) SCHED_IDLE_ALLOWANCE;
 
 		rq->core_cookie = 0UL;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e289b6e1545b..d4f9ea03296e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -611,6 +611,17 @@ bool prio_less_fair(struct task_struct *a, struct task_struct *b)
 	 * Normalize the vruntime if tasks are in different cpus.
 	 */
 	if (task_cpu(a) != task_cpu(b)) {
+
+		if (a->core_cookie && b->core_cookie &&
+		    a->core_cookie != b->core_cookie) {
+			/*
+			 * Will be force idling one thread,
+			 * pick the thread that has more allowance.
+			 */
+			return (task_rq(a)->core_idle_allowance <=
+				task_rq(b)->core_idle_allowance) ? true : false;
+		}
+
 		b_vruntime -= task_cfs_rq(b)->min_vruntime;
 		b_vruntime += task_cfs_rq(a)->min_vruntime;
 
@@ -817,6 +828,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
+		account_core_idletime(curtask, delta_exec);
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bdabe7ce1152..927334b2078c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -963,6 +963,7 @@ struct rq {
 	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
+	s64			core_idle_allowance;
 	struct rb_root		core_tree;
 	bool			core_forceidle;
 
@@ -999,6 +1000,8 @@ static inline int cpu_of(struct rq *rq)
 }
 
 #ifdef CONFIG_SCHED_CORE
+#define SCHED_IDLE_ALLOWANCE	5000000 	/* 5 msec */
+
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1016,6 +1019,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 extern void queue_core_balance(struct rq *rq);
 extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);
+extern void  account_core_idletime(struct task_struct *p, u64 exec);
 
 #else /* !CONFIG_SCHED_CORE */
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-02 15:37                             ` Julien Desfossez
  2019-08-05 15:55                               ` Tim Chen
  2019-08-05 20:09                               ` Phil Auld
@ 2019-08-07  8:58                               ` Dario Faggioli
  2019-08-07 17:10                                 ` Tim Chen
  2 siblings, 1 reply; 161+ messages in thread
From: Dario Faggioli @ 2019-08-07  8:58 UTC (permalink / raw)
  To: Julien Desfossez, Li, Aubrey
  Cc: Aaron Lu, Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2543 bytes --]

Hello everyone,

This is Dario, from SUSE. I'm also interesting in core-scheduling, and
using it in virtualization use cases.

Just for context, I'm working in virt since a few years, mostly on Xen,
but I've done Linux stuff before, and I am getting back at it.

For now, I've been looking at the core-scheduling code, and run some
benchmarks myself.

On Fri, 2019-08-02 at 11:37 -0400, Julien Desfossez wrote:
> We tested both Aaron's and Tim's patches and here are our results.
> 
> Test setup:
> - 2 1-thread sysbench, one running the cpu benchmark, the other one
> the
>   mem benchmark
> - both started at the same time
> - both are pinned on the same core (2 hardware threads)
> - 10 30-seconds runs
> - test script: https://paste.debian.net/plainh/834cf45c
> - only showing the CPU events/sec (higher is better)
> - tested 4 tag configurations:
>   - no tag
>   - sysbench mem untagged, sysbench cpu tagged
>   - sysbench mem tagged, sysbench cpu untagged
>   - both tagged with a different tag
> - "Alone" is the sysbench CPU running alone on the core, no tag
> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> - "Tim's full patchset + sched" is an experiment with Tim's patchset
>   combined with Aaron's "hack patch" to get rid of the remaining deep
>   idle cases
> - In all test cases, both tasks can run simultaneously (which was not
>   the case without those patches), but the standard deviation is a
>   pretty good indicator of the fairness/consistency.
> 
This, and of course the numbers below too, is very interesting.

So, here comes my question: I've done a benchmarking campaign (yes,
I'll post numbers soon) using this branch:

https://github.com/digitalocean/linux-coresched.git  vpillai/coresched-v3-v5.1.5-test
https://github.com/digitalocean/linux-coresched/tree/vpillai/coresched-v3-v5.1.5-test

Last commit:
7feb1007f274 "Fix stalling of untagged processes competing with tagged
processes"

Since I see that, in this thread, there are various patches being
proposed and discussed... should I rerun my benchmarks with them
applied? If yes, which ones? And is there, by any chance, one (or maybe
more than one) updated git branch(es)?

Thanks in advance and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-07  8:58                               ` Dario Faggioli
@ 2019-08-07 17:10                                 ` Tim Chen
  2019-08-15 16:09                                   ` Dario Faggioli
                                                     ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Tim Chen @ 2019-08-07 17:10 UTC (permalink / raw)
  To: Dario Faggioli, Julien Desfossez, Li, Aubrey
  Cc: Aaron Lu, Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 8/7/19 1:58 AM, Dario Faggioli wrote:

> So, here comes my question: I've done a benchmarking campaign (yes,
> I'll post numbers soon) using this branch:
> 
> https://github.com/digitalocean/linux-coresched.git  vpillai/coresched-v3-v5.1.5-test
> https://github.com/digitalocean/linux-coresched/tree/vpillai/coresched-v3-v5.1.5-test
> 
> Last commit:
> 7feb1007f274 "Fix stalling of untagged processes competing with tagged
> processes"
> 
> Since I see that, in this thread, there are various patches being
> proposed and discussed... should I rerun my benchmarks with them
> applied? If yes, which ones? And is there, by any chance, one (or maybe
> more than one) updated git branch(es)?
> 
> Thanks in advance and Regards
> 

Hi Dario,

Having an extra set of eyes are certainly welcomed.
I'll give my 2 cents on the issues with v3.
Others feel free to chime in if my understanding is
incorrect or I'm missing something.

1) Unfairness between the sibling threads
-----------------------------------------
One sibling thread could be suppressing and force idling
the sibling thread over proportionally.  Resulting in
the force idled CPU not getting run and stall tasks on
suppressed CPU.

Status:
i) Aaron has proposed a patchset here based on using one
rq as a base reference for vruntime for task priority
comparison between siblings.

https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
It works well on fairness but has some initialization issues

ii) Tim has proposed a patchset here to account for forced
idle time in rq's min_vruntime
https://lore.kernel.org/lkml/f96350c1-25a9-0564-ff46-6658e96d726c@linux.intel.com/
It improves over v3 with simpler logic compared to
Aaron's patch, but does not work as well on fairness

iii) Tim has proposed yet another patch to maintain fairness
of forced idle time between CPU threads per Peter's suggestion.
https://lore.kernel.org/lkml/21933a50-f796-3d28-664c-030cb7c98431@linux.intel.com/
Its performance has yet to be tested.

2) Not rescheduling forced idled CPU
------------------------------------
The forced idled CPU does not get a chance to re-schedule
itself, and will stall for a long time even though it
has eligible tasks to run.

Status:
i) Aaron proposed a patch to fix this to check if there
are runnable tasks when scheduling tick comes in.
https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/

ii) Vineeth has patches to this issue and also issue 1, based
on scheduling in a new "forced idle task" when getting forced
idle, but has yet to post the patches.

3) Load balancing between CPU cores
-----------------------------------
Say if one CPU core's sibling threads get forced idled
a lot as it has mostly incompatible tasks between the siblings,
moving the incompatible load to other cores and pulling
compatible load to the core could help CPU utilization.

So just considering the load of a task is not enough during
load balancing, task compatibility also needs to be considered.
Peter has put in mechanisms to balance compatible tasks between
CPU thread siblings, but not across cores.

Status:
I have not seen patches on this issue.  This issue could lead to
large variance in workload performance based on your luck
in placing the workload among the cores.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-06 21:19                                       ` Tim Chen
@ 2019-08-08  6:47                                         ` Aaron Lu
  2019-08-08 17:27                                           ` Tim Chen
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-08-08  6:47 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Julien Desfossez, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Tue, Aug 06, 2019 at 02:19:57PM -0700, Tim Chen wrote:
> +void account_core_idletime(struct task_struct *p, u64 exec)
> +{
> +	const struct cpumask *smt_mask;
> +	struct rq *rq;
> +	bool force_idle, refill;
> +	int i, cpu;
> +
> +	rq = task_rq(p);
> +	if (!sched_core_enabled(rq) || !p->core_cookie)
> +		return;

I don't see why return here for untagged task. Untagged task can also
preempt tagged task and force a CPU thread enter idle state.
Untagged is just another tag to me, unless we want to allow untagged
task to coschedule with a tagged task.

> +	cpu = task_cpu(p);
> +	force_idle = false;
> +	refill = true;
> +	smt_mask = cpu_smt_mask(cpu);
> +
> +	for_each_cpu(i, smt_mask) {
> +		if (cpu == i)
> +			continue;
> +
> +		if (cpu_rq(i)->core_forceidle)
> +			force_idle = true;
> +
> +		/* Only refill if everyone has run out of allowance */
> +		if (cpu_rq(i)->core_idle_allowance > 0)
> +			refill = false;
> +	}
> +
> +	if (force_idle)
> +		rq->core_idle_allowance -= (s64) exec;
> +
> +	if (rq->core_idle_allowance < 0 && refill) {
> +		for_each_cpu(i, smt_mask) {
> +			cpu_rq(i)->core_idle_allowance += (s64) SCHED_IDLE_ALLOWANCE;
> +		}
> +	}
> +}

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [tip:sched/core] stop_machine: Fix stop_cpus_in_progress ordering
  2019-05-29 20:36 ` [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
@ 2019-08-08 10:54   ` tip-bot for Peter Zijlstra
  2019-08-26 16:19   ` [RFC PATCH v3 01/16] " mark gross
  1 sibling, 0 replies; 161+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-08-08 10:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, aaron.lwe, naravamudan, valentin.schneider, peterz,
	tglx, hpa, jdesfossez, pauld, mingo

Commit-ID:  99d84bf8c65a7a0dbc9e166ca0a58ed949ac4f37
Gitweb:     https://git.kernel.org/tip/99d84bf8c65a7a0dbc9e166ca0a58ed949ac4f37
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 29 May 2019 20:36:37 +0000
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Thu, 8 Aug 2019 09:09:30 +0200

stop_machine: Fix stop_cpus_in_progress ordering

Make sure the entire for loop has stop_cpus_in_progress set.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpillai@digitalocean.com
---
 kernel/stop_machine.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index b4f83f7bdf86..c7031a22aa7b 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -383,6 +383,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 	 */
 	preempt_disable();
 	stop_cpus_in_progress = true;
+	barrier();
 	for_each_cpu(cpu, cpumask) {
 		work = &per_cpu(cpu_stopper.stop_work, cpu);
 		work->fn = fn;
@@ -391,6 +392,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 		if (cpu_stop_queue_work(cpu, work))
 			queued = true;
 	}
+	barrier();
 	stop_cpus_in_progress = false;
 	preempt_enable();
 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [tip:sched/core] sched: Fix kerneldoc comment for ia64_set_curr_task
  2019-05-29 20:36 ` [RFC PATCH v3 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
@ 2019-08-08 10:55   ` tip-bot for Peter Zijlstra
  2019-08-26 16:20   ` [RFC PATCH v3 02/16] " mark gross
  1 sibling, 0 replies; 161+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-08-08 10:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, hpa, valentin.schneider, naravamudan, aaron.lwe,
	linux-kernel, pauld, peterz, jdesfossez

Commit-ID:  5feeb7837a448f659e0aaa19fb446b1d9a4b323a
Gitweb:     https://git.kernel.org/tip/5feeb7837a448f659e0aaa19fb446b1d9a4b323a
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 29 May 2019 20:36:38 +0000
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Thu, 8 Aug 2019 09:09:30 +0200

sched: Fix kerneldoc comment for ia64_set_curr_task

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/fde3a65ea3091ec6b84dac3c19639f85f452c5d1.1559129225.git.vpillai@digitalocean.com
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4a44bc84749..9a821ff68502 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6772,7 +6772,7 @@ struct task_struct *curr_task(int cpu)
 
 #ifdef CONFIG_IA64
 /**
- * set_curr_task - set the current task for a given CPU.
+ * ia64_set_curr_task - set the current task for a given CPU.
  * @cpu: the processor in question.
  * @p: the task pointer to set.
  *

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [tip:sched/core] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-05-29 20:36 ` [RFC PATCH v3 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Vineeth Remanan Pillai
@ 2019-08-08 10:55   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-08-08 10:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: aaron.lwe, valentin.schneider, pauld, peterz, hpa, jdesfossez,
	linux-kernel, tglx, mingo, naravamudan

Commit-ID:  f95d4eaee6d0207bff2dc93371133d31227d4cfb
Gitweb:     https://git.kernel.org/tip/f95d4eaee6d0207bff2dc93371133d31227d4cfb
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 29 May 2019 20:36:40 +0000
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Thu, 8 Aug 2019 09:09:30 +0200

sched/{rt,deadline}: Fix set_next_task vs pick_next_task

Because pick_next_task() implies set_curr_task() and some of the
details haven't mattered too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com
---
 kernel/sched/deadline.c | 22 +++++++++++-----------
 kernel/sched/rt.c       | 26 +++++++++++++-------------
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 039dde2b1dac..2dc2784b196c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1727,12 +1727,20 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 }
 #endif
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
+
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+
+	if (rq->curr->sched_class != &dl_sched_class)
+		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	deadline_queue_push_tasks(rq);
 }
 
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@@ -1791,15 +1799,7 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = dl_task_of(dl_se);
 
-	set_next_task(rq, p);
-
-	if (hrtick_enabled(rq))
-		start_hrtick_dl(rq, p);
-
-	deadline_queue_push_tasks(rq);
-
-	if (rq->curr->sched_class != &dl_sched_class)
-		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_dl(rq, p);
 
 	return p;
 }
@@ -1846,7 +1846,7 @@ static void task_fork_dl(struct task_struct *p)
 
 static void set_curr_task_dl(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_dl(rq, rq->curr);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a532558a5176..40bb71004325 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1498,12 +1498,22 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 #endif
 }
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
+
+	/*
+	 * If prev task was rt, put_prev_task() has already updated the
+	 * utilization. We only care of the case where we start to schedule a
+	 * rt task
+	 */
+	if (rq->curr->sched_class != &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	rt_queue_push_tasks(rq);
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@@ -1577,17 +1587,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = _pick_next_task_rt(rq);
 
-	set_next_task(rq, p);
-
-	rt_queue_push_tasks(rq);
-
-	/*
-	 * If prev task was rt, put_prev_task() has already updated the
-	 * utilization. We only care of the case where we start to schedule a
-	 * rt task
-	 */
-	if (rq->curr->sched_class != &rt_sched_class)
-		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_rt(rq, p);
 
 	return p;
 }
@@ -2356,7 +2356,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 
 static void set_curr_task_rt(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_rt(rq, rq->curr);
 }
 
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [tip:sched/core] sched: Add task_struct pointer to sched_class::set_curr_task
  2019-05-29 20:36 ` [RFC PATCH v3 05/16] sched: Add task_struct pointer to sched_class::set_curr_task Vineeth Remanan Pillai
@ 2019-08-08 10:57   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-08-08 10:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, naravamudan, hpa, jdesfossez, tglx, mingo, linux-kernel,
	valentin.schneider, pauld, aaron.lwe

Commit-ID:  03b7fad167efca3b7abbbb39733933f9df56e79c
Gitweb:     https://git.kernel.org/tip/03b7fad167efca3b7abbbb39733933f9df56e79c
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 29 May 2019 20:36:41 +0000
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Thu, 8 Aug 2019 09:09:31 +0200

sched: Add task_struct pointer to sched_class::set_curr_task

In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
---
 kernel/sched/core.c      | 12 ++++++------
 kernel/sched/deadline.c  |  7 +------
 kernel/sched/fair.c      | 17 ++++++++++++++---
 kernel/sched/idle.c      | 27 +++++++++++++++------------
 kernel/sched/rt.c        |  7 +------
 kernel/sched/sched.h     |  7 ++++---
 kernel/sched/stop_task.c | 17 +++++++----------
 7 files changed, 48 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 364b6d7da2be..0c4220789092 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1494,7 +1494,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 }
 
 /*
@@ -4325,7 +4325,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -4392,7 +4392,7 @@ void set_user_nice(struct task_struct *p, long nice)
 			resched_curr(rq);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 out_unlock:
 	task_rq_unlock(rq, p, &rf);
 }
@@ -4840,7 +4840,7 @@ change:
 		enqueue_task(rq, p, queue_flags);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 
@@ -6042,7 +6042,7 @@ void sched_setnuma(struct task_struct *p, int nid)
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -6919,7 +6919,7 @@ void sched_move_task(struct task_struct *tsk)
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
 	if (running)
-		set_curr_task(rq, tsk);
+		set_next_task(rq, tsk);
 
 	task_rq_unlock(rq, tsk, &rf);
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2dc2784b196c..6eae79350303 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1844,11 +1844,6 @@ static void task_fork_dl(struct task_struct *p)
 	 */
 }
 
-static void set_curr_task_dl(struct rq *rq)
-{
-	set_next_task_dl(rq, rq->curr);
-}
-
 #ifdef CONFIG_SMP
 
 /* Only try algorithms three times */
@@ -2466,6 +2461,7 @@ const struct sched_class dl_sched_class = {
 
 	.pick_next_task		= pick_next_task_dl,
 	.put_prev_task		= put_prev_task_dl,
+	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
@@ -2476,7 +2472,6 @@ const struct sched_class dl_sched_class = {
 	.task_woken		= task_woken_dl,
 #endif
 
-	.set_curr_task		= set_curr_task_dl,
 	.task_tick		= task_tick_dl,
 	.task_fork              = task_fork_dl,
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d8043fc8317..8ce1b8893947 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10150,9 +10150,19 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_curr_task_fair(struct rq *rq)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p)
 {
-	struct sched_entity *se = &rq->curr->se;
+	struct sched_entity *se = &p->se;
+
+#ifdef CONFIG_SMP
+	if (task_on_rq_queued(p)) {
+		/*
+		 * Move the next running task to the front of the list, so our
+		 * cfs_tasks list becomes MRU one.
+		 */
+		list_move(&se->group_node, &rq->cfs_tasks);
+	}
+#endif
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -10423,7 +10433,9 @@ const struct sched_class fair_sched_class = {
 	.check_preempt_curr	= check_preempt_wakeup,
 
 	.pick_next_task		= pick_next_task_fair,
+
 	.put_prev_task		= put_prev_task_fair,
+	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
@@ -10436,7 +10448,6 @@ const struct sched_class fair_sched_class = {
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
 	.task_fork		= task_fork_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 80940939b733..54194d41035c 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -374,14 +374,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+{
+}
+
+static void set_next_task_idle(struct rq *rq, struct task_struct *next)
+{
+	update_idle_core(rq);
+	schedstat_inc(rq->sched_goidle);
+}
+
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
+	struct task_struct *next = rq->idle;
+
 	put_prev_task(rq, prev);
-	update_idle_core(rq);
-	schedstat_inc(rq->sched_goidle);
+	set_next_task_idle(rq, next);
 
-	return rq->idle;
+	return next;
 }
 
 /*
@@ -397,10 +408,6 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 	raw_spin_lock_irq(&rq->lock);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
-{
-}
-
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -413,10 +420,6 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
 {
 }
 
-static void set_curr_task_idle(struct rq *rq)
-{
-}
-
 static void switched_to_idle(struct rq *rq, struct task_struct *p)
 {
 	BUG();
@@ -451,13 +454,13 @@ const struct sched_class idle_sched_class = {
 
 	.pick_next_task		= pick_next_task_idle,
 	.put_prev_task		= put_prev_task_idle,
+	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_idle,
 	.task_tick		= task_tick_idle,
 
 	.get_rr_interval	= get_rr_interval_idle,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 40bb71004325..f71bcbe1a00c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2354,11 +2354,6 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	}
 }
 
-static void set_curr_task_rt(struct rq *rq)
-{
-	set_next_task_rt(rq, rq->curr);
-}
-
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
 {
 	/*
@@ -2380,6 +2375,7 @@ const struct sched_class rt_sched_class = {
 
 	.pick_next_task		= pick_next_task_rt,
 	.put_prev_task		= put_prev_task_rt,
+	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_rt,
@@ -2391,7 +2387,6 @@ const struct sched_class rt_sched_class = {
 	.switched_from		= switched_from_rt,
 #endif
 
-	.set_curr_task          = set_curr_task_rt,
 	.task_tick		= task_tick_rt,
 
 	.get_rr_interval	= get_rr_interval_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3449d0dd7f0..f3c50445bf22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1707,6 +1707,7 @@ struct sched_class {
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
@@ -1721,7 +1722,6 @@ struct sched_class {
 	void (*rq_offline)(struct rq *rq);
 #endif
 
-	void (*set_curr_task)(struct rq *rq);
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork)(struct task_struct *p);
 	void (*task_dead)(struct task_struct *p);
@@ -1755,9 +1755,10 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
-static inline void set_curr_task(struct rq *rq, struct task_struct *curr)
+static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	curr->sched_class->set_curr_task(rq);
+	WARN_ON_ONCE(rq->curr != next);
+	next->sched_class->set_next_task(rq, next);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index c183b790ca54..47a3d2a18a9a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,6 +23,11 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
 	/* we're never preempted */
 }
 
+static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
+{
+	stop->se.exec_start = rq_clock_task(rq);
+}
+
 static struct task_struct *
 pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -32,8 +37,7 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		return NULL;
 
 	put_prev_task(rq, prev);
-
-	stop->se.exec_start = rq_clock_task(rq);
+	set_next_task_stop(rq, stop);
 
 	return stop;
 }
@@ -86,13 +90,6 @@ static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
 {
 }
 
-static void set_curr_task_stop(struct rq *rq)
-{
-	struct task_struct *stop = rq->stop;
-
-	stop->se.exec_start = rq_clock_task(rq);
-}
-
 static void switched_to_stop(struct rq *rq, struct task_struct *p)
 {
 	BUG(); /* its impossible to change to this class */
@@ -128,13 +125,13 @@ const struct sched_class stop_sched_class = {
 
 	.pick_next_task		= pick_next_task_stop,
 	.put_prev_task		= put_prev_task_stop,
+	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_stop,
 	.task_tick		= task_tick_stop,
 
 	.get_rr_interval	= get_rr_interval_stop,

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [tip:sched/core] sched/fair: Expose newidle_balance()
  2019-05-29 20:36 ` [RFC PATCH v3 06/16] sched/fair: Export newidle_balance() Vineeth Remanan Pillai
@ 2019-08-08 10:58   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-08-08 10:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, naravamudan, mingo, valentin.schneider, peterz, tglx, pauld,
	aaron.lwe, jdesfossez, linux-kernel

Commit-ID:  5ba553eff0c3a7c099b1e29a740277a82c0c3314
Gitweb:     https://git.kernel.org/tip/5ba553eff0c3a7c099b1e29a740277a82c0c3314
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 29 May 2019 20:36:42 +0000
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Thu, 8 Aug 2019 09:09:31 +0200

sched/fair: Expose newidle_balance()

For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/9e3eb1859b946f03d7e500453a885725b68957ba.1559129225.git.vpillai@digitalocean.com
---
 kernel/sched/fair.c  | 18 ++++++++----------
 kernel/sched/sched.h |  4 ++++
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce1b8893947..e7c27eda9f24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3690,8 +3690,6 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 	return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
-
 static inline unsigned long task_util(struct task_struct *p)
 {
 	return READ_ONCE(p->se.avg.util_avg);
@@ -6878,11 +6876,10 @@ done: __maybe_unused;
 	return p;
 
 idle:
-	update_misfit_status(NULL, rq);
-	new_tasks = idle_balance(rq, rf);
+	new_tasks = newidle_balance(rq, rf);
 
 	/*
-	 * Because idle_balance() releases (and re-acquires) rq->lock, it is
+	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
 	 * possible for any higher priority task to appear. In that case we
 	 * must re-start the pick_next_entity() loop.
 	 */
@@ -9045,10 +9042,10 @@ out_one_pinned:
 	ld_moved = 0;
 
 	/*
-	 * idle_balance() disregards balance intervals, so we could repeatedly
-	 * reach this code, which would lead to balance_interval skyrocketting
-	 * in a short amount of time. Skip the balance_interval increase logic
-	 * to avoid that.
+	 * newidle_balance() disregards balance intervals, so we could
+	 * repeatedly reach this code, which would lead to balance_interval
+	 * skyrocketting in a short amount of time. Skip the balance_interval
+	 * increase logic to avoid that.
 	 */
 	if (env.idle == CPU_NEWLY_IDLE)
 		goto out;
@@ -9758,7 +9755,7 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
+int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
 	int this_cpu = this_rq->cpu;
@@ -9766,6 +9763,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	int pulled_task = 0;
 	u64 curr_cost = 0;
 
+	update_misfit_status(NULL, this_rq);
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3c50445bf22..304d98e712bf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1445,10 +1445,14 @@ static inline void unregister_sched_domain_sysctl(void)
 }
 #endif
 
+extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
 
+static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [tip:sched/core] sched: Allow put_prev_task() to drop rq->lock
  2019-05-29 20:36 ` [RFC PATCH v3 07/16] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
@ 2019-08-08 10:58   ` tip-bot for Peter Zijlstra
  2019-08-26 16:51   ` [RFC PATCH v3 07/16] " mark gross
  1 sibling, 0 replies; 161+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-08-08 10:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: aaron.lwe, naravamudan, pauld, mingo, valentin.schneider,
	linux-kernel, tglx, peterz, hpa, jdesfossez

Commit-ID:  5f2a45fc9e89e022233085e6f0f352eb6ff770bb
Gitweb:     https://git.kernel.org/tip/5f2a45fc9e89e022233085e6f0f352eb6ff770bb
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 29 May 2019 20:36:43 +0000
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Thu, 8 Aug 2019 09:09:31 +0200

sched: Allow put_prev_task() to drop rq->lock

Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com
---
 kernel/sched/core.c      |  2 +-
 kernel/sched/deadline.c  | 14 +++++++++++++-
 kernel/sched/fair.c      |  2 +-
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        | 14 +++++++++++++-
 kernel/sched/sched.h     |  4 ++--
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c4220789092..7bbe78a31ba5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6090,7 +6090,7 @@ static struct task_struct *__pick_migrate_task(struct rq *rq)
 	for_each_class(class) {
 		next = class->pick_next_task(rq, NULL, NULL);
 		if (next) {
-			next->sched_class->put_prev_task(rq, next);
+			next->sched_class->put_prev_task(rq, next, NULL);
 			return next;
 		}
 	}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6eae79350303..2872e15a87cd 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1804,13 +1804,25 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_dl(rq);
 
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
+
+	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7c27eda9f24..4418c1998e69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6901,7 +6901,7 @@ idle:
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 54194d41035c..8d59de2e4a6e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -374,7 +374,7 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f71bcbe1a00c..dbdabd76f192 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1592,7 +1592,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_rt(rq);
 
@@ -1604,6 +1604,18 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_rt_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 304d98e712bf..e085cffb8004 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1710,7 +1710,7 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
@@ -1756,7 +1756,7 @@ struct sched_class {
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
-	prev->sched_class->put_prev_task(rq, prev);
+	prev->sched_class->put_prev_task(rq, prev, NULL);
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 47a3d2a18a9a..8f414018d5e0 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -59,7 +59,7 @@ static void yield_task_stop(struct rq *rq)
 	BUG(); /* the stop task should never yield, its pointless. */
 }
 
-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;

^ permalink raw reply	[flat|nested] 161+ messages in thread

* [tip:sched/core] sched: Rework pick_next_task() slow-path
  2019-05-29 20:36 ` [RFC PATCH v3 08/16] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
@ 2019-08-08 10:59   ` tip-bot for Peter Zijlstra
  2019-08-26 17:01   ` [RFC PATCH v3 08/16] " mark gross
  1 sibling, 0 replies; 161+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-08-08 10:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, jdesfossez, peterz, naravamudan, linux-kernel, aaron.lwe,
	hpa, valentin.schneider, pauld, mingo

Commit-ID:  67692435c411e5c53a1c588ecca2037aebd81f2e
Gitweb:     https://git.kernel.org/tip/67692435c411e5c53a1c588ecca2037aebd81f2e
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 29 May 2019 20:36:44 +0000
Committer:  Peter Zijlstra <peterz@infradead.org>
CommitDate: Thu, 8 Aug 2019 09:09:31 +0200

sched: Rework pick_next_task() slow-path

Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com
---
 kernel/sched/core.c      | 19 ++++++++++++-------
 kernel/sched/deadline.c  | 30 ++----------------------------
 kernel/sched/fair.c      |  9 ++++++---
 kernel/sched/idle.c      |  4 +++-
 kernel/sched/rt.c        | 29 +----------------------------
 kernel/sched/sched.h     | 13 ++++++++-----
 kernel/sched/stop_task.c |  3 ++-
 7 files changed, 34 insertions(+), 73 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bbe78a31ba5..a6661852907b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3791,7 +3791,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
-			goto again;
+			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
 		if (unlikely(!p))
@@ -3800,14 +3800,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		return p;
 	}
 
-again:
+restart:
+	/*
+	 * Ensure that we put DL/RT tasks before the pick loop, such that they
+	 * can PULL higher prio tasks when we lower the RQ 'priority'.
+	 */
+	prev->sched_class->put_prev_task(rq, prev, rf);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
 	for_each_class(class) {
-		p = class->pick_next_task(rq, prev, rf);
-		if (p) {
-			if (unlikely(p == RETRY_TASK))
-				goto again;
+		p = class->pick_next_task(rq, NULL, NULL);
+		if (p)
 			return p;
-		}
 	}
 
 	/* The idle class should always have a runnable task: */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2872e15a87cd..0b9cbfb2b1d4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1761,39 +1761,13 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *p;
 	struct dl_rq *dl_rq;
 
-	dl_rq = &rq->dl;
-
-	if (need_pull_dl_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_dl_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_dl_task() can drop (and re-acquire) rq->lock; this
-		 * means a stop task can slip in, in which case we need to
-		 * re-start task selection.
-		 */
-		if (rq->stop && task_on_rq_queued(rq->stop))
-			return RETRY_TASK;
-	}
+	WARN_ON_ONCE(prev || rf);
 
-	/*
-	 * When prev is DL, we may throttle it in put_prev_task().
-	 * So, we update time before we check for dl_nr_running.
-	 */
-	if (prev->sched_class == &dl_sched_class)
-		update_curr_dl(rq);
+	dl_rq = &rq->dl;
 
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4418c1998e69..19c58599e967 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6770,7 +6770,7 @@ again:
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
+	if (!prev || prev->sched_class != &fair_sched_class)
 		goto simple;
 
 	/*
@@ -6847,8 +6847,8 @@ again:
 	goto done;
 simple:
 #endif
-
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
 
 	do {
 		se = pick_next_entity(cfs_rq, NULL);
@@ -6876,6 +6876,9 @@ done: __maybe_unused;
 	return p;
 
 idle:
+	if (!rf)
+		return NULL;
+
 	new_tasks = newidle_balance(rq, rf);
 
 	/*
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 8d59de2e4a6e..7c54550dda6a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -389,7 +389,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	struct task_struct *next = rq->idle;
 
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
+
 	set_next_task_idle(rq, next);
 
 	return next;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index dbdabd76f192..858c4cc6f99b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1553,38 +1553,11 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
-	if (need_pull_rt_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_rt_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_rt_task() can drop (and re-acquire) rq->lock; this
-		 * means a dl or stop task can slip in, in which case we need
-		 * to re-start task selection.
-		 */
-		if (unlikely((rq->stop && task_on_rq_queued(rq->stop)) ||
-			     rq->dl.dl_nr_running))
-			return RETRY_TASK;
-	}
-
-	/*
-	 * We may dequeue prev's rt_rq in put_prev_task().
-	 * So, we update time before rt_queued check.
-	 */
-	if (prev->sched_class == &rt_sched_class)
-		update_curr_rt(rq);
+	WARN_ON_ONCE(prev || rf);
 
 	if (!rt_rq->rt_queued)
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	p = _pick_next_task_rt(rq);
 
 	set_next_task_rt(rq, p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e085cffb8004..7111e3a1eeb4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1700,12 +1700,15 @@ struct sched_class {
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
 	/*
-	 * It is the responsibility of the pick_next_task() method that will
-	 * return the next task to call put_prev_task() on the @prev task or
-	 * something equivalent.
+	 * Both @prev and @rf are optional and may be NULL, in which case the
+	 * caller must already have invoked put_prev_task(rq, prev, rf).
 	 *
-	 * May return RETRY_TASK when it finds a higher prio class has runnable
-	 * tasks.
+	 * Otherwise it is the responsibility of the pick_next_task() to call
+	 * put_prev_task() on the @prev task or something equivalent, IFF it
+	 * returns a next task.
+	 *
+	 * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
+	 * higher prio class has runnable tasks.
 	 */
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 8f414018d5e0..7e1cee4e65b2 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -33,10 +33,11 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	struct task_struct *stop = rq->stop;
 
+	WARN_ON_ONCE(prev || rf);
+
 	if (!stop || !task_on_rq_queued(stop))
 		return NULL;
 
-	put_prev_task(rq, prev);
 	set_next_task_stop(rq, stop);
 
 	return stop;

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-05 15:55                               ` Tim Chen
  2019-08-06  3:24                                 ` Aaron Lu
@ 2019-08-08 12:55                                 ` Aaron Lu
  2019-08-08 16:39                                   ` Tim Chen
  1 sibling, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-08-08 12:55 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> On 8/2/19 8:37 AM, Julien Desfossez wrote:
> > We tested both Aaron's and Tim's patches and here are our results.
> > 
> > Test setup:
> > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> >   mem benchmark
> > - both started at the same time
> > - both are pinned on the same core (2 hardware threads)
> > - 10 30-seconds runs
> > - test script: https://paste.debian.net/plainh/834cf45c
> > - only showing the CPU events/sec (higher is better)
> > - tested 4 tag configurations:
> >   - no tag
> >   - sysbench mem untagged, sysbench cpu tagged
> >   - sysbench mem tagged, sysbench cpu untagged
> >   - both tagged with a different tag
> > - "Alone" is the sysbench CPU running alone on the core, no tag
> > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> >   combined with Aaron's "hack patch" to get rid of the remaining deep
> >   idle cases
> > - In all test cases, both tasks can run simultaneously (which was not
> >   the case without those patches), but the standard deviation is a
> >   pretty good indicator of the fairness/consistency.
> 
> Thanks for testing the patches and giving such detailed data.
> 
> I came to realize that for my scheme, the accumulated deficit of forced idle could be wiped
> out in one execution of a task on the forced idle cpu, with the update of the min_vruntime,
> even if the execution time could be far less than the accumulated deficit.
> That's probably one reason my scheme didn't achieve fairness.

Turns out there is a typo error in v3 when setting rq's core_forceidle:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26fea68f7f54..542974a8da18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3888,7 +3888,7 @@ next_class:;
 		WARN_ON_ONCE(!rq_i->core_pick);
 
 		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
-			rq->core_forceidle = true;
+			rq_i->core_forceidle = true;
 
 		rq_i->core_pick->core_occupation = occ;

With this fixed and together with the patch to let schedule always
happen, your latest 2 patches work well for the 10s cpuhog test I
described previously:
https://lore.kernel.org/lkml/20190725143003.GA992@aaronlu/

overloaded workload without any cpu binding doesn't work well though, I
haven't taken a closer look yet.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-08 12:55                                 ` Aaron Lu
@ 2019-08-08 16:39                                   ` Tim Chen
  2019-08-10 14:18                                     ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-08-08 16:39 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 8/8/19 5:55 AM, Aaron Lu wrote:
> On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
>> On 8/2/19 8:37 AM, Julien Desfossez wrote:
>>> We tested both Aaron's and Tim's patches and here are our results.

> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26fea68f7f54..542974a8da18 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3888,7 +3888,7 @@ next_class:;
>  		WARN_ON_ONCE(!rq_i->core_pick);
>  
>  		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -			rq->core_forceidle = true;
> +			rq_i->core_forceidle = true;

Good catch!

>  
>  		rq_i->core_pick->core_occupation = occ;
> 
> With this fixed and together with the patch to let schedule always
> happen, your latest 2 patches work well for the 10s cpuhog test I
> described previously:
> https://lore.kernel.org/lkml/20190725143003.GA992@aaronlu/

That's encouraging.  You are talking about my patches
that try to keep the force idle time between sibling threads
balanced, right?

> 
> overloaded workload without any cpu binding doesn't work well though, I
> haven't taken a closer look yet.
> 

I think we need a load balancing scheme among the cores that will try
to minimize force idle.

One possible metric to measure load compatibility imbalance that leads to
force idle is 

Say i, j are sibling threads of a cpu core
imbalanace = \sum_tagged_cgroup  abs(Load_cgroup_cpui - Load_cgroup_cpuj)

This gives us a metric to decide if migrating a task will improve
load compatability imbalance.  As we already track cgroup load on a CPU,
it should be doable without adding too much overhead.

Tim


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-08  6:47                                         ` Aaron Lu
@ 2019-08-08 17:27                                           ` Tim Chen
  2019-08-08 21:42                                             ` Tim Chen
  0 siblings, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-08-08 17:27 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Julien Desfossez, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 8/7/19 11:47 PM, Aaron Lu wrote:
> On Tue, Aug 06, 2019 at 02:19:57PM -0700, Tim Chen wrote:
>> +void account_core_idletime(struct task_struct *p, u64 exec)
>> +{
>> +	const struct cpumask *smt_mask;
>> +	struct rq *rq;
>> +	bool force_idle, refill;
>> +	int i, cpu;
>> +
>> +	rq = task_rq(p);
>> +	if (!sched_core_enabled(rq) || !p->core_cookie)
>> +		return;
> 
> I don't see why return here for untagged task. Untagged task can also
> preempt tagged task and force a CPU thread enter idle state.
> Untagged is just another tag to me, unless we want to allow untagged
> task to coschedule with a tagged task.

You are right.  This needs to be fixed.

And the cookie check will also need to be changed in prio_less_fair.

@@ -611,6 +611,17 @@ bool prio_less_fair(struct task_struct *a, struct task_struct *b)
 	 * Normalize the vruntime if tasks are in different cpus.
 	 */
 	if (task_cpu(a) != task_cpu(b)) {
+
+		if (a->core_cookie && b->core_cookie &&
+		    a->core_cookie != b->core_cookie) {

		if (!cookie_match(a, b))

+			/*
+			 * Will be force idling one thread,
+			 * pick the thread that has more allowance.
+			 */
+			return (task_rq(a)->core_idle_allowance <=
+				task_rq(b)->core_idle_allowance) ? true : false;
+		}
+

I'll respin my patches.

Tim

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-08 17:27                                           ` Tim Chen
@ 2019-08-08 21:42                                             ` Tim Chen
  2019-08-10 14:15                                               ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-08-08 21:42 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Julien Desfossez, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 8/8/19 10:27 AM, Tim Chen wrote:
> On 8/7/19 11:47 PM, Aaron Lu wrote:
>> On Tue, Aug 06, 2019 at 02:19:57PM -0700, Tim Chen wrote:
>>> +void account_core_idletime(struct task_struct *p, u64 exec)
>>> +{
>>> +	const struct cpumask *smt_mask;
>>> +	struct rq *rq;
>>> +	bool force_idle, refill;
>>> +	int i, cpu;
>>> +
>>> +	rq = task_rq(p);
>>> +	if (!sched_core_enabled(rq) || !p->core_cookie)
>>> +		return;
>>
>> I don't see why return here for untagged task. Untagged task can also
>> preempt tagged task and force a CPU thread enter idle state.
>> Untagged is just another tag to me, unless we want to allow untagged
>> task to coschedule with a tagged task.
> 
> You are right.  This needs to be fixed.
> 

Here's the updated patchset, including Aaron's fix and also
added accounting of force idle time by deadline and rt tasks.

Tim

-----------------patch 1----------------------
From 730dbb125f5f67c75f97f6be154d382767810f8b Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@linux.alibaba.com>
Date: Thu, 8 Aug 2019 08:57:46 -0700
Subject: [PATCH 1/3 v2] sched: Fix incorrect rq tagged as forced idle

Incorrect run queue was tagged as forced idle.
Tag the correct one.

Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e3cd9cb17809..50453e1329f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3903,7 +3903,7 @@ next_class:;
 		WARN_ON_ONCE(!rq_i->core_pick);
 
 		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
-			rq->core_forceidle = true;
+			rq_i->core_forceidle = true;
 
 		rq_i->core_pick->core_occupation = occ;
 
-- 
2.20.1

--------------patch 2------------------------
From 263ceeb40843b8ca3a91f1b268bec2b836d4986b Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Wed, 24 Jul 2019 13:58:18 -0700
Subject: [PATCH 2/3 v2] sched: Move sched fair prio comparison to fair.c

Consolidate the task priority comparison of the fair class
to fair.c.  A simple code reorganization and there are no functional changes.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c  | 21 ++-------------------
 kernel/sched/fair.c  | 21 +++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 50453e1329f3..0f893853766c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 a_vruntime = a->se.vruntime;
-		u64 b_vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			b_vruntime -= task_cfs_rq(b)->min_vruntime;
-			b_vruntime += task_cfs_rq(a)->min_vruntime;
-
-			trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
-				     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
-				     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
-		}
-
-		return !((s64)(a_vruntime - b_vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return prio_less_fair(a, b);
 
 	return false;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02bff10237d4..e289b6e1545b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -602,6 +602,27 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 	return delta;
 }
 
+bool prio_less_fair(struct task_struct *a, struct task_struct *b)
+{
+	u64 a_vruntime = a->se.vruntime;
+	u64 b_vruntime = b->se.vruntime;
+
+	/*
+	 * Normalize the vruntime if tasks are in different cpus.
+	 */
+	if (task_cpu(a) != task_cpu(b)) {
+		b_vruntime -= task_cfs_rq(b)->min_vruntime;
+		b_vruntime += task_cfs_rq(a)->min_vruntime;
+
+		trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+			     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+			     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+	}
+
+	return !((s64)(a_vruntime - b_vruntime) <= 0);
+}
+
 /*
  * The idea is to set a period in which each task runs once.
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..bdabe7ce1152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1015,6 +1015,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 }
 
 extern void queue_core_balance(struct rq *rq);
+extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);
 
 #else /* !CONFIG_SCHED_CORE */
 
-- 
2.20.1

------------------------patch 3---------------------------
From 5318e23c741a832140effbaf2f79fdf4b08f883c Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Tue, 6 Aug 2019 12:50:45 -0700
Subject: [PATCH 3/3 v2] sched: Enforce fairness between cpu threads

CPU thread could be suppressed by its sibling for extended time.
Implement a budget for force idling, making all CPU threads have
equal chance to run.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c     | 43 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/deadline.c |  1 +
 kernel/sched/fair.c     | 11 +++++++++++
 kernel/sched/rt.c       |  1 +
 kernel/sched/sched.h    |  4 ++++
 5 files changed, 60 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f893853766c..de83dcb84495 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -207,6 +207,46 @@ static struct task_struct *sched_core_next(struct task_struct *p, unsigned long
 	return p;
 }
 
+void account_core_idletime(struct task_struct *p, u64 exec)
+{
+	const struct cpumask *smt_mask;
+	struct rq *rq;
+	bool force_idle, refill;
+	int i, cpu;
+
+	rq = task_rq(p);
+	if (!sched_core_enabled(rq))
+		return;
+
+	cpu = task_cpu(p);
+	force_idle = false;
+	refill = true;
+	smt_mask = cpu_smt_mask(cpu);
+
+	for_each_cpu(i, smt_mask) {
+		if (cpu == i || cpu_is_offline(i))
+			continue;
+
+		if (cpu_rq(i)->core_forceidle)
+			force_idle = true;
+
+		/* Only refill if everyone has run out of allowance */
+		if (cpu_rq(i)->core_idle_allowance > 0)
+			refill = false;
+	}
+
+	if (force_idle)
+		rq->core_idle_allowance -= (s64) exec;
+
+	if (rq->core_idle_allowance < 0 && refill) {
+		for_each_cpu(i, smt_mask) {
+			if (cpu_is_offline(i))
+				continue;
+			cpu_rq(i)->core_idle_allowance += (s64) SCHED_IDLE_ALLOWANCE;
+		}
+	}
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -273,6 +313,8 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void account_core_idletime(struct task_struct *p, u64 exec) { }
+{
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -6773,6 +6815,7 @@ void __init sched_init(void)
 		rq->core_enabled = 0;
 		rq->core_tree = RB_ROOT;
 		rq->core_forceidle = false;
+		rq->core_idle_allowance = (s64) SCHED_IDLE_ALLOWANCE;
 
 		rq->core_cookie = 0UL;
 #endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 64fc444f44f9..684c64a95ec7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1175,6 +1175,7 @@ static void update_curr_dl(struct rq *rq)
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
+	account_core_idletime(curr, delta_exec);
 
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e289b6e1545b..f65270784c28 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -611,6 +611,16 @@ bool prio_less_fair(struct task_struct *a, struct task_struct *b)
 	 * Normalize the vruntime if tasks are in different cpus.
 	 */
 	if (task_cpu(a) != task_cpu(b)) {
+
+		if (a->core_cookie != b->core_cookie) {
+			/*
+			 * Will be force idling one thread,
+			 * pick the thread that has more allowance.
+			 */
+			return (task_rq(a)->core_idle_allowance <
+				task_rq(b)->core_idle_allowance) ? true : false;
+		}
+
 		b_vruntime -= task_cfs_rq(b)->min_vruntime;
 		b_vruntime += task_cfs_rq(a)->min_vruntime;
 
@@ -817,6 +827,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
+		account_core_idletime(curtask, delta_exec);
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 81557224548c..6f18e1455778 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -971,6 +971,7 @@ static void update_curr_rt(struct rq *rq)
 
 	curr->se.sum_exec_runtime += delta_exec;
 	account_group_exec_runtime(curr, delta_exec);
+	account_core_idletime(curr, delta_exec);
 
 	curr->se.exec_start = now;
 	cgroup_account_cputime(curr, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bdabe7ce1152..927334b2078c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -963,6 +963,7 @@ struct rq {
 	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
+	s64			core_idle_allowance;
 	struct rb_root		core_tree;
 	bool			core_forceidle;
 
@@ -999,6 +1000,8 @@ static inline int cpu_of(struct rq *rq)
 }
 
 #ifdef CONFIG_SCHED_CORE
+#define SCHED_IDLE_ALLOWANCE	5000000 	/* 5 msec */
+
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1016,6 +1019,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 extern void queue_core_balance(struct rq *rq);
 extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);
+extern void  account_core_idletime(struct task_struct *p, u64 exec);
 
 #else /* !CONFIG_SCHED_CORE */
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-08 21:42                                             ` Tim Chen
@ 2019-08-10 14:15                                               ` Aaron Lu
  2019-08-12 15:38                                                 ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-08-10 14:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Julien Desfossez, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Aug 08, 2019 at 02:42:57PM -0700, Tim Chen wrote:
> On 8/8/19 10:27 AM, Tim Chen wrote:
> > On 8/7/19 11:47 PM, Aaron Lu wrote:
> >> On Tue, Aug 06, 2019 at 02:19:57PM -0700, Tim Chen wrote:
> >>> +void account_core_idletime(struct task_struct *p, u64 exec)
> >>> +{
> >>> +	const struct cpumask *smt_mask;
> >>> +	struct rq *rq;
> >>> +	bool force_idle, refill;
> >>> +	int i, cpu;
> >>> +
> >>> +	rq = task_rq(p);
> >>> +	if (!sched_core_enabled(rq) || !p->core_cookie)
> >>> +		return;
> >>
> >> I don't see why return here for untagged task. Untagged task can also
> >> preempt tagged task and force a CPU thread enter idle state.
> >> Untagged is just another tag to me, unless we want to allow untagged
> >> task to coschedule with a tagged task.
> > 
> > You are right.  This needs to be fixed.
> > 
> 
> Here's the updated patchset, including Aaron's fix and also
> added accounting of force idle time by deadline and rt tasks.

I have two other small changes that I think are worth sending out.

The first simplify logic in pick_task() and the 2nd avoid task pick all
over again when max is preempted. I also refined the previous hack patch to
make schedule always happen only for root cfs rq. Please see below for
details, thanks.

patch1:

From cea56db35fe9f393c357cdb1bdcb2ef9b56cfe97 Mon Sep 17 00:00:00 2001
From: Aaron Lu <ziqian.lzq@antfin.com>
Date: Mon, 5 Aug 2019 21:21:25 +0800
Subject: [PATCH 1/3] sched/core: simplify pick_task()

No need to special case !cookie case in pick_task(), we just need to
make it possible to return idle in sched_core_find() for !cookie query.
And cookie_pick will always have less priority than class_pick, so
remove the redundant check of prio_less(cookie_pick, class_pick).

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90655c9ad937..84fec9933b74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,6 +186,8 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 	 * The idle task always matches any cookie!
 	 */
 	match = idle_sched_class.pick_task(rq);
+	if (!cookie)
+		goto out;
 
 	while (node) {
 		node_task = container_of(node, struct task_struct, core_node);
@@ -199,7 +201,7 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 			node = node->rb_left;
 		}
 	}
-
+out:
 	return match;
 }
 
@@ -3657,18 +3659,6 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	if (!class_pick)
 		return NULL;
 
-	if (!cookie) {
-		/*
-		 * If class_pick is tagged, return it only if it has
-		 * higher priority than max.
-		 */
-		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max))
-			return idle_sched_class.pick_task(rq);
-
-		return class_pick;
-	}
-
 	/*
 	 * If class_pick is idle or matches cookie, return early.
 	 */
@@ -3682,8 +3672,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	 * the core (so far) and it must be selected, otherwise we must go with
 	 * the cookie pick in order to satisfy the constraint.
 	 */
-	if (prio_less(cookie_pick, class_pick) &&
-	    (!max || prio_less(max, class_pick)))
+	if (!max || prio_less(max, class_pick))
 		return class_pick;
 
 	return cookie_pick;
-- 
2.19.1.3.ge56e4f7

patch2:

From 487950dc53a40d5c566602f775ce46a0bab7a412 Mon Sep 17 00:00:00 2001
From: Aaron Lu <ziqian.lzq@antfin.com>
Date: Fri, 9 Aug 2019 14:48:01 +0800
Subject: [PATCH 2/3] sched/core: no need to pick again after max is preempted

When sibling's task preempts current max, there is no need to do the
pick all over again - the preempted cpu could just pick idle and done.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84fec9933b74..e88583860abe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3756,7 +3756,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * order.
 	 */
 	for_each_class(class) {
-again:
 		for_each_cpu_wrap(i, smt_mask, cpu) {
 			struct rq *rq_i = cpu_rq(i);
 			struct task_struct *p;
@@ -3828,10 +3827,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 						if (j == i)
 							continue;
 
-						cpu_rq(j)->core_pick = NULL;
+						cpu_rq(j)->core_pick = idle_sched_class.pick_task(cpu_rq(j));
 					}
 					occ = 1;
-					goto again;
+					goto out;
 				} else {
 					/*
 					 * Once we select a task for a cpu, we
@@ -3846,7 +3845,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 next_class:;
 	}
-
+out:
 	rq->core->core_pick_seq = rq->core->core_task_seq;
 	next = rq->core_pick;
 	rq->core_sched_seq = rq->core->core_pick_seq;
-- 
2.19.1.3.ge56e4f7

patch3:

From 2d396d99e0dd7157b0b4f7a037c8b84ed135ea56 Mon Sep 17 00:00:00 2001
From: Aaron Lu <ziqian.lzq@antfin.com>
Date: Thu, 25 Jul 2019 19:57:21 +0800
Subject: [PATCH 3/3] sched/fair: make tick based schedule always happen

When a hyperthread is forced idle and the other hyperthread has a single
CPU intensive task running, the running task can occupy the hyperthread
for a long time with no scheduling point and starve the other
hyperthread.

Fix this temporarily by always checking if the task has exceed its
timeslice and if so, for root cfs_rq, do a schedule.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26d29126d6a5..b1f0defdad91 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4011,6 +4011,9 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 	}
 
+	if (cfs_rq->nr_running <= 1)
+		return;
+
 	/*
 	 * Ensure that a task that missed wakeup preemption by a
 	 * narrow margin doesn't have to wait for a full slice.
@@ -4179,7 +4182,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 		return;
 #endif
 
-	if (cfs_rq->nr_running > 1)
+	if (cfs_rq->nr_running > 1 || cfs_rq->tg == &root_task_group)
 		check_preempt_tick(cfs_rq, curr);
 }
 
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-08 16:39                                   ` Tim Chen
@ 2019-08-10 14:18                                     ` Aaron Lu
  0 siblings, 0 replies; 161+ messages in thread
From: Aaron Lu @ 2019-08-10 14:18 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Aug 08, 2019 at 09:39:45AM -0700, Tim Chen wrote:
> On 8/8/19 5:55 AM, Aaron Lu wrote:
> > On Mon, Aug 05, 2019 at 08:55:28AM -0700, Tim Chen wrote:
> >> On 8/2/19 8:37 AM, Julien Desfossez wrote:
> >>> We tested both Aaron's and Tim's patches and here are our results.
> 
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 26fea68f7f54..542974a8da18 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3888,7 +3888,7 @@ next_class:;
> >  		WARN_ON_ONCE(!rq_i->core_pick);
> >  
> >  		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> > -			rq->core_forceidle = true;
> > +			rq_i->core_forceidle = true;
> 
> Good catch!
> 
> >  
> >  		rq_i->core_pick->core_occupation = occ;
> > 
> > With this fixed and together with the patch to let schedule always
> > happen, your latest 2 patches work well for the 10s cpuhog test I
> > described previously:
> > https://lore.kernel.org/lkml/20190725143003.GA992@aaronlu/
> 
> That's encouraging.  You are talking about my patches
> that try to keep the force idle time between sibling threads
> balanced, right?

Yes.

> > 
> > overloaded workload without any cpu binding doesn't work well though, I
> > haven't taken a closer look yet.
> > 
> 
> I think we need a load balancing scheme among the cores that will try
> to minimize force idle.

Agree.

> 
> One possible metric to measure load compatibility imbalance that leads to
> force idle is 
> 
> Say i, j are sibling threads of a cpu core
> imbalanace = \sum_tagged_cgroup  abs(Load_cgroup_cpui - Load_cgroup_cpuj)
> 
> This gives us a metric to decide if migrating a task will improve
> load compatability imbalance.  As we already track cgroup load on a CPU,
> it should be doable without adding too much overhead.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-10 14:15                                               ` Aaron Lu
@ 2019-08-12 15:38                                                 ` Vineeth Remanan Pillai
  2019-08-13  2:24                                                   ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-08-12 15:38 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Peter Zijlstra, Julien Desfossez, Li, Aubrey,
	Aubrey Li, Subhra Mazumdar, Nishanth Aravamudan, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> I have two other small changes that I think are worth sending out.
>
> The first simplify logic in pick_task() and the 2nd avoid task pick all
> over again when max is preempted. I also refined the previous hack patch to
> make schedule always happen only for root cfs rq. Please see below for
> details, thanks.
>
I see a potential issue here. With the simplification in pick_task,
you might introduce a livelock where the match logic spins for ever.
But you avoid that with the patch 2, by removing the loop if a pick
preempts max. The potential problem is that, you miss a case where
the newly picked task might have a match in the sibling on which max
was selected before. By selecting idle, you ignore the potential match.
As of now, the potential match check does not really work because,
sched_core_find will always return the same task and we do not check
the whole core_tree for a next match. This is in my TODO list to have
sched_core_find to return the best next match, if match was preempted.
But its a bit complex and needs more thought.

Third patch looks good to me.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-12 15:38                                                 ` Vineeth Remanan Pillai
@ 2019-08-13  2:24                                                   ` Aaron Lu
  0 siblings, 0 replies; 161+ messages in thread
From: Aaron Lu @ 2019-08-13  2:24 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Peter Zijlstra, Julien Desfossez, Li, Aubrey,
	Aubrey Li, Subhra Mazumdar, Nishanth Aravamudan, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/8/12 23:38, Vineeth Remanan Pillai wrote:
>> I have two other small changes that I think are worth sending out.
>>
>> The first simplify logic in pick_task() and the 2nd avoid task pick all
>> over again when max is preempted. I also refined the previous hack patch to
>> make schedule always happen only for root cfs rq. Please see below for
>> details, thanks.
>>
> I see a potential issue here. With the simplification in pick_task,
> you might introduce a livelock where the match logic spins for ever.
> But you avoid that with the patch 2, by removing the loop if a pick
> preempts max. The potential problem is that, you miss a case where
> the newly picked task might have a match in the sibling on which max
> was selected before. By selecting idle, you ignore the potential match.

Oh that's right, I missed this.

> As of now, the potential match check does not really work because,
> sched_core_find will always return the same task and we do not check
> the whole core_tree for a next match. This is in my TODO list to have
> sched_core_find to return the best next match, if match was preempted.
> But its a bit complex and needs more thought.

Sounds worth to do :-)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-07 17:10                                 ` Tim Chen
@ 2019-08-15 16:09                                   ` Dario Faggioli
  2019-08-16  2:33                                     ` Aaron Lu
  2019-09-05  1:44                                   ` Julien Desfossez
  2019-09-06 18:30                                   ` Tim Chen
  2 siblings, 1 reply; 161+ messages in thread
From: Dario Faggioli @ 2019-08-15 16:09 UTC (permalink / raw)
  To: Tim Chen, Julien Desfossez, Li, Aubrey
  Cc: Aaron Lu, Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2942 bytes --]

On Wed, 2019-08-07 at 10:10 -0700, Tim Chen wrote:
> On 8/7/19 1:58 AM, Dario Faggioli wrote:
> 
> > Since I see that, in this thread, there are various patches being
> > proposed and discussed... should I rerun my benchmarks with them
> > applied? If yes, which ones? And is there, by any chance, one (or
> > maybe
> > more than one) updated git branch(es)?
> > 
> Hi Dario,
> 
Hi Tim!

> Having an extra set of eyes are certainly welcomed.
> I'll give my 2 cents on the issues with v3.
> 
Ok, and thanks a lot for this.

> 1) Unfairness between the sibling threads
> -----------------------------------------
> One sibling thread could be suppressing and force idling
> the sibling thread over proportionally.  Resulting in
> the force idled CPU not getting run and stall tasks on
> suppressed CPU.
> 
> 
> [...]
>
> 2) Not rescheduling forced idled CPU
> ------------------------------------
> The forced idled CPU does not get a chance to re-schedule
> itself, and will stall for a long time even though it
> has eligible tasks to run.
> 
> [...]
> 
> 3) Load balancing between CPU cores
> -----------------------------------
> Say if one CPU core's sibling threads get forced idled
> a lot as it has mostly incompatible tasks between the siblings,
> moving the incompatible load to other cores and pulling
> compatible load to the core could help CPU utilization.
> 
> So just considering the load of a task is not enough during
> load balancing, task compatibility also needs to be considered.
> Peter has put in mechanisms to balance compatible tasks between
> CPU thread siblings, but not across cores.
> 
> [...]
>
Ok. Yes, as said, I've been trying to follow the thread, but thanks a
lot again for this summary.

As said, I'm about to have numbers for the repo/branch I mentioned.

I was considering whether to also re-run the benchmarking campaign with
some of the patches that floated around within this thread. Now, thanks
to your summary, I have an even clearer picture about which patch does
what, and that is indeed very useful.

I'll see about putting something together. I'm thinking of picking:

https://lore.kernel.org/lkml/b7a83fcb-5c34-9794-5688-55c52697fd84@linux.intel.com/
https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/

And maybe even (part of):
https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/#t

If anyone has ideas or suggestions about whether or not this choice
makes sense, feel free to share. :-)

Also, I only have another week before leaving, so let's see what I
manage to actually run, and then share here, by then.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-15 16:09                                   ` Dario Faggioli
@ 2019-08-16  2:33                                     ` Aaron Lu
  0 siblings, 0 replies; 161+ messages in thread
From: Aaron Lu @ 2019-08-16  2:33 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Aug 15, 2019 at 06:09:28PM +0200, Dario Faggioli wrote:
> On Wed, 2019-08-07 at 10:10 -0700, Tim Chen wrote:
> > On 8/7/19 1:58 AM, Dario Faggioli wrote:
> > 
> > > Since I see that, in this thread, there are various patches being
> > > proposed and discussed... should I rerun my benchmarks with them
> > > applied? If yes, which ones? And is there, by any chance, one (or
> > > maybe
> > > more than one) updated git branch(es)?
> > > 
> > Hi Dario,
> > 
> Hi Tim!
> 
> > Having an extra set of eyes are certainly welcomed.
> > I'll give my 2 cents on the issues with v3.
> > 
> Ok, and thanks a lot for this.
> 
> > 1) Unfairness between the sibling threads
> > -----------------------------------------
> > One sibling thread could be suppressing and force idling
> > the sibling thread over proportionally.  Resulting in
> > the force idled CPU not getting run and stall tasks on
> > suppressed CPU.
> > 
> > 
> > [...]
> >
> > 2) Not rescheduling forced idled CPU
> > ------------------------------------
> > The forced idled CPU does not get a chance to re-schedule
> > itself, and will stall for a long time even though it
> > has eligible tasks to run.
> > 
> > [...]
> > 
> > 3) Load balancing between CPU cores
> > -----------------------------------
> > Say if one CPU core's sibling threads get forced idled
> > a lot as it has mostly incompatible tasks between the siblings,
> > moving the incompatible load to other cores and pulling
> > compatible load to the core could help CPU utilization.
> > 
> > So just considering the load of a task is not enough during
> > load balancing, task compatibility also needs to be considered.
> > Peter has put in mechanisms to balance compatible tasks between
> > CPU thread siblings, but not across cores.
> > 
> > [...]
> >
> Ok. Yes, as said, I've been trying to follow the thread, but thanks a
> lot again for this summary.
> 
> As said, I'm about to have numbers for the repo/branch I mentioned.
> 
> I was considering whether to also re-run the benchmarking campaign with
> some of the patches that floated around within this thread. Now, thanks
> to your summary, I have an even clearer picture about which patch does
> what, and that is indeed very useful.
> 
> I'll see about putting something together. I'm thinking of picking:
> 
> https://lore.kernel.org/lkml/b7a83fcb-5c34-9794-5688-55c52697fd84@linux.intel.com/
> https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/
> 
> And maybe even (part of):
> https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/#t
> 
> If anyone has ideas or suggestions about whether or not this choice
> makes sense, feel free to share. :-)

Makes sense to me.
patch3 in the last link is slightly better than the one in the 2nd link,
so just use that instead.

Thanks,
Aaron

> Also, I only have another week before leaving, so let's see what I
> manage to actually run, and then share here, by then.
> 
> Thanks and Regards
> -- 
> Dario Faggioli, Ph.D
> http://about.me/dario.faggioli
> Virtualization Software Engineer
> SUSE Labs, SUSE https://www.suse.com/
> -------------------------------------------------------------------
> <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
> 



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering
  2019-05-29 20:36 ` [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
  2019-08-08 10:54   ` [tip:sched/core] " tip-bot for Peter Zijlstra
@ 2019-08-26 16:19   ` mark gross
  2019-08-26 16:59     ` Peter Zijlstra
  1 sibling, 1 reply; 161+ messages in thread
From: mark gross @ 2019-08-26 16:19 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, May 29, 2019 at 08:36:37PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Make sure the entire for loop has stop_cpus_in_progress set.
It is not clear how this commit comment matches the change.  Please explain
how adding 2 barrier's makes sure stop_cpus_in_progress is set for the entier
for loop.

--mark

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/stop_machine.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index 067cb83f37ea..583119e0c51c 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -375,6 +375,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
>  	 */
>  	preempt_disable();
>  	stop_cpus_in_progress = true;
> +	barrier();
>  	for_each_cpu(cpu, cpumask) {
>  		work = &per_cpu(cpu_stopper.stop_work, cpu);
>  		work->fn = fn;
> @@ -383,6 +384,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
>  		if (cpu_stop_queue_work(cpu, work))
>  			queued = true;
>  	}
> +	barrier();
>  	stop_cpus_in_progress = false;
>  	preempt_enable();
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task
  2019-05-29 20:36 ` [RFC PATCH v3 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
  2019-08-08 10:55   ` [tip:sched/core] " tip-bot for Peter Zijlstra
@ 2019-08-26 16:20   ` mark gross
  1 sibling, 0 replies; 161+ messages in thread
From: mark gross @ 2019-08-26 16:20 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, May 29, 2019 at 08:36:38PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>

NULL commit comment.

--mark

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4778c48a7fda..416ea613eda8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6287,7 +6287,7 @@ struct task_struct *curr_task(int cpu)
>  
>  #ifdef CONFIG_IA64
>  /**
> - * set_curr_task - set the current task for a given CPU.
> + * ia64_set_curr_task - set the current task for a given CPU.
>   * @cpu: the processor in question.
>   * @p: the task pointer to set.
>   *
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 07/16] sched: Allow put_prev_task() to drop rq->lock
  2019-05-29 20:36 ` [RFC PATCH v3 07/16] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
  2019-08-08 10:58   ` [tip:sched/core] " tip-bot for Peter Zijlstra
@ 2019-08-26 16:51   ` mark gross
  1 sibling, 0 replies; 161+ messages in thread
From: mark gross @ 2019-08-26 16:51 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, May 29, 2019 at 08:36:43PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Currently the pick_next_task() loop is convoluted and ugly because of
> how it can drop the rq->lock and needs to restart the picking.
> 
> For the RT/Deadline classes, it is put_prev_task() where we do
> balancing, and we could do this before the picking loop. Make this
> possible.

Maybe explain why adding strtu rq_flags pointers to the function call supports
the above commit coment.  

--mark

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c      |  2 +-
>  kernel/sched/deadline.c  | 14 +++++++++++++-
>  kernel/sched/fair.c      |  2 +-
>  kernel/sched/idle.c      |  2 +-
>  kernel/sched/rt.c        | 14 +++++++++++++-
>  kernel/sched/sched.h     |  4 ++--
>  kernel/sched/stop_task.c |  2 +-
>  7 files changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 32ea79fb8d29..9dfa0c53deb3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5595,7 +5595,7 @@ static void calc_load_migrate(struct rq *rq)
>  		atomic_long_add(delta, &calc_load_tasks);
>  }
>  
> -static void put_prev_task_fake(struct rq *rq, struct task_struct *prev)
> +static void put_prev_task_fake(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  }
>  
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index c02b3229e2c3..45425f971eec 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1772,13 +1772,25 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	return p;
>  }
>  
> -static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
> +static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>  {
>  	update_curr_dl(rq);
>  
>  	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
>  	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
>  		enqueue_pushable_dl_task(rq, p);
> +
> +	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
> +		/*
> +		 * This is OK, because current is on_cpu, which avoids it being
> +		 * picked for load-balance and preemption/IRQs are still
> +		 * disabled avoiding further scheduler activity on it and we've
> +		 * not yet started the picking loop.
> +		 */
> +		rq_unpin_lock(rq, rf);
> +		pull_dl_task(rq);
> +		rq_repin_lock(rq, rf);
> +	}
>  }
>  
>  /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 49707b4797de..8e3eb243fd9f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7110,7 +7110,7 @@ done: __maybe_unused;
>  /*
>   * Account for a descheduled task:
>   */
> -static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
> +static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct sched_entity *se = &prev->se;
>  	struct cfs_rq *cfs_rq;
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index dd64be34881d..1b65a4c3683e 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -373,7 +373,7 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
>  	resched_curr(rq);
>  }
>  
> -static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> +static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  }
>  
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index adec98a94f2b..51ee87c5a28a 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1593,7 +1593,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	return p;
>  }
>  
> -static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> +static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>  {
>  	update_curr_rt(rq);
>  
> @@ -1605,6 +1605,18 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>  	 */
>  	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
>  		enqueue_pushable_task(rq, p);
> +
> +	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
> +		/*
> +		 * This is OK, because current is on_cpu, which avoids it being
> +		 * picked for load-balance and preemption/IRQs are still
> +		 * disabled avoiding further scheduler activity on it and we've
> +		 * not yet started the picking loop.
> +		 */
> +		rq_unpin_lock(rq, rf);
> +		pull_rt_task(rq);
> +		rq_repin_lock(rq, rf);
> +	}
>  }
>  
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bfcbcbb25646..4cbe2bef92e4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1675,7 +1675,7 @@ struct sched_class {
>  	struct task_struct * (*pick_next_task)(struct rq *rq,
>  					       struct task_struct *prev,
>  					       struct rq_flags *rf);
> -	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
> +	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
>  	void (*set_next_task)(struct rq *rq, struct task_struct *p);
>  
>  #ifdef CONFIG_SMP
> @@ -1721,7 +1721,7 @@ struct sched_class {
>  static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
>  {
>  	WARN_ON_ONCE(rq->curr != prev);
> -	prev->sched_class->put_prev_task(rq, prev);
> +	prev->sched_class->put_prev_task(rq, prev, NULL);
>  }
>  
>  static inline void set_next_task(struct rq *rq, struct task_struct *next)
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 47a3d2a18a9a..8f414018d5e0 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -59,7 +59,7 @@ static void yield_task_stop(struct rq *rq)
>  	BUG(); /* the stop task should never yield, its pointless. */
>  }
>  
> -static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
> +static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *curr = rq->curr;
>  	u64 delta_exec;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering
  2019-08-26 16:19   ` [RFC PATCH v3 01/16] " mark gross
@ 2019-08-26 16:59     ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2019-08-26 16:59 UTC (permalink / raw)
  To: mark gross
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Mon, Aug 26, 2019 at 09:19:31AM -0700, mark gross wrote:
> On Wed, May 29, 2019 at 08:36:37PM +0000, Vineeth Remanan Pillai wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > Make sure the entire for loop has stop_cpus_in_progress set.
> It is not clear how this commit comment matches the change.  Please explain
> how adding 2 barrier's makes sure stop_cpus_in_progress is set for the entier
> for loop.

Without the barrier the compiler is free to move the stores around. It
probably doesn't do anything bad, but this makes sure it cannot.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 08/16] sched: Rework pick_next_task() slow-path
  2019-05-29 20:36 ` [RFC PATCH v3 08/16] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
  2019-08-08 10:59   ` [tip:sched/core] " tip-bot for Peter Zijlstra
@ 2019-08-26 17:01   ` mark gross
  1 sibling, 0 replies; 161+ messages in thread
From: mark gross @ 2019-08-26 17:01 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, May 29, 2019 at 08:36:44PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Avoid the RETRY_TASK case in the pick_next_task() slow path.
> 
> By doing the put_prev_task() early, we get the rt/deadline pull done,
> and by testing rq->nr_running we know if we need newidle_balance().
> 
> This then gives a stable state to pick a task from.
> 
> Since the fast-path is fair only; it means the other classes will
> always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c      | 19 ++++++++++++-------
>  kernel/sched/deadline.c  | 30 ++----------------------------
>  kernel/sched/fair.c      |  9 ++++++---
>  kernel/sched/idle.c      |  4 +++-
>  kernel/sched/rt.c        | 29 +----------------------------
>  kernel/sched/sched.h     | 13 ++++++++-----
>  kernel/sched/stop_task.c |  3 ++-
>  7 files changed, 34 insertions(+), 73 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9dfa0c53deb3..b883c70674ba 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3363,7 +3363,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  
>  		p = fair_sched_class.pick_next_task(rq, prev, rf);
>  		if (unlikely(p == RETRY_TASK))
> -			goto again;
> +			goto restart;
>  
>  		/* Assumes fair_sched_class->next == idle_sched_class */
>  		if (unlikely(!p))
> @@ -3372,14 +3372,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  		return p;
>  	}
>  
> -again:
> +restart:
> +	/*
> +	 * Ensure that we put DL/RT tasks before the pick loop, such that they
> +	 * can PULL higher prio tasks when we lower the RQ 'priority'.
> +	 */
> +	prev->sched_class->put_prev_task(rq, prev, rf);
> +	if (!rq->nr_running)
> +		newidle_balance(rq, rf);
> +
>  	for_each_class(class) {
> -		p = class->pick_next_task(rq, prev, rf);
> -		if (p) {
> -			if (unlikely(p == RETRY_TASK))
> -				goto again;
> +		p = class->pick_next_task(rq, NULL, NULL);
> +		if (p)
>  			return p;
> -		}
>  	}
>  
>  	/* The idle class should always have a runnable task: */
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 45425f971eec..d3904168857a 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1729,39 +1729,13 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	struct task_struct *p;
>  	struct dl_rq *dl_rq;
>  
> -	dl_rq = &rq->dl;
> -
> -	if (need_pull_dl_task(rq, prev)) {
> -		/*
> -		 * This is OK, because current is on_cpu, which avoids it being
> -		 * picked for load-balance and preemption/IRQs are still
> -		 * disabled avoiding further scheduler activity on it and we're
> -		 * being very careful to re-start the picking loop.
> -		 */
> -		rq_unpin_lock(rq, rf);
> -		pull_dl_task(rq);
> -		rq_repin_lock(rq, rf);
> -		/*
> -		 * pull_dl_task() can drop (and re-acquire) rq->lock; this
> -		 * means a stop task can slip in, in which case we need to
> -		 * re-start task selection.
> -		 */
> -		if (rq->stop && task_on_rq_queued(rq->stop))
> -			return RETRY_TASK;
> -	}
> +	WARN_ON_ONCE(prev || rf);
should there be a helpful message to go with this warning?

>  
> -	/*
> -	 * When prev is DL, we may throttle it in put_prev_task().
> -	 * So, we update time before we check for dl_nr_running.
> -	 */
> -	if (prev->sched_class == &dl_sched_class)
> -		update_curr_dl(rq);
> +	dl_rq = &rq->dl;
>  
>  	if (unlikely(!dl_rq->dl_nr_running))
>  		return NULL;
>  
> -	put_prev_task(rq, prev);
> -
>  	dl_se = pick_next_dl_entity(rq, dl_rq);
>  	BUG_ON(!dl_se);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8e3eb243fd9f..e65f2dfda77a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6979,7 +6979,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  		goto idle;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -	if (prev->sched_class != &fair_sched_class)
> +	if (!prev || prev->sched_class != &fair_sched_class)
>  		goto simple;
>  
>  	/*
> @@ -7056,8 +7056,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  	goto done;
>  simple:
>  #endif
> -
> -	put_prev_task(rq, prev);
> +	if (prev)
> +		put_prev_task(rq, prev);
>  
>  	do {
>  		se = pick_next_entity(cfs_rq, NULL);
> @@ -7085,6 +7085,9 @@ done: __maybe_unused;
>  	return p;
>  
>  idle:
> +	if (!rf)
> +		return NULL;
> +
>  	new_tasks = newidle_balance(rq, rf);
>  
>  	/*
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 1b65a4c3683e..7ece8e820b5d 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -388,7 +388,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  {
>  	struct task_struct *next = rq->idle;
>  
> -	put_prev_task(rq, prev);
> +	if (prev)
> +		put_prev_task(rq, prev);
> +
>  	set_next_task_idle(rq, next);
>  
>  	return next;
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 51ee87c5a28a..79f2e60516ef 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1554,38 +1554,11 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  	struct task_struct *p;
>  	struct rt_rq *rt_rq = &rq->rt;
>  
> -	if (need_pull_rt_task(rq, prev)) {
> -		/*
> -		 * This is OK, because current is on_cpu, which avoids it being
> -		 * picked for load-balance and preemption/IRQs are still
> -		 * disabled avoiding further scheduler activity on it and we're
> -		 * being very careful to re-start the picking loop.
> -		 */
> -		rq_unpin_lock(rq, rf);
> -		pull_rt_task(rq);
> -		rq_repin_lock(rq, rf);
> -		/*
> -		 * pull_rt_task() can drop (and re-acquire) rq->lock; this
> -		 * means a dl or stop task can slip in, in which case we need
> -		 * to re-start task selection.
> -		 */
> -		if (unlikely((rq->stop && task_on_rq_queued(rq->stop)) ||
> -			     rq->dl.dl_nr_running))
> -			return RETRY_TASK;
> -	}
> -
> -	/*
> -	 * We may dequeue prev's rt_rq in put_prev_task().
> -	 * So, we update time before rt_queued check.
> -	 */
> -	if (prev->sched_class == &rt_sched_class)
> -		update_curr_rt(rq);
> +	WARN_ON_ONCE(prev || rf);
>  
>  	if (!rt_rq->rt_queued)
>  		return NULL;
>  
> -	put_prev_task(rq, prev);
> -
>  	p = _pick_next_task_rt(rq);
>  
>  	set_next_task_rt(rq, p);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4cbe2bef92e4..460dd04e76af 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1665,12 +1665,15 @@ struct sched_class {
>  	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
>  
>  	/*
> -	 * It is the responsibility of the pick_next_task() method that will
> -	 * return the next task to call put_prev_task() on the @prev task or
> -	 * something equivalent.
> +	 * Both @prev and @rf are optional and may be NULL, in which case the
> +	 * caller must already have invoked put_prev_task(rq, prev, rf).
>  	 *
> -	 * May return RETRY_TASK when it finds a higher prio class has runnable
> -	 * tasks.
> +	 * Otherwise it is the responsibility of the pick_next_task() to call
> +	 * put_prev_task() on the @prev task or something equivalent, IFF it
> +	 * returns a next task.
> +	 *
> +	 * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
> +	 * higher prio class has runnable tasks.
>  	 */
>  	struct task_struct * (*pick_next_task)(struct rq *rq,
>  					       struct task_struct *prev,
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 8f414018d5e0..7e1cee4e65b2 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -33,10 +33,11 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  {
>  	struct task_struct *stop = rq->stop;
>  
> +	WARN_ON_ONCE(prev || rf);
should there be a helpful message to go with this warning?
--mark

> +
>  	if (!stop || !task_on_rq_queued(stop))
>  		return NULL;
>  
> -	put_prev_task(rq, prev);
>  	set_next_task_stop(rq, stop);
>  
>  	return stop;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task()
  2019-05-29 20:36 ` [RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task() Vineeth Remanan Pillai
@ 2019-08-26 17:14   ` mark gross
  0 siblings, 0 replies; 161+ messages in thread
From: mark gross @ 2019-08-26 17:14 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, May 29, 2019 at 08:36:45PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Because sched_class::pick_next_task() also implies
> sched_class::set_next_task() (and possibly put_prev_task() and
> newidle_balance) it is not state invariant. This makes it unsuitable
> for remote task selection.
It would be cool if the commit comment would explain what the change is going
to do about pick_next_task being unsuitable for remote task selection.

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> ---
> 
> Chnages in v3
> -------------
> - Minor refactor to remove redundant NULL checks
> 
> Changes in v2
> -------------
> - Fixes a NULL pointer dereference crash
>   - Subhra Mazumdar
>   - Tim Chen
> 
> ---
>  kernel/sched/deadline.c  | 21 ++++++++++++++++-----
>  kernel/sched/fair.c      | 36 +++++++++++++++++++++++++++++++++---
>  kernel/sched/idle.c      | 10 +++++++++-
>  kernel/sched/rt.c        | 21 ++++++++++++++++-----
>  kernel/sched/sched.h     |  2 ++
>  kernel/sched/stop_task.c | 21 ++++++++++++++++-----
>  6 files changed, 92 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d3904168857a..64fc444f44f9 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1722,15 +1722,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
>  	return rb_entry(left, struct sched_dl_entity, rb_node);
>  }
>  
> -static struct task_struct *
> -pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +static struct task_struct *pick_task_dl(struct rq *rq)
>  {
>  	struct sched_dl_entity *dl_se;
>  	struct task_struct *p;
>  	struct dl_rq *dl_rq;
>  
> -	WARN_ON_ONCE(prev || rf);
> -
>  	dl_rq = &rq->dl;
>  
>  	if (unlikely(!dl_rq->dl_nr_running))
> @@ -1741,7 +1738,19 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  
>  	p = dl_task_of(dl_se);
>  
> -	set_next_task_dl(rq, p);
> +	return p;
> +}
> +
> +static struct task_struct *
> +pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> +	struct task_struct *p;
> +
> +	WARN_ON_ONCE(prev || rf);
What is an admin to do with this warnding if it shows up in there logs?
maybe include some text here to help folks that might hit this warn_on.


> +
> +	p = pick_task_dl(rq);
> +	if (p)
> +		set_next_task_dl(rq, p);
>  
>  	return p;
>  }
> @@ -2388,6 +2397,8 @@ const struct sched_class dl_sched_class = {
>  	.set_next_task		= set_next_task_dl,
>  
>  #ifdef CONFIG_SMP
> +	.pick_task		= pick_task_dl,
> +
>  	.select_task_rq		= select_task_rq_dl,
>  	.migrate_task_rq	= migrate_task_rq_dl,
>  	.set_cpus_allowed       = set_cpus_allowed_dl,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e65f2dfda77a..02e5dfb85e7d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4136,7 +4136,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  	 * Avoid running the skip buddy, if running something else can
>  	 * be done without getting too unfair.
>  	 */
> -	if (cfs_rq->skip == se) {
> +	if (cfs_rq->skip && cfs_rq->skip == se) {
>  		struct sched_entity *second;
>  
>  		if (se == curr) {
> @@ -4154,13 +4154,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  	/*
>  	 * Prefer last buddy, try to return the CPU to a preempted task.
>  	 */
> -	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> +	if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
>  		se = cfs_rq->last;
>  
>  	/*
>  	 * Someone really wants this to run. If it's not unfair, run it.
>  	 */
> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> +	if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
>  		se = cfs_rq->next;
>  
>  	clear_buddies(cfs_rq, se);
> @@ -6966,6 +6966,34 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>  		set_last_buddy(se);
>  }
>  
> +static struct task_struct *
> +pick_task_fair(struct rq *rq)
> +{
> +	struct cfs_rq *cfs_rq = &rq->cfs;
> +	struct sched_entity *se;
> +
> +	if (!cfs_rq->nr_running)
> +		return NULL;
> +
> +	do {
> +		struct sched_entity *curr = cfs_rq->curr;
> +
> +		se = pick_next_entity(cfs_rq, NULL);
> +
> +		if (curr) {
> +			if (se && curr->on_rq)
> +				update_curr(cfs_rq);
> +
> +			if (!se || entity_before(curr, se))
> +				se = curr;
> +		}
> +
> +		cfs_rq = group_cfs_rq(se);
> +	} while (cfs_rq);
> +
> +	return task_of(se);
> +}
> +
>  static struct task_struct *
>  pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> @@ -10677,6 +10705,8 @@ const struct sched_class fair_sched_class = {
>  	.set_next_task          = set_next_task_fair,
>  
>  #ifdef CONFIG_SMP
> +	.pick_task		= pick_task_fair,
> +
>  	.select_task_rq		= select_task_rq_fair,
>  	.migrate_task_rq	= migrate_task_rq_fair,
>  
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 7ece8e820b5d..e7f38da60373 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -373,6 +373,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
>  	resched_curr(rq);
>  }
>  
> +static struct task_struct *
> +pick_task_idle(struct rq *rq)
> +{
> +	return rq->idle;
> +}
> +
>  static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  }
> @@ -386,11 +392,12 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
>  static struct task_struct *
>  pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
> -	struct task_struct *next = rq->idle;
> +	struct task_struct *next;
>  
>  	if (prev)
>  		put_prev_task(rq, prev);
>  
> +	next = pick_task_idle(rq);
>  	set_next_task_idle(rq, next);
>  
>  	return next;
> @@ -458,6 +465,7 @@ const struct sched_class idle_sched_class = {
>  	.set_next_task          = set_next_task_idle,
>  
>  #ifdef CONFIG_SMP
> +	.pick_task		= pick_task_idle,
>  	.select_task_rq		= select_task_rq_idle,
>  	.set_cpus_allowed	= set_cpus_allowed_common,
>  #endif
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 79f2e60516ef..81557224548c 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1548,20 +1548,29 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
>  	return rt_task_of(rt_se);
>  }
>  
> -static struct task_struct *
> -pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +static struct task_struct *pick_task_rt(struct rq *rq)
>  {
>  	struct task_struct *p;
>  	struct rt_rq *rt_rq = &rq->rt;
>  
> -	WARN_ON_ONCE(prev || rf);
> -
>  	if (!rt_rq->rt_queued)
>  		return NULL;
>  
>  	p = _pick_next_task_rt(rq);
>  
> -	set_next_task_rt(rq, p);
> +	return p;
> +}
> +
> +static struct task_struct *
> +pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> +	struct task_struct *p;
> +
> +	WARN_ON_ONCE(prev || rf);
what does it mean if this warn on goes off to an admin?

> +
> +	p = pick_task_rt(rq);
> +	if (p)
> +		set_next_task_rt(rq, p);
>  
>  	return p;
>  }
> @@ -2364,6 +2373,8 @@ const struct sched_class rt_sched_class = {
>  	.set_next_task          = set_next_task_rt,
>  
>  #ifdef CONFIG_SMP
> +	.pick_task		= pick_task_rt,
> +
>  	.select_task_rq		= select_task_rq_rt,
>  
>  	.set_cpus_allowed       = set_cpus_allowed_common,
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 460dd04e76af..a024dd80eeb3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1682,6 +1682,8 @@ struct sched_class {
>  	void (*set_next_task)(struct rq *rq, struct task_struct *p);
>  
>  #ifdef CONFIG_SMP
> +	struct task_struct * (*pick_task)(struct rq *rq);
> +
>  	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
>  	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
>  
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 7e1cee4e65b2..fb6c436cba6c 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -29,20 +29,30 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
>  }
>  
>  static struct task_struct *
> -pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +pick_task_stop(struct rq *rq)
>  {
>  	struct task_struct *stop = rq->stop;
>  
> -	WARN_ON_ONCE(prev || rf);
> -
>  	if (!stop || !task_on_rq_queued(stop))
>  		return NULL;
>  
> -	set_next_task_stop(rq, stop);
> -
>  	return stop;
>  }
>  
> +static struct task_struct *
> +pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
> +	struct task_struct *p;
> +
> +	WARN_ON_ONCE(prev || rf);
> +
> +	p = pick_task_stop(rq);
> +	if (p)
> +		set_next_task_stop(rq, p);
> +
> +	return p;
> +}
> +
>  static void
>  enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
>  {
> @@ -129,6 +139,7 @@ const struct sched_class stop_sched_class = {
>  	.set_next_task          = set_next_task_stop,
>  
>  #ifdef CONFIG_SMP
> +	.pick_task		= pick_task_stop,
>  	.select_task_rq		= select_task_rq_stop,
>  	.set_cpus_allowed	= set_cpus_allowed_common,
>  #endif
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 11/16] sched: Basic tracking of matching tasks
  2019-05-29 20:36 ` [RFC PATCH v3 11/16] sched: Basic tracking of matching tasks Vineeth Remanan Pillai
@ 2019-08-26 20:59   ` mark gross
  0 siblings, 0 replies; 161+ messages in thread
From: mark gross @ 2019-08-26 20:59 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, May 29, 2019 at 08:36:47PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Introduce task_struct::core_cookie as an opaque identifier for core
> scheduling. When enabled; core scheduling will only allow matching
> task to be on the core; where idle matches everything.
> 
> When task_struct::core_cookie is set (and core scheduling is enabled)
> these tasks are indexed in a second RB-tree, first on cookie value
> then on scheduling function, such that matching task selection always
> finds the most elegible match.
> 
> NOTE: *shudder* at the overhead...
> 
> NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
> is per class tracking of cookies and that just duplicates a lot of
> stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).
 s/raisen/reason

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
> Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
> ---
> 
> Changes in v3
> -------------
> - Refactored priority comparison code
> - Fixed a comparison logic issue in sched_core_find
>   - Aaron Lu
> 
> Changes in v2
> -------------
> - Improves the priority comparison logic between processes in
>   different cpus.
>   - Peter Zijlstra
>   - Aaron Lu
> 
> ---
>  include/linux/sched.h |   8 ++-
>  kernel/sched/core.c   | 146 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c   |  46 -------------
>  kernel/sched/sched.h  |  55 ++++++++++++++++
>  4 files changed, 208 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1549584a1538..a4b39a28236f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -636,10 +636,16 @@ struct task_struct {
>  	const struct sched_class	*sched_class;
>  	struct sched_entity		se;
>  	struct sched_rt_entity		rt;
> +	struct sched_dl_entity		dl;
> +
> +#ifdef CONFIG_SCHED_CORE
> +	struct rb_node			core_node;
> +	unsigned long			core_cookie;
> +#endif
> +
>  #ifdef CONFIG_CGROUP_SCHED
>  	struct task_group		*sched_task_group;
>  #endif
> -	struct sched_dl_entity		dl;
>  
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	/* List of struct preempt_notifier: */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b1ce33f9b106..112d70f2b1e5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -64,6 +64,141 @@ int sysctl_sched_rt_runtime = 950000;
>  
>  DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
>  
> +/* kernel prio, less is more */
> +static inline int __task_prio(struct task_struct *p)
> +{
> +	if (p->sched_class == &stop_sched_class) /* trumps deadline */
> +		return -2;
> +
> +	if (rt_prio(p->prio)) /* includes deadline */
> +		return p->prio; /* [-1, 99] */
> +
> +	if (p->sched_class == &idle_sched_class)
> +		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
> +
> +	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
> +}
> +
> +/*
> + * l(a,b)
> + * le(a,b) := !l(b,a)
> + * g(a,b)  := l(b,a)
> + * ge(a,b) := !l(a,b)
why does this truth table comment exist?
maybe inline comments at the confusing inequalities would be better.
--mark



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
                   ` (16 preceding siblings ...)
  2019-05-30 14:04 ` [RFC PATCH v3 00/16] Core scheduling v3 Aubrey Li
@ 2019-08-27 21:14 ` Matthew Garrett
  2019-08-27 21:50   ` Peter Zijlstra
  2019-08-27 23:24   ` Aubrey Li
  17 siblings, 2 replies; 161+ messages in thread
From: Matthew Garrett @ 2019-08-27 21:14 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

Apple have provided a sysctl that allows applications to indicate that 
specific threads should make use of core isolation while allowing 
the rest of the system to make use of SMT, and browsers (Safari, Firefox 
and Chrome, at least) are now making use of this. Trying to do something 
similar using cgroups seems a bit awkward. Would something like this be 
reasonable? Having spoken to the Chrome team, I believe that the 
semantics we want are:

1) A thread to be able to indicate that it should not run on the same 
core as anything not in posession of the same cookie
2) Descendents of that thread to (by default) have the same cookie
3) No other thread be able to obtain the same cookie
4) Threads not be able to rejoin the global group (ie, threads can 
segregate themselves from their parent and peers, but can never rejoin 
that group once segregated)

but don't know if that's what everyone else would want.

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 094bb03b9cc2..5d411246d4d5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -229,4 +229,5 @@ struct prctl_mm_map {
 # define PR_PAC_APDBKEY			(1UL << 3)
 # define PR_PAC_APGAKEY			(1UL << 4)
 
+#define PR_CORE_ISOLATE			55
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 12df0e5434b8..a054cfcca511 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = PAC_RESET_KEYS(me, arg2);
 		break;
+	case PR_CORE_ISOLATE:
+#ifdef CONFIG_SCHED_CORE
+		current->core_cookie = (unsigned long)current;
+#else
+		result = -EINVAL;
+#endif
+		break;
 	default:
 		error = -EINVAL;
 		break;


-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-27 21:14 ` Matthew Garrett
@ 2019-08-27 21:50   ` Peter Zijlstra
  2019-08-28 15:30     ` Phil Auld
  2019-08-28 15:59     ` Tim Chen
  2019-08-27 23:24   ` Aubrey Li
  1 sibling, 2 replies; 161+ messages in thread
From: Peter Zijlstra @ 2019-08-27 21:50 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Tim Chen, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
> Apple have provided a sysctl that allows applications to indicate that 
> specific threads should make use of core isolation while allowing 
> the rest of the system to make use of SMT, and browsers (Safari, Firefox 
> and Chrome, at least) are now making use of this. Trying to do something 
> similar using cgroups seems a bit awkward. Would something like this be 
> reasonable? 

Sure; like I wrote earlier; I only did the cgroup thing because I was
lazy and it was the easiest interface to hack on in a hurry.

The rest of the ABI nonsense can 'trivially' be done later; if when we
decide to actually do this.

And given MDS, I'm still not entirely convinced it all makes sense. If
it were just L1TF, then yes, but now...

> Having spoken to the Chrome team, I believe that the 
> semantics we want are:
> 
> 1) A thread to be able to indicate that it should not run on the same 
> core as anything not in posession of the same cookie
> 2) Descendents of that thread to (by default) have the same cookie
> 3) No other thread be able to obtain the same cookie
> 4) Threads not be able to rejoin the global group (ie, threads can 
> segregate themselves from their parent and peers, but can never rejoin 
> that group once segregated)
> 
> but don't know if that's what everyone else would want.
> 
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 094bb03b9cc2..5d411246d4d5 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -229,4 +229,5 @@ struct prctl_mm_map {
>  # define PR_PAC_APDBKEY			(1UL << 3)
>  # define PR_PAC_APGAKEY			(1UL << 4)
>  
> +#define PR_CORE_ISOLATE			55
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 12df0e5434b8..a054cfcca511 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			return -EINVAL;
>  		error = PAC_RESET_KEYS(me, arg2);
>  		break;
> +	case PR_CORE_ISOLATE:
> +#ifdef CONFIG_SCHED_CORE
> +		current->core_cookie = (unsigned long)current;

This needs to then also force a reschedule of current. And there's the
little issue of what happens if 'current' dies while its children live
on, and current gets re-used for a new process and does this again.

> +#else
> +		result = -EINVAL;
> +#endif
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;
> 
> 
> -- 
> Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-27 21:14 ` Matthew Garrett
  2019-08-27 21:50   ` Peter Zijlstra
@ 2019-08-27 23:24   ` Aubrey Li
  1 sibling, 0 replies; 161+ messages in thread
From: Aubrey Li @ 2019-08-27 23:24 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Aaron Lu, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Wed, Aug 28, 2019 at 5:14 AM Matthew Garrett <mjg59@srcf.ucam.org> wrote:
>
> Apple have provided a sysctl that allows applications to indicate that
> specific threads should make use of core isolation while allowing
> the rest of the system to make use of SMT, and browsers (Safari, Firefox
> and Chrome, at least) are now making use of this. Trying to do something
> similar using cgroups seems a bit awkward. Would something like this be
> reasonable? Having spoken to the Chrome team, I believe that the
> semantics we want are:
>
> 1) A thread to be able to indicate that it should not run on the same
> core as anything not in posession of the same cookie
> 2) Descendents of that thread to (by default) have the same cookie
> 3) No other thread be able to obtain the same cookie
> 4) Threads not be able to rejoin the global group (ie, threads can
> segregate themselves from their parent and peers, but can never rejoin
> that group once segregated)
>
> but don't know if that's what everyone else would want.
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 094bb03b9cc2..5d411246d4d5 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -229,4 +229,5 @@ struct prctl_mm_map {
>  # define PR_PAC_APDBKEY                        (1UL << 3)
>  # define PR_PAC_APGAKEY                        (1UL << 4)
>
> +#define PR_CORE_ISOLATE                        55
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 12df0e5434b8..a054cfcca511 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>                         return -EINVAL;
>                 error = PAC_RESET_KEYS(me, arg2);
>                 break;
> +       case PR_CORE_ISOLATE:
> +#ifdef CONFIG_SCHED_CORE
> +               current->core_cookie = (unsigned long)current;

Because AVX512 instructions could pull down the core frequency,
we also want to give a magic cookie number to all AVX512-using
tasks on the system, so they won't affect the performance/latency
of any other tasks.

This could be done by putting all AVX512 tasks into a cgroup, or
by AVX512 detection the following patch introduced.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=2f7726f955572e587d5f50fbe9b2deed5334bd90

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-27 21:50   ` Peter Zijlstra
@ 2019-08-28 15:30     ` Phil Auld
  2019-08-28 16:01       ` Peter Zijlstra
  2019-08-28 15:59     ` Tim Chen
  1 sibling, 1 reply; 161+ messages in thread
From: Phil Auld @ 2019-08-28 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Garrett, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Julien Desfossez, Tim Chen, mingo, tglx, pjt, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
> > Apple have provided a sysctl that allows applications to indicate that 
> > specific threads should make use of core isolation while allowing 
> > the rest of the system to make use of SMT, and browsers (Safari, Firefox 
> > and Chrome, at least) are now making use of this. Trying to do something 
> > similar using cgroups seems a bit awkward. Would something like this be 
> > reasonable? 
> 
> Sure; like I wrote earlier; I only did the cgroup thing because I was
> lazy and it was the easiest interface to hack on in a hurry.
> 
> The rest of the ABI nonsense can 'trivially' be done later; if when we
> decide to actually do this.

I think something that allows the tag to be set may be needed. One of 
the use cases for this is virtualization stacks, where you really want
to be able to keep the higher CPU count and to set up the isolation 
from management processes on the host. 

The current cgroup interface doesn't work for that because it doesn't 
apply the tag to children. We've been unable to fully test it in a virt
setup because our VMs are made of a child cgroup per vcpu. 

> 
> And given MDS, I'm still not entirely convinced it all makes sense. If
> it were just L1TF, then yes, but now...

I was thinking MDS is really the reason for this. L1TF has mitigations but
the only current mitigation for MDS for smt is ... nosmt. 

The current core scheduler implementation, I believe, still has (theoretical?) 
holes involving interrupts, once/if those are closed it may be even less 
attractive.

> 
> > Having spoken to the Chrome team, I believe that the 
> > semantics we want are:
> > 
> > 1) A thread to be able to indicate that it should not run on the same 
> > core as anything not in posession of the same cookie
> > 2) Descendents of that thread to (by default) have the same cookie
> > 3) No other thread be able to obtain the same cookie
> > 4) Threads not be able to rejoin the global group (ie, threads can 
> > segregate themselves from their parent and peers, but can never rejoin 
> > that group once segregated)
> > 
> > but don't know if that's what everyone else would want.
> > 
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 094bb03b9cc2..5d411246d4d5 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -229,4 +229,5 @@ struct prctl_mm_map {
> >  # define PR_PAC_APDBKEY			(1UL << 3)
> >  # define PR_PAC_APGAKEY			(1UL << 4)
> >  
> > +#define PR_CORE_ISOLATE			55
> >  #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 12df0e5434b8..a054cfcca511 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >  			return -EINVAL;
> >  		error = PAC_RESET_KEYS(me, arg2);
> >  		break;
> > +	case PR_CORE_ISOLATE:
> > +#ifdef CONFIG_SCHED_CORE
> > +		current->core_cookie = (unsigned long)current;
> 
> This needs to then also force a reschedule of current. And there's the
> little issue of what happens if 'current' dies while its children live
> on, and current gets re-used for a new process and does this again.

sched_core_get() too?


Cheers,
Phil

> 
> > +#else
> > +		result = -EINVAL;
> > +#endif
> > +		break;
> >  	default:
> >  		error = -EINVAL;
> >  		break;
> > 
> > 
> > -- 
> > Matthew Garrett | mjg59@srcf.ucam.org

-- 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-27 21:50   ` Peter Zijlstra
  2019-08-28 15:30     ` Phil Auld
@ 2019-08-28 15:59     ` Tim Chen
  2019-08-28 16:16       ` Peter Zijlstra
  1 sibling, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-08-28 15:59 UTC (permalink / raw)
  To: Peter Zijlstra, Matthew Garrett
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 8/27/19 2:50 PM, Peter Zijlstra wrote:
> On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
>> Apple have provided a sysctl that allows applications to indicate that 
>> specific threads should make use of core isolation while allowing 
>> the rest of the system to make use of SMT, and browsers (Safari, Firefox 
>> and Chrome, at least) are now making use of this. Trying to do something 
>> similar using cgroups seems a bit awkward. Would something like this be 
>> reasonable? 
> 
> Sure; like I wrote earlier; I only did the cgroup thing because I was
> lazy and it was the easiest interface to hack on in a hurry.
> 
> The rest of the ABI nonsense can 'trivially' be done later; if when we
> decide to actually do this.
> 
> And given MDS, I'm still not entirely convinced it all makes sense. If
> it were just L1TF, then yes, but now...
> 

For MDS, core-scheduler does prevent thread to thread
attack between user space threads running on sibling CPU threads.
Yes, it doesn't prevent the user to kernel attack from sibling
which will require additional mitigation measure. However, it does
block a major attack vector for MDS if HT is enabled.

Tim

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-28 15:30     ` Phil Auld
@ 2019-08-28 16:01       ` Peter Zijlstra
  2019-08-28 16:37         ` Tim Chen
  2019-08-29 14:30         ` Phil Auld
  0 siblings, 2 replies; 161+ messages in thread
From: Peter Zijlstra @ 2019-08-28 16:01 UTC (permalink / raw)
  To: Phil Auld
  Cc: Matthew Garrett, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Julien Desfossez, Tim Chen, mingo, tglx, pjt, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:

> > And given MDS, I'm still not entirely convinced it all makes sense. If
> > it were just L1TF, then yes, but now...
> 
> I was thinking MDS is really the reason for this. L1TF has mitigations but
> the only current mitigation for MDS for smt is ... nosmt. 

L1TF has no known mitigation that is SMT safe. The moment you have
something in your L1, the other sibling can read it using L1TF.

The nice thing about L1TF is that only (malicious) guests can exploit
it, and therefore the synchronizatin context is VMM. And it so happens
that VMEXITs are 'rare' (and already expensive and thus lots of effort
has already gone into avoiding them).

If you don't use VMs, you're good and SMT is not a problem.

If you do use VMs (and do/can not trust them), _then_ you need
core-scheduling; and in that case, the implementation under discussion
misses things like synchronization on VMEXITs due to interrupts and
things like that.

But under the assumption that VMs don't generate high scheduling rates,
it can work.

> The current core scheduler implementation, I believe, still has (theoretical?) 
> holes involving interrupts, once/if those are closed it may be even less 
> attractive.

No; so MDS leaks anything the other sibling (currently) does, this makes
_any_ privilidge boundary a synchronization context.

Worse still, the exploit doesn't require a VM at all, any other task can
get to it.

That means you get to sync the siblings on lovely things like system
call entry and exit, along with VMM and anything else that one would
consider a privilidge boundary. Now, system calls are not rare, they
are really quite common in fact. Trying to sync up siblings at the rate
of system calls is utter madness.

So under MDS, SMT is completely hosed. If you use VMs exclusively, then
it _might_ work because a 'pure' host doesn't schedule that often
(maybe, same assumption as for L1TF).

Now, there have been proposals of moving the privilidge boundary further
into the kernel. Just like PTI exposes the entry stack and code to
Meltdown, the thinking is, lets expose more. By moving the priv boundary
the hope is that we can do lots of common system calls without having to
sync up -- lots of details are 'pending'.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-28 15:59     ` Tim Chen
@ 2019-08-28 16:16       ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2019-08-28 16:16 UTC (permalink / raw)
  To: Tim Chen
  Cc: Matthew Garrett, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Julien Desfossez, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Wed, Aug 28, 2019 at 08:59:21AM -0700, Tim Chen wrote:
> On 8/27/19 2:50 PM, Peter Zijlstra wrote:
> > On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
> >> Apple have provided a sysctl that allows applications to indicate that 
> >> specific threads should make use of core isolation while allowing 
> >> the rest of the system to make use of SMT, and browsers (Safari, Firefox 
> >> and Chrome, at least) are now making use of this. Trying to do something 
> >> similar using cgroups seems a bit awkward. Would something like this be 
> >> reasonable? 
> > 
> > Sure; like I wrote earlier; I only did the cgroup thing because I was
> > lazy and it was the easiest interface to hack on in a hurry.
> > 
> > The rest of the ABI nonsense can 'trivially' be done later; if when we
> > decide to actually do this.
> > 
> > And given MDS, I'm still not entirely convinced it all makes sense. If
> > it were just L1TF, then yes, but now...
> > 
> 
> For MDS, core-scheduler does prevent thread to thread
> attack between user space threads running on sibling CPU threads.
> Yes, it doesn't prevent the user to kernel attack from sibling
> which will require additional mitigation measure. However, it does
> block a major attack vector for MDS if HT is enabled.

I'm not sure what your argument is; the dike has two holes; you plug
one, you still drown.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-28 16:01       ` Peter Zijlstra
@ 2019-08-28 16:37         ` Tim Chen
  2019-08-29 14:30         ` Phil Auld
  1 sibling, 0 replies; 161+ messages in thread
From: Tim Chen @ 2019-08-28 16:37 UTC (permalink / raw)
  To: Peter Zijlstra, Phil Auld
  Cc: Matthew Garrett, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Julien Desfossez, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aaron Lu,
	Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 8/28/19 9:01 AM, Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
>> On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> 
>> The current core scheduler implementation, I believe, still has (theoretical?) 
>> holes involving interrupts, once/if those are closed it may be even less 
>> attractive.
> 
> No; so MDS leaks anything the other sibling (currently) does, this makes
> _any_ privilidge boundary a synchronization context.
> 
> Worse still, the exploit doesn't require a VM at all, any other task can
> get to it.
> 
> That means you get to sync the siblings on lovely things like system
> call entry and exit, along with VMM and anything else that one would
> consider a privilidge boundary. Now, system calls are not rare, they
> are really quite common in fact. Trying to sync up siblings at the rate
> of system calls is utter madness.
> 
> So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> it _might_ work because a 'pure' host doesn't schedule that often
> (maybe, same assumption as for L1TF).
> 
> Now, there have been proposals of moving the privilidge boundary further
> into the kernel. Just like PTI exposes the entry stack and code to
> Meltdown, the thinking is, lets expose more. By moving the priv boundary
> the hope is that we can do lots of common system calls without having to
> sync up -- lots of details are 'pending'.
> 

If are willing to consider the idea that we will sync with the sibling
only if we touch potential user data, then a significant portion of
syscalls may not need to sync.  Yeah, it still sucks because of the
complexity added to audit all the places in kernel that may touch
privileged data and require synchronization. 

I did a prototype (without core sched), kernel build slow 2.5%.  
So this use case still seem reasonable.

A worst case scenario is concurrent SMT FIO write to encrypted file,
which have a lot of synchronizations due to extended access to privilege
data by crypto, we slow by 9%.

Tim


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-28 16:01       ` Peter Zijlstra
  2019-08-28 16:37         ` Tim Chen
@ 2019-08-29 14:30         ` Phil Auld
  2019-08-29 14:38           ` Peter Zijlstra
  1 sibling, 1 reply; 161+ messages in thread
From: Phil Auld @ 2019-08-29 14:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Garrett, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Julien Desfossez, Tim Chen, mingo, tglx, pjt, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Wed, Aug 28, 2019 at 06:01:14PM +0200 Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> > On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> 
> > > And given MDS, I'm still not entirely convinced it all makes sense. If
> > > it were just L1TF, then yes, but now...
> > 
> > I was thinking MDS is really the reason for this. L1TF has mitigations but
> > the only current mitigation for MDS for smt is ... nosmt. 
> 
> L1TF has no known mitigation that is SMT safe. The moment you have
> something in your L1, the other sibling can read it using L1TF.
> 
> The nice thing about L1TF is that only (malicious) guests can exploit
> it, and therefore the synchronizatin context is VMM. And it so happens
> that VMEXITs are 'rare' (and already expensive and thus lots of effort
> has already gone into avoiding them).
> 
> If you don't use VMs, you're good and SMT is not a problem.
> 
> If you do use VMs (and do/can not trust them), _then_ you need
> core-scheduling; and in that case, the implementation under discussion
> misses things like synchronization on VMEXITs due to interrupts and
> things like that.
> 
> But under the assumption that VMs don't generate high scheduling rates,
> it can work.
> 
> > The current core scheduler implementation, I believe, still has (theoretical?) 
> > holes involving interrupts, once/if those are closed it may be even less 
> > attractive.
> 
> No; so MDS leaks anything the other sibling (currently) does, this makes
> _any_ privilidge boundary a synchronization context.
> 
> Worse still, the exploit doesn't require a VM at all, any other task can
> get to it.
> 
> That means you get to sync the siblings on lovely things like system
> call entry and exit, along with VMM and anything else that one would
> consider a privilidge boundary. Now, system calls are not rare, they
> are really quite common in fact. Trying to sync up siblings at the rate
> of system calls is utter madness.
> 
> So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> it _might_ work because a 'pure' host doesn't schedule that often
> (maybe, same assumption as for L1TF).
> 
> Now, there have been proposals of moving the privilidge boundary further
> into the kernel. Just like PTI exposes the entry stack and code to
> Meltdown, the thinking is, lets expose more. By moving the priv boundary
> the hope is that we can do lots of common system calls without having to
> sync up -- lots of details are 'pending'.


Thanks for clarifying. My understanding is (somewhat) less fuzzy now. :)

I think, though, that you were basically agreeing with me that the current 
core scheduler does not close the holes, or am I reading that wrong.


Cheers,
Phil

-- 

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-29 14:30         ` Phil Auld
@ 2019-08-29 14:38           ` Peter Zijlstra
  2019-09-10 14:27             ` Julien Desfossez
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2019-08-29 14:38 UTC (permalink / raw)
  To: Phil Auld
  Cc: Matthew Garrett, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Julien Desfossez, Tim Chen, mingo, tglx, pjt, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Thu, Aug 29, 2019 at 10:30:51AM -0400, Phil Auld wrote:
> On Wed, Aug 28, 2019 at 06:01:14PM +0200 Peter Zijlstra wrote:
> > On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> > > On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> > 
> > > > And given MDS, I'm still not entirely convinced it all makes sense. If
> > > > it were just L1TF, then yes, but now...
> > > 
> > > I was thinking MDS is really the reason for this. L1TF has mitigations but
> > > the only current mitigation for MDS for smt is ... nosmt. 
> > 
> > L1TF has no known mitigation that is SMT safe. The moment you have
> > something in your L1, the other sibling can read it using L1TF.
> > 
> > The nice thing about L1TF is that only (malicious) guests can exploit
> > it, and therefore the synchronizatin context is VMM. And it so happens
> > that VMEXITs are 'rare' (and already expensive and thus lots of effort
> > has already gone into avoiding them).
> > 
> > If you don't use VMs, you're good and SMT is not a problem.
> > 
> > If you do use VMs (and do/can not trust them), _then_ you need
> > core-scheduling; and in that case, the implementation under discussion
> > misses things like synchronization on VMEXITs due to interrupts and
> > things like that.
> > 
> > But under the assumption that VMs don't generate high scheduling rates,
> > it can work.
> > 
> > > The current core scheduler implementation, I believe, still has (theoretical?) 
> > > holes involving interrupts, once/if those are closed it may be even less 
> > > attractive.
> > 
> > No; so MDS leaks anything the other sibling (currently) does, this makes
> > _any_ privilidge boundary a synchronization context.
> > 
> > Worse still, the exploit doesn't require a VM at all, any other task can
> > get to it.
> > 
> > That means you get to sync the siblings on lovely things like system
> > call entry and exit, along with VMM and anything else that one would
> > consider a privilidge boundary. Now, system calls are not rare, they
> > are really quite common in fact. Trying to sync up siblings at the rate
> > of system calls is utter madness.
> > 
> > So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> > it _might_ work because a 'pure' host doesn't schedule that often
> > (maybe, same assumption as for L1TF).
> > 
> > Now, there have been proposals of moving the privilidge boundary further
> > into the kernel. Just like PTI exposes the entry stack and code to
> > Meltdown, the thinking is, lets expose more. By moving the priv boundary
> > the hope is that we can do lots of common system calls without having to
> > sync up -- lots of details are 'pending'.
> 
> 
> Thanks for clarifying. My understanding is (somewhat) less fuzzy now. :)
> 
> I think, though, that you were basically agreeing with me that the current 
> core scheduler does not close the holes, or am I reading that wrong.

Agreed; the missing bits for L1TF are ugly but doable (I've actually
done them before, Tim has that _somewhere_), but I've not seen a
'workable' solution for MDS yet.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-07 17:10                                 ` Tim Chen
  2019-08-15 16:09                                   ` Dario Faggioli
@ 2019-09-05  1:44                                   ` Julien Desfossez
  2019-09-06 22:17                                     ` Tim Chen
  2019-09-18 21:27                                     ` Tim Chen
  2019-09-06 18:30                                   ` Tim Chen
  2 siblings, 2 replies; 161+ messages in thread
From: Julien Desfossez @ 2019-09-05  1:44 UTC (permalink / raw)
  To: Tim Chen
  Cc: Dario Faggioli, Li, Aubrey, Aaron Lu, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> 1) Unfairness between the sibling threads
> -----------------------------------------
> One sibling thread could be suppressing and force idling
> the sibling thread over proportionally.  Resulting in
> the force idled CPU not getting run and stall tasks on
> suppressed CPU.
> 
> Status:
> i) Aaron has proposed a patchset here based on using one
> rq as a base reference for vruntime for task priority
> comparison between siblings.
> 
> https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
> It works well on fairness but has some initialization issues
> 
> ii) Tim has proposed a patchset here to account for forced
> idle time in rq's min_vruntime
> https://lore.kernel.org/lkml/f96350c1-25a9-0564-ff46-6658e96d726c@linux.intel.com/
> It improves over v3 with simpler logic compared to
> Aaron's patch, but does not work as well on fairness
> 
> iii) Tim has proposed yet another patch to maintain fairness
> of forced idle time between CPU threads per Peter's suggestion.
> https://lore.kernel.org/lkml/21933a50-f796-3d28-664c-030cb7c98431@linux.intel.com/
> Its performance has yet to be tested.
> 
> 2) Not rescheduling forced idled CPU
> ------------------------------------
> The forced idled CPU does not get a chance to re-schedule
> itself, and will stall for a long time even though it
> has eligible tasks to run.
> 
> Status:
> i) Aaron proposed a patch to fix this to check if there
> are runnable tasks when scheduling tick comes in.
> https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/
> 
> ii) Vineeth has patches to this issue and also issue 1, based
> on scheduling in a new "forced idle task" when getting forced
> idle, but has yet to post the patches.

We finished writing and debugging the PoC for the coresched_idle task
and here are the results and the code.

Those patches are applied on top of Aaron's patches:
- sched: Fix incorrect rq tagged as forced idle
- wrapper for cfs_rq->min_vruntime
  https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
- core vruntime comparison
  https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/

For the testing, we used the same strategy as described in
https://lore.kernel.org/lkml/20190802153715.GA18075@sinkpad/

No tag
------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          828.15      32.45
Aaron's first 2 patches:        832.12      36.53
Tim's first patchset:           852.50      4.11
Tim's second patchset:          855.11      9.89
coresched_idle                  985.67      0.83

Sysbench mem untagged, sysbench cpu tagged
------------------------------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          586.06      1.77
Tim's first patchset:           852.50      4.11
Tim's second patchset:          663.88      44.43
coresched_idle                  653.58      0.49

Sysbench mem tagged, sysbench cpu untagged
------------------------------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          583.77      3.52
Tim's first patchset:           564.04      58.05
Tim's second patchset:          524.72      55.24
coresched_idle                  653.30      0.81

Both sysbench tagged
--------------------
Test                            Average     Stdev
Alone                           1306.90     0.94
nosmt                           649.95      1.44
Aaron's full patchset:          582.15      3.75
Tim's first patchset:           679.43      70.07
Tim's second patchset:          563.10      34.58
coresched_idle                  653.12      1.68

As we can see from this stress-test, with the coresched_idle thread
being a real process, the fairness is more consistent (low stdev). Also,
the performance remains the same regardless of the tagging, and even
always slightly better than nosmt.

Thanks,

Julien

From: vpillai <vpillai@digitalocean.com>
Date: Wed, 4 Sep 2019 17:41:38 +0000
Subject: [RFC PATCH 1/2] coresched_idle thread

---
 kernel/sched/core.c  | 46 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 47 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f7839bf96e8b..fe560739c247 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3639,6 +3639,51 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 	return a->core_cookie == b->core_cookie;
 }
 
+static int coresched_idle_worker(void *data)
+{
+	struct rq *rq = (struct rq *)data;
+
+	/*
+	 * Transition to parked state and dequeue from runqueue.
+	 * pick_task() will select us if needed without enqueueing.
+	 */
+	set_special_state(TASK_PARKED);
+	schedule();
+
+	while (true) {
+		if (kthread_should_stop())
+			break;
+
+		play_idle(1);
+	}
+
+	return 0;
+}
+
+static void coresched_idle_worker_init(struct rq *rq)
+{
+
+	// XXX core_idle_task needs lock protection?
+	if (!rq->core_idle_task) {
+		rq->core_idle_task = kthread_create_on_cpu(coresched_idle_worker,
+				(void *)rq, cpu_of(rq), "coresched_idle");
+		if (rq->core_idle_task) {
+			wake_up_process(rq->core_idle_task);
+		}
+
+	}
+
+	return;
+}
+
+static void coresched_idle_worker_fini(struct rq *rq)
+{
+	if (rq->core_idle_task) {
+		kthread_stop(rq->core_idle_task);
+		rq->core_idle_task = NULL;
+	}
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -6774,6 +6819,7 @@ void __init sched_init(void)
 		atomic_set(&rq->nr_iowait, 0);
 
 #ifdef CONFIG_SCHED_CORE
+		rq->core_idle_task = NULL;
 		rq->core = NULL;
 		rq->core_pick = NULL;
 		rq->core_enabled = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..c3ae0af55b05 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -965,6 +965,7 @@ struct rq {
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
 	bool			core_forceidle;
+	struct task_struct	*core_idle_task;
 
 	/* shared state */
 	unsigned int		core_task_seq;
-- 
2.17.1

From: vpillai <vpillai@digitalocean.com>
Date: Wed, 4 Sep 2019 18:22:55 +0000
Subject: [RFC PATCH 2/2] Use coresched_idle to force idle a sibling

Currently we use idle thread to force idle on a sibling. Lets
use the new coresched_idle thread that scheduler sees a valid
task during force idle.
---
 kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 56 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe560739c247..e35d69a81adb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -244,23 +244,33 @@ static int __sched_core_stopper(void *data)
 static DEFINE_MUTEX(sched_core_mutex);
 static int sched_core_count;
 
+static void coresched_idle_worker_init(struct rq *rq);
+static void coresched_idle_worker_fini(struct rq *rq);
 static void __sched_core_enable(void)
 {
+	int cpu;
+
 	// XXX verify there are no cookie tasks (yet)
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
 
+	for_each_online_cpu(cpu)
+		coresched_idle_worker_init(cpu_rq(cpu));
 	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
 {
+	int cpu;
+
 	// XXX verify there are no cookie tasks (left)
 
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
 
+	for_each_online_cpu(cpu)
+		coresched_idle_worker_fini(cpu_rq(cpu));
 	printk("core sched disabled\n");
 }
 
@@ -3626,14 +3636,25 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 #ifdef CONFIG_SCHED_CORE
 
+static inline bool is_force_idle_task(struct task_struct *p)
+{
+	BUG_ON(task_rq(p)->core_idle_task == NULL);
+	return task_rq(p)->core_idle_task == p;
+}
+
+static inline bool is_core_idle_task(struct task_struct *p)
+{
+	return is_idle_task(p) || is_force_idle_task(p);
+}
+
 static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
 {
-	return is_idle_task(a) || (a->core_cookie == cookie);
+	return is_core_idle_task(a) || (a->core_cookie == cookie);
 }
 
 static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 {
-	if (is_idle_task(a) || is_idle_task(b))
+	if (is_core_idle_task(a) || is_core_idle_task(b))
 		return true;
 
 	return a->core_cookie == b->core_cookie;
@@ -3641,8 +3662,6 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
 
 static int coresched_idle_worker(void *data)
 {
-	struct rq *rq = (struct rq *)data;
-
 	/*
 	 * Transition to parked state and dequeue from runqueue.
 	 * pick_task() will select us if needed without enqueueing.
@@ -3666,7 +3685,7 @@ static void coresched_idle_worker_init(struct rq *rq)
 	// XXX core_idle_task needs lock protection?
 	if (!rq->core_idle_task) {
 		rq->core_idle_task = kthread_create_on_cpu(coresched_idle_worker,
-				(void *)rq, cpu_of(rq), "coresched_idle");
+				NULL, cpu_of(rq), "coresched_idle");
 		if (rq->core_idle_task) {
 			wake_up_process(rq->core_idle_task);
 		}
@@ -3684,6 +3703,14 @@ static void coresched_idle_worker_fini(struct rq *rq)
 	}
 }
 
+static inline struct task_struct *core_idle_task(struct rq *rq)
+{
+	BUG_ON(rq->core_idle_task == NULL);
+
+	return rq->core_idle_task;
+
+}
+
 // XXX fairness/fwd progress conditions
 /*
  * Returns
@@ -3709,7 +3736,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 		 */
 		if (max && class_pick->core_cookie &&
 		    prio_less(class_pick, max))
-			return idle_sched_class.pick_task(rq);
+			return core_idle_task(rq);
 
 		return class_pick;
 	}
@@ -3853,7 +3880,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				goto done;
 			}
 
-			if (!is_idle_task(p))
+			if (!is_force_idle_task(p))
 				occ++;
 
 			rq_i->core_pick = p;
@@ -3906,7 +3933,6 @@ next_class:;
 	rq->core->core_pick_seq = rq->core->core_task_seq;
 	next = rq->core_pick;
 	rq->core_sched_seq = rq->core->core_pick_seq;
-	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 	/*
 	 * Reschedule siblings
@@ -3924,13 +3950,24 @@ next_class:;
 
 		WARN_ON_ONCE(!rq_i->core_pick);
 
-		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+		if (is_core_idle_task(rq_i->core_pick) && rq_i->nr_running) {
+			/*
+			 * Matching logic can sometimes select idle_task when
+			 * iterating the sched_classes. If that selection is
+			 * actually a forced idle case, we need to update the
+			 * core_pick to coresched_idle.
+			 */
+			if (is_idle_task(rq_i->core_pick))
+				rq_i->core_pick = core_idle_task(rq_i);
 			rq_i->core_forceidle = true;
+		}
 
 		rq_i->core_pick->core_occupation = occ;
 
-		if (i == cpu)
+		if (i == cpu) {
+			next = rq_i->core_pick;
 			continue;
+		}
 
 		if (rq_i->curr != rq_i->core_pick) {
 			trace_printk("IPI(%d)\n", i);
@@ -3947,6 +3984,7 @@ next_class:;
 			WARN_ON_ONCE(1);
 		}
 	}
+	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 done:
 	set_next_task(rq, next);
@@ -4200,6 +4238,12 @@ static void __sched notrace __schedule(bool preempt)
 		 *   is a RELEASE barrier),
 		 */
 		++*switch_count;
+#ifdef CONFIG_SCHED_CORE
+		if (next == rq->core_idle_task)
+			next->state = TASK_RUNNING;
+		else if (prev == rq->core_idle_task)
+			prev->state = TASK_PARKED;
+#endif
 
 		trace_sched_switch(preempt, prev, next);
 
@@ -6479,6 +6523,7 @@ int sched_cpu_activate(unsigned int cpu)
 #ifdef CONFIG_SCHED_CORE
 		if (static_branch_unlikely(&__sched_core_enabled)) {
 			rq->core_enabled = true;
+			coresched_idle_worker_init(rq);
 		}
 #endif
 	}
@@ -6535,6 +6580,7 @@ int sched_cpu_deactivate(unsigned int cpu)
 		struct rq *rq = cpu_rq(cpu);
 		if (static_branch_unlikely(&__sched_core_enabled)) {
 			rq->core_enabled = false;
+			coresched_idle_worker_fini(rq);
 		}
 #endif
 		static_branch_dec_cpuslocked(&sched_smt_present);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-07 17:10                                 ` Tim Chen
  2019-08-15 16:09                                   ` Dario Faggioli
  2019-09-05  1:44                                   ` Julien Desfossez
@ 2019-09-06 18:30                                   ` Tim Chen
  2019-09-11 14:02                                     ` Aaron Lu
                                                       ` (2 more replies)
  2 siblings, 3 replies; 161+ messages in thread
From: Tim Chen @ 2019-09-06 18:30 UTC (permalink / raw)
  To: Dario Faggioli, Julien Desfossez, Li, Aubrey
  Cc: Aaron Lu, Aubrey Li, Subhra Mazumdar, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 8/7/19 10:10 AM, Tim Chen wrote:

> 3) Load balancing between CPU cores
> -----------------------------------
> Say if one CPU core's sibling threads get forced idled
> a lot as it has mostly incompatible tasks between the siblings,
> moving the incompatible load to other cores and pulling
> compatible load to the core could help CPU utilization.
> 
> So just considering the load of a task is not enough during
> load balancing, task compatibility also needs to be considered.
> Peter has put in mechanisms to balance compatible tasks between
> CPU thread siblings, but not across cores.
> 
> Status:
> I have not seen patches on this issue.  This issue could lead to
> large variance in workload performance based on your luck
> in placing the workload among the cores.
> 

I've made an attempt in the following two patches to address
the load balancing of mismatched load between the siblings.

It is applied on top of Aaron's patches:
- sched: Fix incorrect rq tagged as forced idle
- wrapper for cfs_rq->min_vruntime
  https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
- core vruntime comparison
  https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/

I will love Julien, Aaron and others to try it out.  Suggestions
to tune it is welcomed.

Tim

---

From c7b91fb26d787d020f0795c3fbec82914889dc67 Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Wed, 21 Aug 2019 15:48:15 -0700
Subject: [PATCH 1/2] sched: scan core sched load mismatch

Calculate the mismatched load imbalance on a core when
running the core scheduler when we are updating the
load balance statistics.  This will guide the load
balancer later to move load to another CPU that can
reduce the mismatched load.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 150 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 149 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 730c9359e9c9..b3d6a6482553 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7507,6 +7507,9 @@ static inline int migrate_degrades_locality(struct task_struct *p,
 }
 #endif
 
+static inline s64 core_sched_imbalance_improvement(int src_cpu, int dst_cpu,
+					      struct task_struct *p);
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -7970,6 +7973,11 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_SCHED_CORE
+	int			imbl_cpu;
+	struct task_group	*imbl_tg;
+	s64			imbl_load;
+#endif
 };
 
 /*
@@ -8314,6 +8322,145 @@ static bool update_nohz_stats(struct rq *rq, bool force)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+static inline int cpu_sibling(int cpu)
+{
+	int i;
+
+	for_each_cpu(i, cpu_smt_mask(cpu)) {
+		if (i == cpu)
+			continue;
+		return i;
+	}
+	return -1;
+}
+
+static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
+			int src_sibling, int dst_sibling,
+			struct task_group *tg, u64 task_load)
+{
+	struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
+	s64 excess, deficit, old_mismatch, new_mismatch;
+
+	if (src_cpu == dst_cpu)
+		return -1;
+
+	/* XXX SMT4 will require additional logic */
+
+	se = tg->se[src_cpu];
+	se_sibling = tg->se[src_sibling];
+
+	excess = se->avg.load_avg - se_sibling->avg.load_avg;
+	if (src_sibling == dst_cpu) {
+		old_mismatch = abs(excess);
+		new_mismatch = abs(excess - 2*task_load);
+		return old_mismatch - new_mismatch;
+	}
+
+	dst_se = tg->se[dst_cpu];
+	dst_se_sibling = tg->se[dst_sibling];
+	deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
+
+	old_mismatch = abs(excess) + abs(deficit);
+	new_mismatch = abs(excess - (s64) task_load) +
+		       abs(deficit + (s64) task_load);
+
+	if (excess > 0 && deficit < 0)
+		return old_mismatch - new_mismatch;
+	else
+		/* no mismatch improvement */
+		return -1;
+}
+
+static inline s64 core_sched_imbalance_improvement(int src_cpu, int dst_cpu,
+					      struct task_struct *p)
+{
+	int src_sibling, dst_sibling;
+	unsigned long task_load = task_h_load(p);
+	struct task_group *tg;
+
+	if (!p->se.parent)
+		return 0;
+
+	tg = p->se.parent->cfs_rq->tg;
+	if (!tg->tagged)
+		return 0;
+
+	/* XXX SMT4 will require additional logic */
+	src_sibling = cpu_sibling(src_cpu);
+	dst_sibling = cpu_sibling(dst_cpu);
+
+	if (src_sibling == -1 || dst_sibling == -1)
+		return 0;
+
+	return core_sched_imbalance_delta(src_cpu, dst_cpu,
+					  src_sibling, dst_sibling,
+					  tg, task_load);
+}
+
+static inline void core_sched_imbalance_scan(struct sg_lb_stats *sgs,
+					     int src_cpu,
+					     int dst_cpu)
+{
+	struct rq *rq;
+	struct cfs_rq *cfs_rq, *pos;
+	struct task_group *tg;
+	s64 mismatch;
+	int src_sibling, dst_sibling;
+	u64 src_avg_load_task;
+
+	if (!sched_core_enabled(cpu_rq(src_cpu)) ||
+	    !sched_core_enabled(cpu_rq(dst_cpu)) ||
+	    src_cpu == dst_cpu)
+		return;
+
+	rq = cpu_rq(src_cpu);
+
+	src_sibling = cpu_sibling(src_cpu);
+	dst_sibling = cpu_sibling(dst_cpu);
+
+	if (src_sibling == -1 || dst_sibling == -1)
+		return;
+
+	src_avg_load_task = cpu_avg_load_per_task(src_cpu);
+
+	if (src_avg_load_task == 0)
+		return;
+
+	/*
+	 * Imbalance in tagged task group's load causes forced
+	 * idle time in sibling, that will be counted as mismatched load
+	 * on the forced idled cpu.  Record the source cpu in the sched
+	 * group causing the largest mismatched load.
+	 */
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+
+		tg = cfs_rq->tg;
+		if (!tg->tagged)
+			continue;
+
+		mismatch = core_sched_imbalance_delta(src_cpu, dst_cpu,
+						      src_sibling, dst_sibling,
+						      tg, src_avg_load_task);
+
+		if (mismatch > sgs->imbl_load &&
+		    mismatch > src_avg_load_task) {
+			sgs->imbl_load = mismatch;
+			sgs->imbl_tg = tg;
+			sgs->imbl_cpu = src_cpu;
+		}
+	}
+}
+
+#else
+#define core_sched_imbalance_scan(sgs, src_cpu, dst_cpu)
+static inline s64 core_sched_imbalance_improvement(int src_cpu, int dst_cpu,
+					      struct task_struct *p)
+{
+	return 0;
+}
+#endif /* CONFIG_SCHED_CORE */
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -8345,7 +8492,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		else
 			load = source_load(i, load_idx);
 
-		sgs->group_load += load;
+		core_sched_imbalance_scan(sgs, i, env->dst_cpu);
+
 		sgs->group_util += cpu_util(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
-- 
2.20.1


From a11084f84de9c174f36cf2701ba5bbe1546e45f5 Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Wed, 28 Aug 2019 11:22:43 -0700
Subject: [PATCH 2/2] sched: load balance core imbalanced load

If moving mismatched core scheduling load can reduce load imbalance
more than regular load balancing, move the mismatched load instead.

On regular load balancing, also skip moving a task that could increase
load mismatch.

Move only one mismatched task at a time to reduce load disturbance.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3d6a6482553..69939c977797 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7412,6 +7412,11 @@ struct lb_env {
 	enum fbq_type		fbq_type;
 	enum group_type		src_grp_type;
 	struct list_head	tasks;
+#ifdef CONFIG_SCHED_CORE
+	int			imbl_cpu;
+	struct task_group	*imbl_tg;
+	s64			imbl_load;
+#endif
 };
 
 /*
@@ -7560,6 +7565,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	/* Don't migrate if we increase core imbalance */
+	if (core_sched_imbalance_improvement(env->src_cpu, env->dst_cpu, p) < 0)
+		return 0;
+#endif
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
@@ -8533,6 +8544,14 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_no_capacity = group_is_overloaded(env, sgs);
 	sgs->group_type = group_classify(group, sgs);
+
+#ifdef CONFIG_SCHED_CORE
+	if (sgs->imbl_load > env->imbl_load) {
+		env->imbl_cpu = sgs->imbl_cpu;
+		env->imbl_tg = sgs->imbl_tg;
+		env->imbl_load = sgs->imbl_load;
+	}
+#endif
 }
 
 /**
@@ -9066,6 +9085,15 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	unsigned long busiest_load = 0, busiest_capacity = 1;
 	int i;
 
+#ifdef CONFIG_SCHED_CORE
+	if (env->imbl_load > env->imbalance) {
+		env->imbalance = cpu_avg_load_per_task(env->imbl_cpu);
+		return cpu_rq(env->imbl_cpu);
+	} else {
+		env->imbl_load = 0;
+	}
+#endif
+
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		unsigned long capacity, wl;
 		enum fbq_type rt;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-05  1:44                                   ` Julien Desfossez
@ 2019-09-06 22:17                                     ` Tim Chen
  2019-09-18 21:27                                     ` Tim Chen
  1 sibling, 0 replies; 161+ messages in thread
From: Tim Chen @ 2019-09-06 22:17 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Dario Faggioli, Li, Aubrey, Aaron Lu, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 9/4/19 6:44 PM, Julien Desfossez wrote:


>@@ -3853,7 +3880,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> 				goto done;
> 			}
> 
>-			if (!is_idle_task(p))
>+			if (!is_force_idle_task(p))

Should this be 		if (!is_core_idle_task(p))
instead?

> 				occ++;
> 


Tim

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-08-29 14:38           ` Peter Zijlstra
@ 2019-09-10 14:27             ` Julien Desfossez
  2019-09-18 21:12               ` Tim Chen
  0 siblings, 1 reply; 161+ messages in thread
From: Julien Desfossez @ 2019-09-10 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, Matthew Garrett, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Tim Chen, mingo, tglx, pjt, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 29-Aug-2019 04:38:21 PM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 10:30:51AM -0400, Phil Auld wrote:
> > I think, though, that you were basically agreeing with me that the current
> > core scheduler does not close the holes, or am I reading that wrong.
>
> Agreed; the missing bits for L1TF are ugly but doable (I've actually
> done them before, Tim has that _somewhere_), but I've not seen a
> 'workable' solution for MDS yet.

Following the discussion we had yesterday at LPC, after we have agreed
on a solution for fixing the current fairness issue, we will post the
v4. We will then work on prototyping the other synchronisation points
(syscalls, interrupts and VMEXIT) to evaluate the overhead in various
use-cases.

Depending on the use-case, we know the performance overhead maybe
heavier than just disabling SMT, but the benchmarks we have seen so far
indicate that there are valid cases for core scheduling. Core scheduling
will continue to be unused by default, but with it, we will have the
option to tune the system to be both secure and faster than disabling
SMT for those cases.

Thanks,

Julien

P.S: I think the branch that contains the VMEXIT handling is here
https://github.com/pdxChen/gang/commits/sched_1.23-base


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-06 18:30                                   ` Tim Chen
@ 2019-09-11 14:02                                     ` Aaron Lu
  2019-09-11 16:19                                       ` Tim Chen
  2019-09-25  2:40                                     ` Aubrey Li
  2019-09-30 15:22                                     ` Julien Desfossez
  2 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-09-11 14:02 UTC (permalink / raw)
  To: Tim Chen, Julien Desfossez
  Cc: Dario Faggioli, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

Hi Tim & Julien,

On Fri, Sep 06, 2019 at 11:30:20AM -0700, Tim Chen wrote:
> On 8/7/19 10:10 AM, Tim Chen wrote:
> 
> > 3) Load balancing between CPU cores
> > -----------------------------------
> > Say if one CPU core's sibling threads get forced idled
> > a lot as it has mostly incompatible tasks between the siblings,
> > moving the incompatible load to other cores and pulling
> > compatible load to the core could help CPU utilization.
> > 
> > So just considering the load of a task is not enough during
> > load balancing, task compatibility also needs to be considered.
> > Peter has put in mechanisms to balance compatible tasks between
> > CPU thread siblings, but not across cores.
> > 
> > Status:
> > I have not seen patches on this issue.  This issue could lead to
> > large variance in workload performance based on your luck
> > in placing the workload among the cores.
> > 
> 
> I've made an attempt in the following two patches to address
> the load balancing of mismatched load between the siblings.
> 
> It is applied on top of Aaron's patches:
> - sched: Fix incorrect rq tagged as forced idle
> - wrapper for cfs_rq->min_vruntime
>   https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
> - core vruntime comparison
>   https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/

So both of you are working on top of my 2 patches that deal with the
fairness issue, but I had the feeling Tim's alternative patches[1] are
simpler than mine and achieves the same result(after the force idle tag
fix), so unless there is something I missed, I think we should go with
the simpler one?

[1]: https://lore.kernel.org/lkml/b7a83fcb-5c34-9794-5688-55c52697fd84@linux.intel.com/

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-11 14:02                                     ` Aaron Lu
@ 2019-09-11 16:19                                       ` Tim Chen
  2019-09-11 16:47                                         ` Vineeth Remanan Pillai
  2019-09-12 12:04                                         ` Aaron Lu
  0 siblings, 2 replies; 161+ messages in thread
From: Tim Chen @ 2019-09-11 16:19 UTC (permalink / raw)
  To: Aaron Lu, Julien Desfossez
  Cc: Dario Faggioli, Li, Aubrey, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 9/11/19 7:02 AM, Aaron Lu wrote:
> Hi Tim & Julien,
> 
> On Fri, Sep 06, 2019 at 11:30:20AM -0700, Tim Chen wrote:
>> On 8/7/19 10:10 AM, Tim Chen wrote:
>>
>>> 3) Load balancing between CPU cores
>>> -----------------------------------
>>> Say if one CPU core's sibling threads get forced idled
>>> a lot as it has mostly incompatible tasks between the siblings,
>>> moving the incompatible load to other cores and pulling
>>> compatible load to the core could help CPU utilization.
>>>
>>> So just considering the load of a task is not enough during
>>> load balancing, task compatibility also needs to be considered.
>>> Peter has put in mechanisms to balance compatible tasks between
>>> CPU thread siblings, but not across cores.
>>>
>>> Status:
>>> I have not seen patches on this issue.  This issue could lead to
>>> large variance in workload performance based on your luck
>>> in placing the workload among the cores.
>>>
>>
>> I've made an attempt in the following two patches to address
>> the load balancing of mismatched load between the siblings.
>>
>> It is applied on top of Aaron's patches:
>> - sched: Fix incorrect rq tagged as forced idle
>> - wrapper for cfs_rq->min_vruntime
>>   https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
>> - core vruntime comparison
>>   https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
> 
> So both of you are working on top of my 2 patches that deal with the
> fairness issue, but I had the feeling Tim's alternative patches[1] are
> simpler than mine and achieves the same result(after the force idle tag

I think Julien's result show that my patches did not do as well as
your patches for fairness. Aubrey did some other testing with the same
conclusion.  So I think keeping the forced idle time balanced is not
enough for maintaining fairness.

Will love to see if my load balancing patches help for your workload.

Tim

> fix), so unless there is something I missed, I think we should go with
> the simpler one?
> 
> [1]: https://lore.kernel.org/lkml/b7a83fcb-5c34-9794-5688-55c52697fd84@linux.intel.com/
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-11 16:19                                       ` Tim Chen
@ 2019-09-11 16:47                                         ` Vineeth Remanan Pillai
  2019-09-12 12:35                                           ` Aaron Lu
  2019-09-12 12:04                                         ` Aaron Lu
  1 sibling, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-09-11 16:47 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aaron Lu, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Subhra Mazumdar, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> > So both of you are working on top of my 2 patches that deal with the
> > fairness issue, but I had the feeling Tim's alternative patches[1] are
> > simpler than mine and achieves the same result(after the force idle tag
>
> I think Julien's result show that my patches did not do as well as
> your patches for fairness. Aubrey did some other testing with the same
> conclusion.  So I think keeping the forced idle time balanced is not
> enough for maintaining fairness.
>
There are two main issues - vruntime comparison issue and the
forced idle issue.  coresched_idle thread patch is addressing
the forced idle issue as scheduler is no longer overloading idle
thread for forcing idle. If I understand correctly, Tim's patch
also tries to fix the forced idle issue. On top of fixing forced
idle issue, we also need to fix that vruntime comparison issue
and I think thats where Aaron's patch helps.

I think comparing parent's runtime also will have issues once
the task group has a lot more threads with different running
patterns. One example is a task group with lot of active threads
and a thread with fairly less activity. So when this less active
thread is competing with a thread in another group, there is a
chance that it loses continuously for a while until the other
group catches up on its vruntime.

As discussed during LPC, probably start thinking along the lines
of global vruntime or core wide vruntime to fix the vruntime
comparison issue?

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-11 16:19                                       ` Tim Chen
  2019-09-11 16:47                                         ` Vineeth Remanan Pillai
@ 2019-09-12 12:04                                         ` Aaron Lu
  2019-09-12 17:05                                           ` Tim Chen
  2019-09-12 23:12                                           ` Aubrey Li
  1 sibling, 2 replies; 161+ messages in thread
From: Aaron Lu @ 2019-09-12 12:04 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Dario Faggioli, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> On 9/11/19 7:02 AM, Aaron Lu wrote:
> > Hi Tim & Julien,
> > 
> > On Fri, Sep 06, 2019 at 11:30:20AM -0700, Tim Chen wrote:
> >> On 8/7/19 10:10 AM, Tim Chen wrote:
> >>
> >>> 3) Load balancing between CPU cores
> >>> -----------------------------------
> >>> Say if one CPU core's sibling threads get forced idled
> >>> a lot as it has mostly incompatible tasks between the siblings,
> >>> moving the incompatible load to other cores and pulling
> >>> compatible load to the core could help CPU utilization.
> >>>
> >>> So just considering the load of a task is not enough during
> >>> load balancing, task compatibility also needs to be considered.
> >>> Peter has put in mechanisms to balance compatible tasks between
> >>> CPU thread siblings, but not across cores.
> >>>
> >>> Status:
> >>> I have not seen patches on this issue.  This issue could lead to
> >>> large variance in workload performance based on your luck
> >>> in placing the workload among the cores.
> >>>
> >>
> >> I've made an attempt in the following two patches to address
> >> the load balancing of mismatched load between the siblings.
> >>
> >> It is applied on top of Aaron's patches:
> >> - sched: Fix incorrect rq tagged as forced idle
> >> - wrapper for cfs_rq->min_vruntime
> >>   https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
> >> - core vruntime comparison
> >>   https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
> > 
> > So both of you are working on top of my 2 patches that deal with the
> > fairness issue, but I had the feeling Tim's alternative patches[1] are
> > simpler than mine and achieves the same result(after the force idle tag
> 
> I think Julien's result show that my patches did not do as well as
> your patches for fairness. Aubrey did some other testing with the same
> conclusion.  So I think keeping the forced idle time balanced is not
> enough for maintaining fairness.

Well, I have done following tests:
1 Julien's test script: https://paste.debian.net/plainh/834cf45c
2 start two tagged will-it-scale/page_fault1, see how each performs;
3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git

They all show your patchset performs equally well...And consider what
the patch does, I think they are really doing the same thing in
different ways.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-11 16:47                                         ` Vineeth Remanan Pillai
@ 2019-09-12 12:35                                           ` Aaron Lu
  2019-09-12 17:29                                             ` Tim Chen
  2019-09-30 11:53                                             ` Vineeth Remanan Pillai
  0 siblings, 2 replies; 161+ messages in thread
From: Aaron Lu @ 2019-09-12 12:35 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Subhra Mazumdar, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Wed, Sep 11, 2019 at 12:47:34PM -0400, Vineeth Remanan Pillai wrote:
> > > So both of you are working on top of my 2 patches that deal with the
> > > fairness issue, but I had the feeling Tim's alternative patches[1] are
> > > simpler than mine and achieves the same result(after the force idle tag
> >
> > I think Julien's result show that my patches did not do as well as
> > your patches for fairness. Aubrey did some other testing with the same
> > conclusion.  So I think keeping the forced idle time balanced is not
> > enough for maintaining fairness.
> >
> There are two main issues - vruntime comparison issue and the
> forced idle issue.  coresched_idle thread patch is addressing
> the forced idle issue as scheduler is no longer overloading idle
> thread for forcing idle. If I understand correctly, Tim's patch
> also tries to fix the forced idle issue. On top of fixing forced

Er...I don't think so. Tim's patch is meant to solve fairness issue as
mine, it doesn't attempt to address the forced idle issue.

> idle issue, we also need to fix that vruntime comparison issue
> and I think thats where Aaron's patch helps.
> 
> I think comparing parent's runtime also will have issues once
> the task group has a lot more threads with different running
> patterns. One example is a task group with lot of active threads
> and a thread with fairly less activity. So when this less active
> thread is competing with a thread in another group, there is a
> chance that it loses continuously for a while until the other
> group catches up on its vruntime.

I actually think this is expected behaviour.

Without core scheduling, when deciding which task to run, we will first
decide which "se" to run from the CPU's root level cfs runqueue and then
go downwards. Let's call the chosen se on the root level cfs runqueue
the winner se. Then with core scheduling, we will also need compare the
two winner "se"s of each hyperthread and choose the core wide winner "se".

> 
> As discussed during LPC, probably start thinking along the lines
> of global vruntime or core wide vruntime to fix the vruntime
> comparison issue?

core wide vruntime makes sense when there are multiple tasks of
different cgroups queued on the same core. e.g. when there are two
tasks of cgroupA and one task of cgroupB are queued on the same core,
assume cgroupA's one task is on one hyperthread and its other task is on
the other hyperthread with cgroupB's task. With my current
implementation or Tim's, cgroupA will get more time than cgroupB. If we
maintain core wide vruntime for cgroupA and cgroupB, we should be able
to maintain fairness between cgroups on this core. Tim propose to solve
this problem by doing some kind of load balancing if I'm not mistaken, I
haven't taken a look at this yet.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-12 12:04                                         ` Aaron Lu
@ 2019-09-12 17:05                                           ` Tim Chen
  2019-09-13 13:57                                             ` Aaron Lu
  2019-09-12 23:12                                           ` Aubrey Li
  1 sibling, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-09-12 17:05 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Julien Desfossez, Dario Faggioli, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 9/12/19 5:04 AM, Aaron Lu wrote:

> Well, I have done following tests:
> 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> 2 start two tagged will-it-scale/page_fault1, see how each performs;
> 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
> 
> They all show your patchset performs equally well...And consider what
> the patch does, I think they are really doing the same thing in
> different ways.
> 

Aaron,

The new feature of my new patches attempt to load balance between cores,
and remove imbalance of cgroup load on a core that causes forced idle.
Whereas previous patches attempt for fairness of cgroup between sibling threads,
so I think the goals are kind of orthogonal and complementary.

The premise is this, say cgroup1 is occupying 50% of cpu on cpu thread 1
and 25% of cpu on cpu thread 2, that means we have a 25% cpu imbalance
and cpu is force idled 25% of the time.  So ideally we need to remove
12.5% of cgroup 1 load from cpu thread 1 to sibling thread 2, so they
both run at 37.5% on both thread for cgroup1 load without causing
any force idled time.  Otherwise we will try to remove 25% of cgroup1
load from cpu thread 1 to another core that has cgroup1 load to match.

This load balance is done in the regular load balance paths.

Previously for v3, only sched_core_balance made an attempt to pull a cookie task, and only
in the idle balance path. So if the cpu is kept busy, the cgroup load imbalance
between sibling threads could last a long time.  And the thread fairness
patches for v3 don't help to balance load for such cases.

The new patches take into actual consideration of the amount of load imbalance
of the same group between sibling threads when selecting task to pull, 
and it also prevent task migration that creates
more load imbalance. So hopefully this feature will help when we have
more cores and need load balance across the cores.  This tries to help
even cgroup workload between threads to minimize forced idle time, and also
even out load across cores.

In your test, how many cores are on your machine and how many threads did
each page_fault1 spawn off?

Tim

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-12 12:35                                           ` Aaron Lu
@ 2019-09-12 17:29                                             ` Tim Chen
  2019-09-13 14:15                                               ` Aaron Lu
  2019-09-30 11:53                                             ` Vineeth Remanan Pillai
  1 sibling, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-09-12 17:29 UTC (permalink / raw)
  To: Aaron Lu, Vineeth Remanan Pillai
  Cc: Julien Desfossez, Dario Faggioli, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 9/12/19 5:35 AM, Aaron Lu wrote:
> On Wed, Sep 11, 2019 at 12:47:34PM -0400, Vineeth Remanan Pillai wrote:

> 
> core wide vruntime makes sense when there are multiple tasks of
> different cgroups queued on the same core. e.g. when there are two
> tasks of cgroupA and one task of cgroupB are queued on the same core,
> assume cgroupA's one task is on one hyperthread and its other task is on
> the other hyperthread with cgroupB's task. With my current
> implementation or Tim's, cgroupA will get more time than cgroupB. 

I think that's expected because cgroup A has two tasks and cgroup B
has one task, so cgroup A should get twice the cpu time than cgroup B
to maintain fairness.

> If we
> maintain core wide vruntime for cgroupA and cgroupB, we should be able
> to maintain fairness between cgroups on this core. 

I don't think the right thing to do is to give cgroupA and cgroupB equal
time on a core.  The time they get should still depend on their 
load weight. The better thing to do is to move one task from cgroupA
to another core, that has only one cgroupA task so it can be paired up
with that lonely cgroupA task.  This will eliminate the forced idle time
for cgropuA both on current core and also the migrated core.

> Tim propose to solve
> this problem by doing some kind of load balancing if I'm not mistaken, I
> haven't taken a look at this yet.
> 

My new patchset is trying to solve a different problem.  It is
not trying to maintain fairness between cgroup on a core, but tries to
even out the load of a cgroup between threads, and even out general
load between cores. This will minimize the forced idle time.

The fairness between cgroup relies still on
proper vruntime accounting and proper comparison of vruntime between
threads.  So for now, I am still using Aaron's patchset for this purpose
as it has better fairness property than my other proposed patchsets
for fairness purpose.

With just Aaron's current patchset we may have a lot of forced idle time
due to the uneven distribution of tasks of different cgroup among the
threads and cores, even though scheduling fairness is maintained.
My new patches try to remove those forced idle time by moving the
tasks around, to minimize cgroup unevenness between sibling threads
and general load unevenness between the CPUs.

Tim


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-12 12:04                                         ` Aaron Lu
  2019-09-12 17:05                                           ` Tim Chen
@ 2019-09-12 23:12                                           ` Aubrey Li
  2019-09-15 14:14                                             ` Aaron Lu
  1 sibling, 1 reply; 161+ messages in thread
From: Aubrey Li @ 2019-09-12 23:12 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Sep 12, 2019 at 8:04 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> > On 9/11/19 7:02 AM, Aaron Lu wrote:
> > I think Julien's result show that my patches did not do as well as
> > your patches for fairness. Aubrey did some other testing with the same
> > conclusion.  So I think keeping the forced idle time balanced is not
> > enough for maintaining fairness.
>
> Well, I have done following tests:
> 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> 2 start two tagged will-it-scale/page_fault1, see how each performs;
> 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
>
> They all show your patchset performs equally well...And consider what
> the patch does, I think they are really doing the same thing in
> different ways.

It looks like we are not on the same page, if you don't mind, can both of
you rebase your patchset onto v5.3-rc8 and provide a public branch so I
can fetch and test it at least by my benchmark?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-12 17:05                                           ` Tim Chen
@ 2019-09-13 13:57                                             ` Aaron Lu
  0 siblings, 0 replies; 161+ messages in thread
From: Aaron Lu @ 2019-09-13 13:57 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Dario Faggioli, Li, Aubrey, Aubrey Li,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Sep 12, 2019 at 10:05:43AM -0700, Tim Chen wrote:
> On 9/12/19 5:04 AM, Aaron Lu wrote:
> 
> > Well, I have done following tests:
> > 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> > 2 start two tagged will-it-scale/page_fault1, see how each performs;
> > 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
> > 
> > They all show your patchset performs equally well...And consider what
> > the patch does, I think they are really doing the same thing in
> > different ways.
> > 
> 
> Aaron,
> 
> The new feature of my new patches attempt to load balance between cores,
> and remove imbalance of cgroup load on a core that causes forced idle.
> Whereas previous patches attempt for fairness of cgroup between sibling threads,
> so I think the goals are kind of orthogonal and complementary.
> 
> The premise is this, say cgroup1 is occupying 50% of cpu on cpu thread 1
> and 25% of cpu on cpu thread 2, that means we have a 25% cpu imbalance
> and cpu is force idled 25% of the time.  So ideally we need to remove
> 12.5% of cgroup 1 load from cpu thread 1 to sibling thread 2, so they
> both run at 37.5% on both thread for cgroup1 load without causing
> any force idled time.  Otherwise we will try to remove 25% of cgroup1
> load from cpu thread 1 to another core that has cgroup1 load to match.
> 
> This load balance is done in the regular load balance paths.
> 
> Previously for v3, only sched_core_balance made an attempt to pull a cookie task, and only
> in the idle balance path. So if the cpu is kept busy, the cgroup load imbalance
> between sibling threads could last a long time.  And the thread fairness
> patches for v3 don't help to balance load for such cases.
> 
> The new patches take into actual consideration of the amount of load imbalance
> of the same group between sibling threads when selecting task to pull, 
> and it also prevent task migration that creates
> more load imbalance. So hopefully this feature will help when we have
> more cores and need load balance across the cores.  This tries to help
> even cgroup workload between threads to minimize forced idle time, and also
> even out load across cores.

Will take a look at your new patches, thanks for the explanation.

> In your test, how many cores are on your machine and how many threads did
> each page_fault1 spawn off?

The test VM has 16 cores and 32 threads.
I created 2 tagged cgroups to run page_fault1 and each page_fault1 has
16 processes, like this:
$ ./src/will-it-scale/page_fault1_processes -t 16 -s 60

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-12 17:29                                             ` Tim Chen
@ 2019-09-13 14:15                                               ` Aaron Lu
  2019-09-13 17:13                                                 ` Tim Chen
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-09-13 14:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: Vineeth Remanan Pillai, Julien Desfossez, Dario Faggioli, Li,
	Aubrey, Aubrey Li, Subhra Mazumdar, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Sep 12, 2019 at 10:29:13AM -0700, Tim Chen wrote:
> On 9/12/19 5:35 AM, Aaron Lu wrote:
> > On Wed, Sep 11, 2019 at 12:47:34PM -0400, Vineeth Remanan Pillai wrote:
> 
> > 
> > core wide vruntime makes sense when there are multiple tasks of
> > different cgroups queued on the same core. e.g. when there are two
> > tasks of cgroupA and one task of cgroupB are queued on the same core,
> > assume cgroupA's one task is on one hyperthread and its other task is on
> > the other hyperthread with cgroupB's task. With my current
> > implementation or Tim's, cgroupA will get more time than cgroupB. 
> 
> I think that's expected because cgroup A has two tasks and cgroup B
> has one task, so cgroup A should get twice the cpu time than cgroup B
> to maintain fairness.

Like you said below, the ideal run time for each cgroup should depend on
their individual weight. The fact cgroupA has two tasks doesn't mean it
has twice the weight. Both cgroup can have the same cpu.share settings
and then, the more task a cgroup has, the less weight it can get for the
cgroup's per-cpu se.

I now realized one thing that's different in your idle_allowance
implementation and my core_vruntime implementation. In your
implementation, the idle_allowance is absolute time while vruntime can
be adjusted by the se's weight, that's probably one area your
implementation can make things less fair then mine.

> > If we
> > maintain core wide vruntime for cgroupA and cgroupB, we should be able
> > to maintain fairness between cgroups on this core. 
> 
> I don't think the right thing to do is to give cgroupA and cgroupB equal
> time on a core.  The time they get should still depend on their 
> load weight.

Agree.

> The better thing to do is to move one task from cgroupA to another core,
> that has only one cgroupA task so it can be paired up
> with that lonely cgroupA task.  This will eliminate the forced idle time
> for cgropuA both on current core and also the migrated core.

I'm not sure if this is always possible.

Say on a 16cores/32threads machine, there are 3 cgroups, each has 16 cpu
intensive tasks, will it be possible to make things perfectly balanced?

Don't get me wrong, I think this kind of load balancing is good and
needed, but I'm not sure if we can always make things perfectly
balanced. And if not, do we care those few cores where cgroup tasks are
not balanced and then, do we need to implement the core_wide cgoup
fairness functionality or we don't care since those cores are supposed
to be few and isn't a big deal?

> > Tim propose to solve
> > this problem by doing some kind of load balancing if I'm not mistaken, I
> > haven't taken a look at this yet.
> > 
> 
> My new patchset is trying to solve a different problem.  It is
> not trying to maintain fairness between cgroup on a core, but tries to
> even out the load of a cgroup between threads, and even out general
> load between cores. This will minimize the forced idle time.

Understood.

> 
> The fairness between cgroup relies still on
> proper vruntime accounting and proper comparison of vruntime between
> threads.  So for now, I am still using Aaron's patchset for this purpose
> as it has better fairness property than my other proposed patchsets
> for fairness purpose.
> 
> With just Aaron's current patchset we may have a lot of forced idle time
> due to the uneven distribution of tasks of different cgroup among the
> threads and cores, even though scheduling fairness is maintained.
> My new patches try to remove those forced idle time by moving the
> tasks around, to minimize cgroup unevenness between sibling threads
> and general load unevenness between the CPUs.

Yes I think this is definitely a good thing to do.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-13 14:15                                               ` Aaron Lu
@ 2019-09-13 17:13                                                 ` Tim Chen
  0 siblings, 0 replies; 161+ messages in thread
From: Tim Chen @ 2019-09-13 17:13 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vineeth Remanan Pillai, Julien Desfossez, Dario Faggioli, Li,
	Aubrey, Aubrey Li, Subhra Mazumdar, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 9/13/19 7:15 AM, Aaron Lu wrote:
> On Thu, Sep 12, 2019 at 10:29:13AM -0700, Tim Chen wrote:

> 
>> The better thing to do is to move one task from cgroupA to another core,
>> that has only one cgroupA task so it can be paired up
>> with that lonely cgroupA task.  This will eliminate the forced idle time
>> for cgropuA both on current core and also the migrated core.
> 
> I'm not sure if this is always possible.

During update_sg_lb_stats, we can scan for opportunities where pulling a task
from a source cpu in the sched group to the target dest cpu can reduce the forced idle imbalance.
And we also prevent task migrations that increase forced idle imbalance.

With those policies in place, we may not achieve perfect balance, but at least
we will load balance in the right direction to lower forced idle imbalance.

> 
> Say on a 16cores/32threads machine, there are 3 cgroups, each has 16 cpu
> intensive tasks, will it be possible to make things perfectly balanced?
> 
> Don't get me wrong, I think this kind of load balancing is good and
> needed, but I'm not sure if we can always make things perfectly
> balanced. And if not, do we care those few cores where cgroup tasks are
> not balanced and then, do we need to implement the core_wide cgoup
> fairness functionality or we don't care since those cores are supposed
> to be few and isn't a big deal?

Yes, we still need core wide fairness for tasks.  Load balancing is to
move tasks around so we will have less imbalance of cgroup tasks in a core
that results in forced idle time.  Once they are in place, we still need
to maintain fairness in a core.

Tim


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-12 23:12                                           ` Aubrey Li
@ 2019-09-15 14:14                                             ` Aaron Lu
  2019-09-18  1:33                                               ` Aubrey Li
  2019-10-29  9:11                                               ` Dario Faggioli
  0 siblings, 2 replies; 161+ messages in thread
From: Aaron Lu @ 2019-09-15 14:14 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, Sep 13, 2019 at 07:12:52AM +0800, Aubrey Li wrote:
> On Thu, Sep 12, 2019 at 8:04 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> >
> > On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> > > On 9/11/19 7:02 AM, Aaron Lu wrote:
> > > I think Julien's result show that my patches did not do as well as
> > > your patches for fairness. Aubrey did some other testing with the same
> > > conclusion.  So I think keeping the forced idle time balanced is not
> > > enough for maintaining fairness.
> >
> > Well, I have done following tests:
> > 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> > 2 start two tagged will-it-scale/page_fault1, see how each performs;
> > 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
> >
> > They all show your patchset performs equally well...And consider what
> > the patch does, I think they are really doing the same thing in
> > different ways.
> 
> It looks like we are not on the same page, if you don't mind, can both of
> you rebase your patchset onto v5.3-rc8 and provide a public branch so I
> can fetch and test it at least by my benchmark?

I'm using the following branch as base which is v5.1.5 based:
https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-test

And I have pushed Tim's branch to:
https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim

Mine:
https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime

The two branches both have two patches I have sent previouslly:
https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/
Although it has some potential performance loss as pointed out by
Vineeth, I haven't got time to rework it yet.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-15 14:14                                             ` Aaron Lu
@ 2019-09-18  1:33                                               ` Aubrey Li
  2019-09-18 20:40                                                 ` Tim Chen
  2019-10-29  9:11                                               ` Dario Faggioli
  1 sibling, 1 reply; 161+ messages in thread
From: Aubrey Li @ 2019-09-18  1:33 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On Fri, Sep 13, 2019 at 07:12:52AM +0800, Aubrey Li wrote:
> > On Thu, Sep 12, 2019 at 8:04 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> > >
> > > On Wed, Sep 11, 2019 at 09:19:02AM -0700, Tim Chen wrote:
> > > > On 9/11/19 7:02 AM, Aaron Lu wrote:
> > > > I think Julien's result show that my patches did not do as well as
> > > > your patches for fairness. Aubrey did some other testing with the same
> > > > conclusion.  So I think keeping the forced idle time balanced is not
> > > > enough for maintaining fairness.
> > >
> > > Well, I have done following tests:
> > > 1 Julien's test script: https://paste.debian.net/plainh/834cf45c
> > > 2 start two tagged will-it-scale/page_fault1, see how each performs;
> > > 3 Aubrey's mysql test: https://github.com/aubreyli/coresched_bench.git
> > >
> > > They all show your patchset performs equally well...And consider what
> > > the patch does, I think they are really doing the same thing in
> > > different ways.
> >
> > It looks like we are not on the same page, if you don't mind, can both of
> > you rebase your patchset onto v5.3-rc8 and provide a public branch so I
> > can fetch and test it at least by my benchmark?
>
> I'm using the following branch as base which is v5.1.5 based:
> https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-test
>
> And I have pushed Tim's branch to:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
>
> Mine:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
>
> The two branches both have two patches I have sent previouslly:
> https://lore.kernel.org/lkml/20190810141556.GA73644@aaronlu/
> Although it has some potential performance loss as pointed out by
> Vineeth, I haven't got time to rework it yet.

In terms of these two branches, we tested two cases:

1) 32 AVX threads and 32 mysql threads on one core(2 HT)
2) 192 AVX threads and 192 mysql threads on 96 cores(192 HTs)

For case 1), we saw two branches is on par

Branch: coresched-v3-v5.1.5-test-core_vruntime
-Avg throughput:: 1865.62 (std: 20.6%)
-Avg latency: 26.43 (std: 8.3%)

Branch: coresched-v3-v5.1.5-test-tim
- Avg throughput: 1804.88 (std: 20.1%)
- Avg latency: 29.78 (std: 11.8%)

For case 2), we saw core vruntime performs better than counting forced
idle time

Branch: coresched-v3-v5.1.5-test-core_vruntime
- Avg throughput: 5528.56 (std: 44.2%)
- Avg latency: 165.99 (std: 45.2%)

Branch: coresched-v3-v5.1.5-test-tim
-Avg throughput:: 3842.33 (std: 35.1%)
-Avg latency: 306.99 (std: 72.9%)

As Aaron pointed out, vruntime is with se's weight, which could be a reason
for the difference.

So should we go with core vruntime approach?
Or Tim - do you want to improve forced idle time approach?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-18  1:33                                               ` Aubrey Li
@ 2019-09-18 20:40                                                 ` Tim Chen
  2019-09-18 22:16                                                   ` Aubrey Li
  2019-10-29 20:40                                                   ` Julien Desfossez
  0 siblings, 2 replies; 161+ messages in thread
From: Tim Chen @ 2019-09-18 20:40 UTC (permalink / raw)
  To: Aubrey Li, Aaron Lu
  Cc: Julien Desfossez, Dario Faggioli, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 9/17/19 6:33 PM, Aubrey Li wrote:
> On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:

>>
>> And I have pushed Tim's branch to:
>> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
>>
>> Mine:
>> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime


Aubrey,

Thanks for testing with your set up.

I think the test that's of interest is to see my load balancing added on top
of Aaron's fairness patch, instead of using my previous version of
forced idle approach in coresched-v3-v5.1.5-test-tim branch. 
 
I've added my two load balance patches on top of Aaron's patches
in coresched-v3-v5.1.5-test-core_vruntime branch and put it in

https://github.com/pdxChen/gang/tree/coresched-v3-v5.1.5-test-core_vruntime-lb

> 
> As Aaron pointed out, vruntime is with se's weight, which could be a reason
> for the difference.
> 
> So should we go with core vruntime approach?
> Or Tim - do you want to improve forced idle time approach?
> 

I hope to improve the forced idle time later.  But for now let's see if
additional load balance logic can help remove cgroup mismatch
and improve performance, on top of Aaron's fairness patches.

Thanks.

Tim



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-10 14:27             ` Julien Desfossez
@ 2019-09-18 21:12               ` Tim Chen
  0 siblings, 0 replies; 161+ messages in thread
From: Tim Chen @ 2019-09-18 21:12 UTC (permalink / raw)
  To: Julien Desfossez, Peter Zijlstra
  Cc: Phil Auld, Matthew Garrett, Vineeth Remanan Pillai,
	Nishanth Aravamudan, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aaron Lu,
	Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Dave Stewart

On 9/10/19 7:27 AM, Julien Desfossez wrote:
> On 29-Aug-2019 04:38:21 PM, Peter Zijlstra wrote:
>> On Thu, Aug 29, 2019 at 10:30:51AM -0400, Phil Auld wrote:
>>> I think, though, that you were basically agreeing with me that the current
>>> core scheduler does not close the holes, or am I reading that wrong.
>>
>> Agreed; the missing bits for L1TF are ugly but doable (I've actually
>> done them before, Tim has that _somewhere_), but I've not seen a
>> 'workable' solution for MDS yet.
> 

The L1TF problem is a much bigger one for HT than MDS.  It is relatively
easy for a Rogue VM to sniff L1 cached memory locations.  While for MDS,
it is quite difficult for the attacker to associate data in the cpu buffers
with specific memory to make the sniffed data useful.

Even if we don't have a complete solution yet for MDS HT vulnerability,
it is worthwhile to plug the L1TF hole for HT first with core scheduler,
as L1TF is much more exploitable.

Tim

> Following the discussion we had yesterday at LPC, after we have agreed
> on a solution for fixing the current fairness issue, we will post the
> v4. We will then work on prototyping the other synchronisation points
> (syscalls, interrupts and VMEXIT) to evaluate the overhead in various
> use-cases.
> 
> Depending on the use-case, we know the performance overhead maybe
> heavier than just disabling SMT, but the benchmarks we have seen so far
> indicate that there are valid cases for core scheduling. Core scheduling
> will continue to be unused by default, but with it, we will have the
> option to tune the system to be both secure and faster than disabling
> SMT for those cases.
> 
> Thanks,
> 
> Julien
> 
> P.S: I think the branch that contains the VMEXIT handling is here
> https://github.com/pdxChen/gang/commits/sched_1.23-base
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-05  1:44                                   ` Julien Desfossez
  2019-09-06 22:17                                     ` Tim Chen
@ 2019-09-18 21:27                                     ` Tim Chen
  1 sibling, 0 replies; 161+ messages in thread
From: Tim Chen @ 2019-09-18 21:27 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Dario Faggioli, Li, Aubrey, Aaron Lu, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 9/4/19 6:44 PM, Julien Desfossez wrote:

> +
> +static void coresched_idle_worker_fini(struct rq *rq)
> +{
> +	if (rq->core_idle_task) {
> +		kthread_stop(rq->core_idle_task);
> +		rq->core_idle_task = NULL;
> +	}

During testing, I have found access of rq->core_idle_task as
NULL pointer from other cpus (other than the cpu executing the stop_machine
function) when you toggle cpu.tag of the cgroup.
Doing locking here is tricky because the rq lock is being
transitioned from core lock to per run queue lock.  As a fix,
I made coresched_idle_worker_fini a null function, and not
to null out the core_idle_task.

Tim


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-18 20:40                                                 ` Tim Chen
@ 2019-09-18 22:16                                                   ` Aubrey Li
  2019-09-30 14:36                                                     ` Vineeth Remanan Pillai
  2019-10-29 20:40                                                   ` Julien Desfossez
  1 sibling, 1 reply; 161+ messages in thread
From: Aubrey Li @ 2019-09-18 22:16 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aaron Lu, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Sep 19, 2019 at 4:41 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On 9/17/19 6:33 PM, Aubrey Li wrote:
> > On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> >>
> >> And I have pushed Tim's branch to:
> >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> >>
> >> Mine:
> >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
>
>
> Aubrey,
>
> Thanks for testing with your set up.
>
> I think the test that's of interest is to see my load balancing added on top
> of Aaron's fairness patch, instead of using my previous version of
> forced idle approach in coresched-v3-v5.1.5-test-tim branch.
>

I'm trying to figure out a way to solve fairness only(not include task
placement),
So @Vineeth - if everyone is okay with Aaron's fairness patch, maybe
we should have a v4?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-06 18:30                                   ` Tim Chen
  2019-09-11 14:02                                     ` Aaron Lu
@ 2019-09-25  2:40                                     ` Aubrey Li
  2019-09-25 17:24                                       ` Tim Chen
  2019-09-30 15:22                                     ` Julien Desfossez
  2 siblings, 1 reply; 161+ messages in thread
From: Aubrey Li @ 2019-09-25  2:40 UTC (permalink / raw)
  To: Tim Chen
  Cc: Dario Faggioli, Julien Desfossez, Li, Aubrey, Aaron Lu,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Sat, Sep 7, 2019 at 2:30 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> +static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
> +                       int src_sibling, int dst_sibling,
> +                       struct task_group *tg, u64 task_load)
> +{
> +       struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
> +       s64 excess, deficit, old_mismatch, new_mismatch;
> +
> +       if (src_cpu == dst_cpu)
> +               return -1;
> +
> +       /* XXX SMT4 will require additional logic */
> +
> +       se = tg->se[src_cpu];
> +       se_sibling = tg->se[src_sibling];
> +
> +       excess = se->avg.load_avg - se_sibling->avg.load_avg;
> +       if (src_sibling == dst_cpu) {
> +               old_mismatch = abs(excess);
> +               new_mismatch = abs(excess - 2*task_load);
> +               return old_mismatch - new_mismatch;
> +       }
> +
> +       dst_se = tg->se[dst_cpu];
> +       dst_se_sibling = tg->se[dst_sibling];
> +       deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
> +
> +       old_mismatch = abs(excess) + abs(deficit);
> +       new_mismatch = abs(excess - (s64) task_load) +
> +                      abs(deficit + (s64) task_load);

If I understood correctly, these formulas made an assumption that the task
being moved to the destination is matched the destination's core cookie. so if
the task is not matched with dst's core cookie and still have to stay
in the runqueue
then the formula becomes not correct.

>  /**
>   * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -8345,7 +8492,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 else
>                         load = source_load(i, load_idx);
>
> -               sgs->group_load += load;

Why is this load update line removed?

> +               core_sched_imbalance_scan(sgs, i, env->dst_cpu);
> +
>                 sgs->group_util += cpu_util(i);
>                 sgs->sum_nr_running += rq->cfs.h_nr_running;
>

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-25  2:40                                     ` Aubrey Li
@ 2019-09-25 17:24                                       ` Tim Chen
  2019-09-25 22:07                                         ` Aubrey Li
  0 siblings, 1 reply; 161+ messages in thread
From: Tim Chen @ 2019-09-25 17:24 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Dario Faggioli, Julien Desfossez, Li, Aubrey, Aaron Lu,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 9/24/19 7:40 PM, Aubrey Li wrote:
> On Sat, Sep 7, 2019 at 2:30 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> +static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
>> +                       int src_sibling, int dst_sibling,
>> +                       struct task_group *tg, u64 task_load)
>> +{
>> +       struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
>> +       s64 excess, deficit, old_mismatch, new_mismatch;
>> +
>> +       if (src_cpu == dst_cpu)
>> +               return -1;
>> +
>> +       /* XXX SMT4 will require additional logic */
>> +
>> +       se = tg->se[src_cpu];
>> +       se_sibling = tg->se[src_sibling];
>> +
>> +       excess = se->avg.load_avg - se_sibling->avg.load_avg;
>> +       if (src_sibling == dst_cpu) {
>> +               old_mismatch = abs(excess);
>> +               new_mismatch = abs(excess - 2*task_load);
>> +               return old_mismatch - new_mismatch;
>> +       }
>> +
>> +       dst_se = tg->se[dst_cpu];
>> +       dst_se_sibling = tg->se[dst_sibling];
>> +       deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
>> +
>> +       old_mismatch = abs(excess) + abs(deficit);
>> +       new_mismatch = abs(excess - (s64) task_load) +
>> +                      abs(deficit + (s64) task_load);
> 
> If I understood correctly, these formulas made an assumption that the task
> being moved to the destination is matched the destination's core cookie. 

That's not the case.  We do not need to match the destination's core cookie, as that
may change after context switches.  It needs to reduce the load mismatch with 
the destination CPU's sibling for that cgroup.

> so if
> the task is not matched with dst's core cookie and still have to stay
> in the runqueue
> then the formula becomes not correct.
> 
>>  /**
>>   * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>>   * @env: The load balancing environment.
>> @@ -8345,7 +8492,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                 else
>>                         load = source_load(i, load_idx);
>>
>> -               sgs->group_load += load;
> 
> Why is this load update line removed?

This was removed accidentally.  Should be restored.

> 
>> +               core_sched_imbalance_scan(sgs, i, env->dst_cpu);
>> +
>>                 sgs->group_util += cpu_util(i);
>>                 sgs->sum_nr_running += rq->cfs.h_nr_running;
>>
> 


Thanks.

Tim


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-25 17:24                                       ` Tim Chen
@ 2019-09-25 22:07                                         ` Aubrey Li
  0 siblings, 0 replies; 161+ messages in thread
From: Aubrey Li @ 2019-09-25 22:07 UTC (permalink / raw)
  To: Tim Chen
  Cc: Dario Faggioli, Julien Desfossez, Li, Aubrey, Aaron Lu,
	Subhra Mazumdar, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Sep 26, 2019 at 1:24 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On 9/24/19 7:40 PM, Aubrey Li wrote:
> > On Sat, Sep 7, 2019 at 2:30 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >> +static inline s64 core_sched_imbalance_delta(int src_cpu, int dst_cpu,
> >> +                       int src_sibling, int dst_sibling,
> >> +                       struct task_group *tg, u64 task_load)
> >> +{
> >> +       struct sched_entity *se, *se_sibling, *dst_se, *dst_se_sibling;
> >> +       s64 excess, deficit, old_mismatch, new_mismatch;
> >> +
> >> +       if (src_cpu == dst_cpu)
> >> +               return -1;
> >> +
> >> +       /* XXX SMT4 will require additional logic */
> >> +
> >> +       se = tg->se[src_cpu];
> >> +       se_sibling = tg->se[src_sibling];
> >> +
> >> +       excess = se->avg.load_avg - se_sibling->avg.load_avg;
> >> +       if (src_sibling == dst_cpu) {
> >> +               old_mismatch = abs(excess);
> >> +               new_mismatch = abs(excess - 2*task_load);
> >> +               return old_mismatch - new_mismatch;
> >> +       }
> >> +
> >> +       dst_se = tg->se[dst_cpu];
> >> +       dst_se_sibling = tg->se[dst_sibling];
> >> +       deficit = dst_se->avg.load_avg - dst_se_sibling->avg.load_avg;
> >> +
> >> +       old_mismatch = abs(excess) + abs(deficit);
> >> +       new_mismatch = abs(excess - (s64) task_load) +
> >> +                      abs(deficit + (s64) task_load);
> >
> > If I understood correctly, these formulas made an assumption that the task
> > being moved to the destination is matched the destination's core cookie.
>
> That's not the case.  We do not need to match the destination's core cookie,

I actually meant destination core's core cookie.

> as that may change after context switches. It needs to reduce the load mismatch
> with the destination CPU's sibling for that cgroup.

So the new_mismatch is not always true, especially when there are more
cgroups and
more core cookies on the system.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-12 12:35                                           ` Aaron Lu
  2019-09-12 17:29                                             ` Tim Chen
@ 2019-09-30 11:53                                             ` Vineeth Remanan Pillai
  2019-10-02 20:48                                               ` Vineeth Remanan Pillai
  1 sibling, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-09-30 11:53 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Subhra Mazumdar, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Sep 12, 2019 at 8:35 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> >
> > I think comparing parent's runtime also will have issues once
> > the task group has a lot more threads with different running
> > patterns. One example is a task group with lot of active threads
> > and a thread with fairly less activity. So when this less active
> > thread is competing with a thread in another group, there is a
> > chance that it loses continuously for a while until the other
> > group catches up on its vruntime.
>
> I actually think this is expected behaviour.
>
> Without core scheduling, when deciding which task to run, we will first
> decide which "se" to run from the CPU's root level cfs runqueue and then
> go downwards. Let's call the chosen se on the root level cfs runqueue
> the winner se. Then with core scheduling, we will also need compare the
> two winner "se"s of each hyperthread and choose the core wide winner "se".
>
Sorry, I misunderstood the fix and I did not initially see the core wide
min_vruntime that you tried to maintain in the rq->core. This approach
seems reasonable. I think we can fix the potential starvation that you
mentioned in the comment by adjusting for the difference in all the children
cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
both the queues, it should be doable and I am trying to see how we can best
do that.

> >
> > As discussed during LPC, probably start thinking along the lines
> > of global vruntime or core wide vruntime to fix the vruntime
> > comparison issue?
>
> core wide vruntime makes sense when there are multiple tasks of
> different cgroups queued on the same core. e.g. when there are two
> tasks of cgroupA and one task of cgroupB are queued on the same core,
> assume cgroupA's one task is on one hyperthread and its other task is on
> the other hyperthread with cgroupB's task. With my current
> implementation or Tim's, cgroupA will get more time than cgroupB. If we
> maintain core wide vruntime for cgroupA and cgroupB, we should be able
> to maintain fairness between cgroups on this core. Tim propose to solve
> this problem by doing some kind of load balancing if I'm not mistaken, I
> haven't taken a look at this yet.
I think your fix is almost close to maintaining a core wide vruntime as you
have a single minvruntime to compare now across the siblings in the core.
To make the fix complete, we might need to adjust the whole tree's
min_vruntime and I think its doable.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-18 22:16                                                   ` Aubrey Li
@ 2019-09-30 14:36                                                     ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-09-30 14:36 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Tim Chen, Aaron Lu, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Subhra Mazumdar, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Wed, Sep 18, 2019 at 6:16 PM Aubrey Li <aubrey.intel@gmail.com> wrote:
>
> On Thu, Sep 19, 2019 at 4:41 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> > On 9/17/19 6:33 PM, Aubrey Li wrote:
> > > On Sun, Sep 15, 2019 at 10:14 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> >
> > >>
> > >> And I have pushed Tim's branch to:
> > >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > >>
> > >> Mine:
> > >> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-core_vruntime
> >
> >
> > Aubrey,
> >
> > Thanks for testing with your set up.
> >
> > I think the test that's of interest is to see my load balancing added on top
> > of Aaron's fairness patch, instead of using my previous version of
> > forced idle approach in coresched-v3-v5.1.5-test-tim branch.
> >
>
> I'm trying to figure out a way to solve fairness only(not include task
> placement),
> So @Vineeth - if everyone is okay with Aaron's fairness patch, maybe
> we should have a v4?
>
Yes, I think we can move to v4 with Aaron's fairness fix and potentially
Tim's load balancing fixes. I am working on some improvements to Aaron's
fixes and shall post the changes after some testing. Basically, what I am
trying to do is to propagate the min_vruntime change down to all the cf_rq
and individual se when we update the cfs_rq(rq->core)->min_vrutime. So,
we can make sure that the rq stays in sync and starvation do not happen.

If everything goes well, we shall also post the v4 towards the end of this
week. We would be testing Tim's load balancing patches in an
over-committed VM scenario to observe the effect of the fix.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-06 18:30                                   ` Tim Chen
  2019-09-11 14:02                                     ` Aaron Lu
  2019-09-25  2:40                                     ` Aubrey Li
@ 2019-09-30 15:22                                     ` Julien Desfossez
  2 siblings, 0 replies; 161+ messages in thread
From: Julien Desfossez @ 2019-09-30 15:22 UTC (permalink / raw)
  To: Tim Chen
  Cc: Dario Faggioli, Li, Aubrey, Aaron Lu, Aubrey Li, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> I've made an attempt in the following two patches to address
> the load balancing of mismatched load between the siblings.
> 
> It is applied on top of Aaron's patches:
> - sched: Fix incorrect rq tagged as forced idle
> - wrapper for cfs_rq->min_vruntime
>   https://lore.kernel.org/lkml/20190725143127.GB992@aaronlu/
> - core vruntime comparison
>   https://lore.kernel.org/lkml/20190725143248.GC992@aaronlu/
> 
> I will love Julien, Aaron and others to try it out.  Suggestions
> to tune it is welcomed.

Just letting you know that I will be testing your load balancing patches
this week along with the changes Vineeth is currently doing. I didn't
test it before because I was focused on single threaded and pinned
micro-benchmarks, but I am back on scaling tests so it will be
interesting to see.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-30 11:53                                             ` Vineeth Remanan Pillai
@ 2019-10-02 20:48                                               ` Vineeth Remanan Pillai
  2019-10-10 13:54                                                 ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-02 20:48 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Mon, Sep 30, 2019 at 7:53 AM Vineeth Remanan Pillai
<vpillai@digitalocean.com> wrote:
>
> >
> Sorry, I misunderstood the fix and I did not initially see the core wide
> min_vruntime that you tried to maintain in the rq->core. This approach
> seems reasonable. I think we can fix the potential starvation that you
> mentioned in the comment by adjusting for the difference in all the children
> cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
> both the queues, it should be doable and I am trying to see how we can best
> do that.
>
Attaching here with, the 2 patches I was working on in preparation of v4.

Patch 1 is an improvement of patch 2 of Aaron where I am propagating the
vruntime changes to the whole tree.
Patch 2 is an improvement for patch 3 of Aaron where we do resched_curr
only when the sibling is forced idle.

Micro benchmarks seems good. Will be doing larger set of tests and hopefully
posting v4 by end of week. Please let me know what you think of these patches
(patch 1 is on top of Aaron's patch 2, patch 2 replaces Aaron's patch 3)

Thanks,
Vineeth

[PATCH 1/2] sched/fair: propagate the min_vruntime change to the whole rq tree

When we adjust the min_vruntime of rq->core, we need to propgate
that down the tree so as to not cause starvation of existing tasks
based on previous vruntime.
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59cb01a1563b..e8dd78a8c54d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -476,6 +476,23 @@ static inline u64 cfs_rq_min_vruntime(struct
cfs_rq *cfs_rq)
                return cfs_rq->min_vruntime;
 }

+static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
+{
+       struct sched_entity *se, *next;
+
+       if (!cfs_rq)
+               return;
+
+       cfs_rq->min_vruntime -= delta;
+       rbtree_postorder_for_each_entry_safe(se, next,
+                       &cfs_rq->tasks_timeline.rb_root, run_node) {
+               if (se->vruntime > delta)
+                       se->vruntime -= delta;
+               if (se->my_q)
+                       coresched_adjust_vruntime(se->my_q, delta);
+       }
+}
+
 static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
        struct cfs_rq *cfs_rq_core;
@@ -487,8 +504,11 @@ static void
update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
                return;

        cfs_rq_core = core_cfs_rq(cfs_rq);
-       cfs_rq_core->min_vruntime = max(cfs_rq_core->min_vruntime,
-                                       cfs_rq->min_vruntime);
+       if (cfs_rq_core != cfs_rq &&
+           cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
+               u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
+               coresched_adjust_vruntime(cfs_rq_core, delta);
+       }
 }

 bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
--
2.17.1

[PATCH 2/2] sched/fair : Wake up forced idle siblings if needed

If a cpu has only one task and if it has used up its timeslice,
then we should try to wake up the sibling to give the forced idle
thread a chance.
We do that by triggering schedule which will IPI the sibling if
the task in the sibling wins the priority check.
---
 kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8dd78a8c54d..ba4d929abae6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4165,6 +4165,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
sched_entity *se, int flags)
                update_min_vruntime(cfs_rq);
 }

+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+       return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+               sched_slice(cfs_rq_of(se), se);
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -10052,6 +10059,39 @@ static void rq_offline_fair(struct rq *rq)

 #endif /* CONFIG_SMP */

+#ifdef CONFIG_SCHED_CORE
+/*
+ * If runqueue has only one task which used up its slice and
+ * if the sibling is forced idle, then trigger schedule
+ * to give forced idle task a chance.
+ */
+static void resched_forceidle(struct rq *rq, struct sched_entity *se)
+{
+       int cpu = cpu_of(rq), sibling_cpu;
+       if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+               return;
+
+       for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+               struct rq *sibling_rq;
+               if (sibling_cpu == cpu)
+                       continue;
+               if (cpu_is_offline(sibling_cpu))
+                       continue;
+
+               sibling_rq = cpu_rq(sibling_cpu);
+               if (sibling_rq->core_forceidle) {
+                       resched_curr(rq);
+                       break;
+               }
+       }
+}
+#else
+static inline void resched_forceidle(struct rq *rq, struct sched_entity *se)
+{
+}
+#endif
+
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10075,6 +10115,9 @@ static void task_tick_fair(struct rq *rq,
struct task_struct *curr, int queued)

        update_misfit_status(curr, rq);
        update_overutilized_status(task_rq(curr));
+
+       if (sched_core_enabled(rq))
+               resched_forceidle(rq, &curr->se);
 }

 /*
--
2.17.1

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-02 20:48                                               ` Vineeth Remanan Pillai
@ 2019-10-10 13:54                                                 ` Aaron Lu
  2019-10-10 14:29                                                   ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-10-10 13:54 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Wed, Oct 02, 2019 at 04:48:14PM -0400, Vineeth Remanan Pillai wrote:
> On Mon, Sep 30, 2019 at 7:53 AM Vineeth Remanan Pillai
> <vpillai@digitalocean.com> wrote:
> >
> > >
> > Sorry, I misunderstood the fix and I did not initially see the core wide
> > min_vruntime that you tried to maintain in the rq->core. This approach
> > seems reasonable. I think we can fix the potential starvation that you
> > mentioned in the comment by adjusting for the difference in all the children
> > cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
> > both the queues, it should be doable and I am trying to see how we can best
> > do that.
> >
> Attaching here with, the 2 patches I was working on in preparation of v4.
> 
> Patch 1 is an improvement of patch 2 of Aaron where I am propagating the
> vruntime changes to the whole tree.

I didn't see why we need do this.

We only need to have the root level sched entities' vruntime become core
wide since we will compare vruntime for them across hyperthreads. For
sched entities on sub cfs_rqs, we never(at least, not now) compare their
vruntime outside their cfs_rqs.

Thanks,
Aaron

> Patch 2 is an improvement for patch 3 of Aaron where we do resched_curr
> only when the sibling is forced idle.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-10 13:54                                                 ` Aaron Lu
@ 2019-10-10 14:29                                                   ` Vineeth Remanan Pillai
  2019-10-11  7:33                                                     ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-10 14:29 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> I didn't see why we need do this.
>
> We only need to have the root level sched entities' vruntime become core
> wide since we will compare vruntime for them across hyperthreads. For
> sched entities on sub cfs_rqs, we never(at least, not now) compare their
> vruntime outside their cfs_rqs.
>
The reason we need to do this is because, new tasks that gets created will
have a vruntime based on the new min_vruntime and old tasks will have it
based on the old min_vruntime and it can cause starvation based on how
you set the min_vruntime. With this new patch, we normalize the whole
tree so that new tasks and old tasks compare with the same min_vruntime.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-10 14:29                                                   ` Vineeth Remanan Pillai
@ 2019-10-11  7:33                                                     ` Aaron Lu
  2019-10-11 11:32                                                       ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-10-11  7:33 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Oct 10, 2019 at 10:29:47AM -0400, Vineeth Remanan Pillai wrote:
> > I didn't see why we need do this.
> >
> > We only need to have the root level sched entities' vruntime become core
> > wide since we will compare vruntime for them across hyperthreads. For
> > sched entities on sub cfs_rqs, we never(at least, not now) compare their
> > vruntime outside their cfs_rqs.
> >
> The reason we need to do this is because, new tasks that gets created will
> have a vruntime based on the new min_vruntime and old tasks will have it
> based on the old min_vruntime

I think this is expected behaviour.

> and it can cause starvation based on how
> you set the min_vruntime.

Care to elaborate the starvation problem?

> With this new patch, we normalize the whole
> tree so that new tasks and old tasks compare with the same min_vruntime.

Again, what's the point of normalizing sched entities' vruntime in
sub-cfs_rqs? Their vruntime comparisons only happen inside their own
cfs_rq, we don't do cross CPU vruntime comparison for them.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-11  7:33                                                     ` Aaron Lu
@ 2019-10-11 11:32                                                       ` Vineeth Remanan Pillai
  2019-10-11 12:01                                                         ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-11 11:32 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> > The reason we need to do this is because, new tasks that gets created will
> > have a vruntime based on the new min_vruntime and old tasks will have it
> > based on the old min_vruntime
>
> I think this is expected behaviour.
>
I don't think this is the expected behavior. If we hadn't changed the root
cfs->min_vruntime for the core rq, then it would have been the expected
behaviour. But now, we are updating the core rq's root cfs, min_vruntime
without changing the the vruntime down to the tree. To explain, consider
this example based on your patch. Let cpu 1 and 2 be siblings. And let rq(cpu1)
be the core rq. Let rq1->cfs->min_vruntime=1000 and rq2->cfs->min_vruntime=2000.
So in update_core_cfs_rq_min_vruntime, you update rq1->cfs->min_vruntime
to 2000 because that is the max. So new tasks enqueued on rq1 starts with
vruntime of 2000 while the tasks in that runqueue are still based on the old
min_vruntime(1000). So the new tasks gets enqueued some where to the right
of the tree and has to wait until already existing tasks catch up the
vruntime to
2000. This is what I meant by starvation. This happens always when we update
the core rq's cfs->min_vruntime. Hope this clarifies.

> > and it can cause starvation based on how
> > you set the min_vruntime.
>
> Care to elaborate the starvation problem?

Explained above.

> Again, what's the point of normalizing sched entities' vruntime in
> sub-cfs_rqs? Their vruntime comparisons only happen inside their own
> cfs_rq, we don't do cross CPU vruntime comparison for them.

As I mentioned above, this is to avoid the starvation case. Even though we are
not doing cross cfs_rq comparison, the whole tree's vruntime is based on the
root cfs->min_vruntime and we will have an imbalance if we change the root
cfs->min_vruntime without updating down the tree.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-11 11:32                                                       ` Vineeth Remanan Pillai
@ 2019-10-11 12:01                                                         ` Aaron Lu
  2019-10-11 12:10                                                           ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-10-11 12:01 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, Oct 11, 2019 at 07:32:48AM -0400, Vineeth Remanan Pillai wrote:
> > > The reason we need to do this is because, new tasks that gets created will
> > > have a vruntime based on the new min_vruntime and old tasks will have it
> > > based on the old min_vruntime
> >
> > I think this is expected behaviour.
> >
> I don't think this is the expected behavior. If we hadn't changed the root
> cfs->min_vruntime for the core rq, then it would have been the expected
> behaviour. But now, we are updating the core rq's root cfs, min_vruntime
> without changing the the vruntime down to the tree. To explain, consider
> this example based on your patch. Let cpu 1 and 2 be siblings. And let rq(cpu1)
> be the core rq. Let rq1->cfs->min_vruntime=1000 and rq2->cfs->min_vruntime=2000.
> So in update_core_cfs_rq_min_vruntime, you update rq1->cfs->min_vruntime
> to 2000 because that is the max. So new tasks enqueued on rq1 starts with
> vruntime of 2000 while the tasks in that runqueue are still based on the old
> min_vruntime(1000). So the new tasks gets enqueued some where to the right
> of the tree and has to wait until already existing tasks catch up the
> vruntime to
> 2000. This is what I meant by starvation. This happens always when we update
> the core rq's cfs->min_vruntime. Hope this clarifies.

Thanks for the clarification.

Yes, this is the initialization issue I mentioned before when core
scheduling is initially enabled. rq1's vruntime is bumped the first time
update_core_cfs_rq_min_vruntime() is called and if there are already
some tasks queued, new tasks queued on rq1 will be starved to some extent.

Agree that this needs fix. But we shouldn't need do this afterwards.

So do I understand correctly that patch1 is meant to solve the
initialization issue?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-11 12:01                                                         ` Aaron Lu
@ 2019-10-11 12:10                                                           ` Vineeth Remanan Pillai
  2019-10-12  3:55                                                             ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-11 12:10 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

> Thanks for the clarification.
>
> Yes, this is the initialization issue I mentioned before when core
> scheduling is initially enabled. rq1's vruntime is bumped the first time
> update_core_cfs_rq_min_vruntime() is called and if there are already
> some tasks queued, new tasks queued on rq1 will be starved to some extent.
>
> Agree that this needs fix. But we shouldn't need do this afterwards.
>
> So do I understand correctly that patch1 is meant to solve the
> initialization issue?

I think we need this update logic even after initialization. I mean, core
runqueue's min_vruntime can get updated every time when the core
runqueue's min_vruntime changes with respect to the sibling's min_vruntime.
So, whenever this update happens, we would need to propagate the changes
down the tree right? Please let me know if I am visualizing it wrong.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-11 12:10                                                           ` Vineeth Remanan Pillai
@ 2019-10-12  3:55                                                             ` Aaron Lu
  2019-10-13 12:44                                                               ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-10-12  3:55 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, Oct 11, 2019 at 08:10:30AM -0400, Vineeth Remanan Pillai wrote:
> > Thanks for the clarification.
> >
> > Yes, this is the initialization issue I mentioned before when core
> > scheduling is initially enabled. rq1's vruntime is bumped the first time
> > update_core_cfs_rq_min_vruntime() is called and if there are already
> > some tasks queued, new tasks queued on rq1 will be starved to some extent.
> >
> > Agree that this needs fix. But we shouldn't need do this afterwards.
> >
> > So do I understand correctly that patch1 is meant to solve the
> > initialization issue?
> 
> I think we need this update logic even after initialization. I mean, core
> runqueue's min_vruntime can get updated every time when the core
> runqueue's min_vruntime changes with respect to the sibling's min_vruntime.
> So, whenever this update happens, we would need to propagate the changes
> down the tree right? Please let me know if I am visualizing it wrong.

I don't think we need do the normalization afterwrads and it appears
we are on the same page regarding core wide vruntime.

The intent of my patch is to treat all the root level sched entities of
the two siblings as if they are in a single cfs_rq of the core. With a
core wide min_vruntime, the core scheduler can decide which sched entity
to run next. And the individual sched entity's vruntime shouldn't be
changed based on the change of core wide min_vruntime, or faireness can
hurt(if we add or reduce vruntime of a sched entity, its credit will
change).

The weird thing about my patch is, the min_vruntime is often increased,
it doesn't point to the smallest value as in a traditional cfs_rq. This
probabaly can be changed to follow the tradition, I don't quite remember
why I did this, will need to check this some time later.

All those sub cfs_rq's sched entities are not interesting. Because once
we decided which sched entity in the root level cfs_rq should run next,
we can then pick the final next task from there(using the usual way). In
other words, to make scheduler choose the correct candidate for the core,
we only need worry about sched entities on both CPU's root level cfs_rqs.

Does this make sense?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-12  3:55                                                             ` Aaron Lu
@ 2019-10-13 12:44                                                               ` Vineeth Remanan Pillai
  2019-10-14  9:57                                                                 ` Aaron Lu
  0 siblings, 1 reply; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-13 12:44 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Fri, Oct 11, 2019 at 11:55 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:

>
> I don't think we need do the normalization afterwrads and it appears
> we are on the same page regarding core wide vruntime.
>
> The intent of my patch is to treat all the root level sched entities of
> the two siblings as if they are in a single cfs_rq of the core. With a
> core wide min_vruntime, the core scheduler can decide which sched entity
> to run next. And the individual sched entity's vruntime shouldn't be
> changed based on the change of core wide min_vruntime, or faireness can
> hurt(if we add or reduce vruntime of a sched entity, its credit will
> change).
>
Ok, I think I get it now. I see that your first patch actually wraps
all the places
where min_vruntime is accessed. So yes, the tree vruntime updation is needed
only one time. From then on, since we use the wrapper cfs_rq_min_vruntime(),
both the runqueues would self adjust from then on based on the code wide
min_vruntime. Also by the virtue that min_vruntime stays min from there on, the
tree updation logic will not be called more than once. So I think the
changes are safe.
I will do some profiling to make sure that it is actually called once only.

> The weird thing about my patch is, the min_vruntime is often increased,
> it doesn't point to the smallest value as in a traditional cfs_rq. This
> probabaly can be changed to follow the tradition, I don't quite remember
> why I did this, will need to check this some time later.

Yeah, I noticed this. In my patch, I had already accounted for this and changed
to min() instead of max() which is more logical that min_vruntime should be the
minimum of both the run queue.

> All those sub cfs_rq's sched entities are not interesting. Because once
> we decided which sched entity in the root level cfs_rq should run next,
> we can then pick the final next task from there(using the usual way). In
> other words, to make scheduler choose the correct candidate for the core,
> we only need worry about sched entities on both CPU's root level cfs_rqs.
>
Understood. The only reason I did the normalize is to get both the runqueues
under one min_vruntime always. And as long as we use the cfs_rq_min_vruntime
from then on, we wouldn't be calling the balancing logic any more.

> Does this make sense?

Sure, thanks for the clarification.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-13 12:44                                                               ` Vineeth Remanan Pillai
@ 2019-10-14  9:57                                                                 ` Aaron Lu
  2019-10-21 12:30                                                                   ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 161+ messages in thread
From: Aaron Lu @ 2019-10-14  9:57 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Sun, Oct 13, 2019 at 08:44:32AM -0400, Vineeth Remanan Pillai wrote:
> On Fri, Oct 11, 2019 at 11:55 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> 
> >
> > I don't think we need do the normalization afterwrads and it appears
> > we are on the same page regarding core wide vruntime.

Should be "we are not on the same page..."

[...]
> > The weird thing about my patch is, the min_vruntime is often increased,
> > it doesn't point to the smallest value as in a traditional cfs_rq. This
> > probabaly can be changed to follow the tradition, I don't quite remember
> > why I did this, will need to check this some time later.
> 
> Yeah, I noticed this. In my patch, I had already accounted for this and changed
> to min() instead of max() which is more logical that min_vruntime should be the
> minimum of both the run queue.

I now remembered why I used max().

Assume rq1 and rq2's min_vruntime are both at 2000 and the core wide
min_vruntime is also 2000. Also assume both runqueues are empty at the
moment. Then task t1 is queued to rq1 and runs for a long time while rq2
keeps empty. rq1's min_vruntime will be incremented all the time while
the core wide min_vruntime stays at 2000 if min() is used. Then when
another task gets queued to rq2, it will get really large unfair boost
by using a much smaller min_vruntime as its base.

To fix this, either max() is used as is done in my patch, or adjust
rq2's min_vruntime to be the same as rq1's on each
update_core_cfs_min_vruntime() when rq2 is found empty and then use
min() to get the core wide min_vruntime. Looks not worth the trouble to
use min().

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-14  9:57                                                                 ` Aaron Lu
@ 2019-10-21 12:30                                                                   ` Vineeth Remanan Pillai
  0 siblings, 0 replies; 161+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-21 12:30 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Dario Faggioli, Li, Aubrey,
	Aubrey Li, Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Mon, Oct 14, 2019 at 5:57 AM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> I now remembered why I used max().
>
> Assume rq1 and rq2's min_vruntime are both at 2000 and the core wide
> min_vruntime is also 2000. Also assume both runqueues are empty at the
> moment. Then task t1 is queued to rq1 and runs for a long time while rq2
> keeps empty. rq1's min_vruntime will be incremented all the time while
> the core wide min_vruntime stays at 2000 if min() is used. Then when
> another task gets queued to rq2, it will get really large unfair boost
> by using a much smaller min_vruntime as its base.
>
> To fix this, either max() is used as is done in my patch, or adjust
> rq2's min_vruntime to be the same as rq1's on each
> update_core_cfs_min_vruntime() when rq2 is found empty and then use
> min() to get the core wide min_vruntime. Looks not worth the trouble to
> use min().

Understood. I think this case is a special case where one runqueue is empty
and hence the min_vruntime of the core should match the progressing vruntime
of the active runqueue. If we use max as the core wide min_vruntime, I think
we may hit starvation elsewhere. On quick example I can think of is during
force idle. When a sibling is forced idle, and if a new task gets enqueued
in the force idled runq, it would inherit the max vruntime and would starve
until the other tasks in the forced idle sibling catches up. While this might
be okay, we are deviating from the concept that all new tasks inherits the
min_vruntime of the cpu(core in our case). I have not tested deeply to see
if there are any assumptions which may fail if we use max.

The modified patch actually takes care of syncing the min_vruntime across
the siblings so that, core wide min_vruntime and per cpu min_vruntime
always stays in sync.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-15 14:14                                             ` Aaron Lu
  2019-09-18  1:33                                               ` Aubrey Li
@ 2019-10-29  9:11                                               ` Dario Faggioli
  2019-10-29  9:15                                                 ` Dario Faggioli
                                                                   ` (6 more replies)
  1 sibling, 7 replies; 161+ messages in thread
From: Dario Faggioli @ 2019-10-29  9:11 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Dario Faggioli

[-- Attachment #1: Type: text/plain, Size: 18419 bytes --]

On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> I'm using the following branch as base which is v5.1.5 based:
> https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-
> test
> 
> And I have pushed Tim's branch to:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> 
> Mine:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> core_vruntime
> 
Hello,

As anticipated, I've been trying to follow the development of this
feature and, in the meantime, I have done some benchmarks.

I actually have a lot of data (and am planning for more), so I am
sending a few emails, each one with a subset of the numbers in it,
instead than just one which would be beyond giant! :-)

I'll put, in this first one, some background and some common
information, e.g., about the benchmarking platform and configurations,
and on how to read and interpreet the data that will follow.

It's quite hard to come up with a concise summary, and sometimes it's
even tricky to identify consolidated trends. There are also things that
looks weird and, although I double checked my methodology, I can't
exclude that of glitches or errors may have occurred. For each of the
benchmark, I have at least some information about what the
configuration was when it was run, and also some monitorning and perf
data. So, if interested, try to ask and we'll see what we can dig out.

And in any case, I have the procedure for running these benchmarks
fairly decently (although not completely) automated. So if we see
things that looks really really weird, I can rerun (perhaps with
different configuration, more monitoring, etc).

For each benchmark, I'll "dump" the results, with just some comments
about the things that I find more relevant/interesting. Then, if we
want, we can look at them and analyze them together.
For each experiment, I do have some limited amount of tracing and
debugging information still available, in case it could be useful. And,
as said, I can always rerun.

I can also provide, quite easily, different looking tables. E.g.,
different set of columns, different baselines, etc. Just as what you
thinks it would be the most interesting to see, and, most likely, it
will be possible to do it.

Oh, and I'll upload text files whose contents will be identical to the
emails in this space:

  http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/

In case tables are rendered better in a browser than in a MUA.

Thanks and Regards,
Dario
---

Code: 
 1) Linux 5.1.5 (commit id 835365932f0dc25468840753e071c05ad6abc76f)
 2) https://github.com/digitalocean/linux-coresched/tree/vpillai/coresched-v3-v5.1.5-test
 3) https://github.com/aaronlu/linux/tree/coresched-v3-v5.1.5-test-core_vruntime
 4) https://github.com/aaronlu/linux/tree/coresched-v3-v5.1.5-test-tim

Benchmarking suite:
 - MMTests: https://github.com/gormanm/mmtests
 - Tweaked to deal with running benchmarks in VMs. Still working on
   upstreaming that to Mel (a WIP is available here:
   https://github.com/dfaggioli/mmtests/tree/bench-virt )

Benchmarking host:
 - CPU: 1 socket, 4 cores, 2 threads
 - RAM: 32 GB
 - distro: opneSUSE Tumbleweed
 - HW Bugs Mitigations: fully disabled
 - Filesys: XFS

VMs:
 - vCPUs: either 8 or 4
 - distro: opneSUSE Tumbleweed
 - Kernel: 5.1.16
 - HW Bugs Mitigations: fully disabled
 - Filesys: XFS

Benchmarks:
- STREAM         : pure memory benchmark (various kind of mem-ops done
                   in parallel). Parallelism is NR_CPUS/2 tasks
- Kernbench      : builds a kernel, with varying number of compile
                   jobs. HT is, in general, known to help, as it let 
                   us do "more parallel" builds
- Hackbench      : communication (via pipes, in this case) between
                   group of processes. As we deal with _groups_ of
                   tasks, we're already in saturation with 1 group,
                   hence we expect HyperThreading disabled
                   configurations to suffer
- mutilate       : load generator for memcached, with high request
                   rate;
- netperf-unix   : two communicating tasks. Without any pinning
                   (neither at the host nor at the guest level), we
                   expect HT to play a role. In fact, depending on
                   where the two task are scheduler (i.e., whether on
                   two core of the same thread, or not) performance may
                   vary
- sysbenchcpu    : the process-based CPU stressing workload of sysbench
- sysbenchthread : the thread-based CPU stressing workload of sysbench
- sysbench       : the database workload

This is kind of a legend for the columns you will see in the tables.

- v-*   : vanilla, i.e., benchmarks were run on code _without_ any
          core-scheduling patch applied (see 1 in 'Code' section above)
- *BM-* : baremetal, i.e., benchmarks were run on the host, without 
          any VM running or anything
- *VM-* : Virtual Machine, i.e., benchmarks were run inside a VM, with
          the following haracteristics:
   - *VM-   : benchmarks were run in a VM with 8 vCPUs. That was the
              only VM running in the system
   - *VM-v4 : benchmarks were run in a VM with 4 vCPUs. That was the
              only VM running in the system
   - *VMx2  : benchmark were run in a VM with 8 vCPUs, and there was
              another VM running, also with 8 vCPUS, generating CPU,
              memory and IO stress load for about 600%
- *-csc-*          : benchmarks were run with Core scheduling v3 patch
                     series (see 2 in 'Code' section above)
- *-csc_stallfix-* : benchmarks were run with Core scheduling v3 and
                     the 'stallfix' feature enabled
- *-csc_vruntime-* : benchmarks were run with Core scheduling v3 + the
                     vruntime patches (see 3 in 'Code' section above)
- *-csc-_tim-*     : benchmarks were run with Core scheduling v3 +
                     Tim's patches (see 4 in 'Code' section above)
- *-noHT           : benchmarks were run with HyperThreading Disabled
- *-HT             : benchmarks were run with Hyperthreading enabled

So, for instance, the column BM-noHT shows data from a run done on
baremetal, with HyperThreading disabled. The column v-VM-HT shows data
from a run done in a 8 vCPUs VM, with HyperThreading enabled, and no
core-scheduling patches applied. The column VM-csc_vruntime-HT shows
data from a run done in a 8 vCPUs VM with core-scheduling v3 patches +
the vruntime patches applied. The column VM-v4-HT shows data from a run
done in a 4 vCPUs VM, core-scheduling patches were applied but not used
(the vCPUs of the VM weren't tagged). The column VMx2-csc_vruntime-HT
shows data from a run done in a 8 vCPUs VM, core-scheduling v3 + Tim's
patchs were applied and the vCPUs of the VM tagged, while there was
another (untagged) VM in the system, trying to introduce ~600% load
(CPU, memory and IO, via stress-ng). Etc.

See the 'Appendix' at the bottom of this email, for a comprehensive
list of all the combinations (or, at least I think is comprehensive...
I hope I haven't missed any :-) ).

In all tables, percent increase and decrease are always relative to the
first column. It is already taken care of whether lower or higher
values are better.
Basically, when we see -x.yz%, it always means performance are worse
than the baseline, and the absolute value of that (i.e., x.yz) tells
you by how much.

If, for instance, we want to compare HT and non HT, on baremetal, we
check the BM-HT and BM-noHT columns.
If we want to compare v3 + vruntime patches against no HyperThreading,
when the system is overloaded, we look at VMx2-noHT and VMx2-
csc_vruntime-HT columns and check by how much they deviates from the
baseline (i.e., which one regresses more). For comparing, the various
core scheduling solutions, we can check by how much each one is either
better or worse than baseline. And so on...

The most relevant comparisons, IMO, are:
- the various core scheduling solutions against their respective HT
baseline. This, in fact, tells us what people will experience if they
start using core scheduling on these workloads
- the various core scheduling solutions against their respective noHT
baseline. This, in fact, tells use whether or not core scheduling is
effective, for the given workload, or if it would just be better to
disable HyperThreading
- the overhead introduced by the core scheduling patches, when they are
not used (i.e., v-BM-HT against BM-HT, or v-VM-HT against VM-HT). This,
in fact, tells us what happens to *everyone*, including the ones that
do not want core scheduling and will keep it disabled, if we merge it

Note that the overhead, so far, has been evaluated only for the -csc
case, i.e., when patches from point 2 in 'Code' above are applied, but
tasks/vCPUs are not tagged, and hence core scheduling is not really
used,

Anyway, let's get to the point where I give you some data already! :-D
:-D

STREAM
======

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-1_stream.txt

                                  v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                              BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
MB/sec copy     33827.50 (   0.00%)    33654.32 (  -0.51%)    33683.34 (  -0.43%)    33819.30 (  -0.02%)    33830.88 (   0.01%)    33731.02 (  -0.29%)    33573.76 (  -0.75%)    33292.76 (  -1.58%)
MB/sec scale    22762.02 (   0.00%)    22524.00 (  -1.05%)    22416.54 (  -1.52%)    22444.16 (  -1.40%)    22652.56 (  -0.48%)    22462.80 (  -1.31%)    22461.90 (  -1.32%)    22670.84 (  -0.40%)
MB/sec add      26141.76 (   0.00%)    26241.42 (   0.38%)    26559.40 (   1.60%)    26365.36 (   0.86%)    26607.10 (   1.78%)    26384.50 (   0.93%)    26117.78 (  -0.09%)    26192.12 (   0.19%)
MB/sec triad    26522.46 (   0.00%)    26555.26 (   0.12%)    26499.62 (  -0.09%)    26373.26 (  -0.56%)    26667.32 (   0.55%)    26642.70 (   0.45%)    26505.38 (  -0.06%)    26409.60 (  -0.43%)
                                  v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                              VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
MB/sec copy     34559.32 (   0.00%)    34153.30 (  -1.17%)    34236.64 (  -0.93%)    33724.38 (  -2.42%)    33535.60 (  -2.96%)    33534.10 (  -2.97%)    33469.70 (  -3.15%)    33873.18 (  -1.99%)
MB/sec scale    22556.18 (   0.00%)    22834.88 (   1.24%)    22733.12 (   0.78%)    23010.46 (   2.01%)    22480.60 (  -0.34%)    22552.94 (  -0.01%)    22756.50 (   0.89%)    22434.96 (  -0.54%)
MB/sec add      26209.70 (   0.00%)    26640.08 (   1.64%)    26692.54 (   1.84%)    26747.40 (   2.05%)    26358.20 (   0.57%)    26353.50 (   0.55%)    26686.62 (   1.82%)    26256.50 (   0.18%)
MB/sec triad    26521.80 (   0.00%)    26490.26 (  -0.12%)    26598.66 (   0.29%)    26466.30 (  -0.21%)    26560.48 (   0.15%)    26496.30 (  -0.10%)    26609.10 (   0.33%)    26450.68 (  -0.27%)
                                  v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                           VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
MB/sec copy     32257.48 (   0.00%)    32504.18 (   0.76%)    32375.66 (   0.37%)    32261.98 (   0.01%)    31940.84 (  -0.98%)    32070.88 (  -0.58%)    31926.80 (  -1.03%)    31882.18 (  -1.16%)
MB/sec scale    19806.46 (   0.00%)    20281.18 (   2.40%)    20266.80 (   2.32%)    20075.46 (   1.36%)    19847.66 (   0.21%)    20119.00 (   1.58%)    19899.84 (   0.47%)    20060.48 (   1.28%)
MB/sec add      22178.58 (   0.00%)    22426.92 (   1.12%)    22185.54 (   0.03%)    22153.52 (  -0.11%)    21975.80 (  -0.91%)    22097.72 (  -0.36%)    21827.66 (  -1.58%)    22068.04 (  -0.50%)
MB/sec triad    22149.10 (   0.00%)    22200.54 (   0.23%)    22142.10 (  -0.03%)    21933.04 (  -0.98%)    21898.50 (  -1.13%)    22160.64 (   0.05%)    22003.40 (  -0.66%)    21951.16 (  -0.89%)
                                  v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                            VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
MB/sec copy     33514.96 (   0.00%)    24740.70 ( -26.18%)    30410.96 (  -9.26%)    22157.24 ( -33.89%)    29552.60 ( -11.82%)    29374.78 ( -12.35%)    28717.38 ( -14.31%)    29143.88 ( -13.04%)
MB/sec scale    22605.74 (   0.00%)    15473.56 ( -31.55%)    19051.76 ( -15.72%)    15278.64 ( -32.41%)    19246.98 ( -14.86%)    19081.04 ( -15.59%)    18747.60 ( -17.07%)    18776.02 ( -16.94%)
MB/sec add      26249.56 (   0.00%)    18559.92 ( -29.29%)    21143.90 ( -19.45%)    18664.30 ( -28.90%)    21236.00 ( -19.10%)    21067.40 ( -19.74%)    20878.78 ( -20.46%)    21266.92 ( -18.98%)
MB/sec triad    26290.16 (   0.00%)    19274.10 ( -26.69%)    20573.62 ( -21.74%)    17631.52 ( -32.93%)    21066.94 ( -19.87%)    20975.04 ( -20.22%)    20944.56 ( -20.33%)    20942.18 ( -20.34%)

So, STREAM, at least in this configuration, it is not (as it could have
been expected) really sensitive to HyperThreading. In fact, in most
cases, both when run on baremetal and in VMs, HT and noHT results are
pretty much the same. When core scheduling is used, things does not
look bad at all to me, although results are, most of the time, only
marginally worse.

Do check, however, the overloaded case. There, disabling HT has quite a
big impact, and core scheduling does a rather good job in restoring
good performance.

From the overhead point of view, the situation does not look too bad
either. In fact, in the first three group of measurements, the overhead
introduced by having core scheduling patches in, is acceptable (there
are actually cases where they seem to do more good than harm! :-P).
However, when the system is overloaded, despite there not being any
tagged task, numbers look pretty bad. It seems that, for instance, of
the 13.04% performance drop between v-VMx2-HT and VMx2-csc_vruntime-HT, 
9.26% comes from overhead (as that's there already in VMx2-HT)!!

Something to investigate better, I guess...


Appendix

* v-BM-HT      : no coresched patch applied, baremetal, HyperThreading enabled
* v-BM-noHT    : no coresched patch applied, baremetal, Hyperthreading disabled
* v-VM-HT      : no coresched patch applied, 8 vCPUs VM, HyperThreading enabled
* v-VM-noHT    : no coresched patch applied, 8 vCPUs VM, Hyperthreading disabled
* v-VM-v4-HT   : no coresched patch applied, 4 vCPUs VM, HyperThreading enabled
* v-VM-v4-noHT : no coresched patch applied, 4 vCPUs VM, Hyperthreading disabled
* v-VMx2-HT    : no coresched patch applied, 8 vCPUs VM + 600% stress overhead, HyperThreading enabled
* v-VMx2-noHT  : no coresched patch applied, 8 vCPUs VM + 600% stress overhead, Hyperthreading disabled

* BM-HT              : baremetal, HyperThreading enabled
* BM-noHT            : baremetal, Hyperthreading disabled
* BM-csc-HT          : baremetal, coresched-v3 (Hyperthreading enabled, of course)
* BM-csc_stallfix-HT : baremetal, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* BM-csc_tim-HT      : baremetal, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* BM-csc_vruntime-HT : baremetal, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VM-HT              : 8 vCPUs VM, HyperThreading enabled
* VM-noHT            : 8 vCPUs VM, Hyperthreading disabled
* VM-csc-HT          : 8 vCPUs VM, coresched-v3 (Hyperthreading enabled, of course)
* VM-csc_stallfix-HT : 8 vCPUs VM, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VM-csc_tim-HT      : 8 vCPUs VM, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VM-csc_vruntime-HT : 8 vCPUs VM, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VM-v4-HT              : 4 vCPUs VM, HyperThreading enabled
* VM-v4-noHT            : 4 vCPUs VM, Hyperthreading disabled
* VM-v4-csc-HT          : 4 vCPUs VM, coresched-v3 (Hyperthreading enabled, of course)
* VM-v4-csc_stallfix-HT : 4 vCPUs VM, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VM-v4-csc_tim-HT      : 4 vCPUs VM, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VM-v4-csc_vruntime-HT : 4 vCPUs VM, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VMx2-HT              : 8 vCPUs VM + 600% stress overhead, HyperThreading enabled
* VMx2-noHT            : 8 vCPUs VM + 600% stress overhead, Hyperthreading disabled
* VMx2-csc-HT          : 8 vCPUs VM + 600% stress overhead, coresched-v3 (Hyperthreading enabled, of course)
* VMx2-csc_stallfix-HT : 8 vCPUs VM + 600% stress overhead, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VMx2-csc_tim-HT      : 8 vCPUs VM + 600% stress overhead, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VMx2-csc_vruntime-HT : 8 vCPUs VM + 600% stress overhead,
                        coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29  9:11                                               ` Dario Faggioli
@ 2019-10-29  9:15                                                 ` Dario Faggioli
  2019-10-29  9:16                                                 ` Dario Faggioli
                                                                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 161+ messages in thread
From: Dario Faggioli @ 2019-10-29  9:15 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 17098 bytes --]

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> > 
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > 
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> > 
> Hello,
> 
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
> 
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
> 
HACKBENCH-PROCESS-PIPES
=======================

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-2_hackbench.txt

                                  v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                              BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1       0.4372 (   0.00%)      0.6714 * -53.57%*      0.4242 (   2.97%)      0.6654 * -52.20%*      1.1748 *-168.71%*      1.1666 *-166.83%*      1.1496 *-162.95%*      1.3696 *-213.27%*
Amean     3       1.1466 (   0.00%)      2.0284 * -76.91%*      1.1700 (  -2.04%)      2.0320 * -77.22%*      6.8484 *-497.28%*      6.7128 *-485.45%*      6.8068 *-493.65%*      7.9702 *-595.12%*
Amean     5       2.0140 (   0.00%)      2.7834 * -38.20%*      2.0116 (   0.12%)      2.6688 * -32.51%*     11.9494 *-493.32%*     11.9450 *-493.10%*     12.0904 *-500.32%*     14.4190 *-615.94%*
Amean     7       2.5396 (   0.00%)      3.3064 * -30.19%*      2.6002 (  -2.39%)      2.9074 * -14.48%*     16.0142 *-530.58%*     16.3418 *-543.48%*     17.0896 *-572.92%*     20.2050 *-695.60%*
Amean     12      3.2930 (   0.00%)      2.9830 (   9.41%)      3.4226 (  -3.94%)      2.8522 (  13.39%)     28.1482 *-754.79%*     27.3916 *-731.81%*     28.5938 *-768.32%*     34.6184 *-951.27%*
Amean     18      3.9452 (   0.00%)      4.0806 (  -3.43%)      3.7950 (   3.81%)      3.8938 (   1.30%)     41.7120 *-957.28%*     42.1062 *-967.28%*     44.2136 *-1020.69%*     51.0200 *-1193.22%*
Amean     24      4.1618 (   0.00%)      4.8466 * -16.45%*      4.2258 (  -1.54%)      5.0720 * -21.87%*     56.8598 *-1266.23%*     57.3568 *-1278.17%*     61.2660 *-1372.10%*     68.2538 *-1540.01%*
Amean     30      4.7790 (   0.00%)      5.9726 * -24.98%*      4.8756 (  -2.02%)      5.9434 * -24.36%*     72.1602 *-1409.94%*     75.8036 *-1486.18%*     80.0066 *-1574.13%*     81.3768 *-1602.80%*
Amean     32      5.1680 (   0.00%)      6.6000 * -27.71%*      5.2004 (  -0.63%)      6.5490 * -26.72%*     78.1974 *-1413.11%*     82.9812 *-1505.67%*     87.7340 *-1597.64%*     85.6876 *-1558.04%*
Stddev    1       0.0173 (   0.00%)      0.0624 (-259.99%)      0.0129 (  25.54%)      0.0451 (-160.36%)      0.0727 (-319.12%)      0.0954 (-450.05%)      0.0545 (-214.16%)      0.0960 (-453.74%)
Stddev    3       0.0471 (   0.00%)      0.3035 (-544.20%)      0.0547 ( -16.12%)      0.2745 (-482.52%)      0.2029 (-330.60%)      0.1102 (-133.97%)      0.2391 (-407.38%)      0.1864 (-295.62%)
Stddev    5       0.1492 (   0.00%)      0.4178 (-180.05%)      0.1419 (   4.88%)      0.2569 ( -72.22%)      0.3878 (-159.96%)      0.2584 ( -73.22%)      0.4259 (-185.47%)      0.3340 (-123.89%)
Stddev    7       0.2077 (   0.00%)      0.2941 ( -41.59%)      0.2281 (  -9.85%)      0.2049 (   1.33%)      0.3922 ( -88.86%)      0.8178 (-293.78%)      0.4064 ( -95.67%)      0.6127 (-195.02%)
Stddev    12      0.5560 (   0.00%)      0.2038 (  63.34%)      0.2113 (  62.00%)      0.2490 (  55.21%)      1.0797 ( -94.21%)      0.7564 ( -36.06%)      1.0225 ( -83.91%)      1.5233 (-174.01%)
Stddev    18      0.3556 (   0.00%)      0.3110 (  12.55%)      0.3054 (  14.11%)      0.1265 (  64.43%)      1.0258 (-188.47%)      1.4386 (-304.53%)      1.1818 (-232.33%)      2.6710 (-651.08%)
Stddev    24      0.1844 (   0.00%)      0.1614 (  12.46%)      0.1135 (  38.46%)      0.3679 ( -99.54%)      2.0997 (-1038.84%)      1.1493 (-523.36%)      0.5214 (-182.79%)      2.9229 (-1485.35%)
Stddev    30      0.1420 (   0.00%)      0.0875 (  38.37%)      0.1799 ( -26.66%)      0.1076 (  24.26%)      4.5079 (-3073.51%)      1.5704 (-1005.58%)      1.4054 (-889.42%)      4.5743 (-3120.26%)
Stddev    32      0.2184 (   0.00%)      0.2427 ( -11.11%)      0.3143 ( -43.92%)      0.3517 ( -61.01%)      3.7345 (-1609.81%)      1.3564 (-521.00%)      2.1822 (-899.11%)      4.0896 (-1772.40%)
                                  v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                              VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1       0.6762 (   0.00%)      1.6824 *-148.80%*      0.6630 (   1.95%)      1.5706 *-132.27%*      0.6600 (   2.40%)      0.6514 (   3.67%)      0.6780 (  -0.27%)      0.6964 *  -2.99%*
Amean     3       1.7130 (   0.00%)      2.2806 * -33.13%*      1.8074 *  -5.51%*      2.1882 * -27.74%*      1.7992 (  -5.03%)      1.7896 *  -4.47%*      1.7760 (  -3.68%)      1.8264 *  -6.62%*
Amean     5       2.8688 (   0.00%)      2.8048 (   2.23%)      2.5880 *   9.79%*      2.8474 (   0.75%)      2.5486 *  11.16%*      2.7896 (   2.76%)      2.4402 *  14.94%*      2.6020 *   9.30%*
Amean     7       3.3432 (   0.00%)      3.3434 (  -0.01%)      3.4564 (  -3.39%)      3.4704 (  -3.80%)      3.2424 (   3.02%)      3.5620 (  -6.54%)      3.0352 (   9.21%)      3.0874 (   7.65%)
Amean     12      5.8936 (   0.00%)      4.9968 *  15.22%*      5.1560 *  12.52%*      5.0670 *  14.03%*      4.2722 *  27.51%*      5.6570 (   4.01%)      4.8270 *  18.10%*      4.2514 *  27.86%*
Amean     18      6.3938 (   0.00%)      7.0542 ( -10.33%)      6.4900 (  -1.50%)      6.9682 (  -8.98%)      5.6478 (  11.67%)      7.4324 ( -16.24%)      6.3160 (   1.22%)      6.5124 (  -1.85%)
Amean     24      7.6278 (   0.00%)      9.5096 * -24.67%*      7.0062 *   8.15%*      9.0278 * -18.35%*      7.5650 (   0.82%)      8.2604 (  -8.29%)      8.5372 ( -11.92%)      7.7008 (  -0.96%)
Amean     30     10.5534 (   0.00%)     10.9456 (  -3.72%)     11.4470 (  -8.47%)     11.1330 (  -5.49%)      8.8486 (  16.15%)     10.8508 (  -2.82%)     12.6182 ( -19.57%)     10.5836 (  -0.29%)
Amean     32     11.6024 (   0.00%)     12.6052 (  -8.64%)     10.8236 (   6.71%)     11.2156 (   3.33%)      9.0654 *  21.87%*     11.5074 (   0.82%)     10.3592 (  10.72%)     11.7174 (  -0.99%)
Stddev    1       0.0143 (   0.00%)      0.3700 (-2480.08%)      0.0261 ( -82.29%)      0.1206 (-741.16%)      0.0444 (-209.74%)      0.0270 ( -88.42%)      0.0238 ( -65.73%)      0.0180 ( -25.17%)
Stddev    3       0.0473 (   0.00%)      0.1384 (-192.63%)      0.0739 ( -56.38%)      0.0714 ( -51.01%)      0.0931 ( -96.94%)      0.0517 (  -9.28%)      0.0960 (-103.02%)      0.0553 ( -17.04%)
Stddev    5       0.1236 (   0.00%)      0.2251 ( -82.10%)      0.1607 ( -29.98%)      0.1464 ( -18.41%)      0.1319 (  -6.69%)      0.1842 ( -48.96%)      0.0647 (  47.66%)      0.1977 ( -59.92%)
Stddev    7       0.3597 (   0.00%)      0.1105 (  69.27%)      0.2288 (  36.38%)      0.1335 (  62.90%)      0.3131 (  12.94%)      0.2958 (  17.76%)      0.1581 (  56.05%)      0.2361 (  34.37%)
Stddev    12      0.4677 (   0.00%)      0.0846 (  81.91%)      0.6319 ( -35.11%)      0.1526 (  67.37%)      0.3591 (  23.23%)      0.6898 ( -47.48%)      0.4853 (  -3.76%)      0.2963 (  36.64%)
Stddev    18      1.2289 (   0.00%)      0.1849 (  84.95%)      1.1160 (   9.18%)      0.2497 (  79.68%)      0.9843 (  19.90%)      0.9542 (  22.35%)      0.6621 (  46.12%)      0.7668 (  37.60%)
Stddev    24      0.5202 (   0.00%)      0.9344 ( -79.60%)      0.1940 (  62.71%)      0.4706 (   9.53%)      0.8362 ( -60.73%)      1.0819 (-107.96%)      1.0229 ( -96.63%)      0.8881 ( -70.72%)
Stddev    30      2.1557 (   0.00%)      0.7984 (  62.96%)      1.2499 (  42.02%)      0.9804 (  54.52%)      0.4846 (  77.52%)      1.2901 (  40.15%)      1.5532 (  27.95%)      1.2932 (  40.01%)
Stddev    32      2.2255 (   0.00%)      1.1321 (  49.13%)      2.2380 (  -0.56%)      0.2127 (  90.44%)      0.3654 (  83.58%)      1.5727 (  29.33%)      2.3291 (  -4.66%)      2.0936 (   5.93%)
                                  v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                           VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
Amean     1       1.2194 (   0.00%)      1.1974 (   1.80%)      1.0308 *  15.47%*      1.1054 *   9.35%*      1.0522 *  13.71%*      1.2290 (  -0.79%)      1.1392 *   6.58%*      1.0524 *  13.70%*
Amean     3       2.2568 (   0.00%)      2.0588 (   8.77%)      2.0708 (   8.24%)      2.2352 (   0.96%)      2.1808 (   3.37%)      2.2820 (  -1.12%)      2.1682 (   3.93%)      2.2598 (  -0.13%)
Amean     5       2.9848 (   0.00%)      2.9912 (  -0.21%)      2.4938 *  16.45%*      2.4634 *  17.47%*      2.6890 (   9.91%)      2.8908 (   3.15%)      2.8636 (   4.06%)      2.4158 *  19.06%*
Amean     7       3.4500 (   0.00%)      3.2538 (   5.69%)      3.3646 (   2.48%)      3.2666 (   5.32%)      3.0800 (  10.72%)      4.2206 ( -22.34%)      3.1016 (  10.10%)      3.2186 (   6.71%)
Amean     12      6.0414 (   0.00%)      5.0624 (  16.20%)      5.1276 (  15.13%)      5.1066 (  15.47%)      4.7728 *  21.00%*      5.5068 (   8.85%)      4.7544 *  21.30%*      5.8920 (   2.47%)
Amean     16      7.5510 (   0.00%)      7.6888 (  -1.82%)      6.9732 (   7.65%)      5.9098 *  21.73%*      6.5542 *  13.20%*      6.9492 (   7.97%)      6.4372 *  14.75%*      6.1968 *  17.93%*
Stddev    1       0.0786 (   0.00%)      0.1166 ( -48.34%)      0.1762 (-124.09%)      0.0712 (   9.45%)      0.1541 ( -95.99%)      0.0814 (  -3.55%)      0.0452 (  42.57%)      0.1817 (-131.05%)
Stddev    3       0.2220 (   0.00%)      0.1887 (  15.03%)      0.2174 (   2.07%)      0.1368 (  38.37%)      0.1928 (  13.17%)      0.4342 ( -95.57%)      0.2353 (  -5.97%)      0.1753 (  21.06%)
Stddev    5       0.4586 (   0.00%)      0.4689 (  -2.23%)      0.3202 (  30.19%)      0.2666 (  41.88%)      0.3108 (  32.24%)      0.2876 (  37.29%)      0.4067 (  11.33%)      0.3010 (  34.37%)
Stddev    7       0.3769 (   0.00%)      0.5242 ( -39.09%)      0.2523 (  33.06%)      0.2498 (  33.71%)      0.4527 ( -20.13%)      1.5089 (-300.36%)      0.3998 (  -6.09%)      0.4018 (  -6.62%)
Stddev    12      1.0604 (   0.00%)      0.5194 (  51.02%)      0.5646 (  46.75%)      0.5255 (  50.45%)      0.4872 (  54.06%)      0.3447 (  67.49%)      0.5492 (  48.21%)      0.5487 (  48.26%)
Stddev    16      0.6245 (   0.00%)      0.5220 (  16.40%)      0.6914 ( -10.71%)      0.6984 ( -11.83%)      0.3543 (  43.27%)      0.5137 (  17.74%)      0.5083 (  18.61%)      1.0744 ( -72.04%)
                                  v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                            VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1       0.6780 (   0.00%)      2.1522 *-217.43%*      1.6790 *-147.64%*      2.2440 *-230.97%*     15.6090 *-2202.21%*      9.1920 *-1255.75%*      2.0732 *-205.78%*      2.2390 *-230.24%*
Amean     3       1.8452 (   0.00%)      3.2456 * -75.89%*      2.5214 * -36.65%*      3.3944 * -83.96%*     13.7258 *-643.87%*      9.7654 *-429.23%*      2.6668 * -44.53%*      2.9044 * -57.40%*
Amean     5       2.8564 (   0.00%)      4.2548 * -48.96%*      3.2500 * -13.78%*      4.3958 * -53.89%*     11.6808 *-308.93%*      8.4724 *-196.61%*      3.4000 * -19.03%*      3.5298 * -23.58%*
Amean     7       3.3712 (   0.00%)      5.1056 * -51.45%*      4.1636 * -23.50%*      4.9874 * -47.94%*     14.0516 (-316.81%)      9.1454 *-171.28%*      4.1396 * -22.79%*      4.3570 * -29.24%*
Amean     12      4.8134 (   0.00%)      7.4516 * -54.81%*      6.1254 * -27.26%*      7.3488 * -52.67%*     13.7680 *-186.03%*     13.7400 *-185.45%*      6.0424 * -25.53%*      6.3404 * -31.72%*
Amean     18      6.1980 (   0.00%)     10.4126 * -68.00%*      8.5942 * -38.66%*      9.7554 * -57.40%*     10.7234 * -73.01%*     13.5214 *-118.16%*      8.6628 * -39.77%*      8.5910 * -38.61%*
Amean     24      8.2116 (   0.00%)     12.7570 * -55.35%*     10.9386 * -33.21%*     12.9032 * -57.13%*     17.8492 *-117.37%*     14.4432 * -75.89%*     10.7488 * -30.90%*     11.1930 * -36.31%*
Amean     30      9.3264 (   0.00%)     15.3704 * -64.81%*     13.6630 * -46.50%*     15.4806 * -65.99%*     16.9272 * -81.50%*     18.4184 * -97.49%*     13.6174 * -46.01%*     13.5026 * -44.78%*
Amean     32     10.1954 (   0.00%)     16.5494 * -62.32%*     14.7450 * -44.62%*     16.4650 * -61.49%*     13.9146 * -36.48%*     17.5790 * -72.42%*     14.6072 * -43.27%*     14.0346 * -37.66%*
Stddev    1       0.0129 (   0.00%)      0.2665 (-1959.36%)      0.2262 (-1647.54%)      0.0703 (-443.26%)      9.5450 (-73651.40%)      6.3226 (-48752.57%)      0.1736 (-1241.41%)      0.3070 (-2272.09%)
Stddev    3       0.1156 (   0.00%)      0.3653 (-215.89%)      0.1388 ( -20.00%)      0.2067 ( -78.77%)      3.3781 (-2821.39%)      3.2241 (-2688.20%)      0.0970 (  16.10%)      0.1247 (  -7.87%)
Stddev    5       0.0817 (   0.00%)      0.2968 (-263.19%)      0.0938 ( -14.79%)      0.1367 ( -67.25%)      3.1273 (-3726.47%)      3.9175 (-4693.34%)      0.1155 ( -41.31%)      0.1864 (-128.10%)
Stddev    7       0.1190 (   0.00%)      0.1337 ( -12.31%)      0.3136 (-163.48%)      0.2034 ( -70.91%)     17.1856 (-14336.96%)      2.8676 (-2308.94%)      0.0794 (  33.32%)      0.1362 ( -14.42%)
Stddev    12      0.5237 (   0.00%)      0.2929 (  44.08%)      0.1386 (  73.53%)      0.2330 (  55.51%)      7.1869 (-1272.26%)      8.0085 (-1429.13%)      0.2535 (  51.59%)      0.1441 (  72.49%)
Stddev    18      0.4791 (   0.00%)      0.1528 (  68.10%)      0.4254 (  11.21%)      0.2953 (  38.37%)      2.8492 (-494.71%)      4.3767 (-813.54%)      0.3205 (  33.10%)      0.3094 (  35.43%)
Stddev    24      1.1591 (   0.00%)      0.2616 (  77.43%)      0.2446 (  78.90%)      0.4363 (  62.36%)      5.3380 (-360.53%)      3.2433 (-179.81%)      0.2704 (  76.67%)      0.4402 (  62.02%)
Stddev    30      0.6561 (   0.00%)      0.4008 (  38.91%)      0.2670 (  59.31%)      0.1451 (  77.88%)      4.5765 (-597.54%)      4.2380 (-545.95%)      0.4768 (  27.33%)      0.2693 (  58.96%)
Stddev    32      1.2197 (   0.00%)      0.4459 (  63.44%)      0.5496 (  54.94%)      0.4179 (  65.74%)      1.3553 ( -11.12%)      4.3537 (-256.96%)      0.5602 (  54.07%)      0.5312 (  56.45%)

So, situation looks pretty *terrible* on baremetal. Well, something to
investigate, I guess.

In VMs, on the other hand, things don't look too bad. Although, it's a
bit weird that the benchmark seems to be sensitive to HT when run on
baremetal, but not so much when run in VMs. Among the various core
scheduling implementations/patches, plain v3 seems to have issues with
this workload, while Tim's and vruntime are better.

It is confirmed that, in virtualization scenario, under
overcommittment, core scheduling is more effective than disabling
HyperThreading, at least either with Tim's or with vruntime patches.

Overhead wise, it is an hard call. Numbers vary a lot, and I think each
group of measurements needs to be looked at carefully to come up with a
sensible analysis.

Overhead is there, that's for sure, in the baremetal case. Also, it
looks more severe when HT is disabled (e.g., comapre v-BM-noHT with BM-
noHT).

In virt cases, it's really an hard call. E.g., when a VM with 4 vCPUs
is used, core scheduling seems to be able to make the system
significantly faster! :-O
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29  9:11                                               ` Dario Faggioli
  2019-10-29  9:15                                                 ` Dario Faggioli
@ 2019-10-29  9:16                                                 ` Dario Faggioli
  2019-10-29  9:17                                                 ` Dario Faggioli
                                                                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 161+ messages in thread
From: Dario Faggioli @ 2019-10-29  9:16 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 9777 bytes --]

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> > 
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > 
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> > 
> Hello,
> 
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
> 
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
> 
KERNBENCH
=========

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-3_kernbench.txt

                                       v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                                   BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     elsp-2       200.07 (   0.00%)      196.88 *   1.60%*      199.93 (   0.07%)      196.89 *   1.59%*      200.86 *  -0.39%*      200.83 *  -0.38%*      199.47 *   0.30%*      200.45 *  -0.19%*
Amean     elsp-4       115.56 (   0.00%)      108.64 *   5.99%*      115.12 (   0.39%)      108.72 *   5.92%*      118.17 *  -2.25%*      116.92 *  -1.18%*      115.92 (  -0.31%)      115.86 (  -0.25%)
Amean     elsp-8        84.72 (   0.00%)      110.77 * -30.75%*       84.19 (   0.62%)      111.03 * -31.06%*       84.78 (  -0.07%)       84.63 (   0.11%)       89.09 *  -5.16%*       90.21 *  -6.48%*
Amean     elsp-16       85.06 (   0.00%)      113.63 * -33.59%*       85.33 (  -0.32%)      113.83 * -33.81%*       85.95 *  -1.04%*       85.73 *  -0.78%*       90.20 *  -6.04%*       90.46 *  -6.35%*
Stddev    elsp-2         0.11 (   0.00%)        0.05 (  59.33%)        0.43 (-278.63%)        0.05 (  60.30%)        0.20 ( -75.43%)        0.15 ( -28.90%)        0.16 ( -40.87%)        0.08 (  26.69%)
Stddev    elsp-4         0.54 (   0.00%)        0.37 (  30.80%)        0.02 (  96.11%)        0.09 (  83.05%)        1.10 (-105.52%)        0.24 (  55.54%)        0.10 (  81.29%)        0.26 (  50.67%)
Stddev    elsp-8         0.82 (   0.00%)        0.25 (  69.66%)        0.28 (  65.58%)        0.27 (  66.75%)        0.30 (  63.64%)        0.07 (  92.05%)        0.09 (  88.92%)        0.19 (  77.18%)
Stddev    elsp-16        0.07 (   0.00%)        0.32 (-375.21%)        0.41 (-502.93%)        0.19 (-176.54%)        0.22 (-219.28%)        0.21 (-208.51%)        0.31 (-358.10%)        0.09 ( -32.49%)
                                       v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                                   VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     elsp-2       229.61 (   0.00%)      202.26 *  11.91%*      205.30 *  10.59%*      201.14 *  12.40%*      207.43 *   9.66%*      207.30 *   9.72%*      205.92 *  10.32%*      206.69 *   9.98%*
Amean     elsp-4       128.32 (   0.00%)      124.33 *   3.11%*      116.84 *   8.95%*      124.66 *   2.85%*      121.86 *   5.04%*      140.87 *  -9.78%*      118.28 *   7.83%*      122.58 *   4.48%*
Amean     elsp-8        87.33 (   0.00%)      118.45 * -35.63%*       85.52 (   2.07%)      118.61 * -35.81%*       96.92 * -10.98%*      110.31 * -26.31%*       88.69 (  -1.55%)       88.49 (  -1.32%)
Amean     elsp-16       87.00 (   0.00%)      116.17 * -33.52%*       86.41 (   0.68%)      116.43 * -33.82%*      103.24 * -18.67%*       90.77 *  -4.33%*       88.89 *  -2.17%*       89.35 *  -2.70%*
Stddev    elsp-2         0.24 (   0.00%)        1.90 (-702.44%)        0.39 ( -63.41%)        0.45 ( -91.41%)        1.78 (-650.49%)        1.31 (-454.67%)        0.22 (   8.97%)        1.09 (-362.56%)
Stddev    elsp-4         0.10 (   0.00%)        0.56 (-484.51%)        0.16 ( -63.75%)        0.60 (-524.50%)        0.47 (-392.73%)        2.53 (-2556.69%)        0.37 (-288.05%)        0.19 ( -96.77%)
Stddev    elsp-8         1.48 (   0.00%)        0.28 (  81.02%)        0.08 (  94.57%)        0.19 (  87.46%)        1.25 (  15.23%)        2.84 ( -92.18%)        1.23 (  16.49%)        0.58 (  60.60%)
Stddev    elsp-16        0.62 (   0.00%)        0.33 (  46.43%)        0.07 (  89.43%)        0.50 (  20.24%)        0.54 (  12.75%)        0.93 ( -49.52%)        0.44 (  29.54%)        0.29 (  53.95%)
                                      v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                               VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
Amean     elsp-2      227.42 (   0.00%)      202.75 *  10.85%*      201.44 *  11.42%*      201.05 *  11.60%*      207.88 *   8.59%*      210.35 *   7.51%*      204.07 *  10.27%*      204.16 *  10.23%*
Amean     elsp-4      124.46 (   0.00%)      128.74 *  -3.44%*      111.03 *  10.79%*      110.48 *  11.23%*      116.46 *   6.43%*      132.08 *  -6.13%*      112.10 *   9.93%*      112.44 *   9.66%*
Amean     elsp-8      127.12 (   0.00%)      114.00 *  10.32%*      113.37 *  10.82%*      112.50 *  11.50%*      118.62 *   6.69%*      135.04 *  -6.23%*      114.19 *  10.18%*      114.36 *  10.04%*
Stddev    elsp-2        0.16 (   0.00%)        0.05 (  68.44%)        0.23 ( -45.93%)        0.32 (-101.53%)        0.89 (-471.21%)        0.86 (-452.61%)        0.13 (  18.49%)        0.08 (  51.98%)
Stddev    elsp-4        0.09 (   0.00%)        0.90 (-958.30%)        0.17 ( -95.69%)        0.40 (-364.45%)        1.33 (-1462.83%)        0.61 (-621.84%)        0.20 (-134.86%)        0.06 (  28.48%)
Stddev    elsp-8        0.14 (   0.00%)        0.10 (  29.71%)        0.40 (-181.91%)        0.05 (  67.30%)        1.34 (-857.21%)        0.43 (-206.99%)        0.09 (  39.30%)        0.12 (  12.79%)
                                       v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                                 VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     elsp-2       206.25 (   0.00%)      338.78 * -64.26%*      296.38 * -43.70%*      335.66 * -62.75%*      327.38 * -58.73%*      443.46 *-115.01%*      317.41 * -53.90%*      295.89 * -43.47%*
Amean     elsp-4       117.69 (   0.00%)      267.38 *-127.19%*      174.91 * -48.62%*      263.22 *-123.66%*      201.69 * -71.37%*      317.24 *-169.56%*      193.92 * -64.77%*      193.04 * -64.02%*
Amean     elsp-8        85.93 (   0.00%)      225.41 *-162.31%*      146.21 * -70.14%*      221.96 *-158.29%*      188.12 *-118.91%*      203.79 *-137.15%*      154.66 * -79.98%*      162.26 * -88.82%*
Amean     elsp-16       86.82 (   0.00%)      221.32 *-154.92%*      141.53 * -63.02%*      217.27 *-150.27%*      180.06 *-107.41%*      190.36 *-119.27%*      143.76 * -65.59%*      156.84 * -80.65%*
Stddev    elsp-2         1.01 (   0.00%)        1.11 ( -10.37%)        2.24 (-121.88%)        1.36 ( -35.11%)       22.86 (-2164.27%)       24.78 (-2353.59%)        2.84 (-180.79%)        0.91 (  10.16%)
Stddev    elsp-4         0.18 (   0.00%)        4.55 (-2398.38%)        0.62 (-239.73%)        3.15 (-1630.91%)        3.00 (-1551.21%)       10.85 (-5864.08%)        5.87 (-3124.85%)        3.79 (-1984.93%)
Stddev    elsp-8         0.42 (   0.00%)        1.13 (-167.05%)        0.44 (  -4.45%)        2.91 (-590.61%)        4.16 (-885.58%)        6.85 (-1524.29%)        0.78 ( -85.34%)        1.48 (-252.23%)
Stddev    elsp-16        0.41 (   0.00%)        2.40 (-478.92%)        2.21 (-432.16%)        3.27 (-688.15%)        9.50 (-2189.54%)        3.39 (-717.66%)        2.00 (-383.08%)        0.48 ( -15.60%)


So, if only building kernels were the only thing that people do with
computers, we could say we're done, and go have beers! :-)

In fact, here, core scheduling is doing good on baremetal. E.g., look
at what happens, in the BM- group of measurements, when number of jobs
is higher than 8, which is how many CPUs we have.

It's actually doing quite fine in Virt too. Among the various variants
(sorry), plain v3, even with stallfix on, is the one performing worse.
Tim's patches seems to me to be the better looking set of numbers for
this workload.

Furthermore, there's basically no overhead --actually, there are
speedups-- until there is not virt-overcommitment. In fact, in the VMx2
group of measurements, even just applying the core scheduling v3
patches makes the VM

-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29  9:11                                               ` Dario Faggioli
  2019-10-29  9:15                                                 ` Dario Faggioli
  2019-10-29  9:16                                                 ` Dario Faggioli
@ 2019-10-29  9:17                                                 ` Dario Faggioli
  2019-10-29  9:18                                                 ` Dario Faggioli
                                                                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 161+ messages in thread
From: Dario Faggioli @ 2019-10-29  9:17 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 12651 bytes --]

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> > 
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > 
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> > 
> Hello,
> 
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
> 
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
> 
SYSBENCHCPU
===========

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-4_sysbenchcpu.txt

                                  v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                              BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1        22.55 (   0.00%)       22.55 *  -0.03%*       22.55 (   0.00%)       22.55 (  -0.02%)       22.56 *  -0.04%*       22.56 *  -0.03%*       22.55 (   0.01%)       22.55 (  -0.03%)
Amean     3         7.68 (   0.00%)        7.69 *  -0.07%*        7.68 (  -0.02%)        7.69 *  -0.09%*        7.69 *  -0.07%*        7.68 (  -0.02%)        7.68 (  -0.02%)        7.68 (  -0.02%)
Amean     5         4.80 (   0.00%)        5.79 * -20.64%*        4.80 (   0.00%)        5.79 * -20.73%*        4.81 *  -0.24%*        4.80 *  -0.18%*        4.81 *  -0.24%*        4.81 *  -0.24%*
Amean     7         3.59 (   0.00%)        5.79 * -61.24%*        3.59 (   0.00%)        5.79 * -61.20%*        3.59 (  -0.08%)        3.60 *  -0.16%*        3.62 *  -0.92%*        3.61 *  -0.52%*
Amean     12        3.18 (   0.00%)        5.79 * -81.79%*        3.19 (  -0.13%)        5.79 * -81.92%*        3.20 (  -0.49%)        3.19 *  -0.27%*        3.27 *  -2.78%*        3.41 *  -7.04%*
Amean     16        3.19 (   0.00%)        5.79 * -81.29%*        3.20 (  -0.13%)        5.79 * -81.38%*        3.24 (  -1.52%)        3.20 (  -0.31%)        3.27 *  -2.51%*        3.40 *  -6.40%*
Stddev    1         0.00 (   0.00%)        0.01 ( -41.42%)        0.01 ( -82.57%)        0.01 (-194.39%)        0.01 ( -82.57%)        0.01 ( -41.42%)        0.01 (-158.20%)        0.01 (-269.68%)
Stddev    3         0.00 (   0.00%)        0.01 ( -99.00%)        0.00 ( -99.00%)        0.00 ( -99.00%)        0.01 ( -99.00%)        0.00 ( -99.00%)        0.00 ( -99.00%)        0.00 ( -99.00%)
Stddev    5         0.01 (   0.00%)        0.02 (-302.08%)        0.01 (   0.00%)        0.02 (-258.24%)        0.00 (   8.71%)        0.01 (  -0.00%)        0.01 ( -41.42%)        0.00 (   8.71%)
Stddev    7         0.00 (   0.00%)        0.02 ( -99.00%)        0.00 (   0.00%)        0.02 ( -99.00%)        0.00 ( -99.00%)        0.01 ( -99.00%)        0.01 ( -99.00%)        0.01 ( -99.00%)
Stddev    12        0.01 (   0.00%)        0.02 (-125.32%)        0.01 ( -54.42%)        0.02 (-103.81%)        0.03 (-321.54%)        0.01 ( -20.89%)        0.03 (-300.00%)        0.03 (-330.56%)
Stddev    16        0.01 (   0.00%)        0.02 (  -3.28%)        0.01 (   4.55%)        0.02 ( -44.53%)        0.10 (-591.05%)        0.02 ( -21.11%)        0.02 ( -44.53%)        0.04 (-145.85%)
                                  v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                              VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1        24.38 (   0.00%)       22.57 *   7.41%*       22.56 *   7.45%*       22.56 *   7.44%*       22.56 *   7.44%*       22.94 *   5.91%*       22.56 *   7.44%*       22.57 *   7.43%*
Amean     3         8.12 (   0.00%)        7.70 *   5.21%*        7.68 *   5.45%*        7.68 *   5.40%*        7.69 *   5.36%*        7.72 *   4.99%*        7.68 *   5.43%*        7.68 *   5.40%*
Amean     5         4.96 (   0.00%)        5.85 * -17.88%*        4.79 *   3.43%*        5.85 * -17.91%*        4.81 *   3.02%*        5.09 *  -2.68%*        4.82 *   2.94%*        4.80 *   3.25%*
Amean     7         3.62 (   0.00%)        5.85 * -61.72%*        3.59 *   0.91%*        5.87 * -62.04%*        3.69 (  -1.82%)        3.94 *  -8.76%*        3.62 (   0.12%)        3.60 *   0.59%*
Amean     12        3.18 (   0.00%)        5.89 * -84.93%*        3.19 (  -0.09%)        5.83 * -83.18%*        3.26 *  -2.51%*        3.40 *  -6.64%*        3.28 *  -3.10%*        3.31 *  -3.86%*
Amean     16        3.18 (   0.00%)        5.86 * -83.89%*        3.18 (   0.04%)        5.85 * -83.85%*        3.28 *  -3.10%*        3.45 *  -8.30%*        3.29 *  -3.28%*        3.31 *  -3.90%*
Stddev    1         0.02 (   0.00%)        0.01 (  47.21%)        0.01 (  66.12%)        0.01 (  53.84%)        0.01 (  55.65%)        0.19 (-1012.08%)        0.01 (  53.84%)        0.01 (  53.84%)
Stddev    3         0.00 (   0.00%)        0.02 (-254.96%)        0.00 ( 100.00%)        0.01 ( -61.25%)        0.00 (   0.00%)        0.06 (-1116.55%)        0.00 (  22.54%)        0.01 (  -9.54%)
Stddev    5         0.00 (   0.00%)        0.07 (-1637.81%)        0.00 (  -0.00%)        0.06 (-1587.21%)        0.01 (-255.90%)        0.18 (-4785.35%)        0.01 (-236.65%)        0.01 ( -52.75%)
Stddev    7         0.00 (   0.00%)        0.10 (-20807867647333972.00%)        0.01 (-1575931584692200.00%)        0.08 (-16887732666966734.00%)        0.15 (-30988869061711636.00%)        0.12 (-24303774169159900.00%)        0.01 (-1640281598669292.00%)        0.01 (-2228703820443910.00%)
Stddev    12        0.01 (   0.00%)        0.12 (-2069.49%)        0.01 ( -41.42%)        0.08 (-1441.64%)        0.08 (-1396.11%)        0.09 (-1516.07%)        0.03 (-458.27%)        0.03 (-403.32%)
Stddev    16        0.01 (   0.00%)        0.07 (-1295.23%)        0.00 (   8.71%)        0.07 (-1235.42%)        0.11 (-1867.23%)        0.26 (-4790.98%)        0.04 (-711.38%)        0.05 (-800.00%)
                                 v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                          VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
Amean     1       24.37 (   0.00%)       22.76 *   6.62%*       22.56 *   7.43%*       22.57 *   7.41%*       22.57 *   7.41%*       22.86 *   6.21%*       22.56 *   7.44%*       22.56 *   7.44%*
Amean     3        8.12 (   0.00%)        7.73 *   4.82%*        7.68 *   5.40%*        7.68 *   5.40%*        7.69 *   5.33%*        7.70 *   5.24%*        7.68 *   5.42%*        7.68 *   5.38%*
Amean     5        6.11 (   0.00%)        5.90 *   3.53%*        5.79 *   5.35%*        5.78 *   5.51%*        5.79 *   5.33%*        6.07 *   0.77%*        5.78 *   5.44%*        5.77 *   5.58%*
Amean     7        6.11 (   0.00%)        5.92 *   3.11%*        5.79 *   5.33%*        5.77 *   5.59%*        5.80 *   5.16%*        6.07 *   0.70%*        5.78 *   5.42%*        5.77 *   5.54%*
Amean     8        6.11 (   0.00%)        5.91 *   3.30%*        5.78 *   5.33%*        5.77 *   5.52%*        5.81 *   4.93%*        6.09 (   0.28%)        5.78 *   5.45%*        5.77 *   5.52%*
Stddev    1        0.01 (   0.00%)        0.01 (  27.45%)        0.01 (  27.45%)        0.01 (  17.28%)        0.01 ( -57.28%)        0.08 (-765.72%)        0.01 (  39.30%)        0.01 (  27.45%)
Stddev    3        0.00 (   0.00%)        0.03 (-616.47%)        0.00 ( -29.10%)        0.00 ( -29.10%)        0.01 (-138.05%)        0.01 (-108.17%)        0.00 (   0.00%)        0.01 ( -41.42%)
Stddev    5        0.01 (   0.00%)        0.03 (-336.77%)        0.02 (-128.71%)        0.01 ( -75.41%)        0.01 ( -70.97%)        0.02 (-128.71%)        0.01 ( -54.42%)        0.01 ( -59.33%)
Stddev    7        0.01 (   0.00%)        0.04 (-214.79%)        0.02 ( -82.57%)        0.01 (   3.08%)        0.01 ( -10.10%)        0.02 ( -52.75%)        0.01 (   3.08%)        0.01 (  -1.50%)
Stddev    8        0.01 (   0.00%)        0.04 (-329.84%)        0.02 (-122.54%)        0.01 ( -11.27%)        0.03 (-228.78%)        0.07 (-649.92%)        0.01 ( -38.01%)        0.01 ( -11.27%)
                                  v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                            VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1        22.58 (   0.00%)       23.41 *  -3.67%*       25.22 * -11.70%*       23.36 *  -3.48%*       31.57 * -39.84%*       29.30 * -29.78%*       31.82 * -40.93%*       25.60 * -13.38%*
Amean     3         7.69 (   0.00%)       11.03 * -43.47%*        8.81 * -14.66%*       11.19 * -45.60%*        8.64 * -12.34%*       10.68 * -38.88%*       10.10 * -31.43%*        9.05 * -17.75%*
Amean     5         4.80 (   0.00%)       11.04 *-130.24%*        5.66 * -17.96%*       11.35 *-136.61%*        6.66 * -38.81%*        7.92 * -65.18%*        6.27 * -30.65%*        6.61 * -37.83%*
Amean     7         3.58 (   0.00%)       10.94 *-205.34%*        5.44 * -51.79%*       11.53 *-221.93%*        6.39 * -78.35%*        6.23 * -73.96%*        5.65 * -57.70%*        6.07 * -69.30%*
Amean     12        3.18 (   0.00%)       10.95 *-243.74%*        5.52 * -73.44%*       11.31 *-255.09%*        7.19 *-125.93%*        9.43 *-196.05%*        5.62 * -76.54%*        6.09 * -91.16%*
Amean     16        3.18 (   0.00%)       11.10 *-248.65%*        5.43 * -70.56%*       11.35 *-256.55%*        6.84 *-115.04%*        7.19 *-125.90%*        5.50 * -72.76%*        6.13 * -92.64%*
Stddev    1         0.01 (   0.00%)        0.14 (-1403.85%)        0.16 (-1575.99%)        0.09 (-866.82%)        1.41 (-14685.15%)        1.84 (-19279.11%)        1.96 (-20463.33%)        0.38 (-3929.30%)
Stddev    3         0.00 (   0.00%)        0.51 (-10380.94%)        0.14 (-2749.21%)        0.68 (-13757.63%)        0.30 (-6133.94%)        0.80 (-16343.24%)        0.12 (-2332.69%)        0.16 (-3086.85%)
Stddev    5         0.01 (   0.00%)        0.30 (-5443.92%)        0.23 (-4280.45%)        0.58 (-10722.66%)        0.21 (-3888.73%)        2.41 (-45040.69%)        0.12 (-2113.22%)        0.16 (-2867.88%)
Stddev    7         0.00 (   0.00%)        0.40 (-8009.13%)        0.08 (-1480.51%)        0.47 (-9518.94%)        1.04 (-21241.56%)        0.51 (-10409.80%)        0.22 (-4319.28%)        0.25 (-5017.81%)
Stddev    12        0.01 (   0.00%)        0.53 (-9750.38%)        0.16 (-2879.09%)        0.66 (-12313.77%)        1.29 (-24020.36%)        3.97 (-74080.08%)        0.25 (-4570.12%)        0.32 (-5817.77%)
Stddev    16        0.00 (   0.00%)        0.46 (-9428.17%)        0.25 (-5124.17%)        0.75 (-15173.77%)        0.78 (-15851.43%)        1.38 (-28142.45%)        0.35 (-7094.72%)        0.21 (-4230.36%)

This is even better than kernbench! :-)

Basically, both on baremetal and in VMs, noHT causes some -81.29%. With
core scheduling, the worst we get is -6.40%.

And that's in the not Virt-overcommitted case. In such case, noHT
brings us down to -248.65% (i.e., we are ~250% slower, it's not that
we're going back in time :-P). Core scheduling contains the damage to
either -72.76% (Tim's patches) or -92.64% (vruntime). While plain v3
is, again, worse than both.

Overhead looks similar to other cases already discussed.

-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29  9:11                                               ` Dario Faggioli
                                                                   ` (2 preceding siblings ...)
  2019-10-29  9:17                                                 ` Dario Faggioli
@ 2019-10-29  9:18                                                 ` Dario Faggioli
  2019-10-29  9:18                                                 ` Dario Faggioli
                                                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 161+ messages in thread
From: Dario Faggioli @ 2019-10-29  9:18 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 12771 bytes --]

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> > 
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > 
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> > 
> Hello,
> 
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
> 
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
> 
SYSBENCHTHREAD
==============

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-5_sysbenchthread.txt

                                  v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                              BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1         1.12 (   0.00%)        1.11 (   0.38%)        1.14 *  -2.43%*        1.15 *  -2.81%*        8.49 *-659.72%*        8.43 *-654.99%*        8.42 *-653.96%*        8.38 *-649.74%*
Amean     3         1.11 (   0.00%)        1.11 (   0.00%)        1.14 *  -2.70%*        1.14 *  -2.96%*        8.35 *-651.29%*        8.42 *-657.20%*        8.34 *-650.77%*        8.41 *-656.56%*
Amean     5         1.11 (   0.00%)        1.11 (   0.00%)        1.14 *  -2.70%*        1.14 *  -2.82%*        8.34 *-649.81%*        8.41 *-655.46%*        8.32 *-647.75%*        8.34 *-649.42%*
Amean     7         1.12 (   0.00%)        1.11 (   0.38%)        1.14 *  -2.43%*        1.15 *  -2.69%*        8.31 *-643.48%*        8.40 *-652.05%*        8.41 *-653.20%*        8.33 *-645.65%*
Amean     12        1.12 (   0.00%)        1.12 (   0.00%)        1.14 *  -2.43%*        1.14 *  -2.56%*        8.38 *-651.34%*        8.41 *-654.16%*        8.32 *-645.71%*        8.38 *-651.09%*
Amean     16        1.11 (   0.00%)        1.11 (  -0.13%)        1.15 *  -3.08%*        1.14 *  -2.57%*        8.42 *-656.61%*        8.36 *-651.60%*        8.31 *-646.73%*        8.39 *-654.30%*
Stddev    1         0.01 (   0.00%)        0.01 (  49.47%)        0.01 (  64.27%)        0.01 (  39.86%)        0.08 (-414.47%)        0.16 (-976.34%)        0.01 (  36.42%)        0.09 (-525.69%)
Stddev    3         0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.01 (-108.17%)        0.13 (-3211.60%)        0.10 (-2531.86%)        0.06 (-1515.55%)        0.10 (-2467.10%)
Stddev    5         0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.01 (  -9.54%)        0.12 (-2271.92%)        0.10 (-1917.42%)        0.09 (-1700.00%)        0.09 (-1725.38%)
Stddev    7         0.01 (   0.00%)        0.01 (  20.53%)        0.01 (  43.80%)        0.01 (   0.00%)        0.13 (-1255.69%)        0.10 (-988.21%)        0.03 (-247.93%)        0.08 (-762.68%)
Stddev    12        0.01 (   0.00%)        0.01 ( -24.03%)        0.00 (  37.98%)        0.01 (   0.00%)        0.13 (-1577.68%)        0.10 (-1210.31%)        0.10 (-1140.97%)        0.06 (-644.73%)
Stddev    16        0.00 (   0.00%)        0.01 (  -9.54%)        0.01 ( -54.92%)        0.00 (  22.54%)        0.15 (-3032.73%)        0.12 (-2401.20%)        0.11 (-2206.51%)        0.07 (-1304.28%)
                                  v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                              VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1         1.43 (   0.00%)        1.15 *  19.56%*        1.15 *  19.36%*        1.19 *  16.67%*        1.16 *  19.16%*        1.17 *  18.46%*        1.16 *  18.66%*        1.16 *  18.86%*
Amean     3         1.43 (   0.00%)        1.16 *  19.16%*        1.16 *  19.26%*        1.17 *  18.46%*        1.16 *  19.16%*        1.21 *  15.17%*        1.16 *  18.86%*        1.21 *  15.27%*
Amean     5         1.43 (   0.00%)        1.15 *  19.28%*        1.18 *  17.78%*        1.17 *  18.08%*        1.16 *  19.18%*        1.17 *  18.18%*        1.16 *  19.08%*        1.18 *  17.58%*
Amean     7         1.43 (   0.00%)        1.16 *  19.26%*        1.16 *  18.86%*        1.17 *  18.56%*        1.16 *  19.16%*        1.19 *  16.57%*        1.15 *  19.36%*        1.15 *  19.36%*
Amean     12        1.44 (   0.00%)        1.16 *  19.44%*        1.16 *  19.44%*        1.17 *  18.75%*        1.16 *  19.25%*        1.18 *  18.06%*        1.16 *  19.15%*        1.16 *  19.44%*
Amean     16        1.46 (   0.00%)        1.16 *  20.37%*        1.16 *  20.37%*        1.16 *  20.47%*        1.16 *  20.37%*        1.17 *  19.59%*        1.17 *  20.08%*        1.16 *  20.37%*
Stddev    1         0.00 (   0.00%)        0.00 (   0.00%)        0.01 ( -41.42%)        0.05 (-1314.21%)        0.00 ( -29.10%)        0.00 ( -29.10%)        0.01 (-236.65%)        0.02 (-316.33%)
Stddev    3         0.00 (   0.00%)        0.00 ( -29.10%)        0.01 (-108.17%)        0.03 (-773.69%)        0.01 (-100.00%)        0.09 (-2324.18%)        0.01 (-138.05%)        0.06 (-1378.74%)
Stddev    5         0.00 (   0.00%)        0.01 ( -99.00%)        0.05 ( -99.00%)        0.02 ( -99.00%)        0.01 ( -99.00%)        0.01 ( -99.00%)        0.01 ( -99.00%)        0.04 ( -99.00%)
Stddev    7         0.00 (   0.00%)        0.01 ( -41.42%)        0.01 (-138.05%)        0.02 (-468.62%)        0.01 (-100.00%)        0.06 (-1600.00%)        0.01 (-108.17%)        0.01 ( -41.42%)
Stddev    12        0.00 (   0.00%)        0.00 ( 100.00%)        0.00 ( 100.00%)        0.03 (-11031521092846084.00%)        0.00 (-2034518927425100.00%)        0.01 (-4169523056347680.00%)        0.01 (-2228703820443891.00%)        0.00 ( 100.00%)
Stddev    16        0.05 (   0.00%)        0.00 (  92.31%)        0.00 (  92.31%)        0.00 ( 100.00%)        0.00 (  92.31%)        0.01 (  80.64%)        0.01 (  89.12%)        0.00 (  92.31%)
                                 v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                          VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
Amean     1        1.43 (   0.00%)        1.17 *  18.15%*        1.15 *  19.64%*        1.15 *  19.44%*        1.16 *  19.14%*        1.17 *  18.34%*        1.16 *  18.94%*        1.16 *  18.94%*
Amean     3        1.43 (   0.00%)        1.17 *  18.33%*        1.17 *  18.73%*        1.15 *  19.52%*        1.15 *  19.52%*        1.19 *  16.93%*        1.16 *  19.12%*        1.16 *  19.32%*
Amean     5        1.43 (   0.00%)        1.18 *  17.45%*        1.15 *  19.44%*        1.15 *  19.54%*        1.17 *  18.15%*        1.18 *  17.55%*        1.15 *  19.44%*        1.16 *  19.04%*
Amean     7        1.43 (   0.00%)        1.18 *  17.53%*        1.16 *  19.32%*        1.15 *  19.72%*        1.17 *  18.33%*        1.17 *  18.13%*        1.16 *  18.92%*        1.15 *  19.62%*
Amean     8        1.43 (   0.00%)        1.17 *  18.46%*        1.16 *  19.16%*        1.16 *  18.96%*        1.17 *  18.46%*        1.18 *  17.86%*        1.15 *  19.36%*        1.16 *  19.06%*
Stddev    1        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (  22.54%)        0.01 (  -9.54%)        0.01 (-119.09%)        0.01 ( -67.33%)        0.01 (-149.00%)        0.01 ( -84.39%)
Stddev    3        0.01 (   0.00%)        0.01 (  12.29%)        0.02 (-105.69%)        0.01 (   0.00%)        0.01 (   0.00%)        0.05 (-515.82%)        0.01 ( -27.10%)        0.01 ( -41.42%)
Stddev    5        0.00 (   0.00%)        0.04 (-700.00%)        0.01 ( -61.25%)        0.01 ( -54.92%)        0.04 (-717.31%)        0.01 ( -84.39%)        0.01 ( -61.25%)        0.01 ( -18.32%)
Stddev    7        0.01 (   0.00%)        0.04 (-390.68%)        0.01 (   3.92%)        0.00 (  51.96%)        0.04 (-366.58%)        0.01 (   0.00%)        0.02 (-151.15%)        0.01 (   3.92%)
Stddev    8        0.00 (   0.00%)        0.00 ( -29.10%)        0.01 (-151.66%)        0.02 (-383.05%)        0.05 (-1100.00%)        0.01 (-108.17%)        0.01 ( -41.42%)        0.01 (-138.05%)
                                  v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                            VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Amean     1         1.16 (   0.00%)        1.22 *  -5.80%*        1.54 * -32.84%*        1.22 *  -5.31%*        2.83 (-144.32%)        1.35 * -16.42%*        1.69 * -46.30%*        1.35 * -16.91%*
Amean     3         1.16 (   0.00%)        1.24 *  -6.90%*        1.56 * -34.11%*        1.24 *  -6.53%*        1.89 * -63.05%*        1.95 * -68.35%*        1.67 * -43.84%*        1.33 * -14.41%*
Amean     5         1.17 (   0.00%)        1.21 *  -3.79%*        1.52 * -30.48%*        1.23 *  -5.75%*        1.43 * -22.89%*        1.85 ( -58.51%)        1.71 * -46.88%*        1.35 * -15.91%*
Amean     7         1.15 (   0.00%)        1.23 *  -6.81%*        1.54 * -33.29%*        1.24 *  -7.05%*        1.68 * -45.79%*        1.39 * -20.79%*        2.09 * -80.69%*        1.30 * -12.62%*
Amean     12        1.16 (   0.00%)        1.25 *  -7.25%*        1.69 * -45.09%*        1.22 *  -5.28%*        1.72 * -47.54%*        1.52 * -31.08%*        2.08 * -78.87%*        1.32 * -13.14%*
Amean     16        1.16 (   0.00%)        1.23 *  -6.14%*        1.68 * -44.23%*        1.22 *  -4.67%*        1.80 * -54.91%*        1.37 * -17.44%*        2.18 * -87.10%*        1.28 *  -9.95%*
Stddev    1         0.01 (   0.00%)        0.05 (-569.58%)        0.14 (-1693.97%)        0.02 (-169.26%)        3.19 (-42121.82%)        0.25 (-3271.57%)        0.26 (-3362.06%)        0.09 (-1072.60%)
Stddev    3         0.01 (   0.00%)        0.06 (-664.85%)        0.06 (-630.95%)        0.03 (-259.56%)        0.84 (-10177.51%)        0.98 (-11852.56%)        0.45 (-5404.28%)        0.08 (-845.29%)
Stddev    5         0.00 (   0.00%)        0.02 (-393.96%)        0.17 (-3371.31%)        0.07 (-1423.81%)        0.32 (-6361.11%)        1.04 (-21129.08%)        0.22 (-4383.53%)        0.06 (-1033.14%)
Stddev    7         0.01 (   0.00%)        0.07 (-1234.79%)        0.20 (-3727.10%)        0.08 (-1483.25%)        0.57 (-10594.70%)        0.19 (-3455.98%)        0.30 (-5454.88%)        0.08 (-1400.56%)
Stddev    12        0.00 (   0.00%)        0.08 (-1641.84%)        0.16 (-3078.68%)        0.03 (-524.50%)        0.50 (-10161.87%)        0.08 (-1634.36%)        0.32 (-6423.80%)        0.07 (-1232.67%)
Stddev    16        0.00 (   0.00%)        0.05 (-996.36%)        0.20 (-3958.82%)        0.06 (-1045.43%)        0.60 (-12116.06%)        0.23 (-4642.99%)        0.23 (-4594.04%)        0.05 (-842.34%)

This is all quite bizarre. Judging from these numbers, the benchmark
seems to be no sensitive to HT at all (while, e.g., sysbenchcpu was),
when on baremetal. In VMs, at least in most cases, things are
significantly *faster* with HT off. Which I know is something that can
happen, I just was not expecting it from this workload.

Also, core scheduling is a total disaster on baremetal. And its
behavior in VM is an anti-pattern too.

So, I guess I'll go trying to see if I did something wrong when
configuring or running this particular benchmark. If that is not the
case, then core scheduling has a serious issue when dealing with
threads instead than with processes, I would say.

-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29  9:11                                               ` Dario Faggioli
                                                                   ` (3 preceding siblings ...)
  2019-10-29  9:18                                                 ` Dario Faggioli
@ 2019-10-29  9:18                                                 ` Dario Faggioli
  2019-10-29  9:19                                                 ` Dario Faggioli
  2019-10-29  9:20                                                 ` Dario Faggioli
  6 siblings, 0 replies; 161+ messages in thread
From: Dario Faggioli @ 2019-10-29  9:18 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 9276 bytes --]

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> > 
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > 
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> > 
> Hello,
> 
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
> 
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
> 
SYSBENCH
========

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-6_sysbench.txt

                                 v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                             BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     1      235.81 (   0.00%)      221.49 (  -6.07%)      245.28 (   4.01%)      230.53 (  -2.24%)      241.40 (   2.37%)      225.00 (  -4.58%)      225.50 (  -4.37%)      202.38 ( -14.18%)
Hmean     4      273.77 (   0.00%)      290.01 (   5.93%)      292.47 (   6.83%)      261.76 (  -4.39%)      287.58 (   5.04%)      281.30 (   2.75%)      274.21 (   0.16%)      271.91 (  -0.68%)
Hmean     7      346.60 (   0.00%)      315.58 (  -8.95%)      345.38 (  -0.35%)      349.29 (   0.78%)      363.76 (   4.95%)      349.09 (   0.72%)      355.69 (   2.62%)      336.69 (  -2.86%)
Hmean     8      343.17 (   0.00%)      353.73 (   3.08%)      409.04 (  19.19%)      411.31 (  19.86%)      406.77 (  18.53%)      306.33 ( -10.74%)      393.70 (  14.72%)      342.73 (  -0.13%)
Stddev    1       44.93 (   0.00%)       50.07 ( -11.44%)       25.05 (  44.24%)       39.22 (  12.71%)       26.27 (  41.54%)       42.77 (   4.81%)       43.63 (   2.90%)       62.88 ( -39.95%)
Stddev    4       16.03 (   0.00%)       23.37 ( -45.76%)       23.77 ( -48.25%)       22.40 ( -39.69%)       18.63 ( -16.19%)       14.37 (  10.35%)        9.34 (  41.72%)       25.21 ( -57.23%)
Stddev    7       22.88 (   0.00%)       37.54 ( -64.07%)       26.57 ( -16.16%)       38.50 ( -68.26%)       59.14 (-158.51%)       26.73 ( -16.83%)       24.58 (  -7.43%)       32.93 ( -43.94%)
Stddev    8       36.74 (   0.00%)       36.60 (   0.39%)      102.82 (-179.86%)       93.56 (-154.65%)       77.33 (-110.47%)       36.16 (   1.58%)       44.27 ( -20.50%)       35.15 (   4.33%)
                                 v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                             VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     1      215.16 (   0.00%)      225.80 *   4.95%*      205.74 (  -4.38%)      200.61 (  -6.76%)      169.70 ( -21.13%)      168.84 ( -21.53%)      157.27 ( -26.91%)      162.94 ( -24.27%)
Hmean     4      163.44 (   0.00%)      189.82 *  16.14%*      164.54 (   0.67%)      148.47 (  -9.16%)       40.62 * -75.15%*       53.14 * -67.49%*      129.51 * -20.76%*      158.99 (  -2.72%)
Hmean     7      162.74 (   0.00%)      185.79 (  14.17%)      211.79 (  30.14%)      186.92 (  14.86%)       28.02 * -82.78%*       34.32 * -78.91%*      130.01 ( -20.11%)      145.47 ( -10.61%)
Hmean     8      240.19 (   0.00%)      192.24 ( -19.96%)      192.87 * -19.70%*      194.01 ( -19.23%)       16.92 * -92.95%*       30.55 * -87.28%*      150.51 * -37.34%*      147.67 * -38.52%*
Stddev    1        1.80 (   0.00%)        4.14 (-129.80%)        6.04 (-234.90%)       24.54 (-1261.18%)       61.54 (-3313.19%)       62.92 (-3389.48%)       55.91 (-3000.67%)       58.29 (-3132.75%)
Stddev    4        6.33 (   0.00%)       14.22 (-124.51%)        7.04 ( -11.07%)       13.77 (-117.30%)       13.23 (-108.83%)        5.07 (  19.92%)       15.73 (-148.38%)       14.04 (-121.61%)
Stddev    7       24.70 (   0.00%)       37.50 ( -51.78%)       35.59 ( -44.07%)       29.96 ( -21.26%)       20.07 (  18.77%)       21.06 (  14.74%)       24.97 (  -1.07%)       30.52 ( -23.52%)
Stddev    8       23.23 (   0.00%)       41.13 ( -77.03%)       27.30 ( -17.49%)       41.67 ( -79.35%)        3.75 (  83.88%)       12.62 (  45.68%)       55.87 (-140.49%)       37.12 ( -59.76%)
                                 v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                          VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
Hmean     1      216.12 (   0.00%)      310.43 (  43.63%)      168.80 ( -21.90%)      196.04 (  -9.29%)      168.27 ( -22.14%)      188.53 ( -12.77%)      176.57 ( -18.30%)      180.12 ( -16.66%)
Hmean     3      161.91 (   0.00%)      160.80 (  -0.69%)      160.33 (  -0.97%)      175.36 (   8.31%)       51.27 * -68.34%*       52.18 * -67.77%*      137.95 ( -14.80%)      166.12 (   2.60%)
Hmean     4      156.44 (   0.00%)      196.25 *  25.45%*      165.19 (   5.60%)      199.78 *  27.71%*       50.67 * -67.61%*       40.42 * -74.16%*      175.03 (  11.88%)      172.66 *  10.37%*
Stddev    1        4.67 (   0.00%)      100.18 (-2043.42%)       50.61 (-982.87%)      141.33 (-2923.68%)       69.08 (-1377.96%)       70.09 (-1399.51%)       47.32 (-912.42%)       42.67 (-813.02%)
Stddev    3       12.62 (   0.00%)        8.76 (  30.60%)       13.18 (  -4.38%)        9.42 (  25.41%)        3.57 (  71.72%)       28.47 (-125.51%)       21.08 ( -67.00%)       19.37 ( -53.48%)
Stddev    4        8.52 (   0.00%)        8.54 (  -0.17%)        3.09 (  63.70%)       13.84 ( -62.42%)       24.33 (-185.50%)        9.30 (  -9.10%)       16.60 ( -94.73%)        9.64 ( -13.07%)
                                 v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                           VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     1      168.87 (   0.00%)      154.79 (  -8.34%)      190.28 (  12.68%)      151.18 ( -10.48%)      136.73 ( -19.03%)       36.63 * -78.31%*       83.21 ( -50.73%)      124.42 ( -26.32%)
Hmean     4      163.65 (   0.00%)       87.90 * -46.29%*      119.37 * -27.06%*       87.94 * -46.26%*       26.96 * -83.53%*       24.08 * -85.29%*       54.15 * -66.91%*       63.80 * -61.01%*
Hmean     7      181.60 (   0.00%)       89.10 * -50.93%*      148.16 ( -18.41%)       75.71 * -58.31%*       16.98 * -90.65%*       23.92 * -86.83%*       57.28 * -68.46%*       66.10 * -63.60%*
Hmean     8      198.98 (   0.00%)       94.24 * -52.64%*      141.96 ( -28.65%)       96.62 * -51.44%*       23.22 * -88.33%*       29.24 * -85.30%*       80.10 * -59.74%*       80.36 * -59.61%*
Stddev    1       61.59 (   0.00%)       44.71 (  27.41%)       52.14 (  15.33%)       44.61 (  27.56%)       90.12 ( -46.32%)       42.53 (  30.94%)       38.32 (  37.79%)       43.03 (  30.12%)
Stddev    4        8.65 (   0.00%)       21.74 (-151.41%)       21.18 (-144.98%)       22.51 (-160.27%)       19.72 (-128.07%)        4.38 (  49.38%)        2.68 (  68.95%)       14.40 ( -66.55%)
Stddev    7       17.94 (   0.00%)       15.14 (  15.62%)       29.94 ( -66.88%)       17.30 (   3.54%)       26.23 ( -46.20%)        5.17 (  71.17%)       15.98 (  10.95%)       18.43 (  -2.72%)
Stddev    8       38.45 (   0.00%)       19.68 (  48.82%)       44.14 ( -14.78%)       28.84 (  25.00%)       10.64 (  72.33%)       11.65 (  69.71%)       22.63 (  41.15%)       16.00 (  58.39%)

Core scheduling does not seem to be able to handle sysbench well, yet.
In this case, things are not to bad on baremtal (and the best
performing coresched variant is again the one with Tim's patches).

But things go bad when running the banchmark in VMs, where core
scheduling almost always loses against no HyperThreading, even in the
overcommitted case. For virt. cases, it's also not straightforward to
tell which set of patches are best, as some runs are in favours of
Tim's, some others of vruntime's.

-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29  9:11                                               ` Dario Faggioli
                                                                   ` (4 preceding siblings ...)
  2019-10-29  9:18                                                 ` Dario Faggioli
@ 2019-10-29  9:19                                                 ` Dario Faggioli
  2019-10-29  9:20                                                 ` Dario Faggioli
  6 siblings, 0 replies; 161+ messages in thread
From: Dario Faggioli @ 2019-10-29  9:19 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 10587 bytes --]

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> > 
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > 
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> > 
> Hello,
> 
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
> 
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
> 
MEMCACHED/MUTILATE
==================

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-7_mutilate.txt

                                 v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                             BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     1    60672.24 (   0.00%)    60476.27 (  -0.32%)    61303.65 (   1.04%)    61353.05 (   1.12%)    57026.54 *  -6.01%*    56874.19 *  -6.26%*    56781.04 *  -6.41%*    56642.48 *  -6.64%*
Hmean     3   149695.40 (   0.00%)   131084.79 * -12.43%*   150802.12 (   0.74%)   131058.25 * -12.45%*   138260.22 *  -7.64%*   138002.31 *  -7.81%*   136907.04 *  -8.54%*   138317.36 *  -7.60%*
Hmean     5   198656.98 (   0.00%)   181719.88 *  -8.53%*   196429.59 (  -1.12%)   186468.57 *  -6.14%*   171612.43 * -13.61%*   176464.24 * -11.17%*   162578.64 * -18.16%*   170196.51 * -14.33%*
Hmean     7   180858.07 (   0.00%)   187088.17 *   3.44%*   181549.92 (   0.38%)   189813.97 *   4.95%*   166898.82 *  -7.72%*   164724.43 *  -8.92%*   143102.58 * -20.88%*   128900.19 * -28.73%*
Hmean     8   157176.91 (   0.00%)   190533.25 *  21.22%*   159795.04 (   1.67%)   188249.31 *  19.77%*   168399.58 *   7.14%*   169700.06 *   7.97%*   137152.66 * -12.74%*   123355.03 * -21.52%*
Stddev    1      551.36 (   0.00%)       52.18 (  90.54%)      306.31 (  44.44%)      385.01 (  30.17%)      353.37 (  35.91%)      350.76 (  36.38%)      476.56 (  13.57%)      279.80 (  49.25%)
Stddev    3     1206.26 (   0.00%)     2368.98 ( -96.39%)      652.07 (  45.94%)     1300.67 (  -7.83%)     1849.15 ( -53.30%)     2434.55 (-101.83%)     1063.43 (  11.84%)     1825.98 ( -51.38%)
Stddev    5     3843.30 (   0.00%)      122.92 (  96.80%)     2883.77 (  24.97%)      481.11 (  87.48%)     4099.56 (  -6.67%)     4424.91 ( -15.13%)      596.25 (  84.49%)     1345.78 (  64.98%)
Stddev    7     2990.97 (   0.00%)     1645.75 (  44.98%)     6567.23 (-119.57%)     5422.57 ( -81.30%)     3191.68 (  -6.71%)     2438.38 (  18.48%)     1211.94 (  59.48%)      705.34 (  76.42%)
Stddev    8     3637.12 (   0.00%)     1490.36 (  59.02%)     3386.42 (   6.89%)     2437.43 (  32.98%)     1056.02 (  70.97%)     1391.07 (  61.75%)     1488.57 (  59.07%)      774.85 (  78.70%)
                                 v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                             VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     1    42580.67 (   0.00%)    53440.15 *  25.50%*    54293.03 *  27.51%*    52771.89 *  23.93%*    53689.85 *  26.09%*    53804.35 *  26.36%*    53730.74 *  26.19%*    53902.75 *  26.59%*
Hmean     3   115732.63 (   0.00%)    70651.66 * -38.95%*   125537.18 *   8.47%*    70738.43 * -38.88%*   126041.27 *   8.91%*   126387.30 *   9.21%*   127285.87 *   9.98%*   126816.33 *   9.58%*
Hmean     5   176633.72 (   0.00%)    22349.68 * -87.35%*   182113.02 *   3.10%*    19850.93 * -88.76%*   137910.40 * -21.92%*   180934.44 *   2.43%*   175009.60 (  -0.92%)   174758.21 (  -1.06%)
Hmean     7   182512.86 (   0.00%)    48728.14 * -73.30%*   186272.46 *   2.06%*    49015.69 * -73.14%*   157398.75 * -13.76%*   184989.00 (   1.36%)   172307.74 *  -5.59%*   173203.24 *  -5.10%*
Hmean     8   192244.37 (   0.00%)    65283.21 * -66.04%*   195435.08 (   1.66%)    63616.87 * -66.91%*   176288.98 *  -8.30%*   192413.50 (   0.09%)   188699.07 (  -1.84%)   185870.91 *  -3.32%*
Stddev    1      337.79 (   0.00%)      523.30 ( -54.92%)       22.90 (  93.22%)      353.55 (  -4.67%)      196.63 (  41.79%)      470.55 ( -39.30%)      414.86 ( -22.82%)      190.50 (  43.60%)
Stddev    3     1945.52 (   0.00%)      420.36 (  78.39%)     1365.07 (  29.84%)      954.82 (  50.92%)     1136.12 (  41.60%)     1674.98 (  13.91%)     1539.07 (  20.89%)     1129.09 (  41.96%)
Stddev    5      964.02 (   0.00%)     2121.27 (-120.05%)     3232.44 (-235.31%)     2284.37 (-136.96%)     8374.66 (-768.73%)     3156.06 (-227.39%)      948.17 (   1.64%)     2700.91 (-180.17%)
Stddev    7     1028.64 (   0.00%)     1740.95 ( -69.25%)     2860.04 (-178.04%)     3366.68 (-227.29%)     3360.79 (-226.72%)     4762.47 (-362.99%)     2968.56 (-188.59%)     1429.88 ( -39.01%)
Stddev    8     3794.15 (   0.00%)     1893.82 (  50.09%)      565.06 (  85.11%)      893.04 (  76.46%)     3201.43 (  15.62%)     1931.66 (  49.09%)      360.93 (  90.49%)     1945.37 (  48.73%)
                                 v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                          VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
Hmean     1    44216.52 (   0.00%)    53968.94 *  22.06%*    54078.28 *  22.30%*    54290.57 *  22.78%*    53552.08 *  21.11%*    51342.78 *  16.12%*    53967.09 *  22.05%*    53790.48 *  21.65%*
Hmean     3   105120.19 (   0.00%)   120769.59 *  14.89%*   121460.04 *  15.54%*   117410.09 *  11.69%*   120296.87 *  14.44%*   109089.25 (   3.78%)   123133.70 *  17.14%*   123614.82 *  17.59%*
Hmean     4   144618.05 (   0.00%)   167039.86 *  15.50%*   178922.55 *  23.72%*   175641.36 *  21.45%*   185338.19 *  28.16%*   148152.02 (   2.44%)   179709.39 *  24.26%*   179754.74 *  24.30%*
Stddev    1      144.30 (   0.00%)      101.12 (  29.93%)      420.02 (-191.08%)      294.43 (-104.04%)      300.29 (-108.10%)      330.58 (-129.10%)      353.69 (-145.11%)      142.58 (   1.19%)
Stddev    3     3300.71 (   0.00%)      905.78 (  72.56%)     2537.34 (  23.13%)     3289.78 (   0.33%)     1113.35 (  66.27%)     2685.20 (  18.65%)     8425.99 (-155.28%)     2280.72 (  30.90%)
Stddev    4     1828.49 (   0.00%)     9851.42 (-438.77%)     2779.22 ( -52.00%)     3312.07 ( -81.14%)     7392.50 (-304.30%)     6090.54 (-233.09%)     3320.77 ( -81.61%)     4836.89 (-164.53%)
                                 v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                           VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     1    53045.85 (   0.00%)    35310.51 * -33.43%*    38949.82 * -26.57%*    35500.55 * -33.08%*    49350.06 *  -6.97%*    35372.44 * -33.32%*    40556.30 * -23.54%*    42238.57 * -20.37%*
Hmean     3   124865.12 (   0.00%)     7997.67 * -93.59%*    57204.70 * -54.19%*     8373.67 * -93.29%*    56781.47 * -54.53%*    38892.18 * -68.85%*    51207.79 * -58.99%*    46151.33 * -63.04%*
Hmean     5   178848.31 (   0.00%)     4984.40 * -97.21%*    31431.45 * -82.43%*     5101.29 * -97.15%*    41204.02 * -76.96%*    24592.26 * -86.25%*    33541.47 * -81.25%*    28762.54 * -83.92%*
Hmean     7   187708.49 (   0.00%)    12945.53 * -93.10%*    57937.51 * -69.13%*    12898.91 * -93.13%*    89180.74 * -52.49%*    48390.34 * -74.22%*    54198.33 * -71.13%*    50261.68 * -73.22%*
Hmean     8   199929.38 (   0.00%)    27516.10 * -86.24%*    77578.88 * -61.20%*    27608.20 * -86.19%*   117871.45 * -41.04%*    65719.85 * -67.13%*    76669.78 * -61.65%*    67983.03 * -66.00%*
Stddev    1      404.57 (   0.00%)      708.20 ( -75.05%)      102.16 (  74.75%)      258.18 (  36.19%)      132.11 (  67.35%)     1145.71 (-183.19%)      740.95 ( -83.14%)       91.67 (  77.34%)
Stddev    3     1867.11 (   0.00%)      729.15 (  60.95%)     2529.52 ( -35.48%)     1212.31 (  35.07%)     1765.46 (   5.44%)     2698.45 ( -44.53%)      376.23 (  79.85%)     2108.90 ( -12.95%)
Stddev    5     1556.91 (   0.00%)      193.16 (  87.59%)     3307.78 (-112.46%)      207.87 (  86.65%)     1401.20 (  10.00%)     1945.52 ( -24.96%)     1821.44 ( -16.99%)     2166.87 ( -39.18%)
Stddev    7     3622.13 (   0.00%)     1690.01 (  53.34%)     5268.81 ( -45.46%)      690.12 (  80.95%)     8337.77 (-130.19%)     3737.07 (  -3.17%)     1900.15 (  47.54%)     3388.45 (   6.45%)
Stddev    8     1913.74 (   0.00%)     1460.07 (  23.71%)     2068.35 (  -8.08%)     1750.51 (   8.53%)     4229.08 (-120.99%)     3477.66 ( -81.72%)     3318.68 ( -73.41%)     1338.76 (  30.04%)

Mutilate brings us back where core scheduling does bad on baremetal,
but fine in VMs. The benchmark seems to be much more sensitive to
HyperThreading when run inside virtual machines, and core scheduling
guarantees better results than disabling hyperthreading, in such cases.
Interestingly, this time the thing is more evedint in the *non*
overcommitted scenario.

In fact, with one 8 vCPUs VM, v-VM-noHT is -87.35% and VM-csc_vruntime-
HT is only -1.06%, which is really good. In the overcommitted case, v-
VMx2-noHT reaches -97.21% while VMx2-csc_vruntime-HT is -83.92%. So,
yeah, better, but not "as better" as before.

-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29  9:11                                               ` Dario Faggioli
                                                                   ` (5 preceding siblings ...)
  2019-10-29  9:19                                                 ` Dario Faggioli
@ 2019-10-29  9:20                                                 ` Dario Faggioli
  2019-10-29 20:34                                                   ` Julien Desfossez
  6 siblings, 1 reply; 161+ messages in thread
From: Dario Faggioli @ 2019-10-29  9:20 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 9533 bytes --]

On Tue, 2019-10-29 at 10:11 +0100, Dario Faggioli wrote:
> On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> > I'm using the following branch as base which is v5.1.5 based:
> > https://github.com/digitalocean/linux-coresched coresched-v3-
> > v5.1.5-
> > test
> > 
> > And I have pushed Tim's branch to:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> > 
> > Mine:
> > https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> > core_vruntime
> > 
> Hello,
> 
> As anticipated, I've been trying to follow the development of this
> feature and, in the meantime, I have done some benchmarks.
> 
> I actually have a lot of data (and am planning for more), so I am
> sending a few emails, each one with a subset of the numbers in it,
> instead than just one which would be beyond giant! :-)
> 
NETPERF-UNIX
============

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-7_mutilate.txt

                                    v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                                BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     64        984.93 (   0.00%)     1011.24 (   2.67%)     1000.82 (   1.61%)     1001.98 (   1.73%)      947.61 (  -3.79%)      823.10 * -16.43%*      789.89 * -19.80%*      928.11 *  -5.77%*
Hmean     256      3683.78 (   0.00%)     3763.52 (   2.16%)     3788.96 (   2.86%)     3725.17 (   1.12%)     1254.25 * -65.95%*     1261.92 * -65.74%*     1264.02 * -65.69%*     1260.30 * -65.79%*
Hmean     2048     7928.28 (   0.00%)     7845.97 (  -1.04%)     7911.88 (  -0.21%)     7809.11 *  -1.50%*     5334.65 * -32.71%*     5340.97 * -32.63%*     5337.93 * -32.67%*     5394.57 * -31.96%*
Hmean     8192     8134.23 (   0.00%)     8096.88 (  -0.46%)     8258.84 (   1.53%)     8076.36 (  -0.71%)     5374.33 * -33.93%*     5394.12 * -33.69%*     5504.00 * -32.34%*     5447.92 * -33.02%*
Stddev    64         30.46 (   0.00%)       31.07 (  -2.00%)       15.36 (  49.58%)       41.51 ( -36.26%)       54.71 ( -79.62%)      121.84 (-299.98%)       81.75 (-168.35%)       33.75 ( -10.79%)
Stddev    256       102.26 (   0.00%)      104.86 (  -2.55%)      107.90 (  -5.52%)      116.17 ( -13.61%)        3.32 (  96.75%)        5.79 (  94.34%)       14.69 (  85.63%)       19.37 (  81.06%)
Stddev    2048       94.73 (   0.00%)       48.12 (  49.21%)      137.50 ( -45.15%)       43.30 (  54.29%)       58.65 (  38.08%)       50.97 (  46.20%)       57.43 (  39.38%)       36.39 (  61.58%)
Stddev    8192      172.77 (   0.00%)       48.68 (  71.83%)      261.65 ( -51.45%)       76.30 (  55.83%)       40.65 (  76.47%)       27.61 (  84.02%)       65.26 (  62.23%)       31.66 (  81.68%)
                                    v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                                VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     64        516.67 (   0.00%)      585.50 (  13.32%)      592.54 (  14.68%)      582.70 (  12.78%)      591.32 (  14.45%)      546.68 (   5.81%)      602.58 (  16.63%)      729.10 *  41.12%*
Hmean     256      1070.01 (   0.00%)     1306.95 *  22.14%*     1193.89 *  11.58%*     1271.17 *  18.80%*     1362.78 *  27.36%*     1171.49 (   9.48%)     1335.98 *  24.86%*     1248.79 *  16.71%*
Hmean     2048     5002.14 (   0.00%)     6865.72 *  37.26%*     5569.42 (  11.34%)     5074.48 (   1.45%)     5849.11 (  16.93%)     4745.97 (  -5.12%)     6330.87 (  26.56%)     6418.51 (  28.32%)
Hmean     8192     5116.24 (   0.00%)     7494.15 *  46.48%*     6960.17 *  36.04%*     6009.31 (  17.46%)     6114.30 (  19.51%)     5226.13 (   2.15%)     6316.40 (  23.46%)     7924.66 *  54.89%*
Stddev    64         81.30 (   0.00%)      139.96 ( -72.15%)      162.38 ( -99.72%)      113.29 ( -39.35%)      150.34 ( -84.91%)      162.05 ( -99.32%)      163.74 (-101.39%)       47.94 (  41.04%)
Stddev    256        64.89 (   0.00%)      130.57 (-101.20%)      115.02 ( -77.23%)      120.49 ( -85.68%)      106.63 ( -64.31%)      140.18 (-116.01%)      118.29 ( -82.28%)      133.66 (-105.96%)
Stddev    2048      779.45 (   0.00%)      767.31 (   1.56%)     1869.81 (-139.89%)     1265.84 ( -62.40%)     1249.59 ( -60.32%)      506.63 (  35.00%)     1427.22 ( -83.11%)     2296.29 (-194.60%)
Stddev    8192      942.88 (   0.00%)     2559.80 (-171.49%)     1207.26 ( -28.04%)     1776.59 ( -88.42%)     1405.17 ( -49.03%)     1379.48 ( -46.30%)     2565.86 (-172.13%)     1066.06 ( -13.06%)
                                    v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                             VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
Hmean     64        626.51 (   0.00%)      535.18 ( -14.58%)      610.07 (  -2.62%)      509.04 ( -18.75%)      552.16 ( -11.87%)      471.44 * -24.75%*      484.50 * -22.67%*      488.32 * -22.06%*
Hmean     256       999.57 (   0.00%)     1159.65 *  16.02%*     1209.25 *  20.98%*     1217.94 *  21.85%*     1196.13 *  19.66%*     1286.01 *  28.66%*     1154.57 *  15.51%*     1238.40 *  23.89%*
Hmean     2048     3882.52 (   0.00%)     4483.92 (  15.49%)     4969.62 *  28.00%*     4910.98 *  26.49%*     4646.33 *  19.67%*     5247.76 *  35.16%*     4515.47 *  16.30%*     5096.38 *  31.26%*
Hmean     8192     4086.48 (   0.00%)     4935.20 (  20.77%)     4711.62 *  15.30%*     5067.04 (  24.00%)     5887.99 *  44.08%*     5360.18 *  31.17%*     5847.04 *  43.08%*     5990.50 *  46.59%*
Stddev    64        134.26 (   0.00%)      117.95 (  12.15%)       67.29 (  49.88%)       88.91 (  33.78%)       78.73 (  41.36%)       22.34 (  83.36%)       45.36 (  66.22%)       64.62 (  51.87%)
Stddev    256        36.69 (   0.00%)       32.94 (  10.22%)       93.79 (-155.60%)       52.76 ( -43.79%)       72.03 ( -96.31%)       94.69 (-158.06%)       32.05 (  12.65%)       62.31 ( -69.82%)
Stddev    2048       64.82 (   0.00%)      785.16 (-1111.23%)     1086.67 (-1576.36%)      863.76 (-1232.48%)      552.05 (-751.62%)      597.66 (-821.99%)      300.86 (-364.12%)     1057.90 (-1531.98%)
Stddev    8192      248.29 (   0.00%)     1345.78 (-442.02%)      636.92 (-156.53%)     1497.43 (-503.10%)     1528.17 (-515.48%)      204.43 (  17.66%)      788.65 (-217.63%)     1380.18 (-455.88%)
                                    v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                              VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
Hmean     64        575.39 (   0.00%)      230.57 * -59.93%*      525.97 (  -8.59%)      241.09 * -58.10%*      671.83 (  16.76%)      574.64 (  -0.13%)      676.93 (  17.65%)      713.81 (  24.06%)
Hmean     256      1243.12 (   0.00%)      679.95 * -45.30%*     1262.76 (   1.58%)      646.82 * -47.97%*     1607.80 *  29.34%*     1297.86 (   4.40%)     1573.09 *  26.54%*     1244.30 (   0.09%)
Hmean     2048     4448.89 (   0.00%)     3020.71 * -32.10%*     4460.65 (   0.26%)     3342.89 * -24.86%*     7086.92 *  59.30%*     4544.81 (   2.16%)     4209.05 (  -5.39%)     4346.58 (  -2.30%)
Hmean     8192     5539.82 (   0.00%)     3118.35 * -43.71%*     4003.50 * -27.73%*     2931.43 * -47.08%*     6069.77 (   9.57%)     5571.91 (   0.58%)     5143.26 (  -7.16%)     4245.56 * -23.36%*
Stddev    64        128.48 (   0.00%)       33.63 (  73.82%)      111.27 (  13.39%)       64.39 (  49.89%)       79.47 (  38.14%)      189.98 ( -47.87%)       89.67 (  30.20%)      135.17 (  -5.21%)
Stddev    256       191.78 (   0.00%)      252.00 ( -31.40%)      225.33 ( -17.49%)      123.00 (  35.86%)      183.12 (   4.52%)      231.23 ( -20.57%)      161.13 (  15.98%)       66.95 (  65.09%)
Stddev    2048      463.85 (   0.00%)     1364.71 (-194.21%)     1390.20 (-199.71%)      382.49 (  17.54%)     1271.89 (-174.20%)     1058.81 (-128.27%)      602.49 ( -29.89%)      595.07 ( -28.29%)
Stddev    8192     1230.77 (   0.00%)     2402.19 ( -95.18%)      567.52 (  53.89%)      511.30 (  58.46%)     2551.77 (-107.33%)     1065.61 (  13.42%)     1070.69 (  13.01%)      512.77 (  58.34%)

As many other instances: baremetal is suffering with core scheduling.
VMs are doing reasonably good, especially when there is overcommit.

---

Ok, that's it for now... Any comment, discussion, feedback, etc, more
than welcome.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29  9:20                                                 ` Dario Faggioli
@ 2019-10-29 20:34                                                   ` Julien Desfossez
  2019-11-15 16:30                                                     ` Dario Faggioli
  0 siblings, 1 reply; 161+ messages in thread
From: Julien Desfossez @ 2019-10-29 20:34 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Aaron Lu, Aubrey Li, Tim Chen, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 29-Oct-2019 10:20:57 AM, Dario Faggioli wrote:
> > Hello,
> > 
> > As anticipated, I've been trying to follow the development of this
> > feature and, in the meantime, I have done some benchmarks.
> > 
> > I actually have a lot of data (and am planning for more), so I am
> > sending a few emails, each one with a subset of the numbers in it,
> > instead than just one which would be beyond giant! :-)
> > 

Hi Dario,

Thank you for this comprehensive set of tests and analyses !

It confirms the trend we are seeing for the VM cases. Basically when the
CPUs are overcommitted, core scheduling helps compared to noHT. But when
we have I/O in the mix (sysbench-oltp), then it becomes a bit less
clear, it depends if the CPU is still overcommitted or not. About the
2nd VM that is doing the background noise, is it enough to fill up the
disk queues or is its disk throughput somewhat limited ? Have you
compared the results if you disable the disk noise ?

Our approach for bare-metal tests is a bit different, we are
constraining a set of processes only on a limited set of cpus, but I
like your approach because it pushes more the number of processes
against the whole system. And I have no explanation for why sysbench
thread vs process is so different.

And it also confirms, core scheduling has trouble scaling with the
number of threads, it works pretty well in VMs because the number of
threads is limited by the number of vcpus, but the bare-metal cases show
a major scaling issue (which is not too surprising).

I am curious, for the tagging in KVM, do you move all the vcpus into the
same cgroup before tagging ?  Did you leave the emulator threads
untagged at all time ?

For the overhead (without tagging), have you tried bisecting the
patchset to see which patch introduces the overhead ? it is more than I
had in mind.

And for the cases when core scheduling improves the performance compared
to the baseline numbers, could it be related to frequency scaling (more
work to do means a higher chance of running at a higher frequency) ?

We are almost ready to send the v4 patchset (most likely tomorrow), it
has been rebased on v5.3.5, so stay tuned and ready for another set of
tests ;-)

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-09-18 20:40                                                 ` Tim Chen
  2019-09-18 22:16                                                   ` Aubrey Li
@ 2019-10-29 20:40                                                   ` Julien Desfossez
  2019-11-01 21:42                                                     ` Tim Chen
  1 sibling, 1 reply; 161+ messages in thread
From: Julien Desfossez @ 2019-10-29 20:40 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aubrey Li, Aaron Lu, Dario Faggioli, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 18-Sep-2019 01:40:58 PM, Tim Chen wrote:
> I think the test that's of interest is to see my load balancing added on top
> of Aaron's fairness patch, instead of using my previous version of
> forced idle approach in coresched-v3-v5.1.5-test-tim branch. 
>  
> I've added my two load balance patches on top of Aaron's patches
> in coresched-v3-v5.1.5-test-core_vruntime branch and put it in
> 
> https://github.com/pdxChen/gang/tree/coresched-v3-v5.1.5-test-core_vruntime-lb

We have been trying to benchmark with the load balancer patches and have
experienced some hard lockups with the saturated test cases, but we
don't have traces for now.

Since we are mostly focused on testing the rebased v4 currently, we will
post it without those patches and then we can try to debug more.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29 20:40                                                   ` Julien Desfossez
@ 2019-11-01 21:42                                                     ` Tim Chen
  0 siblings, 0 replies; 161+ messages in thread
From: Tim Chen @ 2019-11-01 21:42 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Aubrey Li, Aaron Lu, Dario Faggioli, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 10/29/19 1:40 PM, Julien Desfossez wrote:
> On 18-Sep-2019 01:40:58 PM, Tim Chen wrote:
>> I think the test that's of interest is to see my load balancing added on top
>> of Aaron's fairness patch, instead of using my previous version of
>> forced idle approach in coresched-v3-v5.1.5-test-tim branch. 
>>  
>> I've added my two load balance patches on top of Aaron's patches
>> in coresched-v3-v5.1.5-test-core_vruntime branch and put it in
>>
>> https://github.com/pdxChen/gang/tree/coresched-v3-v5.1.5-test-core_vruntime-lb
> 
> We have been trying to benchmark with the load balancer patches and have
> experienced some hard lockups with the saturated test cases, but we
> don't have traces for now.
> 
> Since we are mostly focused on testing the rebased v4 currently, we will
> post it without those patches and then we can try to debug more.
> 

Aubrey has been experimenting with a couple of patches that tries to move
task to cpu with matching cookie on wake up and load balance.  They are
much simpler than mine and got better performance on the workload he was
testing. He'll rebase those patches on v4 and post them.

Tim

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [RFC PATCH v3 00/16] Core scheduling v3
  2019-10-29 20:34                                                   ` Julien Desfossez
@ 2019-11-15 16:30                                                     ` Dario Faggioli
  0 siblings, 0 replies; 161+ messages in thread
From: Dario Faggioli @ 2019-11-15 16:30 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Aaron Lu, Aubrey Li, Tim Chen, Li, Aubrey, Subhra Mazumdar,
	Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3857 bytes --]

On Tue, 2019-10-29 at 16:34 -0400, Julien Desfossez wrote:
> On 29-Oct-2019 10:20:57 AM, Dario Faggioli wrote:
> > > Hello,
> > > 
> > > As anticipated, I've been trying to follow the development of
> > > this
> > > feature and, in the meantime, I have done some benchmarks.
> 
> Hi Dario,
> 
Hi!

> Thank you for this comprehensive set of tests and analyses !
> 
Sure. And sorry for replying so late. I was travelling (for speaking
about core scheduling and virtualization at KVMForum :-P) and, after
that, had some catch up to do

> It confirms the trend we are seeing for the VM cases. Basically when
> the
> CPUs are overcommitted, core scheduling helps compared to noHT. 
>
Yep.

> But when
> we have I/O in the mix (sysbench-oltp), then it becomes a bit less
> clear, it depends if the CPU is still overcommitted or not. About the
> 2nd VM that is doing the background noise, is it enough to fill up
> the
> disk queues or is its disk throughput somewhat limited ? Have you
> compared the results if you disable the disk noise ?
> 
There were some IO, but it was mostly CPU noise. Anyway, sure, I can
repeat the experiments with different kind of noise. TBH, I also have
other ideas for different setup. And of course, I'll work on v4 now.

> Our approach for bare-metal tests is a bit different, we are
> constraining a set of processes only on a limited set of cpus, but I
> like your approach because it pushes more the number of processes
> against the whole system. 
>
Yes, and for this time, I deliberately choose a small system, to avoid
having NUMA effects, etc. But I'm working toward running the evaluation
on a bigger box.

> I am curious, for the tagging in KVM, do you move all the vcpus into
> the
> same cgroup before tagging ?  Did you leave the emulator threads
> untagged at all time ?
> 
So, for this round, yes, all the vcpus of the VM were put in the same
cgroup, and then I set the tag for it.

All the other threads that libvirt creates were left out of such group
(and were, hence, untagged). I did a few manual runs with _all_ the
tasks related to a VM in a tagged cgroup, but I did not see much
differences (that's why the numbers for these runs are not reported).

The VM did not have any virtual topology defined.

And in fact, one thing that I want to try is to put pairs of vcpus in
the same cgroup, tag it, and define a virtual HT topology for the VM
(i.e., mark the two vcpu that will be in the same cgroup with the same
tag as threads of the same core).

> For the overhead (without tagging), have you tried bisecting the
> patchset to see which patch introduces the overhead ? it is more than
> I
> had in mind.
> 
Yes, there is definitely something weird. Well, in the meantime, I
improved my automated procedure for running the benchmarks. I'll rerun
on v4. And I'll do a bisect if the overhead is still that big.

> And for the cases when core scheduling improves the performance
> compared
> to the baseline numbers, could it be related to frequency scaling
> (more
> work to do means a higher chance of running at a higher frequency) ?
> 
Governor was 'performance' during all the experiments. But yes, since
it's intel_pstate that is in charge, frequency can still vary, and
something like what you suggest may indeed be happening, I think.

> We are almost ready to send the v4 patchset (most likely tomorrow),
> it
> has been rebased on v5.3.5, so stay tuned and ready for another set
> of
> tests ;-)
> 
Already on it. :-)

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

end of thread, other threads:[~2019-11-15 16:30 UTC | newest]

Thread overview: 161+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29 20:36 [RFC PATCH v3 00/16] Core scheduling v3 Vineeth Remanan Pillai
2019-05-29 20:36 ` [RFC PATCH v3 01/16] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
2019-08-08 10:54   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2019-08-26 16:19   ` [RFC PATCH v3 01/16] " mark gross
2019-08-26 16:59     ` Peter Zijlstra
2019-05-29 20:36 ` [RFC PATCH v3 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
2019-08-08 10:55   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2019-08-26 16:20   ` [RFC PATCH v3 02/16] " mark gross
2019-05-29 20:36 ` [RFC PATCH v3 03/16] sched: Wrap rq::lock access Vineeth Remanan Pillai
2019-05-29 20:36 ` [RFC PATCH v3 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Vineeth Remanan Pillai
2019-08-08 10:55   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2019-05-29 20:36 ` [RFC PATCH v3 05/16] sched: Add task_struct pointer to sched_class::set_curr_task Vineeth Remanan Pillai
2019-08-08 10:57   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2019-05-29 20:36 ` [RFC PATCH v3 06/16] sched/fair: Export newidle_balance() Vineeth Remanan Pillai
2019-08-08 10:58   ` [tip:sched/core] sched/fair: Expose newidle_balance() tip-bot for Peter Zijlstra
2019-05-29 20:36 ` [RFC PATCH v3 07/16] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
2019-08-08 10:58   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2019-08-26 16:51   ` [RFC PATCH v3 07/16] " mark gross
2019-05-29 20:36 ` [RFC PATCH v3 08/16] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
2019-08-08 10:59   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2019-08-26 17:01   ` [RFC PATCH v3 08/16] " mark gross
2019-05-29 20:36 ` [RFC PATCH v3 09/16] sched: Introduce sched_class::pick_task() Vineeth Remanan Pillai
2019-08-26 17:14   ` mark gross
2019-05-29 20:36 ` [RFC PATCH v3 10/16] sched: Core-wide rq->lock Vineeth Remanan Pillai
2019-05-31 11:08   ` Peter Zijlstra
2019-05-31 15:23     ` Vineeth Pillai
2019-05-29 20:36 ` [RFC PATCH v3 11/16] sched: Basic tracking of matching tasks Vineeth Remanan Pillai
2019-08-26 20:59   ` mark gross
2019-05-29 20:36 ` [RFC PATCH v3 12/16] sched: A quick and dirty cgroup tagging interface Vineeth Remanan Pillai
2019-05-29 20:36 ` [RFC PATCH v3 13/16] sched: Add core wide task selection and scheduling Vineeth Remanan Pillai
2019-06-07 23:36   ` Pawan Gupta
2019-05-29 20:36 ` [RFC PATCH v3 14/16] sched/fair: Add a few assertions Vineeth Remanan Pillai
2019-05-29 20:36 ` [RFC PATCH v3 15/16] sched: Trivial forced-newidle balancer Vineeth Remanan Pillai
2019-05-29 20:36 ` [RFC PATCH v3 16/16] sched: Debug bits Vineeth Remanan Pillai
2019-05-29 21:02   ` Peter Oskolkov
2019-05-30 14:04 ` [RFC PATCH v3 00/16] Core scheduling v3 Aubrey Li
2019-05-30 14:17   ` Julien Desfossez
2019-05-31  4:55     ` Aubrey Li
2019-05-31  3:01   ` Aaron Lu
2019-05-31  5:12     ` Aubrey Li
2019-05-31  6:09       ` Aaron Lu
2019-05-31  6:53         ` Aubrey Li
2019-05-31  7:44           ` Aaron Lu
2019-05-31  8:26             ` Aubrey Li
2019-05-31 21:08     ` Julien Desfossez
2019-06-06 15:26       ` Julien Desfossez
2019-06-12  1:52         ` Li, Aubrey
2019-06-12 16:06           ` Julien Desfossez
2019-06-12 16:33         ` Julien Desfossez
2019-06-13  0:03           ` Subhra Mazumdar
2019-06-13  3:22             ` Julien Desfossez
2019-06-17  2:51               ` Aubrey Li
2019-06-19 18:33                 ` Julien Desfossez
2019-07-18 10:07                   ` Aaron Lu
2019-07-18 23:27                     ` Tim Chen
2019-07-19  5:52                       ` Aaron Lu
2019-07-19 11:48                         ` Aubrey Li
2019-07-19 18:33                         ` Tim Chen
2019-07-22 10:26                     ` Aubrey Li
2019-07-22 10:43                       ` Aaron Lu
2019-07-23  2:52                         ` Aubrey Li
2019-07-25 14:30                       ` Aaron Lu
2019-07-25 14:31                         ` [RFC PATCH 1/3] wrapper for cfs_rq->min_vruntime Aaron Lu
2019-07-25 14:32                         ` [PATCH 2/3] core vruntime comparison Aaron Lu
2019-08-06 14:17                           ` Peter Zijlstra
2019-07-25 14:33                         ` [PATCH 3/3] temp hack to make tick based schedule happen Aaron Lu
2019-07-25 21:42                         ` [RFC PATCH v3 00/16] Core scheduling v3 Li, Aubrey
2019-07-26 15:21                         ` Julien Desfossez
2019-07-26 21:29                           ` Tim Chen
2019-07-31  2:42                           ` Li, Aubrey
2019-08-02 15:37                             ` Julien Desfossez
2019-08-05 15:55                               ` Tim Chen
2019-08-06  3:24