LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC][PATCH 00/16] sched: Core scheduling
@ 2019-02-18 16:56 Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 01/16] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
                   ` (19 more replies)
  0 siblings, 20 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)


A much 'demanded' feature: core-scheduling :-(

I still hate it with a passion, and that is part of why it took a little
longer than 'promised'.

While this one doesn't have all the 'features' of the previous (never
published) version and isn't L1TF 'complete', I tend to like the structure
better (relatively speaking: I hate it slightly less).

This one is sched class agnostic and therefore, in principle, doesn't horribly
wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
to force-idle siblings).

Now, as hinted by that, there are semi sane reasons for actually having this.
Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
per core (due to SMT fundamentally sharing caches) and therefore grouping
related tasks on a core makes it more reliable.

However; whichever way around you turn this cookie; it is expensive and nasty.

It doesn't help that there are truly bonghit crazy proposals for using this out
there, and I really hope to never see them in code.

These patches are lightly tested and didn't insta explode, but no promises,
they might just set your pets on fire.

'enjoy'

@pjt; I know this isn't quite what we talked about, but this is where I ended
up after I started typing. There's plenty design decisions to question and my
changelogs don't even get close to beginning to cover them all. Feel free to ask.

---
 include/linux/sched.h    |   9 +-
 kernel/Kconfig.preempt   |   8 +-
 kernel/sched/core.c      | 762 ++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/deadline.c  |  99 +++---
 kernel/sched/debug.c     |   4 +-
 kernel/sched/fair.c      | 129 +++++---
 kernel/sched/idle.c      |  42 ++-
 kernel/sched/pelt.h      |   2 +-
 kernel/sched/rt.c        |  96 +++---
 kernel/sched/sched.h     | 183 ++++++++----
 kernel/sched/stop_task.c |  35 ++-
 kernel/sched/topology.c  |   4 +-
 kernel/stop_machine.c    |   2 +
 13 files changed, 1096 insertions(+), 279 deletions(-)



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 01/16] stop_machine: Fix stop_cpus_in_progress ordering
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Peter Zijlstra
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Make sure the entire for loop has stop_cpus_in_progress set.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/stop_machine.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -375,6 +375,7 @@ static bool queue_stop_cpus_work(const s
 	 */
 	preempt_disable();
 	stop_cpus_in_progress = true;
+	barrier();
 	for_each_cpu(cpu, cpumask) {
 		work = &per_cpu(cpu_stopper.stop_work, cpu);
 		work->fn = fn;
@@ -383,6 +384,7 @@ static bool queue_stop_cpus_work(const s
 		if (cpu_stop_queue_work(cpu, work))
 			queued = true;
 	}
+	barrier();
 	stop_cpus_in_progress = false;
 	preempt_enable();
 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 01/16] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 03/16] sched: Wrap rq::lock access Peter Zijlstra
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6256,7 +6256,7 @@ struct task_struct *curr_task(int cpu)
 
 #ifdef CONFIG_IA64
 /**
- * set_curr_task - set the current task for a given CPU.
+ * ia64_set_curr_task - set the current task for a given CPU.
  * @cpu: the processor in question.
  * @p: the task pointer to set.
  *



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 01/16] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-19 16:13   ` Phil Auld
  2019-03-18 15:41   ` Julien Desfossez
  2019-02-18 16:56 ` [RFC][PATCH 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Peter Zijlstra
                   ` (16 subsequent siblings)
  19 siblings, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |   44 ++++++++++----------
 kernel/sched/deadline.c |   18 ++++----
 kernel/sched/debug.c    |    4 -
 kernel/sched/fair.c     |   41 +++++++++----------
 kernel/sched/idle.c     |    4 -
 kernel/sched/pelt.h     |    2 
 kernel/sched/rt.c       |    8 +--
 kernel/sched/sched.h    |  102 ++++++++++++++++++++++++------------------------
 kernel/sched/topology.c |    4 -
 9 files changed, 114 insertions(+), 113 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,12 +72,12 @@ struct rq *__task_rq_lock(struct task_st
 
 	for (;;) {
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 
 		while (unlikely(task_on_rq_migrating(p)))
 			cpu_relax();
@@ -96,7 +96,7 @@ struct rq *task_rq_lock(struct task_stru
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -118,7 +118,7 @@ struct rq *task_rq_lock(struct task_stru
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -188,7 +188,7 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (rq->clock_update_flags & RQCF_ACT_SKIP)
 		return;
@@ -497,7 +497,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (test_tsk_need_resched(curr))
 		return;
@@ -521,10 +521,10 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
 		resched_curr(rq);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -956,7 +956,7 @@ static inline bool is_cpu_allowed(struct
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	dequeue_task(rq, p, DEQUEUE_NOCLOCK);
@@ -1070,7 +1070,7 @@ void do_set_cpus_allowed(struct task_str
 		 * Because __kthread_bind() calls this on blocked tasks without
 		 * holding rq->lock.
 		 */
-		lockdep_assert_held(&rq->lock);
+		lockdep_assert_held(rq_lockp(rq));
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	}
 	if (running)
@@ -1203,7 +1203,7 @@ void set_task_cpu(struct task_struct *p,
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(&task_rq(p)->lock)));
+				      lockdep_is_held(rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -1732,7 +1732,7 @@ ttwu_do_activate(struct rq *rq, struct t
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 #ifdef CONFIG_SMP
 	if (p->sched_contributes_to_load)
@@ -2123,7 +2123,7 @@ static void try_to_wake_up_local(struct
 	    WARN_ON_ONCE(p == current))
 		return;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (!raw_spin_trylock(&p->pi_lock)) {
 		/*
@@ -2606,10 +2606,10 @@ prepare_lock_switch(struct rq *rq, struc
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
+	spin_release(&rq_lockp(rq)->dep_map, 1, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
-	rq->lock.owner = next;
+	rq_lockp(rq)->owner = next;
 #endif
 }
 
@@ -2620,8 +2620,8 @@ static inline void finish_lock_switch(st
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
-	raw_spin_unlock_irq(&rq->lock);
+	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 /*
@@ -2771,7 +2771,7 @@ static void __balance_callback(struct rq
 	void (*func)(struct rq *rq);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	head = rq->balance_callback;
 	rq->balance_callback = NULL;
 	while (head) {
@@ -2782,7 +2782,7 @@ static void __balance_callback(struct rq
 
 		func(rq);
 	}
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 static inline void balance_callback(struct rq *rq)
@@ -5411,7 +5411,7 @@ void init_idle(struct task_struct *idle,
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 
 	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
@@ -5448,7 +5448,7 @@ void init_idle(struct task_struct *idle,
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
 #endif
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
@@ -6016,7 +6016,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
-		raw_spin_lock_init(&rq->lock);
+		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -80,7 +80,7 @@ void __add_running_bw(u64 dl_bw, struct
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -93,7 +93,7 @@ void __sub_running_bw(u64 dl_bw, struct
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
 	dl_rq->running_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
@@ -107,7 +107,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
 	dl_rq->this_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
 }
@@ -117,7 +117,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
 	dl_rq->this_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
@@ -893,7 +893,7 @@ static int start_dl_timer(struct task_st
 	ktime_t now, act;
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/*
 	 * We want the timer to fire at the deadline, but considering
@@ -1003,9 +1003,9 @@ static enum hrtimer_restart dl_task_time
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(&rq->lock, rf.cookie);
+		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(&rq->lock);
+		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
@@ -1620,7 +1620,7 @@ static void migrate_task_rq_dl(struct ta
 	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
 	 * rq->lock is not... So, lock it
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
 		p->dl.dl_non_contending = 0;
@@ -1635,7 +1635,7 @@ static void migrate_task_rq_dl(struct ta
 			put_task_struct(p);
 	}
 	sub_rq_bw(&p->dl, &rq->dl);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -515,7 +515,7 @@ void print_cfs_rq(struct seq_file *m, in
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_rq->exec_clock));
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (rb_first_cached(&cfs_rq->tasks_timeline))
 		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
 	last = __pick_last_entity(cfs_rq);
@@ -523,7 +523,7 @@ void print_cfs_rq(struct seq_file *m, in
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4966,7 +4966,7 @@ static void __maybe_unused update_runtim
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -4985,7 +4985,7 @@ static void __maybe_unused unthrottle_of
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6743,7 +6743,7 @@ static void migrate_task_rq_fair(struct
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_held(&task_rq(p)->lock);
+		lockdep_assert_held(rq_lockp(task_rq(p)));
 		detach_entity_cfs_rq(&p->se);
 
 	} else {
@@ -7317,7 +7317,7 @@ static int task_hot(struct task_struct *
 {
 	s64 delta;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (p->sched_class != &fair_sched_class)
 		return 0;
@@ -7411,7 +7411,7 @@ int can_migrate_task(struct task_struct
 {
 	int tsk_cache_hot;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	/*
 	 * We do not migrate tasks that are:
@@ -7489,7 +7489,7 @@ int can_migrate_task(struct task_struct
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	p->on_rq = TASK_ON_RQ_MIGRATING;
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
@@ -7506,7 +7506,7 @@ static struct task_struct *detach_one_ta
 {
 	struct task_struct *p;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	list_for_each_entry_reverse(p,
 			&env->src_rq->cfs_tasks, se.group_node) {
@@ -7542,7 +7542,7 @@ static int detach_tasks(struct lb_env *e
 	unsigned long load;
 	int detached = 0;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (env->imbalance <= 0)
 		return 0;
@@ -7623,7 +7623,7 @@ static int detach_tasks(struct lb_env *e
  */
 static void attach_task(struct rq *rq, struct task_struct *p)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	BUG_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9164,7 +9164,7 @@ static int load_balance(int this_cpu, st
 		if (need_active_balance(&env)) {
 			unsigned long flags;
 
-			raw_spin_lock_irqsave(&busiest->lock, flags);
+			raw_spin_lock_irqsave(rq_lockp(busiest), flags);
 
 			/*
 			 * Don't kick the active_load_balance_cpu_stop,
@@ -9172,8 +9172,7 @@ static int load_balance(int this_cpu, st
 			 * moved to this_cpu:
 			 */
 			if (!cpumask_test_cpu(this_cpu, &busiest->curr->cpus_allowed)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
-							    flags);
+				raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
 				env.flags |= LBF_ALL_PINNED;
 				goto out_one_pinned;
 			}
@@ -9188,7 +9187,7 @@ static int load_balance(int this_cpu, st
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
 
 			if (active_balance) {
 				stop_one_cpu_nowait(cpu_of(busiest),
@@ -9897,7 +9896,7 @@ static void nohz_newidle_balance(struct
 	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
 		return;
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	/*
 	 * This CPU is going to be idle and blocked load of idle CPUs
 	 * need to be updated. Run the ilb locally as it is a good
@@ -9906,7 +9905,7 @@ static void nohz_newidle_balance(struct
 	 */
 	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
 		kick_ilb(NOHZ_STATS_KICK);
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 }
 
 #else /* !CONFIG_NO_HZ_COMMON */
@@ -9966,7 +9965,7 @@ static int idle_balance(struct rq *this_
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -10007,7 +10006,7 @@ static int idle_balance(struct rq *this_
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -10443,11 +10442,11 @@ void online_fair_sched_group(struct task
 		rq = cpu_rq(i);
 		se = tg->se[i];
 
-		raw_spin_lock_irq(&rq->lock);
+		raw_spin_lock_irq(rq_lockp(rq));
 		update_rq_clock(rq);
 		attach_entity_cfs_rq(se);
 		sync_throttle(tg, i);
-		raw_spin_unlock_irq(&rq->lock);
+		raw_spin_unlock_irq(rq_lockp(rq));
 	}
 }
 
@@ -10470,9 +10469,9 @@ void unregister_fair_sched_group(struct
 
 		rq = cpu_rq(cpu);
 
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_lock_irqsave(rq_lockp(rq), flags);
 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	}
 }
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -390,10 +390,10 @@ pick_next_task_idle(struct rq *rq, struc
 static void
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 }
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -116,7 +116,7 @@ static inline void update_idle_rq_clock_
 
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_pelt - rq->lost_idle_time;
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -845,7 +845,7 @@ static int do_sched_rt_period_timer(stru
 		if (skip)
 			continue;
 
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		if (rt_rq->rt_time) {
@@ -883,7 +883,7 @@ static int do_sched_rt_period_timer(stru
 
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -2034,9 +2034,9 @@ void rto_push_irq_work_func(struct irq_w
 	 * When it gets updated, a check is made if a push is possible.
 	 */
 	if (has_pushable_tasks(rq)) {
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		push_rt_tasks(rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	raw_spin_lock(&rd->rto_lock);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -806,7 +806,7 @@ extern void rto_push_irq_work_func(struc
  */
 struct rq {
 	/* runqueue lock: */
-	raw_spinlock_t		lock;
+	raw_spinlock_t		__lock;
 
 	/*
 	 * nr_running and cpu_load should be in the same cacheline because
@@ -979,6 +979,10 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
 
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
@@ -1046,7 +1050,7 @@ static inline void assert_clock_updated(
 
 static inline u64 rq_clock(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock;
@@ -1054,7 +1058,7 @@ static inline u64 rq_clock(struct rq *rq
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_task;
@@ -1062,7 +1066,7 @@ static inline u64 rq_clock_task(struct r
 
 static inline void rq_clock_skip_update(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags |= RQCF_REQ_SKIP;
 }
 
@@ -1072,7 +1076,7 @@ static inline void rq_clock_skip_update(
  */
 static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
@@ -1091,7 +1095,7 @@ struct rq_flags {
 
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(&rq->lock);
+	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1106,12 +1110,12 @@ static inline void rq_unpin_lock(struct
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(&rq->lock, rf->cookie);
+	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(&rq->lock, rf->cookie);
+	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -1132,7 +1136,7 @@ static inline void __task_rq_unlock(stru
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline void
@@ -1141,7 +1145,7 @@ task_rq_unlock(struct rq *rq, struct tas
 	__releases(p->pi_lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1149,7 +1153,7 @@ static inline void
 rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1157,7 +1161,7 @@ static inline void
 rq_lock_irq(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1165,7 +1169,7 @@ static inline void
 rq_lock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1173,7 +1177,7 @@ static inline void
 rq_relock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_repin_lock(rq, rf);
 }
 
@@ -1182,7 +1186,7 @@ rq_unlock_irqrestore(struct rq *rq, stru
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
 }
 
 static inline void
@@ -1190,7 +1194,7 @@ rq_unlock_irq(struct rq *rq, struct rq_f
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 static inline void
@@ -1198,7 +1202,7 @@ rq_unlock(struct rq *rq, struct rq_flags
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline struct rq *
@@ -1261,7 +1265,7 @@ queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
 		       void (*func)(struct rq *rq))
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (unlikely(head->next))
 		return;
@@ -1917,7 +1921,7 @@ static inline int _double_lock_balance(s
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	double_rq_lock(this_rq, busiest);
 
 	return 1;
@@ -1936,20 +1940,22 @@ static inline int _double_lock_balance(s
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
+	if (rq_lockp(this_rq) == rq_lockp(busiest))
+		return 0;
 
-	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			raw_spin_unlock(&this_rq->lock);
-			raw_spin_lock(&busiest->lock);
-			raw_spin_lock_nested(&this_rq->lock,
-					      SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			raw_spin_lock_nested(&busiest->lock,
-					      SINGLE_DEPTH_NESTING);
+	if (likely(raw_spin_trylock(rq_lockp(busiest))))
+		return 0;
+
+	if (busiest >= this_rq) {
+		raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+		return 0;
 	}
-	return ret;
+
+	raw_spin_unlock(rq_lockp(this_rq));
+	raw_spin_lock(rq_lockp(busiest));
+	raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+	return 1;
 }
 
 #endif /* CONFIG_PREEMPT */
@@ -1959,20 +1965,16 @@ static inline int _double_lock_balance(s
  */
 static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 {
-	if (unlikely(!irqs_disabled())) {
-		/* printk() doesn't work well under rq->lock */
-		raw_spin_unlock(&this_rq->lock);
-		BUG_ON(1);
-	}
-
+	lockdep_assert_irqs_disabled();
 	return _double_lock_balance(this_rq, busiest);
 }
 
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	raw_spin_unlock(&busiest->lock);
-	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+	if (rq_lockp(this_rq) != rq_lockp(busiest))
+		raw_spin_unlock(rq_lockp(busiest));
+	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2013,16 +2015,16 @@ static inline void double_rq_lock(struct
 	__acquires(rq2->lock)
 {
 	BUG_ON(!irqs_disabled());
-	if (rq1 == rq2) {
-		raw_spin_lock(&rq1->lock);
+	if (rq_lockp(rq1) == rq_lockp(rq2)) {
+		raw_spin_lock(rq_lockp(rq1));
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
 		if (rq1 < rq2) {
-			raw_spin_lock(&rq1->lock);
-			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock(rq_lockp(rq1));
+			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
 		} else {
-			raw_spin_lock(&rq2->lock);
-			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock(rq_lockp(rq2));
+			raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
 		}
 	}
 }
@@ -2037,9 +2039,9 @@ static inline void double_rq_unlock(stru
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_unlock(&rq1->lock);
-	if (rq1 != rq2)
-		raw_spin_unlock(&rq2->lock);
+	raw_spin_unlock(rq_lockp(rq1));
+	if (rq_lockp(rq1) != rq_lockp(rq2))
+		raw_spin_unlock(rq_lockp(rq2));
 	else
 		__release(rq2->lock);
 }
@@ -2062,7 +2064,7 @@ static inline void double_rq_lock(struct
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
-	raw_spin_lock(&rq1->lock);
+	raw_spin_lock(rq_lockp(rq1));
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
 
@@ -2077,7 +2079,7 @@ static inline void double_rq_unlock(stru
 	__releases(rq2->lock)
 {
 	BUG_ON(rq1 != rq2);
-	raw_spin_unlock(&rq1->lock);
+	raw_spin_unlock(rq_lockp(rq1));
 	__release(rq2->lock);
 }
 
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -442,7 +442,7 @@ void rq_attach_root(struct rq *rq, struc
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 
 	if (rq->rd) {
 		old_rd = rq->rd;
@@ -468,7 +468,7 @@ void rq_attach_root(struct rq *rq, struc
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 
 	if (old_rd)
 		call_rcu(&old_rd->rcu, free_rootdomain);



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (2 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 03/16] sched: Wrap rq::lock access Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 05/16] sched: Add task_struct pointer to sched_class::set_curr_task Peter Zijlstra
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Because pick_next_task() implies set_curr_task() and some of the
details haven't matter too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c |   23 ++++++++++++-----------
 kernel/sched/rt.c       |   27 ++++++++++++++-------------
 2 files changed, 26 insertions(+), 24 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1695,12 +1695,21 @@ static void start_hrtick_dl(struct rq *r
 }
 #endif
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
+
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+
+	if (rq->curr->sched_class != &dl_sched_class)
+		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	if (rq->curr != p)
+		deadline_queue_push_tasks(rq);
 }
 
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@@ -1759,15 +1768,7 @@ pick_next_task_dl(struct rq *rq, struct
 
 	p = dl_task_of(dl_se);
 
-	set_next_task(rq, p);
-
-	if (hrtick_enabled(rq))
-		start_hrtick_dl(rq, p);
-
-	deadline_queue_push_tasks(rq);
-
-	if (rq->curr->sched_class != &dl_sched_class)
-		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_dl(rq, p);
 
 	return p;
 }
@@ -1814,7 +1815,7 @@ static void task_fork_dl(struct task_str
 
 static void set_curr_task_dl(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_dl(rq, rq->curr);
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1498,12 +1498,23 @@ static void check_preempt_curr_rt(struct
 #endif
 }
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
+
+	/*
+	 * If prev task was rt, put_prev_task() has already updated the
+	 * utilization. We only care of the case where we start to schedule a
+	 * rt task
+	 */
+	if (rq->curr->sched_class != &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	if (rq->curr != p)
+		rt_queue_push_tasks(rq);
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@@ -1577,17 +1588,7 @@ pick_next_task_rt(struct rq *rq, struct
 
 	p = _pick_next_task_rt(rq);
 
-	set_next_task(rq, p);
-
-	rt_queue_push_tasks(rq);
-
-	/*
-	 * If prev task was rt, put_prev_task() has already updated the
-	 * utilization. We only care of the case where we start to schedule a
-	 * rt task
-	 */
-	if (rq->curr->sched_class != &rt_sched_class)
-		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_rt(rq, p);
 
 	return p;
 }
@@ -2356,7 +2357,7 @@ static void task_tick_rt(struct rq *rq,
 
 static void set_curr_task_rt(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_rt(rq, rq->curr);
 }
 
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 05/16] sched: Add task_struct pointer to sched_class::set_curr_task
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (3 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 06/16] sched/fair: Export newidle_balance() Peter Zijlstra
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   12 ++++++------
 kernel/sched/deadline.c  |    7 +------
 kernel/sched/fair.c      |   17 ++++++++++++++---
 kernel/sched/idle.c      |   27 +++++++++++++++------------
 kernel/sched/rt.c        |    7 +------
 kernel/sched/sched.h     |    8 +++++---
 kernel/sched/stop_task.c |   17 +++++++----------
 7 files changed, 49 insertions(+), 46 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1081,7 +1081,7 @@ void do_set_cpus_allowed(struct task_str
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 }
 
 /*
@@ -3887,7 +3887,7 @@ void rt_mutex_setprio(struct task_struct
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -3954,7 +3954,7 @@ void set_user_nice(struct task_struct *p
 			resched_curr(rq);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 out_unlock:
 	task_rq_unlock(rq, p, &rf);
 }
@@ -4379,7 +4379,7 @@ static int __sched_setscheduler(struct t
 		enqueue_task(rq, p, queue_flags);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 
@@ -5552,7 +5552,7 @@ void sched_setnuma(struct task_struct *p
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -6407,7 +6407,7 @@ void sched_move_task(struct task_struct
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
 	if (running)
-		set_curr_task(rq, tsk);
+		set_next_task(rq, tsk);
 
 	task_rq_unlock(rq, tsk, &rf);
 }
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1813,11 +1813,6 @@ static void task_fork_dl(struct task_str
 	 */
 }
 
-static void set_curr_task_dl(struct rq *rq)
-{
-	set_next_task_dl(rq, rq->curr);
-}
-
 #ifdef CONFIG_SMP
 
 /* Only try algorithms three times */
@@ -2405,6 +2400,7 @@ const struct sched_class dl_sched_class
 
 	.pick_next_task		= pick_next_task_dl,
 	.put_prev_task		= put_prev_task_dl,
+	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
@@ -2415,7 +2411,6 @@ const struct sched_class dl_sched_class
 	.task_woken		= task_woken_dl,
 #endif
 
-	.set_curr_task		= set_curr_task_dl,
 	.task_tick		= task_tick_dl,
 	.task_fork              = task_fork_dl,
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10315,9 +10315,19 @@ static void switched_to_fair(struct rq *
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_curr_task_fair(struct rq *rq)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p)
 {
-	struct sched_entity *se = &rq->curr->se;
+	struct sched_entity *se = &p->se;
+
+#ifdef CONFIG_SMP
+	if (task_on_rq_queued(p)) {
+		/*
+		 * Move the next running task to the front of the list, so our
+		 * cfs_tasks list becomes MRU one.
+		 */
+		list_move(&se->group_node, &rq->cfs_tasks);
+	}
+#endif
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -10588,7 +10598,9 @@ const struct sched_class fair_sched_clas
 	.check_preempt_curr	= check_preempt_wakeup,
 
 	.pick_next_task		= pick_next_task_fair,
+
 	.put_prev_task		= put_prev_task_fair,
+	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
@@ -10601,7 +10613,6 @@ const struct sched_class fair_sched_clas
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
 	.task_fork		= task_fork_fair,
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,14 +373,25 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+{
+}
+
+static void set_next_task_idle(struct rq *rq, struct task_struct *next)
+{
+	update_idle_core(rq);
+	schedstat_inc(rq->sched_goidle);
+}
+
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
+	struct task_struct *next = rq->idle;
+
 	put_prev_task(rq, prev);
-	update_idle_core(rq);
-	schedstat_inc(rq->sched_goidle);
+	set_next_task_idle(rq, next);
 
-	return rq->idle;
+	return next;
 }
 
 /*
@@ -396,10 +407,6 @@ dequeue_task_idle(struct rq *rq, struct
 	raw_spin_lock_irq(rq_lockp(rq));
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
-{
-}
-
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -412,10 +419,6 @@ static void task_tick_idle(struct rq *rq
 {
 }
 
-static void set_curr_task_idle(struct rq *rq)
-{
-}
-
 static void switched_to_idle(struct rq *rq, struct task_struct *p)
 {
 	BUG();
@@ -450,13 +453,13 @@ const struct sched_class idle_sched_clas
 
 	.pick_next_task		= pick_next_task_idle,
 	.put_prev_task		= put_prev_task_idle,
+	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_idle,
 	.task_tick		= task_tick_idle,
 
 	.get_rr_interval	= get_rr_interval_idle,
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2355,11 +2355,6 @@ static void task_tick_rt(struct rq *rq,
 	}
 }
 
-static void set_curr_task_rt(struct rq *rq)
-{
-	set_next_task_rt(rq, rq->curr);
-}
-
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
 {
 	/*
@@ -2381,6 +2376,7 @@ const struct sched_class rt_sched_class
 
 	.pick_next_task		= pick_next_task_rt,
 	.put_prev_task		= put_prev_task_rt,
+	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_rt,
@@ -2392,7 +2388,6 @@ const struct sched_class rt_sched_class
 	.switched_from		= switched_from_rt,
 #endif
 
-	.set_curr_task          = set_curr_task_rt,
 	.task_tick		= task_tick_rt,
 
 	.get_rr_interval	= get_rr_interval_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1672,6 +1672,7 @@ struct sched_class {
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
@@ -1686,7 +1687,6 @@ struct sched_class {
 	void (*rq_offline)(struct rq *rq);
 #endif
 
-	void (*set_curr_task)(struct rq *rq);
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork)(struct task_struct *p);
 	void (*task_dead)(struct task_struct *p);
@@ -1716,12 +1716,14 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
+	WARN_ON_ONCE(rq->curr != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
-static inline void set_curr_task(struct rq *rq, struct task_struct *curr)
+static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	curr->sched_class->set_curr_task(rq);
+	WARN_ON_ONCE(rq->curr != next);
+	next->sched_class->set_next_task(rq, next);
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,6 +23,11 @@ check_preempt_curr_stop(struct rq *rq, s
 	/* we're never preempted */
 }
 
+static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
+{
+	stop->se.exec_start = rq_clock_task(rq);
+}
+
 static struct task_struct *
 pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -32,8 +37,7 @@ pick_next_task_stop(struct rq *rq, struc
 		return NULL;
 
 	put_prev_task(rq, prev);
-
-	stop->se.exec_start = rq_clock_task(rq);
+	set_next_task_stop(rq, stop);
 
 	return stop;
 }
@@ -86,13 +90,6 @@ static void task_tick_stop(struct rq *rq
 {
 }
 
-static void set_curr_task_stop(struct rq *rq)
-{
-	struct task_struct *stop = rq->stop;
-
-	stop->se.exec_start = rq_clock_task(rq);
-}
-
 static void switched_to_stop(struct rq *rq, struct task_struct *p)
 {
 	BUG(); /* its impossible to change to this class */
@@ -128,13 +125,13 @@ const struct sched_class stop_sched_clas
 
 	.pick_next_task		= pick_next_task_stop,
 	.put_prev_task		= put_prev_task_stop,
+	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_stop,
 	.task_tick		= task_tick_stop,
 
 	.get_rr_interval	= get_rr_interval_stop,



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 06/16] sched/fair: Export newidle_balance()
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (4 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 05/16] sched: Add task_struct pointer to sched_class::set_curr_task Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 07/16] sched: Allow put_prev_task() to drop rq->lock Peter Zijlstra
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  |   18 ++++++++----------
 kernel/sched/sched.h |    4 ++++
 2 files changed, 12 insertions(+), 10 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3610,8 +3610,6 @@ static inline unsigned long cfs_rq_load_
 	return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
-
 static inline unsigned long task_util(struct task_struct *p)
 {
 	return READ_ONCE(p->se.avg.util_avg);
@@ -7057,11 +7055,10 @@ done: __maybe_unused;
 	return p;
 
 idle:
-	update_misfit_status(NULL, rq);
-	new_tasks = idle_balance(rq, rf);
+	new_tasks = newidle_balance(rq, rf);
 
 	/*
-	 * Because idle_balance() releases (and re-acquires) rq->lock, it is
+	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
 	 * possible for any higher priority task to appear. In that case we
 	 * must re-start the pick_next_entity() loop.
 	 */
@@ -9243,10 +9240,10 @@ static int load_balance(int this_cpu, st
 	ld_moved = 0;
 
 	/*
-	 * idle_balance() disregards balance intervals, so we could repeatedly
-	 * reach this code, which would lead to balance_interval skyrocketting
-	 * in a short amount of time. Skip the balance_interval increase logic
-	 * to avoid that.
+	 * newidle_balance() disregards balance intervals, so we could
+	 * repeatedly reach this code, which would lead to balance_interval
+	 * skyrocketting in a short amount of time. Skip the balance_interval
+	 * increase logic to avoid that.
 	 */
 	if (env.idle == CPU_NEWLY_IDLE)
 		goto out;
@@ -9923,7 +9920,7 @@ static inline void nohz_newidle_balance(
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
+int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
 	int this_cpu = this_rq->cpu;
@@ -9931,6 +9928,7 @@ static int idle_balance(struct rq *this_
 	int pulled_task = 0;
 	u64 curr_cost = 0;
 
+	update_misfit_status(NULL, this_rq);
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1414,10 +1414,14 @@ static inline void unregister_sched_doma
 }
 #endif
 
+extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
 
+static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 07/16] sched: Allow put_prev_task() to drop rq->lock
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (5 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 06/16] sched/fair: Export newidle_balance() Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 08/16] sched: Rework pick_next_task() slow-path Peter Zijlstra
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |    2 +-
 kernel/sched/deadline.c  |   14 +++++++++++++-
 kernel/sched/fair.c      |    2 +-
 kernel/sched/idle.c      |    2 +-
 kernel/sched/rt.c        |   14 +++++++++++++-
 kernel/sched/sched.h     |    4 ++--
 kernel/sched/stop_task.c |    2 +-
 7 files changed, 32 insertions(+), 8 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5592,7 +5592,7 @@ static void calc_load_migrate(struct rq
 		atomic_long_add(delta, &calc_load_tasks);
 }
 
-static void put_prev_task_fake(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fake(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1773,13 +1773,25 @@ pick_next_task_dl(struct rq *rq, struct
 	return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_dl(rq);
 
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
+
+	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7080,7 +7080,7 @@ done: __maybe_unused;
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,7 +373,7 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1593,7 +1593,7 @@ pick_next_task_rt(struct rq *rq, struct
 	return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_rt(rq);
 
@@ -1605,6 +1605,18 @@ static void put_prev_task_rt(struct rq *
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_rt_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1675,7 +1675,7 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
@@ -1721,7 +1721,7 @@ struct sched_class {
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
-	prev->sched_class->put_prev_task(rq, prev);
+	prev->sched_class->put_prev_task(rq, prev, NULL);
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -59,7 +59,7 @@ static void yield_task_stop(struct rq *r
 	BUG(); /* the stop task should never yield, its pointless. */
 }
 
-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 08/16] sched: Rework pick_next_task() slow-path
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (6 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 07/16] sched: Allow put_prev_task() to drop rq->lock Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 09/16] sched: Introduce sched_class::pick_task() Peter Zijlstra
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   19 ++++++++++++-------
 kernel/sched/deadline.c  |   30 ++----------------------------
 kernel/sched/fair.c      |    9 ++++++---
 kernel/sched/idle.c      |    4 +++-
 kernel/sched/rt.c        |   29 +----------------------------
 kernel/sched/sched.h     |   13 ++++++++-----
 kernel/sched/stop_task.c |    3 ++-
 7 files changed, 34 insertions(+), 73 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3360,7 +3360,7 @@ pick_next_task(struct rq *rq, struct tas
 
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
-			goto again;
+			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
 		if (unlikely(!p))
@@ -3369,14 +3369,19 @@ pick_next_task(struct rq *rq, struct tas
 		return p;
 	}
 
-again:
+restart:
+	/*
+	 * Ensure that we put DL/RT tasks before the pick loop, such that they
+	 * can PULL higher prio tasks when we lower the RQ 'priority'.
+	 */
+	prev->sched_class->put_prev_task(rq, prev, rf);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
 	for_each_class(class) {
-		p = class->pick_next_task(rq, prev, rf);
-		if (p) {
-			if (unlikely(p == RETRY_TASK))
-				goto again;
+		p = class->pick_next_task(rq, NULL, NULL);
+		if (p)
 			return p;
-		}
 	}
 
 	/* The idle class should always have a runnable task: */
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1730,39 +1730,13 @@ pick_next_task_dl(struct rq *rq, struct
 	struct task_struct *p;
 	struct dl_rq *dl_rq;
 
-	dl_rq = &rq->dl;
-
-	if (need_pull_dl_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_dl_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_dl_task() can drop (and re-acquire) rq->lock; this
-		 * means a stop task can slip in, in which case we need to
-		 * re-start task selection.
-		 */
-		if (rq->stop && task_on_rq_queued(rq->stop))
-			return RETRY_TASK;
-	}
+	WARN_ON_ONCE(prev || rf);
 
-	/*
-	 * When prev is DL, we may throttle it in put_prev_task().
-	 * So, we update time before we check for dl_nr_running.
-	 */
-	if (prev->sched_class == &dl_sched_class)
-		update_curr_dl(rq);
+	dl_rq = &rq->dl;
 
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6949,7 +6949,7 @@ pick_next_task_fair(struct rq *rq, struc
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
+	if (!prev || prev->sched_class != &fair_sched_class)
 		goto simple;
 
 	/*
@@ -7026,8 +7026,8 @@ pick_next_task_fair(struct rq *rq, struc
 	goto done;
 simple:
 #endif
-
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
 
 	do {
 		se = pick_next_entity(cfs_rq, NULL);
@@ -7055,6 +7055,9 @@ done: __maybe_unused;
 	return p;
 
 idle:
+	if (!rf)
+		return NULL;
+
 	new_tasks = newidle_balance(rq, rf);
 
 	/*
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -388,7 +388,9 @@ pick_next_task_idle(struct rq *rq, struc
 {
 	struct task_struct *next = rq->idle;
 
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
+
 	set_next_task_idle(rq, next);
 
 	return next;
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1554,38 +1554,11 @@ pick_next_task_rt(struct rq *rq, struct
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
-	if (need_pull_rt_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_rt_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_rt_task() can drop (and re-acquire) rq->lock; this
-		 * means a dl or stop task can slip in, in which case we need
-		 * to re-start task selection.
-		 */
-		if (unlikely((rq->stop && task_on_rq_queued(rq->stop)) ||
-			     rq->dl.dl_nr_running))
-			return RETRY_TASK;
-	}
-
-	/*
-	 * We may dequeue prev's rt_rq in put_prev_task().
-	 * So, we update time before rt_queued check.
-	 */
-	if (prev->sched_class == &rt_sched_class)
-		update_curr_rt(rq);
+	WARN_ON_ONCE(prev || rf);
 
 	if (!rt_rq->rt_queued)
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	p = _pick_next_task_rt(rq);
 
 	set_next_task_rt(rq, p);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1665,12 +1665,15 @@ struct sched_class {
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
 	/*
-	 * It is the responsibility of the pick_next_task() method that will
-	 * return the next task to call put_prev_task() on the @prev task or
-	 * something equivalent.
+	 * Both @prev and @rf are optional and may be NULL, in which case the
+	 * caller must already have invoked put_prev_task(rq, prev, rf).
 	 *
-	 * May return RETRY_TASK when it finds a higher prio class has runnable
-	 * tasks.
+	 * Otherwise it is the responsibility of the pick_next_task() to call
+	 * put_prev_task() on the @prev task or something equivalent, IFF it
+	 * returns a next task.
+	 *
+	 * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
+	 * higher prio class has runnable tasks.
 	 */
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -33,10 +33,11 @@ pick_next_task_stop(struct rq *rq, struc
 {
 	struct task_struct *stop = rq->stop;
 
+	WARN_ON_ONCE(prev || rf);
+
 	if (!stop || !task_on_rq_queued(stop))
 		return NULL;
 
-	put_prev_task(rq, prev);
 	set_next_task_stop(rq, stop);
 
 	return stop;



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 09/16] sched: Introduce sched_class::pick_task()
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (7 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 08/16] sched: Rework pick_next_task() slow-path Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 10/16] sched: Core-wide rq->lock Peter Zijlstra
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c  |   21 ++++++++++++++++-----
 kernel/sched/fair.c      |   30 ++++++++++++++++++++++++++++++
 kernel/sched/idle.c      |   10 +++++++++-
 kernel/sched/rt.c        |   21 ++++++++++++++++-----
 kernel/sched/sched.h     |    2 ++
 kernel/sched/stop_task.c |   21 ++++++++++++++++-----
 6 files changed, 89 insertions(+), 16 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1723,15 +1723,12 @@ static struct sched_dl_entity *pick_next
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *
-pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct task_struct *p;
 	struct dl_rq *dl_rq;
 
-	WARN_ON_ONCE(prev || rf);
-
 	dl_rq = &rq->dl;
 
 	if (unlikely(!dl_rq->dl_nr_running))
@@ -1742,7 +1739,19 @@ pick_next_task_dl(struct rq *rq, struct
 
 	p = dl_task_of(dl_se);
 
-	set_next_task_dl(rq, p);
+	return p;
+}
+
+static struct task_struct *
+pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_dl(rq);
+	if (p)
+		set_next_task_dl(rq, p);
 
 	return p;
 }
@@ -2389,6 +2398,8 @@ const struct sched_class dl_sched_class
 	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_dl,
+
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6937,6 +6937,34 @@ static void check_preempt_wakeup(struct
 }
 
 static struct task_struct *
+pick_task_fair(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	struct sched_entity *se;
+
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		se = pick_next_entity(cfs_rq, NULL);
+
+		if (curr) {
+			if (se && curr->on_rq)
+				update_curr(cfs_rq);
+
+			if (!se || entity_before(curr, se))
+				se = curr;
+		}
+
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	return task_of(se);
+}
+
+static struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct cfs_rq *cfs_rq = &rq->cfs;
@@ -10604,6 +10632,8 @@ const struct sched_class fair_sched_clas
 	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_fair,
+
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,6 +373,12 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
+static struct task_struct *
+pick_task_idle(struct rq *rq)
+{
+	return rq->idle;
+}
+
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
@@ -386,11 +392,12 @@ static void set_next_task_idle(struct rq
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *next = rq->idle;
+	struct task_struct *next;
 
 	if (prev)
 		put_prev_task(rq, prev);
 
+	next = pick_task_idle(rq);
 	set_next_task_idle(rq, next);
 
 	return next;
@@ -458,6 +465,7 @@ const struct sched_class idle_sched_clas
 	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1548,20 +1548,29 @@ static struct task_struct *_pick_next_ta
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *
-pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!rt_rq->rt_queued)
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
 
-	set_next_task_rt(rq, p);
+	return p;
+}
+
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_rt(rq);
+	if (p)
+		set_next_task_rt(rq, p);
 
 	return p;
 }
@@ -2364,6 +2373,8 @@ const struct sched_class rt_sched_class
 	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_rt,
+
 	.select_task_rq		= select_task_rq_rt,
 
 	.set_cpus_allowed       = set_cpus_allowed_common,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1682,6 +1682,8 @@ struct sched_class {
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
+	struct task_struct * (*pick_task)(struct rq *rq);
+
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -29,20 +29,30 @@ static void set_next_task_stop(struct rq
 }
 
 static struct task_struct *
-pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_task_stop(struct rq *rq)
 {
 	struct task_struct *stop = rq->stop;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!stop || !task_on_rq_queued(stop))
 		return NULL;
 
-	set_next_task_stop(rq, stop);
-
 	return stop;
 }
 
+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_stop(rq);
+	if (p)
+		set_next_task_stop(rq, p);
+
+	return p;
+}
+
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -129,6 +139,7 @@ const struct sched_class stop_sched_clas
 	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 10/16] sched: Core-wide rq->lock
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (8 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 09/16] sched: Introduce sched_class::pick_task() Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 11/16] sched: Basic tracking of matching tasks Peter Zijlstra
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/Kconfig.preempt |    8 +++-
 kernel/sched/core.c    |   93 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h   |   31 ++++++++++++++++
 3 files changed, 131 insertions(+), 1 deletion(-)

--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -57,4 +57,10 @@ config PREEMPT
 endchoice
 
 config PREEMPT_COUNT
-       bool
+	bool
+
+config SCHED_CORE
+	bool
+	default y
+	depends on SCHED_SMT
+
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,70 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ *	spin_lock(rq_lockp(rq));
+ *	...
+ *	spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+	bool enabled = !!(unsigned long)data;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	stop_machine(__sched_core_stopper, (void *)false, NULL);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
@@ -5862,6 +5926,28 @@ static void sched_rq_cpu_starting(unsign
 
 int sched_cpu_starting(unsigned int cpu)
 {
+#ifdef CONFIG_SCHED_CORE
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	for_each_cpu(i, smt_mask) {
+		rq = cpu_rq(i);
+		if (rq->core && rq->core == rq)
+			core_rq = rq;
+	}
+
+	if (!core_rq)
+		core_rq = cpu_rq(cpu);
+
+	for_each_cpu(i, smt_mask) {
+		rq = cpu_rq(i);
+
+		WARN_ON_ONCE(rq->core && rq->core != core_rq);
+		rq->core = core_rq;
+	}
+#endif /* CONFIG_SCHED_CORE */
+
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -6088,6 +6176,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -952,6 +952,12 @@ struct rq {
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
 #endif
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -979,11 +985,36 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
 	return &rq->__lock;
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 11/16] sched: Basic tracking of matching tasks
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (9 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 10/16] sched: Core-wide rq->lock Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 12/16] sched: A quick and dirty cgroup tagging interface Peter Zijlstra
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    8 ++
 kernel/sched/core.c   |  145 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |    4 +
 3 files changed, 156 insertions(+), 1 deletion(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -635,10 +635,16 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+#ifdef CONFIG_SCHED_CORE
+	struct rb_node			core_node;
+	unsigned long			core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* List of struct preempt_notifier: */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -64,6 +64,140 @@ int sysctl_sched_rt_runtime = 950000;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+	if (p->sched_class == &stop_sched_class) /* trumps deadline */
+		return -2;
+
+	if (rt_prio(p->prio)) /* includes deadline */
+		return p->prio; /* [-1, 99] */
+
+	if (p->sched_class == &idle_sched_class)
+		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool __prio_less(struct task_struct *a, struct task_struct *b, bool runtime)
+{
+	int pa = __task_prio(a), pb = __task_prio(b);
+
+	if (-pa < -pb)
+		return true;
+
+	if (-pb < -pa)
+		return false;
+
+	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+	if (pa == MAX_RT_PRIO + MAX_NICE && runtime) /* fair */
+		return !((s64)(a->se.vruntime - b->se.vruntime) < 0);
+
+	return false;
+}
+
+static inline bool cpu_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	return __prio_less(a, b, true);
+}
+
+static inline bool core_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	/* cannot compare vruntime across CPUs */
+	return __prio_less(a, b, false);
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+	if (a->core_cookie < b->core_cookie)
+		return true;
+
+	if (a->core_cookie > b->core_cookie)
+		return false;
+
+	/* flip prio, so high prio is leftmost */
+	if (cpu_prio_less(b, a))
+		return true;
+
+	return false;
+}
+
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct task_struct *node_task;
+
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	node = &rq->core_tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		node_task = container_of(*node, struct task_struct, core_node);
+		parent = *node;
+
+		if (__sched_core_less(p, node_task))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&p->core_node, parent, node);
+	rb_insert_color(&p->core_node, &rq->core_tree);
+}
+
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+	struct rb_node *node = rq->core_tree.rb_node;
+	struct task_struct *node_task, *match;
+
+	/*
+	 * The idle task always matches any cookie!
+	 */
+	match = idle_sched_class.pick_task(rq);
+
+	while (node) {
+		node_task = container_of(node, struct task_struct, core_node);
+
+		if (node_task->core_cookie < cookie) {
+			node = node->rb_left;
+		} else if (node_task->core_cookie > cookie) {
+			node = node->rb_right;
+		} else {
+			match = node_task;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -122,6 +256,11 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -826,6 +965,9 @@ static void set_load_weight(struct task_
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -839,6 +981,9 @@ static inline void enqueue_task(struct r
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_dequeue(rq, p);
+
 	if (!(flags & DEQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -957,6 +957,10 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	unsigned int		core_enabled;
+	struct rb_root		core_tree;
+
+	/* shared state */
+	unsigned int		core_task_seq;
 #endif
 };
 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 12/16] sched: A quick and dirty cgroup tagging interface
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (10 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 11/16] sched: Basic tracking of matching tasks Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling Peter Zijlstra
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Marks all tasks in a cgroup as matching for core-scheduling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    4 +++
 2 files changed, 66 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6608,6 +6608,15 @@ static void sched_change_group(struct ta
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+		tsk->core_cookie = 0UL;
+
+	if (tg->tagged /* && !tsk->core_cookie ? */)
+		tsk->core_cookie = (unsigned long)tg;
+#endif
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -7067,6 +7076,43 @@ static u64 cpu_rt_period_read_uint(struc
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return !!tg->tagged;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (tg->tagged == !!val)
+		return 0;
+
+	tg->tagged = !!val;
+
+	if (!!val)
+		sched_core_get();
+
+	css_task_iter_start(css, 0, &it);
+	while ((p = css_task_iter_next(&it)))
+		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+	css_task_iter_end(&it);
+
+	if (!val)
+		sched_core_put();
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7103,6 +7149,14 @@ static struct cftype cpu_legacy_files[]
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 	{ }	/* Terminate */
 };
 
@@ -7270,6 +7324,14 @@ static struct cftype cpu_files[] = {
 		.write = cpu_max_write,
 	},
 #endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
+#endif
 	{ }	/* terminate */
 };
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -363,6 +363,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+	int			tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (11 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 12/16] sched: A quick and dirty cgroup tagging interface Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
       [not found]   ` <20190402064612.GA46500@aaronlu>
  2019-04-09 18:38   ` Julien Desfossez
  2019-02-18 16:56 ` [RFC][PATCH 14/16] sched/fair: Add a few assertions Peter Zijlstra
                   ` (6 subsequent siblings)
  19 siblings, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
      simplify it further.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |  222 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |    5 -
 2 files changed, 224 insertions(+), 3 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3552,7 +3552,7 @@ static inline void schedule_debug(struct
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -3597,6 +3597,220 @@ pick_next_task(struct rq *rq, struct tas
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+	if (is_idle_task(a) || is_idle_task(b))
+		return true;
+
+	return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+	struct task_struct *class_pick, *cookie_pick;
+	unsigned long cookie = 0UL;
+
+	/*
+	 * We must not rely on rq->core->core_cookie here, because we fail to reset
+	 * rq->core->core_cookie on new picks, such that we can detect if we need
+	 * to do single vs multi rq task selection.
+	 */
+
+	if (max && max->core_cookie) {
+		WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
+		cookie = max->core_cookie;
+	}
+
+	class_pick = class->pick_task(rq);
+	if (!cookie)
+		return class_pick;
+
+	cookie_pick = sched_core_find(rq, cookie);
+	if (!class_pick)
+		return cookie_pick;
+
+	/*
+	 * If class > max && class > cookie, it is the highest priority task on
+	 * the core (so far) and it must be selected, otherwise we must go with
+	 * the cookie pick in order to satisfy the constraint.
+	 */
+	if (cpu_prio_less(cookie_pick, class_pick) && cpu_prio_less(max, class_pick))
+		return class_pick;
+
+	return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *next, *max = NULL;
+	const struct sched_class *class;
+	const struct cpumask *smt_mask;
+	int i, j, cpu;
+
+	if (!sched_core_enabled(rq))
+		return __pick_next_task(rq, prev, rf);
+
+	/*
+	 * If there were no {en,de}queues since we picked (IOW, the task
+	 * pointers are all still valid), and we haven't scheduled the last
+	 * pick yet, do so now.
+	 */
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		next = rq->core_pick;
+		if (next != prev) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+		}
+		return next;
+	}
+
+	prev->sched_class->put_prev_task(rq, prev, rf);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
+	cpu = cpu_of(rq);
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+
+	/* reset state */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		rq_i->core_pick = NULL;
+
+		if (i != cpu)
+			update_rq_clock(rq_i);
+	}
+
+	/*
+	 * Try and select tasks for each sibling in decending sched_class
+	 * order.
+	 */
+	for_each_class(class) {
+again:
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			struct rq *rq_i = cpu_rq(i);
+			struct task_struct *p;
+
+			if (rq_i->core_pick)
+				continue;
+
+			/*
+			 * If this sibling doesn't yet have a suitable task to
+			 * run; ask for the most elegible task, given the
+			 * highest priority task already selected for this
+			 * core.
+			 */
+			p = pick_task(rq_i, class, max);
+			if (!p) {
+				/*
+				 * If there weren't no cookies; we don't need
+				 * to bother with the other siblings.
+				 */
+				if (i == cpu && !rq->core->core_cookie)
+					goto next_class;
+
+				continue;
+			}
+
+			/*
+			 * Optimize the 'normal' case where there aren't any
+			 * cookies and we don't need to sync up.
+			 */
+			if (i == cpu && !rq->core->core_cookie && !p->core_cookie) {
+				next = p;
+				goto done;
+			}
+
+			rq_i->core_pick = p;
+
+			/*
+			 * If this new candidate is of higher priority than the
+			 * previous; and they're incompatible; we need to wipe
+			 * the slate and start over.
+			 *
+			 * NOTE: this is a linear max-filter and is thus bounded
+			 * in execution time.
+			 */
+			if (!max || core_prio_less(max, p)) {
+				struct task_struct *old_max = max;
+
+				rq->core->core_cookie = p->core_cookie;
+				max = p;
+
+				if (old_max && !cookie_match(old_max, p)) {
+					for_each_cpu(j, smt_mask) {
+						if (j == i)
+							continue;
+
+						cpu_rq(j)->core_pick = NULL;
+					}
+					goto again;
+				}
+			}
+		}
+next_class:;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+
+	/*
+	 * Reschedule siblings
+	 *
+	 * NOTE: L1TF -- at this point we're no longer running the old task and
+	 * sending an IPI (below) ensures the sibling will no longer be running
+	 * their task. This ensures there is no inter-sibling overlap between
+	 * non-matching user state.
+	 */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		WARN_ON_ONCE(!rq_i->core_pick);
+
+		if (i == cpu)
+			continue;
+
+		if (rq_i->curr != rq_i->core_pick)
+			resched_curr(rq_i);
+	}
+
+	rq->core_sched_seq = rq->core->core_pick_seq;
+	next = rq->core_pick;
+
+done:
+	set_next_task(rq, next);
+	return next;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return __pick_next_task(rq, prev, rf);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -5866,7 +6080,7 @@ static void migrate_tasks(struct rq *dea
 		/*
 		 * pick_next_task() assumes pinned rq->lock:
 		 */
-		next = pick_next_task(rq, &fake_task, rf);
+		next = __pick_next_task(rq, &fake_task, rf);
 		BUG_ON(!next);
 		put_prev_task(rq, next);
 
@@ -6322,7 +6536,11 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
+		rq->core_pick = NULL;
 		rq->core_enabled = 0;
+		rq->core_tree = RB_ROOT;
+
+		rq->core_cookie = 0UL;
 #endif
 	}
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -960,11 +960,15 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
+	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
+	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
 
 	/* shared state */
 	unsigned int		core_task_seq;
+	unsigned int		core_pick_seq;
+	unsigned long		core_cookie;
 #endif
 };
 
@@ -1770,7 +1774,6 @@ static inline void put_prev_task(struct
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != next);
 	next->sched_class->set_next_task(rq, next);
 }
 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 14/16] sched/fair: Add a few assertions
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (12 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 16:56 ` [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer Peter Zijlstra
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6208,6 +6208,11 @@ static int select_idle_sibling(struct ta
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
+	/*
+	 * per-cpu select_idle_mask usage
+	 */
+	lockdep_assert_irqs_disabled();
+
 	if (available_idle_cpu(target))
 		return target;
 
@@ -6635,8 +6640,6 @@ static int find_energy_efficient_cpu(str
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
@@ -6647,6 +6650,11 @@ select_task_rq_fair(struct task_struct *
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	/*
+	 * required for stable ->cpus_allowed
+	 */
+	lockdep_assert_held(&p->pi_lock);
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (13 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 14/16] sched/fair: Add a few assertions Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-21 16:19   ` Valentin Schneider
  2019-02-18 16:56 ` [RFC][PATCH 16/16] sched: Debug bits Peter Zijlstra
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    1 
 kernel/sched/core.c   |  131 +++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c   |    1 
 kernel/sched/sched.h  |    6 ++
 4 files changed, 138 insertions(+), 1 deletion(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -640,6 +640,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned int			core_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -198,6 +198,21 @@ struct task_struct *sched_core_find(stru
 	return match;
 }
 
+struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+	struct rb_node *node = &p->core_node;
+
+	node = rb_next(node);
+	if (!node)
+		return NULL;
+
+	p = container_of(node, struct task_struct, core_node);
+	if (p->core_cookie != cookie)
+		return NULL;
+
+	return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -3650,7 +3665,7 @@ pick_next_task(struct rq *rq, struct tas
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
-	int i, j, cpu;
+	int i, j, cpu, occ = 0;
 
 	if (!sched_core_enabled(rq))
 		return __pick_next_task(rq, prev, rf);
@@ -3741,6 +3756,9 @@ pick_next_task(struct rq *rq, struct tas
 				goto done;
 			}
 
+			if (!is_idle_task(p))
+				occ++;
+
 			rq_i->core_pick = p;
 
 			/*
@@ -3764,6 +3782,7 @@ pick_next_task(struct rq *rq, struct tas
 
 						cpu_rq(j)->core_pick = NULL;
 					}
+					occ = 1;
 					goto again;
 				}
 			}
@@ -3786,6 +3805,8 @@ next_class:;
 
 		WARN_ON_ONCE(!rq_i->core_pick);
 
+		rq_i->core_pick->core_occupation = occ;
+
 		if (i == cpu)
 			continue;
 
@@ -3801,6 +3822,114 @@ next_class:;
 	return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+	struct task_struct *p;
+	unsigned long cookie;
+	bool success = false;
+
+	local_irq_disable();
+	double_rq_lock(dst, src);
+
+	cookie = dst->core->core_cookie;
+	if (!cookie)
+		goto unlock;
+
+	if (dst->curr != dst->idle)
+		goto unlock;
+
+	p = sched_core_find(src, cookie);
+	if (p == src->idle)
+		goto unlock;
+
+	do {
+		if (p == src->core_pick || p == src->curr)
+			goto next;
+
+		if (!cpumask_test_cpu(this, &p->cpus_allowed))
+			goto next;
+
+		if (p->core_occupation > dst->idle->core_occupation)
+			goto next;
+
+		p->on_rq = TASK_ON_RQ_MIGRATING;
+		deactivate_task(src, p, 0);
+		set_task_cpu(p, this);
+		activate_task(dst, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+
+		resched_curr(dst);
+
+		success = true;
+		break;
+
+next:
+		p = sched_core_next(p, cookie);
+	} while (p);
+
+unlock:
+	double_rq_unlock(dst, src);
+	local_irq_enable();
+
+	return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+	int i;
+
+	for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+		if (i == cpu)
+			continue;
+
+		if (need_resched())
+			break;
+
+		if (try_steal_cookie(cpu, i))
+			return true;
+	}
+
+	return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+	struct sched_domain *sd;
+	int cpu = cpu_of(rq);
+
+	rcu_read_lock();
+	raw_spin_unlock_irq(rq_lockp(rq));
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & SD_LOAD_BALANCE))
+			break;
+
+		if (need_resched())
+			break;
+
+		if (steal_cookie_task(cpu, sd))
+			break;
+	}
+	raw_spin_lock_irq(rq_lockp(rq));
+	rcu_read_unlock();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	if (!rq->core->core_cookie)
+		return;
+
+	if (!rq->nr_running) /* not forced idle */
+		return;
+
+	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static struct task_struct *
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -387,6 +387,7 @@ static void set_next_task_idle(struct rq
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	queue_core_balance(rq);
 }
 
 static struct task_struct *
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1013,6 +1013,8 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+extern void queue_core_balance(struct rq *rq);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1025,6 +1027,10 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC][PATCH 16/16] sched: Debug bits...
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (14 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer Peter Zijlstra
@ 2019-02-18 16:56 ` Peter Zijlstra
  2019-02-18 17:49 ` [RFC][PATCH 00/16] sched: Core scheduling Linus Torvalds
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 16:56 UTC (permalink / raw)
  To: mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Peter Zijlstra (Intel)


Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
---
 kernel/sched/core.c |   36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -91,6 +91,10 @@ static inline bool __prio_less(struct ta
 {
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+			a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+			b->comm, b->pid, pa, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -245,6 +249,8 @@ static void __sched_core_enable(void)
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
@@ -253,6 +259,8 @@ static void __sched_core_disable(void)
 
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
+
+	printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -3684,6 +3692,14 @@ pick_next_task(struct rq *rq, struct tas
 			put_prev_task(rq, prev);
 			set_next_task(rq, next);
 		}
+
+		trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+				rq->core->core_task_seq,
+				rq->core->core_pick_seq,
+				rq->core_sched_seq,
+				next->comm, next->pid,
+				next->core_cookie);
+
 		return next;
 	}
 
@@ -3753,6 +3769,10 @@ pick_next_task(struct rq *rq, struct tas
 			 */
 			if (i == cpu && !rq->core->core_cookie && !p->core_cookie) {
 				next = p;
+
+				trace_printk("unconstrained pick: %s/%d %lx\n",
+						next->comm, next->pid, next->core_cookie);
+
 				goto done;
 			}
 
@@ -3761,6 +3781,9 @@ pick_next_task(struct rq *rq, struct tas
 
 			rq_i->core_pick = p;
 
+			trace_printk("cpu(%d): selected: %s/%d %lx\n",
+					i, p->comm, p->pid, p->core_cookie);
+
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
@@ -3774,6 +3797,7 @@ pick_next_task(struct rq *rq, struct tas
 
 				rq->core->core_cookie = p->core_cookie;
 				max = p;
+				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
 
 				if (old_max && !cookie_match(old_max, p)) {
 					for_each_cpu(j, smt_mask) {
@@ -3810,13 +3834,17 @@ next_class:;
 		if (i == cpu)
 			continue;
 
-		if (rq_i->curr != rq_i->core_pick)
+		if (rq_i->curr != rq_i->core_pick) {
 			resched_curr(rq_i);
+			trace_printk("IPI(%d)\n", i);
+		}
 	}
 
 	rq->core_sched_seq = rq->core->core_pick_seq;
 	next = rq->core_pick;
 
+	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
+
 done:
 	set_next_task(rq, next);
 	return next;
@@ -3853,6 +3881,10 @@ static bool try_steal_cookie(int this, i
 		if (p->core_occupation > dst->idle->core_occupation)
 			goto next;
 
+		trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+				p->comm, p->pid, that, this,
+				p->core_occupation, dst->idle->core_occupation, cookie);
+
 		p->on_rq = TASK_ON_RQ_MIGRATING;
 		deactivate_task(src, p, 0);
 		set_task_cpu(p, this);
@@ -6434,6 +6466,8 @@ int sched_cpu_starting(unsigned int cpu)
 		WARN_ON_ONCE(rq->core && rq->core != core_rq);
 		rq->core = core_rq;
 	}
+
+	printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
 #endif /* CONFIG_SCHED_CORE */
 
 	sched_rq_cpu_starting(cpu);



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (15 preceding siblings ...)
  2019-02-18 16:56 ` [RFC][PATCH 16/16] sched: Debug bits Peter Zijlstra
@ 2019-02-18 17:49 ` Linus Torvalds
  2019-02-18 20:40   ` Peter Zijlstra
                     ` (2 more replies)
  2019-02-19 22:07 ` Greg Kerr
                   ` (2 subsequent siblings)
  19 siblings, 3 replies; 99+ messages in thread
From: Linus Torvalds @ 2019-02-18 17:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, subhra.mazumdar,
	Frédéric Weisbecker, Kees Cook, kerrnel

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> However; whichever way around you turn this cookie; it is expensive and nasty.

Do you (or anybody else) have numbers for real loads?

Because performance is all that matters. If performance is bad, then
it's pointless, since just turning off SMT is the answer.

                  Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 17:49 ` [RFC][PATCH 00/16] sched: Core scheduling Linus Torvalds
@ 2019-02-18 20:40   ` Peter Zijlstra
  2019-02-19  0:29     ` Linus Torvalds
  2019-02-22 12:17     ` Paolo Bonzini
  2019-02-21  2:53   ` Subhra Mazumdar
  2019-02-22 12:45   ` Mel Gorman
  2 siblings, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-18 20:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, subhra.mazumdar,
	Frédéric Weisbecker, Kees Cook, kerrnel

On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > However; whichever way around you turn this cookie; it is expensive and nasty.
> 
> Do you (or anybody else) have numbers for real loads?
> 
> Because performance is all that matters. If performance is bad, then
> it's pointless, since just turning off SMT is the answer.

Not for these patches; they stopped crashing only yesterday and I
cleaned them up and send them out.

The previous version; which was more horrible; but L1TF complete, was
between OK-ish and horrible depending on the number of VMEXITs a
workload had.

If there were close to no VMEXITs, it beat smt=off, if there were lots
of VMEXITs it was far far worse. Supposedly hosting people try their
very bestest to have no VMEXITs so it mostly works for them (with the
obvious exception of single VCPU guests).

It's just that people have been bugging me for this crap; and I figure
I'd post it now that it's not exploding anymore and let others have at.



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 20:40   ` Peter Zijlstra
@ 2019-02-19  0:29     ` Linus Torvalds
  2019-02-19 15:15       ` Ingo Molnar
  2019-02-22 12:17     ` Paolo Bonzini
  1 sibling, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2019-02-19  0:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, subhra.mazumdar,
	Frédéric Weisbecker, Kees Cook, kerrnel

On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> If there were close to no VMEXITs, it beat smt=off, if there were lots
> of VMEXITs it was far far worse. Supposedly hosting people try their
> very bestest to have no VMEXITs so it mostly works for them (with the
> obvious exception of single VCPU guests).
>
> It's just that people have been bugging me for this crap; and I figure
> I'd post it now that it's not exploding anymore and let others have at.

The patches didn't look disgusting to me, but I admittedly just
scanned through them quickly.

Are there downsides (maintenance and/or performance) when core
scheduling _isn't_ enabled? I guess if it's not a maintenance or
performance nightmare when off, it's ok to just give people the
option.

That all assumes that it works at all for the people who are clamoring
for this feature, but I guess they can run some loads on it
eventually. It's a holiday in the US right now ("Presidents' Day"),
but maybe we can get some numebrs this week?

                Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-19  0:29     ` Linus Torvalds
@ 2019-02-19 15:15       ` Ingo Molnar
  0 siblings, 0 replies; 99+ messages in thread
From: Ingo Molnar @ 2019-02-19 15:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, subhra.mazumdar,
	Frédéric Weisbecker, Kees Cook, kerrnel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
> >
> > It's just that people have been bugging me for this crap; and I figure
> > I'd post it now that it's not exploding anymore and let others have at.
> 
> The patches didn't look disgusting to me, but I admittedly just
> scanned through them quickly.
> 
> Are there downsides (maintenance and/or performance) when core
> scheduling _isn't_ enabled? I guess if it's not a maintenance or
> performance nightmare when off, it's ok to just give people the
> option.

So this bit is the main straight-line performance impact when the 
CONFIG_SCHED_CORE Kconfig feature is present (which I expect distros to 
enable broadly):

  +static inline bool sched_core_enabled(struct rq *rq)
  +{
  +       return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
  +}

   static inline raw_spinlock_t *rq_lockp(struct rq *rq)
   {
  +       if (sched_core_enabled(rq))
  +               return &rq->core->__lock
  +
          return &rq->__lock;


This should at least in principe keep the runtime overhead down to more 
NOPs and a bit bigger instruction cache footprint - modulo compiler 
shenanigans.

Here's the code generation impact on x86-64 defconfig:

   text	   data	    bss	    dec	    hex	filename
    228	     48	      0	    276	    114	sched.core.n/cpufreq.o (ex sched.core.n/built-in.a)
    228	     48	      0	    276	    114	sched.core.y/cpufreq.o (ex sched.core.y/built-in.a)

   4438	     96	      0	   4534	   11b6	sched.core.n/completion.o (ex sched.core.n/built-in.a)
   4438	     96	      0	   4534	   11b6	sched.core.y/completion.o (ex sched.core.y/built-in.a)

   2167	   2428	      0	   4595	   11f3	sched.core.n/cpuacct.o (ex sched.core.n/built-in.a)
   2167	   2428	      0	   4595	   11f3	sched.core.y/cpuacct.o (ex sched.core.y/built-in.a)

  61099	  22114	    488	  83701	  146f5	sched.core.n/core.o (ex sched.core.n/built-in.a)
  70541	  25370	    508	  96419	  178a3	sched.core.y/core.o (ex sched.core.y/built-in.a)

   3262	   6272	      0	   9534	   253e	sched.core.n/wait_bit.o (ex sched.core.n/built-in.a)
   3262	   6272	      0	   9534	   253e	sched.core.y/wait_bit.o (ex sched.core.y/built-in.a)

  12235	    341	     96	  12672	   3180	sched.core.n/rt.o (ex sched.core.n/built-in.a)
  13073	    917	     96	  14086	   3706	sched.core.y/rt.o (ex sched.core.y/built-in.a)

  10293	    477	   1928	  12698	   319a	sched.core.n/topology.o (ex sched.core.n/built-in.a)
  10363	    509	   1928	  12800	   3200	sched.core.y/topology.o (ex sched.core.y/built-in.a)

    886	     24	      0	    910	    38e	sched.core.n/cpupri.o (ex sched.core.n/built-in.a)
    886	     24	      0	    910	    38e	sched.core.y/cpupri.o (ex sched.core.y/built-in.a)

   1061	     64	      0	   1125	    465	sched.core.n/stop_task.o (ex sched.core.n/built-in.a)
   1077	    128	      0	   1205	    4b5	sched.core.y/stop_task.o (ex sched.core.y/built-in.a)

  18443	    365	     24	  18832	   4990	sched.core.n/deadline.o (ex sched.core.n/built-in.a)
  20019	   2189	     24	  22232	   56d8	sched.core.y/deadline.o (ex sched.core.y/built-in.a)

   1123	      8	     64	   1195	    4ab	sched.core.n/loadavg.o (ex sched.core.n/built-in.a)
   1123	      8	     64	   1195	    4ab	sched.core.y/loadavg.o (ex sched.core.y/built-in.a)

   1323	      8	      0	   1331	    533	sched.core.n/stats.o (ex sched.core.n/built-in.a)
   1323	      8	      0	   1331	    533	sched.core.y/stats.o (ex sched.core.y/built-in.a)

   1282	    164	     32	   1478	    5c6	sched.core.n/isolation.o (ex sched.core.n/built-in.a)
   1282	    164	     32	   1478	    5c6	sched.core.y/isolation.o (ex sched.core.y/built-in.a)

   1564	     36	      0	   1600	    640	sched.core.n/cpudeadline.o (ex sched.core.n/built-in.a)
   1564	     36	      0	   1600	    640	sched.core.y/cpudeadline.o (ex sched.core.y/built-in.a)

   1640	     56	      0	   1696	    6a0	sched.core.n/swait.o (ex sched.core.n/built-in.a)
   1640	     56	      0	   1696	    6a0	sched.core.y/swait.o (ex sched.core.y/built-in.a)

   1859	    244	     32	   2135	    857	sched.core.n/clock.o (ex sched.core.n/built-in.a)
   1859	    244	     32	   2135	    857	sched.core.y/clock.o (ex sched.core.y/built-in.a)

   2339	      8	      0	   2347	    92b	sched.core.n/cputime.o (ex sched.core.n/built-in.a)
   2339	      8	      0	   2347	    92b	sched.core.y/cputime.o (ex sched.core.y/built-in.a)

   3014	     32	      0	   3046	    be6	sched.core.n/membarrier.o (ex sched.core.n/built-in.a)
   3014	     32	      0	   3046	    be6	sched.core.y/membarrier.o (ex sched.core.y/built-in.a)

  50027	    964	     96	  51087	   c78f	sched.core.n/fair.o (ex sched.core.n/built-in.a)
  51537	   2484	     96	  54117	   d365	sched.core.y/fair.o (ex sched.core.y/built-in.a)

   3192	    220	      0	   3412	    d54	sched.core.n/idle.o (ex sched.core.n/built-in.a)
   3276	    252	      0	   3528	    dc8	sched.core.y/idle.o (ex sched.core.y/built-in.a)

   3633	      0	      0	   3633	    e31	sched.core.n/pelt.o (ex sched.core.n/built-in.a)
   3633	      0	      0	   3633	    e31	sched.core.y/pelt.o (ex sched.core.y/built-in.a)

   3794	    160	      0	   3954	    f72	sched.core.n/wait.o (ex sched.core.n/built-in.a)
   3794	    160	      0	   3954	    f72	sched.core.y/wait.o (ex sched.core.y/built-in.a)

I'd say this one is representative:

   text	   data	    bss	    dec	    hex	filename
  12235	    341	     96	  12672	   3180	sched.core.n/rt.o (ex sched.core.n/built-in.a)
  13073	    917	     96	  14086	   3706	sched.core.y/rt.o (ex sched.core.y/built-in.a)

which ~6% bloat is primarily due to the higher rq-lock inlining overhead, 
I believe.

This is roughly what you'd expect from a change wrapping all 350+ inlined 
instantiations of rq->lock uses. I.e. it might make sense to uninline it.

In terms of long term maintenance overhead, ignoring the overhead of the 
core-scheduling feature itself, the rq-lock wrappery is the biggest 
ugliness, the rest is mostly isolated.

So if this actually *works* and improves the performance of some real 
VMEXIT-poor SMT workloads and allows the enabling of HyperThreading with 
untrusted VMs without inviting thousands of guest roots then I'm 
cautiously in support of it.

> That all assumes that it works at all for the people who are clamoring 
> for this feature, but I guess they can run some loads on it eventually. 
> It's a holiday in the US right now ("Presidents' Day"), but maybe we 
> can get some numebrs this week?

Such numbers would be *very* helpful indeed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-02-18 16:56 ` [RFC][PATCH 03/16] sched: Wrap rq::lock access Peter Zijlstra
@ 2019-02-19 16:13   ` Phil Auld
  2019-02-19 16:22     ` Peter Zijlstra
  2019-03-18 15:41   ` Julien Desfossez
  1 sibling, 1 reply; 99+ messages in thread
From: Phil Auld @ 2019-02-19 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel

On Mon, Feb 18, 2019 at 05:56:23PM +0100 Peter Zijlstra wrote:
> In preparation of playing games with rq->lock, abstract the thing
> using an accessor.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Hi Peter,

Sorry... what tree are these for?  They don't apply to mainline. 
Some branch on tip, I guess?


Thanks,
Phil



> ---
>  kernel/sched/core.c     |   44 ++++++++++----------
>  kernel/sched/deadline.c |   18 ++++----
>  kernel/sched/debug.c    |    4 -
>  kernel/sched/fair.c     |   41 +++++++++----------
>  kernel/sched/idle.c     |    4 -
>  kernel/sched/pelt.h     |    2 
>  kernel/sched/rt.c       |    8 +--
>  kernel/sched/sched.h    |  102 ++++++++++++++++++++++++------------------------
>  kernel/sched/topology.c |    4 -
>  9 files changed, 114 insertions(+), 113 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -72,12 +72,12 @@ struct rq *__task_rq_lock(struct task_st
>  
>  	for (;;) {
>  		rq = task_rq(p);
> -		raw_spin_lock(&rq->lock);
> +		raw_spin_lock(rq_lockp(rq));
>  		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
>  			rq_pin_lock(rq, rf);
>  			return rq;
>  		}
> -		raw_spin_unlock(&rq->lock);
> +		raw_spin_unlock(rq_lockp(rq));
>  
>  		while (unlikely(task_on_rq_migrating(p)))
>  			cpu_relax();
> @@ -96,7 +96,7 @@ struct rq *task_rq_lock(struct task_stru
>  	for (;;) {
>  		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
>  		rq = task_rq(p);
> -		raw_spin_lock(&rq->lock);
> +		raw_spin_lock(rq_lockp(rq));
>  		/*
>  		 *	move_queued_task()		task_rq_lock()
>  		 *
> @@ -118,7 +118,7 @@ struct rq *task_rq_lock(struct task_stru
>  			rq_pin_lock(rq, rf);
>  			return rq;
>  		}
> -		raw_spin_unlock(&rq->lock);
> +		raw_spin_unlock(rq_lockp(rq));
>  		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  
>  		while (unlikely(task_on_rq_migrating(p)))
> @@ -188,7 +188,7 @@ void update_rq_clock(struct rq *rq)
>  {
>  	s64 delta;
>  
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	if (rq->clock_update_flags & RQCF_ACT_SKIP)
>  		return;
> @@ -497,7 +497,7 @@ void resched_curr(struct rq *rq)
>  	struct task_struct *curr = rq->curr;
>  	int cpu;
>  
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	if (test_tsk_need_resched(curr))
>  		return;
> @@ -521,10 +521,10 @@ void resched_cpu(int cpu)
>  	struct rq *rq = cpu_rq(cpu);
>  	unsigned long flags;
>  
> -	raw_spin_lock_irqsave(&rq->lock, flags);
> +	raw_spin_lock_irqsave(rq_lockp(rq), flags);
>  	if (cpu_online(cpu) || cpu == smp_processor_id())
>  		resched_curr(rq);
> -	raw_spin_unlock_irqrestore(&rq->lock, flags);
> +	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
>  }
>  
>  #ifdef CONFIG_SMP
> @@ -956,7 +956,7 @@ static inline bool is_cpu_allowed(struct
>  static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
>  				   struct task_struct *p, int new_cpu)
>  {
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
>  	dequeue_task(rq, p, DEQUEUE_NOCLOCK);
> @@ -1070,7 +1070,7 @@ void do_set_cpus_allowed(struct task_str
>  		 * Because __kthread_bind() calls this on blocked tasks without
>  		 * holding rq->lock.
>  		 */
> -		lockdep_assert_held(&rq->lock);
> +		lockdep_assert_held(rq_lockp(rq));
>  		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
>  	}
>  	if (running)
> @@ -1203,7 +1203,7 @@ void set_task_cpu(struct task_struct *p,
>  	 * task_rq_lock().
>  	 */
>  	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
> -				      lockdep_is_held(&task_rq(p)->lock)));
> +				      lockdep_is_held(rq_lockp(task_rq(p)))));
>  #endif
>  	/*
>  	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
> @@ -1732,7 +1732,7 @@ ttwu_do_activate(struct rq *rq, struct t
>  {
>  	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
>  
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  #ifdef CONFIG_SMP
>  	if (p->sched_contributes_to_load)
> @@ -2123,7 +2123,7 @@ static void try_to_wake_up_local(struct
>  	    WARN_ON_ONCE(p == current))
>  		return;
>  
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	if (!raw_spin_trylock(&p->pi_lock)) {
>  		/*
> @@ -2606,10 +2606,10 @@ prepare_lock_switch(struct rq *rq, struc
>  	 * do an early lockdep release here:
>  	 */
>  	rq_unpin_lock(rq, rf);
> -	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> +	spin_release(&rq_lockp(rq)->dep_map, 1, _THIS_IP_);
>  #ifdef CONFIG_DEBUG_SPINLOCK
>  	/* this is a valid case when another task releases the spinlock */
> -	rq->lock.owner = next;
> +	rq_lockp(rq)->owner = next;
>  #endif
>  }
>  
> @@ -2620,8 +2620,8 @@ static inline void finish_lock_switch(st
>  	 * fix up the runqueue lock - which gets 'carried over' from
>  	 * prev into current:
>  	 */
> -	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
> -	raw_spin_unlock_irq(&rq->lock);
> +	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
> +	raw_spin_unlock_irq(rq_lockp(rq));
>  }
>  
>  /*
> @@ -2771,7 +2771,7 @@ static void __balance_callback(struct rq
>  	void (*func)(struct rq *rq);
>  	unsigned long flags;
>  
> -	raw_spin_lock_irqsave(&rq->lock, flags);
> +	raw_spin_lock_irqsave(rq_lockp(rq), flags);
>  	head = rq->balance_callback;
>  	rq->balance_callback = NULL;
>  	while (head) {
> @@ -2782,7 +2782,7 @@ static void __balance_callback(struct rq
>  
>  		func(rq);
>  	}
> -	raw_spin_unlock_irqrestore(&rq->lock, flags);
> +	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
>  }
>  
>  static inline void balance_callback(struct rq *rq)
> @@ -5411,7 +5411,7 @@ void init_idle(struct task_struct *idle,
>  	unsigned long flags;
>  
>  	raw_spin_lock_irqsave(&idle->pi_lock, flags);
> -	raw_spin_lock(&rq->lock);
> +	raw_spin_lock(rq_lockp(rq));
>  
>  	__sched_fork(0, idle);
>  	idle->state = TASK_RUNNING;
> @@ -5448,7 +5448,7 @@ void init_idle(struct task_struct *idle,
>  #ifdef CONFIG_SMP
>  	idle->on_cpu = 1;
>  #endif
> -	raw_spin_unlock(&rq->lock);
> +	raw_spin_unlock(rq_lockp(rq));
>  	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
>  
>  	/* Set the preempt count _outside_ the spinlocks! */
> @@ -6016,7 +6016,7 @@ void __init sched_init(void)
>  		struct rq *rq;
>  
>  		rq = cpu_rq(i);
> -		raw_spin_lock_init(&rq->lock);
> +		raw_spin_lock_init(&rq->__lock);
>  		rq->nr_running = 0;
>  		rq->calc_load_active = 0;
>  		rq->calc_load_update = jiffies + LOAD_FREQ;
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -80,7 +80,7 @@ void __add_running_bw(u64 dl_bw, struct
>  {
>  	u64 old = dl_rq->running_bw;
>  
> -	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
> +	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
>  	dl_rq->running_bw += dl_bw;
>  	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
>  	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
> @@ -93,7 +93,7 @@ void __sub_running_bw(u64 dl_bw, struct
>  {
>  	u64 old = dl_rq->running_bw;
>  
> -	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
> +	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
>  	dl_rq->running_bw -= dl_bw;
>  	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
>  	if (dl_rq->running_bw > old)
> @@ -107,7 +107,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq
>  {
>  	u64 old = dl_rq->this_bw;
>  
> -	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
> +	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
>  	dl_rq->this_bw += dl_bw;
>  	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
>  }
> @@ -117,7 +117,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq
>  {
>  	u64 old = dl_rq->this_bw;
>  
> -	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
> +	lockdep_assert_held(rq_lockp((rq_of_dl_rq(dl_rq))));
>  	dl_rq->this_bw -= dl_bw;
>  	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
>  	if (dl_rq->this_bw > old)
> @@ -893,7 +893,7 @@ static int start_dl_timer(struct task_st
>  	ktime_t now, act;
>  	s64 delta;
>  
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	/*
>  	 * We want the timer to fire at the deadline, but considering
> @@ -1003,9 +1003,9 @@ static enum hrtimer_restart dl_task_time
>  		 * If the runqueue is no longer available, migrate the
>  		 * task elsewhere. This necessarily changes rq.
>  		 */
> -		lockdep_unpin_lock(&rq->lock, rf.cookie);
> +		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
>  		rq = dl_task_offline_migration(rq, p);
> -		rf.cookie = lockdep_pin_lock(&rq->lock);
> +		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
>  		update_rq_clock(rq);
>  
>  		/*
> @@ -1620,7 +1620,7 @@ static void migrate_task_rq_dl(struct ta
>  	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
>  	 * rq->lock is not... So, lock it
>  	 */
> -	raw_spin_lock(&rq->lock);
> +	raw_spin_lock(rq_lockp(rq));
>  	if (p->dl.dl_non_contending) {
>  		sub_running_bw(&p->dl, &rq->dl);
>  		p->dl.dl_non_contending = 0;
> @@ -1635,7 +1635,7 @@ static void migrate_task_rq_dl(struct ta
>  			put_task_struct(p);
>  	}
>  	sub_rq_bw(&p->dl, &rq->dl);
> -	raw_spin_unlock(&rq->lock);
> +	raw_spin_unlock(rq_lockp(rq));
>  }
>  
>  static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -515,7 +515,7 @@ void print_cfs_rq(struct seq_file *m, in
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
>  			SPLIT_NS(cfs_rq->exec_clock));
>  
> -	raw_spin_lock_irqsave(&rq->lock, flags);
> +	raw_spin_lock_irqsave(rq_lockp(rq), flags);
>  	if (rb_first_cached(&cfs_rq->tasks_timeline))
>  		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
>  	last = __pick_last_entity(cfs_rq);
> @@ -523,7 +523,7 @@ void print_cfs_rq(struct seq_file *m, in
>  		max_vruntime = last->vruntime;
>  	min_vruntime = cfs_rq->min_vruntime;
>  	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
> -	raw_spin_unlock_irqrestore(&rq->lock, flags);
> +	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
>  			SPLIT_NS(MIN_vruntime));
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4966,7 +4966,7 @@ static void __maybe_unused update_runtim
>  {
>  	struct task_group *tg;
>  
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(tg, &task_groups, list) {
> @@ -4985,7 +4985,7 @@ static void __maybe_unused unthrottle_of
>  {
>  	struct task_group *tg;
>  
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(tg, &task_groups, list) {
> @@ -6743,7 +6743,7 @@ static void migrate_task_rq_fair(struct
>  		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
>  		 * rq->lock and can modify state directly.
>  		 */
> -		lockdep_assert_held(&task_rq(p)->lock);
> +		lockdep_assert_held(rq_lockp(task_rq(p)));
>  		detach_entity_cfs_rq(&p->se);
>  
>  	} else {
> @@ -7317,7 +7317,7 @@ static int task_hot(struct task_struct *
>  {
>  	s64 delta;
>  
> -	lockdep_assert_held(&env->src_rq->lock);
> +	lockdep_assert_held(rq_lockp(env->src_rq));
>  
>  	if (p->sched_class != &fair_sched_class)
>  		return 0;
> @@ -7411,7 +7411,7 @@ int can_migrate_task(struct task_struct
>  {
>  	int tsk_cache_hot;
>  
> -	lockdep_assert_held(&env->src_rq->lock);
> +	lockdep_assert_held(rq_lockp(env->src_rq));
>  
>  	/*
>  	 * We do not migrate tasks that are:
> @@ -7489,7 +7489,7 @@ int can_migrate_task(struct task_struct
>   */
>  static void detach_task(struct task_struct *p, struct lb_env *env)
>  {
> -	lockdep_assert_held(&env->src_rq->lock);
> +	lockdep_assert_held(rq_lockp(env->src_rq));
>  
>  	p->on_rq = TASK_ON_RQ_MIGRATING;
>  	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
> @@ -7506,7 +7506,7 @@ static struct task_struct *detach_one_ta
>  {
>  	struct task_struct *p;
>  
> -	lockdep_assert_held(&env->src_rq->lock);
> +	lockdep_assert_held(rq_lockp(env->src_rq));
>  
>  	list_for_each_entry_reverse(p,
>  			&env->src_rq->cfs_tasks, se.group_node) {
> @@ -7542,7 +7542,7 @@ static int detach_tasks(struct lb_env *e
>  	unsigned long load;
>  	int detached = 0;
>  
> -	lockdep_assert_held(&env->src_rq->lock);
> +	lockdep_assert_held(rq_lockp(env->src_rq));
>  
>  	if (env->imbalance <= 0)
>  		return 0;
> @@ -7623,7 +7623,7 @@ static int detach_tasks(struct lb_env *e
>   */
>  static void attach_task(struct rq *rq, struct task_struct *p)
>  {
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	BUG_ON(task_rq(p) != rq);
>  	activate_task(rq, p, ENQUEUE_NOCLOCK);
> @@ -9164,7 +9164,7 @@ static int load_balance(int this_cpu, st
>  		if (need_active_balance(&env)) {
>  			unsigned long flags;
>  
> -			raw_spin_lock_irqsave(&busiest->lock, flags);
> +			raw_spin_lock_irqsave(rq_lockp(busiest), flags);
>  
>  			/*
>  			 * Don't kick the active_load_balance_cpu_stop,
> @@ -9172,8 +9172,7 @@ static int load_balance(int this_cpu, st
>  			 * moved to this_cpu:
>  			 */
>  			if (!cpumask_test_cpu(this_cpu, &busiest->curr->cpus_allowed)) {
> -				raw_spin_unlock_irqrestore(&busiest->lock,
> -							    flags);
> +				raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
>  				env.flags |= LBF_ALL_PINNED;
>  				goto out_one_pinned;
>  			}
> @@ -9188,7 +9187,7 @@ static int load_balance(int this_cpu, st
>  				busiest->push_cpu = this_cpu;
>  				active_balance = 1;
>  			}
> -			raw_spin_unlock_irqrestore(&busiest->lock, flags);
> +			raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
>  
>  			if (active_balance) {
>  				stop_one_cpu_nowait(cpu_of(busiest),
> @@ -9897,7 +9896,7 @@ static void nohz_newidle_balance(struct
>  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
>  		return;
>  
> -	raw_spin_unlock(&this_rq->lock);
> +	raw_spin_unlock(rq_lockp(this_rq));
>  	/*
>  	 * This CPU is going to be idle and blocked load of idle CPUs
>  	 * need to be updated. Run the ilb locally as it is a good
> @@ -9906,7 +9905,7 @@ static void nohz_newidle_balance(struct
>  	 */
>  	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>  		kick_ilb(NOHZ_STATS_KICK);
> -	raw_spin_lock(&this_rq->lock);
> +	raw_spin_lock(rq_lockp(this_rq));
>  }
>  
>  #else /* !CONFIG_NO_HZ_COMMON */
> @@ -9966,7 +9965,7 @@ static int idle_balance(struct rq *this_
>  		goto out;
>  	}
>  
> -	raw_spin_unlock(&this_rq->lock);
> +	raw_spin_unlock(rq_lockp(this_rq));
>  
>  	update_blocked_averages(this_cpu);
>  	rcu_read_lock();
> @@ -10007,7 +10006,7 @@ static int idle_balance(struct rq *this_
>  	}
>  	rcu_read_unlock();
>  
> -	raw_spin_lock(&this_rq->lock);
> +	raw_spin_lock(rq_lockp(this_rq));
>  
>  	if (curr_cost > this_rq->max_idle_balance_cost)
>  		this_rq->max_idle_balance_cost = curr_cost;
> @@ -10443,11 +10442,11 @@ void online_fair_sched_group(struct task
>  		rq = cpu_rq(i);
>  		se = tg->se[i];
>  
> -		raw_spin_lock_irq(&rq->lock);
> +		raw_spin_lock_irq(rq_lockp(rq));
>  		update_rq_clock(rq);
>  		attach_entity_cfs_rq(se);
>  		sync_throttle(tg, i);
> -		raw_spin_unlock_irq(&rq->lock);
> +		raw_spin_unlock_irq(rq_lockp(rq));
>  	}
>  }
>  
> @@ -10470,9 +10469,9 @@ void unregister_fair_sched_group(struct
>  
>  		rq = cpu_rq(cpu);
>  
> -		raw_spin_lock_irqsave(&rq->lock, flags);
> +		raw_spin_lock_irqsave(rq_lockp(rq), flags);
>  		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
> -		raw_spin_unlock_irqrestore(&rq->lock, flags);
> +		raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
>  	}
>  }
>  
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -390,10 +390,10 @@ pick_next_task_idle(struct rq *rq, struc
>  static void
>  dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
>  {
> -	raw_spin_unlock_irq(&rq->lock);
> +	raw_spin_unlock_irq(rq_lockp(rq));
>  	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
>  	dump_stack();
> -	raw_spin_lock_irq(&rq->lock);
> +	raw_spin_lock_irq(rq_lockp(rq));
>  }
>  
>  static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -116,7 +116,7 @@ static inline void update_idle_rq_clock_
>  
>  static inline u64 rq_clock_pelt(struct rq *rq)
>  {
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  	assert_clock_updated(rq);
>  
>  	return rq->clock_pelt - rq->lost_idle_time;
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -845,7 +845,7 @@ static int do_sched_rt_period_timer(stru
>  		if (skip)
>  			continue;
>  
> -		raw_spin_lock(&rq->lock);
> +		raw_spin_lock(rq_lockp(rq));
>  		update_rq_clock(rq);
>  
>  		if (rt_rq->rt_time) {
> @@ -883,7 +883,7 @@ static int do_sched_rt_period_timer(stru
>  
>  		if (enqueue)
>  			sched_rt_rq_enqueue(rt_rq);
> -		raw_spin_unlock(&rq->lock);
> +		raw_spin_unlock(rq_lockp(rq));
>  	}
>  
>  	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
> @@ -2034,9 +2034,9 @@ void rto_push_irq_work_func(struct irq_w
>  	 * When it gets updated, a check is made if a push is possible.
>  	 */
>  	if (has_pushable_tasks(rq)) {
> -		raw_spin_lock(&rq->lock);
> +		raw_spin_lock(rq_lockp(rq));
>  		push_rt_tasks(rq);
> -		raw_spin_unlock(&rq->lock);
> +		raw_spin_unlock(rq_lockp(rq));
>  	}
>  
>  	raw_spin_lock(&rd->rto_lock);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -806,7 +806,7 @@ extern void rto_push_irq_work_func(struc
>   */
>  struct rq {
>  	/* runqueue lock: */
> -	raw_spinlock_t		lock;
> +	raw_spinlock_t		__lock;
>  
>  	/*
>  	 * nr_running and cpu_load should be in the same cacheline because
> @@ -979,6 +979,10 @@ static inline int cpu_of(struct rq *rq)
>  #endif
>  }
>  
> +static inline raw_spinlock_t *rq_lockp(struct rq *rq)
> +{
> +	return &rq->__lock;
> +}
>  
>  #ifdef CONFIG_SCHED_SMT
>  extern void __update_idle_core(struct rq *rq);
> @@ -1046,7 +1050,7 @@ static inline void assert_clock_updated(
>  
>  static inline u64 rq_clock(struct rq *rq)
>  {
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  	assert_clock_updated(rq);
>  
>  	return rq->clock;
> @@ -1054,7 +1058,7 @@ static inline u64 rq_clock(struct rq *rq
>  
>  static inline u64 rq_clock_task(struct rq *rq)
>  {
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  	assert_clock_updated(rq);
>  
>  	return rq->clock_task;
> @@ -1062,7 +1066,7 @@ static inline u64 rq_clock_task(struct r
>  
>  static inline void rq_clock_skip_update(struct rq *rq)
>  {
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  	rq->clock_update_flags |= RQCF_REQ_SKIP;
>  }
>  
> @@ -1072,7 +1076,7 @@ static inline void rq_clock_skip_update(
>   */
>  static inline void rq_clock_cancel_skipupdate(struct rq *rq)
>  {
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
>  }
>  
> @@ -1091,7 +1095,7 @@ struct rq_flags {
>  
>  static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
>  {
> -	rf->cookie = lockdep_pin_lock(&rq->lock);
> +	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
>  
>  #ifdef CONFIG_SCHED_DEBUG
>  	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
> @@ -1106,12 +1110,12 @@ static inline void rq_unpin_lock(struct
>  		rf->clock_update_flags = RQCF_UPDATED;
>  #endif
>  
> -	lockdep_unpin_lock(&rq->lock, rf->cookie);
> +	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
>  }
>  
>  static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
>  {
> -	lockdep_repin_lock(&rq->lock, rf->cookie);
> +	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
>  
>  #ifdef CONFIG_SCHED_DEBUG
>  	/*
> @@ -1132,7 +1136,7 @@ static inline void __task_rq_unlock(stru
>  	__releases(rq->lock)
>  {
>  	rq_unpin_lock(rq, rf);
> -	raw_spin_unlock(&rq->lock);
> +	raw_spin_unlock(rq_lockp(rq));
>  }
>  
>  static inline void
> @@ -1141,7 +1145,7 @@ task_rq_unlock(struct rq *rq, struct tas
>  	__releases(p->pi_lock)
>  {
>  	rq_unpin_lock(rq, rf);
> -	raw_spin_unlock(&rq->lock);
> +	raw_spin_unlock(rq_lockp(rq));
>  	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>  
> @@ -1149,7 +1153,7 @@ static inline void
>  rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
>  	__acquires(rq->lock)
>  {
> -	raw_spin_lock_irqsave(&rq->lock, rf->flags);
> +	raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
>  	rq_pin_lock(rq, rf);
>  }
>  
> @@ -1157,7 +1161,7 @@ static inline void
>  rq_lock_irq(struct rq *rq, struct rq_flags *rf)
>  	__acquires(rq->lock)
>  {
> -	raw_spin_lock_irq(&rq->lock);
> +	raw_spin_lock_irq(rq_lockp(rq));
>  	rq_pin_lock(rq, rf);
>  }
>  
> @@ -1165,7 +1169,7 @@ static inline void
>  rq_lock(struct rq *rq, struct rq_flags *rf)
>  	__acquires(rq->lock)
>  {
> -	raw_spin_lock(&rq->lock);
> +	raw_spin_lock(rq_lockp(rq));
>  	rq_pin_lock(rq, rf);
>  }
>  
> @@ -1173,7 +1177,7 @@ static inline void
>  rq_relock(struct rq *rq, struct rq_flags *rf)
>  	__acquires(rq->lock)
>  {
> -	raw_spin_lock(&rq->lock);
> +	raw_spin_lock(rq_lockp(rq));
>  	rq_repin_lock(rq, rf);
>  }
>  
> @@ -1182,7 +1186,7 @@ rq_unlock_irqrestore(struct rq *rq, stru
>  	__releases(rq->lock)
>  {
>  	rq_unpin_lock(rq, rf);
> -	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
> +	raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
>  }
>  
>  static inline void
> @@ -1190,7 +1194,7 @@ rq_unlock_irq(struct rq *rq, struct rq_f
>  	__releases(rq->lock)
>  {
>  	rq_unpin_lock(rq, rf);
> -	raw_spin_unlock_irq(&rq->lock);
> +	raw_spin_unlock_irq(rq_lockp(rq));
>  }
>  
>  static inline void
> @@ -1198,7 +1202,7 @@ rq_unlock(struct rq *rq, struct rq_flags
>  	__releases(rq->lock)
>  {
>  	rq_unpin_lock(rq, rf);
> -	raw_spin_unlock(&rq->lock);
> +	raw_spin_unlock(rq_lockp(rq));
>  }
>  
>  static inline struct rq *
> @@ -1261,7 +1265,7 @@ queue_balance_callback(struct rq *rq,
>  		       struct callback_head *head,
>  		       void (*func)(struct rq *rq))
>  {
> -	lockdep_assert_held(&rq->lock);
> +	lockdep_assert_held(rq_lockp(rq));
>  
>  	if (unlikely(head->next))
>  		return;
> @@ -1917,7 +1921,7 @@ static inline int _double_lock_balance(s
>  	__acquires(busiest->lock)
>  	__acquires(this_rq->lock)
>  {
> -	raw_spin_unlock(&this_rq->lock);
> +	raw_spin_unlock(rq_lockp(this_rq));
>  	double_rq_lock(this_rq, busiest);
>  
>  	return 1;
> @@ -1936,20 +1940,22 @@ static inline int _double_lock_balance(s
>  	__acquires(busiest->lock)
>  	__acquires(this_rq->lock)
>  {
> -	int ret = 0;
> +	if (rq_lockp(this_rq) == rq_lockp(busiest))
> +		return 0;
>  
> -	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
> -		if (busiest < this_rq) {
> -			raw_spin_unlock(&this_rq->lock);
> -			raw_spin_lock(&busiest->lock);
> -			raw_spin_lock_nested(&this_rq->lock,
> -					      SINGLE_DEPTH_NESTING);
> -			ret = 1;
> -		} else
> -			raw_spin_lock_nested(&busiest->lock,
> -					      SINGLE_DEPTH_NESTING);
> +	if (likely(raw_spin_trylock(rq_lockp(busiest))))
> +		return 0;
> +
> +	if (busiest >= this_rq) {
> +		raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
> +		return 0;
>  	}
> -	return ret;
> +
> +	raw_spin_unlock(rq_lockp(this_rq));
> +	raw_spin_lock(rq_lockp(busiest));
> +	raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
> +
> +	return 1;
>  }
>  
>  #endif /* CONFIG_PREEMPT */
> @@ -1959,20 +1965,16 @@ static inline int _double_lock_balance(s
>   */
>  static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
>  {
> -	if (unlikely(!irqs_disabled())) {
> -		/* printk() doesn't work well under rq->lock */
> -		raw_spin_unlock(&this_rq->lock);
> -		BUG_ON(1);
> -	}
> -
> +	lockdep_assert_irqs_disabled();
>  	return _double_lock_balance(this_rq, busiest);
>  }
>  
>  static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
>  	__releases(busiest->lock)
>  {
> -	raw_spin_unlock(&busiest->lock);
> -	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
> +	if (rq_lockp(this_rq) != rq_lockp(busiest))
> +		raw_spin_unlock(rq_lockp(busiest));
> +	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
>  }
>  
>  static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
> @@ -2013,16 +2015,16 @@ static inline void double_rq_lock(struct
>  	__acquires(rq2->lock)
>  {
>  	BUG_ON(!irqs_disabled());
> -	if (rq1 == rq2) {
> -		raw_spin_lock(&rq1->lock);
> +	if (rq_lockp(rq1) == rq_lockp(rq2)) {
> +		raw_spin_lock(rq_lockp(rq1));
>  		__acquire(rq2->lock);	/* Fake it out ;) */
>  	} else {
>  		if (rq1 < rq2) {
> -			raw_spin_lock(&rq1->lock);
> -			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
> +			raw_spin_lock(rq_lockp(rq1));
> +			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
>  		} else {
> -			raw_spin_lock(&rq2->lock);
> -			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
> +			raw_spin_lock(rq_lockp(rq2));
> +			raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
>  		}
>  	}
>  }
> @@ -2037,9 +2039,9 @@ static inline void double_rq_unlock(stru
>  	__releases(rq1->lock)
>  	__releases(rq2->lock)
>  {
> -	raw_spin_unlock(&rq1->lock);
> -	if (rq1 != rq2)
> -		raw_spin_unlock(&rq2->lock);
> +	raw_spin_unlock(rq_lockp(rq1));
> +	if (rq_lockp(rq1) != rq_lockp(rq2))
> +		raw_spin_unlock(rq_lockp(rq2));
>  	else
>  		__release(rq2->lock);
>  }
> @@ -2062,7 +2064,7 @@ static inline void double_rq_lock(struct
>  {
>  	BUG_ON(!irqs_disabled());
>  	BUG_ON(rq1 != rq2);
> -	raw_spin_lock(&rq1->lock);
> +	raw_spin_lock(rq_lockp(rq1));
>  	__acquire(rq2->lock);	/* Fake it out ;) */
>  }
>  
> @@ -2077,7 +2079,7 @@ static inline void double_rq_unlock(stru
>  	__releases(rq2->lock)
>  {
>  	BUG_ON(rq1 != rq2);
> -	raw_spin_unlock(&rq1->lock);
> +	raw_spin_unlock(rq_lockp(rq1));
>  	__release(rq2->lock);
>  }
>  
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -442,7 +442,7 @@ void rq_attach_root(struct rq *rq, struc
>  	struct root_domain *old_rd = NULL;
>  	unsigned long flags;
>  
> -	raw_spin_lock_irqsave(&rq->lock, flags);
> +	raw_spin_lock_irqsave(rq_lockp(rq), flags);
>  
>  	if (rq->rd) {
>  		old_rd = rq->rd;
> @@ -468,7 +468,7 @@ void rq_attach_root(struct rq *rq, struc
>  	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
>  		set_rq_online(rq);
>  
> -	raw_spin_unlock_irqrestore(&rq->lock, flags);
> +	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
>  
>  	if (old_rd)
>  		call_rcu(&old_rd->rcu, free_rootdomain);
> 
> 

-- 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-02-19 16:13   ` Phil Auld
@ 2019-02-19 16:22     ` Peter Zijlstra
  2019-02-19 16:37       ` Phil Auld
  0 siblings, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-19 16:22 UTC (permalink / raw)
  To: Phil Auld
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel

On Tue, Feb 19, 2019 at 11:13:43AM -0500, Phil Auld wrote:
> On Mon, Feb 18, 2019 at 05:56:23PM +0100 Peter Zijlstra wrote:
> > In preparation of playing games with rq->lock, abstract the thing
> > using an accessor.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Hi Peter,
> 
> Sorry... what tree are these for?  They don't apply to mainline. 
> Some branch on tip, I guess?

tip/master I think; any rejects should be trivial tough.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-02-19 16:22     ` Peter Zijlstra
@ 2019-02-19 16:37       ` Phil Auld
  0 siblings, 0 replies; 99+ messages in thread
From: Phil Auld @ 2019-02-19 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel

On Tue, Feb 19, 2019 at 05:22:50PM +0100 Peter Zijlstra wrote:
> On Tue, Feb 19, 2019 at 11:13:43AM -0500, Phil Auld wrote:
> > On Mon, Feb 18, 2019 at 05:56:23PM +0100 Peter Zijlstra wrote:
> > > In preparation of playing games with rq->lock, abstract the thing
> > > using an accessor.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > 
> > Hi Peter,
> > 
> > Sorry... what tree are these for?  They don't apply to mainline. 
> > Some branch on tip, I guess?
> 
> tip/master I think; any rejects should be trivial tough.

Yep... git foo failed. I was on an old branch on tip by mistake, which didn't 
have rq_clock_pelt. 


Thanks,
Phil
-- 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (16 preceding siblings ...)
  2019-02-18 17:49 ` [RFC][PATCH 00/16] sched: Core scheduling Linus Torvalds
@ 2019-02-19 22:07 ` Greg Kerr
  2019-02-20  9:42   ` Peter Zijlstra
  2019-03-01  2:54 ` Subhra Mazumdar
  2019-03-14 15:28 ` Julien Desfossez
  19 siblings, 1 reply; 99+ messages in thread
From: Greg Kerr @ 2019-02-19 22:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, Paul Turner, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook

Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
quick and dirty cgroup tagging interface," I believe cgroups are used to
define co-scheduling groups in this implementation.

Chrome OS engineers (kerrnel@google.com, mpdenton@google.com, and
palmer@google.com) are considering an interface that is usable by unprivileged
userspace apps. cgroups are a global resource that require privileged access.
Have you considered an interface that is akin to namespaces? Consider the
following strawperson API proposal (I understand prctl() is generally
used for process
specific actions, so we aren't married to using prctl()):

# API Properties

The kernel introduces coscheduling groups, which specify which processes may
be executed together. An unprivileged process may use prctl() to create a
coscheduling group. The process may then join the coscheduling group, and
place any of its child processes into the coscheduling group. To
provide flexibility for
unrelated processes to join pre-existing groups, an IPC mechanism could send a
coscheduling group handle between processes.

# Strawperson API Proposal
To create a new coscheduling group:
    int coscheduling_group = prctl(PR_CREATE_COSCHEDULING_GROUP);

The return value is >= 0 on success and -1 on failure, with the following
possible values for errno:

    ENOTSUP: This kernel doesn’t support the PR_NEW_COSCHEDULING_GROUP
operation.
    EMFILE: The process’ kernel-side coscheduling group table is full.

To join a given process to the group:
    pid_t process = /* self or child... */
    int status = prctl(PR_JOIN_COSCHEDULING_GROUP, coscheduling_group, process);
    if (status) {
        err(errno, NULL);
    }

The kernel will check and enforce that the given process ID really is the
caller’s own PID or a PID of one of the caller’s children, and that the given
group ID really exists. The return value is 0 on success and -1 on failure,
with the following possible values for errno:

    EPERM: The caller could not join the given process to the coscheduling
           group because it was not the creator of the given coscheduling group.
    EPERM: The caller could not join the given process to the coscheduling
           group because the given process was not the caller or one
of the caller’s
           children.
    EINVAL: The given group ID did not exist in the kernel-side coscheduling
            group table associated with the caller.
    ESRCH: The given process did not exist.

Regards,

Greg Kerr (kerrnel@google.com)

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.
>
> It doesn't help that there are truly bonghit crazy proposals for using this out
> there, and I really hope to never see them in code.
>
> These patches are lightly tested and didn't insta explode, but no promises,
> they might just set your pets on fire.
>
> 'enjoy'
>
> @pjt; I know this isn't quite what we talked about, but this is where I ended
> up after I started typing. There's plenty design decisions to question and my
> changelogs don't even get close to beginning to cover them all. Feel free to ask.
>
> ---
>  include/linux/sched.h    |   9 +-
>  kernel/Kconfig.preempt   |   8 +-
>  kernel/sched/core.c      | 762 ++++++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/deadline.c  |  99 +++---
>  kernel/sched/debug.c     |   4 +-
>  kernel/sched/fair.c      | 129 +++++---
>  kernel/sched/idle.c      |  42 ++-
>  kernel/sched/pelt.h      |   2 +-
>  kernel/sched/rt.c        |  96 +++---
>  kernel/sched/sched.h     | 183 ++++++++----
>  kernel/sched/stop_task.c |  35 ++-
>  kernel/sched/topology.c  |   4 +-
>  kernel/stop_machine.c    |   2 +
>  13 files changed, 1096 insertions(+), 279 deletions(-)
>
>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-19 22:07 ` Greg Kerr
@ 2019-02-20  9:42   ` Peter Zijlstra
  2019-02-20 18:33     ` Greg Kerr
  2019-02-20 18:43     ` Subhra Mazumdar
  0 siblings, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-20  9:42 UTC (permalink / raw)
  To: Greg Kerr
  Cc: mingo, tglx, Paul Turner, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook


A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
> Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
> quick and dirty cgroup tagging interface," I believe cgroups are used to
> define co-scheduling groups in this implementation.
> 
> Chrome OS engineers (kerrnel@google.com, mpdenton@google.com, and
> palmer@google.com) are considering an interface that is usable by unprivileged
> userspace apps. cgroups are a global resource that require privileged access.
> Have you considered an interface that is akin to namespaces? Consider the
> following strawperson API proposal (I understand prctl() is generally
> used for process
> specific actions, so we aren't married to using prctl()):

I don't think we're anywhere near the point where I care about
interfaces with this stuff.

Interfaces are a trivial but tedious matter once the rest works to
satisfaction.

As it happens; there is actually a bug in that very cgroup patch that
can cause undesired scheduling. Try spotting and fixing that.

Another question is if we want to be L1TF complete (and how strict) or
not, and if so, build the missing pieces (for instance we currently
don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
and horrible code and missing for that reason).

So first; does this provide what we need? If that's sorted we can
bike-shed on uapi/abi.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-20  9:42   ` Peter Zijlstra
@ 2019-02-20 18:33     ` Greg Kerr
  2019-02-22 14:10       ` Peter Zijlstra
  2019-02-20 18:43     ` Subhra Mazumdar
  1 sibling, 1 reply; 99+ messages in thread
From: Greg Kerr @ 2019-02-20 18:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Greg Kerr, mingo, tglx, Paul Turner, tim.c.chen, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook

On Wed, Feb 20, 2019 at 10:42:55AM +0100, Peter Zijlstra wrote:
> 
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
> 
I am relieved to know that when my mail client embeds HTML tags into raw
text, it will only be the second most annoying thing I've done on
e-mail.

Speaking of annoying things to do, sorry for switching e-mail addresses
but this is easier to do from my personal e-mail.

> On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
> > Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
> > quick and dirty cgroup tagging interface," I believe cgroups are used to
> > define co-scheduling groups in this implementation.
> > 
> > Chrome OS engineers (kerrnel@google.com, mpdenton@google.com, and
> > palmer@google.com) are considering an interface that is usable by unprivileged
> > userspace apps. cgroups are a global resource that require privileged access.
> > Have you considered an interface that is akin to namespaces? Consider the
> > following strawperson API proposal (I understand prctl() is generally
> > used for process
> > specific actions, so we aren't married to using prctl()):
> 
> I don't think we're anywhere near the point where I care about
> interfaces with this stuff.
> 
> Interfaces are a trivial but tedious matter once the rest works to
> satisfaction.
> 
I agree that the API itself is a bit of a bike shedding and that's why I
provided a strawperson proposal to highlight the desired properties. I
do think the high level semantics are important to agree upon.

Using cgroups could imply that a privileged user is meant to create and
track all the core scheduling groups. It sounds like you picked cgroups
out of ease of prototyping and not the specific behavior?

> As it happens; there is actually a bug in that very cgroup patch that
> can cause undesired scheduling. Try spotting and fixing that.
> 
This is where I think the high level properties of core scheduling are
relevant. I'm not sure what bug is in the existing patch, but it's hard
for me to tell if the existing code behaves correctly without answering
questions, such as, "Should processes from two separate parents be
allowed to co-execute?"

> Another question is if we want to be L1TF complete (and how strict) or
> not, and if so, build the missing pieces (for instance we currently
> don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> and horrible code and missing for that reason).
>
I assumed from the beginning that this should be safe across exceptions.
Is there a mitigating reason that it shouldn't?

> 
> So first; does this provide what we need? If that's sorted we can
> bike-shed on uapi/abi.
I agree on not bike shedding about the API, but can we agree on some of
the high level properties? For example, who generates the core
scheduling ids, what properties about them are enforced, etc.?

Regards,

Greg Kerr

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-20  9:42   ` Peter Zijlstra
  2019-02-20 18:33     ` Greg Kerr
@ 2019-02-20 18:43     ` Subhra Mazumdar
  1 sibling, 0 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-02-20 18:43 UTC (permalink / raw)
  To: Peter Zijlstra, Greg Kerr
  Cc: mingo, tglx, Paul Turner, tim.c.chen, torvalds, linux-kernel,
	fweisbec, keescook


On 2/20/19 1:42 AM, Peter Zijlstra wrote:
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
>
> On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
>> Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
>> quick and dirty cgroup tagging interface," I believe cgroups are used to
>> define co-scheduling groups in this implementation.
>>
>> Chrome OS engineers (kerrnel@google.com, mpdenton@google.com, and
>> palmer@google.com) are considering an interface that is usable by unprivileged
>> userspace apps. cgroups are a global resource that require privileged access.
>> Have you considered an interface that is akin to namespaces? Consider the
>> following strawperson API proposal (I understand prctl() is generally
>> used for process
>> specific actions, so we aren't married to using prctl()):
> I don't think we're anywhere near the point where I care about
> interfaces with this stuff.
>
> Interfaces are a trivial but tedious matter once the rest works to
> satisfaction.
>
> As it happens; there is actually a bug in that very cgroup patch that
> can cause undesired scheduling. Try spotting and fixing that.
>
> Another question is if we want to be L1TF complete (and how strict) or
> not, and if so, build the missing pieces (for instance we currently
> don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> and horrible code and missing for that reason).
I remember asking Paul about this and he mentioned he has a Address Space
Isolation proposal to cover this. So it seems this is out of scope of
core scheduling?
>
> So first; does this provide what we need? If that's sorted we can
> bike-shed on uapi/abi.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 17:49 ` [RFC][PATCH 00/16] sched: Core scheduling Linus Torvalds
  2019-02-18 20:40   ` Peter Zijlstra
@ 2019-02-21  2:53   ` Subhra Mazumdar
  2019-02-21 14:03     ` Peter Zijlstra
  2019-02-22 12:45   ` Mel Gorman
  2 siblings, 1 reply; 99+ messages in thread
From: Subhra Mazumdar @ 2019-02-21  2:53 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, kerrnel


On 2/18/19 9:49 AM, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> However; whichever way around you turn this cookie; it is expensive and nasty.
> Do you (or anybody else) have numbers for real loads?
>
> Because performance is all that matters. If performance is bad, then
> it's pointless, since just turning off SMT is the answer.
>
>                    Linus
I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
This is on baremetal, no virtualization.  In all cases I put each DB
instance in separate cpu cgroup. Following are the avg throughput numbers
of the 2 instances. %stdev is the standard deviation between the 2
instances.

Baseline = build w/o CONFIG_SCHED_CORE
core_sched = build w/ CONFIG_SCHED_CORE
HT_disable = offlined sibling HT with baseline

Users  Baseline  %stdev  core_sched     %stdev HT_disable       %stdev
16     997768    3.28    808193(-19%)   34 1053888(+5.6%)   2.9
24     1157314   9.4     974555(-15.8%) 40.5 1197904(+3.5%)   4.6
32     1693644   6.4     1237195(-27%)  42.8 1308180(-22.8%)  5.3

The regressions are substantial. Also noticed one of the DB instances was
having much less throughput than the other with core scheduling which
brought down the avg and also reflected in the very high %stdev. Disabling
HT has effect at 32 users but still better than core scheduling both in
terms of avg and %stdev. There are some issue with the DB setup for which
I couldn't go beyond 32 users.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-21  2:53   ` Subhra Mazumdar
@ 2019-02-21 14:03     ` Peter Zijlstra
  2019-02-21 18:44       ` Subhra Mazumdar
  2019-02-22  0:34       ` Subhra Mazumdar
  0 siblings, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-21 14:03 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Tim Chen, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, kerrnel

On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:
> 
> On 2/18/19 9:49 AM, Linus Torvalds wrote:
> > On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > However; whichever way around you turn this cookie; it is expensive and nasty.
> > Do you (or anybody else) have numbers for real loads?
> > 
> > Because performance is all that matters. If performance is bad, then
> > it's pointless, since just turning off SMT is the answer.
> > 
> >                    Linus
> I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
> This is on baremetal, no virtualization.  

I'm thinking oracle schedules quite a bit, right? Then you get massive
overhead (as shown).

The thing with virt workloads is that if they don't VMEXIT lots, they
also don't schedule lots (the vCPU stays running, nested scheduler
etc..).

Also; like I wrote, it is quite possible there is some sibling rivalry
here, which can cause excessive rescheduling. Someone would have to
trace a workload and check.

My older patches had a condition that would not preempt a task for a
little while, such that it might make _some_ progress, these patches
don't have that (yet).


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer
  2019-02-18 16:56 ` [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer Peter Zijlstra
@ 2019-02-21 16:19   ` Valentin Schneider
  2019-02-21 16:41     ` Peter Zijlstra
  0 siblings, 1 reply; 99+ messages in thread
From: Valentin Schneider @ 2019-02-21 16:19 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel

Hi,

On 18/02/2019 16:56, Peter Zijlstra wrote:
[...]
> +static bool try_steal_cookie(int this, int that)
> +{
> +	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> +	struct task_struct *p;
> +	unsigned long cookie;
> +	bool success = false;
> +
> +	local_irq_disable();
> +	double_rq_lock(dst, src);
> +
> +	cookie = dst->core->core_cookie;
> +	if (!cookie)
> +		goto unlock;
> +
> +	if (dst->curr != dst->idle)
> +		goto unlock;
> +
> +	p = sched_core_find(src, cookie);
> +	if (p == src->idle)
> +		goto unlock;
> +
> +	do {
> +		if (p == src->core_pick || p == src->curr)
> +			goto next;
> +
> +		if (!cpumask_test_cpu(this, &p->cpus_allowed))
> +			goto next;
> +
> +		if (p->core_occupation > dst->idle->core_occupation)
> +			goto next;
> +

IIUC, we're trying to find/steal tasks matching the core_cookie from other
rqs because dst has been cookie-forced-idle.

If the p we find isn't running, what's the meaning of core_occupation?
I would have expected it to be 0, but we don't seem to be clearing it when
resetting the state in pick_next_task().

If it is running, we prevent the stealing if the core it's on is running
more matching tasks than the core of the pulling rq. It feels to me as if
that's a balancing tweak to try to cram as many matching tasks as possible
in a single core, so to me this reads as "don't steal my tasks if I'm
running more than you are, but I will steal tasks from you if I'm given
the chance". Is that correct?

[...]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer
  2019-02-21 16:19   ` Valentin Schneider
@ 2019-02-21 16:41     ` Peter Zijlstra
  2019-02-21 16:47       ` Peter Zijlstra
  2019-04-04  8:31       ` Aubrey Li
  0 siblings, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-21 16:41 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel

On Thu, Feb 21, 2019 at 04:19:46PM +0000, Valentin Schneider wrote:
> Hi,
> 
> On 18/02/2019 16:56, Peter Zijlstra wrote:
> [...]
> > +static bool try_steal_cookie(int this, int that)
> > +{
> > +	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> > +	struct task_struct *p;
> > +	unsigned long cookie;
> > +	bool success = false;
> > +
> > +	local_irq_disable();
> > +	double_rq_lock(dst, src);
> > +
> > +	cookie = dst->core->core_cookie;
> > +	if (!cookie)
> > +		goto unlock;
> > +
> > +	if (dst->curr != dst->idle)
> > +		goto unlock;
> > +
> > +	p = sched_core_find(src, cookie);
> > +	if (p == src->idle)
> > +		goto unlock;
> > +
> > +	do {
> > +		if (p == src->core_pick || p == src->curr)
> > +			goto next;
> > +
> > +		if (!cpumask_test_cpu(this, &p->cpus_allowed))
> > +			goto next;
> > +
> > +		if (p->core_occupation > dst->idle->core_occupation)
> > +			goto next;
> > +
> 
> IIUC, we're trying to find/steal tasks matching the core_cookie from other
> rqs because dst has been cookie-forced-idle.
> 
> If the p we find isn't running, what's the meaning of core_occupation?
> I would have expected it to be 0, but we don't seem to be clearing it when
> resetting the state in pick_next_task().

Indeed. We preserve the occupation from the last time around; it's not
perfect but its better than nothing.

Consider there's two groups; and we just happen to run the other group.
Then our occopation, being what it was last, is still accurate. When
next we run, we'll again get that many siblings together.

> If it is running, we prevent the stealing if the core it's on is running
> more matching tasks than the core of the pulling rq. It feels to me as if
> that's a balancing tweak to try to cram as many matching tasks as possible
> in a single core, so to me this reads as "don't steal my tasks if I'm
> running more than you are, but I will steal tasks from you if I'm given
> the chance". Is that correct?

Correct, otherwise an SMT4 with 5 tasks could end up ping-ponging the
one task forever.

Note that a further condition a little up the callchain from here only
does this stealing if the thread was forced-idle -- ie. it had something
to run anyway. So under the condition where there simple aren't enough
tasks to keep all siblings busy, we'll not compact just cause.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer
  2019-02-21 16:41     ` Peter Zijlstra
@ 2019-02-21 16:47       ` Peter Zijlstra
  2019-02-21 18:28         ` Valentin Schneider
  2019-04-04  8:31       ` Aubrey Li
  1 sibling, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-21 16:47 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel

On Thu, Feb 21, 2019 at 05:41:46PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 21, 2019 at 04:19:46PM +0000, Valentin Schneider wrote:
> > Hi,
> > 
> > On 18/02/2019 16:56, Peter Zijlstra wrote:
> > [...]
> > > +static bool try_steal_cookie(int this, int that)
> > > +{
> > > +	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> > > +	struct task_struct *p;
> > > +	unsigned long cookie;
> > > +	bool success = false;
> > > +
> > > +	local_irq_disable();
> > > +	double_rq_lock(dst, src);
> > > +
> > > +	cookie = dst->core->core_cookie;
> > > +	if (!cookie)
> > > +		goto unlock;
> > > +
> > > +	if (dst->curr != dst->idle)
> > > +		goto unlock;
> > > +
> > > +	p = sched_core_find(src, cookie);
> > > +	if (p == src->idle)
> > > +		goto unlock;
> > > +
> > > +	do {
> > > +		if (p == src->core_pick || p == src->curr)
> > > +			goto next;
> > > +
> > > +		if (!cpumask_test_cpu(this, &p->cpus_allowed))
> > > +			goto next;
> > > +
> > > +		if (p->core_occupation > dst->idle->core_occupation)
> > > +			goto next;
> > > +
> > 
> > IIUC, we're trying to find/steal tasks matching the core_cookie from other
> > rqs because dst has been cookie-forced-idle.
> > 
> > If the p we find isn't running, what's the meaning of core_occupation?
> > I would have expected it to be 0, but we don't seem to be clearing it when
> > resetting the state in pick_next_task().
> 
> Indeed. We preserve the occupation from the last time around; it's not
> perfect but its better than nothing.
> 
> Consider there's two groups; and we just happen to run the other group.
> Then our occopation, being what it was last, is still accurate. When
> next we run, we'll again get that many siblings together.
> 
> > If it is running, we prevent the stealing if the core it's on is running
> > more matching tasks than the core of the pulling rq. It feels to me as if
> > that's a balancing tweak to try to cram as many matching tasks as possible
> > in a single core, so to me this reads as "don't steal my tasks if I'm
> > running more than you are, but I will steal tasks from you if I'm given
> > the chance". Is that correct?
> 
> Correct, otherwise an SMT4 with 5 tasks could end up ping-ponging the
> one task forever.
> 
> Note that a further condition a little up the callchain from here only
> does this stealing if the thread was forced-idle -- ie. it had something
> to run anyway. So under the condition where there simple aren't enough
> tasks to keep all siblings busy, we'll not compact just cause.

Better example; it will avoid stealing a task from a full SMT2 core to
fill another.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer
  2019-02-21 16:47       ` Peter Zijlstra
@ 2019-02-21 18:28         ` Valentin Schneider
  0 siblings, 0 replies; 99+ messages in thread
From: Valentin Schneider @ 2019-02-21 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel

On 21/02/2019 16:47, Peter Zijlstra wrote:
[...]
>>> IIUC, we're trying to find/steal tasks matching the core_cookie from other
>>> rqs because dst has been cookie-forced-idle.
>>>
>>> If the p we find isn't running, what's the meaning of core_occupation?
>>> I would have expected it to be 0, but we don't seem to be clearing it when
>>> resetting the state in pick_next_task().
>>
>> Indeed. We preserve the occupation from the last time around; it's not
>> perfect but its better than nothing.
>>
>> Consider there's two groups; and we just happen to run the other group.
>> Then our occopation, being what it was last, is still accurate. When
>> next we run, we'll again get that many siblings together.
>>
>>> If it is running, we prevent the stealing if the core it's on is running
>>> more matching tasks than the core of the pulling rq. It feels to me as if
>>> that's a balancing tweak to try to cram as many matching tasks as possible
>>> in a single core, so to me this reads as "don't steal my tasks if I'm
>>> running more than you are, but I will steal tasks from you if I'm given
>>> the chance". Is that correct?
>>
>> Correct, otherwise an SMT4 with 5 tasks could end up ping-ponging the
>> one task forever.
>>

Wouldn't we want to move some tasks in those cases? If we're going newidle
we're guaranteed to have a thread for that extra task.

So

  if (p->core_occupation == cpumask_weight(cpu_smt_mask(that))

we could want to steal, overriding the occupation comparison
(we already have a (p == src->core_pick) abort before). Kind of feels like
CFS stealing that steals when nr_running > 1.

>> Note that a further condition a little up the callchain from here only
>> does this stealing if the thread was forced-idle -- ie. it had something
>> to run anyway. So under the condition where there simple aren't enough
>> tasks to keep all siblings busy, we'll not compact just cause.
> 
> Better example; it will avoid stealing a task from a full SMT2 core to
> fill another.
> 

Aye, that's the scenario I was thinking of.

Thanks for clearing things up.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-21 14:03     ` Peter Zijlstra
@ 2019-02-21 18:44       ` Subhra Mazumdar
  2019-02-22  0:34       ` Subhra Mazumdar
  1 sibling, 0 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-02-21 18:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Tim Chen, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, kerrnel


On 2/21/19 6:03 AM, Peter Zijlstra wrote:
> On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:
>> On 2/18/19 9:49 AM, Linus Torvalds wrote:
>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>>> Do you (or anybody else) have numbers for real loads?
>>>
>>> Because performance is all that matters. If performance is bad, then
>>> it's pointless, since just turning off SMT is the answer.
>>>
>>>                     Linus
>> I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
>> This is on baremetal, no virtualization.
> I'm thinking oracle schedules quite a bit, right? Then you get massive
> overhead (as shown).
Yes. In terms of idleness we have:

Users baseline core_sched
16    67% 70%
24    53% 59%
32    41% 49%

So there is more idleness with core sched which is understandable as there
can be forced idleness. The other part contributing to regression is most
likely overhead.
>
> The thing with virt workloads is that if they don't VMEXIT lots, they
> also don't schedule lots (the vCPU stays running, nested scheduler
> etc..).
I plan to run some VM workloads.
>
> Also; like I wrote, it is quite possible there is some sibling rivalry
> here, which can cause excessive rescheduling. Someone would have to
> trace a workload and check.
>
> My older patches had a condition that would not preempt a task for a
> little while, such that it might make _some_ progress, these patches
> don't have that (yet).
>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-21 14:03     ` Peter Zijlstra
  2019-02-21 18:44       ` Subhra Mazumdar
@ 2019-02-22  0:34       ` Subhra Mazumdar
  1 sibling, 0 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-02-22  0:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Tim Chen, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, kerrnel


On 2/21/19 6:03 AM, Peter Zijlstra wrote:
> On Wed, Feb 20, 2019 at 06:53:08PM -0800, Subhra Mazumdar wrote:
>> On 2/18/19 9:49 AM, Linus Torvalds wrote:
>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>>> Do you (or anybody else) have numbers for real loads?
>>>
>>> Because performance is all that matters. If performance is bad, then
>>> it's pointless, since just turning off SMT is the answer.
>>>
>>>                     Linus
>> I tested 2 Oracle DB instances running OLTP on a 2 socket 44 cores system.
>> This is on baremetal, no virtualization.
> I'm thinking oracle schedules quite a bit, right? Then you get massive
> overhead (as shown).
>
Out of curiosity I ran the patchset from Amazon with the same setup to see
if performance wise it was any better. But it looks equally bad. At 32
users it performed even worse and the idle time increased much more. Only
good thing about it was it was being fair to both the instances as seen in
the low %stdev

Users  Baseline %stdev  %idle  cosched     %stdev %idle
16     1        2.9     66     0.93(-7%)   1.1 69
24     1        11.3    53     0.87(-13%)  11.2 61
32     1        7       41     0.66(-34%)  5.3     54

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 20:40   ` Peter Zijlstra
  2019-02-19  0:29     ` Linus Torvalds
@ 2019-02-22 12:17     ` Paolo Bonzini
  2019-02-22 14:20       ` Peter Zijlstra
  1 sibling, 1 reply; 99+ messages in thread
From: Paolo Bonzini @ 2019-02-22 12:17 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, subhra.mazumdar,
	Frédéric Weisbecker, Kees Cook, kerrnel

On 18/02/19 21:40, Peter Zijlstra wrote:
> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>>
>> Do you (or anybody else) have numbers for real loads?
>>
>> Because performance is all that matters. If performance is bad, then
>> it's pointless, since just turning off SMT is the answer.
> 
> Not for these patches; they stopped crashing only yesterday and I
> cleaned them up and send them out.
> 
> The previous version; which was more horrible; but L1TF complete, was
> between OK-ish and horrible depending on the number of VMEXITs a
> workload had.
>
> If there were close to no VMEXITs, it beat smt=off, if there were lots
> of VMEXITs it was far far worse. Supposedly hosting people try their
> very bestest to have no VMEXITs so it mostly works for them (with the
> obvious exception of single VCPU guests).

If you are giving access to dedicated cores to guests, you also let them
do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
bound workload.

In any case, IIUC what you are looking for is:

1) take a benchmark that *is* helped by SMT, this will be something CPU
bound.

2) compare two runs, one without SMT and without core scheduler, and one
with SMT+core scheduler.

3) find out whether performance is helped by SMT despite the increased
overhead of the core scheduler

Do you want some other load in the host, so that the scheduler actually
does do something?  Or is the point just that you show that the
performance isn't affected when the scheduler does not have anything to
do (which should be obvious, but having numbers is always better)?

Paolo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 17:49 ` [RFC][PATCH 00/16] sched: Core scheduling Linus Torvalds
  2019-02-18 20:40   ` Peter Zijlstra
  2019-02-21  2:53   ` Subhra Mazumdar
@ 2019-02-22 12:45   ` Mel Gorman
  2019-02-22 16:10     ` Mel Gorman
  2019-03-08 19:44     ` Subhra Mazumdar
  2 siblings, 2 replies; 99+ messages in thread
From: Mel Gorman @ 2019-02-22 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, subhra.mazumdar, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, kerrnel

On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > However; whichever way around you turn this cookie; it is expensive and nasty.
> 
> Do you (or anybody else) have numbers for real loads?
> 
> Because performance is all that matters. If performance is bad, then
> it's pointless, since just turning off SMT is the answer.
> 

I tried to do a comparison between tip/master, ht disabled and this series
putting test workloads into a tagged cgroup but unfortunately it failed

[  156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
[  156.986597] #PF error: [normal kernel read fault]
[  156.991343] PGD 0 P4D 0
[  156.993905] Oops: 0000 [#1] SMP PTI
[  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1
[  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
[  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
[  157.037544] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
[  157.042819] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
[  157.050015] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
[  157.057215] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
[  157.064410] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
[  157.071611] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
[  157.078814] FS:  0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
[  157.086977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  157.092779] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
[  157.099979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  157.109529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  157.119058] Call Trace:
[  157.123865]  pick_next_entity+0x61/0x110
[  157.130137]  pick_task_fair+0x4b/0x90
[  157.136124]  __schedule+0x365/0x12c0
[  157.141985]  schedule_idle+0x1e/0x40
[  157.147822]  do_idle+0x166/0x280
[  157.153275]  cpu_startup_entry+0x19/0x20
[  157.159420]  start_secondary+0x17a/0x1d0
[  157.165568]  secondary_startup_64+0xa4/0xb0
[  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[  157.258990] CR2: 0000000000000058
[  157.264961] ---[ end trace a301ac5e3ee86fde ]---
[  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
[  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
[  157.316121] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
[  157.324060] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
[  157.333932] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
[  157.343795] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
[  157.353634] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
[  157.363506] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
[  157.373395] FS:  0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
[  157.384238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  157.392709] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
[  157.402601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  157.412488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
[  158.529804] Shutting down cpus with NMI
[  158.573249] Kernel Offset: disabled
[  158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

RIP translates to kernel/sched/fair.c:6819

static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
        s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */

        if (vdiff <= 0)
                return -1;

        gran = wakeup_gran(se);
        if (vdiff > gran)
                return 1;
}

I haven't tried debugging it yet.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-20 18:33     ` Greg Kerr
@ 2019-02-22 14:10       ` Peter Zijlstra
  2019-03-07 22:06         ` Paolo Bonzini
  0 siblings, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-22 14:10 UTC (permalink / raw)
  To: Greg Kerr
  Cc: Greg Kerr, mingo, tglx, Paul Turner, tim.c.chen, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook

On Wed, Feb 20, 2019 at 10:33:55AM -0800, Greg Kerr wrote:
> > On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:

> Using cgroups could imply that a privileged user is meant to create and
> track all the core scheduling groups. It sounds like you picked cgroups
> out of ease of prototyping and not the specific behavior?

Yep. Where a prtcl() patch would've been similarly simple, the userspace
part would've been more annoying. The cgroup thing I can just echo into.

> > As it happens; there is actually a bug in that very cgroup patch that
> > can cause undesired scheduling. Try spotting and fixing that.
> > 
> This is where I think the high level properties of core scheduling are
> relevant. I'm not sure what bug is in the existing patch, but it's hard
> for me to tell if the existing code behaves correctly without answering
> questions, such as, "Should processes from two separate parents be
> allowed to co-execute?"

Sure, why not.

The bug is that we set the cookie and don't force a reschedule. This
then allows the existing task selection to continue; which might not
adhere to the (new) cookie constraints.

It is a transient state though; as soon as we reschedule this gets
corrected automagically.

A second bug is that we leak the cgroup tag state on destroy.

A third bug would be that it is not hierarchical -- but that this point
meh.

> > Another question is if we want to be L1TF complete (and how strict) or
> > not, and if so, build the missing pieces (for instance we currently
> > don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> > and horrible code and missing for that reason).
> >
> I assumed from the beginning that this should be safe across exceptions.
> Is there a mitigating reason that it shouldn't?

I'm not entirely sure what you mean; so let me expound -- L1TF is public
now after all.

So the basic problem is that a malicious guest can read the entire L1,
right? L1 is shared between SMT. So if one sibling takes a host
interrupt and populates L1 with host data, that other thread can read
it from the guest.

This is why my old patches (which Tim has on github _somewhere_) also
have hooks in irq_enter/irq_exit.

The big question is of course; if any data touched by interrupts is
worth the pain.

> > So first; does this provide what we need? If that's sorted we can
> > bike-shed on uapi/abi.

> I agree on not bike shedding about the API, but can we agree on some of
> the high level properties? For example, who generates the core
> scheduling ids, what properties about them are enforced, etc.?

It's an opaque cookie; the scheduler really doesn't care. All it does is
ensure that tasks match or force idle within a core.

My previous patches got the cookie from a modified
preempt_notifier_register/unregister() which passed the vcpu->kvm
pointer into it from vcpu_load/put.

This auto-grouped VMs. It was also found to be somewhat annoying because
apparently KVM does a lot of userspace assist for all sorts of nonsense
and it would leave/re-join the cookie group for every single assist.
Causing tons of rescheduling.

I'm fine with having all these interfaces, kvm, prctl and cgroup, and I
don't care about conflict resolution -- that's the tedious part of the
bike-shed :-)

The far more important questions are if there's enough workloads where
this can be made useful or not. If not, none of that interface crud
matters one whit, we can file these here patches in the bit-bucket and
happily go spend out time elsewhere.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-22 12:17     ` Paolo Bonzini
@ 2019-02-22 14:20       ` Peter Zijlstra
  2019-02-22 19:26         ` Tim Chen
  0 siblings, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2019-02-22 14:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Tim Chen, Linux List Kernel Mailing, subhra.mazumdar,
	Frédéric Weisbecker, Kees Cook, kerrnel

On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> On 18/02/19 21:40, Peter Zijlstra wrote:
> > On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >>>
> >>> However; whichever way around you turn this cookie; it is expensive and nasty.
> >>
> >> Do you (or anybody else) have numbers for real loads?
> >>
> >> Because performance is all that matters. If performance is bad, then
> >> it's pointless, since just turning off SMT is the answer.
> > 
> > Not for these patches; they stopped crashing only yesterday and I
> > cleaned them up and send them out.
> > 
> > The previous version; which was more horrible; but L1TF complete, was
> > between OK-ish and horrible depending on the number of VMEXITs a
> > workload had.
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
> 
> If you are giving access to dedicated cores to guests, you also let them
> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> bound workload.
> 
> In any case, IIUC what you are looking for is:
> 
> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> bound.
> 
> 2) compare two runs, one without SMT and without core scheduler, and one
> with SMT+core scheduler.
> 
> 3) find out whether performance is helped by SMT despite the increased
> overhead of the core scheduler
> 
> Do you want some other load in the host, so that the scheduler actually
> does do something?  Or is the point just that you show that the
> performance isn't affected when the scheduler does not have anything to
> do (which should be obvious, but having numbers is always better)?

Well, what _I_ want is for all this to just go away :-)

Tim did much of testing last time around; and I don't think he did
core-pinning of VMs much (although I'm sure he did some of that). I'm
still a complete virt noob; I can barely boot a VM to save my life.

(you should be glad to not have heard my cursing at qemu cmdline when
trying to reproduce some of Tim's results -- lets just say that I can
deal with gpg)

I'm sure he tried some oversubscribed scenarios without pinning. But
even there, when all the vCPU threads are runnable, they don't schedule
that much. Sure we take the preemption tick and thus schedule 100-1000
times a second, but that's managable.

We spend quite some time tracing workloads and fixing funny behaviour --
none of that has been done for these patches yet.

The moment KVM needed user space assist for things (and thus VMEXITs
happened) things came apart real quick.


Anyway, Tim, can you tell these fine folks what you did and for what
scenarios the last incarnation did show promise?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-22 12:45   ` Mel Gorman
@ 2019-02-22 16:10     ` Mel Gorman
  2019-03-08 19:44     ` Subhra Mazumdar
  1 sibling, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2019-02-22 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, subhra.mazumdar, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, kerrnel

On Fri, Feb 22, 2019 at 12:45:44PM +0000, Mel Gorman wrote:
> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > However; whichever way around you turn this cookie; it is expensive and nasty.
> > 
> > Do you (or anybody else) have numbers for real loads?
> > 
> > Because performance is all that matters. If performance is bad, then
> > it's pointless, since just turning off SMT is the answer.
> > 
> 
> I tried to do a comparison between tip/master, ht disabled and this series
> putting test workloads into a tagged cgroup but unfortunately it failed
> 
> [  156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> [  156.986597] #PF error: [normal kernel read fault]
> [  156.991343] PGD 0 P4D 0

When bodged around, one test survived (performance was crucified but the
benchmark is very synthetic). pgbench (test 2) paniced with a hard
lockup. Most of the console log was corrupted (unrelated to the patch)
but the relevant part is

[ 4587.419674] Call Trace:
[ 4587.419674]  _raw_spin_lock+0x1b/0x20
[ 4587.419675]  sched_core_balance+0x155/0x520
[ 4587.419675]  ? __switch_to_asm+0x34/0x70
[ 4587.419675]  __balance_callback+0x49/0xa0
[ 4587.419676]  __schedule+0xf15/0x12c0
[ 4587.419676]  schedule_idle+0x1e/0x40
[ 4587.419677]  do_idle+0x166/0x280
[ 4587.419677]  cpu_startup_entry+0x19/0x20
[ 4587.419678]  start_secondary+0x17a/0x1d0
[ 4587.419678]  secondary_startup_64+0xa4/0xb0
[ 4587.419679] Kernel panic - not syncing: Hard LOCKUP

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-22 14:20       ` Peter Zijlstra
@ 2019-02-22 19:26         ` Tim Chen
  2019-02-26  8:26           ` Aubrey Li
  0 siblings, 1 reply; 99+ messages in thread
From: Tim Chen @ 2019-02-22 19:26 UTC (permalink / raw)
  To: Peter Zijlstra, Paolo Bonzini
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linux List Kernel Mailing, subhra.mazumdar,
	Frédéric Weisbecker, Kees Cook, kerrnel

On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
>> On 18/02/19 21:40, Peter Zijlstra wrote:
>>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>
>>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>>>>
>>>> Do you (or anybody else) have numbers for real loads?
>>>>
>>>> Because performance is all that matters. If performance is bad, then
>>>> it's pointless, since just turning off SMT is the answer.
>>>
>>> Not for these patches; they stopped crashing only yesterday and I
>>> cleaned them up and send them out.
>>>
>>> The previous version; which was more horrible; but L1TF complete, was
>>> between OK-ish and horrible depending on the number of VMEXITs a
>>> workload had.
>>>
>>> If there were close to no VMEXITs, it beat smt=off, if there were lots
>>> of VMEXITs it was far far worse. Supposedly hosting people try their
>>> very bestest to have no VMEXITs so it mostly works for them (with the
>>> obvious exception of single VCPU guests).
>>
>> If you are giving access to dedicated cores to guests, you also let them
>> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
>> bound workload.
>>
>> In any case, IIUC what you are looking for is:
>>
>> 1) take a benchmark that *is* helped by SMT, this will be something CPU
>> bound.
>>
>> 2) compare two runs, one without SMT and without core scheduler, and one
>> with SMT+core scheduler.
>>
>> 3) find out whether performance is helped by SMT despite the increased
>> overhead of the core scheduler
>>
>> Do you want some other load in the host, so that the scheduler actually
>> does do something?  Or is the point just that you show that the
>> performance isn't affected when the scheduler does not have anything to
>> do (which should be obvious, but having numbers is always better)?
> 
> Well, what _I_ want is for all this to just go away :-)
> 
> Tim did much of testing last time around; and I don't think he did
> core-pinning of VMs much (although I'm sure he did some of that). I'm

Yes. The last time around I tested basic scenarios like:
1. single VM pinned on a core
2. 2 VMs pinned on a core
3. system oversubscription (no pinning)

In general, CPU bound benchmarks and even things without too much I/O
causing lots of VMexits perform better with HT than without for Peter's
last patchset.

> still a complete virt noob; I can barely boot a VM to save my life.
> 
> (you should be glad to not have heard my cursing at qemu cmdline when
> trying to reproduce some of Tim's results -- lets just say that I can
> deal with gpg)
> 
> I'm sure he tried some oversubscribed scenarios without pinning. 

We did try some oversubscribed scenarios like SPECVirt, that tried to
squeeze tons of VMs on a single system in over subscription mode.

There're two main problems in the last go around:

1. Workload with high rate of Vmexits (SpecVirt is one) 
were a major source of pain when we tried Peter's previous patchset.
The switch from vcpus to qemu and back in previous version of Peter's patch
requires some coordination between the hyperthread siblings via IPI.  And for
workload that does this a lot, the overhead quickly added up.

For Peter's new patch, this overhead hopefully would be reduced and give
better performance.

2. Load balancing is quite tricky.  Peter's last patchset did not have
load balancing for consolidating compatible running threads.
I did some non-sophisticated load balancing
to pair vcpus up.  But the constant vcpu migrations overhead probably ate up
any improvements from better load pairing.  So I didn't get much
improvement in the over-subscription case when turning on load balancing
to consolidate the VCPUs of the same VM. We'll probably have to try
out this incarnation of Peter's patch and see how well the load balancing
works.

I'll try to line up some benchmarking folks to do some tests.

Tim


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-22 19:26         ` Tim Chen
@ 2019-02-26  8:26           ` Aubrey Li
  2019-02-27  7:54             ` Aubrey Li
  0 siblings, 1 reply; 99+ messages in thread
From: Aubrey Li @ 2019-02-26  8:26 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Paolo Bonzini, Linus Torvalds, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linux List Kernel Mailing,
	subhra.mazumdar, Frédéric Weisbecker, Kees Cook,
	kerrnel

On Sat, Feb 23, 2019 at 3:27 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> >> On 18/02/19 21:40, Peter Zijlstra wrote:
> >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >>>>>
> >>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
> >>>>
> >>>> Do you (or anybody else) have numbers for real loads?
> >>>>
> >>>> Because performance is all that matters. If performance is bad, then
> >>>> it's pointless, since just turning off SMT is the answer.
> >>>
> >>> Not for these patches; they stopped crashing only yesterday and I
> >>> cleaned them up and send them out.
> >>>
> >>> The previous version; which was more horrible; but L1TF complete, was
> >>> between OK-ish and horrible depending on the number of VMEXITs a
> >>> workload had.
> >>>
> >>> If there were close to no VMEXITs, it beat smt=off, if there were lots
> >>> of VMEXITs it was far far worse. Supposedly hosting people try their
> >>> very bestest to have no VMEXITs so it mostly works for them (with the
> >>> obvious exception of single VCPU guests).
> >>
> >> If you are giving access to dedicated cores to guests, you also let them
> >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> >> bound workload.
> >>
> >> In any case, IIUC what you are looking for is:
> >>
> >> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> >> bound.
> >>
> >> 2) compare two runs, one without SMT and without core scheduler, and one
> >> with SMT+core scheduler.
> >>
> >> 3) find out whether performance is helped by SMT despite the increased
> >> overhead of the core scheduler
> >>
> >> Do you want some other load in the host, so that the scheduler actually
> >> does do something?  Or is the point just that you show that the
> >> performance isn't affected when the scheduler does not have anything to
> >> do (which should be obvious, but having numbers is always better)?
> >
> > Well, what _I_ want is for all this to just go away :-)
> >
> > Tim did much of testing last time around; and I don't think he did
> > core-pinning of VMs much (although I'm sure he did some of that). I'm
>
> Yes. The last time around I tested basic scenarios like:
> 1. single VM pinned on a core
> 2. 2 VMs pinned on a core
> 3. system oversubscription (no pinning)
>
> In general, CPU bound benchmarks and even things without too much I/O
> causing lots of VMexits perform better with HT than without for Peter's
> last patchset.
>
> > still a complete virt noob; I can barely boot a VM to save my life.
> >
> > (you should be glad to not have heard my cursing at qemu cmdline when
> > trying to reproduce some of Tim's results -- lets just say that I can
> > deal with gpg)
> >
> > I'm sure he tried some oversubscribed scenarios without pinning.
>
> We did try some oversubscribed scenarios like SPECVirt, that tried to
> squeeze tons of VMs on a single system in over subscription mode.
>
> There're two main problems in the last go around:
>
> 1. Workload with high rate of Vmexits (SpecVirt is one)
> were a major source of pain when we tried Peter's previous patchset.
> The switch from vcpus to qemu and back in previous version of Peter's patch
> requires some coordination between the hyperthread siblings via IPI.  And for
> workload that does this a lot, the overhead quickly added up.
>
> For Peter's new patch, this overhead hopefully would be reduced and give
> better performance.
>
> 2. Load balancing is quite tricky.  Peter's last patchset did not have
> load balancing for consolidating compatible running threads.
> I did some non-sophisticated load balancing
> to pair vcpus up.  But the constant vcpu migrations overhead probably ate up
> any improvements from better load pairing.  So I didn't get much
> improvement in the over-subscription case when turning on load balancing
> to consolidate the VCPUs of the same VM. We'll probably have to try
> out this incarnation of Peter's patch and see how well the load balancing
> works.
>
> I'll try to line up some benchmarking folks to do some tests.

I can help to do some basic tests.

Cgroup bias looks weird to me. If I have hundreds of cgroups, should I turn
core scheduling(cpu.tag) on one by one? Or Is there a global knob I missed?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-26  8:26           ` Aubrey Li
@ 2019-02-27  7:54             ` Aubrey Li
  0 siblings, 0 replies; 99+ messages in thread
From: Aubrey Li @ 2019-02-27  7:54 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Paolo Bonzini, Linus Torvalds, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linux List Kernel Mailing, subhra.mazumdar,
	Frédéric Weisbecker, Kees Cook, Greg Kerr

On Tue, Feb 26, 2019 at 4:26 PM Aubrey Li <aubrey.intel@gmail.com> wrote:
>
> On Sat, Feb 23, 2019 at 3:27 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> > On 2/22/19 6:20 AM, Peter Zijlstra wrote:
> > > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote:
> > >> On 18/02/19 21:40, Peter Zijlstra wrote:
> > >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > >>>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >>>>>
> > >>>>> However; whichever way around you turn this cookie; it is expensive and nasty.
> > >>>>
> > >>>> Do you (or anybody else) have numbers for real loads?
> > >>>>
> > >>>> Because performance is all that matters. If performance is bad, then
> > >>>> it's pointless, since just turning off SMT is the answer.
> > >>>
> > >>> Not for these patches; they stopped crashing only yesterday and I
> > >>> cleaned them up and send them out.
> > >>>
> > >>> The previous version; which was more horrible; but L1TF complete, was
> > >>> between OK-ish and horrible depending on the number of VMEXITs a
> > >>> workload had.
> > >>>
> > >>> If there were close to no VMEXITs, it beat smt=off, if there were lots
> > >>> of VMEXITs it was far far worse. Supposedly hosting people try their
> > >>> very bestest to have no VMEXITs so it mostly works for them (with the
> > >>> obvious exception of single VCPU guests).
> > >>
> > >> If you are giving access to dedicated cores to guests, you also let them
> > >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU
> > >> bound workload.
> > >>
> > >> In any case, IIUC what you are looking for is:
> > >>
> > >> 1) take a benchmark that *is* helped by SMT, this will be something CPU
> > >> bound.
> > >>
> > >> 2) compare two runs, one without SMT and without core scheduler, and one
> > >> with SMT+core scheduler.
> > >>
> > >> 3) find out whether performance is helped by SMT despite the increased
> > >> overhead of the core scheduler
> > >>
> > >> Do you want some other load in the host, so that the scheduler actually
> > >> does do something?  Or is the point just that you show that the
> > >> performance isn't affected when the scheduler does not have anything to
> > >> do (which should be obvious, but having numbers is always better)?
> > >
> > > Well, what _I_ want is for all this to just go away :-)
> > >
> > > Tim did much of testing last time around; and I don't think he did
> > > core-pinning of VMs much (although I'm sure he did some of that). I'm
> >
> > Yes. The last time around I tested basic scenarios like:
> > 1. single VM pinned on a core
> > 2. 2 VMs pinned on a core
> > 3. system oversubscription (no pinning)
> >
> > In general, CPU bound benchmarks and even things without too much I/O
> > causing lots of VMexits perform better with HT than without for Peter's
> > last patchset.
> >
> > > still a complete virt noob; I can barely boot a VM to save my life.
> > >
> > > (you should be glad to not have heard my cursing at qemu cmdline when
> > > trying to reproduce some of Tim's results -- lets just say that I can
> > > deal with gpg)
> > >
> > > I'm sure he tried some oversubscribed scenarios without pinning.
> >
> > We did try some oversubscribed scenarios like SPECVirt, that tried to
> > squeeze tons of VMs on a single system in over subscription mode.
> >
> > There're two main problems in the last go around:
> >
> > 1. Workload with high rate of Vmexits (SpecVirt is one)
> > were a major source of pain when we tried Peter's previous patchset.
> > The switch from vcpus to qemu and back in previous version of Peter's patch
> > requires some coordination between the hyperthread siblings via IPI.  And for
> > workload that does this a lot, the overhead quickly added up.
> >
> > For Peter's new patch, this overhead hopefully would be reduced and give
> > better performance.
> >
> > 2. Load balancing is quite tricky.  Peter's last patchset did not have
> > load balancing for consolidating compatible running threads.
> > I did some non-sophisticated load balancing
> > to pair vcpus up.  But the constant vcpu migrations overhead probably ate up
> > any improvements from better load pairing.  So I didn't get much
> > improvement in the over-subscription case when turning on load balancing
> > to consolidate the VCPUs of the same VM. We'll probably have to try
> > out this incarnation of Peter's patch and see how well the load balancing
> > works.
> >
> > I'll try to line up some benchmarking folks to do some tests.
>
> I can help to do some basic tests.
>
> Cgroup bias looks weird to me. If I have hundreds of cgroups, should I turn
> core scheduling(cpu.tag) on one by one? Or Is there a global knob I missed?
>

I encountered the following panic when I turned core sched on in a
cgroup when the cgroup
was running a best effort workload with high CPU utilization.

Feb 27 01:51:53 aubrey-ivb kernel: [  508.981348] core sched enabled
[  508.990627] BUG: unable to handle kernel NULL pointer dereference
at 000000000000008
[  508.999445] #PF error: [normal kernel read fault]
[  509.004772] PGD 8000001807b7d067 P4D 8000001807b7d067 PUD
18071c9067 PMD 0
[  509.012616] Oops: 0000 [#1] SMP PTI
[  509.016568] CPU: 24 PID: 3503 Comm: schbench Tainted: G          I
     5.0.0-rc8-4
[  509.027918] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.92
[  509.039475] RIP: 0010:rb_insert_color+0x17/0x190
[  509.044707] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 174
[  509.065765] RSP: 0000:ffffc90009203c08 EFLAGS: 00010046
[  509.071671] RAX: 0000000000000000 RBX: ffff889806f91e00 RCX:
ffff889806f91e00
[  509.079715] RDX: ffff889806f83f48 RSI: ffff88980f2238c8 RDI:
ffff889806f92148
[  509.087752] RBP: ffff88980f222cc0 R08: 000000000000026e R09:
ffff88980a099000
[  509.095789] R10: 0000000000000078 R11: ffff88980a099b58 R12:
0000000000000004
[  509.103833] R13: ffffc90009203c68 R14: 0000000000000046 R15:
0000000000022cc0
[  509.111860] FS:  00007f854e7fc700(0000) GS:ffff88980f200000(0000)
knlGS:000000000000
[  509.120957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  509.127443] CR2: 0000000000000008 CR3: 0000001807b64005 CR4:
00000000000606e0
[  509.135478] Call Trace:
[  509.138285]  enqueue_task+0x6f/0xe0
[  509.142278]  ttwu_do_activate+0x49/0x80
[  509.146654]  try_to_wake_up+0x1dc/0x4c0
[  509.151038]  ? __probe_kernel_read+0x3a/0x70
[  509.155909]  signal_wake_up_state+0x15/0x30
[  509.160683]  zap_process+0x90/0xd0
[  509.164573]  do_coredump+0xdba/0xef0
[  509.168679]  ? _raw_spin_lock+0x1b/0x20
[  509.173045]  ? try_to_wake_up+0x120/0x4c0
[  509.177632]  ? pointer+0x1f9/0x2b0
[  509.181532]  ? sched_clock+0x5/0x10
[  509.185526]  ? sched_clock_cpu+0xc/0xa0
[  509.189911]  ? log_store+0x1b5/0x280
[  509.194002]  get_signal+0x12d/0x6d0
[  509.197998]  ? page_fault+0x8/0x30
[  509.201895]  do_signal+0x30/0x6c0
[  509.205686]  ? signal_wake_up_state+0x15/0x30
[  509.210643]  ? __send_signal+0x306/0x4a0
[  509.215114]  ? show_opcodes+0x93/0xa0
[  509.219286]  ? force_sig_info+0xc7/0xe0
[  509.223653]  ? page_fault+0x8/0x30
[  509.227544]  exit_to_usermode_loop+0x77/0xe0
[  509.232415]  prepare_exit_to_usermode+0x70/0x80
[  509.237569]  retint_user+0x8/0x8
[  509.241273] RIP: 0033:0x7f854e7fbe80
[  509.245357] Code: 00 00 36 2a 0e 00 00 00 00 00 90 be 7f 4e 85 7f
00 00 4c e8 bf a10
[  509.266508] RSP: 002b:00007f854e7fbe50 EFLAGS: 00010246
[  509.272429] RAX: 0000000000000000 RBX: 00000000002dc6c0 RCX:
0000000000000000
[  509.280500] RDX: 00000000000e2a36 RSI: 00007f854e7fbe50 RDI:
0000000000000000
[  509.288563] RBP: 00007f855020a170 R08: 000000005c764199 R09:
00007ffea1bfb0a0
[  509.296624] R10: 00007f854e7fbe30 R11: 000000000002457c R12:
00007f854e7fbed0
[  509.304685] R13: 00007f855e555e6f R14: 0000000000000000 R15:
00007f855020a150
[  509.312738] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nai
[  509.398325] CR2: 0000000000000008
[  509.402116] ---[ end trace f1214a54c044bdb6 ]---
[  509.402118] BUG: unable to handle kernel NULL pointer dereference
at 000000000000008
[  509.402122] #PF error: [normal kernel read fault]
[  509.412727] RIP: 0010:rb_insert_color+0x17/0x190
[  509.416649] PGD 8000001807b7d067 P4D 8000001807b7d067 PUD
18071c9067 PMD 0
[  509.421990] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 174
[  509.427230] Oops: 0000 [#2] SMP PTI
[  509.435096] RSP: 0000:ffffc90009203c08 EFLAGS: 00010046
[  509.456243] CPU: 2 PID: 3498 Comm: schbench Tainted: G      D   I
    5.0.0-rc8-04
[  509.460222] RAX: 0000000000000000 RBX: ffff889806f91e00 RCX:
ffff889806f91e00
[  509.460224] RDX: ffff889806f83f48 RSI: ffff88980f2238c8 RDI:
ffff889806f92148
[  509.466152] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.92
[  509.466159] RIP: 0010:task_tick_fair+0xb3/0x290
[  509.477458] RBP: ffff88980f222cc0 R08: 000000000000026e R09:
ffff88980a099000
[  509.477461] R10: 0000000000000078 R11: ffff88980a099b58 R12:
0000000000000004
[  509.485521] Code: 2b 53 60 48 39 d0 0f 82 a0 00 00 00 8b 0d 29 ab
19 01 48 39 ca 728
[  509.485523] RSP: 0000:ffff888c0f083e60 EFLAGS: 00010046
[  509.493583] R13: ffffc90009203c68 R14: 0000000000000046 R15:
0000000000022cc0
[  509.493586] FS:  00007f854e7fc700(0000) GS:ffff88980f200000(0000)
knlGS:000000000000
[  509.505170] RAX: 0000000000b71aff RBX: ffff888be4df3800 RCX:
0000000000000000
[  509.505173] RDX: 00000525112fc50e RSI: 0000000000000000 RDI:
0000000000000000
[  509.510318] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  509.510320] CR2: 0000000000000008 CR3: 0000001807b64005 CR4:
00000000000606e0
[  509.518381] RBP: ffff888c0f0a2d40 R08: 0000000001ffffff R09:
0000000000000040
[  509.518383] R10: ffff888c0f083e20 R11: 0000000000405f09 R12:
0000000000000000
[  509.617516] R13: ffff889806f81e00 R14: ffff888c0f0a2cc0 R15:
0000000000000000
[  509.625586] FS:  00007f854ffff700(0000) GS:ffff888c0f080000(0000)
knlGS:000000000000
[  509.634742] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  509.641245] CR2: 0000000000000058 CR3: 0000001807b64002 CR4:
00000000000606e0
[  509.649313] Call Trace:
[  509.652131]  <IRQ>
[  509.654462]  ? tick_sched_do_timer+0x60/0x60
[  509.659315]  scheduler_tick+0x84/0x120
[  509.663584]  update_process_times+0x40/0x50
[  509.668345]  tick_sched_handle+0x21/0x70
[  509.672814]  tick_sched_timer+0x37/0x70
[  509.677204]  __hrtimer_run_queues+0x108/0x290
[  509.682163]  hrtimer_interrupt+0xe5/0x240
[  509.686732]  smp_apic_timer_interrupt+0x6a/0x130
[  509.691989]  apic_timer_interrupt+0xf/0x20
[  509.696659]  </IRQ>
[  509.699079] RIP: 0033:0x7ffea1bfe6ac
[  509.703160] Code: 2d 81 e9 ff ff 4c 8b 05 82 e9 ff ff 0f 01 f9 66
90 41 8b 0c 24 39f
[  509.724301] RSP: 002b:00007f854fffedf0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  509.732872] RAX: 0000000077e4a044 RBX: 00007f854fffee50 RCX:
0000000000000002
[  509.740941] RDX: 0000000000000166 RSI: 00007f854fffee50 RDI:
0000000000000000
[  509.749001] RBP: 00007f854fffee10 R08: 0000000000000000 R09:
00007ffea1bfb0a0
[  509.757061] R10: 00007ffea1bfb080 R11: 000000000002457e R12:
0000000000000000
[  509.765121] R13: 00007f855e555e6f R14: 0000000000000000 R15:
00007f85500008c0
[  509.773182] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nai
[  509.858758] CR2: 0000000000000058
[  509.862581] ---[ end trace f1214a54c044bdb7 ]---
[  509.862583] BUG: unable to handle kernel NULL pointer dereference
at 000000000000008
[  509.862585] #PF error: [normal kernel read fault]
[  509.873332] RIP: 0010:rb_insert_color+0x17/0x190
[  509.877246] PGD 8000001807b7d067 P4D 8000001807b7d067 PUD
18071c9067 PMD 0
[  509.882592] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 174
[  509.887828] Oops: 0000 [#3] SMP PTI
[  509.895684] RSP: 0000:ffffc90009203c08 EFLAGS: 00010046
[  509.916828] CPU: 26 PID: 3506 Comm: schbench Tainted: G      D   I
     5.0.0-rc8-4
[  509.920802] RAX: 0000000000000000 RBX: ffff889806f91e00 RCX:
ffff889806f91e00
[  509.920804] RDX: ffff889806f83f48 RSI: ffff88980f2238c8 RDI:
ffff889806f92148
[  509.926726] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.92
[  509.926731] RIP: 0010:task_tick_fair+0xb3/0x290
[  509.938120] RBP: ffff88980f222cc0 R08: 000000000000026e R09:
ffff88980a099000
[  509.938122] R10: 0000000000000078 R11: ffff88980a099b58 R12:
0000000000000004
[  509.946183] Code: 2b 53 60 48 39 d0 0f 82 a0 00 00 00 8b 0d 29 ab
19 01 48 39 ca 728
[  509.946186] RSP: 0000:ffff88980f283e60 EFLAGS: 00010046
[  509.954245] R13: ffffc90009203c68 R14: 0000000000000046 R15:
0000000000022cc0
[  509.954248] FS:  00007f854ffff700(0000) GS:ffff888c0f080000(0000)
knlGS:000000000000
[  509.965836] RAX: 0000000000b71aff RBX: ffff888be4df3800 RCX:
0000000000000000
[  509.965839] RDX: 00000525112fc50e RSI: 0000000000000000 RDI:
0000000000000000
[  509.970981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  509.970983] CR2: 0000000000000058 CR3: 0000001807b64002 CR4:
00000000000606e0
[  509.979043] RBP: ffff888c0f0a2d40 R08: 0000000001ffffff R09:
0000000000000040
[  509.979045] R10: ffff88980f283e68 R11: 0000000000000000 R12:
0000000000000000
[  509.987095] Kernel panic - not syncing: Fatal exception in
interrupt
[  510.008237] R13: ffff889807f91e00 R14: ffff88980f2a2cc0 R15:
0000000000000000
[  510.008240] FS:  00007f8547fff700(0000) GS:ffff88980f280000(0000)
knlGS:000000000000
[  510.102589] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  510.109103] CR2: 0000000000000058 CR3: 0000001807b64003 CR4:
00000000000606e0
[  510.117164] Call Trace:
[  510.119977]  <IRQ>
[  510.122316]  ? tick_sched_do_timer+0x60/0x60
[  510.127168]  scheduler_tick+0x84/0x120
[  510.131445]  update_process_times+0x40/0x50
[  510.136203]  tick_sched_handle+0x21/0x70
[  510.140672]  tick_sched_timer+0x37/0x70
[  510.145040]  __hrtimer_run_queues+0x108/0x290
[  510.149990]  hrtimer_interrupt+0xe5/0x240
[  510.154554]  smp_apic_timer_interrupt+0x6a/0x130
[  510.159796]  apic_timer_interrupt+0xf/0x20
[  510.164454]  </IRQ>
[  510.166882] RIP: 0033:0x7ffea1bfe6ac
[  510.170958] Code: 2d 81 e9 ff ff 4c 8b 05 82 e9 ff ff 0f 01 f9 66
90 41 8b 0c 24 39f
[  510.192101] RSP: 002b:00007f8547ffedf0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  510.200675] RAX: 0000000078890657 RBX: 00007f8547ffee50 RCX:
000000000000101a
[  510.208736] RDX: 0000000000000166 RSI: 00007f8547ffee50 RDI:
0000000000000000
[  510.216799] RBP: 00007f8547ffee10 R08: 0000000000000000 R09:
00007ffea1bfb0a0
[  510.224861] R10: 00007ffea1bfb080 R11: 000000000002457e R12:
0000000000000000
[  510.234319] R13: 00007f855ed56e6f R14: 0000000000000000 R15:
00007f855830ed98
[  510.242371] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nai
[  510.327929] CR2: 0000000000000058
[  510.331720] ---[ end trace f1214a54c044bdb8 ]---
[  510.342658] RIP: 0010:rb_insert_color+0x17/0x190
[  510.347900] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 174
[  510.369044] RSP: 0000:ffffc90009203c08 EFLAGS: 00010046
[  510.374968] RAX: 0000000000000000 RBX: ffff889806f91e00 RCX:
ffff889806f91e00
[  510.383031] RDX: ffff889806f83f48 RSI: ffff88980f2238c8 RDI:
ffff889806f92148
[  510.391093] RBP: ffff88980f222cc0 R08: 000000000000026e R09:
ffff88980a099000
[  510.399154] R10: 0000000000000078 R11: ffff88980a099b58 R12:
0000000000000004
[  510.407214] R13: ffffc90009203c68 R14: 0000000000000046 R15:
0000000000022cc0
[  510.415278] FS:  00007f8547fff700(0000) GS:ffff88980f280000(0000)
knlGS:000000000000
[  510.424434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  510.430939] CR2: 0000000000000058 CR3: 0000001807b64003 CR4:
00000000000606e0
[  511.068880] Shutting down cpus with NMI
[  511.075437] Kernel Offset: disabled
[  511.083621] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (17 preceding siblings ...)
  2019-02-19 22:07 ` Greg Kerr
@ 2019-03-01  2:54 ` Subhra Mazumdar
  2019-03-14 15:28 ` Julien Desfossez
  19 siblings, 0 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-01  2:54 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel


On 2/18/19 8:56 AM, Peter Zijlstra wrote:
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.
>
I am seeing the following hard lockup frequently now. Following is full
kernel output:

[ 5846.412296] drop_caches (8657): drop_caches: 3
[ 5846.624823] drop_caches (8658): drop_caches: 3
[ 5850.604641] hugetlbfs: oracle (8671): Using mlock ulimits for SHM_HUGETL
B is deprecated
[ 5962.930812] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
[ 5962.930814] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5962.930828] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930828] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930829] RIP: 0010:try_to_wake_up+0x98/0x470
[ 5962.930830] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43
3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38
<85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f
[ 5962.930831] RSP: 0018:ffffc9000f4dbcb8 EFLAGS: 00000002
[ 5962.930832] RAX: 0000000000000001 RBX: ffff88dfb4af1680 RCX:
0000000000000041
[ 5962.930832] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
ffff88dfb4af214c
[ 5962.930833] RBP: 0000000000000000 R08: 0000000000000001 R09:
ffffc9000f4dbd80
[ 5962.930833] R10: ffff888000000000 R11: ffffea00f0003d80 R12:
ffff88dfb4af214c
[ 5962.930834] R13: 0000000000000001 R14: 0000000000000046 R15:
0000000000000001
[ 5962.930834] FS:  00007ff4fabd9ae0(0000) GS:ffff88dfbe280000(0000)
knlGS:0000000000000000
[ 5962.930834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5962.930835] CR2: 0000000f4cc84000 CR3: 0000003b93d36002 CR4:
00000000003606e0
[ 5962.930835] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5962.930836] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5962.930836] Call Trace:
[ 5962.930837]  ? __switch_to_asm+0x34/0x70
[ 5962.930837]  ? __switch_to_asm+0x40/0x70
[ 5962.930838]  ? __switch_to_asm+0x34/0x70
[ 5962.930838]  autoremove_wake_function+0x11/0x50
[ 5962.930838]  __wake_up_common+0x8f/0x160
[ 5962.930839]  ? __switch_to_asm+0x40/0x70
[ 5962.930839]  __wake_up_common_lock+0x7c/0xc0
[ 5962.930840]  pipe_write+0x24e/0x3f0
[ 5962.930840]  __vfs_write+0x127/0x1b0
[ 5962.930840]  vfs_write+0xb3/0x1b0
[ 5962.930841]  ksys_write+0x52/0xc0
[ 5962.930841]  do_syscall_64+0x5b/0x170
[ 5962.930842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5962.930842] RIP: 0033:0x3b5900e7b0
[ 5962.930843] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90
90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24
[ 5962.930843] RSP: 002b:00007ffedbcd93a8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[ 5962.930844] RAX: ffffffffffffffda RBX: 00007ff4faa86e24 RCX:
0000003b5900e7b0
[ 5962.930845] RDX: 000000000000028f RSI: 00007ff4faa9688e RDI:
000000000000000a
[ 5962.930845] RBP: 00007ffedbcd93c0 R08: 00007ffedbcd9458 R09:
0000000000000020
[ 5962.930846] R10: 0000000000000000 R11: 0000000000000246 R12:
00007ffedbcd9458
[ 5962.930847] R13: 00007ff4faa9688e R14: 00007ff4faa89cc8 R15:
00007ff4faa86bd0
[ 5962.930847] Kernel panic - not syncing: Hard LOCKUP
[ 5962.930848] CPU: 32 PID: 10333 Comm: oracle_10333_tp Not tainted
5.0.0-rc7core_sched #1
[ 5962.930848] Hardware name: Oracle Corporation ORACLE
SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5962.930849] Call Trace:
[ 5962.930849]  <NMI>
[ 5962.930849]  dump_stack+0x5c/0x7b
[ 5962.930850]  panic+0xfe/0x2b2
[ 5962.930850]  nmi_panic+0x35/0x40
[ 5962.930851]  watchdog_overflow_callback+0xef/0x100
[ 5962.930851]  __perf_event_overflow+0x5a/0xe0
[ 5962.930852]  handle_pmi_common+0x1d1/0x280
[ 5962.930852]  ? __set_pte_vaddr+0x32/0x50
[ 5962.930852]  ? __set_pte_vaddr+0x32/0x50
[ 5962.930853]  ? set_pte_vaddr+0x3c/0x60
[ 5962.930853]  ? intel_pmu_handle_irq+0xad/0x170
[ 5962.930854]  intel_pmu_handle_irq+0xad/0x170
[ 5962.930854]  perf_event_nmi_handler+0x2e/0x50
[ 5962.930854]  nmi_handle+0x6f/0x120
[ 5962.930855]  default_do_nmi+0xee/0x110
[ 5962.930855]  do_nmi+0xe5/0x130
[ 5962.930856]  end_repeat_nmi+0x16/0x50
[ 5962.930856] RIP: 0010:try_to_wake_up+0x98/0x470
[ 5962.930857] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 8b 43
3c 8b 73 60 85 f6 0f 85 a6 01 00 00 8b 43 38 85 c0 74 09 f3 90 8b 43 38
<85> c0 75 f7 48 8b 43 10 a8 02 b8 00 00 00 00 0f 85 d5 01 00 00 0f
[ 5962.930857] RSP: 0018:ffffc9000f4dbcb8 EFLAGS: 00000002
[ 5962.930858] RAX: 0000000000000001 RBX: ffff88dfb4af1680 RCX:
0000000000000041
[ 5962.930859] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
ffff88dfb4af214c
[ 5962.930859] RBP: 0000000000000000 R08: 0000000000000001 R09:
ffffc9000f4dbd80
[ 5962.930859] R10: ffff888000000000 R11: ffffea00f0003d80 R12:
ffff88dfb4af214c
[ 5962.930860] R13: 0000000000000001 R14: 0000000000000046 R15:
0000000000000001
[ 5962.930860]  ? try_to_wake_up+0x98/0x470
[ 5962.930861]  ? try_to_wake_up+0x98/0x470
[ 5962.930861]  </NMI>
[ 5962.930862]  ? __switch_to_asm+0x34/0x70
[ 5962.930862]  ? __switch_to_asm+0x40/0x70
[ 5962.930862]  ? __switch_to_asm+0x34/0x70
[ 5962.930863]  autoremove_wake_function+0x11/0x50
[ 5962.930863]  __wake_up_common+0x8f/0x160
[ 5962.930864]  ? __switch_to_asm+0x40/0x70
[ 5962.930864]  __wake_up_common_lock+0x7c/0xc0
[ 5962.930864]  pipe_write+0x24e/0x3f0
[ 5962.930865]  __vfs_write+0x127/0x1b0
[ 5962.930865]  vfs_write+0xb3/0x1b0
[ 5962.930866]  ksys_write+0x52/0xc0
[ 5962.930866]  do_syscall_64+0x5b/0x170
[ 5962.930866]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5962.930867] RIP: 0033:0x3b5900e7b0
[ 5962.930868] Code: 97 20 00 31 d2 48 29 c2 64 89 11 48 83 c8 ff eb ea 90
90 90 90 90 90 90 90 90 83 3d f1 db 20 00 00 75 10 b8 01 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e fa ff ff 48 89 04 24
[ 5962.930868] RSP: 002b:00007ffedbcd93a8 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[ 5962.930869] RAX: ffffffffffffffda RBX: 00007ff4faa86e24 RCX:
0000003b5900e7b0
[ 5962.930869] RDX: 000000000000028f RSI: 00007ff4faa9688e RDI:
000000000000000a
[ 5962.930870] RBP: 00007ffedbcd93c0 R08: 00007ffedbcd9458 R09:
0000000000000020
[ 5962.930870] R10: 0000000000000000 R11: 0000000000000246 R12:
00007ffedbcd9458
[ 5962.930871] R13: 00007ff4faa9688e R14: 00007ff4faa89cc8 R15:
00007ff4faa86bd0
[ 5963.987766] NMI watchdog: Watchdog detected hard LOCKUP on cpu 11
[ 5963.987767] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5963.987775] CPU: 11 PID: 8805 Comm: ora_lg02_tpcc1 Not tainted
5.0.0-rc7core_sched #1
[ 5963.987775] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5963.987776] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1e0
[ 5963.987777] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 00 3a 02
00 48 03 34 c5 20 98 13 82 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08
<85> c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 8b 07 66
[ 5963.987777] RSP: 0018:ffffc90023003760 EFLAGS: 00000046
[ 5963.987778] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
0000000000300000
[ 5963.987778] RDX: ffff88afbf2e3a00 RSI: ffff88dfbeae3a00 RDI:
ffff88dfbe1a2d40
[ 5963.987779] RBP: ffff88dfbe1a2d40 R08: 0000000000300000 R09:
00000fffffc00000
[ 5963.987779] R10: ffffc90023003778 R11: ffff88afb77b3340 R12:
000000000000001c
[ 5963.987779] R13: 0000000000022d40 R14: 0000000000000000 R15:
000000000000001c
[ 5963.987780] FS:  00007f4e14e73ae0(0000) GS:ffff88afbf2c0000(0000)
knlGS:0000000000000000
[ 5963.987780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5963.987780] CR2: 00007fe503647850 CR3: 0000000d1b1ae002 CR4:
00000000003606e0
[ 5963.987781] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5963.987781] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5963.987781] Call Trace:
[ 5963.987781]  _raw_spin_lock_irqsave+0x39/0x40
[ 5963.987782]  update_blocked_averages+0x32/0x610
[ 5963.987782]  update_nohz_stats+0x4d/0x60
[ 5963.987782]  update_sd_lb_stats+0x2e5/0x7d0
[ 5963.987783]  find_busiest_group+0x3e/0x5b0
[ 5963.987783]  load_balance+0x18c/0xc00
[ 5963.987783]  newidle_balance+0x278/0x490
[ 5963.987783]  __schedule+0xd16/0x1060
[ 5963.987784]  ? lock_timer_base+0x66/0x80
[ 5963.987784]  schedule+0x32/0x70
[ 5963.987784]  schedule_timeout+0x16d/0x360
[ 5963.987785]  ? __next_timer_interrupt+0xc0/0xc0
[ 5963.987785]  do_semtimedop+0x966/0x1180
[ 5963.987785]  ? xas_load+0x9/0x80
[ 5963.987786]  ? find_get_entry+0x5d/0x1e0
[ 5963.987786]  ? pagecache_get_page+0x1b4/0x2d0
[ 5963.987786]  ? __vfs_getxattr+0x2a/0x70
[ 5963.987786]  ? enqueue_task_rt+0x98/0xb0
[ 5963.987787]  ? check_preempt_curr+0x50/0x90
[ 5963.987787]  ? push_rt_tasks+0x20/0x20
[ 5963.987787]  ? ttwu_do_wakeup+0x5e/0x160
[ 5963.987788]  ? try_to_wake_up+0x54/0x470
[ 5963.987788]  ? wake_up_q+0x2d/0x70
[ 5963.987788]  ? semctl_setval+0x26d/0x400
[ 5963.987788]  ? ksys_semtimedop+0x52/0x80
[ 5963.987789]  ksys_semtimedop+0x52/0x80
[ 5963.987789]  do_syscall_64+0x5b/0x170
[ 5963.987789]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5963.987789] RIP: 0033:0x3b58ceb28a
[ 5963.987790] Code: 73 01 c3 48 8b 0d 3e 2d 2a 00 31 d2 48 29 c2 64 89 11
48 83 c8 ff eb ea 90 90 90 90 90 90 90 90 49 89 ca b8 dc 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0e 2d 2a 00 31 d2 48 29 c2 64
[ 5963.987790] RSP: 002b:00007fff79282ed8 EFLAGS: 00000206 ORIG_RAX:
00000000000000dc
[ 5963.987791] RAX: ffffffffffffffda RBX: ffffffffffd23940 RCX:
0000003b58ceb28a
[ 5963.987791] RDX: 0000000000000001 RSI: 00007fff792830b8 RDI:
0000000000058002
[ 5963.987791] RBP: 00007fff792830e0 R08: 0000000000000000 R09:
0000000171327788
[ 5963.987792] R10: 00007fff79283068 R11: 0000000000000206 R12:
00007fff792833c8
[ 5963.987792] R13: 0000000000058002 R14: 0000000168dc0770 R15:
0000000000000000
[ 5963.987796] Shutting down cpus with NMI
[ 5963.987796] Kernel Offset: disabled
[ 5963.987797] NMI watchdog: Watchdog detected hard LOCKUP on cpu 33
[ 5963.987797] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5963.987805] CPU: 33 PID: 10303 Comm: oracle_10303_tp Not tainted
5.0.0-rc7core_sched #1
[ 5963.987806] Hardware name: Oracle Corporation ORACLE
SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5963.987806] RIP: 0010:native_queued_spin_lock_slowpath+0x180/0x1e0
[ 5963.987807] Code: c1 e8 12 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81
c6 00 3a 02 00 48 03 34 c5 20 98 13 82 48 89 16 8b 42 08 85 c0 75 09 f3 90
<8b> 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90
[ 5963.987807] RSP: 0018:ffffc90024833980 EFLAGS: 00000046
[ 5963.987808] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
0000000000880000
[ 5963.987808] RDX: ffff88dfbe2e3a00 RSI: ffff88dfbe763a00 RDI:
ffff88dfbe3e2d40
[ 5963.987809] RBP: ffff88dfbe3e2d40 R08: 0000000000880000 R09:
0000002000000000
[ 5963.987809] R10: 0000000000000004 R11: ffff88dfb6ffd2c0 R12:
0000000000000025
[ 5963.987809] R13: 0000000000022d40 R14: 0000000000000000 R15:
0000000000000025
[ 5963.987810] FS:  00007f0b7e5feae0(0000) GS:ffff88dfbe2c0000(0000)
knlGS:0000000000000000
[ 5963.987810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5963.987810] CR2: 000000007564d0e7 CR3: 0000003debf6a001 CR4:
00000000003606e0
[ 5963.987811] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5963.987811] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5963.987811] Call Trace:
[ 5963.987812]  _raw_spin_lock_irqsave+0x39/0x40
[ 5963.987812]  update_blocked_averages+0x32/0x610
[ 5963.987812]  update_nohz_stats+0x4d/0x60
[ 5963.987812]  update_sd_lb_stats+0x2e5/0x7d0
[ 5963.987813]  find_busiest_group+0x3e/0x5b0
[ 5963.987813]  load_balance+0x18c/0xc00
[ 5963.987813]  ? __switch_to_asm+0x40/0x70
[ 5963.987813]  ? __switch_to_asm+0x34/0x70
[ 5963.987814]  newidle_balance+0x278/0x490
[ 5963.987814]  __schedule+0xd16/0x1060
[ 5963.987814]  ? enqueue_hrtimer+0x3a/0x90
[ 5963.987814]  schedule+0x32/0x70
[ 5963.987815]  do_nanosleep+0x81/0x180
[ 5963.987815]  hrtimer_nanosleep+0xce/0x1f0
[ 5963.987815]  ? __hrtimer_init+0xb0/0xb0
[ 5963.987816]  __x64_sys_nanosleep+0x8d/0xa0
[ 5963.987816]  do_syscall_64+0x5b/0x170
[ 5963.987816]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5963.987816] RIP: 0033:0x3b5900eff0
[ 5963.987817] Code: 73 01 c3 48 8b 0d b8 8f 20 00 31 d2 48 29 c2 64 89 11
48 83 c8 ff eb ea 90 90 83 3d b1 d3 20 00 00 75 10 b8 23 00 00 00 0f 05
<48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e f2 ff ff 48 89 04 24
[ 5963.987817] RSP: 002b:00007ffe359b5158 EFLAGS: 00000246 ORIG_RAX:
0000000000000023
[ 5963.987818] RAX: ffffffffffffffda RBX: 0000000000169d10 RCX:
0000003b5900eff0
[ 5963.987818] RDX: 0000000000000000 RSI: 00007ffe359b5170 RDI:
00007ffe359b5160
[ 5963.987818] RBP: 00007ffe359b51c0 R08: 0000000000000000 R09:
0000000000000000
[ 5963.987819] R10: 00000001512241f8 R11: 0000000000000246 R12:
0000000000000000
[ 5963.987819] R13: 00007ffe359b5280 R14: 00000000000005ca R15:
0000000000000000
[ 5963.987827] NMI watchdog: Watchdog detected hard LOCKUP on cpu 75
[ 5963.987828] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5963.987835] CPU: 75 PID: 0 Comm: swapper/75 Not tainted
5.0.0-rc7core_sched #1
[ 5963.987836] Hardware name: Oracle Corporation
ORACLE SERVER X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5963.987836] RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1e0
[ 5963.987837] Code: ff 75 40 f0 0f ba 2f 08 0f 82 e7 00 00 00 8b 07 30 e4
09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74 08 f3 90 8b 07
<84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 81 e6 00 ff 00 00 75 04 c6
[ 5963.987837] RSP: 0000:ffffc9000c77bd78 EFLAGS: 00000002
[ 5963.987838] RAX: 0000000001240101 RBX: 000000000000004b RCX:
0000000000000001
[ 5963.987838] RDX: ffff88dfbe522d40 RSI: 0000000000000001 RDI:
ffff88dfbe262d40
[ 5963.987838] RBP: 0000000000022d40 R08: 0000000000000000 R09:
0000000000000001
[ 5963.987839] R10: 0000000000000001 R11: 0000000000000001 R12:
ffff88dfb4e40000
[ 5963.987839] R13: ffff88dfbe7e2d40 R14: 000000000000002a R15:
ffff88dfbe522d40
[ 5963.987840] FS:  0000000000000000(0000) GS:ffff88dfbe7c0000(0000)
knlGS:0000000000000000
[ 5963.987840] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5963.987840] CR2: 0000000f4b348000 CR3: 0000003c45ed6001 CR4:
00000000003606e0
[ 5963.987841] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5963.987841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5963.987841] Call Trace:
[ 5963.987842]  _raw_spin_lock+0x24/0x30
[ 5963.987842]  sched_core_balance+0x15c/0x4f0
[ 5963.987842]  __balance_callback+0x49/0xa0
[ 5963.987843]  __schedule+0xdc0/0x1060
[ 5963.987843]  schedule_idle+0x28/0x40
[ 5963.987843]  do_idle+0x164/0x260
[ 5963.987843]  cpu_startup_entry+0x19/0x20
[ 5963.987844]  start_secondary+0x17d/0x1d0
[ 5963.987844]  secondary_startup_64+0xa4/0xb0
[ 5963.987845] NMI watchdog: Watchdog detected hard LOCKUP on cpu 81
[ 5963.987845] Modules linked in: drbd lru_cache autofs4 cpufreq_powersave
ipv6 crc_ccitt mxm_wmi iTCO_wdt iTCO_vendor_support btrfs raid6_pq
zstd_compress zstd_decompress xor pcspkr i2c_i801 lpc_ich mfd_core ioatdma
ixgbe dca mdio sg ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi
pcc_cpufreq acpi_pad ext4 fscrypto jbd2 mbcache sd_mod ahci libahci nvme
nvme_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[ 5963.987853] CPU: 81 PID: 0 Comm: swapper/81 Not tainted
5.0.0-rc7core_sched #1
[ 5963.987854] Hardware name: Oracle Corporation ORACLE SERVER
X6-2L/ASM,MOBO TRAY,2U, BIOS 39050100 08/30/2016
[ 5963.987854] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1e0
[ 5963.987854] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 00 3a 02
00 48 03 34 c5 20 98 13 82 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08
<85> c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90 8b 07 66
[ 5963.987855] RSP: 0000:ffffc9000c7abd78 EFLAGS: 00000046
[ 5963.987855] RAX: 0000000000000000 RBX: 0000000000000051 RCX:
0000000001480000
[ 5963.987856] RDX: ffff88dfbe963a00 RSI: ffff88dfbe523a00 RDI:
ffff88dfbe522d40
[ 5963.987856] RBP: 0000000000022d40 R08: 0000000001480000 R09:
0000000000000001
[ 5963.987856] R10: ffff88dfb7a41680 R11: 0000000000000001 R12:
0000000000000001
[ 5963.987857] R13: ffff88dfbe962d40 R14: 0000000000000056 R15:
ffff88dfbeaa2d40
[ 5963.987857] FS:  0000000000000000(0000) GS:ffff88dfbe940000(0000)
knlGS:0000000000000000
[ 5963.987857] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5963.987857] CR2: 0000000f4a7a8000 CR3: 0000003b67e2a002 CR4:
00000000003606e0
[ 5963.987858] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 5963.987858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 5963.987858] Call Trace:
[ 5963.987859]  _raw_spin_lock+0x24/0x30
[ 5963.987859]  sched_core_balance+0x15c/0x4f0
[ 5963.987859]  __balance_callback+0x49/0xa0
[ 5963.987859]  __schedule+0xdc0/0x1060
[ 5963.987860]  schedule_idle+0x28/0x40
[ 5963.987860]  do_idle+0x164/0x260
[ 5963.987860]  cpu_startup_entry+0x19/0x20
[ 5963.987860]  start_secondary+0x17d/0x1d0
[ 5963.987861]  secondary_startup_64+0xa4/0xb0
[ 5983.129164] ---[ end Kernel panic - not syncing: Hard LOCKUP ]---





^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-22 14:10       ` Peter Zijlstra
@ 2019-03-07 22:06         ` Paolo Bonzini
  0 siblings, 0 replies; 99+ messages in thread
From: Paolo Bonzini @ 2019-03-07 22:06 UTC (permalink / raw)
  To: Peter Zijlstra, Greg Kerr
  Cc: Greg Kerr, mingo, tglx, Paul Turner, tim.c.chen, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook

On 22/02/19 15:10, Peter Zijlstra wrote:
>> I agree on not bike shedding about the API, but can we agree on some of
>> the high level properties? For example, who generates the core
>> scheduling ids, what properties about them are enforced, etc.?
> It's an opaque cookie; the scheduler really doesn't care. All it does is
> ensure that tasks match or force idle within a core.
> 
> My previous patches got the cookie from a modified
> preempt_notifier_register/unregister() which passed the vcpu->kvm
> pointer into it from vcpu_load/put.
> 
> This auto-grouped VMs. It was also found to be somewhat annoying because
> apparently KVM does a lot of userspace assist for all sorts of nonsense
> and it would leave/re-join the cookie group for every single assist.
> Causing tons of rescheduling.

KVM doesn't do _that much_ userspace exiting in practice when VMs are
properly configured (if they're not, you probably don't care about core
scheduling).

However, note that KVM needs core scheduling groups to be defined at the
thread level; one group per process is not enough.  A VM has a bunch of
I/O threads and vCPU threads, and we want to set up core scheduling like
this:

+--------------------------------------+
| VM 1   iothread1  iothread2          |
| +----------------+-----------------+ |
| | vCPU0  vCPU1   |   vCPU2  vCPU3  | |
| +----------------+-----------------+ |
+--------------------------------------+

+--------------------------------------+
| VM 1   iothread1  iothread2          |
| +----------------+-----------------+ |
| | vCPU0  vCPU1   |   vCPU2  vCPU3  | |
| +----------------+-----------------+ |
| | vCPU4  vCPU5   |   vCPU6  vCPU7  | |
| +----------------+-----------------+ |
+--------------------------------------+

where the iothreads need not be subject to core scheduling but the vCPUs
do.  If you don't place guest-sibling vCPUs in the same core scheduling
group, bad things happen.

The reason is that the guest might also be running a core scheduler, so
you could have:

- guest process 1 registering two threads A and B in the same group

- guest process 2 registering two threads C and D in the same group

- guest core scheduler placing thread A on vCPU0, thread B on vCPU1,
thread C on vCPU2, thread D on vCPU3

- host core scheduler deciding the four threads can be in physical cores
0-1, but physical core 0 gets A+C and physical core 1 gets B+D

- now process 2 shares cache with process 1. :(

Paolo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-22 12:45   ` Mel Gorman
  2019-02-22 16:10     ` Mel Gorman
@ 2019-03-08 19:44     ` Subhra Mazumdar
  2019-03-11  4:23       ` Aubrey Li
  2019-03-26  7:32       ` Aaron Lu
  1 sibling, 2 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-08 19:44 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, Linus Torvalds, Fr?d?ric Weisbecker,
	Kees Cook, kerrnel


On 2/22/19 4:45 AM, Mel Gorman wrote:
> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
>> On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>> However; whichever way around you turn this cookie; it is expensive and nasty.
>> Do you (or anybody else) have numbers for real loads?
>>
>> Because performance is all that matters. If performance is bad, then
>> it's pointless, since just turning off SMT is the answer.
>>
> I tried to do a comparison between tip/master, ht disabled and this series
> putting test workloads into a tagged cgroup but unfortunately it failed
>
> [  156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> [  156.986597] #PF error: [normal kernel read fault]
> [  156.991343] PGD 0 P4D 0
> [  156.993905] Oops: 0000 [#1] SMP PTI
> [  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1
> [  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> [  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> [  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
>   53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> [  157.037544] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> [  157.042819] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> [  157.050015] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> [  157.057215] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> [  157.064410] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> [  157.071611] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> [  157.078814] FS:  0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> [  157.086977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  157.092779] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> [  157.099979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  157.109529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  157.119058] Call Trace:
> [  157.123865]  pick_next_entity+0x61/0x110
> [  157.130137]  pick_task_fair+0x4b/0x90
> [  157.136124]  __schedule+0x365/0x12c0
> [  157.141985]  schedule_idle+0x1e/0x40
> [  157.147822]  do_idle+0x166/0x280
> [  157.153275]  cpu_startup_entry+0x19/0x20
> [  157.159420]  start_secondary+0x17a/0x1d0
> [  157.165568]  secondary_startup_64+0xa4/0xb0
> [  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
> [  157.258990] CR2: 0000000000000058
> [  157.264961] ---[ end trace a301ac5e3ee86fde ]---
> [  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> [  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> [  157.316121] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> [  157.324060] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> [  157.333932] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> [  157.343795] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> [  157.353634] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> [  157.363506] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> [  157.373395] FS:  0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> [  157.384238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  157.392709] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> [  157.402601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  157.412488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
> [  158.529804] Shutting down cpus with NMI
> [  158.573249] Kernel Offset: disabled
> [  158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
>
> RIP translates to kernel/sched/fair.c:6819
>
> static int
> wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> {
>          s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */
>
>          if (vdiff <= 0)
>                  return -1;
>
>          gran = wakeup_gran(se);
>          if (vdiff > gran)
>                  return 1;
> }
>
> I haven't tried debugging it yet.
>
I think the following fix, while trivial, is the right fix for the NULL
dereference in this case. This bug is reproducible with patch 14. I also 
did
some performance bisecting and with patch 14 performance is decimated, 
that's
expected. Most of the performance recovery happens in patch 15 which,
unfortunately, is also the one that introduces the hard lockup.

-------8<-----------

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4..ecadf36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
          * Avoid running the skip buddy, if running something else can
          * be done without getting too unfair.
*/
-       if (cfs_rq->skip == se) {
+       if (cfs_rq->skip && cfs_rq->skip == se) {
                 struct sched_entity *second;

                 if (se == curr) {
@@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
/*
          * Prefer last buddy, try to return the CPU to a preempted task.
*/
-       if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+       if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, 
left)
+           < 1)
                 se = cfs_rq->last;

/*
          * Someone really wants this to run. If it's not unfair, run it.
*/
-       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+       if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, 
left)
+           < 1)
                 se = cfs_rq->next;

         clear_buddies(cfs_rq, se);


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-08 19:44     ` Subhra Mazumdar
@ 2019-03-11  4:23       ` Aubrey Li
  2019-03-11 18:34         ` Subhra Mazumdar
  2019-03-26  7:32       ` Aaron Lu
  1 sibling, 1 reply; 99+ messages in thread
From: Aubrey Li @ 2019-03-11  4:23 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Tim Chen, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, Greg Kerr

On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
<subhra.mazumdar@oracle.com> wrote:
>
> expected. Most of the performance recovery happens in patch 15 which,
> unfortunately, is also the one that introduces the hard lockup.
>

After applied Subhra's patch, the following is triggered by enabling
core sched when a cgroup is
under heavy load.

Mar 10 22:46:57 aubrey-ivb kernel: [ 2662.973792] core sched enabled
[ 2663.348371] WARNING: CPU: 5 PID: 3087 at kernel/sched/pelt.h:119
update_load_avg+00
[ 2663.357960] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_ni
[ 2663.443269] CPU: 5 PID: 3087 Comm: schbench Tainted: G          I
    5.0.0-rc8-7
[ 2663.454520] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2663.466063] RIP: 0010:update_load_avg+0x52/0x5e0
[ 2663.471286] Code: 8b af 70 01 00 00 8b 3d 14 a6 6e 01 85 ff 74 1c
e9 4c 04 00 00 40
[ 2663.492350] RSP: 0000:ffffc9000a6a3dd8 EFLAGS: 00010046
[ 2663.498276] RAX: 0000000000000000 RBX: ffff888be7937600 RCX: 0000000000000001
[ 2663.506337] RDX: 0000000000000000 RSI: ffff888c09fe4418 RDI: 0000000000000046
[ 2663.514398] RBP: ffff888bdfb8aac0 R08: 0000000000000000 R09: ffff888bdfb9aad8
[ 2663.522459] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2663.530520] R13: ffff888c09fe4400 R14: 0000000000000001 R15: ffff888bdfb8aa40
[ 2663.538582] FS:  00007f006a7cc700(0000) GS:ffff888c0a600000(0000)
knlGS:00000000000
[ 2663.547739] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2663.554241] CR2: 0000000000604048 CR3: 0000000bfdd64006 CR4: 00000000000606e0
[ 2663.562310] Call Trace:
[ 2663.565128]  ? update_load_avg+0xa6/0x5e0
[ 2663.569690]  ? update_load_avg+0xa6/0x5e0
[ 2663.574252]  set_next_entity+0xd9/0x240
[ 2663.578619]  set_next_task_fair+0x6e/0xa0
[ 2663.583182]  __schedule+0x12af/0x1570
[ 2663.587350]  schedule+0x28/0x70
[ 2663.590937]  exit_to_usermode_loop+0x61/0xf0
[ 2663.595791]  prepare_exit_to_usermode+0xbf/0xd0
[ 2663.600936]  retint_user+0x8/0x18
[ 2663.604719] RIP: 0033:0x402057
[ 2663.608209] Code: 24 10 64 48 8b 04 25 28 00 00 00 48 89 44 24 38
31 c0 e8 2c eb ff
[ 2663.629351] RSP: 002b:00007f006a7cbe50 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff02
[ 2663.637924] RAX: 000000000029778f RBX: 00000000002dc6c0 RCX: 0000000000000002
[ 2663.645985] RDX: 00007f006a7cbe60 RSI: 0000000000000000 RDI: 00007f006a7cbe50
[ 2663.654046] RBP: 0000000000000006 R08: 0000000000000001 R09: 00007ffe965450a0
[ 2663.662108] R10: 00007f006a7cbe30 R11: 000000000003b368 R12: 00007f006a7cbed0
[ 2663.670160] R13: 00007f0098c1ce6f R14: 0000000000000000 R15: 00007f0084a30390
[ 2663.678226] irq event stamp: 27182
[ 2663.682114] hardirqs last  enabled at (27181): [<ffffffff81003f70>]
exit_to_usermo0
[ 2663.692348] hardirqs last disabled at (27182): [<ffffffff81a0affc>]
__schedule+0xd0
[ 2663.701716] softirqs last  enabled at (27004): [<ffffffff81e00359>]
__do_softirq+0a
[ 2663.711268] softirqs last disabled at (26999): [<ffffffff81095be1>]
irq_exit+0xc1/0
[ 2663.720247] ---[ end trace d46e59b84bcde977 ]---
[ 2663.725503] BUG: unable to handle kernel paging request at 00000000005df5f0
[ 2663.733377] #PF error: [WRITE]
[ 2663.736875] PGD 8000000bff037067 P4D 8000000bff037067 PUD bff0b1067
PMD bfbf02067 0
[ 2663.745954] Oops: 0002 [#1] SMP PTI
[ 2663.749931] CPU: 5 PID: 3078 Comm: schbench Tainted: G        W I
    5.0.0-rc8-7
[ 2663.761233] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2663.772836] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1c0
[ 2663.779827] Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83
e0 03 83 ee 01 42
[ 2663.800970] RSP: 0000:ffffc9000a633e18 EFLAGS: 00010006
[ 2663.806892] RAX: 00000000005df5f0 RBX: ffff888bdfbf2a40 RCX: 0000000000180000
[ 2663.814954] RDX: ffff888c0a7e5180 RSI: 0000000000001fff RDI: ffff888bdfbf2a40
[ 2663.823015] RBP: ffff888bdfbf2a40 R08: 0000000000180000 R09: 0000000000000001
[ 2663.831068] R10: ffffc9000a633dc0 R11: ffff888bdfbf2a58 R12: 0000000000000046
[ 2663.839129] R13: ffff888bdfb8aa40 R14: ffff888be5b90d80 R15: ffff888be5b90d80
[ 2663.847182] FS:  00007f00797ea700(0000) GS:ffff888c0a600000(0000)
knlGS:00000000000
[ 2663.856330] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2663.862834] CR2: 00000000005df5f0 CR3: 0000000bfdd64006 CR4: 00000000000606e0
[ 2663.870895] Call Trace:
[ 2663.873715]  do_raw_spin_lock+0xab/0xb0
[ 2663.878095]  _raw_spin_lock_irqsave+0x63/0x80
[ 2663.883066]  __balance_callback+0x19/0xa0
[ 2663.887626]  __schedule+0x1113/0x1570
[ 2663.891803]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 2663.897142]  ? apic_timer_interrupt+0xa/0x20
[ 2663.901996]  ? interrupt_entry+0x9a/0xe0
[ 2663.906450]  ? apic_timer_interrupt+0xa/0x20
[ 2663.911307] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_ni
[ 2663.996886] CR2: 00000000005df5f0
[ 2664.000686] ---[ end trace d46e59b84bcde978 ]---
[ 2664.011393] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1c0
[ 2664.018386] Code: f3 90 48 8b 32 48 85 f6 74 f6 eb e8 c1 ee 12 83
e0 03 83 ee 01 42
[ 2664.039529] RSP: 0000:ffffc9000a633e18 EFLAGS: 00010006
[ 2664.045452] RAX: 00000000005df5f0 RBX: ffff888bdfbf2a40 RCX: 0000000000180000
[ 2664.053513] RDX: ffff888c0a7e5180 RSI: 0000000000001fff RDI: ffff888bdfbf2a40
[ 2664.061574] RBP: ffff888bdfbf2a40 R08: 0000000000180000 R09: 0000000000000001
[ 2664.069635] R10: ffffc9000a633dc0 R11: ffff888bdfbf2a58 R12: 0000000000000046
[ 2664.077688] R13: ffff888bdfb8aa40 R14: ffff888be5b90d80 R15: ffff888be5b90d80
[ 2664.085749] FS:  00007f00797ea700(0000) GS:ffff888c0a600000(0000)
knlGS:00000000000
[ 2664.094897] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2664.101402] CR2: 00000000005df5f0 CR3: 0000000bfdd64006 CR4: 00000000000606e0
[ 2664.109481]
[ 2664.109482] ======================================================
[ 2664.109483] WARNING: possible circular locking dependency detected
[ 2664.109483] 5.0.0-rc8-00542-gd697415be692-dirty #7 Tainted: G          I
[ 2664.109484] ------------------------------------------------------
[ 2664.109485] schbench/3087 is trying to acquire lock:
[ 2664.109485] 000000007a0032d4 ((console_sem).lock){-.-.}, at:
down_trylock+0xf/0x30
[ 2664.109488]
[ 2664.109497] but task is already holding lock:
[ 2664.109497] 00000000efdef567 (&rq->__lock){-.-.}, at: __schedule+0xfa/0x1570
[ 2664.109507]
[ 2664.109508] which lock already depends on the new lock.
[ 2664.109509]
[ 2664.109509]
[ 2664.109510] the existing dependency chain (in reverse order) is:
[ 2664.109510]
[ 2664.109511] -> #2 (&rq->__lock){-.-.}:
[ 2664.109513]        task_fork_fair+0x35/0x1c0
[ 2664.109513]        sched_fork+0xf4/0x1f0
[ 2664.109514]        copy_process.part.39+0x7ac/0x21f0
[ 2664.109515]        _do_fork+0xf9/0x6a0
[ 2664.109515]        kernel_thread+0x25/0x30
[ 2664.109516]        rest_init+0x22/0x240
[ 2664.109517]        start_kernel+0x49f/0x4bf
[ 2664.109517]        secondary_startup_64+0xa4/0xb0
[ 2664.109518]
[ 2664.109518] -> #1 (&p->pi_lock){-.-.}:
[ 2664.109520]        try_to_wake_up+0x3d/0x510
[ 2664.109521]        up+0x40/0x60
[ 2664.109521]        __up_console_sem+0x41/0x70
[ 2664.109522]        console_unlock+0x32a/0x610
[ 2664.109522]        vprintk_emit+0x14a/0x350
[ 2664.109523]        dev_vprintk_emit+0x11d/0x230
[ 2664.109524]        dev_printk_emit+0x4a/0x70
[ 2664.109524]        _dev_info+0x64/0x80
[ 2664.109525]        usb_new_device+0x105/0x490
[ 2664.109525]        hub_event+0x81f/0x1730
[ 2664.109526]        process_one_work+0x2a4/0x600
[ 2664.109527]        worker_thread+0x2d/0x3d0
[ 2664.109527]        kthread+0x116/0x130
[ 2664.109528]        ret_from_fork+0x3a/0x50
[ 2664.109528]
[ 2664.109529] -> #0 ((console_sem).lock){-.-.}:
[ 2664.109531]        _raw_spin_lock_irqsave+0x41/0x80
[ 2664.109531]        down_trylock+0xf/0x30
[ 2664.109532]        __down_trylock_console_sem+0x33/0xa0
[ 2664.109533]        console_trylock+0x13/0x60
[ 2664.109533]        vprintk_emit+0x13d/0x350
[ 2664.109534]        printk+0x52/0x6e
[ 2664.109534]        __warn+0x5f/0x110
[ 2664.109535]        report_bug+0xa5/0x110
[ 2664.109536]        fixup_bug.part.15+0x18/0x30
[ 2664.109536]        do_error_trap+0xbb/0x100
[ 2664.109537]        do_invalid_op+0x28/0x30
[ 2664.109537]        invalid_op+0x14/0x20
[ 2664.109538]        update_load_avg+0x52/0x5e0
[ 2664.109538]        set_next_entity+0xd9/0x240
[ 2664.109539]        set_next_task_fair+0x6e/0xa0
[ 2664.109540]        __schedule+0x12af/0x1570
[ 2664.109540]        schedule+0x28/0x70
[ 2664.109541]        exit_to_usermode_loop+0x61/0xf0
[ 2664.109542]        prepare_exit_to_usermode+0xbf/0xd0
[ 2664.109542]        retint_user+0x8/0x18
[ 2664.109542]
[ 2664.109543] other info that might help us debug this:
[ 2664.109544]
[ 2664.109544] Chain exists of:
[ 2664.109544]   (console_sem).lock --> &p->pi_lock --> &rq->__lock
[ 2664.109547]
[ 2664.109548]  Possible unsafe locking scenario:
[ 2664.109548]
[ 2664.109549]        CPU0                    CPU1
[ 2664.109549]        ----                    ----
[ 2664.109550]   lock(&rq->__lock);
[ 2664.109551]                                lock(&p->pi_lock);
[ 2664.109553]                                lock(&rq->__lock);
[ 2664.109554]   lock((console_sem).lock);
[ 2664.109555]
[ 2664.109556]  *** DEADLOCK ***
[ 2664.109556]
[ 2664.109557] 1 lock held by schbench/3087:
[ 2664.109557]  #0: 00000000efdef567 (&rq->__lock){-.-.}, at:
__schedule+0xfa/0x1570
[ 2664.109560]
[ 2664.109560] stack backtrace:
[ 2664.109561] CPU: 5 PID: 3087 Comm: schbench Tainted: G          I
    5.0.0-rc8-7
[ 2664.109562] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.2
[ 2664.109563] Call Trace:
[ 2664.109563]  dump_stack+0x85/0xcb
[ 2664.109564]  print_circular_bug.isra.37+0x1d7/0x1e4
[ 2664.109565]  __lock_acquire+0x139c/0x1430
[ 2664.109565]  ? lock_acquire+0x9e/0x180
[ 2664.109566]  lock_acquire+0x9e/0x180
[ 2664.109566]  ? down_trylock+0xf/0x30
[ 2664.109567]  _raw_spin_lock_irqsave+0x41/0x80
[ 2664.109567]  ? down_trylock+0xf/0x30
[ 2664.109568]  ? vprintk_emit+0x13d/0x350
[ 2664.109569]  down_trylock+0xf/0x30
[ 2664.109569]  __down_trylock_console_sem+0x33/0xa0
[ 2664.109570]  console_trylock+0x13/0x60
[ 2664.109571]  vprintk_emit+0x13d/0x350
[ 2664.109571]  ? update_load_avg+0x52/0x5e0
[ 2664.109572]  printk+0x52/0x6e
[ 2664.109573]  ? update_load_avg+0x52/0x5e0
[ 2664.109573]  __warn+0x5f/0x110
[ 2664.109574]  ? update_load_avg+0x52/0x5e0
[ 2664.109575]  ? update_load_avg+0x52/0x5e0
[ 2664.109575]  report_bug+0xa5/0x110
[ 2664.109576]  fixup_bug.part.15+0x18/0x30
[ 2664.109576]  do_error_trap+0xbb/0x100
[ 2664.109577]  do_invalid_op+0x28/0x30
[ 2664.109578]  ? update_load_avg+0x52/0x5e0
[ 2664.109578]  invalid_op+0x14/0x20
[ 2664.109579] RIP: 0010:update_load_avg+0x52/0x5e0
[ 2664.109580] Code: 8b af 70 01 00 00 8b 3d 14 a6 6e 01 85 ff 74 1c
e9 4c 04 00 00 40
[ 2664.109581] RSP: 0000:ffffc9000a6a3dd8 EFLAGS: 00010046
[ 2664.109582] RAX: 0000000000000000 RBX: ffff888be7937600 RCX: 0000000000000001
[ 2664.109582] RDX: 0000000000000000 RSI: ffff888c09fe4418 RDI: 0000000000000046
[ 2664.109583] RBP: ffff888bdfb8aac0 R08: 0000000000000000 R09: ffff888bdfb9aad8
[ 2664.109584] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2664.109585] R13: ffff888c09fe4400 R14: 0000000000000001 R15: ffff888bdfb8aa40
[ 2664.109585]  ? update_load_avg+0x4e/0x5e0
[ 2664.109586]  ? update_load_avg+0xa6/0x5e0
[ 2664.109586]  ? update_load_avg+0xa6/0x5e0
[ 2664.109587]  set_next_entity+0xd9/0x240
[ 2664.109588]  set_next_task_fair+0x6e/0xa0
[ 2664.109588]  __schedule+0x12af/0x1570
[ 2664.109589]  schedule+0x28/0x70
[ 2664.109589]  exit_to_usermode_loop+0x61/0xf0
[ 2664.109590]  prepare_exit_to_usermode+0xbf/0xd0
[ 2664.109590]  retint_user+0x8/0x18
[ 2664.109591] RIP: 0033:0x402057
[ 2664.109592] Code: 24 10 64 48 8b 04 25 28 00 00 00 48 89 44 24 38
31 c0 e8 2c eb ff
[ 2664.109593] RSP: 002b:00007f006a7cbe50 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff02
[ 2664.109594] RAX: 000000000029778f RBX: 00000000002dc6c0 RCX: 0000000000000002
[ 2664.109595] RDX: 00007f006a7cbe60 RSI: 0000000000000000 RDI: 00007f006a7cbe50
[ 2664.109596] RBP: 0000000000000006 R08: 0000000000000001 R09: 00007ffe965450a0
[ 2664.109596] R10: 00007f006a7cbe30 R11: 000000000003b368 R12: 00007f006a7cbed0
[ 2664.109597] R13: 00007f0098c1ce6f R14: 0000000000000000 R15: 00007f0084a30390

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-11  4:23       ` Aubrey Li
@ 2019-03-11 18:34         ` Subhra Mazumdar
  2019-03-11 23:33           ` Subhra Mazumdar
  2019-03-12 19:07           ` Pawan Gupta
  0 siblings, 2 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-11 18:34 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Tim Chen, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, Greg Kerr


On 3/10/19 9:23 PM, Aubrey Li wrote:
> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> <subhra.mazumdar@oracle.com> wrote:
>> expected. Most of the performance recovery happens in patch 15 which,
>> unfortunately, is also the one that introduces the hard lockup.
>>
> After applied Subhra's patch, the following is triggered by enabling
> core sched when a cgroup is
> under heavy load.
>
It seems you are facing some other deadlock where printk is involved. 
Can you
drop the last patch (patch 16 sched: Debug bits...) and try?

Thanks,
Subhra


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-11 18:34         ` Subhra Mazumdar
@ 2019-03-11 23:33           ` Subhra Mazumdar
  2019-03-12  0:20             ` Greg Kerr
                               ` (2 more replies)
  2019-03-12 19:07           ` Pawan Gupta
  1 sibling, 3 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-11 23:33 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Tim Chen, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, Greg Kerr


On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
>
> On 3/10/19 9:23 PM, Aubrey Li wrote:
>> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
>> <subhra.mazumdar@oracle.com> wrote:
>>> expected. Most of the performance recovery happens in patch 15 which,
>>> unfortunately, is also the one that introduces the hard lockup.
>>>
>> After applied Subhra's patch, the following is triggered by enabling
>> core sched when a cgroup is
>> under heavy load.
>>
> It seems you are facing some other deadlock where printk is involved. 
> Can you
> drop the last patch (patch 16 sched: Debug bits...) and try?
>
> Thanks,
> Subhra
>
Never Mind, I am seeing the same lockdep deadlock output even w/o patch 
16. Btw
the NULL fix had something missing, following works.

--------->8------------

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4..27cbc64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
          * Avoid running the skip buddy, if running something else can
          * be done without getting too unfair.
*/
-       if (cfs_rq->skip == se) {
+       if (cfs_rq->skip && cfs_rq->skip == se) {
                 struct sched_entity *second;

                 if (se == curr) {
@@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *curr)
/*
          * Prefer last buddy, try to return the CPU to a preempted task.
*/
-       if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+       if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, 
left)
+           < 1)
                 se = cfs_rq->last;

/*
          * Someone really wants this to run. If it's not unfair, run it.
*/
-       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+       if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, 
left)
+           < 1)
                 se = cfs_rq->next;

         clear_buddies(cfs_rq, se);
@@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq)

                 se = pick_next_entity(cfs_rq, NULL);

+               if (!(se || curr))
+                       return NULL;
+
                 if (curr) {
                         if (se && curr->on_rq)
update_curr(cfs_rq);


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-11 23:33           ` Subhra Mazumdar
@ 2019-03-12  0:20             ` Greg Kerr
  2019-03-12  0:47               ` Subhra Mazumdar
  2019-03-12  7:33               ` Aaron Lu
  2019-03-12  7:45             ` Aubrey Li
  2019-03-18  6:56             ` Aubrey Li
  2 siblings, 2 replies; 99+ messages in thread
From: Greg Kerr @ 2019-03-12  0:20 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Aubrey Li, Mel Gorman, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, Linus Torvalds, Fr?d?ric Weisbecker,
	Kees Cook, Greg Kerr

On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
<subhra.mazumdar@oracle.com> wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >> <subhra.mazumdar@oracle.com> wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing, following works.

Is this panic below, which occurs when I tag the first process,
related or known? If not, I will debug it tomorrow.

[   46.831828] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000000
[   46.831829] core sched enabled
[   46.834261] #PF error: [WRITE]
[   46.834899] PGD 0 P4D 0
[   46.835438] Oops: 0002 [#1] SMP PTI
[   46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
[   46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[   46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
[   46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
00 00 00 <f0> 0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
00 00
[   46.843000] RSP: 0018:ffffb9d300cabe38 EFLAGS: 00010046
[   46.843744] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[   46.844709] RDX: 0000000000000001 RSI: ffffffffaea435ae RDI: 0000000000000000
[   46.845689] RBP: ffffb9d300cabed8 R08: 0000000000000000 R09: 0000000000020800
[   46.846651] R10: ffffffffaf603ea0 R11: 0000000000000001 R12: ffffffffaf6576c0
[   46.847619] R13: ffff9a57366c8000 R14: ffff9a5737401300 R15: ffffffffade868f0
[   46.848584] FS:  0000000000000000(0000) GS:ffff9a5737a00000(0000)
knlGS:0000000000000000
[   46.849680] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   46.850455] CR2: 0000000000000000 CR3: 00000001d36fa000 CR4: 00000000000006f0
[   46.851415] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   46.852371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   46.853326] Call Trace:
[   46.853678]  __schedule+0x139/0x11f0
[   46.854167]  ? cpumask_next+0x16/0x20
[   46.854668]  ? cpu_stop_queue_work+0xc0/0xc0
[   46.855252]  ? sort_range+0x20/0x20
[   46.855742]  schedule+0x4e/0x60
[   46.856171]  smpboot_thread_fn+0x12a/0x160
[   46.856725]  kthread+0x112/0x120
[   46.857164]  ? kthread_stop+0xf0/0xf0
[   46.857661]  ret_from_fork+0x35/0x40
[   46.858146] Modules linked in:
[   46.858562] CR2: 0000000000000000
[   46.859022] ---[ end trace e9fff08f17bfd2be ]---

- Greg

>
> --------->8------------
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4..27cbc64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
>           * Avoid running the skip buddy, if running something else can
>           * be done without getting too unfair.
> */
> -       if (cfs_rq->skip == se) {
> +       if (cfs_rq->skip && cfs_rq->skip == se) {
>                  struct sched_entity *second;
>
>                  if (se == curr) {
> @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
> /*
>           * Prefer last buddy, try to return the CPU to a preempted task.
> */
> -       if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> +       if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
> left)
> +           < 1)
>                  se = cfs_rq->last;
>
> /*
>           * Someone really wants this to run. If it's not unfair, run it.
> */
> -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> +       if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
> left)
> +           < 1)
>                  se = cfs_rq->next;
>
>          clear_buddies(cfs_rq, se);
> @@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq)
>
>                  se = pick_next_entity(cfs_rq, NULL);
>
> +               if (!(se || curr))
> +                       return NULL;
> +
>                  if (curr) {
>                          if (se && curr->on_rq)
> update_curr(cfs_rq);
>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-12  0:20             ` Greg Kerr
@ 2019-03-12  0:47               ` Subhra Mazumdar
  2019-03-12  7:33               ` Aaron Lu
  1 sibling, 0 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-12  0:47 UTC (permalink / raw)
  To: Greg Kerr
  Cc: Aubrey Li, Mel Gorman, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, Linus Torvalds, Fr?d?ric Weisbecker,
	Kees Cook, Greg Kerr


On 3/11/19 5:20 PM, Greg Kerr wrote:
> On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
> <subhra.mazumdar@oracle.com> wrote:
>>
>> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
>>> On 3/10/19 9:23 PM, Aubrey Li wrote:
>>>> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
>>>> <subhra.mazumdar@oracle.com> wrote:
>>>>> expected. Most of the performance recovery happens in patch 15 which,
>>>>> unfortunately, is also the one that introduces the hard lockup.
>>>>>
>>>> After applied Subhra's patch, the following is triggered by enabling
>>>> core sched when a cgroup is
>>>> under heavy load.
>>>>
>>> It seems you are facing some other deadlock where printk is involved.
>>> Can you
>>> drop the last patch (patch 16 sched: Debug bits...) and try?
>>>
>>> Thanks,
>>> Subhra
>>>
>> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
>> 16. Btw
>> the NULL fix had something missing, following works.
> Is this panic below, which occurs when I tag the first process,
> related or known? If not, I will debug it tomorrow.
>
> [   46.831828] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000000
> [   46.831829] core sched enabled
> [   46.834261] #PF error: [WRITE]
> [   46.834899] PGD 0 P4D 0
> [   46.835438] Oops: 0002 [#1] SMP PTI
> [   46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
> 5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
> [   46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1 04/01/2014
> [   46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
> [   46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
> ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
> 00 00 00 <f0> 0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
> 00 00
> [   46.843000] RSP: 0018:ffffb9d300cabe38 EFLAGS: 00010046
> [   46.843744] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> [   46.844709] RDX: 0000000000000001 RSI: ffffffffaea435ae RDI: 0000000000000000
> [   46.845689] RBP: ffffb9d300cabed8 R08: 0000000000000000 R09: 0000000000020800
> [   46.846651] R10: ffffffffaf603ea0 R11: 0000000000000001 R12: ffffffffaf6576c0
> [   46.847619] R13: ffff9a57366c8000 R14: ffff9a5737401300 R15: ffffffffade868f0
> [   46.848584] FS:  0000000000000000(0000) GS:ffff9a5737a00000(0000)
> knlGS:0000000000000000
> [   46.849680] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   46.850455] CR2: 0000000000000000 CR3: 00000001d36fa000 CR4: 00000000000006f0
> [   46.851415] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   46.852371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   46.853326] Call Trace:
> [   46.853678]  __schedule+0x139/0x11f0
> [   46.854167]  ? cpumask_next+0x16/0x20
> [   46.854668]  ? cpu_stop_queue_work+0xc0/0xc0
> [   46.855252]  ? sort_range+0x20/0x20
> [   46.855742]  schedule+0x4e/0x60
> [   46.856171]  smpboot_thread_fn+0x12a/0x160
> [   46.856725]  kthread+0x112/0x120
> [   46.857164]  ? kthread_stop+0xf0/0xf0
> [   46.857661]  ret_from_fork+0x35/0x40
> [   46.858146] Modules linked in:
> [   46.858562] CR2: 0000000000000000
> [   46.859022] ---[ end trace e9fff08f17bfd2be ]---
>
> - Greg
>
This seems to be different

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-12  0:20             ` Greg Kerr
  2019-03-12  0:47               ` Subhra Mazumdar
@ 2019-03-12  7:33               ` Aaron Lu
  1 sibling, 0 replies; 99+ messages in thread
From: Aaron Lu @ 2019-03-12  7:33 UTC (permalink / raw)
  To: Greg Kerr
  Cc: Subhra Mazumdar, Aubrey Li, Mel Gorman, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Tim Chen,
	Linux List Kernel Mailing, Linus Torvalds, Fr?d?ric Weisbecker,
	Kees Cook, Greg Kerr

On Mon, Mar 11, 2019 at 05:20:19PM -0700, Greg Kerr wrote:
> On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
> <subhra.mazumdar@oracle.com> wrote:
> >
> >
> > On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> > >
> > > On 3/10/19 9:23 PM, Aubrey Li wrote:
> > >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> > >> <subhra.mazumdar@oracle.com> wrote:
> > >>> expected. Most of the performance recovery happens in patch 15 which,
> > >>> unfortunately, is also the one that introduces the hard lockup.
> > >>>
> > >> After applied Subhra's patch, the following is triggered by enabling
> > >> core sched when a cgroup is
> > >> under heavy load.
> > >>
> > > It seems you are facing some other deadlock where printk is involved.
> > > Can you
> > > drop the last patch (patch 16 sched: Debug bits...) and try?
> > >
> > > Thanks,
> > > Subhra
> > >
> > Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> > 16. Btw
> > the NULL fix had something missing, following works.
> 
> Is this panic below, which occurs when I tag the first process,
> related or known? If not, I will debug it tomorrow.
> 
> [   46.831828] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000000
> [   46.831829] core sched enabled
> [   46.834261] #PF error: [WRITE]
> [   46.834899] PGD 0 P4D 0
> [   46.835438] Oops: 0002 [#1] SMP PTI
> [   46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
> 5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
> [   46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1 04/01/2014

Probably due to SMT not enabled for this qemu setup.

rq->core can be NULL for cpu0: sched_cpu_starting() won't be called for
CPU0 and since it doesn't have any siblings, its rq->core remains
un-initialized(NULL).

> [   46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
> [   46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
> ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
> 00 00 00 <f0> 0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
> 00 00
> [   46.843000] RSP: 0018:ffffb9d300cabe38 EFLAGS: 00010046
> [   46.843744] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> [   46.844709] RDX: 0000000000000001 RSI: ffffffffaea435ae RDI: 0000000000000000
> [   46.845689] RBP: ffffb9d300cabed8 R08: 0000000000000000 R09: 0000000000020800
> [   46.846651] R10: ffffffffaf603ea0 R11: 0000000000000001 R12: ffffffffaf6576c0
> [   46.847619] R13: ffff9a57366c8000 R14: ffff9a5737401300 R15: ffffffffade868f0
> [   46.848584] FS:  0000000000000000(0000) GS:ffff9a5737a00000(0000)
> knlGS:0000000000000000
> [   46.849680] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   46.850455] CR2: 0000000000000000 CR3: 00000001d36fa000 CR4: 00000000000006f0
> [   46.851415] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   46.852371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   46.853326] Call Trace:
> [   46.853678]  __schedule+0x139/0x11f0
> [   46.854167]  ? cpumask_next+0x16/0x20
> [   46.854668]  ? cpu_stop_queue_work+0xc0/0xc0
> [   46.855252]  ? sort_range+0x20/0x20
> [   46.855742]  schedule+0x4e/0x60
> [   46.856171]  smpboot_thread_fn+0x12a/0x160
> [   46.856725]  kthread+0x112/0x120
> [   46.857164]  ? kthread_stop+0xf0/0xf0
> [   46.857661]  ret_from_fork+0x35/0x40
> [   46.858146] Modules linked in:
> [   46.858562] CR2: 0000000000000000
> [   46.859022] ---[ end trace e9fff08f17bfd2be ]---

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-11 23:33           ` Subhra Mazumdar
  2019-03-12  0:20             ` Greg Kerr
@ 2019-03-12  7:45             ` Aubrey Li
  2019-03-13  5:55               ` Aubrey Li
  2019-03-18  6:56             ` Aubrey Li
  2 siblings, 1 reply; 99+ messages in thread
From: Aubrey Li @ 2019-03-12  7:45 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Tim Chen, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, Greg Kerr

On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
<subhra.mazumdar@oracle.com> wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >> <subhra.mazumdar@oracle.com> wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing,

One more NULL pointer dereference:

Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
[  201.950203] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[  201.950254] ------------[ cut here ]------------
[  201.959045] #PF error: [normal kernel read fault]
[  201.964272] !se->on_rq
[  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
set_next_buddy+0x52/0x70
[  201.969596] PGD 8000000be9ed7067 P4D 8000000be9ed7067 PUD c00911067 PMD 0
[  201.972300] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  201.981712] Oops: 0000 [#1] SMP PTI
[  201.989463] CPU: 22 PID: 2965 Comm: schbench Tainted: G          I
     5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.074710] CPU: 27 PID: 2947 Comm: schbench Tainted: G          I
     5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.078662] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.078674] RIP: 0010:set_next_buddy+0x52/0x70
[  202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.090144] RIP: 0010:rb_insert_color+0x17/0x190
[  202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5
6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e
bb fb ff <0f> 0b c3 83 bf 04 03 0e
[  202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.118263] RSP: 0018:ffffc9000a5cbbb0 EFLAGS: 00010086
[  202.129858] RSP: 0018:ffffc9000a463cc0 EFLAGS: 00010046
[  202.135102] RAX: 0000000000000000 RBX: ffff88980047e800 RCX: 0000000000000000
[  202.135105] RDX: ffff888be28caa40 RSI: 0000000000000001 RDI: ffffffff8110c3fa
[  202.156251] RAX: 0000000000000000 RBX: ffff888bfeb80000 RCX: ffff888bfeb80000
[  202.156255] RDX: ffff888be28c8348 RSI: ffff88980b5e50c8 RDI: ffff888bfeb80348
[  202.177390] RBP: ffff88980047ea00 R08: 0000000000000000 R09: 00000000001e3a80
[  202.177393] R10: ffffc9000a5cbb28 R11: 0000000000000000 R12: ffff888c0b9e4400
[  202.183317] RBP: ffff88980b5e4400 R08: 000000000000014f R09: ffff8898049cf000
[  202.183320] R10: 0000000000000078 R11: ffff8898049cfc5c R12: 0000000000000004
[  202.189241] R13: ffff888be28caa40 R14: 0000000000000009 R15: 0000000000000009
[  202.189245] FS:  00007f05f87f8700(0000) GS:ffff888c0b800000(0000)
knlGS:0000000000000000
[  202.197310] R13: ffffc9000a463d20 R14: 0000000000000246 R15: 000000000000001c
[  202.197314] FS:  00007f0611cca700(0000) GS:ffff88980b200000(0000)
knlGS:0000000000000000
[  202.205373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  202.205377] CR2: 00007f05e9fdb728 CR3: 0000000be4d0e006 CR4: 00000000000606e0
[  202.213441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  202.213444] CR2: 0000000000000008 CR3: 0000000be4d0e005 CR4: 00000000000606e0
[  202.221509] Call Trace:
[  202.229574] Call Trace:
[  202.237640]  dequeue_task_fair+0x7e/0x1b0
[  202.245700]  enqueue_task+0x6f/0xb0
[  202.253761]  __schedule+0xcc8/0x1570
[  202.261823]  ttwu_do_activate+0x6a/0xc0
[  202.270985]  schedule+0x28/0x70
[  202.279042]  try_to_wake_up+0x20b/0x510
[  202.288206]  futex_wait_queue_me+0xbf/0x130
[  202.294714]  wake_up_q+0x3f/0x80
[  202.302773]  futex_wait+0xeb/0x240
[  202.309282]  futex_wake+0x157/0x180
[  202.317353]  ? __switch_to_asm+0x40/0x70
[  202.320158]  do_futex+0x451/0xad0
[  202.322970]  ? __switch_to_asm+0x34/0x70
[  202.322980]  ? __switch_to_asm+0x40/0x70
[  202.327541]  ? do_nanosleep+0xcc/0x1a0
[  202.331521]  do_futex+0x479/0xad0
[  202.335599]  ? hrtimer_nanosleep+0xe7/0x230
[  202.339954]  ? lockdep_hardirqs_on+0xf0/0x180
[  202.343548]  __x64_sys_futex+0x134/0x180
[  202.347906]  ? _raw_spin_unlock_irq+0x29/0x40
[  202.352660]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  202.356343]  ? finish_task_switch+0x9a/0x2c0
[  202.360228]  do_syscall_64+0x60/0x1b0
[  202.364197]  ? __schedule+0xbcd/0x1570
[  202.368663]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  202.372448]  __x64_sys_futex+0x134/0x180
[  202.376913] RIP: 0033:0x7f06129e14d9
[  202.381380]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  202.385650] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 08
[  202.389436]  do_syscall_64+0x60/0x1b0
[  202.394190] RSP: 002b:00007f0611cc9e88 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[  202.399147]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  202.403612] RAX: ffffffffffffffda RBX: 00007f0604a30390 RCX: 00007f06129e14d9
[  202.403614] RDX: 0000000000000001 RSI: 0000000000000081 RDI: 00007f060461d2a0
[  202.408565] RIP: 0033:0x7f06129e14d9
[  202.413905] RBP: 00007f0604a30390 R08: 0000000000000000 R09: 0000000000000000
[  202.413908] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000f4240
[  202.418760] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 08
[  202.418763] RSP: 002b:00007f05f87f7e68 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[  202.422937] R13: 00007f06125d0c88 R14: 0000000000000010 R15: 00007f06125d0c58
[  202.422945] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  202.427209] RAX: ffffffffffffffda RBX: 00007f060820a180 RCX: 00007f06129e14d9
[  202.427212] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f060820a180
[  202.432944] CR2: 0000000000000008
[  202.437416] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[  202.437419] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f05f87f7ed0
[  202.441506] ---[ end trace 1b953fe9220b3d88 ]---
[  202.441508] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[  202.441510] #PF error: [normal kernel read fault]
[  202.441511] PGD 0 P4D 0
[  202.441514] Oops: 0000 [#2] SMP PTI
[  202.441516] CPU: 24 PID: 0 Comm: swapper/24 Tainted: G      D   I
    5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.441517] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.441521] RIP: 0010:rb_insert_color+0x17/0x190
[  202.441522] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.441523] RSP: 0018:ffff88980ac03e68 EFLAGS: 00010046
[  202.441525] RAX: 0000000000000000 RBX: ffff888bfddf5480 RCX: ffff888bfddf5480
[  202.441526] RDX: ffff888bfeb857c8 RSI: ffff88980ade50c8 RDI: ffff888bfddf57c8
[  202.441527] RBP: ffff88980ade4400 R08: 0000000000000077 R09: 0000000eb31aac68
[  202.441528] R10: 0000000000000078 R11: ffff889809de4418 R12: 0000000000000000
[  202.441529] R13: ffff88980ac03ec8 R14: 0000000000000046 R15: 0000000000000018
[  202.441531] FS:  0000000000000000(0000) GS:ffff88980ac00000(0000)
knlGS:0000000000000000
[  202.441532] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  202.441533] CR2: 0000000000000008 CR3: 0000000002616004 CR4: 00000000000606e0
[  202.441534] Call Trace:
[  202.441536]  <IRQ>
[  202.441538]  enqueue_task+0x6f/0xb0
[  202.441541]  ttwu_do_activate+0x6a/0xc0
[  202.441544]  try_to_wake_up+0x20b/0x510
[  202.441549]  hrtimer_wakeup+0x1e/0x30
[  202.441551]  __hrtimer_run_queues+0x117/0x3d0
[  202.441553]  ? __hrtimer_init+0xb0/0xb0
[  202.441557]  hrtimer_interrupt+0xe5/0x240
[  202.441563]  smp_apic_timer_interrupt+0x81/0x1f0
[  202.441565]  apic_timer_interrupt+0xf/0x20
[  202.441567]  </IRQ>
[  202.441573] RIP: 0010:cpuidle_enter_state+0xbb/0x440
[  202.441574] Code: 0f 8b ff 80 7c 24 07 00 74 17 9c 58 66 66 90 66
90 f6 c4 02 0f 85 59 03 00 00 31 ff e8 5e 2b 92 ff e8 c9 0a 99 ff fb
66 66 90 <66> 66 90 45 85 ed 0f 85
[  202.441575] RSP: 0018:ffffc90006403ea0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  202.441577] RAX: 0000000000000000 RBX: ffffffff82700480 RCX: 000000000000001f
[  202.441578] RDX: 0000002f08578463 RSI: 000000002f858b0c RDI: ffffffff8181bca7
[  202.441579] RBP: ffffe8fffae03000 R08: 0000000000000002 R09: 00000000001e3a80
[  202.441580] R10: ffffc90006403e80 R11: 0000000000000000 R12: 0000002f08578463
[  202.441581] R13: 0000000000000005 R14: 0000000000000005 R15: 0000002f05150b84
[  202.441585]  ? cpuidle_enter_state+0xb7/0x440
[  202.441590]  do_idle+0x20f/0x2a0
[  202.441594]  cpu_startup_entry+0x19/0x20
[  202.441599]  start_secondary+0x17f/0x1d0
[  202.441602]  secondary_startup_64+0xa4/0xb0
[  202.441608] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  202.441637] CR2: 0000000000000008
[  202.441684] ---[ end trace 1b953fe9220b3d89 ]---
[  202.441686] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[  202.441689] #PF error: [normal kernel read fault]
[  202.441690] PGD 8000000be9ed7067 P4D 8000000be9ed7067 PUD c00911067 PMD 0
[  202.443007] Oops: 0000 [#3] SMP PTI
[  202.443010] CPU: 0 PID: 3006 Comm: schbench Tainted: G      D   I
    5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.443012] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.443016] RIP: 0010:rb_insert_color+0x17/0x190
[  202.443018] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.443020] RSP: 0000:ffff888c09c03e68 EFLAGS: 00010046
[  202.443022] RAX: 0000000000000000 RBX: ffff888bfddf2a40 RCX: ffff888bfddf2a40
[  202.443024] RDX: ffff888be28cd7c8 RSI: ffff888c0b5e50c8 RDI: ffff888bfddf2d88
[  202.443026] RBP: ffff888c0b5e4400 R08: 0000000000000077 R09: 0000000e7267ab1a
[  202.443027] R10: 0000000000000078 R11: ffff888c0a5e4418 R12: 0000000000000004
[  202.443029] R13: ffff888c09c03ec8 R14: 0000000000000046 R15: 0000000000000014
[  202.443031] FS:  00007f05e37ce700(0000) GS:ffff888c09c00000(0000)
knlGS:0000000000000000
[  202.443033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  202.443034] CR2: 0000000000000008 CR3: 0000000be4d0e001 CR4: 00000000000606f0
[  202.443035] Call Trace:
[  202.443038]  <IRQ>
[  202.443041]  enqueue_task+0x6f/0xb0
[  202.443046]  ttwu_do_activate+0x6a/0xc0
[  202.443050]  try_to_wake_up+0x20b/0x510
[  202.443057]  hrtimer_wakeup+0x1e/0x30
[  202.443059]  __hrtimer_run_queues+0x117/0x3d0
[  202.443062]  ? __hrtimer_init+0xb0/0xb0
[  202.443067]  hrtimer_interrupt+0xe5/0x240
[  202.443074]  smp_apic_timer_interrupt+0x81/0x1f0
[  202.443077]  apic_timer_interrupt+0xf/0x20
[  202.443079]  </IRQ>
[  202.443081] RIP: 0033:0x7ffcf1fac6ac
[  202.443084] Code: 2d 81 e9 ff ff 4c 8b 05 82 e9 ff ff 0f 01 f9 66
90 41 8b 0c 24 39 cb 0f 84 07 01 00 00 41 8b 1c 24 85 db 75 d9 eb ba
0f 01 f9 <66> 90 48 c1 e2 20 48 0f
[  202.443085] RSP: 002b:00007f05e37cddf0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  202.443088] RAX: 0000000067252571 RBX: 00007f05e37cde50 RCX: 0000000000000000
[  202.443089] RDX: 000000000000bddd RSI: 00007f05e37cde50 RDI: 0000000000000000
[  202.443091] RBP: 00007f05e37cde10 R08: 0000000000000000 R09: 00007ffcf1fa90a0
[  202.443092] R10: 00007ffcf1fa9080 R11: 000000000000fd2a R12: 0000000000000000
[  202.443094] R13: 00007f0610cc7e6f R14: 0000000000000000 R15: 00007f05fca30390
[  202.443101] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  202.443146] CR2: 0000000000000008
[  202.443202] ---[ end trace 1b953fe9220b3d8a ]---
[  202.443206] BUG: scheduling while atomic: schbench/3006/0x00010000
[  202.443207] INFO: lockdep is turned off.
[  202.443208] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  202.443253] CPU: 0 PID: 3006 Comm: schbench Tainted: G      D   I
    5.0.0-rc8-00542-gd697415be692-dirty #13
[  202.443254] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.443255] Call Trace:
[  202.443257]  <IRQ>
[  202.443260]  dump_stack+0x85/0xcb
[  202.443263]  __schedule_bug+0x62/0x90
[  202.443266]  __schedule+0x118f/0x1570
[  202.443273]  ? down_trylock+0xf/0x30
[  202.443278]  ? is_bpf_text_address+0x5/0xe0
[  202.443282]  schedule+0x28/0x70
[  202.443285]  schedule_timeout+0x221/0x4b0
[  202.443290]  ? vprintk_emit+0x1f9/0x350
[  202.443298]  __down_interruptible+0x86/0x100
[  202.443304]  ? down_interruptible+0x42/0x50
[  202.443307]  down_interruptible+0x42/0x50
[  202.443312]  pstore_dump+0x9e/0x340
[  202.443316]  ? lock_acquire+0x9e/0x180
[  202.443319]  ? kmsg_dump+0xe1/0x1d0
[  202.443325]  kmsg_dump+0x99/0x1d0
[  202.443331]  oops_end+0x6e/0xd0
[  202.443336]  no_context+0x1bd/0x540
[  202.443341]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  202.443347]  page_fault+0x1e/0x30
[  202.443350] RIP: 0010:rb_insert_color+0x17/0x190
[  202.443352] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.443353] RSP: 0000:ffff888c09c03e68 EFLAGS: 00010046
[  202.443355] RAX: 0000000000000000 RBX: ffff888bfddf2a40 RCX: ffff888bfddf2a40
[  202.443357] RDX: ffff888be28cd7c8 RSI: ffff888c0b5e50c8 RDI: ffff888bfddf2d88
[  202.443358] RBP: ffff888c0b5e4400 R08: 0000000000000077 R09: 0000000e7267ab1a
[  202.443360] R10: 0000000000000078 R11: ffff888c0a5e4418 R12: 0000000000000004
[  202.443361] R13: ffff888c09c03ec8 R14: 0000000000000046 R15: 0000000000000014
[  202.443370]  enqueue_task+0x6f/0xb0
[  202.443374]  ttwu_do_activate+0x6a/0xc0
[  202.443377]  try_to_wake_up+0x20b/0x510
[  202.443383]  hrtimer_wakeup+0x1e/0x30
[  202.443385]  __hrtimer_run_queues+0x117/0x3d0
[  202.443387]  ? __hrtimer_init+0xb0/0xb0
[  202.443393]  hrtimer_interrupt+0xe5/0x240
[  202.443398]  smp_apic_timer_interrupt+0x81/0x1f0
[  202.443401]  apic_timer_interrupt+0xf/0x20
[  202.443402]  </IRQ>
[  202.443404] RIP: 0033:0x7ffcf1fac6ac
[  202.443406] Code: 2d 81 e9 ff ff 4c 8b 05 82 e9 ff ff 0f 01 f9 66
90 41 8b 0c 24 39 cb 0f 84 07 01 00 00 41 8b 1c 24 85 db 75 d9 eb ba
0f 01 f9 <66> 90 48 c1 e2 20 48 0f
[  202.443407] RSP: 002b:00007f05e37cddf0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  202.443409] RAX: 0000000067252571 RBX: 00007f05e37cde50 RCX: 0000000000000000
[  202.443411] RDX: 000000000000bddd RSI: 00007f05e37cde50 RDI: 0000000000000000
[  202.443412] RBP: 00007f05e37cde10 R08: 0000000000000000 R09: 00007ffcf1fa90a0
[  202.443414] R10: 00007ffcf1fa9080 R11: 000000000000fd2a R12: 0000000000000000
[  202.443415] R13: 00007f0610cc7e6f R14: 0000000000000000 R15: 00007f05fca30390
[  202.447384] R13: 00007f06114c8e6f R14: 0000000000000000 R15: 00007f060820a150
[  202.447392] irq event stamp: 6766
[  203.839390] hardirqs last  enabled at (6765): [<ffffffff810044f2>]
do_syscall_64+0x12/0x1b0
[  203.848842] hardirqs last disabled at (6766): [<ffffffff81a0affc>]
__schedule+0xdc/0x1570
[  203.858097] softirqs last  enabled at (6760): [<ffffffff81e00359>]
__do_softirq+0x359/0x40a
[  203.867558] softirqs last disabled at (6751): [<ffffffff81095be1>]
irq_exit+0xc1/0xd0
[  203.876423] ---[ end trace 1b953fe9220b3d8b ]---

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-11 18:34         ` Subhra Mazumdar
  2019-03-11 23:33           ` Subhra Mazumdar
@ 2019-03-12 19:07           ` Pawan Gupta
  1 sibling, 0 replies; 99+ messages in thread
From: Pawan Gupta @ 2019-03-12 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Aubrey Li, Mel Gorman, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Tim Chen, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, Greg Kerr, Subhra Mazumdar

Hi,

With core scheduling LTP reports 2 new failures related to cgroups(memcg_stat_rss and memcg_move_charge_at_immigrate). I will try to debug it.

Also "perf sched map" indicates there might be a small window when 2 processes in different cgroups run together on one core.
In below case B0 and D0(stress-ng-cpu and sysbench) belong to 2 different cgroups with cpu.tag enabled.

$ perf sched map

  *A0                                 382.266600 secs A0 => kworker/0:1-eve:51
  *B0                                 382.266612 secs B0 => stress-ng-cpu:7956
  *A0                                 382.394597 secs 
  *B0                                 382.394609 secs 
   B0             *C0                 382.494459 secs C0 => i915/signal:0:450
   B0             *D0                 382.494468 secs D0 => sysbench:8088
  *.               D0                 382.494472 secs .  => swapper:0
   .              *C0                 383.095787 secs 
  *B0              C0                 383.095792 secs 
   B0             *D0                 383.095820 secs
  *A0              D0                 383.096587 secs

In some cases I dont see an IPI getting sent to sibling cpu when 2 incompatible processes are picked. Like is below logs at timestamp 382.146250
"stress-ng-cpu" is picked when "sysbench" is running on the sibling cpu.

  kworker/0:1-51    [000] d...   382.146246: __schedule: cpu(0): selected: stress-ng-cpu/7956 ffff9945bad29200
  kworker/0:1-51    [000] d...   382.146246: __schedule: max: stress-ng-cpu/7956 ffff9945bad29200
  kworker/0:1-51    [000] d...   382.146247: __prio_less: (swapper/4/0;140,0,0) ?< (sysbench/8088;140,34783671987,0)
  kworker/0:1-51    [000] d...   382.146248: __prio_less: (stress-ng-cpu/7956;119,34817170203,0) ?< (sysbench/8088;119,34783671987,0)
  kworker/0:1-51    [000] d...   382.146249: __schedule: cpu(4): selected: sysbench/8088 ffff9945a7405200
  kworker/0:1-51    [000] d...   382.146249: __prio_less: (stress-ng-cpu/7956;119,34817170203,0) ?< (sysbench/8088;119,34783671987,0)
  kworker/0:1-51    [000] d...   382.146250: __schedule: picked: stress-ng-cpu/7956 ffff9945bad29200
  kworker/0:1-51    [000] d...   382.146251: __switch_to: Pawan: cpu(0) switching to stress-ng-cpu
  kworker/0:1-51    [000] d...   382.146251: __switch_to: Pawan: cpu(4) running sysbench
stress-ng-cpu-7956  [000] dN..   382.274234: __schedule: cpu(0): selected: kworker/0:1/51 0
stress-ng-cpu-7956  [000] dN..   382.274235: __schedule: max: kworker/0:1/51 0
stress-ng-cpu-7956  [000] dN..   382.274235: __schedule: cpu(4): selected: sysbench/8088 ffff9945a7405200
stress-ng-cpu-7956  [000] dN..   382.274237: __prio_less: (kworker/0:1/51;119,50744489595,0) ?< (sysbench/8088;119,34911643157,0)
stress-ng-cpu-7956  [000] dN..   382.274237: __schedule: picked: kworker/0:1/51 0
stress-ng-cpu-7956  [000] d...   382.274239: __switch_to: Pawan: cpu(0) switching to kworker/0:1
stress-ng-cpu-7956  [000] d...   382.274239: __switch_to: Pawan: cpu(4) running sysbench

-Pawan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-12  7:45             ` Aubrey Li
@ 2019-03-13  5:55               ` Aubrey Li
  2019-03-14  0:35                 ` Tim Chen
  0 siblings, 1 reply; 99+ messages in thread
From: Aubrey Li @ 2019-03-13  5:55 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Tim Chen, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, Greg Kerr

On Tue, Mar 12, 2019 at 3:45 PM Aubrey Li <aubrey.intel@gmail.com> wrote:
>
> On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
> <subhra.mazumdar@oracle.com> wrote:
> >
> >
> > On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> > >
> > > On 3/10/19 9:23 PM, Aubrey Li wrote:
> > >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> > >> <subhra.mazumdar@oracle.com> wrote:
> > >>> expected. Most of the performance recovery happens in patch 15 which,
> > >>> unfortunately, is also the one that introduces the hard lockup.
> > >>>
> > >> After applied Subhra's patch, the following is triggered by enabling
> > >> core sched when a cgroup is
> > >> under heavy load.
> > >>
> > > It seems you are facing some other deadlock where printk is involved.
> > > Can you
> > > drop the last patch (patch 16 sched: Debug bits...) and try?
> > >
> > > Thanks,
> > > Subhra
> > >
> > Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> > 16. Btw
> > the NULL fix had something missing,
>
> One more NULL pointer dereference:
>
> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000008
> [  201.950254] ------------[ cut here ]------------
> [  201.959045] #PF error: [normal kernel read fault]
> [  201.964272] !se->on_rq
> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> set_next_buddy+0x52/0x70

A quick workaround below:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0dac4fd94f..ef6acfe2cf7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se)
                return;

        for_each_sched_entity(se) {
-               if (SCHED_WARN_ON(!se->on_rq))
+               if (SCHED_WARN_ON(!(se && se->on_rq))
                        return;
                cfs_rq_of(se)->last = se;
        }
@@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se)
                return;

        for_each_sched_entity(se) {
-               if (SCHED_WARN_ON(!se->on_rq))
+               if (SCHED_WARN_ON(!(se && se->on_rq))
                        return;
                cfs_rq_of(se)->next = se;
        }

And now I'm running into a hard LOCKUP:

[  326.336279] NMI watchdog: Watchdog detected hard LOCKUP on cpu 31
[  326.336280] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  326.336311] irq event stamp: 164460
[  326.336312] hardirqs last  enabled at (164459):
[<ffffffff810c7a97>] sched_core_balance+0x247/0x470
[  326.336312] hardirqs last disabled at (164460):
[<ffffffff810c7963>] sched_core_balance+0x113/0x470
[  326.336313] softirqs last  enabled at (164250):
[<ffffffff81e00359>] __do_softirq+0x359/0x40a
[  326.336314] softirqs last disabled at (164213):
[<ffffffff81095be1>] irq_exit+0xc1/0xd0
[  326.336315] CPU: 31 PID: 0 Comm: swapper/31 Tainted: G          I
    5.0.0-rc8-00542-gd697415be692-dirty #15
[  326.336316] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  326.336317] RIP: 0010:native_queued_spin_lock_slowpath+0x18f/0x1c0
[  326.336318] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 80 51 1e 00 48 03 04 f5 40 58 39 82 48 89 10 8b 42 08 85 c0 75
09 f3 90 <8b> 42 08 85 c0 74 f7 4b
[  326.336318] RSP: 0000:ffffc9000643bd58 EFLAGS: 00000046
[  326.336319] RAX: 0000000000000000 RBX: ffff888c0ade4400 RCX: 0000000000800000
[  326.336320] RDX: ffff88980bbe5180 RSI: 0000000000000019 RDI: ffff888c0ade4400
[  326.336321] RBP: ffff888c0ade4400 R08: 0000000000800000 R09: 00000000001e3a80
[  326.336321] R10: ffffc9000643bd08 R11: 0000000000000000 R12: 0000000000000000
[  326.336322] R13: 0000000000000000 R14: ffff88980bbe4400 R15: 000000000000001f
[  326.336323] FS:  0000000000000000(0000) GS:ffff88980ba00000(0000)
knlGS:0000000000000000
[  326.336323] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  326.336324] CR2: 00007fdcd7fd7728 CR3: 00000017e821a001 CR4: 00000000000606e0
[  326.336325] Call Trace:
[  326.336325]  do_raw_spin_lock+0xab/0xb0
[  326.336326]  _raw_spin_lock+0x4b/0x60
[  326.336326]  double_rq_lock+0x99/0x140
[  326.336327]  sched_core_balance+0x11e/0x470
[  326.336327]  __balance_callback+0x49/0xa0
[  326.336328]  __schedule+0x1113/0x1570
[  326.336328]  schedule_idle+0x1e/0x40
[  326.336329]  do_idle+0x16b/0x2a0
[  326.336329]  cpu_startup_entry+0x19/0x20
[  326.336330]  start_secondary+0x17f/0x1d0
[  326.336331]  secondary_startup_64+0xa4/0xb0
[  330.959367] ---[ end Kernel panic - not syncing: Hard LOCKUP ]---

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-13  5:55               ` Aubrey Li
@ 2019-03-14  0:35                 ` Tim Chen
  2019-03-14  5:30                   ` Aubrey Li
  0 siblings, 1 reply; 99+ messages in thread
From: Tim Chen @ 2019-03-14  0:35 UTC (permalink / raw)
  To: Aubrey Li, Subhra Mazumdar
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, Greg Kerr


>>
>> One more NULL pointer dereference:
>>
>> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
>> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
>> at 0000000000000008
>> [  201.950254] ------------[ cut here ]------------
>> [  201.959045] #PF error: [normal kernel read fault]
>> [  201.964272] !se->on_rq
>> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
>> set_next_buddy+0x52/0x70
> 
> A quick workaround below:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4fd94f..ef6acfe2cf7d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se)
>                 return;
> 
>         for_each_sched_entity(se) {
> -               if (SCHED_WARN_ON(!se->on_rq))
> +               if (SCHED_WARN_ON(!(se && se->on_rq))
>                         return;
>                 cfs_rq_of(se)->last = se;
>         }
> @@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se)
>                 return;
> 
>         for_each_sched_entity(se) {
> -               if (SCHED_WARN_ON(!se->on_rq))
> +               if (SCHED_WARN_ON(!(se && se->on_rq))


Shouldn't the for_each_sched_entity(se) skip the code block for !se case
have avoided null pointer access of se?

Since
#define for_each_sched_entity(se) \
                for (; se; se = se->parent)

Scratching my head a bit here on how your changes would have made
a difference.

In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed with the actual OOPs?
Saw also in your original log rb_insert_color.  Wonder if that
was actually the source of the Oops?


[  202.078674] RIP: 0010:set_next_buddy+0x52/0x70
[  202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.090144] RIP: 0010:rb_insert_color+0x17/0x190
[  202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5
6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e
bb fb ff <0f> 0b c3 83 bf 04 03 0e
[  202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.118263] RSP: 0018:ffffc9000a5cbbb0 EFLAGS: 00010086
[  202.129858] RSP: 0018:ffffc9000a463cc0 EFLAGS: 00010046
[  202.135102] RAX: 0000000000000000 RBX: ffff88980047e800 RCX: 0000000000000000
[  202.135105] RDX: ffff888be28caa40 RSI: 0000000000000001 RDI: ffffffff8110c3fa
[  202.156251] RAX: 0000000000000000 RBX: ffff888bfeb80000 RCX: ffff888bfeb80

Thanks.

Tim

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-14  0:35                 ` Tim Chen
@ 2019-03-14  5:30                   ` Aubrey Li
  2019-03-14  6:07                     ` Li, Aubrey
  0 siblings, 1 reply; 99+ messages in thread
From: Aubrey Li @ 2019-03-14  5:30 UTC (permalink / raw)
  To: Tim Chen
  Cc: Subhra Mazumdar, Mel Gorman, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linux List Kernel Mailing,
	Linus Torvalds, Fr?d?ric Weisbecker, Kees Cook, Greg Kerr

On Thu, Mar 14, 2019 at 8:35 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >>
> >> One more NULL pointer dereference:
> >>
> >> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
> >> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
> >> at 0000000000000008
> >> [  201.950254] ------------[ cut here ]------------
> >> [  201.959045] #PF error: [normal kernel read fault]
> >> [  201.964272] !se->on_rq
> >> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> >> set_next_buddy+0x52/0x70
> >
> Shouldn't the for_each_sched_entity(se) skip the code block for !se case
> have avoided null pointer access of se?
>
> Since
> #define for_each_sched_entity(se) \
>                 for (; se; se = se->parent)
>
> Scratching my head a bit here on how your changes would have made
> a difference.

This NULL pointer dereference is not replicable, which makes me thought the
change works...

>
> In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed with the actual OOPs?
> Saw also in your original log rb_insert_color.  Wonder if that
> was actually the source of the Oops?

No chance to figure this out, I only saw this once, lockup occurs more
frequently.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-14  5:30                   ` Aubrey Li
@ 2019-03-14  6:07                     ` Li, Aubrey
  0 siblings, 0 replies; 99+ messages in thread
From: Li, Aubrey @ 2019-03-14  6:07 UTC (permalink / raw)
  To: Aubrey Li, Tim Chen
  Cc: Subhra Mazumdar, Mel Gorman, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linux List Kernel Mailing,
	Linus Torvalds, Fr?d?ric Weisbecker, Kees Cook, Greg Kerr

The original patch seems missing the following change for 32bit.

Thanks,
-Aubrey

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..78de28ebc45d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	/*
 	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	if (index == CPUACCT_STAT_NSTATS) {
@@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	}
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	return data;
@@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	/*
 	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 }
 
@@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 			 * Take rq->lock to make 64-bit read safe on 32-bit
 			 * platforms.
 			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 			seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 		}
 		seq_puts(m, "\n");

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
                   ` (18 preceding siblings ...)
  2019-03-01  2:54 ` Subhra Mazumdar
@ 2019-03-14 15:28 ` Julien Desfossez
  19 siblings, 0 replies; 99+ messages in thread
From: Julien Desfossez @ 2019-03-14 15:28 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: Julien Desfossez, linux-kernel, subhra.mazumdar, fweisbec,
	keescook, kerrnel, Vineeth Pillai, Nishanth Aravamudan

On 2/18/19 8:56 AM, Peter Zijlstra wrote:
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.

We are seeing this hard lockup within 1 hour of testing the patchset with 2
VMs using the core scheduler feature. Here is the full dmesg. We have the
kdump as well if more information is necessary.

[ 1989.647539] core sched enabled
[ 3353.211527] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[ 3353.211528] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted
5.0-0.coresched-generic #1
[ 3353.211530] RIP: 0010:native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211532] Code: eb e8 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 00 3a 02 00 48 03 04 f5 20 48 bb a6 48 89 10 8b 42 08 85 c0 75 09 <f3>
90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 8e 0f 18 0e eb 8f
[ 3353.211533] RSP: 0018:ffff97ba3f603e18 EFLAGS: 00000046
[ 3353.211535] RAX: 0000000000000000 RBX: 0000000000000202 RCX:
0000000000040000
[ 3353.211535] RDX: ffff97ba3f623a00 RSI: 0000000000000007 RDI:
ffff97dabf822d40
[ 3353.211536] RBP: ffff97ba3f603e18 R08: 0000000000040000 R09:
0000000000018499
[ 3353.211537] R10: 0000000000000001 R11: 0000000000000000 R12:
0000000000000001
[ 3353.211538] R13: ffffffffa7340740 R14: 000000000000000c R15:
000000000000000c
[ 3353.211539] FS:  0000000000000000(0000) GS:ffff97ba3f600000(0000)
knlGS:0000000000000000
[ 3353.211544] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3353.211545] CR2: 00007efeac310004 CR3: 0000001bf4c0e002 CR4:
00000000001626f0
[ 3353.211546] Call Trace:
[ 3353.211546]  <IRQ>
[ 3353.211547]  _raw_spin_lock_irqsave+0x35/0x40
[ 3353.211548]  update_blocked_averages+0x35/0x5d0
[ 3353.211549]  ? rebalance_domains+0x180/0x2c0
[ 3353.211549]  update_nohz_stats+0x48/0x60
[ 3353.211550]  _nohz_idle_balance+0xdf/0x290
[ 3353.211551]  run_rebalance_domains+0x97/0xa0
[ 3353.211551]  __do_softirq+0xe4/0x2f3
[ 3353.211552]  irq_exit+0xb6/0xc0
[ 3353.211553]  scheduler_ipi+0xe4/0x130
[ 3353.211553]  smp_reschedule_interrupt+0x39/0xe0
[ 3353.211554]  reschedule_interrupt+0xf/0x20
[ 3353.211555]  </IRQ>
[ 3353.211556] RIP: 0010:cpuidle_enter_state+0xbc/0x440
[ 3353.211557] Code: ff e8 d8 dd 86 ff 80 7d d3 00 74 17 9c 58 0f 1f 44 00
00 f6 c4 02 0f 85 54 03 00 00 31 ff e8 eb 1d 8d ff fb 66 0f 1f 44 00 00 <45>
85 f6 0f 88 1a 03 00 00 4c 2b 6d c8 48 ba cf f7 53 e3 a5 9b c4
[ 3353.211558] RSP: 0018:ffffffffa6e03df8 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff02
[ 3353.211560] RAX: ffff97ba3f622d40 RBX: ffffffffa6f545e0 RCX:
000000000000001f
[ 3353.211561] RDX: 0000024c9b7d936c RSI: 0000000047318912 RDI:
0000000000000000
[ 3353.211562] RBP: ffffffffa6e03e38 R08: 0000000000000002 R09:
0000000000022600
[ 3353.211562] R10: ffffffffa6e03dc8 R11: 00000000000002dc R12:
ffffd6c67f602968
[ 3353.211563] R13: 0000024c9b7d936c R14: 0000000000000004 R15:
ffffffffa6f54760
[ 3353.211564]  ? cpuidle_enter_state+0x98/0x440
[ 3353.211565]  cpuidle_enter+0x17/0x20
[ 3353.211565]  call_cpuidle+0x23/0x40
[ 3353.211566]  do_idle+0x204/0x280
[ 3353.211567]  cpu_startup_entry+0x1d/0x20
[ 3353.211567]  rest_init+0xae/0xb0
[ 3353.211568]  arch_call_rest_init+0xe/0x1b
[ 3353.211569]  start_kernel+0x4f5/0x516
[ 3353.211569]  x86_64_start_reservations+0x24/0x26
[ 3353.211570]  x86_64_start_kernel+0x74/0x77
[ 3353.211571]  secondary_startup_64+0xa4/0xb0
[ 3353.211571] Kernel panic - not syncing: NMI IOCK error: Not continuing
[ 3353.211572] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted
5.0-0.coresched-generic #1
[ 3353.211574] Call Trace:
[ 3353.211575]  <NMI>
[ 3353.211575]  dump_stack+0x63/0x85
[ 3353.211576]  panic+0xfe/0x2a4
[ 3353.211576]  nmi_panic+0x39/0x40
[ 3353.211577]  io_check_error+0x92/0xa0
[ 3353.211578]  default_do_nmi+0x9e/0x110
[ 3353.211578]  do_nmi+0x119/0x180
[ 3353.211579]  end_repeat_nmi+0x16/0x50
[ 3353.211580] RIP: 0010:native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211581] Code: eb e8 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6
48 05 00 3a 02 00 48 03 04 f5 20 48 bb a6 48 89 10 8b 42 08 85 c0 75 09 <f3>
90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 8e 0f 18 0e eb 8f
[ 3353.211582] RSP: 0018:ffff97ba3f603e18 EFLAGS: 00000046
[ 3353.211583] RAX: 0000000000000000 RBX: 0000000000000202 RCX:
0000000000040000
[ 3353.211584] RDX: ffff97ba3f623a00 RSI: 0000000000000007 RDI:
ffff97dabf822d40
[ 3353.211585] RBP: ffff97ba3f603e18 R08: 0000000000040000 R09:
0000000000018499
[ 3353.211586] R10: 0000000000000001 R11: 0000000000000000 R12:
0000000000000001
[ 3353.211587] R13: ffffffffa7340740 R14: 000000000000000c R15:
000000000000000c
[ 3353.211587]  ? native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211588]  ? native_queued_spin_lock_slowpath+0x199/0x1e0
[ 3353.211589]  </NMI>
[ 3353.211589]  <IRQ>
[ 3353.211590]  _raw_spin_lock_irqsave+0x35/0x40
[ 3353.211591]  update_blocked_averages+0x35/0x5d0
[ 3353.211591]  ? rebalance_domains+0x180/0x2c0
[ 3353.211592]  update_nohz_stats+0x48/0x60
[ 3353.211593]  _nohz_idle_balance+0xdf/0x290
[ 3353.211593]  run_rebalance_domains+0x97/0xa0
[ 3353.211594]  __do_softirq+0xe4/0x2f3
[ 3353.211595]  irq_exit+0xb6/0xc0
[ 3353.211595]  scheduler_ipi+0xe4/0x130
[ 3353.211596]  smp_reschedule_interrupt+0x39/0xe0
[ 3353.211597]  reschedule_interrupt+0xf/0x20
[ 3353.211597]  </IRQ>
[ 3353.211598] RIP: 0010:cpuidle_enter_state+0xbc/0x440
[ 3353.211599] Code: ff e8 d8 dd 86 ff 80 7d d3 00 74 17 9c 58 0f 1f 44 00
00 f6 c4 02 0f 85 54 03 00 00 31 ff e8 eb 1d 8d ff fb 66 0f 1f 44 00 00 <45>
85 f6 0f 88 1a 03 00 00 4c 2b 6d c8 48 ba cf f7 53 e3 a5 9b c4
[ 3353.211600] RSP: 0018:ffffffffa6e03df8 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff02
[ 3353.211602] RAX: ffff97ba3f622d40 RBX: ffffffffa6f545e0 RCX:
000000000000001f
[ 3353.211603] RDX: 0000024c9b7d936c RSI: 0000000047318912 RDI:
0000000000000000
[ 3353.211603] RBP: ffffffffa6e03e38 R08: 0000000000000002 R09:
0000000000022600
[ 3353.211604] R10: ffffffffa6e03dc8 R11: 00000000000002dc R12:
ffffd6c67f602968
[ 3353.211605] R13: 0000024c9b7d936c R14: 0000000000000004 R15:
ffffffffa6f54760
[ 3353.211606]  ? cpuidle_enter_state+0x98/0x440
[ 3353.211607]  cpuidle_enter+0x17/0x20
[ 3353.211607]  call_cpuidle+0x23/0x40
[ 3353.211608]  do_idle+0x204/0x280
[ 3353.211609]  cpu_startup_entry+0x1d/0x20
[ 3353.211609]  rest_init+0xae/0xb0
[ 3353.211610]  arch_call_rest_init+0xe/0x1b
[ 3353.211611]  start_kernel+0x4f5/0x516
[ 3353.211611]  x86_64_start_reservations+0x24/0x26
[ 3353.211612]  x86_64_start_kernel+0x74/0x77
[ 3353.211613]  secondary_startup_64+0xa4/0xb0


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-11 23:33           ` Subhra Mazumdar
  2019-03-12  0:20             ` Greg Kerr
  2019-03-12  7:45             ` Aubrey Li
@ 2019-03-18  6:56             ` Aubrey Li
  2 siblings, 0 replies; 99+ messages in thread
From: Aubrey Li @ 2019-03-18  6:56 UTC (permalink / raw)
  To: Subhra Mazumdar, Peter Zijlstra, Tim Chen
  Cc: Mel Gorman, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linux List Kernel Mailing, Linus Torvalds, Fr?d?ric Weisbecker,
	Kees Cook, Greg Kerr

On Tue, Mar 12, 2019 at 7:36 AM Subhra Mazumdar
<subhra.mazumdar@oracle.com> wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >> <subhra.mazumdar@oracle.com> wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing, following works.
>

okay, here is another one, on my system, the boot up CPUs don't match the
possible cpu map, so the not onlined CPU rq->core are not initialized, which
causes NULL pointer dereference panic in online_fair_sched_group():

And here is a quick fix.
-----------------------------------------------------------------------------------------------------
@@ -10488,7 +10493,8 @@ void online_fair_sched_group(struct task_group *tg)
        for_each_possible_cpu(i) {
                rq = cpu_rq(i);
                se = tg->se[i];
-
+               if (!rq->core)
+                       continue;
                raw_spin_lock_irq(rq_lockp(rq));
                update_rq_clock(rq);
                attach_entity_cfs_rq(se);

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-02-18 16:56 ` [RFC][PATCH 03/16] sched: Wrap rq::lock access Peter Zijlstra
  2019-02-19 16:13   ` Phil Auld
@ 2019-03-18 15:41   ` Julien Desfossez
  2019-03-20  2:29     ` Subhra Mazumdar
  1 sibling, 1 reply; 99+ messages in thread
From: Julien Desfossez @ 2019-03-18 15:41 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: Julien Desfossez, linux-kernel, subhra.mazumdar, fweisbec,
	keescook, kerrnel, Vineeth Pillai, Nishanth Aravamudan

The case where we try to acquire the lock on 2 runqueues belonging to 2
different cores requires the rq_lockp wrapper as well otherwise we
frequently deadlock in there.

This fixes the crash reported in
1552577311-8218-1-git-send-email-jdesfossez@digitalocean.com

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76fee56..71bb71f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 		raw_spin_lock(rq_lockp(rq1));
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1 < rq2) {
+		if (rq_lockp(rq1) < rq_lockp(rq2)) {
 			raw_spin_lock(rq_lockp(rq1));
 			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
 		} else {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-18 15:41   ` Julien Desfossez
@ 2019-03-20  2:29     ` Subhra Mazumdar
  2019-03-21 21:20       ` Julien Desfossez
  2019-03-22 23:28       ` Tim Chen
  0 siblings, 2 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-20  2:29 UTC (permalink / raw)
  To: Julien Desfossez, Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan


On 3/18/19 8:41 AM, Julien Desfossez wrote:
> The case where we try to acquire the lock on 2 runqueues belonging to 2
> different cores requires the rq_lockp wrapper as well otherwise we
> frequently deadlock in there.
>
> This fixes the crash reported in
> 1552577311-8218-1-git-send-email-jdesfossez@digitalocean.com
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 76fee56..71bb71f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
>   		raw_spin_lock(rq_lockp(rq1));
>   		__acquire(rq2->lock);	/* Fake it out ;) */
>   	} else {
> -		if (rq1 < rq2) {
> +		if (rq_lockp(rq1) < rq_lockp(rq2)) {
>   			raw_spin_lock(rq_lockp(rq1));
>   			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
>   		} else {
With this fix and my previous NULL pointer fix my stress tests are 
surviving. I
re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB
instance in separate core scheduling group. The numbers look much worse 
now.

users  baseline  %stdev  %idle  core_sched  %stdev %idle
16     1         0.3     66     -73.4%      136.8 82
24     1         1.6     54     -95.8%      133.2 81
32     1         1.5     42     -97.5%      124.3 89

I also notice that if I enable a bunch of debug configs related to 
mutexes, spin
locks, lockdep etc. (which I did earlier to debug the dead lock), it 
opens up a
can of worms with multiple crashes.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-20  2:29     ` Subhra Mazumdar
@ 2019-03-21 21:20       ` Julien Desfossez
  2019-03-22 13:34         ` Peter Zijlstra
  2019-03-23  0:06         ` Subhra Mazumdar
  2019-03-22 23:28       ` Tim Chen
  1 sibling, 2 replies; 99+ messages in thread
From: Julien Desfossez @ 2019-03-21 21:20 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: Julien Desfossez, linux-kernel, subhra.mazumdar, fweisbec,
	keescook, kerrnel, Vineeth Pillai, Nishanth Aravamudan

On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar <subhra.mazumdar@oracle.com>
wrote:
> On 3/18/19 8:41 AM, Julien Desfossez wrote:
> > The case where we try to acquire the lock on 2 runqueues belonging to 2
> > different cores requires the rq_lockp wrapper as well otherwise we
> > frequently deadlock in there.
> >
> > This fixes the crash reported in
> > 1552577311-8218-1-git-send-email-jdesfossez@digitalocean.com
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 76fee56..71bb71f 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1,
> struct rq *rq2)
> >               raw_spin_lock(rq_lockp(rq1));
> >               __acquire(rq2->lock);   /* Fake it out ;) */
> >       } else {
> > -             if (rq1 < rq2) {
> > +             if (rq_lockp(rq1) < rq_lockp(rq2)) {
> >                       raw_spin_lock(rq_lockp(rq1));
> >                       raw_spin_lock_nested(rq_lockp(rq2),
> SINGLE_DEPTH_NESTING);
> >               } else {
> With this fix and my previous NULL pointer fix my stress tests are
> surviving. I
> re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB
> instance in separate core scheduling group. The numbers look much worse
> now.
>
> users  baseline  %stdev  %idle  core_sched  %stdev %idle
> 16     1         0.3     66     -73.4%      136.8 82
> 24     1         1.6     54     -95.8%      133.2 81
> 32     1         1.5     42     -97.5%      124.3 89

We are also seeing a performance degradation of about 83% on the throughput
of 2 MySQL VMs under a stress test (12 vcpus, 32GB of RAM). The server has 2
NUMA nodes, each with 18 cores (so a total of 72 hardware threads). Each
MySQL VM is pinned to a different NUMA node. The clients for the stress
tests are running on a separate physical machine, each client runs 48 query
threads. Only the MySQL VMs use core scheduling (all vcpus and emulator
threads). Overall the server is 90% idle when the 2 VMs use core scheduling,
and 75% when they don’t.

The rate of preemption vs normal “switch out” is about 1% with and without
core scheduling enabled, but the overall rate of sched_switch is 5 times
higher without core scheduling which suggests some heavy contention in the
scheduling path.

On further investigation, we could see that the contention is mostly in the
way rq locks are taken. With this patchset, we lock the whole core if
cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or
less serialized for the core and that attributes to the performance loss
that we are seeing. We also saw that newidle_balance() takes considerably
long time in load_balance() due to the rq spinlock contention. Do you think
it would help if the core-wide locking was only performed when absolutely
needed ?

In terms of isolation, we measured the time a thread spends co-scheduled
with either a thread from the same group, the idle thread or a thread from
another group. This is what we see for 60 seconds of a specific busy VM
pinned to a whole NUMA node (all its threads):

no core scheduling:
- local neighbors (19.989 % of process runtime)
- idle neighbors (47.197 % of process runtime)
- foreign neighbors (22.811 % of process runtime)

core scheduling enabled:
- local neighbors (6.763 % of process runtime)
- idle neighbors (93.064 % of process runtime)
- foreign neighbors (0.236 % of process runtime)

As a separate test, we tried to pin all the vcpu threads to a set of cores
(6 cores for 12 vcpus):
no core scheduling:
- local neighbors (88.299 % of process runtime)
- idle neighbors (9.334 % of process runtime)
- foreign neighbors (0.197 % of process runtime)

core scheduling enabled:
- local neighbors (84.570 % of process runtime)
- idle neighbors (15.195 % of process runtime)
- foreign neighbors (0.257 % of process runtime)

Thanks,

Julien

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-21 21:20       ` Julien Desfossez
@ 2019-03-22 13:34         ` Peter Zijlstra
  2019-03-22 20:59           ` Julien Desfossez
  2019-03-23  0:06         ` Subhra Mazumdar
  1 sibling, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2019-03-22 13:34 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan

On Thu, Mar 21, 2019 at 05:20:17PM -0400, Julien Desfossez wrote:
> On further investigation, we could see that the contention is mostly in the
> way rq locks are taken. With this patchset, we lock the whole core if
> cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or
> less serialized for the core and that attributes to the performance loss
> that we are seeing. We also saw that newidle_balance() takes considerably
> long time in load_balance() due to the rq spinlock contention. Do you think
> it would help if the core-wide locking was only performed when absolutely
> needed ?

Something like that could be done, but then you end up with 2 locks,
something which I was hoping to avoid.

Basically you keep rq->lock as it exists today, but add something like
rq->core->core_lock, you then have to take that second lock (nested
under rq->lock) for every scheduling action involving a tagged task.

It makes things complicatd though; because now my head hurts thikning
about pick_next_task().

(this can obviously do away with the whole rq->lock wrappery)

Also, completely untested..

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -146,6 +146,8 @@ void sched_core_enqueue(struct rq *rq, s
 	if (!p->core_cookie)
 		return;
 
+	raw_spin_lock(&rq->core->core_lock);
+
 	node = &rq->core_tree.rb_node;
 	parent = *node;
 
@@ -161,6 +163,8 @@ void sched_core_enqueue(struct rq *rq, s
 
 	rb_link_node(&p->core_node, parent, node);
 	rb_insert_color(&p->core_node, &rq->core_tree);
+
+	raw_spin_unlock(&rq->core->core_lock);
 }
 
 void sched_core_dequeue(struct rq *rq, struct task_struct *p)
@@ -170,7 +174,9 @@ void sched_core_dequeue(struct rq *rq, s
 	if (!p->core_cookie)
 		return;
 
+	raw_spin_lock(&rq->core->core_lock);
 	rb_erase(&p->core_node, &rq->core_tree);
+	raw_spin_unlock(&rq->core->core_lock);
 }
 
 /*
@@ -181,6 +187,8 @@ struct task_struct *sched_core_find(stru
 	struct rb_node *node = rq->core_tree.rb_node;
 	struct task_struct *node_task, *match;
 
+	lockdep_assert_held(&rq->core->core_lock);
+
 	/*
 	 * The idle task always matches any cookie!
 	 */
@@ -206,6 +214,8 @@ struct task_struct *sched_core_next(stru
 {
 	struct rb_node *node = &p->core_node;
 
+	lockdep_assert_held(&rq->core->core_lock);
+
 	node = rb_next(node);
 	if (!node)
 		return NULL;
@@ -3685,6 +3695,8 @@ pick_next_task(struct rq *rq, struct tas
 	 * If there were no {en,de}queues since we picked (IOW, the task
 	 * pointers are all still valid), and we haven't scheduled the last
 	 * pick yet, do so now.
+	 *
+	 * XXX probably OK without ->core_lock
 	 */
 	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
 	    rq->core->core_pick_seq != rq->core_sched_seq) {
@@ -3710,6 +3722,20 @@ pick_next_task(struct rq *rq, struct tas
 	if (!rq->nr_running)
 		newidle_balance(rq, rf);
 
+	if (!rq->core->core_cookie) {
+		for_each_class(class) {
+			next = pick_task(rq, class, NULL);
+			if (next)
+				break;
+		}
+
+		if (!next->core_cookie) {
+			set_next_task(rq, next);
+			return next;
+		}
+	}
+
+	raw_spin_lock(&rq->core->core_lock);
 	cpu = cpu_of(rq);
 	smt_mask = cpu_smt_mask(cpu);
 
@@ -3849,6 +3875,7 @@ next_class:;
 	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 done:
+	raw_spin_unlock(&rq->core->core_lock);
 	set_next_task(rq, next);
 	return next;
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -966,6 +966,7 @@ struct rq {
 	struct rb_root		core_tree;
 
 	/* shared state */
+	raw_spinlock_t		core_lock;
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
@@ -1007,9 +1008,6 @@ static inline bool sched_core_enabled(st
 
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
-	if (sched_core_enabled(rq))
-		return &rq->core->__lock;
-
 	return &rq->__lock;
 }
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-22 13:34         ` Peter Zijlstra
@ 2019-03-22 20:59           ` Julien Desfossez
  0 siblings, 0 replies; 99+ messages in thread
From: Julien Desfossez @ 2019-03-22 20:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Julien Desfossez, mingo, tglx, pjt, tim.c.chen, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Vineeth Pillai, Nishanth Aravamudan

On Fri, Mar 22, 2019 at 9:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Mar 21, 2019 at 05:20:17PM -0400, Julien Desfossez wrote:
> > On further investigation, we could see that the contention is mostly in
> the
> > way rq locks are taken. With this patchset, we lock the whole core if
> > cpu.tag is set for at least one cgroup. Due to this, __schedule() is
> more or
> > less serialized for the core and that attributes to the performance loss
> > that we are seeing. We also saw that newidle_balance() takes considerably
> > long time in load_balance() due to the rq spinlock contention. Do you
> think
> > it would help if the core-wide locking was only performed when absolutely
> > needed ?
>
> Something like that could be done, but then you end up with 2 locks,
> something which I was hoping to avoid.
>
> Basically you keep rq->lock as it exists today, but add something like
> rq->core->core_lock, you then have to take that second lock (nested
> under rq->lock) for every scheduling action involving a tagged task.
>
> It makes things complicatd though; because now my head hurts thikning
> about pick_next_task().
>
> (this can obviously do away with the whole rq->lock wrappery)
>
> Also, completely untested..

We tried it and it dies within 30ms of enabling the tag on 2 VMs :-)
Now after trying to debug this my head hurts as well !

We'll continue trying to figure this out, but if you want to take a look,
the full dmesg is here: https://paste.debian.net/plainh/0b8f87f3

Thanks,

Julien

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-20  2:29     ` Subhra Mazumdar
  2019-03-21 21:20       ` Julien Desfossez
@ 2019-03-22 23:28       ` Tim Chen
  2019-03-22 23:44         ` Tim Chen
  1 sibling, 1 reply; 99+ messages in thread
From: Tim Chen @ 2019-03-22 23:28 UTC (permalink / raw)
  To: Subhra Mazumdar, Julien Desfossez, Peter Zijlstra, mingo, tglx,
	pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan, Pawan Gupta, Aubrey

On 3/19/19 7:29 PM, Subhra Mazumdar wrote:
> 
> On 3/18/19 8:41 AM, Julien Desfossez wrote:
>> The case where we try to acquire the lock on 2 runqueues belonging to 2
>> different cores requires the rq_lockp wrapper as well otherwise we
>> frequently deadlock in there.
>>
>> This fixes the crash reported in
>> 1552577311-8218-1-git-send-email-jdesfossez@digitalocean.com
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 76fee56..71bb71f 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
>>           raw_spin_lock(rq_lockp(rq1));
>>           __acquire(rq2->lock);    /* Fake it out ;) */
>>       } else {
>> -        if (rq1 < rq2) {
>> +        if (rq_lockp(rq1) < rq_lockp(rq2)) {
>>               raw_spin_lock(rq_lockp(rq1));
>>               raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
>>           } else {


Pawan was seeing occasional crashes and lock up that's avoided by doing the following.
We're trying to dig a little more tracing to see why pick_next_entity is returning
NULL.

Tim

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5349ebedc645..4c7f353b8900 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7031,6 +7031,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
                }
 
                se = pick_next_entity(cfs_rq, curr);
+               if (!se)
+                       return NULL;
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);
 
@@ -7070,6 +7072,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 
        do {
                se = pick_next_entity(cfs_rq, NULL);
+               if (!se)
+                       return NULL;
                set_next_entity(cfs_rq, se);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-22 23:28       ` Tim Chen
@ 2019-03-22 23:44         ` Tim Chen
  0 siblings, 0 replies; 99+ messages in thread
From: Tim Chen @ 2019-03-22 23:44 UTC (permalink / raw)
  To: Subhra Mazumdar, Julien Desfossez, Peter Zijlstra, mingo, tglx,
	pjt, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan, Pawan Gupta, Aubrey

On 3/22/19 4:28 PM, Tim Chen wrote:
> On 3/19/19 7:29 PM, Subhra Mazumdar wrote:
>>
>> On 3/18/19 8:41 AM, Julien Desfossez wrote:
>>> The case where we try to acquire the lock on 2 runqueues belonging to 2
>>> different cores requires the rq_lockp wrapper as well otherwise we
>>> frequently deadlock in there.
>>>
>>> This fixes the crash reported in
>>> 1552577311-8218-1-git-send-email-jdesfossez@digitalocean.com
>>>
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 76fee56..71bb71f 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
>>>           raw_spin_lock(rq_lockp(rq1));
>>>           __acquire(rq2->lock);    /* Fake it out ;) */
>>>       } else {
>>> -        if (rq1 < rq2) {
>>> +        if (rq_lockp(rq1) < rq_lockp(rq2)) {
>>>               raw_spin_lock(rq_lockp(rq1));
>>>               raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
>>>           } else {
> 
> 
> Pawan was seeing occasional crashes and lock up that's avoided by doing the following.
> We're trying to dig a little more tracing to see why pick_next_entity is returning
> NULL.
> 

We found the root cause was a missing chunk when we port Subhra's fix of pick_next_entity

         * Someone really wants this to run. If it's not unfair, run it.
*/
-       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+       if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left)
+           < 1) 

That fixes the problem of pick_next_entity returning NULL.  sorry for the noise.

Tim

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-21 21:20       ` Julien Desfossez
  2019-03-22 13:34         ` Peter Zijlstra
@ 2019-03-23  0:06         ` Subhra Mazumdar
  2019-03-27  1:02           ` Subhra Mazumdar
  2019-03-29 13:35           ` Julien Desfossez
  1 sibling, 2 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-23  0:06 UTC (permalink / raw)
  To: Julien Desfossez, Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan


On 3/21/19 2:20 PM, Julien Desfossez wrote:
> On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar <subhra.mazumdar@oracle.com>
> wrote:
>> On 3/18/19 8:41 AM, Julien Desfossez wrote:
>>
> On further investigation, we could see that the contention is mostly in the
> way rq locks are taken. With this patchset, we lock the whole core if
> cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or
> less serialized for the core and that attributes to the performance loss
> that we are seeing. We also saw that newidle_balance() takes considerably
> long time in load_balance() due to the rq spinlock contention. Do you think
> it would help if the core-wide locking was only performed when absolutely
> needed ?
>
Is the core wide lock primarily responsible for the regression? I ran 
upto patch
12 which also has the core wide lock for tagged cgroups and also calls
newidle_balance() from pick_next_task(). I don't see any regression.  Of 
course
the core sched version of pick_next_task() may be doing more but 
comparing with
the __pick_next_task() it doesn't look too horrible.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-08 19:44     ` Subhra Mazumdar
  2019-03-11  4:23       ` Aubrey Li
@ 2019-03-26  7:32       ` Aaron Lu
  2019-03-26  7:56         ` Aaron Lu
  1 sibling, 1 reply; 99+ messages in thread
From: Aaron Lu @ 2019-03-26  7:32 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Tim Chen, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, kerrnel

On Fri, Mar 08, 2019 at 11:44:01AM -0800, Subhra Mazumdar wrote:
> 
> On 2/22/19 4:45 AM, Mel Gorman wrote:
> >On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> >>On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >>>However; whichever way around you turn this cookie; it is expensive and nasty.
> >>Do you (or anybody else) have numbers for real loads?
> >>
> >>Because performance is all that matters. If performance is bad, then
> >>it's pointless, since just turning off SMT is the answer.
> >>
> >I tried to do a comparison between tip/master, ht disabled and this series
> >putting test workloads into a tagged cgroup but unfortunately it failed
> >
> >[  156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> >[  156.986597] #PF error: [normal kernel read fault]
> >[  156.991343] PGD 0 P4D 0
> >[  156.993905] Oops: 0000 [#1] SMP PTI
> >[  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1
> >[  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> >[  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> >[  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
> >  53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> >[  157.037544] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> >[  157.042819] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> >[  157.050015] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> >[  157.057215] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> >[  157.064410] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> >[  157.071611] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> >[  157.078814] FS:  0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> >[  157.086977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >[  157.092779] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> >[  157.099979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >[  157.109529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >[  157.119058] Call Trace:
> >[  157.123865]  pick_next_entity+0x61/0x110
> >[  157.130137]  pick_task_fair+0x4b/0x90
> >[  157.136124]  __schedule+0x365/0x12c0
> >[  157.141985]  schedule_idle+0x1e/0x40
> >[  157.147822]  do_idle+0x166/0x280
> >[  157.153275]  cpu_startup_entry+0x19/0x20
> >[  157.159420]  start_secondary+0x17a/0x1d0
> >[  157.165568]  secondary_startup_64+0xa4/0xb0
> >[  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
> >[  157.258990] CR2: 0000000000000058
> >[  157.264961] ---[ end trace a301ac5e3ee86fde ]---
> >[  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> >[  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> >[  157.316121] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> >[  157.324060] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> >[  157.333932] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> >[  157.343795] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> >[  157.353634] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> >[  157.363506] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> >[  157.373395] FS:  0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> >[  157.384238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >[  157.392709] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> >[  157.402601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >[  157.412488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >[  157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
> >[  158.529804] Shutting down cpus with NMI
> >[  158.573249] Kernel Offset: disabled
> >[  158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> >
> >RIP translates to kernel/sched/fair.c:6819
> >
> >static int
> >wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> >{
> >         s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */
> >
> >         if (vdiff <= 0)
> >                 return -1;
> >
> >         gran = wakeup_gran(se);
> >         if (vdiff > gran)
> >                 return 1;
> >}
> >
> >I haven't tried debugging it yet.
> >
> I think the following fix, while trivial, is the right fix for the NULL
> dereference in this case. This bug is reproducible with patch 14. I

I assume you meant patch 4?

My understanding is, this is due to 'left' being NULL in
pick_next_entity().

With patch 4, in pick_task_fair(), pick_next_entity() can be called with
an empty rbtree of cfs_rq and with a NULL 'curr'. This resulted in a
NULL 'left'. Before patch 4, this can't happen.

It's not clear to me why NULL is used instead of 'curr' for
pick_next_entity() in pick_task_fair(). My first thought is, 'curr' will
not be considered as next entity, but then 'curr' is checked after
pick_next_entity() returns so this shouldn't be the reason. Guess I
missed something.

Thanks,
Aaron

> also did
> some performance bisecting and with patch 14 performance is
> decimated, that's
> expected. Most of the performance recovery happens in patch 15 which,
> unfortunately, is also the one that introduces the hard lockup.
> 
> -------8<-----------
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4..ecadf36 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
>          * Avoid running the skip buddy, if running something else can
>          * be done without getting too unfair.
> */
> -       if (cfs_rq->skip == se) {
> +       if (cfs_rq->skip && cfs_rq->skip == se) {
>                 struct sched_entity *second;
> 
>                 if (se == curr) {
> @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq,
> struct sched_entity *curr)
> /*
>          * Prefer last buddy, try to return the CPU to a preempted task.
> */
> -       if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> +       if (left && cfs_rq->last &&
> wakeup_preempt_entity(cfs_rq->last, left)
> +           < 1)
>                 se = cfs_rq->last;
> 
> /*
>          * Someone really wants this to run. If it's not unfair, run it.
> */
> -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> +       if (left && cfs_rq->next &&
> wakeup_preempt_entity(cfs_rq->next, left)
> +           < 1)
>                 se = cfs_rq->next;
> 
>         clear_buddies(cfs_rq, se);
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 00/16] sched: Core scheduling
  2019-03-26  7:32       ` Aaron Lu
@ 2019-03-26  7:56         ` Aaron Lu
  0 siblings, 0 replies; 99+ messages in thread
From: Aaron Lu @ 2019-03-26  7:56 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Tim Chen, Linux List Kernel Mailing, Linus Torvalds,
	Fr?d?ric Weisbecker, Kees Cook, kerrnel

On Tue, Mar 26, 2019 at 03:32:12PM +0800, Aaron Lu wrote:
> On Fri, Mar 08, 2019 at 11:44:01AM -0800, Subhra Mazumdar wrote:
> > 
> > On 2/22/19 4:45 AM, Mel Gorman wrote:
> > >On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote:
> > >>On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >>>However; whichever way around you turn this cookie; it is expensive and nasty.
> > >>Do you (or anybody else) have numbers for real loads?
> > >>
> > >>Because performance is all that matters. If performance is bad, then
> > >>it's pointless, since just turning off SMT is the answer.
> > >>
> > >I tried to do a comparison between tip/master, ht disabled and this series
> > >putting test workloads into a tagged cgroup but unfortunately it failed
> > >
> > >[  156.978682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> > >[  156.986597] #PF error: [normal kernel read fault]
> > >[  156.991343] PGD 0 P4D 0
> > >[  156.993905] Oops: 0000 [#1] SMP PTI
> > >[  156.997438] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.0.0-rc7-schedcore-v1r1 #1
> > >[  157.005161] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> > >[  157.012896] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> > >[  157.018613] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00
> > >  53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> > >[  157.037544] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> > >[  157.042819] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> > >[  157.050015] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> > >[  157.057215] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> > >[  157.064410] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> > >[  157.071611] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> > >[  157.078814] FS:  0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> > >[  157.086977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >[  157.092779] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> > >[  157.099979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >[  157.109529] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >[  157.119058] Call Trace:
> > >[  157.123865]  pick_next_entity+0x61/0x110
> > >[  157.130137]  pick_task_fair+0x4b/0x90
> > >[  157.136124]  __schedule+0x365/0x12c0
> > >[  157.141985]  schedule_idle+0x1e/0x40
> > >[  157.147822]  do_idle+0x166/0x280
> > >[  157.153275]  cpu_startup_entry+0x19/0x20
> > >[  157.159420]  start_secondary+0x17a/0x1d0
> > >[  157.165568]  secondary_startup_64+0xa4/0xb0
> > >[  157.171985] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs msr intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crc32_pclmul ghash_clmulni_intel ixgbe aesni_intel xfrm_algo iTCO_wdt joydev iTCO_vendor_support libphy igb aes_x86_64 crypto_simd ptp cryptd mei_me mdio pps_core ioatdma glue_helper pcspkr ipmi_si lpc_ich i2c_i801 mei dca ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq button btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq hid_generic usbhid ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crc32c_intel ehci_pci ttm xhci_hcd ehci_hcd drm ahci usbcore mpt3sas libahci raid_class scsi_transport_sas wmi sg nbd dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
> > >[  157.258990] CR2: 0000000000000058
> > >[  157.264961] ---[ end trace a301ac5e3ee86fde ]---
> > >[  157.283719] RIP: 0010:wakeup_preempt_entity.isra.70+0x9/0x50
> > >[  157.291967] Code: 00 be c0 82 60 00 e9 86 02 1a 00 66 0f 1f 44 00 00 48 c1 e7 03 be c0 80 60 00 e9 72 02 1a 00 66 90 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 58 48 85 db 7e 2c 48 81 3e 00 00 10 00 8b 05 a9 b7 19 01
> > >[  157.316121] RSP: 0018:ffffc9000c5bbde8 EFLAGS: 00010086
> > >[  157.324060] RAX: ffff88810f5f6a00 RBX: 00000001547f175c RCX: 0000000000000001
> > >[  157.333932] RDX: ffff88bf3bdb0a40 RSI: 0000000000000000 RDI: 00000001547f175c
> > >[  157.343795] RBP: ffff88bf7fae32c0 R08: 000000000001e358 R09: ffff88810fb9f000
> > >[  157.353634] R10: ffffc9000c5bbe08 R11: ffff88810fb9f5c4 R12: 0000000000000000
> > >[  157.363506] R13: ffff88bf4e3ea0c0 R14: 0000000000000000 R15: ffff88bf4e3ea7a8
> > >[  157.373395] FS:  0000000000000000(0000) GS:ffff88bf7f5c0000(0000) knlGS:0000000000000000
> > >[  157.384238] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >[  157.392709] CR2: 0000000000000058 CR3: 000000000220e005 CR4: 00000000003606e0
> > >[  157.402601] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >[  157.412488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >[  157.422334] Kernel panic - not syncing: Attempted to kill the idle task!
> > >[  158.529804] Shutting down cpus with NMI
> > >[  158.573249] Kernel Offset: disabled
> > >[  158.586198] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> > >
> > >RIP translates to kernel/sched/fair.c:6819
> > >
> > >static int
> > >wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > >{
> > >         s64 gran, vdiff = curr->vruntime - se->vruntime; /* LINE 6819 */
> > >
> > >         if (vdiff <= 0)
> > >                 return -1;
> > >
> > >         gran = wakeup_gran(se);
> > >         if (vdiff > gran)
> > >                 return 1;
> > >}
> > >
> > >I haven't tried debugging it yet.
> > >
> > I think the following fix, while trivial, is the right fix for the NULL
> > dereference in this case. This bug is reproducible with patch 14. I
> 
> I assume you meant patch 4?

Correction, should be patch 9 where pick_task_fair() is introduced.

Thanks,
Aaron

> 
> My understanding is, this is due to 'left' being NULL in
> pick_next_entity().
> 
> With patch 4, in pick_task_fair(), pick_next_entity() can be called with
> an empty rbtree of cfs_rq and with a NULL 'curr'. This resulted in a
> NULL 'left'. Before patch 4, this can't happen.
> 
> It's not clear to me why NULL is used instead of 'curr' for
> pick_next_entity() in pick_task_fair(). My first thought is, 'curr' will
> not be considered as next entity, but then 'curr' is checked after
> pick_next_entity() returns so this shouldn't be the reason. Guess I
> missed something.
> 
> Thanks,
> Aaron
> 
> > also did
> > some performance bisecting and with patch 14 performance is
> > decimated, that's
> > expected. Most of the performance recovery happens in patch 15 which,
> > unfortunately, is also the one that introduces the hard lockup.
> > 
> > -------8<-----------
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1d0dac4..ecadf36 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> > sched_entity *curr)
> >          * Avoid running the skip buddy, if running something else can
> >          * be done without getting too unfair.
> > */
> > -       if (cfs_rq->skip == se) {
> > +       if (cfs_rq->skip && cfs_rq->skip == se) {
> >                 struct sched_entity *second;
> > 
> >                 if (se == curr) {
> > @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq,
> > struct sched_entity *curr)
> > /*
> >          * Prefer last buddy, try to return the CPU to a preempted task.
> > */
> > -       if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> > +       if (left && cfs_rq->last &&
> > wakeup_preempt_entity(cfs_rq->last, left)
> > +           < 1)
> >                 se = cfs_rq->last;
> > 
> > /*
> >          * Someone really wants this to run. If it's not unfair, run it.
> > */
> > -       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> > +       if (left && cfs_rq->next &&
> > wakeup_preempt_entity(cfs_rq->next, left)
> > +           < 1)
> >                 se = cfs_rq->next;
> > 
> >         clear_buddies(cfs_rq, se);
> > 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-23  0:06         ` Subhra Mazumdar
@ 2019-03-27  1:02           ` Subhra Mazumdar
  2019-03-29 13:35           ` Julien Desfossez
  1 sibling, 0 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-27  1:02 UTC (permalink / raw)
  To: Julien Desfossez, Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan


On 3/22/19 5:06 PM, Subhra Mazumdar wrote:
>
> On 3/21/19 2:20 PM, Julien Desfossez wrote:
>> On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar 
>> <subhra.mazumdar@oracle.com>
>> wrote:
>>> On 3/18/19 8:41 AM, Julien Desfossez wrote:
>>>
>> On further investigation, we could see that the contention is mostly 
>> in the
>> way rq locks are taken. With this patchset, we lock the whole core if
>> cpu.tag is set for at least one cgroup. Due to this, __schedule() is 
>> more or
>> less serialized for the core and that attributes to the performance loss
>> that we are seeing. We also saw that newidle_balance() takes 
>> considerably
>> long time in load_balance() due to the rq spinlock contention. Do you 
>> think
>> it would help if the core-wide locking was only performed when 
>> absolutely
>> needed ?
>>
> Is the core wide lock primarily responsible for the regression? I ran 
> upto patch
> 12 which also has the core wide lock for tagged cgroups and also calls
> newidle_balance() from pick_next_task(). I don't see any regression.  
> Of course
> the core sched version of pick_next_task() may be doing more but 
> comparing with
> the __pick_next_task() it doesn't look too horrible.
I gathered some data with only 1 DB instance running (which also has 52% 
slow
down). Following are the numbers of pick_next_task() calls and their avg 
cost
for patch 12 and patch 15. The total number of calls seems to be similar 
but the
avg cost (in us) has more than doubled. For both the patches I had put 
the DB
instance into a cpu tagged cgroup.

                              patch12 patch15
count pick_next_task         62317898 58925395
avg cost pick_next_task      0.6566323209   1.4223810108

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-23  0:06         ` Subhra Mazumdar
  2019-03-27  1:02           ` Subhra Mazumdar
@ 2019-03-29 13:35           ` Julien Desfossez
  2019-03-29 22:23             ` Subhra Mazumdar
  1 sibling, 1 reply; 99+ messages in thread
From: Julien Desfossez @ 2019-03-29 13:35 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Julien Desfossez, Peter Zijlstra, mingo, tglx, pjt, tim.c.chen,
	torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Vineeth Pillai, Nishanth Aravamudan

On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar <subhra.mazumdar@oracle.com>
wrote:
> Is the core wide lock primarily responsible for the regression? I ran
> upto patch
> 12 which also has the core wide lock for tagged cgroups and also calls
> newidle_balance() from pick_next_task(). I don't see any regression.  Of
> course
> the core sched version of pick_next_task() may be doing more but
> comparing with
> the __pick_next_task() it doesn't look too horrible.

On further testing and investigation, we also agree that spinlock contention
is not the major cause for the regression, but we feel that it should be one
of the major contributing factors to this performance loss.

To reduce the scope of the investigation of the performance regression, we
designed a couple of smaller test cases (compared to big VMs running complex
benchmarks) and it turns out the test case that is most impacted is a simple
disk write-intensive case (up to 99% performance drop). CPU-intensive and
scheduler-intensive tests (perf bench sched) behave pretty well.

On the same server we used before (2x18 cores, 72 hardware threads), with
all the non-essential services disabled, we setup a cpuset of 4 cores (8
hardware threads) and ran sysbench fileio on a dedicated drive (no RAID).
With sysbench running with 8 threads in this cpuset without core scheduling,
we get about 155.23 MiB/s in sequential write. If we enable the tag, we drop
to 0.25 MiB/s. Interestingly, even with 4 threads, we see the same kind of
performance drop.

Command used:

sysbench --test=fileio prepare
cgexec -g cpu,cpuset:test sysbench --threads=4 --test=fileio \
--file-test-mode=seqwr run

If we run this with the data in a ramdisk instead of a real drive, we don’t
notice any drop. The amount of performance drops depends a bit depending on
the machine, but it’s always significant.

We spent a lot of time in the trace and noticed that a couple times during
every run, the sysbench worker threads are waiting for IO sometimes up to 4
seconds, all the threads wait for the same duration, and during that time we
don’t see any block-related softirq coming in. As soon as the interrupt is
processed, sysbench gets woken up immediately. This long wait never happens
without the core scheduling. So we are trying to see if there is a place
where the interrupts are disabled for an extended period of time. The
irqsoff tracer doesn’t seem to pick it up.

Any thoughts about that ?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-29 13:35           ` Julien Desfossez
@ 2019-03-29 22:23             ` Subhra Mazumdar
  2019-04-01 21:35               ` Subhra Mazumdar
  2019-04-02  7:42               ` Peter Zijlstra
  0 siblings, 2 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-03-29 22:23 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds,
	linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan


On 3/29/19 6:35 AM, Julien Desfossez wrote:
> On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar <subhra.mazumdar@oracle.com>
> wrote:
>> Is the core wide lock primarily responsible for the regression? I ran
>> upto patch
>> 12 which also has the core wide lock for tagged cgroups and also calls
>> newidle_balance() from pick_next_task(). I don't see any regression.  Of
>> course
>> the core sched version of pick_next_task() may be doing more but
>> comparing with
>> the __pick_next_task() it doesn't look too horrible.
> On further testing and investigation, we also agree that spinlock contention
> is not the major cause for the regression, but we feel that it should be one
> of the major contributing factors to this performance loss.
>
>
I finally did some code bisection and found the following lines are
basically responsible for the regression. Commenting them out I don't see
the regressions. Can you confirm? I am yet to figure if this is needed for
the correctness of core scheduling and if so can we do this better?

-------->8-------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe3918c..3b3388a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct 
*prev, struct rq_flags *rf)
                                  * If there weren't no cookies; we 
don't need
                                  * to bother with the other siblings.
*/
-                               if (i == cpu && !rq->core->core_cookie)
-                                       goto next_class;
+                               //if (i == cpu && !rq->core->core_cookie)
+                                       //goto next_class;

continue;
                         }

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-29 22:23             ` Subhra Mazumdar
@ 2019-04-01 21:35               ` Subhra Mazumdar
  2019-04-03 20:16                 ` Julien Desfossez
  2019-04-02  7:42               ` Peter Zijlstra
  1 sibling, 1 reply; 99+ messages in thread
From: Subhra Mazumdar @ 2019-04-01 21:35 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds,
	linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan


On 3/29/19 3:23 PM, Subhra Mazumdar wrote:
>
> On 3/29/19 6:35 AM, Julien Desfossez wrote:
>> On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar 
>> <subhra.mazumdar@oracle.com>
>> wrote:
>>> Is the core wide lock primarily responsible for the regression? I ran
>>> upto patch
>>> 12 which also has the core wide lock for tagged cgroups and also calls
>>> newidle_balance() from pick_next_task(). I don't see any 
>>> regression.  Of
>>> course
>>> the core sched version of pick_next_task() may be doing more but
>>> comparing with
>>> the __pick_next_task() it doesn't look too horrible.
>> On further testing and investigation, we also agree that spinlock 
>> contention
>> is not the major cause for the regression, but we feel that it should 
>> be one
>> of the major contributing factors to this performance loss.
>>
>>
> I finally did some code bisection and found the following lines are
> basically responsible for the regression. Commenting them out I don't see
> the regressions. Can you confirm? I am yet to figure if this is needed 
> for
> the correctness of core scheduling and if so can we do this better?
>
> -------->8-------------
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fe3918c..3b3388a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct 
> *prev, struct rq_flags *rf)
>                                  * If there weren't no cookies; we 
> don't need
>                                  * to bother with the other siblings.
> */
> -                               if (i == cpu && !rq->core->core_cookie)
> -                                       goto next_class;
> +                               //if (i == cpu && !rq->core->core_cookie)
> +                                       //goto next_class;
>
> continue;
>                         }
AFAICT this condition is not needed for correctness as cookie matching will
sill be enforced. Peter any thoughts? I get the following numbers with 1 DB
and 2 DB instance.

1 DB instance
users  baseline   %idle  core_sched %idle
16     1          84     -5.5% 84
24     1          76     -5% 76
32     1          69     -0.45% 69

2 DB instance
users  baseline   %idle  core_sched %idle
16     1          66     -23.8% 69
24     1          54     -3.1% 57
32     1          42     -21.1%      48

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-03-29 22:23             ` Subhra Mazumdar
  2019-04-01 21:35               ` Subhra Mazumdar
@ 2019-04-02  7:42               ` Peter Zijlstra
  1 sibling, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-04-02  7:42 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Julien Desfossez, mingo, tglx, pjt, tim.c.chen, torvalds,
	linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan

On Fri, Mar 29, 2019 at 03:23:14PM -0700, Subhra Mazumdar wrote:
> 
> On 3/29/19 6:35 AM, Julien Desfossez wrote:
> > On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar <subhra.mazumdar@oracle.com>
> > wrote:
> > > Is the core wide lock primarily responsible for the regression? I ran
> > > upto patch
> > > 12 which also has the core wide lock for tagged cgroups and also calls
> > > newidle_balance() from pick_next_task(). I don't see any regression.  Of
> > > course
> > > the core sched version of pick_next_task() may be doing more but
> > > comparing with
> > > the __pick_next_task() it doesn't look too horrible.
> > On further testing and investigation, we also agree that spinlock contention
> > is not the major cause for the regression, but we feel that it should be one
> > of the major contributing factors to this performance loss.
> > 
> > 
> I finally did some code bisection and found the following lines are
> basically responsible for the regression. Commenting them out I don't see
> the regressions. Can you confirm? I am yet to figure if this is needed for
> the correctness of core scheduling and if so can we do this better?

It was meant to be an optimization; specifically, when no cookie was
set, don't bother to schedule the sibling(s).

> -------->8-------------
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fe3918c..3b3388a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct
> *prev, struct rq_flags *rf)
>                                  * If there weren't no cookies; we don't
> need
>                                  * to bother with the other siblings.
> */
> -                               if (i == cpu && !rq->core->core_cookie)
> -                                       goto next_class;
> +                               //if (i == cpu && !rq->core->core_cookie)
> +                                       //goto next_class;
> 
> continue;
>                         }

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
       [not found]   ` <20190402064612.GA46500@aaronlu>
@ 2019-04-02  8:28     ` Peter Zijlstra
  2019-04-02 13:20       ` Aaron Lu
                         ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-04-02  8:28 UTC (permalink / raw)
  To: Aaron Lu
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li

On Tue, Apr 02, 2019 at 02:46:13PM +0800, Aaron Lu wrote:
> On Mon, Feb 18, 2019 at 05:56:33PM +0100, Peter Zijlstra wrote:

> > +static struct task_struct *
> > +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
> > +{
> > +	struct task_struct *class_pick, *cookie_pick;
> > +	unsigned long cookie = 0UL;
> > +
> > +	/*
> > +	 * We must not rely on rq->core->core_cookie here, because we fail to reset
> > +	 * rq->core->core_cookie on new picks, such that we can detect if we need
> > +	 * to do single vs multi rq task selection.
> > +	 */
> > +
> > +	if (max && max->core_cookie) {
> > +		WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
> > +		cookie = max->core_cookie;
> > +	}
> > +
> > +	class_pick = class->pick_task(rq);
> > +	if (!cookie)
> > +		return class_pick;
> > +
> > +	cookie_pick = sched_core_find(rq, cookie);
> > +	if (!class_pick)
> > +		return cookie_pick;
> > +
> > +	/*
> > +	 * If class > max && class > cookie, it is the highest priority task on
> > +	 * the core (so far) and it must be selected, otherwise we must go with
> > +	 * the cookie pick in order to satisfy the constraint.
> > +	 */
> > +	if (cpu_prio_less(cookie_pick, class_pick) && cpu_prio_less(max, class_pick))
> > +		return class_pick;
> 
> I have a question about the use of cpu_prio_less(max, class_pick) here
> and core_prio_less(max, p) below in pick_next_task().
> 
> Assume cpu_prio_less(max, class_pick) thinks class_pick has higher
> priority and class_pick is returned here. Then in pick_next_task(),
> core_prio_less(max, p) is used to decide if max should be replaced.
> Since core_prio_less(max, p) doesn't compare vruntime, it could return
> fasle for this class_pick and the same max. Then max isn't replaced
> and we could end up scheduling two processes belonging to two different
> cgroups...

> > +
> > +	return cookie_pick;
> > +}
> > +
> > +static struct task_struct *
> > +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > +{

> > +			/*
> > +			 * If this new candidate is of higher priority than the
> > +			 * previous; and they're incompatible; we need to wipe
> > +			 * the slate and start over.
> > +			 *
> > +			 * NOTE: this is a linear max-filter and is thus bounded
> > +			 * in execution time.
> > +			 */
> > +			if (!max || core_prio_less(max, p)) {
> 
> This is the place to decide if max should be replaced.

Hummm.... very good spotting that. Yes, I'm afraid you're very much
right about this.

> Perhaps we can test if max is on the same cpu as class_pick and then
> use cpu_prio_less() or core_prio_less() accordingly here, or just
> replace core_prio_less(max, p) with cpu_prio_less(max, p) in
> pick_next_task(). The 2nd obviously breaks the comment of
> core_prio_less() though: /* cannot compare vruntime across CPUs */.

Right, so as the comment states, you cannot directly compare vruntime
across CPUs, doing that is completely buggered.

That also means that the cpu_prio_less(max, class_pick) in pick_task()
is buggered, because there is no saying @max is on this CPU to begin
with.

Changing that to core_prio_less() doesn't fix this though.

> I'm still evaluating, your comments are appreciated.

We could change the above condition to:

		if (!max || !cookie_match(max, p))

I suppose. But please double check the thikning.



> > +				struct task_struct *old_max = max;
> > +
> > +				rq->core->core_cookie = p->core_cookie;
> > +				max = p;
> > +
> > +				if (old_max && !cookie_match(old_max, p)) {
> > +					for_each_cpu(j, smt_mask) {
> > +						if (j == i)
> > +							continue;
> > +
> > +						cpu_rq(j)->core_pick = NULL;
> > +					}
> > +					goto again;
> > +				}
> > +			}
> > +		}
> > +next_class:;
> > +	}


Another approach would be something like the below:


--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -87,7 +87,7 @@ static inline int __task_prio(struct tas
  */
 
 /* real prio, less is less */
-static inline bool __prio_less(struct task_struct *a, struct task_struct *b, bool runtime)
+static inline bool __prio_less(struct task_struct *a, struct task_struct *b, u64 vruntime)
 {
 	int pa = __task_prio(a), pb = __task_prio(b);
 
@@ -104,21 +104,25 @@ static inline bool __prio_less(struct ta
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE && runtime) /* fair */
-		return !((s64)(a->se.vruntime - b->se.vruntime) < 0);
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return !((s64)(a->se.vruntime - vruntime) < 0);
 
 	return false;
 }
 
 static inline bool cpu_prio_less(struct task_struct *a, struct task_struct *b)
 {
-	return __prio_less(a, b, true);
+	return __prio_less(a, b, b->se.vruntime);
 }
 
 static inline bool core_prio_less(struct task_struct *a, struct task_struct *b)
 {
-	/* cannot compare vruntime across CPUs */
-	return __prio_less(a, b, false);
+	u64 vruntime = b->se.vruntime;
+
+	vruntime -= task_rq(b)->cfs.min_vruntime;
+	vruntime += task_rq(a)->cfs.min_vruntime
+
+	return __prio_less(a, b, vruntime);
 }
 
 static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-02  8:28     ` Peter Zijlstra
@ 2019-04-02 13:20       ` Aaron Lu
  2019-04-05 14:55       ` Aaron Lu
  2019-04-16 13:43       ` Aaron Lu
  2 siblings, 0 replies; 99+ messages in thread
From: Aaron Lu @ 2019-04-02 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Aaron Lu, mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li

On Tue, Apr 02, 2019 at 10:28:12AM +0200, Peter Zijlstra wrote:
> Another approach would be something like the below:
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -87,7 +87,7 @@ static inline int __task_prio(struct tas
>   */
>  
>  /* real prio, less is less */
> -static inline bool __prio_less(struct task_struct *a, struct task_struct *b, bool runtime)
> +static inline bool __prio_less(struct task_struct *a, struct task_struct *b, u64 vruntime)
>  {
>  	int pa = __task_prio(a), pb = __task_prio(b);
>  
> @@ -104,21 +104,25 @@ static inline bool __prio_less(struct ta
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE && runtime) /* fair */
> -		return !((s64)(a->se.vruntime - b->se.vruntime) < 0);
> +	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
> +		return !((s64)(a->se.vruntime - vruntime) < 0);
>  
>  	return false;
>  }
>  
>  static inline bool cpu_prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -	return __prio_less(a, b, true);
> +	return __prio_less(a, b, b->se.vruntime);
>  }
>  
>  static inline bool core_prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -	/* cannot compare vruntime across CPUs */
> -	return __prio_less(a, b, false);
> +	u64 vruntime = b->se.vruntime;
> +
> +	vruntime -= task_rq(b)->cfs.min_vruntime;
> +	vruntime += task_rq(a)->cfs.min_vruntime
> +
> +	return __prio_less(a, b, vruntime);
>  }
>  
>  static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)

Brilliant, I like this approach, it makes core_prio_less() work across
CPUs. So I tested this, together with changing
cpu_prio_less(max, class_pick) to core_prio_less(max, class_pick) in
pick_task(), this problem is gone :-)

I verified with below debug code:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb24a0141e57..50658e79363f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3832,6 +3832,14 @@ next_class:;
 
 		WARN_ON_ONCE(!rq_i->core_pick);
 
+		if (rq->core->core_cookie && rq_i->core_pick->core_cookie &&
+			rq->core->core_cookie != rq_i->core_pick->core_cookie) {
+			trace_printk("expect 0x%lx, cpu%d got 0x%lx\n",
+					rq->core->core_cookie, i,
+					rq_i->core_pick->core_cookie);
+			WARN_ON_ONCE(1);
+		}
+
 		rq_i->core_pick->core_occupation = occ;
 
 		if (i == cpu)
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-04-01 21:35               ` Subhra Mazumdar
@ 2019-04-03 20:16                 ` Julien Desfossez
  2019-04-05  1:30                   ` Subhra Mazumdar
  0 siblings, 1 reply; 99+ messages in thread
From: Julien Desfossez @ 2019-04-03 20:16 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds,
	linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan

> >>>Is the core wide lock primarily responsible for the regression? I ran
> >>>upto patch
> >>>12 which also has the core wide lock for tagged cgroups and also calls
> >>>newidle_balance() from pick_next_task(). I don't see any regression. 
> >>>Of
> >>>course
> >>>the core sched version of pick_next_task() may be doing more but
> >>>comparing with
> >>>the __pick_next_task() it doesn't look too horrible.
> >>On further testing and investigation, we also agree that spinlock
> >>contention
> >>is not the major cause for the regression, but we feel that it should be
> >>one
> >>of the major contributing factors to this performance loss.
> >>
> >>
> >I finally did some code bisection and found the following lines are
> >basically responsible for the regression. Commenting them out I don't see
> >the regressions. Can you confirm? I am yet to figure if this is needed for
> >the correctness of core scheduling and if so can we do this better?
> >
> >-------->8-------------
> >
> >diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >index fe3918c..3b3388a 100644
> >--- a/kernel/sched/core.c
> >+++ b/kernel/sched/core.c
> >@@ -3741,8 +3741,8 @@ pick_next_task(struct rq *rq, struct task_struct
> >*prev, struct rq_flags *rf)
> >                                 * If there weren't no cookies; we don't
> >need
> >                                 * to bother with the other siblings.
> >*/
> >-                               if (i == cpu && !rq->core->core_cookie)
> >-                                       goto next_class;
> >+                               //if (i == cpu && !rq->core->core_cookie)
> >+                                       //goto next_class;
> >
> >continue;
> >                        }
> AFAICT this condition is not needed for correctness as cookie matching will
> sill be enforced. Peter any thoughts? I get the following numbers with 1 DB
> and 2 DB instance.
> 
> 1 DB instance
> users  baseline   %idle  core_sched %idle
> 16     1          84     -5.5% 84
> 24     1          76     -5% 76
> 32     1          69     -0.45% 69
> 
> 2 DB instance
> users  baseline   %idle  core_sched %idle
> 16     1          66     -23.8% 69
> 24     1          54     -3.1% 57
> 32     1          42     -21.1%      48

We tried to comment those lines and it doesn’t seem to get rid of the
performance regression we are seeing.
Can you elaborate a bit more about the test you are performing, what kind of
resources it uses ?
Can you also try to replicate our test and see if you see the same problem ?

cgcreate -g cpu,cpuset:set1
cat /sys/devices/system/cpu/cpu{0,2,4,6}/topology/thread_siblings_list 
0,36
2,38
4,40
6,42

echo "0,2,4,6,36,38,40,42" | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.cpus
echo 0 | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.mems

echo 1 | sudo tee /sys/fs/cgroup/cpu,cpuacct/set1/cpu.tag

sysbench --test=fileio prepare
cgexec -g cpu,cpuset:set1 sysbench --threads=4 --test=fileio \
--file-test-mode=seqwr run

The reason we create a cpuset is to narrow down the investigation to just 4
cores on a highly powerful machine. It might not be needed if testing on a
smaller machine.

Julien

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer
  2019-02-21 16:41     ` Peter Zijlstra
  2019-02-21 16:47       ` Peter Zijlstra
@ 2019-04-04  8:31       ` Aubrey Li
  2019-04-06  1:36         ` Aubrey Li
  1 sibling, 1 reply; 99+ messages in thread
From: Aubrey Li @ 2019-04-04  8:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Tim Chen, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr

On Fri, Feb 22, 2019 at 12:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Feb 21, 2019 at 04:19:46PM +0000, Valentin Schneider wrote:
> > Hi,
> >
> > On 18/02/2019 16:56, Peter Zijlstra wrote:
> > [...]
> > > +static bool try_steal_cookie(int this, int that)
> > > +{
> > > +   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> > > +   struct task_struct *p;
> > > +   unsigned long cookie;
> > > +   bool success = false;
> > > +
> > > +   local_irq_disable();
> > > +   double_rq_lock(dst, src);

Here, should we check dst and src's rq status before lock their rq?
if src is idle, it could be in the progress of load balance already?

Thanks,
-Aubrey

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e3162f..a1e0a6f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3861,6 +3861,13 @@ static bool try_steal_cookie(int this, int that)
        unsigned long cookie;
        bool success = false;

+       /*
+        * Don't steal if src is idle or has only one runnable task,
+        * or dst has more than one runnable task
+        */
+       if (src->nr_running <= 1 || unlikely(dst->nr_running >= 1))
+               return false;
+
        local_irq_disable();
        double_rq_lock(dst, src);

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access
  2019-04-03 20:16                 ` Julien Desfossez
@ 2019-04-05  1:30                   ` Subhra Mazumdar
  0 siblings, 0 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-04-05  1:30 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds,
	linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan


> We tried to comment those lines and it doesn’t seem to get rid of the
> performance regression we are seeing.
> Can you elaborate a bit more about the test you are performing, what kind of
> resources it uses ?
I am running 1 and 2 Oracle DB instances each running TPC-C workload. The
clients driving the instances also run in same node. Each server client
pair is put in each cpu group and tagged.
> Can you also try to replicate our test and see if you see the same problem ?
>
> cgcreate -g cpu,cpuset:set1
> cat /sys/devices/system/cpu/cpu{0,2,4,6}/topology/thread_siblings_list
> 0,36
> 2,38
> 4,40
> 6,42
>
> echo "0,2,4,6,36,38,40,42" | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.cpus
> echo 0 | sudo tee /sys/fs/cgroup/cpuset/set1/cpuset.mems
>
> echo 1 | sudo tee /sys/fs/cgroup/cpu,cpuacct/set1/cpu.tag
>
> sysbench --test=fileio prepare
> cgexec -g cpu,cpuset:set1 sysbench --threads=4 --test=fileio \
> --file-test-mode=seqwr run
>
> The reason we create a cpuset is to narrow down the investigation to just 4
> cores on a highly powerful machine. It might not be needed if testing on a
> smaller machine.
With this sysbench test I am not seeing any improvement with removing the
condition. Also with hackbench I found it makes no difference but that has
much lower regression to begin with (18%)
>
> Julien

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-02  8:28     ` Peter Zijlstra
  2019-04-02 13:20       ` Aaron Lu
@ 2019-04-05 14:55       ` Aaron Lu
  2019-04-09 18:09         ` Tim Chen
  2019-04-16 13:43       ` Aaron Lu
  2 siblings, 1 reply; 99+ messages in thread
From: Aaron Lu @ 2019-04-05 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li,
	Julien Desfossez

On Tue, Apr 02, 2019 at 10:28:12AM +0200, Peter Zijlstra wrote:
> Another approach would be something like the below:
> 
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -87,7 +87,7 @@ static inline int __task_prio(struct tas
>   */
>  
>  /* real prio, less is less */
> -static inline bool __prio_less(struct task_struct *a, struct task_struct *b, bool runtime)
> +static inline bool __prio_less(struct task_struct *a, struct task_struct *b, u64 vruntime)
>  {
>  	int pa = __task_prio(a), pb = __task_prio(b);
>  
> @@ -104,21 +104,25 @@ static inline bool __prio_less(struct ta
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE && runtime) /* fair */
> -		return !((s64)(a->se.vruntime - b->se.vruntime) < 0);
> +	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
> +		return !((s64)(a->se.vruntime - vruntime) < 0);
		                                         ~~~
I think <= should be used here, so that two tasks with the same vruntime
will return false. Or we could bounce two tasks having different tags
with one set to max in the first round and the other set to max in the
next round. CPU would stuck in __schedule() with irq disabled.

>  
>  	return false;
>  }
>  
>  static inline bool cpu_prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -	return __prio_less(a, b, true);
> +	return __prio_less(a, b, b->se.vruntime);
>  }
>  
>  static inline bool core_prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -	/* cannot compare vruntime across CPUs */
> -	return __prio_less(a, b, false);
> +	u64 vruntime = b->se.vruntime;
> +
> +	vruntime -= task_rq(b)->cfs.min_vruntime;
> +	vruntime += task_rq(a)->cfs.min_vruntime

After some testing, I figured task_cfs_rq() should be used instead of
task_rq(:-)

With the two changes(and some other minor ones that still need more time
to sort out), I'm now able to start doing 2 full CPU kbuilds in 2 tagged
cgroups. Previouslly, the system would hang pretty soon after I started
kbuild in any tagged cgroup(presumbly, CPUs stucked in __schedule() with
irqs disabled).

And there is no warning appeared due to two tasks having different tags
get scheduled on the same CPU.

Thanks,
Aaron

> +
> +	return __prio_less(a, b, vruntime);
>  }
>  
>  static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer
  2019-04-04  8:31       ` Aubrey Li
@ 2019-04-06  1:36         ` Aubrey Li
  0 siblings, 0 replies; 99+ messages in thread
From: Aubrey Li @ 2019-04-06  1:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Tim Chen, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Aaron Lu

On Thu, Apr 4, 2019 at 4:31 PM Aubrey Li <aubrey.intel@gmail.com> wrote:
>
> On Fri, Feb 22, 2019 at 12:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Feb 21, 2019 at 04:19:46PM +0000, Valentin Schneider wrote:
> > > Hi,
> > >
> > > On 18/02/2019 16:56, Peter Zijlstra wrote:
> > > [...]
> > > > +static bool try_steal_cookie(int this, int that)
> > > > +{
> > > > +   struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> > > > +   struct task_struct *p;
> > > > +   unsigned long cookie;
> > > > +   bool success = false;
> > > > +
> > > > +   local_irq_disable();
> > > > +   double_rq_lock(dst, src);
>
> Here, should we check dst and src's rq status before lock their rq?
> if src is idle, it could be in the progress of load balance already?
>
> Thanks,
> -Aubrey
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e3162f..a1e0a6f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3861,6 +3861,13 @@ static bool try_steal_cookie(int this, int that)
>         unsigned long cookie;
>         bool success = false;
>
> +       /*
> +        * Don't steal if src is idle or has only one runnable task,
> +        * or dst has more than one runnable task
> +        */
> +       if (src->nr_running <= 1 || unlikely(dst->nr_running >= 1))
> +               return false;
> +
>         local_irq_disable();
>         double_rq_lock(dst, src);

This seems to eliminate a hard lockup on my side.

Thanks,
-Aubrey

[  122.961909] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[  122.961910] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  122.961940] irq event stamp: 8200
[  122.961941] hardirqs last  enabled at (8199): [<ffffffff81003829>]
trace_hardirqs_on_thunk+0x1a/0x1c
[  122.961942] hardirqs last disabled at (8200): [<ffffffff81003845>]
trace_hardirqs_off_thunk+0x1a/0x1c
[  122.961942] softirqs last  enabled at (8192): [<ffffffff81e003a3>]
__do_softirq+0x3a3/0x3f2
[  122.961943] softirqs last disabled at (8185): [<ffffffff81095ea1>]
irq_exit+0xc1/0xd0
[  122.961944] CPU: 0 PID: 2704 Comm: schbench Tainted: G          I
    5.0.0-rc8-00544-gf24f5e9-dirty #20
[  122.961945] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  122.961945] RIP: 0010:native_queued_spin_lock_slowpath+0x5c/0x1d0
[  122.961946] Code: ff ff ff 75 40 f0 0f ba 2f 08 0f 82 cd 00 00 00
8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74
08 f3 90 <8b> 07 84 c0 75 f8 b8 05
[  122.961947] RSP: 0000:ffff888c09c03e78 EFLAGS: 00000002
[  122.961948] RAX: 0000000000740101 RBX: ffff888c0ade4400 RCX: 0000000000000540
[  122.961948] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff888c0ade4400
[  122.961949] RBP: ffff888c0ade4400 R08: 0000000000000000 R09: 0000000000000001
[  122.961950] R10: ffff888c09c03e20 R11: 0000000000000000 R12: 00000000001e4400
[  122.961950] R13: 0000000000000000 R14: 0000000000000016 R15: ffff888be0340000
[  122.961951] FS:  00007f21e17ea700(0000) GS:ffff888c09c00000(0000)
knlGS:0000000000000000
[  122.961951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  122.961952] CR2: 00007f21e17ea728 CR3: 0000000be0b6a002 CR4: 00000000000606f0
[  122.961952] Call Trace:
[  122.961953]  <IRQ>
[  122.961953]  do_raw_spin_lock+0xb4/0xc0
[  122.961954]  _raw_spin_lock+0x4b/0x60
[  122.961954]  scheduler_tick+0x48/0x170
[  122.961955]  ? tick_sched_do_timer+0x60/0x60
[  122.961955]  update_process_times+0x40/0x50
[  122.961956]  tick_sched_handle+0x22/0x60
[  122.961956]  tick_sched_timer+0x37/0x70
[  122.961957]  __hrtimer_run_queues+0xed/0x3f0
[  122.961957]  hrtimer_interrupt+0x122/0x270
[  122.961958]  smp_apic_timer_interrupt+0x86/0x210
[  122.961958]  apic_timer_interrupt+0xf/0x20
[  122.961959]  </IRQ>
[  122.961959] RIP: 0033:0x7fff855a2839
[  122.961960] Code: 08 3b 15 6a c8 ff ff 75 df 31 c0 5d c3 b8 e4 00
00 00 5d 0f 05 c3 f3 90 eb ce 0f 1f 80 00 00 00 00 55 48 85 ff 48 89
e5 41 54 <49> 89 f4 53 74 30 48 8b
[  122.961961] RSP: 002b:00007f21e17e9e38 EFLAGS: 00000206 ORIG_RAX:
ffffffffffffff13
[  122.961962] RAX: 000000000002038b RBX: 00000000002dc6c0 RCX: 0000000000000000
[  122.961962] RDX: 00007f21e17e9e60 RSI: 0000000000000000 RDI: 00007f21e17e9e50
[  122.961963] RBP: 00007f21e17e9e40 R08: 0000000000000000 R09: 0000000000007a14
[  122.961963] R10: 00007f21e17e9e30 R11: 0000000000000246 R12: 00007f21e17e9ed0
[  122.961964] R13: 00007f2202716e6f R14: 0000000000000000 R15: 00007f21fc826b00
[  122.961964] Kernel panic - not syncing: Hard LOCKUP
[  122.961965] CPU: 0 PID: 2704 Comm: schbench Tainted: G          I
    5.0.0-rc8-00544-gf24f5e9-dirty #20
[  122.961966] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  122.961966] Call Trace:
[  122.961967]  <NMI>
[  122.961967]  dump_stack+0x7c/0xbb
[  122.961968]  panic+0x103/0x2c8
[  122.961968]  nmi_panic+0x35/0x40
[  122.961969]  watchdog_overflow_callback+0xfd/0x110
[  122.961969]  __perf_event_overflow+0x5a/0xe0
[  122.961970]  handle_pmi_common+0x1d1/0x280
[  122.961970]  ? intel_pmu_handle_irq+0xad/0x170
[  122.961971]  intel_pmu_handle_irq+0xad/0x170
[  122.961971]  perf_event_nmi_handler+0x2e/0x50
[  122.961972]  nmi_handle+0xc6/0x260
[  122.961972]  default_do_nmi+0xca/0x120
[  122.961972]  do_nmi+0x113/0x160
[  122.961973]  end_repeat_nmi+0x16/0x50
[  122.961973] RIP: 0010:native_queued_spin_lock_slowpath+0x5c/0x1d0
[  122.961974] Code: ff ff ff 75 40 f0 0f ba 2f 08 0f 82 cd 00 00 00
8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74
08 f3 90 <8b> 07 84 c0 75 f8 b8 05
[  122.961975] RSP: 0000:ffff888c09c03e78 EFLAGS: 00000002
[  122.961976] RAX: 0000000000740101 RBX: ffff888c0ade4400 RCX: 0000000000000540
[  122.961976] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff888c0ade4400
[  122.961977] RBP: ffff888c0ade4400 R08: 0000000000000000 R09: 0000000000000001
[  122.961978] R10: ffff888c09c03e20 R11: 0000000000000000 R12: 00000000001e4400
[  122.961978] R13: 0000000000000000 R14: 0000000000000016 R15: ffff888be0340000
[  122.961979]  ? native_queued_spin_lock_slowpath+0x5c/0x1d0
[  122.961980]  ? native_queued_spin_lock_slowpath+0x5c/0x1d0
[  122.961980]  </NMI>
[  122.961981]  <IRQ>
[  122.961981]  do_raw_spin_lock+0xb4/0xc0
[  122.961982]  _raw_spin_lock+0x4b/0x60
[  122.961982]  scheduler_tick+0x48/0x170
[  122.961983]  ? tick_sched_do_timer+0x60/0x60
[  122.961983]  update_process_times+0x40/0x50
[  122.961984]  tick_sched_handle+0x22/0x60
[  122.961984]  tick_sched_timer+0x37/0x70
[  122.961985]  __hrtimer_run_queues+0xed/0x3f0
[  122.961985]  hrtimer_interrupt+0x122/0x270
[  122.961986]  smp_apic_timer_interrupt+0x86/0x210
[  122.961986]  apic_timer_interrupt+0xf/0x20
[  122.961987]  </IRQ>
[  122.961987] RIP: 0033:0x7fff855a2839
[  122.961988] Code: 08 3b 15 6a c8 ff ff 75 df 31 c0 5d c3 b8 e4 00
00 00 5d 0f 05 c3 f3 90 eb ce 0f 1f 80 00 00 00 00 55 48 85 ff 48 89
e5 41 54 <49> 89 f4 53 74 30 48 8b
[  122.961989] RSP: 002b:00007f21e17e9e38 EFLAGS: 00000206 ORIG_RAX:
ffffffffffffff13
[  122.961990] RAX: 000000000002038b RBX: 00000000002dc6c0 RCX: 0000000000000000
[  122.961990] RDX: 00007f21e17e9e60 RSI: 0000000000000000 RDI: 00007f21e17e9e50
[  122.961991] RBP: 00007f21e17e9e40 R08: 0000000000000000 R09: 0000000000007a14
[  122.961991] R10: 00007f21e17e9e30 R11: 0000000000000246 R12: 00007f21e17e9ed0
[  122.961992] R13: 00007f2202716e6f R14: 0000000000000000 R15: 00007f21fc826b00
[  124.022169] Shutting down cpus with NMI
[  124.022170] Kernel Offset: disabled
[  124.022170] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
[  124.022171] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022200] irq event stamp: 148640
[  124.022200] hardirqs last  enabled at (148639):
[<ffffffff810c7544>] sched_core_balance+0x164/0x5d0
[  124.022209] hardirqs last disabled at (148640):
[<ffffffff810c75b5>] sched_core_balance+0x1d5/0x5d0
[  124.022218] softirqs last  enabled at (148598):
[<ffffffff81e003a3>] __do_softirq+0x3a3/0x3f2
[  124.022226] softirqs last disabled at (148563):
[<ffffffff81095ea1>] irq_exit+0xc1/0xd0
[  124.022227] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G          I
  5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022228] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022229] RIP: 0010:native_queued_spin_lock_slowpath+0x17e/0x1d0
[  124.022230] Code: 34 c5 40 48 39 82 48 89 16 8b 42 08 85 c0 75 09
f3 90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 18 0e eb 02 f3
90 8b 07 <66> 85 c0 75 f7 41 89 c7
[  124.022230] RSP: 0000:ffffc9000634bcf8 EFLAGS: 00000002
[  124.022231] RAX: 00000000001c0101 RBX: ffff888c0abe4400 RCX: 0000000000080000
[  124.022232] RDX: ffff888c09fe5180 RSI: 0000000000000000 RDI: ffff888c0abe4400
[  124.022232] RBP: ffff888c0abe4400 R08: 0000000000080000 R09: 0000000000000001
[  124.022233] R10: ffffc9000634bca0 R11: 0000000000000000 R12: 0000000000000001
[  124.022233] R13: 0000000000000000 R14: ffff888c09fe4400 R15: 00000000001e4400
[  124.022234] FS:  0000000000000000(0000) GS:ffff888c09e00000(0000)
knlGS:0000000000000000
[  124.022235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022235] CR2: 00007f21f1ffae00 CR3: 0000000be0b6a002 CR4: 00000000000606e0
[  124.022236] Call Trace:
[  124.022237]  do_raw_spin_lock+0xb4/0xc0
[  124.022237]  _raw_spin_lock_nested+0x49/0x60
[  124.022238]  sched_core_balance+0x106/0x5d0
[  124.022238]  __balance_callback+0x49/0xa0
[  124.022239]  __schedule+0x12eb/0x1660
[  124.022239]  ? _raw_spin_unlock_irqrestore+0x4e/0x60
[  124.022240]  ? hrtimer_start_range_ns+0x1b7/0x340
[  124.022240]  schedule_idle+0x28/0x40
[  124.022241]  do_idle+0x169/0x2a0
[  124.022241]  cpu_startup_entry+0x19/0x20
[  124.022242]  start_secondary+0x17f/0x1d0
[  124.022242]  secondary_startup_64+0xa4/0xb0
[  124.022243] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2
[  124.022243] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022272] irq event stamp: 154392
[  124.022273] hardirqs last  enabled at (154391):
[<ffffffff810c7544>] sched_core_balance+0x164/0x5d0
[  124.022273] hardirqs last disabled at (154392):
[<ffffffff810c75b5>] sched_core_balance+0x1d5/0x5d0
[  124.022274] softirqs last  enabled at (154374):
[<ffffffff81e003a3>] __do_softirq+0x3a3/0x3f2
[  124.022275] softirqs last disabled at (154309):
[<ffffffff81095ea1>] irq_exit+0xc1/0xd0
[  124.022275] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G          I
  5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022276] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022276] RIP: 0010:native_queued_spin_lock_slowpath+0x166/0x1d0
[  124.022277] Code: c1 e8 12 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48
81 c6 80 51 1e 00 48 03 34 c5 40 48 39 82 48 89 16 8b 42 08 85 c0 75
09 f3 90 <8b> 42 08 85 c0 74 f7 40
[  124.022278] RSP: 0000:ffffc90006353cf8 EFLAGS: 00000046
[  124.022279] RAX: 0000000000000000 RBX: ffff888c0a3e4400 RCX: 00000000000c0000
[  124.022279] RDX: ffff888c0a1e5180 RSI: ffff888c0a3e5180 RDI: ffff888c0a3e4400
[  124.022280] RBP: ffff888c0a3e4400 R08: 00000000000c0000 R09: 0000000000000001
[  124.022281] R10: ffffc90006353ca0 R11: 0000000000000000 R12: 0000000000000002
[  124.022281] R13: 0000000000000000 R14: ffff888c0a1e4400 R15: 00000000001e4400
[  124.022282] FS:  0000000000000000(0000) GS:ffff888c0a000000(0000)
knlGS:0000000000000000
[  124.022282] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022283] CR2: 00007f21e1feb728 CR3: 0000000be0b6a005 CR4: 00000000000606e0
[  124.022283] Call Trace:
[  124.022284]  do_raw_spin_lock+0xb4/0xc0
[  124.022284]  _raw_spin_lock_nested+0x49/0x60
[  124.022285]  sched_core_balance+0x106/0x5d0
[  124.022285]  __balance_callback+0x49/0xa0
[  124.022286]  __schedule+0x12eb/0x1660
[  124.022286]  ? enqueue_entity+0x112/0x6f0
[  124.022287]  ? sched_clock+0x5/0x10
[  124.022287]  ? sched_clock_cpu+0xc/0xa0
[  124.022288]  ? lockdep_hardirqs_on+0x11d/0x190
[  124.022288]  schedule_idle+0x28/0x40
[  124.022289]  do_idle+0x169/0x2a0
[  124.022289]  cpu_startup_entry+0x19/0x20
[  124.022290]  start_secondary+0x17f/0x1d0
[  124.022290]  secondary_startup_64+0xa4/0xb0
[  124.022291] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[  124.022292] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022321] irq event stamp: 13957
[  124.022321] hardirqs last  enabled at (13956): [<ffffffff81a10892>]
_raw_spin_unlock_irqrestore+0x32/0x60
[  124.022322] hardirqs last disabled at (13957): [<ffffffff81a10f32>]
_raw_spin_lock_irqsave+0x22/0x80
[  124.022323] softirqs last  enabled at (13946): [<ffffffff81e003a3>]
__do_softirq+0x3a3/0x3f2
[  124.022323] softirqs last disabled at (13949): [<ffffffff81095ea1>]
irq_exit+0xc1/0xd0
[  124.022324] CPU: 4 PID: 2698 Comm: schbench Tainted: G          I
    5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022325] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022325] RIP: 0010:native_queued_spin_lock_slowpath+0x17e/0x1d0
[  124.022326] Code: 34 c5 40 48 39 82 48 89 16 8b 42 08 85 c0 75 09
f3 90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 18 0e eb 02 f3
90 8b 07 <66> 85 c0 75 f7 41 89 c7
[  124.022327] RSP: 0000:ffff888c0a403da8 EFLAGS: 00000002
[  124.022328] RAX: 0000000000540101 RBX: ffff888c0a1e4400 RCX: 0000000000140000
[  124.022328] RDX: ffff888c0a5e5180 RSI: 0000000000000000 RDI: ffff888c0a1e4400
[  124.022329] RBP: ffff888c0a1e4400 R08: 0000000000140000 R09: 0000000000000001
[  124.022329] R10: ffff888c0a403d50 R11: 0000000000000000 R12: 0000000000000202
[  124.022330] R13: 000000000000003b R14: 0000000000000001 R15: ffff888c0b1e4400
[  124.022331] FS:  00007f21e47f0700(0000) GS:ffff888c0a400000(0000)
knlGS:0000000000000000
[  124.022331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022332] CR2: 00007f21f17fa728 CR3: 0000000be0b6a004 CR4: 00000000000606e0
[  124.022332] Call Trace:
[  124.022332]  <IRQ>
[  124.022333]  do_raw_spin_lock+0xb4/0xc0
[  124.022333]  _raw_spin_lock_irqsave+0x63/0x80
[  124.022334]  load_balance+0x358/0xde0
[  124.022334]  rebalance_domains+0x239/0x320
[  124.022335]  ? lockdep_hardirqs_on+0xa3/0x190
[  124.022335]  __do_softirq+0xd6/0x3f2
[  124.022336]  irq_exit+0xc1/0xd0
[  124.022336]  smp_apic_timer_interrupt+0xac/0x210
[  124.022337]  apic_timer_interrupt+0xf/0x20
[  124.022337]  </IRQ>
[  124.022338] RIP: 0033:0x7fff855a26dd
[  124.022339] Code: e2 20 48 09 c2 48 85 d2 49 8b 40 08 48 8b 0d c2
c9 ff ff 78 9b 48 39 d1 73 10 48 29 ca 8b 0d c2 c9 ff ff 48 0f af d1
48 01 d0 <8b> 0d b9 c9 ff ff 4d 8f
[  124.022339] RSP: 002b:00007f21e47efe00 EFLAGS: 00000206 ORIG_RAX:
ffffffffffffff13
[  124.022340] RAX: 001c0b86d5d09d9a RBX: 00007f21e47efe50 RCX: 00000000005f0b1d
[  124.022341] RDX: 000063752327ca12 RSI: 00007f21e47efe50 RDI: 0000000000000000
[  124.022341] RBP: 00007f21e47efe10 R08: 00007fff8559f0a0 R09: 0000000000007f96
[  124.022342] R10: 00007f21e47efe30 R11: 0000000000000246 R12: 0000000000000000
[  124.022342] R13: 00007f2202716e6f R14: 0000000000000000 R15: 00007f21fc518628
[  124.022343] NMI watchdog: Watchdog detected hard LOCKUP on cpu 5
[  124.022344] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022372] irq event stamp: 132866
[  124.022373] hardirqs last  enabled at (132865):
[<ffffffff810c7544>] sched_core_balance+0x164/0x5d0
[  124.022374] hardirqs last disabled at (132866):
[<ffffffff810c75b5>] sched_core_balance+0x1d5/0x5d0
[  124.022375] softirqs last  enabled at (132834):
[<ffffffff81e003a3>] __do_softirq+0x3a3/0x3f2
[  124.022375] softirqs last disabled at (132825):
[<ffffffff81095ea1>] irq_exit+0xc1/0xd0
[  124.022376] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G          I
  5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022377] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022377] RIP: 0010:native_queued_spin_lock_slowpath+0x169/0x1d0
[  124.022378] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 80
51 1e 00 48 03 34 c5 40 48 39 82 48 89 16 8b 42 08 85 c0 75 09 f3 90
8b 42 08 <85> c0 74 f7 48 8b 32 46
[  124.022379] RSP: 0000:ffffc9000636bcf8 EFLAGS: 00000046
[  124.022380] RAX: 0000000000000000 RBX: ffff888c0abe4400 RCX: 0000000000180000
[  124.022380] RDX: ffff888c0a7e5180 RSI: ffff888c0abe5180 RDI: ffff888c0abe4400
[  124.022381] RBP: ffff888c0abe4400 R08: 0000000000180000 R09: 0000000000000001
[  124.022381] R10: ffffc9000636bca0 R11: 0000000000000000 R12: 0000000000000005
[  124.022382] R13: 0000000000000000 R14: ffff888c0a7e4400 R15: 00000000001e4400
[  124.022382] FS:  0000000000000000(0000) GS:ffff888c0a600000(0000)
knlGS:0000000000000000
[  124.022383] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022384] CR2: 00007f7956376889 CR3: 0000000be0b6a001 CR4: 00000000000606e0
[  124.022384] Call Trace:
[  124.022384]  do_raw_spin_lock+0xb4/0xc0
[  124.022385]  _raw_spin_lock_nested+0x49/0x60
[  124.022385]  sched_core_balance+0x106/0x5d0
[  124.022386]  __balance_callback+0x49/0xa0
[  124.022386]  __schedule+0x12eb/0x1660
[  124.022387]  ? enqueue_entity+0x112/0x6f0
[  124.022387]  ? sched_clock+0x5/0x10
[  124.022388]  ? sched_clock_cpu+0xc/0xa0
[  124.022388]  ? lockdep_hardirqs_on+0x11d/0x190
[  124.022389]  schedule_idle+0x28/0x40
[  124.022389]  do_idle+0x169/0x2a0
[  124.022390]  cpu_startup_entry+0x19/0x20
[  124.022390]  start_secondary+0x17f/0x1d0
[  124.022391]  secondary_startup_64+0xa4/0xb0
[  124.022392] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
[  124.022392] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022421] irq event stamp: 8164
[  124.022422] hardirqs last  enabled at (8163): [<ffffffff81003829>]
trace_hardirqs_on_thunk+0x1a/0x1c
[  124.022422] hardirqs last disabled at (8164): [<ffffffff81003845>]
trace_hardirqs_off_thunk+0x1a/0x1c
[  124.022423] softirqs last  enabled at (8138): [<ffffffff81e003a3>]
__do_softirq+0x3a3/0x3f2
[  124.022424] softirqs last disabled at (8129): [<ffffffff81095ea1>]
irq_exit+0xc1/0xd0
[  124.022424] CPU: 7 PID: 2733 Comm: schbench Tainted: G          I
    5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022425] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022426] RIP: 0010:native_queued_spin_lock_slowpath+0x169/0x1d0
[  124.022427] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 80
51 1e 00 48 03 34 c5 40 48 39 82 48 89 16 8b 42 08 85 c0 75 09 f3 90
8b 42 08 <85> c0 74 f7 48 8b 32 46
[  124.022427] RSP: 0000:ffff888c0aa03e78 EFLAGS: 00000046
[  124.022428] RAX: 0000000000000000 RBX: ffff888c0abe4400 RCX: 0000000000200000
[  124.022429] RDX: ffff888c0abe5180 RSI: ffff888c09fe5180 RDI: ffff888c0abe4400
[  124.022429] RBP: ffff888c0abe4400 R08: 0000000000200000 R09: 0000000000000001
[  124.022430] R10: ffff888c0aa03e20 R11: 0000000000000000 R12: 00000000001e4400
[  124.022430] R13: 0000000000000007 R14: 0000000000000002 R15: ffff888be038d480
[  124.022431] FS:  00007f21d47d0700(0000) GS:ffff888c0aa00000(0000)
knlGS:0000000000000000
[  124.022432] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022432] CR2: 00007f21e07e8728 CR3: 0000000be0b6a001 CR4: 00000000000606e0
[  124.022433] Call Trace:
[  124.022433]  <IRQ>
[  124.022433]  do_raw_spin_lock+0xb4/0xc0
[  124.022434]  _raw_spin_lock+0x4b/0x60
[  124.022434]  scheduler_tick+0x48/0x170
[  124.022435]  ? tick_sched_do_timer+0x60/0x60
[  124.022435]  update_process_times+0x40/0x50
[  124.022436]  tick_sched_handle+0x22/0x60
[  124.022436]  tick_sched_timer+0x37/0x70
[  124.022437]  __hrtimer_run_queues+0xed/0x3f0
[  124.022437]  hrtimer_interrupt+0x122/0x270
[  124.022438]  smp_apic_timer_interrupt+0x86/0x210
[  124.022438]  apic_timer_interrupt+0xf/0x20
[  124.022439]  </IRQ>
[  124.022439] RIP: 0033:0x7fff855a262b
[  124.022440] Code: 41 0e 10 86 02 48 0d 06 60 c6 0c 07 08 00 00 4c
8d 54 24 08 48 63 ff 48 83 e4 f0 4c 8d 47 02 48 8d 05 59 ca ff ff 41
ff 72 f8 <55> 49 c1 e0 04 48 89 e1
[  124.022441] RSP: 002b:00007f21d47cfe18 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff13
[  124.022442] RAX: 00007fff8559f080 RBX: 00007f21d47cfe50 RCX: 0000000000000000
[  124.022442] RDX: 00007f21d47cfe60 RSI: 00007f21d47cfe50 RDI: 0000000000000000
[  124.022443] RBP: 00007f21d47cfe40 R08: 0000000000000002 R09: 00000000000079f0
[  124.022443] R10: 00007f21d47cfe30 R11: 0000000000000246 R12: 0000000000000000
[  124.022444] R13: 00007f2201714e6f R14: 0000000000000000 R15: 00007f21f8b34fd8
[  124.022446] NMI watchdog: Watchdog detected hard LOCKUP on cpu 16
[  124.022447] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022475] irq event stamp: 134706
[  124.022476] hardirqs last  enabled at (134705):
[<ffffffff810c7544>] sched_core_balance+0x164/0x5d0
[  124.022476] hardirqs last disabled at (134706):
[<ffffffff810c75b5>] sched_core_balance+0x1d5/0x5d0
[  124.022477] softirqs last  enabled at (134694):
[<ffffffff81e003a3>] __do_softirq+0x3a3/0x3f2
[  124.022478] softirqs last disabled at (134683):
[<ffffffff81095ea1>] irq_exit+0xc1/0xd0
[  124.022479] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G          I
    5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022479] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022480] RIP: 0010:native_queued_spin_lock_slowpath+0x169/0x1d0
[  124.022481] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 80
51 1e 00 48 03 34 c5 40 48 39 82 48 89 16 8b 42 08 85 c0 75 09 f3 90
8b 42 08 <85> c0 74 f7 48 8b 32 46
[  124.022481] RSP: 0000:ffffc900063c3cf8 EFLAGS: 00000046
[  124.022482] RAX: 0000000000000000 RBX: ffff888c09fe4400 RCX: 0000000000440000
[  124.022483] RDX: ffff888c0ade5180 RSI: ffff88980b3e5180 RDI: ffff888c09fe4400
[  124.022483] RBP: ffff888c09fe4400 R08: 0000000000440000 R09: 0000000000000001
[  124.022484] R10: ffffc900063c3ca0 R11: 0000000000000000 R12: 0000000000000010
[  124.022485] R13: 0000000000000000 R14: ffff888c0ade4400 R15: 00000000001e4400
[  124.022485] FS:  0000000000000000(0000) GS:ffff888c0ac00000(0000)
knlGS:0000000000000000
[  124.022486] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022486] CR2: 00007f21e77f5e30 CR3: 0000000be0b6a003 CR4: 00000000000606e0
[  124.022487] Call Trace:
[  124.022487]  do_raw_spin_lock+0xb4/0xc0
[  124.022488]  _raw_spin_lock_nested+0x49/0x60
[  124.022488]  sched_core_balance+0x106/0x5d0
[  124.022489]  __balance_callback+0x49/0xa0
[  124.022489]  __schedule+0x12eb/0x1660
[  124.022490]  ? _raw_spin_unlock_irqrestore+0x4e/0x60
[  124.022490]  ? hrtimer_start_range_ns+0x1b7/0x340
[  124.022491]  schedule_idle+0x28/0x40
[  124.022491]  do_idle+0x169/0x2a0
[  124.022492]  cpu_startup_entry+0x19/0x20
[  124.022492]  start_secondary+0x17f/0x1d0
[  124.022493]  secondary_startup_64+0xa4/0xb0
[  124.022493] NMI watchdog: Watchdog detected hard LOCKUP on cpu 17
[  124.022494] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022522] irq event stamp: 10038
[  124.022523] hardirqs last  enabled at (10037): [<ffffffff81003829>]
trace_hardirqs_on_thunk+0x1a/0x1c
[  124.022524] hardirqs last disabled at (10038): [<ffffffff81003845>]
trace_hardirqs_off_thunk+0x1a/0x1c
[  124.022524] softirqs last  enabled at (10012): [<ffffffff81e003a3>]
__do_softirq+0x3a3/0x3f2
[  124.022525] softirqs last disabled at (10001): [<ffffffff81095ea1>]
irq_exit+0xc1/0xd0
[  124.022525] CPU: 17 PID: 2681 Comm: schbench Tainted: G          I
     5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022526] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022527] RIP: 0010:native_queued_spin_lock_slowpath+0x169/0x1d0
[  124.022528] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 80
51 1e 00 48 03 34 c5 40 48 39 82 48 89 16 8b 42 08 85 c0 75 09 f3 90
8b 42 08 <85> c0 74 f7 48 8b 32 46
[  124.022528] RSP: 0000:ffff888c0ae03e78 EFLAGS: 00000046
[  124.022529] RAX: 0000000000000000 RBX: ffff888c09fe4400 RCX: 0000000000480000
[  124.022530] RDX: ffff888c0afe5180 RSI: ffff88980a1e5180 RDI: ffff888c09fe4400
[  124.022531] RBP: ffff888c09fe4400 R08: 0000000000480000 R09: 0000000000000001
[  124.022531] R10: ffff888c0ae03e20 R11: 0000000000000000 R12: 00000000001e4400
[  124.022532] R13: 0000000000000011 R14: 0000000000000002 R15: ffff888be006d480
[  124.022533] FS:  00007f21f1ffb700(0000) GS:ffff888c0ae00000(0000)
knlGS:0000000000000000
[  124.022533] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022534] CR2: 00007f21e4ff1728 CR3: 0000000be0b6a004 CR4: 00000000000606e0
[  124.022534] Call Trace:
[  124.022535]  <IRQ>
[  124.022535]  do_raw_spin_lock+0xb4/0xc0
[  124.022536]  _raw_spin_lock+0x4b/0x60
[  124.022536]  scheduler_tick+0x48/0x170
[  124.022537]  ? tick_sched_do_timer+0x60/0x60
[  124.022537]  update_process_times+0x40/0x50
[  124.022538]  tick_sched_handle+0x22/0x60
[  124.022538]  tick_sched_timer+0x37/0x70
[  124.022538]  __hrtimer_run_queues+0xed/0x3f0
[  124.022539]  hrtimer_interrupt+0x122/0x270
[  124.022540]  smp_apic_timer_interrupt+0x86/0x210
[  124.022540]  apic_timer_interrupt+0xf/0x20
[  124.022540]  </IRQ>
[  124.022541] RIP: 0033:0x7fff855a26af
[  124.022542] Code: ff 4c 8b 1d 83 e9 ff ff 0f 01 f9 66 90 8b 0d 68
e9 ff ff 41 39 ca 75 d6 48 c1 e2 20 48 09 d0 48 f7 e3 4c 01 da eb 0c
0f 01 f9 <66> 90 48 c1 e2 20 48 0f
[  124.022542] RSP: 002b:00007f21f1ffae00 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  124.022543] RAX: 00000000e4479e7b RBX: 00007f21f1ffae50 RCX: 0000000000000011
[  124.022544] RDX: 0000000000000068 RSI: 00007f21f1ffae50 RDI: 0000000000000000
[  124.022545] RBP: 00007f21f1ffae10 R08: 00007fff8559f0a0 R09: 00000000000079f0
[  124.022545] R10: 00007f21f1ffae30 R11: 0000000000000246 R12: 0000000000000000
[  124.022546] R13: 00007f2201f15e6f R14: 0000000000000000 R15: 00007f21f4105508
[  124.022546] NMI watchdog: Watchdog detected hard LOCKUP on cpu 18
[  124.022547] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022575] irq event stamp: 8788
[  124.022576] hardirqs last  enabled at (8787): [<ffffffff81003829>]
trace_hardirqs_on_thunk+0x1a/0x1c
[  124.022577] hardirqs last disabled at (8788): [<ffffffff81003845>]
trace_hardirqs_off_thunk+0x1a/0x1c
[  124.022577] softirqs last  enabled at (8782): [<ffffffff81e003a3>]
__do_softirq+0x3a3/0x3f2
[  124.022578] softirqs last disabled at (8775): [<ffffffff81095ea1>]
irq_exit+0xc1/0xd0
[  124.022579] CPU: 18 PID: 2741 Comm: schbench Tainted: G          I
     5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022579] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022580] RIP: 0010:native_queued_spin_lock_slowpath+0x5c/0x1d0
[  124.022581] Code: ff ff ff 75 40 f0 0f ba 2f 08 0f 82 cd 00 00 00
8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74
08 f3 90 <8b> 07 84 c0 75 f8 b8 05
[  124.022581] RSP: 0000:ffff888c0b003e78 EFLAGS: 00000002
[  124.022582] RAX: 0000000000540101 RBX: ffff888c0a1e4400 RCX: 0000000000000540
[  124.022583] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff888c0a1e4400
[  124.022584] RBP: ffff888c0a1e4400 R08: 0000000000000000 R09: 0000000000000001
[  124.022584] R10: ffff888c0b003e20 R11: 0000000000000000 R12: 00000000001e4400
[  124.022585] R13: 0000000000000012 R14: 0000000000000002 R15: ffff888be03a2a40
[  124.022585] FS:  00007f21d07c8700(0000) GS:ffff888c0b000000(0000)
knlGS:0000000000000000
[  124.022586] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022586] CR2: 00007f21df7e6728 CR3: 0000000be0b6a003 CR4: 00000000000606e0
[  124.022587] Call Trace:
[  124.022587]  <IRQ>
[  124.022588]  do_raw_spin_lock+0xb4/0xc0
[  124.022588]  _raw_spin_lock+0x4b/0x60
[  124.022589]  scheduler_tick+0x48/0x170
[  124.022589]  ? tick_sched_do_timer+0x60/0x60
[  124.022590]  update_process_times+0x40/0x50
[  124.022590]  tick_sched_handle+0x22/0x60
[  124.022591]  tick_sched_timer+0x37/0x70
[  124.022591]  __hrtimer_run_queues+0xed/0x3f0
[  124.022592]  hrtimer_interrupt+0x122/0x270
[  124.022592]  smp_apic_timer_interrupt+0x86/0x210
[  124.022593]  apic_timer_interrupt+0xf/0x20
[  124.022593]  </IRQ>
[  124.022593] RIP: 0033:0x402050
[  124.022594] Code: 48 83 ec 40 48 8d 7c 24 10 64 48 8b 04 25 28 00
00 00 48 89 44 24 38 31 c0 e8 2c eb ff ff eb 0c 66 2e 0f 1f 84 00 00
00 00 00 <f3> 90 31 f6 48 89 e7 e0
[  124.022595] RSP: 002b:00007f21d07c7e50 EFLAGS: 00000283 ORIG_RAX:
ffffffffffffff13
[  124.022596] RAX: 0000000000051d87 RBX: 00000000002dc6c0 RCX: 0000000000000000
[  124.022596] RDX: 00007f21d07c7e60 RSI: 00007f21d07c7e50 RDI: 00007f21d07c7e70
[  124.022597] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000007a06
[  124.022598] R10: 00007f21d07c7e30 R11: 0000000000000246 R12: 00007f21d07c7ed0
[  124.022598] R13: 00007f2201714e6f R14: 0000000000000000 R15: 00007f21f8f480f8
[  124.022599] NMI watchdog: Watchdog detected hard LOCKUP on cpu 19
[  124.022599] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022628] irq event stamp: 132622
[  124.022628] hardirqs last  enabled at (132621):
[<ffffffff810c7544>] sched_core_balance+0x164/0x5d0
[  124.022629] hardirqs last disabled at (132622):
[<ffffffff810c75b5>] sched_core_balance+0x1d5/0x5d0
[  124.022630] softirqs last  enabled at (132576):
[<ffffffff81095dde>] irq_enter+0x5e/0x60
[  124.022630] softirqs last disabled at (132575):
[<ffffffff81095dc3>] irq_enter+0x43/0x60
[  124.022631] CPU: 19 PID: 0 Comm: swapper/19 Tainted: G          I
    5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022632] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022632] RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1d0
[  124.022633] Code: ff 75 40 f0 0f ba 2f 08 0f 82 cd 00 00 00 8b 07
30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74 08 f3
90 8b 07 <84> c0 75 f8 b8 01 00 06
[  124.022634] RSP: 0000:ffffc900063dbcf8 EFLAGS: 00000002
[  124.022635] RAX: 00000000000c0101 RBX: ffff888c0a3e4400 RCX: 000000000000c980
[  124.022635] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff888c0a3e4400
[  124.022636] RBP: ffff888c0a3e4400 R08: 0000000000000000 R09: 0000000000000001
[  124.022636] R10: ffffc900063dbca0 R11: 0000000000000000 R12: 0000000000000013
[  124.022637] R13: 0000000000000000 R14: ffff888c0b3e4400 R15: 00000000001e4400
[  124.022638] FS:  0000000000000000(0000) GS:ffff888c0b200000(0000)
knlGS:0000000000000000
[  124.022638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022639] CR2: 00007f21e7ff7728 CR3: 0000000be0b6a003 CR4: 00000000000606e0
[  124.022639] Call Trace:
[  124.022640]  do_raw_spin_lock+0xb4/0xc0
[  124.022640]  _raw_spin_lock_nested+0x49/0x60
[  124.022641]  sched_core_balance+0x106/0x5d0
[  124.022641]  __balance_callback+0x49/0xa0
[  124.022642]  __schedule+0x12eb/0x1660
[  124.022642]  ? enqueue_entity+0x112/0x6f0
[  124.022643]  ? sched_clock+0x5/0x10
[  124.022643]  ? sched_clock_cpu+0xc/0xa0
[  124.022643]  ? lockdep_hardirqs_on+0x11d/0x190
[  124.022644]  schedule_idle+0x28/0x40
[  124.022644]  do_idle+0x169/0x2a0
[  124.022645]  cpu_startup_entry+0x19/0x20
[  124.022645]  start_secondary+0x17f/0x1d0
[  124.022646]  secondary_startup_64+0xa4/0xb0
[  124.022647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 20
[  124.022647] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022676] irq event stamp: 12904
[  124.022676] hardirqs last  enabled at (12903): [<ffffffff81004552>]
do_syscall_64+0x12/0x1b0
[  124.022677] hardirqs last disabled at (12904): [<ffffffff81a07e92>]
__schedule+0xe2/0x1660
[  124.022677] softirqs last  enabled at (12890): [<ffffffff81e003a3>]
__do_softirq+0x3a3/0x3f2
[  124.022678] softirqs last disabled at (12883): [<ffffffff81095ea1>]
irq_exit+0xc1/0xd0
[  124.022679] CPU: 20 PID: 2688 Comm: schbench Tainted: G          I
     5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022679] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022680] RIP: 0010:native_queued_spin_lock_slowpath+0x169/0x1d0
[  124.022681] Code: 48 c1 ee 0b 83 e8 01 83 e6 60 48 98 48 81 c6 80
51 1e 00 48 03 34 c5 40 48 39 82 48 89 16 8b 42 08 85 c0 75 09 f3 90
8b 42 08 <85> c0 74 f7 48 8b 32 46
[  124.022681] RSP: 0018:ffffc9000a1c39b8 EFLAGS: 00000046
[  124.022682] RAX: 0000000000000000 RBX: ffff888c0a1e4400 RCX: 0000000000540000
[  124.022683] RDX: ffff888c0b5e5180 RSI: ffff888c0a5e5180 RDI: ffff888c0a1e4400
[  124.022683] RBP: ffff888c0a1e4400 R08: 0000000000540000 R09: 0000000000000001
[  124.022684] R10: ffffc9000a1c3960 R11: 0000000000000000 R12: 0000000000000002
[  124.022685] R13: 000000000000003b R14: 0000000000000001 R15: ffff888c0b1e4400
[  124.022685] FS:  00007f21ea7fc700(0000) GS:ffff888c0b400000(0000)
knlGS:0000000000000000
[  124.022686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022686] CR2: 00007f795684d2a0 CR3: 0000000be0b6a003 CR4: 00000000000606e0
[  124.022687] Call Trace:
[  124.022687]  do_raw_spin_lock+0xb4/0xc0
[  124.022688]  _raw_spin_lock_irqsave+0x63/0x80
[  124.022688]  load_balance+0x358/0xde0
[  124.022689]  ? update_dl_rq_load_avg+0x121/0x280
[  124.022689]  newidle_balance+0x1d0/0x580
[  124.022690]  __schedule+0x120f/0x1660
[  124.022690]  ? lock_acquire+0xab/0x180
[  124.022691]  ? attach_entity_load_avg+0x140/0x170
[  124.022691]  ? sched_clock+0x5/0x10
[  124.022692]  ? sched_clock_cpu+0xc/0xa0
[  124.022692]  schedule+0x32/0x70
[  124.022692]  futex_wait_queue_me+0xbf/0x130
[  124.022693]  futex_wait+0xf6/0x250
[  124.022693]  ? __switch_to_asm+0x34/0x70
[  124.022694]  ? __switch_to_asm+0x40/0x70
[  124.022694]  ? __switch_to_asm+0x34/0x70
[  124.022695]  ? __switch_to_asm+0x40/0x70
[  124.022695]  ? __switch_to_asm+0x34/0x70
[  124.022696]  ? __switch_to_asm+0x40/0x70
[  124.022696]  do_futex+0x311/0xb80
[  124.022697]  ? __schedule+0xde3/0x1660
[  124.022697]  __x64_sys_futex+0x88/0x180
[  124.022698]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  124.022698]  do_syscall_64+0x60/0x1b0
[  124.022699]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  124.022699] RIP: 0033:0x7f2202c2d4d9
[  124.022700] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 08
[  124.022701] RSP: 002b:00007f21ea7fbe68 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[  124.022702] RAX: ffffffffffffffda RBX: 00007f21fc20a180 RCX: 00007f2202c2d4d9
[  124.022702] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f21fc20a180
[  124.022703] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[  124.022704] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f21ea7fbed0
[  124.022704] R13: 00007f2202716e6f R14: 0000000000000000 R15: 00007f21fc20a150
[  124.022705] NMI watchdog: Watchdog detected hard LOCKUP on cpu 21
[  124.022705] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022734] irq event stamp: 7710
[  124.022734] hardirqs last  enabled at (7709): [<ffffffff81003829>]
trace_hardirqs_on_thunk+0x1a/0x1c
[  124.022735] hardirqs last disabled at (7710): [<ffffffff81003845>]
trace_hardirqs_off_thunk+0x1a/0x1c
[  124.022736] softirqs last  enabled at (7704): [<ffffffff81e003a3>]
__do_softirq+0x3a3/0x3f2
[  124.022736] softirqs last disabled at (7697): [<ffffffff81095ea1>]
irq_exit+0xc1/0xd0
[  124.022737] CPU: 21 PID: 2719 Comm: schbench Tainted: G          I
     5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022737] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022738] RIP: 0010:native_queued_spin_lock_slowpath+0x5c/0x1d0
[  124.022739] Code: ff ff ff 75 40 f0 0f ba 2f 08 0f 82 cd 00 00 00
8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74
08 f3 90 <8b> 07 84 c0 75 f8 b8 05
[  124.022739] RSP: 0000:ffff888c0b603e78 EFLAGS: 00000002
[  124.022740] RAX: 00000000006c0101 RBX: ffff888c0a7e4400 RCX: 0000000000000540
[  124.022741] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff888c0a7e4400
[  124.022742] RBP: ffff888c0a7e4400 R08: 0000000000000000 R09: 0000000000000001
[  124.022742] R10: ffff888c0b603e20 R11: 0000000000000000 R12: 00000000001e4400
[  124.022743] R13: 0000000000000015 R14: 0000000000000002 R15: ffff888be0362a40
[  124.022743] FS:  00007f21dafdd700(0000) GS:ffff888c0b600000(0000)
knlGS:0000000000000000
[  124.022744] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022744] CR2: 00007f21dafdd728 CR3: 0000000be0b6a002 CR4: 00000000000606e0
[  124.022745] Call Trace:
[  124.022745]  <IRQ>
[  124.022746]  do_raw_spin_lock+0xb4/0xc0
[  124.022746]  _raw_spin_lock+0x4b/0x60
[  124.022747]  scheduler_tick+0x48/0x170
[  124.022747]  ? tick_sched_do_timer+0x60/0x60
[  124.022748]  update_process_times+0x40/0x50
[  124.022748]  tick_sched_handle+0x22/0x60
[  124.022749]  tick_sched_timer+0x37/0x70
[  124.022749]  __hrtimer_run_queues+0xed/0x3f0
[  124.022750]  hrtimer_interrupt+0x122/0x270
[  124.022750]  smp_apic_timer_interrupt+0x86/0x210
[  124.022751]  apic_timer_interrupt+0xf/0x20
[  124.022751]  </IRQ>
[  124.022752] RIP: 0033:0x400b70
[  124.022752] Code: 68 03 00 00 00 e9 b0 ff ff ff ff 25 e2 34 20 00
68 04 00 00 00 e9 a0 ff ff ff ff 25 da 34 20 00 68 05 00 00 00 e9 90
ff ff ff <ff> 25 d2 34 20 00 68 00
[  124.022753] RSP: 002b:00007f21dafdce48 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  124.022754] RAX: 000000000001bd7b RBX: 00000000002dc6c0 RCX: 0000000000000000
[  124.022755] RDX: 00007f21dafdce60 RSI: 0000000000000000 RDI: 00007f21dafdce50
[  124.022755] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000007a06
[  124.022756] R10: 00007f21dafdce30 R11: 0000000000000246 R12: 00007f21dafdced0
[  124.022756] R13: 00007f2202716e6f R14: 0000000000000000 R15: 00007f21fcf480f8
[  124.022757] NMI watchdog: Watchdog detected hard LOCKUP on cpu 22
[  124.022758] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022786] irq event stamp: 14310
[  124.022787] hardirqs last  enabled at (14309): [<ffffffff81003829>]
trace_hardirqs_on_thunk+0x1a/0x1c
[  124.022787] hardirqs last disabled at (14310): [<ffffffff81003845>]
trace_hardirqs_off_thunk+0x1a/0x1c
[  124.022788] softirqs last  enabled at (14308): [<ffffffff81e003a3>]
__do_softirq+0x3a3/0x3f2
[  124.022788] softirqs last disabled at (14301): [<ffffffff81095ea1>]
irq_exit+0xc1/0xd0
[  124.022789] CPU: 22 PID: 2702 Comm: schbench Tainted: G          I
     5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022790] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022790] RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1d0
[  124.022791] Code: ff 75 40 f0 0f ba 2f 08 0f 82 cd 00 00 00 8b 07
30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74 08 f3
90 8b 07 <84> c0 75 f8 b8 01 00 06
[  124.022792] RSP: 0000:ffff888c0b803e78 EFLAGS: 00000002
[  124.022793] RAX: 0000000000000101 RBX: ffff888c0a9e4400 RCX: 0000000000000540
[  124.022793] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff888c0a9e4400
[  124.022794] RBP: ffff888c0a9e4400 R08: 0000000000000000 R09: 0000000000000001
[  124.022794] R10: ffff888c0b803e20 R11: 0000000000000000 R12: 00000000001e4400
[  124.022795] R13: 0000000000000016 R14: 0000000000000006 R15: ffff888be0335480
[  124.022795] FS:  00007f21e27ec700(0000) GS:ffff888c0b800000(0000)
knlGS:0000000000000000
[  124.022796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022797] CR2: 00007f21e27ec728 CR3: 0000000be0b6a003 CR4: 00000000000606e0
[  124.022797] Call Trace:
[  124.022797]  <IRQ>
[  124.022798]  do_raw_spin_lock+0xb4/0xc0
[  124.022798]  _raw_spin_lock+0x4b/0x60
[  124.022799]  scheduler_tick+0x48/0x170
[  124.022799]  ? tick_sched_do_timer+0x60/0x60
[  124.022800]  update_process_times+0x40/0x50
[  124.022800]  tick_sched_handle+0x22/0x60
[  124.022801]  tick_sched_timer+0x37/0x70
[  124.022801]  __hrtimer_run_queues+0xed/0x3f0
[  124.022802]  hrtimer_interrupt+0x122/0x270
[  124.022802]  smp_apic_timer_interrupt+0x86/0x210
[  124.022803]  apic_timer_interrupt+0xf/0x20
[  124.022804]  </IRQ>
[  124.022804] RIP: 0010:smp_call_function_many+0x20e/0x270
[  124.022805] Code: f5 89 00 3b 05 e7 ed 66 01 0f 83 7e fe ff ff 48
63 d0 48 8b 0b 48 03 0c d5 40 48 39 82 8b 51 18 83 e2 01 74 0a f3 90
8b 51 18 <83> e2 01 75 f6 eb c8 09
[  124.022806] RSP: 0000:ffffc9000a237bc0 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff13
[  124.022807] RAX: 0000000000000000 RBX: ffff888c0b9e5400 RCX: ffff888c09dead60
[  124.022807] RDX: 0000000000000003 RSI: 0000000000000200 RDI: ffff888c0b9e5408
[  124.022808] RBP: ffff888c0b9e5408 R08: 0000000000000001 R09: 0000000000000000
[  124.022808] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8107d070
[  124.022809] R13: ffffc9000a237cc0 R14: 0000000000000001 R15: 0000000000000200
[  124.022810]  ? flush_tlb_func_common+0x300/0x300
[  124.022810]  ? flush_tlb_func_common+0x300/0x300
[  124.022811]  ? flush_tlb_func_common+0x300/0x300
[  124.022811]  on_each_cpu_mask+0x25/0x90
[  124.022812]  ? x86_configure_nx+0x50/0x50
[  124.022812]  on_each_cpu_cond_mask+0x97/0xd0
[  124.022813]  flush_tlb_mm_range+0xd5/0x140
[  124.022813]  ? sched_clock_cpu+0xc/0xa0
[  124.022814]  ? change_protection+0xa7b/0xaf0
[  124.022814]  ? _cond_resched+0x16/0x40
[  124.022814]  change_protection+0xa7b/0xaf0
[  124.022815]  change_prot_numa+0x18/0x30
[  124.022815]  task_numa_work+0x217/0x330
[  124.022816]  task_work_run+0x7e/0xa0
[  124.022816]  exit_to_usermode_loop+0xe0/0xf0
[  124.022817]  prepare_exit_to_usermode+0x9f/0xd0
[  124.022817]  retint_user+0x8/0x18
[  124.022818] RIP: 0033:0x7fff855a2615
[  124.022819] Code: 0c 07 08 00 00 1c 00 00 00 fc 00 00 00 c8 02 00
00 2a 00 00 00 00 41 0e 10 86 02 48 0d 06 60 c6 0c 07 08 00 00 4c 8d
54 24 08 <48> 63 ff 48 83 e4 f0 48
[  124.022819] RSP: 002b:00007f21e27ebe28 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  124.022820] RAX: 00000000001e2ec1 RBX: 00007f21e27ebe50 RCX: 0000000000000001
[  124.022821] RDX: 00007f21e27ebe60 RSI: 00007f21e27ebe50 RDI: 0000000000000000
[  124.022821] RBP: 00007f21e27ebe40 R08: 0000000000000001 R09: 0000000000007daa
[  124.022822] R10: 00007f21e27ebe30 R11: 0000000000000246 R12: 0000000000000000
[  124.022823] R13: 00007f2202716e6f R14: 0000000000000000 R15: 00007f21fc721eb8
[  124.022823] NMI watchdog: Watchdog detected hard LOCKUP on cpu 23
[  124.022824] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo
iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_nat nf_conntracki
[  124.022852] irq event stamp: 141222
[  124.022853] hardirqs last  enabled at (141221):
[<ffffffff810c7544>] sched_core_balance+0x164/0x5d0
[  124.022854] hardirqs last disabled at (141222):
[<ffffffff810c75b5>] sched_core_balance+0x1d5/0x5d0
[  124.022854] softirqs last  enabled at (141112):
[<ffffffff81e003a3>] __do_softirq+0x3a3/0x3f2
[  124.022855] softirqs last disabled at (141101):
[<ffffffff81095ea1>] irq_exit+0xc1/0xd0
[  124.022856] CPU: 23 PID: 0 Comm: swapper/23 Tainted: G          I
    5.0.0-rc8-00544-gf24f5e9-dirty #20
[  124.022857] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  124.022857] RIP: 0010:native_queued_spin_lock_slowpath+0x5e/0x1d0
[  124.022858] Code: ff 75 40 f0 0f ba 2f 08 0f 82 cd 00 00 00 8b 07
30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74 08 f3
90 8b 07 <84> c0 75 f8 b8 01 00 06
[  124.022859] RSP: 0000:ffffc900063fbcf8 EFLAGS: 00000002
[  124.022860] RAX: 00000000001c0101 RBX: ffff888c0abe4400 RCX: 000000000000c980
[  124.022860] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff888c0abe4400
[  124.022861] RBP: ffff888c0abe4400 R08: 0000000000000000 R09: 0000000000000001
[  124.022861] R10: ffffc900063fbca0 R11: 0000000000000000 R12: 0000000000000017
[  124.022862] R13: 0000000000000000 R14: ffff888c0bbe4400 R15: 00000000001e4400
[  124.022863] FS:  0000000000000000(0000) GS:ffff888c0ba00000(0000)
knlGS:0000000000000000
[  124.022863] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  124.022864] CR2: 00007f21eb7fe728 CR3: 0000000be0b6a002 CR4: 00000000000606e0
[  124.022864] Call Trace:
[  124.022865]  do_raw_spin_lock+0xb4/0xc0
[  124.022865]  _raw_spin_lock_nested+0x49/0x60
[  124.022866]  sched_core_balance+0x106/0x5d0
[  124.022866]  __balance_callback+0x49/0xa0
[  124.022867]  __schedule+0x12eb/0x1660
[  124.022867]  ? _raw_spin_unlock_irqrestore+0x4e/0x60
[  124.022868]  ? hrtimer_start_range_ns+0x1b7/0x340
[  124.022868]  schedule_idle+0x28/0x40
[  124.022868]  do_idle+0x169/0x2a0
[  124.022869]  cpu_startup_entry+0x19/0x20
[  124.022869]  start_secondary+0x17f/0x1d0
[  124.022870]  secondary_startup_64+0xa4/0xb0
[  129.616038] ---[ end Kernel panic - not syncing: Hard LOCKUP ]---

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-05 14:55       ` Aaron Lu
@ 2019-04-09 18:09         ` Tim Chen
  2019-04-10  4:36           ` Aaron Lu
  2019-04-10  8:06           ` Peter Zijlstra
  0 siblings, 2 replies; 99+ messages in thread
From: Tim Chen @ 2019-04-09 18:09 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra
  Cc: mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Aubrey Li, Julien Desfossez

On 4/5/19 7:55 AM, Aaron Lu wrote:
> On Tue, Apr 02, 2019 at 10:28:12AM +0200, Peter Zijlstra wrote:
>> Another approach would be something like the below:
>>
>>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -87,7 +87,7 @@ static inline int __task_prio(struct tas
>>   */
>>  
>>  /* real prio, less is less */
>> -static inline bool __prio_less(struct task_struct *a, struct task_struct *b, bool runtime)
>> +static inline bool __prio_less(struct task_struct *a, struct task_struct *b, u64 vruntime)
>>  {
>>  	int pa = __task_prio(a), pb = __task_prio(b);
>>  
>> @@ -104,21 +104,25 @@ static inline bool __prio_less(struct ta
>>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>>  
>> -	if (pa == MAX_RT_PRIO + MAX_NICE && runtime) /* fair */
>> -		return !((s64)(a->se.vruntime - b->se.vruntime) < 0);
>> +	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
>> +		return !((s64)(a->se.vruntime - vruntime) < 0);
> 		                                         ~~~
> I think <= should be used here, so that two tasks with the same vruntime
> will return false. Or we could bounce two tasks having different tags
> with one set to max in the first round and the other set to max in the
> next round. CPU would stuck in __schedule() with irq disabled.
> 
>>  
>>  	return false;
>>  }
>>  
>>  static inline bool cpu_prio_less(struct task_struct *a, struct task_struct *b)
>>  {
>> -	return __prio_less(a, b, true);
>> +	return __prio_less(a, b, b->se.vruntime);
>>  }
>>  
>>  static inline bool core_prio_less(struct task_struct *a, struct task_struct *b)
>>  {
>> -	/* cannot compare vruntime across CPUs */
>> -	return __prio_less(a, b, false);
>> +	u64 vruntime = b->se.vruntime;
>> +
>> +	vruntime -= task_rq(b)->cfs.min_vruntime;
>> +	vruntime += task_rq(a)->cfs.min_vruntime
> 
> After some testing, I figured task_cfs_rq() should be used instead of
> task_rq(:-)
> 
> With the two changes(and some other minor ones that still need more time
> to sort out), I'm now able to start doing 2 full CPU kbuilds in 2 tagged
> cgroups. Previouslly, the system would hang pretty soon after I started
> kbuild in any tagged cgroup(presumbly, CPUs stucked in __schedule() with
> irqs disabled).
> 
> And there is no warning appeared due to two tasks having different tags
> get scheduled on the same CPU.
> 
> Thanks,
> Aaron
> 

Peter,

Now that we have accumulated quite a number of different fixes to your orginal
posted patches.  Would you like to post a v2 of the core scheduler with the fixes?

Thanks.

Tim

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-02-18 16:56 ` [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling Peter Zijlstra
       [not found]   ` <20190402064612.GA46500@aaronlu>
@ 2019-04-09 18:38   ` Julien Desfossez
  2019-04-10 15:01     ` Peter Zijlstra
  2019-04-11  0:11     ` Subhra Mazumdar
  1 sibling, 2 replies; 99+ messages in thread
From: Julien Desfossez @ 2019-04-09 18:38 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: Julien Desfossez, linux-kernel, subhra.mazumdar, fweisbec,
	keescook, kerrnel, Vineeth Pillai, Nishanth Aravamudan, Aaron Lu

We found the source of the major performance regression we discussed
previously. It turns out there was a pattern where a task (a kworker in this
case) could be woken up, but the core could still end up idle before that
task had a chance to run.

Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
task2 are in the same cgroup with the tag enabled (each following line
happens in the increasing order of time):
- task1 running on cpu0, task2 running on cpu1
- sched_waking(kworker/0, target_cpu=cpu0)
- task1 scheduled out of cpu0
- kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
  cpu0 is idle
- task2 scheduled out of cpu1
- cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends
  the task selection if core_cookie is NULL for currently selected process
  and the cpu1’s runqueue.
- cpu1 is idle
--> both siblings are idle but kworker/0 is still in the run queue of cpu0.
    Cpu0 may stay idle for longer if it goes deep idle.

With the fix below, we ensure to send an IPI to the sibling if it is idle
and has tasks waiting in its runqueue.
This fixes the performance issue we were seeing.

Now here is what we can measure with a disk write-intensive benchmark:
- no performance impact with enabling core scheduling without any tagged
  task,
- 5% overhead if one tagged task is competing with an untagged task,
- 10% overhead if 2 tasks tagged with a different tag are competing
  against each other.

We are starting more scaling tests, but this is very encouraging !


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1fa10561279..02c862a5e973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 				trace_printk("unconstrained pick: %s/%d %lx\n",
 						next->comm, next->pid, next->core_cookie);
+				rq->core_pick = NULL;
 
+				/*
+				 * If the sibling is idling, we might want to wake it
+				 * so that it can check for any runnable but blocked tasks 
+				 * due to previous task matching.
+				 */
+				for_each_cpu(j, smt_mask) {
+					struct rq *rq_j = cpu_rq(j);
+					rq_j->core_pick = NULL;
+					if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) {
+						resched_curr(rq_j);
+						trace_printk("IPI(%d->%d[%d]) idle preempt\n",
+							     cpu, j, rq_j->nr_running);
+					}
+				}
 				goto done;
 			}
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-09 18:09         ` Tim Chen
@ 2019-04-10  4:36           ` Aaron Lu
  2019-04-10 14:18             ` Aubrey Li
  2019-04-10 14:44             ` Peter Zijlstra
  2019-04-10  8:06           ` Peter Zijlstra
  1 sibling, 2 replies; 99+ messages in thread
From: Aaron Lu @ 2019-04-10  4:36 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: mingo, tglx, pjt, torvalds, linux-kernel, subhra.mazumdar,
	fweisbec, keescook, kerrnel, Aubrey Li, Julien Desfossez

On Tue, Apr 09, 2019 at 11:09:45AM -0700, Tim Chen wrote:
> Now that we have accumulated quite a number of different fixes to your orginal
> posted patches.  Would you like to post a v2 of the core scheduler with the fixes?

One more question I'm not sure: should a task with cookie=0, i.e. tasks
that are untagged, be allowed to scheduled on the the same core with
another tagged task?

The current patch seems to disagree on this, e.g. in pick_task(),
if max is already chosen but max->core_cookie == 0, then we didn't care
about cookie and simply use class_pick for the other cpu. This means we
could schedule two tasks with different cookies(one is zero and the
other can be tagged).

But then sched_core_find() only allow idle task to match with any tagged
tasks(we didn't place untagged tasks to the core tree of course :-).

Thoughts? Do I understand this correctly? If so, I think we probably
want to make this clear before v2. I personally feel, we shouldn't allow
untagged tasks(like kernel threads) to match with tagged tasks.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-09 18:09         ` Tim Chen
  2019-04-10  4:36           ` Aaron Lu
@ 2019-04-10  8:06           ` Peter Zijlstra
  2019-04-10 19:58             ` Vineeth Remanan Pillai
  2019-04-15 16:59             ` Julien Desfossez
  1 sibling, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-04-10  8:06 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aaron Lu, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li,
	Julien Desfossez

On Tue, Apr 09, 2019 at 11:09:45AM -0700, Tim Chen wrote:
> Now that we have accumulated quite a number of different fixes to your orginal
> posted patches.  Would you like to post a v2 of the core scheduler with the fixes?

Well, I was promised someome else was going to carry all this, also,
while you're all having fun playing with this, I've not yet had answers
to the important questions of how L1TF complete we want to be and if all
this crud actually matters one way or the other.

Also, I still don't see this stuff working for high context switch rate
workloads, and that is exactly what some people were aiming for..



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-10  4:36           ` Aaron Lu
@ 2019-04-10 14:18             ` Aubrey Li
  2019-04-11  2:11               ` Aaron Lu
  2019-04-10 14:44             ` Peter Zijlstra
  1 sibling, 1 reply; 99+ messages in thread
From: Aubrey Li @ 2019-04-10 14:18 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Aubrey Li, Julien Desfossez

On Wed, Apr 10, 2019 at 12:36 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
>
> On Tue, Apr 09, 2019 at 11:09:45AM -0700, Tim Chen wrote:
> > Now that we have accumulated quite a number of different fixes to your orginal
> > posted patches.  Would you like to post a v2 of the core scheduler with the fixes?
>
> One more question I'm not sure: should a task with cookie=0, i.e. tasks
> that are untagged, be allowed to scheduled on the the same core with
> another tagged task?
>
> The current patch seems to disagree on this, e.g. in pick_task(),
> if max is already chosen but max->core_cookie == 0, then we didn't care
> about cookie and simply use class_pick for the other cpu. This means we
> could schedule two tasks with different cookies(one is zero and the
> other can be tagged).
>
> But then sched_core_find() only allow idle task to match with any tagged
> tasks(we didn't place untagged tasks to the core tree of course :-).
>
> Thoughts? Do I understand this correctly? If so, I think we probably
> want to make this clear before v2. I personally feel, we shouldn't allow
> untagged tasks(like kernel threads) to match with tagged tasks.

Does it make sense if we take untagged tasks as hypervisor, and different
cookie tasks as different VMs? Isolation is done between VMs, not between
VM and hypervisor.

Did you see anything harmful if an untagged task and a tagged task
run simultaneously on the same core?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-10  4:36           ` Aaron Lu
  2019-04-10 14:18             ` Aubrey Li
@ 2019-04-10 14:44             ` Peter Zijlstra
  2019-04-11  3:05               ` Aaron Lu
  1 sibling, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2019-04-10 14:44 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li,
	Julien Desfossez

On Wed, Apr 10, 2019 at 12:36:33PM +0800, Aaron Lu wrote:
> On Tue, Apr 09, 2019 at 11:09:45AM -0700, Tim Chen wrote:
> > Now that we have accumulated quite a number of different fixes to your orginal
> > posted patches.  Would you like to post a v2 of the core scheduler with the fixes?
> 
> One more question I'm not sure: should a task with cookie=0, i.e. tasks
> that are untagged, be allowed to scheduled on the the same core with
> another tagged task?

That was not meant to be possible.

> The current patch seems to disagree on this, e.g. in pick_task(),
> if max is already chosen but max->core_cookie == 0, then we didn't care
> about cookie and simply use class_pick for the other cpu. This means we
> could schedule two tasks with different cookies(one is zero and the
> other can be tagged).

When core_cookie==0 we shouldn't schedule the other siblings at all.

> But then sched_core_find() only allow idle task to match with any tagged
> tasks(we didn't place untagged tasks to the core tree of course :-).
> 
> Thoughts? Do I understand this correctly? If so, I think we probably
> want to make this clear before v2. I personally feel, we shouldn't allow
> untagged tasks(like kernel threads) to match with tagged tasks.

Agreed, cookie should always match or idle.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-09 18:38   ` Julien Desfossez
@ 2019-04-10 15:01     ` Peter Zijlstra
  2019-04-11  0:11     ` Subhra Mazumdar
  1 sibling, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-04-10 15:01 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan, Aaron Lu

On Tue, Apr 09, 2019 at 02:38:55PM -0400, Julien Desfossez wrote:
> We found the source of the major performance regression we discussed
> previously. It turns out there was a pattern where a task (a kworker in this
> case) could be woken up, but the core could still end up idle before that
> task had a chance to run.
> 
> Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
> task2 are in the same cgroup with the tag enabled (each following line
> happens in the increasing order of time):
> - task1 running on cpu0, task2 running on cpu1
> - sched_waking(kworker/0, target_cpu=cpu0)
> - task1 scheduled out of cpu0
> - kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
>   cpu0 is idle
> - task2 scheduled out of cpu1

But at this point core_cookie is still set; we don't clear it when the
last task goes away.

> - cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends
>   the task selection if core_cookie is NULL for currently selected process
>   and the cpu1’s runqueue.

But at this point core_cookie is still set, we only (re)set it later to
p->core_cookie.

What I suspect happens is that you hit the 'again' clause due to a
higher prio @max on the second sibling. And at that point we've
destroyed core_cookie.

> - cpu1 is idle
> --> both siblings are idle but kworker/0 is still in the run queue of cpu0.
>     Cpu0 may stay idle for longer if it goes deep idle.
> 
> With the fix below, we ensure to send an IPI to the sibling if it is idle
> and has tasks waiting in its runqueue.
> This fixes the performance issue we were seeing.
> 
> Now here is what we can measure with a disk write-intensive benchmark:
> - no performance impact with enabling core scheduling without any tagged
>   task,
> - 5% overhead if one tagged task is competing with an untagged task,
> - 10% overhead if 2 tasks tagged with a different tag are competing
>   against each other.
> 
> We are starting more scaling tests, but this is very encouraging !
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e1fa10561279..02c862a5e973 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  
>  				trace_printk("unconstrained pick: %s/%d %lx\n",
>  						next->comm, next->pid, next->core_cookie);
> +				rq->core_pick = NULL;
>  
> +				/*
> +				 * If the sibling is idling, we might want to wake it
> +				 * so that it can check for any runnable but blocked tasks 
> +				 * due to previous task matching.
> +				 */
> +				for_each_cpu(j, smt_mask) {
> +					struct rq *rq_j = cpu_rq(j);
> +					rq_j->core_pick = NULL;
> +					if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) {
> +						resched_curr(rq_j);
> +						trace_printk("IPI(%d->%d[%d]) idle preempt\n",
> +							     cpu, j, rq_j->nr_running);
> +					}
> +				}
>  				goto done;
>  			}

I'm thinking there is a more elegant solution hiding in there; possibly
saving/restoring that core_cookie on the again loop should do, but I've
always had the nagging suspicion that whole selection loop could be done
better.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-10  8:06           ` Peter Zijlstra
@ 2019-04-10 19:58             ` Vineeth Remanan Pillai
  2019-04-15 16:59             ` Julien Desfossez
  1 sibling, 0 replies; 99+ messages in thread
From: Vineeth Remanan Pillai @ 2019-04-10 19:58 UTC (permalink / raw)
  To: Peter Zijlstra, tim.c.chen
  Cc: Vineeth Pillai, Aaron Lu, mingo, tglx, pjt, torvalds,
	linux-kernel, subhra.mazumdar, fweisbec, keescook, kerrnel,
	Aubrey Li, Julien Desfossez, Nishanth Aravamudan

From: Vineeth Pillai <vpillai@digitalocean.com>

> Well, I was promised someome else was going to carry all this, also

We are interested in this feature and have been actively testing, benchmarking
and working on fixes. If there is no v2 effort currently in progress, we are
willing to help consolidate all the changes discussed here and prepare a v2.
If there are any pending changes in pipeline, please post your ideas so that
we could include it in v2.

We hope to post the v2 with all the changes here in a week’s time rebased on
the latest tip.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-09 18:38   ` Julien Desfossez
  2019-04-10 15:01     ` Peter Zijlstra
@ 2019-04-11  0:11     ` Subhra Mazumdar
  2019-04-19  8:40       ` Ingo Molnar
  1 sibling, 1 reply; 99+ messages in thread
From: Subhra Mazumdar @ 2019-04-11  0:11 UTC (permalink / raw)
  To: Julien Desfossez, Peter Zijlstra, mingo, tglx, pjt, tim.c.chen, torvalds
  Cc: linux-kernel, fweisbec, keescook, kerrnel, Vineeth Pillai,
	Nishanth Aravamudan, Aaron Lu


On 4/9/19 11:38 AM, Julien Desfossez wrote:
> We found the source of the major performance regression we discussed
> previously. It turns out there was a pattern where a task (a kworker in this
> case) could be woken up, but the core could still end up idle before that
> task had a chance to run.
>
> Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
> task2 are in the same cgroup with the tag enabled (each following line
> happens in the increasing order of time):
> - task1 running on cpu0, task2 running on cpu1
> - sched_waking(kworker/0, target_cpu=cpu0)
> - task1 scheduled out of cpu0
> - kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
>    cpu0 is idle
> - task2 scheduled out of cpu1
> - cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends
>    the task selection if core_cookie is NULL for currently selected process
>    and the cpu1’s runqueue.
> - cpu1 is idle
> --> both siblings are idle but kworker/0 is still in the run queue of cpu0.
>      Cpu0 may stay idle for longer if it goes deep idle.
>
> With the fix below, we ensure to send an IPI to the sibling if it is idle
> and has tasks waiting in its runqueue.
> This fixes the performance issue we were seeing.
>
> Now here is what we can measure with a disk write-intensive benchmark:
> - no performance impact with enabling core scheduling without any tagged
>    task,
> - 5% overhead if one tagged task is competing with an untagged task,
> - 10% overhead if 2 tasks tagged with a different tag are competing
>    against each other.
>
> We are starting more scaling tests, but this is very encouraging !
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e1fa10561279..02c862a5e973 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   
>   				trace_printk("unconstrained pick: %s/%d %lx\n",
>   						next->comm, next->pid, next->core_cookie);
> +				rq->core_pick = NULL;
>   
> +				/*
> +				 * If the sibling is idling, we might want to wake it
> +				 * so that it can check for any runnable but blocked tasks
> +				 * due to previous task matching.
> +				 */
> +				for_each_cpu(j, smt_mask) {
> +					struct rq *rq_j = cpu_rq(j);
> +					rq_j->core_pick = NULL;
> +					if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) {
> +						resched_curr(rq_j);
> +						trace_printk("IPI(%d->%d[%d]) idle preempt\n",
> +							     cpu, j, rq_j->nr_running);
> +					}
> +				}
>   				goto done;
>   			}
>   
I see similar improvement with this patch as removing the condition I
earlier mentioned. So that's not needed. I also included the patch for the
priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
(from earlier emails).


1 DB instance

users  baseline   %idle    core_sched %idle
16     1          84       -4.9% 84
24     1          76       -6.7% 75
32     1          69       -2.4% 69

2 DB instance

users  baseline   %idle    core_sched %idle
16     1          66       -19.5% 69
24     1          54       -9.8% 57
32     1          42       -27.2%        48

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-10 14:18             ` Aubrey Li
@ 2019-04-11  2:11               ` Aaron Lu
  0 siblings, 0 replies; 99+ messages in thread
From: Aaron Lu @ 2019-04-11  2:11 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Subhra Mazumdar, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Julien Desfossez

On Wed, Apr 10, 2019 at 10:18:10PM +0800, Aubrey Li wrote:
> On Wed, Apr 10, 2019 at 12:36 PM Aaron Lu <aaron.lu@linux.alibaba.com> wrote:
> >
> > On Tue, Apr 09, 2019 at 11:09:45AM -0700, Tim Chen wrote:
> > > Now that we have accumulated quite a number of different fixes to your orginal
> > > posted patches.  Would you like to post a v2 of the core scheduler with the fixes?
> >
> > One more question I'm not sure: should a task with cookie=0, i.e. tasks
> > that are untagged, be allowed to scheduled on the the same core with
> > another tagged task?
> >
> > The current patch seems to disagree on this, e.g. in pick_task(),
> > if max is already chosen but max->core_cookie == 0, then we didn't care
> > about cookie and simply use class_pick for the other cpu. This means we
> > could schedule two tasks with different cookies(one is zero and the
> > other can be tagged).
> >
> > But then sched_core_find() only allow idle task to match with any tagged
> > tasks(we didn't place untagged tasks to the core tree of course :-).
> >
> > Thoughts? Do I understand this correctly? If so, I think we probably
> > want to make this clear before v2. I personally feel, we shouldn't allow
> > untagged tasks(like kernel threads) to match with tagged tasks.
> 
> Does it make sense if we take untagged tasks as hypervisor, and different
> cookie tasks as different VMs? Isolation is done between VMs, not between
> VM and hypervisor.
> 
> Did you see anything harmful if an untagged task and a tagged task
> run simultaneously on the same core?

VM can see hypervisor's data then, I think.
We probably do not want that happen.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-10 14:44             ` Peter Zijlstra
@ 2019-04-11  3:05               ` Aaron Lu
  2019-04-11  9:19                 ` Peter Zijlstra
  0 siblings, 1 reply; 99+ messages in thread
From: Aaron Lu @ 2019-04-11  3:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li,
	Julien Desfossez

On Wed, Apr 10, 2019 at 04:44:18PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 10, 2019 at 12:36:33PM +0800, Aaron Lu wrote:
> > On Tue, Apr 09, 2019 at 11:09:45AM -0700, Tim Chen wrote:
> > > Now that we have accumulated quite a number of different fixes to your orginal
> > > posted patches.  Would you like to post a v2 of the core scheduler with the fixes?
> > 
> > One more question I'm not sure: should a task with cookie=0, i.e. tasks
> > that are untagged, be allowed to scheduled on the the same core with
> > another tagged task?
> 
> That was not meant to be possible.

Good to know this.

> > The current patch seems to disagree on this, e.g. in pick_task(),
> > if max is already chosen but max->core_cookie == 0, then we didn't care
> > about cookie and simply use class_pick for the other cpu. This means we
> > could schedule two tasks with different cookies(one is zero and the
> > other can be tagged).
> 
> When core_cookie==0 we shouldn't schedule the other siblings at all.

Not even with another untagged task?

I was thinking to leave host side tasks untagged, like kernel threads,
init and other system daemons or utilities etc., and tenant tasks tagged.
Then at least two untagged tasks can be scheduled on the same core.

Kindly let me know if you see a problem with this.

> > But then sched_core_find() only allow idle task to match with any tagged
> > tasks(we didn't place untagged tasks to the core tree of course :-).
> > 
> > Thoughts? Do I understand this correctly? If so, I think we probably
> > want to make this clear before v2. I personally feel, we shouldn't allow
> > untagged tasks(like kernel threads) to match with tagged tasks.
> 
> Agreed, cookie should always match or idle.

Thanks a lot for the clarification.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-11  3:05               ` Aaron Lu
@ 2019-04-11  9:19                 ` Peter Zijlstra
  0 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2019-04-11  9:19 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li,
	Julien Desfossez

On Thu, Apr 11, 2019 at 11:05:41AM +0800, Aaron Lu wrote:
> On Wed, Apr 10, 2019 at 04:44:18PM +0200, Peter Zijlstra wrote:
> > When core_cookie==0 we shouldn't schedule the other siblings at all.
> 
> Not even with another untagged task?
> 
> I was thinking to leave host side tasks untagged, like kernel threads,
> init and other system daemons or utilities etc., and tenant tasks tagged.
> Then at least two untagged tasks can be scheduled on the same core.
> 
> Kindly let me know if you see a problem with this.

Let me clarify; when the rq->core->core_cookie == 0, each sibling should
schedule independently.

As Julien found, there were some issues here, but the intent was:

  core_cookie 0, independent scheduling
  core_cookie 0->n, core scheduling
  core_cookie n->0, one last core schedule to kick possibly forced idle siblings

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-10  8:06           ` Peter Zijlstra
  2019-04-10 19:58             ` Vineeth Remanan Pillai
@ 2019-04-15 16:59             ` Julien Desfossez
  1 sibling, 0 replies; 99+ messages in thread
From: Julien Desfossez @ 2019-04-15 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Aaron Lu, mingo, tglx, pjt, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li

On 10-Apr-2019 10:06:30 AM, Peter Zijlstra wrote:
> while you're all having fun playing with this, I've not yet had answers
> to the important questions of how L1TF complete we want to be and if all
> this crud actually matters one way or the other.
> 
> Also, I still don't see this stuff working for high context switch rate
> workloads, and that is exactly what some people were aiming for..

We have been running scaling tests on highly loaded systems (with all
the fixes and suggestions applied) and here are the results.

On a system with 2x6 cores (12 hardware threads per NUMA node), with one
12-vcpus-32gb VM per NUMA node running a CPU-intensive workload
(linpack):
- Baseline: 864 gflops
- Core scheduling: 864 gflops
- nosmt (switch to 6 hardware threads per node): 298 gflops (-65%)

In this test, the VMs are basically alone on their own NUMA node, so
they are only competing with themselves, so for the next test we moved
the 2 VMs to the same node:
- Baseline: 340 gflops, about 586k context switches/sec
- Core scheduling: 322 gflops (-5%), about 575k context switches/sec
- nosmt: 146 gflops (-57%), about 284k context switches/sec

In terms of isolation, CPU-intensive VMs share their core with a
"foreign process" (not tagged or tagged with a different tag) less than
2% of the time (sum of the time spent with a lot of different
processes). For reference, this could add up to 60% without core
scheduling and smt on. We are working on identifying the various cases
where there is unwanted co-scheduling so we can address those.

With a more heterogeneous benchmark (MySQL benchmark with a remote
client, 1 12-vcpus MySQL VM on each NUMA node), we don’t measure any
performance degradation when there is more hardware threads available
than vcpus (same with nosmt), but when we add noise VMs (sleep(15);
collect metrics; send them over a VPN; repeat) with an overcommit ratio
of 3 vcpus to 1 hardware thread, core scheduling can have up to 25%
performance degradation, whereas nosmt has 15% impact.

So the performance impact varies depending on the type of workload, but
since the CPU-intensive workloads are the ones most impacted when we
disable SMT, this is very encouraging and is a worthwhile effort.

Thanks,

Julien

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-02  8:28     ` Peter Zijlstra
  2019-04-02 13:20       ` Aaron Lu
  2019-04-05 14:55       ` Aaron Lu
@ 2019-04-16 13:43       ` Aaron Lu
  2 siblings, 0 replies; 99+ messages in thread
From: Aaron Lu @ 2019-04-16 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, pjt, tim.c.chen, torvalds, linux-kernel,
	subhra.mazumdar, fweisbec, keescook, kerrnel, Aubrey Li

On Tue, Apr 02, 2019 at 10:28:12AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 02, 2019 at 02:46:13PM +0800, Aaron Lu wrote:
...
> > Perhaps we can test if max is on the same cpu as class_pick and then
> > use cpu_prio_less() or core_prio_less() accordingly here, or just
> > replace core_prio_less(max, p) with cpu_prio_less(max, p) in
> > pick_next_task(). The 2nd obviously breaks the comment of
> > core_prio_less() though: /* cannot compare vruntime across CPUs */.
> 
> Right, so as the comment states, you cannot directly compare vruntime
> across CPUs, doing that is completely buggered.
> 
> That also means that the cpu_prio_less(max, class_pick) in pick_task()
> is buggered, because there is no saying @max is on this CPU to begin
> with.

I find it difficult to decide which task of fair_sched_class having
higher priority when the two tasks belong to different CPUs.

Please see below.

> Another approach would be something like the below:
> 
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -87,7 +87,7 @@ static inline int __task_prio(struct tas
>   */
>  
>  /* real prio, less is less */
> -static inline bool __prio_less(struct task_struct *a, struct task_struct *b, bool runtime)
> +static inline bool __prio_less(struct task_struct *a, struct task_struct *b, u64 vruntime)
>  {
>  	int pa = __task_prio(a), pb = __task_prio(b);
>  
> @@ -104,21 +104,25 @@ static inline bool __prio_less(struct ta
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE && runtime) /* fair */
> -		return !((s64)(a->se.vruntime - b->se.vruntime) < 0);
> +	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
> +		return !((s64)(a->se.vruntime - vruntime) < 0);
>  
>  	return false;
>  }
>  
>  static inline bool cpu_prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -	return __prio_less(a, b, true);
> +	return __prio_less(a, b, b->se.vruntime);
>  }
>  
>  static inline bool core_prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -	/* cannot compare vruntime across CPUs */
> -	return __prio_less(a, b, false);
> +	u64 vruntime = b->se.vruntime;
> +
> +	vruntime -= task_rq(b)->cfs.min_vruntime;
> +	vruntime += task_rq(a)->cfs.min_vruntime

(I used task_cfs_rq() instead of task_rq() above.)

Consider the following scenario:
(assume cpu0 and cpu1 are siblings of core0)

1 a cpu-intensive task belonging to cgroupA running on cpu0;
2 launch 'ls' from a shell(bash) which belongs to cgroupB;
3 'ls' blocked for a long time(if not forever).

Per my limited understanding: the launch of 'ls' cause bash to fork,
then the newly forked process' vruntime will be 6ms(probably not
precise) ahead of its cfs_rq due to START_DEBIT. Since there is no other
running task on that cfs_rq, the cfs_rq's min_vruntime doesn't have a
chance to get updated and the newly forked process will always have a
distance of 6ms compared to its cfs_rq and it will always 'lose' to the
cpu-intensive task belonging to cgroupA by core_prio_less().

No idea how to solve this...

> +
> +	return __prio_less(a, b, vruntime);
>  }
>  
>  static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-11  0:11     ` Subhra Mazumdar
@ 2019-04-19  8:40       ` Ingo Molnar
  2019-04-19 23:16         ` Subhra Mazumdar
  0 siblings, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2019-04-19  8:40 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Julien Desfossez, Peter Zijlstra, tglx, pjt, tim.c.chen,
	torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Vineeth Pillai, Nishanth Aravamudan, Aaron Lu


* Subhra Mazumdar <subhra.mazumdar@oracle.com> wrote:

> I see similar improvement with this patch as removing the condition I
> earlier mentioned. So that's not needed. I also included the patch for the
> priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
> (from earlier emails).
> 
> 
> 1 DB instance
> 
> users  baseline   %idle    core_sched %idle
> 16     1          84       -4.9% 84
> 24     1          76       -6.7% 75
> 32     1          69       -2.4% 69
> 
> 2 DB instance
> 
> users  baseline   %idle    core_sched %idle
> 16     1          66       -19.5% 69
> 24     1          54       -9.8% 57
> 32     1          42       -27.2%        48

So HT disabling slows down the 2DB instance by -22%, while core-sched 
slows it down by -27.2%?

Would it be possible to see all the results in two larger tables (1 DB 
instance and 2 DB instance) so that we can compare the performance of the 
3 kernel variants with each other:

 - "vanilla +HT": Hyperthreading enabled,  vanilla scheduler
 - "vanilla -HT": Hyperthreading disabled, vanilla scheduler
 - "core_sched":  Hyperthreading enabled,  core-scheduling enabled

?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.
  2019-04-19  8:40       ` Ingo Molnar
@ 2019-04-19 23:16         ` Subhra Mazumdar
  0 siblings, 0 replies; 99+ messages in thread
From: Subhra Mazumdar @ 2019-04-19 23:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Julien Desfossez, Peter Zijlstra, tglx, pjt, tim.c.chen,
	torvalds, linux-kernel, fweisbec, keescook, kerrnel,
	Vineeth Pillai, Nishanth Aravamudan, Aaron Lu


On 4/19/19 1:40 AM, Ingo Molnar wrote:
> * Subhra Mazumdar <subhra.mazumdar@oracle.com> wrote:
>
>> I see similar improvement with this patch as removing the condition I
>> earlier mentioned. So that's not needed. I also included the patch for the
>> priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
>> (from earlier emails).
>>
>>
>> 1 DB instance
>>
>> users  baseline   %idle    core_sched %idle
>> 16     1          84       -4.9% 84
>> 24     1          76       -6.7% 75
>> 32     1          69       -2.4% 69
>>
>> 2 DB instance
>>
>> users  baseline   %idle    core_sched %idle
>> 16     1          66       -19.5% 69
>> 24     1          54       -9.8% 57
>> 32     1          42       -27.2%        48
> So HT disabling slows down the 2DB instance by -22%, while core-sched
> slows it down by -27.2%?
>
> Would it be possible to see all the results in two larger tables (1 DB
> instance and 2 DB instance) so that we can compare the performance of the
> 3 kernel variants with each other:
>
>   - "vanilla +HT": Hyperthreading enabled,  vanilla scheduler
>   - "vanilla -HT": Hyperthreading disabled, vanilla scheduler
>   - "core_sched":  Hyperthreading enabled,  core-scheduling enabled
>
> ?
>
> Thanks,
>
> 	Ingo
Following are the numbers. Disabling HT gives improvement in some cases.

1 DB instance

users  vanilla+HT   core_sched vanilla-HT
16     1            -4.9% -11.7%
24     1            -6.7% +13.7%
32     1            -2.4% +8%

2 DB instance

users  vanilla+HT   core_sched vanilla-HT
16     1            -19.5% +5.6%
24     1            -9.8% +3.5%
32     1            -27.2%        -22.8%

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2019-04-19 23:20 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-18 16:56 [RFC][PATCH 00/16] sched: Core scheduling Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 01/16] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 02/16] sched: Fix kerneldoc comment for ia64_set_curr_task Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 03/16] sched: Wrap rq::lock access Peter Zijlstra
2019-02-19 16:13   ` Phil Auld
2019-02-19 16:22     ` Peter Zijlstra
2019-02-19 16:37       ` Phil Auld
2019-03-18 15:41   ` Julien Desfossez
2019-03-20  2:29     ` Subhra Mazumdar
2019-03-21 21:20       ` Julien Desfossez
2019-03-22 13:34         ` Peter Zijlstra
2019-03-22 20:59           ` Julien Desfossez
2019-03-23  0:06         ` Subhra Mazumdar
2019-03-27  1:02           ` Subhra Mazumdar
2019-03-29 13:35           ` Julien Desfossez
2019-03-29 22:23             ` Subhra Mazumdar
2019-04-01 21:35               ` Subhra Mazumdar
2019-04-03 20:16                 ` Julien Desfossez
2019-04-05  1:30                   ` Subhra Mazumdar
2019-04-02  7:42               ` Peter Zijlstra
2019-03-22 23:28       ` Tim Chen
2019-03-22 23:44         ` Tim Chen
2019-02-18 16:56 ` [RFC][PATCH 04/16] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 05/16] sched: Add task_struct pointer to sched_class::set_curr_task Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 06/16] sched/fair: Export newidle_balance() Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 07/16] sched: Allow put_prev_task() to drop rq->lock Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 08/16] sched: Rework pick_next_task() slow-path Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 09/16] sched: Introduce sched_class::pick_task() Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 10/16] sched: Core-wide rq->lock Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 11/16] sched: Basic tracking of matching tasks Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 12/16] sched: A quick and dirty cgroup tagging interface Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling Peter Zijlstra
     [not found]   ` <20190402064612.GA46500@aaronlu>
2019-04-02  8:28     ` Peter Zijlstra
2019-04-02 13:20       ` Aaron Lu
2019-04-05 14:55       ` Aaron Lu
2019-04-09 18:09         ` Tim Chen
2019-04-10  4:36           ` Aaron Lu
2019-04-10 14:18             ` Aubrey Li
2019-04-11  2:11               ` Aaron Lu
2019-04-10 14:44             ` Peter Zijlstra
2019-04-11  3:05               ` Aaron Lu
2019-04-11  9:19                 ` Peter Zijlstra
2019-04-10  8:06           ` Peter Zijlstra
2019-04-10 19:58             ` Vineeth Remanan Pillai
2019-04-15 16:59             ` Julien Desfossez
2019-04-16 13:43       ` Aaron Lu
2019-04-09 18:38   ` Julien Desfossez
2019-04-10 15:01     ` Peter Zijlstra
2019-04-11  0:11     ` Subhra Mazumdar
2019-04-19  8:40       ` Ingo Molnar
2019-04-19 23:16         ` Subhra Mazumdar
2019-02-18 16:56 ` [RFC][PATCH 14/16] sched/fair: Add a few assertions Peter Zijlstra
2019-02-18 16:56 ` [RFC][PATCH 15/16] sched: Trivial forced-newidle balancer Peter Zijlstra
2019-02-21 16:19   ` Valentin Schneider
2019-02-21 16:41     ` Peter Zijlstra
2019-02-21 16:47       ` Peter Zijlstra
2019-02-21 18:28         ` Valentin Schneider
2019-04-04  8:31       ` Aubrey Li
2019-04-06  1:36         ` Aubrey Li
2019-02-18 16:56 ` [RFC][PATCH 16/16] sched: Debug bits Peter Zijlstra
2019-02-18 17:49 ` [RFC][PATCH 00/16] sched: Core scheduling Linus Torvalds
2019-02-18 20:40   ` Peter Zijlstra
2019-02-19  0:29     ` Linus Torvalds
2019-02-19 15:15       ` Ingo Molnar
2019-02-22 12:17     ` Paolo Bonzini
2019-02-22 14:20       ` Peter Zijlstra
2019-02-22 19:26         ` Tim Chen
2019-02-26  8:26           ` Aubrey Li
2019-02-27  7:54             ` Aubrey Li
2019-02-21  2:53   ` Subhra Mazumdar
2019-02-21 14:03     ` Peter Zijlstra
2019-02-21 18:44       ` Subhra Mazumdar
2019-02-22  0:34       ` Subhra Mazumdar
2019-02-22 12:45   ` Mel Gorman
2019-02-22 16:10     ` Mel Gorman
2019-03-08 19:44     ` Subhra Mazumdar
2019-03-11  4:23       ` Aubrey Li
2019-03-11 18:34         ` Subhra Mazumdar
2019-03-11 23:33           ` Subhra Mazumdar
2019-03-12  0:20             ` Greg Kerr
2019-03-12  0:47               ` Subhra Mazumdar
2019-03-12  7:33               ` Aaron Lu
2019-03-12  7:45             ` Aubrey Li
2019-03-13  5:55               ` Aubrey Li
2019-03-14  0:35                 ` Tim Chen
2019-03-14  5:30                   ` Aubrey Li
2019-03-14  6:07                     ` Li, Aubrey
2019-03-18  6:56             ` Aubrey Li
2019-03-12 19:07           ` Pawan Gupta
2019-03-26  7:32       ` Aaron Lu
2019-03-26  7:56         ` Aaron Lu
2019-02-19 22:07 ` Greg Kerr
2019-02-20  9:42   ` Peter Zijlstra
2019-02-20 18:33     ` Greg Kerr
2019-02-22 14:10       ` Peter Zijlstra
2019-03-07 22:06         ` Paolo Bonzini
2019-02-20 18:43     ` Subhra Mazumdar
2019-03-01  2:54 ` Subhra Mazumdar
2019-03-14 15:28 ` Julien Desfossez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).