LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
@ 2015-02-04 19:39 Steven Rostedt
  2015-02-05 14:58 ` Peter Zijlstra
  2015-02-05 15:21 ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Steven Rostedt @ 2015-02-04 19:39 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Clark Williams,
	linux-rt-users, Mike Galbraith, Paul E. McKenney



I posted this patch a while back ago here:

 https://lkml.org/lkml/2012/12/12/354

It solved an issue that is very prevalent on large count CPUs. Tracing
showed it nicely. When we have more than 16 CPUs, the lock contention
on the run queue lock made a very noticeable delay. So much so, that
Mike Galbraith showed the results of how much this patch made things
run nicely:

  https://lkml.org/lkml/2012/12/21/220

We're talking about 1ms latencies dropped down to 50us or less. That is
a HUGE impact.

But Thomas Gleixner shot down this approach comparing it to TTWU_QUEUE,
and never commented on it again :-(

  https://lkml.org/lkml/2012/12/11/172

TTWU_QUEUE is when a task is woken on another CPU, instead of migrating
it over, an IPI is sent to that CPU and the CPU will push off the
tasks. Now if you have 100 tasks waking up, there could be 100 tasks
needing to be migrated, and this, as Thomas pointed out, is extremely
non-deterministic.


Now how is my patch different? For one thing, unlike TTWU_QUEUE, it
only deals with RT tasks, and has nothing to do with SCHED_OTHER.
Another thing that is different is that it has nothing to do with a
task waking up, but instead, a task scheduling out on another CPU. In
fact, the current approach is the non deterministic one, because as
running cyclictest -d0 -t -i100 will show, is that you can have n CPUs
all trying to grab the same lock at the same time.

This patch deals with pull_rt_tasks() which is called when a CPU lowers
its priority and sees that there's an rq somewhere that has two or more
RT task scheduled on it where one of those RT tasks can migrate over to
the newly lowered CPU rq.

The problem is that if you have 100 CPUs having an RT task schedule out
(just like cyclictest -d0 -t -i100 does a lot), and if there's one rq
that has more than one RT task scheduled on it (overloaded), each
of those 100 CPUs are going to try to get that second RT task on that
lonely rq, and each of those 100 CPUs are going to take that lonely
rq's lock! Then what happens if the task running on that rq wants to
schedule? Well, it needs to wait behind 100 CPUs contending for its rq
lock, and we see a HUGE latency (1ms or more).

What does this patch do instead? Instead of trying to fight for the rq,
if it sees that there's an overloaded rq, it simply sends an IPI to
that CPU to have it push the overloaded RT task off to another CPU.

Yes, it interrupts the currently running RT task to push off a lower RT
task (which wasn't able to migrate anywhere when it was scheduled in
the first place, because all the other CPUs had higher priority tasks
waiting).

Can this be wildly non-deterministic? If the one queue had a 100 RT
tasks, and there were a 100 CPUs that had RT tasks that were higher
priority than those other 100 RT tasks, and they all just scheduled
out, then sure. But really, if you had such a system, it's totally
broken to begin with.

This issue has popped up several times already, and that's why I'm
posting this patch again. This time, I'm actually posting it to
mainline and not just for the RT kernel because it affects mainline as
well. It probably hasn't been reported as much because RT tasks can
easily suffer 1ms latencies on mainline for other reasons. But I bet
this will help a lot for large CPUs anyway.

>From my tests, the rq contention starts to show itself after you reach
16 CPUs, which is why I added this as a sched_feature (RT_PUSH_IPI),
which does not get set by default unless you have more than 16 CPUs.
Otherwise, you can turn it on or off with the sched_feature interface.

I'll let Clark Williams post the results that he's seen (on
RT_PREEMPT), but I feel this is better for mainline.

Below is the original patch change log, but the patch itself was
forward ported to 3.19-rc7.

Comments?

-- Steve


--------
sched/rt: Use IPI to trigger RT task push migration instead of pulling

When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.

Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.

The test that was run was the following:

 cyclictest --numa -p95 -m -d0 -i100

This created a thread on each CPU, that would set its wakeup in interations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.

What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPUS test went to sleep and
scheduled idle. This cause the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.

To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks. As these locks were blocked, any wakeups or load balanceing
on these CPUs would also block on these locks, and the wait time escalated.

I've tried various methods to lesson the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.

Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.

With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.

Now, this issue only seems to apply to boxes with greater than 16 CPUs.
We noticed this on a 24 CPU box, and things got much worse on 40 (and
presumably more CPUs would get even worse yet). But running with 16
CPUs and below, the lock contention caused by the pulling of RT tasks
is not noticable.

I've created a new sched feature called RT_PUSH_IPI, which by default
on 16 and less core CPUs is disabled, and on 17 or more CPUs it is
enabled. That seems to be heuristic limit where the pulling logic
causes higher latencies than IPIs. Of course with all heuristics, things
could be different with different architectures.

When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.

To enabled or disable this at run time:

 # mount -t debugfs nodev /sys/kernel/debug
 # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
 # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched/core.c     |   18 ++++++++++++++++++
 kernel/sched/features.h |   14 ++++++++++++++
 kernel/sched/rt.c       |   37 +++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |    5 +++++
 4 files changed, 74 insertions(+)

Index: linux-rt.git/kernel/sched/core.c
===================================================================
--- linux-rt.git.orig/kernel/sched/core.c	2015-02-04 14:08:15.688111069 -0500
+++ linux-rt.git/kernel/sched/core.c	2015-02-04 14:08:17.382088074 -0500
@@ -1582,6 +1582,9 @@
 	 */
 	preempt_fold_need_resched();
 
+	if (sched_feat(RT_PUSH_IPI))
+		sched_rt_push_check();
+
 	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
 		return;
 
@@ -7271,6 +7274,21 @@
 		zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
 	idle_thread_set_boot_cpu();
 	set_cpu_rq_start_time();
+
+	/*
+	 * To avoid heavy contention on large CPU boxes,
+	 * when there is an RT overloaded CPU (two or more RT tasks
+	 * queued to run on a CPU and one of the waiting RT tasks
+	 * can migrate) and another CPU lowers its priority, instead
+	 * of grabbing both rq locks of the CPUS (as many CPUs lowering
+	 * their priority at the same time may create large latencies)
+	 * send an IPI to the CPU that is overloaded so that it can
+	 * do an efficent push.
+	 */
+	if (num_possible_cpus() > 16) {
+		sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI);
+		sysctl_sched_features |= (1UL << __SCHED_FEAT_RT_PUSH_IPI);
+	}
 #endif
 	init_sched_fair_class();
 
Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c	2015-02-04 14:08:15.688111069 -0500
+++ linux-rt.git/kernel/sched/rt.c	2015-02-04 14:08:17.383088061 -0500
@@ -1760,6 +1760,31 @@
 		;
 }
 
+/**
+ * sched_rt_push_check - check if we can push waiting RT tasks
+ *
+ * Called from sched IPI when sched feature RT_PUSH_IPI is enabled.
+ *
+ * Checks if there is an RT task that can migrate and there exists
+ * a CPU in its affinity that only has tasks lower in priority than
+ * the waiting RT task. If so, then it will push the task off to that
+ * CPU.
+ */
+void sched_rt_push_check(void)
+{
+	struct rq *rq = cpu_rq(smp_processor_id());
+
+	if (WARN_ON_ONCE(!irqs_disabled()))
+		return;
+
+	if (!has_pushable_tasks(rq))
+		return;
+
+	raw_spin_lock(&rq->lock);
+	push_rt_tasks(rq);
+	raw_spin_unlock(&rq->lock);
+}
+
 static int pull_rt_task(struct rq *this_rq)
 {
 	int this_cpu = this_rq->cpu, ret = 0, cpu;
@@ -1793,6 +1818,18 @@
 			continue;
 
 		/*
+		 * When the RT_PUSH_IPI sched feature is enabled, instead
+		 * of trying to grab the rq lock of the RT overloaded CPU
+		 * send an IPI to that CPU instead. This prevents heavy
+		 * contention from several CPUs lowering its priority
+		 * and all trying to grab the rq lock of that overloaded CPU.
+		 */
+		if (sched_feat(RT_PUSH_IPI)) {
+			smp_send_reschedule(cpu);
+			continue;
+		}
+
+		/*
 		 * We can potentially drop this_rq's lock in
 		 * double_lock_balance, and another CPU could
 		 * alter this_rq
Index: linux-rt.git/kernel/sched/sched.h
===================================================================
--- linux-rt.git.orig/kernel/sched/sched.h	2015-02-04 14:08:15.688111069 -0500
+++ linux-rt.git/kernel/sched/sched.h	2015-02-04 14:08:17.392087939 -0500
@@ -1507,6 +1507,8 @@
 		__release(rq2->lock);
 }
 
+void sched_rt_push_check(void);
+
 #else /* CONFIG_SMP */
 
 /*
@@ -1540,6 +1542,9 @@
 	__release(rq2->lock);
 }
 
+void sched_rt_push_check(void)
+{
+}
 #endif
 
 extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
Index: linux-rt.git/kernel/sched/features.h
===================================================================
--- linux-rt.git.orig/kernel/sched/features.h	2015-02-04 14:08:15.688111069 -0500
+++ linux-rt.git/kernel/sched/features.h	2015-02-04 14:08:17.392087939 -0500
@@ -56,6 +56,20 @@
  */
 SCHED_FEAT(TTWU_QUEUE, true)
 
+/*
+ * In order to avoid a thundering herd attack of CPUS that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ *
+ * This is default off for machines with <= 16 CPUs, and will
+ * be turned on at boot up for machines with > 16 CPUs.
+ */
+SCHED_FEAT(RT_PUSH_IPI, false)
+
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2015-02-04 19:39 [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
@ 2015-02-05 14:58 ` Peter Zijlstra
  2015-02-05 16:04   ` Steven Rostedt
  2015-02-05 15:21 ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2015-02-05 14:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	linux-rt-users, Mike Galbraith, Paul E. McKenney

On Wed, Feb 04, 2015 at 02:39:06PM -0500, Steven Rostedt wrote:
> Can this be wildly non-deterministic? 

You're not actually answering the question here; so basically its doing
push_rt_tasks() - and you 'need' to show that that has bounded behaviour.

Saying the current thing is bad doesn't mean the proposed thing is good.

Now clearly spinlock contention is O(nr_cpus) and while that sucks rock,
its actually fairly bounded.

> Comments?

A few.. find below.

>  kernel/sched/core.c     |   18 ++++++++++++++++++
>  kernel/sched/features.h |   14 ++++++++++++++
>  kernel/sched/rt.c       |   37 +++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h    |    5 +++++
>  4 files changed, 74 insertions(+)
> 
> Index: linux-rt.git/kernel/sched/core.c
> ===================================================================
> --- linux-rt.git.orig/kernel/sched/core.c	2015-02-04 14:08:15.688111069 -0500
> +++ linux-rt.git/kernel/sched/core.c	2015-02-04 14:08:17.382088074 -0500
> @@ -1582,6 +1582,9 @@
>  	 */
>  	preempt_fold_need_resched();
>  
> +	if (sched_feat(RT_PUSH_IPI))
> +		sched_rt_push_check();

This should be in the irq_enter()/exit() part, but better still, move
this into an smp_call_fuction_single_async() or queue_irq_work_on(), no
reason to make the scheduler IPI do yet more work.

>  	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
>  		return;
>  
> @@ -7271,6 +7274,21 @@
>  		zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
>  	idle_thread_set_boot_cpu();
>  	set_cpu_rq_start_time();
> +
> +	/*
> +	 * To avoid heavy contention on large CPU boxes,
> +	 * when there is an RT overloaded CPU (two or more RT tasks
> +	 * queued to run on a CPU and one of the waiting RT tasks
> +	 * can migrate) and another CPU lowers its priority, instead
> +	 * of grabbing both rq locks of the CPUS (as many CPUs lowering
> +	 * their priority at the same time may create large latencies)
> +	 * send an IPI to the CPU that is overloaded so that it can
> +	 * do an efficent push.
> +	 */
> +	if (num_possible_cpus() > 16) {
> +		sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI);
> +		sysctl_sched_features |= (1UL << __SCHED_FEAT_RT_PUSH_IPI);
> +	}

*boom* this won't compile for !CONFIG_SCHED_DEBUG, sysctl_sched_features
is const in that case.

So I'm sure that it doesn't make sense to tie to an arbitrary CPU
number, better tie it to a topology property, like the machine having
two LLC cache domains -- or just always enable the thing.

>  #endif
>  	init_sched_fair_class();
>  
> Index: linux-rt.git/kernel/sched/rt.c
> ===================================================================
> --- linux-rt.git.orig/kernel/sched/rt.c	2015-02-04 14:08:15.688111069 -0500
> +++ linux-rt.git/kernel/sched/rt.c	2015-02-04 14:08:17.383088061 -0500
> @@ -1760,6 +1760,31 @@
>  		;
>  }
>  
> +/**
> + * sched_rt_push_check - check if we can push waiting RT tasks
> + *
> + * Called from sched IPI when sched feature RT_PUSH_IPI is enabled.
> + *
> + * Checks if there is an RT task that can migrate and there exists
> + * a CPU in its affinity that only has tasks lower in priority than
> + * the waiting RT task. If so, then it will push the task off to that
> + * CPU.
> + */
> +void sched_rt_push_check(void)
> +{
> +	struct rq *rq = cpu_rq(smp_processor_id());
> +
> +	if (WARN_ON_ONCE(!irqs_disabled()))
> +		return;
> +
> +	if (!has_pushable_tasks(rq))
> +		return;
> +
> +	raw_spin_lock(&rq->lock);
> +	push_rt_tasks(rq);

So this, on first sight, looks like an unbounded while loop; better
present some runtime bound analysis on it.

At present I find a worst case O(inf * nr_tries * nr_prios * nr_cpus),
which while on average might run faster, is actually quite horrible.

RT isn't won on avg or median execution times :-)

> +	raw_spin_unlock(&rq->lock);
> +}
> +
>  static int pull_rt_task(struct rq *this_rq)
>  {
>  	int this_cpu = this_rq->cpu, ret = 0, cpu;
> Index: linux-rt.git/kernel/sched/sched.h
> ===================================================================
> --- linux-rt.git.orig/kernel/sched/sched.h	2015-02-04 14:08:15.688111069 -0500
> +++ linux-rt.git/kernel/sched/sched.h	2015-02-04 14:08:17.392087939 -0500
> @@ -1540,6 +1542,9 @@
>  	__release(rq2->lock);
>  }
>  
> +void sched_rt_push_check(void)

I'm very sure you mean static inline there.. otherwise the linker will
get upset about multiple instances of the same symbol.

> +{
> +}
>  #endif
>  
>  extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2015-02-04 19:39 [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
  2015-02-05 14:58 ` Peter Zijlstra
@ 2015-02-05 15:21 ` Peter Zijlstra
  2015-02-05 16:55   ` Steven Rostedt
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2015-02-05 15:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	linux-rt-users, Mike Galbraith, Paul E. McKenney

On Wed, Feb 04, 2015 at 02:39:06PM -0500, Steven Rostedt wrote:
> The problem is that if you have 100 CPUs having an RT task schedule out
> (just like cyclictest -d0 -t -i100 does a lot), and if there's one rq
> that has more than one RT task scheduled on it (overloaded), each
> of those 100 CPUs are going to try to get that second RT task on that
> lonely rq, and each of those 100 CPUs are going to take that lonely
> rq's lock! Then what happens if the task running on that rq wants to
> schedule? Well, it needs to wait behind 100 CPUs contending for its rq
> lock, and we see a HUGE latency (1ms or more).
> 
> What does this patch do instead? Instead of trying to fight for the rq,
> if it sees that there's an overloaded rq, it simply sends an IPI to
> that CPU to have it push the overloaded RT task off to another CPU.

So can't we flip the problem around; 99 overloaded cpus and 1 going
'low', then we IPI the 99, and they're all going to try and push their
tasks on the one (now) sad cpu?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2015-02-05 14:58 ` Peter Zijlstra
@ 2015-02-05 16:04   ` Steven Rostedt
  2015-02-06 12:40     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2015-02-05 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	linux-rt-users, Mike Galbraith, Paul E. McKenney

On Thu, 5 Feb 2015 15:58:33 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Feb 04, 2015 at 02:39:06PM -0500, Steven Rostedt wrote:
> > Can this be wildly non-deterministic? 
> 
> You're not actually answering the question here; so basically its doing
> push_rt_tasks() - and you 'need' to show that that has bounded behaviour.
> 
> Saying the current thing is bad doesn't mean the proposed thing is good.
> 
> Now clearly spinlock contention is O(nr_cpus) and while that sucks rock,
> its actually fairly bounded.

And so is the solution. The solution is O(nr_cpus). Because it only
pushes tasks to CPUs that have lowered its priority. And you also need
to have nr_cpus tasks on that queue to hit this worse case scenario.

Now we have the current code which is O(nr_cpus) and a new solution
which is also O(nr_cpus). But the current code can easily trigger that
worse case, just by having all cpus drop their priorities. The new
solution needs that, plus you also need to have nr_cpus tasks waiting
for a cpu to go to. (once there's no more cpus to push to, even if
there's more than nr_cpus tasks, the algorithm stops).

> 
> > Comments?
> 
> A few.. find below.
> 
> >  kernel/sched/core.c     |   18 ++++++++++++++++++
> >  kernel/sched/features.h |   14 ++++++++++++++
> >  kernel/sched/rt.c       |   37 +++++++++++++++++++++++++++++++++++++
> >  kernel/sched/sched.h    |    5 +++++
> >  4 files changed, 74 insertions(+)
> > 
> > Index: linux-rt.git/kernel/sched/core.c
> > ===================================================================
> > --- linux-rt.git.orig/kernel/sched/core.c	2015-02-04 14:08:15.688111069 -0500
> > +++ linux-rt.git/kernel/sched/core.c	2015-02-04 14:08:17.382088074 -0500
> > @@ -1582,6 +1582,9 @@
> >  	 */
> >  	preempt_fold_need_resched();
> >  
> > +	if (sched_feat(RT_PUSH_IPI))
> > +		sched_rt_push_check();
> 
> This should be in the irq_enter()/exit() part, but better still, move
> this into an smp_call_fuction_single_async() or queue_irq_work_on(), no
> reason to make the scheduler IPI do yet more work.

OK, the smp_call_function_single_async() may be better. Question is, do
they too grab spinlocks to send? I'm trying to avoid the lock
contention that will happen when we have 100 CPUs all seeing that one
CPU and grabbing the same lock. This happens in the scheduler code,
with rq locks held.

If they do not grab any locks, then they would be fine to use.

> 
> >  	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> >  		return;
> >  
> > @@ -7271,6 +7274,21 @@
> >  		zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
> >  	idle_thread_set_boot_cpu();
> >  	set_cpu_rq_start_time();
> > +
> > +	/*
> > +	 * To avoid heavy contention on large CPU boxes,
> > +	 * when there is an RT overloaded CPU (two or more RT tasks
> > +	 * queued to run on a CPU and one of the waiting RT tasks
> > +	 * can migrate) and another CPU lowers its priority, instead
> > +	 * of grabbing both rq locks of the CPUS (as many CPUs lowering
> > +	 * their priority at the same time may create large latencies)
> > +	 * send an IPI to the CPU that is overloaded so that it can
> > +	 * do an efficent push.
> > +	 */
> > +	if (num_possible_cpus() > 16) {
> > +		sched_feat_enable(__SCHED_FEAT_RT_PUSH_IPI);
> > +		sysctl_sched_features |= (1UL << __SCHED_FEAT_RT_PUSH_IPI);
> > +	}
> 
> *boom* this won't compile for !CONFIG_SCHED_DEBUG, sysctl_sched_features
> is const in that case.

Heh, well that's solved with an #ifdef.

> 
> So I'm sure that it doesn't make sense to tie to an arbitrary CPU
> number, better tie it to a topology property, like the machine having
> two LLC cache domains -- or just always enable the thing.

Yeah, I thought about this. Probably best to just enable it. I did
notice on a 4 CPU box, that it tend to give a little more latency than
with it off. But it wasn't much. I'll have to benchmark this some more.

> 
> >  #endif
> >  	init_sched_fair_class();
> >  
> > Index: linux-rt.git/kernel/sched/rt.c
> > ===================================================================
> > --- linux-rt.git.orig/kernel/sched/rt.c	2015-02-04 14:08:15.688111069 -0500
> > +++ linux-rt.git/kernel/sched/rt.c	2015-02-04 14:08:17.383088061 -0500
> > @@ -1760,6 +1760,31 @@
> >  		;
> >  }
> >  
> > +/**
> > + * sched_rt_push_check - check if we can push waiting RT tasks
> > + *
> > + * Called from sched IPI when sched feature RT_PUSH_IPI is enabled.
> > + *
> > + * Checks if there is an RT task that can migrate and there exists
> > + * a CPU in its affinity that only has tasks lower in priority than
> > + * the waiting RT task. If so, then it will push the task off to that
> > + * CPU.
> > + */
> > +void sched_rt_push_check(void)
> > +{
> > +	struct rq *rq = cpu_rq(smp_processor_id());
> > +
> > +	if (WARN_ON_ONCE(!irqs_disabled()))
> > +		return;
> > +
> > +	if (!has_pushable_tasks(rq))
> > +		return;
> > +
> > +	raw_spin_lock(&rq->lock);
> > +	push_rt_tasks(rq);
> 
> So this, on first sight, looks like an unbounded while loop; better
> present some runtime bound analysis on it.
> 
> At present I find a worst case O(inf * nr_tries * nr_prios * nr_cpus),
> which while on average might run faster, is actually quite horrible.

Again, the algorithm stops when no task is pushed, so it's not inf.

I could add a flag to not do the retry in this case. And, yeah, the
nr_prios could happen if the waiting task is of high priority (but that
task would need to be of lower priority than all the other CPUs that are
asking for the pull, otherwise it would have pushed to those CPUs in
the first place).

> 
> RT isn't won on avg or median execution times :-)

Totally agree. And I'm willing to write up more analysis, because the
worse case scenario, would require a lot of things to happen that would
take a lot of work to do so, and not something that would randomly
happen.

Even the retry only happens if it could not find a CPU that had a lower
priority than the task it was looking for, and the task either somehow
migrated, or a higher priority task some how got queued instead when
the lock was released.

And this retry case is in today's code, and any problem we have with
it, should be solved today as well.

Lets go back to how to get into a situation where this code is
activated. You need to have a bunch of RT tasks queue on one run queue,
which happens only if when they were queued all other CPUs were running
RT tasks of higher priority than the queued tasks.

When one of the other CPUS stops running the high priority task, it
sees that there's a queue that has RT tasks that are waiting and sends
an IPI to that queue to push off its RT tasks. (Note, I do see a flaw
in this patch, and it should send an IPI to all CPUS that has
overloaded RT tasks.) Then that queue will try to push off the RT task
to available CPUS. If there's no CPU to push to, or when it runs out of
RT tasks, then the algorithm stops.

Now I guess you could argue that if you have > nr_cpus tasks on your
queue, and when you push one task off, and it runs on the other CPU and
finishes, which would mean you could push the next RT task off. If that
task finishes before we get to the next task, it could be nr_tasks
algorithm. But that's more than unlikely that a task could schedule on
another CPU and finish, and schedule out, before we do the next task on
the loop.

And, again, that algorithm exists today, and would cause issues. I bet
I could come up with some scenario that would cause the current code to
have an infinite worse case scenario as well.

This isn't just avg or median times. When the current code is
constantly hitting 1ms latencies in the scheduler, and this patch nukes
it down to 50us max latencies, I think we should seriously think about
it. Because today, the code is not acceptable.

I'm willing to massage this patch and do a full analysis on it. I just
don't want this discussion to end like it did last time.

> 
> > +	raw_spin_unlock(&rq->lock);
> > +}
> > +
> >  static int pull_rt_task(struct rq *this_rq)
> >  {
> >  	int this_cpu = this_rq->cpu, ret = 0, cpu;
> > Index: linux-rt.git/kernel/sched/sched.h
> > ===================================================================
> > --- linux-rt.git.orig/kernel/sched/sched.h	2015-02-04 14:08:15.688111069 -0500
> > +++ linux-rt.git/kernel/sched/sched.h	2015-02-04 14:08:17.392087939 -0500
> > @@ -1540,6 +1542,9 @@
> >  	__release(rq2->lock);
> >  }
> >  
> > +void sched_rt_push_check(void)
> 
> I'm very sure you mean static inline there.. otherwise the linker will
> get upset about multiple instances of the same symbol.

ug, yeah.

Thanks,

-- Steve

> 
> > +{
> > +}
> >  #endif
> >  
> >  extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2015-02-05 15:21 ` Peter Zijlstra
@ 2015-02-05 16:55   ` Steven Rostedt
  2015-02-06 13:23     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2015-02-05 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	linux-rt-users, Mike Galbraith, Paul E. McKenney

On Thu, 5 Feb 2015 16:21:44 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> So can't we flip the problem around; 99 overloaded cpus and 1 going
> 'low', then we IPI the 99, and they're all going to try and push their
> tasks on the one (now) sad cpu?
> 

Ug, you're right :-/


OK, then we could do this, because if there's 10 CPUS with overloaded
RT tasks (more than one queued), and 20 CPUs drop prios, if they all
send to one CPU to do a pull, it will miss pulling from the other CPUs.


But we do not want to send to all CPUS with overloaded RT queues,
because, as you say, they could all try to push to the same queue and
we hit the same problem this patch is trying to solve (lots of CPUs
grabbing the same rq lock).

Thus, we could proxy it.

Send an IPI to just one CPU. When that CPU receives it, it pushes off
only one task (as only one CPU told it it lowered its CPU).

If it receives the IPI and there's no tasks to push, it means that a
there was another CPU that lowered its priority and asked this CPU to
push a task to it. But another CPU got there first. Then this CPU could
check to see if there's another rq out there with overloaded CPUs and
send the IPI to that one. This way, only one CPU is pushing its tasks
off at a time, and we only push if it is likely to succeed.

Pass the IPI around!

Something to work on.

-- Steve


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2015-02-05 16:04   ` Steven Rostedt
@ 2015-02-06 12:40     ` Peter Zijlstra
  2015-02-06 15:58       ` Mike Galbraith
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2015-02-06 12:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	linux-rt-users, Mike Galbraith, Paul E. McKenney

On Thu, Feb 05, 2015 at 11:04:20AM -0500, Steven Rostedt wrote:
> > This should be in the irq_enter()/exit() part, but better still, move
> > this into an smp_call_fuction_single_async() or queue_irq_work_on(), no
> > reason to make the scheduler IPI do yet more work.
> 
> OK, the smp_call_function_single_async() may be better. Question is, do
> they too grab spinlocks to send? 

One is in kernel/smp.c the other in kernel/irq_work.c, have a look ;-)

But no, they use llist stuff.

> > > +	push_rt_tasks(rq);
> > 
> > So this, on first sight, looks like an unbounded while loop; better
> > present some runtime bound analysis on it.
> > 
> > At present I find a worst case O(inf * nr_tries * nr_prios * nr_cpus),
> > which while on average might run faster, is actually quite horrible.
> 
> Again, the algorithm stops when no task is pushed, so it's not inf.

The inf. came from the retry loop in push_rt_task(), could we not get
stuck there if things keep on shifting back and forth? Sure, that would
probably require inf. (bad) luck as well, but on a quick reading I see
no guaranteed bound.

If there is one, please put a comment in. Moron in a hurry (me) is
failing to see the obvious ;-)

> I could add a flag to not do the retry in this case. And, yeah, the
> nr_prios could happen if the waiting task is of high priority (but that
> task would need to be of lower priority than all the other CPUs that are
> asking for the pull, otherwise it would have pushed to those CPUs in
> the first place).

The O(nr_prio * nr_cpus) is from cpupri_find() and is part of the cpupri
design, nothing you can really do about it. You need to iterate the
priorities and scan for a bit, and the bit scan is nr_cpus because
that's how long the cpumasks are.

Sure you won't actually hit the very worst case every time, but here
even the avg execution time is that same O expression, just maybe with a
constant 1/4 factor in, which we ignore because well, that's what big-Os
do ;-)

Now, luckily O(nr_prio * nr_cpus) is bounded (and constant), so that's
good ;-)

> > RT isn't won on avg or median execution times :-)
> 
> Totally agree. And I'm willing to write up more analysis, because the
> worse case scenario, would require a lot of things to happen that would
> take a lot of work to do so, and not something that would randomly
> happen.
> 
> Even the retry only happens if it could not find a CPU that had a lower
> priority than the task it was looking for, and the task either somehow
> migrated, or a higher priority task some how got queued instead when
> the lock was released.

As per the above, yes please. Also if we can somehow bound that retry
loop that'd be awesome. Sure it might be unlikely to trigger this side
of the universe's heat death, but... unlikely things do happen.

> And this retry case is in today's code, and any problem we have with
> it, should be solved today as well.

Right; this is an argument you _should_ have made :-)

> Now I guess you could argue that if you have > nr_cpus tasks on your
> queue, and when you push one task off, and it runs on the other CPU and
> finishes, which would mean you could push the next RT task off. If that
> task finishes before we get to the next task, it could be nr_tasks
> algorithm. But that's more than unlikely that a task could schedule on
> another CPU and finish, and schedule out, before we do the next task on
> the loop.

The devils advocate needs to bring up virt crazies, vcpu stalls make
such races entirely possible :/ They're much _much_ better than the
P4-SMT stuff for finding quaint crap like this.

> This isn't just avg or median times. When the current code is
> constantly hitting 1ms latencies in the scheduler, and this patch nukes
> it down to 50us max latencies, I think we should seriously think about
> it. Because today, the code is not acceptable.

Well, the thing is, with or without this patch, the worst case is still
the very same. The likelihood of actually triggering it goes down
dramatically [*], but any spin_lock() has that O(nr_cpus) worst case, and
that spinlock op isn't going away.

Now, lowering avg/median cases is good, but don't kid yourself that the
worst case is going to be better.

People running -RT on big boxes had better be aware of this.

[*] if we solve that reverse problem from the other email, otherwise we
just shifted it.

> I'm willing to massage this patch and do a full analysis on it. I just
> don't want this discussion to end like it did last time.

Sure; but I think we need to be honest here too. This patch is about
lowering avg case latencies, not absolute worst case.

Now, the argument that we currently already run this code with IRQs
disabled, so running it from an IPI is not strictly worse than what we
already do is a good one and should be highlighted.

I suspect tglx wanted to hear such argument, but I did not go back and
look at the history.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2015-02-05 16:55   ` Steven Rostedt
@ 2015-02-06 13:23     ` Peter Zijlstra
  2015-02-06 13:59       ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2015-02-06 13:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	linux-rt-users, Mike Galbraith, Paul E. McKenney

On Thu, Feb 05, 2015 at 11:55:01AM -0500, Steven Rostedt wrote:
> On Thu, 5 Feb 2015 16:21:44 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > So can't we flip the problem around; 99 overloaded cpus and 1 going
> > 'low', then we IPI the 99, and they're all going to try and push their
> > tasks on the one (now) sad cpu?
> > 
> 
> Ug, you're right :-/
> 
> 
> OK, then we could do this, because if there's 10 CPUS with overloaded
> RT tasks (more than one queued), and 20 CPUs drop prios, if they all
> send to one CPU to do a pull, it will miss pulling from the other CPUs.
> 
> 
> But we do not want to send to all CPUS with overloaded RT queues,
> because, as you say, they could all try to push to the same queue and
> we hit the same problem this patch is trying to solve (lots of CPUs
> grabbing the same rq lock).
> 
> Thus, we could proxy it.
> 
> Send an IPI to just one CPU. When that CPU receives it, it pushes off
> only one task (as only one CPU told it it lowered its CPU).
> 
> If it receives the IPI and there's no tasks to push, it means that a
> there was another CPU that lowered its priority and asked this CPU to
> push a task to it. But another CPU got there first. Then this CPU could
> check to see if there's another rq out there with overloaded CPUs and
> send the IPI to that one. This way, only one CPU is pushing its tasks
> off at a time, and we only push if it is likely to succeed.
> 
> Pass the IPI around!

The IPI needs some state; it needs to also check that the condition that
triggered its existence on its dst cpu is still true. If somehow its dst
gained work (say a wakeup) we should stop.

Equally, if the dst cpu drops in prio again and there's already one IPI
out and about looking for work for it, we should not start another.

To further avoid collisions; eg. two CPUs dropping and ending up sending
an IPI to the same overloaded CPU; one could consider randomized
algorithms for selecting a (next) src cpu.

A trivial option might be to start the rto_mask iteration at the current
cpu number and wrap around until you hit self again.

A more advanced version might use an LFSR to iterate the mask, where we
give each cpu a different seed (say its cpuid) -- be careful to
artificially insert 0 when we loop.

But again; this will only affect the avg case, people should realize
that that 1ms latency is still entirely possible. Nothing changes that.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2015-02-06 13:23     ` Peter Zijlstra
@ 2015-02-06 13:59       ` Peter Zijlstra
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2015-02-06 13:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	linux-rt-users, Mike Galbraith, Paul E. McKenney

On Fri, Feb 06, 2015 at 02:23:41PM +0100, Peter Zijlstra wrote:
> A more advanced version might use an LFSR to iterate the mask, where we
> give each cpu a different seed (say its cpuid) -- be careful to
> artificially insert 0 when we loop.

And since I already stared at this longer than I should have... its got
a few quirks and is slower than the regular iteration but it should
still have bounded behaviour.


static unsigned long lfsr_taps;

void __init lfsr_init(void)
{
	int bits = ilog2(nr_cpu_ids) + 1;

	switch (bits) {
		case  1: lfsr_taps = 0x0001; break;
		case  2: lfsr_taps = 0x0001; break;
		case  3: lfsr_taps = 0x0003; break;
		case  4: lfsr_taps = 0x0009; break;
		case  5: lfsr_taps = 0x0012; break;
		case  5: lfsr_taps = 0x0012; break;
		case  6: lfsr_taps = 0x0021; break;
		case  7: lfsr_taps = 0x0041; break;
		case  8: lfsr_taps = 0x008E; break;
		case  9: lfsr_taps = 0x0108; break;
		case 10: lfsr_taps = 0x0204; break;
		case 11: lfsr_taps = 0x0402; break;
		case 12: lfsr_taps = 0x0829; break;
		case 13: lfsr_taps = 0x100D; break;
		case 14: lfsr_taps = 0x2015; break;

		default:
			BUG_ON(1); /* add more numbers */
	}
}

static inline int __lfsr_next(int lfsr)
{
	unsigned int bit = lfsr & 1;
	lfsr >>= 1;
	if (bit)
		lfsr ^= lfsr_taps;
	return lfsr;
}

int cpumask_next_lfsr(int cpu, const struct cpumask *mask, int seed)
{
	seed |= 1; /* must not be 0 */

	/* kick-start, avoid the termination condition */
	if (cpu == -1) {
		cpu = seed;
		goto test;
	}

again:
	/* when we cycle, we're done, test cpu 0 last */
	if (cpu == seed)
		cpu = 0;
	/* if we just tested cpu 0, we're done */
	else if (cpu == 0)
		return nr_cpu_ids;
	/* otherwise, next please! */
	else
		cpu = __lfsr_next(cpu);

test:
	/* cycle through the unused space; <= 2^(n-1) loops */
	if (cpu >= nr_cpu_ids)
		goto again;

	if (!cpumask_test_cpu(cpu), mask)
		goto again;

	return cpu;
}

#define for_each_cpu_lfsr(cpu, mask, seed)				  \
	for ((cpu) = -1; (cpu) = cpumask_next_lfsr((cpu), (mask), (seed)), \
			 (cpu) < nr_cpu_ids;)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2015-02-06 12:40     ` Peter Zijlstra
@ 2015-02-06 15:58       ` Mike Galbraith
  0 siblings, 0 replies; 9+ messages in thread
From: Mike Galbraith @ 2015-02-06 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, LKML, Ingo Molnar, Thomas Gleixner,
	Clark Williams, linux-rt-users, Paul E. McKenney

On Fri, 2015-02-06 at 13:40 +0100, Peter Zijlstra wrote:
> On Thu, Feb 05, 2015 at 11:04:20AM -0500, Steven Rostedt wrote:

> Well, the thing is, with or without this patch, the worst case is still
> the very same. The likelihood of actually triggering it goes down
> dramatically [*], but any spin_lock() has that O(nr_cpus) worst case, and
> that spinlock op isn't going away.
> 
> Now, lowering avg/median cases is good, but don't kid yourself that the
> worst case is going to be better.
> 
> People running -RT on big boxes had better be aware of this.

The folks I know of running -RT on big boxen isolate cores and nail
absolutely everything to the floor.  The only way to survive in a big
box is to chop the thing into tiny pieces.  A plasma cutter would be
even better.. 'course you could just _buy smaller boxen_ and connect
what you want connected, take the lazy path to big box solution ;-)
  
	-Mike


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-02-06 15:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-04 19:39 [RFC][PATCH] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
2015-02-05 14:58 ` Peter Zijlstra
2015-02-05 16:04   ` Steven Rostedt
2015-02-06 12:40     ` Peter Zijlstra
2015-02-06 15:58       ` Mike Galbraith
2015-02-05 15:21 ` Peter Zijlstra
2015-02-05 16:55   ` Steven Rostedt
2015-02-06 13:23     ` Peter Zijlstra
2015-02-06 13:59       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).