LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
@ 2015-02-24 18:39 Steven Rostedt
2015-02-24 18:45 ` Steven Rostedt
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Steven Rostedt @ 2015-02-24 18:39 UTC (permalink / raw)
To: LKML
Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.
Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.
The test that was run was the following:
cyclictest --numa -p95 -m -d0 -i100
This created a thread on each CPU, that would set its wakeup in interations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.
What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPU tests went to sleep and
scheduled idle. This caused the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.
To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks. As these locks were blocked, any wakeups or load balanceing
on these CPUs would also block on these locks, and the wait time escalated.
I've tried various methods to lesson the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.
Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.
With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.
I've created a new sched feature called RT_PUSH_IPI, which is enabled
by default.
When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.
To enabled or disable this at run time:
# mount -t debugfs nodev /sys/kernel/debug
# echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
# echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
Update: This original patch would send an IPI to all CPUs in the RT overload
list. But that could theoretically cause the reverse issue. That is, there
could be lots of overloaded RT queues and on CPU lowers its priority. It would
then send an IPI to all the overloaded RT queues and they could then all try
to grab the rq lock of the CPU lowering its priority, and then we have the
same problem.
To avoid this, the CPU lowering its priority finds the highest rq in the RT
overloaded mask, and sends out a single IPI to that queue. If that rq can not
push a task, it will then look for other overloaded RT queues and send an IPI
to the next highest one that is lower than its priority (using irq work queues).
It ignores rqs that have a higher priority because that could cause an infinite
ping pong affect if the rqs could not migrate a task due to affinity. By limiting
it to only rqs of lower priority, the IPI should never hit the same CPU twice
(unless the rq itself did change, but still, it is limited by the priority of the
rq that originally lowered its prio, and the highest priority of the time it was
sent). There is no unbounded check here even if the state of the system constantly
changes.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since V1:
o As already mentioned in the "update", a redesign was done to send
only one IPI, and to the highest rt overloaded queue. If for some
reason that could not push a task, it would look for the next rq and
send an ipi (irq work actually) to the next one. Limits are in place
to prevent any ping pong affect of two rqs sending IPIs back and
forth.
o Made this sched feature enabled by default instead of enabling it
when we have 16 or more CPUs.
kernel/sched/features.h | 11 ++
kernel/sched/rt.c | 202 ++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 6 +
3 files changed, 219 insertions(+)
Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c 2015-02-24 10:44:08.798785452 -0500
+++ linux-rt.git/kernel/sched/rt.c 2015-02-24 13:07:20.154735448 -0500
@@ -6,6 +6,7 @@
#include "sched.h"
#include <linux/slab.h>
+#include <linux/irq_work.h>
int sched_rr_timeslice = RR_TIMESLICE;
@@ -59,6 +60,11 @@ static void start_rt_bandwidth(struct rt
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
+#ifdef CONFIG_SMP
+static void try_to_push_tasks(void *arg);
+static void push_irq_work_func(struct irq_work *work);
+#endif
+
void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
{
struct rt_prio_array *array;
@@ -78,6 +84,11 @@ void init_rt_rq(struct rt_rq *rt_rq, str
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
+
+ rt_rq->push_csd.flags = 0;
+ rt_rq->push_csd.func = try_to_push_tasks;
+ rt_rq->push_csd.info = rt_rq;
+ init_irq_work(&rt_rq->push_csd_work, push_irq_work_func);
#endif
/* We start is dequeued state, because no RT tasks are queued */
rt_rq->rt_queued = 0;
@@ -1760,11 +1771,171 @@ static void push_rt_tasks(struct rq *rq)
;
}
+static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
+{
+ /* We should never be here if the IPI is already out. */
+ BUG_ON(rq->rt.push_csd_pending);
+
+ rq->rt.push_csd_pending = 1;
+ rq->rt.push_csd_cpu = cpu;
+ /* Save the prio that we used to find this CPU */
+ rq->rt.push_csd_prio = prio;
+
+ /* Make sure csd_cpu is visible before we send the ipi */
+ smp_mb();
+
+ smp_call_function_single_async(cpu, &rq->rt.push_csd);
+}
+
+static void try_to_push_tasks(void *arg)
+{
+ struct rt_rq *rt_rq = arg;
+ struct rq *rq, *next_rq;
+ int next_cpu = -1;
+ int next_prio = MAX_PRIO + 1;
+ int this_prio;
+ int src_prio;
+ int prio;
+ int this_cpu;
+ int success;
+ int cpu;
+
+ /* Make sure we can see csd_cpu */
+ smp_rmb();
+
+ this_cpu = rt_rq->push_csd_cpu;
+
+ /* Paranoid check */
+ BUG_ON(this_cpu != smp_processor_id());
+
+ rq = cpu_rq(this_cpu);
+
+ /*
+ * If there's nothing to push here, then see if another queue
+ * can push instead.
+ */
+ if (!has_pushable_tasks(rq))
+ goto pass_the_ipi;
+
+ raw_spin_lock(&rq->lock);
+ success = push_rt_task(rq);
+ raw_spin_unlock(&rq->lock);
+
+ if (success)
+ goto done;
+
+ /* Nothing was pushed, try another queue */
+pass_the_ipi:
+
+ /*
+ * We use the priority that determined to send to this CPU
+ * even if the priority for this CPU changed. This is used
+ * to determine what other CPUs to send to, to keep from
+ * doing a ping pong from each CPU.
+ */
+ this_prio = rt_rq->push_csd_prio;
+ src_prio = rt_rq->highest_prio.curr;
+
+ for_each_cpu(cpu, rq->rd->rto_mask) {
+ if (this_cpu == cpu)
+ continue;
+
+ /*
+ * This function was called because some rq lowered its
+ * priority. It then searched for the highest priority
+ * rq that had overloaded tasks and sent an smp function
+ * call to that cpu to call this function to push its
+ * tasks. But when it got here, the task was either
+ * already pushed, or due to affinity, could not move
+ * the overloaded task.
+ *
+ * Now we need to see if there's another overloaded rq that
+ * has an RT task that can migrate to that CPU.
+ *
+ * We need to be careful, we do not want to cause a ping
+ * pong between this CPU and another CPU that has an RT task
+ * that can migrate, but not to the CPU that lowered its
+ * priority. Since the lowering priority CPU finds the highest
+ * priority rq to send to, we will ignore any rq that is of higher
+ * priority than this current one. That is, if a rq scheduled a
+ * task of higher priority, the schedule itself would do the
+ * push or pull then. We can safely ignore higher priority rqs.
+ * And if there's one that is the same priority, since the CPUS
+ * are searched in order we will ignore CPUS of the same priority
+ * unless the CPU number is greater than this CPU's number.
+ */
+ next_rq = cpu_rq(cpu);
+
+ /* Use a single read for the next prio for decision making */
+ prio = READ_ONCE(next_rq->rt.highest_prio.next);
+
+ /* Looking for highest priority */
+ if (prio >= next_prio)
+ continue;
+
+ /* Make sure that the rq can push to the source rq */
+ if (prio >= src_prio)
+ continue;
+
+ /* If the prio is higher than the current prio, ignore it */
+ if (prio < this_prio)
+ continue;
+
+ /*
+ * If the prio is equal to the current prio, only use it
+ * if the cpu number is greater than the current cpu.
+ * This prevents a ping pong effect.
+ */
+ if (prio == this_prio && cpu < this_cpu)
+ continue;
+
+ next_prio = prio;
+ next_cpu = cpu;
+ }
+
+ /* Nothing found, do nothing */
+ if (next_cpu < 0)
+ goto done;
+
+ /*
+ * Now we can not send another smp async function due to locking,
+ * use irq_work instead.
+ */
+
+ rt_rq->push_csd_cpu = next_cpu;
+ rt_rq->push_csd_prio = next_prio;
+
+ /* Make sure the next cpu is seen on remote CPU */
+ smp_mb();
+
+ irq_work_queue_on(&rt_rq->push_csd_work, next_cpu);
+
+ return;
+
+done:
+ rt_rq->push_csd_pending = 0;
+
+ /* Now make sure the src CPU can see this update */
+ smp_wmb();
+}
+
+static void push_irq_work_func(struct irq_work *work)
+{
+ struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_csd_work);
+
+ try_to_push_tasks(rt_rq);
+}
+
static int pull_rt_task(struct rq *this_rq)
{
int this_cpu = this_rq->cpu, ret = 0, cpu;
struct task_struct *p;
struct rq *src_rq;
+ bool push_pending = false;
+ bool use_ipi;
+ int next_cpu = -1;
+ int next_prio = MAX_PRIO + 1;
+ int src_prio;
if (likely(!rt_overloaded(this_rq)))
return 0;
@@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
*/
smp_rmb();
+ /* Use local just in case a feature is switched in the middle of this */
+ if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
+ /* Make sure the update to pending is visible here. */
+ smp_rmb();
+
+ /* If a push ipi is out, then we must do the old method */
+ push_pending = READ_ONCE(this_rq->rt.push_csd_pending);
+ }
+
for_each_cpu(cpu, this_rq->rd->rto_mask) {
if (this_cpu == cpu)
continue;
@@ -1793,6 +1973,26 @@ static int pull_rt_task(struct rq *this_
continue;
/*
+ * When the RT_PUSH_IPI sched feature is enabled, instead
+ * of trying to grab the rq lock of the RT overloaded CPU
+ * send an IPI to that CPU instead. This prevents heavy
+ * contention from several CPUs lowering its priority
+ * and all trying to grab the rq lock of that overloaded CPU.
+ */
+ if (use_ipi && !push_pending) {
+ /*
+ * We need to make sure the prio we use for comparisons
+ * is the same throughout.
+ */
+ src_prio = READ_ONCE(src_rq->rt.highest_prio.next);
+ if (src_prio < next_prio) {
+ next_prio = src_prio;
+ next_cpu = cpu;
+ }
+ continue;
+ }
+
+ /*
* We can potentially drop this_rq's lock in
* double_lock_balance, and another CPU could
* alter this_rq
@@ -1839,6 +2039,8 @@ static int pull_rt_task(struct rq *this_
skip:
double_unlock_balance(this_rq, src_rq);
}
+ if (use_ipi && !push_pending && next_cpu >= 0)
+ tell_cpu_to_push(next_cpu, this_rq, next_prio);
return ret;
}
Index: linux-rt.git/kernel/sched/sched.h
===================================================================
--- linux-rt.git.orig/kernel/sched/sched.h 2015-02-24 10:44:08.798785452 -0500
+++ linux-rt.git/kernel/sched/sched.h 2015-02-24 11:58:25.799436421 -0500
@@ -6,6 +6,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/irq_work.h>
#include <linux/tick.h>
#include <linux/slab.h>
@@ -435,6 +436,11 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
+ struct call_single_data push_csd;
+ int push_csd_pending;
+ int push_csd_cpu;
+ int push_csd_prio;
+ struct irq_work push_csd_work;
#endif
int rt_queued;
Index: linux-rt.git/kernel/sched/features.h
===================================================================
--- linux-rt.git.orig/kernel/sched/features.h 2015-02-24 10:44:08.798785452 -0500
+++ linux-rt.git/kernel/sched/features.h 2015-02-24 13:06:20.471596597 -0500
@@ -56,6 +56,17 @@ SCHED_FEAT(NONTASK_CAPACITY, true)
*/
SCHED_FEAT(TTWU_QUEUE, true)
+/*
+ * In order to avoid a thundering herd attack of CPUS that are
+ * lowering their priorities at the same time, and there being
+ * a single CPU that has an RT task that can migrate and is waiting
+ * to run, where the other CPUs will try to take that CPUs
+ * rq lock and possibly create a large contention, sending an
+ * IPI to that CPU and let that CPU push the RT task to where
+ * it should go may be a better scenario.
+ */
+SCHED_FEAT(RT_PUSH_IPI, true)
+
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-24 18:39 [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
@ 2015-02-24 18:45 ` Steven Rostedt
2015-02-24 21:23 ` Steven Rostedt
2015-02-25 10:35 ` Peter Zijlstra
2 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2015-02-24 18:45 UTC (permalink / raw)
To: LKML
Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Tue, 24 Feb 2015 13:39:46 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
> +{
> + /* We should never be here if the IPI is already out. */
> + BUG_ON(rq->rt.push_csd_pending);
> +
> + rq->rt.push_csd_pending = 1;
> + rq->rt.push_csd_cpu = cpu;
> + /* Save the prio that we used to find this CPU */
> + rq->rt.push_csd_prio = prio;
> +
> + /* Make sure csd_cpu is visible before we send the ipi */
> + smp_mb();
> +
> + smp_call_function_single_async(cpu, &rq->rt.push_csd);
> +}
> + rt_rq->push_csd_cpu = next_cpu;
> + rt_rq->push_csd_prio = next_prio;
> +
> + /* Make sure the next cpu is seen on remote CPU */
> + smp_mb();
BTW, would a smp_wmb() be enough here. I was a little paranoid about
the irq work and smp_call_functon above some how leaking before the
barrier. I'm probably just being too paranoid, and these can be
switched to smp_wmb(). But for now I'll just keep being a freak.
-- Steve
> +
> + irq_work_queue_on(&rt_rq->push_csd_work, next_cpu);
> +
> + return;
> +
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-24 18:39 [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
2015-02-24 18:45 ` Steven Rostedt
@ 2015-02-24 21:23 ` Steven Rostedt
2015-02-25 10:35 ` Peter Zijlstra
2 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2015-02-24 21:23 UTC (permalink / raw)
To: LKML
Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Tue, 24 Feb 2015 13:39:46 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
> */
> smp_rmb();
>
> + /* Use local just in case a feature is switched in the middle of this */
> + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
> + /* Make sure the update to pending is visible here. */
> + smp_rmb();
While porting this to the -rt kernel, I noticed that this rmb is
unneeded. It's already called above for a different reason :-p
-- Steve
> +
> + /* If a push ipi is out, then we must do the old method */
> + push_pending = READ_ONCE(this_rq->rt.push_csd_pending);
> + }
> +
> for_each_cpu(cpu, this_rq->rd->rto_mask) {
> if (this_cpu == cpu)
> continue;
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-24 18:39 [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
2015-02-24 18:45 ` Steven Rostedt
2015-02-24 21:23 ` Steven Rostedt
@ 2015-02-25 10:35 ` Peter Zijlstra
2015-02-25 15:51 ` Steven Rostedt
2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2015-02-25 10:35 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Tue, Feb 24, 2015 at 01:39:46PM -0500, Steven Rostedt wrote:
> Index: linux-rt.git/kernel/sched/rt.c
> ===================================================================
> --- linux-rt.git.orig/kernel/sched/rt.c 2015-02-24 10:44:08.798785452 -0500
> +++ linux-rt.git/kernel/sched/rt.c 2015-02-24 13:07:20.154735448 -0500
> @@ -1760,11 +1771,171 @@ static void push_rt_tasks(struct rq *rq)
> ;
> }
>
> +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
> +{
> + /* We should never be here if the IPI is already out. */
> + BUG_ON(rq->rt.push_csd_pending);
> +
> + rq->rt.push_csd_pending = 1;
> + rq->rt.push_csd_cpu = cpu;
> + /* Save the prio that we used to find this CPU */
> + rq->rt.push_csd_prio = prio;
> +
> + /* Make sure csd_cpu is visible before we send the ipi */
> + smp_mb();
We've architecturally defined the raising of interrupts vs the execution
of the handler to be a serializing event, therefore this full barrier is
not required.
I think we documented it in places, but I never can find it.
> +
> + smp_call_function_single_async(cpu, &rq->rt.push_csd);
I'm confused why are you mixing smp_call_function_single_async() and
irq_work_queue_on() in the same logic?
Pick one and stick to it.
Also: lkml.kernel.org/r/CA+55aFz492bzLFhdbKN-Hygjcreup7CjMEYk3nTSfRWjppz-OA@mail.gmail.com
Now I would suggest you use irq_wok_queue_on() for this, consistently,
because irq_works are ran after the smp function calls complete and
therefore minimize the latency for people waiting on the (sync) smp
function calls.
> +}
> +
> +static void try_to_push_tasks(void *arg)
> +{
> + struct rt_rq *rt_rq = arg;
> + struct rq *rq, *next_rq;
> + int next_cpu = -1;
> + int next_prio = MAX_PRIO + 1;
> + int this_prio;
> + int src_prio;
> + int prio;
> + int this_cpu;
> + int success;
> + int cpu;
> +
> + /* Make sure we can see csd_cpu */
> + smp_rmb();
As per the above, interrupt are serializing, this is not needed.
> +
> + this_cpu = rt_rq->push_csd_cpu;
> +
> + /* Paranoid check */
> + BUG_ON(this_cpu != smp_processor_id());
> +
> + rq = cpu_rq(this_cpu);
> +
> + /*
> + * If there's nothing to push here, then see if another queue
> + * can push instead.
> + */
> + if (!has_pushable_tasks(rq))
> + goto pass_the_ipi;
> +
> + raw_spin_lock(&rq->lock);
> + success = push_rt_task(rq);
> + raw_spin_unlock(&rq->lock);
> +
> + if (success)
> + goto done;
> +
> + /* Nothing was pushed, try another queue */
> +pass_the_ipi:
> +
> + /*
> + * We use the priority that determined to send to this CPU
> + * even if the priority for this CPU changed. This is used
> + * to determine what other CPUs to send to, to keep from
> + * doing a ping pong from each CPU.
> + */
> + this_prio = rt_rq->push_csd_prio;
> + src_prio = rt_rq->highest_prio.curr;
Should we, at this point, not check if the condition that triggered the
pull request is still valid on our src cpu? No point in continuing our
IPI chain if the CPU we're looking for work for is no longer interested
in it.
> + for_each_cpu(cpu, rq->rd->rto_mask) {
> + if (this_cpu == cpu)
> + continue;
> +
> + /*
> + * This function was called because some rq lowered its
> + * priority. It then searched for the highest priority
> + * rq that had overloaded tasks and sent an smp function
> + * call to that cpu to call this function to push its
> + * tasks. But when it got here, the task was either
> + * already pushed, or due to affinity, could not move
> + * the overloaded task.
> + *
> + * Now we need to see if there's another overloaded rq that
> + * has an RT task that can migrate to that CPU.
> + *
> + * We need to be careful, we do not want to cause a ping
> + * pong between this CPU and another CPU that has an RT task
> + * that can migrate, but not to the CPU that lowered its
> + * priority. Since the lowering priority CPU finds the highest
> + * priority rq to send to, we will ignore any rq that is of higher
> + * priority than this current one.
Maybe start a new paragraph here?
That is, if a rq scheduled a
> + * task of higher priority, the schedule itself would do the
> + * push or pull then. We can safely ignore higher priority rqs.
> + * And if there's one that is the same priority, since the CPUS
> + * are searched in order we will ignore CPUS of the same priority
> + * unless the CPU number is greater than this CPU's number.
> + */
I would s/CPUS/CPUs/ the plural is not part of the acronym is it?
> + next_rq = cpu_rq(cpu);
> +
> + /* Use a single read for the next prio for decision making */
> + prio = READ_ONCE(next_rq->rt.highest_prio.next);
> +
> + /* Looking for highest priority */
> + if (prio >= next_prio)
> + continue;
> +
> + /* Make sure that the rq can push to the source rq */
> + if (prio >= src_prio)
> + continue;
> +
> + /* If the prio is higher than the current prio, ignore it */
> + if (prio < this_prio)
> + continue;
> +
> + /*
> + * If the prio is equal to the current prio, only use it
> + * if the cpu number is greater than the current cpu.
> + * This prevents a ping pong effect.
> + */
> + if (prio == this_prio && cpu < this_cpu)
> + continue;
> +
> + next_prio = prio;
> + next_cpu = cpu;
> + }
> +
> + /* Nothing found, do nothing */
> + if (next_cpu < 0)
> + goto done;
> +
> + /*
> + * Now we can not send another smp async function due to locking,
> + * use irq_work instead.
> + */
> +
> + rt_rq->push_csd_cpu = next_cpu;
> + rt_rq->push_csd_prio = next_prio;
> +
> + /* Make sure the next cpu is seen on remote CPU */
> + smp_mb();
Not required,
> + irq_work_queue_on(&rt_rq->push_csd_work, next_cpu);
And like said before; pick one, stick with it.
> + return;
> +
> +done:
> + rt_rq->push_csd_pending = 0;
> +
> + /* Now make sure the src CPU can see this update */
> + smp_wmb();
This barrier does not make sense either, for a wmb to make sense you
need two stores:
x := 0, y := 0
[S] x = 1
WMB
[S] y = 1
And one would typically pair that with some loads like:
[L] r1 = y
RMB
[L] r2 = x
Which gives you the guarantee that if r1 is true, l2 must then also be
true.
How in this case, there is no subsequent store.. therefore there is no
order and therefore there is no use of barriers.
> +}
> +
> +static void push_irq_work_func(struct irq_work *work)
> +{
> + struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_csd_work);
> +
> + try_to_push_tasks(rt_rq);
> +}
> +
> static int pull_rt_task(struct rq *this_rq)
> {
> int this_cpu = this_rq->cpu, ret = 0, cpu;
> struct task_struct *p;
> struct rq *src_rq;
> + bool push_pending = false;
> + bool use_ipi;
> + int next_cpu = -1;
> + int next_prio = MAX_PRIO + 1;
> + int src_prio;
>
> if (likely(!rt_overloaded(this_rq)))
> return 0;
> @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
> */
> smp_rmb();
>
> + /* Use local just in case a feature is switched in the middle of this */
> + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
> + /* Make sure the update to pending is visible here. */
> + smp_rmb();
Another dubious barriers; the sched_feat() 'LOAD' cannot matter,
therefore this barries doesn't add any order over the previous rmb.
And again, you need two loads for an RMB to make sense, and a WMB (or
MB) on the other side with some STORE.
I think this refers to the store of push_csd_pending at the tail of
try_to_push_task(), but as there's no order there, there cannot be any
here either.
> +
> + /* If a push ipi is out, then we must do the old method */
> + push_pending = READ_ONCE(this_rq->rt.push_csd_pending);
> + }
Hmm, this deserves a wee bit more comment I think.
Ideally you'd be able to 'cancel' the active IPI and re-issue it.
> +
> for_each_cpu(cpu, this_rq->rd->rto_mask) {
> if (this_cpu == cpu)
> continue;
> @@ -1793,6 +1973,26 @@ static int pull_rt_task(struct rq *this_
> continue;
>
> /*
> + * When the RT_PUSH_IPI sched feature is enabled, instead
> + * of trying to grab the rq lock of the RT overloaded CPU
> + * send an IPI to that CPU instead. This prevents heavy
> + * contention from several CPUs lowering its priority
> + * and all trying to grab the rq lock of that overloaded CPU.
> + */
> + if (use_ipi && !push_pending) {
> + /*
> + * We need to make sure the prio we use for comparisons
> + * is the same throughout.
> + */
> + src_prio = READ_ONCE(src_rq->rt.highest_prio.next);
> + if (src_prio < next_prio) {
> + next_prio = src_prio;
> + next_cpu = cpu;
> + }
> + continue;
> + }
> +
> + /*
> * We can potentially drop this_rq's lock in
> * double_lock_balance, and another CPU could
> * alter this_rq
> @@ -1839,6 +2039,8 @@ static int pull_rt_task(struct rq *this_
> skip:
> double_unlock_balance(this_rq, src_rq);
> }
> + if (use_ipi && !push_pending && next_cpu >= 0)
> + tell_cpu_to_push(next_cpu, this_rq, next_prio);
>
> return ret;
> }
Hohumm,.. so there's still a problem here, and you created it by looking
for the highest prio overloaded cpu.
This means that if multiple CPUs need to go pull, they'll end up sending IPIs
to the same highest overloaded CPU.
It also causes you to do this for_each_cpu(rto_mask) walk in the IPI
again (and again) trying to look for the next highest overloadead CPU.
We should really try and avoid for_each_cpu() loops in interrupts (or
with interrupts disabled for that matter).
So when last we talked about this problem; I suggested two alternatives:
- pick the next rto cpu after the current and loop around until you're
past the original cpu again.
- do that 'randomized' walk over rto and again, terminate when you're
back where you started.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-25 10:35 ` Peter Zijlstra
@ 2015-02-25 15:51 ` Steven Rostedt
2015-02-25 17:11 ` Peter Zijlstra
0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2015-02-25 15:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Wed, 25 Feb 2015 11:35:35 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Feb 24, 2015 at 01:39:46PM -0500, Steven Rostedt wrote:
> > Index: linux-rt.git/kernel/sched/rt.c
> > ===================================================================
> > --- linux-rt.git.orig/kernel/sched/rt.c 2015-02-24 10:44:08.798785452 -0500
> > +++ linux-rt.git/kernel/sched/rt.c 2015-02-24 13:07:20.154735448 -0500
>
> > @@ -1760,11 +1771,171 @@ static void push_rt_tasks(struct rq *rq)
> > ;
> > }
> >
> > +static void tell_cpu_to_push(int cpu, struct rq *rq, int prio)
> > +{
> > + /* We should never be here if the IPI is already out. */
> > + BUG_ON(rq->rt.push_csd_pending);
> > +
> > + rq->rt.push_csd_pending = 1;
> > + rq->rt.push_csd_cpu = cpu;
> > + /* Save the prio that we used to find this CPU */
> > + rq->rt.push_csd_prio = prio;
> > +
> > + /* Make sure csd_cpu is visible before we send the ipi */
> > + smp_mb();
>
> We've architecturally defined the raising of interrupts vs the execution
> of the handler to be a serializing event, therefore this full barrier is
> not required.
>
> I think we documented it in places, but I never can find it.
>
> > +
> > + smp_call_function_single_async(cpu, &rq->rt.push_csd);
>
> I'm confused why are you mixing smp_call_function_single_async() and
> irq_work_queue_on() in the same logic?
I originally started with smp_call_function_single_async(), but found
that it takes csd locks and holds them while the functions are
executed. Meaning, you can not call another
smp_call_function_single_async() from within a previous call. I left
the code that starts the IPI as the smp_call_function but changed the
internal calls to use irq_queue_work_on(). I guess I can switch all to
just use that if you want only one, as the
smp_call_function_single_async() is not even an option.
>
> Pick one and stick to it.
>
> Also: lkml.kernel.org/r/CA+55aFz492bzLFhdbKN-Hygjcreup7CjMEYk3nTSfRWjppz-OA@mail.gmail.com
Ah that would have helped. I talked about doing this too in the past.
I think I mentioned it to you on IRC.
But I believe this still requires the user making sure the original
call is synchronized, which breaks making a change for you below. I'll
discuss that further down.
>
> Now I would suggest you use irq_wok_queue_on() for this, consistently,
> because irq_works are ran after the smp function calls complete and
> therefore minimize the latency for people waiting on the (sync) smp
> function calls.
Well, we still want the pull to happen as fast as possible. But sure,
I get your point.
>
> > +}
> > +
> > +static void try_to_push_tasks(void *arg)
> > +{
> > + struct rt_rq *rt_rq = arg;
> > + struct rq *rq, *next_rq;
> > + int next_cpu = -1;
> > + int next_prio = MAX_PRIO + 1;
> > + int this_prio;
> > + int src_prio;
> > + int prio;
> > + int this_cpu;
> > + int success;
> > + int cpu;
> > +
> > + /* Make sure we can see csd_cpu */
> > + smp_rmb();
>
> As per the above, interrupt are serializing, this is not needed.
Because entering an interrupt is a serializing point with other CPUs?
>
> > +
> > + this_cpu = rt_rq->push_csd_cpu;
> > +
> > + /* Paranoid check */
> > + BUG_ON(this_cpu != smp_processor_id());
> > +
> > + rq = cpu_rq(this_cpu);
> > +
> > + /*
> > + * If there's nothing to push here, then see if another queue
> > + * can push instead.
> > + */
> > + if (!has_pushable_tasks(rq))
> > + goto pass_the_ipi;
> > +
> > + raw_spin_lock(&rq->lock);
> > + success = push_rt_task(rq);
> > + raw_spin_unlock(&rq->lock);
> > +
> > + if (success)
> > + goto done;
> > +
> > + /* Nothing was pushed, try another queue */
> > +pass_the_ipi:
> > +
> > + /*
> > + * We use the priority that determined to send to this CPU
> > + * even if the priority for this CPU changed. This is used
> > + * to determine what other CPUs to send to, to keep from
> > + * doing a ping pong from each CPU.
> > + */
> > + this_prio = rt_rq->push_csd_prio;
> > + src_prio = rt_rq->highest_prio.curr;
>
> Should we, at this point, not check if the condition that triggered the
> pull request is still valid on our src cpu? No point in continuing our
> IPI chain if the CPU we're looking for work for is no longer interested
> in it.
But how do we know that?
I added logic here to do exactly that, but then realized I need to keep
track of too much information. The pull happens when the rq schedules a
task of lower priority. That new task can still be an RT task. To know
we do not need to do the pull, we need to keep track of what the
original priority was.
Hmm, but that said, we can add an optimization here and do this:
if (src_prio <= this_prio)
goto done;
As "this_prio" is what we saw that we could pull. Thus, if the rq that
is asking to do the pull schedules a task that is equal or higher in
priority than the highest rq it sent a pull request for, then we do not
need to continue this IPI.
I'll add that.
>
> > + for_each_cpu(cpu, rq->rd->rto_mask) {
> > + if (this_cpu == cpu)
> > + continue;
> > +
> > + /*
> > + * This function was called because some rq lowered its
> > + * priority. It then searched for the highest priority
> > + * rq that had overloaded tasks and sent an smp function
> > + * call to that cpu to call this function to push its
> > + * tasks. But when it got here, the task was either
> > + * already pushed, or due to affinity, could not move
> > + * the overloaded task.
> > + *
> > + * Now we need to see if there's another overloaded rq that
> > + * has an RT task that can migrate to that CPU.
> > + *
> > + * We need to be careful, we do not want to cause a ping
> > + * pong between this CPU and another CPU that has an RT task
> > + * that can migrate, but not to the CPU that lowered its
> > + * priority. Since the lowering priority CPU finds the highest
> > + * priority rq to send to, we will ignore any rq that is of higher
> > + * priority than this current one.
>
> Maybe start a new paragraph here?
OK.
>
> That is, if a rq scheduled a
> > + * task of higher priority, the schedule itself would do the
> > + * push or pull then. We can safely ignore higher priority rqs.
> > + * And if there's one that is the same priority, since the CPUS
> > + * are searched in order we will ignore CPUS of the same priority
> > + * unless the CPU number is greater than this CPU's number.
> > + */
>
> I would s/CPUS/CPUs/ the plural is not part of the acronym is it?
Heh, I usually do, but then I noticed that I did it consistently (I was
probably tired, and I use capslock to write acronyms), that I just
left it.
>
>
> > + next_rq = cpu_rq(cpu);
> > +
> > + /* Use a single read for the next prio for decision making */
> > + prio = READ_ONCE(next_rq->rt.highest_prio.next);
> > +
> > + /* Looking for highest priority */
> > + if (prio >= next_prio)
> > + continue;
> > +
> > + /* Make sure that the rq can push to the source rq */
> > + if (prio >= src_prio)
> > + continue;
> > +
> > + /* If the prio is higher than the current prio, ignore it */
> > + if (prio < this_prio)
> > + continue;
> > +
> > + /*
> > + * If the prio is equal to the current prio, only use it
> > + * if the cpu number is greater than the current cpu.
> > + * This prevents a ping pong effect.
> > + */
> > + if (prio == this_prio && cpu < this_cpu)
> > + continue;
> > +
> > + next_prio = prio;
> > + next_cpu = cpu;
> > + }
> > +
> > + /* Nothing found, do nothing */
> > + if (next_cpu < 0)
> > + goto done;
> > +
> > + /*
> > + * Now we can not send another smp async function due to locking,
> > + * use irq_work instead.
> > + */
> > +
> > + rt_rq->push_csd_cpu = next_cpu;
> > + rt_rq->push_csd_prio = next_prio;
> > +
> > + /* Make sure the next cpu is seen on remote CPU */
> > + smp_mb();
>
> Not required,
Because irq_work_queue_on() is a full barrier, right?
>
> > + irq_work_queue_on(&rt_rq->push_csd_work, next_cpu);
>
> And like said before; pick one, stick with it.
Yep, it will have to be irq_work_queue_on(), as the other is not
possible to use this way (yet).
>
> > + return;
> > +
> > +done:
> > + rt_rq->push_csd_pending = 0;
> > +
> > + /* Now make sure the src CPU can see this update */
> > + smp_wmb();
>
>
> This barrier does not make sense either, for a wmb to make sense you
> need two stores:
>
> x := 0, y := 0
>
> [S] x = 1
> WMB
> [S] y = 1
>
> And one would typically pair that with some loads like:
>
> [L] r1 = y
> RMB
> [L] r2 = x
>
> Which gives you the guarantee that if r1 is true, l2 must then also be
> true.
>
> How in this case, there is no subsequent store.. therefore there is no
> order and therefore there is no use of barriers.
Actually, this is a speed up, but if exiting a irq handler is also a
full barrier, then it is not needed. Here's what I have:
rmb();
if (!pending)
send ipi;
The above is to make sure we see the pending bit before making the
decision. It would suck if it was already cleared, but because we never
flushed the cache, that it always sees it as pending.
I still like that rmb() there, as there's nothing that guarantees that
we have that flushed coming in. Hmm, perhaps the rq lock taken when
entering into the schedule could be considered a serialization point.
>
> > +}
> > +
> > +static void push_irq_work_func(struct irq_work *work)
> > +{
> > + struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_csd_work);
> > +
> > + try_to_push_tasks(rt_rq);
> > +}
> > +
> > static int pull_rt_task(struct rq *this_rq)
> > {
> > int this_cpu = this_rq->cpu, ret = 0, cpu;
> > struct task_struct *p;
> > struct rq *src_rq;
> > + bool push_pending = false;
> > + bool use_ipi;
> > + int next_cpu = -1;
> > + int next_prio = MAX_PRIO + 1;
> > + int src_prio;
> >
> > if (likely(!rt_overloaded(this_rq)))
> > return 0;
> > @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
> > */
> > smp_rmb();
> >
> > + /* Use local just in case a feature is switched in the middle of this */
> > + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
> > + /* Make sure the update to pending is visible here. */
> > + smp_rmb();
>
> Another dubious barriers; the sched_feat() 'LOAD' cannot matter,
> therefore this barries doesn't add any order over the previous rmb.
I added it before noticing that there was an rmb() above :-)
>
> And again, you need two loads for an RMB to make sense, and a WMB (or
> MB) on the other side with some STORE.
>
> I think this refers to the store of push_csd_pending at the tail of
> try_to_push_task(), but as there's no order there, there cannot be any
> here either.
Right, but it's more of making sure that the state of the system is
correct at that point than a correctness thing.
>
> > +
> > + /* If a push ipi is out, then we must do the old method */
> > + push_pending = READ_ONCE(this_rq->rt.push_csd_pending);
> > + }
>
> Hmm, this deserves a wee bit more comment I think.
>
> Ideally you'd be able to 'cancel' the active IPI and re-issue it.
Right, but 'canceling' is really racy. The thing is, there's no locks
out there to serialize this. And due to the nature of this code, we
don't want any locks either.
Since I was using smp_call_function I also could not start one until
the previous one was finished, as it uses the same data structures to
do the send.
Even if I switch this to irq_work, I still have no good way to
cancel it without introducing a race. How do I cancel it and know that
it is canceled before starting a new ipi? I'm not sure I can use the
same irq_work struct for two calls of queue_work_on().
I tried several ways but always found a new mole to whack. I decided to
punt and just fall back to the old "safe" way if this case happens. It
seldom does even on stress tests.
If you can think of a safe way to do this, please enlighten me :-)
>
> > +
> > for_each_cpu(cpu, this_rq->rd->rto_mask) {
> > if (this_cpu == cpu)
> > continue;
> > @@ -1793,6 +1973,26 @@ static int pull_rt_task(struct rq *this_
> > continue;
> >
> > /*
> > + * When the RT_PUSH_IPI sched feature is enabled, instead
> > + * of trying to grab the rq lock of the RT overloaded CPU
> > + * send an IPI to that CPU instead. This prevents heavy
> > + * contention from several CPUs lowering its priority
> > + * and all trying to grab the rq lock of that overloaded CPU.
> > + */
> > + if (use_ipi && !push_pending) {
> > + /*
> > + * We need to make sure the prio we use for comparisons
> > + * is the same throughout.
> > + */
> > + src_prio = READ_ONCE(src_rq->rt.highest_prio.next);
> > + if (src_prio < next_prio) {
> > + next_prio = src_prio;
> > + next_cpu = cpu;
> > + }
> > + continue;
> > + }
> > +
> > + /*
> > * We can potentially drop this_rq's lock in
> > * double_lock_balance, and another CPU could
> > * alter this_rq
> > @@ -1839,6 +2039,8 @@ static int pull_rt_task(struct rq *this_
> > skip:
> > double_unlock_balance(this_rq, src_rq);
> > }
> > + if (use_ipi && !push_pending && next_cpu >= 0)
> > + tell_cpu_to_push(next_cpu, this_rq, next_prio);
> >
> > return ret;
> > }
>
> Hohumm,.. so there's still a problem here, and you created it by looking
> for the highest prio overloaded cpu.
>
> This means that if multiple CPUs need to go pull, they'll end up sending IPIs
> to the same highest overloaded CPU.
Sure, and if you like, I'll post timings of the cost of this.
I can add tests to try to trigger the worse case scenario, and we can
see what this actually is. So far, it is much better than the current
everyone grabbing that lock, as that blocks everything, and has a
really nasty cascading effect.
>
> It also causes you to do this for_each_cpu(rto_mask) walk in the IPI
> again (and again) trying to look for the next highest overloadead CPU.
Of course this requires all CPUs to have more than one RT task queued
on it.
>
> We should really try and avoid for_each_cpu() loops in interrupts (or
> with interrupts disabled for that matter).
>
> So when last we talked about this problem; I suggested two alternatives:
>
> - pick the next rto cpu after the current and loop around until you're
> past the original cpu again.
OK, I realized that my patch did change something. It sends to the
highest prio CPU first, and stops when it gets a task. This is
different than what the original code does (and IMO better), where the
original code pulls all tasks that it can pull, thus adding a bunch of
RT tasks on this queue. At least this keeps the number of RT overloaded
bits in the rto_mask down.
But, I can keep this nastiness and still do the same. Instead of
stopping after a push, keep sending the IPI around and let all RT
overloaded CPUs try to push their tasks.
>
> - do that 'randomized' walk over rto and again, terminate when you're
> back where you started.
Not sure what you mean by a randomized walk?
I wonder if we can add another cpufind that deals with not the highest
running RT task on a CPU, but the next highest RT task. Where it
returns quickly the CPU with the next highest RT task to run.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-25 15:51 ` Steven Rostedt
@ 2015-02-25 17:11 ` Peter Zijlstra
2015-02-25 17:41 ` Peter Zijlstra
2015-02-25 17:50 ` Steven Rostedt
0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2015-02-25 17:11 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Wed, Feb 25, 2015 at 10:51:16AM -0500, Steven Rostedt wrote:
> > > +static void try_to_push_tasks(void *arg)
> > > +{
> > > + struct rt_rq *rt_rq = arg;
> > > + struct rq *rq, *next_rq;
> > > + int next_cpu = -1;
> > > + int next_prio = MAX_PRIO + 1;
> > > + int this_prio;
> > > + int src_prio;
> > > + int prio;
> > > + int this_cpu;
> > > + int success;
> > > + int cpu;
> > > +
> > > + /* Make sure we can see csd_cpu */
> > > + smp_rmb();
> >
> > As per the above, interrupt are serializing, this is not needed.
>
> Because entering an interrupt is a serializing point with other CPUs?
https://lkml.org/lkml/2009/2/18/145
So raising IPI vs receiving one should be serialized. Also, both APIs
(irq_work_queue_on() and smp_call_function_single()) should guarantee
this even if we didn't require the architecture to provide this.
> > > + /*
> > > + * We use the priority that determined to send to this CPU
> > > + * even if the priority for this CPU changed. This is used
> > > + * to determine what other CPUs to send to, to keep from
> > > + * doing a ping pong from each CPU.
> > > + */
> > > + this_prio = rt_rq->push_csd_prio;
> > > + src_prio = rt_rq->highest_prio.curr;
> >
> > Should we, at this point, not check if the condition that triggered the
> > pull request is still valid on our src cpu? No point in continuing our
> > IPI chain if the CPU we're looking for work for is no longer interested
> > in it.
>
> But how do we know that?
Dunno, I didn't say I really thought about it ;-)
> I added logic here to do exactly that, but then realized I need to keep
> track of too much information. The pull happens when the rq schedules a
> task of lower priority. That new task can still be an RT task. To know
> we do not need to do the pull, we need to keep track of what the
> original priority was.
>
> Hmm, but that said, we can add an optimization here and do this:
>
> if (src_prio <= this_prio)
> goto done;
>
> As "this_prio" is what we saw that we could pull. Thus, if the rq that
> is asking to do the pull schedules a task that is equal or higher in
> priority than the highest rq it sent a pull request for, then we do not
> need to continue this IPI.
>
> I'll add that.
Yeah something like that might work. You could also put in a stop flag,
which the IPI handler would check before propagating.
> > > + /* Make sure the next cpu is seen on remote CPU */
> > > + smp_mb();
> >
> > Not required,
>
> Because irq_work_queue_on() is a full barrier, right?
Lets say yes :-)
> > > +done:
> > > + rt_rq->push_csd_pending = 0;
> > > +
> > > + /* Now make sure the src CPU can see this update */
> > > + smp_wmb();
> >
> >
> > This barrier does not make sense either, for a wmb to make sense you
> > need two stores:
> >
> > x := 0, y := 0
> >
> > [S] x = 1
> > WMB
> > [S] y = 1
> >
> > And one would typically pair that with some loads like:
> >
> > [L] r1 = y
> > RMB
> > [L] r2 = x
> >
> > Which gives you the guarantee that if r1 is true, l2 must then also be
> > true.
> >
> > How in this case, there is no subsequent store.. therefore there is no
> > order and therefore there is no use of barriers.
>
> Actually, this is a speed up, but if exiting a irq handler is also a
> full barrier, then it is not needed.
Barrieres are _NEVER_ about speedups, they're either required or not.
> Here's what I have:
>
> rmb();
> if (!pending)
> send ipi;
>
>
> The above is to make sure we see the pending bit before making the
> decision. It would suck if it was already cleared, but because we never
> flushed the cache, that it always sees it as pending.
barriers are not about flushing store buffers (although some
implementations might do so we cannot rely on that in general).
The above rmb is entirely without function; as stated you need at least
two loads for an rmb to make sense.
> > > @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
> > > */
> > > smp_rmb();
> > >
> > > + /* Use local just in case a feature is switched in the middle of this */
> > > + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
> > > + /* Make sure the update to pending is visible here. */
> > > + smp_rmb();
> >
> > Another dubious barriers; the sched_feat() 'LOAD' cannot matter,
> > therefore this barries doesn't add any order over the previous rmb.
>
> I added it before noticing that there was an rmb() above :-)
>
> >
> > And again, you need two loads for an RMB to make sense, and a WMB (or
> > MB) on the other side with some STORE.
> >
> > I think this refers to the store of push_csd_pending at the tail of
> > try_to_push_task(), but as there's no order there, there cannot be any
> > here either.
>
> Right, but it's more of making sure that the state of the system is
> correct at that point than a correctness thing.
-ENOPARSE, there appear to be two different definitions of correct going
about in the above sentence.
Outside of arch code (and rarely there) should we use barriers for
anything other than ordering guarantees. And there is no 'order' in a
single variable.
> > > +
> > > + /* If a push ipi is out, then we must do the old method */
> > > + push_pending = READ_ONCE(this_rq->rt.push_csd_pending);
> > > + }
> >
> > Hmm, this deserves a wee bit more comment I think.
> >
> > Ideally you'd be able to 'cancel' the active IPI and re-issue it.
>
> Right, but 'canceling' is really racy. The thing is, there's no locks
> out there to serialize this. And due to the nature of this code, we
> don't want any locks either.
>
> Since I was using smp_call_function I also could not start one until
> the previous one was finished, as it uses the same data structures to
> do the send.
>
> Even if I switch this to irq_work, I still have no good way to
> cancel it without introducing a race. How do I cancel it and know that
> it is canceled before starting a new ipi? I'm not sure I can use the
> same irq_work struct for two calls of queue_work_on().
>
> I tried several ways but always found a new mole to whack. I decided to
> punt and just fall back to the old "safe" way if this case happens. It
> seldom does even on stress tests.
>
> If you can think of a safe way to do this, please enlighten me :-)
You could put a lock around your IPI state; that is:
#define IPS_BUSY 0x01
#define IPS_LOOPED 0x02
struct ipi_pull_struct {
struct irq_work work;
raw_spinlock_t lock;
int flags;
int dst_cpu; /* CPU that issued this search */
int dst_prio; /* Prio of the originating CPU */
int src_cpu; /* CPU where the IPI is running */
int stop_cpu; /* cpu where we can stop iterating */
...
};
Where you can increase seq every time you change it and the IPI will
continue at least one more rto mask round for every change?
> > Hohumm,.. so there's still a problem here, and you created it by looking
> > for the highest prio overloaded cpu.
> >
> > This means that if multiple CPUs need to go pull, they'll end up sending IPIs
> > to the same highest overloaded CPU.
>
> Sure, and if you like, I'll post timings of the cost of this.
>
> I can add tests to try to trigger the worse case scenario, and we can
> see what this actually is. So far, it is much better than the current
> everyone grabbing that lock, as that blocks everything, and has a
> really nasty cascading effect.
But what is wrong with maintaining the current behaviour -- or something
similar -- where you just try the 'first' one?
The current code of pull_rt_task() does not do the n^2 loop trying to
pull from the highest prio cpu first and then the second, third.. etc
highest cpu.
It just walks the rto_mask and pulls from whatever it finds first.
You're changing that for no real reason afaict.
> > We should really try and avoid for_each_cpu() loops in interrupts (or
> > with interrupts disabled for that matter).
> >
> > So when last we talked about this problem; I suggested two alternatives:
> >
> > - pick the next rto cpu after the current and loop around until you're
> > past the original cpu again.
>
> OK, I realized that my patch did change something. It sends to the
> highest prio CPU first, and stops when it gets a task. This is
> different than what the original code does (and IMO better), where the
> original code pulls all tasks that it can pull, thus adding a bunch of
> RT tasks on this queue. At least this keeps the number of RT overloaded
> bits in the rto_mask down.
>
> But, I can keep this nastiness and still do the same. Instead of
> stopping after a push, keep sending the IPI around and let all RT
> overloaded CPUs try to push their tasks.
Well, the problem with it is one of collisions. So the 'easy' solution I
proposed would be something like:
int ips_next(struct ipi_pull_struct *ips)
{
int cpu = ips->src_cpu;
cpu = cpumask_next(cpu, rto_mask);
if (cpu >= nr_cpu_ids) {
cpu = 0;
ips->flags |= IPS_LOOPED;
cpu = cpumask_next(cpu, rto_mask);
if (cpu >= nr_cpu_ids) /* empty mask *;
return cpu;
}
if (ips->flags & IPS_LOOPED && cpu >= ips->stop_cpu)
return nr_cpu_ids;
return cpu;
}
struct ipi_pull_struct *ips = __this_cpu_ptr(ips);
raw_spin_lock(&ips->lock);
if (ips->flags & IPS_BUSY) {
/* there is an IPI active; update state */
ips->dst_prio = current->prio;
ips->stop_cpu = ips->src_cpu;
ips->flags &= ~IPS_LOOPED;
} else {
/* no IPI active, make one go */
ips->dst_cpu = smp_processor_id();
ips->dst_prio = current->prio;
ips->src_cpu = ips->dst_cpu;
ips->stop_cpu = ips->dst_cpu;
ips->flags = IPS_BUSY;
cpu = ips_next(ips);
ips->src_cpu = cpu;
if (cpu < nr_cpu_ids)
irq_work_queue_on(&ips->work, cpu);
}
raw_spin_unlock(&ips->lock);
Where you would simply start walking the RTO mask from the current
position -- it also includes some restart logic, and you'd only take
ips->lock when your ipi handler starts and when it needs to migrate to
another cpu.
This way, on big systems, there's at least some chance different CPUs
find different targets to pull from.
> > - do that 'randomized' walk over rto and again, terminate when you're
> > back where you started.
>
>
> Not sure what you mean by a randomized walk?
lkml.kernel.org/r/20150206135945.GM24151@twins.programming.kicks-ass.net
Where instead of a linear rto walk it does a random walk over the
bitmap -- bit over the top maybe ;-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-25 17:11 ` Peter Zijlstra
@ 2015-02-25 17:41 ` Peter Zijlstra
2015-02-25 17:50 ` Steven Rostedt
1 sibling, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2015-02-25 17:41 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Wed, Feb 25, 2015 at 06:11:10PM +0100, Peter Zijlstra wrote:
>
> #define IPS_BUSY 0x01
> #define IPS_LOOPED 0x02
>
> struct ipi_pull_struct {
> struct irq_work work;
> raw_spinlock_t lock;
> int flags;
> int dst_cpu; /* CPU that issued this search */
> int dst_prio; /* Prio of the originating CPU */
> int src_cpu; /* CPU where the IPI is running */
> int stop_cpu; /* cpu where we can stop iterating */
> ...
> };
>
> Where you can increase seq every time you change it and the IPI will
> continue at least one more rto mask round for every change?
seq got lost.. ignore that bit, look at the code.
> int ips_next(struct ipi_pull_struct *ips)
> {
> int cpu = ips->src_cpu;
> cpu = cpumask_next(cpu, rto_mask);
> if (cpu >= nr_cpu_ids) {
> cpu = 0;
should be -1
> ips->flags |= IPS_LOOPED;
> cpu = cpumask_next(cpu, rto_mask);
> if (cpu >= nr_cpu_ids) /* empty mask *;
> return cpu;
> }
> if (ips->flags & IPS_LOOPED && cpu >= ips->stop_cpu)
> return nr_cpu_ids;
> return cpu;
> }
>
>
>
> struct ipi_pull_struct *ips = __this_cpu_ptr(ips);
>
> raw_spin_lock(&ips->lock);
> if (ips->flags & IPS_BUSY) {
> /* there is an IPI active; update state */
> ips->dst_prio = current->prio;
> ips->stop_cpu = ips->src_cpu;
> ips->flags &= ~IPS_LOOPED;
> } else {
> /* no IPI active, make one go */
> ips->dst_cpu = smp_processor_id();
> ips->dst_prio = current->prio;
> ips->src_cpu = ips->dst_cpu;
> ips->stop_cpu = ips->dst_cpu;
> ips->flags = IPS_BUSY;
>
> cpu = ips_next(ips);
> ips->src_cpu = cpu;
> if (cpu < nr_cpu_ids)
> irq_work_queue_on(&ips->work, cpu);
> }
> raw_spin_unlock(&ips->lock);
There might be a wee race vs the busy looping, seeing how src_cpu might
be checking against the old dst_prio, so we could check one too few rto
mask.
No real problem I think.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-25 17:11 ` Peter Zijlstra
2015-02-25 17:41 ` Peter Zijlstra
@ 2015-02-25 17:50 ` Steven Rostedt
2015-02-26 7:45 ` Peter Zijlstra
2015-02-26 7:49 ` Peter Zijlstra
1 sibling, 2 replies; 15+ messages in thread
From: Steven Rostedt @ 2015-02-25 17:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Wed, 25 Feb 2015 18:11:10 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
>
> > Actually, this is a speed up, but if exiting a irq handler is also a
> > full barrier, then it is not needed.
>
> Barrieres are _NEVER_ about speedups, they're either required or not.
"speed up" was a wrong word. "correct state" is what I should have said.
I meant "speed up" the getting of the "correct state" :-)
>
> > Here's what I have:
> >
> > rmb();
> > if (!pending)
> > send ipi;
> >
> >
> > The above is to make sure we see the pending bit before making the
> > decision. It would suck if it was already cleared, but because we never
> > flushed the cache, that it always sees it as pending.
>
> barriers are not about flushing store buffers (although some
> implementations might do so we cannot rely on that in general).
>
> The above rmb is entirely without function; as stated you need at least
> two loads for an rmb to make sense.
It can't be used for state?
If one CPU writes "zero", and the other CPU wants to decide if the
system is in the state to do something, isn't a rmb() fine to use?
CPU 1:
x = 0;
/* Tell other CPUs they can now do something */
smp_wmb();
CPU 2:
/* Make sure we see current state of x */
smp_rmb();
if (x == 0)
do_something();
The above situation is not acceptable?
Otherwise, we fail to be able to do_something() when it is perfectly
fine to do so.
>
> > > > @@ -1775,6 +1946,15 @@ static int pull_rt_task(struct rq *this_
> > > > */
> > > > smp_rmb();
> > > >
> > > > + /* Use local just in case a feature is switched in the middle of this */
> > > > + if ((use_ipi = sched_feat(RT_PUSH_IPI))) {
> > > > + /* Make sure the update to pending is visible here. */
> > > > + smp_rmb();
> > >
> > > Another dubious barriers; the sched_feat() 'LOAD' cannot matter,
> > > therefore this barries doesn't add any order over the previous rmb.
> >
> > I added it before noticing that there was an rmb() above :-)
> >
> > >
> > > And again, you need two loads for an RMB to make sense, and a WMB (or
> > > MB) on the other side with some STORE.
> > >
> > > I think this refers to the store of push_csd_pending at the tail of
> > > try_to_push_task(), but as there's no order there, there cannot be any
> > > here either.
> >
> > Right, but it's more of making sure that the state of the system is
> > correct at that point than a correctness thing.
>
> -ENOPARSE, there appear to be two different definitions of correct going
> about in the above sentence.
>
> Outside of arch code (and rarely there) should we use barriers for
> anything other than ordering guarantees. And there is no 'order' in a
> single variable.
Can they not be used for passing of state as mentioned above?
>
> > > > +
> > > > + /* If a push ipi is out, then we must do the old method */
> > > > + push_pending = READ_ONCE(this_rq->rt.push_csd_pending);
> > > > + }
> > >
> > > Hmm, this deserves a wee bit more comment I think.
> > >
> > > Ideally you'd be able to 'cancel' the active IPI and re-issue it.
> >
> > Right, but 'canceling' is really racy. The thing is, there's no locks
> > out there to serialize this. And due to the nature of this code, we
> > don't want any locks either.
> >
> > Since I was using smp_call_function I also could not start one until
> > the previous one was finished, as it uses the same data structures to
> > do the send.
> >
> > Even if I switch this to irq_work, I still have no good way to
> > cancel it without introducing a race. How do I cancel it and know that
> > it is canceled before starting a new ipi? I'm not sure I can use the
> > same irq_work struct for two calls of queue_work_on().
> >
> > I tried several ways but always found a new mole to whack. I decided to
> > punt and just fall back to the old "safe" way if this case happens. It
> > seldom does even on stress tests.
> >
> > If you can think of a safe way to do this, please enlighten me :-)
>
> You could put a lock around your IPI state; that is:
>
> #define IPS_BUSY 0x01
> #define IPS_LOOPED 0x02
>
> struct ipi_pull_struct {
> struct irq_work work;
> raw_spinlock_t lock;
> int flags;
> int dst_cpu; /* CPU that issued this search */
> int dst_prio; /* Prio of the originating CPU */
> int src_cpu; /* CPU where the IPI is running */
> int stop_cpu; /* cpu where we can stop iterating */
> ...
> };
>
> Where you can increase seq every time you change it and the IPI will
> continue at least one more rto mask round for every change?
>
> > > Hohumm,.. so there's still a problem here, and you created it by looking
> > > for the highest prio overloaded cpu.
> > >
> > > This means that if multiple CPUs need to go pull, they'll end up sending IPIs
> > > to the same highest overloaded CPU.
> >
> > Sure, and if you like, I'll post timings of the cost of this.
> >
> > I can add tests to try to trigger the worse case scenario, and we can
> > see what this actually is. So far, it is much better than the current
> > everyone grabbing that lock, as that blocks everything, and has a
> > really nasty cascading effect.
>
> But what is wrong with maintaining the current behaviour -- or something
> similar -- where you just try the 'first' one?
Because that first one will be what all CPUs try, and we cause the lock
contention again.
>
> The current code of pull_rt_task() does not do the n^2 loop trying to
> pull from the highest prio cpu first and then the second, third.. etc
> highest cpu.
>
> It just walks the rto_mask and pulls from whatever it finds first.
Actually, it pulls everything it can. It doesn't stop at the first one
it finds as there may be a higher priority task to pull.
Now, I also notice a problem with the current approach. It pulls what
ever can preempt the current process. It doesn't look like it updates
what it pulls. That is, it can pull the highest prio task, and then
pull a lower prio task as long as it is higher than what is current
running. That is, it only checks this_rq->rt.highest_prio.curr, which
may not be the highest priority task queued. It also needs to check
this_rq->rt.highest_prio.next.
>
> You're changing that for no real reason afaict.
It was an optimization. Now, I did say we can maintain this behavior
with the IPIs as well.
>
> > > We should really try and avoid for_each_cpu() loops in interrupts (or
> > > with interrupts disabled for that matter).
> > >
> > > So when last we talked about this problem; I suggested two alternatives:
> > >
> > > - pick the next rto cpu after the current and loop around until you're
> > > past the original cpu again.
> >
> > OK, I realized that my patch did change something. It sends to the
> > highest prio CPU first, and stops when it gets a task. This is
> > different than what the original code does (and IMO better), where the
> > original code pulls all tasks that it can pull, thus adding a bunch of
> > RT tasks on this queue. At least this keeps the number of RT overloaded
> > bits in the rto_mask down.
> >
> > But, I can keep this nastiness and still do the same. Instead of
> > stopping after a push, keep sending the IPI around and let all RT
> > overloaded CPUs try to push their tasks.
>
> Well, the problem with it is one of collisions. So the 'easy' solution I
> proposed would be something like:
>
> int ips_next(struct ipi_pull_struct *ips)
> {
> int cpu = ips->src_cpu;
> cpu = cpumask_next(cpu, rto_mask);
> if (cpu >= nr_cpu_ids) {
Do we really need to loop? Just start with the first one, and go to the
end.
> cpu = 0;
> ips->flags |= IPS_LOOPED;
> cpu = cpumask_next(cpu, rto_mask);
> if (cpu >= nr_cpu_ids) /* empty mask *;
> return cpu;
> }
> if (ips->flags & IPS_LOOPED && cpu >= ips->stop_cpu)
> return nr_cpu_ids;
> return cpu;
> }
>
>
>
> struct ipi_pull_struct *ips = __this_cpu_ptr(ips);
>
> raw_spin_lock(&ips->lock);
> if (ips->flags & IPS_BUSY) {
> /* there is an IPI active; update state */
> ips->dst_prio = current->prio;
> ips->stop_cpu = ips->src_cpu;
> ips->flags &= ~IPS_LOOPED;
I guess the loop is needed for continuing the work, in case the
scheduling changed?
> } else {
> /* no IPI active, make one go */
> ips->dst_cpu = smp_processor_id();
> ips->dst_prio = current->prio;
> ips->src_cpu = ips->dst_cpu;
> ips->stop_cpu = ips->dst_cpu;
> ips->flags = IPS_BUSY;
>
> cpu = ips_next(ips);
> ips->src_cpu = cpu;
> if (cpu < nr_cpu_ids)
> irq_work_queue_on(&ips->work, cpu);
> }
> raw_spin_unlock(&ips->lock);
I'll have to spend some time comprehending this.
>
>
> Where you would simply start walking the RTO mask from the current
> position -- it also includes some restart logic, and you'd only take
> ips->lock when your ipi handler starts and when it needs to migrate to
> another cpu.
>
> This way, on big systems, there's at least some chance different CPUs
> find different targets to pull from.
OK, makes sense. I can try that.
>
> > > - do that 'randomized' walk over rto and again, terminate when you're
> > > back where you started.
> >
> >
> > Not sure what you mean by a randomized walk?
>
> lkml.kernel.org/r/20150206135945.GM24151@twins.programming.kicks-ass.net
>
> Where instead of a linear rto walk it does a random walk over the
> bitmap -- bit over the top maybe ;-)
Yeah, I think just starting from smp_processor_id()+1 is sufficient.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-25 17:50 ` Steven Rostedt
@ 2015-02-26 7:45 ` Peter Zijlstra
2015-02-26 12:43 ` Steven Rostedt
2015-02-26 7:49 ` Peter Zijlstra
1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2015-02-26 7:45 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel, Oleg Nesterov
On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
> It can't be used for state?
>
> If one CPU writes "zero", and the other CPU wants to decide if the
> system is in the state to do something, isn't a rmb() fine to use?
>
>
> CPU 1:
>
> x = 0;
> /* Tell other CPUs they can now do something */
> smp_wmb();
>
> CPU 2:
> /* Make sure we see current state of x */
> smp_rmb();
> if (x == 0)
> do_something();
>
> The above situation is not acceptable?
Acceptable is just not the word. It plain doesn't work that way.
> Otherwise, we fail to be able to do_something() when it is perfectly
> fine to do so.
Can't be helped.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-26 7:45 ` Peter Zijlstra
@ 2015-02-26 12:43 ` Steven Rostedt
2015-02-26 13:47 ` Peter Zijlstra
0 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2015-02-26 12:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel, Oleg Nesterov
On Thu, 26 Feb 2015 08:45:59 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
> > It can't be used for state?
> >
> > If one CPU writes "zero", and the other CPU wants to decide if the
> > system is in the state to do something, isn't a rmb() fine to use?
> >
> >
> > CPU 1:
> >
> > x = 0;
> > /* Tell other CPUs they can now do something */
> > smp_wmb();
> >
> > CPU 2:
> > /* Make sure we see current state of x */
> > smp_rmb();
> > if (x == 0)
> > do_something();
> >
> > The above situation is not acceptable?
>
> Acceptable is just not the word. It plain doesn't work that way.
Thinking about this more, is it because a wmb just forces the CPU to
write everything before this before it writes anything after it. That
is, the writes themselves can happen at a much later time. Does a plain
mb() work the same way if there are no reads required?
>
> > Otherwise, we fail to be able to do_something() when it is perfectly
> > fine to do so.
>
> Can't be helped.
What about using atomic_t?
Note, my latest code doesn't have any of this, but I just want to
understand the semantics of these operations a bit better.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-26 12:43 ` Steven Rostedt
@ 2015-02-26 13:47 ` Peter Zijlstra
2015-02-26 14:00 ` Steven Rostedt
0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2015-02-26 13:47 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel, Oleg Nesterov
On Thu, Feb 26, 2015 at 07:43:01AM -0500, Steven Rostedt wrote:
> On Thu, 26 Feb 2015 08:45:59 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
> > > It can't be used for state?
> > >
> > > If one CPU writes "zero", and the other CPU wants to decide if the
> > > system is in the state to do something, isn't a rmb() fine to use?
> > >
> > >
> > > CPU 1:
> > >
> > > x = 0;
> > > /* Tell other CPUs they can now do something */
> > > smp_wmb();
> > >
> > > CPU 2:
> > > /* Make sure we see current state of x */
> > > smp_rmb();
> > > if (x == 0)
> > > do_something();
> > >
> > > The above situation is not acceptable?
> >
> > Acceptable is just not the word. It plain doesn't work that way.
>
> Thinking about this more, is it because a wmb just forces the CPU to
> write everything before this before it writes anything after it. That
> is, the writes themselves can happen at a much later time. Does a plain
> mb() work the same way if there are no reads required?
No, neither smp_wmb nor smp_mb are required to flush the store buffers.
The only thing barriers do is guarantee order, this can be done by
flushing store buffers but it can also be done by making sure store
buffers flush writes in the 'right' order.
Nor does an rmb help anything with ordering against a possible store
buffer flush. Again rmb only guarantees two loads are issued in that
particular order, it doesn't disallow the CPU speculating the load at
all.
So that load of X could come out of thin air, or a year ago, or
whatever, definitely before any store buffers were, or were not, flushed
on another cpu.
> > > Otherwise, we fail to be able to do_something() when it is perfectly
> > > fine to do so.
> >
> > Can't be helped.
>
> What about using atomic_t?
>
> Note, my latest code doesn't have any of this, but I just want to
> understand the semantics of these operations a bit better.
Nope, atomic_t doesn't help here either. Atomics only make sure the RmW
cycle is atomic.
Note that even if wmb or mb did flush the store buffer, you would still
have a race here.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-26 13:47 ` Peter Zijlstra
@ 2015-02-26 14:00 ` Steven Rostedt
0 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2015-02-26 14:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel, Oleg Nesterov
On Thu, 26 Feb 2015 14:47:54 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> > Thinking about this more, is it because a wmb just forces the CPU to
> > write everything before this before it writes anything after it. That
> > is, the writes themselves can happen at a much later time. Does a plain
> > mb() work the same way if there are no reads required?
>
> No, neither smp_wmb nor smp_mb are required to flush the store buffers.
Heh, that's what I said :-) "That is, the writes themselves can happen
at a much later time."
>
> The only thing barriers do is guarantee order, this can be done by
> flushing store buffers but it can also be done by making sure store
> buffers flush writes in the 'right' order.
>
> Nor does an rmb help anything with ordering against a possible store
> buffer flush. Again rmb only guarantees two loads are issued in that
> particular order, it doesn't disallow the CPU speculating the load at
> all.
Yep understood.
> > What about using atomic_t?
> >
> > Note, my latest code doesn't have any of this, but I just want to
> > understand the semantics of these operations a bit better.
>
> Nope, atomic_t doesn't help here either. Atomics only make sure the RmW
> cycle is atomic.
Crummy. ;-)
>
> Note that even if wmb or mb did flush the store buffer, you would still
> have a race here.
Oh, it wasn't that I meant to remove the race. I was just trying to
make that race smaller.
But this is all academic now, as my last version doesn't do any of this.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-25 17:50 ` Steven Rostedt
2015-02-26 7:45 ` Peter Zijlstra
@ 2015-02-26 7:49 ` Peter Zijlstra
2015-02-26 12:46 ` Steven Rostedt
1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2015-02-26 7:49 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Wed, Feb 25, 2015 at 12:50:15PM -0500, Steven Rostedt wrote:
> > Well, the problem with it is one of collisions. So the 'easy' solution I
> > proposed would be something like:
> >
> > int ips_next(struct ipi_pull_struct *ips)
> > {
> > int cpu = ips->src_cpu;
> > cpu = cpumask_next(cpu, rto_mask);
> > if (cpu >= nr_cpu_ids) {
>
> Do we really need to loop? Just start with the first one, and go to the
> end.
>
> > cpu = 0;
> > ips->flags |= IPS_LOOPED;
> > cpu = cpumask_next(cpu, rto_mask);
> > if (cpu >= nr_cpu_ids) /* empty mask *;
> > return cpu;
> > }
> > if (ips->flags & IPS_LOOPED && cpu >= ips->stop_cpu)
> > return nr_cpu_ids;
> > return cpu;
> > }
Yes, notice that we don't start iterating at the beginning; this in on
purpose. If we start iterating at the beginning, _every_ cpu will again
pile up on the first one.
By starting at the current cpu, each cpu will start iteration some place
else and hopefully, with a big enough system, different CPUs end up on a
different rto cpu.
> >
> >
> > struct ipi_pull_struct *ips = __this_cpu_ptr(ips);
> >
> > raw_spin_lock(&ips->lock);
> > if (ips->flags & IPS_BUSY) {
> > /* there is an IPI active; update state */
> > ips->dst_prio = current->prio;
> > ips->stop_cpu = ips->src_cpu;
> > ips->flags &= ~IPS_LOOPED;
>
> I guess the loop is needed for continuing the work, in case the
> scheduling changed?
That too.
> > } else {
> > /* no IPI active, make one go */
> > ips->dst_cpu = smp_processor_id();
> > ips->dst_prio = current->prio;
> > ips->src_cpu = ips->dst_cpu;
> > ips->stop_cpu = ips->dst_cpu;
> > ips->flags = IPS_BUSY;
> >
> > cpu = ips_next(ips);
> > ips->src_cpu = cpu;
> > if (cpu < nr_cpu_ids)
> > irq_work_queue_on(&ips->work, cpu);
> > }
> > raw_spin_unlock(&ips->lock);
>
> I'll have to spend some time comprehending this.
:-)
> > Where you would simply start walking the RTO mask from the current
> > position -- it also includes some restart logic, and you'd only take
> > ips->lock when your ipi handler starts and when it needs to migrate to
> > another cpu.
> >
> > This way, on big systems, there's at least some chance different CPUs
> > find different targets to pull from.
>
> OK, makes sense. I can try that.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
2015-02-26 7:49 ` Peter Zijlstra
@ 2015-02-26 12:46 ` Steven Rostedt
0 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2015-02-26 12:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
linux-rt-users, Mike Galbraith, Paul E. McKenney,
Jörn Engel
On Thu, 26 Feb 2015 08:49:07 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> Yes, notice that we don't start iterating at the beginning; this in on
> purpose. If we start iterating at the beginning, _every_ cpu will again
> pile up on the first one.
>
> By starting at the current cpu, each cpu will start iteration some place
> else and hopefully, with a big enough system, different CPUs end up on a
> different rto cpu.
Note, v3 doesn't start at the current CPU, it starts at the original
CPU again. As the start of every IPI comes from a different CPU, this
still keeps a restart from clobbering the same CPU.
-- Steve
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling
@ 2015-02-25 7:56 Hillf Danton
0 siblings, 0 replies; 15+ messages in thread
From: Hillf Danton @ 2015-02-25 7:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: Ingo Molnar, Peter Zijlstra, 'Thomas Gleixner',
'Clark Williams', 'Mike Galbraith',
linux-kernel
> +static void try_to_push_tasks(void *arg)
> +{
> + struct rt_rq *rt_rq = arg;
> + struct rq *rq, *next_rq;
> + int next_cpu = -1;
> + int next_prio = MAX_PRIO + 1;
> + int this_prio;
> + int src_prio;
> + int prio;
> + int this_cpu;
> + int success;
> + int cpu;
> +
> + /* Make sure we can see csd_cpu */
> + smp_rmb();
> +
> + this_cpu = rt_rq->push_csd_cpu;
> +
> + /* Paranoid check */
> + BUG_ON(this_cpu != smp_processor_id());
> +
> + rq = cpu_rq(this_cpu);
> +
> + /*
> + * If there's nothing to push here, then see if another queue
> + * can push instead.
> + */
> + if (!has_pushable_tasks(rq))
> + goto pass_the_ipi;
> +
> + raw_spin_lock(&rq->lock);
> + success = push_rt_task(rq);
> + raw_spin_unlock(&rq->lock);
> +
> + if (success)
> + goto done;
The latency, 150us over a 20 hour run, goes up if we goto done directly?
Hillf
> +
> + /* Nothing was pushed, try another queue */
> +pass_the_ipi:
> +
> + /*
> + * We use the priority that determined to send to this CPU
> + * even if the priority for this CPU changed. This is used
> + * to determine what other CPUs to send to, to keep from
> + * doing a ping pong from each CPU.
> + */
> + this_prio = rt_rq->push_csd_prio;
> + src_prio = rt_rq->highest_prio.curr;
> +
> + for_each_cpu(cpu, rq->rd->rto_mask) {
> + if (this_cpu == cpu)
> + continue;
> +
> + /*
> + * This function was called because some rq lowered its
> + * priority. It then searched for the highest priority
> + * rq that had overloaded tasks and sent an smp function
> + * call to that cpu to call this function to push its
> + * tasks. But when it got here, the task was either
> + * already pushed, or due to affinity, could not move
> + * the overloaded task.
> + *
> + * Now we need to see if there's another overloaded rq that
> + * has an RT task that can migrate to that CPU.
> + *
> + * We need to be careful, we do not want to cause a ping
> + * pong between this CPU and another CPU that has an RT task
> + * that can migrate, but not to the CPU that lowered its
> + * priority. Since the lowering priority CPU finds the highest
> + * priority rq to send to, we will ignore any rq that is of higher
> + * priority than this current one. That is, if a rq scheduled a
> + * task of higher priority, the schedule itself would do the
> + * push or pull then. We can safely ignore higher priority rqs.
> + * And if there's one that is the same priority, since the CPUS
> + * are searched in order we will ignore CPUS of the same priority
> + * unless the CPU number is greater than this CPU's number.
> + */
> + next_rq = cpu_rq(cpu);
> +
> + /* Use a single read for the next prio for decision making */
> + prio = READ_ONCE(next_rq->rt.highest_prio.next);
> +
> + /* Looking for highest priority */
> + if (prio >= next_prio)
> + continue;
> +
> + /* Make sure that the rq can push to the source rq */
> + if (prio >= src_prio)
> + continue;
> +
> + /* If the prio is higher than the current prio, ignore it */
> + if (prio < this_prio)
> + continue;
> +
> + /*
> + * If the prio is equal to the current prio, only use it
> + * if the cpu number is greater than the current cpu.
> + * This prevents a ping pong effect.
> + */
> + if (prio == this_prio && cpu < this_cpu)
> + continue;
> +
> + next_prio = prio;
> + next_cpu = cpu;
> + }
> +
> + /* Nothing found, do nothing */
> + if (next_cpu < 0)
> + goto done;
> +
> + /*
> + * Now we can not send another smp async function due to locking,
> + * use irq_work instead.
> + */
> +
> + rt_rq->push_csd_cpu = next_cpu;
> + rt_rq->push_csd_prio = next_prio;
> +
> + /* Make sure the next cpu is seen on remote CPU */
> + smp_mb();
> +
> + irq_work_queue_on(&rt_rq->push_csd_work, next_cpu);
> +
> + return;
> +
> +done:
> + rt_rq->push_csd_pending = 0;
> +
> + /* Now make sure the src CPU can see this update */
> + smp_wmb();
> +}
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-02-26 14:01 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-24 18:39 [RFC][PATCH v2] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
2015-02-24 18:45 ` Steven Rostedt
2015-02-24 21:23 ` Steven Rostedt
2015-02-25 10:35 ` Peter Zijlstra
2015-02-25 15:51 ` Steven Rostedt
2015-02-25 17:11 ` Peter Zijlstra
2015-02-25 17:41 ` Peter Zijlstra
2015-02-25 17:50 ` Steven Rostedt
2015-02-26 7:45 ` Peter Zijlstra
2015-02-26 12:43 ` Steven Rostedt
2015-02-26 13:47 ` Peter Zijlstra
2015-02-26 14:00 ` Steven Rostedt
2015-02-26 7:49 ` Peter Zijlstra
2015-02-26 12:46 ` Steven Rostedt
2015-02-25 7:56 Hillf Danton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).