LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/4] pending scheduler updates
@ 2008-10-17 17:27 Peter Zijlstra
2008-10-17 17:27 ` [PATCH 1/4] sched: optimize group load balancer Peter Zijlstra
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri
Please apply
--
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/4] sched: optimize group load balancer
2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
@ 2008-10-17 17:27 ` Peter Zijlstra
2008-10-17 17:27 ` [PATCH 2/4] sched: fair scheduler should not resched rt tasks Peter Zijlstra
` (3 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri, Chris Friesen
Cc: Peter Zijlstra
[-- Attachment #1: sched-opt-group-balance.patch --]
[-- Type: text/plain, Size: 4441 bytes --]
I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus
in the sched_domain. This hurts.
We need the rq-locks whenever we change the weight of the per-cpu group sched
entities. To allevate this a little, only change the weight when the new
weight is at least shares_thresh away from the old value.
This avoids the rq-lock for the top level entries, since those will never
be re-weighted, and fuzzes the lower level entries a little to gain performance
in semi-stable situations.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Chris Friesen <cfriesen@nortel.com>
---
include/linux/sched.h | 1 +
kernel/sched.c | 45 +++++++++++++++++++++++++--------------------
kernel/sysctl.c | 10 ++++++++++
3 files changed, 36 insertions(+), 20 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1660,6 +1660,7 @@ extern unsigned int sysctl_sched_feature
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
extern unsigned int sysctl_sched_shares_ratelimit;
+extern unsigned int sysctl_sched_shares_thresh;
int sched_nr_latency_handler(struct ctl_table *table, int write,
struct file *file, void __user *buffer, size_t *length,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -818,6 +818,13 @@ const_debug unsigned int sysctl_sched_nr
unsigned int sysctl_sched_shares_ratelimit = 250000;
/*
+ * Inject some fuzzyness into changing the per-cpu group shares
+ * this avoids remote rq-locks at the expense of fairness.
+ * default: 4
+ */
+unsigned int sysctl_sched_shares_thresh = 4;
+
+/*
* period over which we measure -rt task cpu usage in us.
* default: 1s
*/
@@ -1453,8 +1460,8 @@ static void __set_se_shares(struct sched
* Calculate and set the cpu's group shares.
*/
static void
-__update_group_shares_cpu(struct task_group *tg, int cpu,
- unsigned long sd_shares, unsigned long sd_rq_weight)
+update_group_shares_cpu(struct task_group *tg, int cpu,
+ unsigned long sd_shares, unsigned long sd_rq_weight)
{
int boost = 0;
unsigned long shares;
@@ -1485,19 +1492,23 @@ __update_group_shares_cpu(struct task_gr
*
*/
shares = (sd_shares * rq_weight) / (sd_rq_weight + 1);
+ shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
- /*
- * record the actual number of shares, not the boosted amount.
- */
- tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
- tg->cfs_rq[cpu]->rq_weight = rq_weight;
+ if (abs(shares - tg->se[cpu]->load.weight) >
+ sysctl_sched_shares_thresh) {
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
- if (shares < MIN_SHARES)
- shares = MIN_SHARES;
- else if (shares > MAX_SHARES)
- shares = MAX_SHARES;
+ spin_lock_irqsave(&rq->lock, flags);
+ /*
+ * record the actual number of shares, not the boosted amount.
+ */
+ tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
+ tg->cfs_rq[cpu]->rq_weight = rq_weight;
- __set_se_shares(tg->se[cpu], shares);
+ __set_se_shares(tg->se[cpu], shares);
+ spin_unlock_irqrestore(&rq->lock, flags);
+ }
}
/*
@@ -1526,14 +1537,8 @@ static int tg_shares_up(struct task_grou
if (!rq_weight)
rq_weight = cpus_weight(sd->span) * NICE_0_LOAD;
- for_each_cpu_mask(i, sd->span) {
- struct rq *rq = cpu_rq(i);
- unsigned long flags;
-
- spin_lock_irqsave(&rq->lock, flags);
- __update_group_shares_cpu(tg, i, shares, rq_weight);
- spin_unlock_irqrestore(&rq->lock, flags);
- }
+ for_each_cpu_mask(i, sd->span)
+ update_group_shares_cpu(tg, i, shares, rq_weight);
return 0;
}
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -277,6 +277,16 @@ static struct ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_shares_thresh",
+ .data = &sysctl_sched_shares_thresh,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
.data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),
--
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/4] sched: fair scheduler should not resched rt tasks
2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
2008-10-17 17:27 ` [PATCH 1/4] sched: optimize group load balancer Peter Zijlstra
@ 2008-10-17 17:27 ` Peter Zijlstra
2008-10-17 17:27 ` [PATCH 3/4] sched: revert back to per-rq vruntime Peter Zijlstra
` (2 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri
Cc: Peter Zijlstra, Steven Rostedt
[-- Attachment #1: hrtick-rostedt.patch --]
[-- Type: text/plain, Size: 2480 bytes --]
With use of ftrace Steven noticed that some RT tasks got rescheduled due
to sched_fair interaction.
What happens is that we reprogram the hrtick from enqueue/dequeue_fair_task()
because that can change nr_running, and thus a current tasks ideal runtime.
However, its possible the current task isn't a fair_sched_class task, and thus
doesn't have a hrtick set to change.
Fix this by wrapping those hrtick_start_fair() calls in a hrtick_update()
function, which will check for the right conditions.
Reported-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <srostedt@redhat.com>
---
kernel/sched_fair.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -73,6 +73,8 @@ unsigned int sysctl_sched_wakeup_granula
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+static const struct sched_class fair_sched_class;
+
/**************************************************************
* CFS operations on generic schedulable entities:
*/
@@ -848,11 +850,31 @@ static void hrtick_start_fair(struct rq
hrtick_start(rq, delta);
}
}
+
+/*
+ * called from enqueue/dequeue and updates the hrtick when the
+ * current task is from our class and nr_running is low enough
+ * to matter.
+ */
+static void hrtick_update(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+
+ if (curr->sched_class != &fair_sched_class)
+ return;
+
+ if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
+ hrtick_start_fair(rq, curr);
+}
#else /* !CONFIG_SCHED_HRTICK */
static inline void
hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
}
+
+static inline void hrtick_update(struct rq *rq)
+{
+}
#endif
/*
@@ -873,7 +895,7 @@ static void enqueue_task_fair(struct rq
wakeup = 1;
}
- hrtick_start_fair(rq, rq->curr);
+ hrtick_update(rq);
}
/*
@@ -895,7 +917,7 @@ static void dequeue_task_fair(struct rq
sleep = 1;
}
- hrtick_start_fair(rq, rq->curr);
+ hrtick_update(rq);
}
/*
@@ -1001,8 +1023,6 @@ static inline int wake_idle(int cpu, str
#ifdef CONFIG_SMP
-static const struct sched_class fair_sched_class;
-
#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* effective_load() calculates the load change as seen from the root_task_group
--
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/4] sched: revert back to per-rq vruntime
2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
2008-10-17 17:27 ` [PATCH 1/4] sched: optimize group load balancer Peter Zijlstra
2008-10-17 17:27 ` [PATCH 2/4] sched: fair scheduler should not resched rt tasks Peter Zijlstra
@ 2008-10-17 17:27 ` Peter Zijlstra
2008-10-17 17:27 ` [PATCH 4/4] sched: fix wakeup preemption Peter Zijlstra
2008-10-20 12:05 ` [PATCH 0/4] pending scheduler updates Ingo Molnar
4 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri; +Cc: Peter Zijlstra
[-- Attachment #1: sched-opt-weight.patch --]
[-- Type: text/plain, Size: 3195 bytes --]
Vatsa rightly points out that having the runqueue weight in the vruntime
calculations can cause unfairness in the face of task joins/leaves.
Suppose: dv = dt * rw / w
Then take 10 tasks t_n, each of similar weight. If the first will run 1
then its vruntime will increase by 10. Now, if the next 8 tasks leave after
having run their 1, then the last task will get a vruntime increase of 2
after having run 1.
Which will leave us with 2 tasks of equal weight and equal runtime, of which
one will not be scheduled for 8/2=4 units of time.
Ergo, we cannot do that and must use: dv = dt / w.
This means we cannot have a global vruntime based on effective priority, but
must instead go back to the vruntime per rq model we started out with.
This patch was lightly tested by doing starting while loops on each nice level
and observing their execution time, and a simple group scenario of 1:2:3 pinned
to a single cpu.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched_fair.c | 32 +++++++++++++++-----------------
1 file changed, 15 insertions(+), 17 deletions(-)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -336,7 +336,7 @@ int sched_nr_latency_handler(struct ctl_
#endif
/*
- * delta *= w / rw
+ * delta *= P[w / rw]
*/
static inline unsigned long
calc_delta_weight(unsigned long delta, struct sched_entity *se)
@@ -350,15 +350,13 @@ calc_delta_weight(unsigned long delta, s
}
/*
- * delta *= rw / w
+ * delta /= w
*/
static inline unsigned long
calc_delta_fair(unsigned long delta, struct sched_entity *se)
{
- for_each_sched_entity(se) {
- delta = calc_delta_mine(delta,
- cfs_rq_of(se)->load.weight, &se->load);
- }
+ if (unlikely(se->load.weight != NICE_0_LOAD))
+ delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);
return delta;
}
@@ -388,26 +386,26 @@ static u64 __sched_period(unsigned long
* We calculate the wall-time slice from the period by taking a part
* proportional to the weight.
*
- * s = p*w/rw
+ * s = p*P[w/rw]
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- return calc_delta_weight(__sched_period(cfs_rq->nr_running), se);
+ unsigned long nr_running = cfs_rq->nr_running;
+
+ if (unlikely(!se->on_rq))
+ nr_running++;
+
+ return calc_delta_weight(__sched_period(nr_running), se);
}
/*
* We calculate the vruntime slice of a to be inserted task
*
- * vs = s*rw/w = p
+ * vs = s/w
*/
-static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- unsigned long nr_running = cfs_rq->nr_running;
-
- if (!se->on_rq)
- nr_running++;
-
- return __sched_period(nr_running);
+ return calc_delta_fair(sched_slice(cfs_rq, se), se);
}
/*
@@ -630,7 +628,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
* stays open at the end.
*/
if (initial && sched_feat(START_DEBIT))
- vruntime += sched_vslice_add(cfs_rq, se);
+ vruntime += sched_vslice(cfs_rq, se);
if (!initial) {
/* sleeps upto a single latency don't count. */
--
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 4/4] sched: fix wakeup preemption
2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
` (2 preceding siblings ...)
2008-10-17 17:27 ` [PATCH 3/4] sched: revert back to per-rq vruntime Peter Zijlstra
@ 2008-10-17 17:27 ` Peter Zijlstra
2008-10-20 21:57 ` Chris Friesen
2008-10-20 12:05 ` [PATCH 0/4] pending scheduler updates Ingo Molnar
4 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri; +Cc: Peter Zijlstra
[-- Attachment #1: sched-fix-wakeup-preempt.patch --]
[-- Type: text/plain, Size: 1109 bytes --]
In my recent wakeup preempt rework I messed up the asym wakeup.
The idea is that it should be easier to preempt lighter tasks
but not harder to preempt heavier tasks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched_fair.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1243,8 +1243,8 @@ static unsigned long wakeup_gran(struct
* More easily preempt - nice tasks, while not making it harder for
* + nice tasks.
*/
- if (sched_feat(ASYM_GRAN))
- gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
+ if (sched_feat(ASYM_GRAN) && se->load.weight < NICE_0_LOAD)
+ gran = (gran * se->load.weight) >> NICE_0_SHIFT;
return gran;
}
@@ -1296,7 +1296,7 @@ static void check_preempt_wakeup(struct
}
delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
- if (delta_exec > wakeup_gran(pse))
+ if (delta_exec > wakeup_gran(se))
resched_task(curr);
}
--
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
` (3 preceding siblings ...)
2008-10-17 17:27 ` [PATCH 4/4] sched: fix wakeup preemption Peter Zijlstra
@ 2008-10-20 12:05 ` Ingo Molnar
2008-10-21 17:35 ` Srivatsa Vaddagiri
4 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-20 12:05 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: LKML, Mike Galbraith, Srivatsa Vaddagiri
* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Please apply
applied these to tip/sched/urgent:
f9c0b09: sched: revert back to per-rq vruntime
a4c2f00: sched: fair scheduler should not resched rt tasks
ffda12a: sched: optimize group load balancer
(will wait with 4/4 until you and Mike come to a verdict.)
thanks Peter,
Ingo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] sched: fix wakeup preemption
2008-10-17 17:27 ` [PATCH 4/4] sched: fix wakeup preemption Peter Zijlstra
@ 2008-10-20 21:57 ` Chris Friesen
0 siblings, 0 replies; 17+ messages in thread
From: Chris Friesen @ 2008-10-20 21:57 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri
Peter Zijlstra wrote:
> In my recent wakeup preempt rework I messed up the asym wakeup.
> The idea is that it should be easier to preempt lighter tasks
> but not harder to preempt heavier tasks.
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -1243,8 +1243,8 @@ static unsigned long wakeup_gran(struct
> * More easily preempt - nice tasks, while not making it harder for
> * + nice tasks.
> */
> - if (sched_feat(ASYM_GRAN))
> - gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
> + if (sched_feat(ASYM_GRAN) && se->load.weight < NICE_0_LOAD)
> + gran = (gran * se->load.weight) >> NICE_0_SHIFT;
>
> return gran;
> }
Setting aside whether the asym wakeup is desirable, the code looks
reasonable but I think you need to change the code comments as well.
The proposed code only affects with a weight of less than NICE_0_LOAD,
ie. +nice tasks. The comment suggests the opposite.
Chris
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-20 12:05 ` [PATCH 0/4] pending scheduler updates Ingo Molnar
@ 2008-10-21 17:35 ` Srivatsa Vaddagiri
2008-10-22 9:40 ` Ingo Molnar
0 siblings, 1 reply; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2008-10-21 17:35 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Peter Zijlstra, LKML, Mike Galbraith
On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> > Please apply
>
> applied these to tip/sched/urgent:
>
> f9c0b09: sched: revert back to per-rq vruntime
> a4c2f00: sched: fair scheduler should not resched rt tasks
> ffda12a: sched: optimize group load balancer
>
> (will wait with 4/4 until you and Mike come to a verdict.)
Is there any conclusion on Patch 4/4? It looks sane to me!
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-21 17:35 ` Srivatsa Vaddagiri
@ 2008-10-22 9:40 ` Ingo Molnar
2008-10-22 10:03 ` Mike Galbraith
0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-22 9:40 UTC (permalink / raw)
To: Srivatsa Vaddagiri; +Cc: Peter Zijlstra, LKML, Mike Galbraith
* Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
> On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > > Please apply
> >
> > applied these to tip/sched/urgent:
> >
> > f9c0b09: sched: revert back to per-rq vruntime
> > a4c2f00: sched: fair scheduler should not resched rt tasks
> > ffda12a: sched: optimize group load balancer
> >
> > (will wait with 4/4 until you and Mike come to a verdict.)
>
> Is there any conclusion on Patch 4/4? It looks sane to me!
Mike?
Ingo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-22 9:40 ` Ingo Molnar
@ 2008-10-22 10:03 ` Mike Galbraith
2008-10-22 10:32 ` Mike Galbraith
0 siblings, 1 reply; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 10:03 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML
[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]
On Wed, 2008-10-22 at 11:40 +0200, Ingo Molnar wrote:
> * Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
>
> > On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
> > >
> > > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > >
> > > > Please apply
> > >
> > > applied these to tip/sched/urgent:
> > >
> > > f9c0b09: sched: revert back to per-rq vruntime
> > > a4c2f00: sched: fair scheduler should not resched rt tasks
> > > ffda12a: sched: optimize group load balancer
> > >
> > > (will wait with 4/4 until you and Mike come to a verdict.)
> >
> > Is there any conclusion on Patch 4/4? It looks sane to me!
>
> Mike?
Your call of course, but I don't think it's a good trade. In testing,
it does serious injury to mysql+oltp peak throughput, and slows down
things which need frequent preemption in order to compete effectively
against hogs. (includes desktop, though that's not heavily tested)
The attached starvation testcase, distilled from real app and posted to
lkml a few years ago, it totally kills. It's a worst case scenario to
be sure, but that worst case is pretty dramatic. I guesstimated it
would take ~12 hours for it to do 10M signals with wakeup preemption so
restricted, vs 43 seconds with preemption as is. To me, that suggests
that anything ultra fast/light will pay through the nose.
It has positive effects too, but IMHO, the bad outweigh the good.
-Mike
[-- Attachment #2: starve.c --]
[-- Type: text/x-csrc, Size: 715 bytes --]
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
volatile unsigned long loop = 10000000;
void
handler (int n)
{
if (loop > 0)
--loop;
}
static int
child (void)
{
pid_t ppid = getppid ();
sleep (1);
while (1)
kill (ppid, SIGUSR1);
return 0;
}
int
main (int argc, char **argv)
{
pid_t child_pid;
int r;
loop = argc > 1 ? strtoul (argv[1], NULL, 10) : 10000000;
printf ("expecting to receive %lu signals\n", loop);
if ((child_pid = fork ()) == 0)
exit (child ());
signal (SIGUSR1, handler);
while (loop)
sleep (1);
r = kill (child_pid, SIGTERM);
waitpid (child_pid, NULL, 0);
return 0;
}
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-22 10:03 ` Mike Galbraith
@ 2008-10-22 10:32 ` Mike Galbraith
2008-10-22 12:10 ` Ingo Molnar
2008-10-22 17:38 ` Peter Zijlstra
0 siblings, 2 replies; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 10:32 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML
On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> It has positive effects too, but IMHO, the bad outweigh the good.
BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
preemption as is, it collapses as load climbs to heavy with preemption
knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
others while these are still held. For this load, there is such a thing
as too much short-term fairness, preempting lock holder creates nasty
gaggle of contended lock spinners. It's curable with knobs, and I think
it's postgres's own fault, but may be wrong.
With that patch, pgsql+oltp scales perfectly.
-Mike
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-22 10:32 ` Mike Galbraith
@ 2008-10-22 12:10 ` Ingo Molnar
2008-10-22 12:38 ` Mike Galbraith
2008-10-22 17:38 ` Peter Zijlstra
1 sibling, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-22 12:10 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML
* Mike Galbraith <efault@gmx.de> wrote:
> On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
>
> > It has positive effects too, but IMHO, the bad outweigh the good.
>
> BTW, most dramatic on the other end of the spectrum is pgsql+oltp.
> With preemption as is, it collapses as load climbs to heavy with
> preemption knobs at stock. Postgres uses user-land spinlocks and
> _appears_ to wake others while these are still held. For this load,
> there is such a thing as too much short-term fairness, preempting lock
> holder creates nasty gaggle of contended lock spinners. It's curable
> with knobs, and I think it's postgres's own fault, but may be wrong.
>
> With that patch, pgsql+oltp scales perfectly.
hm, tempting.
Have you tried to hack/fix pgsql to do proper wakeups?
Right now pgsql it punishes schedulers that preempt it while it is
holding totally undeclared (to the kernel) user-space spinlocks ...
Hence postgresql is rewarding a _bad_ scheduler policy in essence. And
pgsql scalability seems to fall totally apart above 16 cpus - regardless
of scheduler policy.
Ingo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-22 12:10 ` Ingo Molnar
@ 2008-10-22 12:38 ` Mike Galbraith
2008-10-22 12:42 ` Ingo Molnar
0 siblings, 1 reply; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 12:38 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML
On Wed, 2008-10-22 at 14:10 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
>
> > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> >
> > > It has positive effects too, but IMHO, the bad outweigh the good.
> >
> > BTW, most dramatic on the other end of the spectrum is pgsql+oltp.
> > With preemption as is, it collapses as load climbs to heavy with
> > preemption knobs at stock. Postgres uses user-land spinlocks and
> > _appears_ to wake others while these are still held. For this load,
> > there is such a thing as too much short-term fairness, preempting lock
> > holder creates nasty gaggle of contended lock spinners. It's curable
> > with knobs, and I think it's postgres's own fault, but may be wrong.
> >
> > With that patch, pgsql+oltp scales perfectly.
>
> hm, tempting.
I disagree. Postgres's scaling problem is trivially corrected by
twiddling knobs (or whatnot). With that patch, you can't twiddle mysql
throughput back, or disk intensive loads for that matter. You can tweak
the preempt number, but it has nothing to do with lag, so anybody can
preempt anybody else as you turn the knob toward zero. Chaos.
> Have you tried to hack/fix pgsql to do proper wakeups?
No, I tried to build without spinlocks to verify, but build croaked.
Never went back to slogging through the code.
> Right now pgsql it punishes schedulers that preempt it while it is
> holding totally undeclared (to the kernel) user-space spinlocks ...
>
> Hence postgresql is rewarding a _bad_ scheduler policy in essence. And
> pgsql scalability seems to fall totally apart above 16 cpus - regardless
> of scheduler policy.
If someone gives me that problem, and a credit card for electric
company, I'll do my very extra special best to defeat it ;-)
-Mike
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-22 12:38 ` Mike Galbraith
@ 2008-10-22 12:42 ` Ingo Molnar
2008-10-22 13:05 ` Mike Galbraith
0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-22 12:42 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML
* Mike Galbraith <efault@gmx.de> wrote:
> > > With that patch, pgsql+oltp scales perfectly.
> >
> > hm, tempting.
>
> I disagree. Postgres's scaling problem is trivially corrected by
> twiddling knobs (or whatnot). [...]
okay, then we need to document it a bit more: what knobs need twiddling
to make it scale perfectly?
> [...] With that patch, you can't twiddle mysql throughput back, or
> disk intensive loads for that matter. You can tweak the preempt
> number, but it has nothing to do with lag, so anybody can preempt
> anybody else as you turn the knob toward zero. Chaos.
okay, convinced.
> > Have you tried to hack/fix pgsql to do proper wakeups?
>
> No, I tried to build without spinlocks to verify, but build croaked.
> Never went back to slogging through the code.
if it falls back to IPC semaphores that's a bad trade from a performance
POV. The best would be if it used proper futexes (i.e. pthread_mutex()
and friends) not some home-grown user-space spinlock thing.
Ingo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-22 12:42 ` Ingo Molnar
@ 2008-10-22 13:05 ` Mike Galbraith
0 siblings, 0 replies; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 13:05 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML
On Wed, 2008-10-22 at 14:42 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
>
> > > > With that patch, pgsql+oltp scales perfectly.
> > >
> > > hm, tempting.
> >
> > I disagree. Postgres's scaling problem is trivially corrected by
> > twiddling knobs (or whatnot). [...]
>
> okay, then we need to document it a bit more: what knobs need twiddling
> to make it scale perfectly?
Twiddle sched_wakeup_granularity_ns or just turn FAIR_SLEEPERS off for
best pgsql+oltp scalability. Pgsql+oltp collapse is delicate. It
doesn't take much to send it one way or the other. Cut preempt a wee
bit, and it can snap right into shape.
I think we need to penalize sleepers a wee bit as load climbs in general
though, everybody begins to like preemption less as load increases.
Mysql+oltp improves as well, and it loves preempt at modest load.
-Mike
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-22 10:32 ` Mike Galbraith
2008-10-22 12:10 ` Ingo Molnar
@ 2008-10-22 17:38 ` Peter Zijlstra
2008-10-22 17:56 ` Mike Galbraith
1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-22 17:38 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Ingo Molnar, Srivatsa Vaddagiri, LKML
On Wed, 2008-10-22 at 12:32 +0200, Mike Galbraith wrote:
> On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
>
> > It has positive effects too, but IMHO, the bad outweigh the good.
>
> BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
> preemption as is, it collapses as load climbs to heavy with preemption
> knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
> others while these are still held. For this load, there is such a thing
> as too much short-term fairness, preempting lock holder creates nasty
> gaggle of contended lock spinners. It's curable with knobs, and I think
> it's postgres's own fault, but may be wrong.
>
> With that patch, pgsql+oltp scales perfectly.
Are we talking about this patch, which re-instates the vruntime based
wakeup-preemption ?
---
Subject: sched: fix wakeup preemption
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched_fair.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 92 insertions(+), 6 deletions(-)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -143,6 +143,49 @@ static inline struct sched_entity *paren
return se->parent;
}
+/* return depth at which a sched entity is present in the hierarchy */
+static inline int depth_se(struct sched_entity *se)
+{
+ int depth = 0;
+
+ for_each_sched_entity(se)
+ depth++;
+
+ return depth;
+}
+
+static void
+find_matching_se(struct sched_entity **se, struct sched_entity **pse)
+{
+ int se_depth, pse_depth;
+
+ /*
+ * preemption test can be made between sibling entities who are in the
+ * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
+ * both tasks until we find their ancestors who are siblings of common
+ * parent.
+ */
+
+ /* First walk up until both entities are at same depth */
+ se_depth = depth_se(*se);
+ pse_depth = depth_se(*pse);
+
+ while (se_depth > pse_depth) {
+ se_depth--;
+ *se = parent_entity(*se);
+ }
+
+ while (pse_depth > se_depth) {
+ pse_depth--;
+ *pse = parent_entity(*pse);
+ }
+
+ while (!is_same_group(*se, *pse)) {
+ *se = parent_entity(*se);
+ *pse = parent_entity(*pse);
+ }
+}
+
#else /* CONFIG_FAIR_GROUP_SCHED */
static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
@@ -193,6 +236,11 @@ static inline struct sched_entity *paren
return NULL;
}
+static inline void
+find_matching_se(struct sched_entity **se, struct sched_entity **pse)
+{
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -1244,13 +1292,42 @@ static unsigned long wakeup_gran(struct
* More easily preempt - nice tasks, while not making it harder for
* + nice tasks.
*/
- if (sched_feat(ASYM_GRAN))
- gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
+ if (!sched_feat(ASYM_GRAN) || se->load.weight > NICE_0_LOAD)
+ gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);
return gran;
}
/*
+ * Should 'se' preempt 'curr'.
+ *
+ * |s1
+ * |s2
+ * |s3
+ * g
+ * |<--->|c
+ *
+ * w(c, s1) = -1
+ * w(c, s2) = 0
+ * w(c, s3) = 1
+ *
+ */
+static int
+wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
+{
+ s64 gran, vdiff = curr->vruntime - se->vruntime;
+
+ if (vdiff < 0)
+ return -1;
+
+ gran = wakeup_gran(curr);
+ if (vdiff > gran)
+ return 1;
+
+ return 0;
+}
+
+/*
* Preempt the current task with a newly woken task if needed:
*/
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
@@ -1258,7 +1335,6 @@ static void check_preempt_wakeup(struct
struct task_struct *curr = rq->curr;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
struct sched_entity *se = &curr->se, *pse = &p->se;
- s64 delta_exec;
if (unlikely(rt_prio(p->prio))) {
update_rq_clock(rq);
@@ -1296,9 +1372,19 @@ static void check_preempt_wakeup(struct
return;
}
- delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
- if (delta_exec > wakeup_gran(pse))
- resched_task(curr);
+ find_matching_se(&se, &pse);
+
+ while (se) {
+ BUG_ON(!pse);
+
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ resched_task(curr);
+ break;
+ }
+
+ se = parent_entity(se);
+ pse = parent_entity(pse);
+ }
}
static struct task_struct *pick_next_task_fair(struct rq *rq)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/4] pending scheduler updates
2008-10-22 17:38 ` Peter Zijlstra
@ 2008-10-22 17:56 ` Mike Galbraith
0 siblings, 0 replies; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 17:56 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Ingo Molnar, Srivatsa Vaddagiri, LKML
On Wed, 2008-10-22 at 19:38 +0200, Peter Zijlstra wrote:
> On Wed, 2008-10-22 at 12:32 +0200, Mike Galbraith wrote:
> > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> >
> > > It has positive effects too, but IMHO, the bad outweigh the good.
> >
> > BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
> > preemption as is, it collapses as load climbs to heavy with preemption
> > knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
> > others while these are still held. For this load, there is such a thing
> > as too much short-term fairness, preempting lock holder creates nasty
> > gaggle of contended lock spinners. It's curable with knobs, and I think
> > it's postgres's own fault, but may be wrong.
> >
> > With that patch, pgsql+oltp scales perfectly.
>
> Are we talking about this patch, which re-instates the vruntime based
> wakeup-preemption ?
No, if it was that one, I'd be tinkering with mysql+oltp. Everything
else I tested (limited time, but fairly wide spectrum) with that patch
was fine, including interactivity. Caveat: tbench/netperf test results
I'm not comfortable with, would need to backport to 26 to feel at all
confident with those. (fwtw)
-Mike
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-10-22 17:57 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
2008-10-17 17:27 ` [PATCH 1/4] sched: optimize group load balancer Peter Zijlstra
2008-10-17 17:27 ` [PATCH 2/4] sched: fair scheduler should not resched rt tasks Peter Zijlstra
2008-10-17 17:27 ` [PATCH 3/4] sched: revert back to per-rq vruntime Peter Zijlstra
2008-10-17 17:27 ` [PATCH 4/4] sched: fix wakeup preemption Peter Zijlstra
2008-10-20 21:57 ` Chris Friesen
2008-10-20 12:05 ` [PATCH 0/4] pending scheduler updates Ingo Molnar
2008-10-21 17:35 ` Srivatsa Vaddagiri
2008-10-22 9:40 ` Ingo Molnar
2008-10-22 10:03 ` Mike Galbraith
2008-10-22 10:32 ` Mike Galbraith
2008-10-22 12:10 ` Ingo Molnar
2008-10-22 12:38 ` Mike Galbraith
2008-10-22 12:42 ` Ingo Molnar
2008-10-22 13:05 ` Mike Galbraith
2008-10-22 17:38 ` Peter Zijlstra
2008-10-22 17:56 ` Mike Galbraith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).