LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/4] pending scheduler updates
@ 2008-10-17 17:27 Peter Zijlstra
  2008-10-17 17:27 ` [PATCH 1/4] sched: optimize group load balancer Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
  To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri

Please apply
-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] sched: optimize group load balancer
  2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
@ 2008-10-17 17:27 ` Peter Zijlstra
  2008-10-17 17:27 ` [PATCH 2/4] sched: fair scheduler should not resched rt tasks Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
  To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri, Chris Friesen
  Cc: Peter Zijlstra

[-- Attachment #1: sched-opt-group-balance.patch --]
[-- Type: text/plain, Size: 4441 bytes --]

I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus
in the sched_domain. This hurts.

We need the rq-locks whenever we change the weight of the per-cpu group sched
entities. To allevate this a little, only change the weight when the new
weight is at least shares_thresh away from the old value.

This avoids the rq-lock for the top level entries, since those will never
be re-weighted, and fuzzes the lower level entries a little to gain performance
in semi-stable situations.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Chris Friesen <cfriesen@nortel.com>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   45 +++++++++++++++++++++++++--------------------
 kernel/sysctl.c       |   10 ++++++++++
 3 files changed, 36 insertions(+), 20 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1660,6 +1660,7 @@ extern unsigned int sysctl_sched_feature
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_shares_ratelimit;
+extern unsigned int sysctl_sched_shares_thresh;
 
 int sched_nr_latency_handler(struct ctl_table *table, int write,
 		struct file *file, void __user *buffer, size_t *length,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -818,6 +818,13 @@ const_debug unsigned int sysctl_sched_nr
 unsigned int sysctl_sched_shares_ratelimit = 250000;
 
 /*
+ * Inject some fuzzyness into changing the per-cpu group shares
+ * this avoids remote rq-locks at the expense of fairness.
+ * default: 4
+ */
+unsigned int sysctl_sched_shares_thresh = 4;
+
+/*
  * period over which we measure -rt task cpu usage in us.
  * default: 1s
  */
@@ -1453,8 +1460,8 @@ static void __set_se_shares(struct sched
  * Calculate and set the cpu's group shares.
  */
 static void
-__update_group_shares_cpu(struct task_group *tg, int cpu,
-			  unsigned long sd_shares, unsigned long sd_rq_weight)
+update_group_shares_cpu(struct task_group *tg, int cpu,
+			unsigned long sd_shares, unsigned long sd_rq_weight)
 {
 	int boost = 0;
 	unsigned long shares;
@@ -1485,19 +1492,23 @@ __update_group_shares_cpu(struct task_gr
 	 *
 	 */
 	shares = (sd_shares * rq_weight) / (sd_rq_weight + 1);
+	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
 
-	/*
-	 * record the actual number of shares, not the boosted amount.
-	 */
-	tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
-	tg->cfs_rq[cpu]->rq_weight = rq_weight;
+	if (abs(shares - tg->se[cpu]->load.weight) >
+			sysctl_sched_shares_thresh) {
+		struct rq *rq = cpu_rq(cpu);
+		unsigned long flags;
 
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	else if (shares > MAX_SHARES)
-		shares = MAX_SHARES;
+		spin_lock_irqsave(&rq->lock, flags);
+		/*
+		 * record the actual number of shares, not the boosted amount.
+		 */
+		tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
+		tg->cfs_rq[cpu]->rq_weight = rq_weight;
 
-	__set_se_shares(tg->se[cpu], shares);
+		__set_se_shares(tg->se[cpu], shares);
+		spin_unlock_irqrestore(&rq->lock, flags);
+	}
 }
 
 /*
@@ -1526,14 +1537,8 @@ static int tg_shares_up(struct task_grou
 	if (!rq_weight)
 		rq_weight = cpus_weight(sd->span) * NICE_0_LOAD;
 
-	for_each_cpu_mask(i, sd->span) {
-		struct rq *rq = cpu_rq(i);
-		unsigned long flags;
-
-		spin_lock_irqsave(&rq->lock, flags);
-		__update_group_shares_cpu(tg, i, shares, rq_weight);
-		spin_unlock_irqrestore(&rq->lock, flags);
-	}
+	for_each_cpu_mask(i, sd->span)
+		update_group_shares_cpu(tg, i, shares, rq_weight);
 
 	return 0;
 }
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -277,6 +277,16 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_shares_thresh",
+		.data		= &sysctl_sched_shares_thresh,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_child_runs_first",
 		.data		= &sysctl_sched_child_runs_first,
 		.maxlen		= sizeof(unsigned int),

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/4] sched: fair scheduler should not resched rt tasks
  2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
  2008-10-17 17:27 ` [PATCH 1/4] sched: optimize group load balancer Peter Zijlstra
@ 2008-10-17 17:27 ` Peter Zijlstra
  2008-10-17 17:27 ` [PATCH 3/4] sched: revert back to per-rq vruntime Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
  To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Steven Rostedt

[-- Attachment #1: hrtick-rostedt.patch --]
[-- Type: text/plain, Size: 2480 bytes --]

With use of ftrace Steven noticed that some RT tasks got rescheduled due
to sched_fair interaction.

What happens is that we reprogram the hrtick from enqueue/dequeue_fair_task()
because that can change nr_running, and thus a current tasks ideal runtime.
However, its possible the current task isn't a fair_sched_class task, and thus
doesn't have a hrtick set to change.

Fix this by wrapping those hrtick_start_fair() calls in a hrtick_update()
function, which will check for the right conditions.

Reported-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/sched_fair.c |   28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -73,6 +73,8 @@ unsigned int sysctl_sched_wakeup_granula
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 
+static const struct sched_class fair_sched_class;
+
 /**************************************************************
  * CFS operations on generic schedulable entities:
  */
@@ -848,11 +850,31 @@ static void hrtick_start_fair(struct rq 
 		hrtick_start(rq, delta);
 	}
 }
+
+/*
+ * called from enqueue/dequeue and updates the hrtick when the
+ * current task is from our class and nr_running is low enough
+ * to matter.
+ */
+static void hrtick_update(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+
+	if (curr->sched_class != &fair_sched_class)
+		return;
+
+	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
+		hrtick_start_fair(rq, curr);
+}
 #else /* !CONFIG_SCHED_HRTICK */
 static inline void
 hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 }
+
+static inline void hrtick_update(struct rq *rq)
+{
+}
 #endif
 
 /*
@@ -873,7 +895,7 @@ static void enqueue_task_fair(struct rq 
 		wakeup = 1;
 	}
 
-	hrtick_start_fair(rq, rq->curr);
+	hrtick_update(rq);
 }
 
 /*
@@ -895,7 +917,7 @@ static void dequeue_task_fair(struct rq 
 		sleep = 1;
 	}
 
-	hrtick_start_fair(rq, rq->curr);
+	hrtick_update(rq);
 }
 
 /*
@@ -1001,8 +1023,6 @@ static inline int wake_idle(int cpu, str
 
 #ifdef CONFIG_SMP
 
-static const struct sched_class fair_sched_class;
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * effective_load() calculates the load change as seen from the root_task_group

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/4] sched: revert back to per-rq vruntime
  2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
  2008-10-17 17:27 ` [PATCH 1/4] sched: optimize group load balancer Peter Zijlstra
  2008-10-17 17:27 ` [PATCH 2/4] sched: fair scheduler should not resched rt tasks Peter Zijlstra
@ 2008-10-17 17:27 ` Peter Zijlstra
  2008-10-17 17:27 ` [PATCH 4/4] sched: fix wakeup preemption Peter Zijlstra
  2008-10-20 12:05 ` [PATCH 0/4] pending scheduler updates Ingo Molnar
  4 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
  To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri; +Cc: Peter Zijlstra

[-- Attachment #1: sched-opt-weight.patch --]
[-- Type: text/plain, Size: 3195 bytes --]

Vatsa rightly points out that having the runqueue weight in the vruntime
calculations can cause unfairness in the face of task joins/leaves.

Suppose: dv = dt * rw / w

Then take 10 tasks t_n, each of similar weight. If the first will run 1
then its vruntime will increase by 10. Now, if the next 8 tasks leave after
having run their 1, then the last task will get a vruntime increase of 2
after having run 1.

Which will leave us with 2 tasks of equal weight and equal runtime, of which
one will not be scheduled for 8/2=4 units of time.

Ergo, we cannot do that and must use: dv = dt / w.

This means we cannot have a global vruntime based on effective priority, but
must instead go back to the vruntime per rq model we started out with.

This patch was lightly tested by doing starting while loops on each nice level
and observing their execution time, and a simple group scenario of 1:2:3 pinned
to a single cpu.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -336,7 +336,7 @@ int sched_nr_latency_handler(struct ctl_
 #endif
 
 /*
- * delta *= w / rw
+ * delta *= P[w / rw]
  */
 static inline unsigned long
 calc_delta_weight(unsigned long delta, struct sched_entity *se)
@@ -350,15 +350,13 @@ calc_delta_weight(unsigned long delta, s
 }
 
 /*
- * delta *= rw / w
+ * delta /= w
  */
 static inline unsigned long
 calc_delta_fair(unsigned long delta, struct sched_entity *se)
 {
-	for_each_sched_entity(se) {
-		delta = calc_delta_mine(delta,
-				cfs_rq_of(se)->load.weight, &se->load);
-	}
+	if (unlikely(se->load.weight != NICE_0_LOAD))
+		delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);
 
 	return delta;
 }
@@ -388,26 +386,26 @@ static u64 __sched_period(unsigned long 
  * We calculate the wall-time slice from the period by taking a part
  * proportional to the weight.
  *
- * s = p*w/rw
+ * s = p*P[w/rw]
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return calc_delta_weight(__sched_period(cfs_rq->nr_running), se);
+	unsigned long nr_running = cfs_rq->nr_running;
+
+	if (unlikely(!se->on_rq))
+		nr_running++;
+
+	return calc_delta_weight(__sched_period(nr_running), se);
 }
 
 /*
  * We calculate the vruntime slice of a to be inserted task
  *
- * vs = s*rw/w = p
+ * vs = s/w
  */
-static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	unsigned long nr_running = cfs_rq->nr_running;
-
-	if (!se->on_rq)
-		nr_running++;
-
-	return __sched_period(nr_running);
+	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
 /*
@@ -630,7 +628,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	 * stays open at the end.
 	 */
 	if (initial && sched_feat(START_DEBIT))
-		vruntime += sched_vslice_add(cfs_rq, se);
+		vruntime += sched_vslice(cfs_rq, se);
 
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4] sched: fix wakeup preemption
  2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
                   ` (2 preceding siblings ...)
  2008-10-17 17:27 ` [PATCH 3/4] sched: revert back to per-rq vruntime Peter Zijlstra
@ 2008-10-17 17:27 ` Peter Zijlstra
  2008-10-20 21:57   ` Chris Friesen
  2008-10-20 12:05 ` [PATCH 0/4] pending scheduler updates Ingo Molnar
  4 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-17 17:27 UTC (permalink / raw)
  To: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri; +Cc: Peter Zijlstra

[-- Attachment #1: sched-fix-wakeup-preempt.patch --]
[-- Type: text/plain, Size: 1109 bytes --]

In my recent wakeup preempt rework I messed up the asym wakeup.
The idea is that it should be easier to preempt lighter tasks
but not harder to preempt heavier tasks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1243,8 +1243,8 @@ static unsigned long wakeup_gran(struct 
 	 * More easily preempt - nice tasks, while not making it harder for
 	 * + nice tasks.
 	 */
-	if (sched_feat(ASYM_GRAN))
-		gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
+	if (sched_feat(ASYM_GRAN) && se->load.weight < NICE_0_LOAD)
+		gran = (gran * se->load.weight) >> NICE_0_SHIFT;
 
 	return gran;
 }
@@ -1296,7 +1296,7 @@ static void check_preempt_wakeup(struct 
 	}
 
 	delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
-	if (delta_exec > wakeup_gran(pse))
+	if (delta_exec > wakeup_gran(se))
 		resched_task(curr);
 }
 

-- 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
                   ` (3 preceding siblings ...)
  2008-10-17 17:27 ` [PATCH 4/4] sched: fix wakeup preemption Peter Zijlstra
@ 2008-10-20 12:05 ` Ingo Molnar
  2008-10-21 17:35   ` Srivatsa Vaddagiri
  4 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-20 12:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Mike Galbraith, Srivatsa Vaddagiri


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Please apply

applied these to tip/sched/urgent:

 f9c0b09: sched: revert back to per-rq vruntime
 a4c2f00: sched: fair scheduler should not resched rt tasks
 ffda12a: sched: optimize group load balancer

(will wait with 4/4 until you and Mike come to a verdict.)

thanks Peter,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched: fix wakeup preemption
  2008-10-17 17:27 ` [PATCH 4/4] sched: fix wakeup preemption Peter Zijlstra
@ 2008-10-20 21:57   ` Chris Friesen
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Friesen @ 2008-10-20 21:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Mike Galbraith, Srivatsa Vaddagiri

Peter Zijlstra wrote:

 > In my recent wakeup preempt rework I messed up the asym wakeup.
 > The idea is that it should be easier to preempt lighter tasks
 > but not harder to preempt heavier tasks.

 > Index: linux-2.6/kernel/sched_fair.c
 > ===================================================================
 > --- linux-2.6.orig/kernel/sched_fair.c
 > +++ linux-2.6/kernel/sched_fair.c
 > @@ -1243,8 +1243,8 @@ static unsigned long wakeup_gran(struct
 >   	 * More easily preempt - nice tasks, while not making it harder for
 >   	 * + nice tasks.
 >   	 */
 > -	if (sched_feat(ASYM_GRAN))
 > -		gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
 > +	if (sched_feat(ASYM_GRAN) && se->load.weight < NICE_0_LOAD)
 > +		gran = (gran * se->load.weight) >> NICE_0_SHIFT;
 >
 >   	return gran;
 >   }

Setting aside whether the asym wakeup is desirable, the code looks 
reasonable but I think you need to change the code comments as well.

The proposed code only affects with a weight of less than NICE_0_LOAD, 
ie. +nice tasks.  The comment suggests the opposite.

Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-20 12:05 ` [PATCH 0/4] pending scheduler updates Ingo Molnar
@ 2008-10-21 17:35   ` Srivatsa Vaddagiri
  2008-10-22  9:40     ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2008-10-21 17:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, LKML, Mike Galbraith

On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > Please apply
> 
> applied these to tip/sched/urgent:
> 
>  f9c0b09: sched: revert back to per-rq vruntime
>  a4c2f00: sched: fair scheduler should not resched rt tasks
>  ffda12a: sched: optimize group load balancer
> 
> (will wait with 4/4 until you and Mike come to a verdict.)

Is there any conclusion on Patch 4/4? It looks sane to me!

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-21 17:35   ` Srivatsa Vaddagiri
@ 2008-10-22  9:40     ` Ingo Molnar
  2008-10-22 10:03       ` Mike Galbraith
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-22  9:40 UTC (permalink / raw)
  To: Srivatsa Vaddagiri; +Cc: Peter Zijlstra, LKML, Mike Galbraith


* Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:

> On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > Please apply
> > 
> > applied these to tip/sched/urgent:
> > 
> >  f9c0b09: sched: revert back to per-rq vruntime
> >  a4c2f00: sched: fair scheduler should not resched rt tasks
> >  ffda12a: sched: optimize group load balancer
> > 
> > (will wait with 4/4 until you and Mike come to a verdict.)
> 
> Is there any conclusion on Patch 4/4? It looks sane to me!

Mike?

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-22  9:40     ` Ingo Molnar
@ 2008-10-22 10:03       ` Mike Galbraith
  2008-10-22 10:32         ` Mike Galbraith
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 10:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML

[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]

On Wed, 2008-10-22 at 11:40 +0200, Ingo Molnar wrote:
> * Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
> 
> > On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
> > > 
> > > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > 
> > > > Please apply
> > > 
> > > applied these to tip/sched/urgent:
> > > 
> > >  f9c0b09: sched: revert back to per-rq vruntime
> > >  a4c2f00: sched: fair scheduler should not resched rt tasks
> > >  ffda12a: sched: optimize group load balancer
> > > 
> > > (will wait with 4/4 until you and Mike come to a verdict.)
> > 
> > Is there any conclusion on Patch 4/4? It looks sane to me!
> 
> Mike?

Your call of course, but I don't think it's a good trade.  In testing,
it does serious injury to mysql+oltp peak throughput, and slows down
things which need frequent preemption in order to compete effectively
against hogs.  (includes desktop, though that's not heavily tested)

The attached starvation testcase, distilled from real app and posted to
lkml a few years ago, it totally kills.  It's a worst case scenario to
be sure, but that worst case is pretty dramatic.  I guesstimated it
would take ~12 hours for it to do 10M signals with wakeup preemption so
restricted, vs 43 seconds with preemption as is.  To me, that suggests
that anything ultra fast/light will pay through the nose.

It has positive effects too, but IMHO, the bad outweigh the good.

	-Mike

[-- Attachment #2: starve.c --]
[-- Type: text/x-csrc, Size: 715 bytes --]

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/wait.h>

volatile unsigned long loop = 10000000;

void
handler (int n)
{
  if (loop > 0)
    --loop;
}

static int
child (void)
{
  pid_t ppid = getppid ();

  sleep (1);
  while (1)
    kill (ppid, SIGUSR1);
  return 0;
}

int
main (int argc, char **argv)
{
  pid_t child_pid;
  int r;

  loop = argc > 1 ? strtoul (argv[1], NULL, 10) : 10000000;
  printf ("expecting to receive %lu signals\n", loop);

  if ((child_pid = fork ()) == 0)
    exit (child ());

  signal (SIGUSR1, handler);
  while (loop)
    sleep (1);
  r = kill (child_pid, SIGTERM);
  waitpid (child_pid, NULL, 0);
  return 0;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-22 10:03       ` Mike Galbraith
@ 2008-10-22 10:32         ` Mike Galbraith
  2008-10-22 12:10           ` Ingo Molnar
  2008-10-22 17:38           ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 10:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML

On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:

> It has positive effects too, but IMHO, the bad outweigh the good.

BTW, most dramatic on the other end of the spectrum is pgsql+oltp.  With
preemption as is, it collapses as load climbs to heavy with preemption
knobs at stock.  Postgres uses user-land spinlocks and _appears_ to wake
others while these are still held.  For this load, there is such a thing
as too much short-term fairness, preempting lock holder creates nasty
gaggle of contended lock spinners.  It's curable with knobs, and I think
it's postgres's own fault, but may be wrong.

With that patch, pgsql+oltp scales perfectly.

	-Mike 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-22 10:32         ` Mike Galbraith
@ 2008-10-22 12:10           ` Ingo Molnar
  2008-10-22 12:38             ` Mike Galbraith
  2008-10-22 17:38           ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-22 12:10 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML


* Mike Galbraith <efault@gmx.de> wrote:

> On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> 
> > It has positive effects too, but IMHO, the bad outweigh the good.
> 
> BTW, most dramatic on the other end of the spectrum is pgsql+oltp.  
> With preemption as is, it collapses as load climbs to heavy with 
> preemption knobs at stock.  Postgres uses user-land spinlocks and 
> _appears_ to wake others while these are still held.  For this load, 
> there is such a thing as too much short-term fairness, preempting lock 
> holder creates nasty gaggle of contended lock spinners.  It's curable 
> with knobs, and I think it's postgres's own fault, but may be wrong.
> 
> With that patch, pgsql+oltp scales perfectly.

hm, tempting.

Have you tried to hack/fix pgsql to do proper wakeups?

Right now pgsql it punishes schedulers that preempt it while it is 
holding totally undeclared (to the kernel) user-space spinlocks ... 

Hence postgresql is rewarding a _bad_ scheduler policy in essence. And 
pgsql scalability seems to fall totally apart above 16 cpus - regardless 
of scheduler policy.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-22 12:10           ` Ingo Molnar
@ 2008-10-22 12:38             ` Mike Galbraith
  2008-10-22 12:42               ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 12:38 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML

On Wed, 2008-10-22 at 14:10 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> > 
> > > It has positive effects too, but IMHO, the bad outweigh the good.
> > 
> > BTW, most dramatic on the other end of the spectrum is pgsql+oltp.  
> > With preemption as is, it collapses as load climbs to heavy with 
> > preemption knobs at stock.  Postgres uses user-land spinlocks and 
> > _appears_ to wake others while these are still held.  For this load, 
> > there is such a thing as too much short-term fairness, preempting lock 
> > holder creates nasty gaggle of contended lock spinners.  It's curable 
> > with knobs, and I think it's postgres's own fault, but may be wrong.
> > 
> > With that patch, pgsql+oltp scales perfectly.
> 
> hm, tempting.

I disagree.  Postgres's scaling problem is trivially corrected by
twiddling knobs (or whatnot).  With that patch, you can't twiddle mysql
throughput back, or disk intensive loads for that matter.  You can tweak
the preempt number, but it has nothing to do with lag, so anybody can
preempt anybody else as you turn the knob toward zero.  Chaos.
  
> Have you tried to hack/fix pgsql to do proper wakeups?

No, I tried to build without spinlocks to verify, but build croaked.
Never went back to slogging through the code.

> Right now pgsql it punishes schedulers that preempt it while it is 
> holding totally undeclared (to the kernel) user-space spinlocks ... 
> 
> Hence postgresql is rewarding a _bad_ scheduler policy in essence. And 
> pgsql scalability seems to fall totally apart above 16 cpus - regardless 
> of scheduler policy.

If someone gives me that problem, and a credit card for electric
company, I'll do my very extra special best to defeat it ;-)

	-Mike


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-22 12:38             ` Mike Galbraith
@ 2008-10-22 12:42               ` Ingo Molnar
  2008-10-22 13:05                 ` Mike Galbraith
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2008-10-22 12:42 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML


* Mike Galbraith <efault@gmx.de> wrote:

> > > With that patch, pgsql+oltp scales perfectly.
> > 
> > hm, tempting.
> 
> I disagree.  Postgres's scaling problem is trivially corrected by 
> twiddling knobs (or whatnot). [...]

okay, then we need to document it a bit more: what knobs need twiddling 
to make it scale perfectly?

> [...]  With that patch, you can't twiddle mysql throughput back, or 
> disk intensive loads for that matter.  You can tweak the preempt 
> number, but it has nothing to do with lag, so anybody can preempt 
> anybody else as you turn the knob toward zero.  Chaos.

okay, convinced.

> > Have you tried to hack/fix pgsql to do proper wakeups?
> 
> No, I tried to build without spinlocks to verify, but build croaked. 
> Never went back to slogging through the code.

if it falls back to IPC semaphores that's a bad trade from a performance 
POV. The best would be if it used proper futexes (i.e. pthread_mutex() 
and friends) not some home-grown user-space spinlock thing.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-22 12:42               ` Ingo Molnar
@ 2008-10-22 13:05                 ` Mike Galbraith
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 13:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Peter Zijlstra, LKML

On Wed, 2008-10-22 at 14:42 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > > > With that patch, pgsql+oltp scales perfectly.
> > > 
> > > hm, tempting.
> > 
> > I disagree.  Postgres's scaling problem is trivially corrected by 
> > twiddling knobs (or whatnot). [...]
> 
> okay, then we need to document it a bit more: what knobs need twiddling 
> to make it scale perfectly?

Twiddle sched_wakeup_granularity_ns or just turn FAIR_SLEEPERS off for
best pgsql+oltp scalability.  Pgsql+oltp collapse is delicate.  It
doesn't take much to send it one way or the other.  Cut preempt a wee
bit, and it can snap right into shape.

I think we need to penalize sleepers a wee bit as load climbs in general
though, everybody begins to like preemption less as load increases.
Mysql+oltp improves as well, and it loves preempt at modest load.

	-Mike


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-22 10:32         ` Mike Galbraith
  2008-10-22 12:10           ` Ingo Molnar
@ 2008-10-22 17:38           ` Peter Zijlstra
  2008-10-22 17:56             ` Mike Galbraith
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2008-10-22 17:38 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Srivatsa Vaddagiri, LKML

On Wed, 2008-10-22 at 12:32 +0200, Mike Galbraith wrote:
> On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> 
> > It has positive effects too, but IMHO, the bad outweigh the good.
> 
> BTW, most dramatic on the other end of the spectrum is pgsql+oltp.  With
> preemption as is, it collapses as load climbs to heavy with preemption
> knobs at stock.  Postgres uses user-land spinlocks and _appears_ to wake
> others while these are still held.  For this load, there is such a thing
> as too much short-term fairness, preempting lock holder creates nasty
> gaggle of contended lock spinners.  It's curable with knobs, and I think
> it's postgres's own fault, but may be wrong.
> 
> With that patch, pgsql+oltp scales perfectly.

Are we talking about this patch, which re-instates the vruntime based
wakeup-preemption ?

---
Subject: sched: fix wakeup preemption
From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   98 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 92 insertions(+), 6 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -143,6 +143,49 @@ static inline struct sched_entity *paren
 	return se->parent;
 }
 
+/* return depth at which a sched entity is present in the hierarchy */
+static inline int depth_se(struct sched_entity *se)
+{
+	int depth = 0;
+
+	for_each_sched_entity(se)
+		depth++;
+
+	return depth;
+}
+
+static void
+find_matching_se(struct sched_entity **se, struct sched_entity **pse)
+{
+	int se_depth, pse_depth;
+
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
+	 * both tasks until we find their ancestors who are siblings of common
+	 * parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	se_depth = depth_se(*se);
+	pse_depth = depth_se(*pse);
+
+	while (se_depth > pse_depth) {
+		se_depth--;
+		*se = parent_entity(*se);
+	}
+
+	while (pse_depth > se_depth) {
+		pse_depth--;
+		*pse = parent_entity(*pse);
+	}
+
+	while (!is_same_group(*se, *pse)) {
+		*se = parent_entity(*se);
+		*pse = parent_entity(*pse);
+	}
+}
+
 #else	/* CONFIG_FAIR_GROUP_SCHED */
 
 static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
@@ -193,6 +236,11 @@ static inline struct sched_entity *paren
 	return NULL;
 }
 
+static inline void
+find_matching_se(struct sched_entity **se, struct sched_entity **pse)
+{
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 
@@ -1244,13 +1292,42 @@ static unsigned long wakeup_gran(struct 
 	 * More easily preempt - nice tasks, while not making it harder for
 	 * + nice tasks.
 	 */
-	if (sched_feat(ASYM_GRAN))
-		gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
+	if (!sched_feat(ASYM_GRAN) || se->load.weight > NICE_0_LOAD)
+		gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);
 
 	return gran;
 }
 
 /*
+ * Should 'se' preempt 'curr'.
+ *
+ *             |s1
+ *        |s2
+ *   |s3
+ *         g
+ *      |<--->|c
+ *
+ *  w(c, s1) = -1
+ *  w(c, s2) =  0
+ *  w(c, s3) =  1
+ *
+ */
+static int
+wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
+{
+	s64 gran, vdiff = curr->vruntime - se->vruntime;
+
+	if (vdiff < 0)
+		return -1;
+
+	gran = wakeup_gran(curr);
+	if (vdiff > gran)
+		return 1;
+
+	return 0;
+}
+
+/*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
@@ -1258,7 +1335,6 @@ static void check_preempt_wakeup(struct 
 	struct task_struct *curr = rq->curr;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
-	s64 delta_exec;
 
 	if (unlikely(rt_prio(p->prio))) {
 		update_rq_clock(rq);
@@ -1296,9 +1372,19 @@ static void check_preempt_wakeup(struct 
 		return;
 	}
 
-	delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
-	if (delta_exec > wakeup_gran(pse))
-		resched_task(curr);
+	find_matching_se(&se, &pse);
+
+	while (se) {
+		BUG_ON(!pse);
+
+		if (wakeup_preempt_entity(se, pse) == 1) {
+			resched_task(curr);
+			break;
+		}
+
+		se = parent_entity(se);
+		pse = parent_entity(pse);
+	}
 }
 
 static struct task_struct *pick_next_task_fair(struct rq *rq)



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] pending scheduler updates
  2008-10-22 17:38           ` Peter Zijlstra
@ 2008-10-22 17:56             ` Mike Galbraith
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Galbraith @ 2008-10-22 17:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Srivatsa Vaddagiri, LKML

On Wed, 2008-10-22 at 19:38 +0200, Peter Zijlstra wrote:
> On Wed, 2008-10-22 at 12:32 +0200, Mike Galbraith wrote:
> > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> > 
> > > It has positive effects too, but IMHO, the bad outweigh the good.
> > 
> > BTW, most dramatic on the other end of the spectrum is pgsql+oltp.  With
> > preemption as is, it collapses as load climbs to heavy with preemption
> > knobs at stock.  Postgres uses user-land spinlocks and _appears_ to wake
> > others while these are still held.  For this load, there is such a thing
> > as too much short-term fairness, preempting lock holder creates nasty
> > gaggle of contended lock spinners.  It's curable with knobs, and I think
> > it's postgres's own fault, but may be wrong.
> > 
> > With that patch, pgsql+oltp scales perfectly.
> 
> Are we talking about this patch, which re-instates the vruntime based
> wakeup-preemption ?

No, if it was that one, I'd be tinkering with mysql+oltp.  Everything
else I tested (limited time, but fairly wide spectrum) with that patch
was fine, including interactivity.  Caveat: tbench/netperf test results
I'm not comfortable with, would need to backport to 26 to feel at all
confident with those.  (fwtw)

	-Mike


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-10-22 17:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-17 17:27 [PATCH 0/4] pending scheduler updates Peter Zijlstra
2008-10-17 17:27 ` [PATCH 1/4] sched: optimize group load balancer Peter Zijlstra
2008-10-17 17:27 ` [PATCH 2/4] sched: fair scheduler should not resched rt tasks Peter Zijlstra
2008-10-17 17:27 ` [PATCH 3/4] sched: revert back to per-rq vruntime Peter Zijlstra
2008-10-17 17:27 ` [PATCH 4/4] sched: fix wakeup preemption Peter Zijlstra
2008-10-20 21:57   ` Chris Friesen
2008-10-20 12:05 ` [PATCH 0/4] pending scheduler updates Ingo Molnar
2008-10-21 17:35   ` Srivatsa Vaddagiri
2008-10-22  9:40     ` Ingo Molnar
2008-10-22 10:03       ` Mike Galbraith
2008-10-22 10:32         ` Mike Galbraith
2008-10-22 12:10           ` Ingo Molnar
2008-10-22 12:38             ` Mike Galbraith
2008-10-22 12:42               ` Ingo Molnar
2008-10-22 13:05                 ` Mike Galbraith
2008-10-22 17:38           ` Peter Zijlstra
2008-10-22 17:56             ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).