LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* [patch] restore sched_exec load balance heuristics @ 2008-11-06 19:40 Ken Chen 2008-11-06 20:07 ` Ingo Molnar 0 siblings, 1 reply; 8+ messages in thread From: Ken Chen @ 2008-11-06 19:40 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linux Kernel Mailing List We've seen long standing performance regression on sys_execve for several upstream kernels, largely on workload that does heavy execve. The main reason for the regression was due to a change in sched_exec load balance heuristics. For example, on 2.6.11 kernel, the "exec" task will run on the same cpu if that is the only task running. However, 2.6.13 and onward kernels will go around the sched-domain looking for most idle CPU (which doesn't treat task exec'ing as an idle CPU). Thus bouncing the exec'ing task all over the place which leads to poor CPU cache and numa locality. (The workload happens to share common data between subsequent exec program). This execve heuristic was removed in upstream kernel by this git commit: commit 68767a0ae428801649d510d9a65bb71feed44dd1 Author: Nick Piggin <nickpiggin@yahoo.com.au> Date: Sat Jun 25 14:57:20 2005 -0700 [PATCH] sched: schedstats update for balance on fork Add SCHEDSTAT statistics for sched-balance-fork. >From the commit description, it appears that deleting the heuristics was an accident, as the commit is supposedly just for schedstats. So, restore the sched-exec load balancing if exec'ing task is the only task running on that specific CPU. The logic make sense: newly exec program should continue to run on current CPU as it doesn't change any load imbalance nor does it help anything by bouncing to another idle CPU. By keeping on the same CPU, it preserves cache and numa locality. Signed-off-by: Ken Chen <kenchen@google.com> diff --git a/kernel/sched.c b/kernel/sched.c index e8819bc..4ad1907 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2873,7 +2873,12 @@ out: */ void sched_exec(void) { - int new_cpu, this_cpu = get_cpu(); + int new_cpu, this_cpu; + + if (this_rq()->nr_running <= 1) + return; + + this_cpu = get_cpu(); new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC); put_cpu(); if (new_cpu != this_cpu) ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [patch] restore sched_exec load balance heuristics 2008-11-06 19:40 [patch] restore sched_exec load balance heuristics Ken Chen @ 2008-11-06 20:07 ` Ingo Molnar 2008-11-06 20:32 ` Ken Chen 2008-11-10 8:50 ` Peter Zijlstra 0 siblings, 2 replies; 8+ messages in thread From: Ingo Molnar @ 2008-11-06 20:07 UTC (permalink / raw) To: Ken Chen; +Cc: Linux Kernel Mailing List, Peter Zijlstra, Mike Galbraith * Ken Chen <kenchen@google.com> wrote: > We've seen long standing performance regression on sys_execve for several > upstream kernels, largely on workload that does heavy execve. The main > reason for the regression was due to a change in sched_exec load balance > heuristics. For example, on 2.6.11 kernel, the "exec" task will run on > the same cpu if that is the only task running. However, 2.6.13 and onward > kernels will go around the sched-domain looking for most idle CPU (which > doesn't treat task exec'ing as an idle CPU). Thus bouncing the exec'ing > task all over the place which leads to poor CPU cache and numa locality. > (The workload happens to share common data between subsequent exec program). > > This execve heuristic was removed in upstream kernel by this git commit: > > commit 68767a0ae428801649d510d9a65bb71feed44dd1 > Author: Nick Piggin <nickpiggin@yahoo.com.au> > Date: Sat Jun 25 14:57:20 2005 -0700 > > [PATCH] sched: schedstats update for balance on fork > Add SCHEDSTAT statistics for sched-balance-fork. > > >From the commit description, it appears that deleting the heuristics > was an accident, as the commit is supposedly just for schedstats. > > So, restore the sched-exec load balancing if exec'ing task is the only > task running on that specific CPU. The logic make sense: newly exec > program should continue to run on current CPU as it doesn't change any > load imbalance nor does it help anything by bouncing to another idle > CPU. By keeping on the same CPU, it preserves cache and numa locality. > > Signed-off-by: Ken Chen <kenchen@google.com> > > diff --git a/kernel/sched.c b/kernel/sched.c > index e8819bc..4ad1907 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -2873,7 +2873,12 @@ out: > */ > void sched_exec(void) > { > - int new_cpu, this_cpu = get_cpu(); > + int new_cpu, this_cpu; > + > + if (this_rq()->nr_running <= 1) > + return; > + > + this_cpu = get_cpu(); > new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC); > put_cpu(); > if (new_cpu != this_cpu) ok, this should be solved - but rather at the level of sched_balance_self(): it should never migrate this task over to another cpu, it should take away this task's load from the current CPU's load when considering migration. Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] restore sched_exec load balance heuristics 2008-11-06 20:07 ` Ingo Molnar @ 2008-11-06 20:32 ` Ken Chen 2008-11-06 20:38 ` Ingo Molnar 2008-11-06 20:49 ` Chris Friesen 2008-11-10 8:50 ` Peter Zijlstra 1 sibling, 2 replies; 8+ messages in thread From: Ken Chen @ 2008-11-06 20:32 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linux Kernel Mailing List, Peter Zijlstra, Mike Galbraith On Thu, Nov 6, 2008 at 12:07 PM, Ingo Molnar <mingo@elte.hu> wrote: > ok, this should be solved - but rather at the level of > sched_balance_self(): it should never migrate this task over to > another cpu, it should take away this task's load from the current > CPU's load when considering migration. There are two callers to sched_balance_self(). In the sched_fork path, sched_balance_self will balance the newly forked task. I think it is OK to bounce a newly forked task to another CPU since current CPU will be busy when fork returns in the parent process. And if sched_balance_self() needs to different between fork / exec load balance, it has to check a flag from function argument, which I think it is better to just short circuit in sched_exec() directly. - Ken ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] restore sched_exec load balance heuristics 2008-11-06 20:32 ` Ken Chen @ 2008-11-06 20:38 ` Ingo Molnar 2008-11-06 20:49 ` Chris Friesen 1 sibling, 0 replies; 8+ messages in thread From: Ingo Molnar @ 2008-11-06 20:38 UTC (permalink / raw) To: Ken Chen; +Cc: Linux Kernel Mailing List, Peter Zijlstra, Mike Galbraith * Ken Chen <kenchen@google.com> wrote: > On Thu, Nov 6, 2008 at 12:07 PM, Ingo Molnar <mingo@elte.hu> wrote: > > ok, this should be solved - but rather at the level of > > sched_balance_self(): it should never migrate this task over to > > another cpu, it should take away this task's load from the current > > CPU's load when considering migration. > > There are two callers to sched_balance_self(). In the sched_fork > path, sched_balance_self will balance the newly forked task. I > think it is OK to bounce a newly forked task to another CPU since > current CPU will be busy when fork returns in the parent process. > > And if sched_balance_self() needs to different between fork / exec > load balance, it has to check a flag from function argument, which I > think it is better to just short circuit in sched_exec() directly. yes, but the problem is deeper than that and your fix only addresses teh most obvious case: when a single task is exec()-ing. But if we exec while there are two tasks on this CPU, and one task on every other CPU, we bounce around the "new" task unnecessarily just as much. So the best solution is to pass in not a flag, but a 'load bias' offset - which is 0 in the fork case and -self-weight in the exec case. Ok? Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] restore sched_exec load balance heuristics 2008-11-06 20:32 ` Ken Chen 2008-11-06 20:38 ` Ingo Molnar @ 2008-11-06 20:49 ` Chris Friesen 1 sibling, 0 replies; 8+ messages in thread From: Chris Friesen @ 2008-11-06 20:49 UTC (permalink / raw) To: Ken Chen Cc: Ingo Molnar, Linux Kernel Mailing List, Peter Zijlstra, Mike Galbraith Ken Chen wrote: > There are two callers to sched_balance_self(). In the sched_fork > path, sched_balance_self will balance the newly forked task. I think > it is OK to bounce a newly forked task to another CPU since current > CPU will be busy when fork returns in the parent process. What about vfork()? > And if sched_balance_self() needs to different between fork / exec > load balance, it has to check a flag from function argument, which I > think it is better to just short circuit in sched_exec() directly. From a cleanliness perspective, it make more sense to me for the decision as to whether or not to balance to be done in the "balance" function, not the "exec" function. Chris ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] restore sched_exec load balance heuristics 2008-11-06 20:07 ` Ingo Molnar 2008-11-06 20:32 ` Ken Chen @ 2008-11-10 8:50 ` Peter Zijlstra 2008-11-10 9:29 ` Ingo Molnar 1 sibling, 1 reply; 8+ messages in thread From: Peter Zijlstra @ 2008-11-10 8:50 UTC (permalink / raw) To: Ingo Molnar; +Cc: Ken Chen, Linux Kernel Mailing List, Mike Galbraith On Thu, 2008-11-06 at 21:07 +0100, Ingo Molnar wrote: > ok, this should be solved - but rather at the level of > sched_balance_self(): it should never migrate this task over to > another cpu, it should take away this task's load from the current > CPU's load when considering migration. How's this? (compile tested only) --- Subject: sched: fix sched_exec From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Mon Nov 10 09:47:54 CET 2008 When deciding placement for an execve() task, subtract the effect of the current task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -2035,7 +2035,8 @@ static unsigned long target_load(int cpu * domain. */ static struct sched_group * -find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, + int org_cpu, unsigned long org_weight) { struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups; unsigned long min_load = ULONG_MAX, this_load = 0; @@ -2063,6 +2064,9 @@ find_idlest_group(struct sched_domain *s else load = target_load(i, load_idx); + if (i == org_cpu) + load -= min(org_weight, load); + avg_load += load; } @@ -2089,7 +2093,7 @@ find_idlest_group(struct sched_domain *s */ static int find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu, - cpumask_t *tmp) + cpumask_t *tmp, int org_cpu, unsigned long org_weight) { unsigned long load, min_load = ULONG_MAX; int idlest = -1; @@ -2101,6 +2105,9 @@ find_idlest_cpu(struct sched_group *grou for_each_cpu_mask_nr(i, *tmp) { load = weighted_cpuload(i); + if (i == org_cpu) + load -= org_weight; + if (load < min_load || (load == min_load && i == this_cpu)) { min_load = load; idlest = i; @@ -2121,10 +2128,11 @@ find_idlest_cpu(struct sched_group *grou * * preempt must be disabled. */ -static int sched_balance_self(int cpu, int flag) +static int sched_balance_self(int cpu, int flag, unsigned long org_weight) { struct task_struct *t = current; struct sched_domain *tmp, *sd = NULL; + int org_cpu = cpu; for_each_domain(cpu, tmp) { /* @@ -2150,13 +2158,14 @@ static int sched_balance_self(int cpu, i } span = sd->span; - group = find_idlest_group(sd, t, cpu); + group = find_idlest_group(sd, t, cpu, org_cpu, org_weight); if (!group) { sd = sd->child; continue; } - new_cpu = find_idlest_cpu(group, t, cpu, &tmpmask); + new_cpu = find_idlest_cpu(group, t, cpu, &tmpmask, + org_cpu, org_weight); if (new_cpu == -1 || new_cpu == cpu) { /* Now try balancing at a lower domain level of cpu */ sd = sd->child; @@ -2365,7 +2374,7 @@ void sched_fork(struct task_struct *p, i __sched_fork(p); #ifdef CONFIG_SMP - cpu = sched_balance_self(cpu, SD_BALANCE_FORK); + cpu = sched_balance_self(cpu, SD_BALANCE_FORK, 0); #endif set_task_cpu(p, cpu); @@ -2856,7 +2865,14 @@ out: void sched_exec(void) { int new_cpu, this_cpu = get_cpu(); - new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC); + struct task_group *tg; + long weight, eload; + + tg = task_group(current); + weight = current->se.load.weight; + eload = -effective_load(tg, this_cpu, -weight, -weight); + + new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload); put_cpu(); if (new_cpu != this_cpu) sched_migrate_task(current, new_cpu); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] restore sched_exec load balance heuristics 2008-11-10 8:50 ` Peter Zijlstra @ 2008-11-10 9:29 ` Ingo Molnar 2008-11-10 12:54 ` Peter Zijlstra 0 siblings, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2008-11-10 9:29 UTC (permalink / raw) To: Peter Zijlstra, Ken Chen; +Cc: Linux Kernel Mailing List, Mike Galbraith * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > void sched_exec(void) > { > int new_cpu, this_cpu = get_cpu(); > - new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC); > + struct task_group *tg; > + long weight, eload; > + > + tg = task_group(current); > + weight = current->se.load.weight; > + eload = -effective_load(tg, this_cpu, -weight, -weight); > + > + new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload); okay, i think this will work. it feels somewhat backwards though on a conceptual level. There's nothing particularly special about exec-balancing: the load picture is in equilibrium - it is in essence a rebalancing pass done not in the scheduler tick but in a special place in the middle of exec() where the old-task / new-task cross section is at a minimum level. _fork_ balancing is what is special: there we'll get a new context so we have to take the new load into account. It's a bit like wakeup balancing. (just done before the new task is truly woken up) OTOH, triggering the regular busy-balance at exec() time isnt totally straightforward either: the 'old' task is the current task so it cannot be balanced away. We have to trigger all the active-migration logic - which again makes exec() balancing special. So maybe this patch is the best solution after all. Ken, does it do the trick for your workload, when applied against v2.6.28-rc4? You might even try to confirm that your testcase still works fine even if you elevate the load average with +1.0 on every cpu by starting infinite CPU eater loops on every CPU, via this bash oneliner: for ((i=0;i<2;i++)); do while :; do :; done & done (change the '2' to '4' if you test this on a quad, not on a dual-core box) the desired behavior would be for your "exec hopper" testcase to not hop between cpus, but to stick the same CPU most of the time. Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [patch] restore sched_exec load balance heuristics 2008-11-10 9:29 ` Ingo Molnar @ 2008-11-10 12:54 ` Peter Zijlstra 0 siblings, 0 replies; 8+ messages in thread From: Peter Zijlstra @ 2008-11-10 12:54 UTC (permalink / raw) To: Ingo Molnar; +Cc: Ken Chen, Linux Kernel Mailing List, Mike Galbraith On Mon, 2008-11-10 at 10:29 +0100, Ingo Molnar wrote: > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > void sched_exec(void) > > { > > int new_cpu, this_cpu = get_cpu(); > > - new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC); > > + struct task_group *tg; > > + long weight, eload; > > + > > + tg = task_group(current); > > + weight = current->se.load.weight; > > + eload = -effective_load(tg, this_cpu, -weight, -weight); > > + > > + new_cpu = sched_balance_self(this_cpu, SD_BALANCE_EXEC, eload); > > okay, i think this will work. > > it feels somewhat backwards though on a conceptual level. > > There's nothing particularly special about exec-balancing: the load > picture is in equilibrium - it is in essence a rebalancing pass done > not in the scheduler tick but in a special place in the middle of > exec() where the old-task / new-task cross section is at a minimum > level. > > _fork_ balancing is what is special: there we'll get a new context so > we have to take the new load into account. It's a bit like wakeup > balancing. (just done before the new task is truly woken up) > > OTOH, triggering the regular busy-balance at exec() time isnt totally > straightforward either: the 'old' task is the current task so it > cannot be balanced away. We have to trigger all the active-migration > logic - which again makes exec() balancing special. > > So maybe this patch is the best solution after all. Even worse, you want to balance current, the generic load balance might pick two cpus to balance neither of which will have current on it. But even if it would pick the queue with current on it as busiest, there is no saying you'll actually end up moving current. So this specialized form of moving current to a possibly more idle cpu is afaics the best solution for balancing a particular task. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-11-10 12:54 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-11-06 19:40 [patch] restore sched_exec load balance heuristics Ken Chen 2008-11-06 20:07 ` Ingo Molnar 2008-11-06 20:32 ` Ken Chen 2008-11-06 20:38 ` Ingo Molnar 2008-11-06 20:49 ` Chris Friesen 2008-11-10 8:50 ` Peter Zijlstra 2008-11-10 9:29 ` Ingo Molnar 2008-11-10 12:54 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).