LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [wake_afine fixes/improvements 0/3] Introduction
@ 2011-01-15 1:57 Paul Turner
2011-01-15 1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-15 1:57 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Nick Piggin,
Srivatsa Vaddagiri
I've been looking at the wake_affine path to improve the group scheduling case
(wake affine performance for fair group sched has historically lagged) as well
as tweaking performance in general.
The current series of patches is attached, the first of which should probably be
considered for 2.6.38 since it fixes a bug/regression in the case of waking up
onto a previously (group) empty cpu. While the others can be considered more
forwards looking.
I've been using an rpc ping-pong workload which is known be sensitive to poor affine
decisions to benchmark these changes, I'm happy to run these patches against
other workloads. In particular improvements on reaim have been demonstrated,
but since it's not as stable a benchmark the numbers are harder to present in
a representative fashion. Suggestions/pet benchmarks greatly appreciated
here.
Some other things experimented with (but didn't pan out as a performance win):
- Considering instantaneous load on prev_cpu as well as current_cpu
- Using more gentle wl/wg values to reflect that they a task's contribution to
load_contribution is likely less than its weight.
Performance:
(througput is measured in txn/s across a 5 minute interval, with a 30 second
warmup)
tip (no group scheduling):
throughput=57798.701988 reqs/sec.
throughput=58098.876188 reqs/sec.
tip: (autogroup + current shares code and associated broken effective_load)
throughput=49824.283179 reqs/sec.
throughput=48527.942386 reqs/sec.
tip (autogroup + old tg_shares code): [parity goal post]
throughput=57846.575060 reqs/sec.
throughput=57626.442034 reqs/sec.
tip (autogroup + effective_load rewrite):
throughput=58534.073595 reqs/sec.
throughput=58068.072052 reqs/sec.
tip (autogroup + effective_load + no affine moves for hot tasks):
throughput=60907.794697 reqs/sec.
throughput=61208.305629 reqs/sec.
Thanks,
- Paul
^ permalink raw reply [flat|nested] 12+ messages in thread
* [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights
2011-01-15 1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
@ 2011-01-15 1:57 ` Paul Turner
2011-01-17 14:11 ` Peter Zijlstra
2011-01-18 19:04 ` [tip:sched/urgent] sched: Update " tip-bot for Paul Turner
2011-01-15 1:57 ` [wake_afine fixes/improvements 2/3] sched: clean up task_hot() Paul Turner
` (3 subsequent siblings)
4 siblings, 2 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-15 1:57 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Nick Piggin,
Srivatsa Vaddagiri
[-- Attachment #1: fix_wake_affine.patch --]
[-- Type: text/plain, Size: 1907 bytes --]
Previously effective_load would approximate the global load weight present on
a group taking advantage of:
entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
by tg_shares_up.
This worked (approximately) for an 'empty' (at tg level) cpu since we would
place boost load representative of what a newly woken task would receive.
However, now that load is instantaneously updated this assumption is no longer
true and the load calculation is rather incorrect in this case.
Fix this (and improve the general case) by re-writing effective_load to take
advantage of the new shares distribution code.
Signed-off-by: Paul Turner <pjt@google.com>
---
kernel/sched_fair.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
Index: tip3/kernel/sched_fair.c
===================================================================
--- tip3.orig/kernel/sched_fair.c
+++ tip3/kernel/sched_fair.c
@@ -1362,27 +1362,27 @@ static long effective_load(struct task_g
return wl;
for_each_sched_entity(se) {
- long S, rw, s, a, b;
+ long lw, w;
- S = se->my_q->tg->shares;
- s = se->load.weight;
- rw = se->my_q->load.weight;
+ tg = se->my_q->tg;
+ w = se->my_q->load.weight;
- a = S*(rw + wl);
- b = S*rw + s*wg;
+ /* use this cpu's instantaneous contribution */
+ lw = atomic_read(&tg->load_weight);
+ lw -= se->my_q->load_contribution;
+ lw += w + wg;
- wl = s*(a-b);
+ wl += w;
- if (likely(b))
- wl /= b;
+ if (lw > 0 && wl < lw)
+ wl = (wl * tg->shares) / lw;
+ else
+ wl = tg->shares;
- /*
- * Assume the group is already running and will
- * thus already be accounted for in the weight.
- *
- * That is, moving shares between CPUs, does not
- * alter the group weight.
- */
+ /* zero point is MIN_SHARES */
+ if (wl < MIN_SHARES)
+ wl = MIN_SHARES;
+ wl -= se->load.weight;
wg = 0;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* [wake_afine fixes/improvements 2/3] sched: clean up task_hot()
2011-01-15 1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
2011-01-15 1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
@ 2011-01-15 1:57 ` Paul Turner
2011-01-17 14:14 ` Peter Zijlstra
2011-01-15 1:57 ` [wake_afine fixes/improvements 3/3] sched: introduce sched_feat(NO_HOT_AFFINE) Paul Turner
` (2 subsequent siblings)
4 siblings, 1 reply; 12+ messages in thread
From: Paul Turner @ 2011-01-15 1:57 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Nick Piggin,
Srivatsa Vaddagiri
[-- Attachment #1: no_hot_sd.patch --]
[-- Type: text/plain, Size: 3081 bytes --]
We no longer compute per-domain migration costs or have use for task_hot()
external to the fair scheduling class.
Signed-off-by: Paul Turner <pjt@google.com>
---
kernel/sched.c | 35 -----------------------------------
kernel/sched_fair.c | 32 +++++++++++++++++++++++++++++++-
2 files changed, 31 insertions(+), 36 deletions(-)
Index: tip3/kernel/sched.c
===================================================================
--- tip3.orig/kernel/sched.c
+++ tip3/kernel/sched.c
@@ -1522,8 +1522,6 @@ static unsigned long power_of(int cpu)
return cpu_rq(cpu)->cpu_power;
}
-static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
-
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -2061,38 +2059,6 @@ static void check_preempt_curr(struct rq
}
#ifdef CONFIG_SMP
-/*
- * Is this task likely cache-hot:
- */
-static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
-{
- s64 delta;
-
- if (p->sched_class != &fair_sched_class)
- return 0;
-
- if (unlikely(p->policy == SCHED_IDLE))
- return 0;
-
- /*
- * Buddy candidates are cache hot:
- */
- if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
- (&p->se == cfs_rq_of(&p->se)->next ||
- &p->se == cfs_rq_of(&p->se)->last))
- return 1;
-
- if (sysctl_sched_migration_cost == -1)
- return 1;
- if (sysctl_sched_migration_cost == 0)
- return 0;
-
- delta = now - p->se.exec_start;
-
- return delta < (s64)sysctl_sched_migration_cost;
-}
-
void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
#ifdef CONFIG_SCHED_DEBUG
@@ -9237,4 +9203,3 @@ struct cgroup_subsys cpuacct_subsys = {
.subsys_id = cpuacct_subsys_id,
};
#endif /* CONFIG_CGROUP_CPUACCT */
-
Index: tip3/kernel/sched_fair.c
===================================================================
--- tip3.orig/kernel/sched_fair.c
+++ tip3/kernel/sched_fair.c
@@ -1346,6 +1346,36 @@ static void task_waking_fair(struct rq *
se->vruntime -= cfs_rq->min_vruntime;
}
+/* is this task likely cache-hot */
+static int
+task_hot(struct task_struct *p, u64 now)
+{
+ s64 delta;
+
+ if (p->sched_class != &fair_sched_class)
+ return 0;
+
+ if (unlikely(p->policy == SCHED_IDLE))
+ return 0;
+
+ /*
+ * Buddy candidates are cache hot:
+ */
+ if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
+ (&p->se == cfs_rq_of(&p->se)->next ||
+ &p->se == cfs_rq_of(&p->se)->last))
+ return 1;
+
+ if (sysctl_sched_migration_cost == -1)
+ return 1;
+ if (sysctl_sched_migration_cost == 0)
+ return 0;
+
+ delta = now - p->se.exec_start;
+
+ return delta < (s64)sysctl_sched_migration_cost;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* effective_load() calculates the load change as seen from the root_task_group
@@ -1954,7 +1984,7 @@ int can_migrate_task(struct task_struct
* 2) too many balance attempts have failed.
*/
- tsk_cache_hot = task_hot(p, rq->clock_task, sd);
+ tsk_cache_hot = task_hot(p, rq->clock_task);
if (!tsk_cache_hot ||
sd->nr_balance_failed > sd->cache_nice_tries) {
#ifdef CONFIG_SCHEDSTATS
^ permalink raw reply [flat|nested] 12+ messages in thread
* [wake_afine fixes/improvements 3/3] sched: introduce sched_feat(NO_HOT_AFFINE)
2011-01-15 1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
2011-01-15 1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
2011-01-15 1:57 ` [wake_afine fixes/improvements 2/3] sched: clean up task_hot() Paul Turner
@ 2011-01-15 1:57 ` Paul Turner
2011-01-15 14:29 ` [wake_afine fixes/improvements 0/3] Introduction Mike Galbraith
2011-01-15 21:34 ` Nick Piggin
4 siblings, 0 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-15 1:57 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Nick Piggin,
Srivatsa Vaddagiri
[-- Attachment #1: task_hot_lazy.patch --]
[-- Type: text/plain, Size: 2177 bytes --]
re-introduce the cache-cold requirement for affine wake-up balancing.
A much more aggressive migration cost (currently 0.5ms) appears to have tilted
the needle towards favouring not performing affine migrations for cache_hot
tasks.
Since the update_rq path is more expensive now (and the 'hot' window so small),
avoid hammering it in the common case where the (possibly slightly stale)
rq->clock_task value has already advanced enough to invalidate hot-ness.
Signed-off-by: Paul Turner <pjt@google.com>
---
kernel/sched_fair.c | 20 +++++++++++++++++++-
kernel/sched_features.h | 5 +++++
2 files changed, 24 insertions(+), 1 deletion(-)
Index: tip3/kernel/sched_fair.c
===================================================================
--- tip3.orig/kernel/sched_fair.c
+++ tip3/kernel/sched_fair.c
@@ -1376,6 +1376,23 @@ task_hot(struct task_struct *p, u64 now)
return delta < (s64)sysctl_sched_migration_cost;
}
+/*
+ * Since sched_migration_cost is (relatively) very small we only need to
+ * actually update the clock in the boundary case when determining whether a
+ * task is hot or not.
+ */
+static int task_hot_lazy(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+
+ if (!task_hot(p, rq->clock_task))
+ return 0;
+
+ update_rq_clock(rq);
+
+ return task_hot(p, rq->clock_task);
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* effective_load() calculates the load change as seen from the root_task_group
@@ -1664,7 +1681,8 @@ select_task_rq_fair(struct rq *rq, struc
int sync = wake_flags & WF_SYNC;
if (sd_flag & SD_BALANCE_WAKE) {
- if (cpumask_test_cpu(cpu, &p->cpus_allowed))
+ if (cpumask_test_cpu(cpu, &p->cpus_allowed) &&
+ (!sched_feat(NO_HOT_AFFINE) || !task_hot_lazy(p)))
want_affine = 1;
new_cpu = prev_cpu;
}
Index: tip3/kernel/sched_features.h
===================================================================
--- tip3.orig/kernel/sched_features.h
+++ tip3/kernel/sched_features.h
@@ -64,3 +64,8 @@ SCHED_FEAT(OWNER_SPIN, 1)
* Decrement CPU power based on irq activity
*/
SCHED_FEAT(NONIRQ_POWER, 1)
+
+/*
+ * Don't consider cache-hot tasks for affine wakeups
+ */
+SCHED_FEAT(NO_HOT_AFFINE, 1)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [wake_afine fixes/improvements 0/3] Introduction
2011-01-15 1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
` (2 preceding siblings ...)
2011-01-15 1:57 ` [wake_afine fixes/improvements 3/3] sched: introduce sched_feat(NO_HOT_AFFINE) Paul Turner
@ 2011-01-15 14:29 ` Mike Galbraith
2011-01-15 19:29 ` Paul Turner
2011-01-15 21:34 ` Nick Piggin
4 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2011-01-15 14:29 UTC (permalink / raw)
To: Paul Turner
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Nick Piggin,
Srivatsa Vaddagiri
On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
> I've been looking at the wake_affine path to improve the group scheduling case
> (wake affine performance for fair group sched has historically lagged) as well
> as tweaking performance in general.
>
> The current series of patches is attached, the first of which should probably be
> considered for 2.6.38 since it fixes a bug/regression in the case of waking up
> onto a previously (group) empty cpu. While the others can be considered more
> forwards looking.
>
> I've been using an rpc ping-pong workload which is known be sensitive to poor affine
> decisions to benchmark these changes, I'm happy to run these patches against
> other workloads. In particular improvements on reaim have been demonstrated,
> but since it's not as stable a benchmark the numbers are harder to present in
> a representative fashion. Suggestions/pet benchmarks greatly appreciated
> here.
>
> Some other things experimented with (but didn't pan out as a performance win):
> - Considering instantaneous load on prev_cpu as well as current_cpu
> - Using more gentle wl/wg values to reflect that they a task's contribution to
> load_contribution is likely less than its weight.
>
> Performance:
>
> (througput is measured in txn/s across a 5 minute interval, with a 30 second
> warmup)
>
> tip (no group scheduling):
> throughput=57798.701988 reqs/sec.
> throughput=58098.876188 reqs/sec.
>
> tip: (autogroup + current shares code and associated broken effective_load)
> throughput=49824.283179 reqs/sec.
> throughput=48527.942386 reqs/sec.
>
> tip (autogroup + old tg_shares code): [parity goal post]
> throughput=57846.575060 reqs/sec.
> throughput=57626.442034 reqs/sec.
>
> tip (autogroup + effective_load rewrite):
> throughput=58534.073595 reqs/sec.
> throughput=58068.072052 reqs/sec.
>
> tip (autogroup + effective_load + no affine moves for hot tasks):
> throughput=60907.794697 reqs/sec.
> throughput=61208.305629 reqs/sec.
The effective_load() change is a humongous improvement for mysql+oltp.
The rest is iffy looking on my box with this load.
Looks like what will happen with NO_HOT_AFFINE if say two high frequency
ping pong players are perturbed such that one lands non-affine, it will
stay that way instead of recovering, because these will always be hot.
I haven't tested that though, pure rumination ;-)
mysql+oltp numbers
unpatched v2.6.37-7185-g52cfd50
clients 1 2 4 8 16 32 64 128 256
noautogroup 11084.37 20904.39 37356.65 36855.64 35395.45 35585.32 33343.44 28259.58 21404.18
11025.94 20870.93 37272.99 36835.54 35367.92 35448.45 33422.20 28309.88 21285.18
11076.00 20774.98 36847.44 36881.97 35295.35 35031.19 33490.84 28254.12 21307.13
1 avg 11062.10 20850.10 37159.02 36857.71 35352.90 35354.98 33418.82 28274.52 21332.16
autogroup 10963.27 20058.34 23567.63 29361.08 29111.98 29731.23 28563.18 24151.10 18163.00
10754.92 19713.71 22983.43 28906.34 28576.12 30809.49 28384.14 24208.99 18057.34
10990.27 19645.70 22193.71 29247.07 28763.53 30764.55 28912.45 24143.41 18002.07
2 avg 10902.82 19805.91 22914.92 29171.49 28817.21 30435.09 28619.92 24167.83 18074.13
.985 .949 .616 .791 .815 .860 .856 .854 .847
patched v2.6.37-7185-g52cfd50
noautogroup 11095.73 20794.49 37062.81 36611.92 35444.55 35468.36 33463.56 28236.18 21255.67
11035.59 20649.44 37304.91 36878.34 35331.63 35248.05 33424.15 28147.17 21370.39
11077.88 20653.92 37207.26 37047.54 35441.78 35445.02 33469.31 28050.80 21306.89
avg 11069.73 20699.28 37191.66 36845.93 35405.98 35387.14 33452.34 28144.71 21310.98
vs 1 1.000 .992 1.000 .999 1.001 1.000 1.001 .995 .999
noautogroup 10784.89 20304.49 37482.07 37251.63 35556.21 35116.93 32187.66 27839.60 21023.17
NO_HOT_AFFINE 10627.17 19835.43 37611.04 37168.37 35609.65 35289.32 32331.95 27598.50 21366.97
10378.76 19998.29 37018.31 36888.67 35633.45 35277.39 32300.37 27896.24 21532.09
avg 10596.94 20046.07 37370.47 37102.89 35599.77 35227.88 32273.32 27778.11 21307.41
vs 1 .957 .961 1.005 1.006 1.006 .996 .965 .982 .998
autogroup 10452.16 19547.57 36082.97 36653.02 35251.51 34099.80 31226.18 27274.91 20927.65
10586.36 19931.37 36928.99 36640.64 35604.17 34238.38 31528.80 27412.44 20874.03
10472.72 20143.83 36407.91 36715.85 35481.78 34332.42 31612.57 27357.18 21018.63
3 avg 10503.74 19874.25 36473.29 36669.83 35445.82 34223.53 31455.85 27348.17 20940.10
vs 1 .949 .953 .981 .994 1.002 .967 .941 .967 .981
vs 2 .963 1.003 1.591 1.257 1.230 1.124 1.099 1.131 1.158
autogroup 10276.41 19642.90 36790.86 36575.28 35326.89 34094.66 31626.82 27185.72 21017.51
NO_HOT_AFFINE 10305.91 20027.66 37017.90 36814.35 35452.63 34268.32 31399.49 27353.71 21039.37
11013.96 19977.08 36984.17 36661.80 35393.99 34141.05 31246.47 26960.48 20873.94
avg 10532.09 19882.54 36930.97 36683.81 35391.17 34168.01 31424.26 27166.63 20976.94
vs 1 .952 .953 .993 .995 1.001 .966 .940 .960 .983
vs 2 .965 1.003 1.611 1.257 1.228 1.122 1.097 1.124 1.160
vs 3 1.002 1.000 1.012 1.000 .998 .998 .998 .993 1.001
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [wake_afine fixes/improvements 0/3] Introduction
2011-01-15 14:29 ` [wake_afine fixes/improvements 0/3] Introduction Mike Galbraith
@ 2011-01-15 19:29 ` Paul Turner
0 siblings, 0 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-15 19:29 UTC (permalink / raw)
To: Mike Galbraith
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Nick Piggin,
Srivatsa Vaddagiri
On Sat, Jan 15, 2011 at 6:29 AM, Mike Galbraith <efault@gmx.de> wrote:
> On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
>> I've been looking at the wake_affine path to improve the group scheduling case
>> (wake affine performance for fair group sched has historically lagged) as well
>> as tweaking performance in general.
>>
>> The current series of patches is attached, the first of which should probably be
>> considered for 2.6.38 since it fixes a bug/regression in the case of waking up
>> onto a previously (group) empty cpu. While the others can be considered more
>> forwards looking.
>>
>> I've been using an rpc ping-pong workload which is known be sensitive to poor affine
>> decisions to benchmark these changes, I'm happy to run these patches against
>> other workloads. In particular improvements on reaim have been demonstrated,
>> but since it's not as stable a benchmark the numbers are harder to present in
>> a representative fashion. Suggestions/pet benchmarks greatly appreciated
>> here.
>>
>> Some other things experimented with (but didn't pan out as a performance win):
>> - Considering instantaneous load on prev_cpu as well as current_cpu
>> - Using more gentle wl/wg values to reflect that they a task's contribution to
>> load_contribution is likely less than its weight.
>>
>> Performance:
>>
>> (througput is measured in txn/s across a 5 minute interval, with a 30 second
>> warmup)
>>
>> tip (no group scheduling):
>> throughput=57798.701988 reqs/sec.
>> throughput=58098.876188 reqs/sec.
>>
>> tip: (autogroup + current shares code and associated broken effective_load)
>> throughput=49824.283179 reqs/sec.
>> throughput=48527.942386 reqs/sec.
>>
>> tip (autogroup + old tg_shares code): [parity goal post]
>> throughput=57846.575060 reqs/sec.
>> throughput=57626.442034 reqs/sec.
>>
>> tip (autogroup + effective_load rewrite):
>> throughput=58534.073595 reqs/sec.
>> throughput=58068.072052 reqs/sec.
>>
>> tip (autogroup + effective_load + no affine moves for hot tasks):
>> throughput=60907.794697 reqs/sec.
>> throughput=61208.305629 reqs/sec.
>
> The effective_load() change is a humongous improvement for mysql+oltp.
> The rest is iffy looking on my box with this load.
>
Yes -- this one is definitely the priority, the other is more forward
looking since we've had some good gains with it internally.
> Looks like what will happen with NO_HOT_AFFINE if say two high frequency
> ping pong players are perturbed such that one lands non-affine, it will
> stay that way instead of recovering, because these will always be hot.
> I haven't tested that though, pure rumination ;-)
>
This is a valid concern, the improvements we've seen have been with
many clients. Thinking about it I suspect a better option might be to
just increase the imbalance_pct required for a hot task rather than
blocking the move entirely. Will try this.
> mysql+oltp numbers
>
> unpatched v2.6.37-7185-g52cfd50
>
> clients 1 2 4 8 16 32 64 128 256
> noautogroup 11084.37 20904.39 37356.65 36855.64 35395.45 35585.32 33343.44 28259.58 21404.18
> 11025.94 20870.93 37272.99 36835.54 35367.92 35448.45 33422.20 28309.88 21285.18
> 11076.00 20774.98 36847.44 36881.97 35295.35 35031.19 33490.84 28254.12 21307.13
> 1 avg 11062.10 20850.10 37159.02 36857.71 35352.90 35354.98 33418.82 28274.52 21332.16
>
> autogroup 10963.27 20058.34 23567.63 29361.08 29111.98 29731.23 28563.18 24151.10 18163.00
> 10754.92 19713.71 22983.43 28906.34 28576.12 30809.49 28384.14 24208.99 18057.34
> 10990.27 19645.70 22193.71 29247.07 28763.53 30764.55 28912.45 24143.41 18002.07
> 2 avg 10902.82 19805.91 22914.92 29171.49 28817.21 30435.09 28619.92 24167.83 18074.13
> .985 .949 .616 .791 .815 .860 .856 .854 .847
>
> patched v2.6.37-7185-g52cfd50
>
> noautogroup 11095.73 20794.49 37062.81 36611.92 35444.55 35468.36 33463.56 28236.18 21255.67
> 11035.59 20649.44 37304.91 36878.34 35331.63 35248.05 33424.15 28147.17 21370.39
> 11077.88 20653.92 37207.26 37047.54 35441.78 35445.02 33469.31 28050.80 21306.89
> avg 11069.73 20699.28 37191.66 36845.93 35405.98 35387.14 33452.34 28144.71 21310.98
> vs 1 1.000 .992 1.000 .999 1.001 1.000 1.001 .995 .999
>
> noautogroup 10784.89 20304.49 37482.07 37251.63 35556.21 35116.93 32187.66 27839.60 21023.17
> NO_HOT_AFFINE 10627.17 19835.43 37611.04 37168.37 35609.65 35289.32 32331.95 27598.50 21366.97
> 10378.76 19998.29 37018.31 36888.67 35633.45 35277.39 32300.37 27896.24 21532.09
> avg 10596.94 20046.07 37370.47 37102.89 35599.77 35227.88 32273.32 27778.11 21307.41
> vs 1 .957 .961 1.005 1.006 1.006 .996 .965 .982 .998
>
> autogroup 10452.16 19547.57 36082.97 36653.02 35251.51 34099.80 31226.18 27274.91 20927.65
> 10586.36 19931.37 36928.99 36640.64 35604.17 34238.38 31528.80 27412.44 20874.03
> 10472.72 20143.83 36407.91 36715.85 35481.78 34332.42 31612.57 27357.18 21018.63
> 3 avg 10503.74 19874.25 36473.29 36669.83 35445.82 34223.53 31455.85 27348.17 20940.10
> vs 1 .949 .953 .981 .994 1.002 .967 .941 .967 .981
> vs 2 .963 1.003 1.591 1.257 1.230 1.124 1.099 1.131 1.158
>
> autogroup 10276.41 19642.90 36790.86 36575.28 35326.89 34094.66 31626.82 27185.72 21017.51
> NO_HOT_AFFINE 10305.91 20027.66 37017.90 36814.35 35452.63 34268.32 31399.49 27353.71 21039.37
> 11013.96 19977.08 36984.17 36661.80 35393.99 34141.05 31246.47 26960.48 20873.94
> avg 10532.09 19882.54 36930.97 36683.81 35391.17 34168.01 31424.26 27166.63 20976.94
> vs 1 .952 .953 .993 .995 1.001 .966 .940 .960 .983
> vs 2 .965 1.003 1.611 1.257 1.228 1.122 1.097 1.124 1.160
> vs 3 1.002 1.000 1.012 1.000 .998 .998 .998 .993 1.001
>
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [wake_afine fixes/improvements 0/3] Introduction
2011-01-15 1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
` (3 preceding siblings ...)
2011-01-15 14:29 ` [wake_afine fixes/improvements 0/3] Introduction Mike Galbraith
@ 2011-01-15 21:34 ` Nick Piggin
4 siblings, 0 replies; 12+ messages in thread
From: Nick Piggin @ 2011-01-15 21:34 UTC (permalink / raw)
To: Paul Turner
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Mike Galbraith,
Nick Piggin, Srivatsa Vaddagiri
On Sat, Jan 15, 2011 at 12:57 PM, Paul Turner <pjt@google.com> wrote:
>
> I've been looking at the wake_affine path to improve the group scheduling case
> (wake affine performance for fair group sched has historically lagged) as well
> as tweaking performance in general.
>
> The current series of patches is attached, the first of which should probably be
> considered for 2.6.38 since it fixes a bug/regression in the case of waking up
> onto a previously (group) empty cpu. While the others can be considered more
> forwards looking.
>
> I've been using an rpc ping-pong workload which is known be sensitive to poor affine
> decisions to benchmark these changes,
Not _necessarily_ the best thing to use :) As a sanity check maybe, but it would
be nice to have at least an improvement on one workload that somebody
actually uses (and then it's a matter of getting a lot more testing to
see it does
not cause regressions on others that people use).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights
2011-01-15 1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
@ 2011-01-17 14:11 ` Peter Zijlstra
2011-01-17 14:20 ` Peter Zijlstra
2011-01-18 19:04 ` [tip:sched/urgent] sched: Update " tip-bot for Paul Turner
1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2011-01-17 14:11 UTC (permalink / raw)
To: Paul Turner
Cc: linux-kernel, Ingo Molnar, Mike Galbraith, Nick Piggin,
Srivatsa Vaddagiri
On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
> plain text document attachment (fix_wake_affine.patch)
> Previously effective_load would approximate the global load weight present on
> a group taking advantage of:
>
> entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
> by tg_shares_up.
>
> This worked (approximately) for an 'empty' (at tg level) cpu since we would
> place boost load representative of what a newly woken task would receive.
>
> However, now that load is instantaneously updated this assumption is no longer
> true and the load calculation is rather incorrect in this case.
>
> Fix this (and improve the general case) by re-writing effective_load to take
> advantage of the new shares distribution code.
>
> Signed-off-by: Paul Turner <pjt@google.com>
>
> ---
> kernel/sched_fair.c | 32 ++++++++++++++++----------------
> 1 file changed, 16 insertions(+), 16 deletions(-)
>
> Index: tip3/kernel/sched_fair.c
> ===================================================================
> --- tip3.orig/kernel/sched_fair.c
> +++ tip3/kernel/sched_fair.c
> @@ -1362,27 +1362,27 @@ static long effective_load(struct task_g
> return wl;
>
> for_each_sched_entity(se) {
> + long lw, w;
>
> + tg = se->my_q->tg;
> + w = se->my_q->load.weight;
weight of this cpu's part of the task-group
> + /* use this cpu's instantaneous contribution */
> + lw = atomic_read(&tg->load_weight);
> + lw -= se->my_q->load_contribution;
> + lw += w + wg;
total weight of this task_group + new load
> + wl += w;
this cpu's weight + new load
> + if (lw > 0 && wl < lw)
> + wl = (wl * tg->shares) / lw;
> + else
> + wl = tg->shares;
OK, so this computes the new load for this cpu, by taking the
appropriate proportion of tg->shares, it clips on large wl, and does
something funny for !lw -- on purpose?
> + /* zero point is MIN_SHARES */
> + if (wl < MIN_SHARES)
> + wl = MIN_SHARES;
*nod*
> + wl -= se->load.weight;
Take the weight delta up to the next level..
> wg = 0;
And assume all further groups are already enqueued and stay enqueued.
> }
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [wake_afine fixes/improvements 2/3] sched: clean up task_hot()
2011-01-15 1:57 ` [wake_afine fixes/improvements 2/3] sched: clean up task_hot() Paul Turner
@ 2011-01-17 14:14 ` Peter Zijlstra
2011-01-18 21:52 ` Paul Turner
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2011-01-17 14:14 UTC (permalink / raw)
To: Paul Turner
Cc: linux-kernel, Ingo Molnar, Mike Galbraith, Nick Piggin,
Srivatsa Vaddagiri
On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
> plain text document attachment (no_hot_sd.patch)
> We no longer compute per-domain migration costs or have use for task_hot()
> external to the fair scheduling class.
Ok, so this a mostly a pure code move (aside from removing the unused sd
argument). I do seem to remember that various folks played around with
bringing the per sd cache refill cost back.. any conclusion on that?
(not really a big point, we can easily add the argument back when
needed)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights
2011-01-17 14:11 ` Peter Zijlstra
@ 2011-01-17 14:20 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2011-01-17 14:20 UTC (permalink / raw)
To: Paul Turner
Cc: linux-kernel, Ingo Molnar, Mike Galbraith, Nick Piggin,
Srivatsa Vaddagiri
On Mon, 2011-01-17 at 15:11 +0100, Peter Zijlstra wrote:
>
> > + if (lw > 0 && wl < lw)
> > + wl = (wl * tg->shares) / lw;
> > + else
> > + wl = tg->shares;
>
> OK, so this computes the new load for this cpu, by taking the
> appropriate proportion of tg->shares, it clips on large wl, and does
> something funny for !lw -- on purpose?
D'0h, when !lw, the tg is empty and we don't care what happens since it
won't get scheduled anyway..
Ok, very nice, applied!
^ permalink raw reply [flat|nested] 12+ messages in thread
* [tip:sched/urgent] sched: Update effective_load() to use global share weights
2011-01-15 1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
2011-01-17 14:11 ` Peter Zijlstra
@ 2011-01-18 19:04 ` tip-bot for Paul Turner
1 sibling, 0 replies; 12+ messages in thread
From: tip-bot for Paul Turner @ 2011-01-18 19:04 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx, mingo
Commit-ID: 977dda7c9b540f48b228174346d8b31542c1e99f
Gitweb: http://git.kernel.org/tip/977dda7c9b540f48b228174346d8b31542c1e99f
Author: Paul Turner <pjt@google.com>
AuthorDate: Fri, 14 Jan 2011 17:57:50 -0800
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 18 Jan 2011 15:09:38 +0100
sched: Update effective_load() to use global share weights
Previously effective_load would approximate the global load weight present on
a group taking advantage of:
entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
by tg_shares_up.
This worked (approximately) for an 'empty' (at tg level) cpu since we would
place boost load representative of what a newly woken task would receive.
However, now that load is instantaneously updated this assumption is no longer
true and the load calculation is rather incorrect in this case.
Fix this (and improve the general case) by re-writing effective_load to take
advantage of the new shares distribution code.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110115015817.069769529@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched_fair.c | 32 ++++++++++++++++----------------
1 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c62ebae..414145c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1362,27 +1362,27 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
return wl;
for_each_sched_entity(se) {
- long S, rw, s, a, b;
+ long lw, w;
- S = se->my_q->tg->shares;
- s = se->load.weight;
- rw = se->my_q->load.weight;
+ tg = se->my_q->tg;
+ w = se->my_q->load.weight;
- a = S*(rw + wl);
- b = S*rw + s*wg;
+ /* use this cpu's instantaneous contribution */
+ lw = atomic_read(&tg->load_weight);
+ lw -= se->my_q->load_contribution;
+ lw += w + wg;
- wl = s*(a-b);
+ wl += w;
- if (likely(b))
- wl /= b;
+ if (lw > 0 && wl < lw)
+ wl = (wl * tg->shares) / lw;
+ else
+ wl = tg->shares;
- /*
- * Assume the group is already running and will
- * thus already be accounted for in the weight.
- *
- * That is, moving shares between CPUs, does not
- * alter the group weight.
- */
+ /* zero point is MIN_SHARES */
+ if (wl < MIN_SHARES)
+ wl = MIN_SHARES;
+ wl -= se->load.weight;
wg = 0;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [wake_afine fixes/improvements 2/3] sched: clean up task_hot()
2011-01-17 14:14 ` Peter Zijlstra
@ 2011-01-18 21:52 ` Paul Turner
0 siblings, 0 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-18 21:52 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Ingo Molnar, Mike Galbraith, Nick Piggin,
Srivatsa Vaddagiri
On Mon, Jan 17, 2011 at 6:14 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
>> plain text document attachment (no_hot_sd.patch)
>> We no longer compute per-domain migration costs or have use for task_hot()
>> external to the fair scheduling class.
>
> Ok, so this a mostly a pure code move (aside from removing the unused sd
> argument). I do seem to remember that various folks played around with
> bringing the per sd cache refill cost back.. any conclusion on that?
>
> (not really a big point, we can easily add the argument back when
> needed)
>
Yeah this one's solely housekeeping.
I think there probably is value in a relative notion of what it means
to be hot that's based on the domain distance (especially with the
slightly more exotic topologies we're starting to see), but until some
framework exists I figured I might as well clean it up while I was
there.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-01-18 21:52 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-15 1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
2011-01-15 1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
2011-01-17 14:11 ` Peter Zijlstra
2011-01-17 14:20 ` Peter Zijlstra
2011-01-18 19:04 ` [tip:sched/urgent] sched: Update " tip-bot for Paul Turner
2011-01-15 1:57 ` [wake_afine fixes/improvements 2/3] sched: clean up task_hot() Paul Turner
2011-01-17 14:14 ` Peter Zijlstra
2011-01-18 21:52 ` Paul Turner
2011-01-15 1:57 ` [wake_afine fixes/improvements 3/3] sched: introduce sched_feat(NO_HOT_AFFINE) Paul Turner
2011-01-15 14:29 ` [wake_afine fixes/improvements 0/3] Introduction Mike Galbraith
2011-01-15 19:29 ` Paul Turner
2011-01-15 21:34 ` Nick Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).