LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2
@ 2018-02-13 13:37 Mel Gorman
2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
` (5 more replies)
0 siblings, 6 replies; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
LKML, Mel Gorman
This is a combination of "Reduce migrations due to load imbalance and
process exits" and "Stop wake_affine fighting with automatic NUMA balancing"
series.
There was a request for data on NAS OMP and how it compares with hard-binding
of threads to CPUs that is included in the changelog for the relevant
patch. I did not update most of the comparisons as a partial rerun of the
workloads shows similar results.
Changelog since v1
o Replace patch 4 with Peter's version
o Correct comments
o Add data on NAS OMP
o Rebase to 4.16-rc1
The following series was motivated by the fact that higher migrations were
observed than expected. The first two patches are minor cleanups. The third
patch avoids migrations on wakeup when two CPUs are equally loaded which is
particularly important in the case where the measured load is 0. The fourth
patch avoids cross-node migrations of a parent process when the child exits.
The fifth path resists new children being woken for the first time on a
remote node but it's not barred completely as that may lead to saturation
of a single node. The final patch forces automatic NUMA balancing to backoff
if it disagrees with wake_affine about where a task should be running.
kernel/sched/fair.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 82 insertions(+), 7 deletions(-)
--
2.15.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
2018-02-21 10:27 ` [tip:sched/core] " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
` (4 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
LKML, Mel Gorman
The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eb3ffc9be84..4b3e86eb2222 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5751,9 +5751,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
- int prev_cpu, int sync)
+ int this_cpu, int prev_cpu, int sync)
{
- int this_cpu = smp_processor_id();
int target = nr_cpumask_bits;
if (sched_feat(WA_IDLE))
@@ -6376,7 +6375,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (cpu == prev_cpu)
goto pick_cpu;
- new_cpu = wake_affine(affine_sd, p, prev_cpu, sync);
+ new_cpu = wake_affine(affine_sd, p, cpu, prev_cpu, sync);
}
if (sd && !(sd_flag & SD_BALANCE_FORK)) {
--
2.15.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
2018-02-21 10:28 ` [tip:sched/core] sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
` (3 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
LKML, Mel Gorman
On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b3e86eb2222..c1091cb023c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5724,7 +5724,6 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
unsigned long task_load;
this_eff_load = target_load(this_cpu, sd->wake_idx);
- prev_eff_load = source_load(prev_cpu, sd->wake_idx);
if (sync) {
unsigned long current_load = task_h_load(current);
@@ -5742,6 +5741,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
this_eff_load *= 100;
this_eff_load *= capacity_of(prev_cpu);
+ prev_eff_load = source_load(prev_cpu, sd->wake_idx);
prev_eff_load -= task_load;
if (sched_feat(WA_BIAS))
prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
--
2.15.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
2018-02-21 10:28 ` [tip:sched/core] sched/fair: Do not migrate on wake_affine_weight() " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
` (2 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
LKML, Mel Gorman
wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.
This patch modifies wake_affine_weight to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.
For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows
4.15.0 4.15.0
16rc0 noequal-v1r23
Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%)
Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%)
Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%)
Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%)
Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%)
Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%)
Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%)
Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%)
Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%)
Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%)
Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%)
Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%)
Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%)
Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%)
Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%)
Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%)
Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%)
Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%)
Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%)
Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%)
The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.
The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency
4.15.0 4.15.0
16rc0 noequal-v1r23
Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%)
Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%)
Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%)
Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%)
Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%)
Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%)
Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%)
Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%)
Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%)
Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%)
Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%)
Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%)
Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%)
Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%)
Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%)
Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%)
Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%)
Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%)
Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%)
Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%)
Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%)
Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%)
Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%)
Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%)
Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%)
Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%)
Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%)
Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%)
Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%)
Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%)
Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%)
Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%)
Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%)
Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%)
Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%)
Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%)
Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%)
Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%)
Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%)
Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%)
Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%)
Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%)
Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%)
Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1091cb023c4..28c8d9c91955 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load *= capacity_of(this_cpu);
- return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
+ /*
+ * If sync, adjust the weight of prev_eff_load such that if
+ * prev_eff == this_eff that select_idle_sibling will consider
+ * stacking the wakee on top of the waker if no other CPU is
+ * idle.
+ */
+ if (sync)
+ prev_eff_load += 1;
+
+ return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
--
2.15.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
` (2 preceding siblings ...)
2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
2018-02-21 10:29 ` [tip:sched/core] " tip-bot for Peter Zijlstra
2018-02-13 13:37 ` [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on Mel Gorman
2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
LKML, Mel Gorman
From: Peter Zijlstra <peterz@infradead.org>
When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28c8d9c91955..e5bbcbefd01b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6350,7 +6350,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
int want_affine = 0;
- int sync = wake_flags & WF_SYNC;
+ int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
--
2.15.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
` (3 preceding siblings ...)
2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
2018-02-21 10:29 ` [tip:sched/core] " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
LKML, Mel Gorman
find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on. This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine.
This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.
A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Elapsed min 1706.06 ( 0.00%) 1435.94 ( 15.83%)
Elapsed mean 1709.53 ( 0.00%) 1436.98 ( 15.94%)
Elapsed stddev 2.16 ( 0.00%) 1.01 ( 53.38%)
Elapsed coeffvar 0.13 ( 0.00%) 0.07 ( 44.54%)
Elapsed max 1711.59 ( 0.00%) 1438.01 ( 15.98%)
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
User 5434.12 5188.41
System 4878.77 3467.09
Elapsed 10259.06 8624.21
That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.
There is also a noticable impact on hackbench such as this example using
processes and pipes
hackbench-process-pipes
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Amean 1 1.0973 ( 0.00%) 0.9393 ( 14.40%)
Amean 4 1.3427 ( 0.00%) 1.3730 ( -2.26%)
Amean 7 1.4233 ( 0.00%) 1.6670 ( -17.12%)
Amean 12 3.0250 ( 0.00%) 3.3013 ( -9.13%)
Amean 21 9.0860 ( 0.00%) 9.5343 ( -4.93%)
Amean 30 14.6547 ( 0.00%) 13.2433 ( 9.63%)
Amean 48 22.5447 ( 0.00%) 20.4303 ( 9.38%)
Amean 79 29.2010 ( 0.00%) 26.7853 ( 8.27%)
Amean 110 36.7443 ( 0.00%) 35.8453 ( 2.45%)
Amean 141 45.8533 ( 0.00%) 42.6223 ( 7.05%)
Amean 172 55.1317 ( 0.00%) 50.6473 ( 8.13%)
Amean 203 64.4420 ( 0.00%) 58.3957 ( 9.38%)
Amean 234 73.2293 ( 0.00%) 67.1047 ( 8.36%)
Amean 265 80.5220 ( 0.00%) 75.7330 ( 5.95%)
Amean 296 88.7567 ( 0.00%) 82.1533 ( 7.44%)
It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Hmean send-64 349.09 ( 0.00%) 354.67 ( 1.60%)
Hmean send-128 699.16 ( 0.00%) 702.91 ( 0.54%)
Hmean send-256 1316.34 ( 0.00%) 1350.07 ( 2.56%)
Hmean send-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%)
Hmean send-2048 9705.19 ( 0.00%) 9687.44 ( -0.18%)
Hmean send-3312 14359.48 ( 0.00%) 14577.64 ( 1.52%)
Hmean send-4096 16324.20 ( 0.00%) 16393.62 ( 0.43%)
Hmean send-8192 26112.61 ( 0.00%) 26877.26 ( 2.93%)
Hmean send-16384 37208.44 ( 0.00%) 38683.43 ( 3.96%)
Hmean recv-64 349.09 ( 0.00%) 354.67 ( 1.60%)
Hmean recv-128 699.16 ( 0.00%) 702.91 ( 0.54%)
Hmean recv-256 1316.34 ( 0.00%) 1350.07 ( 2.56%)
Hmean recv-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%)
Hmean recv-2048 9705.16 ( 0.00%) 9687.43 ( -0.18%)
Hmean recv-3312 14359.42 ( 0.00%) 14577.59 ( 1.52%)
Hmean recv-4096 16323.98 ( 0.00%) 16393.55 ( 0.43%)
Hmean recv-8192 26111.85 ( 0.00%) 26876.96 ( 2.93%)
Hmean recv-16384 37206.99 ( 0.00%) 38682.41 ( 3.97%)
However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate.
NUMA base PTE updates 4620 1473
NUMA huge PMD updates 0 0
NUMA page range updates 4620 1473
NUMA hint faults 4301 1383
NUMA hint local faults 1309 451
NUMA hint local percent 30 32
NUMA pages migrated 1335 491
AutoNUMA cost 21% 6%
There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5bbcbefd01b..b1efd7570c88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5911,6 +5911,18 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
if (!idlest)
return NULL;
+ /*
+ * When comparing groups across NUMA domains, it's possible for the
+ * local domain to be very lightly loaded relative to the remote
+ * domains but "imbalance" skews the comparison making remote CPUs
+ * look much more favourable. When considering cross-domain, add
+ * imbalance to the runnable load on the remote node and consider
+ * staying local.
+ */
+ if ((sd->flags & SD_NUMA) &&
+ min_runnable_load + imbalance >= this_runnable_load)
+ return NULL;
+
if (min_runnable_load > (this_runnable_load + imbalance))
return NULL;
--
2.15.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
` (4 preceding siblings ...)
2018-02-13 13:37 ` [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
2018-02-13 14:01 ` Peter Zijlstra
2018-02-21 10:30 ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
5 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
LKML, Mel Gorman
If wake_affine pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.
Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting.
Hmean send-64 354.67 ( 0.00%) 352.15 ( -0.71%)
Hmean send-128 702.91 ( 0.00%) 693.84 ( -1.29%)
Hmean send-256 1350.07 ( 0.00%) 1344.19 ( -0.44%)
Hmean send-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%)
Hmean send-2048 9687.44 ( 0.00%) 9624.45 ( -0.65%)
Hmean send-3312 14577.64 ( 0.00%) 14514.35 ( -0.43%)
Hmean send-4096 16393.62 ( 0.00%) 16488.30 ( 0.58%)
Hmean send-8192 26877.26 ( 0.00%) 26431.63 ( -1.66%)
Hmean send-16384 38683.43 ( 0.00%) 38264.91 ( -1.08%)
Hmean recv-64 354.67 ( 0.00%) 352.15 ( -0.71%)
Hmean recv-128 702.91 ( 0.00%) 693.84 ( -1.29%)
Hmean recv-256 1350.07 ( 0.00%) 1344.19 ( -0.44%)
Hmean recv-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%)
Hmean recv-2048 9687.43 ( 0.00%) 9624.45 ( -0.65%)
Hmean recv-3312 14577.59 ( 0.00%) 14514.35 ( -0.43%)
Hmean recv-4096 16393.55 ( 0.00%) 16488.20 ( 0.58%)
Hmean recv-8192 26876.96 ( 0.00%) 26431.29 ( -1.66%)
Hmean recv-16384 38682.41 ( 0.00%) 38263.94 ( -1.08%)
NUMA alloc hit 1465986 1423090
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 1465897 1423003
NUMA base PTE updates 1473 1420
NUMA huge PMD updates 0 0
NUMA page range updates 1473 1420
NUMA hint faults 1383 1312
NUMA hint local faults 451 124
NUMA hint local percent 32 9
There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine and automatic
NUMA balancing fighting each other constantly is undesirable.
However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine
nas-mpi
4.15.0 4.15.0
sdnuma-v1r23 delayretry-v1r23
Time cg.D 557.00 ( 0.00%) 431.82 ( 22.47%)
Time ep.D 77.83 ( 0.00%) 79.01 ( -1.52%)
Time is.D 26.46 ( 0.00%) 26.64 ( -0.68%)
Time lu.D 727.14 ( 0.00%) 597.94 ( 17.77%)
Time mg.D 191.35 ( 0.00%) 146.85 ( 23.26%)
4.15.0 4.15.0
sdnuma-v1r23delayretry-v1r23
User 75665.20 70413.30
System 20321.59 8861.67
Elapsed 766.13 634.92
Minor Faults 16528502 7127941
Major Faults 4553 5068
NUMA alloc local 6963197 6749135
NUMA base PTE updates 366409093 107491434
NUMA huge PMD updates 687556 198880
NUMA page range updates 718437765 209317994
NUMA hint faults 13643410 4601187
NUMA hint local faults 9212593 3063996
NUMA hint local percent 67 66
Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.
There were questions on NAS OMP and how it behaved related to threads
being bound to CPUs. First, there are more gains than losses with this
patch applied and a reduction in system CPU usage.
nas-omp
4.16.0-rc1 4.16.0-rc1
sdnuma-v2r1 delayretry-v2r1
Time bt.D 436.71 ( 0.00%) 430.05 ( 1.53%)
Time cg.D 201.02 ( 0.00%) 180.87 ( 10.02%)
Time ep.D 32.84 ( 0.00%) 32.68 ( 0.49%)
Time is.D 9.63 ( 0.00%) 9.64 ( -0.10%)
Time lu.D 331.20 ( 0.00%) 304.80 ( 7.97%)
Time mg.D 54.87 ( 0.00%) 52.72 ( 3.92%)
Time sp.D 1108.78 ( 0.00%) 917.10 ( 17.29%)
Time ua.D 378.81 ( 0.00%) 398.83 ( -5.28%)
4.16.0-rc1 4.16.0-rc1
sdnuma-v2r1delayretry-v2r1
User 305633.08 296751.91
System 451.75 357.80
Elapsed 2595.73 2368.13
However, it does not close the gap between binding and being unbound. There
is negligible difference between the performance of the baseline and a
patched kernel when threads are bound so it is not presented here.
4.16.0-rc1 4.16.0-rc1
delayretry-bind delayretry-unbound
Time bt.D 385.02 ( 0.00%) 430.05 ( -11.70%)
Time cg.D 144.02 ( 0.00%) 180.87 ( -25.59%)
Time ep.D 32.85 ( 0.00%) 32.68 ( 0.52%)
Time is.D 10.52 ( 0.00%) 9.64 ( 8.37%)
Time lu.D 285.31 ( 0.00%) 304.80 ( -6.83%)
Time mg.D 43.21 ( 0.00%) 52.72 ( -22.01%)
Time sp.D 820.24 ( 0.00%) 917.10 ( -11.81%)
Time ua.D 337.09 ( 0.00%) 398.83 ( -18.32%)
4.16.0-rc1 4.16.0-rc1
delayretry-binddelayretry-unbound
User 277731.25 296751.91
System 261.29 357.80
Elapsed 2100.55 2368.13
Unfortunately, while performance is improved by the patch, there is still
quite a long way to go before it's equivalent to hard binding.
Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 56 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1efd7570c88..b8778924cef5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1869,6 +1869,7 @@ static int task_numa_migrate(struct task_struct *p)
static void numa_migrate_preferred(struct task_struct *p)
{
unsigned long interval = HZ;
+ unsigned long numa_migrate_retry;
/* This task has no NUMA fault statistics yet */
if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
@@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
/* Periodically retry migrating the task to the preferred node */
interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
- p->numa_migrate_retry = jiffies + interval;
+ numa_migrate_retry = jiffies + interval;
+
+ /*
+ * Check that the new retry threshold is after the current one. If
+ * the retry is in the future, it implies that wake_affine has
+ * temporarily asked NUMA balancing to backoff from placement.
+ */
+ if (numa_migrate_retry > p->numa_migrate_retry)
+ return;
+
+ /* Safe to try placing the task on the preferred node */
+ p->numa_migrate_retry = numa_migrate_retry;
/* Success if task is already running on preferred CPU */
if (task_node(p) == p->numa_preferred_nid)
@@ -5759,6 +5771,48 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}
+#ifdef CONFIG_NUMA_BALANCING
+static void
+update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
+{
+ unsigned long interval;
+
+ if (!static_branch_likely(&sched_numa_balancing))
+ return;
+
+ /* If balancing has no preference then continue gathering data */
+ if (p->numa_preferred_nid == -1)
+ return;
+
+ /*
+ * If the wakeup is not affecting locality then it is neutral from
+ * the perspective of NUMA balacing so continue gathering data.
+ */
+ if (cpus_share_cache(prev_cpu, target))
+ return;
+
+ /*
+ * Temporarily prevent NUMA balancing trying to place waker/wakee after
+ * wakee has been moved by wake_affine. This will potentially allow
+ * related tasks to converge and update their data placement. The
+ * 4 * numa_scan_period is to allow the two-pass filter to migrate
+ * hot data to the wakers node.
+ */
+ interval = max(sysctl_numa_balancing_scan_delay,
+ p->numa_scan_period << 2);
+ p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
+
+ interval = max(sysctl_numa_balancing_scan_delay,
+ current->numa_scan_period << 2);
+ current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
+}
+#else
+static void
+update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
+{
+}
+#endif
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
@@ -5774,6 +5828,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (target == nr_cpumask_bits)
return prev_cpu;
+ update_wa_numa_placement(p, prev_cpu, target);
schedstat_inc(sd->ttwu_move_affine);
schedstat_inc(p->se.statistics.nr_wakeups_affine);
return target;
--
2.15.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
@ 2018-02-13 14:01 ` Peter Zijlstra
2018-02-13 14:18 ` Mel Gorman
2018-02-21 10:30 ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-02-13 14:01 UTC (permalink / raw)
To: Mel Gorman
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML
On Tue, Feb 13, 2018 at 01:37:30PM +0000, Mel Gorman wrote:
> +static void
> +update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
> +{
> + unsigned long interval;
> +
> + if (!static_branch_likely(&sched_numa_balancing))
> + return;
> +
> + /* If balancing has no preference then continue gathering data */
> + if (p->numa_preferred_nid == -1)
> + return;
> +
> + /*
> + * If the wakeup is not affecting locality then it is neutral from
> + * the perspective of NUMA balacing so continue gathering data.
> + */
> + if (cpus_share_cache(prev_cpu, target))
> + return;
Dang, I wanted to mention this before, but it slipped my mind. The
comment and code don't match.
Did you want to write:
if (cpu_to_node(prev_cpu) == cpu_to_node(target))
return;
> + /*
> + * Temporarily prevent NUMA balancing trying to place waker/wakee after
> + * wakee has been moved by wake_affine. This will potentially allow
> + * related tasks to converge and update their data placement. The
> + * 4 * numa_scan_period is to allow the two-pass filter to migrate
> + * hot data to the wakers node.
> + */
> + interval = max(sysctl_numa_balancing_scan_delay,
> + p->numa_scan_period << 2);
> + p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
> +
> + interval = max(sysctl_numa_balancing_scan_delay,
> + current->numa_scan_period << 2);
> + current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
> +}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
2018-02-13 14:01 ` Peter Zijlstra
@ 2018-02-13 14:18 ` Mel Gorman
2018-02-13 14:43 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 14:18 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML
On Tue, Feb 13, 2018 at 03:01:37PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 13, 2018 at 01:37:30PM +0000, Mel Gorman wrote:
> > +static void
> > +update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
> > +{
> > + unsigned long interval;
> > +
> > + if (!static_branch_likely(&sched_numa_balancing))
> > + return;
> > +
> > + /* If balancing has no preference then continue gathering data */
> > + if (p->numa_preferred_nid == -1)
> > + return;
> > +
> > + /*
> > + * If the wakeup is not affecting locality then it is neutral from
> > + * the perspective of NUMA balacing so continue gathering data.
> > + */
> > + if (cpus_share_cache(prev_cpu, target))
> > + return;
>
> Dang, I wanted to mention this before, but it slipped my mind. The
> comment and code don't match.
>
> Did you want to write:
>
> if (cpu_to_node(prev_cpu) == cpu_to_node(target))
> return;
>
Well, it was deliberate. While it's possible to be on the same memory
node and not sharing cache, the scheduler typically is more concerned with
the LLC than NUMA per-se. If they share LLC, then I also assume that they
share memory locality.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
2018-02-13 14:18 ` Mel Gorman
@ 2018-02-13 14:43 ` Peter Zijlstra
2018-02-13 15:00 ` Mel Gorman
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-02-13 14:43 UTC (permalink / raw)
To: Mel Gorman
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML
On Tue, Feb 13, 2018 at 02:18:12PM +0000, Mel Gorman wrote:
> On Tue, Feb 13, 2018 at 03:01:37PM +0100, Peter Zijlstra wrote:
> > On Tue, Feb 13, 2018 at 01:37:30PM +0000, Mel Gorman wrote:
> > > +static void
> > > +update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
> > > +{
> > > + unsigned long interval;
> > > +
> > > + if (!static_branch_likely(&sched_numa_balancing))
> > > + return;
> > > +
> > > + /* If balancing has no preference then continue gathering data */
> > > + if (p->numa_preferred_nid == -1)
> > > + return;
> > > +
> > > + /*
> > > + * If the wakeup is not affecting locality then it is neutral from
> > > + * the perspective of NUMA balacing so continue gathering data.
> > > + */
> > > + if (cpus_share_cache(prev_cpu, target))
> > > + return;
> >
> > Dang, I wanted to mention this before, but it slipped my mind. The
> > comment and code don't match.
> >
> > Did you want to write:
> >
> > if (cpu_to_node(prev_cpu) == cpu_to_node(target))
> > return;
> >
>
> Well, it was deliberate. While it's possible to be on the same memory
> node and not sharing cache, the scheduler typically is more concerned with
> the LLC than NUMA per-se. If they share LLC, then I also assume that they
> share memory locality.
True, but the remaining code only has effect for numa balance, which is
concerned with nodes. So I don't see the point of using something
potentially smaller.
Suppose someone did hardware where a node has 2 cache clusters, then
we'd still set a wake_affine back-off for numa-balance, even though it
remains on the same node.
How would that be useful?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
2018-02-13 14:43 ` Peter Zijlstra
@ 2018-02-13 15:00 ` Mel Gorman
2018-02-13 15:10 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 15:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML
On Tue, Feb 13, 2018 at 03:43:26PM +0100, Peter Zijlstra wrote:
> >
> > Well, it was deliberate. While it's possible to be on the same memory
> > node and not sharing cache, the scheduler typically is more concerned with
> > the LLC than NUMA per-se. If they share LLC, then I also assume that they
> > share memory locality.
>
> True, but the remaining code only has effect for numa balance, which is
> concerned with nodes. So I don't see the point of using something
> potentially smaller.
>
> Suppose someone did hardware where a node has 2 cache clusters, then
> we'd still set a wake_affine back-off for numa-balance, even though it
> remains on the same node.
>
> How would that be useful?
Fair point, it could be unexpected from a NUMA balancing perspective and
sub-numa clustering does exist so it's a potential issue. I'm happy to
change it to cpu_to_node. I can resend the series if you prefer but feel
free to change it in-place if you're picking it up. I do not expect any
change on the machines I tested with as for all of them LLC was equivalent
to checking the node ID.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
2018-02-13 15:00 ` Mel Gorman
@ 2018-02-13 15:10 ` Peter Zijlstra
0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2018-02-13 15:10 UTC (permalink / raw)
To: Mel Gorman
Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML
On Tue, Feb 13, 2018 at 03:00:20PM +0000, Mel Gorman wrote:
> On Tue, Feb 13, 2018 at 03:43:26PM +0100, Peter Zijlstra wrote:
> > >
> > > Well, it was deliberate. While it's possible to be on the same memory
> > > node and not sharing cache, the scheduler typically is more concerned with
> > > the LLC than NUMA per-se. If they share LLC, then I also assume that they
> > > share memory locality.
> >
> > True, but the remaining code only has effect for numa balance, which is
> > concerned with nodes. So I don't see the point of using something
> > potentially smaller.
> >
> > Suppose someone did hardware where a node has 2 cache clusters, then
> > we'd still set a wake_affine back-off for numa-balance, even though it
> > remains on the same node.
> >
> > How would that be useful?
>
> Fair point, it could be unexpected from a NUMA balancing perspective and
> sub-numa clustering does exist so it's a potential issue. I'm happy to
> change it to cpu_to_node. I can resend the series if you prefer but feel
> free to change it in-place if you're picking it up. I do not expect any
> change on the machines I tested with as for all of them LLC was equivalent
> to checking the node ID.
OK, changed it locally. Thanks!
^ permalink raw reply [flat|nested] 22+ messages in thread
* [tip:sched/core] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
@ 2018-02-21 10:27 ` tip-bot for Mel Gorman
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:27 UTC (permalink / raw)
To: linux-tip-commits
Cc: peterz, torvalds, ggherdovich, efault, tglx, mgorman, mingo,
matt, hpa, linux-kernel
Commit-ID: 7ebb66a12f85bc375beaf45ca900427fe47aa8f7
Gitweb: https://git.kernel.org/tip/7ebb66a12f85bc375beaf45ca900427fe47aa8f7
Author: Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:25 +0000
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:07 +0100
sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 820f94c..0132572 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5751,9 +5751,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
- int prev_cpu, int sync)
+ int this_cpu, int prev_cpu, int sync)
{
- int this_cpu = smp_processor_id();
int target = nr_cpumask_bits;
if (sched_feat(WA_IDLE))
@@ -6376,7 +6375,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (cpu == prev_cpu)
goto pick_cpu;
- new_cpu = wake_affine(affine_sd, p, prev_cpu, sync);
+ new_cpu = wake_affine(affine_sd, p, cpu, prev_cpu, sync);
}
if (sd && !(sd_flag & SD_BALANCE_FORK)) {
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [tip:sched/core] sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed
2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
@ 2018-02-21 10:28 ` tip-bot for Mel Gorman
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:28 UTC (permalink / raw)
To: linux-tip-commits
Cc: hpa, efault, matt, torvalds, mingo, ggherdovich, mgorman, peterz,
linux-kernel, tglx
Commit-ID: eeb60398639143c11ff2c8b509e3a471411bb5d3
Gitweb: https://git.kernel.org/tip/eeb60398639143c11ff2c8b509e3a471411bb5d3
Author: Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:26 +0000
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:07 +0100
sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed
On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0132572..ae3e6f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5724,7 +5724,6 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
unsigned long task_load;
this_eff_load = target_load(this_cpu, sd->wake_idx);
- prev_eff_load = source_load(prev_cpu, sd->wake_idx);
if (sync) {
unsigned long current_load = task_h_load(current);
@@ -5742,6 +5741,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
this_eff_load *= 100;
this_eff_load *= capacity_of(prev_cpu);
+ prev_eff_load = source_load(prev_cpu, sd->wake_idx);
prev_eff_load -= task_load;
if (sched_feat(WA_BIAS))
prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [tip:sched/core] sched/fair: Do not migrate on wake_affine_weight() if weights are equal
2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
@ 2018-02-21 10:28 ` tip-bot for Mel Gorman
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:28 UTC (permalink / raw)
To: linux-tip-commits
Cc: torvalds, linux-kernel, tglx, matt, peterz, mingo, mgorman, hpa,
efault, ggherdovich
Commit-ID: 082f764a2f3f2968afa1a0b04a1ccb1b70633844
Gitweb: https://git.kernel.org/tip/082f764a2f3f2968afa1a0b04a1ccb1b70633844
Author: Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:27 +0000
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:08 +0100
sched/fair: Do not migrate on wake_affine_weight() if weights are equal
wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.
This patch modifies wake_affine_weight() to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.
For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows:
4.15.0 4.15.0
16rc0 noequal-v1r23
Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%)
Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%)
Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%)
Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%)
Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%)
Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%)
Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%)
Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%)
Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%)
Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%)
Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%)
Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%)
Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%)
Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%)
Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%)
Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%)
Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%)
Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%)
Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%)
Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%)
The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.
The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency:
4.15.0 4.15.0
16rc0 noequal-v1r23
Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%)
Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%)
Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%)
Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%)
Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%)
Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%)
Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%)
Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%)
Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%)
Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%)
Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%)
Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%)
Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%)
Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%)
Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%)
Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%)
Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%)
Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%)
Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%)
Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%)
Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%)
Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%)
Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%)
Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%)
Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%)
Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%)
Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%)
Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%)
Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%)
Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%)
Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%)
Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%)
Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%)
Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%)
Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%)
Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%)
Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%)
Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%)
Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%)
Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%)
Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%)
Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%)
Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%)
Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae3e6f8..a07920f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load *= capacity_of(this_cpu);
- return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
+ /*
+ * If sync, adjust the weight of prev_eff_load such that if
+ * prev_eff == this_eff that select_idle_sibling() will consider
+ * stacking the wakee on top of the waker if no other CPU is
+ * idle.
+ */
+ if (sync)
+ prev_eff_load += 1;
+
+ return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [tip:sched/core] sched/fair: Do not migrate due to a sync wakeup on exit
2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
@ 2018-02-21 10:29 ` tip-bot for Peter Zijlstra
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2018-02-21 10:29 UTC (permalink / raw)
To: linux-tip-commits
Cc: torvalds, hpa, ggherdovich, efault, mgorman, tglx, peterz, mingo,
linux-kernel, matt
Commit-ID: 24d0c1d6e65f635b2c0684d0a42ff6c0674aa0e6
Gitweb: https://git.kernel.org/tip/24d0c1d6e65f635b2c0684d0a42ff6c0674aa0e6
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 13 Feb 2018 13:37:28 +0000
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:42 +0100
sched/fair: Do not migrate due to a sync wakeup on exit
When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a07920f..302dda8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6350,7 +6350,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
int want_affine = 0;
- int sync = wake_flags & WF_SYNC;
+ int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [tip:sched/core] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
2018-02-13 13:37 ` [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on Mel Gorman
@ 2018-02-21 10:29 ` tip-bot for Mel Gorman
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:29 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, matt, hpa, torvalds, peterz, mingo, mgorman, tglx,
efault, ggherdovich
Commit-ID: 2c83362734dad8e48ccc0710b5cd2436a0323893
Gitweb: https://git.kernel.org/tip/2c83362734dad8e48ccc0710b5cd2436a0323893
Author: Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:29 +0000
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:43 +0100
sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on. This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine().
This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.
A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes:
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Elapsed min 1706.06 ( 0.00%) 1435.94 ( 15.83%)
Elapsed mean 1709.53 ( 0.00%) 1436.98 ( 15.94%)
Elapsed stddev 2.16 ( 0.00%) 1.01 ( 53.38%)
Elapsed coeffvar 0.13 ( 0.00%) 0.07 ( 44.54%)
Elapsed max 1711.59 ( 0.00%) 1438.01 ( 15.98%)
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
User 5434.12 5188.41
System 4878.77 3467.09
Elapsed 10259.06 8624.21
That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.
There is also a noticable impact on hackbench such as this example using
processes and pipes:
hackbench-process-pipes
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Amean 1 1.0973 ( 0.00%) 0.9393 ( 14.40%)
Amean 4 1.3427 ( 0.00%) 1.3730 ( -2.26%)
Amean 7 1.4233 ( 0.00%) 1.6670 ( -17.12%)
Amean 12 3.0250 ( 0.00%) 3.3013 ( -9.13%)
Amean 21 9.0860 ( 0.00%) 9.5343 ( -4.93%)
Amean 30 14.6547 ( 0.00%) 13.2433 ( 9.63%)
Amean 48 22.5447 ( 0.00%) 20.4303 ( 9.38%)
Amean 79 29.2010 ( 0.00%) 26.7853 ( 8.27%)
Amean 110 36.7443 ( 0.00%) 35.8453 ( 2.45%)
Amean 141 45.8533 ( 0.00%) 42.6223 ( 7.05%)
Amean 172 55.1317 ( 0.00%) 50.6473 ( 8.13%)
Amean 203 64.4420 ( 0.00%) 58.3957 ( 9.38%)
Amean 234 73.2293 ( 0.00%) 67.1047 ( 8.36%)
Amean 265 80.5220 ( 0.00%) 75.7330 ( 5.95%)
Amean 296 88.7567 ( 0.00%) 82.1533 ( 7.44%)
It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable:
4.15.0 4.15.0
noexit-v1r23 sdnuma-v1r23
Hmean send-64 349.09 ( 0.00%) 354.67 ( 1.60%)
Hmean send-128 699.16 ( 0.00%) 702.91 ( 0.54%)
Hmean send-256 1316.34 ( 0.00%) 1350.07 ( 2.56%)
Hmean send-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%)
Hmean send-2048 9705.19 ( 0.00%) 9687.44 ( -0.18%)
Hmean send-3312 14359.48 ( 0.00%) 14577.64 ( 1.52%)
Hmean send-4096 16324.20 ( 0.00%) 16393.62 ( 0.43%)
Hmean send-8192 26112.61 ( 0.00%) 26877.26 ( 2.93%)
Hmean send-16384 37208.44 ( 0.00%) 38683.43 ( 3.96%)
Hmean recv-64 349.09 ( 0.00%) 354.67 ( 1.60%)
Hmean recv-128 699.16 ( 0.00%) 702.91 ( 0.54%)
Hmean recv-256 1316.34 ( 0.00%) 1350.07 ( 2.56%)
Hmean recv-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%)
Hmean recv-2048 9705.16 ( 0.00%) 9687.43 ( -0.18%)
Hmean recv-3312 14359.42 ( 0.00%) 14577.59 ( 1.52%)
Hmean recv-4096 16323.98 ( 0.00%) 16393.55 ( 0.43%)
Hmean recv-8192 26111.85 ( 0.00%) 26876.96 ( 2.93%)
Hmean recv-16384 37206.99 ( 0.00%) 38682.41 ( 3.97%)
However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate:
NUMA base PTE updates 4620 1473
NUMA huge PMD updates 0 0
NUMA page range updates 4620 1473
NUMA hint faults 4301 1383
NUMA hint local faults 1309 451
NUMA hint local percent 30 32
NUMA pages migrated 1335 491
AutoNUMA cost 21% 6%
There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 302dda8..94aea5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5911,6 +5911,18 @@ skip_spare:
if (!idlest)
return NULL;
+ /*
+ * When comparing groups across NUMA domains, it's possible for the
+ * local domain to be very lightly loaded relative to the remote
+ * domains but "imbalance" skews the comparison making remote CPUs
+ * look much more favourable. When considering cross-domain, add
+ * imbalance to the runnable load on the remote node and consider
+ * staying local.
+ */
+ if ((sd->flags & SD_NUMA) &&
+ min_runnable_load + imbalance >= this_runnable_load)
+ return NULL;
+
if (min_runnable_load > (this_runnable_load + imbalance))
return NULL;
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
2018-02-13 14:01 ` Peter Zijlstra
@ 2018-02-21 10:30 ` tip-bot for Mel Gorman
2018-05-07 11:06 ` Srikar Dronamraju
1 sibling, 1 reply; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:30 UTC (permalink / raw)
To: linux-tip-commits
Cc: tglx, mgorman, torvalds, ggherdovich, peterz, matt, linux-kernel,
efault, hpa, mingo
Commit-ID: 7347fc87dfe6b7315e74310ee1243dc222c68086
Gitweb: https://git.kernel.org/tip/7347fc87dfe6b7315e74310ee1243dc222c68086
Author: Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:30 +0000
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:45 +0100
sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
If wake_affine() pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.
Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting:
Hmean send-64 354.67 ( 0.00%) 352.15 ( -0.71%)
Hmean send-128 702.91 ( 0.00%) 693.84 ( -1.29%)
Hmean send-256 1350.07 ( 0.00%) 1344.19 ( -0.44%)
Hmean send-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%)
Hmean send-2048 9687.44 ( 0.00%) 9624.45 ( -0.65%)
Hmean send-3312 14577.64 ( 0.00%) 14514.35 ( -0.43%)
Hmean send-4096 16393.62 ( 0.00%) 16488.30 ( 0.58%)
Hmean send-8192 26877.26 ( 0.00%) 26431.63 ( -1.66%)
Hmean send-16384 38683.43 ( 0.00%) 38264.91 ( -1.08%)
Hmean recv-64 354.67 ( 0.00%) 352.15 ( -0.71%)
Hmean recv-128 702.91 ( 0.00%) 693.84 ( -1.29%)
Hmean recv-256 1350.07 ( 0.00%) 1344.19 ( -0.44%)
Hmean recv-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%)
Hmean recv-2048 9687.43 ( 0.00%) 9624.45 ( -0.65%)
Hmean recv-3312 14577.59 ( 0.00%) 14514.35 ( -0.43%)
Hmean recv-4096 16393.55 ( 0.00%) 16488.20 ( 0.58%)
Hmean recv-8192 26876.96 ( 0.00%) 26431.29 ( -1.66%)
Hmean recv-16384 38682.41 ( 0.00%) 38263.94 ( -1.08%)
NUMA alloc hit 1465986 1423090
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 1465897 1423003
NUMA base PTE updates 1473 1420
NUMA huge PMD updates 0 0
NUMA page range updates 1473 1420
NUMA hint faults 1383 1312
NUMA hint local faults 451 124
NUMA hint local percent 32 9
There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine() and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine() and automatic
NUMA balancing fighting each other constantly is undesirable.
However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine:
nas-mpi
4.15.0 4.15.0
sdnuma-v1r23 delayretry-v1r23
Time cg.D 557.00 ( 0.00%) 431.82 ( 22.47%)
Time ep.D 77.83 ( 0.00%) 79.01 ( -1.52%)
Time is.D 26.46 ( 0.00%) 26.64 ( -0.68%)
Time lu.D 727.14 ( 0.00%) 597.94 ( 17.77%)
Time mg.D 191.35 ( 0.00%) 146.85 ( 23.26%)
4.15.0 4.15.0
sdnuma-v1r23delayretry-v1r23
User 75665.20 70413.30
System 20321.59 8861.67
Elapsed 766.13 634.92
Minor Faults 16528502 7127941
Major Faults 4553 5068
NUMA alloc local 6963197 6749135
NUMA base PTE updates 366409093 107491434
NUMA huge PMD updates 687556 198880
NUMA page range updates 718437765 209317994
NUMA hint faults 13643410 4601187
NUMA hint local faults 9212593 3063996
NUMA hint local percent 67 66
Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.
There were questions on NAS OMP and how it behaved related to threads
being bound to CPUs. First, there are more gains than losses with this
patch applied and a reduction in system CPU usage:
nas-omp
4.16.0-rc1 4.16.0-rc1
sdnuma-v2r1 delayretry-v2r1
Time bt.D 436.71 ( 0.00%) 430.05 ( 1.53%)
Time cg.D 201.02 ( 0.00%) 180.87 ( 10.02%)
Time ep.D 32.84 ( 0.00%) 32.68 ( 0.49%)
Time is.D 9.63 ( 0.00%) 9.64 ( -0.10%)
Time lu.D 331.20 ( 0.00%) 304.80 ( 7.97%)
Time mg.D 54.87 ( 0.00%) 52.72 ( 3.92%)
Time sp.D 1108.78 ( 0.00%) 917.10 ( 17.29%)
Time ua.D 378.81 ( 0.00%) 398.83 ( -5.28%)
4.16.0-rc1 4.16.0-rc1
sdnuma-v2r1delayretry-v2r1
User 305633.08 296751.91
System 451.75 357.80
Elapsed 2595.73 2368.13
However, it does not close the gap between binding and being unbound. There
is negligible difference between the performance of the baseline and a
patched kernel when threads are bound so it is not presented here:
4.16.0-rc1 4.16.0-rc1
delayretry-bind delayretry-unbound
Time bt.D 385.02 ( 0.00%) 430.05 ( -11.70%)
Time cg.D 144.02 ( 0.00%) 180.87 ( -25.59%)
Time ep.D 32.85 ( 0.00%) 32.68 ( 0.52%)
Time is.D 10.52 ( 0.00%) 9.64 ( 8.37%)
Time lu.D 285.31 ( 0.00%) 304.80 ( -6.83%)
Time mg.D 43.21 ( 0.00%) 52.72 ( -22.01%)
Time sp.D 820.24 ( 0.00%) 917.10 ( -11.81%)
Time ua.D 337.09 ( 0.00%) 398.83 ( -18.32%)
4.16.0-rc1 4.16.0-rc1
delayretry-binddelayretry-unbound
User 277731.25 296751.91
System 261.29 357.80
Elapsed 2100.55 2368.13
Unfortunately, while performance is improved by the patch, there is still
quite a long way to go before it's equivalent to hard binding.
Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 56 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94aea5b..33662a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1869,6 +1869,7 @@ static int task_numa_migrate(struct task_struct *p)
static void numa_migrate_preferred(struct task_struct *p)
{
unsigned long interval = HZ;
+ unsigned long numa_migrate_retry;
/* This task has no NUMA fault statistics yet */
if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
@@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
/* Periodically retry migrating the task to the preferred node */
interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
- p->numa_migrate_retry = jiffies + interval;
+ numa_migrate_retry = jiffies + interval;
+
+ /*
+ * Check that the new retry threshold is after the current one. If
+ * the retry is in the future, it implies that wake_affine has
+ * temporarily asked NUMA balancing to backoff from placement.
+ */
+ if (numa_migrate_retry > p->numa_migrate_retry)
+ return;
+
+ /* Safe to try placing the task on the preferred node */
+ p->numa_migrate_retry = numa_migrate_retry;
/* Success if task is already running on preferred CPU */
if (task_node(p) == p->numa_preferred_nid)
@@ -5759,6 +5771,48 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}
+#ifdef CONFIG_NUMA_BALANCING
+static void
+update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
+{
+ unsigned long interval;
+
+ if (!static_branch_likely(&sched_numa_balancing))
+ return;
+
+ /* If balancing has no preference then continue gathering data */
+ if (p->numa_preferred_nid == -1)
+ return;
+
+ /*
+ * If the wakeup is not affecting locality then it is neutral from
+ * the perspective of NUMA balacing so continue gathering data.
+ */
+ if (cpu_to_node(prev_cpu) == cpu_to_node(target))
+ return;
+
+ /*
+ * Temporarily prevent NUMA balancing trying to place waker/wakee after
+ * wakee has been moved by wake_affine. This will potentially allow
+ * related tasks to converge and update their data placement. The
+ * 4 * numa_scan_period is to allow the two-pass filter to migrate
+ * hot data to the wakers node.
+ */
+ interval = max(sysctl_numa_balancing_scan_delay,
+ p->numa_scan_period << 2);
+ p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
+
+ interval = max(sysctl_numa_balancing_scan_delay,
+ current->numa_scan_period << 2);
+ current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
+}
+#else
+static void
+update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
+{
+}
+#endif
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
@@ -5774,6 +5828,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (target == nr_cpumask_bits)
return prev_cpu;
+ update_wa_numa_placement(p, prev_cpu, target);
schedstat_inc(sd->ttwu_move_affine);
schedstat_inc(p->se.statistics.nr_wakeups_affine);
return target;
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
2018-02-21 10:30 ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
@ 2018-05-07 11:06 ` Srikar Dronamraju
2018-05-09 8:41 ` Mel Gorman
0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2018-05-07 11:06 UTC (permalink / raw)
To: mgorman, torvalds, tglx, mingo, hpa, efault, linux-kernel, matt,
peterz, ggherdovich
Cc: linux-tip-commits, mpe
Hi Mel,
I do see performance improving with this commit 7347fc87df "sched/numa:
Delay retrying placement for automatic NUMA balance after wake_affine()"
even on powerpc where we have SD_WAKE_AFFINE *disabled* on numa sched
domains. Ideally this commit should not have affected powerpc machines.
That made me to look a bit deeper.
> @@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
>
> /* Periodically retry migrating the task to the preferred node */
> interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
> - p->numa_migrate_retry = jiffies + interval;
> + numa_migrate_retry = jiffies + interval;
> +
> + /*
> + * Check that the new retry threshold is after the current one. If
> + * the retry is in the future, it implies that wake_affine has
> + * temporarily asked NUMA balancing to backoff from placement.
> + */
> + if (numa_migrate_retry > p->numa_migrate_retry)
> + return;
The above check looks wrong. This check will most likely to be true,
numa_migrate_preferred() itself is called either when jiffies >
p->numa_migrate_retry or if the task's numa_preferred_nid has changed.
Hence we never end up calling task_numa_migrate() i.e we never go thro
the active cpu balancing path in numa balancing.
Reading the comments just above the check, makes me think the check
should have been
if (numa_migrate_retry < p->numa_migrate_retry)
return;
Here is perf stat output with 7347fc87df running perf bench numa mem
--no-data_rand_walk 96 -p 2 -t 48 -G 0 -P 3072 -T 0 -l 50 -c -s 1000
2,13,898 cs ( +- 2.65% )
10,228 migrations ( +- 14.61% )
21,86,406 faults ( +- 9.69% )
40,65,84,68,026 cache-misses ( +- 0.31% )
0 sched:sched_move_numa <---------------
0 sched:sched_stick_numa <---------------
0 sched:sched_swap_numa <---------------
1,41,780 migrate:mm_migrate_pages ( +- 24.11% )
0 migrate:mm_numa_migrate_ratelimit
778.331602169 seconds time elapsed
If you look at sched_move_numa, sched_stick_numa, sched_swap_numa
numbers, its very clear that we did try any active cpu migrations.
Same command with the commit reverted
2,38,685 cs ( +- 2.93% )
25,127 migrations ( +- 13.22% )
17,27,858 faults ( +- 2.61% )
34,77,06,21,298 cache-misses ( +- 0.61% )
560 sched:sched_move_numa ( +- 2.05% )
16 sched:sched_stick_numa ( +- 33.33% )
310 sched:sched_swap_numa ( +- 15.16% )
1,25,062 migrate:mm_migrate_pages ( +- 0.91% )
0 migrate:mm_numa_migrate_ratelimit
916.777315465 seconds time elapsed
(numbers are almost same with just that check commented/modified)
So we are seeing an improvement, but the improvement is because of
bypassing the active cpu balancing. Do we really want to by-pass this
code?
> +
> + /* Safe to try placing the task on the preferred node */
> + p->numa_migrate_retry = numa_migrate_retry;
>
> /* Success if task is already running on preferred CPU */
> if (task_node(p) == p->numa_preferred_nid)
> @@ -5759,6 +5771,48 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
2018-05-07 11:06 ` Srikar Dronamraju
@ 2018-05-09 8:41 ` Mel Gorman
2018-05-09 10:58 ` Srikar Dronamraju
0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-05-09 8:41 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: torvalds, tglx, mingo, hpa, efault, linux-kernel, matt, peterz,
ggherdovich, linux-tip-commits, mpe
On Mon, May 07, 2018 at 04:06:07AM -0700, Srikar Dronamraju wrote:
> > @@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
> >
> > /* Periodically retry migrating the task to the preferred node */
> > interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
> > - p->numa_migrate_retry = jiffies + interval;
> > + numa_migrate_retry = jiffies + interval;
> > +
> > + /*
> > + * Check that the new retry threshold is after the current one. If
> > + * the retry is in the future, it implies that wake_affine has
> > + * temporarily asked NUMA balancing to backoff from placement.
> > + */
> > + if (numa_migrate_retry > p->numa_migrate_retry)
> > + return;
>
> The above check looks wrong. This check will most likely to be true,
> numa_migrate_preferred() itself is called either when jiffies >
> p->numa_migrate_retry or if the task's numa_preferred_nid has changed.
>
Sorry for the delay getting back -- viral infections combined with a public
day off is slowing me.
You're right, without affine wakeups with a wakeup-intensive workload
the path may never be hit and with the current code, it effectively acts
as a broken throttling mechanism. However, I've confirmed that "fixing"
it has mixed results with many regressions on x86 for both 2 and 4 socket
boxes. I need time to think about it and see if this can be fixed without
introducing another regression. I'll also check if a plain revert is the
way to go for a short-term fix and then revisit it.
Thanks Srikar.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
2018-05-09 8:41 ` Mel Gorman
@ 2018-05-09 10:58 ` Srikar Dronamraju
2018-05-09 16:34 ` Mel Gorman
0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2018-05-09 10:58 UTC (permalink / raw)
To: Mel Gorman
Cc: torvalds, tglx, mingo, hpa, efault, linux-kernel, matt, peterz,
ggherdovich, linux-tip-commits, mpe
* Mel Gorman <mgorman@techsingularity.net> [2018-05-09 09:41:48]:
> On Mon, May 07, 2018 at 04:06:07AM -0700, Srikar Dronamraju wrote:
> > > @@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
> > >
> > > /* Periodically retry migrating the task to the preferred node */
> > > interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
> > > - p->numa_migrate_retry = jiffies + interval;
> > > + numa_migrate_retry = jiffies + interval;
> > > +
> > > + /*
> > > + * Check that the new retry threshold is after the current one. If
> > > + * the retry is in the future, it implies that wake_affine has
> > > + * temporarily asked NUMA balancing to backoff from placement.
> > > + */
> > > + if (numa_migrate_retry > p->numa_migrate_retry)
> > > + return;
> >
> > The above check looks wrong. This check will most likely to be true,
> > numa_migrate_preferred() itself is called either when jiffies >
> > p->numa_migrate_retry or if the task's numa_preferred_nid has changed.
> >
>
> You're right, without affine wakeups with a wakeup-intensive workload
> the path may never be hit and with the current code, it effectively acts
> as a broken throttling mechanism.
I haven't tried on an x86 box, but still trying to get my head around
that check. How does affine wakeups differ for this check.
Lets say p->numa_migrate_retry was set by wake_affine and task has
crossed that temporary period where we dont want the task to undergo
numa balancing.
Now the task is back at numa_migrate_preferred(); p->numa_migrate_retry
is lesser than jiffies (something like "current jiffies - 100"). It
would always return back from that check.
In the other scenario, where wake_affine set p->numa_migrate_preferred to a
bigger value, the task calls numa_migrate_preferred(),
numa_migrate_preferred could be before p->numa_migrate_preferred. In
such a case, we should have stopped the task from migration.
However we overwrite p->numa_migrate_preferred and do the
task_numa_migrate(). Somehow this doesn't seem to achieve what the
commit intended.
Or did I misunderstand?
--
Thanks and Regards
Srikar
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
2018-05-09 10:58 ` Srikar Dronamraju
@ 2018-05-09 16:34 ` Mel Gorman
0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-05-09 16:34 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: torvalds, tglx, mingo, hpa, efault, linux-kernel, matt, peterz,
ggherdovich, linux-tip-commits, mpe
On Wed, May 09, 2018 at 03:58:14AM -0700, Srikar Dronamraju wrote:
> In the other scenario, where wake_affine set p->numa_migrate_preferred to a
> bigger value, the task calls numa_migrate_preferred(),
> numa_migrate_preferred could be before p->numa_migrate_preferred. In
> such a case, we should have stopped the task from migration.
> However we overwrite p->numa_migrate_preferred and do the
> task_numa_migrate(). Somehow this doesn't seem to achieve what the
> commit intended.
>
> Or did I misunderstand?
>
Nope, the logic is broken. While it can be "fixed", the end result adds
complexity for very little gain. I believe the right path for the moment
is a revert and retry at a future date.
Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2018-05-09 16:34 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
2018-02-21 10:27 ` [tip:sched/core] " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
2018-02-21 10:28 ` [tip:sched/core] sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
2018-02-21 10:28 ` [tip:sched/core] sched/fair: Do not migrate on wake_affine_weight() " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
2018-02-21 10:29 ` [tip:sched/core] " tip-bot for Peter Zijlstra
2018-02-13 13:37 ` [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on Mel Gorman
2018-02-21 10:29 ` [tip:sched/core] " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
2018-02-13 14:01 ` Peter Zijlstra
2018-02-13 14:18 ` Mel Gorman
2018-02-13 14:43 ` Peter Zijlstra
2018-02-13 15:00 ` Mel Gorman
2018-02-13 15:10 ` Peter Zijlstra
2018-02-21 10:30 ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
2018-05-07 11:06 ` Srikar Dronamraju
2018-05-09 8:41 ` Mel Gorman
2018-05-09 10:58 ` Srikar Dronamraju
2018-05-09 16:34 ` Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).