LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 00/19]  Fixes for sched/numa_balancing
@ 2018-06-04 10:00 Srikar Dronamraju
  2018-06-04 10:00 ` [PATCH 01/19] sched/numa: Remove redundant field Srikar Dronamraju
                   ` (18 more replies)
  0 siblings, 19 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

This patchset based on v4.17-rc5, provides few simple cleanups and fixes in
the sched/numa_balancing code. Some of these fixes are specific to systems
having more than 2 nodes. Few patches add per-rq and per-node complexities
to solve what I feel are a fairness/correctness issues.


Here are the scripts used to benchmark this series
They are based on Andrea Arcangeli and Petr Holasek's
https://github.com/pholasek/autonuma-benchmark.git

# cat numa01.sh
#! /bin/bash
# numa01.sh corresponds to 2 perf bench processes each having ncpus/2 threads
# 50 loops of 3G process memory.

THREADS=${THREADS:-$(($(getconf _NPROCESSORS_ONLN)/2))}
perf bench numa mem --no-data_rand_walk $CPUS -p 2 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 2000 $@


# cat numa02.sh
#! /bin/bash
# numa02.sh corresponds to  1 perf bench process having ncpus threads
# 800 loops of 32 MB thread specific memory.
THREADS=$(getconf _NPROCESSORS_ONLN)
perf bench numa mem --no-data_rand_walk -p 1 -t $THREADS -G 0 -P 0 -T 32 -l 800 -c -s 2000 $@



# cat numa03.sh
#! /bin/bash
# numa03.sh corresponds to 1 perf bench process having ncpus threads
# 50 loops of 3G process memory.

THREADS=$(getconf _NPROCESSORS_ONLN)
perf bench numa mem --no-data_rand_walk -p 1 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 2000 $@


# cat numa04.sh
#! /bin/bash
# numa04.sh corresponds to nrnodes perf bench processes each having
# ncpus/nrnodes threads 50 loops of 3G process memory.

NODES=$(numactl -H |awk /available/'{print $2}')
INST=$NODES
THREADS=$(($(getconf _NPROCESSORS_ONLN)/$INST))
perf bench numa mem --no-data_rand_walk -p $INST -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 2000 $@


# cat numa05.sh
#! /bin/bash
# numa05.sh corresponds to nrnodes *2  perf bench processes each having
# ncpus/(nrnodes *2 ) threads 50 loops of 3G process memory.


NODES=$(numactl -H |awk /available/'{print $2}')
INST=$((2*NODES))
THREADS=$(($(getconf _NPROCESSORS_ONLN)/$INST))
perf bench numa mem --no-data_rand_walk -p $INST -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 2000 $@

Stats were collected on a 4 node/96 cpu machine.

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 32431 MB
node 0 free: 30759 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 31961 MB
node 1 free: 30502 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 30425 MB
node 2 free: 30189 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 32200 MB
node 3 free: 31346 MB
node distances:
node   0   1   2   3
  0:  10  20  40  40
  1:  20  10  40  40
  2:  40  40  10  20
  3:  40  40  20  10

Since we are looking for time as a metric, smaller numbers are better.

v4.17-rc5
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      440.65      941.32      758.98      189.17
numa01.sh       Sys:      183.48      320.07      258.42       50.09
numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
numa02.sh      Real:       61.24       65.35       62.49        1.49
numa02.sh       Sys:       16.83       24.18       21.40        2.60
numa02.sh      User:     5219.59     5356.34     5264.03       49.07
numa03.sh      Real:      822.04      912.40      873.55       37.35
numa03.sh       Sys:      118.80      140.94      132.90        7.60
numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
numa04.sh      Real:      690.66      872.12      778.49       65.44
numa04.sh       Sys:      459.26      563.03      494.03       42.39
numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
numa05.sh      Real:      418.37      562.28      525.77       54.27
numa05.sh       Sys:      299.45      481.00      392.49       64.27
numa05.sh      User:    34115.09    41324.02    39105.30     2627.68


v4.17-rc5+patches
Testcase       Time:         Min         Max         Avg      StdDev    %Change
numa01.sh      Real:      424.63      566.18      498.12       59.26    34.36%
numa01.sh       Sys:      160.19      256.53      208.98       37.02    19.13%
numa01.sh      User:    37320.00    46225.58    42001.57     3482.45    30.34%
numa02.sh      Real:       60.17       62.47       60.91        0.85    2.528%
numa02.sh       Sys:       15.30       22.82       17.04        2.90    20.37%
numa02.sh      User:     5202.13     5255.51     5219.08       20.14    0.853%
numa03.sh      Real:      823.91      844.89      833.86        8.46    4.543%
numa03.sh       Sys:      130.69      148.29      140.47        6.21    -5.69%
numa03.sh      User:    62519.15    64262.20    63613.38      620.05    5.348%
numa04.sh      Real:      515.30      603.74      548.56       30.93    29.53%
numa04.sh       Sys:      459.73      525.48      489.18       21.63    0.981%
numa04.sh      User:    40561.96    44919.18    42047.87     1526.85    28.55%
numa05.sh      Real:      396.58      454.37      421.13       19.71    19.90%
numa05.sh       Sys:      208.72      422.02      348.90       73.60    11.10%
numa05.sh      User:    33124.08    36109.35    34846.47     1089.74    10.89%


Even the perf bench o/p (not included here because its pretty verbose)
points to a better consolidation. Attaching perf stat data instead.

Performance counter stats for 'system wide' (5 runs): numa01.sh
					               v4.17-rc5         v4.17-rc5+patches
cs                                          196,530 ( +- 13.22% )            117,524 ( +-  7.46% )
migrations                                   16,077 ( +- 16.98% )              6,602 ( +-  9.93% )
faults                                    1,698,631 ( +-  6.66% )          1,292,159 ( +-  3.99% )
cache-misses                         32,841,908,826 ( +-  5.33% )     27,059,597,808 ( +-  2.17% )
sched:sched_move_numa                           555 ( +- 25.92% )                  8 ( +- 38.45% )
sched:sched_stick_numa                           16 ( +- 20.73% )                  1 ( +- 31.18% )
sched:sched_swap_numa                           313 ( +- 23.21% )                278 ( +-  5.31% )
migrate:mm_migrate_pages                    138,981 ( +- 13.26% )            121,639 ( +-  8.75% )
migrate:mm_numa_migrate_ratelimit               439 ( +-100.00% )                138 ( +-100.00% )
seconds time elapsed     	      759.019898884 ( +- 12.46% )      498.158680658 ( +-  5.95% )

Numa Hinting and other vmstat info (Sum of 5 runs)

numa01.sh               v4.17-rc5         v4.17-rc5+patches
numa_foreign            0                 0
numa_hint_faults        7283263           5389142
numa_hint_faults_local  3689375           2209029
numa_hit                1401549           1264559
numa_huge_pte_updates   0                 0
numa_interleave         0                 0
numa_local              1401487           1264534
numa_miss               0                 0
numa_other              62                25
numa_pages_migrated     693724            608024
numa_pte_updates        7320797           5410463
pgfault                 8514248           6474639
pgmajfault              351               203
pgmigrate_fail          1181              171
pgmigrate_success       693724            608024

Faults and page migrations have decreased and that correlates with perf stat
numbers. We are achieving faster and better consolidation with lesser task
migrations. Esp number of failed numa task migrations have decreased.



Performance counter stats for 'system wide' (5 runs): numa02.sh
					               v4.17-rc5         v4.17-rc5+patches
cs                                           33,541 ( +-  2.20% )             33,472 ( +-  2.58% )
migrations                                    2,022 ( +-  2.36% )              1,742 ( +-  4.36% )
faults                                      452,697 ( +-  6.29% )            400,244 ( +-  3.14% )
cache-misses                          4,559,889,977 ( +-  0.40% )      4,510,581,926 ( +-  0.17% )
sched:sched_move_numa                            27 ( +- 40.26% )                  2 ( +- 32.39% )
sched:sched_stick_numa                            0                                0
sched:sched_swap_numa                             9 ( +- 41.81% )                  8 ( +- 23.39% )
migrate:mm_migrate_pages                     23,428 ( +-  6.91% )             19,418 ( +-  9.28% )
migrate:mm_numa_migrate_ratelimit               238 ( +- 61.52% )                315 ( +- 66.65% )
seconds time elapsed      	       62.532524687 ( +-  1.20% )       60.943143605 ( +-  0.70% )

Numa Hinting and other vmstat info (Sum of 5 runs)

numa02.sh		v4.17-rc5         v4.17-rc5+patches
numa_foreign            0                 0
numa_hint_faults        1797406           1541180
numa_hint_faults_local  1652638           1423986
numa_hit                447642            427011
numa_huge_pte_updates   0                 0
numa_interleave         0                 0
numa_local              447639            427011
numa_miss               0                 0
numa_other              3                 0
numa_pages_migrated     117142            97088
numa_pte_updates        1812907           1557075
pgfault                 2273993           2011485
pgmajfault              112               119
pgmigrate_fail          0                 0
pgmigrate_success       117142            97088

Again, lesser page faults, lesser page migrations but hitting page
migrations ratelimits more often.




Performance counter stats for 'system wide' (5 runs): numa03.sh
					               v4.17-rc5         v4.17-rc5+patches
cs                                          184,615 ( +-  2.83% )           178,526 ( +-  2.66% )
migrations                                   14,010 ( +-  4.68% )             9,511 ( +-  4.20% )
faults                                      766,543 ( +-  2.55% )           835,876 ( +-  6.09% )
cache-misses                         34,905,163,767 ( +-  0.75% )    35,979,821,603 ( +-  0.30% )
sched:sched_move_numa                           562 ( +-  6.38% )                 4 ( +- 22.64% )
sched:sched_stick_numa                           16 ( +- 16.42% )                 1 ( +- 61.24% )
sched:sched_swap_numa                           268 ( +-  4.88% )               394 ( +-  6.05% )
migrate:mm_migrate_pages                     53,999 ( +-  5.89% )            51,520 ( +-  8.68% )
migrate:mm_numa_migrate_ratelimit                 0                             508 ( +- 76.69% )
seconds time elapsed     	      873.586758847 ( +-  2.14% )     833.910858522 ( +-  0.51% )

Numa Hinting and other vmstat info (Sum of 5 runs)

numa03.sh		v4.17-rc5         v4.17-rc5+patches
numa_foreign            0                 0
numa_hint_faults        2962951           3275731
numa_hint_faults_local  1159054           1215206
numa_hit                702071            693754
numa_huge_pte_updates   0                 0
numa_interleave         0                 0
numa_local              702042            693722
numa_miss               0                 0
numa_other              29                32
numa_pages_migrated     269918            256838
numa_pte_updates        2963554           3305006
pgfault                 3853016           4193700
pgmajfault              202               281
pgmigrate_fail          77                764
pgmigrate_success       269918            256838

Seeing more faults and cache misses but lesser task migrations. Increase in
migration ratelimits is a worry.




Performance counter stats for 'system wide' (5 runs): numa04.sh

					               v4.17-rc5         v4.17-rc5+patches
cs                                          203,184 ( +-  6.67% )            141,653 ( +-  3.26% )
migrations                                   17,852 ( +- 12.84% )              6,837 ( +-  5.14% )
faults                                    3,650,884 ( +-  3.15% )          2,910,839 ( +-  1.36% )
cache-misses                         34,362,104,705 ( +-  2.26% )     30,064,624,934 ( +-  1.18% )
sched:sched_move_numa                           923 ( +- 21.36% )                  8 ( +- 30.22% )
sched:sched_stick_numa                           10 ( +- 23.89% )                  1 ( +- 46.77% )
sched:sched_swap_numa                           350 ( +- 21.32% )                261 ( +-  7.80% )
migrate:mm_migrate_pages                    288,410 ( +-  4.10% )            296,726 ( +-  3.33% )
migrate:mm_numa_migrate_ratelimit                 0                              162 ( +-100.00% )
seconds time elapsed     	      778.519948731 ( +-  4.20% )      548.606652462 ( +-  2.82% )

Numa Hinting and other vmstat info (Sum of 5 runs)

numa04.sh		v4.17-rc5         v4.17-rc5+patches
numa_foreign            0                 0
numa_hint_faults        16506833          12815656
numa_hint_faults_local  10237480          7526798
numa_hit                2617983           2647923
numa_huge_pte_updates   0                 0
numa_interleave         0                 0
numa_local              2617962           2647914
numa_miss               0                 0
numa_other              21                9
numa_pages_migrated     1441453           1481743
numa_pte_updates        16519819          12844781
pgfault                 18274350          14567947
pgmajfault              264               180
pgmigrate_fail          595               1889
pgmigrate_success       1441453           1481743

Lesser faults, page migrations and task migrations but increasing hitting
migrate ratelimits.




Performance counter stats for 'system wide' (5 runs): numa05.sh
					               v4.17-rc5         v4.17-rc5+patches
cs                                          149,941 ( +-  5.30% )            119,881 ( +-  9.39% )
migrations                                   10,478 ( +- 13.01% )              4,901 ( +-  6.53% )
faults                                    6,457,542 ( +-  3.07% )          5,799,805 ( +-  1.62% )
cache-misses                         31,146,034,587 ( +-  1.40% )     29,894,482,788 ( +-  0.73% )
sched:sched_move_numa                           667 ( +- 27.46% )                  6 ( +- 21.28% )
sched:sched_stick_numa                            3 ( +- 27.28% )                  0
sched:sched_swap_numa                           173 ( +- 20.79% )                113 ( +- 17.60% )
migrate:mm_migrate_pages                    419,446 ( +-  4.94% )            325,522 ( +- 13.88% )
migrate:mm_numa_migrate_ratelimit             1,714 ( +- 66.17% )                338 ( +- 45.02% )
seconds time elapsed     	      525.801216597 ( +-  5.16% )      421.186302929 ( +-  2.34% )

Numa Hinting and other vmstat info (Sum of 5 runs)

numa05.sh		v4.17-rc5         v4.17-rc5+patches
numa_foreign            0                 0
numa_hint_faults        29575825          26294424
numa_hint_faults_local  21637356          21808958
numa_hit                4246286           3771867
numa_huge_pte_updates   0                 0
numa_interleave         0                 0
numa_local              4246270           3771854
numa_miss               0                 0
numa_other              16                13
numa_pages_migrated     2096896           1625671
numa_pte_updates        29620565          26399455
pgfault                 32309072          29013170
pgmajfault              285               255
pgmigrate_fail          334               1937
pgmigrate_success       2096896           1625671

Faults and page migrations have decreased . We are achieving faster and
better consolidation with lesser task migrations. Esp ratio of swap/move numa
task migrations has increased.

Srikar Dronamraju (19):
  sched/numa: Remove redundant field.
  sched/numa: Evaluate move once per node
  sched/numa: Simplify load_too_imbalanced
  sched/numa: Set preferred_node based on best_cpu
  sched/numa: Use task faults only if numa_group is not yet setup
  sched/debug: Reverse the order of printing faults
  sched/numa: Skip nodes that are at hoplimit
  sched/numa: Remove unused task_capacity from numa_stats
  sched/numa: Modify migrate_swap to accept additional params
  sched/numa: Stop multiple tasks from moving to the cpu at the same time
  sched/numa: Restrict migrating in parallel to the same node.
  sched:numa Remove numa_has_capacity
  mm/migrate: Use xchg instead of spinlock
  sched/numa: Updation of scan period need not be in lock
  sched/numa: Use group_weights to identify if migration degrades locality
  sched/numa: Detect if node actively handling migration
  sched/numa: Pass destination cpu as a parameter to migrate_task_rq
  sched/numa: Reset scan rate whenever task moves across nodes
  sched/numa: Move task_placement closer to numa_migrate_preferred

 include/linux/mmzone.h  |   4 +-
 include/linux/sched.h   |   1 -
 kernel/sched/core.c     |  11 +-
 kernel/sched/deadline.c |   2 +-
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     | 328 +++++++++++++++++++++++-------------------------
 kernel/sched/sched.h    |   6 +-
 mm/migrate.c            |   8 +-
 mm/page_alloc.c         |   2 +-
 9 files changed, 178 insertions(+), 188 deletions(-)

--
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 01/19] sched/numa: Remove redundant field.
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 14:53   ` Rik van Riel
  2018-06-05  8:41   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

numa_entry is a list_head defined in task_struct, but never used.

No functional change

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c241370..d1d43be 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1016,7 +1016,6 @@ struct task_struct {
 	u64				last_sum_exec_runtime;
 	struct callback_head		numa_work;
 
-	struct list_head		numa_entry;
 	struct numa_group		*numa_group;
 
 	/*
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 02/19] sched/numa: Evaluate move once per node
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
  2018-06-04 10:00 ` [PATCH 01/19] sched/numa: Remove redundant field Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 14:51   ` Rik van Riel
  2018-06-04 10:00 ` [PATCH 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

task_numa_compare() helps choose the best cpu to move or swap the
selected task. To achieve this task_numa_compare() is called for every
cpu in the node. Currently it evaluates if the task can be moved/swapped
for each of the cpus. However the move evaluation is mostly independent
of the cpu. Evaluating the move logic once per node, provides scope for
simplifying task_numa_compare().

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      440.65      941.32      758.98      189.17
numa01.sh       Sys:      183.48      320.07      258.42       50.09
numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
numa02.sh      Real:       61.24       65.35       62.49        1.49
numa02.sh       Sys:       16.83       24.18       21.40        2.60
numa02.sh      User:     5219.59     5356.34     5264.03       49.07
numa03.sh      Real:      822.04      912.40      873.55       37.35
numa03.sh       Sys:      118.80      140.94      132.90        7.60
numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
numa04.sh      Real:      690.66      872.12      778.49       65.44
numa04.sh       Sys:      459.26      563.03      494.03       42.39
numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
numa05.sh      Real:      418.37      562.28      525.77       54.27
numa05.sh       Sys:      299.45      481.00      392.49       64.27
numa05.sh      User:    34115.09    41324.02    39105.30     2627.68

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 128 +++++++++++++++++++++++-----------------------------
 1 file changed, 57 insertions(+), 71 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 79f574d..57d1ee8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1541,9 +1541,8 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * be exchanged with the source task
  */
 static void task_numa_compare(struct task_numa_env *env,
-			      long taskimp, long groupimp)
+			      long taskimp, long groupimp, bool move)
 {
-	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long src_load, dst_load;
@@ -1564,97 +1563,73 @@ static void task_numa_compare(struct task_numa_env *env,
 	if (cur == env->p)
 		goto unlock;
 
+	if (!cur) {
+		if (!move || imp <= env->best_imp)
+			goto unlock;
+		else
+			goto assign;
+	}
+
 	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative
-	 * the value is, the more rmeote accesses that would be expected to
+	 * the value is, the more remote accesses that would be expected to
 	 * be incurred if the tasks were swapped.
 	 */
-	if (cur) {
-		/* Skip this swap candidate if cannot move to the source CPU: */
-		if (!cpumask_test_cpu(env->src_cpu, &cur->cpus_allowed))
-			goto unlock;
+	/* Skip this swap candidate if cannot move to the source cpu */
+	if (!cpumask_test_cpu(env->src_cpu, &cur->cpus_allowed))
+		goto unlock;
 
+	/*
+	 * If dst and source tasks are in the same NUMA group, or not
+	 * in any group then look only at task weights.
+	 */
+	if (cur->numa_group == env->p->numa_group) {
+		imp = taskimp + task_weight(cur, env->src_nid, dist) -
+		      task_weight(cur, env->dst_nid, dist);
 		/*
-		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights.
+		 * Add some hysteresis to prevent swapping the
+		 * tasks within a group over tiny differences.
 		 */
-		if (cur->numa_group == env->p->numa_group) {
-			imp = taskimp + task_weight(cur, env->src_nid, dist) -
-			      task_weight(cur, env->dst_nid, dist);
-			/*
-			 * Add some hysteresis to prevent swapping the
-			 * tasks within a group over tiny differences.
-			 */
-			if (cur->numa_group)
-				imp -= imp/16;
-		} else {
-			/*
-			 * Compare the group weights. If a task is all by
-			 * itself (not part of a group), use the task weight
-			 * instead.
-			 */
-			if (cur->numa_group)
-				imp += group_weight(cur, env->src_nid, dist) -
-				       group_weight(cur, env->dst_nid, dist);
-			else
-				imp += task_weight(cur, env->src_nid, dist) -
-				       task_weight(cur, env->dst_nid, dist);
-		}
+		if (cur->numa_group)
+			imp -= imp / 16;
+	} else {
+		/*
+		 * Compare the group weights. If a task is all by itself
+		 * (not part of a group), use the task weight instead.
+		 */
+		if (cur->numa_group && env->p->numa_group)
+			imp += group_weight(cur, env->src_nid, dist) -
+			       group_weight(cur, env->dst_nid, dist);
+		else
+			imp += task_weight(cur, env->src_nid, dist) -
+			       task_weight(cur, env->dst_nid, dist);
 	}
 
-	if (imp <= env->best_imp && moveimp <= env->best_imp)
+	if (imp <= env->best_imp)
 		goto unlock;
 
-	if (!cur) {
-		/* Is there capacity at our destination? */
-		if (env->src_stats.nr_running <= env->src_stats.task_capacity &&
-		    !env->dst_stats.has_free_capacity)
-			goto unlock;
-
-		goto balance;
-	}
-
-	/* Balance doesn't matter much if we're running a task per CPU: */
-	if (imp > env->best_imp && src_rq->nr_running == 1 &&
-			dst_rq->nr_running == 1)
+	if (move && moveimp > imp && moveimp > env->best_imp) {
+		imp = moveimp - 1;
+		cur = NULL;
 		goto assign;
+	}
 
 	/*
 	 * In the overloaded case, try and keep the load balanced.
 	 */
-balance:
-	load = task_h_load(env->p);
+	load = task_h_load(env->p) - task_h_load(cur);
+	if (!load)
+		goto assign;
+
 	dst_load = env->dst_stats.load + load;
 	src_load = env->src_stats.load - load;
 
-	if (moveimp > imp && moveimp > env->best_imp) {
-		/*
-		 * If the improvement from just moving env->p direction is
-		 * better than swapping tasks around, check if a move is
-		 * possible. Store a slightly smaller score than moveimp,
-		 * so an actually idle CPU will win.
-		 */
-		if (!load_too_imbalanced(src_load, dst_load, env)) {
-			imp = moveimp - 1;
-			cur = NULL;
-			goto assign;
-		}
-	}
-
-	if (imp <= env->best_imp)
-		goto unlock;
-
-	if (cur) {
-		load = task_h_load(cur);
-		dst_load -= load;
-		src_load += load;
-	}
-
 	if (load_too_imbalanced(src_load, dst_load, env))
 		goto unlock;
 
+assign:
 	/*
 	 * One idle CPU per node is evaluated for a task numa move.
 	 * Call select_idle_sibling to maybe find a better one.
@@ -1670,7 +1645,6 @@ static void task_numa_compare(struct task_numa_env *env,
 		local_irq_enable();
 	}
 
-assign:
 	task_numa_assign(env, cur, imp);
 unlock:
 	rcu_read_unlock();
@@ -1679,15 +1653,27 @@ static void task_numa_compare(struct task_numa_env *env,
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
+	long src_load, dst_load, load;
+	bool move = false;
 	int cpu;
 
+	load = task_h_load(env->p);
+	dst_load = env->dst_stats.load + load;
+	src_load = env->src_stats.load - load;
+
+	/*
+	 * If the improvement from just moving env->p direction is better
+	 * than swapping tasks around, check if a move is possible.
+	 */
+	move = !load_too_imbalanced(src_load, dst_load, env);
+
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp);
+		task_numa_compare(env, taskimp, groupimp, move);
 	}
 }
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 03/19] sched/numa: Simplify load_too_imbalanced
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
  2018-06-04 10:00 ` [PATCH 01/19] sched/numa: Remove redundant field Srikar Dronamraju
  2018-06-04 10:00 ` [PATCH 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 14:57   ` Rik van Riel
  2018-06-05  8:46   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently load_too_imbalance() cares about the slope of imbalance.
It doesn't care of the direction of the imbalance.

However this may not work if nodes that are being compared have
dissimilar capacities. Few nodes might have more cores than other nodes
in the system. Also unlike traditional load balance at a NUMA sched
domain, multiple requests to migrate from the same source node to same
destination node may run in parallel. This can cause huge load
imbalance. This is specially true on a larger machines with either large
cores per node or more number of nodes in the system. Hence allow
move/swap only if the imbalance is going to reduce.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      516.14      892.41      739.84      151.32
numa01.sh       Sys:      153.16      192.99      177.70       14.58
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
numa02.sh      Real:       60.91       62.35       61.58        0.63
numa02.sh       Sys:       16.47       26.16       21.20        3.85
numa02.sh      User:     5227.58     5309.61     5265.17       31.04
numa03.sh      Real:      739.07      917.73      795.75       64.45
numa03.sh       Sys:       94.46      136.08      109.48       14.58
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
numa04.sh      Real:      442.61      715.43      530.31       96.12
numa04.sh       Sys:      224.90      348.63      285.61       48.83
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
numa05.sh      Real:      386.13      489.17      434.94       43.59
numa05.sh       Sys:      144.29      438.56      278.80      105.78
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57d1ee8..ea32a66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1507,28 +1507,12 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	src_capacity = env->src_stats.compute_capacity;
 	dst_capacity = env->dst_stats.compute_capacity;
 
-	/* We care about the slope of the imbalance, not the direction. */
-	if (dst_load < src_load)
-		swap(dst_load, src_load);
-
-	/* Is the difference below the threshold? */
-	imb = dst_load * src_capacity * 100 -
-	      src_load * dst_capacity * env->imbalance_pct;
-	if (imb <= 0)
-		return false;
+	imb = abs(dst_load * src_capacity - src_load * dst_capacity);
 
-	/*
-	 * The imbalance is above the allowed threshold.
-	 * Compare it with the old imbalance.
-	 */
 	orig_src_load = env->src_stats.load;
 	orig_dst_load = env->dst_stats.load;
 
-	if (orig_dst_load < orig_src_load)
-		swap(orig_dst_load, orig_src_load);
-
-	old_imb = orig_dst_load * src_capacity * 100 -
-		  orig_src_load * dst_capacity * env->imbalance_pct;
+	old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity);
 
 	/* Would this change make things worse? */
 	return (imb > old_imb);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 12:18   ` Peter Zijlstra
  2018-06-04 12:23   ` Peter Zijlstra
  2018-06-04 10:00 ` [PATCH 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently preferred node is set to dst_nid which is the last node in the
iteration whose group weight or task weight is greater than the current
node. However it doesn't guarantee that dst_nid has the numa capacity
to move. It also doesn't guarantee that dst_nid has the best_cpu which
is the cpu/node ideal for node migration.

Lets consider faults on a 4 node system with group weight numbers
in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
is running on 3 and 0 is its preferred node but its capacity is full.
Consider nodes 1, 2 and 3 have capacity. Then the task should be
migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
points to the last node whose faults were greater than current node.

Modify to set the preferred node based of best_cpu.

Also while modifying task_numa_migrate(), use sched_setnuma to set
preferred node. This ensures out numa accounting is correct.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      435.78      653.81      534.58       83.20
numa01.sh       Sys:      121.93      187.18      145.90       23.47
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
numa02.sh      Real:       60.64       61.63       61.19        0.40
numa02.sh       Sys:       14.72       25.68       19.06        4.03
numa02.sh      User:     5210.95     5266.69     5233.30       20.82
numa03.sh      Real:      746.51      808.24      780.36       23.88
numa03.sh       Sys:       97.26      108.48      105.07        4.28
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
numa04.sh      Real:      465.97      519.27      484.81       19.62
numa04.sh       Sys:      304.43      359.08      334.68       20.64
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
numa05.sh      Real:      411.57      457.20      433.29       16.58
numa05.sh       Sys:      230.05      435.48      339.95       67.58
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%

While there is a performance hit, this is a correctness issue that is very
much needed in bigger systems.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea32a66..94091e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1725,8 +1725,9 @@ static int task_numa_migrate(struct task_struct *p)
 	 * Tasks that are "trapped" in such domains cannot be migrated
 	 * elsewhere, so there is no point in (re)trying.
 	 */
-	if (unlikely(!sd)) {
-		p->numa_preferred_nid = task_node(p);
+	if (unlikely(!sd) && p->numa_preferred_nid != task_node(p)) {
+		/* Set the new preferred node */
+		sched_setnuma(p, task_node(p));
 		return -EINVAL;
 	}
 
@@ -1785,15 +1786,13 @@ static int task_numa_migrate(struct task_struct *p)
 	 * trying for a better one later. Do not set the preferred node here.
 	 */
 	if (p->numa_group) {
-		struct numa_group *ng = p->numa_group;
-
 		if (env.best_cpu == -1)
 			nid = env.src_nid;
 		else
-			nid = env.dst_nid;
+			nid = cpu_to_node(env.best_cpu);
 
-		if (ng->active_nodes > 1 && numa_is_active_node(env.dst_nid, ng))
-			sched_setnuma(p, env.dst_nid);
+		if (nid != p->numa_preferred_nid)
+			sched_setnuma(p, nid);
 	}
 
 	/* No better CPU than the current one was found. */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 05/19] sched/numa: Use task faults only if numa_group is not yet setup
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 12:24   ` Peter Zijlstra
  2018-06-04 10:00 ` [PATCH 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

When numa_group faults are available, task_numa_placement only uses
numa_group faults to evaluate preferred node. However it still accounts
task faults and even evaluates the preferred node just based on task
faults just to discard it in favour of preferred node chosen on the
basis of numa_group.

Instead use task faults only if numa_group is not set.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      506.35      794.46      599.06      104.26
numa01.sh       Sys:      150.37      223.56      195.99       24.94
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
numa02.sh      Real:       60.33       62.40       61.31        0.90
numa02.sh       Sys:       18.12       31.66       24.28        5.89
numa02.sh      User:     5203.91     5325.32     5260.29       49.98
numa03.sh      Real:      696.47      853.62      745.80       57.28
numa03.sh       Sys:       85.68      123.71       97.89       13.48
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
numa04.sh      Real:      444.05      514.83      497.06       26.85
numa04.sh       Sys:      230.39      375.79      316.23       48.58
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
numa05.sh      Real:      423.09      460.41      439.57       13.92
numa05.sh       Sys:      287.38      480.15      369.37       68.52
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%

Ideally this change shouldn't have affected performance.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94091e6..e7c07aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2072,8 +2072,8 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq, nid, max_nid = -1, max_group_nid = -1;
-	unsigned long max_faults = 0, max_group_faults = 0;
+	int seq, nid, max_nid = -1;
+	unsigned long max_faults = 0;
 	unsigned long fault_types[2] = { 0, 0 };
 	unsigned long total_faults;
 	u64 runtime, period;
@@ -2152,15 +2152,15 @@ static void task_numa_placement(struct task_struct *p)
 			}
 		}
 
-		if (faults > max_faults) {
-			max_faults = faults;
+		if (!p->numa_group) {
+			if (faults > max_faults) {
+				max_faults = faults;
+				max_nid = nid;
+			}
+		} else if (group_faults > max_faults) {
+			max_faults = group_faults;
 			max_nid = nid;
 		}
-
-		if (group_faults > max_group_faults) {
-			max_group_faults = group_faults;
-			max_group_nid = nid;
-		}
 	}
 
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
@@ -2168,7 +2168,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_group) {
 		numa_group_count_active_nodes(p->numa_group);
 		spin_unlock_irq(group_lock);
-		max_nid = preferred_group_nid(p, max_group_nid);
+		max_nid = preferred_group_nid(p, max_nid);
 	}
 
 	if (max_faults) {
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 06/19] sched/debug: Reverse the order of printing faults
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 16:28   ` Rik van Riel
  2018-06-05  8:50   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Fix the order in which the private and shared numa faults are getting
printed.

Shouldn't have any performance impact.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/debug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 15b10e2..82ac522 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -869,8 +869,8 @@ void print_numa_stats(struct seq_file *m, int node, unsigned long tsf,
 		unsigned long tpf, unsigned long gsf, unsigned long gpf)
 {
 	SEQ_printf(m, "numa_faults node=%d ", node);
-	SEQ_printf(m, "task_private=%lu task_shared=%lu ", tsf, tpf);
-	SEQ_printf(m, "group_private=%lu group_shared=%lu\n", gsf, gpf);
+	SEQ_printf(m, "task_private=%lu task_shared=%lu ", tpf, tsf);
+	SEQ_printf(m, "group_private=%lu group_shared=%lu\n", gpf, gsf);
 }
 #endif
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 07/19] sched/numa: Skip nodes that are at hoplimit
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 16:27   ` Rik van Riel
  2018-06-05  8:50   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

When comparing two nodes at a distance of hoplimit, we should consider
nodes only upto hoplimit. Currently we also consider nodes at hoplimit
distance too. Hence two nodes at a distance of "hoplimit" will have same
groupweight. Fix this by skipping nodes at hoplimit.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      478.45      565.90      515.11       30.87
numa01.sh       Sys:      207.79      271.04      232.94       21.33
numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
numa02.sh      Real:       60.00       61.46       60.78        0.49
numa02.sh       Sys:       15.71       25.31       20.69        3.42
numa02.sh      User:     5175.92     5265.86     5235.97       32.82
numa03.sh      Real:      776.42      834.85      806.01       23.22
numa03.sh       Sys:      114.43      128.75      121.65        5.49
numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
numa04.sh      Real:      456.93      511.95      482.91       20.88
numa04.sh       Sys:      178.09      460.89      356.86       94.58
numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
numa05.sh      Real:      393.98      493.48      436.61       35.59
numa05.sh       Sys:      164.49      329.15      265.87       61.78
numa05.sh      User:    33182.65    36654.53    35074.51     1187.71

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%

While there is a regression with this change, this change is needed from a
correctness perspective. Also it helps consolidation as seen from perf bench
output.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7c07aa..d4bf973 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1273,7 +1273,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * of each group. Skip other nodes.
 		 */
 		if (sched_numa_topology_type == NUMA_BACKPLANE &&
-					dist > maxdist)
+					dist >= maxdist)
 			continue;
 
 		/* Add up the faults from nearby nodes. */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 08/19] sched/numa: Remove unused task_capacity from numa_stats
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 16:28   ` Rik van Riel
  2018-06-05  8:57   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

task_capacity field in struct numa_stats is redundant.
Also move nr_running for better packing within the struct.

No functional changes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bf973..35c4a75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1411,14 +1411,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
-	unsigned long nr_running;
 	unsigned long load;
 
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
 
-	/* Approximate capacity in terms of runnable tasks on a node */
-	unsigned long task_capacity;
+	unsigned int nr_running;
 	int has_free_capacity;
 };
 
@@ -1456,9 +1454,9 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, ns->compute_capacity);
 	capacity = cpus / smt; /* cores */
 
-	ns->task_capacity = min_t(unsigned, capacity,
+	capacity = min_t(unsigned, capacity,
 		DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
-	ns->has_free_capacity = (ns->nr_running < ns->task_capacity);
+	ns->has_free_capacity = (ns->nr_running < capacity);
 }
 
 struct task_numa_env {
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 09/19] sched/numa: Modify migrate_swap to accept additional params
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 17:00   ` Rik van Riel
  2018-06-05  8:58   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
                   ` (9 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

There are checks in migrate_swap_stop that check if the task/cpu
combination is as per migrate_swap_arg before migrating.

However atleast one of the two tasks to be swapped by migrate_swap could
have migrated to a completely different cpu before updating the
migrate_swap_arg. The new cpu where the task is currently running could
be a different node too. If the task has migrated, numa balancer might
end up placing a task in a wrong node.  Instead of achieving node
consolidation, it may end up spreading the load across nodes.

To avoid that pass the cpus as additional parameters.

While here, place migrate_swap under CONFIG_NUMA_BALANCING.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/core.c  | 9 ++++++---
 kernel/sched/fair.c  | 3 ++-
 kernel/sched/sched.h | 3 ++-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 092f7c4..68849c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1178,6 +1178,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
 static void __migrate_swap_task(struct task_struct *p, int cpu)
 {
 	if (task_on_rq_queued(p)) {
@@ -1259,16 +1260,17 @@ static int migrate_swap_stop(void *data)
 /*
  * Cross migrate two tasks
  */
-int migrate_swap(struct task_struct *cur, struct task_struct *p)
+int migrate_swap(struct task_struct *cur, struct task_struct *p,
+		int target_cpu, int curr_cpu)
 {
 	struct migration_swap_arg arg;
 	int ret = -EINVAL;
 
 	arg = (struct migration_swap_arg){
 		.src_task = cur,
-		.src_cpu = task_cpu(cur),
+		.src_cpu = curr_cpu,
 		.dst_task = p,
-		.dst_cpu = task_cpu(p),
+		.dst_cpu = target_cpu,
 	};
 
 	if (arg.src_cpu == arg.dst_cpu)
@@ -1293,6 +1295,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 out:
 	return ret;
 }
+#endif /* CONFIG_NUMA_BALANCING */
 
 /*
  * wait_task_inactive - wait for a thread to unschedule.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 35c4a75..46d773c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1810,7 +1810,8 @@ static int task_numa_migrate(struct task_struct *p)
 		return ret;
 	}
 
-	ret = migrate_swap(p, env.best_task);
+	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
 	put_task_struct(env.best_task);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c2..211841e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1068,7 +1068,8 @@ enum numa_faults_stats {
 };
 extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
-extern int migrate_swap(struct task_struct *, struct task_struct *);
+extern int migrate_swap(struct task_struct *p, struct task_struct *t,
+			int cpu, int scpu);
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SMP
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (8 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 17:57   ` Rik van Riel
  2018-06-05  9:51   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Task migration under numa balancing can happen in parallel. More than
one task might choose to migrate to the same cpu at the same time. This
can result in
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
  moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
  to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.

This problem is more likely if there are more cores per node or more
nodes in the system.

Use a per run-queue variable to check if numa-balance is active on the
run-queue.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      414.64      819.20      556.08      147.70
numa01.sh       Sys:       77.52      205.04      139.40       52.05
numa01.sh      User:    37043.24    61757.88    45517.48     9290.38
numa02.sh      Real:       60.80       63.32       61.63        0.88
numa02.sh       Sys:       17.35       39.37       25.71        7.33
numa02.sh      User:     5213.79     5374.73     5268.90       55.09
numa03.sh      Real:      780.09      948.64      831.43       63.02
numa03.sh       Sys:      104.96      136.92      116.31       11.34
numa03.sh      User:    60465.42    73339.78    64368.03     4700.14
numa04.sh      Real:      412.60      681.92      521.29       96.64
numa04.sh       Sys:      210.32      314.10      251.77       37.71
numa04.sh      User:    34026.38    45581.20    38534.49     4198.53
numa05.sh      Real:      394.79      439.63      411.35       16.87
numa05.sh       Sys:      238.32      330.09      292.31       38.32
numa05.sh      User:    33456.45    34876.07    34138.62      609.45

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      434.84      676.90      550.53      106.24 	 1.008%
numa01.sh       Sys:      125.98      217.34      179.41       30.35 	 -22.3%
numa01.sh      User:    38318.48    53789.56    45864.17     6620.80 	 -0.75%
numa02.sh      Real:       60.06       61.27       60.59        0.45 	 1.716%
numa02.sh       Sys:       14.25       17.86       16.09        1.28 	 59.78%
numa02.sh      User:     5190.13     5225.67     5209.24       13.19 	 1.145%
numa03.sh      Real:      748.21      960.25      823.15       73.51 	 1.005%
numa03.sh       Sys:       96.68      122.10      110.42       11.29 	 5.334%
numa03.sh      User:    58222.16    72595.27    63552.22     5048.87 	 1.283%
numa04.sh      Real:      433.08      630.55      499.30       68.15 	 4.404%
numa04.sh       Sys:      245.22      386.75      306.09       63.32 	 -17.7%
numa04.sh      User:    35014.68    46151.72    38530.26     3924.65 	 0.010%
numa05.sh      Real:      394.77      410.07      401.41        5.99 	 2.476%
numa05.sh       Sys:      212.40      301.82      256.23       35.41 	 14.08%
numa05.sh      User:    33224.86    34201.40    33665.61      313.40 	 1.405%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c  | 17 +++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46d773c..3e19e32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1478,6 +1478,16 @@ struct task_numa_env {
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
+	struct rq *rq = cpu_rq(env->dst_cpu);
+
+	if (xchg(&rq->numa_migrate_on, 1))
+		return;
+
+	if (env->best_cpu != -1) {
+		rq = cpu_rq(env->best_cpu);
+		WRITE_ONCE(rq->numa_migrate_on, 0);
+	}
+
 	if (env->best_task)
 		put_task_struct(env->best_task);
 	if (p)
@@ -1533,6 +1543,9 @@ static void task_numa_compare(struct task_numa_env *env,
 	long moveimp = imp;
 	int dist = env->dist;
 
+	if (READ_ONCE(dst_rq->numa_migrate_on))
+		return;
+
 	rcu_read_lock();
 	cur = task_rcu_dereference(&dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1699,6 +1712,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1,
 	};
 	struct sched_domain *sd;
+	struct rq *best_rq;
 	unsigned long taskweight, groupweight;
 	int nid, ret, dist;
 	long taskimp, groupimp;
@@ -1803,14 +1817,17 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	p->numa_scan_period = task_scan_start(p);
 
+	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
 		ret = migrate_task_to(p, env.best_cpu);
+		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
 			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
 	}
 
 	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+	WRITE_ONCE(best_rq->numa_migrate_on, 0);
 
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 211841e..55bc6e1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -756,6 +756,7 @@ struct rq {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		nr_numa_running;
 	unsigned int		nr_preferred_running;
+	unsigned int		numa_migrate_on;
 #endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long		cpu_load[CPU_LOAD_IDX_MAX];
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node.
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (9 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 17:59   ` Rik van Riel
  2018-06-05  9:53   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 12/19] sched:numa Remove numa_has_capacity Srikar Dronamraju
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Since task migration under numa balancing can happen in parallel, more
than one task might choose to move to the same node at the same time.
This can cause load imbalances at the node level.

The problem is more likely if there are more cores per node or more
nodes in system.

Use a per-node variable to indicate if task migration
to the node under numa balance is currently active.
This per-node variable will not track swapping of tasks.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      434.84      676.90      550.53      106.24
numa01.sh       Sys:      125.98      217.34      179.41       30.35
numa01.sh      User:    38318.48    53789.56    45864.17     6620.80
numa02.sh      Real:       60.06       61.27       60.59        0.45
numa02.sh       Sys:       14.25       17.86       16.09        1.28
numa02.sh      User:     5190.13     5225.67     5209.24       13.19
numa03.sh      Real:      748.21      960.25      823.15       73.51
numa03.sh       Sys:       96.68      122.10      110.42       11.29
numa03.sh      User:    58222.16    72595.27    63552.22     5048.87
numa04.sh      Real:      433.08      630.55      499.30       68.15
numa04.sh       Sys:      245.22      386.75      306.09       63.32
numa04.sh      User:    35014.68    46151.72    38530.26     3924.65
numa05.sh      Real:      394.77      410.07      401.41        5.99
numa05.sh       Sys:      212.40      301.82      256.23       35.41
numa05.sh      User:    33224.86    34201.40    33665.61      313.40

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      674.61      997.71      785.01      115.95 	 -29.86%
numa01.sh       Sys:      180.87      318.88      270.13       51.32 	 -33.58%
numa01.sh      User:    54001.30    71936.50    60495.48     6237.55 	 -24.18%
numa02.sh      Real:       60.62       62.30       61.46        0.62 	 -1.415%
numa02.sh       Sys:       15.01       33.63       24.38        6.81 	 -34.00%
numa02.sh      User:     5234.20     5325.60     5276.23       38.85 	 -1.269%
numa03.sh      Real:      827.62      946.85      914.48       44.58 	 -9.987%
numa03.sh       Sys:      135.55      172.40      158.46       12.75 	 -30.31%
numa03.sh      User:    64839.42    73195.44    70805.96     3061.20 	 -10.24%
numa04.sh      Real:      481.01      608.76      521.14       47.28 	 -4.190%
numa04.sh       Sys:      329.59      373.15      353.20       14.20 	 -13.33%
numa04.sh      User:    37649.09    40722.94    38806.32     1072.32 	 -0.711%
numa05.sh      Real:      399.21      415.38      409.88        5.54 	 -2.066%
numa05.sh       Sys:      319.46      418.57      363.31       37.62 	 -29.47%
numa05.sh      User:    33727.77    34732.68    34127.41      447.11 	 -1.353%


The commit does cause some performance regression but is needed from
a fairness/correctness perspective.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |  1 +
 kernel/sched/fair.c    | 14 ++++++++++++++
 mm/page_alloc.c        |  1 +
 3 files changed, 16 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..b0767703 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -677,6 +677,7 @@ struct zonelist {
 
 	/* Number of pages migrated during the rate limiting time interval */
 	unsigned long numabalancing_migrate_nr_pages;
+	int active_node_migrate;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e19e32..259c343 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1478,11 +1478,22 @@ struct task_numa_env {
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
+	pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
 	struct rq *rq = cpu_rq(env->dst_cpu);
 
 	if (xchg(&rq->numa_migrate_on, 1))
 		return;
 
+	if (!env->best_task && env->best_cpu != -1)
+		WRITE_ONCE(pgdat->active_node_migrate, 0);
+
+	if (!p) {
+		if (xchg(&pgdat->active_node_migrate, 1)) {
+			WRITE_ONCE(rq->numa_migrate_on, 0);
+			return;
+		}
+	}
+
 	if (env->best_cpu != -1) {
 		rq = cpu_rq(env->best_cpu);
 		WRITE_ONCE(rq->numa_migrate_on, 0);
@@ -1819,8 +1830,11 @@ static int task_numa_migrate(struct task_struct *p)
 
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
+		pg_data_t *pgdat = NODE_DATA(cpu_to_node(env.dst_cpu));
+
 		ret = migrate_task_to(p, env.best_cpu);
 		WRITE_ONCE(best_rq->numa_migrate_on, 0);
+		WRITE_ONCE(pgdat->active_node_migrate, 0);
 		if (ret != 0)
 			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d..4526643 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6210,6 +6210,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 #ifdef CONFIG_NUMA_BALANCING
 	spin_lock_init(&pgdat->numabalancing_migrate_lock);
 	pgdat->numabalancing_migrate_nr_pages = 0;
+	pgdat->active_node_migrate = 0;
 	pgdat->numabalancing_migrate_next_window = jiffies;
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 12/19] sched:numa Remove numa_has_capacity
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (10 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 18:07   ` Rik van Riel
  2018-06-04 10:00 ` [PATCH 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

task_numa_find_cpu helps to find the cpu to swap/move the task.
Its guarded by numa_has_capacity(). However node not having capacity
shouldn't deter a task swapping if it helps numa placement.

Further load_too_imbalanced, which evaluates possibilities of move/swap,
provides similar checks as numa_has_capacity.

Hence remove numa_has_capacity() to enhance possibilities of task
swapping even if load is imbalanced.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      674.61      997.71      785.01      115.95
numa01.sh       Sys:      180.87      318.88      270.13       51.32
numa01.sh      User:    54001.30    71936.50    60495.48     6237.55
numa02.sh      Real:       60.62       62.30       61.46        0.62
numa02.sh       Sys:       15.01       33.63       24.38        6.81
numa02.sh      User:     5234.20     5325.60     5276.23       38.85
numa03.sh      Real:      827.62      946.85      914.48       44.58
numa03.sh       Sys:      135.55      172.40      158.46       12.75
numa03.sh      User:    64839.42    73195.44    70805.96     3061.20
numa04.sh      Real:      481.01      608.76      521.14       47.28
numa04.sh       Sys:      329.59      373.15      353.20       14.20
numa04.sh      User:    37649.09    40722.94    38806.32     1072.32
numa05.sh      Real:      399.21      415.38      409.88        5.54
numa05.sh       Sys:      319.46      418.57      363.31       37.62
numa05.sh      User:    33727.77    34732.68    34127.41      447.11

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      435.67      707.28      527.49       97.85 	 48.81%
numa01.sh       Sys:       76.41      231.19      162.49       56.13 	 66.24%
numa01.sh      User:    38247.36    59033.52    45129.31     7642.69 	 34.04%
numa02.sh      Real:       60.35       62.09       61.09        0.69 	 0.605%
numa02.sh       Sys:       15.01       30.20       20.64        5.56 	 18.12%
numa02.sh      User:     5195.93     5294.82     5240.99       40.55 	 0.672%
numa03.sh      Real:      752.04      919.89      836.81       63.29 	 9.281%
numa03.sh       Sys:      115.10      133.35      125.46        7.78 	 26.30%
numa03.sh      User:    58736.44    70084.26    65103.67     4416.10 	 8.758%
numa04.sh      Real:      418.43      709.69      512.53      104.17 	 1.679%
numa04.sh       Sys:      242.99      370.47      297.39       42.20 	 18.76%
numa04.sh      User:    34916.14    48429.54    38955.65     4928.05 	 -0.38%
numa05.sh      Real:      379.27      434.05      403.70       17.79 	 1.530%
numa05.sh       Sys:      145.94      344.50      268.72       68.53 	 35.20%
numa05.sh      User:    32679.32    35449.75    33989.10      913.19 	 0.406%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 37 +++----------------------------------
 1 file changed, 3 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 259c343..709c77c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1417,7 +1417,6 @@ struct numa_stats {
 	unsigned long compute_capacity;
 
 	unsigned int nr_running;
-	int has_free_capacity;
 };
 
 /*
@@ -1444,8 +1443,7 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 	 * the @ns structure is NULL'ed and task_numa_compare() will
 	 * not find this node attractive.
 	 *
-	 * We'll either bail at !has_free_capacity, or we'll detect a huge
-	 * imbalance and bail there.
+	 * We'll detect a huge imbalance and bail there.
 	 */
 	if (!cpus)
 		return;
@@ -1456,7 +1454,6 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 
 	capacity = min_t(unsigned, capacity,
 		DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
-	ns->has_free_capacity = (ns->nr_running < capacity);
 }
 
 struct task_numa_env {
@@ -1672,7 +1669,6 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	 * than swapping tasks around, check if a move is possible.
 	 */
 	move = !load_too_imbalanced(src_load, dst_load, env);
-
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
@@ -1683,31 +1679,6 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	}
 }
 
-/* Only move tasks to a NUMA node less busy than the current node. */
-static bool numa_has_capacity(struct task_numa_env *env)
-{
-	struct numa_stats *src = &env->src_stats;
-	struct numa_stats *dst = &env->dst_stats;
-
-	if (src->has_free_capacity && !dst->has_free_capacity)
-		return false;
-
-	/*
-	 * Only consider a task move if the source has a higher load
-	 * than the destination, corrected for CPU capacity on each node.
-	 *
-	 *      src->load                dst->load
-	 * --------------------- vs ---------------------
-	 * src->compute_capacity    dst->compute_capacity
-	 */
-	if (src->load * dst->compute_capacity * env->imbalance_pct >
-
-	    dst->load * src->compute_capacity * 100)
-		return true;
-
-	return false;
-}
-
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1764,8 +1735,7 @@ static int task_numa_migrate(struct task_struct *p)
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
-	if (numa_has_capacity(&env))
-		task_numa_find_cpu(&env, taskimp, groupimp);
+	task_numa_find_cpu(&env, taskimp, groupimp);
 
 	/*
 	 * Look at other nodes in these cases:
@@ -1795,8 +1765,7 @@ static int task_numa_migrate(struct task_struct *p)
 			env.dist = dist;
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			if (numa_has_capacity(&env))
-				task_numa_find_cpu(&env, taskimp, groupimp);
+			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 13/19] mm/migrate: Use xchg instead of spinlock
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (11 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 12/19] sched:numa Remove numa_has_capacity Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 18:22   ` Rik van Riel
  2018-06-04 19:28   ` Peter Zijlstra
  2018-06-04 10:00 ` [PATCH 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently resetting the migrate rate limit is under a spinlock.
The spinlock will only serialize the migrate rate limiting and something
similar can actually be achieved by a simpler xchg.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      435.67      707.28      527.49       97.85
numa01.sh       Sys:       76.41      231.19      162.49       56.13
numa01.sh      User:    38247.36    59033.52    45129.31     7642.69
numa02.sh      Real:       60.35       62.09       61.09        0.69
numa02.sh       Sys:       15.01       30.20       20.64        5.56
numa02.sh      User:     5195.93     5294.82     5240.99       40.55
numa03.sh      Real:      752.04      919.89      836.81       63.29
numa03.sh       Sys:      115.10      133.35      125.46        7.78
numa03.sh      User:    58736.44    70084.26    65103.67     4416.10
numa04.sh      Real:      418.43      709.69      512.53      104.17
numa04.sh       Sys:      242.99      370.47      297.39       42.20
numa04.sh      User:    34916.14    48429.54    38955.65     4928.05
numa05.sh      Real:      379.27      434.05      403.70       17.79
numa05.sh       Sys:      145.94      344.50      268.72       68.53
numa05.sh      User:    32679.32    35449.75    33989.10      913.19

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      490.04      774.86      596.26       96.46 	 -11.5%
numa01.sh       Sys:      151.52      242.88      184.82       31.71 	 -12.0%
numa01.sh      User:    41418.41    60844.59    48776.09     6564.27 	 -7.47%
numa02.sh      Real:       60.14       62.94       60.98        1.00 	 0.180%
numa02.sh       Sys:       16.11       30.77       21.20        5.28 	 -2.64%
numa02.sh      User:     5184.33     5311.09     5228.50       44.24 	 0.238%
numa03.sh      Real:      790.95      856.35      826.41       24.11 	 1.258%
numa03.sh       Sys:      114.93      118.85      117.05        1.63 	 7.184%
numa03.sh      User:    60990.99    64959.28    63470.43     1415.44 	 2.573%
numa04.sh      Real:      434.37      597.92      504.87       59.70 	 1.517%
numa04.sh       Sys:      237.63      397.40      289.74       55.98 	 2.640%
numa04.sh      User:    34854.87    41121.83    38572.52     2615.84 	 0.993%
numa05.sh      Real:      386.77      448.90      417.22       22.79 	 -3.24%
numa05.sh       Sys:      149.23      379.95      303.04       79.55 	 -11.3%
numa05.sh      User:    32951.76    35959.58    34562.18     1034.05 	 -1.65%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mmzone.h | 3 ---
 mm/migrate.c           | 8 +++-----
 mm/page_alloc.c        | 1 -
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b0767703..0dbe1d5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -669,9 +669,6 @@ struct zonelist {
 	struct task_struct *kcompactd;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-	/* Lock serializing the migrate rate limiting window */
-	spinlock_t numabalancing_migrate_lock;
-
 	/* Rate limiting time interval */
 	unsigned long numabalancing_migrate_next_window;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f..1c55956 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1874,11 +1874,9 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 	 * all the time is being spent migrating!
 	 */
 	if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
-		spin_lock(&pgdat->numabalancing_migrate_lock);
-		pgdat->numabalancing_migrate_nr_pages = 0;
-		pgdat->numabalancing_migrate_next_window = jiffies +
-			msecs_to_jiffies(migrate_interval_millisecs);
-		spin_unlock(&pgdat->numabalancing_migrate_lock);
+		if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0))
+			pgdat->numabalancing_migrate_next_window = jiffies +
+				msecs_to_jiffies(migrate_interval_millisecs);
 	}
 	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
 		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4526643..464a25c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6208,7 +6208,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 
 	pgdat_resize_init(pgdat);
 #ifdef CONFIG_NUMA_BALANCING
-	spin_lock_init(&pgdat->numabalancing_migrate_lock);
 	pgdat->numabalancing_migrate_nr_pages = 0;
 	pgdat->active_node_migrate = 0;
 	pgdat->numabalancing_migrate_next_window = jiffies;
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 14/19] sched/numa: Updation of scan period need not be in lock
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (12 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 18:24   ` Rik van Riel
  2018-06-04 10:00 ` [PATCH 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

The metrics for updating scan periods are local or task specific.
Currently this updation happens under numa_group lock which seems
unnecessary. Hence move this updation outside the lock.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 709c77c..6b72240 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2162,8 +2162,6 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	update_task_scan_period(p, fault_types[0], fault_types[1]);
-
 	if (p->numa_group) {
 		numa_group_count_active_nodes(p->numa_group);
 		spin_unlock_irq(group_lock);
@@ -2178,6 +2176,8 @@ static void task_numa_placement(struct task_struct *p)
 		if (task_node(p) != p->numa_preferred_nid)
 			numa_migrate_preferred(p);
 	}
+
+	update_task_scan_period(p, fault_types[0], fault_types[1]);
 }
 
 static inline int get_numa_group(struct numa_group *grp)
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 15/19] sched/numa: Use group_weights to identify if migration degrades locality
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (13 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 18:56   ` Rik van Riel
  2018-06-04 10:00 ` [PATCH 16/19] sched/numa: Detect if node actively handling migration Srikar Dronamraju
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
consolidated to the closest group of nodes. In such a case, relying on
group_fault metric may not always help to consolidate. There can always
be a case where a node closer to the preferred node may have lesser
faults than a node further away from the preferred node. In such a case,
moving to node with more faults might avoid numa consolidation.

Using group_weight would help to consolidate task/memory around the
preferred_node.

While here, to be on the conservative side, don't override migrate thread
degrades locality logic for CPU_NEWLY_IDLE load balancing.

Note: Similar problems exist with should_numa_migrate_memory and will be
dealt separately.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      490.04      774.86      596.26       96.46
numa01.sh       Sys:      151.52      242.88      184.82       31.71
numa01.sh      User:    41418.41    60844.59    48776.09     6564.27
numa02.sh      Real:       60.14       62.94       60.98        1.00
numa02.sh       Sys:       16.11       30.77       21.20        5.28
numa02.sh      User:     5184.33     5311.09     5228.50       44.24
numa03.sh      Real:      790.95      856.35      826.41       24.11
numa03.sh       Sys:      114.93      118.85      117.05        1.63
numa03.sh      User:    60990.99    64959.28    63470.43     1415.44
numa04.sh      Real:      434.37      597.92      504.87       59.70
numa04.sh       Sys:      237.63      397.40      289.74       55.98
numa04.sh      User:    34854.87    41121.83    38572.52     2615.84
numa05.sh      Real:      386.77      448.90      417.22       22.79
numa05.sh       Sys:      149.23      379.95      303.04       79.55
numa05.sh      User:    32951.76    35959.58    34562.18     1034.05

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      493.19      672.88      597.51       59.38 	 -0.20%
numa01.sh       Sys:      150.09      245.48      207.76       34.26 	 -11.0%
numa01.sh      User:    41928.51    53779.17    48747.06     3901.39 	 0.059%
numa02.sh      Real:       60.63       62.87       61.22        0.83 	 -0.39%
numa02.sh       Sys:       16.64       27.97       20.25        4.06 	 4.691%
numa02.sh      User:     5222.92     5309.60     5254.03       29.98 	 -0.48%
numa03.sh      Real:      821.52      902.15      863.60       32.41 	 -4.30%
numa03.sh       Sys:      112.04      130.66      118.35        7.08 	 -1.09%
numa03.sh      User:    62245.16    69165.14    66443.04     2450.32 	 -4.47%
numa04.sh      Real:      414.53      519.57      476.25       37.00 	 6.009%
numa04.sh       Sys:      181.84      335.67      280.41       54.07 	 3.327%
numa04.sh      User:    33924.50    39115.39    37343.78     1934.26 	 3.290%
numa05.sh      Real:      408.30      441.45      417.90       12.05 	 -0.16%
numa05.sh       Sys:      233.41      381.60      295.58       57.37 	 2.523%
numa05.sh      User:    33301.31    35972.50    34335.19      938.94 	 0.661%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b72240..c388ecf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7222,8 +7222,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
-	unsigned long src_faults, dst_faults;
-	int src_nid, dst_nid;
+	unsigned long src_weight, dst_weight;
+	int src_nid, dst_nid, dist;
 
 	if (!static_branch_likely(&sched_numa_balancing))
 		return -1;
@@ -7250,18 +7250,19 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 		return 0;
 
 	/* Leaving a core idle is often worse than degrading locality. */
-	if (env->idle != CPU_NOT_IDLE)
+	if (env->idle == CPU_IDLE)
 		return -1;
 
+	dist = node_distance(src_nid, dst_nid);
 	if (numa_group) {
-		src_faults = group_faults(p, src_nid);
-		dst_faults = group_faults(p, dst_nid);
+		src_weight = group_weight(p, src_nid, dist);
+		dst_weight = group_weight(p, dst_nid, dist);
 	} else {
-		src_faults = task_faults(p, src_nid);
-		dst_faults = task_faults(p, dst_nid);
+		src_weight = task_weight(p, src_nid, dist);
+		dst_weight = task_weight(p, dst_nid, dist);
 	}
 
-	return dst_faults < src_faults;
+	return dst_weight < src_weight;
 }
 
 #else
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 16/19] sched/numa: Detect if node actively handling migration
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (14 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 20:05   ` Rik van Riel
  2018-06-04 10:00 ` [PATCH 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

If a node is the destination for a task migration under numa balancing,
then any parallel movements to the node will be restricted. In such a
scenario, detect at the earliest and avoid evaluation for a task
movement.

While here, avoid task migration if the numa imbalance is very minimal.
Especially consider two tasks A and B racing with each other to find the
best cpu to swap. If task A already has found one task/cpu pair to
swap and trying to find a better cpu. Task B is yet to find a better
cpu/task to swap. Task A can race with task B and deprive it from
getting a task/cpu to swap.


Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      493.19      672.88      597.51       59.38
numa01.sh       Sys:      150.09      245.48      207.76       34.26
numa01.sh      User:    41928.51    53779.17    48747.06     3901.39
numa02.sh      Real:       60.63       62.87       61.22        0.83
numa02.sh       Sys:       16.64       27.97       20.25        4.06
numa02.sh      User:     5222.92     5309.60     5254.03       29.98
numa03.sh      Real:      821.52      902.15      863.60       32.41
numa03.sh       Sys:      112.04      130.66      118.35        7.08
numa03.sh      User:    62245.16    69165.14    66443.04     2450.32
numa04.sh      Real:      414.53      519.57      476.25       37.00
numa04.sh       Sys:      181.84      335.67      280.41       54.07
numa04.sh      User:    33924.50    39115.39    37343.78     1934.26
numa05.sh      Real:      408.30      441.45      417.90       12.05
numa05.sh       Sys:      233.41      381.60      295.58       57.37
numa05.sh      User:    33301.31    35972.50    34335.19      938.94

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      428.48      837.17      700.45      162.77 	 -14.6%
numa01.sh       Sys:       78.64      247.70      164.45       58.32 	 26.33%
numa01.sh      User:    37487.25    63728.06    54399.27    10088.13 	 -10.3%
numa02.sh      Real:       60.07       62.65       61.41        0.85 	 -0.30%
numa02.sh       Sys:       15.83       29.36       21.04        4.48 	 -3.75%
numa02.sh      User:     5194.27     5280.60     5236.55       28.01 	 0.333%
numa03.sh      Real:      814.33      881.93      849.69       27.06 	 1.637%
numa03.sh       Sys:      111.45      134.02      125.28        7.69 	 -5.53%
numa03.sh      User:    63007.36    68013.46    65590.46     2023.37 	 1.299%
numa04.sh      Real:      412.19      438.75      424.43        9.28 	 12.20%
numa04.sh       Sys:      232.97      315.77      268.98       26.98 	 4.249%
numa04.sh      User:    33997.30    35292.88    34711.66      415.78 	 7.582%
numa05.sh      Real:      394.88      449.45      424.30       22.53 	 -1.50%
numa05.sh       Sys:      262.03      390.10      314.53       51.01 	 -6.02%
numa05.sh      User:    33389.03    35684.40    34561.34      942.34 	 -0.65%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c388ecf..6851412 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1535,14 +1535,22 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 }
 
 /*
+ * Maximum numa importance can be 1998 (2*999);
+ * SMALLIMP @ 30 would be close to 1998/64.
+ * Used to deter task migration.
+ */
+#define SMALLIMP	30
+
+/*
  * This checks if the overall compute and NUMA accesses of the system would
  * be improved if the source tasks was migrated to the target dst_cpu taking
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
 static void task_numa_compare(struct task_numa_env *env,
-			      long taskimp, long groupimp, bool move)
+			      long taskimp, long groupimp, bool *move)
 {
+	pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long src_load, dst_load;
@@ -1554,6 +1562,9 @@ static void task_numa_compare(struct task_numa_env *env,
 	if (READ_ONCE(dst_rq->numa_migrate_on))
 		return;
 
+	if (*move && READ_ONCE(pgdat->active_node_migrate))
+		*move = false;
+
 	rcu_read_lock();
 	cur = task_rcu_dereference(&dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1567,10 +1578,10 @@ static void task_numa_compare(struct task_numa_env *env,
 		goto unlock;
 
 	if (!cur) {
-		if (!move || imp <= env->best_imp)
-			goto unlock;
-		else
+		if (*move && moveimp >= env->best_imp)
 			goto assign;
+		else
+			goto unlock;
 	}
 
 	/*
@@ -1610,16 +1621,22 @@ static void task_numa_compare(struct task_numa_env *env,
 			       task_weight(cur, env->dst_nid, dist);
 	}
 
-	if (imp <= env->best_imp)
-		goto unlock;
-
-	if (move && moveimp > imp && moveimp > env->best_imp) {
-		imp = moveimp - 1;
+	if (*move && moveimp > imp && moveimp > env->best_imp) {
+		imp = moveimp;
 		cur = NULL;
 		goto assign;
 	}
 
 	/*
+	 * If the numa importance is less than SMALLIMP,
+	 * task migration might only result in ping pong
+	 * of tasks and also hurt performance due to cache
+	 * misses.
+	 */
+	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
+		goto unlock;
+
+	/*
 	 * In the overloaded case, try and keep the load balanced.
 	 */
 	load = task_h_load(env->p) - task_h_load(cur);
@@ -1675,7 +1692,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, move);
+		task_numa_compare(env, taskimp, groupimp, &move);
 	}
 }
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (15 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 16/19] sched/numa: Detect if node actively handling migration Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 10:00 ` [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
  2018-06-04 10:00 ` [PATCH 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju
  18 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

This additional parameter (new_cpu) is used later for identifying if
task migration is across nodes.

No functional change.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/core.c     | 2 +-
 kernel/sched/deadline.c | 2 +-
 kernel/sched/fair.c     | 2 +-
 kernel/sched/sched.h    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 68849c2..7ecd131 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1170,7 +1170,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 
 	if (task_cpu(p) != new_cpu) {
 		if (p->sched_class->migrate_task_rq)
-			p->sched_class->migrate_task_rq(p);
+			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		perf_event_task_migrate(p);
 	}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e7b3008..4f6b376 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1608,7 +1608,7 @@ static void yield_task_dl(struct rq *rq)
 	return cpu;
 }
 
-static void migrate_task_rq_dl(struct task_struct *p)
+static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused)
 {
 	struct rq *rq;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6851412..339c3dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6616,7 +6616,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous CPU. The caller guarantees p->pi_lock or task_rq(p)->lock is held.
  */
-static void migrate_task_rq_fair(struct task_struct *p)
+static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unused)
 {
 	/*
 	 * As blocked tasks retain absolute vruntime the migration needs to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 55bc6e1..33ddea7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1487,7 +1487,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
-	void (*migrate_task_rq)(struct task_struct *p);
+	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (16 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  2018-06-04 20:08   ` Rik van Riel
  2018-06-05  9:58   ` Mel Gorman
  2018-06-04 10:00 ` [PATCH 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju
  18 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently task scan rate is reset when numa balancer migrates the task
to a different node. If numa balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.

Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.


Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      428.48      837.17      700.45      162.77
numa01.sh       Sys:       78.64      247.70      164.45       58.32
numa01.sh      User:    37487.25    63728.06    54399.27    10088.13
numa02.sh      Real:       60.07       62.65       61.41        0.85
numa02.sh       Sys:       15.83       29.36       21.04        4.48
numa02.sh      User:     5194.27     5280.60     5236.55       28.01
numa03.sh      Real:      814.33      881.93      849.69       27.06
numa03.sh       Sys:      111.45      134.02      125.28        7.69
numa03.sh      User:    63007.36    68013.46    65590.46     2023.37
numa04.sh      Real:      412.19      438.75      424.43        9.28
numa04.sh       Sys:      232.97      315.77      268.98       26.98
numa04.sh      User:    33997.30    35292.88    34711.66      415.78
numa05.sh      Real:      394.88      449.45      424.30       22.53
numa05.sh       Sys:      262.03      390.10      314.53       51.01
numa05.sh      User:    33389.03    35684.40    34561.34      942.34

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      449.46      770.77      615.22      101.70 	 13.85%
numa01.sh       Sys:      132.72      208.17      170.46       24.96 	 -3.52%
numa01.sh      User:    39185.26    60290.89    50066.76     6807.84 	 8.653%
numa02.sh      Real:       60.85       61.79       61.28        0.37 	 0.212%
numa02.sh       Sys:       15.34       24.71       21.08        3.61 	 -0.18%
numa02.sh      User:     5204.41     5249.85     5231.21       17.60 	 0.102%
numa03.sh      Real:      785.50      916.97      840.77       44.98 	 1.060%
numa03.sh       Sys:      108.08      133.60      119.43        8.82 	 4.898%
numa03.sh      User:    61422.86    70919.75    64720.87     3310.61 	 1.343%
numa04.sh      Real:      429.57      587.37      480.80       57.40 	 -11.7%
numa04.sh       Sys:      240.61      321.97      290.84       33.58 	 -7.51%
numa04.sh      User:    34597.65    40498.99    37079.48     2060.72 	 -6.38%
numa05.sh      Real:      392.09      431.25      414.65       13.82 	 2.327%
numa05.sh       Sys:      229.41      372.48      297.54       53.14 	 5.710%
numa05.sh      User:    33390.86    34697.49    34222.43      556.42 	 0.990%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 339c3dc..fc1f388 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1808,12 +1808,6 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
-	/*
-	 * Reset the scan period if the task is being rescheduled on an
-	 * alternative node to recheck if the tasks is now properly placed.
-	 */
-	p->numa_scan_period = task_scan_start(p);
-
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
 		pg_data_t *pgdat = NODE_DATA(cpu_to_node(env.dst_cpu));
@@ -6669,6 +6663,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
 
 	/* We have migrated, no longer consider this task hot */
 	p->se.exec_start = 0;
+
+#ifdef CONFIG_NUMA_BALANCING
+	if (!p->mm || (p->flags & PF_EXITING))
+		return;
+
+	if (p->numa_faults) {
+		int src_nid = cpu_to_node(task_cpu(p));
+		int dst_nid = cpu_to_node(new_cpu);
+
+		if (src_nid != dst_nid)
+			p->numa_scan_period = task_scan_start(p);
+	}
+#endif
 }
 
 static void task_dead_fair(struct task_struct *p)
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred
  2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (17 preceding siblings ...)
  2018-06-04 10:00 ` [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-06-04 10:00 ` Srikar Dronamraju
  18 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 10:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

numa_migrate_preferred is called periodically or when task preferred
node changes. Preferred node evaluations happen once per scan sequence.

If the scan completion happens just after the periodic numa migration,
then we try to migrate to the preferred node and the preferred node might
change, needing another node migration.

Avoid this by checking for scan sequence completion only when checking
for periodic migration.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      449.46      770.77      615.22      101.70
numa01.sh       Sys:      132.72      208.17      170.46       24.96
numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
numa02.sh      Real:       60.85       61.79       61.28        0.37
numa02.sh       Sys:       15.34       24.71       21.08        3.61
numa02.sh      User:     5204.41     5249.85     5231.21       17.60
numa03.sh      Real:      785.50      916.97      840.77       44.98
numa03.sh       Sys:      108.08      133.60      119.43        8.82
numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
numa04.sh      Real:      429.57      587.37      480.80       57.40
numa04.sh       Sys:      240.61      321.97      290.84       33.58
numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
numa05.sh      Real:      392.09      431.25      414.65       13.82
numa05.sh       Sys:      229.41      372.48      297.54       53.14
numa05.sh      User:    33390.86    34697.49    34222.43      556.42


Testcase       Time:         Min         Max         Avg      StdDev 	%Change
numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc1f388..fe0a5cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2183,9 +2183,6 @@ static void task_numa_placement(struct task_struct *p)
 		/* Set the new preferred node */
 		if (max_nid != p->numa_preferred_nid)
 			sched_setnuma(p, max_nid);
-
-		if (task_node(p) != p->numa_preferred_nid)
-			numa_migrate_preferred(p);
 	}
 
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
@@ -2388,14 +2385,14 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 				numa_is_active_node(mem_node, ng))
 		local = 1;
 
-	task_numa_placement(p);
-
 	/*
 	 * Retry task to preferred node migration periodically, in case it
 	 * case it previously failed, or the scheduler moved us.
 	 */
-	if (time_after(jiffies, p->numa_migrate_retry))
+	if (time_after(jiffies, p->numa_migrate_retry)) {
+		task_numa_placement(p);
 		numa_migrate_preferred(p);
+	}
 
 	if (migrated)
 		p->numa_pages_migrated += pages;
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 10:00 ` [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
@ 2018-06-04 12:18   ` Peter Zijlstra
  2018-06-04 12:53     ` Srikar Dronamraju
  2018-06-04 12:23   ` Peter Zijlstra
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2018-06-04 12:18 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:13PM +0530, Srikar Dronamraju wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ea32a66..94091e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1725,8 +1725,9 @@ static int task_numa_migrate(struct task_struct *p)
>  	 * Tasks that are "trapped" in such domains cannot be migrated
>  	 * elsewhere, so there is no point in (re)trying.
>  	 */
> -	if (unlikely(!sd)) {
> -		p->numa_preferred_nid = task_node(p);
> +	if (unlikely(!sd) && p->numa_preferred_nid != task_node(p)) {
> +		/* Set the new preferred node */
> +		sched_setnuma(p, task_node(p));
>  		return -EINVAL;
>  	}
>  

That looks dodgy.. this would allow things to continue with !sd.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 10:00 ` [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
  2018-06-04 12:18   ` Peter Zijlstra
@ 2018-06-04 12:23   ` Peter Zijlstra
  2018-06-04 12:59     ` Srikar Dronamraju
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2018-06-04 12:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:13PM +0530, Srikar Dronamraju wrote:
> @@ -1785,15 +1786,13 @@ static int task_numa_migrate(struct task_struct *p)
>  	 * trying for a better one later. Do not set the preferred node here.
>  	 */
>  	if (p->numa_group) {
> -		struct numa_group *ng = p->numa_group;
> -
>  		if (env.best_cpu == -1)
>  			nid = env.src_nid;
>  		else
> -			nid = env.dst_nid;
> +			nid = cpu_to_node(env.best_cpu);

OK, the above matches the description, but I'm puzzled by the remainder:

>  
> -		if (ng->active_nodes > 1 && numa_is_active_node(env.dst_nid, ng))
> -			sched_setnuma(p, env.dst_nid);
> +		if (nid != p->numa_preferred_nid)
> +			sched_setnuma(p, nid);
>  	}

That seems to entirely loose the active_node thing, or are you saying
best_cpu already includes that? (Changelog could use a little help there
I suppose)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/19] sched/numa: Use task faults only if numa_group is not yet setup
  2018-06-04 10:00 ` [PATCH 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
@ 2018-06-04 12:24   ` Peter Zijlstra
  2018-06-04 13:09     ` Srikar Dronamraju
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2018-06-04 12:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:14PM +0530, Srikar Dronamraju wrote:
> When numa_group faults are available, task_numa_placement only uses
> numa_group faults to evaluate preferred node. However it still accounts
> task faults and even evaluates the preferred node just based on task
> faults just to discard it in favour of preferred node chosen on the
> basis of numa_group.
> 
> Instead use task faults only if numa_group is not set.
> 
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      506.35      794.46      599.06      104.26
> numa01.sh       Sys:      150.37      223.56      195.99       24.94
> numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
> numa02.sh      Real:       60.33       62.40       61.31        0.90
> numa02.sh       Sys:       18.12       31.66       24.28        5.89
> numa02.sh      User:     5203.91     5325.32     5260.29       49.98
> numa03.sh      Real:      696.47      853.62      745.80       57.28
> numa03.sh       Sys:       85.68      123.71       97.89       13.48
> numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
> numa04.sh      Real:      444.05      514.83      497.06       26.85
> numa04.sh       Sys:      230.39      375.79      316.23       48.58
> numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
> numa05.sh      Real:      423.09      460.41      439.57       13.92
> numa05.sh       Sys:      287.38      480.15      369.37       68.52
> numa05.sh      User:    34732.12    38016.80    36255.85     1070.51
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
> numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
> numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
> numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
> numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
> numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
> numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
> numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
> numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
> numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
> numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
> numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
> numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
> numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
> numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%
> 
> Ideally this change shouldn't have affected performance.

Ideally you go on here to explain why it does in fact do affect
performance.. :-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 12:18   ` Peter Zijlstra
@ 2018-06-04 12:53     ` Srikar Dronamraju
  0 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 12:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-06-04 14:18:00]:

> On Mon, Jun 04, 2018 at 03:30:13PM +0530, Srikar Dronamraju wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ea32a66..94091e6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1725,8 +1725,9 @@ static int task_numa_migrate(struct task_struct *p)
> >  	 * Tasks that are "trapped" in such domains cannot be migrated
> >  	 * elsewhere, so there is no point in (re)trying.
> >  	 */
> > -	if (unlikely(!sd)) {
> > -		p->numa_preferred_nid = task_node(p);
> > +	if (unlikely(!sd) && p->numa_preferred_nid != task_node(p)) {
> > +		/* Set the new preferred node */
> > +		sched_setnuma(p, task_node(p));
> >  		return -EINVAL;
> >  	}
> >  
> 
> That looks dodgy.. this would allow things to continue with !sd.

Okay so are we suggesting something like the below?

if (unlikely(!sd)) {
	/* Set the new preferred node */
	sched_setnuma(p, task_node(p));
  		return -EINVAL;
  	}

The reason for using sched_setnuma was to make sure we account numa
tasks correctly.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 12:23   ` Peter Zijlstra
@ 2018-06-04 12:59     ` Srikar Dronamraju
  2018-06-04 13:39       ` Peter Zijlstra
  2018-06-04 14:37       ` Rik van Riel
  0 siblings, 2 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-06-04 14:23:36]:

> OK, the above matches the description, but I'm puzzled by the remainder:
>
> >
> > -		if (ng->active_nodes > 1 && numa_is_active_node(env.dst_nid, ng))
> > -			sched_setnuma(p, env.dst_nid);
> > +		if (nid != p->numa_preferred_nid)
> > +			sched_setnuma(p, nid);
> >  	}
>
> That seems to entirely loose the active_node thing, or are you saying
> best_cpu already includes that? (Changelog could use a little help there
> I suppose)

I think checking for active_nodes before calling sched_setnuma was a
mistake.

Before this change, we may be retaining numa_preferred_nid to be the
source node while we select another node with better numa affinity to
run on. So we are creating a situation where we force a thread to run on
a node which is not going to be its preferred_node. So in the course of
regular load balancing, this task might then be moved to set
preferred_node which is actually not the preferred_node.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/19] sched/numa: Use task faults only if numa_group is not yet setup
  2018-06-04 12:24   ` Peter Zijlstra
@ 2018-06-04 13:09     ` Srikar Dronamraju
  0 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 13:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

> > Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> > numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
> > numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
> > numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
> > numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
> > numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
> > numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
> > numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
> > numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
> > numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
> > numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
> > numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
> > numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
> > numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
> > numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
> > numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%
> > 
> > Ideally this change shouldn't have affected performance.
> 
> Ideally you go on here to explain why it does in fact do affect
> performance.. :-)

I know it looks bad, but I have been unable to figure out why this patch
affects performance. I repeated the experiment multiple times to recheck
if it was not a one off problem. While there is a variance in different
runs, we do see a change in numbers before and after this patch atleast
on my machine.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 12:59     ` Srikar Dronamraju
@ 2018-06-04 13:39       ` Peter Zijlstra
  2018-06-04 13:48         ` Srikar Dronamraju
  2018-06-04 14:37       ` Rik van Riel
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2018-06-04 13:39 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 05:59:40AM -0700, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-06-04 14:23:36]:
> 
> > OK, the above matches the description, but I'm puzzled by the remainder:
> >
> > >
> > > -		if (ng->active_nodes > 1 && numa_is_active_node(env.dst_nid, ng))
> > > -			sched_setnuma(p, env.dst_nid);
> > > +		if (nid != p->numa_preferred_nid)
> > > +			sched_setnuma(p, nid);
> > >  	}
> >
> > That seems to entirely loose the active_node thing, or are you saying
> > best_cpu already includes that? (Changelog could use a little help there
> > I suppose)
> 
> I think checking for active_nodes before calling sched_setnuma was a
> mistake.
> 
> Before this change, we may be retaining numa_preferred_nid to be the
> source node while we select another node with better numa affinity to
> run on. So we are creating a situation where we force a thread to run on
> a node which is not going to be its preferred_node. So in the course of
> regular load balancing, this task might then be moved to set
> preferred_node which is actually not the preferred_node.

Then your Changelog had better explain all that, no?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 13:39       ` Peter Zijlstra
@ 2018-06-04 13:48         ` Srikar Dronamraju
  0 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 13:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-06-04 15:39:53]:

> > >
> > > That seems to entirely loose the active_node thing, or are you saying
> > > best_cpu already includes that? (Changelog could use a little help there
> > > I suppose)
> > 
> > I think checking for active_nodes before calling sched_setnuma was a
> > mistake.
> > 
> > Before this change, we may be retaining numa_preferred_nid to be the
> > source node while we select another node with better numa affinity to
> > run on. So we are creating a situation where we force a thread to run on
> > a node which is not going to be its preferred_node. So in the course of
> > regular load balancing, this task might then be moved to set
> > preferred_node which is actually not the preferred_node.
> 
> Then your Changelog had better explain all that, no?
> 

I had mentioned
"Modify to set the preferred node based of best_cpu."
I will update that to describe the situation.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 12:59     ` Srikar Dronamraju
  2018-06-04 13:39       ` Peter Zijlstra
@ 2018-06-04 14:37       ` Rik van Riel
  2018-06-04 15:56         ` Srikar Dronamraju
  1 sibling, 1 reply; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 14:37 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1225 bytes --]

On Mon, 2018-06-04 at 05:59 -0700, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-06-04 14:23:36]:
> 
> > OK, the above matches the description, but I'm puzzled by the
> > remainder:
> > 
> > > 
> > > -		if (ng->active_nodes > 1 &&
> > > numa_is_active_node(env.dst_nid, ng))
> > > -			sched_setnuma(p, env.dst_nid);
> > > +		if (nid != p->numa_preferred_nid)
> > > +			sched_setnuma(p, nid);
> > >  	}
> > 
> > That seems to entirely loose the active_node thing, or are you
> > saying
> > best_cpu already includes that? (Changelog could use a little help
> > there
> > I suppose)
> 
> I think checking for active_nodes before calling sched_setnuma was a
> mistake.
> 
> Before this change, we may be retaining numa_preferred_nid to be the
> source node while we select another node with better numa affinity to
> run on. 

Sometimes workloads are so large they get spread
around to multiple NUMA nodes.

In that case, you do NOT want all the tasks of
that workload (numa group) to try squeezing onto
the same load, only to have the load balancer
randomly move tasks off of that node again later.

How do you keep that from happening?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/19] sched/numa: Evaluate move once per node
  2018-06-04 10:00 ` [PATCH 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
@ 2018-06-04 14:51   ` Rik van Riel
  2018-06-04 15:45     ` Srikar Dronamraju
  0 siblings, 1 reply; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 14:51 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 815 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:

> @@ -1564,97 +1563,73 @@ static void task_numa_compare(struct
> task_numa_env *env,
>  	if (cur == env->p)
>  		goto unlock;
>  
> +	if (!cur) {
> +		if (!move || imp <= env->best_imp)
> +			goto unlock;
> +		else
> +			goto assign;
> +	}

Just bike shedding, but it may be easier to read
if the "we found our destination" check were written
more explicitly:


	if (!cur) {
		if (move && imp > env->best_imp)
			gote assign;
		else
			goto unlock;
	}

Also, the "move" variable seems to indicate that
the NUMA code may move the task, but not a decision
that moving the task is better than a swap.

Would it make sense to call it "maymove"?

I like how this patch simplifies the code a little.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/19] sched/numa: Remove redundant field.
  2018-06-04 10:00 ` [PATCH 01/19] sched/numa: Remove redundant field Srikar Dronamraju
@ 2018-06-04 14:53   ` Rik van Riel
  2018-06-05  8:41   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 14:53 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 416 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> numa_entry is a list_head defined in task_struct, but never used.
> 
> No functional change
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

I guess we stopped using that when we figured out there
was no need to iterate over the tasks in a numa group.

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/19] sched/numa: Simplify load_too_imbalanced
  2018-06-04 10:00 ` [PATCH 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
@ 2018-06-04 14:57   ` Rik van Riel
  2018-06-05  8:46   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 14:57 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 892 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> Currently load_too_imbalance() cares about the slope of imbalance.
> It doesn't care of the direction of the imbalance.
> 
> However this may not work if nodes that are being compared have
> dissimilar capacities. Few nodes might have more cores than other
> nodes
> in the system. Also unlike traditional load balance at a NUMA sched
> domain, multiple requests to migrate from the same source node to
> same
> destination node may run in parallel. This can cause huge load
> imbalance. This is specially true on a larger machines with either
> large
> cores per node or more number of nodes in the system. Hence allow
> move/swap only if the imbalance is going to reduce.

> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/19] sched/numa: Evaluate move once per node
  2018-06-04 14:51   ` Rik van Riel
@ 2018-06-04 15:45     ` Srikar Dronamraju
  0 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 15:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Thomas Gleixner

* Rik van Riel <riel@surriel.com> [2018-06-04 10:51:27]:

> On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> 
> 
> Just bike shedding, but it may be easier to read
> if the "we found our destination" check were written
> more explicitly:
> 
> 
> 	if (!cur) {
> 		if (move && imp > env->best_imp)
> 			gote assign;
> 		else
> 			goto unlock;
> 	}
> 

will incorporate this.


> Also, the "move" variable seems to indicate that
> the NUMA code may move the task, but not a decision
> that moving the task is better than a swap.
> 
> Would it make sense to call it "maymove"?

Okay, will incorporate this too.

> 
> I like how this patch simplifies the code a little.
> 

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-04 14:37       ` Rik van Riel
@ 2018-06-04 15:56         ` Srikar Dronamraju
  0 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-04 15:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Mel Gorman, Thomas Gleixner

* Rik van Riel <riel@surriel.com> [2018-06-04 10:37:30]:

> On Mon, 2018-06-04 at 05:59 -0700, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2018-06-04 14:23:36]:
> > 
> > > > -		if (ng->active_nodes > 1 &&
> > > > numa_is_active_node(env.dst_nid, ng))
> > > > -			sched_setnuma(p, env.dst_nid);
> > > > +		if (nid != p->numa_preferred_nid)
> > > > +			sched_setnuma(p, nid);
> > > >  	}
> > > 
> > 
> > I think checking for active_nodes before calling sched_setnuma was a
> > mistake.
> > 
> > Before this change, we may be retaining numa_preferred_nid to be the
> > source node while we select another node with better numa affinity to
> > run on. 
> 
> Sometimes workloads are so large they get spread
> around to multiple NUMA nodes.
> 
> In that case, you do NOT want all the tasks of
> that workload (numa group) to try squeezing onto
> the same load, only to have the load balancer
> randomly move tasks off of that node again later.
> 
> How do you keep that from happening?

Infact we are exactly doing this now in all cases. We are not changing
anything in the ng->active_node > 1 case. (which is the workload spread
across multiple nodes).

Earlier we would not set the numa_preferred_nid if there is only one
active node. However its not certain that the src node is the active
node. Infact its most likely not going to be src node because we
wouldn't have reached here if the task was running on the source node.
Keeping the numa_preferred_nid as the source node increases the chances
of the regular load balancer randomly moving tasks from the node.  Now
we are making sure task_node(p) and numa_preferred_nid. Hence we would
reduce the risk of moving to a random node.

Hope this is clear.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/19] sched/numa: Skip nodes that are at hoplimit
  2018-06-04 10:00 ` [PATCH 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
@ 2018-06-04 16:27   ` Rik van Riel
  2018-06-05  8:50   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 16:27 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> When comparing two nodes at a distance of hoplimit, we should
> consider
> nodes only upto hoplimit. Currently we also consider nodes at
> hoplimit
> distance too. Hence two nodes at a distance of "hoplimit" will have
> same
> groupweight. Fix this by skipping nodes at hoplimit.

> While there is a regression with this change, this change is needed
> from a
> correctness perspective. Also it helps consolidation as seen from
> perf bench
> output.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Good catch.

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/19] sched/debug: Reverse the order of printing faults
  2018-06-04 10:00 ` [PATCH 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
@ 2018-06-04 16:28   ` Rik van Riel
  2018-06-05  8:50   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 16:28 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 335 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> Fix the order in which the private and shared numa faults are getting
> printed.
> 
> Shouldn't have any performance impact.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 08/19] sched/numa: Remove unused task_capacity from numa_stats
  2018-06-04 10:00 ` [PATCH 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
@ 2018-06-04 16:28   ` Rik van Riel
  2018-06-05  8:57   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 16:28 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 351 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> task_capacity field in struct numa_stats is redundant.
> Also move nr_running for better packing within the struct.
> 
> No functional changes.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/19] sched/numa: Modify migrate_swap to accept additional params
  2018-06-04 10:00 ` [PATCH 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
@ 2018-06-04 17:00   ` Rik van Riel
  2018-06-05  8:58   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 17:00 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 899 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> There are checks in migrate_swap_stop that check if the task/cpu
> combination is as per migrate_swap_arg before migrating.
> 
> However atleast one of the two tasks to be swapped by migrate_swap
> could
> have migrated to a completely different cpu before updating the
> migrate_swap_arg. The new cpu where the task is currently running
> could
> be a different node too. If the task has migrated, numa balancer
> might
> end up placing a task in a wrong node.  Instead of achieving node
> consolidation, it may end up spreading the load across nodes.
> 
> To avoid that pass the cpus as additional parameters.
> 
> While here, place migrate_swap under CONFIG_NUMA_BALANCING.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time
  2018-06-04 10:00 ` [PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-06-04 17:57   ` Rik van Riel
  2018-06-05  9:51   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 17:57 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 992 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> Task migration under numa balancing can happen in parallel. More than
> one task might choose to migrate to the same cpu at the same time.
> This
> can result in
> - During task swap, choosing a task that was not part of the
> evaluation.
> - During task swap, task which just got moved into its preferred
> node,
>   moving to a completely different node.
> - During task swap, task failing to move to the preferred node, will
> have
>   to wait an extra interval for the next migrate opportunity.
> - During task movement, multiple task movements can cause load
> imbalance.
> 
> This problem is more likely if there are more cores per node or more
> nodes in the system.
> 
> Use a per run-queue variable to check if numa-balance is active on
> the
> run-queue.

> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> 

Reviewed-by: Rik van Riel <riel@surriel.com>
-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node.
  2018-06-04 10:00 ` [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
@ 2018-06-04 17:59   ` Rik van Riel
  2018-06-05  9:53   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 17:59 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 872 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> Since task migration under numa balancing can happen in parallel,
> more
> than one task might choose to move to the same node at the same time.
> This can cause load imbalances at the node level.
> 
> The problem is more likely if there are more cores per node or more
> nodes in system.
> 
> Use a per-node variable to indicate if task migration
> to the node under numa balance is currently active.
> This per-node variable will not track swapping of tasks.

> The commit does cause some performance regression but is needed from
> a fairness/correctness perspective.

Does it help any "real workloads", even simple things
like SpecJBB2005?

If this patch only causes regressions, and does not help
any workloads, I would argue that it is not in fact needed.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 12/19] sched:numa Remove numa_has_capacity
  2018-06-04 10:00 ` [PATCH 12/19] sched:numa Remove numa_has_capacity Srikar Dronamraju
@ 2018-06-04 18:07   ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 18:07 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 646 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> task_numa_find_cpu helps to find the cpu to swap/move the task.
> Its guarded by numa_has_capacity(). However node not having capacity
> shouldn't deter a task swapping if it helps numa placement.
> 
> Further load_too_imbalanced, which evaluates possibilities of
> move/swap,
> provides similar checks as numa_has_capacity.
> 
> Hence remove numa_has_capacity() to enhance possibilities of task
> swapping even if load is imbalanced.

> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 13/19] mm/migrate: Use xchg instead of spinlock
  2018-06-04 10:00 ` [PATCH 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
@ 2018-06-04 18:22   ` Rik van Riel
  2018-06-04 19:28   ` Peter Zijlstra
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 18:22 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 872 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> 
> +++ b/mm/migrate.c
> @@ -1874,11 +1874,9 @@ static bool
> numamigrate_update_ratelimit(pg_data_t *pgdat,
>  	 * all the time is being spent migrating!
>  	 */
>  	if (time_after(jiffies, pgdat-
> >numabalancing_migrate_next_window)) {
> -		spin_lock(&pgdat->numabalancing_migrate_lock);
> -		pgdat->numabalancing_migrate_nr_pages = 0;
> -		pgdat->numabalancing_migrate_next_window = jiffies +
> -			msecs_to_jiffies(migrate_interval_millisecs)
> ;
> -		spin_unlock(&pgdat->numabalancing_migrate_lock);
> +		if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0))
> +			pgdat->numabalancing_migrate_next_window =
> jiffies +
> +				msecs_to_jiffies(migrate_interval_mi
> llisecs);
>  	}

I am not convinced this is simpler, but no real
objection either way :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 14/19] sched/numa: Updation of scan period need not be in lock
  2018-06-04 10:00 ` [PATCH 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
@ 2018-06-04 18:24   ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 18:24 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 502 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> The metrics for updating scan periods are local or task specific.
> Currently this updation happens under numa_group lock which seems
> unnecessary. Hence move this updation outside the lock.

Indeed, it looks like update_task_scan_period does
not look at any numa_group numbers (any more?).

> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 15/19] sched/numa: Use group_weights to identify if migration degrades locality
  2018-06-04 10:00 ` [PATCH 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
@ 2018-06-04 18:56   ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 18:56 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 964 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> 
> +	dist = node_distance(src_nid, dst_nid);
>  	if (numa_group) {
> -		src_faults = group_faults(p, src_nid);
> -		dst_faults = group_faults(p, dst_nid);
> +		src_weight = group_weight(p, src_nid, dist);
> +		dst_weight = group_weight(p, dst_nid, dist);
>  	} else {
> -		src_faults = task_faults(p, src_nid);
> -		dst_faults = task_faults(p, dst_nid);
> +		src_weight = task_weight(p, src_nid, dist);
> +		dst_weight = task_weight(p, dst_nid, dist);
>  	}
>  
> -	return dst_faults < src_faults;
> +	return dst_weight < src_weight;
>  }

While this is better in principle, in practice
task/group_weight is a LOT more expensive to
calculate than just comparing the faults.

This may be too expensive to do in the load
balancing code.

This patch regressed performance in your synthetic
test. How does it do for "real workload" style
benchmarks?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 13/19] mm/migrate: Use xchg instead of spinlock
  2018-06-04 10:00 ` [PATCH 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
  2018-06-04 18:22   ` Rik van Riel
@ 2018-06-04 19:28   ` Peter Zijlstra
  2018-06-05  7:24     ` Srikar Dronamraju
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2018-06-04 19:28 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:22PM +0530, Srikar Dronamraju wrote:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8c0af0f..1c55956 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1874,11 +1874,9 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
>  	 * all the time is being spent migrating!
>  	 */
>  	if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
> -		spin_lock(&pgdat->numabalancing_migrate_lock);
> -		pgdat->numabalancing_migrate_nr_pages = 0;
> -		pgdat->numabalancing_migrate_next_window = jiffies +
> -			msecs_to_jiffies(migrate_interval_millisecs);
> -		spin_unlock(&pgdat->numabalancing_migrate_lock);
> +		if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0))
> +			pgdat->numabalancing_migrate_next_window = jiffies +
> +				msecs_to_jiffies(migrate_interval_millisecs);

Note that both are in fact wrong. That wants to be something like:

	pgdat->numabalancing_migrate_next_window += interval;

Otherwise you stretch every interval by 'jiffies - numabalancing_migrate_next_window'.

Also, that all wants READ_ONCE/WRITE_ONCE, irrespective of the
spinlock/xchg.

I suppose the problem here is that PPC has a very nasty test-and-set
spinlock with fwd progress issues while xchg maps to a fairly simple
ll/sc that (hopefully) has some hardware fairness.

And pgdata being a rather course data structure (per node?) there could
be a lot of CPUs stomping on this here thing.

So simpler not really, but better for PPC.

>  	}
>  	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
>  		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4526643..464a25c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6208,7 +6208,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>  
>  	pgdat_resize_init(pgdat);
>  #ifdef CONFIG_NUMA_BALANCING
> -	spin_lock_init(&pgdat->numabalancing_migrate_lock);
>  	pgdat->numabalancing_migrate_nr_pages = 0;
>  	pgdat->active_node_migrate = 0;
>  	pgdat->numabalancing_migrate_next_window = jiffies;
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
  2018-06-04 10:00 ` [PATCH 16/19] sched/numa: Detect if node actively handling migration Srikar Dronamraju
@ 2018-06-04 20:05   ` Rik van Riel
  2018-06-05  3:56     ` Srikar Dronamraju
  0 siblings, 1 reply; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 20:05 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:

> @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> task_numa_env *env,
>  	if (READ_ONCE(dst_rq->numa_migrate_on))
>  		return;
>  
> +	if (*move && READ_ONCE(pgdat->active_node_migrate))
> +		*move = false;

Why not do this check in task_numa_find_cpu?

That way you won't have to pass this in as a
pointer, and you also will not have to recalculate
NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.

>  	/*
> +	 * If the numa importance is less than SMALLIMP,

              ^^^ numa improvement

> +	 * task migration might only result in ping pong
> +	 * of tasks and also hurt performance due to cache
> +	 * misses.
> +	 */
> +	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> +		goto unlock;

I can see a use for the first test, but why limit the
search for the best score once you are past the
threshold?

I don't understand the use for that second test.

What workload benefits from it?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes
  2018-06-04 10:00 ` [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-06-04 20:08   ` Rik van Riel
  2018-06-05  9:58   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-04 20:08 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 811 bytes --]

On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> Currently task scan rate is reset when numa balancer migrates the
> task
> to a different node. If numa balancer initiates a swap, reset is only
> applicable to the task that initiates the swap. Similarly no scan
> rate
> reset is done if the task is migrated across nodes by traditional
> load
> balancer.
> 
> Instead move the scan reset to the migrate_task_rq. This ensures the
> task moved out of its preferred node, either gets back to its
> preferred
> node quickly or finds a new preferred node. Doing so, would be fair
> to
> all tasks migrating across nodes.

How does this impact performance of benchmarks
closer to real world workloads?

Not just the test cases measuring NUMA convergence.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
  2018-06-04 20:05   ` Rik van Riel
@ 2018-06-05  3:56     ` Srikar Dronamraju
  2018-06-05 13:07       ` Rik van Riel
  0 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-05  3:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Thomas Gleixner

* Rik van Riel <riel@surriel.com> [2018-06-04 16:05:55]:

> On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> 
> > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> > task_numa_env *env,
> >  	if (READ_ONCE(dst_rq->numa_migrate_on))
> >  		return;
> >  
> > +	if (*move && READ_ONCE(pgdat->active_node_migrate))
> > +		*move = false;
> 
> Why not do this check in task_numa_find_cpu?
> 
> That way you won't have to pass this in as a
> pointer, and you also will not have to recalculate
> NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.
> 

I thought about this. Lets say we evaluated that destination node can
allow movement. While we iterate through the list of cpus trying to find
the best cpu node, we find a idle cpu towards the end of the list.
However if another task as already raced with us to move a task to this
node, then we should bail out. Keeping the check in task_numa_compare
will allow us to do this.

> >  	/*
> > +	 * If the numa importance is less than SMALLIMP,
> 
>               ^^^ numa improvement
> 

okay

> > +	 * task migration might only result in ping pong
> > +	 * of tasks and also hurt performance due to cache
> > +	 * misses.
> > +	 */
> > +	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> > +		goto unlock;
> 
> I can see a use for the first test, but why limit the
> search for the best score once you are past the
> threshold?
> 
> I don't understand the use for that second test.
> 

Lets say few threads are racing with each other to find a cpu on the
node. The first thread has already found a task/cpu 'A' to swap and
finds another task/cpu 'B' thats slightly better than the current
best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to be
set as best_cpu. However the second or subsequent threads cannot find
the first task/cpu A because its suppose to be in active migration. By
the time it reaches task/cpu B even that may look to be in active
migration. It may never know that task/cpu A was cleared. In this way,
the second and subsequent threads may not get a task/cpu in the
preferred node to swap just because the first task kept hopping task/cpu
as its choice of migration.

While we can't complete avoid this, the second check will try to make
sure we don't hop on/hop off just for small incremental numa
improvement.



> What workload benefits from it?

> 
> -- 
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 13/19] mm/migrate: Use xchg instead of spinlock
  2018-06-04 19:28   ` Peter Zijlstra
@ 2018-06-05  7:24     ` Srikar Dronamraju
  2018-06-05  8:16       ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-05  7:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-06-04 21:28:21]:

> >  	if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
> > -		spin_lock(&pgdat->numabalancing_migrate_lock);
> > -		pgdat->numabalancing_migrate_nr_pages = 0;
> > -		pgdat->numabalancing_migrate_next_window = jiffies +
> > -			msecs_to_jiffies(migrate_interval_millisecs);
> > -		spin_unlock(&pgdat->numabalancing_migrate_lock);
> > +		if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0))
> > +			pgdat->numabalancing_migrate_next_window = jiffies +
> > +				msecs_to_jiffies(migrate_interval_millisecs);
>
> Note that both are in fact wrong. That wants to be something like:
>
> 	pgdat->numabalancing_migrate_next_window += interval;
>
> Otherwise you stretch every interval by 'jiffies - numabalancing_migrate_next_window'.

Okay, I get your point.


>
> Also, that all wants READ_ONCE/WRITE_ONCE, irrespective of the
> spinlock/xchg.
>
> I suppose the problem here is that PPC has a very nasty test-and-set
> spinlock with fwd progress issues while xchg maps to a fairly simple
> ll/sc that (hopefully) has some hardware fairness.
>
> And pgdata being a rather course data structure (per node?) there could
> be a lot of CPUs stomping on this here thing.
>
> So simpler not really, but better for PPC.
>

unsigned long interval = READ_ONCE(pgdat->numabalancing_migrate_next_window);

if (time_after(jiffies, interval)) {
	interval += msecs_to_jiffies(migrate_interval_millisecs));
	if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0))
		WRITE_ONCE(pgdat->numabalancing_migrate_next_window, interval);
}

Something like this?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 13/19] mm/migrate: Use xchg instead of spinlock
  2018-06-05  7:24     ` Srikar Dronamraju
@ 2018-06-05  8:16       ` Peter Zijlstra
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Zijlstra @ 2018-06-05  8:16 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Tue, Jun 05, 2018 at 12:24:39AM -0700, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-06-04 21:28:21]:
> 
> > >  	if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
> > > -		spin_lock(&pgdat->numabalancing_migrate_lock);
> > > -		pgdat->numabalancing_migrate_nr_pages = 0;
> > > -		pgdat->numabalancing_migrate_next_window = jiffies +
> > > -			msecs_to_jiffies(migrate_interval_millisecs);
> > > -		spin_unlock(&pgdat->numabalancing_migrate_lock);
> > > +		if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0))
> > > +			pgdat->numabalancing_migrate_next_window = jiffies +
> > > +				msecs_to_jiffies(migrate_interval_millisecs);
> >
> > Note that both are in fact wrong. That wants to be something like:
> >
> > 	pgdat->numabalancing_migrate_next_window += interval;
> >
> > Otherwise you stretch every interval by 'jiffies - numabalancing_migrate_next_window'.
> 
> Okay, I get your point.

Note that in practise it probably doesn't matter, but it just upsets my
OCD ;-)

> > Also, that all wants READ_ONCE/WRITE_ONCE, irrespective of the
> > spinlock/xchg.

> unsigned long interval = READ_ONCE(pgdat->numabalancing_migrate_next_window);
> 
> if (time_after(jiffies, interval)) {
> 	interval += msecs_to_jiffies(migrate_interval_millisecs));
> 	if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0))
> 		WRITE_ONCE(pgdat->numabalancing_migrate_next_window, interval);
> }
> 
> Something like this?

Almost, you forgot about the case where 'jiffies -
numabalancing_migrate_next_window > interval'.

That wants to be something like:

  unsigned long timo = READ_ONCE(stupid_long_name);

  if (time_after(jiffies, timo) && xchg(&other_long_name, 0)) {
	  do {
		  timo += msec_to_jiffies(..);
	  } while (unlikely(time_after(jiffies, timo);

	  WRITE_ONCE(stupid_long_name, timo);
  }

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/19] sched/numa: Remove redundant field.
  2018-06-04 10:00 ` [PATCH 01/19] sched/numa: Remove redundant field Srikar Dronamraju
  2018-06-04 14:53   ` Rik van Riel
@ 2018-06-05  8:41   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  8:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:10PM +0530, Srikar Dronamraju wrote:
> numa_entry is a list_head defined in task_struct, but never used.
> 
> No functional change
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 03/19] sched/numa: Simplify load_too_imbalanced
  2018-06-04 10:00 ` [PATCH 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
  2018-06-04 14:57   ` Rik van Riel
@ 2018-06-05  8:46   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  8:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:12PM +0530, Srikar Dronamraju wrote:
> Currently load_too_imbalance() cares about the slope of imbalance.
> It doesn't care of the direction of the imbalance.
> 
> However this may not work if nodes that are being compared have
> dissimilar capacities. Few nodes might have more cores than other nodes
> in the system. Also unlike traditional load balance at a NUMA sched
> domain, multiple requests to migrate from the same source node to same
> destination node may run in parallel. This can cause huge load
> imbalance. This is specially true on a larger machines with either large
> cores per node or more number of nodes in the system. Hence allow
> move/swap only if the imbalance is going to reduce.
> 
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      516.14      892.41      739.84      151.32
> numa01.sh       Sys:      153.16      192.99      177.70       14.58
> numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
> numa02.sh      Real:       60.91       62.35       61.58        0.63
> numa02.sh       Sys:       16.47       26.16       21.20        3.85
> numa02.sh      User:     5227.58     5309.61     5265.17       31.04
> numa03.sh      Real:      739.07      917.73      795.75       64.45
> numa03.sh       Sys:       94.46      136.08      109.48       14.58
> numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
> numa04.sh      Real:      442.61      715.43      530.31       96.12
> numa04.sh       Sys:      224.90      348.63      285.61       48.83
> numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
> numa05.sh      Real:      386.13      489.17      434.94       43.59
> numa05.sh       Sys:      144.29      438.56      278.80      105.78
> numa05.sh      User:    33255.86    36890.82    34879.31     1641.98
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
> numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
> numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
> numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
> numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
> numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
> numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
> numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
> numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
> numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
> numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
> numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
> numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
> numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
> numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/19] sched/debug: Reverse the order of printing faults
  2018-06-04 10:00 ` [PATCH 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
  2018-06-04 16:28   ` Rik van Riel
@ 2018-06-05  8:50   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  8:50 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:15PM +0530, Srikar Dronamraju wrote:
> Fix the order in which the private and shared numa faults are getting
> printed.
> 
> Shouldn't have any performance impact.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 07/19] sched/numa: Skip nodes that are at hoplimit
  2018-06-04 10:00 ` [PATCH 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
  2018-06-04 16:27   ` Rik van Riel
@ 2018-06-05  8:50   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  8:50 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:16PM +0530, Srikar Dronamraju wrote:
> When comparing two nodes at a distance of hoplimit, we should consider
> nodes only upto hoplimit. Currently we also consider nodes at hoplimit
> distance too. Hence two nodes at a distance of "hoplimit" will have same
> groupweight. Fix this by skipping nodes at hoplimit.
> 
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      478.45      565.90      515.11       30.87
> numa01.sh       Sys:      207.79      271.04      232.94       21.33
> numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
> numa02.sh      Real:       60.00       61.46       60.78        0.49
> numa02.sh       Sys:       15.71       25.31       20.69        3.42
> numa02.sh      User:     5175.92     5265.86     5235.97       32.82
> numa03.sh      Real:      776.42      834.85      806.01       23.22
> numa03.sh       Sys:      114.43      128.75      121.65        5.49
> numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
> numa04.sh      Real:      456.93      511.95      482.91       20.88
> numa04.sh       Sys:      178.09      460.89      356.86       94.58
> numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
> numa05.sh      Real:      393.98      493.48      436.61       35.59
> numa05.sh       Sys:      164.49      329.15      265.87       61.78
> numa05.sh      User:    33182.65    36654.53    35074.51     1187.71
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
> numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
> numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
> numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
> numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
> numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
> numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
> numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
> numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
> numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
> numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
> numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
> numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
> numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
> numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%
> 
> While there is a regression with this change, this change is needed from a
> correctness perspective. Also it helps consolidation as seen from perf bench
> output.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Agreed that it's better from a correctness POV

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 08/19] sched/numa: Remove unused task_capacity from numa_stats
  2018-06-04 10:00 ` [PATCH 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
  2018-06-04 16:28   ` Rik van Riel
@ 2018-06-05  8:57   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  8:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:17PM +0530, Srikar Dronamraju wrote:
> task_capacity field in struct numa_stats is redundant.
> Also move nr_running for better packing within the struct.
> 
> No functional changes.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 09/19] sched/numa: Modify migrate_swap to accept additional params
  2018-06-04 10:00 ` [PATCH 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
  2018-06-04 17:00   ` Rik van Riel
@ 2018-06-05  8:58   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  8:58 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:18PM +0530, Srikar Dronamraju wrote:
> There are checks in migrate_swap_stop that check if the task/cpu
> combination is as per migrate_swap_arg before migrating.
> 
> However atleast one of the two tasks to be swapped by migrate_swap could
> have migrated to a completely different cpu before updating the
> migrate_swap_arg. The new cpu where the task is currently running could
> be a different node too. If the task has migrated, numa balancer might
> end up placing a task in a wrong node.  Instead of achieving node
> consolidation, it may end up spreading the load across nodes.
> 
> To avoid that pass the cpus as additional parameters.
> 
> While here, place migrate_swap under CONFIG_NUMA_BALANCING.
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time
  2018-06-04 10:00 ` [PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
  2018-06-04 17:57   ` Rik van Riel
@ 2018-06-05  9:51   ` Mel Gorman
  1 sibling, 0 replies; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  9:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:19PM +0530, Srikar Dronamraju wrote:
> Task migration under numa balancing can happen in parallel. More than
> one task might choose to migrate to the same cpu at the same time. This
> can result in
> - During task swap, choosing a task that was not part of the evaluation.
> - During task swap, task which just got moved into its preferred node,
>   moving to a completely different node.
> - During task swap, task failing to move to the preferred node, will have
>   to wait an extra interval for the next migrate opportunity.
> - During task movement, multiple task movements can cause load imbalance.
> 
> This problem is more likely if there are more cores per node or more
> nodes in the system.
> 
> Use a per run-queue variable to check if numa-balance is active on the
> run-queue.
> 

FWIW, I had noticed a similar problem when selecting an idle CPU for
SIS. When prototying something to look at nearby CPUs for cross-node
migrations during wake_affine I found that the patch allowed multiple
tasks to select the same CPU when a waker was waking many wakees and
ultimately dropped the patch. 

Either way for this patch;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node.
  2018-06-04 10:00 ` [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
  2018-06-04 17:59   ` Rik van Riel
@ 2018-06-05  9:53   ` Mel Gorman
  2018-06-06 12:58     ` Srikar Dronamraju
  1 sibling, 1 reply; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  9:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:20PM +0530, Srikar Dronamraju wrote:
> Since task migration under numa balancing can happen in parallel, more
> than one task might choose to move to the same node at the same time.
> This can cause load imbalances at the node level.
> 
> The problem is more likely if there are more cores per node or more
> nodes in system.
> 
> Use a per-node variable to indicate if task migration
> to the node under numa balance is currently active.
> This per-node variable will not track swapping of tasks.
> 
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      434.84      676.90      550.53      106.24
> numa01.sh       Sys:      125.98      217.34      179.41       30.35
> numa01.sh      User:    38318.48    53789.56    45864.17     6620.80
> numa02.sh      Real:       60.06       61.27       60.59        0.45
> numa02.sh       Sys:       14.25       17.86       16.09        1.28
> numa02.sh      User:     5190.13     5225.67     5209.24       13.19
> numa03.sh      Real:      748.21      960.25      823.15       73.51
> numa03.sh       Sys:       96.68      122.10      110.42       11.29
> numa03.sh      User:    58222.16    72595.27    63552.22     5048.87
> numa04.sh      Real:      433.08      630.55      499.30       68.15
> numa04.sh       Sys:      245.22      386.75      306.09       63.32
> numa04.sh      User:    35014.68    46151.72    38530.26     3924.65
> numa05.sh      Real:      394.77      410.07      401.41        5.99
> numa05.sh       Sys:      212.40      301.82      256.23       35.41
> numa05.sh      User:    33224.86    34201.40    33665.61      313.40
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> numa01.sh      Real:      674.61      997.71      785.01      115.95 	 -29.86%
> numa01.sh       Sys:      180.87      318.88      270.13       51.32 	 -33.58%
> numa01.sh      User:    54001.30    71936.50    60495.48     6237.55 	 -24.18%
> numa02.sh      Real:       60.62       62.30       61.46        0.62 	 -1.415%
> numa02.sh       Sys:       15.01       33.63       24.38        6.81 	 -34.00%
> numa02.sh      User:     5234.20     5325.60     5276.23       38.85 	 -1.269%
> numa03.sh      Real:      827.62      946.85      914.48       44.58 	 -9.987%
> numa03.sh       Sys:      135.55      172.40      158.46       12.75 	 -30.31%
> numa03.sh      User:    64839.42    73195.44    70805.96     3061.20 	 -10.24%
> numa04.sh      Real:      481.01      608.76      521.14       47.28 	 -4.190%
> numa04.sh       Sys:      329.59      373.15      353.20       14.20 	 -13.33%
> numa04.sh      User:    37649.09    40722.94    38806.32     1072.32 	 -0.711%
> numa05.sh      Real:      399.21      415.38      409.88        5.54 	 -2.066%
> numa05.sh       Sys:      319.46      418.57      363.31       37.62 	 -29.47%
> numa05.sh      User:    33727.77    34732.68    34127.41      447.11 	 -1.353%
> 
> The commit does cause some performance regression but is needed from
> a fairness/correctness perspective.
> 

While it may cause some performance regressions, it may be due to either
a) some workloads benefit from overloading a node if the tasks idle
frequently or b) the regression may be due to delayed convergence. I'm
not 100% convinced this needs to be done from a correctness point of
view based on just this microbenchmark

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes
  2018-06-04 10:00 ` [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
  2018-06-04 20:08   ` Rik van Riel
@ 2018-06-05  9:58   ` Mel Gorman
  2018-06-06 13:47     ` Srikar Dronamraju
  1 sibling, 1 reply; 66+ messages in thread
From: Mel Gorman @ 2018-06-05  9:58 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Mon, Jun 04, 2018 at 03:30:27PM +0530, Srikar Dronamraju wrote:
> Currently task scan rate is reset when numa balancer migrates the task
> to a different node. If numa balancer initiates a swap, reset is only
> applicable to the task that initiates the swap. Similarly no scan rate
> reset is done if the task is migrated across nodes by traditional load
> balancer.
> 
> Instead move the scan reset to the migrate_task_rq. This ensures the
> task moved out of its preferred node, either gets back to its preferred
> node quickly or finds a new preferred node. Doing so, would be fair to
> all tasks migrating across nodes.
> 

By and large you need to be very careful resetting the scan rate without
a lot of justification and I don't think this is enough. With scan rate
resets, there is a significant risk that system CPU overhead is
increased to do the page table updates and handle the resulting minor
faults. There are cases where tasks can get pulled cross-node very
frequently and we do not want NUMA balancing scanning agressively when
that happens.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
  2018-06-05  3:56     ` Srikar Dronamraju
@ 2018-06-05 13:07       ` Rik van Riel
  2018-06-06 12:55         ` Srikar Dronamraju
  0 siblings, 1 reply; 66+ messages in thread
From: Rik van Riel @ 2018-06-05 13:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2913 bytes --]

On Mon, 2018-06-04 at 20:56 -0700, Srikar Dronamraju wrote:
> * Rik van Riel <riel@surriel.com> [2018-06-04 16:05:55]:
> 
> > On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> > 
> > > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> > > task_numa_env *env,
> > >  	if (READ_ONCE(dst_rq->numa_migrate_on))
> > >  		return;
> > >  
> > > +	if (*move && READ_ONCE(pgdat->active_node_migrate))
> > > +		*move = false;
> > 
> > Why not do this check in task_numa_find_cpu?
> > 
> > That way you won't have to pass this in as a
> > pointer, and you also will not have to recalculate
> > NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.
> > 
> 
> I thought about this. Lets say we evaluated that destination node can
> allow movement. While we iterate through the list of cpus trying to
> find
> the best cpu node, we find a idle cpu towards the end of the list.
> However if another task as already raced with us to move a task to
> this
> node, then we should bail out. Keeping the check in task_numa_compare
> will allow us to do this.

Your check is called once for every invocation
of task_numa_compare. It does not matter whether
it is inside or outside, except on the outside
the variable manipulation will be easier to read.

> > > +	 * task migration might only result in ping pong
> > > +	 * of tasks and also hurt performance due to cache
> > > +	 * misses.
> > > +	 */
> > > +	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP /
> > > 2)
> > > +		goto unlock;
> > 
> > I can see a use for the first test, but why limit the
> > search for the best score once you are past the
> > threshold?
> > 
> > I don't understand the use for that second test.
> > 
> 
> Lets say few threads are racing with each other to find a cpu on the
> node. The first thread has already found a task/cpu 'A' to swap and
> finds another task/cpu 'B' thats slightly better than the current
> best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to
> be
> set as best_cpu. However the second or subsequent threads cannot find
> the first task/cpu A because its suppose to be in active migration.
> By
> the time it reaches task/cpu B even that may look to be in active
> migration. It may never know that task/cpu A was cleared. In this
> way,
> the second and subsequent threads may not get a task/cpu in the
> preferred node to swap just because the first task kept hopping
> task/cpu
> as its choice of migration.
> 
> While we can't complete avoid this, the second check will try to make
> sure we don't hop on/hop off just for small incremental numa
> improvement.

However, all those racing tasks start searching
the CPUs on a node from the same start position.

That means they may all get stuck on the same
task/cpu A, and not select the better task/cpu B.

What am I missing? 

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
  2018-06-05 13:07       ` Rik van Riel
@ 2018-06-06 12:55         ` Srikar Dronamraju
  2018-06-06 13:55           ` Rik van Riel
  0 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-06 12:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Thomas Gleixner

> > 
> > I thought about this. Lets say we evaluated that destination node can
> > allow movement. While we iterate through the list of cpus trying to
> > find
> > the best cpu node, we find a idle cpu towards the end of the list.
> > However if another task as already raced with us to move a task to
> > this
> > node, then we should bail out. Keeping the check in task_numa_compare
> > will allow us to do this.
> 
> Your check is called once for every invocation
> of task_numa_compare. It does not matter whether
> it is inside or outside, except on the outside
> the variable manipulation will be easier to read.
> 

Okay I mistook your comment; Basically you want the check to be moved
within the for-loop in task_numa_find_cpu.
I will do the needful.

> > 
> > While we can't complete avoid this, the second check will try to make
> > sure we don't hop on/hop off just for small incremental numa
> > improvement.
> 
> However, all those racing tasks start searching
> the CPUs on a node from the same start position.
> 
> That means they may all get stuck on the same
> task/cpu A, and not select the better task/cpu B.

> 
> What am I missing? 

All tasks will not be stuck at task/cpu A.

"[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
cpu..." the first task to set cpu A as swap target will ensure
subsequent tasks wont be allowed to set cpu A as target for swap till it
finds a better task/cpu. Because of this there a very very small chance
of a second task unable to find a task to swap.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node.
  2018-06-05  9:53   ` Mel Gorman
@ 2018-06-06 12:58     ` Srikar Dronamraju
  0 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-06 12:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

> > The commit does cause some performance regression but is needed from
> > a fairness/correctness perspective.
> > 
> 
> While it may cause some performance regressions, it may be due to either
> a) some workloads benefit from overloading a node if the tasks idle
> frequently or b) the regression may be due to delayed convergence. I'm
> not 100% convinced this needs to be done from a correctness point of
> view based on just this microbenchmark
> 

I will get back with Specjbb2005 numbers as suggested by Rik.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes
  2018-06-05  9:58   ` Mel Gorman
@ 2018-06-06 13:47     ` Srikar Dronamraju
  0 siblings, 0 replies; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-06 13:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

* Mel Gorman <mgorman@techsingularity.net> [2018-06-05 10:58:43]:

> On Mon, Jun 04, 2018 at 03:30:27PM +0530, Srikar Dronamraju wrote:
> > Currently task scan rate is reset when numa balancer migrates the task
> > to a different node. If numa balancer initiates a swap, reset is only
> > applicable to the task that initiates the swap. Similarly no scan rate
> > reset is done if the task is migrated across nodes by traditional load
> > balancer.
> > 
> > Instead move the scan reset to the migrate_task_rq. This ensures the
> > task moved out of its preferred node, either gets back to its preferred
> > node quickly or finds a new preferred node. Doing so, would be fair to
> > all tasks migrating across nodes.
> > 
> 
> By and large you need to be very careful resetting the scan rate without
> a lot of justification and I don't think this is enough. With scan rate
> resets, there is a significant risk that system CPU overhead is
> increased to do the page table updates and handle the resulting minor
> faults. There are cases where tasks can get pulled cross-node very
> frequently and we do not want NUMA balancing scanning agressively when
> that happens.
> 

I agree with your thoughts here. I will try to see if there are other
workloads that benefit from this change. My rational for this change
being, because a workload consolidated and slowed down its scanning
shouldn't adversely affect it from coming back to its preferred node.


> -- 
> Mel Gorman
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
  2018-06-06 12:55         ` Srikar Dronamraju
@ 2018-06-06 13:55           ` Rik van Riel
  2018-06-06 15:32             ` Srikar Dronamraju
  0 siblings, 1 reply; 66+ messages in thread
From: Rik van Riel @ 2018-06-06 13:55 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1160 bytes --]

On Wed, 2018-06-06 at 05:55 -0700, Srikar Dronamraju wrote:
> > > 
> > > While we can't complete avoid this, the second check will try to
> > > make
> > > sure we don't hop on/hop off just for small incremental numa
> > > improvement.
> > 
> > However, all those racing tasks start searching
> > the CPUs on a node from the same start position.
> > 
> > That means they may all get stuck on the same
> > task/cpu A, and not select the better task/cpu B.
> > 
> > What am I missing? 
> 
> All tasks will not be stuck at task/cpu A.
> 
> "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
> cpu..." the first task to set cpu A as swap target will ensure
> subsequent tasks wont be allowed to set cpu A as target for swap till
> it
> finds a better task/cpu. Because of this there a very very small
> chance
> of a second task unable to find a task to swap.

Would it not be better for task_numa_compare to skip
from consideration CPUs that somebody else is already
trying to migrate a task to, but still search for the
best location, instead of settling for a worse location
to migrate to?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
  2018-06-06 13:55           ` Rik van Riel
@ 2018-06-06 15:32             ` Srikar Dronamraju
  2018-06-06 17:06               ` Rik van Riel
  0 siblings, 1 reply; 66+ messages in thread
From: Srikar Dronamraju @ 2018-06-06 15:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Thomas Gleixner

> > 
> > All tasks will not be stuck at task/cpu A.
> > 
> > "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
> > cpu..." the first task to set cpu A as swap target will ensure
> > subsequent tasks wont be allowed to set cpu A as target for swap till
> > it
> > finds a better task/cpu. Because of this there a very very small
> > chance
> > of a second task unable to find a task to swap.
> 
> Would it not be better for task_numa_compare to skip
> from consideration CPUs that somebody else is already
> trying to migrate a task to, but still search for the
> best location, instead of settling for a worse location
> to migrate to?

Yes its better to skip cpus if they are already in migration.
And we are already doing it with the above patch. However as I said
earlier 

- Task T1 sets Cpu 1 as best_cpu, 
- Task T2 finds cpu1 and skips Cpu1
- Task T1 finds cpu2 slightly better than cpu1.
- Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2.
- Task T2 finds cpu2 and skips cpu2
- Task T1 finds cpu3 as best_cpu slightly better than cpu2.
- Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3.
- Task T2 finds cpu3 and skips cpu3

So after this T1 was able to find a cpu but T2 couldn't find a cpu even
though there were 3 cpus that were available for 2 task to swap.

Again, this is too corner case, that I am okay to drop.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
  2018-06-06 15:32             ` Srikar Dronamraju
@ 2018-06-06 17:06               ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2018-06-06 17:06 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]

On Wed, 2018-06-06 at 08:32 -0700, Srikar Dronamraju wrote:

> Yes its better to skip cpus if they are already in migration.
> And we are already doing it with the above patch. However as I said
> earlier 
> 
> - Task T1 sets Cpu 1 as best_cpu, 
> - Task T2 finds cpu1 and skips Cpu1
> - Task T1 finds cpu2 slightly better than cpu1.
> - Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2.
> - Task T2 finds cpu2 and skips cpu2
> - Task T1 finds cpu3 as best_cpu slightly better than cpu2.
> - Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3.
> - Task T2 finds cpu3 and skips cpu3
> 
> So after this T1 was able to find a cpu but T2 couldn't find a cpu
> even
> though there were 3 cpus that were available for 2 task to swap.
>
> Again, this is too corner case, that I am okay to drop.

Not only is that above race highly unlikely, it is also
still possible with your patch applied, if the scores
between cpu1, cpu2, and cpu3 differ by more than
SMALLIMP/2.

Not only is this patch some weird magic that makes the
code harder to maintain (since its purpose does not match
what the code actually does), but it also does not
reliably do what you intend it to do.

We may be better off not adding this bit of complexity.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2018-06-06 17:06 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04 10:00 [PATCH 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
2018-06-04 10:00 ` [PATCH 01/19] sched/numa: Remove redundant field Srikar Dronamraju
2018-06-04 14:53   ` Rik van Riel
2018-06-05  8:41   ` Mel Gorman
2018-06-04 10:00 ` [PATCH 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
2018-06-04 14:51   ` Rik van Riel
2018-06-04 15:45     ` Srikar Dronamraju
2018-06-04 10:00 ` [PATCH 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
2018-06-04 14:57   ` Rik van Riel
2018-06-05  8:46   ` Mel Gorman
2018-06-04 10:00 ` [PATCH 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
2018-06-04 12:18   ` Peter Zijlstra
2018-06-04 12:53     ` Srikar Dronamraju
2018-06-04 12:23   ` Peter Zijlstra
2018-06-04 12:59     ` Srikar Dronamraju
2018-06-04 13:39       ` Peter Zijlstra
2018-06-04 13:48         ` Srikar Dronamraju
2018-06-04 14:37       ` Rik van Riel
2018-06-04 15:56         ` Srikar Dronamraju
2018-06-04 10:00 ` [PATCH 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
2018-06-04 12:24   ` Peter Zijlstra
2018-06-04 13:09     ` Srikar Dronamraju
2018-06-04 10:00 ` [PATCH 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
2018-06-04 16:28   ` Rik van Riel
2018-06-05  8:50   ` Mel Gorman
2018-06-04 10:00 ` [PATCH 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
2018-06-04 16:27   ` Rik van Riel
2018-06-05  8:50   ` Mel Gorman
2018-06-04 10:00 ` [PATCH 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
2018-06-04 16:28   ` Rik van Riel
2018-06-05  8:57   ` Mel Gorman
2018-06-04 10:00 ` [PATCH 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
2018-06-04 17:00   ` Rik van Riel
2018-06-05  8:58   ` Mel Gorman
2018-06-04 10:00 ` [PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
2018-06-04 17:57   ` Rik van Riel
2018-06-05  9:51   ` Mel Gorman
2018-06-04 10:00 ` [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
2018-06-04 17:59   ` Rik van Riel
2018-06-05  9:53   ` Mel Gorman
2018-06-06 12:58     ` Srikar Dronamraju
2018-06-04 10:00 ` [PATCH 12/19] sched:numa Remove numa_has_capacity Srikar Dronamraju
2018-06-04 18:07   ` Rik van Riel
2018-06-04 10:00 ` [PATCH 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
2018-06-04 18:22   ` Rik van Riel
2018-06-04 19:28   ` Peter Zijlstra
2018-06-05  7:24     ` Srikar Dronamraju
2018-06-05  8:16       ` Peter Zijlstra
2018-06-04 10:00 ` [PATCH 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
2018-06-04 18:24   ` Rik van Riel
2018-06-04 10:00 ` [PATCH 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
2018-06-04 18:56   ` Rik van Riel
2018-06-04 10:00 ` [PATCH 16/19] sched/numa: Detect if node actively handling migration Srikar Dronamraju
2018-06-04 20:05   ` Rik van Riel
2018-06-05  3:56     ` Srikar Dronamraju
2018-06-05 13:07       ` Rik van Riel
2018-06-06 12:55         ` Srikar Dronamraju
2018-06-06 13:55           ` Rik van Riel
2018-06-06 15:32             ` Srikar Dronamraju
2018-06-06 17:06               ` Rik van Riel
2018-06-04 10:00 ` [PATCH 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
2018-06-04 10:00 ` [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
2018-06-04 20:08   ` Rik van Riel
2018-06-05  9:58   ` Mel Gorman
2018-06-06 13:47     ` Srikar Dronamraju
2018-06-04 10:00 ` [PATCH 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).