LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH 0/5] NUMA Balancer Suite
@ 2019-04-22  2:10 王贇
  2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
                   ` (6 more replies)
  0 siblings, 7 replies; 62+ messages in thread
From: 王贇 @ 2019-04-22  2:10 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm

We have NUMA Balancing feature which always trying to move pages
of a task to the node it executed more, while still got issues:

* page cache can't be handled
* no cgroup level balancing

Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks,
below scenery could be easily observed:

NODE0			|	NODE1
			|
CPU0		CPU1	|	CPU2		CPU3
task_A0		task_A1	|	task_A2		task_A3
task_B0		task_B1	|	task_B2		task_B3

and usually with the equal memory consumption on each node, when tasks have
similar behavior.

In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 0,
pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located randomly,
depends on the first read/write CPU location.

Let's suppose another scenery:

NODE0			|	NODE1
			|
CPU0		CPU1	|	CPU2		CPU3
task_A0		task_A1	|	task_B0		task_B1
task_A2		task_A3	|	task_B2		task_B3

By switching the cpu & memory resources of task_A0,1 and task_B0,1, now workloads
of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are same
but related tasks could share a closer cpu cache, while cache still randomly located.

Now what if the workloads generate lot's of page cache, and most of the memory
accessing are page cache writing?

A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if task_A0
was already on NODE0 before it read/write files, caches will be there, so how to
make sure this happen?

Usually we could solve this problem by binding workloads on a single node, if the
cgroup A was binding to CPU0,1, then all the caches it generated will be on NODE0,
the numa bonus will be maximum.

However, this require a very well administration on specified workloads, suppose in our
cases if A & B are with a changing CPU requirement from 0% to 400%, then binding to a
single node would be a bad idea.

So what we need is a way to detect memory topology on cgroup level, and try to migrate
cpu/mem resources to the node with most of the caches there, as long as the resource
is plenty on that node.

This patch set introduced:
  * advanced per-cgroup numa statistic
  * numa preferred node feature
  * Numa Balancer module

Which helps to achieve an easy and flexible numa resource assignment, to gain numa bonus
as much as possible.

Michael Wang (5):
  numa: introduce per-cgroup numa balancing locality statistic
  numa: append per-node execution info in memory.numa_stat
  numa: introduce per-cgroup preferred numa node
  numa: introduce numa balancer infrastructure
  numa: numa balancer

 drivers/Makefile             |   1 +
 drivers/numa/Makefile        |   1 +
 drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/memcontrol.h   |  99 ++++++
 include/linux/sched.h        |   9 +-
 kernel/sched/debug.c         |   8 +
 kernel/sched/fair.c          |  41 +++
 mm/huge_memory.c             |   7 +-
 mm/memcontrol.c              | 246 +++++++++++++++
 mm/memory.c                  |   9 +-
 mm/mempolicy.c               |   4 +
 11 files changed, 1133 insertions(+), 7 deletions(-)
 create mode 100644 drivers/numa/Makefile
 create mode 100644 drivers/numa/numa_balancer.c

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic
  2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
@ 2019-04-22  2:11 ` 王贇
  2019-04-23  8:44   ` Peter Zijlstra
                     ` (2 more replies)
  2019-04-22  2:12 ` [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat 王贇
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 62+ messages in thread
From: 王贇 @ 2019-04-22  2:11 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm

This patch introduced numa locality statistic, which try to imply
the numa balancing efficiency per memory cgroup.

By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'locality', the format is:

  locality 0~9% 10%~19% 20%~29% 30%~39% 40%~49% 50%~59% 60%~69% 70%~79%
80%~89% 90%~100%

interval means that on a task's last numa balancing, the percentage
of accessing local pages, which we called numa balancing locality.

And the number means inside the cgroup, how many ticks we hit tasks with
such locality are running, for example:

  locality 7260278 54860 90493 209327 295801 462784 558897 667242
2786324 7399308

the 7260278 means that this cgroup have some tasks with 0~9% locality
executed 7260278 ticks.

By monitoring the increment, we can check if the workload of a particular
cgroup is doing well with numa, when most of the tasks are running with
locality 0~9%, then something is wrong with your numa policy.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h | 38 +++++++++++++++++++++++++++++++++++
 include/linux/sched.h      |  8 +++++++-
 kernel/sched/debug.c       |  7 +++++++
 kernel/sched/fair.c        |  8 ++++++++
 mm/huge_memory.c           |  4 +---
 mm/memcontrol.c            | 50 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c                |  5 ++---
 7 files changed, 113 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 534267947664..bb62e6294484 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,6 +179,27 @@ enum memcg_kmem_state {
 	KMEM_ONLINE,
 };

+#ifdef CONFIG_NUMA_BALANCING
+
+enum memcg_numa_locality_interval {
+	PERCENT_0_9,
+	PERCENT_10_19,
+	PERCENT_20_29,
+	PERCENT_30_39,
+	PERCENT_40_49,
+	PERCENT_50_59,
+	PERCENT_60_69,
+	PERCENT_70_79,
+	PERCENT_80_89,
+	PERCENT_90_100,
+	NR_NL_INTERVAL,
+};
+
+struct memcg_stat_numa {
+	u64 locality[NR_NL_INTERVAL];
+};
+
+#endif
 #if defined(CONFIG_SMP)
 struct memcg_padding {
 	char x[0];
@@ -311,6 +332,10 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;

+#ifdef CONFIG_NUMA_BALANCING
+	struct memcg_stat_numa __percpu *stat_numa;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
@@ -818,6 +843,14 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif

+#ifdef CONFIG_NUMA_BALANCING
+extern void memcg_stat_numa_update(struct task_struct *p);
+#else
+static inline void memcg_stat_numa_update(struct task_struct *p)
+{
+}
+#endif
+
 #else /* CONFIG_MEMCG */

 #define MEM_CGROUP_ID_SHIFT	0
@@ -1156,6 +1189,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void memcg_stat_numa_update(struct task_struct *p)
+{
+}
+
 #endif /* CONFIG_MEMCG */

 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1a3c28d997d4..0b01262d110d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1049,8 +1049,14 @@ struct task_struct {
 	 * scan window were remote/local or failed to migrate. The task scan
 	 * period is adapted based on the locality of the faults with different
 	 * weights depending on whether they were shared or private faults
+	 *
+	 * 0 -- remote faults
+	 * 1 -- local faults
+	 * 2 -- page migration failure
+	 * 3 -- remote page accessing after page migration
+	 * 4 -- local page accessing after page migration
 	 */
-	unsigned long			numa_faults_locality[3];
+	unsigned long			numa_faults_locality[5];

 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8039d62ae36e..2898f5fa4fba 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -873,6 +873,13 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
 			task_node(p), task_numa_group_id(p));
 	show_numa_stats(p, m);
+	SEQ_printf(m, "faults_locality local=%lu remote=%lu failed=%lu ",
+			p->numa_faults_locality[1],
+			p->numa_faults_locality[0],
+			p->numa_faults_locality[2]);
+	SEQ_printf(m, "lhit=%lu rhit=%lu\n",
+			p->numa_faults_locality[4],
+			p->numa_faults_locality[3]);
 	mpol_put(pol);
 #endif
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdab7eb6f351..ba5a67139d57 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -23,6 +23,7 @@
 #include "sched.h"

 #include <trace/events/sched.h>
+#include <linux/memcontrol.h>

 /*
  * Targeted preemption latency for CPU-bound tasks:
@@ -2387,6 +2388,11 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
 	}

+	p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages;
+
+	if (mem_node == NUMA_NO_NODE)
+		return;
+
 	/*
 	 * First accesses are treated as private, otherwise consider accesses
 	 * to be private if the accessing pid has not changed
@@ -2604,6 +2610,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
 		return;

+	memcg_stat_numa_update(curr);
+
 	/*
 	 * Using runtime rather than walltime has the dual advantage that
 	 * we (mostly) drive the selection from busy threads and that the
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 404acdcd0455..2614ce725a63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1621,9 +1621,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);

-	if (page_nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
-				flags);
+	task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);

 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c532f8685aa3..b810d4e9c906 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,6 +66,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3396,10 +3397,50 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 		seq_putc(m, '\n');
 	}

+#ifdef CONFIG_NUMA_BALANCING
+	seq_puts(m, "locality");
+	for (nr = 0; nr < NR_NL_INTERVAL; nr++) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_possible_cpu(cpu)
+			sum += per_cpu(memcg->stat_numa->locality[nr], cpu);
+
+		seq_printf(m, " %llu", sum);
+	}
+	seq_putc(m, '\n');
+#endif
+
 	return 0;
 }
 #endif /* CONFIG_NUMA */

+#ifdef CONFIG_NUMA_BALANCING
+
+void memcg_stat_numa_update(struct task_struct *p)
+{
+	struct mem_cgroup *memcg;
+	unsigned long remote = p->numa_faults_locality[3];
+	unsigned long local = p->numa_faults_locality[4];
+	unsigned long idx = -1;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	if (remote || local) {
+		idx = (local * 10) / (remote + local);
+		if (idx >= NR_NL_INTERVAL)
+			idx = NR_NL_INTERVAL - 1;
+	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(p);
+	if (idx != -1)
+		this_cpu_inc(memcg->stat_numa->locality[idx]);
+	rcu_read_unlock();
+}
+#endif
+
 /* Universal VM events cgroup1 shows, original sort order */
 static const unsigned int memcg1_events[] = {
 	PGPGIN,
@@ -4435,6 +4476,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)

 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+#ifdef CONFIG_NUMA_BALANCING
+	free_percpu(memcg->stat_numa);
+#endif
 	free_percpu(memcg->vmstats_percpu);
 	kfree(memcg);
 }
@@ -4468,6 +4512,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg->vmstats_percpu)
 		goto fail;

+#ifdef CONFIG_NUMA_BALANCING
+	memcg->stat_numa = alloc_percpu(struct memcg_stat_numa);
+	if (!memcg->stat_numa)
+		goto fail;
+#endif
+
 	for_each_node(node)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
diff --git a/mm/memory.c b/mm/memory.c
index c0391a9f18b8..fb0c1d940d36 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3609,7 +3609,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL;
 	int page_nid = NUMA_NO_NODE;
-	int last_cpupid;
+	int last_cpupid = 0;
 	int target_nid;
 	bool migrated = false;
 	pte_t pte, old_pte;
@@ -3689,8 +3689,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		flags |= TNF_MIGRATE_FAIL;

 out:
-	if (page_nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, page_nid, 1, flags);
+	task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
 }

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat
  2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
  2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
@ 2019-04-22  2:12 ` 王贇
  2019-04-23  8:52   ` Peter Zijlstra
  2019-04-22  2:13 ` [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node 王贇
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-04-22  2:12 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm

This patch introduced numa execution information, to imply the numa
efficiency.

By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'exectime', like:

  exectime 24399843 27865444

which means the tasks of this cgroup executed 24399843 ticks on node 0,
and 27865444 ticks on node 1.

Combined with the memory node info, we can estimate the numa efficiency,
for example the memory.numa_stat show:

  total=4613257 N0=6849 N1=3928327
  ...
  exectime 24399843 27865444

there could be unmovable or cache pages on N1, then good locality could
mean nothing since we are not tracing these type of pages, thus bind the
workloads on the cpus of N1 worth a try, in order to achieve the maximum
performance bonus.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bb62e6294484..e784d6252d5e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -197,6 +197,7 @@ enum memcg_numa_locality_interval {

 struct memcg_stat_numa {
 	u64 locality[NR_NL_INTERVAL];
+	u64 exectime;
 };

 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b810d4e9c906..91bcd71fc38a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3409,6 +3409,18 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 		seq_printf(m, " %llu", sum);
 	}
 	seq_putc(m, '\n');
+
+	seq_puts(m, "exectime");
+	for_each_online_node(nr) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_cpu(cpu, cpumask_of_node(nr))
+			sum += per_cpu(memcg->stat_numa->exectime, cpu);
+
+		seq_printf(m, " %llu", sum);
+	}
+	seq_putc(m, '\n');
 #endif

 	return 0;
@@ -3437,6 +3449,7 @@ void memcg_stat_numa_update(struct task_struct *p)
 	memcg = mem_cgroup_from_task(p);
 	if (idx != -1)
 		this_cpu_inc(memcg->stat_numa->locality[idx]);
+	this_cpu_inc(memcg->stat_numa->exectime);
 	rcu_read_unlock();
 }
 #endif
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node
  2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
  2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
  2019-04-22  2:12 ` [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat 王贇
@ 2019-04-22  2:13 ` 王贇
  2019-04-23  8:55   ` Peter Zijlstra
  2019-04-22  2:14 ` [RFC PATCH 4/5] numa: introduce numa balancer infrastructure 王贇
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-04-22  2:13 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm

This patch add a new entry 'numa_preferred' for each memory cgroup,
by which we can now override the memory policy of the tasks inside
a particular cgroup, combined with numa balancing, we now be able to
migrate the workloads of a cgroup to the specified numa node, in gentle
way.

The load balancing and numa prefer against each other on CPU locations,
which lead into the situation that although a particular node is capable
enough to hold all the workloads, tasks will still spread.

In order to acquire the numa benifit in this situation,  load balancing
should respect the prefer decision as long as the balancing won't be
broken.

This patch try to forbid workloads leave memcg preferred node, when
and only when numa preferred node configured, in case if load balancing
can't find other tasks to move and keep failing, we will then giveup
and allow the migration to happen.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h | 34 +++++++++++++++++++
 include/linux/sched.h      |  1 +
 kernel/sched/debug.c       |  1 +
 kernel/sched/fair.c        | 33 +++++++++++++++++++
 mm/huge_memory.c           |  3 ++
 mm/memcontrol.c            | 82 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c                |  4 +++
 mm/mempolicy.c             |  4 +++
 8 files changed, 162 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e784d6252d5e..0fd5eeb27c4f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -335,6 +335,8 @@ struct mem_cgroup {

 #ifdef CONFIG_NUMA_BALANCING
 	struct memcg_stat_numa __percpu *stat_numa;
+	s64 numa_preferred;
+	struct mutex numa_mutex;
 #endif

 	struct mem_cgroup_per_node *nodeinfo[0];
@@ -846,10 +848,26 @@ void mem_cgroup_split_huge_fixup(struct page *head);

 #ifdef CONFIG_NUMA_BALANCING
 extern void memcg_stat_numa_update(struct task_struct *p);
+extern int memcg_migrate_prep(int target_nid, int page_nid);
+extern int memcg_preferred_nid(struct task_struct *p, gfp_t gfp);
+extern struct page *alloc_page_numa_preferred(gfp_t gfp, unsigned int order);
 #else
 static inline void memcg_stat_numa_update(struct task_struct *p)
 {
 }
+static inline int memcg_migrate_prep(int target_nid, int page_nid)
+{
+	return target_nid;
+}
+static inline int memcg_preferred_nid(struct task_struct *p, gfp_t gfp)
+{
+	return -1;
+}
+static inline struct page *alloc_page_numa_preferred(gfp_t gfp,
+						     unsigned int order)
+{
+	return NULL;
+}
 #endif

 #else /* CONFIG_MEMCG */
@@ -1195,6 +1213,22 @@ static inline void memcg_stat_numa_update(struct task_struct *p)
 {
 }

+static inline int memcg_migrate_prep(int target_nid, int page_nid)
+{
+	return target_nid;
+}
+
+static inline int memcg_preferred_nid(struct task_struct *p, gfp_t gfp)
+{
+	return -1;
+}
+
+static inline struct page *alloc_page_numa_preferred(gfp_t gfp,
+						     unsigned int order)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_MEMCG */

 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0b01262d110d..9f931db1d31f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -422,6 +422,7 @@ struct sched_statistics {
 	u64				nr_migrations_cold;
 	u64				nr_failed_migrations_affine;
 	u64				nr_failed_migrations_running;
+	u64				nr_failed_migrations_memcg;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2898f5fa4fba..32f5fd66f0fe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -934,6 +934,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(se.statistics.nr_migrations_cold);
 		P_SCHEDSTAT(se.statistics.nr_failed_migrations_affine);
 		P_SCHEDSTAT(se.statistics.nr_failed_migrations_running);
+		P_SCHEDSTAT(se.statistics.nr_failed_migrations_memcg);
 		P_SCHEDSTAT(se.statistics.nr_failed_migrations_hot);
 		P_SCHEDSTAT(se.statistics.nr_forced_migrations);
 		P_SCHEDSTAT(se.statistics.nr_wakeups);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba5a67139d57..5d0758e78b96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6701,6 +6701,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
+		int pnid = memcg_preferred_nid(p, 0);
+
+		if (pnid != NUMA_NO_NODE && pnid != cpu_to_node(new_cpu))
+			new_cpu = prev_cpu;

 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

@@ -7404,12 +7408,36 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	return dst_weight < src_weight;
 }

+static inline bool memcg_migrate_allow(struct task_struct *p,
+					struct lb_env *env)
+{
+	int src_nid, dst_nid, pnid;
+
+	/* failed too much could imply balancing broken, now be a good boy */
+	if (env->sd->nr_balance_failed > env->sd->cache_nice_tries)
+		return true;
+
+	src_nid = cpu_to_node(env->src_cpu);
+	dst_nid = cpu_to_node(env->dst_cpu);
+
+	pnid = memcg_preferred_nid(p, 0);
+	if (pnid != dst_nid && pnid == src_nid)
+		return false;
+
+	return true;
+}
 #else
 static inline int migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
 	return -1;
 }
+
+static inline bool memcg_migrate_allow(struct task_struct *p,
+					struct lb_env *env)
+{
+	return true;
+}
 #endif

 /*
@@ -7470,6 +7498,11 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}

+	if (!memcg_migrate_allow(p, env)) {
+		schedstat_inc(p->se.statistics.nr_failed_migrations_memcg);
+		return 0;
+	}
+
 	/*
 	 * Aggressive migration if:
 	 * 1) destination numa is preferred
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2614ce725a63..c01e1bb22477 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1523,6 +1523,9 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 */
 	page_locked = trylock_page(page);
 	target_nid = mpol_misplaced(page, vma, haddr);
+
+	target_nid = memcg_migrate_prep(target_nid, page_nid);
+
 	if (target_nid == NUMA_NO_NODE) {
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91bcd71fc38a..f1cb1e726430 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3452,6 +3452,79 @@ void memcg_stat_numa_update(struct task_struct *p)
 	this_cpu_inc(memcg->stat_numa->exectime);
 	rcu_read_unlock();
 }
+
+static s64 memcg_numa_preferred_read_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return memcg->numa_preferred;
+}
+
+static int memcg_numa_preferred_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	if (val != NUMA_NO_NODE && !node_isset(val, node_possible_map))
+		return -EINVAL;
+
+	mutex_lock(&memcg->numa_mutex);
+	memcg->numa_preferred = val;
+	mutex_unlock(&memcg->numa_mutex);
+
+	return 0;
+}
+
+int memcg_preferred_nid(struct task_struct *p, gfp_t gfp)
+{
+	int preferred_nid = NUMA_NO_NODE;
+
+	if (!mem_cgroup_disabled() &&
+	    !in_interrupt() &&
+	    !(gfp & __GFP_THISNODE)) {
+		struct mem_cgroup *memcg;
+
+		rcu_read_lock();
+		memcg = mem_cgroup_from_task(p);
+		if (memcg)
+			preferred_nid = memcg->numa_preferred;
+		rcu_read_unlock();
+	}
+
+	return preferred_nid;
+}
+
+int memcg_migrate_prep(int target_nid, int page_nid)
+{
+	bool ret = false;
+	unsigned int cookie;
+	int preferred_nid = memcg_preferred_nid(current, 0);
+
+	if (preferred_nid == NUMA_NO_NODE)
+		return target_nid;
+
+	do {
+		cookie = read_mems_allowed_begin();
+		ret = node_isset(preferred_nid, current->mems_allowed);
+	} while (read_mems_allowed_retry(cookie));
+
+	if (ret)
+		return page_nid == preferred_nid ? NUMA_NO_NODE : preferred_nid;
+
+	return target_nid;
+}
+
+struct page *alloc_page_numa_preferred(gfp_t gfp, unsigned int order)
+{
+	int pnid = memcg_preferred_nid(current, gfp);
+
+	if (pnid == NUMA_NO_NODE || !node_isset(pnid, current->mems_allowed))
+		return NULL;
+
+	return __alloc_pages_node(pnid, gfp, order);
+}
+
 #endif

 /* Universal VM events cgroup1 shows, original sort order */
@@ -4309,6 +4382,13 @@ static struct cftype mem_cgroup_legacy_files[] = {
 		.name = "numa_stat",
 		.seq_show = memcg_numa_stat_show,
 	},
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	{
+		.name = "numa_preferred",
+		.read_s64 = memcg_numa_preferred_read_s64,
+		.write_s64 = memcg_numa_preferred_write_s64,
+	},
 #endif
 	{
 		.name = "kmem.limit_in_bytes",
@@ -4529,6 +4609,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->stat_numa = alloc_percpu(struct memcg_stat_numa);
 	if (!memcg->stat_numa)
 		goto fail;
+	mutex_init(&memcg->numa_mutex);
+	memcg->numa_preferred = NUMA_NO_NODE;
 #endif

 	for_each_node(node)
diff --git a/mm/memory.c b/mm/memory.c
index fb0c1d940d36..98d988ca717c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -70,6 +70,7 @@
 #include <linux/dax.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/memcontrol.h>

 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -3675,6 +3676,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
 			&flags);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	target_nid = memcg_migrate_prep(target_nid, page_nid);
+
 	if (target_nid == NUMA_NO_NODE) {
 		put_page(page);
 		goto out;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index af171ccb56a2..6513504373b4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2031,6 +2031,10 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,

 	pol = get_vma_policy(vma, addr);

+	page = alloc_page_numa_preferred(gfp, order);
+	if (page)
+		goto out;
+
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 4/5] numa: introduce numa balancer infrastructure
  2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
                   ` (2 preceding siblings ...)
  2019-04-22  2:13 ` [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node 王贇
@ 2019-04-22  2:14 ` 王贇
  2019-04-22  2:21 ` [RFC PATCH 5/5] numa: numa balancer 王贇
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-04-22  2:14 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm

Now we have the way to estimate and adjust numa preferred node for each
memcg, next problem is how to use them.

Usually one will bind workloads with cpuset.cpus, combined with cpuset.mems
or maybe better the memory policy to achieve numa bonus, however in complicated
scenery like combined type of workloads or cpushare way of isolation, this
kind of administration could make one crazy, what we need is a way to gain
numa bonus automatically, maybe not maximum but as much as possible.

This patch introduced basic API for kernel module to do numa adjustment,
later coming the numa balancer module to use them and try to gain numa bonus
as much as possible, automatically.

API including:
  * numa preferred control
  * memcg callback hook
  * memcg per-node page number acquire

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h |  26 ++++++++++++
 mm/memcontrol.c            | 101 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 127 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0fd5eeb27c4f..7456b862d5a9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,11 @@ struct memcg_stat_numa {
 	u64 exectime;
 };

+struct memcg_callback {
+	void (*init)(struct mem_cgroup *memcg);
+	void (*exit)(struct mem_cgroup *memcg);
+};
+
 #endif
 #if defined(CONFIG_SMP)
 struct memcg_padding {
@@ -337,6 +342,8 @@ struct mem_cgroup {
 	struct memcg_stat_numa __percpu *stat_numa;
 	s64 numa_preferred;
 	struct mutex numa_mutex;
+	void *numa_private;
+	struct list_head numa_list;
 #endif

 	struct mem_cgroup_per_node *nodeinfo[0];
@@ -851,6 +858,10 @@ extern void memcg_stat_numa_update(struct task_struct *p);
 extern int memcg_migrate_prep(int target_nid, int page_nid);
 extern int memcg_preferred_nid(struct task_struct *p, gfp_t gfp);
 extern struct page *alloc_page_numa_preferred(gfp_t gfp, unsigned int order);
+extern int register_memcg_callback(void *cb);
+extern int unregister_memcg_callback(void *cb);
+extern void config_numa_preferred(struct mem_cgroup *memcg, int nid);
+extern u64 memcg_numa_pages(struct mem_cgroup *memcg, int nid, u32 mask);
 #else
 static inline void memcg_stat_numa_update(struct task_struct *p)
 {
@@ -868,6 +879,21 @@ static inline struct page *alloc_page_numa_preferred(gfp_t gfp,
 {
 	return NULL;
 }
+static inline int register_memcg_callback(void *cb)
+{
+	return -EINVAL;
+}
+static inline int unregister_memcg_callback(void *cb)
+{
+	return -EINVAL;
+}
+static inline void config_numa_preferred(struct mem_cgroup *memcg, int nid)
+{
+}
+static inline u64 memcg_numa_pages(struct mem_cgroup *memcg, int nid, u32 mask)
+{
+	return 0;
+}
 #endif

 #else /* CONFIG_MEMCG */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f1cb1e726430..dc232ecc904f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3525,6 +3525,102 @@ struct page *alloc_page_numa_preferred(gfp_t gfp, unsigned int order)
 	return __alloc_pages_node(pnid, gfp, order);
 }

+static struct memcg_callback *memcg_cb;
+
+static LIST_HEAD(memcg_cb_list);
+static DEFINE_MUTEX(memcg_cb_mutex);
+
+int register_memcg_callback(void *cb)
+{
+	int ret = 0;
+
+	mutex_lock(&memcg_cb_mutex);
+	if (memcg_cb || !cb) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	memcg_cb = (struct memcg_callback *)cb;
+	if (memcg_cb->init) {
+		struct mem_cgroup *memcg;
+
+		list_for_each_entry(memcg, &memcg_cb_list, numa_list)
+			memcg_cb->init(memcg);
+	}
+
+out:
+	mutex_unlock(&memcg_cb_mutex);
+	return ret;
+}
+EXPORT_SYMBOL(register_memcg_callback);
+
+int unregister_memcg_callback(void *cb)
+{
+	int ret = 0;
+
+	mutex_lock(&memcg_cb_mutex);
+	if (!memcg_cb || memcg_cb != cb) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (memcg_cb->exit) {
+		struct mem_cgroup *memcg;
+
+		list_for_each_entry(memcg, &memcg_cb_list, numa_list)
+			memcg_cb->exit(memcg);
+	}
+	memcg_cb = NULL;
+
+out:
+	mutex_unlock(&memcg_cb_mutex);
+	return ret;
+}
+EXPORT_SYMBOL(unregister_memcg_callback);
+
+void config_numa_preferred(struct mem_cgroup *memcg, int nid)
+{
+	mutex_lock(&memcg->numa_mutex);
+	memcg->numa_preferred = nid;
+	mutex_unlock(&memcg->numa_mutex);
+}
+EXPORT_SYMBOL(config_numa_preferred);
+
+u64 memcg_numa_pages(struct mem_cgroup *memcg, int nid, u32 mask)
+{
+	if (nid == NUMA_NO_NODE)
+		return mem_cgroup_nr_lru_pages(memcg, mask);
+	else
+		return mem_cgroup_node_nr_lru_pages(memcg, nid, mask);
+}
+EXPORT_SYMBOL(memcg_numa_pages);
+
+static void memcg_online_callback(struct mem_cgroup *memcg)
+{
+	mutex_lock(&memcg_cb_mutex);
+	list_add_tail(&memcg->numa_list, &memcg_cb_list);
+	if (memcg_cb && memcg_cb->init)
+		memcg_cb->init(memcg);
+	mutex_unlock(&memcg_cb_mutex);
+}
+
+static void memcg_offline_callback(struct mem_cgroup *memcg)
+{
+	mutex_lock(&memcg_cb_mutex);
+	if (memcg_cb && memcg_cb->exit)
+		memcg_cb->exit(memcg);
+	list_del_init(&memcg->numa_list);
+	mutex_unlock(&memcg_cb_mutex);
+}
+
+#else
+
+static void memcg_online_callback(struct mem_cgroup *memcg)
+{}
+
+static void memcg_offline_callback(struct mem_cgroup *memcg)
+{}
+
 #endif

 /* Universal VM events cgroup1 shows, original sort order */
@@ -4719,6 +4815,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	/* Online state pins memcg ID, memcg ID pins CSS */
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
+
+	memcg_online_callback(memcg);
+
 	return 0;
 }

@@ -4727,6 +4826,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup_event *event, *tmp;

+	memcg_offline_callback(memcg);
+
 	/*
 	 * Unregister events and notify userspace.
 	 * Notify userspace about cgroup removing only after rmdir of cgroup
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 5/5] numa: numa balancer
  2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
                   ` (3 preceding siblings ...)
  2019-04-22  2:14 ` [RFC PATCH 4/5] numa: introduce numa balancer infrastructure 王贇
@ 2019-04-22  2:21 ` 王贇
  2019-04-23  9:05   ` Peter Zijlstra
       [not found] ` <CAHCio2gEw4xyuoiurvwzvEiU8eLas+5ZLhzmqm1V2CJqvt+cyA@mail.gmail.com>
  2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
  6 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-04-22  2:21 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm

numa balancer is a module which will try to automatically adjust numa
balancing stuff to gain numa bonus as much as possible.

For each memory cgroup, we process the work in two steps:

On stage 1 we check cgroup's exectime and memory topology to see
if there could be a candidate for settled down, if we got one then
move onto stage 2.

On stage 2 we try to settle down as much as possible by prefer the
candidate node, if the node no longer suitable or locality keep
downturn, we reset things and new round begin.

Decision made with find_candidate_nid(), should_prefer() and keep_prefer(),
which try to pick a candidate node, see if allowed to prefer it and if
keep doing the prefer.

Tested on the box with 96 cpus with sysbench-mysql-oltp_read_write
testing, 4 mysqld instances created and attached to 4 cgroups, 4
sysbench instances then created and attached to corresponding cgroup
to test the mysql with oltp_read_write script, average eps show:

				origin		balancer
4 instances each 12 threads	5241.08		5375.59		+2.50%
4 instances each 24 threads	7497.29		7820.73		+4.13%
4 instances each 36 threads	8985.44		9317.04		+3.55%
4 instances each 48 threads	9716.50		9982.60		+2.66%

Other benchmark liks dbench, pgbench, perf bench numa also tested, and
with different parameters and number of instances/threads, most of
the cases show bonus, some show acceptable regression, and some got no
changes.

TODO:
  * improve the logical to address the regression cases
  * Find a way, maybe, to handle the page cache left on remote
  * find more scenery which could gain benefit

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 drivers/Makefile             |   1 +
 drivers/numa/Makefile        |   1 +
 drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 717 insertions(+)
 create mode 100644 drivers/numa/Makefile
 create mode 100644 drivers/numa/numa_balancer.c

diff --git a/drivers/Makefile b/drivers/Makefile
index c61cde554340..f07936b03870 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -187,3 +187,4 @@ obj-$(CONFIG_UNISYS_VISORBUS)	+= visorbus/
 obj-$(CONFIG_SIOX)		+= siox/
 obj-$(CONFIG_GNSS)		+= gnss/
 obj-$(CONFIG_INTERCONNECT)	+= interconnect/
+obj-$(CONFIG_NUMA_BALANCING)	+= numa/
diff --git a/drivers/numa/Makefile b/drivers/numa/Makefile
new file mode 100644
index 000000000000..acf8a4083333
--- /dev/null
+++ b/drivers/numa/Makefile
@@ -0,0 +1 @@
+obj-m	+= numa_balancer.o
diff --git a/drivers/numa/numa_balancer.c b/drivers/numa/numa_balancer.c
new file mode 100644
index 000000000000..25bbe08c82a2
--- /dev/null
+++ b/drivers/numa/numa_balancer.c
@@ -0,0 +1,715 @@
+/*
+ * NUMA Balancer
+ *
+ *  Copyright (C) 2019 Alibaba Group Holding Limited.
+ *  Author: Michael Wang <yun.wang@linux.alibaba.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <linux/module.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/kthread.h>
+#include <linux/kernel_stat.h>
+#include <linux/vmstat.h>
+
+static unsigned int debug_level;
+module_param(debug_level, uint, 0644);
+MODULE_PARM_DESC(debug_level, "1 to print decisions, 2 to print both decisions and node info");
+
+static int prefer_level = 10;
+module_param(prefer_level, int, 0644);
+MODULE_PARM_DESC(prefer_level, "stop numa prefer when reach this much continuous downturn, 0 means no prefer");
+
+static unsigned int locality_level = PERCENT_70_79;
+module_param(locality_level, uint, 0644);
+MODULE_PARM_DESC(locality_level, "consider locality as good when above this sector");
+
+static unsigned long period_max = (600 * HZ);
+module_param(period_max, ulong, 0644);
+MODULE_PARM_DESC(period_max, "maximum period between each stage");
+
+static unsigned long period_min = (5 * HZ);
+module_param(period_min, ulong, 0644);
+MODULE_PARM_DESC(period_min, "minimum period between each stage");
+
+static unsigned int cpu_high_wmark = 100;
+module_param(cpu_high_wmark, uint, 0644);
+MODULE_PARM_DESC(cpu_high_wmark, "respect the execution percent rather than memory percent when above this cpu usage");
+
+static unsigned int cpu_low_wmark = 10;
+module_param(cpu_low_wmark, uint, 0644);
+MODULE_PARM_DESC(cpu_low_wmark, "consider cgroup as active when above this cpu usage");
+
+static unsigned int free_low_wmark = 10;
+module_param(free_low_wmark, uint, 0644);
+MODULE_PARM_DESC(free_low_wmark, "consider node as consumed out when below this free percent");
+
+static unsigned int candidate_wmark = 60;
+module_param(candidate_wmark, uint, 0644);
+MODULE_PARM_DESC(candidate_wmark, "consider node as candidate when above this execution time or memory percent");
+
+static unsigned int settled_wmark = 90;
+module_param(settled_wmark, uint, 0644);
+MODULE_PARM_DESC(settled_wmark, "consider cgroup settle down on node when above this execution time and memory percent, or locality");
+
+/*
+ * STAGE_1 -- no preferred node
+ *
+ * STAGE_2 -- preferred node setted
+ *
+ * Check handlers for details.
+ */
+enum {
+	STAGE_1,
+	STAGE_2,
+	NR_STAGES,
+};
+
+struct node_info {
+	u64 anon;
+	u64 pages;
+	u64 exectime;
+	u64 exectime_history;
+
+	u64 ticks;
+	u64 ticks_history;
+	u64 idle;
+	u64 idle_history;
+
+	u64 total_pages;
+	u64 free_pages;
+
+	unsigned int exectime_percent;
+	unsigned int last_exectime_percent;
+	unsigned int anon_percent;
+	unsigned int pages_percent;
+	unsigned int free_percent;
+	unsigned int idle_percent;
+	unsigned int cpu_usage;
+};
+
+struct numa_balancer {
+	struct delayed_work dwork;
+	struct mem_cgroup *memcg;
+	struct node_info *ni;
+
+	unsigned long period;
+	unsigned long jstamp;
+
+	u64 locality_good;
+	u64 locality_sum;
+	u64 anon_sum;
+	u64 pages_sum;
+	u64 exectime_sum;
+	u64 free_pages_sum;
+
+	unsigned int stage;
+	unsigned int cpu_usage_sum;
+	unsigned int locality_score;
+	unsigned int downturn;
+
+	int anon_max_nid;
+	int pages_max_nid;
+	int exectime_max_nid;
+	int candidate_nid;
+};
+
+static struct workqueue_struct *numa_balancer_wq;
+
+/*
+ * Kernel increase the locality counter when hit memcg's task running
+ * on each tick, classified according to the percentage of local page
+ * access.
+ *
+ * This can representing the NUMA benefit, higher the locality lower
+ * the memory access latency, thus we calculate a score here to tell
+ * how well the memcg is playing with NUMA.
+ *
+ * The score is simplly the percentage of ticks above locality_level,
+ * which usually from 0 to 100, -1 means no ticks.
+ *
+ * For example, score 90 with locality_level 7 means there are 90
+ * percentage of the ticks hit memcg's tasks running above 79% local
+ * page access on numa page fault.
+ */
+static inline void update_locality_score(struct numa_balancer *nb)
+{
+	int i, cpu;
+	u64 good, sum, tmp;
+	unsigned int last_locality_score = nb->locality_score;
+	struct memcg_stat_numa *stat = nb->memcg->stat_numa;
+
+	nb->locality_score = -1;
+
+	for (good = sum = i = 0; i < NR_NL_INTERVAL; i++) {
+		for_each_possible_cpu(cpu) {
+			u64 val = per_cpu(stat->locality[i], cpu);
+
+			good += i > locality_level ? val : 0;
+			sum += val;
+		}
+	}
+
+	tmp = nb->locality_good;
+	nb->locality_good = good;
+	good -= tmp;
+
+	tmp = nb->locality_sum;
+	nb->locality_sum = sum;
+	sum -= tmp;
+
+	if (sum)
+		nb->locality_score = (good * 100) / sum;
+
+	if (nb->locality_score == -1 ||
+	    nb->locality_score > settled_wmark ||
+	    nb->locality_score > last_locality_score)
+		nb->downturn = 0;
+	else
+		nb->downturn++;
+}
+
+static inline void update_numa_info(struct numa_balancer *nb)
+{
+	int nid;
+	unsigned long period_in_jiffies = jiffies - nb->jstamp;
+	struct memcg_stat_numa *stat = nb->memcg->stat_numa;
+
+	if (period_in_jiffies <= 0)
+		return;
+
+	nb->anon_sum = nb->pages_sum = nb->exectime_sum = 0;
+	nb->anon_max_nid = nb->pages_max_nid = nb->exectime_max_nid = 0;
+
+	nb->free_pages_sum = 0;
+
+	for_each_online_node(nid) {
+		int cpu, zid;
+		u64 idle_curr, ticks_curr, exectime_curr;
+		struct node_info *nip = &nb->ni[nid];
+
+		nip->total_pages = nip->free_pages = 0;
+		for (zid = 0; zid <= ZONE_MOVABLE; zid++) {
+			pg_data_t *pgdat = NODE_DATA(nid);
+			struct zone *z = &pgdat->node_zones[zid];
+
+			nip->total_pages += zone_managed_pages(z);
+			nip->free_pages  += zone_page_state(z, NR_FREE_PAGES);
+		}
+
+		idle_curr = ticks_curr = exectime_curr = 0;
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			u64 *cstat = kcpustat_cpu(cpu).cpustat;
+
+			/* not accurate but fine */
+			idle_curr += cstat[CPUTIME_IDLE];
+			ticks_curr +=
+				cstat[CPUTIME_USER] + cstat[CPUTIME_NICE] +
+				cstat[CPUTIME_SYSTEM] + cstat[CPUTIME_IDLE] +
+				cstat[CPUTIME_IOWAIT] + cstat[CPUTIME_IRQ] +
+				cstat[CPUTIME_SOFTIRQ] + cstat[CPUTIME_STEAL];
+
+			exectime_curr += per_cpu(stat->exectime, cpu);
+		}
+
+		nip->ticks = ticks_curr - nip->ticks_history;
+		nip->ticks_history = ticks_curr;
+
+		nip->idle = idle_curr - nip->idle_history;
+		nip->idle_history = idle_curr;
+
+		nip->idle_percent = nip->idle * 100 / nip->ticks;
+
+		nip->exectime = exectime_curr - nip->exectime_history;
+		nip->exectime_history = exectime_curr;
+
+		nip->anon = memcg_numa_pages(nb->memcg, nid, LRU_ALL_ANON);
+		nip->pages = memcg_numa_pages(nb->memcg, nid, LRU_ALL);
+
+		if (nip->anon > nb->ni[nb->anon_max_nid].anon)
+			nb->anon_max_nid = nid;
+
+		if (nip->pages > nb->ni[nb->pages_max_nid].pages)
+			nb->pages_max_nid = nid;
+
+		if (nip->exectime > nb->ni[nb->exectime_max_nid].exectime)
+			nb->exectime_max_nid = nid;
+
+		nb->anon_sum += nip->anon;
+		nb->pages_sum += nip->pages;
+		nb->exectime_sum += nip->exectime;
+		nb->free_pages_sum += nip->free_pages;
+	}
+
+	for_each_online_node(nid) {
+		struct node_info *nip = &nb->ni[nid];
+
+		nip->last_exectime_percent = nip->exectime_percent;
+		nip->exectime_percent = nb->exectime_sum ?
+			nip->exectime * 100 / nb->exectime_sum : 0;
+
+		nip->anon_percent = nb->anon_sum ?
+			nip->anon * 100 / nb->anon_sum : 0;
+
+		nip->pages_percent = nb->pages_sum ?
+			nip->pages * 100 / nb->pages_sum : 0;
+
+		nip->free_percent = nip->total_pages ?
+			nip->free_pages * 100 / nip->total_pages : 0;
+
+		nip->cpu_usage = nip->exectime * 100 / period_in_jiffies;
+	}
+
+	nb->cpu_usage_sum = nb->exectime_sum * 100 / period_in_jiffies;
+	nb->jstamp = jiffies;
+}
+
+/*
+ * We consider a node as candidate when settle down is more easier,
+ * which means page and task migration should as less as possible.
+ *
+ * However, usually it's impossible to find an ideal candidate since
+ * kernel have no idea about the cgroup numa affinity, thus we need
+ * to pick out the most likely winner and play gambling.
+ */
+static inline int find_candidate_nid(struct numa_balancer *nb)
+{
+	int cnid = -1;
+	int enid = nb->exectime_max_nid;
+	int pnid = nb->pages_max_nid;
+	int anid = nb->anon_max_nid;
+	struct node_info *nip;
+
+	/*
+	 * settled execution percent could imply the only available
+	 * node for running, respect this firstly.
+	 */
+	nip = &nb->ni[enid];
+	if (nb->cpu_usage_sum > cpu_high_wmark &&
+	    nip->exectime_percent > settled_wmark) {
+		cnid = enid;
+		goto out;
+	}
+
+	/*
+	 * Migrate page cost a lot, if the node is available for
+	 * running and most of the pages reside there, just pick it.
+	 */
+	nip = &nb->ni[pnid];
+	if (nip->exectime_percent &&
+	    nip->pages_percent > candidate_wmark) {
+		cnid = pnid;
+		goto out;
+	}
+
+	/*
+	 * Now pick the node when most of the execution time and
+	 * anonymous pages already there.
+	 */
+	nip = &nb->ni[anid];
+	if (nip->exectime_percent > candidate_wmark &&
+	    nip->anon_percent > candidate_wmark) {
+		cnid = anid;
+		goto out;
+	}
+
+	/*
+	 * No strong hint so we reach here, respect the load balancing
+	 * and play gambling.
+	 */
+	nip = &nb->ni[enid];
+	if (nb->cpu_usage_sum > cpu_high_wmark &&
+	    nip->exectime_percent > candidate_wmark) {
+		cnid = enid;
+		goto out;
+	}
+
+out:
+	nb->candidate_nid = cnid;
+	return cnid;
+}
+
+static inline unsigned long clip_period(unsigned long period)
+{
+	if (period < period_min)
+		return period_min;
+	if (period > period_max)
+		return period_max;
+	return period;
+}
+
+static inline void increase_period(struct numa_balancer *nb)
+{
+	nb->period = clip_period(nb->period * 2);
+}
+
+static inline void decrease_period(struct numa_balancer *nb)
+{
+	nb->period = clip_period(nb->period / 2);
+}
+
+static inline bool is_zombie(struct numa_balancer *nb)
+{
+	return (nb->cpu_usage_sum < cpu_low_wmark);
+}
+
+static inline bool is_settled(struct numa_balancer *nb, int nid)
+{
+	return (nb->ni[nid].exectime_percent > settled_wmark &&
+		nb->ni[nid].pages_percent > settled_wmark);
+}
+
+static inline void
+__memcg_printk(struct mem_cgroup *memcg, const char *fmt, ...)
+{
+	struct va_format vaf;
+	va_list args;
+	const char *name = memcg->css.cgroup->kn->name;
+
+	if (!debug_level)
+		return;
+
+	if (*name == '\0')
+		name = "root";
+
+	va_start(args, fmt);
+	vaf.fmt = fmt;
+	vaf.va = &args;
+	pr_notice("%s: [%s] %pV",
+		KBUILD_MODNAME, name, &vaf);
+	va_end(args);
+}
+
+static inline void
+__nb_printk(struct numa_balancer *nb, const char *fmt, ...)
+{
+	int nid;
+	struct va_format vaf;
+	va_list args;
+	const char *name = nb->memcg->css.cgroup->kn->name;
+
+	if (!debug_level)
+		return;
+
+	if (*name == '\0')
+		name = "root";
+
+	va_start(args, fmt);
+	vaf.fmt = fmt;
+	vaf.va = &args;
+	pr_notice("%s: [%s][stage %d] cpu %d%% %pV",
+		KBUILD_MODNAME, name, nb->stage, nb->cpu_usage_sum, &vaf);
+	va_end(args);
+
+	if (debug_level < 2)
+		return;
+
+	for_each_online_node(nid) {
+		struct node_info *nip = &nb->ni[nid];
+
+		pr_notice("%s: [%s][stage %d]\tnid %d exectime %llu[%d%%] anon %llu[%d%%] pages %llu[%d%%] idle [%d%%] free [%d%%]\n",
+			KBUILD_MODNAME, name, nb->stage,
+			nid, nip->exectime, nip->exectime_percent,
+			nip->anon, nip->anon_percent,
+			nip->pages, nip->pages_percent, nip->idle_percent,
+			nip->free_percent);
+	}
+}
+
+#define nb_printk(fmt...)	__nb_printk(nb, fmt)
+#define memcg_printk(fmt...)	__memcg_printk(memcg, fmt)
+
+static inline void reset_stage(struct numa_balancer *nb)
+{
+	nb->stage		= STAGE_1;
+	nb->period		= period_min;
+	nb->candidate_nid	= NUMA_NO_NODE;
+	nb->locality_score	= -1;
+	nb->downturn		= 0;
+
+	config_numa_preferred(nb->memcg, -1);
+}
+
+/*
+ * In most of the cases, we need to give kernel the hint of memcg
+ * preference in order to settle down on a particular node, the benefit
+ * is obviously while the risk too.
+ *
+ * Prefer behaviour could cause global influence and become a trigger
+ * for other memcg to make their own decision, ideally different memcg
+ * workloads will change their resources then settle down on different
+ * nodes, make resource balanced again and gain maximum numa benefit.
+ */
+static inline bool should_prefer(struct numa_balancer *nb, int cnid)
+{
+	struct node_info *cnip = &nb->ni[cnid];
+	u64 cpu_left, cpu_to_move, mem_left, mem_to_move;
+
+	if (nb->downturn >= prefer_level ||
+	    cnip->free_percent < free_low_wmark ||
+	    cnip->idle_percent < free_low_wmark)
+		return false;
+
+	/*
+	 * We don't want to cause starving on a particular node,
+	 * while there are race conditions and it's impossible to
+	 * predict the resource requirement in future, so risk can't
+	 * be avoided.
+	 *
+	 * Fortunately kernel won't respect numa prefer anymore if
+	 * things going to get worse :-P
+	 */
+	cpu_left = cpumask_weight(cpumask_of_node(cnid)) * 100 *
+			(cnip->idle_percent - free_low_wmark);
+	cpu_to_move = nb->cpu_usage_sum - cnip->cpu_usage;
+	if (cpu_left < cpu_to_move)
+		return false;
+
+	mem_left = cnip->total_pages *
+			(cnip->free_percent - free_low_wmark);
+	mem_to_move = nb->pages_sum - cnip->pages;
+	if (mem_left < mem_to_move)
+		return false;
+
+	return true;
+}
+
+static void STAGE_1_handler(struct numa_balancer *nb)
+{
+	int cnid;
+	struct node_info *cnip;
+
+	if (is_zombie(nb)) {
+		reset_stage(nb);
+		increase_period(nb);
+		nb_printk("zombie, silent for %lu seconds\n", nb->period / HZ);
+		return;
+	}
+
+	update_locality_score(nb);
+
+	cnid = find_candidate_nid(nb);
+	if (cnid == NUMA_NO_NODE) {
+		increase_period(nb);
+		nb_printk("no candidate locality %d%%, silent for %lu seconds\n",
+				nb->locality_score, nb->period / HZ);
+		return;
+	}
+
+	cnip = &nb->ni[cnid];
+	if (is_settled(nb, cnid)) {
+		increase_period(nb);
+		nb_printk("settle down node %d exectime %d%% pages %d%% locality %d%%, silent for %lu seconds\n",
+				cnid, cnip->exectime_percent,
+				cnip->pages_percent, nb->locality_score,
+				nb->period / HZ);
+		return;
+	}
+
+	if (!should_prefer(nb, cnid)) {
+		increase_period(nb);
+		nb_printk("discard node %d exectime %d%% pages %d%% locality %d%% free %d%%, silent for %lu seconds\n",
+				cnid, cnip->exectime_percent,
+				cnip->pages_percent, nb->locality_score,
+				cnip->free_percent, nb->period / HZ);
+		return;
+	}
+
+	nb_printk("prefer node %d exectime %d%% pages %d%% locality %d%% free %d%%, goto next stage\n",
+			cnid, cnip->exectime_percent, cnip->pages_percent,
+			nb->locality_score, cnip->free_percent);
+
+	config_numa_preferred(nb->memcg, cnid);
+
+	nb->stage++;
+	nb->period = period_min;
+}
+
+/*
+ * A tough decision here, as soon as we giveup prefer the node,
+ * kernel will lost the hint on memcg CPU preference, in good case
+ * tasks will still running on the right node since numa balancing
+ * preferred, but no more guarantees.
+ */
+static inline bool keep_prefer(struct numa_balancer *nb, int cnid)
+{
+	struct node_info *cnip = &nb->ni[cnid];
+
+	if (nb->downturn >= prefer_level)
+		return false;
+
+	/* stop prefer a harsh node */
+	if (cnip->free_percent < free_low_wmark ||
+	    cnip->idle_percent < free_low_wmark)
+		return false;
+
+	if (nb->locality_score > settled_wmark ||
+	    cnip->exectime_percent > settled_wmark)
+		return true;
+
+	if (cnip->exectime_percent > cnip->last_exectime_percent)
+		return true;
+
+	/*
+	 * kernel will make sure the balancing won't be broken, which
+	 * means some task won't stay on the preferred node when
+	 * balancing failed too much, imply that we should stop the
+	 * prefer behaviour to avoid the possible cpu starving on
+	 * the preferred node.
+	 *
+	 * Or maybe the current preferred node just haven't got enough
+	 * available cpus for memcg anymore.
+	 */
+	if (cnip->exectime_percent < candidate_wmark ||
+	    nb->exectime_max_nid != cnid)
+		return false;
+
+	return true;
+}
+
+static void STAGE_2_handler(struct numa_balancer *nb)
+{
+	int cnid;
+	struct node_info *cnip;
+
+	if (is_zombie(nb)) {
+		nb_printk("zombie, reset stage\n");
+		reset_stage(nb);
+		return;
+	}
+
+	cnid = nb->candidate_nid;
+	cnip = &nb->ni[cnid];
+
+	update_locality_score(nb);
+
+	if (keep_prefer(nb, cnid)) {
+		if (is_settled(nb, cnid))
+			increase_period(nb);
+		else
+			decrease_period(nb);
+
+		nb_printk("tangled node %d exectime %d%% pages %d%% locality %d%% free %d%%, silent for %lu seconds\n",
+				cnid, cnip->exectime_percent,
+				cnip->pages_percent, nb->locality_score,
+				cnip->free_percent, nb->period / HZ);
+		return;
+	}
+
+	nb_printk("giveup node %d exectime %d%% pages %d%% locality %d%% downturn %d free %d%%, reset stage\n",
+			cnid, cnip->exectime_percent, cnip->pages_percent,
+			nb->locality_score, nb->downturn, cnip->free_percent);
+
+	reset_stage(nb);
+}
+
+static void (*stage_handler[NR_STAGES])(struct numa_balancer *nb) = {
+	&STAGE_1_handler,
+	&STAGE_2_handler,
+};
+
+static void numa_balancer_workfn(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct numa_balancer *nb =
+			container_of(dwork, struct numa_balancer, dwork);
+
+	update_numa_info(nb);
+	(stage_handler[nb->stage])(nb);
+	cond_resched();
+
+	queue_delayed_work(numa_balancer_wq, &nb->dwork, nb->period);
+}
+
+static void memcg_init_handler(struct mem_cgroup *memcg)
+{
+	struct numa_balancer *nb = memcg->numa_private;
+
+	if (!nb) {
+		nb = kzalloc(sizeof(struct numa_balancer), GFP_KERNEL);
+		if (!nb) {
+			pr_err("allocate balancer private failed\n");
+			return;
+		}
+
+		nb->ni = kcalloc(nr_online_nodes, sizeof(*nb->ni), GFP_KERNEL);
+		if (!nb->ni) {
+			pr_err("allocate balancer node info failed\n");
+			kfree(nb);
+			return;
+		}
+
+		nb->memcg = memcg;
+		memcg->numa_private = nb;
+
+		INIT_DELAYED_WORK(&nb->dwork, numa_balancer_workfn);
+	}
+
+	reset_stage(nb);
+	update_numa_info(nb);
+	update_locality_score(nb);
+
+	queue_delayed_work(numa_balancer_wq, &nb->dwork, nb->period);
+	memcg_printk("NUMA Balancer On\n");
+}
+
+static void memcg_exit_handler(struct mem_cgroup *memcg)
+{
+	struct numa_balancer *nb = memcg->numa_private;
+
+	if (nb) {
+		cancel_delayed_work_sync(&nb->dwork);
+
+		kfree(nb->ni);
+		kfree(nb);
+		memcg->numa_private = NULL;
+	}
+
+	config_numa_preferred(memcg, -1);
+	memcg_printk("NUMA Balancer Off\n");
+}
+
+struct memcg_callback cb = {
+	.init = memcg_init_handler,
+	.exit = memcg_exit_handler,
+};
+
+static int __init numa_balancer_init(void)
+{
+	if (nr_online_nodes < 2) {
+		pr_err("Single node arch don't need numa balancer\n");
+		return -EINVAL;
+	}
+
+	numa_balancer_wq = create_workqueue("numa_balancer");
+	if (!numa_balancer_wq) {
+		pr_err("Create workqueue failed\n");
+		return -ENOMEM;
+	}
+
+	if (register_memcg_callback(&cb) != 0) {
+		pr_err("Register memcg callback failed\n");
+		return -EINVAL;
+	}
+
+	pr_notice(KBUILD_MODNAME ": Initialization Done\n");
+	return 0;
+}
+
+static void __exit numa_balancer_exit(void)
+{
+	unregister_memcg_callback(&cb);
+	destroy_workqueue(numa_balancer_wq);
+
+	pr_notice(KBUILD_MODNAME ": Exit\n");
+}
+
+module_init(numa_balancer_init);
+module_exit(numa_balancer_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Michael Wang <yun.wang@linux.alibaba.com>");
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] NUMA Balancer Suite
       [not found] ` <CAHCio2gEw4xyuoiurvwzvEiU8eLas+5ZLhzmqm1V2CJqvt+cyA@mail.gmail.com>
@ 2019-04-23  2:14   ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-04-23  2:14 UTC (permalink / raw)
  To: 禹舟键
  Cc: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar,
	linux-kernel, linux-mm

On 2019/4/22 下午10:34, 禹舟键 wrote:
> Hi, Michael
> I really want to know how could you fix the conflict between numa balancer and load balancer. Maybe you gained numa bonus by migrating some tasks to the node with most of the cache there, but, cpu load balance was break, so how to do it ?

The trick here is to allow migration when load balancing keep failing,
which means no better tasks to move.

However, since the idea here is cgroup workloads scheduling, it could be
hard to make sure load balanced, for example only two cgroup with different
workloads and putting them to different node.

Thus why we make this a module, rather than changing the kernel logical,
at this moment not every situation could gain benefit from numa balancer,
but in some situations, balanced load can't bring benefit while numa
balancer could.

Also we are improving the module to give it an overall sight, so it will
know whether the decision is breaking the load balance, but this introduced
big lock and more per cpu/node counters, we need more testing to know whether
this is really helpful.

Anyway, if you have any scenery may could gain benefit, please take a try
and let me know what's the problem is, we'll try to address them :-)

Regards,
Michael Wang

> 
> Thanks
> Wind
> 
> 
> 王贇 <yun.wang@linux.alibaba.com <mailto:yun.wang@linux.alibaba.com>> 于2019年4月22日周一 上午10:13写道:
> 
>     We have NUMA Balancing feature which always trying to move pages
>     of a task to the node it executed more, while still got issues:
> 
>     * page cache can't be handled
>     * no cgroup level balancing
> 
>     Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks,
>     below scenery could be easily observed:
> 
>     NODE0                   |       NODE1
>                             |
>     CPU0            CPU1    |       CPU2            CPU3
>     task_A0         task_A1 |       task_A2         task_A3
>     task_B0         task_B1 |       task_B2         task_B3
> 
>     and usually with the equal memory consumption on each node, when tasks have
>     similar behavior.
> 
>     In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 0,
>     pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located randomly,
>     depends on the first read/write CPU location.
> 
>     Let's suppose another scenery:
> 
>     NODE0                   |       NODE1
>                             |
>     CPU0            CPU1    |       CPU2            CPU3
>     task_A0         task_A1 |       task_B0         task_B1
>     task_A2         task_A3 |       task_B2         task_B3
> 
>     By switching the cpu & memory resources of task_A0,1 and task_B0,1, now workloads
>     of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are same
>     but related tasks could share a closer cpu cache, while cache still randomly located.
> 
>     Now what if the workloads generate lot's of page cache, and most of the memory
>     accessing are page cache writing?
> 
>     A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if task_A0
>     was already on NODE0 before it read/write files, caches will be there, so how to
>     make sure this happen?
> 
>     Usually we could solve this problem by binding workloads on a single node, if the
>     cgroup A was binding to CPU0,1, then all the caches it generated will be on NODE0,
>     the numa bonus will be maximum.
> 
>     However, this require a very well administration on specified workloads, suppose in our
>     cases if A & B are with a changing CPU requirement from 0% to 400%, then binding to a
>     single node would be a bad idea.
> 
>     So what we need is a way to detect memory topology on cgroup level, and try to migrate
>     cpu/mem resources to the node with most of the caches there, as long as the resource
>     is plenty on that node.
> 
>     This patch set introduced:
>       * advanced per-cgroup numa statistic
>       * numa preferred node feature
>       * Numa Balancer module
> 
>     Which helps to achieve an easy and flexible numa resource assignment, to gain numa bonus
>     as much as possible.
> 
>     Michael Wang (5):
>       numa: introduce per-cgroup numa balancing locality statistic
>       numa: append per-node execution info in memory.numa_stat
>       numa: introduce per-cgroup preferred numa node
>       numa: introduce numa balancer infrastructure
>       numa: numa balancer
> 
>      drivers/Makefile             |   1 +
>      drivers/numa/Makefile        |   1 +
>      drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
>      include/linux/memcontrol.h   |  99 ++++++
>      include/linux/sched.h        |   9 +-
>      kernel/sched/debug.c         |   8 +
>      kernel/sched/fair.c          |  41 +++
>      mm/huge_memory.c             |   7 +-
>      mm/memcontrol.c              | 246 +++++++++++++++
>      mm/memory.c                  |   9 +-
>      mm/mempolicy.c               |   4 +
>      11 files changed, 1133 insertions(+), 7 deletions(-)
>      create mode 100644 drivers/numa/Makefile
>      create mode 100644 drivers/numa/numa_balancer.c
> 
>     -- 
>     2.14.4.44.g2045bb6
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic
  2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
@ 2019-04-23  8:44   ` Peter Zijlstra
  2019-04-23  9:14     ` 王贇
  2019-04-23  8:46   ` Peter Zijlstra
  2019-04-23  8:47   ` Peter Zijlstra
  2 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-04-23  8:44 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
> +#ifdef CONFIG_NUMA_BALANCING
> +
> +enum memcg_numa_locality_interval {
> +	PERCENT_0_9,
> +	PERCENT_10_19,
> +	PERCENT_20_29,
> +	PERCENT_30_39,
> +	PERCENT_40_49,
> +	PERCENT_50_59,
> +	PERCENT_60_69,
> +	PERCENT_70_79,
> +	PERCENT_80_89,
> +	PERCENT_90_100,
> +	NR_NL_INTERVAL,
> +};
> +
> +struct memcg_stat_numa {
> +	u64 locality[NR_NL_INTERVAL];
> +};

If you make that 8 it fits a single cacheline. Do you really need the
additional resolution? If so, then 16 would be the next logical amount
of buckets. 10 otoh makes no sense what so ever.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic
  2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
  2019-04-23  8:44   ` Peter Zijlstra
@ 2019-04-23  8:46   ` Peter Zijlstra
  2019-04-23  9:32     ` 王贇
  2019-04-23  8:47   ` Peter Zijlstra
  2 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-04-23  8:46 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
> +	 * 0 -- remote faults
> +	 * 1 -- local faults
> +	 * 2 -- page migration failure
> +	 * 3 -- remote page accessing after page migration
> +	 * 4 -- local page accessing after page migration

> @@ -2387,6 +2388,11 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>  		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
>  	}
> 
> +	p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages;
> +
> +	if (mem_node == NUMA_NO_NODE)
> +		return;

I'm confused on the meaning of 3 & 4. It says 'after page migration' but
'every' access if after 'a' migration. But even more confusingly, you
even account it if we know the page has never been migrated.

So what are you really counting?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic
  2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
  2019-04-23  8:44   ` Peter Zijlstra
  2019-04-23  8:46   ` Peter Zijlstra
@ 2019-04-23  8:47   ` Peter Zijlstra
  2019-04-23  9:33     ` 王贇
  2 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-04-23  8:47 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
> +	p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages;

Possibly: 3 + !!(mem_node = numa_node_id()), generates better code.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat
  2019-04-22  2:12 ` [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat 王贇
@ 2019-04-23  8:52   ` Peter Zijlstra
  2019-04-23  9:36     ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-04-23  8:52 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On Mon, Apr 22, 2019 at 10:12:20AM +0800, 王贇 wrote:
> This patch introduced numa execution information, to imply the numa
> efficiency.
> 
> By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
> see new output line heading with 'exectime', like:
> 
>   exectime 24399843 27865444
> 
> which means the tasks of this cgroup executed 24399843 ticks on node 0,
> and 27865444 ticks on node 1.

I think we stopped reporting time in HZ to userspace a long long time
ago. Please don't do that.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node
  2019-04-22  2:13 ` [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node 王贇
@ 2019-04-23  8:55   ` Peter Zijlstra
  2019-04-23  9:41     ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-04-23  8:55 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On Mon, Apr 22, 2019 at 10:13:36AM +0800, 王贇 wrote:
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index af171ccb56a2..6513504373b4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2031,6 +2031,10 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> 
>  	pol = get_vma_policy(vma, addr);
> 
> +	page = alloc_page_numa_preferred(gfp, order);
> +	if (page)
> +		goto out;
> +
>  	if (pol->mode == MPOL_INTERLEAVE) {
>  		unsigned nid;
> 

This I think is wrong, it overrides app specific mbind() requests.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] numa: numa balancer
  2019-04-22  2:21 ` [RFC PATCH 5/5] numa: numa balancer 王贇
@ 2019-04-23  9:05   ` Peter Zijlstra
  2019-04-23  9:59     ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-04-23  9:05 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On Mon, Apr 22, 2019 at 10:21:17AM +0800, 王贇 wrote:
> numa balancer is a module which will try to automatically adjust numa
> balancing stuff to gain numa bonus as much as possible.
> 
> For each memory cgroup, we process the work in two steps:
> 
> On stage 1 we check cgroup's exectime and memory topology to see
> if there could be a candidate for settled down, if we got one then
> move onto stage 2.
> 
> On stage 2 we try to settle down as much as possible by prefer the
> candidate node, if the node no longer suitable or locality keep
> downturn, we reset things and new round begin.
> 
> Decision made with find_candidate_nid(), should_prefer() and keep_prefer(),
> which try to pick a candidate node, see if allowed to prefer it and if
> keep doing the prefer.
> 
> Tested on the box with 96 cpus with sysbench-mysql-oltp_read_write
> testing, 4 mysqld instances created and attached to 4 cgroups, 4
> sysbench instances then created and attached to corresponding cgroup
> to test the mysql with oltp_read_write script, average eps show:
> 
> 				origin		balancer
> 4 instances each 12 threads	5241.08		5375.59		+2.50%
> 4 instances each 24 threads	7497.29		7820.73		+4.13%
> 4 instances each 36 threads	8985.44		9317.04		+3.55%
> 4 instances each 48 threads	9716.50		9982.60		+2.66%
> 
> Other benchmark liks dbench, pgbench, perf bench numa also tested, and
> with different parameters and number of instances/threads, most of
> the cases show bonus, some show acceptable regression, and some got no
> changes.
> 
> TODO:
>   * improve the logical to address the regression cases
>   * Find a way, maybe, to handle the page cache left on remote
>   * find more scenery which could gain benefit
> 
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>  drivers/Makefile             |   1 +
>  drivers/numa/Makefile        |   1 +
>  drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++

So I really think this is the wrong direction. Why introduce yet another
balancer thingy and not extend the existing numa balancer with the
additional information you got from the previous patches?

Also, this really should not be a module and not in drivers/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic
  2019-04-23  8:44   ` Peter Zijlstra
@ 2019-04-23  9:14     ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-04-23  9:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On 2019/4/23 下午4:44, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
>> +#ifdef CONFIG_NUMA_BALANCING
>> +
>> +enum memcg_numa_locality_interval {
>> +	PERCENT_0_9,
>> +	PERCENT_10_19,
>> +	PERCENT_20_29,
>> +	PERCENT_30_39,
>> +	PERCENT_40_49,
>> +	PERCENT_50_59,
>> +	PERCENT_60_69,
>> +	PERCENT_70_79,
>> +	PERCENT_80_89,
>> +	PERCENT_90_100,
>> +	NR_NL_INTERVAL,
>> +};
>> +
>> +struct memcg_stat_numa {
>> +	u64 locality[NR_NL_INTERVAL];
>> +};

> If you make that 8 it fits a single cacheline. Do you really need the
> additional resolution? If so, then 16 would be the next logical amount
> of buckets. 10 otoh makes no sense what so ever.

Thanks for point out :-) not have to be 10, I think we can save first two
and make it PERCENT_0_29, already wrong enough if it drops below 30% and
it's helpless to know detail changes in this section.

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic
  2019-04-23  8:46   ` Peter Zijlstra
@ 2019-04-23  9:32     ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-04-23  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On 2019/4/23 下午4:46, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
>> +	 * 0 -- remote faults
>> +	 * 1 -- local faults
>> +	 * 2 -- page migration failure
>> +	 * 3 -- remote page accessing after page migration
>> +	 * 4 -- local page accessing after page migration
> 
>> @@ -2387,6 +2388,11 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>>  		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
>>  	}
>>
>> +	p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages;
>> +
>> +	if (mem_node == NUMA_NO_NODE)
>> +		return;
> 
> I'm confused on the meaning of 3 & 4. It says 'after page migration' but
> 'every' access if after 'a' migration. But even more confusingly, you
> even account it if we know the page has never been migrated.
> 
> So what are you really counting
Here is try to get the times of a task accessing the local or remote pages,
and on no migration cases we still account since it's also one time of accessing,
remotely or locally.

'after page migration' means this accounting need to understand the real page
position after PF, what ever migration failure or succeed, whatever page move to
local/remote or untouched, we want to know the times a task accessed the page
locally or remotely, on numa balancing period.

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic
  2019-04-23  8:47   ` Peter Zijlstra
@ 2019-04-23  9:33     ` 王贇
  2019-04-23  9:46       ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-04-23  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm



On 2019/4/23 下午4:47, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
>> +	p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages;
> 
> Possibly: 3 + !!(mem_node = numa_node_id()), generates better code.

Sounds good~ will apply in next version.

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat
  2019-04-23  8:52   ` Peter Zijlstra
@ 2019-04-23  9:36     ` 王贇
  2019-04-23  9:46       ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-04-23  9:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm



On 2019/4/23 下午4:52, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:12:20AM +0800, 王贇 wrote:
>> This patch introduced numa execution information, to imply the numa
>> efficiency.
>>
>> By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
>> see new output line heading with 'exectime', like:
>>
>>   exectime 24399843 27865444
>>
>> which means the tasks of this cgroup executed 24399843 ticks on node 0,
>> and 27865444 ticks on node 1.
> 
> I think we stopped reporting time in HZ to userspace a long long time
> ago. Please don't do that.

Ah I see, let's make it us maybe?

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node
  2019-04-23  8:55   ` Peter Zijlstra
@ 2019-04-23  9:41     ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-04-23  9:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm



On 2019/4/23 下午4:55, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:13:36AM +0800, 王贇 wrote:
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index af171ccb56a2..6513504373b4 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2031,6 +2031,10 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>
>>  	pol = get_vma_policy(vma, addr);
>>
>> +	page = alloc_page_numa_preferred(gfp, order);
>> +	if (page)
>> +		goto out;
>> +
>>  	if (pol->mode == MPOL_INTERLEAVE) {
>>  		unsigned nid;
>>
> 
> This I think is wrong, it overrides app specific mbind() requests.

The original concern is that we scared the user apps insider cgroup deal
wrong with memory policy and do bad behavior, but now I agree that we
should not override the policy, the admin will take the responsibility.

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic
  2019-04-23  9:33     ` 王贇
@ 2019-04-23  9:46       ` Peter Zijlstra
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2019-04-23  9:46 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On Tue, Apr 23, 2019 at 05:33:25PM +0800, 王贇 wrote:
> 
> 
> On 2019/4/23 下午4:47, Peter Zijlstra wrote:
> > On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
> >> +	p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages;
> > 
> > Possibly: 3 + !!(mem_node = numa_node_id()), generates better code.
> 
> Sounds good~ will apply in next version.

Well, check code gen first, of course.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat
  2019-04-23  9:36     ` 王贇
@ 2019-04-23  9:46       ` Peter Zijlstra
  2019-04-23 10:01         ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-04-23  9:46 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm

On Tue, Apr 23, 2019 at 05:36:25PM +0800, 王贇 wrote:
> 
> 
> On 2019/4/23 下午4:52, Peter Zijlstra wrote:
> > On Mon, Apr 22, 2019 at 10:12:20AM +0800, 王贇 wrote:
> >> This patch introduced numa execution information, to imply the numa
> >> efficiency.
> >>
> >> By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
> >> see new output line heading with 'exectime', like:
> >>
> >>   exectime 24399843 27865444
> >>
> >> which means the tasks of this cgroup executed 24399843 ticks on node 0,
> >> and 27865444 ticks on node 1.
> > 
> > I think we stopped reporting time in HZ to userspace a long long time
> > ago. Please don't do that.
> 
> Ah I see, let's make it us maybe?

ms might be best I think.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] numa: numa balancer
  2019-04-23  9:05   ` Peter Zijlstra
@ 2019-04-23  9:59     ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-04-23  9:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm



On 2019/4/23 下午5:05, Peter Zijlstra wrote:
[snip]
>>
>> TODO:
>>   * improve the logical to address the regression cases
>>   * Find a way, maybe, to handle the page cache left on remote
>>   * find more scenery which could gain benefit
>>
>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
>> ---
>>  drivers/Makefile             |   1 +
>>  drivers/numa/Makefile        |   1 +
>>  drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
> 
> So I really think this is the wrong direction. Why introduce yet another
> balancer thingy and not extend the existing numa balancer with the
> additional information you got from the previous patches?
> 
> Also, this really should not be a module and not in drivers
The reason why we present the idea in the way of a module is that
it's not suitable for all the situations, a module could be clean
and easier for deploy on demands.

Besides, we assume someone may prefer to have their own logical
on how to do the numa balancer, thus the module give them the way
to DIY easily.

But there are no insist on the style, once the logical is mature
enough, we can merge the idea into CFS, per-cgroup switch could be
enough :-P

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat
  2019-04-23  9:46       ` Peter Zijlstra
@ 2019-04-23 10:01         ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-04-23 10:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel, linux-mm



On 2019/4/23 下午5:46, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 05:36:25PM +0800, 王贇 wrote:
>>
>>
>> On 2019/4/23 下午4:52, Peter Zijlstra wrote:
>>> On Mon, Apr 22, 2019 at 10:12:20AM +0800, 王贇 wrote:
>>>> This patch introduced numa execution information, to imply the numa
>>>> efficiency.
>>>>
>>>> By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
>>>> see new output line heading with 'exectime', like:
>>>>
>>>>   exectime 24399843 27865444
>>>>
>>>> which means the tasks of this cgroup executed 24399843 ticks on node 0,
>>>> and 27865444 ticks on node 1.
>>>
>>> I think we stopped reporting time in HZ to userspace a long long time
>>> ago. Please don't do that.
>>
>> Ah I see, let's make it us maybe?
> 
> ms might be best I think.

Will be in next version.

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 0/4] per cpu cgroup numa suite
  2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
                   ` (5 preceding siblings ...)
       [not found] ` <CAHCio2gEw4xyuoiurvwzvEiU8eLas+5ZLhzmqm1V2CJqvt+cyA@mail.gmail.com>
@ 2019-07-03  3:26 ` 王贇
  2019-07-03  3:28   ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
                     ` (5 more replies)
  6 siblings, 6 replies; 62+ messages in thread
From: 王贇 @ 2019-07-03  3:26 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

During our torturing on numa stuff, we found problems like:

  * missing per-cgroup information about the per-node execution status
  * missing per-cgroup information about the numa locality

That is when we have a cpu cgroup running with bunch of tasks, no good
way to tell how it's tasks are dealing with numa.

The first two patches are trying to complete the missing pieces, but
more problems appeared after monitoring these status:

  * tasks not always running on the preferred numa node
  * tasks from same cgroup running on different nodes

The task numa group handler will always check if tasks are sharing pages
and try to pack them into a single numa group, so they will have chance to
settle down on the same node, but this failed in some cases:

  * workloads share page caches rather than share mappings
  * workloads got too many wakeup across nodes

Since page caches are not traced by numa balancing, there are no way to
realize such kind of relationship, and when there are too many wakeup,
task will be drag from the preferred node and then migrate back by numa
balancing, repeatedly.

Here the third patch try to address the first issue, we could now give hint
to kernel about the relationship of tasks, and pack them into single numa
group.

And the forth patch introduced numa cling, which try to address the wakup
issue, now we try to make task stay on the preferred node on wakeup in fast
path, in order to address the unbalancing risk, we monitoring the numa
migration failure ratio, and pause numa cling when it reach the specified
degree.

Michael Wang (4):
  numa: introduce per-cgroup numa balancing locality statistic
  numa: append per-node execution info in memory.numa_stat
  numa: introduce numa group per task group
  numa: introduce numa cling feature

 include/linux/memcontrol.h   |  37 ++++
 include/linux/sched.h        |   8 +-
 include/linux/sched/sysctl.h |   3 +
 kernel/sched/core.c          |  37 ++++
 kernel/sched/debug.c         |   7 +
 kernel/sched/fair.c          | 455 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h         |  14 ++
 kernel/sysctl.c              |   9 +
 mm/memcontrol.c              |  66 +++++++
 9 files changed, 628 insertions(+), 8 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
@ 2019-07-03  3:28   ` 王贇
  2019-07-11 13:43     ` Peter Zijlstra
  2019-07-11 13:47     ` Peter Zijlstra
  2019-07-03  3:29   ` [PATCH 2/4] numa: append per-node execution info in memory.numa_stat 王贇
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 62+ messages in thread
From: 王贇 @ 2019-07-03  3:28 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

This patch introduced numa locality statistic, which try to imply
the numa balancing efficiency per memory cgroup.

By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'locality', the format is:

  locality 0%~29% 30%~39% 40%~49% 50%~59% 60%~69% 70%~79% 80%~89%
90%~100%

interval means that on a task's last numa balancing, the percentage
of accessing local pages, which we called numa balancing locality.

And the number means inside the cgroup, how many micro seconds tasks
with that locality are running, for example:

  locality 15393 21259 13023 44461 21247 17012 28496 145402

the first number means that this cgroup have some tasks with 0~29%
locality executed 15393 ms.

By monitoring the increment, we can check if the workload of a
particular
cgroup is doing well with numa, when most of the tasks are running with
locality 0~29%, then something is wrong with your numa policy.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h | 36 +++++++++++++++++++++++++++++++
 include/linux/sched.h      |  8 ++++++-
 kernel/sched/debug.c       |  7 ++++++
 kernel/sched/fair.c        |  9 ++++++++
 mm/memcontrol.c            | 53 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2cbce1fe7780..0a30d14c9f43 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -174,6 +174,25 @@ enum memcg_kmem_state {
 	KMEM_ONLINE,
 };

+#ifdef CONFIG_NUMA_BALANCING
+
+enum memcg_numa_locality_interval {
+	PERCENT_0_29,
+	PERCENT_30_39,
+	PERCENT_40_49,
+	PERCENT_50_59,
+	PERCENT_60_69,
+	PERCENT_70_79,
+	PERCENT_80_89,
+	PERCENT_90_100,
+	NR_NL_INTERVAL,
+};
+
+struct memcg_stat_numa {
+	u64 locality[NR_NL_INTERVAL];
+};
+
+#endif
 #if defined(CONFIG_SMP)
 struct memcg_padding {
 	char x[0];
@@ -313,6 +332,10 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;

+#ifdef CONFIG_NUMA_BALANCING
+	struct memcg_stat_numa __percpu *stat_numa;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
@@ -795,6 +818,14 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif

+#ifdef CONFIG_NUMA_BALANCING
+extern void memcg_stat_numa_update(struct task_struct *p);
+#else
+static inline void memcg_stat_numa_update(struct task_struct *p)
+{
+}
+#endif
+
 #else /* CONFIG_MEMCG */

 #define MEM_CGROUP_ID_SHIFT	0
@@ -1131,6 +1162,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void memcg_stat_numa_update(struct task_struct *p)
+{
+}
+
 #endif /* CONFIG_MEMCG */

 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 907808f1acc5..eb26098de6ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1117,8 +1117,14 @@ struct task_struct {
 	 * scan window were remote/local or failed to migrate. The task scan
 	 * period is adapted based on the locality of the faults with different
 	 * weights depending on whether they were shared or private faults
+	 *
+	 * 0 -- remote faults
+	 * 1 -- local faults
+	 * 2 -- page migration failure
+	 * 3 -- remote page accessing
+	 * 4 -- local page accessing
 	 */
-	unsigned long			numa_faults_locality[3];
+	unsigned long			numa_faults_locality[5];

 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f7e4579e746c..473e6b7a1b8d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -849,6 +849,13 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
 			task_node(p), task_numa_group_id(p));
 	show_numa_stats(p, m);
+	SEQ_printf(m, "faults_locality local=%lu remote=%lu failed=%lu ",
+			p->numa_faults_locality[1],
+			p->numa_faults_locality[0],
+			p->numa_faults_locality[2]);
+	SEQ_printf(m, "lhit=%lu rhit=%lu\n",
+			p->numa_faults_locality[4],
+			p->numa_faults_locality[3]);
 	mpol_put(pol);
 #endif
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..b32304817eeb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -23,6 +23,7 @@
 #include "sched.h"

 #include <trace/events/sched.h>
+#include <linux/memcontrol.h>

 /*
  * Targeted preemption latency for CPU-bound tasks:
@@ -2449,6 +2450,12 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+	/*
+	 * We want to have the real local/remote page access statistic
+	 * here, so use 'mem_node' which is the real residential node of
+	 * page after migrate_misplaced_page().
+	 */
+	p->numa_faults_locality[3 + !!(mem_node == numa_node_id())] += pages;
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -2625,6 +2632,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
 		return;

+	memcg_stat_numa_update(curr);
+
 	/*
 	 * Using runtime rather than walltime has the dual advantage that
 	 * we (mostly) drive the selection from busy threads and that the
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b3f67a6b6527..2edf3f5ac4b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -58,6 +58,7 @@
 #include <linux/file.h>
 #include <linux/tracehook.h>
 #include <linux/seq_buf.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3562,10 +3563,53 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 		seq_putc(m, '\n');
 	}

+#ifdef CONFIG_NUMA_BALANCING
+	seq_puts(m, "locality");
+	for (nr = 0; nr < NR_NL_INTERVAL; nr++) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_possible_cpu(cpu)
+			sum += per_cpu(memcg->stat_numa->locality[nr], cpu);
+
+		seq_printf(m, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(m, '\n');
+#endif
+
 	return 0;
 }
 #endif /* CONFIG_NUMA */

+#ifdef CONFIG_NUMA_BALANCING
+
+void memcg_stat_numa_update(struct task_struct *p)
+{
+	struct mem_cgroup *memcg;
+	unsigned long remote = p->numa_faults_locality[3];
+	unsigned long local = p->numa_faults_locality[4];
+	unsigned long idx = -1;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	if (remote || local) {
+		idx = ((local * 10) / (remote + local)) - 2;
+		/* 0~29% in one slot for cache align */
+		if (idx < PERCENT_0_29)
+			idx = PERCENT_0_29;
+		else if (idx >= NR_NL_INTERVAL)
+			idx = NR_NL_INTERVAL - 1;
+	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(p);
+	if (idx != -1)
+		this_cpu_inc(memcg->stat_numa->locality[idx]);
+	rcu_read_unlock();
+}
+#endif
+
 static const unsigned int memcg1_stats[] = {
 	MEMCG_CACHE,
 	MEMCG_RSS,
@@ -4641,6 +4685,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)

 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+#ifdef CONFIG_NUMA_BALANCING
+	free_percpu(memcg->stat_numa);
+#endif
 	free_percpu(memcg->vmstats_percpu);
 	free_percpu(memcg->vmstats_local);
 	kfree(memcg);
@@ -4679,6 +4726,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg->vmstats_percpu)
 		goto fail;

+#ifdef CONFIG_NUMA_BALANCING
+	memcg->stat_numa = alloc_percpu(struct memcg_stat_numa);
+	if (!memcg->stat_numa)
+		goto fail;
+#endif
+
 	for_each_node(node)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 2/4] numa: append per-node execution info in memory.numa_stat
  2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
  2019-07-03  3:28   ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
@ 2019-07-03  3:29   ` 王贇
  2019-07-11 13:45     ` Peter Zijlstra
  2019-07-03  3:32   ` [PATCH 3/4] numa: introduce numa group per task group 王贇
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-03  3:29 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

This patch introduced numa execution information, to imply the numa
efficiency.

By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'exectime', like:

  exectime 311900 407166

which means the tasks of this cgroup executed 311900 micro seconds on
node 0, and 407166 ms on node 1.

Combined with the memory node info, we can estimate the numa efficiency,
for example if the node memory info is:

  total=206892 N0=21933 N1=185171

By monitoring the increments, if the topology keep in this way and
locality is not nice, then it imply numa balancing can't help migrate
the memory from node 1 to 0 which is accessing by tasks on node 0, or
tasks can't migrate to node 1 for some reason, then you may consider
to bind the cgroup on the cpus of node 1.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0a30d14c9f43..deeca9db17d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -190,6 +190,7 @@ enum memcg_numa_locality_interval {

 struct memcg_stat_numa {
 	u64 locality[NR_NL_INTERVAL];
+	u64 exectime;
 };

 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2edf3f5ac4b9..d5f48365770f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3575,6 +3575,18 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 		seq_printf(m, " %u", jiffies_to_msecs(sum));
 	}
 	seq_putc(m, '\n');
+
+	seq_puts(m, "exectime");
+	for_each_online_node(nr) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_cpu(cpu, cpumask_of_node(nr))
+			sum += per_cpu(memcg->stat_numa->exectime, cpu);
+
+		seq_printf(m, " %llu", jiffies_to_msecs(sum));
+	}
+	seq_putc(m, '\n');
 #endif

 	return 0;
@@ -3606,6 +3618,7 @@ void memcg_stat_numa_update(struct task_struct *p)
 	memcg = mem_cgroup_from_task(p);
 	if (idx != -1)
 		this_cpu_inc(memcg->stat_numa->locality[idx]);
+	this_cpu_inc(memcg->stat_numa->exectime);
 	rcu_read_unlock();
 }
 #endif
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 3/4] numa: introduce numa group per task group
  2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
  2019-07-03  3:28   ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
  2019-07-03  3:29   ` [PATCH 2/4] numa: append per-node execution info in memory.numa_stat 王贇
@ 2019-07-03  3:32   ` 王贇
  2019-07-11 14:10     ` Peter Zijlstra
  2019-07-03  3:34   ` [PATCH 4/4] numa: introduce numa cling feature 王贇
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-03  3:32 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

By tracing numa page faults, we recognize tasks sharing the same page,
and try pack them together into a single numa group.

However when two task share lot's of cache pages while not much
anonymous pages, since numa balancing do not tracing cache page, they
have no chance to join into the same group.

While tracing cache page cost too much, we could use some hints from
userland and cpu cgroup could be a good one.

This patch introduced new entry 'numa_group' for cpu cgroup, by echo
non-zero into the entry, we can now force all the tasks of this cgroup
to join the same numa group serving for task group.

In this way tasks are more likely to settle down on the same node, to
share closer cpu cache and gain benefit from NUMA on both file/anonymous
pages.

Besides, when multiple cgroup enabled numa group, they will be able to
exchange task location by utilizing numa migration, in this way they
could achieve single node settle down without breaking load balance.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 kernel/sched/core.c  |  37 +++++++++++
 kernel/sched/fair.c  | 175 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  14 +++++
 3 files changed, 225 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa43ce3962e7..148c231a4309 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6790,6 +6790,8 @@ void sched_offline_group(struct task_group *tg)
 {
 	unsigned long flags;

+	update_tg_numa_group(tg, false);
+
 	/* End participation in shares distribution: */
 	unregister_fair_sched_group(tg);

@@ -7277,6 +7279,34 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_NUMA_BALANCING
+static DEFINE_MUTEX(numa_mutex);
+
+static int cpu_numa_group_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	mutex_lock(&numa_mutex);
+	show_tg_numa_group(tg, sf);
+	mutex_unlock(&numa_mutex);
+
+	return 0;
+}
+
+static int cpu_numa_group_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 numa_group)
+{
+	int ret;
+	struct task_group *tg = css_tg(css);
+
+	mutex_lock(&numa_mutex);
+	ret = update_tg_numa_group(tg, numa_group);
+	mutex_unlock(&numa_mutex);
+
+	return ret;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7312,6 +7342,13 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	{
+		.name = "numa_group",
+		.write_s64 = cpu_numa_group_write_s64,
+		.seq_show = cpu_numa_group_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b32304817eeb..6cf9c9c61258 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1074,6 +1074,7 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	bool evacuate;

 	struct rcu_head rcu;
 	unsigned long total_faults;
@@ -2247,6 +2248,176 @@ static inline void put_numa_group(struct numa_group *grp)
 		kfree_rcu(grp, rcu);
 }

+void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
+{
+	int nid;
+	struct numa_group *ng = tg->numa_group;
+
+	if (!ng) {
+		seq_puts(sf, "disabled\n");
+		return;
+	}
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes);
+
+	for_each_online_node(nid) {
+		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
+		int pf_idx = task_faults_idx(NUMA_MEM, nid, 1);
+
+		seq_printf(sf, "node %d ", nid);
+
+		seq_printf(sf, "mem_private %lu mem_shared %lu ",
+			   ng->faults[f_idx], ng->faults[pf_idx]);
+
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+	}
+}
+
+int update_tg_numa_group(struct task_group *tg, bool numa_group)
+{
+	struct numa_group *ng = tg->numa_group;
+
+	/* if no change then do nothing */
+	if ((ng != NULL) == numa_group)
+		return 0;
+
+	if (ng) {
+		/* put and evacuate tg's numa group */
+		rcu_assign_pointer(tg->numa_group, NULL);
+		ng->evacuate = true;
+		put_numa_group(ng);
+	} else {
+		unsigned int size = sizeof(struct numa_group) +
+				    4*nr_node_ids*sizeof(unsigned long);
+
+		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!ng)
+			return -ENOMEM;
+
+		refcount_set(&ng->refcount, 1);
+		spin_lock_init(&ng->lock);
+		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
+						nr_node_ids;
+		/* now make tasks see and join */
+		rcu_assign_pointer(tg->numa_group, ng);
+	}
+
+	return 0;
+}
+
+static bool tg_numa_group(struct task_struct *p)
+{
+	int i;
+	struct task_group *tg;
+	struct numa_group *grp, *my_grp;
+
+	rcu_read_lock();
+
+	tg = task_group(p);
+	if (!tg)
+		goto no_join;
+
+	grp = rcu_dereference(tg->numa_group);
+	my_grp = rcu_dereference(p->numa_group);
+
+	if (!grp)
+		goto no_join;
+
+	if (grp == my_grp) {
+		if (!grp->evacuate)
+			goto joined;
+
+		/*
+		 * Evacuate task from tg's numa group
+		 */
+		rcu_read_unlock();
+
+		spin_lock_irq(&grp->lock);
+
+		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
+			grp->faults[i] -= p->numa_faults[i];
+
+		grp->total_faults -= p->total_numa_faults;
+		grp->nr_tasks--;
+
+		spin_unlock_irq(&grp->lock);
+
+		rcu_assign_pointer(p->numa_group, NULL);
+
+		put_numa_group(grp);
+
+		return false;
+	}
+
+	if (!get_numa_group(grp))
+		goto no_join;
+
+	rcu_read_unlock();
+
+	/*
+	 * Just join tg's numa group
+	 */
+	if (!my_grp) {
+		spin_lock_irq(&grp->lock);
+
+		if (refcount_read(&grp->refcount) == 2) {
+			grp->gid = p->pid;
+			grp->active_nodes = 1;
+			grp->max_faults_cpu = 0;
+		}
+
+		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
+			grp->faults[i] += p->numa_faults[i];
+
+		grp->total_faults += p->total_numa_faults;
+		grp->nr_tasks++;
+
+		spin_unlock_irq(&grp->lock);
+		rcu_assign_pointer(p->numa_group, grp);
+
+		return true;
+	}
+
+	/*
+	 * Switch from the task's numa group to the tg's
+	 */
+	double_lock_irq(&my_grp->lock, &grp->lock);
+
+	if (refcount_read(&grp->refcount) == 2) {
+		grp->gid = p->pid;
+		grp->active_nodes = 1;
+		grp->max_faults_cpu = 0;
+	}
+
+	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
+		my_grp->faults[i] -= p->numa_faults[i];
+		grp->faults[i] += p->numa_faults[i];
+	}
+
+	my_grp->total_faults -= p->total_numa_faults;
+	grp->total_faults += p->total_numa_faults;
+
+	my_grp->nr_tasks--;
+	grp->nr_tasks++;
+
+	spin_unlock(&my_grp->lock);
+	spin_unlock_irq(&grp->lock);
+
+	rcu_assign_pointer(p->numa_group, grp);
+
+	put_numa_group(my_grp);
+	return true;
+
+joined:
+	rcu_read_unlock();
+	return true;
+no_join:
+	rcu_read_unlock();
+	return false;
+}
+
 static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			int *priv)
 {
@@ -2417,7 +2588,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 		priv = 1;
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
-		if (!priv && !(flags & TNF_NO_GROUP))
+		if (tg_numa_group(p))
+			priv = (flags & TNF_SHARED) ? 0 : priv;
+		else if (!priv && !(flags & TNF_NO_GROUP))
 			task_numa_group(p, last_cpupid, flags, &priv);
 	}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..b5bc4d804e2d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -393,6 +393,10 @@ struct task_group {
 #endif

 	struct cfs_bandwidth	cfs_bandwidth;
+
+#ifdef CONFIG_NUMA_BALANCING
+	void *numa_group;
+#endif
 };

 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1286,11 +1290,21 @@ extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *p, struct task_struct *t,
 			int cpu, int scpu);
 extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p);
+extern void show_tg_numa_group(struct task_group *tg, struct seq_file *sf);
+extern int update_tg_numa_group(struct task_group *tg, bool numa_group);
 #else
 static inline void
 init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
 }
+static inline void
+show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
+{
+}
+update_tg_numa_group(struct task_group *tg, bool numa_group)
+{
+	return 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */

 #ifdef CONFIG_SMP
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 4/4] numa: introduce numa cling feature
  2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
                     ` (2 preceding siblings ...)
  2019-07-03  3:32   ` [PATCH 3/4] numa: introduce numa group per task group 王贇
@ 2019-07-03  3:34   ` 王贇
  2019-07-08  2:25     ` [PATCH v2 " 王贇
  2019-07-11 14:27     ` [PATCH " Peter Zijlstra
  2019-07-11  9:00   ` [PATCH 0/4] per cgroup numa suite 王贇
  2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
  5 siblings, 2 replies; 62+ messages in thread
From: 王贇 @ 2019-07-03  3:34 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.

When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be really bad especially when there
are too many fast wakeup, although now we can numa group the tasks,
they won't really stay on the same node, for example we have numa
group ng_A, ng_B, ng_C, ng_D, it's very likely result as:

	CPU Usage:
		Node 0		Node 1
		ng_A(600%)	ng_A(400%)
		ng_B(400%)	ng_B(600%)
		ng_C(400%)	ng_C(600%)
		ng_D(600%)	ng_D(400%)

	Memory Ratio:
		Node 0		Node 1
		ng_A(60%)	ng_A(40%)
		ng_B(40%)	ng_B(60%)
		ng_C(40%)	ng_C(60%)
		ng_D(60%)	ng_D(40%)

Locality won't be too bad but far from the best situation, we want
a numa group to settle down thoroughly on a particular node, with
every thing balanced.

Thus we introduce the numa cling, which try to prevent tasks leaving
the preferred node on wakeup fast path.

This help thoroughly settle down the workloads on single node, but when
multiple numa group try to settle down on the same node, unbalancing
could happen.

For example we have numa group ng_A, ng_B, ng_C, ng_D, it may result in
situation like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(400%)	ng_C(600%)
	ng_D(400%)	ng_D(600%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(10%)	ng_C(90%)
	ng_D(10%)	ng_D(90%)

This is because when ng_C, ng_D start to have most of the memory on node
1 at some point, task_x of ng_C stay on node 0 will try to do numa swap
migration with the task_y of ng_D stay on node 1 as long as load balanced,
the result is task_x stay on node 1 and task_y stay on node 0, while both
of them prefer node 1.

Now when other tasks of ng_D stay on node 1 wakeup task_y, task_y will
very likely go back to node 1, and since numa cling enabled, it will
keep stay on node 1 although load unbalanced, this could be frequently
and more and more tasks will prefer the node 1 and make it busy.

So the key point here is to stop doing numa cling when load starting to
become unbalancing.

We achieved this by monitoring the migration failure ratio, in scenery
above, too much tasks prefer node 1 and will keep migrating to it, load
unbalancing could lead into the migration failure in this case, and when
the failure ratio above the specified degree, we pause the cling and try
to resettle the workloads on a better node by stop tasks prefer the busy
node, this will finally give us the result like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(1000%)	ng_D(1000%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(100%)	ng_D(100%)

Now we achieved the best locality and maximum hot cache benefit.

Tested on a 2 node box with 96 cpus, do sysbench-mysql-oltp_read_write
testing, X mysqld instances created and attached to X cgroups, X sysbench
instances then created and attached to corresponding cgroup to test the
mysql with oltp_read_write script for 20 minutes, average eps show:

				origin		ng + cling
4 instances each 24 threads	7545.28		7790.49		+3.25%
4 instances each 48 threads	9359.36		9832.30		+5.05%
4 instances each 72 threads	9602.88		10196.95	+6.19%

8 instances each 24 threads	4478.82		4508.82		+0.67%
8 instances each 48 threads	5514.90		5689.93		+3.17%
8 instances each 72 threads	5582.19		5741.33		+2.85%

Also tested with perf-bench-numa, dbench, sysbench-memory, pgbench, tiny
improvement observed.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/sched/sysctl.h |   3 +
 kernel/sched/fair.c          | 283 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |   9 ++
 3 files changed, 283 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..6eef34331dd2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -38,6 +38,9 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;

+extern unsigned int sysctl_numa_balancing_cling_degree;
+extern unsigned int max_numa_balancing_cling_degree;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6cf9c9c61258..a4a48cdd2bbd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1067,6 +1067,20 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;

+/*
+ * The numa group serving task group will enable numa cling, a feature
+ * which try to prevent task leaving preferred node on wakeup.
+ *
+ * This help settle down the workloads thorouly and quickly on node,
+ * while introduce the risk of load unbalancing.
+ *
+ * In order to detect the risk in advance and pause the feature, we
+ * rely on numa migration failure stats, and when failure ratio above
+ * cling degree, we pause the numa cling until resettle done.
+ */
+unsigned int sysctl_numa_balancing_cling_degree = 20;
+unsigned int max_numa_balancing_cling_degree = 100;
+
 struct numa_group {
 	refcount_t refcount;

@@ -1074,11 +1088,15 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	int busiest_nid;
 	bool evacuate;
+	bool do_cling;
+	struct timer_list cling_timer;

 	struct rcu_head rcu;
 	unsigned long total_faults;
 	unsigned long max_faults_cpu;
+	unsigned long *migrate_stat;
 	/*
 	 * Faults_cpu is used to decide whether memory should move
 	 * towards the CPU. As a consequence, these stats are weighted
@@ -1088,6 +1106,8 @@ struct numa_group {
 	unsigned long faults[0];
 };

+static inline bool busy_node(struct numa_group *ng, int nid);
+
 static inline unsigned long group_faults_priv(struct numa_group *ng);
 static inline unsigned long group_faults_shared(struct numa_group *ng);

@@ -1132,8 +1152,14 @@ static unsigned int task_scan_start(struct task_struct *p)
 	unsigned long smin = task_scan_min(p);
 	unsigned long period = smin;

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1154,8 +1180,14 @@ static unsigned int task_scan_max(struct task_struct *p)
 	/* Watch for min being lower than max due to floor calculations */
 	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1475,6 +1507,19 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 					ACTIVE_NODE_FRACTION)
 		return true;

+	/*
+	 * Make sure pages do not stay on a busy node when numa cling
+	 * enabled, otherwise they could lead into more numa migration
+	 * to the busy node.
+	 */
+	if (ng->do_cling) {
+		if (busy_node(ng, dst_nid))
+			return false;
+
+		if (busy_node(ng, src_nid))
+			return true;
+	}
+
 	/*
 	 * Distribute memory according to CPU & memory use on each node,
 	 * with 3/4 hysteresis to avoid unnecessary memory migrations:
@@ -1874,9 +1919,190 @@ static int task_numa_migrate(struct task_struct *p)
 	return ret;
 }

+/*
+ * We scale the migration stat count to 1024, divide the maximum numa
+ * balancing scan period by 10 and make that the period of cling timer,
+ * this help to decay one count to 0 after one maximum scan period passed.
+ */
+#define NUMA_MIGRATE_SCALE 10
+#define NUMA_MIGRATE_WEIGHT 1024
+
+enum numa_migrate_stats {
+	FAILURE_SCALED,
+	TOTAL_SCALED,
+	FAILURE_RATIO,
+};
+
+static inline int mstat_idx(int nid, enum numa_migrate_stats s)
+{
+	return (nid + s * nr_node_ids);
+}
+
+static inline unsigned long
+mstat_failure_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_SCALED)];
+}
+
+static inline unsigned long
+mstat_total_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, TOTAL_SCALED)];
+}
+
+static inline unsigned long
+mstat_failure_ratio(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_RATIO)];
+}
+
+/*
+ * A node is busy when the numa migration toward it failed too much,
+ * this imply the load already unbalancing for too much numa cling on
+ * that node.
+ */
+static inline bool busy_node(struct numa_group *ng, int nid)
+{
+	int degree = sysctl_numa_balancing_cling_degree;
+
+	if (mstat_failure_scaled(ng, nid) < NUMA_MIGRATE_WEIGHT)
+		return false;
+
+	/*
+	 * Allow only one busy node in one numa group, to prevent
+	 * ping-pong migration case between nodes.
+	 */
+	if (ng->busiest_nid != nid)
+		return false;
+
+	return mstat_failure_ratio(ng, nid) > degree;
+}
+
+/*
+ * Return true if the task should cling to snid, when it preferred snid
+ * rather than dnid and snid is not busy.
+ */
+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	bool ret = false;
+	int pnid = p->numa_preferred_nid;
+	struct numa_group *ng;
+
+	rcu_read_lock();
+
+	ng = p->numa_group;
+
+	/* Do cling only when the feature enabled and not in pause */
+	if (!ng || !ng->do_cling)
+		goto out;
+
+	if (pnid == NUMA_NO_NODE ||
+	    dnid == pnid ||
+	    snid != pnid)
+		goto out;
+
+	/* Never allow cling to a busy node */
+	if (busy_node(ng, snid))
+		goto out;
+
+	ret = true;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Prevent more tasks from prefer the busy node to easy the unbalancing,
+ * also give the second candidate a chance.
+ */
+static inline bool group_pause_prefer(struct numa_group *ng, int nid)
+{
+	if (!ng || !ng->do_cling)
+		return false;
+
+	return busy_node(ng, nid);
+}
+
+static inline void update_failure_ratio(struct numa_group *ng, int nid)
+{
+	int f_idx = mstat_idx(nid, FAILURE_SCALED);
+	int t_idx = mstat_idx(nid, TOTAL_SCALED);
+	int fp_idx = mstat_idx(nid, FAILURE_RATIO);
+
+	ng->migrate_stat[fp_idx] =
+		ng->migrate_stat[f_idx] * 100 / (ng->migrate_stat[t_idx] + 1);
+}
+
+static void cling_timer_func(struct timer_list *t)
+{
+	int nid;
+	unsigned int degree;
+	unsigned long period, max_failure;
+	struct numa_group *ng = from_timer(ng, t, cling_timer);
+
+	degree = sysctl_numa_balancing_cling_degree;
+	period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+	period /= NUMA_MIGRATE_SCALE;
+
+	spin_lock_irq(&ng->lock);
+
+	max_failure = 0;
+	for_each_online_node(nid) {
+		int f_idx = mstat_idx(nid, FAILURE_SCALED);
+		int t_idx = mstat_idx(nid, TOTAL_SCALED);
+
+		ng->migrate_stat[f_idx] /= 2;
+		ng->migrate_stat[t_idx] /= 2;
+
+		update_failure_ratio(ng, nid);
+
+		if (ng->migrate_stat[f_idx] > max_failure) {
+			ng->busiest_nid = nid;
+			max_failure = ng->migrate_stat[f_idx];
+		}
+	}
+
+	spin_unlock_irq(&ng->lock);
+
+	mod_timer(&ng->cling_timer, jiffies + period);
+}
+
+static inline void
+update_migrate_stat(struct task_struct *p, int nid, bool failed)
+{
+	int idx;
+	struct numa_group *ng = p->numa_group;
+
+	if (!ng || !ng->do_cling)
+		return;
+
+	spin_lock_irq(&ng->lock);
+
+	if (failed) {
+		idx = mstat_idx(nid, FAILURE_SCALED);
+		ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	}
+
+	idx = mstat_idx(nid, TOTAL_SCALED);
+	ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	update_failure_ratio(ng, nid);
+
+	spin_unlock_irq(&ng->lock);
+
+	/*
+	 * On failed task may prefer source node instead, this
+	 * cause ping-pong migration when numa cling enabled,
+	 * so let's reset the preferred node to none.
+	 */
+	if (failed)
+		sched_setnuma(p, NUMA_NO_NODE);
+}
+
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
+	bool failed, target;
 	unsigned long interval = HZ;

 	/* This task has no NUMA fault statistics yet */
@@ -1891,8 +2117,12 @@ static void numa_migrate_preferred(struct task_struct *p)
 	if (task_node(p) == p->numa_preferred_nid)
 		return;

+	target = p->numa_preferred_nid;
+
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	task_numa_migrate(p);
+	failed = (task_numa_migrate(p) != 0);
+
+	update_migrate_stat(p, target, failed);
 }

 /*
@@ -2216,7 +2446,8 @@ static void task_numa_placement(struct task_struct *p)
 				max_faults = faults;
 				max_nid = nid;
 			}
-		} else if (group_faults > max_faults) {
+		} else if (group_faults > max_faults &&
+			   !group_pause_prefer(p->numa_group, nid)) {
 			max_faults = group_faults;
 			max_nid = nid;
 		}
@@ -2258,8 +2489,10 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		return;
 	}

-	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
-		   ng->gid, ng->nr_tasks, ng->active_nodes);
+	spin_lock_irq(&ng->lock);
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d busiest_nid %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes, ng->busiest_nid);

 	for_each_online_node(nid) {
 		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
@@ -2270,9 +2503,16 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		seq_printf(sf, "mem_private %lu mem_shared %lu ",
 			   ng->faults[f_idx], ng->faults[pf_idx]);

-		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu ",
 			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+
+		seq_printf(sf, "migrate_stat %lu %lu %lu\n",
+			   mstat_failure_scaled(ng, nid),
+			   mstat_total_scaled(ng, nid),
+			   mstat_failure_ratio(ng, nid));
 	}
+
+	spin_unlock_irq(&ng->lock);
 }

 int update_tg_numa_group(struct task_group *tg, bool numa_group)
@@ -2286,20 +2526,26 @@ int update_tg_numa_group(struct task_group *tg, bool numa_group)
 	if (ng) {
 		/* put and evacuate tg's numa group */
 		rcu_assign_pointer(tg->numa_group, NULL);
+		del_timer_sync(&ng->cling_timer);
 		ng->evacuate = true;
 		put_numa_group(ng);
 	} else {
 		unsigned int size = sizeof(struct numa_group) +
-				    4*nr_node_ids*sizeof(unsigned long);
+				    7*nr_node_ids*sizeof(unsigned long);
+		unsigned int offset = NR_NUMA_HINT_FAULT_TYPES * nr_node_ids;

 		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!ng)
 			return -ENOMEM;

 		refcount_set(&ng->refcount, 1);
+		ng->busiest_nid = NUMA_NO_NODE;
+		ng->do_cling = true;
+		timer_setup(&ng->cling_timer, cling_timer_func, 0);
 		spin_lock_init(&ng->lock);
-		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
-						nr_node_ids;
+		ng->faults_cpu = ng->faults + offset;
+		ng->migrate_stat = ng->faults_cpu + offset;
+		add_timer(&ng->cling_timer);
 		/* now make tasks see and join */
 		rcu_assign_pointer(tg->numa_group, ng);
 	}
@@ -2436,6 +2682,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			return;

 		refcount_set(&grp->refcount, 1);
+		grp->busiest_nid = NUMA_NO_NODE;
 		grp->active_nodes = 1;
 		grp->max_faults_cpu = 0;
 		spin_lock_init(&grp->lock);
@@ -2879,6 +3126,11 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
 {
 }

+static inline bool task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	return false;
+}
+
 #endif /* CONFIG_NUMA_BALANCING */

 static void
@@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;

+	/*
+	 * Failed to find an idle cpu, wake affine may want to pull but
+	 * try stay on prev-cpu when the task cling to it.
+	 */
+	if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target)))
+		return prev;
+
 	return target;
 }

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..0a889dd1c7ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -417,6 +417,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_cling_degree",
+		.data		= &sysctl_numa_balancing_cling_degree,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &max_numa_balancing_cling_degree,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 4/4] numa: introduce numa cling feature
  2019-07-03  3:34   ` [PATCH 4/4] numa: introduce numa cling feature 王贇
@ 2019-07-08  2:25     ` 王贇
  2019-07-09  2:15       ` 王贇
  2019-07-09  2:24       ` [PATCH v3 " 王贇
  2019-07-11 14:27     ` [PATCH " Peter Zijlstra
  1 sibling, 2 replies; 62+ messages in thread
From: 王贇 @ 2019-07-08  2:25 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.

When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be really bad especially when there
are too many fast wakeup, although now we can numa group the tasks,
they won't really stay on the same node, for example we have numa
group ng_A, ng_B, ng_C, ng_D, it's very likely result as:

	CPU Usage:
		Node 0		Node 1
		ng_A(600%)	ng_A(400%)
		ng_B(400%)	ng_B(600%)
		ng_C(400%)	ng_C(600%)
		ng_D(600%)	ng_D(400%)

	Memory Ratio:
		Node 0		Node 1
		ng_A(60%)	ng_A(40%)
		ng_B(40%)	ng_B(60%)
		ng_C(40%)	ng_C(60%)
		ng_D(60%)	ng_D(40%)

Locality won't be too bad but far from the best situation, we want
a numa group to settle down thoroughly on a particular node, with
every thing balanced.

Thus we introduce the numa cling, which try to prevent tasks leaving
the preferred node on wakeup fast path.

This help thoroughly settle down the workloads on single node, but when
multiple numa group try to settle down on the same node, unbalancing
could happen.

For example we have numa group ng_A, ng_B, ng_C, ng_D, it may result in
situation like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(400%)	ng_C(600%)
	ng_D(400%)	ng_D(600%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(10%)	ng_C(90%)
	ng_D(10%)	ng_D(90%)

This is because when ng_C, ng_D start to have most of the memory on node
1 at some point, task_x of ng_C stay on node 0 will try to do numa swap
migration with the task_y of ng_D stay on node 1 as long as load balanced,
the result is task_x stay on node 1 and task_y stay on node 0, while both
of them prefer node 1.

Now when other tasks of ng_D stay on node 1 wakeup task_y, task_y will
very likely go back to node 1, and since numa cling enabled, it will
keep stay on node 1 although load unbalanced, this could be frequently
and more and more tasks will prefer the node 1 and make it busy.

So the key point here is to stop doing numa cling when load starting to
become unbalancing.

We achieved this by monitoring the migration failure ratio, in scenery
above, too much tasks prefer node 1 and will keep migrating to it, load
unbalancing could lead into the migration failure in this case, and when
the failure ratio above the specified degree, we pause the cling and try
to resettle the workloads on a better node by stop tasks prefer the busy
node, this will finally give us the result like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(1000%)	ng_D(1000%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(100%)	ng_D(100%)

Now we achieved the best locality and maximum hot cache benefit.

Tested on a 2 node box with 96 cpus, do sysbench-mysql-oltp_read_write
testing, X mysqld instances created and attached to X cgroups, X sysbench
instances then created and attached to corresponding cgroup to test the
mysql with oltp_read_write script for 20 minutes, average eps show:

				origin		ng + cling
4 instances each 24 threads	7545.28		7790.49		+3.25%
4 instances each 48 threads	9359.36		9832.30		+5.05%
4 instances each 72 threads	9602.88		10196.95	+6.19%

8 instances each 24 threads	4478.82		4508.82		+0.67%
8 instances each 48 threads	5514.90		5689.93		+3.17%
8 instances each 72 threads	5582.19		5741.33		+2.85%

Also tested with perf-bench-numa, dbench, sysbench-memory, pgbench, tiny
improvement observed.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---

v2:
  * migrate_degrades_locality() now return 1 when numa cling to source node

 include/linux/sched/sysctl.h |   3 +
 kernel/sched/fair.c          | 287 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |   9 ++
 3 files changed, 286 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..6eef34331dd2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -38,6 +38,9 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;

+extern unsigned int sysctl_numa_balancing_cling_degree;
+extern unsigned int max_numa_balancing_cling_degree;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6cf9c9c61258..5ca5a9a148d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1067,6 +1067,20 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;

+/*
+ * The numa group serving task group will enable numa cling, a feature
+ * which try to prevent task leaving preferred node on wakeup.
+ *
+ * This help settle down the workloads thorouly and quickly on node,
+ * while introduce the risk of load unbalancing.
+ *
+ * In order to detect the risk in advance and pause the feature, we
+ * rely on numa migration failure stats, and when failure ratio above
+ * cling degree, we pause the numa cling until resettle done.
+ */
+unsigned int sysctl_numa_balancing_cling_degree = 20;
+unsigned int max_numa_balancing_cling_degree = 100;
+
 struct numa_group {
 	refcount_t refcount;

@@ -1074,11 +1088,15 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	int busiest_nid;
 	bool evacuate;
+	bool do_cling;
+	struct timer_list cling_timer;

 	struct rcu_head rcu;
 	unsigned long total_faults;
 	unsigned long max_faults_cpu;
+	unsigned long *migrate_stat;
 	/*
 	 * Faults_cpu is used to decide whether memory should move
 	 * towards the CPU. As a consequence, these stats are weighted
@@ -1088,6 +1106,8 @@ struct numa_group {
 	unsigned long faults[0];
 };

+static inline bool busy_node(struct numa_group *ng, int nid);
+
 static inline unsigned long group_faults_priv(struct numa_group *ng);
 static inline unsigned long group_faults_shared(struct numa_group *ng);

@@ -1132,8 +1152,14 @@ static unsigned int task_scan_start(struct task_struct *p)
 	unsigned long smin = task_scan_min(p);
 	unsigned long period = smin;

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1154,8 +1180,14 @@ static unsigned int task_scan_max(struct task_struct *p)
 	/* Watch for min being lower than max due to floor calculations */
 	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1475,6 +1507,19 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 					ACTIVE_NODE_FRACTION)
 		return true;

+	/*
+	 * Make sure pages do not stay on a busy node when numa cling
+	 * enabled, otherwise they could lead into more numa migration
+	 * to the busy node.
+	 */
+	if (ng->do_cling) {
+		if (busy_node(ng, dst_nid))
+			return false;
+
+		if (busy_node(ng, src_nid))
+			return true;
+	}
+
 	/*
 	 * Distribute memory according to CPU & memory use on each node,
 	 * with 3/4 hysteresis to avoid unnecessary memory migrations:
@@ -1874,9 +1919,190 @@ static int task_numa_migrate(struct task_struct *p)
 	return ret;
 }

+/*
+ * We scale the migration stat count to 1024, divide the maximum numa
+ * balancing scan period by 10 and make that the period of cling timer,
+ * this help to decay one count to 0 after one maximum scan period passed.
+ */
+#define NUMA_MIGRATE_SCALE 10
+#define NUMA_MIGRATE_WEIGHT 1024
+
+enum numa_migrate_stats {
+	FAILURE_SCALED,
+	TOTAL_SCALED,
+	FAILURE_RATIO,
+};
+
+static inline int mstat_idx(int nid, enum numa_migrate_stats s)
+{
+	return (nid + s * nr_node_ids);
+}
+
+static inline unsigned long
+mstat_failure_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_SCALED)];
+}
+
+static inline unsigned long
+mstat_total_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, TOTAL_SCALED)];
+}
+
+static inline unsigned long
+mstat_failure_ratio(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_RATIO)];
+}
+
+/*
+ * A node is busy when the numa migration toward it failed too much,
+ * this imply the load already unbalancing for too much numa cling on
+ * that node.
+ */
+static inline bool busy_node(struct numa_group *ng, int nid)
+{
+	int degree = sysctl_numa_balancing_cling_degree;
+
+	if (mstat_failure_scaled(ng, nid) < NUMA_MIGRATE_WEIGHT)
+		return false;
+
+	/*
+	 * Allow only one busy node in one numa group, to prevent
+	 * ping-pong migration case between nodes.
+	 */
+	if (ng->busiest_nid != nid)
+		return false;
+
+	return mstat_failure_ratio(ng, nid) > degree;
+}
+
+/*
+ * Return true if the task should cling to snid, when it preferred snid
+ * rather than dnid and snid is not busy.
+ */
+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	bool ret = false;
+	int pnid = p->numa_preferred_nid;
+	struct numa_group *ng;
+
+	rcu_read_lock();
+
+	ng = p->numa_group;
+
+	/* Do cling only when the feature enabled and not in pause */
+	if (!ng || !ng->do_cling)
+		goto out;
+
+	if (pnid == NUMA_NO_NODE ||
+	    dnid == pnid ||
+	    snid != pnid)
+		goto out;
+
+	/* Never allow cling to a busy node */
+	if (busy_node(ng, snid))
+		goto out;
+
+	ret = true;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Prevent more tasks from prefer the busy node to easy the unbalancing,
+ * also give the second candidate a chance.
+ */
+static inline bool group_pause_prefer(struct numa_group *ng, int nid)
+{
+	if (!ng || !ng->do_cling)
+		return false;
+
+	return busy_node(ng, nid);
+}
+
+static inline void update_failure_ratio(struct numa_group *ng, int nid)
+{
+	int f_idx = mstat_idx(nid, FAILURE_SCALED);
+	int t_idx = mstat_idx(nid, TOTAL_SCALED);
+	int fp_idx = mstat_idx(nid, FAILURE_RATIO);
+
+	ng->migrate_stat[fp_idx] =
+		ng->migrate_stat[f_idx] * 100 / (ng->migrate_stat[t_idx] + 1);
+}
+
+static void cling_timer_func(struct timer_list *t)
+{
+	int nid;
+	unsigned int degree;
+	unsigned long period, max_failure;
+	struct numa_group *ng = from_timer(ng, t, cling_timer);
+
+	degree = sysctl_numa_balancing_cling_degree;
+	period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+	period /= NUMA_MIGRATE_SCALE;
+
+	spin_lock_irq(&ng->lock);
+
+	max_failure = 0;
+	for_each_online_node(nid) {
+		int f_idx = mstat_idx(nid, FAILURE_SCALED);
+		int t_idx = mstat_idx(nid, TOTAL_SCALED);
+
+		ng->migrate_stat[f_idx] /= 2;
+		ng->migrate_stat[t_idx] /= 2;
+
+		update_failure_ratio(ng, nid);
+
+		if (ng->migrate_stat[f_idx] > max_failure) {
+			ng->busiest_nid = nid;
+			max_failure = ng->migrate_stat[f_idx];
+		}
+	}
+
+	spin_unlock_irq(&ng->lock);
+
+	mod_timer(&ng->cling_timer, jiffies + period);
+}
+
+static inline void
+update_migrate_stat(struct task_struct *p, int nid, bool failed)
+{
+	int idx;
+	struct numa_group *ng = p->numa_group;
+
+	if (!ng || !ng->do_cling)
+		return;
+
+	spin_lock_irq(&ng->lock);
+
+	if (failed) {
+		idx = mstat_idx(nid, FAILURE_SCALED);
+		ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	}
+
+	idx = mstat_idx(nid, TOTAL_SCALED);
+	ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	update_failure_ratio(ng, nid);
+
+	spin_unlock_irq(&ng->lock);
+
+	/*
+	 * On failed task may prefer source node instead, this
+	 * cause ping-pong migration when numa cling enabled,
+	 * so let's reset the preferred node to none.
+	 */
+	if (failed)
+		sched_setnuma(p, NUMA_NO_NODE);
+}
+
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
+	bool failed, target;
 	unsigned long interval = HZ;

 	/* This task has no NUMA fault statistics yet */
@@ -1891,8 +2117,12 @@ static void numa_migrate_preferred(struct task_struct *p)
 	if (task_node(p) == p->numa_preferred_nid)
 		return;

+	target = p->numa_preferred_nid;
+
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	task_numa_migrate(p);
+	failed = (task_numa_migrate(p) != 0);
+
+	update_migrate_stat(p, target, failed);
 }

 /*
@@ -2216,7 +2446,8 @@ static void task_numa_placement(struct task_struct *p)
 				max_faults = faults;
 				max_nid = nid;
 			}
-		} else if (group_faults > max_faults) {
+		} else if (group_faults > max_faults &&
+			   !group_pause_prefer(p->numa_group, nid)) {
 			max_faults = group_faults;
 			max_nid = nid;
 		}
@@ -2258,8 +2489,10 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		return;
 	}

-	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
-		   ng->gid, ng->nr_tasks, ng->active_nodes);
+	spin_lock_irq(&ng->lock);
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d busiest_nid %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes, ng->busiest_nid);

 	for_each_online_node(nid) {
 		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
@@ -2270,9 +2503,16 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		seq_printf(sf, "mem_private %lu mem_shared %lu ",
 			   ng->faults[f_idx], ng->faults[pf_idx]);

-		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu ",
 			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+
+		seq_printf(sf, "migrate_stat %lu %lu %lu\n",
+			   mstat_failure_scaled(ng, nid),
+			   mstat_total_scaled(ng, nid),
+			   mstat_failure_ratio(ng, nid));
 	}
+
+	spin_unlock_irq(&ng->lock);
 }

 int update_tg_numa_group(struct task_group *tg, bool numa_group)
@@ -2286,20 +2526,26 @@ int update_tg_numa_group(struct task_group *tg, bool numa_group)
 	if (ng) {
 		/* put and evacuate tg's numa group */
 		rcu_assign_pointer(tg->numa_group, NULL);
+		del_timer_sync(&ng->cling_timer);
 		ng->evacuate = true;
 		put_numa_group(ng);
 	} else {
 		unsigned int size = sizeof(struct numa_group) +
-				    4*nr_node_ids*sizeof(unsigned long);
+				    7*nr_node_ids*sizeof(unsigned long);
+		unsigned int offset = NR_NUMA_HINT_FAULT_TYPES * nr_node_ids;

 		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!ng)
 			return -ENOMEM;

 		refcount_set(&ng->refcount, 1);
+		ng->busiest_nid = NUMA_NO_NODE;
+		ng->do_cling = true;
+		timer_setup(&ng->cling_timer, cling_timer_func, 0);
 		spin_lock_init(&ng->lock);
-		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
-						nr_node_ids;
+		ng->faults_cpu = ng->faults + offset;
+		ng->migrate_stat = ng->faults_cpu + offset;
+		add_timer(&ng->cling_timer);
 		/* now make tasks see and join */
 		rcu_assign_pointer(tg->numa_group, ng);
 	}
@@ -2436,6 +2682,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			return;

 		refcount_set(&grp->refcount, 1);
+		grp->busiest_nid = NUMA_NO_NODE;
 		grp->active_nodes = 1;
 		grp->max_faults_cpu = 0;
 		spin_lock_init(&grp->lock);
@@ -2879,6 +3126,11 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
 {
 }

+static inline bool task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	return false;
+}
+
 #endif /* CONFIG_NUMA_BALANCING */

 static void
@@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;

+	/*
+	 * Failed to find an idle cpu, wake affine may want to pull but
+	 * try stay on prev-cpu when the task cling to it.
+	 */
+	if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target)))
+		return prev;
+
 	return target;
 }

@@ -6671,6 +6930,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (want_affine)
 			current->recent_used_cpu = cpu;
 	}
+
 	rcu_read_unlock();

 	return new_cpu;
@@ -7342,7 +7602,8 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)

 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid == p->numa_preferred_nid) {
-		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+		if (task_numa_cling(p, src_nid, dst_nid) ||
+		    env->src_rq->nr_running > env->src_rq->nr_preferred_running)
 			return 1;
 		else
 			return -1;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..0a889dd1c7ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -417,6 +417,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_cling_degree",
+		.data		= &sysctl_numa_balancing_cling_degree,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &max_numa_balancing_cling_degree,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 4/4] numa: introduce numa cling feature
  2019-07-08  2:25     ` [PATCH v2 " 王贇
@ 2019-07-09  2:15       ` 王贇
  2019-07-09  2:24       ` [PATCH v3 " 王贇
  1 sibling, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-09  2:15 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar,
	linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

On 2019/7/8 下午4:07, Hillf Danton wrote:
> 
> On Mon, 8 Jul 2019 10:25:27 +0800 Michael Wang wrote:
>> /* Attempt to migrate a task to a CPU on the preferred node. */
>> static void numa_migrate_preferred(struct task_struct *p)
>> {
>> +	bool failed, target;
>> 	unsigned long interval = HZ;
>>
>> 	/* This task has no NUMA fault statistics yet */
>> @@ -1891,8 +2117,12 @@ static void numa_migrate_preferred(struct task_struct *p)
>> 	if (task_node(p) == p->numa_preferred_nid)
>> 		return;
>>
>> +	target = p->numa_preferred_nid;
>> +
> Something instead of bool can be used, too.

Thx for point out :-) to be fix in v3.

> 
>> 	/* Otherwise, try migrate to a CPU on the preferred node */
>> -	task_numa_migrate(p);
>> +	failed = (task_numa_migrate(p) != 0);
>> +
>> +	update_migrate_stat(p, target, failed);
>> }
>>
>> static void
>> @@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> 	if ((unsigned)i < nr_cpumask_bits)
>> 		return i;
>>
>> +	/*
>> +	 * Failed to find an idle cpu, wake affine may want to pull but
>> +	 * try stay on prev-cpu when the task cling to it.
>> +	 */
>> +	if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target)))
>> +		return prev;
>> +
> Curious to know what test figures would look like without the above line.

It depends on the wake affine condition then, when waker task consider wakee
suitable for pull, wakee may leave the preferred node, or maybe pull to the
preferred node, just randomly and follow the fate.

In mysql case when there are many such wakeup cases and system is very busy,
the observed workloads could be 4:6 or 3:7 distributed in two nodes.

Regards,
Michael Wang

> 
>> 	return target;
>> }
>>
>> Tested on a 2 node box with 96 cpus, do sysbench-mysql-oltp_read_write
>> testing, X mysqld instances created and attached to X cgroups, X sysbench
>> instances then created and attached to corresponding cgroup to test the
>> mysql with oltp_read_write script for 20 minutes, average eps show:
>>
>> 				origin		ng + cling
>> 4 instances each 24 threads	7545.28		7790.49		+3.25%
>> 4 instances each 48 threads	9359.36		9832.30		+5.05%
>> 4 instances each 72 threads	9602.88		10196.95	+6.19%
>>
>> 8 instances each 24 threads	4478.82		4508.82		+0.67%
>> 8 instances each 48 threads	5514.90		5689.93		+3.17%
>> 8 instances each 72 threads	5582.19		5741.33		+2.85%
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v3 4/4] numa: introduce numa cling feature
  2019-07-08  2:25     ` [PATCH v2 " 王贇
  2019-07-09  2:15       ` 王贇
@ 2019-07-09  2:24       ` 王贇
  1 sibling, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-09  2:24 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.

When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be really bad especially when there
are too many fast wakeup, although now we can numa group the tasks,
they won't really stay on the same node, for example we have numa
group ng_A, ng_B, ng_C, ng_D, it's very likely result as:

	CPU Usage:
		Node 0		Node 1
		ng_A(600%)	ng_A(400%)
		ng_B(400%)	ng_B(600%)
		ng_C(400%)	ng_C(600%)
		ng_D(600%)	ng_D(400%)

	Memory Ratio:
		Node 0		Node 1
		ng_A(60%)	ng_A(40%)
		ng_B(40%)	ng_B(60%)
		ng_C(40%)	ng_C(60%)
		ng_D(60%)	ng_D(40%)

Locality won't be too bad but far from the best situation, we want
a numa group to settle down thoroughly on a particular node, with
every thing balanced.

Thus we introduce the numa cling, which try to prevent tasks leaving
the preferred node on wakeup fast path.

This help thoroughly settle down the workloads on single node, but when
multiple numa group try to settle down on the same node, unbalancing
could happen.

For example we have numa group ng_A, ng_B, ng_C, ng_D, it may result in
situation like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(400%)	ng_C(600%)
	ng_D(400%)	ng_D(600%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(10%)	ng_C(90%)
	ng_D(10%)	ng_D(90%)

This is because when ng_C, ng_D start to have most of the memory on node
1 at some point, task_x of ng_C stay on node 0 will try to do numa swap
migration with the task_y of ng_D stay on node 1 as long as load balanced,
the result is task_x stay on node 1 and task_y stay on node 0, while both
of them prefer node 1.

Now when other tasks of ng_D stay on node 1 wakeup task_y, task_y will
very likely go back to node 1, and since numa cling enabled, it will
keep stay on node 1 although load unbalanced, this could be frequently
and more and more tasks will prefer the node 1 and make it busy.

So the key point here is to stop doing numa cling when load starting to
become unbalancing.

We achieved this by monitoring the migration failure ratio, in scenery
above, too much tasks prefer node 1 and will keep migrating to it, load
unbalancing could lead into the migration failure in this case, and when
the failure ratio above the specified degree, we pause the cling and try
to resettle the workloads on a better node by stop tasks prefer the busy
node, this will finally give us the result like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(1000%)	ng_D(1000%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(100%)	ng_D(100%)

Now we achieved the best locality and maximum hot cache benefit.

Tested on a 2 node box with 96 cpus, do sysbench-mysql-oltp_read_write
testing, X mysqld instances created and attached to X cgroups, X sysbench
instances then created and attached to corresponding cgroup to test the
mysql with oltp_read_write script for 20 minutes, average eps show:

				origin		ng + cling
4 instances each 24 threads	7545.28		7790.49		+3.25%
4 instances each 48 threads	9359.36		9832.30		+5.05%
4 instances each 72 threads	9602.88		10196.95	+6.19%

8 instances each 24 threads	4478.82		4508.82		+0.67%
8 instances each 48 threads	5514.90		5689.93		+3.17%
8 instances each 72 threads	5582.19		5741.33		+2.85%

Also tested with perf-bench-numa, dbench, sysbench-memory, pgbench, tiny
improvement observed.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---

v3:
  * fix target type in numa_migrate_preferred()

 include/linux/sched/sysctl.h |   3 +
 kernel/sched/fair.c          | 288 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |   9 ++
 3 files changed, 287 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..6eef34331dd2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -38,6 +38,9 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;

+extern unsigned int sysctl_numa_balancing_cling_degree;
+extern unsigned int max_numa_balancing_cling_degree;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6cf9c9c61258..0757dc86953a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1067,6 +1067,20 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;

+/*
+ * The numa group serving task group will enable numa cling, a feature
+ * which try to prevent task leaving preferred node on wakeup.
+ *
+ * This help settle down the workloads thorouly and quickly on node,
+ * while introduce the risk of load unbalancing.
+ *
+ * In order to detect the risk in advance and pause the feature, we
+ * rely on numa migration failure stats, and when failure ratio above
+ * cling degree, we pause the numa cling until resettle done.
+ */
+unsigned int sysctl_numa_balancing_cling_degree = 20;
+unsigned int max_numa_balancing_cling_degree = 100;
+
 struct numa_group {
 	refcount_t refcount;

@@ -1074,11 +1088,15 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	int busiest_nid;
 	bool evacuate;
+	bool do_cling;
+	struct timer_list cling_timer;

 	struct rcu_head rcu;
 	unsigned long total_faults;
 	unsigned long max_faults_cpu;
+	unsigned long *migrate_stat;
 	/*
 	 * Faults_cpu is used to decide whether memory should move
 	 * towards the CPU. As a consequence, these stats are weighted
@@ -1088,6 +1106,8 @@ struct numa_group {
 	unsigned long faults[0];
 };

+static inline bool busy_node(struct numa_group *ng, int nid);
+
 static inline unsigned long group_faults_priv(struct numa_group *ng);
 static inline unsigned long group_faults_shared(struct numa_group *ng);

@@ -1132,8 +1152,14 @@ static unsigned int task_scan_start(struct task_struct *p)
 	unsigned long smin = task_scan_min(p);
 	unsigned long period = smin;

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1154,8 +1180,14 @@ static unsigned int task_scan_max(struct task_struct *p)
 	/* Watch for min being lower than max due to floor calculations */
 	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1475,6 +1507,19 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 					ACTIVE_NODE_FRACTION)
 		return true;

+	/*
+	 * Make sure pages do not stay on a busy node when numa cling
+	 * enabled, otherwise they could lead into more numa migration
+	 * to the busy node.
+	 */
+	if (ng->do_cling) {
+		if (busy_node(ng, dst_nid))
+			return false;
+
+		if (busy_node(ng, src_nid))
+			return true;
+	}
+
 	/*
 	 * Distribute memory according to CPU & memory use on each node,
 	 * with 3/4 hysteresis to avoid unnecessary memory migrations:
@@ -1874,9 +1919,191 @@ static int task_numa_migrate(struct task_struct *p)
 	return ret;
 }

+/*
+ * We scale the migration stat count to 1024, divide the maximum numa
+ * balancing scan period by 10 and make that the period of cling timer,
+ * this help to decay one count to 0 after one maximum scan period passed.
+ */
+#define NUMA_MIGRATE_SCALE 10
+#define NUMA_MIGRATE_WEIGHT 1024
+
+enum numa_migrate_stats {
+	FAILURE_SCALED,
+	TOTAL_SCALED,
+	FAILURE_RATIO,
+};
+
+static inline int mstat_idx(int nid, enum numa_migrate_stats s)
+{
+	return (nid + s * nr_node_ids);
+}
+
+static inline unsigned long
+mstat_failure_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_SCALED)];
+}
+
+static inline unsigned long
+mstat_total_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, TOTAL_SCALED)];
+}
+
+static inline unsigned long
+mstat_failure_ratio(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_RATIO)];
+}
+
+/*
+ * A node is busy when the numa migration toward it failed too much,
+ * this imply the load already unbalancing for too much numa cling on
+ * that node.
+ */
+static inline bool busy_node(struct numa_group *ng, int nid)
+{
+	int degree = sysctl_numa_balancing_cling_degree;
+
+	if (mstat_failure_scaled(ng, nid) < NUMA_MIGRATE_WEIGHT)
+		return false;
+
+	/*
+	 * Allow only one busy node in one numa group, to prevent
+	 * ping-pong migration case between nodes.
+	 */
+	if (ng->busiest_nid != nid)
+		return false;
+
+	return mstat_failure_ratio(ng, nid) > degree;
+}
+
+/*
+ * Return true if the task should cling to snid, when it preferred snid
+ * rather than dnid and snid is not busy.
+ */
+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	bool ret = false;
+	int pnid = p->numa_preferred_nid;
+	struct numa_group *ng;
+
+	rcu_read_lock();
+
+	ng = p->numa_group;
+
+	/* Do cling only when the feature enabled and not in pause */
+	if (!ng || !ng->do_cling)
+		goto out;
+
+	if (pnid == NUMA_NO_NODE ||
+	    dnid == pnid ||
+	    snid != pnid)
+		goto out;
+
+	/* Never allow cling to a busy node */
+	if (busy_node(ng, snid))
+		goto out;
+
+	ret = true;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Prevent more tasks from prefer the busy node to easy the unbalancing,
+ * also give the second candidate a chance.
+ */
+static inline bool group_pause_prefer(struct numa_group *ng, int nid)
+{
+	if (!ng || !ng->do_cling)
+		return false;
+
+	return busy_node(ng, nid);
+}
+
+static inline void update_failure_ratio(struct numa_group *ng, int nid)
+{
+	int f_idx = mstat_idx(nid, FAILURE_SCALED);
+	int t_idx = mstat_idx(nid, TOTAL_SCALED);
+	int fp_idx = mstat_idx(nid, FAILURE_RATIO);
+
+	ng->migrate_stat[fp_idx] =
+		ng->migrate_stat[f_idx] * 100 / (ng->migrate_stat[t_idx] + 1);
+}
+
+static void cling_timer_func(struct timer_list *t)
+{
+	int nid;
+	unsigned int degree;
+	unsigned long period, max_failure;
+	struct numa_group *ng = from_timer(ng, t, cling_timer);
+
+	degree = sysctl_numa_balancing_cling_degree;
+	period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+	period /= NUMA_MIGRATE_SCALE;
+
+	spin_lock_irq(&ng->lock);
+
+	max_failure = 0;
+	for_each_online_node(nid) {
+		int f_idx = mstat_idx(nid, FAILURE_SCALED);
+		int t_idx = mstat_idx(nid, TOTAL_SCALED);
+
+		ng->migrate_stat[f_idx] /= 2;
+		ng->migrate_stat[t_idx] /= 2;
+
+		update_failure_ratio(ng, nid);
+
+		if (ng->migrate_stat[f_idx] > max_failure) {
+			ng->busiest_nid = nid;
+			max_failure = ng->migrate_stat[f_idx];
+		}
+	}
+
+	spin_unlock_irq(&ng->lock);
+
+	mod_timer(&ng->cling_timer, jiffies + period);
+}
+
+static inline void
+update_migrate_stat(struct task_struct *p, int nid, bool failed)
+{
+	int idx;
+	struct numa_group *ng = p->numa_group;
+
+	if (!ng || !ng->do_cling)
+		return;
+
+	spin_lock_irq(&ng->lock);
+
+	if (failed) {
+		idx = mstat_idx(nid, FAILURE_SCALED);
+		ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	}
+
+	idx = mstat_idx(nid, TOTAL_SCALED);
+	ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	update_failure_ratio(ng, nid);
+
+	spin_unlock_irq(&ng->lock);
+
+	/*
+	 * On failed task may prefer source node instead, this
+	 * cause ping-pong migration when numa cling enabled,
+	 * so let's reset the preferred node to none.
+	 */
+	if (failed)
+		sched_setnuma(p, NUMA_NO_NODE);
+}
+
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
+	bool failed;
+	int target;
 	unsigned long interval = HZ;

 	/* This task has no NUMA fault statistics yet */
@@ -1891,8 +2118,12 @@ static void numa_migrate_preferred(struct task_struct *p)
 	if (task_node(p) == p->numa_preferred_nid)
 		return;

+	target = p->numa_preferred_nid;
+
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	task_numa_migrate(p);
+	failed = (task_numa_migrate(p) != 0);
+
+	update_migrate_stat(p, target, failed);
 }

 /*
@@ -2216,7 +2447,8 @@ static void task_numa_placement(struct task_struct *p)
 				max_faults = faults;
 				max_nid = nid;
 			}
-		} else if (group_faults > max_faults) {
+		} else if (group_faults > max_faults &&
+			   !group_pause_prefer(p->numa_group, nid)) {
 			max_faults = group_faults;
 			max_nid = nid;
 		}
@@ -2258,8 +2490,10 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		return;
 	}

-	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
-		   ng->gid, ng->nr_tasks, ng->active_nodes);
+	spin_lock_irq(&ng->lock);
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d busiest_nid %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes, ng->busiest_nid);

 	for_each_online_node(nid) {
 		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
@@ -2270,9 +2504,16 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		seq_printf(sf, "mem_private %lu mem_shared %lu ",
 			   ng->faults[f_idx], ng->faults[pf_idx]);

-		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu ",
 			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+
+		seq_printf(sf, "migrate_stat %lu %lu %lu\n",
+			   mstat_failure_scaled(ng, nid),
+			   mstat_total_scaled(ng, nid),
+			   mstat_failure_ratio(ng, nid));
 	}
+
+	spin_unlock_irq(&ng->lock);
 }

 int update_tg_numa_group(struct task_group *tg, bool numa_group)
@@ -2286,20 +2527,26 @@ int update_tg_numa_group(struct task_group *tg, bool numa_group)
 	if (ng) {
 		/* put and evacuate tg's numa group */
 		rcu_assign_pointer(tg->numa_group, NULL);
+		del_timer_sync(&ng->cling_timer);
 		ng->evacuate = true;
 		put_numa_group(ng);
 	} else {
 		unsigned int size = sizeof(struct numa_group) +
-				    4*nr_node_ids*sizeof(unsigned long);
+				    7*nr_node_ids*sizeof(unsigned long);
+		unsigned int offset = NR_NUMA_HINT_FAULT_TYPES * nr_node_ids;

 		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!ng)
 			return -ENOMEM;

 		refcount_set(&ng->refcount, 1);
+		ng->busiest_nid = NUMA_NO_NODE;
+		ng->do_cling = true;
+		timer_setup(&ng->cling_timer, cling_timer_func, 0);
 		spin_lock_init(&ng->lock);
-		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
-						nr_node_ids;
+		ng->faults_cpu = ng->faults + offset;
+		ng->migrate_stat = ng->faults_cpu + offset;
+		add_timer(&ng->cling_timer);
 		/* now make tasks see and join */
 		rcu_assign_pointer(tg->numa_group, ng);
 	}
@@ -2436,6 +2683,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			return;

 		refcount_set(&grp->refcount, 1);
+		grp->busiest_nid = NUMA_NO_NODE;
 		grp->active_nodes = 1;
 		grp->max_faults_cpu = 0;
 		spin_lock_init(&grp->lock);
@@ -2879,6 +3127,11 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
 {
 }

+static inline bool task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	return false;
+}
+
 #endif /* CONFIG_NUMA_BALANCING */

 static void
@@ -6195,6 +6448,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;

+	/*
+	 * Failed to find an idle cpu, wake affine may want to pull but
+	 * try stay on prev-cpu when the task cling to it.
+	 */
+	if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target)))
+		return prev;
+
 	return target;
 }

@@ -6671,6 +6931,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (want_affine)
 			current->recent_used_cpu = cpu;
 	}
+
 	rcu_read_unlock();

 	return new_cpu;
@@ -7342,7 +7603,8 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)

 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid == p->numa_preferred_nid) {
-		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+		if (task_numa_cling(p, src_nid, dst_nid) ||
+		    env->src_rq->nr_running > env->src_rq->nr_preferred_running)
 			return 1;
 		else
 			return -1;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..0a889dd1c7ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -417,6 +417,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_cling_degree",
+		.data		= &sysctl_numa_balancing_cling_degree,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &max_numa_balancing_cling_degree,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/4] per cgroup numa suite
  2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
                     ` (3 preceding siblings ...)
  2019-07-03  3:34   ` [PATCH 4/4] numa: introduce numa cling feature 王贇
@ 2019-07-11  9:00   ` 王贇
  2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
  5 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-11  9:00 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups

Hi folks,

How do you think about these patches?

During most of our tests the results show stable improvements, thus
we consider this as a generic problem and proposed this solution,
hope to help address the issue.

Comments are sincerely welcome :-)

Regards,
Michael Wang

On 2019/7/3 上午11:26, 王贇 wrote:
> During our torturing on numa stuff, we found problems like:
> 
>   * missing per-cgroup information about the per-node execution status
>   * missing per-cgroup information about the numa locality
> 
> That is when we have a cpu cgroup running with bunch of tasks, no good
> way to tell how it's tasks are dealing with numa.
> 
> The first two patches are trying to complete the missing pieces, but
> more problems appeared after monitoring these status:
> 
>   * tasks not always running on the preferred numa node
>   * tasks from same cgroup running on different nodes
> 
> The task numa group handler will always check if tasks are sharing pages
> and try to pack them into a single numa group, so they will have chance to
> settle down on the same node, but this failed in some cases:
> 
>   * workloads share page caches rather than share mappings
>   * workloads got too many wakeup across nodes
> 
> Since page caches are not traced by numa balancing, there are no way to
> realize such kind of relationship, and when there are too many wakeup,
> task will be drag from the preferred node and then migrate back by numa
> balancing, repeatedly.
> 
> Here the third patch try to address the first issue, we could now give hint
> to kernel about the relationship of tasks, and pack them into single numa
> group.
> 
> And the forth patch introduced numa cling, which try to address the wakup
> issue, now we try to make task stay on the preferred node on wakeup in fast
> path, in order to address the unbalancing risk, we monitoring the numa
> migration failure ratio, and pause numa cling when it reach the specified
> degree.
> 
> Michael Wang (4):
>   numa: introduce per-cgroup numa balancing locality statistic
>   numa: append per-node execution info in memory.numa_stat
>   numa: introduce numa group per task group
>   numa: introduce numa cling feature
> 
>  include/linux/memcontrol.h   |  37 ++++
>  include/linux/sched.h        |   8 +-
>  include/linux/sched/sysctl.h |   3 +
>  kernel/sched/core.c          |  37 ++++
>  kernel/sched/debug.c         |   7 +
>  kernel/sched/fair.c          | 455 ++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h         |  14 ++
>  kernel/sysctl.c              |   9 +
>  mm/memcontrol.c              |  66 +++++++
>  9 files changed, 628 insertions(+), 8 deletions(-)
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-03  3:28   ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
@ 2019-07-11 13:43     ` Peter Zijlstra
  2019-07-12  3:15       ` 王贇
  2019-07-11 13:47     ` Peter Zijlstra
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-07-11 13:43 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel

On Wed, Jul 03, 2019 at 11:28:10AM +0800, 王贇 wrote:
> +#ifdef CONFIG_NUMA_BALANCING
> +
> +enum memcg_numa_locality_interval {
> +	PERCENT_0_29,
> +	PERCENT_30_39,
> +	PERCENT_40_49,
> +	PERCENT_50_59,
> +	PERCENT_60_69,
> +	PERCENT_70_79,
> +	PERCENT_80_89,
> +	PERCENT_90_100,
> +	NR_NL_INTERVAL,
> +};

That's just daft; why not make 8 equal sized buckets.

> +struct memcg_stat_numa {
> +	u64 locality[NR_NL_INTERVAL];
> +};

> +	if (remote || local) {
> +		idx = ((local * 10) / (remote + local)) - 2;

		idx = (NR_NL_INTERVAL * local) / (remote + local);

> +	}
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(p);
> +	if (idx != -1)
> +		this_cpu_inc(memcg->stat_numa->locality[idx]);
> +	rcu_read_unlock();
> +}
> +#endif

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/4] numa: append per-node execution info in memory.numa_stat
  2019-07-03  3:29   ` [PATCH 2/4] numa: append per-node execution info in memory.numa_stat 王贇
@ 2019-07-11 13:45     ` Peter Zijlstra
  2019-07-12  3:17       ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-07-11 13:45 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel

On Wed, Jul 03, 2019 at 11:29:15AM +0800, 王贇 wrote:

> +++ b/include/linux/memcontrol.h
> @@ -190,6 +190,7 @@ enum memcg_numa_locality_interval {
> 
>  struct memcg_stat_numa {
>  	u64 locality[NR_NL_INTERVAL];
> +	u64 exectime;

Maybe call the field jiffies, because that's what it counts.

>  };
> 
>  #endif
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2edf3f5ac4b9..d5f48365770f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3575,6 +3575,18 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
>  		seq_printf(m, " %u", jiffies_to_msecs(sum));
>  	}
>  	seq_putc(m, '\n');
> +
> +	seq_puts(m, "exectime");
> +	for_each_online_node(nr) {
> +		int cpu;
> +		u64 sum = 0;
> +
> +		for_each_cpu(cpu, cpumask_of_node(nr))
> +			sum += per_cpu(memcg->stat_numa->exectime, cpu);
> +
> +		seq_printf(m, " %llu", jiffies_to_msecs(sum));
> +	}
> +	seq_putc(m, '\n');
>  #endif
> 
>  	return 0;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-03  3:28   ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
  2019-07-11 13:43     ` Peter Zijlstra
@ 2019-07-11 13:47     ` Peter Zijlstra
  2019-07-12  3:43       ` 王贇
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-07-11 13:47 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel

On Wed, Jul 03, 2019 at 11:28:10AM +0800, 王贇 wrote:

> @@ -3562,10 +3563,53 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
>  		seq_putc(m, '\n');
>  	}
> 
> +#ifdef CONFIG_NUMA_BALANCING
> +	seq_puts(m, "locality");
> +	for (nr = 0; nr < NR_NL_INTERVAL; nr++) {
> +		int cpu;
> +		u64 sum = 0;
> +
> +		for_each_possible_cpu(cpu)
> +			sum += per_cpu(memcg->stat_numa->locality[nr], cpu);
> +
> +		seq_printf(m, " %u", jiffies_to_msecs(sum));
> +	}
> +	seq_putc(m, '\n');
> +#endif
> +
>  	return 0;
>  }
>  #endif /* CONFIG_NUMA */
> 
> +#ifdef CONFIG_NUMA_BALANCING
> +
> +void memcg_stat_numa_update(struct task_struct *p)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long remote = p->numa_faults_locality[3];
> +	unsigned long local = p->numa_faults_locality[4];
> +	unsigned long idx = -1;
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	if (remote || local) {
> +		idx = ((local * 10) / (remote + local)) - 2;
> +		/* 0~29% in one slot for cache align */
> +		if (idx < PERCENT_0_29)
> +			idx = PERCENT_0_29;
> +		else if (idx >= NR_NL_INTERVAL)
> +			idx = NR_NL_INTERVAL - 1;
> +	}
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(p);
> +	if (idx != -1)
> +		this_cpu_inc(memcg->stat_numa->locality[idx]);

I thought cgroups were supposed to be hierarchical. That is, if we have:

          R
	 / \
	 A
	/\
	  B
	  \
	   t1

Then our task t1 should be accounted to B (as you do), but also to A and
R.

> +	rcu_read_unlock();
> +}
> +#endif

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/4] numa: introduce numa group per task group
  2019-07-03  3:32   ` [PATCH 3/4] numa: introduce numa group per task group 王贇
@ 2019-07-11 14:10     ` Peter Zijlstra
  2019-07-12  4:03       ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-07-11 14:10 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel

On Wed, Jul 03, 2019 at 11:32:32AM +0800, 王贇 wrote:
> By tracing numa page faults, we recognize tasks sharing the same page,
> and try pack them together into a single numa group.
> 
> However when two task share lot's of cache pages while not much
> anonymous pages, since numa balancing do not tracing cache page, they
> have no chance to join into the same group.
> 
> While tracing cache page cost too much, we could use some hints from

I forgot; where again do we skip shared pages? task_numa_work() doesn't
seem to skip file vmas.

> userland and cpu cgroup could be a good one.
> 
> This patch introduced new entry 'numa_group' for cpu cgroup, by echo
> non-zero into the entry, we can now force all the tasks of this cgroup
> to join the same numa group serving for task group.
> 
> In this way tasks are more likely to settle down on the same node, to
> share closer cpu cache and gain benefit from NUMA on both file/anonymous
> pages.
> 
> Besides, when multiple cgroup enabled numa group, they will be able to
> exchange task location by utilizing numa migration, in this way they
> could achieve single node settle down without breaking load balance.

I dislike cgroup only interfaces; it there really nothing else we could
use for this?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/4] numa: introduce numa cling feature
  2019-07-03  3:34   ` [PATCH 4/4] numa: introduce numa cling feature 王贇
  2019-07-08  2:25     ` [PATCH v2 " 王贇
@ 2019-07-11 14:27     ` Peter Zijlstra
  2019-07-12  3:10       ` 王贇
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-07-11 14:27 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel

On Wed, Jul 03, 2019 at 11:34:16AM +0800, 王贇 wrote:
> Although we paid so many effort to settle down task on a particular
> node, there are still chances for a task to leave it's preferred
> node, that is by wakeup, numa swap migrations or load balance.
> 
> When we are using cpu cgroup in share way, since all the workloads
> see all the cpus, it could be really bad especially when there
> are too many fast wakeup, although now we can numa group the tasks,
> they won't really stay on the same node, for example we have numa
> group ng_A, ng_B, ng_C, ng_D, it's very likely result as:
> 
> 	CPU Usage:
> 		Node 0		Node 1
> 		ng_A(600%)	ng_A(400%)
> 		ng_B(400%)	ng_B(600%)
> 		ng_C(400%)	ng_C(600%)
> 		ng_D(600%)	ng_D(400%)
> 
> 	Memory Ratio:
> 		Node 0		Node 1
> 		ng_A(60%)	ng_A(40%)
> 		ng_B(40%)	ng_B(60%)
> 		ng_C(40%)	ng_C(60%)
> 		ng_D(60%)	ng_D(40%)
> 
> Locality won't be too bad but far from the best situation, we want
> a numa group to settle down thoroughly on a particular node, with
> every thing balanced.
> 
> Thus we introduce the numa cling, which try to prevent tasks leaving
> the preferred node on wakeup fast path.


> @@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	if ((unsigned)i < nr_cpumask_bits)
>  		return i;
> 
> +	/*
> +	 * Failed to find an idle cpu, wake affine may want to pull but
> +	 * try stay on prev-cpu when the task cling to it.
> +	 */
> +	if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target)))
> +		return prev;
> +
>  	return target;
>  }

Select idle sibling should never cross node boundaries and is thus the
entirely wrong place to fix anything.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/4] numa: introduce numa cling feature
  2019-07-11 14:27     ` [PATCH " Peter Zijlstra
@ 2019-07-12  3:10       ` 王贇
  2019-07-12  7:53         ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-12  3:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/11 下午10:27, Peter Zijlstra wrote:
[snip]
>> Thus we introduce the numa cling, which try to prevent tasks leaving
>> the preferred node on wakeup fast path.
> 
> 
>> @@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>  	if ((unsigned)i < nr_cpumask_bits)
>>  		return i;
>>
>> +	/*
>> +	 * Failed to find an idle cpu, wake affine may want to pull but
>> +	 * try stay on prev-cpu when the task cling to it.
>> +	 */
>> +	if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target)))
>> +		return prev;
>> +
>>  	return target;
>>  }
> 
> Select idle sibling should never cross node boundaries and is thus the
> entirely wrong place to fix anything.

Hmm.. in our early testing the printk show both select_task_rq_fair() and
task_numa_find_cpu() will call select_idle_sibling with prev and target on
different node, thus we pick this point to save few lines.

But if the semantics of select_idle_sibling() is to return cpu on the same
node of target, what about move the logical after select_idle_sibling() for
the two callers?

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-11 13:43     ` Peter Zijlstra
@ 2019-07-12  3:15       ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-12  3:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/11 下午9:43, Peter Zijlstra wrote:
> On Wed, Jul 03, 2019 at 11:28:10AM +0800, 王贇 wrote:
>> +#ifdef CONFIG_NUMA_BALANCING
>> +
>> +enum memcg_numa_locality_interval {
>> +	PERCENT_0_29,
>> +	PERCENT_30_39,
>> +	PERCENT_40_49,
>> +	PERCENT_50_59,
>> +	PERCENT_60_69,
>> +	PERCENT_70_79,
>> +	PERCENT_80_89,
>> +	PERCENT_90_100,
>> +	NR_NL_INTERVAL,
>> +};
> 
> That's just daft; why not make 8 equal sized buckets.
> 
>> +struct memcg_stat_numa {
>> +	u64 locality[NR_NL_INTERVAL];
>> +};
> 
>> +	if (remote || local) {
>> +		idx = ((local * 10) / (remote + local)) - 2;
> 
> 		idx = (NR_NL_INTERVAL * local) / (remote + local);

Make sense, we actually want to observe the situation rather than
the ratio itself, will be in next version.

Regards,
Michael Wang

> 
>> +	}
>> +
>> +	rcu_read_lock();
>> +	memcg = mem_cgroup_from_task(p);
>> +	if (idx != -1)
>> +		this_cpu_inc(memcg->stat_numa->locality[idx]);
>> +	rcu_read_unlock();
>> +}
>> +#endif

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/4] numa: append per-node execution info in memory.numa_stat
  2019-07-11 13:45     ` Peter Zijlstra
@ 2019-07-12  3:17       ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-12  3:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/11 下午9:45, Peter Zijlstra wrote:
> On Wed, Jul 03, 2019 at 11:29:15AM +0800, 王贇 wrote:
> 
>> +++ b/include/linux/memcontrol.h
>> @@ -190,6 +190,7 @@ enum memcg_numa_locality_interval {
>>
>>  struct memcg_stat_numa {
>>  	u64 locality[NR_NL_INTERVAL];
>> +	u64 exectime;
> 
> Maybe call the field jiffies, because that's what it counts.

Sure, will be in next version.

Regards,
Michael Wang

> 
>>  };
>>
>>  #endif
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 2edf3f5ac4b9..d5f48365770f 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3575,6 +3575,18 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
>>  		seq_printf(m, " %u", jiffies_to_msecs(sum));
>>  	}
>>  	seq_putc(m, '\n');
>> +
>> +	seq_puts(m, "exectime");
>> +	for_each_online_node(nr) {
>> +		int cpu;
>> +		u64 sum = 0;
>> +
>> +		for_each_cpu(cpu, cpumask_of_node(nr))
>> +			sum += per_cpu(memcg->stat_numa->exectime, cpu);
>> +
>> +		seq_printf(m, " %llu", jiffies_to_msecs(sum));
>> +	}
>> +	seq_putc(m, '\n');
>>  #endif
>>
>>  	return 0;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-11 13:47     ` Peter Zijlstra
@ 2019-07-12  3:43       ` 王贇
  2019-07-12  7:58         ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-12  3:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/11 下午9:47, Peter Zijlstra wrote:
[snip]
>> +	rcu_read_lock();
>> +	memcg = mem_cgroup_from_task(p);
>> +	if (idx != -1)
>> +		this_cpu_inc(memcg->stat_numa->locality[idx]);
> 
> I thought cgroups were supposed to be hierarchical. That is, if we have:
> 
>           R
> 	 / \
> 	 A
> 	/\
> 	  B
> 	  \
> 	   t1
> 
> Then our task t1 should be accounted to B (as you do), but also to A and
> R.

I get the point but not quite sure about this...

Not like pages there are no hierarchical limitation on locality, also tasks
running in a particular group have no influence to others, not to mention the
extra overhead, does it really meaningful to account the stuff hierarchically?

Regards,
Michael Wang

> 
>> +	rcu_read_unlock();
>> +}
>> +#endif

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/4] numa: introduce numa group per task group
  2019-07-11 14:10     ` Peter Zijlstra
@ 2019-07-12  4:03       ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-12  4:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/11 下午10:10, Peter Zijlstra wrote:
> On Wed, Jul 03, 2019 at 11:32:32AM +0800, 王贇 wrote:
>> By tracing numa page faults, we recognize tasks sharing the same page,
>> and try pack them together into a single numa group.
>>
>> However when two task share lot's of cache pages while not much
>> anonymous pages, since numa balancing do not tracing cache page, they
>> have no chance to join into the same group.
>>
>> While tracing cache page cost too much, we could use some hints from
> 
> I forgot; where again do we skip shared pages? task_numa_work() doesn't
> seem to skip file vmas.

That's the page cache generated by file read/write, rather than the pages
for file mapping, pages of memory to support IO also won't be considered as
shared between tasks since they don't belong to any particular task, but may
serving multiples.

> 
>> userland and cpu cgroup could be a good one.
>>
>> This patch introduced new entry 'numa_group' for cpu cgroup, by echo
>> non-zero into the entry, we can now force all the tasks of this cgroup
>> to join the same numa group serving for task group.
>>
>> In this way tasks are more likely to settle down on the same node, to
>> share closer cpu cache and gain benefit from NUMA on both file/anonymous
>> pages.
>>
>> Besides, when multiple cgroup enabled numa group, they will be able to
>> exchange task location by utilizing numa migration, in this way they
>> could achieve single node settle down without breaking load balance.
> 
> I dislike cgroup only interfaces; it there really nothing else we could
> use for this?

Me too... while at this moment that's the best approach we have got, we also
tried to use separately module to handle these automatically, but this need
a very good understanding of the system, configuration and workloads which
only known by the owner.

So maybe just providing the functionality and leave the choice to user is not
that bad?

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/4] numa: introduce numa cling feature
  2019-07-12  3:10       ` 王贇
@ 2019-07-12  7:53         ` Peter Zijlstra
  2019-07-12  8:58           ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-07-12  7:53 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel

On Fri, Jul 12, 2019 at 11:10:08AM +0800, 王贇 wrote:
> On 2019/7/11 下午10:27, Peter Zijlstra wrote:

> >> Thus we introduce the numa cling, which try to prevent tasks leaving
> >> the preferred node on wakeup fast path.
> > 
> > 
> >> @@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >>  	if ((unsigned)i < nr_cpumask_bits)
> >>  		return i;
> >>
> >> +	/*
> >> +	 * Failed to find an idle cpu, wake affine may want to pull but
> >> +	 * try stay on prev-cpu when the task cling to it.
> >> +	 */
> >> +	if (task_numa_cling(p, cpu_to_node(prev), cpu_to_node(target)))
> >> +		return prev;
> >> +
> >>  	return target;
> >>  }
> > 
> > Select idle sibling should never cross node boundaries and is thus the
> > entirely wrong place to fix anything.
> 
> Hmm.. in our early testing the printk show both select_task_rq_fair() and
> task_numa_find_cpu() will call select_idle_sibling with prev and target on
> different node, thus we pick this point to save few lines.

But it will never return @prev if it is not in the same cache domain as
@target. See how everything is gated by:

  && cpus_share_cache(x, target)

> But if the semantics of select_idle_sibling() is to return cpu on the same
> node of target, what about move the logical after select_idle_sibling() for
> the two callers?

No, that's insane. You don't do select_idle_sibling() to then ignore the
result. You have to change @target before calling select_idle_sibling().

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-12  3:43       ` 王贇
@ 2019-07-12  7:58         ` Peter Zijlstra
  2019-07-12  9:11           ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-07-12  7:58 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel

On Fri, Jul 12, 2019 at 11:43:17AM +0800, 王贇 wrote:
> 
> 
> On 2019/7/11 下午9:47, Peter Zijlstra wrote:
> [snip]
> >> +	rcu_read_lock();
> >> +	memcg = mem_cgroup_from_task(p);
> >> +	if (idx != -1)
> >> +		this_cpu_inc(memcg->stat_numa->locality[idx]);
> > 
> > I thought cgroups were supposed to be hierarchical. That is, if we have:
> > 
> >           R
> > 	 / \
> > 	 A
> > 	/\
> > 	  B
> > 	  \
> > 	   t1
> > 
> > Then our task t1 should be accounted to B (as you do), but also to A and
> > R.
> 
> I get the point but not quite sure about this...
> 
> Not like pages there are no hierarchical limitation on locality, also tasks

You can use cpusets to affect that.

> running in a particular group have no influence to others, not to mention the
> extra overhead, does it really meaningful to account the stuff hierarchically?

AFAIU it's a requirement of cgroups to be hierarchical. All our other
cgroup accounting is like that.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/4] numa: introduce numa cling feature
  2019-07-12  7:53         ` Peter Zijlstra
@ 2019-07-12  8:58           ` 王贇
  2019-07-22  3:44             ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-12  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/12 下午3:53, Peter Zijlstra wrote:
[snip]
>>>>  	return target;
>>>>  }
>>>
>>> Select idle sibling should never cross node boundaries and is thus the
>>> entirely wrong place to fix anything.
>>
>> Hmm.. in our early testing the printk show both select_task_rq_fair() and
>> task_numa_find_cpu() will call select_idle_sibling with prev and target on
>> different node, thus we pick this point to save few lines.
> 
> But it will never return @prev if it is not in the same cache domain as
> @target. See how everything is gated by:
> 
>   && cpus_share_cache(x, target)

Yeah, that's right.

> 
>> But if the semantics of select_idle_sibling() is to return cpu on the same
>> node of target, what about move the logical after select_idle_sibling() for
>> the two callers?
> 
> No, that's insane. You don't do select_idle_sibling() to then ignore the
> result. You have to change @target before calling select_idle_sibling().
> 

I see, we should not override the decision of select_idle_sibling().

Actually the original design we try to achieve is:

  let wake affine select the target
  try find idle sibling of target
  if got one
	pick it
  else if task cling to prev
	pick prev

That is to consider wake affine superior to numa cling.

But after rethinking maybe this is not necessary, since numa cling is
also some kind of strong wake affine hint, actually maybe even a better
one to filter out the bad cases.

I'll try change @target instead and give a retest then.

Regards,
Michael Wang

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-12  7:58         ` Peter Zijlstra
@ 2019-07-12  9:11           ` 王贇
  2019-07-12  9:42             ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-12  9:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/12 下午3:58, Peter Zijlstra wrote:
[snip]
>>>
>>> Then our task t1 should be accounted to B (as you do), but also to A and
>>> R.
>>
>> I get the point but not quite sure about this...
>>
>> Not like pages there are no hierarchical limitation on locality, also tasks
> 
> You can use cpusets to affect that.

Could you please give more detail on this?

> 
>> running in a particular group have no influence to others, not to mention the
>> extra overhead, does it really meaningful to account the stuff hierarchically?
> 
> AFAIU it's a requirement of cgroups to be hierarchical. All our other
> cgroup accounting is like that.

Ok, should respect the convention :-)

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-12  9:11           ` 王贇
@ 2019-07-12  9:42             ` Peter Zijlstra
  2019-07-12 10:10               ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2019-07-12  9:42 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel

On Fri, Jul 12, 2019 at 05:11:25PM +0800, 王贇 wrote:
> 
> 
> On 2019/7/12 下午3:58, Peter Zijlstra wrote:
> [snip]
> >>>
> >>> Then our task t1 should be accounted to B (as you do), but also to A and
> >>> R.
> >>
> >> I get the point but not quite sure about this...
> >>
> >> Not like pages there are no hierarchical limitation on locality, also tasks
> > 
> > You can use cpusets to affect that.
> 
> Could you please give more detail on this?

Documentation/cgroup-v1/cpusets.txt

Look for mems_allowed.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-12  9:42             ` Peter Zijlstra
@ 2019-07-12 10:10               ` 王贇
  2019-07-15  2:09                 ` 王贇
  2019-07-15 12:10                 ` Michal Koutný
  0 siblings, 2 replies; 62+ messages in thread
From: 王贇 @ 2019-07-12 10:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/12 下午5:42, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 05:11:25PM +0800, 王贇 wrote:
>>
>>
>> On 2019/7/12 下午3:58, Peter Zijlstra wrote:
>> [snip]
>>>>>
>>>>> Then our task t1 should be accounted to B (as you do), but also to A and
>>>>> R.
>>>>
>>>> I get the point but not quite sure about this...
>>>>
>>>> Not like pages there are no hierarchical limitation on locality, also tasks
>>>
>>> You can use cpusets to affect that.
>>
>> Could you please give more detail on this?
> 
> Documentation/cgroup-v1/cpusets.txt
> 
> Look for mems_allowed.

This is the attribute belong to cpuset cgroup isn't it?

Forgive me but I have no idea on how to combined this
with memory cgroup's locality hierarchical update...
parent memory cgroup do not have influence on mems_allowed
to it's children, correct?

What about we just account the locality status of child
memory group into it's ancestors?

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-12 10:10               ` 王贇
@ 2019-07-15  2:09                 ` 王贇
  2019-07-15 12:10                 ` Michal Koutný
  1 sibling, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-15  2:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/12 下午6:10, 王贇 wrote:
[snip]
>>
>> Documentation/cgroup-v1/cpusets.txt
>>
>> Look for mems_allowed.
> 
> This is the attribute belong to cpuset cgroup isn't it?
> 
> Forgive me but I have no idea on how to combined this
> with memory cgroup's locality hierarchical update...
> parent memory cgroup do not have influence on mems_allowed
> to it's children, correct?
> 
> What about we just account the locality status of child
> memory group into it's ancestors?

We have rethink about this, and found no strong reason to stay
with memory cgroup anymore.

We used to acquire pages number, exectime and locality together
from memory cgroup, to make thing easier for our numa balancer
module, as now we use the numa group approach, maybe we can just
move these accounting into cpu cgroups, so all these features
stay in one subsys and could be hierarchical :-)

Regards,
Michael Wang

> 
> Regards,
> Michael Wang
> 
>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-12 10:10               ` 王贇
  2019-07-15  2:09                 ` 王贇
@ 2019-07-15 12:10                 ` Michal Koutný
  2019-07-16  2:41                   ` 王贇
  1 sibling, 1 reply; 62+ messages in thread
From: Michal Koutný @ 2019-07-15 12:10 UTC (permalink / raw)
  To: 王贇
  Cc: Peter Zijlstra, keescook, hannes, vdavydov.dev, mcgrof, mhocko,
	linux-mm, Ingo Molnar, riel, Mel Gorman, cgroups, linux-fsdevel,
	linux-kernel

Hello Yun.

On Fri, Jul 12, 2019 at 06:10:24PM +0800, 王贇  <yun.wang@linux.alibaba.com> wrote:
> Forgive me but I have no idea on how to combined this
> with memory cgroup's locality hierarchical update...
> parent memory cgroup do not have influence on mems_allowed
> to it's children, correct?
I'd recommend to look at the v2 of the cpuset controller that implements
the hierarchical behavior among configured memory node sets.

(My comment would better fit to 
    [PATCH 3/4] numa: introduce numa group per task group
IIUC, you could use cpuset controller to constraint memory nodes.)

For the second part (accessing numa statistics, i.e. this patch), I
wonder wheter this information wouldn't be better presented under the
cpuset controller too.

HTH,
Michal

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-15 12:10                 ` Michal Koutný
@ 2019-07-16  2:41                   ` 王贇
  2019-07-19 16:47                     ` Michal Koutný
  0 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-16  2:41 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Peter Zijlstra, keescook, hannes, vdavydov.dev, mcgrof, mhocko,
	linux-mm, Ingo Molnar, riel, Mel Gorman, cgroups, linux-fsdevel,
	linux-kernel

Hi Michal,

Thx for the comments :-)

On 2019/7/15 下午8:10, Michal Koutný wrote:
> Hello Yun.
> 
> On Fri, Jul 12, 2019 at 06:10:24PM +0800, 王贇  <yun.wang@linux.alibaba.com> wrote:
>> Forgive me but I have no idea on how to combined this
>> with memory cgroup's locality hierarchical update...
>> parent memory cgroup do not have influence on mems_allowed
>> to it's children, correct?
> I'd recommend to look at the v2 of the cpuset controller that implements
> the hierarchical behavior among configured memory node sets.

Actually whatever the memory node sets or cpu allow sets is, it will
take effect on task's behavior regarding memory location and cpu
location, while the locality only care about the results rather than
the sets.

For example if we bind tasks to cpus of node 0 and memory allow only
the node 1, by cgroup controller or madvise, then they will running
on node 0 with all the memory on node 1, on each PF for numa balancing,
the task will access page on node 1 from node 0 remotely, so the
locality will always be 0.

> 
> (My comment would better fit to 
>     [PATCH 3/4] numa: introduce numa group per task group
> IIUC, you could use cpuset controller to constraint memory nodes.)
> 
> For the second part (accessing numa statistics, i.e. this patch), I
> wonder wheter this information wouldn't be better presented under the
> cpuset controller too.

Yeah, we realized the cpu cgroup could be a better place to hold these
new statistics, both locality and exectime are task's running behavior,
related to memory location but not the memory behavior, will apply in
next version.

Regards,
Michael Wang

> 
> HTH,
> Michal
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 0/4] per-cgroup numa suite
  2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
                     ` (4 preceding siblings ...)
  2019-07-11  9:00   ` [PATCH 0/4] per cgroup numa suite 王贇
@ 2019-07-16  3:38   ` 王贇
  2019-07-16  3:39     ` [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic 王贇
                       ` (5 more replies)
  5 siblings, 6 replies; 62+ messages in thread
From: 王贇 @ 2019-07-16  3:38 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups,
	Michal Koutný,
	Hillf Danton

During our torturing on numa stuff, we found problems like:

  * missing per-cgroup information about the per-node execution status
  * missing per-cgroup information about the numa locality

That is when we have a cpu cgroup running with bunch of tasks, no good
way to tell how it's tasks are dealing with numa.

The first two patches are trying to complete the missing pieces, but
more problems appeared after monitoring these status:

  * tasks not always running on the preferred numa node
  * tasks from same cgroup running on different nodes

The task numa group handler will always check if tasks are sharing pages
and try to pack them into a single numa group, so they will have chance to
settle down on the same node, but this failed in some cases:

  * workloads share page caches rather than share mappings
  * workloads got too many wakeup across nodes

Since page caches are not traced by numa balancing, there are no way to
realize such kind of relationship, and when there are too many wakeup,
task will be drag from the preferred node and then migrate back by numa
balancing, repeatedly.

Here the third patch try to address the first issue, we could now give hint
to kernel about the relationship of tasks, and pack them into single numa
group.

And the forth patch introduced numa cling, which try to address the wakup
issue, now we try to make task stay on the preferred node on wakeup in fast
path, in order to address the unbalancing risk, we monitoring the numa
migration failure ratio, and pause numa cling when it reach the specified
degree.

Since v1:
  * move statistics from memory cgroup into cpu group
  * statistics now accounting in hierarchical way
  * locality now accounted into 8 regions equally
  * numa cling no longer override select_idle_sibling, instead we
    prevent numa swap migration with tasks cling to dst-node, also
    prevent wake affine to drag tasks away which already cling to
    prev-cpu
  * other refine on comments and names

Michael Wang (4):
  v2 numa: introduce per-cgroup numa balancing locality statistic
  v2 numa: append per-node execution time in cpu.numa_stat
  v2 numa: introduce numa group per task group
  v4 numa: introduce numa cling feature

 include/linux/sched.h        |   8 +-
 include/linux/sched/sysctl.h |   3 +
 kernel/sched/core.c          |  85 ++++++++
 kernel/sched/debug.c         |   7 +
 kernel/sched/fair.c          | 510 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h         |  41 ++++
 kernel/sysctl.c              |   9 +
 7 files changed, 651 insertions(+), 12 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic
  2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
@ 2019-07-16  3:39     ` 王贇
  2019-07-16  3:40     ` [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat 王贇
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-16  3:39 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups,
	Michal Koutný,
	Hillf Danton

This patch introduced numa locality statistic, which try to imply
the numa balancing efficiency per memory cgroup.

On numa balancing, we trace the local page accessing ratio of tasks,
which we call the locality.

By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we
see output line heading with 'locality', like:

  locality 15393 21259 13023 44461 21247 17012 28496 145402

locality divided into 8 regions, each number standing for the micro
seconds we hit a task running with the locality within that region,
for example here we have tasks with locality around 0~12% running for
15393 ms, and tasks with locality around 88~100% running for 145402 ms.

By monitoring the increment, we can check if the workloads of a
particular cgroup is doing well with numa, when most of the tasks are
running in low locality region, then something is wrong with your numa
policy.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v1:
  * move implementation from memory cgroup into cpu group
  * introduce new entry 'numa_stat' to present locality
  * locality now accounting in hierarchical way
  * locality now accounted into 8 regions equally

 include/linux/sched.h |  8 +++++++-
 kernel/sched/core.c   | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/debug.c  |  7 +++++++
 kernel/sched/fair.c   | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  | 29 +++++++++++++++++++++++++++++
 5 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 907808f1acc5..eb26098de6ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1117,8 +1117,14 @@ struct task_struct {
 	 * scan window were remote/local or failed to migrate. The task scan
 	 * period is adapted based on the locality of the faults with different
 	 * weights depending on whether they were shared or private faults
+	 *
+	 * 0 -- remote faults
+	 * 1 -- local faults
+	 * 2 -- page migration failure
+	 * 3 -- remote page accessing
+	 * 4 -- local page accessing
 	 */
-	unsigned long			numa_faults_locality[3];
+	unsigned long			numa_faults_locality[5];

 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa43ce3962e7..71a8d3ed8495 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6367,6 +6367,10 @@ static struct kmem_cache *task_group_cache __read_mostly;
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);

+#ifdef CONFIG_NUMA_BALANCING
+DECLARE_PER_CPU(struct numa_stat, root_numa_stat);
+#endif
+
 void __init sched_init(void)
 {
 	unsigned long alloc_size = 0, ptr;
@@ -6416,6 +6420,10 @@ void __init sched_init(void)
 	init_defrootdomain();
 #endif

+#ifdef CONFIG_NUMA_BALANCING
+	root_task_group.numa_stat = &root_numa_stat;
+#endif
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
@@ -6727,6 +6735,7 @@ static DEFINE_SPINLOCK(task_group_lock);

 static void sched_free_group(struct task_group *tg)
 {
+	free_tg_numa_stat(tg);
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
@@ -6742,6 +6751,9 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!tg)
 		return ERR_PTR(-ENOMEM);

+	if (!alloc_tg_numa_stat(tg))
+		goto err;
+
 	if (!alloc_fair_sched_group(tg, parent))
 		goto err;

@@ -7277,6 +7289,28 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_NUMA_BALANCING
+static int cpu_numa_stat_show(struct seq_file *sf, void *v)
+{
+	int nr;
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	seq_puts(sf, "locality");
+	for (nr = 0; nr < NR_NL_INTERVAL; nr++) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_possible_cpu(cpu)
+			sum += per_cpu(tg->numa_stat->locality[nr], cpu);
+
+		seq_printf(sf, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(sf, '\n');
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7312,6 +7346,12 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+	{
+		.name = "numa_stat",
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f7e4579e746c..a22b2a62aee2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -848,6 +848,13 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 	P(total_numa_faults);
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
 			task_node(p), task_numa_group_id(p));
+	SEQ_printf(m, "faults_locality local=%lu remote=%lu failed=%lu ",
+			p->numa_faults_locality[1],
+			p->numa_faults_locality[0],
+			p->numa_faults_locality[2]);
+	SEQ_printf(m, "lhit=%lu rhit=%lu\n",
+			p->numa_faults_locality[4],
+			p->numa_faults_locality[3]);
 	show_numa_stats(p, m);
 	mpol_put(pol);
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..cd716355d70e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2449,6 +2449,12 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+	/*
+	 * We want to have the real local/remote page access statistic
+	 * here, so use 'mem_node' which is the real residential node of
+	 * page after migrate_misplaced_page().
+	 */
+	p->numa_faults_locality[3 + !!(mem_node == numa_node_id())] += pages;
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -2611,6 +2617,47 @@ void task_numa_work(struct callback_head *work)
 	}
 }

+DEFINE_PER_CPU(struct numa_stat, root_numa_stat);
+
+int alloc_tg_numa_stat(struct task_group *tg)
+{
+	tg->numa_stat = alloc_percpu(struct numa_stat);
+	if (!tg->numa_stat)
+		return 0;
+
+	return 1;
+}
+
+void free_tg_numa_stat(struct task_group *tg)
+{
+	free_percpu(tg->numa_stat);
+}
+
+static void update_tg_numa_stat(struct task_struct *p)
+{
+	struct task_group *tg;
+	unsigned long remote = p->numa_faults_locality[3];
+	unsigned long local = p->numa_faults_locality[4];
+	int idx = -1;
+
+	/* Tobe scaled? */
+	if (remote || local)
+		idx = NR_NL_INTERVAL * local / (remote + local + 1);
+
+	rcu_read_lock();
+
+	tg = task_group(p);
+	while (tg) {
+		/* skip account when there are no faults records */
+		if (idx != -1)
+			this_cpu_inc(tg->numa_stat->locality[idx]);
+
+		tg = tg->parent;
+	}
+
+	rcu_read_unlock();
+}
+
 /*
  * Drive the periodic memory faults..
  */
@@ -2625,6 +2672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
 		return;

+	update_tg_numa_stat(curr);
+
 	/*
 	 * Using runtime rather than walltime has the dual advantage that
 	 * we (mostly) drive the selection from busy threads and that the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..685a9e670880 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -353,6 +353,17 @@ struct cfs_bandwidth {
 #endif
 };

+#ifdef CONFIG_NUMA_BALANCING
+
+/* NUMA Locality Interval, 8 bucket for cache align */
+#define NR_NL_INTERVAL	8
+
+struct numa_stat {
+	u64 locality[NR_NL_INTERVAL];
+};
+
+#endif
+
 /* Task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -393,8 +404,26 @@ struct task_group {
 #endif

 	struct cfs_bandwidth	cfs_bandwidth;
+
+#ifdef CONFIG_NUMA_BALANCING
+	struct numa_stat __percpu *numa_stat;
+#endif
 };

+#ifdef CONFIG_NUMA_BALANCING
+int alloc_tg_numa_stat(struct task_group *tg);
+void free_tg_numa_stat(struct task_group *tg);
+#else
+static int alloc_tg_numa_stat(struct task_group *tg)
+{
+	return 1;
+}
+
+static void free_tg_numa_stat(struct task_group *tg)
+{
+}
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat
  2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
  2019-07-16  3:39     ` [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic 王贇
@ 2019-07-16  3:40     ` 王贇
  2019-07-19 16:39       ` Michal Koutný
  2019-07-16  3:41     ` [PATCH v2 3/4] numa: introduce numa group per task group 王贇
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-16  3:40 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups,
	Michal Koutný,
	Hillf Danton

This patch introduced numa execution time information, to imply the numa
efficiency.

By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new
output line heading with 'exectime', like:

  exectime 311900 407166

which means the tasks of this cgroup executed 311900 micro seconds on
node 0, and 407166 ms on node 1.

Combined with the memory node info from memory cgroup, we can estimate
the numa efficiency, for example if the memory.numa_stat show:

  total=206892 N0=21933 N1=185171

By monitoring the increments, if the topology keep in this way and
locality is not nice, then it imply numa balancing can't help migrate
the memory from node 1 to 0 which is accessing by tasks on node 0, or
tasks can't migrate to node 1 for some reason, then you may consider
to bind the workloads on the cpus of node 1.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v1:
  * move implementation from memory cgroup into cpu group
  * exectime now accounting in hierarchical way
  * change member name into jiffies

 kernel/sched/core.c  | 12 ++++++++++++
 kernel/sched/fair.c  |  2 ++
 kernel/sched/sched.h |  1 +
 3 files changed, 15 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71a8d3ed8495..f8aa73aa879b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7307,6 +7307,18 @@ static int cpu_numa_stat_show(struct seq_file *sf, void *v)
 	}
 	seq_putc(sf, '\n');

+	seq_puts(sf, "exectime");
+	for_each_online_node(nr) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_cpu(cpu, cpumask_of_node(nr))
+			sum += per_cpu(tg->numa_stat->jiffies, cpu);
+
+		seq_printf(sf, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(sf, '\n');
+
 	return 0;
 }
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd716355d70e..2c362266af76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2652,6 +2652,8 @@ static void update_tg_numa_stat(struct task_struct *p)
 		if (idx != -1)
 			this_cpu_inc(tg->numa_stat->locality[idx]);

+		this_cpu_inc(tg->numa_stat->jiffies);
+
 		tg = tg->parent;
 	}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 685a9e670880..456f83f7f595 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -360,6 +360,7 @@ struct cfs_bandwidth {

 struct numa_stat {
 	u64 locality[NR_NL_INTERVAL];
+	u64 jiffies;
 };

 #endif
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 3/4] numa: introduce numa group per task group
  2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
  2019-07-16  3:39     ` [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic 王贇
  2019-07-16  3:40     ` [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat 王贇
@ 2019-07-16  3:41     ` 王贇
  2019-07-16  3:41     ` [PATCH v4 4/4] numa: introduce numa cling feature 王贇
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-16  3:41 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups,
	Michal Koutný,
	Hillf Danton

By tracing numa page faults, we recognize tasks sharing the same page,
and try pack them together into a single numa group.

However when two task share lot's of cache pages while not much
anonymous pages, since numa balancing do not tracing cache page, they
have no chance to join into the same group.

While tracing cache page cost too much, we could use some hints from
userland and cpu cgroup could be a good one.

This patch introduced new entry 'numa_group' for cpu cgroup, by echo
non-zero into the entry, we can now force all the tasks of this cgroup
to join the same numa group serving for task group.

In this way tasks are more likely to settle down on the same node, to
share closer cpu cache and gain benefit from NUMA on both file/anonymous
pages.

Besides, when multiple cgroup enabled numa group, they will be able to
exchange task location by utilizing numa migration, in this way they
could achieve single node settle down without breaking load balance.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v1:
  * just rebase, no logical changes

 kernel/sched/core.c  |  33 ++++++++++
 kernel/sched/fair.c  | 175 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  11 ++++
 3 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8aa73aa879b..9f100c48d6e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6802,6 +6802,8 @@ void sched_offline_group(struct task_group *tg)
 {
 	unsigned long flags;

+	update_tg_numa_group(tg, false);
+
 	/* End participation in shares distribution: */
 	unregister_fair_sched_group(tg);

@@ -7321,6 +7323,32 @@ static int cpu_numa_stat_show(struct seq_file *sf, void *v)

 	return 0;
 }
+
+static DEFINE_MUTEX(numa_mutex);
+
+static int cpu_numa_group_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	mutex_lock(&numa_mutex);
+	show_tg_numa_group(tg, sf);
+	mutex_unlock(&numa_mutex);
+
+	return 0;
+}
+
+static int cpu_numa_group_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 numa_group)
+{
+	int ret;
+	struct task_group *tg = css_tg(css);
+
+	mutex_lock(&numa_mutex);
+	ret = update_tg_numa_group(tg, numa_group);
+	mutex_unlock(&numa_mutex);
+
+	return ret;
+}
 #endif

 static struct cftype cpu_legacy_files[] = {
@@ -7364,6 +7392,11 @@ static struct cftype cpu_legacy_files[] = {
 		.name = "numa_stat",
 		.seq_show = cpu_numa_stat_show,
 	},
+	{
+		.name = "numa_group",
+		.write_s64 = cpu_numa_group_write_s64,
+		.seq_show = cpu_numa_group_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c362266af76..c28ba040a563 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1073,6 +1073,7 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	bool evacuate;

 	struct rcu_head rcu;
 	unsigned long total_faults;
@@ -2246,6 +2247,176 @@ static inline void put_numa_group(struct numa_group *grp)
 		kfree_rcu(grp, rcu);
 }

+void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
+{
+	int nid;
+	struct numa_group *ng = tg->numa_group;
+
+	if (!ng) {
+		seq_puts(sf, "disabled\n");
+		return;
+	}
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes);
+
+	for_each_online_node(nid) {
+		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
+		int pf_idx = task_faults_idx(NUMA_MEM, nid, 1);
+
+		seq_printf(sf, "node %d ", nid);
+
+		seq_printf(sf, "mem_private %lu mem_shared %lu ",
+			   ng->faults[f_idx], ng->faults[pf_idx]);
+
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+	}
+}
+
+int update_tg_numa_group(struct task_group *tg, bool numa_group)
+{
+	struct numa_group *ng = tg->numa_group;
+
+	/* if no change then do nothing */
+	if ((ng != NULL) == numa_group)
+		return 0;
+
+	if (ng) {
+		/* put and evacuate tg's numa group */
+		rcu_assign_pointer(tg->numa_group, NULL);
+		ng->evacuate = true;
+		put_numa_group(ng);
+	} else {
+		unsigned int size = sizeof(struct numa_group) +
+				    4*nr_node_ids*sizeof(unsigned long);
+
+		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+		if (!ng)
+			return -ENOMEM;
+
+		refcount_set(&ng->refcount, 1);
+		spin_lock_init(&ng->lock);
+		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
+						nr_node_ids;
+		/* now make tasks see and join */
+		rcu_assign_pointer(tg->numa_group, ng);
+	}
+
+	return 0;
+}
+
+static bool tg_numa_group(struct task_struct *p)
+{
+	int i;
+	struct task_group *tg;
+	struct numa_group *grp, *my_grp;
+
+	rcu_read_lock();
+
+	tg = task_group(p);
+	if (!tg)
+		goto no_join;
+
+	grp = rcu_dereference(tg->numa_group);
+	my_grp = rcu_dereference(p->numa_group);
+
+	if (!grp)
+		goto no_join;
+
+	if (grp == my_grp) {
+		if (!grp->evacuate)
+			goto joined;
+
+		/*
+		 * Evacuate task from tg's numa group
+		 */
+		rcu_read_unlock();
+
+		spin_lock_irq(&grp->lock);
+
+		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
+			grp->faults[i] -= p->numa_faults[i];
+
+		grp->total_faults -= p->total_numa_faults;
+		grp->nr_tasks--;
+
+		spin_unlock_irq(&grp->lock);
+
+		rcu_assign_pointer(p->numa_group, NULL);
+
+		put_numa_group(grp);
+
+		return false;
+	}
+
+	if (!get_numa_group(grp))
+		goto no_join;
+
+	rcu_read_unlock();
+
+	/*
+	 * Just join tg's numa group
+	 */
+	if (!my_grp) {
+		spin_lock_irq(&grp->lock);
+
+		if (refcount_read(&grp->refcount) == 2) {
+			grp->gid = p->pid;
+			grp->active_nodes = 1;
+			grp->max_faults_cpu = 0;
+		}
+
+		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
+			grp->faults[i] += p->numa_faults[i];
+
+		grp->total_faults += p->total_numa_faults;
+		grp->nr_tasks++;
+
+		spin_unlock_irq(&grp->lock);
+		rcu_assign_pointer(p->numa_group, grp);
+
+		return true;
+	}
+
+	/*
+	 * Switch from the task's numa group to the tg's
+	 */
+	double_lock_irq(&my_grp->lock, &grp->lock);
+
+	if (refcount_read(&grp->refcount) == 2) {
+		grp->gid = p->pid;
+		grp->active_nodes = 1;
+		grp->max_faults_cpu = 0;
+	}
+
+	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
+		my_grp->faults[i] -= p->numa_faults[i];
+		grp->faults[i] += p->numa_faults[i];
+	}
+
+	my_grp->total_faults -= p->total_numa_faults;
+	grp->total_faults += p->total_numa_faults;
+
+	my_grp->nr_tasks--;
+	grp->nr_tasks++;
+
+	spin_unlock(&my_grp->lock);
+	spin_unlock_irq(&grp->lock);
+
+	rcu_assign_pointer(p->numa_group, grp);
+
+	put_numa_group(my_grp);
+	return true;
+
+joined:
+	rcu_read_unlock();
+	return true;
+no_join:
+	rcu_read_unlock();
+	return false;
+}
+
 static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			int *priv)
 {
@@ -2416,7 +2587,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 		priv = 1;
 	} else {
 		priv = cpupid_match_pid(p, last_cpupid);
-		if (!priv && !(flags & TNF_NO_GROUP))
+		if (tg_numa_group(p))
+			priv = (flags & TNF_SHARED) ? 0 : priv;
+		else if (!priv && !(flags & TNF_NO_GROUP))
 			task_numa_group(p, last_cpupid, flags, &priv);
 	}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456f83f7f595..23e4a62cd37b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -408,6 +408,7 @@ struct task_group {

 #ifdef CONFIG_NUMA_BALANCING
 	struct numa_stat __percpu *numa_stat;
+	void *numa_group;
 #endif
 };

@@ -1316,11 +1317,21 @@ extern int migrate_task_to(struct task_struct *p, int cpu);
 extern int migrate_swap(struct task_struct *p, struct task_struct *t,
 			int cpu, int scpu);
 extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p);
+extern void show_tg_numa_group(struct task_group *tg, struct seq_file *sf);
+extern int update_tg_numa_group(struct task_group *tg, bool numa_group);
 #else
 static inline void
 init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
 }
+static inline void
+show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
+{
+}
+update_tg_numa_group(struct task_group *tg, bool numa_group)
+{
+	return 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */

 #ifdef CONFIG_SMP
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v4 4/4] numa: introduce numa cling feature
  2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
                       ` (2 preceding siblings ...)
  2019-07-16  3:41     ` [PATCH v2 3/4] numa: introduce numa group per task group 王贇
@ 2019-07-16  3:41     ` 王贇
  2019-07-22  2:37       ` [PATCH v5 " 王贇
  2019-07-25  2:33     ` [PATCH v2 0/4] per-cgroup numa suite 王贇
  2019-08-06  1:33     ` 王贇
  5 siblings, 1 reply; 62+ messages in thread
From: 王贇 @ 2019-07-16  3:41 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups,
	Michal Koutný,
	Hillf Danton

Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.

When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be really bad especially when there
are too many fast wakeup, although now we can numa group the tasks,
they won't really stay on the same node, for example we have numa
group ng_A, ng_B, ng_C, ng_D, it's very likely result as:

	CPU Usage:
		Node 0		Node 1
		ng_A(600%)	ng_A(400%)
		ng_B(400%)	ng_B(600%)
		ng_C(400%)	ng_C(600%)
		ng_D(600%)	ng_D(400%)

	Memory Ratio:
		Node 0		Node 1
		ng_A(60%)	ng_A(40%)
		ng_B(40%)	ng_B(60%)
		ng_C(40%)	ng_C(60%)
		ng_D(60%)	ng_D(40%)

Locality won't be too bad but far from the best situation, we want
a numa group to settle down thoroughly on a particular node, with
every thing balanced.

Thus we introduce the numa cling, which try to prevent tasks leaving
the preferred node on wakeup fast path.

This help thoroughly settle down the workloads on single node, but when
multiple numa group try to settle down on the same node, unbalancing
could happen.

For example we have numa group ng_A, ng_B, ng_C, ng_D, it may result in
situation like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(400%)	ng_C(600%)
	ng_D(400%)	ng_D(600%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(10%)	ng_C(90%)
	ng_D(10%)	ng_D(90%)

This is because when ng_C, ng_D start to have most of the memory on node
1 at some point, task_x of ng_C stay on node 0 will try to do numa swap
migration with the task_y of ng_D stay on node 1 as long as load balanced,
the result is task_x stay on node 1 and task_y stay on node 0, while both
of them prefer node 1.

Now when other tasks of ng_D stay on node 1 wakeup task_y, task_y will
very likely go back to node 1, and since numa cling enabled, it will
keep stay on node 1 although load unbalanced, this could be frequently
and more and more tasks will prefer the node 1 and make it busy.

So the key point here is to stop doing numa cling when load starting to
become unbalancing.

We achieved this by monitoring the migration failure ratio, in scenery
above, too much tasks prefer node 1 and will keep migrating to it, load
unbalancing could lead into the migration failure in this case, and when
the failure ratio above the specified degree, we pause the cling and try
to resettle the workloads on a better node by stop tasks prefer the busy
node, this will finally give us the result like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(1000%)	ng_D(1000%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(100%)	ng_D(100%)

Now we achieved the best locality and maximum hot cache benefit.

Tested on a 2 node box with 96 cpus, do sysbench-mysql-oltp_read_write
testing, X mysqld instances created and attached to X cgroups, X sysbench
instances then created and attached to corresponding cgroup to test the
mysql with oltp_read_write script for 20 minutes, average eps show:

				origin		ng + cling
4 instances each 24 threads	7641.27		8010.18		+4.83%
4 instances each 48 threads	9423.39		10021.03	+6.34%
4 instances each 72 threads	9691.47		10192.73	+5.17%

8 instances each 24 threads	4485.44		4577.95		+2.06%
8 instances each 48 threads	5565.06		5737.50		+3.10%
8 instances each 72 threads	5605.20		5752.33		+2.63%

Also tested with perf-bench-numa, dbench, sysbench-memory, pgbench, tiny
improvement observed.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v3:
  * numa cling no longer override select_idle_sibling, instead we
    prevent numa swap migration with tasks cling to dst-node, also
    prevent wake affine to drag tasks away which already cling to
    prev-cpu
  * refine comments

 include/linux/sched/sysctl.h |   3 +
 kernel/sched/fair.c          | 296 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sysctl.c              |   9 ++
 3 files changed, 292 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..6eef34331dd2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -38,6 +38,9 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;

+extern unsigned int sysctl_numa_balancing_cling_degree;
+extern unsigned int max_numa_balancing_cling_degree;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28ba040a563..e7525bda5a94 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1066,6 +1066,20 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;

+/*
+ * The numa group serving task group will enable numa cling, a feature
+ * which try to prevent task leaving preferred node on wakeup.
+ *
+ * This help settle down the workloads thorouly and quickly on node,
+ * while introduce the risk of load unbalancing.
+ *
+ * In order to detect the risk in advance and pause the feature, we
+ * rely on numa migration failure stats, and when failure ratio above
+ * cling degree, we pause the numa cling until resettle done.
+ */
+unsigned int sysctl_numa_balancing_cling_degree = 20;
+unsigned int max_numa_balancing_cling_degree = 100;
+
 struct numa_group {
 	refcount_t refcount;

@@ -1073,11 +1087,15 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	int busiest_nid;
 	bool evacuate;
+	bool do_cling;
+	struct timer_list cling_timer;

 	struct rcu_head rcu;
 	unsigned long total_faults;
 	unsigned long max_faults_cpu;
+	unsigned long *migrate_stat;
 	/*
 	 * Faults_cpu is used to decide whether memory should move
 	 * towards the CPU. As a consequence, these stats are weighted
@@ -1087,6 +1105,8 @@ struct numa_group {
 	unsigned long faults[0];
 };

+static inline bool busy_node(struct numa_group *ng, int nid);
+
 static inline unsigned long group_faults_priv(struct numa_group *ng);
 static inline unsigned long group_faults_shared(struct numa_group *ng);

@@ -1131,8 +1151,14 @@ static unsigned int task_scan_start(struct task_struct *p)
 	unsigned long smin = task_scan_min(p);
 	unsigned long period = smin;

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1153,8 +1179,14 @@ static unsigned int task_scan_max(struct task_struct *p)
 	/* Watch for min being lower than max due to floor calculations */
 	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1474,6 +1506,19 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 					ACTIVE_NODE_FRACTION)
 		return true;

+	/*
+	 * Make sure pages do not stay on a busy node when numa cling
+	 * enabled, otherwise they could lead into more numa migration
+	 * to the busy node.
+	 */
+	if (ng->do_cling) {
+		if (busy_node(ng, dst_nid))
+			return false;
+
+		if (busy_node(ng, src_nid))
+			return true;
+	}
+
 	/*
 	 * Distribute memory according to CPU & memory use on each node,
 	 * with 3/4 hysteresis to avoid unnecessary memory migrations:
@@ -1592,6 +1637,9 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  */
 #define SMALLIMP	30

+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid);
+
 /*
  * This checks if the overall compute and NUMA accesses of the system would
  * be improved if the source tasks was migrated to the target dst_cpu taking
@@ -1710,6 +1758,10 @@ static void task_numa_compare(struct task_numa_env *env,
 		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
 						   env->dst_cpu);
 		local_irq_enable();
+	} else {
+		/* Do not swap with a task cling to 'dst_nid' */
+		if (task_numa_cling(cur, env->dst_nid, env->src_nid))
+			goto unlock;
 	}

 	task_numa_assign(env, cur, imp);
@@ -1873,9 +1925,191 @@ static int task_numa_migrate(struct task_struct *p)
 	return ret;
 }

+/*
+ * We scale the migration stat count to 1024, divide the maximum numa
+ * balancing scan period by 10 and make that the period of cling timer,
+ * this help to decay one count to 0 after one maximum scan period passed.
+ */
+#define NUMA_MIGRATE_SCALE 10
+#define NUMA_MIGRATE_WEIGHT 1024
+
+enum numa_migrate_stats {
+	FAILURE_SCALED,
+	TOTAL_SCALED,
+	FAILURE_RATIO,
+};
+
+static inline int mstat_idx(int nid, enum numa_migrate_stats s)
+{
+	return (nid + s * nr_node_ids);
+}
+
+static inline unsigned long
+mstat_failure_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_SCALED)];
+}
+
+static inline unsigned long
+mstat_total_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, TOTAL_SCALED)];
+}
+
+static inline unsigned long
+mstat_failure_ratio(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_RATIO)];
+}
+
+/*
+ * A node is busy when the numa migration toward it failed too much,
+ * this imply the load already unbalancing for too much numa cling on
+ * that node.
+ */
+static inline bool busy_node(struct numa_group *ng, int nid)
+{
+	int degree = sysctl_numa_balancing_cling_degree;
+
+	if (mstat_failure_scaled(ng, nid) < NUMA_MIGRATE_WEIGHT)
+		return false;
+
+	/*
+	 * Allow only one busy node in one numa group, to prevent
+	 * ping-pong migration case between nodes.
+	 */
+	if (ng->busiest_nid != nid)
+		return false;
+
+	return mstat_failure_ratio(ng, nid) > degree;
+}
+
+/*
+ * Return true if the task should cling to snid, when it preferred snid
+ * rather than dnid and snid is not busy.
+ */
+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	bool ret = false;
+	int pnid = p->numa_preferred_nid;
+	struct numa_group *ng;
+
+	rcu_read_lock();
+
+	ng = p->numa_group;
+
+	/* Do cling only when the feature enabled and not in pause */
+	if (!ng || !ng->do_cling)
+		goto out;
+
+	if (pnid == NUMA_NO_NODE ||
+	    dnid == pnid ||
+	    snid != pnid)
+		goto out;
+
+	/* Never allow cling to a busy node */
+	if (busy_node(ng, snid))
+		goto out;
+
+	ret = true;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Prevent more tasks from prefer the busy node to easy the unbalancing,
+ * also give the second candidate a chance.
+ */
+static inline bool group_pause_prefer(struct numa_group *ng, int nid)
+{
+	if (!ng || !ng->do_cling)
+		return false;
+
+	return busy_node(ng, nid);
+}
+
+static inline void update_failure_ratio(struct numa_group *ng, int nid)
+{
+	int f_idx = mstat_idx(nid, FAILURE_SCALED);
+	int t_idx = mstat_idx(nid, TOTAL_SCALED);
+	int fp_idx = mstat_idx(nid, FAILURE_RATIO);
+
+	ng->migrate_stat[fp_idx] =
+		ng->migrate_stat[f_idx] * 100 / (ng->migrate_stat[t_idx] + 1);
+}
+
+static void cling_timer_func(struct timer_list *t)
+{
+	int nid;
+	unsigned int degree;
+	unsigned long period, max_failure;
+	struct numa_group *ng = from_timer(ng, t, cling_timer);
+
+	degree = sysctl_numa_balancing_cling_degree;
+	period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+	period /= NUMA_MIGRATE_SCALE;
+
+	spin_lock_irq(&ng->lock);
+
+	max_failure = 0;
+	for_each_online_node(nid) {
+		int f_idx = mstat_idx(nid, FAILURE_SCALED);
+		int t_idx = mstat_idx(nid, TOTAL_SCALED);
+
+		ng->migrate_stat[f_idx] /= 2;
+		ng->migrate_stat[t_idx] /= 2;
+
+		update_failure_ratio(ng, nid);
+
+		if (ng->migrate_stat[f_idx] > max_failure) {
+			ng->busiest_nid = nid;
+			max_failure = ng->migrate_stat[f_idx];
+		}
+	}
+
+	spin_unlock_irq(&ng->lock);
+
+	mod_timer(&ng->cling_timer, jiffies + period);
+}
+
+static inline void
+update_migrate_stat(struct task_struct *p, int nid, bool failed)
+{
+	int idx;
+	struct numa_group *ng = p->numa_group;
+
+	if (!ng || !ng->do_cling)
+		return;
+
+	spin_lock_irq(&ng->lock);
+
+	if (failed) {
+		idx = mstat_idx(nid, FAILURE_SCALED);
+		ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	}
+
+	idx = mstat_idx(nid, TOTAL_SCALED);
+	ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	update_failure_ratio(ng, nid);
+
+	spin_unlock_irq(&ng->lock);
+
+	/*
+	 * On failed task may prefer source node instead, this
+	 * cause ping-pong migration when numa cling enabled,
+	 * so let's reset the preferred node to none.
+	 */
+	if (failed)
+		sched_setnuma(p, NUMA_NO_NODE);
+}
+
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
+	bool failed;
+	int target;
 	unsigned long interval = HZ;

 	/* This task has no NUMA fault statistics yet */
@@ -1890,8 +2124,12 @@ static void numa_migrate_preferred(struct task_struct *p)
 	if (task_node(p) == p->numa_preferred_nid)
 		return;

+	target = p->numa_preferred_nid;
+
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	task_numa_migrate(p);
+	failed = (task_numa_migrate(p) != 0);
+
+	update_migrate_stat(p, target, failed);
 }

 /*
@@ -2215,7 +2453,8 @@ static void task_numa_placement(struct task_struct *p)
 				max_faults = faults;
 				max_nid = nid;
 			}
-		} else if (group_faults > max_faults) {
+		} else if (group_faults > max_faults &&
+			   !group_pause_prefer(p->numa_group, nid)) {
 			max_faults = group_faults;
 			max_nid = nid;
 		}
@@ -2257,8 +2496,10 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		return;
 	}

-	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
-		   ng->gid, ng->nr_tasks, ng->active_nodes);
+	spin_lock_irq(&ng->lock);
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d busiest_nid %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes, ng->busiest_nid);

 	for_each_online_node(nid) {
 		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
@@ -2269,9 +2510,16 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		seq_printf(sf, "mem_private %lu mem_shared %lu ",
 			   ng->faults[f_idx], ng->faults[pf_idx]);

-		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu ",
 			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+
+		seq_printf(sf, "migrate_stat %lu %lu %lu\n",
+			   mstat_failure_scaled(ng, nid),
+			   mstat_total_scaled(ng, nid),
+			   mstat_failure_ratio(ng, nid));
 	}
+
+	spin_unlock_irq(&ng->lock);
 }

 int update_tg_numa_group(struct task_group *tg, bool numa_group)
@@ -2285,20 +2533,26 @@ int update_tg_numa_group(struct task_group *tg, bool numa_group)
 	if (ng) {
 		/* put and evacuate tg's numa group */
 		rcu_assign_pointer(tg->numa_group, NULL);
+		del_timer_sync(&ng->cling_timer);
 		ng->evacuate = true;
 		put_numa_group(ng);
 	} else {
 		unsigned int size = sizeof(struct numa_group) +
-				    4*nr_node_ids*sizeof(unsigned long);
+				    7*nr_node_ids*sizeof(unsigned long);
+		unsigned int offset = NR_NUMA_HINT_FAULT_TYPES * nr_node_ids;

 		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!ng)
 			return -ENOMEM;

 		refcount_set(&ng->refcount, 1);
+		ng->busiest_nid = NUMA_NO_NODE;
+		ng->do_cling = true;
+		timer_setup(&ng->cling_timer, cling_timer_func, 0);
 		spin_lock_init(&ng->lock);
-		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
-						nr_node_ids;
+		ng->faults_cpu = ng->faults + offset;
+		ng->migrate_stat = ng->faults_cpu + offset;
+		add_timer(&ng->cling_timer);
 		/* now make tasks see and join */
 		rcu_assign_pointer(tg->numa_group, ng);
 	}
@@ -2435,6 +2689,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			return;

 		refcount_set(&grp->refcount, 1);
+		grp->busiest_nid = NUMA_NO_NODE;
 		grp->active_nodes = 1;
 		grp->max_faults_cpu = 0;
 		spin_lock_init(&grp->lock);
@@ -2921,6 +3176,11 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
 {
 }

+static inline bool task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	return false;
+}
+
 #endif /* CONFIG_NUMA_BALANCING */

 static void
@@ -6674,8 +6934,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			new_cpu = prev_cpu;
 		}

-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
-			      cpumask_test_cpu(cpu, p->cpus_ptr);
+		want_affine = !wake_wide(p) &&
+			      !wake_cap(p, cpu, prev_cpu) &&
+			      cpumask_test_cpu(cpu, p->cpus_ptr) &&
+			      !task_numa_cling(p, cpu_to_node(prev_cpu),
+						cpu_to_node(cpu));
 	}

 	rcu_read_lock();
@@ -6707,12 +6970,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 		/* Fast path */
-
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

 		if (want_affine)
 			current->recent_used_cpu = cpu;
 	}
+
 	rcu_read_unlock();

 	return new_cpu;
@@ -7384,7 +7647,8 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)

 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid == p->numa_preferred_nid) {
-		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+		if (task_numa_cling(p, src_nid, dst_nid) ||
+		    env->src_rq->nr_running > env->src_rq->nr_preferred_running)
 			return 1;
 		else
 			return -1;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..0a889dd1c7ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -417,6 +417,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_cling_degree",
+		.data		= &sysctl_numa_balancing_cling_degree,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &max_numa_balancing_cling_degree,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat
  2019-07-16  3:40     ` [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat 王贇
@ 2019-07-19 16:39       ` Michal Koutný
  2019-07-22  2:36         ` 王贇
  0 siblings, 1 reply; 62+ messages in thread
From: Michal Koutný @ 2019-07-19 16:39 UTC (permalink / raw)
  To: 王贇
  Cc: hannes, vdavydov.dev, Peter Zijlstra, mhocko, Ingo Molnar,
	keescook, mcgrof, linux-mm, Hillf Danton, cgroups, linux-fsdevel,
	linux-kernel

On Tue, Jul 16, 2019 at 11:40:35AM +0800, 王贇  <yun.wang@linux.alibaba.com> wrote:
> By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new
> output line heading with 'exectime', like:
> 
>   exectime 311900 407166
What you present are times aggregated over CPUs in the NUMA nodes, this
seems a bit lossy interface. 

Despite you the aggregated information is sufficient for your
monitoring, I think it's worth providing the information with the
original granularity.

Note that cpuacct v1 controller used to report such percpu runtime
stats. The v2 implementation would rather build upon the rstat API.

Michal


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic
  2019-07-16  2:41                   ` 王贇
@ 2019-07-19 16:47                     ` Michal Koutný
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Koutný @ 2019-07-19 16:47 UTC (permalink / raw)
  To: 王贇
  Cc: keescook, hannes, vdavydov.dev, Peter Zijlstra, mcgrof, mhocko,
	linux-mm, Ingo Molnar, riel, Mel Gorman, cgroups, linux-fsdevel,
	linux-kernel

On Tue, Jul 16, 2019 at 10:41:36AM +0800, 王贇  <yun.wang@linux.alibaba.com> wrote:
> Actually whatever the memory node sets or cpu allow sets is, it will
> take effect on task's behavior regarding memory location and cpu
> location, while the locality only care about the results rather than
> the sets.
My previous response missed much of the context, so it was a bit off.

I see what you mean by the locality now. Alas, I can't assess whether
it's the right thing to do regarding NUMA behavior that you try to
optimize (i.e. you need an answer from someone more familiar with NUMA
balancing).

Michal

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat
  2019-07-19 16:39       ` Michal Koutný
@ 2019-07-22  2:36         ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-22  2:36 UTC (permalink / raw)
  To: Michal Koutný
  Cc: hannes, vdavydov.dev, Peter Zijlstra, mhocko, Ingo Molnar,
	keescook, mcgrof, linux-mm, Hillf Danton, cgroups, linux-fsdevel,
	linux-kernel



On 2019/7/20 上午12:39, Michal Koutný wrote:
> On Tue, Jul 16, 2019 at 11:40:35AM +0800, 王贇  <yun.wang@linux.alibaba.com> wrote:
>> By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new
>> output line heading with 'exectime', like:
>>
>>   exectime 311900 407166
> What you present are times aggregated over CPUs in the NUMA nodes, this
> seems a bit lossy interface. 
> 
> Despite you the aggregated information is sufficient for your
> monitoring, I think it's worth providing the information with the
> original granularity.

As Peter suggested previously, kernel do not report jiffies to user anymore
and 'ms' could be better, I guess usually we care about how much the percentage
is on a particular node?

> 
> Note that cpuacct v1 controller used to report such percpu runtime
> stats. The v2 implementation would rather build upon the rstat API.

Support cgroup v2 is on the plan :-) let's mark this as todo currently,
i suppose they may not share the same piece of code.

Regards,
Michael Wang

> 
> Michal
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v5 4/4] numa: introduce numa cling feature
  2019-07-16  3:41     ` [PATCH v4 4/4] numa: introduce numa cling feature 王贇
@ 2019-07-22  2:37       ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-22  2:37 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups,
	Michal Koutný,
	Hillf Danton

Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.

When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be really bad especially when there
are too many fast wakeup, although now we can numa group the tasks,
they won't really stay on the same node, for example we have numa
group ng_A, ng_B, ng_C, ng_D, it's very likely result as:

	CPU Usage:
		Node 0		Node 1
		ng_A(600%)	ng_A(400%)
		ng_B(400%)	ng_B(600%)
		ng_C(400%)	ng_C(600%)
		ng_D(600%)	ng_D(400%)

	Memory Ratio:
		Node 0		Node 1
		ng_A(60%)	ng_A(40%)
		ng_B(40%)	ng_B(60%)
		ng_C(40%)	ng_C(60%)
		ng_D(60%)	ng_D(40%)

Locality won't be too bad but far from the best situation, we want
a numa group to settle down thoroughly on a particular node, with
every thing balanced.

Thus we introduce the numa cling, which try to prevent tasks leaving
the preferred node on wakeup fast path.

This help thoroughly settle down the workloads on single node, but when
multiple numa group try to settle down on the same node, unbalancing
could happen.

For example we have numa group ng_A, ng_B, ng_C, ng_D, it may result in
situation like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(400%)	ng_C(600%)
	ng_D(400%)	ng_D(600%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(10%)	ng_C(90%)
	ng_D(10%)	ng_D(90%)

This is because when ng_C, ng_D start to have most of the memory on node
1 at some point, task_x of ng_C stay on node 0 will try to do numa swap
migration with the task_y of ng_D stay on node 1 as long as load balanced,
the result is task_x stay on node 1 and task_y stay on node 0, while both
of them prefer node 1.

Now when other tasks of ng_D stay on node 1 wakeup task_y, task_y will
very likely go back to node 1, and since numa cling enabled, it will
keep stay on node 1 although load unbalanced, this could be frequently
and more and more tasks will prefer the node 1 and make it busy.

So the key point here is to stop doing numa cling when load starting to
become unbalancing.

We achieved this by monitoring the migration failure ratio, in scenery
above, too much tasks prefer node 1 and will keep migrating to it, load
unbalancing could lead into the migration failure in this case, and when
the failure ratio above the specified degree, we pause the cling and try
to resettle the workloads on a better node by stop tasks prefer the busy
node, this will finally give us the result like:

CPU Usage:
	Node 0		Node 1
	ng_A(1000%)	ng_B(1000%)
	ng_C(1000%)	ng_D(1000%)

Memory Ratio:
	Node 0		Node 1
	ng_A(100%)	ng_B(100%)
	ng_C(100%)	ng_D(100%)

Now we achieved the best locality and maximum hot cache benefit.

Tested on a 2 node box with 96 cpus, do sysbench-mysql-oltp_read_write
testing, X mysqld instances created and attached to X cgroups, X sysbench
instances then created and attached to corresponding cgroup to test the
mysql with oltp_read_write script for 20 minutes, average eps show:

				origin		ng + cling
4 instances each 24 threads	7641.27		8010.18		+4.83%
4 instances each 48 threads	9423.39		10021.03	+6.34%
4 instances each 72 threads	9691.47		10192.73	+5.17%

8 instances each 24 threads	4485.44		4577.95		+2.06%
8 instances each 48 threads	5565.06		5737.50		+3.10%
8 instances each 72 threads	5605.20		5752.33		+2.63%

Also tested with perf-bench-numa, dbench, sysbench-memory, pgbench, tiny
improvement observed.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---

Since v4:
  * Trivial cleanup

 include/linux/sched/sysctl.h |   3 +
 kernel/sched/fair.c          | 294 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sysctl.c              |   9 ++
 3 files changed, 291 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..6eef34331dd2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -38,6 +38,9 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;

+extern unsigned int sysctl_numa_balancing_cling_degree;
+extern unsigned int max_numa_balancing_cling_degree;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28ba040a563..87d42c6f676c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1066,6 +1066,20 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;

+/*
+ * The numa group serving task group will enable numa cling, a feature
+ * which try to prevent task leaving preferred node on wakeup.
+ *
+ * This help settle down the workloads thorouly and quickly on node,
+ * while introduce the risk of load unbalancing.
+ *
+ * In order to detect the risk in advance and pause the feature, we
+ * rely on numa migration failure stats, and when failure ratio above
+ * cling degree, we pause the numa cling until resettle done.
+ */
+unsigned int sysctl_numa_balancing_cling_degree = 20;
+unsigned int max_numa_balancing_cling_degree = 100;
+
 struct numa_group {
 	refcount_t refcount;

@@ -1073,11 +1087,15 @@ struct numa_group {
 	int nr_tasks;
 	pid_t gid;
 	int active_nodes;
+	int busiest_nid;
 	bool evacuate;
+	bool do_cling;
+	struct timer_list cling_timer;

 	struct rcu_head rcu;
 	unsigned long total_faults;
 	unsigned long max_faults_cpu;
+	unsigned long *migrate_stat;
 	/*
 	 * Faults_cpu is used to decide whether memory should move
 	 * towards the CPU. As a consequence, these stats are weighted
@@ -1087,6 +1105,8 @@ struct numa_group {
 	unsigned long faults[0];
 };

+static inline bool busy_node(struct numa_group *ng, int nid);
+
 static inline unsigned long group_faults_priv(struct numa_group *ng);
 static inline unsigned long group_faults_shared(struct numa_group *ng);

@@ -1131,8 +1151,14 @@ static unsigned int task_scan_start(struct task_struct *p)
 	unsigned long smin = task_scan_min(p);
 	unsigned long period = smin;

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1153,8 +1179,14 @@ static unsigned int task_scan_max(struct task_struct *p)
 	/* Watch for min being lower than max due to floor calculations */
 	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);

-	/* Scale the maximum scan period with the amount of shared memory. */
-	if (p->numa_group) {
+	/*
+	 * Scale the maximum scan period with the amount of shared memory.
+	 *
+	 * Not for the numa group serving task group, it's tasks are not
+	 * gathered for sharing memory, and we need to detect migration
+	 * failure in time.
+	 */
+	if (p->numa_group && !p->numa_group->do_cling) {
 		struct numa_group *ng = p->numa_group;
 		unsigned long shared = group_faults_shared(ng);
 		unsigned long private = group_faults_priv(ng);
@@ -1474,6 +1506,19 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 					ACTIVE_NODE_FRACTION)
 		return true;

+	/*
+	 * Make sure pages do not stay on a busy node when numa cling
+	 * enabled, otherwise they could lead into more numa migration
+	 * to the busy node.
+	 */
+	if (ng->do_cling) {
+		if (busy_node(ng, dst_nid))
+			return false;
+
+		if (busy_node(ng, src_nid))
+			return true;
+	}
+
 	/*
 	 * Distribute memory according to CPU & memory use on each node,
 	 * with 3/4 hysteresis to avoid unnecessary memory migrations:
@@ -1592,6 +1637,9 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  */
 #define SMALLIMP	30

+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid);
+
 /*
  * This checks if the overall compute and NUMA accesses of the system would
  * be improved if the source tasks was migrated to the target dst_cpu taking
@@ -1710,6 +1758,10 @@ static void task_numa_compare(struct task_numa_env *env,
 		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
 						   env->dst_cpu);
 		local_irq_enable();
+	} else {
+		/* Do not swap with a task cling to 'dst_nid' */
+		if (task_numa_cling(cur, env->dst_nid, env->src_nid))
+			goto unlock;
 	}

 	task_numa_assign(env, cur, imp);
@@ -1873,9 +1925,191 @@ static int task_numa_migrate(struct task_struct *p)
 	return ret;
 }

+/*
+ * We scale the migration stat count to 1024, divide the maximum numa
+ * balancing scan period by 10 and make that the period of cling timer,
+ * this help to decay one count to 0 after one maximum scan period passed.
+ */
+#define NUMA_MIGRATE_SCALE 10
+#define NUMA_MIGRATE_WEIGHT 1024
+
+enum numa_migrate_stats {
+	FAILURE_SCALED,
+	TOTAL_SCALED,
+	FAILURE_RATIO,
+};
+
+static inline int mstat_idx(int nid, enum numa_migrate_stats s)
+{
+	return (nid + s * nr_node_ids);
+}
+
+static inline unsigned long
+mstat_failure_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_SCALED)];
+}
+
+static inline unsigned long
+mstat_total_scaled(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, TOTAL_SCALED)];
+}
+
+static inline unsigned long
+mstat_failure_ratio(struct numa_group *ng, int nid)
+{
+	return ng->migrate_stat[mstat_idx(nid, FAILURE_RATIO)];
+}
+
+/*
+ * A node is busy when the numa migration toward it failed too much,
+ * this imply the load already unbalancing for too much numa cling on
+ * that node.
+ */
+static inline bool busy_node(struct numa_group *ng, int nid)
+{
+	int degree = sysctl_numa_balancing_cling_degree;
+
+	if (mstat_failure_scaled(ng, nid) < NUMA_MIGRATE_WEIGHT)
+		return false;
+
+	/*
+	 * Allow only one busy node in one numa group, to prevent
+	 * ping-pong migration case between nodes.
+	 */
+	if (ng->busiest_nid != nid)
+		return false;
+
+	return mstat_failure_ratio(ng, nid) > degree;
+}
+
+/*
+ * Return true if the task should cling to snid, when it preferred snid
+ * rather than dnid and snid is not busy.
+ */
+static inline bool
+task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	bool ret = false;
+	int pnid = p->numa_preferred_nid;
+	struct numa_group *ng;
+
+	rcu_read_lock();
+
+	ng = p->numa_group;
+
+	/* Do cling only when the feature enabled and not in pause */
+	if (!ng || !ng->do_cling)
+		goto out;
+
+	if (pnid == NUMA_NO_NODE ||
+	    dnid == pnid ||
+	    snid != pnid)
+		goto out;
+
+	/* Never allow cling to a busy node */
+	if (busy_node(ng, snid))
+		goto out;
+
+	ret = true;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Prevent more tasks from prefer the busy node to easy the unbalancing,
+ * also give the second candidate a chance.
+ */
+static inline bool group_pause_prefer(struct numa_group *ng, int nid)
+{
+	if (!ng || !ng->do_cling)
+		return false;
+
+	return busy_node(ng, nid);
+}
+
+static inline void update_failure_ratio(struct numa_group *ng, int nid)
+{
+	int f_idx = mstat_idx(nid, FAILURE_SCALED);
+	int t_idx = mstat_idx(nid, TOTAL_SCALED);
+	int fp_idx = mstat_idx(nid, FAILURE_RATIO);
+
+	ng->migrate_stat[fp_idx] =
+		ng->migrate_stat[f_idx] * 100 / (ng->migrate_stat[t_idx] + 1);
+}
+
+static void cling_timer_func(struct timer_list *t)
+{
+	int nid;
+	unsigned int degree;
+	unsigned long period, max_failure;
+	struct numa_group *ng = from_timer(ng, t, cling_timer);
+
+	degree = sysctl_numa_balancing_cling_degree;
+	period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+	period /= NUMA_MIGRATE_SCALE;
+
+	spin_lock_irq(&ng->lock);
+
+	max_failure = 0;
+	for_each_online_node(nid) {
+		int f_idx = mstat_idx(nid, FAILURE_SCALED);
+		int t_idx = mstat_idx(nid, TOTAL_SCALED);
+
+		ng->migrate_stat[f_idx] /= 2;
+		ng->migrate_stat[t_idx] /= 2;
+
+		update_failure_ratio(ng, nid);
+
+		if (ng->migrate_stat[f_idx] > max_failure) {
+			ng->busiest_nid = nid;
+			max_failure = ng->migrate_stat[f_idx];
+		}
+	}
+
+	spin_unlock_irq(&ng->lock);
+
+	mod_timer(&ng->cling_timer, jiffies + period);
+}
+
+static inline void
+update_migrate_stat(struct task_struct *p, int nid, bool failed)
+{
+	int idx;
+	struct numa_group *ng = p->numa_group;
+
+	if (!ng || !ng->do_cling)
+		return;
+
+	spin_lock_irq(&ng->lock);
+
+	if (failed) {
+		idx = mstat_idx(nid, FAILURE_SCALED);
+		ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	}
+
+	idx = mstat_idx(nid, TOTAL_SCALED);
+	ng->migrate_stat[idx] += NUMA_MIGRATE_WEIGHT;
+	update_failure_ratio(ng, nid);
+
+	spin_unlock_irq(&ng->lock);
+
+	/*
+	 * On failed task may prefer source node instead, this
+	 * cause ping-pong migration when numa cling enabled,
+	 * so let's reset the preferred node to none.
+	 */
+	if (failed)
+		sched_setnuma(p, NUMA_NO_NODE);
+}
+
 /* Attempt to migrate a task to a CPU on the preferred node. */
 static void numa_migrate_preferred(struct task_struct *p)
 {
+	bool failed;
+	int target;
 	unsigned long interval = HZ;

 	/* This task has no NUMA fault statistics yet */
@@ -1890,8 +2124,12 @@ static void numa_migrate_preferred(struct task_struct *p)
 	if (task_node(p) == p->numa_preferred_nid)
 		return;

+	target = p->numa_preferred_nid;
+
 	/* Otherwise, try migrate to a CPU on the preferred node */
-	task_numa_migrate(p);
+	failed = (task_numa_migrate(p) != 0);
+
+	update_migrate_stat(p, target, failed);
 }

 /*
@@ -2215,7 +2453,8 @@ static void task_numa_placement(struct task_struct *p)
 				max_faults = faults;
 				max_nid = nid;
 			}
-		} else if (group_faults > max_faults) {
+		} else if (group_faults > max_faults &&
+			   !group_pause_prefer(p->numa_group, nid)) {
 			max_faults = group_faults;
 			max_nid = nid;
 		}
@@ -2257,8 +2496,10 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		return;
 	}

-	seq_printf(sf, "id %d nr_tasks %d active_nodes %d\n",
-		   ng->gid, ng->nr_tasks, ng->active_nodes);
+	spin_lock_irq(&ng->lock);
+
+	seq_printf(sf, "id %d nr_tasks %d active_nodes %d busiest_nid %d\n",
+		   ng->gid, ng->nr_tasks, ng->active_nodes, ng->busiest_nid);

 	for_each_online_node(nid) {
 		int f_idx = task_faults_idx(NUMA_MEM, nid, 0);
@@ -2269,9 +2510,16 @@ void show_tg_numa_group(struct task_group *tg, struct seq_file *sf)
 		seq_printf(sf, "mem_private %lu mem_shared %lu ",
 			   ng->faults[f_idx], ng->faults[pf_idx]);

-		seq_printf(sf, "cpu_private %lu cpu_shared %lu\n",
+		seq_printf(sf, "cpu_private %lu cpu_shared %lu ",
 			   ng->faults_cpu[f_idx], ng->faults_cpu[pf_idx]);
+
+		seq_printf(sf, "migrate_stat %lu %lu %lu\n",
+			   mstat_failure_scaled(ng, nid),
+			   mstat_total_scaled(ng, nid),
+			   mstat_failure_ratio(ng, nid));
 	}
+
+	spin_unlock_irq(&ng->lock);
 }

 int update_tg_numa_group(struct task_group *tg, bool numa_group)
@@ -2285,20 +2533,26 @@ int update_tg_numa_group(struct task_group *tg, bool numa_group)
 	if (ng) {
 		/* put and evacuate tg's numa group */
 		rcu_assign_pointer(tg->numa_group, NULL);
+		del_timer_sync(&ng->cling_timer);
 		ng->evacuate = true;
 		put_numa_group(ng);
 	} else {
 		unsigned int size = sizeof(struct numa_group) +
-				    4*nr_node_ids*sizeof(unsigned long);
+				    7*nr_node_ids*sizeof(unsigned long);
+		unsigned int offset = NR_NUMA_HINT_FAULT_TYPES * nr_node_ids;

 		ng = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
 		if (!ng)
 			return -ENOMEM;

 		refcount_set(&ng->refcount, 1);
+		ng->busiest_nid = NUMA_NO_NODE;
+		ng->do_cling = true;
+		timer_setup(&ng->cling_timer, cling_timer_func, 0);
 		spin_lock_init(&ng->lock);
-		ng->faults_cpu = ng->faults + NR_NUMA_HINT_FAULT_TYPES *
-						nr_node_ids;
+		ng->faults_cpu = ng->faults + offset;
+		ng->migrate_stat = ng->faults_cpu + offset;
+		add_timer(&ng->cling_timer);
 		/* now make tasks see and join */
 		rcu_assign_pointer(tg->numa_group, ng);
 	}
@@ -2435,6 +2689,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
 			return;

 		refcount_set(&grp->refcount, 1);
+		grp->busiest_nid = NUMA_NO_NODE;
 		grp->active_nodes = 1;
 		grp->max_faults_cpu = 0;
 		spin_lock_init(&grp->lock);
@@ -2921,6 +3176,11 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
 {
 }

+static inline bool task_numa_cling(struct task_struct *p, int snid, int dnid)
+{
+	return false;
+}
+
 #endif /* CONFIG_NUMA_BALANCING */

 static void
@@ -6674,8 +6934,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			new_cpu = prev_cpu;
 		}

-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
-			      cpumask_test_cpu(cpu, p->cpus_ptr);
+		want_affine = !wake_wide(p) &&
+			      !wake_cap(p, cpu, prev_cpu) &&
+			      cpumask_test_cpu(cpu, p->cpus_ptr) &&
+			      !task_numa_cling(p, cpu_to_node(prev_cpu),
+						cpu_to_node(cpu));
 	}

 	rcu_read_lock();
@@ -7384,7 +7647,8 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)

 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid == p->numa_preferred_nid) {
-		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+		if (task_numa_cling(p, src_nid, dst_nid) ||
+		    env->src_rq->nr_running > env->src_rq->nr_preferred_running)
 			return 1;
 		else
 			return -1;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..0a889dd1c7ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -417,6 +417,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_cling_degree",
+		.data		= &sysctl_numa_balancing_cling_degree,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &max_numa_balancing_cling_degree,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= NULL, /* filled in by handler */
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/4] numa: introduce numa cling feature
  2019-07-12  8:58           ` 王贇
@ 2019-07-22  3:44             ` 王贇
  0 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-22  3:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hannes, mhocko, vdavydov.dev, Ingo Molnar, linux-kernel,
	linux-mm, mcgrof, keescook, linux-fsdevel, cgroups, Mel Gorman,
	riel



On 2019/7/12 下午4:58, 王贇 wrote:
[snip]
> 
> I see, we should not override the decision of select_idle_sibling().
> 
> Actually the original design we try to achieve is:
> 
>   let wake affine select the target
>   try find idle sibling of target
>   if got one
> 	pick it
>   else if task cling to prev
> 	pick prev
> 
> That is to consider wake affine superior to numa cling.
> 
> But after rethinking maybe this is not necessary, since numa cling is
> also some kind of strong wake affine hint, actually maybe even a better
> one to filter out the bad cases.
> 
> I'll try change @target instead and give a retest then.

We now leave select_idle_sibling() untouched, instead prevent numa swap
with task cling to dst, and stop wake affine when curr & prev cpu are on
different node and wakee cling to prev.

Retesting show a even better results, benchmark like dbench also show 1%~5%
improvement, not stable but always improved now :-)

Regards,
Michael Wang

> 
> Regards,
> Michael Wang
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 0/4] per-cgroup numa suite
  2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
                       ` (3 preceding siblings ...)
  2019-07-16  3:41     ` [PATCH v4 4/4] numa: introduce numa cling feature 王贇
@ 2019-07-25  2:33     ` 王贇
  2019-08-06  1:33     ` 王贇
  5 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-07-25  2:33 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups,
	Michal Koutný,
	Hillf Danton

Hi, Peter

Now we have all these stuff in cpu cgroup, with the new statistic
folks should be able to estimate their per-cgroup workloads on
numa platform, and numa group + cling would help to address the
issue when their workloads can't be settled on one node.

How do you think about this version :-)

Regards,
Michael Wang

On 2019/7/16 上午11:38, 王贇 wrote:
> During our torturing on numa stuff, we found problems like:
> 
>   * missing per-cgroup information about the per-node execution status
>   * missing per-cgroup information about the numa locality
> 
> That is when we have a cpu cgroup running with bunch of tasks, no good
> way to tell how it's tasks are dealing with numa.
> 
> The first two patches are trying to complete the missing pieces, but
> more problems appeared after monitoring these status:
> 
>   * tasks not always running on the preferred numa node
>   * tasks from same cgroup running on different nodes
> 
> The task numa group handler will always check if tasks are sharing pages
> and try to pack them into a single numa group, so they will have chance to
> settle down on the same node, but this failed in some cases:
> 
>   * workloads share page caches rather than share mappings
>   * workloads got too many wakeup across nodes
> 
> Since page caches are not traced by numa balancing, there are no way to
> realize such kind of relationship, and when there are too many wakeup,
> task will be drag from the preferred node and then migrate back by numa
> balancing, repeatedly.
> 
> Here the third patch try to address the first issue, we could now give hint
> to kernel about the relationship of tasks, and pack them into single numa
> group.
> 
> And the forth patch introduced numa cling, which try to address the wakup
> issue, now we try to make task stay on the preferred node on wakeup in fast
> path, in order to address the unbalancing risk, we monitoring the numa
> migration failure ratio, and pause numa cling when it reach the specified
> degree.
> 
> Since v1:
>   * move statistics from memory cgroup into cpu group
>   * statistics now accounting in hierarchical way
>   * locality now accounted into 8 regions equally
>   * numa cling no longer override select_idle_sibling, instead we
>     prevent numa swap migration with tasks cling to dst-node, also
>     prevent wake affine to drag tasks away which already cling to
>     prev-cpu
>   * other refine on comments and names
> 
> Michael Wang (4):
>   v2 numa: introduce per-cgroup numa balancing locality statistic
>   v2 numa: append per-node execution time in cpu.numa_stat
>   v2 numa: introduce numa group per task group
>   v4 numa: introduce numa cling feature
> 
>  include/linux/sched.h        |   8 +-
>  include/linux/sched/sysctl.h |   3 +
>  kernel/sched/core.c          |  85 ++++++++
>  kernel/sched/debug.c         |   7 +
>  kernel/sched/fair.c          | 510 ++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h         |  41 ++++
>  kernel/sysctl.c              |   9 +
>  7 files changed, 651 insertions(+), 12 deletions(-)
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v2 0/4] per-cgroup numa suite
  2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
                       ` (4 preceding siblings ...)
  2019-07-25  2:33     ` [PATCH v2 0/4] per-cgroup numa suite 王贇
@ 2019-08-06  1:33     ` 王贇
  5 siblings, 0 replies; 62+ messages in thread
From: 王贇 @ 2019-08-06  1:33 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm, mcgrof, keescook, linux-fsdevel, cgroups,
	Michal Koutný,
	Hillf Danton

Hi, Folks

Please feel free to comment if you got any concerns :-)

Hi, Peter

How do you think about this version?

Please let us know if it's still not good enough to be accepted :-)

Regards,
Michael Wang

On 2019/7/16 上午11:38, 王贇 wrote:
> During our torturing on numa stuff, we found problems like:
> 
>   * missing per-cgroup information about the per-node execution status
>   * missing per-cgroup information about the numa locality
> 
> That is when we have a cpu cgroup running with bunch of tasks, no good
> way to tell how it's tasks are dealing with numa.
> 
> The first two patches are trying to complete the missing pieces, but
> more problems appeared after monitoring these status:
> 
>   * tasks not always running on the preferred numa node
>   * tasks from same cgroup running on different nodes
> 
> The task numa group handler will always check if tasks are sharing pages
> and try to pack them into a single numa group, so they will have chance to
> settle down on the same node, but this failed in some cases:
> 
>   * workloads share page caches rather than share mappings
>   * workloads got too many wakeup across nodes
> 
> Since page caches are not traced by numa balancing, there are no way to
> realize such kind of relationship, and when there are too many wakeup,
> task will be drag from the preferred node and then migrate back by numa
> balancing, repeatedly.
> 
> Here the third patch try to address the first issue, we could now give hint
> to kernel about the relationship of tasks, and pack them into single numa
> group.
> 
> And the forth patch introduced numa cling, which try to address the wakup
> issue, now we try to make task stay on the preferred node on wakeup in fast
> path, in order to address the unbalancing risk, we monitoring the numa
> migration failure ratio, and pause numa cling when it reach the specified
> degree.
> 
> Since v1:
>   * move statistics from memory cgroup into cpu group
>   * statistics now accounting in hierarchical way
>   * locality now accounted into 8 regions equally
>   * numa cling no longer override select_idle_sibling, instead we
>     prevent numa swap migration with tasks cling to dst-node, also
>     prevent wake affine to drag tasks away which already cling to
>     prev-cpu
>   * other refine on comments and names
> 
> Michael Wang (4):
>   v2 numa: introduce per-cgroup numa balancing locality statistic
>   v2 numa: append per-node execution time in cpu.numa_stat
>   v2 numa: introduce numa group per task group
>   v4 numa: introduce numa cling feature
> 
>  include/linux/sched.h        |   8 +-
>  include/linux/sched/sysctl.h |   3 +
>  kernel/sched/core.c          |  85 ++++++++
>  kernel/sched/debug.c         |   7 +
>  kernel/sched/fair.c          | 510 ++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h         |  41 ++++
>  kernel/sysctl.c              |   9 +
>  7 files changed, 651 insertions(+), 12 deletions(-)
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2019-08-06  1:33 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
2019-04-23  8:44   ` Peter Zijlstra
2019-04-23  9:14     ` 王贇
2019-04-23  8:46   ` Peter Zijlstra
2019-04-23  9:32     ` 王贇
2019-04-23  8:47   ` Peter Zijlstra
2019-04-23  9:33     ` 王贇
2019-04-23  9:46       ` Peter Zijlstra
2019-04-22  2:12 ` [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat 王贇
2019-04-23  8:52   ` Peter Zijlstra
2019-04-23  9:36     ` 王贇
2019-04-23  9:46       ` Peter Zijlstra
2019-04-23 10:01         ` 王贇
2019-04-22  2:13 ` [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node 王贇
2019-04-23  8:55   ` Peter Zijlstra
2019-04-23  9:41     ` 王贇
2019-04-22  2:14 ` [RFC PATCH 4/5] numa: introduce numa balancer infrastructure 王贇
2019-04-22  2:21 ` [RFC PATCH 5/5] numa: numa balancer 王贇
2019-04-23  9:05   ` Peter Zijlstra
2019-04-23  9:59     ` 王贇
     [not found] ` <CAHCio2gEw4xyuoiurvwzvEiU8eLas+5ZLhzmqm1V2CJqvt+cyA@mail.gmail.com>
2019-04-23  2:14   ` [RFC PATCH 0/5] NUMA Balancer Suite 王贇
2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
2019-07-03  3:28   ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
2019-07-11 13:43     ` Peter Zijlstra
2019-07-12  3:15       ` 王贇
2019-07-11 13:47     ` Peter Zijlstra
2019-07-12  3:43       ` 王贇
2019-07-12  7:58         ` Peter Zijlstra
2019-07-12  9:11           ` 王贇
2019-07-12  9:42             ` Peter Zijlstra
2019-07-12 10:10               ` 王贇
2019-07-15  2:09                 ` 王贇
2019-07-15 12:10                 ` Michal Koutný
2019-07-16  2:41                   ` 王贇
2019-07-19 16:47                     ` Michal Koutný
2019-07-03  3:29   ` [PATCH 2/4] numa: append per-node execution info in memory.numa_stat 王贇
2019-07-11 13:45     ` Peter Zijlstra
2019-07-12  3:17       ` 王贇
2019-07-03  3:32   ` [PATCH 3/4] numa: introduce numa group per task group 王贇
2019-07-11 14:10     ` Peter Zijlstra
2019-07-12  4:03       ` 王贇
2019-07-03  3:34   ` [PATCH 4/4] numa: introduce numa cling feature 王贇
2019-07-08  2:25     ` [PATCH v2 " 王贇
2019-07-09  2:15       ` 王贇
2019-07-09  2:24       ` [PATCH v3 " 王贇
2019-07-11 14:27     ` [PATCH " Peter Zijlstra
2019-07-12  3:10       ` 王贇
2019-07-12  7:53         ` Peter Zijlstra
2019-07-12  8:58           ` 王贇
2019-07-22  3:44             ` 王贇
2019-07-11  9:00   ` [PATCH 0/4] per cgroup numa suite 王贇
2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
2019-07-16  3:39     ` [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic 王贇
2019-07-16  3:40     ` [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat 王贇
2019-07-19 16:39       ` Michal Koutný
2019-07-22  2:36         ` 王贇
2019-07-16  3:41     ` [PATCH v2 3/4] numa: introduce numa group per task group 王贇
2019-07-16  3:41     ` [PATCH v4 4/4] numa: introduce numa cling feature 王贇
2019-07-22  2:37       ` [PATCH v5 " 王贇
2019-07-25  2:33     ` [PATCH v2 0/4] per-cgroup numa suite 王贇
2019-08-06  1:33     ` 王贇

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).