LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC/PATCH 00/17] Current sched patch stack
@ 2008-03-09 17:08 Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 01/17] sched: mix tasks and groups Peter Zijlstra
                   ` (16 more replies)
  0 siblings, 17 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

Hi all,

This is just a quick dump of my patch stack that is not already in
sched-devel/master (read: this is on top of sched-devel/master).

These patches are lightly tested - compiled only with a single .config
and booted only. Will repost in somewhat smaller chunks once I get around
to actually testing more.

The newest code (and thus least tested) is the new SMP balancer.

Just posting to get it out-there... :-)

If it breaks you get to keep both (or rather, all) pieces, although I'll try
and fix reported issues in a timely manner.

Patches can also be found here:
  http://programming.kicks-ass.net/kernel-patches/sched-eevdf/

Happy hacking,

/me off to enjoy the last few hours of weekend remaining.
--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 01/17] sched: mix tasks and groups
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-04-01 12:12   ` Srivatsa Vaddagiri
  2008-03-09 17:08 ` [RFC/PATCH 02/17] sched: allow the CFS group scheduler to have multiple levels Peter Zijlstra
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: dhaval-fairness-model.patch --]
[-- Type: text/plain, Size: 7813 bytes --]

This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

[a.p.zijlstra@chello.nl: rt bits]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_fair.c |   48 +++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_rt.c   |   15 ++++++++------
 3 files changed, 106 insertions(+), 11 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -273,18 +273,23 @@ struct task_group {
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+
+#ifdef CONFIG_USER_SCHED
 /* Default task group's sched entity on each cpu */
 static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
 /* Default task group's cfs_rq on each cpu */
 static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) ____cacheline_aligned_in_smp;
+#endif
 
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
+#ifdef CONFIG_USER_SCHED
 static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_entity);
 static DEFINE_PER_CPU(struct rt_rq, init_rt_rq) ____cacheline_aligned_in_smp;
+#endif
 
 static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS];
 static struct rt_rq *init_rt_rq_p[NR_CPUS];
@@ -7279,6 +7284,10 @@ static void init_tg_cfs_entry(struct rq 
 		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
 
 	tg->se[cpu] = se;
+	/* se could be NULL for init_task_group */
+	if (!se)
+		return;
+
 	se->cfs_rq = &rq->cfs;
 	se->my_q = cfs_rq;
 	se->load.weight = tg->shares;
@@ -7301,6 +7310,9 @@ static void init_tg_rt_entry(struct rq *
 		list_add(&rt_rq->leaf_rt_rq_list, &rq->leaf_rt_rq_list);
 
 	tg->rt_se[cpu] = rt_se;
+	if (!rt_se)
+		return;
+
 	rt_se->rt_rq = &rq->rt;
 	rt_se->my_q = rt_rq;
 	rt_se->parent = NULL;
@@ -7343,18 +7355,56 @@ void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		init_task_group.shares = init_task_group_load;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+#ifdef CONFIG_CGROUP_SCHED
+		/*
+		 * How much cpu bandwidth does init_task_group get?
+		 *
+		 * In case of task-groups formed thr' the cgroup filesystem, it
+		 * gets 100% of the cpu resources in the system. This overall
+		 * system cpu resource is divided among the tasks of
+		 * init_task_group and its child task-groups in a fair manner,
+		 * based on each entity's (task or task-group's) weight
+		 * (se->load.weight).
+		 *
+		 * In other words, if init_task_group has 10 tasks of weight
+		 * 1024) and two child groups A0 and A1 (of weight 1024 each),
+		 * then A0's share of the cpu resource is:
+		 *
+		 * 	A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
+		 *
+		 * We achieve this by letting init_task_group's tasks sit
+		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
+		 */
+		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
+#elif defined CONFIG_USER_SCHED
+		/*
+		 * In case of task-groups formed thr' the user id of tasks,
+		 * init_task_group represents tasks belonging to root user.
+		 * Hence it forms a sibling of all subsequent groups formed.
+		 * In this case, init_task_group gets only a fraction of overall
+		 * system cpu resource, based on the weight assigned to root
+		 * user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
+		 * by letting tasks of init_task_group sit in a separate cfs_rq
+		 * (init_cfs_rq) and having one entity represent this group of
+		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
+		 */
 		init_tg_cfs_entry(rq, &init_task_group,
 				&per_cpu(init_cfs_rq, i),
 				&per_cpu(init_sched_entity, i), i, 1);
 
 #endif
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
 #ifdef CONFIG_RT_GROUP_SCHED
 		INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
+#ifdef CONFIG_CGROUP_SCHED
+		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
+#elif defined CONFIG_USER_SCHED
 		init_tg_rt_entry(rq, &init_task_group,
 				&per_cpu(init_rt_rq, i),
 				&per_cpu(init_sched_rt_entity, i), i, 1);
-#else
-		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
+#endif
 #endif
 
 		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -1029,6 +1029,17 @@ out_set_cpu:
 }
 #endif /* CONFIG_SMP */
 
+/* return depth at which a sched entity is present in the hierarchy */
+static inline int depth_se(struct sched_entity *se)
+{
+	int depth = 0;
+
+	for_each_sched_entity(se)
+		depth++;
+
+	return depth;
+}
+
 
 /*
  * Preempt the current task with a newly woken task if needed:
@@ -1039,6 +1050,7 @@ static void check_preempt_wakeup(struct 
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	unsigned long gran;
+	int se_depth, pse_depth;
 
 	if (unlikely(rt_prio(p->prio))) {
 		update_rq_clock(rq);
@@ -1056,6 +1068,27 @@ static void check_preempt_wakeup(struct 
 	if (!sched_feat(WAKEUP_PREEMPT))
 		return;
 
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
+	 * both tasks until we find their ancestors who are siblings of common
+	 * parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	se_depth = depth_se(se);
+	pse_depth = depth_se(pse);
+
+	while (se_depth > pse_depth) {
+		se_depth--;
+		se = parent_entity(se);
+	}
+
+	while (pse_depth > se_depth) {
+		pse_depth--;
+		pse = parent_entity(pse);
+	}
+
 	while (!is_same_group(se, pse)) {
 		se = parent_entity(se);
 		pse = parent_entity(pse);
@@ -1122,13 +1155,22 @@ static void put_prev_task_fair(struct rq
 static struct task_struct *
 __load_balance_iterator(struct cfs_rq *cfs_rq, struct rb_node *curr)
 {
-	struct task_struct *p;
+	struct task_struct *p = NULL;
+	struct sched_entity *se;
 
 	if (!curr)
 		return NULL;
 
-	p = rb_entry(curr, struct task_struct, se.run_node);
-	cfs_rq->rb_load_balance_curr = rb_next(curr);
+	/* Skip over entities that are not tasks */
+	do {
+		se = rb_entry(curr, struct sched_entity, run_node);
+		curr = rb_next(curr);
+	} while (curr && !entity_is_task(se));
+
+	cfs_rq->rb_load_balance_curr = curr;
+
+	if (entity_is_task(se))
+		p = task_of(se);
 
 	return p;
 }
Index: linux-2.6-2/kernel/sched_rt.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_rt.c
+++ linux-2.6-2/kernel/sched_rt.c
@@ -374,11 +374,15 @@ static void update_curr_rt(struct rq *rq
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
 
-	spin_lock(&rt_rq->rt_runtime_lock);
-	rt_rq->rt_time += delta_exec;
-	if (sched_rt_runtime_exceeded(rt_rq))
-		resched_task(curr);
-	spin_unlock(&rt_rq->rt_runtime_lock);
+	for_each_sched_rt_entity(rt_se) {
+		rt_rq = rt_rq_of_se(rt_se);
+
+		spin_lock(&rt_rq->rt_runtime_lock);
+		rt_rq->rt_time += delta_exec;
+		if (sched_rt_runtime_exceeded(rt_rq))
+			resched_task(curr);
+		spin_unlock(&rt_rq->rt_runtime_lock);
+	}
 }
 
 static inline
@@ -477,7 +481,6 @@ static void dequeue_rt_entity(struct sch
  * entries, we must remove entries top - down.
  *
  * XXX: O(1/2 h^2) because we can only walk up, not down the chain.
- *      doesn't matter much for now, as h=2 for GROUP_SCHED.
  */
 static void dequeue_rt_stack(struct task_struct *p)
 {

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 02/17] sched: allow the CFS group scheduler to have multiple levels
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 01/17] sched: mix tasks and groups Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-03-12 14:47   ` Dhaval Giani
  2008-03-09 17:08 ` [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups Peter Zijlstra
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: dhaval-multi-level.patch --]
[-- Type: text/plain, Size: 4787 bytes --]

This patch makes the group scheduler multi hierarchy aware.

[a.p.zijlstra@chello.nl: fixups]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    2 +-
 kernel/sched.c        |   39 +++++++++++++++++++++++----------------
 kernel/user.c         |    2 +-
 3 files changed, 25 insertions(+), 18 deletions(-)

Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -2062,7 +2062,7 @@ extern void normalize_rt_tasks(void);
 
 extern struct task_group init_task_group;
 
-extern struct task_group *sched_create_group(void);
+extern struct task_group *sched_create_group(struct task_group *parent);
 extern void sched_destroy_group(struct task_group *tg);
 extern void sched_move_task(struct task_struct *tsk);
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -7273,10 +7273,11 @@ static void init_rt_rq(struct rt_rq *rt_
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void init_tg_cfs_entry(struct rq *rq, struct task_group *tg,
-		struct cfs_rq *cfs_rq, struct sched_entity *se,
-		int cpu, int add)
+static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
+				struct sched_entity *se, int cpu, int add,
+				struct sched_entity *parent)
 {
+	struct rq *rq = cpu_rq(cpu);
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
@@ -7288,7 +7289,11 @@ static void init_tg_cfs_entry(struct rq 
 	if (!se)
 		return;
 
-	se->cfs_rq = &rq->cfs;
+	if (!parent)
+		se->cfs_rq = &rq->cfs;
+	else
+		se->cfs_rq = parent->my_q;
+
 	se->my_q = cfs_rq;
 	se->load.weight = tg->shares;
 	se->load.inv_weight = div64_64(1ULL<<32, se->load.weight);
@@ -7375,7 +7380,8 @@ void __init sched_init(void)
 		 * We achieve this by letting init_task_group's tasks sit
 		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 		 */
-		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
+		init_tg_cfs_entry(&init_task_group, &rq->cfs,
+							NULL, i, 1, NULL);
 #elif defined CONFIG_USER_SCHED
 		/*
 		 * In case of task-groups formed thr' the user id of tasks,
@@ -7390,7 +7396,7 @@ void __init sched_init(void)
 		 */
 		init_tg_cfs_entry(rq, &init_task_group,
 				&per_cpu(init_cfs_rq, i),
-				&per_cpu(init_sched_entity, i), i, 1);
+				&per_cpu(init_sched_entity, i), i, 1, NULL);
 
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -7607,7 +7613,8 @@ static void free_fair_sched_group(struct
 	kfree(tg->se);
 }
 
-static int alloc_fair_sched_group(struct task_group *tg)
+static int alloc_fair_sched_group(struct task_group *tg,
+					struct task_group *parent)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se;
@@ -7636,7 +7643,10 @@ static int alloc_fair_sched_group(struct
 		if (!se)
 			goto err;
 
-		init_tg_cfs_entry(rq, tg, cfs_rq, se, i, 0);
+		if (!parent)
+			init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
+		else
+			init_tg_cfs_entry(tg, cfs_rq, se, i, 0,	parent->se[i]);
 	}
 
 	return 1;
@@ -7769,7 +7779,7 @@ static void free_sched_group(struct task
 }
 
 /* allocate runqueue etc for a new task group */
-struct task_group *sched_create_group(void)
+struct task_group *sched_create_group(struct task_group *parent)
 {
 	struct task_group *tg;
 	unsigned long flags;
@@ -7779,7 +7789,7 @@ struct task_group *sched_create_group(vo
 	if (!tg)
 		return ERR_PTR(-ENOMEM);
 
-	if (!alloc_fair_sched_group(tg))
+	if (!alloc_fair_sched_group(tg, parent))
 		goto err;
 
 	if (!alloc_rt_sched_group(tg))
@@ -8138,7 +8148,7 @@ static inline struct task_group *cgroup_
 static struct cgroup_subsys_state *
 cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
-	struct task_group *tg;
+	struct task_group *tg, *parent;
 
 	if (!cgrp->parent) {
 		/* This is early initialization for the top cgroup */
@@ -8146,11 +8156,8 @@ cpu_cgroup_create(struct cgroup_subsys *
 		return &init_task_group.css;
 	}
 
-	/* we support only 1-level deep hierarchical scheduler atm */
-	if (cgrp->parent->parent)
-		return ERR_PTR(-EINVAL);
-
-	tg = sched_create_group();
+	parent = cgroup_tg(cgrp->parent);
+	tg = sched_create_group(parent);
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
Index: linux-2.6-2/kernel/user.c
===================================================================
--- linux-2.6-2.orig/kernel/user.c
+++ linux-2.6-2/kernel/user.c
@@ -101,7 +101,7 @@ static int sched_create_user(struct user
 {
 	int rc = 0;
 
-	up->tg = sched_create_group();
+	up->tg = sched_create_group(NULL);
 	if (IS_ERR(up->tg))
 		rc = -ENOMEM;
 

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 01/17] sched: mix tasks and groups Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 02/17] sched: allow the CFS group scheduler to have multiple levels Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-03-12  8:25   ` Dhaval Giani
  2008-03-09 17:08 ` [RFC/PATCH 04/17] sched: fair-group: SMP-nice for group scheduling Peter Zijlstra
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-rt-group-multi-level.patch --]
[-- Type: text/plain, Size: 4626 bytes --]

Update the -rt bits to support the full hierarchy support started by Dhaval.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   52 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 30 insertions(+), 22 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -7302,10 +7302,12 @@ static void init_tg_cfs_entry(struct tas
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
-static void init_tg_rt_entry(struct rq *rq, struct task_group *tg,
-		struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
-		int cpu, int add)
+static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
+		struct sched_rt_entity *rt_se, int cpu, int add,
+		struct sched_rt_entity *parent)
 {
+	struct rq *rq = cpu_rq(cpu);
+
 	tg->rt_rq[cpu] = rt_rq;
 	init_rt_rq(rt_rq, rq);
 	rt_rq->tg = tg;
@@ -7318,6 +7320,11 @@ static void init_tg_rt_entry(struct rq *
 	if (!rt_se)
 		return;
 
+	if (!parent)
+		rt_se->rt_rq = &rq->rt;
+	else
+		rt_se->rt_rq = parent->my_q;
+
 	rt_se->rt_rq = &rq->rt;
 	rt_se->my_q = rt_rq;
 	rt_se->parent = NULL;
@@ -7380,8 +7387,7 @@ void __init sched_init(void)
 		 * We achieve this by letting init_task_group's tasks sit
 		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 		 */
-		init_tg_cfs_entry(&init_task_group, &rq->cfs,
-							NULL, i, 1, NULL);
+		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
 #elif defined CONFIG_USER_SCHED
 		/*
 		 * In case of task-groups formed thr' the user id of tasks,
@@ -7394,7 +7400,7 @@ void __init sched_init(void)
 		 * (init_cfs_rq) and having one entity represent this group of
 		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
 		 */
-		init_tg_cfs_entry(rq, &init_task_group,
+		init_tg_cfs_entry(&init_task_group,
 				&per_cpu(init_cfs_rq, i),
 				&per_cpu(init_sched_entity, i), i, 1, NULL);
 
@@ -7405,11 +7411,11 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 		INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
 #ifdef CONFIG_CGROUP_SCHED
-		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
+		init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, 1, NULL);
 #elif defined CONFIG_USER_SCHED
-		init_tg_rt_entry(rq, &init_task_group,
+		init_tg_rt_entry(&init_task_group,
 				&per_cpu(init_rt_rq, i),
-				&per_cpu(init_sched_rt_entity, i), i, 1);
+				&per_cpu(init_sched_rt_entity, i), i, 1, NULL);
 #endif
 #endif
 
@@ -7613,11 +7619,11 @@ static void free_fair_sched_group(struct
 	kfree(tg->se);
 }
 
-static int alloc_fair_sched_group(struct task_group *tg,
-					struct task_group *parent)
+static
+int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 {
 	struct cfs_rq *cfs_rq;
-	struct sched_entity *se;
+	struct sched_entity *se, *parent_se;
 	struct rq *rq;
 	int i;
 
@@ -7643,10 +7649,8 @@ static int alloc_fair_sched_group(struct
 		if (!se)
 			goto err;
 
-		if (!parent)
-			init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
-		else
-			init_tg_cfs_entry(tg, cfs_rq, se, i, 0,	parent->se[i]);
+		parent_se = parent ? parent->se[i] : NULL;
+		init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent_se);
 	}
 
 	return 1;
@@ -7670,7 +7674,8 @@ static inline void free_fair_sched_group
 {
 }
 
-static inline int alloc_fair_sched_group(struct task_group *tg)
+static inline
+int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 {
 	return 1;
 }
@@ -7702,10 +7707,11 @@ static void free_rt_sched_group(struct t
 	kfree(tg->rt_se);
 }
 
-static int alloc_rt_sched_group(struct task_group *tg)
+static
+int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 {
 	struct rt_rq *rt_rq;
-	struct sched_rt_entity *rt_se;
+	struct sched_rt_entity *rt_se, *parent_se;
 	struct rq *rq;
 	int i;
 
@@ -7732,7 +7738,8 @@ static int alloc_rt_sched_group(struct t
 		if (!rt_se)
 			goto err;
 
-		init_tg_rt_entry(rq, tg, rt_rq, rt_se, i, 0);
+		parent_se = parent ? parent->rt_se[i] : NULL;
+		init_tg_rt_entry(tg, rt_rq, rt_se, i, 0, parent_se);
 	}
 
 	return 1;
@@ -7756,7 +7763,8 @@ static inline void free_rt_sched_group(s
 {
 }
 
-static inline int alloc_rt_sched_group(struct task_group *tg)
+static inline
+int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 {
 	return 1;
 }
@@ -7792,7 +7800,7 @@ struct task_group *sched_create_group(st
 	if (!alloc_fair_sched_group(tg, parent))
 		goto err;
 
-	if (!alloc_rt_sched_group(tg))
+	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
 	spin_lock_irqsave(&task_group_lock, flags);

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 04/17] sched: fair-group: SMP-nice for group scheduling
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (2 preceding siblings ...)
  2008-03-09 17:08 ` [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 05/17] sched: rt-group: optimize dequeue_rt_stack Peter Zijlstra
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra, Abhishek Chandra

[-- Attachment #1: sched-fair-group-load-balance.patch --]
[-- Type: text/plain, Size: 19632 bytes --]

Implement SMP nice support for the full group hierarchy.

On each load-balance action, compile a sched_domain wide view of the full
task_group tree. We compute the domain wide RQ weight and RQ load when
walking down the hierarchy, and readjust the weights when walking back up.

After collecting and readjusting the domain wide view, we try to balance the
tasks within the task_groups. The current approach is a naive search for the
group which has the biggest load imbalance - assuming a few group rebalances
will sufice to balance out the cpus.

If all else fails, we try to balance the cpus by evacuating whole groups away
from the cpu.

Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
paper.

XXX: there will be some numerical issues due to the limited nature of 
     SCHED_LOAD_SCALE wrt to representing a task_groups invluence on the
     total weight. When the tree is deep enough, or the task weight small
     enough, we'll run out of bits.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Abhishek Chandra <chandra@cs.umn.edu>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
 kernel/sched.c      |  317 +++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched_fair.c |  179 +++++++++++++++++++++--------
 kernel/sched_rt.c   |    4 
 3 files changed, 424 insertions(+), 76 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -270,6 +270,9 @@ struct task_group {
 
 	struct rcu_head rcu;
 	struct list_head list;
+	struct task_group *parent;
+	struct list_head siblings;
+	struct list_head children;
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -326,6 +329,8 @@ struct task_group init_task_group = {
 	.rt_se	= init_sched_rt_entity_p,
 	.rt_rq	= init_rt_rq_p,
 #endif
+
+	.children = LIST_HEAD_INIT(init_task_group.children),
 };
 
 /* return group to which a task belongs */
@@ -407,6 +412,30 @@ struct cfs_rq {
 	 */
 	struct list_head leaf_cfs_rq_list;
 	struct task_group *tg;	/* group that "owns" this runqueue */
+
+#ifdef CONFIG_SMP
+	/*
+	 * We need space to build a sched_domain wide view of the full task
+	 * group tree, in order to avoid depending on dynamic memory allocation
+	 * during the load balancing we place this in the per cpu task group
+	 * hierarchy. This limits the load balancing to one instance per cpu,
+	 * but more should not be needed anyway.
+	 */
+	struct aggregate_struct {
+		/*
+		 *   load = weight(cpus) * SCHED_LOAD_SCALE * f(tg)
+		 *
+		 * Where f(tg) is the recursive weight fraction assigned to
+		 * this group.
+		 */
+		unsigned long load;
+
+		/*
+		 * The sum of all runqueue weights within this sched_domain.
+		 */
+		unsigned long weight;
+	} aggregate;
+#endif
 #endif
 };
 
@@ -1360,11 +1389,240 @@ static void cpuacct_charge(struct task_s
 static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
 #endif
 
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+	update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+	update_load_sub(&rq->load, load);
+}
+
 #ifdef CONFIG_SMP
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
 static unsigned long cpu_avg_load_per_task(int cpu);
 static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/*
+ * Group load balancing.
+ *
+ * We calculate a few balance domain wide aggregate numbers; load and weight.
+ * Given the pictures below, and assuming each item has equal weight:
+ *
+ *         root          1 - thread
+ *         / | \         A - group
+ *        A  1  B
+ *       /|\   / \
+ *      C 2 D 3   4
+ *      |   |
+ *      5   6
+ *
+ * load:
+ *    A and B get 1/3-rd of the total load. C and D get 1/3-rd of A's 1/3-rd,
+ *    which equals 1/9-th of the total load.
+ *
+ * weight:
+ *    Direct sum of all the cpu's their rq weight, e.g. A would get 3 while
+ *    B would get 2.
+ *
+ */
+
+static inline
+struct aggregate_struct *aggregate(struct task_group *tg, int cpu)
+{
+	return &tg->cfs_rq[cpu]->aggregate;
+}
+
+typedef void (*aggregate_func)(struct task_group *, int, cpumask_t);
+
+/*
+ * Iterate the full tree, calling @down when first entering a node and @up when
+ * leaving it for the final time.
+ */
+static
+void aggregate_walk_tree(aggregate_func down, aggregate_func up,
+			 int cpu, cpumask_t mask)
+{
+	struct task_group *parent, *child;
+
+	rcu_read_lock();
+	parent = &init_task_group;
+down:
+	(*down)(parent, cpu, mask);
+	list_for_each_entry_rcu(child, &parent->children, siblings) {
+		parent = child;
+		goto down;
+
+up:
+		continue;
+	}
+	(*up)(parent, cpu, mask);
+
+	child = parent;
+	parent = parent->parent;
+	if (parent)
+		goto up;
+	rcu_read_unlock();
+}
+
+/*
+ * Calculate the aggregate runqueue weight.
+ */
+static
+void aggregate_group_weight(struct task_group *tg, int cpu, cpumask_t mask)
+{
+	int i;
+	unsigned long weight = 0;
+
+	for_each_cpu_mask(i, mask)
+		weight += tg->cfs_rq[i]->load.weight;
+
+	aggregate(tg, cpu)->weight = weight;
+}
+
+/*
+ * Compute the load fraction assigned to this group, relies on the aggregate
+ * weight and this group's parent's load, i.e. top-down.
+ */
+static
+void aggregate_group_load(struct task_group *tg, int cpu, cpumask_t mask)
+{
+	unsigned long weight = cpus_weight(mask);
+	unsigned long load;
+
+	if (!tg->parent) {
+		load = SCHED_LOAD_SCALE * weight;
+	} else {
+		load = aggregate(tg->parent, cpu)->load;
+		load *= tg->shares * weight;
+		load /= aggregate(tg->parent, cpu)->weight;
+	}
+
+	aggregate(tg, cpu)->load = load;
+}
+
+static void set_se_shares(struct sched_entity *se, unsigned long shares);
+
+/*
+ * Calculate and set the cpu's group shares.
+ */
+static void __update_group_shares_cpu(struct task_group *tg, int acpu,
+		int tcpu, unsigned long weight)
+{
+	unsigned long local_shares;
+	unsigned long local_weight;
+	unsigned long total_shares;
+
+	if (!tg->se[tcpu])
+		return;
+
+	if (!aggregate(tg, acpu)->weight)
+		return;
+
+	total_shares = weight * tg->shares;
+	local_weight = tg->cfs_rq[tcpu]->load.weight;
+
+	/*
+	 * If there are currently no tasks on the cpu pretend there is one of
+	 * average load so that when a new task gets to run here it will not
+	 * get delayed by group starvation.
+	 */
+	if (!local_weight)
+		local_weight = NICE_0_LOAD;
+
+	local_shares = total_shares * local_weight;
+	local_shares /= aggregate(tg, acpu)->weight;
+
+	if (local_shares < 2)
+		local_shares = 2;
+
+	set_se_shares(tg->se[tcpu], local_shares);
+}
+
+/*
+ * Because changing a group's shares changes the weight of the super-group
+ * we need to walk up the tree and change all shares until we hit the root.
+ */
+static void update_group_shares_cpu(struct task_group *tg, int acpu,
+		int tcpu, unsigned long weight)
+{
+	while (tg) {
+		__update_group_shares_cpu(tg, acpu, tcpu, weight);
+		tg = tg->parent;
+	}
+}
+
+static
+void aggregate_group_shares(struct task_group *tg, int cpu, cpumask_t mask)
+{
+	unsigned long weight = cpus_weight(mask);
+	int i;
+
+	for_each_cpu_mask(i, mask)
+		__update_group_shares_cpu(tg, cpu, i, weight);
+}
+
+/*
+ * Calculate the accumulative weight and recursive load of each task group
+ * while walking down the tree.
+ */
+static
+void aggregate_get_down(struct task_group *tg, int cpu, cpumask_t mask)
+{
+	aggregate_group_weight(tg, cpu, mask);
+	aggregate_group_load(tg, cpu, mask);
+}
+
+/*
+ * Rebalance the cpu shares while walking back up the tree.
+ */
+static
+void aggregate_get_up(struct task_group *tg, int cpu, cpumask_t mask)
+{
+	aggregate_group_shares(tg, cpu, mask);
+}
+
+static DEFINE_PER_CPU(spinlock_t, aggregate_lock);
+
+static void __init init_aggregate(void)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		spin_lock_init(&per_cpu(aggregate_lock, i));
+}
+
+static void get_aggregate(int cpu, cpumask_t mask)
+{
+	spin_lock(&per_cpu(aggregate_lock, cpu));
+	aggregate_walk_tree(aggregate_get_down, aggregate_get_up, cpu, mask);
+}
+
+static void put_aggregate(int cpu, cpumask_t mask)
+{
+	spin_unlock(&per_cpu(aggregate_lock, cpu));
+}
+
+#else
+
+static inline void init_aggregate(void)
+{
+}
+
+static inline void get_aggregate(int cpu, cpumask_t mask)
+{
+}
+
+static inline void put_aggregate(int cpu, cpumask_t mask)
+{
+}
+
+#endif
+
 #endif /* CONFIG_SMP */
 
 #include "sched_stats.h"
@@ -1377,26 +1635,14 @@ static int task_hot(struct task_struct *
 
 #define sched_class_highest (&rt_sched_class)
 
-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
-	update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
-	update_load_sub(&rq->load, p->se.load.weight);
-}
-
-static void inc_nr_running(struct task_struct *p, struct rq *rq)
+static void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
-	inc_load(rq, p);
 }
 
-static void dec_nr_running(struct task_struct *p, struct rq *rq)
+static void dec_nr_running(struct rq *rq)
 {
 	rq->nr_running--;
-	dec_load(rq, p);
 }
 
 static void set_load_weight(struct task_struct *p)
@@ -1488,7 +1734,7 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, wakeup);
-	inc_nr_running(p, rq);
+	inc_nr_running(rq);
 }
 
 /*
@@ -1500,7 +1746,7 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, sleep);
-	dec_nr_running(p, rq);
+	dec_nr_running(rq);
 }
 
 /**
@@ -2141,7 +2387,7 @@ void wake_up_new_task(struct task_struct
 		 * management (if any):
 		 */
 		p->sched_class->task_new(rq, p);
-		inc_nr_running(p, rq);
+		inc_nr_running(rq);
 	}
 	ftrace_wake_up_new_task(p, rq->curr);
 	check_preempt_curr(rq, p);
@@ -3136,6 +3382,8 @@ static int load_balance(int this_cpu, st
 	cpumask_t cpus = CPU_MASK_ALL;
 	unsigned long flags;
 
+	get_aggregate(this_cpu, sd->span);
+
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
 	 * sibling can pick up load irrespective of busy siblings. In this case,
@@ -3251,8 +3499,9 @@ redo:
 
 	if (!ld_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
 	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
-		return -1;
-	return ld_moved;
+		ld_moved = -1;
+
+	goto out;
 
 out_balanced:
 	schedstat_inc(sd, lb_balanced[idle]);
@@ -3267,8 +3516,12 @@ out_one_pinned:
 
 	if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
 	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
-		return -1;
-	return 0;
+		ld_moved = -1;
+	else
+		ld_moved = 0;
+out:
+	put_aggregate(this_cpu, sd->span);
+	return ld_moved;
 }
 
 /*
@@ -4503,10 +4756,8 @@ void set_user_nice(struct task_struct *p
 		goto out_unlock;
 	}
 	on_rq = p->se.on_rq;
-	if (on_rq) {
+	if (on_rq)
 		dequeue_task(rq, p, 0);
-		dec_load(rq, p);
-	}
 
 	p->static_prio = NICE_TO_PRIO(nice);
 	set_load_weight(p);
@@ -4516,7 +4767,6 @@ void set_user_nice(struct task_struct *p
 
 	if (on_rq) {
 		enqueue_task(rq, p, 0);
-		inc_load(rq, p);
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
@@ -7338,6 +7588,7 @@ void __init sched_init(void)
 	int i, j;
 
 #ifdef CONFIG_SMP
+	init_aggregate();
 	init_defrootdomain();
 #endif
 
@@ -7803,12 +8054,22 @@ struct task_group *sched_create_group(st
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	INIT_LIST_HEAD(&tg->children);
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	for_each_possible_cpu(i) {
 		register_fair_sched_group(tg, i);
 		register_rt_sched_group(tg, i);
 	}
 	list_add_rcu(&tg->list, &task_groups);
+
+	rcu_assign_pointer(tg->parent, parent);
+
+	if (!parent)
+		list_add_rcu(&tg->siblings, &init_task_group.children);
+	else
+		list_add_rcu(&tg->siblings, &parent->children);
+
 	spin_unlock_irqrestore(&task_group_lock, flags);
 
 	return tg;
@@ -7837,6 +8098,7 @@ void sched_destroy_group(struct task_gro
 		unregister_rt_sched_group(tg, i);
 	}
 	list_del_rcu(&tg->list);
+	list_del_rcu(&tg->siblings);
 	spin_unlock_irqrestore(&task_group_lock, flags);
 
 	/* wait for possible concurrent references to cfs_rqs complete */
@@ -7889,9 +8151,10 @@ static void set_se_shares(struct sched_e
 {
 	struct cfs_rq *cfs_rq = se->cfs_rq;
 	struct rq *rq = cfs_rq->rq;
+	unsigned long flags;
 	int on_rq;
 
-	spin_lock_irq(&rq->lock);
+	spin_lock_irqsave(&rq->lock, flags);
 
 	on_rq = se->on_rq;
 	if (on_rq)
@@ -7903,7 +8166,7 @@ static void set_se_shares(struct sched_e
 	if (on_rq)
 		enqueue_entity(cfs_rq, se, 0);
 
-	spin_unlock_irq(&rq->lock);
+	spin_unlock_irqrestore(&rq->lock, flags);
 }
 
 static DEFINE_MUTEX(shares_mutex);
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -804,15 +804,22 @@ static void enqueue_task_fair(struct rq 
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	struct sched_entity *top_se = NULL;
 
 	for_each_sched_entity(se) {
-		if (se->on_rq)
+		top_se = se;
+		if (se->on_rq) {
+			top_se = NULL;
 			break;
+		}
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, wakeup);
 		wakeup = 1;
 	}
 
+	if (top_se)
+		inc_cpu_load(rq, top_se->load.weight);
+
 	hrtick_start_fair(rq, rq->curr);
 }
 
@@ -825,16 +832,24 @@ static void dequeue_task_fair(struct rq 
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	struct sched_entity *top_se = NULL;
 
 	for_each_sched_entity(se) {
+		top_se = se;
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, sleep);
 		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight)
+		if (cfs_rq->load.weight) {
+			if (parent_entity(se))
+				top_se = NULL;
 			break;
+		}
 		sleep = 1;
 	}
 
+	if (top_se)
+		dec_cpu_load(rq, top_se->load.weight);
+
 	hrtick_start_fair(rq, rq->curr);
 }
 
@@ -1189,73 +1204,139 @@ static struct task_struct *load_balance_
 	return __load_balance_iterator(cfs_rq, cfs_rq->rb_load_balance_curr);
 }
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
-{
-	struct sched_entity *curr;
-	struct task_struct *p;
-
-	if (!cfs_rq->nr_running || !first_fair(cfs_rq))
-		return MAX_PRIO;
-
-	curr = cfs_rq->curr;
-	if (!curr)
-		curr = __pick_next_entity(cfs_rq);
-
-	p = task_of(curr);
-
-	return p->prio;
-}
-#endif
-
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
 		  int *all_pinned, int *this_best_prio)
 {
-	struct cfs_rq *busy_cfs_rq;
 	long rem_load_move = max_load_move;
 	struct rq_iterator cfs_rq_iterator;
+	struct cfs_rq *cfs_rq = NULL;
+	unsigned long load_moved;
 
-	cfs_rq_iterator.start = load_balance_start_fair;
-	cfs_rq_iterator.next = load_balance_next_fair;
-
-	for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
 #ifdef CONFIG_FAIR_GROUP_SCHED
-		struct cfs_rq *this_cfs_rq;
-		long imbalance;
-		unsigned long maxload;
+	int busiest_cpu = cpu_of(busiest);
+	unsigned long max_imbalance = 0;
+	unsigned long min_imbalance = 0;
+	struct task_group *tg, *min_tg = NULL, *max_tg = NULL;
+	unsigned long weight, max_load;
 
-		this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
+	/*
+	 * Group load balancing; we look for the group whoes imbalance has the
+	 * biggest load effect. Failing to find any group that is out of
+	 * balance, we look for groups to remove from the busy cpu.
+	 */
 
-		imbalance = busy_cfs_rq->load.weight - this_cfs_rq->load.weight;
-		/* Don't pull if this_cfs_rq has more load than busy_cfs_rq */
-		if (imbalance <= 0)
+again:
+	rcu_read_lock();
+	list_for_each_entry(tg, &task_groups, list) {
+		long imbalance;
+		unsigned long this_weight, busiest_weight;
+
+		/*
+		 * empty group
+		 */
+		if (!aggregate(tg, this_cpu)->weight)
 			continue;
 
-		/* Don't pull more than imbalance/2 */
-		imbalance /= 2;
-		maxload = min(rem_load_move, imbalance);
+		this_weight = tg->cfs_rq[this_cpu]->load.weight;
+		busiest_weight = tg->cfs_rq[busiest_cpu]->load.weight;
+
+		imbalance = busiest_weight - this_weight;
 
-		*this_best_prio = cfs_rq_best_prio(this_cfs_rq);
-#else
-# define maxload rem_load_move
-#endif
 		/*
-		 * pass busy_cfs_rq argument into
-		 * load_balance_[start|next]_fair iterators
+		 * Scale the imbalance to reflect the effect it has on the
+		 * total load.
 		 */
-		cfs_rq_iterator.arg = busy_cfs_rq;
-		rem_load_move -= balance_tasks(this_rq, this_cpu, busiest,
-					       maxload, sd, idle, all_pinned,
-					       this_best_prio,
-					       &cfs_rq_iterator);
+		if (imbalance > 0) {
+			/*
+			 * Calculate the load we could shift by balancing this
+			 * group.
+			 */
+			imbalance *= aggregate(tg, this_cpu)->load;
+			imbalance /= aggregate(tg, this_cpu)->weight;
 
-		if (rem_load_move <= 0)
-			break;
+			if (imbalance > max_imbalance) {
+				max_imbalance = imbalance;
+				max_tg = tg;
+			}
+		} else {
+			/*
+			 * Calculate the load we could shift by taking away this
+			 * group.
+			 */
+			imbalance = busiest_weight;
+			imbalance *= aggregate(tg, this_cpu)->load;
+			imbalance /= aggregate(tg, this_cpu)->weight;
+
+			if (imbalance < rem_load_move &&
+					imbalance > min_imbalance) {
+				min_imbalance = imbalance;
+				min_tg = tg;
+			}
+		}
 	}
 
+	/*
+	 * Prefer to balance groups to ensure maximal concurrency. If that
+	 * fails, try to take a whole group out.
+	 */
+	if (max_tg) {
+		tg = max_tg;
+		cfs_rq = tg->cfs_rq[busiest_cpu];
+		max_load = min_t(unsigned long, rem_load_move, max_imbalance/2);
+	} else if (min_tg) {
+		tg = min_tg;
+		cfs_rq = tg->cfs_rq[busiest_cpu];
+		max_load = min_t(unsigned long, rem_load_move, min_imbalance);
+	} else
+		goto out;
+
+	max_load *= aggregate(tg, this_cpu)->weight;
+	max_load /= aggregate(tg, this_cpu)->load;
+
+	*this_best_prio = 0;
+#else
+	cfs_rq = &busiest->cfs;
+#define max_load rem_load_move
+#endif
+
+	cfs_rq_iterator.start = load_balance_start_fair;
+	cfs_rq_iterator.next = load_balance_next_fair;
+	cfs_rq_iterator.arg = cfs_rq;
+
+	load_moved = balance_tasks(this_rq, this_cpu, busiest,
+			max_load, sd, idle, all_pinned,
+			this_best_prio, &cfs_rq_iterator);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * Scale the moved load to be comparable to total load.
+	 */
+	load_moved *= aggregate(tg, this_cpu)->load;
+	load_moved /= aggregate(tg, this_cpu)->weight;
+
+	rem_load_move -= load_moved;
+
+	/*
+	 * Re-compute the cpu shares for the affected cpus in this group.
+	 */
+	weight = cpus_weight(sd->span);
+	update_group_shares_cpu(tg, this_cpu, this_cpu, weight);
+	update_group_shares_cpu(tg, this_cpu, busiest_cpu, weight);
+
+out:
+	rcu_read_unlock();
+	/*
+	 * While there is more work to do, try again.
+	 */
+	if (rem_load_move > 0 && cfs_rq)
+		goto again;
+#else
+	rem_load_move -= load_moved;
+#endif
+
 	return max_load_move - rem_load_move;
 }
 
Index: linux-2.6-2/kernel/sched_rt.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_rt.c
+++ linux-2.6-2/kernel/sched_rt.c
@@ -518,6 +518,8 @@ static void enqueue_task_rt(struct rq *r
 	 */
 	for_each_sched_rt_entity(rt_se)
 		enqueue_rt_entity(rt_se);
+
+	inc_cpu_load(rq, p->se.load.weight);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
@@ -537,6 +539,8 @@ static void dequeue_task_rt(struct rq *r
 		if (rt_rq && rt_rq->rt_nr_running)
 			enqueue_rt_entity(rt_se);
 	}
+
+	dec_cpu_load(rq, p->se.load.weight);
 }
 
 /*

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 05/17] sched: rt-group: optimize dequeue_rt_stack
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (3 preceding siblings ...)
  2008-03-09 17:08 ` [RFC/PATCH 04/17] sched: fair-group: SMP-nice for group scheduling Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 06/17] sched: fair-group scheduling vs latency Peter Zijlstra
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-rt-group-opt-dequeue.patch --]
[-- Type: text/plain, Size: 1886 bytes --]

Now that the group hierarchy can have an arbitrary depth the O(n^2) nature
of RT task dequeues will really hurt. Optimize this by providing space to
store the tree path, so we can walk it the other way.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    1 +
 kernel/sched_rt.c     |   28 ++++++++++++----------------
 2 files changed, 13 insertions(+), 16 deletions(-)

Index: linux-2.6-2/kernel/sched_rt.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_rt.c
+++ linux-2.6-2/kernel/sched_rt.c
@@ -479,26 +479,22 @@ static void dequeue_rt_entity(struct sch
 /*
  * Because the prio of an upper entry depends on the lower
  * entries, we must remove entries top - down.
- *
- * XXX: O(1/2 h^2) because we can only walk up, not down the chain.
  */
 static void dequeue_rt_stack(struct task_struct *p)
 {
-	struct sched_rt_entity *rt_se, *top_se;
+	struct sched_rt_entity *rt_se, *back = NULL;
 
-	/*
-	 * dequeue all, top - down.
-	 */
-	do {
-		rt_se = &p->rt;
-		top_se = NULL;
-		for_each_sched_rt_entity(rt_se) {
-			if (on_rt_rq(rt_se))
-				top_se = rt_se;
-		}
-		if (top_se)
-			dequeue_rt_entity(top_se);
-	} while (top_se);
+	rt_se = &p->rt;
+	for_each_sched_rt_entity(rt_se) {
+		if (!on_rt_rq(rt_se))
+			break;
+
+		rt_se->back = back;
+		back = rt_se;
+	}
+
+	for (rt_se = back; rt_se; rt_se = rt_se->back)
+		dequeue_rt_entity(rt_se);
 }
 
 /*
Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -978,6 +978,7 @@ struct sched_rt_entity {
 	unsigned long timeout;
 	int nr_cpus_allowed;
 
+	struct sched_rt_entity *back;
 #ifdef CONFIG_RT_GROUP_SCHED
 	struct sched_rt_entity	*parent;
 	/* rq on which this entity is (to be) queued: */

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 06/17] sched: fair-group scheduling vs latency
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (4 preceding siblings ...)
  2008-03-09 17:08 ` [RFC/PATCH 05/17] sched: rt-group: optimize dequeue_rt_stack Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 07/17] sched: fair-group: de-couple load-balancing from the rb-trees Peter Zijlstra
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-fair-group-slice.patch --]
[-- Type: text/plain, Size: 7370 bytes --]

Currently FAIR_GROUP sched grows the scheduler latency outside of
sysctl_sched_latency, invert this so it stays within.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |  231 +++++++++++++++++++++++++++-------------------------
 1 file changed, 120 insertions(+), 111 deletions(-)

Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -87,6 +87,11 @@ const_debug unsigned int sysctl_sched_mi
  * CFS operations on generic schedulable entities:
  */
 
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 /* cpu runqueue to which this cfs_rq is attached */
@@ -98,6 +103,56 @@ static inline struct rq *rq_of(struct cf
 /* An entity is a task if it doesn't "own" a runqueue */
 #define entity_is_task(se)	(!se->my_q)
 
+/* Walk up scheduling entities hierarchy */
+#define for_each_sched_entity(se) \
+		for (; se; se = se->parent)
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return grp->my_q;
+}
+
+/* Given a group's cfs_rq on one cpu, return its corresponding cfs_rq on
+ * another cpu ('this_cpu')
+ */
+static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
+{
+	return cfs_rq->tg->cfs_rq[this_cpu];
+}
+
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is_same_group(struct sched_entity *se, struct sched_entity *pse)
+{
+	if (se->cfs_rq == pse->cfs_rq)
+		return 1;
+
+	return 0;
+}
+
+static inline struct sched_entity *parent_entity(struct sched_entity *se)
+{
+	return se->parent;
+}
+
+#define GROUP_IMBALANCE_PCT	20
+
 #else	/* CONFIG_FAIR_GROUP_SCHED */
 
 static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
@@ -107,13 +162,49 @@ static inline struct rq *rq_of(struct cf
 
 #define entity_is_task(se)	1
 
-#endif	/* CONFIG_FAIR_GROUP_SCHED */
+#define for_each_sched_entity(se) \
+		for (; se; se = NULL)
 
-static inline struct task_struct *task_of(struct sched_entity *se)
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
 {
-	return container_of(se, struct task_struct, se);
+	return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	struct task_struct *p = task_of(se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return NULL;
+}
+
+static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
+{
+	return &cpu_rq(this_cpu)->cfs;
+}
+
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+
+static inline int
+is_same_group(struct sched_entity *se, struct sched_entity *pse)
+{
+	return 1;
+}
+
+static inline struct sched_entity *parent_entity(struct sched_entity *se)
+{
+	return NULL;
 }
 
+#endif	/* CONFIG_FAIR_GROUP_SCHED */
+
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -262,31 +353,44 @@ static u64 sched_slice(struct cfs_rq *cf
 {
 	u64 slice = __sched_period(cfs_rq->nr_running);
 
-	slice *= se->load.weight;
-	do_div(slice, cfs_rq->load.weight);
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		slice *= se->load.weight;
+		do_div(slice, cfs_rq->load.weight);
+	}
 
 	return slice;
 }
 
 /*
- * We calculate the vruntime slice.
+ * We calculate the vruntime slice of a to be inserted task
  *
  * vs = s/w = p/rw
  */
-static u64 __sched_vslice(unsigned long rq_weight, unsigned long nr_running)
+static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	u64 vslice = __sched_period(nr_running);
+	unsigned long nr_running = cfs_rq->nr_running;
+	unsigned long weight;
+	u64 vslice;
 
-	vslice *= NICE_0_LOAD;
-	do_div(vslice, rq_weight);
+	if (!se->on_rq)
+		nr_running++;
 
-	return vslice;
-}
+	vslice = __sched_period(nr_running);
 
-static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	return __sched_vslice(cfs_rq->load.weight + se->load.weight,
-			cfs_rq->nr_running + 1);
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+
+		weight = cfs_rq->load.weight;
+		if (!se->on_rq)
+			weight += se->load.weight;
+
+		vslice *= NICE_0_LOAD;
+		do_div(vslice, weight);
+	}
+
+	return vslice;
 }
 
 /*
@@ -663,101 +767,6 @@ entity_tick(struct cfs_rq *cfs_rq, struc
  * CFS operations on tasks:
  */
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-
-/* Walk up scheduling entities hierarchy */
-#define for_each_sched_entity(se) \
-		for (; se; se = se->parent)
-
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return grp->my_q;
-}
-
-/* Given a group's cfs_rq on one cpu, return its corresponding cfs_rq on
- * another cpu ('this_cpu')
- */
-static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
-{
-	return cfs_rq->tg->cfs_rq[this_cpu];
-}
-
-/* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
-	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
-
-/* Do the two (enqueued) entities belong to the same group ? */
-static inline int
-is_same_group(struct sched_entity *se, struct sched_entity *pse)
-{
-	if (se->cfs_rq == pse->cfs_rq)
-		return 1;
-
-	return 0;
-}
-
-static inline struct sched_entity *parent_entity(struct sched_entity *se)
-{
-	return se->parent;
-}
-
-#else	/* CONFIG_FAIR_GROUP_SCHED */
-
-#define for_each_sched_entity(se) \
-		for (; se; se = NULL)
-
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	struct task_struct *p = task_of(se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return NULL;
-}
-
-static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
-{
-	return &cpu_rq(this_cpu)->cfs;
-}
-
-#define for_each_leaf_cfs_rq(rq, cfs_rq) \
-		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
-
-static inline int
-is_same_group(struct sched_entity *se, struct sched_entity *pse)
-{
-	return 1;
-}
-
-static inline struct sched_entity *parent_entity(struct sched_entity *se)
-{
-	return NULL;
-}
-
-#endif	/* CONFIG_FAIR_GROUP_SCHED */
-
 #ifdef CONFIG_SCHED_HRTICK
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 07/17] sched: fair-group: de-couple load-balancing from the rb-trees
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (5 preceding siblings ...)
  2008-03-09 17:08 ` [RFC/PATCH 06/17] sched: fair-group scheduling vs latency Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 08/17] sched: fair-group: single RQ approach Peter Zijlstra
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-fair-group-smp.patch --]
[-- Type: text/plain, Size: 4470 bytes --]

De-couple load-balancing from the rb-trees, so that I can change their
organization.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/init_task.h |    3 +++
 include/linux/sched.h     |    1 +
 kernel/sched.c            |   10 ++++++++--
 kernel/sched_fair.c       |   21 +++++++++++++--------
 4 files changed, 25 insertions(+), 10 deletions(-)

Index: linux-2.6-2/include/linux/init_task.h
===================================================================
--- linux-2.6-2.orig/include/linux/init_task.h
+++ linux-2.6-2/include/linux/init_task.h
@@ -151,6 +151,9 @@ extern struct group_info init_groups;
 	.cpus_allowed	= CPU_MASK_ALL,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
+	.se		= {						\
+		.group_node 	= LIST_HEAD_INIT(tsk.se.group_node),	\
+	},								\
 	.rt		= {						\
 		.run_list	= LIST_HEAD_INIT(tsk.rt.run_list),	\
 		.time_slice	= HZ, 					\
Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -922,6 +922,7 @@ struct load_weight {
 struct sched_entity {
 	struct load_weight	load;		/* for load-balancing */
 	struct rb_node		run_node;
+	struct list_head	group_node;
 	unsigned int		on_rq;
 
 	u64			exec_start;
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -391,8 +391,12 @@ struct cfs_rq {
 
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
-	struct rb_node *rb_load_balance_curr;
-	/* 'curr' points to currently running entity on this cfs_rq.
+
+	struct list_head tasks;
+	struct list_head *balance_iterator;
+
+	/*
+	 * 'curr' points to currently running entity on this cfs_rq.
 	 * It is set to NULL otherwise (i.e when none are currently running).
 	 */
 	struct sched_entity *curr;
@@ -2312,6 +2316,7 @@ static void __sched_fork(struct task_str
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->se.on_rq = 0;
+	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
@@ -7484,6 +7489,7 @@ int in_sched_functions(unsigned long add
 static void init_cfs_rq(struct cfs_rq *cfs_rq, struct rq *rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT;
+	INIT_LIST_HEAD(&cfs_rq->tasks);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	cfs_rq->rq = rq;
 #endif
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -518,6 +518,7 @@ account_entity_enqueue(struct cfs_rq *cf
 	update_load_add(&cfs_rq->load, se->load.weight);
 	cfs_rq->nr_running++;
 	se->on_rq = 1;
+	list_add(&se->group_node, &cfs_rq->tasks);
 }
 
 static void
@@ -526,6 +527,7 @@ account_entity_dequeue(struct cfs_rq *cf
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	cfs_rq->nr_running--;
 	se->on_rq = 0;
+	list_del_init(&se->group_node);
 }
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1177,21 +1179,24 @@ static void put_prev_task_fair(struct rq
  * the current task:
  */
 static struct task_struct *
-__load_balance_iterator(struct cfs_rq *cfs_rq, struct rb_node *curr)
+__load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
 {
 	struct task_struct *p = NULL;
 	struct sched_entity *se;
 
-	if (!curr)
+	if (next == &cfs_rq->tasks)
 		return NULL;
 
 	/* Skip over entities that are not tasks */
 	do {
-		se = rb_entry(curr, struct sched_entity, run_node);
-		curr = rb_next(curr);
-	} while (curr && !entity_is_task(se));
+		se = list_entry(next, struct sched_entity, group_node);
+		next = next->next;
+	} while (next != &cfs_rq->tasks && !entity_is_task(se));
 
-	cfs_rq->rb_load_balance_curr = curr;
+	if (next == &cfs_rq->tasks)
+		return NULL;
+
+	cfs_rq->balance_iterator = next;
 
 	if (entity_is_task(se))
 		p = task_of(se);
@@ -1203,14 +1208,14 @@ static struct task_struct *load_balance_
 {
 	struct cfs_rq *cfs_rq = arg;
 
-	return __load_balance_iterator(cfs_rq, first_fair(cfs_rq));
+	return __load_balance_iterator(cfs_rq, cfs_rq->tasks.next);
 }
 
 static struct task_struct *load_balance_next_fair(void *arg)
 {
 	struct cfs_rq *cfs_rq = arg;
 
-	return __load_balance_iterator(cfs_rq, cfs_rq->rb_load_balance_curr);
+	return __load_balance_iterator(cfs_rq, cfs_rq->balance_iterator);
 }
 
 static unsigned long

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 08/17] sched: fair-group: single RQ approach
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (6 preceding siblings ...)
  2008-03-09 17:08 ` [RFC/PATCH 07/17] sched: fair-group: de-couple load-balancing from the rb-trees Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-03-09 17:08 ` [RFC/PATCH 09/17] sched: fair: remove moved_group() Peter Zijlstra
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-fair-group-single-rq.patch --]
[-- Type: text/plain, Size: 10647 bytes --]

The current hierarchical RQ group scheduler suffers from a number of problems:
 - yield
 - wakeup preemption
 - sleeper fairness

All these problems are due to the isolation caused by the multiple RQ design.
They are caused by the fact that vruntime becomes a local property.

Solve this by returning to a single RQ model.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |    9 --
 kernel/sched_fair.c |  193 +++++++++++++++++++++++++++-------------------------
 2 files changed, 105 insertions(+), 97 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -1270,6 +1270,9 @@ static void __resched_task(struct task_s
  */
 #define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
 
+/*
+ * delta *= weight / lw
+ */
 static unsigned long
 calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 		struct load_weight *lw)
@@ -1292,12 +1295,6 @@ calc_delta_mine(unsigned long delta_exec
 	return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
 }
 
-static inline unsigned long
-calc_delta_fair(unsigned long delta_exec, struct load_weight *lw)
-{
-	return calc_delta_mine(delta_exec, NICE_0_LOAD, lw);
-}
-
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -238,12 +238,22 @@ static inline s64 entity_key(struct cfs_
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
+	struct rb_node **link;
 	struct rb_node *parent = NULL;
 	struct sched_entity *entry;
-	s64 key = entity_key(cfs_rq, se);
+	s64 key;
 	int leftmost = 1;
 
+	if (!entity_is_task(se))
+		return;
+
+	if (se == cfs_rq->curr)
+		return;
+
+	cfs_rq = &rq_of(cfs_rq)->cfs;
+
+	link = &cfs_rq->tasks_timeline.rb_node;
+	key = entity_key(cfs_rq, se);
 	/*
 	 * Find the right place in the rbtree:
 	 */
@@ -275,6 +285,14 @@ static void __enqueue_entity(struct cfs_
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (!entity_is_task(se))
+		return;
+
+	if (se == cfs_rq->curr)
+		return;
+
+	cfs_rq = &rq_of(cfs_rq)->cfs;
+
 	if (cfs_rq->rb_leftmost == &se->run_node)
 		cfs_rq->rb_leftmost = rb_next(&se->run_node);
 
@@ -353,6 +371,13 @@ static u64 sched_slice(struct cfs_rq *cf
 {
 	u64 slice = __sched_period(cfs_rq->nr_running);
 
+	/*
+	 * FIXME: curious 'hack' to make it boot - when the tick is started we
+	 * hit this with the init_task, which is not actually enqueued.
+	 */
+	if (unlikely(rq_of(cfs_rq)->load.weight <= se->load.weight))
+		goto out;
+
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 
@@ -360,37 +385,66 @@ static u64 sched_slice(struct cfs_rq *cf
 		do_div(slice, cfs_rq->load.weight);
 	}
 
+out:
 	return slice;
 }
 
 /*
  * We calculate the vruntime slice of a to be inserted task
  *
- * vs = s/w = p/rw
+ * vs = s*rw/w = p
  */
 static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	unsigned long nr_running = cfs_rq->nr_running;
-	unsigned long weight;
-	u64 vslice;
 
 	if (!se->on_rq)
 		nr_running++;
 
-	vslice = __sched_period(nr_running);
+	return __sched_period(nr_running);
+}
 
+static inline unsigned long
+calc_delta_fair(unsigned long delta, struct sched_entity *se)
+{
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		delta = calc_delta_mine(delta,
+				cfs_rq_of(se)->load.weight, &se->load);
+	}
 
-		weight = cfs_rq->load.weight;
-		if (!se->on_rq)
-			weight += se->load.weight;
+	return delta;
+}
+
+/*
+ * The goal of calc_delta_asym() is to be asymmetrically around NICE_0_LOAD, in
+ * that it favours >=0 over <0.
+ *
+ *   -20         |
+ *               |
+ *     0 --------+-------
+ *             .'
+ *    19     .'
+ *
+ */
+static unsigned long
+calc_delta_asym(unsigned long delta, struct sched_entity *se)
+{
+	struct load_weight lw = {
+		.weight = NICE_0_LOAD,
+		.inv_weight = 1UL << (WMULT_SHIFT-NICE_0_SHIFT)
+	};
+
+	for_each_sched_entity(se) {
+		struct load_weight *se_lw = &se->load;
 
-		vslice *= NICE_0_LOAD;
-		do_div(vslice, weight);
+		if (se->load.weight < NICE_0_LOAD)
+			se_lw = &lw;
+
+		delta = calc_delta_mine(delta,
+				cfs_rq_of(se)->load.weight, se_lw);
 	}
 
-	return vslice;
+	return delta;
 }
 
 /*
@@ -408,13 +462,14 @@ __update_curr(struct cfs_rq *cfs_rq, str
 
 	curr->sum_exec_runtime += delta_exec;
 	schedstat_add(cfs_rq, exec_clock, delta_exec);
-	delta_exec_weighted = delta_exec;
-	if (unlikely(curr->load.weight != NICE_0_LOAD)) {
-		delta_exec_weighted = calc_delta_fair(delta_exec_weighted,
-							&curr->load);
-	}
+	delta_exec_weighted = calc_delta_fair(delta_exec, curr);
 	curr->vruntime += delta_exec_weighted;
 
+	if (!entity_is_task(curr))
+		return;
+
+	cfs_rq = &rq_of(cfs_rq)->cfs;
+
 	/*
 	 * maintain cfs_rq->min_vruntime to be a monotonic increasing
 	 * value tracking the leftmost vruntime in the tree.
@@ -594,6 +649,11 @@ place_entity(struct cfs_rq *cfs_rq, stru
 {
 	u64 vruntime;
 
+	if (!entity_is_task(se))
+		return;
+
+	cfs_rq = &rq_of(cfs_rq)->cfs;
+
 	vruntime = cfs_rq->min_vruntime;
 
 	/*
@@ -608,7 +668,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */
 		if (sched_feat(NEW_FAIR_SLEEPERS))
-			vruntime -= sysctl_sched_latency;
+			vruntime -= calc_delta_asym(sysctl_sched_latency, se);
 
 		/* ensure we never gain time by being placed backwards. */
 		vruntime = max_vruntime(se->vruntime, vruntime);
@@ -632,8 +692,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 
 	update_stats_enqueue(cfs_rq, se);
 	check_spread(cfs_rq, se);
-	if (se != cfs_rq->curr)
-		__enqueue_entity(cfs_rq, se);
+	__enqueue_entity(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 }
 
@@ -659,8 +718,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 #endif
 	}
 
-	if (se != cfs_rq->curr)
-		__dequeue_entity(cfs_rq, se);
+	__dequeue_entity(cfs_rq, se);
 	account_entity_dequeue(cfs_rq, se);
 }
 
@@ -689,6 +747,8 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 		 * runqueue.
 		 */
 		update_stats_wait_end(cfs_rq, se);
+		if (WARN_ON_ONCE(cfs_rq->curr))
+			cfs_rq->curr = NULL;
 		__dequeue_entity(cfs_rq, se);
 	}
 
@@ -708,18 +768,6 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
-{
-	struct sched_entity *se = NULL;
-
-	if (first_fair(cfs_rq)) {
-		se = __pick_next_entity(cfs_rq);
-		set_next_entity(cfs_rq, se);
-	}
-
-	return se;
-}
-
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -730,12 +778,12 @@ static void put_prev_entity(struct cfs_r
 		update_curr(cfs_rq);
 
 	check_spread(cfs_rq, prev);
+	cfs_rq->curr = NULL;
 	if (prev->on_rq) {
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 	}
-	cfs_rq->curr = NULL;
 }
 
 static void
@@ -746,6 +794,9 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 	 */
 	update_curr(cfs_rq);
 
+	if (!entity_is_task(curr))
+		return;
+
 #ifdef CONFIG_SCHED_HRTICK
 	/*
 	 * queued ticks are scheduled to match the slice, so don't bother
@@ -761,7 +812,8 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 		return;
 #endif
 
-	if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
+	if (rq_of(cfs_rq)->load.weight != curr->load.weight ||
+			!sched_feat(WAKEUP_PREEMPT))
 		check_preempt_tick(cfs_rq, curr);
 }
 
@@ -878,7 +930,7 @@ static void yield_task_fair(struct rq *r
 	/*
 	 * Are we the only task in the tree?
 	 */
-	if (unlikely(cfs_rq->nr_running == 1))
+	if (unlikely(rq->load.weight == curr->se.load.weight))
 		return;
 
 	if (likely(!sysctl_sched_compat_yield) && curr->policy != SCHED_BATCH) {
@@ -893,7 +945,7 @@ static void yield_task_fair(struct rq *r
 	/*
 	 * Find the rightmost entry in the rbtree:
 	 */
-	rightmost = __pick_last_entity(cfs_rq);
+	rightmost = __pick_last_entity(&rq->cfs);
 	/*
 	 * Already in the rightmost position?
 	 */
@@ -1055,18 +1107,6 @@ out_set_cpu:
 }
 #endif /* CONFIG_SMP */
 
-/* return depth at which a sched entity is present in the hierarchy */
-static inline int depth_se(struct sched_entity *se)
-{
-	int depth = 0;
-
-	for_each_sched_entity(se)
-		depth++;
-
-	return depth;
-}
-
-
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -1076,7 +1116,6 @@ static void check_preempt_wakeup(struct 
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	unsigned long gran;
-	int se_depth, pse_depth;
 
 	if (unlikely(rt_prio(p->prio))) {
 		update_rq_clock(rq);
@@ -1095,39 +1134,10 @@ static void check_preempt_wakeup(struct 
 		return;
 
 	/*
-	 * preemption test can be made between sibling entities who are in the
-	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
-	 * both tasks until we find their ancestors who are siblings of common
-	 * parent.
+	 * More easily preempt - nice tasks, while not making it harder for
+	 * + nice tasks.
 	 */
-
-	/* First walk up until both entities are at same depth */
-	se_depth = depth_se(se);
-	pse_depth = depth_se(pse);
-
-	while (se_depth > pse_depth) {
-		se_depth--;
-		se = parent_entity(se);
-	}
-
-	while (pse_depth > se_depth) {
-		pse_depth--;
-		pse = parent_entity(pse);
-	}
-
-	while (!is_same_group(se, pse)) {
-		se = parent_entity(se);
-		pse = parent_entity(pse);
-	}
-
-	gran = sysctl_sched_wakeup_granularity;
-	/*
-	 * More easily preempt - nice tasks, while not making
-	 * it harder for + nice tasks.
-	 */
-	if (unlikely(se->load.weight > NICE_0_LOAD))
-		gran = calc_delta_fair(gran, &se->load);
-
+	gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
 	if (pse->vruntime + gran < se->vruntime)
 		resched_task(curr);
 }
@@ -1136,17 +1146,18 @@ static struct task_struct *pick_next_tas
 {
 	struct task_struct *p;
 	struct cfs_rq *cfs_rq = &rq->cfs;
-	struct sched_entity *se;
+	struct sched_entity *se, *next;
 
-	if (unlikely(!cfs_rq->nr_running))
+	if (!first_fair(cfs_rq))
 		return NULL;
 
-	do {
-		se = pick_next_entity(cfs_rq);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
+	next = se = __pick_next_entity(cfs_rq);
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		set_next_entity(cfs_rq, se);
+	}
 
-	p = task_of(se);
+	p = task_of(next);
 	hrtick_start_fair(rq, p);
 
 	return p;

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 09/17] sched: fair: remove moved_group()
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (7 preceding siblings ...)
  2008-03-09 17:08 ` [RFC/PATCH 08/17] sched: fair-group: single RQ approach Peter Zijlstra
@ 2008-03-09 17:08 ` Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 10/17] sched: remove sysctl_sched_batch_wakeup_granularity Peter Zijlstra
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:08 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-remove-moved_group.patch --]
[-- Type: text/plain, Size: 2130 bytes --]

Now that all tasks are in a single RQ again tasks in different groups are
comparable again, hence we no longer require a notification when a task is
moved between groups.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    4 ----
 kernel/sched.c        |    5 -----
 kernel/sched_fair.c   |   14 --------------
 3 files changed, 23 deletions(-)

Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -899,10 +899,6 @@ struct sched_class {
 			     int running);
 	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
 			     int oldprio, int running);
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	void (*moved_group) (struct task_struct *p);
-#endif
 };
 
 struct load_weight {
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -8134,11 +8134,6 @@ void sched_move_task(struct task_struct 
 
 	set_task_rq(tsk, task_cpu(tsk));
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (tsk->sched_class->moved_group)
-		tsk->sched_class->moved_group(tsk);
-#endif
-
 	if (on_rq) {
 		if (unlikely(running))
 			tsk->sched_class->set_curr_task(rq);
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -1487,16 +1487,6 @@ static void set_curr_task_fair(struct rq
 		set_next_entity(cfs_rq_of(se), se);
 }
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-static void moved_group_fair(struct task_struct *p)
-{
-	struct cfs_rq *cfs_rq = task_cfs_rq(p);
-
-	update_curr(cfs_rq);
-	place_entity(cfs_rq, &p->se, 1);
-}
-#endif
-
 /*
  * All the scheduling class methods:
  */
@@ -1525,10 +1515,6 @@ static const struct sched_class fair_sch
 
 	.prio_changed		= prio_changed_fair,
 	.switched_to		= switched_to_fair,
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	.moved_group		= moved_group_fair,
-#endif
 };
 
 #ifdef CONFIG_SCHED_DEBUG

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 10/17] sched: remove sysctl_sched_batch_wakeup_granularity
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (8 preceding siblings ...)
  2008-03-09 17:08 ` [RFC/PATCH 09/17] sched: fair: remove moved_group() Peter Zijlstra
@ 2008-03-09 17:09 ` Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 11/17] sched: fair: optimize sched_slice() Peter Zijlstra
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-fair-remove-batch-wakeup.patch --]
[-- Type: text/plain, Size: 3286 bytes --]

Its unused,.. toss it!

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    1 -
 kernel/sched.c        |    1 -
 kernel/sched_debug.c  |    1 -
 kernel/sched_fair.c   |   10 ----------
 kernel/sysctl.c       |   11 -----------
 5 files changed, 24 deletions(-)

Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -1539,7 +1539,6 @@ extern void sched_idle_next(void);
 extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
-extern unsigned int sysctl_sched_batch_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_features;
 extern unsigned int sysctl_sched_migration_cost;
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -62,16 +62,6 @@ const_debug unsigned int sysctl_sched_ch
 unsigned int __read_mostly sysctl_sched_compat_yield;
 
 /*
- * SCHED_BATCH wake-up granularity.
- * (default: 10 msec * (1 + ilog(ncpus)), units: nanoseconds)
- *
- * This option delays the preemption effects of decoupled workloads
- * and reduces their over-scheduling. Synchronous workloads will still
- * have immediate wakeup/sleep latencies.
- */
-unsigned int sysctl_sched_batch_wakeup_granularity = 10000000UL;
-
-/*
  * SCHED_OTHER wake-up granularity.
  * (default: 10 msec * (1 + ilog(ncpus)), units: nanoseconds)
  *
Index: linux-2.6-2/kernel/sysctl.c
===================================================================
--- linux-2.6-2.orig/kernel/sysctl.c
+++ linux-2.6-2/kernel/sysctl.c
@@ -271,17 +271,6 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
-		.procname	= "sched_batch_wakeup_granularity_ns",
-		.data		= &sysctl_sched_batch_wakeup_granularity,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
-		.strategy	= &sysctl_intvec,
-		.extra1		= &min_wakeup_granularity_ns,
-		.extra2		= &max_wakeup_granularity_ns,
-	},
-	{
-		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_child_runs_first",
 		.data		= &sysctl_sched_child_runs_first,
 		.maxlen		= sizeof(unsigned int),
Index: linux-2.6-2/kernel/sched_debug.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_debug.c
+++ linux-2.6-2/kernel/sched_debug.c
@@ -214,7 +214,6 @@ static int sched_debug_show(struct seq_f
 	PN(sysctl_sched_latency);
 	PN(sysctl_sched_min_granularity);
 	PN(sysctl_sched_wakeup_granularity);
-	PN(sysctl_sched_batch_wakeup_granularity);
 	PN(sysctl_sched_child_runs_first);
 	P(sysctl_sched_features);
 #undef PN
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -5702,7 +5702,6 @@ static inline void sched_init_granularit
 		sysctl_sched_latency = limit;
 
 	sysctl_sched_wakeup_granularity *= factor;
-	sysctl_sched_batch_wakeup_granularity *= factor;
 }
 
 #ifdef CONFIG_SMP

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 11/17] sched: fair: optimize sched_slice()
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (9 preceding siblings ...)
  2008-03-09 17:09 ` [RFC/PATCH 10/17] sched: remove sysctl_sched_batch_wakeup_granularity Peter Zijlstra
@ 2008-03-09 17:09 ` Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 12/17] sched: fair: cfs_root_rq Peter Zijlstra
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-opt-sched_slice.patch --]
[-- Type: text/plain, Size: 1843 bytes --]

Cache the divide in the hope we can avoid doing it in steady state opteration.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c       |    2 ++
 kernel/sched_debug.c |    1 +
 kernel/sched_fair.c  |    6 ++----
 3 files changed, 5 insertions(+), 4 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -1298,11 +1298,13 @@ calc_delta_mine(unsigned long delta_exec
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
+	lw->inv_weight = 0;
 }
 
 static inline void update_load_sub(struct load_weight *lw, unsigned long dec)
 {
 	lw->weight -= dec;
+	lw->inv_weight = 0;
 }
 
 /*
Index: linux-2.6-2/kernel/sched_debug.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_debug.c
+++ linux-2.6-2/kernel/sched_debug.c
@@ -137,6 +137,7 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(spread0));
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+	SEQ_printf(m, "  .%-30s: %ld\n", "load^-1", cfs_rq->load.inv_weight);
 #ifdef CONFIG_SCHEDSTATS
 	SEQ_printf(m, "  .%-30s: %d\n", "bkl_count",
 			rq->bkl_count);
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -369,10 +369,8 @@ static u64 sched_slice(struct cfs_rq *cf
 		goto out;
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-
-		slice *= se->load.weight;
-		do_div(slice, cfs_rq->load.weight);
+		slice = calc_delta_mine(slice,
+				se->load.weight, &cfs_rq_of(se)->load);
 	}
 
 out:

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 12/17] sched: fair: cfs_root_rq
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (10 preceding siblings ...)
  2008-03-09 17:09 ` [RFC/PATCH 11/17] sched: fair: optimize sched_slice() Peter Zijlstra
@ 2008-03-09 17:09 ` Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 13/17] sched: fair: calc_delta_weight() Peter Zijlstra
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-single-rq-cleanup.patch --]
[-- Type: text/plain, Size: 14149 bytes --]

Now that we have a single tree, there is no need to carry this in the
hierarchy. Strip the tree info from cfs_rq and put it in cfs_root_rq.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c       |   38 ++++++++----
 kernel/sched_debug.c |   41 +++++++++----
 kernel/sched_fair.c  |  152 +++++++++++++++++++++++++--------------------------
 3 files changed, 128 insertions(+), 103 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -381,17 +381,23 @@ static inline void unlock_doms_cur(void)
 
 #endif	/* CONFIG_GROUP_SCHED */
 
-/* CFS-related fields in a runqueue */
-struct cfs_rq {
-	struct load_weight load;
-	unsigned long nr_running;
-
-	u64 exec_clock;
+struct cfs_root_rq {
 	u64 min_vruntime;
 
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
 
+#ifdef CONFIG_SCHEDSTATS
+	unsigned long nr_spread_over;
+	u64 exec_clock;
+#endif
+};
+
+/* CFS-related fields in a runqueue */
+struct cfs_rq {
+	struct load_weight load;
+	unsigned long nr_running;
+
 	struct list_head tasks;
 	struct list_head *balance_iterator;
 
@@ -401,8 +407,6 @@ struct cfs_rq {
 	 */
 	struct sched_entity *curr;
 
-	unsigned long nr_spread_over;
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */
 
@@ -528,6 +532,7 @@ struct rq {
 	unsigned long nr_load_updates;
 	u64 nr_switches;
 
+	struct cfs_root_rq cfs_root;
 	struct cfs_rq cfs;
 	struct rt_rq rt;
 
@@ -1821,8 +1826,8 @@ void set_task_cpu(struct task_struct *p,
 {
 	int old_cpu = task_cpu(p);
 	struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
-	struct cfs_rq *old_cfsrq = task_cfs_rq(p),
-		      *new_cfsrq = cpu_cfs_rq(old_cfsrq, new_cpu);
+	struct cfs_root_rq *old_cfs_r_rq = &old_rq->cfs_root,
+			   *new_cfs_r_rq = &new_rq->cfs_root;
 	u64 clock_offset;
 
 	clock_offset = old_rq->clock - new_rq->clock;
@@ -1840,8 +1845,8 @@ void set_task_cpu(struct task_struct *p,
 			schedstat_inc(p, se.nr_forced2_migrations);
 	}
 #endif
-	p->se.vruntime -= old_cfsrq->min_vruntime -
-					 new_cfsrq->min_vruntime;
+	p->se.vruntime -= old_cfs_r_rq->min_vruntime -
+			  new_cfs_r_rq->min_vruntime;
 
 	__set_task_cpu(p, new_cpu);
 }
@@ -7484,14 +7489,18 @@ int in_sched_functions(unsigned long add
 		&& addr < (unsigned long)__sched_text_end);
 }
 
+static void init_cfs_root_rq(struct cfs_root_rq *cfs_r_rq)
+{
+	cfs_r_rq->tasks_timeline = RB_ROOT;
+	cfs_r_rq->min_vruntime = (u64)(-(1LL << 20));
+}
+
 static void init_cfs_rq(struct cfs_rq *cfs_rq, struct rq *rq)
 {
-	cfs_rq->tasks_timeline = RB_ROOT;
 	INIT_LIST_HEAD(&cfs_rq->tasks);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	cfs_rq->rq = rq;
 #endif
-	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
 }
 
 static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
@@ -7617,6 +7626,7 @@ void __init sched_init(void)
 		rq->nr_running = 0;
 		rq->clock = 1;
 		update_last_tick_seen(rq);
+		init_cfs_root_rq(&rq->cfs_root);
 		init_cfs_rq(&rq->cfs, rq);
 		init_rt_rq(&rq->rt, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: linux-2.6-2/kernel/sched_debug.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_debug.c
+++ linux-2.6-2/kernel/sched_debug.c
@@ -101,49 +101,60 @@ static void print_rq(struct seq_file *m,
 	read_unlock_irqrestore(&tasklist_lock, flags);
 }
 
-void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
+void print_cfs_root(struct seq_file *m, int cpu)
 {
 	s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
 		spread, rq0_min_vruntime, spread0;
 	struct rq *rq = &per_cpu(runqueues, cpu);
+	struct cfs_root_rq *cfs_r_rq = &rq->cfs_root;
 	struct sched_entity *last;
 	unsigned long flags;
 
-	SEQ_printf(m, "\ncfs_rq\n");
-
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
-			SPLIT_NS(cfs_rq->exec_clock));
+	SEQ_printf(m, "\ncfs_root_rq\n");
 
 	spin_lock_irqsave(&rq->lock, flags);
-	if (cfs_rq->rb_leftmost)
-		MIN_vruntime = (__pick_next_entity(cfs_rq))->vruntime;
-	last = __pick_last_entity(cfs_rq);
+	if (cfs_r_rq->rb_leftmost)
+		MIN_vruntime = (__pick_next_entity(cfs_r_rq))->vruntime;
+	last = __pick_last_entity(cfs_r_rq);
 	if (last)
 		max_vruntime = last->vruntime;
-	min_vruntime = rq->cfs.min_vruntime;
-	rq0_min_vruntime = per_cpu(runqueues, 0).cfs.min_vruntime;
+	min_vruntime = cfs_r_rq->min_vruntime;
+	rq0_min_vruntime = per_cpu(runqueues, 0).cfs_root.min_vruntime;
 	spin_unlock_irqrestore(&rq->lock, flags);
+
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
 			SPLIT_NS(min_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "max_vruntime",
 			SPLIT_NS(max_vruntime));
+
 	spread = max_vruntime - MIN_vruntime;
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread",
 			SPLIT_NS(spread));
 	spread0 = min_vruntime - rq0_min_vruntime;
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
 			SPLIT_NS(spread0));
+
+#ifdef CONFIG_SCHEDSTATS
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
+			SPLIT_NS(cfs_r_rq->exec_clock));
+	SEQ_printf(m, "  .%-30s: %ld\n", "nr_spread_over",
+			cfs_r_rq->nr_spread_over);
+#endif
+}
+
+void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
+{
+	SEQ_printf(m, "\ncfs_rq\n");
+
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load^-1", cfs_rq->load.inv_weight);
+
 #ifdef CONFIG_SCHEDSTATS
-	SEQ_printf(m, "  .%-30s: %d\n", "bkl_count",
-			rq->bkl_count);
+	SEQ_printf(m, "  .%-30s: %d\n", "bkl_count", rq_of(cfs_rq)->bkl_count);
 #endif
-	SEQ_printf(m, "  .%-30s: %ld\n", "nr_spread_over",
-			cfs_rq->nr_spread_over);
 }
 
 static void print_cpu(struct seq_file *m, int cpu)
@@ -191,6 +202,8 @@ static void print_cpu(struct seq_file *m
 #undef P
 #undef PN
 
+	print_cfs_root(m, cpu);
+
 	print_cfs_stats(m, cpu);
 
 	print_rq(m, rq, cpu);
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -218,15 +218,14 @@ static inline u64 min_vruntime(u64 min_v
 	return min_vruntime;
 }
 
-static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static inline
+s64 entity_key(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
 {
-	return se->vruntime - cfs_rq->min_vruntime;
+	return se->vruntime - cfs_r_rq->min_vruntime;
 }
 
-/*
- * Enqueue an entity into the rb-tree:
- */
-static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void __enqueue_timeline(struct cfs_root_rq *cfs_r_rq,
+		struct sched_entity *se)
 {
 	struct rb_node **link;
 	struct rb_node *parent = NULL;
@@ -234,16 +233,8 @@ static void __enqueue_entity(struct cfs_
 	s64 key;
 	int leftmost = 1;
 
-	if (!entity_is_task(se))
-		return;
-
-	if (se == cfs_rq->curr)
-		return;
-
-	cfs_rq = &rq_of(cfs_rq)->cfs;
-
-	link = &cfs_rq->tasks_timeline.rb_node;
-	key = entity_key(cfs_rq, se);
+	link = &cfs_r_rq->tasks_timeline.rb_node;
+	key = entity_key(cfs_r_rq, se);
 	/*
 	 * Find the right place in the rbtree:
 	 */
@@ -254,7 +245,7 @@ static void __enqueue_entity(struct cfs_
 		 * We dont care about collisions. Nodes with
 		 * the same key stay together.
 		 */
-		if (key < entity_key(cfs_rq, entry)) {
+		if (key < entity_key(cfs_r_rq, entry)) {
 			link = &parent->rb_left;
 		} else {
 			link = &parent->rb_right;
@@ -267,46 +258,66 @@ static void __enqueue_entity(struct cfs_
 	 * used):
 	 */
 	if (leftmost)
-		cfs_rq->rb_leftmost = &se->run_node;
+		cfs_r_rq->rb_leftmost = &se->run_node;
 
 	rb_link_node(&se->run_node, parent, link);
-	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+	rb_insert_color(&se->run_node, &cfs_r_rq->tasks_timeline);
 }
 
-static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void __dequeue_timeline(struct cfs_root_rq *cfs_r_rq,
+		struct sched_entity *se)
+{
+	if (cfs_r_rq->rb_leftmost == &se->run_node)
+		cfs_r_rq->rb_leftmost = rb_next(&se->run_node);
+
+	rb_erase(&se->run_node, &cfs_r_rq->tasks_timeline);
+}
+
+/*
+ * Enqueue an entity into the rb-tree:
+ */
+static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+
 	if (!entity_is_task(se))
 		return;
 
 	if (se == cfs_rq->curr)
 		return;
 
-	cfs_rq = &rq_of(cfs_rq)->cfs;
+	__enqueue_timeline(&rq_of(cfs_rq)->cfs_root, se);
+}
 
-	if (cfs_rq->rb_leftmost == &se->run_node)
-		cfs_rq->rb_leftmost = rb_next(&se->run_node);
+static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if (!entity_is_task(se))
+		return;
 
-	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+	if (se == cfs_rq->curr)
+		return;
+
+	__dequeue_timeline(&rq_of(cfs_rq)->cfs_root, se);
 }
 
-static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)
+static inline struct rb_node *first_fair(struct cfs_root_rq *cfs_r_rq)
 {
-	return cfs_rq->rb_leftmost;
+	return cfs_r_rq->rb_leftmost;
 }
 
-static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
+static struct sched_entity *__pick_next_entity(struct cfs_root_rq *cfs_r_rq)
 {
-	return rb_entry(first_fair(cfs_rq), struct sched_entity, run_node);
+	return rb_entry(first_fair(cfs_r_rq), struct sched_entity, run_node);
 }
 
-static inline struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
+static inline
+struct sched_entity *__pick_last_entity(struct cfs_root_rq *cfs_r_rq)
 {
 	struct rb_node *last = rb_last(&cfs_rq->tasks_timeline);
 
 	if (!last)
 		return NULL;
 
-	return rb_entry(last, struct sched_entity, run_node);
+	return rb_entry(last, struct sched_entity, timeline_node);
 }
 
 /**************************************************************
@@ -435,6 +446,25 @@ calc_delta_asym(unsigned long delta, str
 	return delta;
 }
 
+static inline
+void __update_root(struct cfs_root_rq *cfs_r_rq, struct sched_entity *curr)
+{
+	u64 vruntime;
+
+	/*
+	 * maintain cfs_r_rq->min_vruntime to be a monotonic increasing
+	 * value tracking the leftmost vruntime in the tree.
+	 */
+	if (first_fair(cfs_r_rq)) {
+		vruntime = min_vruntime(curr->vruntime,
+				__pick_next_entity(cfs_r_rq)->vruntime);
+	} else
+		vruntime = curr->vruntime;
+
+	cfs_r_rq->min_vruntime =
+		max_vruntime(cfs_r_rq->min_vruntime, vruntime);
+}
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -443,33 +473,11 @@ static inline void
 __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 	      unsigned long delta_exec)
 {
-	unsigned long delta_exec_weighted;
-	u64 vruntime;
-
 	schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));
 
 	curr->sum_exec_runtime += delta_exec;
-	schedstat_add(cfs_rq, exec_clock, delta_exec);
-	delta_exec_weighted = calc_delta_fair(delta_exec, curr);
-	curr->vruntime += delta_exec_weighted;
-
-	if (!entity_is_task(curr))
-		return;
-
-	cfs_rq = &rq_of(cfs_rq)->cfs;
-
-	/*
-	 * maintain cfs_rq->min_vruntime to be a monotonic increasing
-	 * value tracking the leftmost vruntime in the tree.
-	 */
-	if (first_fair(cfs_rq)) {
-		vruntime = min_vruntime(curr->vruntime,
-				__pick_next_entity(cfs_rq)->vruntime);
-	} else
-		vruntime = curr->vruntime;
-
-	cfs_rq->min_vruntime =
-		max_vruntime(cfs_rq->min_vruntime, vruntime);
+	schedstat_add(&rq_of(cfs_rq)->cfs_root, exec_clock, delta_exec);
+	curr->vruntime += calc_delta_fair(delta_exec, curr);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -492,9 +500,8 @@ static void update_curr(struct cfs_rq *c
 	curr->exec_start = now;
 
 	if (entity_is_task(curr)) {
-		struct task_struct *curtask = task_of(curr);
-
-		cpuacct_charge(curtask, delta_exec);
+		__update_root(&rq_of(cfs_rq)->cfs_root, curr);
+		cpuacct_charge(task_of(curr), delta_exec);
 	}
 }
 
@@ -622,27 +629,27 @@ static void enqueue_sleeper(struct cfs_r
 static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHED_DEBUG
-	s64 d = se->vruntime - cfs_rq->min_vruntime;
+	struct cfs_root_rq *cfs_r_rq = &rq_of(cfs_rq)->cfs_root;
+	s64 d = se->vruntime - cfs_r_rq->min_vruntime;
 
 	if (d < 0)
 		d = -d;
 
 	if (d > 3*sysctl_sched_latency)
-		schedstat_inc(cfs_rq, nr_spread_over);
+		schedstat_inc(cfs_r_rq, nr_spread_over);
 #endif
 }
 
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
+	struct cfs_root_rq *cfs_r_rq = &rq_of(cfs_rq)->cfs_root;
 	u64 vruntime;
 
 	if (!entity_is_task(se))
 		return;
 
-	cfs_rq = &rq_of(cfs_rq)->cfs;
-
-	vruntime = cfs_rq->min_vruntime;
+	vruntime = cfs_r_rq->min_vruntime;
 
 	/*
 	 * The 'current' period is already promised to the current tasks,
@@ -933,7 +940,7 @@ static void yield_task_fair(struct rq *r
 	/*
 	 * Find the rightmost entry in the rbtree:
 	 */
-	rightmost = __pick_last_entity(&rq->cfs);
+	rightmost = __pick_last_entity(&rq->cfs_root);
 	/*
 	 * Already in the rightmost position?
 	 */
@@ -1133,17 +1140,15 @@ static void check_preempt_wakeup(struct 
 static struct task_struct *pick_next_task_fair(struct rq *rq)
 {
 	struct task_struct *p;
-	struct cfs_rq *cfs_rq = &rq->cfs;
+	struct cfs_root_rq *cfs_r_rq = &rq->cfs_root;
 	struct sched_entity *se, *next;
 
-	if (!first_fair(cfs_rq))
+	if (!first_fair(cfs_r_rq))
 		return NULL;
 
-	next = se = __pick_next_entity(cfs_rq);
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		set_next_entity(cfs_rq, se);
-	}
+	next = se = __pick_next_entity(cfs_r_rq);
+	for_each_sched_entity(se)
+		set_next_entity(cfs_rq_of(se), se);
 
 	p = task_of(next);
 	hrtick_start_fair(rq, p);
@@ -1157,12 +1162,9 @@ static struct task_struct *pick_next_tas
 static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_entity *se = &prev->se;
-	struct cfs_rq *cfs_rq;
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		put_prev_entity(cfs_rq, se);
-	}
+	for_each_sched_entity(se)
+		put_prev_entity(cfs_rq_of(se), se);
 }
 
 #ifdef CONFIG_SMP

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 13/17] sched: fair: calc_delta_weight()
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (11 preceding siblings ...)
  2008-03-09 17:09 ` [RFC/PATCH 12/17] sched: fair: cfs_root_rq Peter Zijlstra
@ 2008-03-09 17:09 ` Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 14/17] sched: fair: avg_vruntime Peter Zijlstra
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-calc_delta_weight.patch --]
[-- Type: text/plain, Size: 1904 bytes --]

Little cleanup..

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -342,6 +342,34 @@ int sched_nr_latency_handler(struct ctl_
 #endif
 
 /*
+ * delta *= w / rw
+ */
+static inline unsigned long
+calc_delta_weight(unsigned long delta, struct sched_entity *se)
+{
+	for_each_sched_entity(se) {
+		delta = calc_delta_mine(delta,
+				se->load.weight, &cfs_rq_of(se)->load);
+	}
+
+	return delta;
+}
+
+/*
+ * delta *= rw / w
+ */
+static inline unsigned long
+calc_delta_fair(unsigned long delta, struct sched_entity *se)
+{
+	for_each_sched_entity(se) {
+		delta = calc_delta_mine(delta,
+				cfs_rq_of(se)->load.weight, &se->load);
+	}
+
+	return delta;
+}
+
+/*
  * The idea is to set a period in which each task runs once.
  *
  * When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
@@ -379,10 +407,7 @@ static u64 sched_slice(struct cfs_rq *cf
 	if (unlikely(rq_of(cfs_rq)->load.weight <= se->load.weight))
 		goto out;
 
-	for_each_sched_entity(se) {
-		slice = calc_delta_mine(slice,
-				se->load.weight, &cfs_rq_of(se)->load);
-	}
+	slice = calc_delta_weight(slice, se);
 
 out:
 	return slice;
@@ -403,17 +428,6 @@ static u64 sched_vslice_add(struct cfs_r
 	return __sched_period(nr_running);
 }
 
-static inline unsigned long
-calc_delta_fair(unsigned long delta, struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		delta = calc_delta_mine(delta,
-				cfs_rq_of(se)->load.weight, &se->load);
-	}
-
-	return delta;
-}
-
 /*
  * The goal of calc_delta_asym() is to be asymmetrically around NICE_0_LOAD, in
  * that it favours >=0 over <0.

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 14/17] sched: fair: avg_vruntime
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (12 preceding siblings ...)
  2008-03-09 17:09 ` [RFC/PATCH 13/17] sched: fair: calc_delta_weight() Peter Zijlstra
@ 2008-03-09 17:09 ` Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 15/17] sched: fair-group: EEVDF scheduling Peter Zijlstra
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-avg-vruntime.patch --]
[-- Type: text/plain, Size: 4946 bytes --]

In order to implement EEVDF we need to be able to test egibility. 
This requires knowing the current virtual time. We use a property of
fair schedulers to determine this in an numerically stable way, namely
the sum of all lags is 0. Therefore the average of all virtual times
is the position of lag=0.

We can't just take the average of vruntime - as it will use the full range
of its u64 and will wrap around. Instead we'll use the average of
(vruntime - min_vruntime)

\Sum_{i}^{n} 1/n (v_{i} - v) = 1/n (\Sum_{i}^{n} v_{i}) - vn

By factoring out the 1/n (never storing that) we avoid rounding, which
would bring an accumulating error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c       |    5 +++
 kernel/sched_debug.c |    5 +++
 kernel/sched_fair.c  |   76 ++++++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 83 insertions(+), 3 deletions(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -387,6 +387,11 @@ struct cfs_root_rq {
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	s64 avg_vruntime;
+	unsigned long nr_queued;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	unsigned long nr_spread_over;
 	u64 exec_clock;
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -273,6 +273,67 @@ static void __dequeue_timeline(struct cf
 	rb_erase(&se->run_node, &cfs_r_rq->tasks_timeline);
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline
+void avg_vruntime_add(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_r_rq, se);
+	cfs_r_rq->avg_vruntime += key;
+	cfs_r_rq->nr_queued++;
+}
+
+static inline
+void avg_vruntime_sub(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_r_rq, se);
+	cfs_r_rq->avg_vruntime -= key;
+	cfs_r_rq->nr_queued--;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_root_rq *cfs_r_rq, s64 delta)
+{
+	cfs_r_rq->avg_vruntime -= cfs_r_rq->nr_queued * delta;
+}
+
+static inline
+s64 avg_vruntime(struct cfs_root_rq *cfs_r_rq)
+{
+	s64 avg = cfs_r_rq->avg_vruntime;
+	int sign = 0;
+
+	if (avg < 0) {
+		sign = 1;
+		avg = -avg;
+	}
+
+	if (cfs_r_rq->nr_queued)
+		do_div(avg, cfs_r_rq->nr_queued);
+
+	if (sign)
+		avg = -avg;
+
+	return avg;
+}
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline
+void avg_vruntime_add(struct cfs_root_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline
+void avg_vruntime_sub(struct cfs_root_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline
+void avg_vruntime_update(struct cfs_root_rq *cfs_rq, s64 delta)
+{
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 /*
  * Enqueue an entity into the rb-tree:
  */
@@ -286,6 +347,7 @@ static void __enqueue_entity(struct cfs_
 		return;
 
 	__enqueue_timeline(&rq_of(cfs_rq)->cfs_root, se);
+	avg_vruntime_add(&rq_of(cfs_rq)->cfs_root, se);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -297,6 +359,7 @@ static void __dequeue_entity(struct cfs_
 		return;
 
 	__dequeue_timeline(&rq_of(cfs_rq)->cfs_root, se);
+	avg_vruntime_sub(&rq_of(cfs_rq)->cfs_root, se);
 }
 
 static inline struct rb_node *first_fair(struct cfs_root_rq *cfs_r_rq)
@@ -464,6 +527,7 @@ static inline
 void __update_root(struct cfs_root_rq *cfs_r_rq, struct sched_entity *curr)
 {
 	u64 vruntime;
+	s64 delta;
 
 	/*
 	 * maintain cfs_r_rq->min_vruntime to be a monotonic increasing
@@ -475,8 +539,14 @@ void __update_root(struct cfs_root_rq *c
 	} else
 		vruntime = curr->vruntime;
 
-	cfs_r_rq->min_vruntime =
-		max_vruntime(cfs_r_rq->min_vruntime, vruntime);
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	delta = (s64)(vruntime - cfs_r_rq->min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_r_rq, delta);
+		cfs_r_rq->min_vruntime = vruntime;
+	}
 }
 
 /*
@@ -939,7 +1009,7 @@ static void yield_task_fair(struct rq *r
 	/*
 	 * Are we the only task in the tree?
 	 */
-	if (unlikely(rq->load.weight == curr->se.load.weight))
+	if (unlikely(!rq->cfs_root.nr_queued))
 		return;
 
 	if (likely(!sysctl_sched_compat_yield) && curr->policy != SCHED_BATCH) {
Index: linux-2.6-2/kernel/sched_debug.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_debug.c
+++ linux-2.6-2/kernel/sched_debug.c
@@ -136,6 +136,11 @@ void print_cfs_root(struct seq_file *m, 
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
 			SPLIT_NS(spread0));
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_r_rq)));
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_r_rq->exec_clock));

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 15/17] sched: fair-group: EEVDF scheduling
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (13 preceding siblings ...)
  2008-03-09 17:09 ` [RFC/PATCH 14/17] sched: fair: avg_vruntime Peter Zijlstra
@ 2008-03-09 17:09 ` Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 16/17] sched: fair: bound tardiness Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 17/17] sched: fair: optimize the elegebility test Peter Zijlstra
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-dual.patch --]
[-- Type: text/plain, Size: 12736 bytes --]

Two groups, A and B. A has 1 tasks, B 2. Say that B has double
the latency requirement A has, but have equal weight.

This gives the following scenario:

    A  |---|---|
    ---+-------+
    B1 |-------|
    B2 |-------|

The various scheduling options:

    Ideal: A B1 A  B2
    CFS:   A B1 B2 A
    EDF:   A A  B1 B2

Both CFS and EDF mess up in that they'll lump A together and schedule
both Bs in between, making A miss its latency requirement.

[ Note, both CFS and EDF have the freedom to accidentally schedule it
  properly, but there is no guarantee ]

However, mixing both can give the ideal result.

Vruntime makes task time run as fast as real-time and by keeping track
of where in vruntime we actually are we can make a decision between the
two tasks presented by either method.

We schedule using the following rule:
  s1) we use the EDF result if its eligible to run (has >= 0 lag)
  s2) we use the CFS result

    A  |---|
1)  ---+---+---+           s1 -> A
    B1 |-------|
    B2 |-------|
       ^

    A      |---|
2)  ---+---+---+-------+   s2 -> B1
    B1 |-------|
    B2 |-------|
         ^

    A      |---|
3)  ---+---+---+-------+   s1 -> A
    B1         |-------|
    B2 |-------|
           ^

    A          |---|
4)  ---+-------+---+---+   s2 -> B2
    B1         |-------|
    B2 |-------|
             ^
[ ^ denotes time ]

Note that 3 is still ambiguous as both A and B2 have identical deadlines
and both start times are before or at the current time. While this situation
will be extremely rare, we can resolve this by making the EDF tree prefer
tasks with a smaller period over those with a larger.

We use a dual tree approach instead of the modified binary tree mentioned in
the EEVDF paper because it allows for a slimmer !group config and saves the
hassle of implementing yet another balanced tree.

With thanks to Dmitry Adamushko for analyzing the problems and poking holes
into my earlier implementations :-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
---
 include/linux/sched.h |    6 +
 kernel/sched.c        |    8 +-
 kernel/sched_debug.c  |    6 -
 kernel/sched_fair.c   |  193 ++++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 172 insertions(+), 41 deletions(-)

Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -917,7 +917,11 @@ struct load_weight {
  */
 struct sched_entity {
 	struct load_weight	load;		/* for load-balancing */
-	struct rb_node		run_node;
+	struct rb_node		timeline_node;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	struct rb_node		deadline_node;
+	u64			vperiod;
+#endif
 	struct list_head	group_node;
 	unsigned int		on_rq;
 
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -385,9 +385,12 @@ struct cfs_root_rq {
 	u64 min_vruntime;
 
 	struct rb_root tasks_timeline;
-	struct rb_node *rb_leftmost;
+	struct rb_node *left_timeline;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	struct rb_root tasks_deadline;
+	struct rb_node *left_deadline;
+
 	s64 avg_vruntime;
 	unsigned long nr_queued;
 #endif
@@ -7498,6 +7501,9 @@ static void init_cfs_root_rq(struct cfs_
 {
 	cfs_r_rq->tasks_timeline = RB_ROOT;
 	cfs_r_rq->min_vruntime = (u64)(-(1LL << 20));
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	cfs_r_rq->tasks_deadline = RB_ROOT;
+#endif
 }
 
 static void init_cfs_rq(struct cfs_rq *cfs_rq, struct rq *rq)
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -219,7 +219,7 @@ static inline u64 min_vruntime(u64 min_v
 }
 
 static inline
-s64 entity_key(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
+s64 entity_timeline_key(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
 {
 	return se->vruntime - cfs_r_rq->min_vruntime;
 }
@@ -234,18 +234,18 @@ static void __enqueue_timeline(struct cf
 	int leftmost = 1;
 
 	link = &cfs_r_rq->tasks_timeline.rb_node;
-	key = entity_key(cfs_r_rq, se);
+	key = entity_timeline_key(cfs_r_rq, se);
 	/*
 	 * Find the right place in the rbtree:
 	 */
 	while (*link) {
 		parent = *link;
-		entry = rb_entry(parent, struct sched_entity, run_node);
+		entry = rb_entry(parent, struct sched_entity, timeline_node);
 		/*
 		 * We dont care about collisions. Nodes with
 		 * the same key stay together.
 		 */
-		if (key < entity_key(cfs_r_rq, entry)) {
+		if (key < entity_timeline_key(cfs_r_rq, entry)) {
 			link = &parent->rb_left;
 		} else {
 			link = &parent->rb_right;
@@ -258,26 +258,48 @@ static void __enqueue_timeline(struct cf
 	 * used):
 	 */
 	if (leftmost)
-		cfs_r_rq->rb_leftmost = &se->run_node;
+		cfs_r_rq->left_timeline = &se->timeline_node;
 
-	rb_link_node(&se->run_node, parent, link);
-	rb_insert_color(&se->run_node, &cfs_r_rq->tasks_timeline);
+	rb_link_node(&se->timeline_node, parent, link);
+	rb_insert_color(&se->timeline_node, &cfs_r_rq->tasks_timeline);
 }
 
 static void __dequeue_timeline(struct cfs_root_rq *cfs_r_rq,
 		struct sched_entity *se)
 {
-	if (cfs_r_rq->rb_leftmost == &se->run_node)
-		cfs_r_rq->rb_leftmost = rb_next(&se->run_node);
+	if (cfs_r_rq->left_timeline == &se->timeline_node)
+		cfs_r_rq->left_timeline = rb_next(&se->timeline_node);
 
-	rb_erase(&se->run_node, &cfs_r_rq->tasks_timeline);
+	rb_erase(&se->timeline_node, &cfs_r_rq->tasks_timeline);
+}
+
+static inline struct rb_node *first_fair(struct cfs_root_rq *cfs_r_rq)
+{
+	return cfs_r_rq->left_timeline;
+}
+
+static struct sched_entity *__pick_next_timeline(struct cfs_root_rq *cfs_r_rq)
+{
+	return rb_entry(first_fair(cfs_r_rq),
+			struct sched_entity, timeline_node);
+}
+
+static inline
+struct sched_entity *__pick_last_timeline(struct cfs_root_rq *cfs_r_rq)
+{
+	struct rb_node *last = rb_last(&cfs_r_rq->tasks_timeline);
+
+	if (!last)
+		return NULL;
+
+	return rb_entry(last, struct sched_entity, timeline_node);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static inline
 void avg_vruntime_add(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
 {
-	s64 key = entity_key(cfs_r_rq, se);
+	s64 key = entity_timeline_key(cfs_r_rq, se);
 	cfs_r_rq->avg_vruntime += key;
 	cfs_r_rq->nr_queued++;
 }
@@ -285,7 +307,7 @@ void avg_vruntime_add(struct cfs_root_rq
 static inline
 void avg_vruntime_sub(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
 {
-	s64 key = entity_key(cfs_r_rq, se);
+	s64 key = entity_timeline_key(cfs_r_rq, se);
 	cfs_r_rq->avg_vruntime -= key;
 	cfs_r_rq->nr_queued--;
 }
@@ -316,6 +338,108 @@ s64 avg_vruntime(struct cfs_root_rq *cfs
 	return avg;
 }
 
+static inline
+s64 entity_deadline_key(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
+{
+	return se->vruntime + se->vperiod - cfs_r_rq->min_vruntime;
+}
+
+static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se);
+
+static void sched_calc_deadline(struct cfs_root_rq *cfs_r_rq,
+				struct sched_entity *se)
+{
+	u64 deadline = avg_vruntime(cfs_r_rq) +
+		sched_vslice_add(cfs_rq_of(se), se);
+	se->vperiod = deadline - entity_timeline_key(cfs_r_rq, se);
+}
+
+static void __enqueue_deadline(struct cfs_root_rq *cfs_r_rq,
+			       struct sched_entity *se)
+{
+	struct rb_node **link;
+	struct rb_node *parent = NULL;
+	struct sched_entity *entry;
+	s64 key, entry_key;
+	int leftmost = 1;
+
+	sched_calc_deadline(cfs_r_rq, se);
+
+	link = &cfs_r_rq->tasks_deadline.rb_node;
+	key = entity_deadline_key(cfs_r_rq, se);
+	/*
+	 * Find the right place in the rbtree:
+	 */
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct sched_entity, deadline_node);
+		/*
+		 * Prefer shorter latency tasks over higher.
+		 */
+		entry_key = entity_deadline_key(cfs_r_rq, entry);
+		if (key < entry_key ||
+		    (key == entry_key && se->vperiod < entry->vperiod)) {
+
+			link = &parent->rb_left;
+
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used):
+	 */
+	if (leftmost)
+		cfs_r_rq->left_deadline = &se->deadline_node;
+
+	rb_link_node(&se->deadline_node, parent, link);
+	rb_insert_color(&se->deadline_node, &cfs_r_rq->tasks_deadline);
+}
+
+static void __dequeue_deadline(struct cfs_root_rq *cfs_r_rq,
+			       struct sched_entity *se)
+{
+	if (cfs_r_rq->left_deadline == &se->deadline_node)
+		cfs_r_rq->left_deadline = rb_next(&se->deadline_node);
+
+	rb_erase(&se->deadline_node, &cfs_r_rq->tasks_deadline);
+}
+
+static inline struct rb_node *first_deadline(struct cfs_root_rq *cfs_r_rq)
+{
+	return cfs_r_rq->left_deadline;
+}
+
+static struct sched_entity *__pick_next_deadline(struct cfs_root_rq *cfs_r_rq)
+{
+	return rb_entry(first_deadline(cfs_r_rq),
+			struct sched_entity, deadline_node);
+}
+
+/*
+ * Earliest Eligible Virtual Deadline First
+ *
+ * pick a task using the virtual deadline tree, when it is eligible to run
+ * use it, otherwise run the task with the greatest need.
+ *
+ * Eligibility means we never run a task which has more than its fair share
+ * of runtime.
+ */
+static struct sched_entity *__pick_next_entity(struct cfs_root_rq *cfs_r_rq)
+{
+	struct sched_entity *next_deadline;
+	s64 avg = avg_vruntime(cfs_r_rq);
+
+	next_deadline = __pick_next_deadline(cfs_r_rq);
+	if (entity_timeline_key(cfs_r_rq, next_deadline) <= avg)
+		return next_deadline;
+
+	return __pick_next_timeline(cfs_r_rq);
+}
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline
@@ -332,6 +456,22 @@ static inline
 void avg_vruntime_update(struct cfs_root_rq *cfs_rq, s64 delta)
 {
 }
+
+static inline
+void __enqueue_deadline(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
+{
+}
+
+static inline
+void __dequeue_deadline(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
+{
+}
+
+static inline
+struct sched_entity *__pick_next_entity(struct cfs_root_rq *cfs_r_rq)
+{
+	return __pick_next_timeline(cfs_r_rq);
+}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 /*
@@ -347,6 +487,7 @@ static void __enqueue_entity(struct cfs_
 		return;
 
 	__enqueue_timeline(&rq_of(cfs_rq)->cfs_root, se);
+	__enqueue_deadline(&rq_of(cfs_rq)->cfs_root, se);
 	avg_vruntime_add(&rq_of(cfs_rq)->cfs_root, se);
 }
 
@@ -359,30 +500,10 @@ static void __dequeue_entity(struct cfs_
 		return;
 
 	__dequeue_timeline(&rq_of(cfs_rq)->cfs_root, se);
+	__dequeue_deadline(&rq_of(cfs_rq)->cfs_root, se);
 	avg_vruntime_sub(&rq_of(cfs_rq)->cfs_root, se);
 }
 
-static inline struct rb_node *first_fair(struct cfs_root_rq *cfs_r_rq)
-{
-	return cfs_r_rq->rb_leftmost;
-}
-
-static struct sched_entity *__pick_next_entity(struct cfs_root_rq *cfs_r_rq)
-{
-	return rb_entry(first_fair(cfs_r_rq), struct sched_entity, run_node);
-}
-
-static inline
-struct sched_entity *__pick_last_entity(struct cfs_root_rq *cfs_r_rq)
-{
-	struct rb_node *last = rb_last(&cfs_rq->tasks_timeline);
-
-	if (!last)
-		return NULL;
-
-	return rb_entry(last, struct sched_entity, timeline_node);
-}
-
 /**************************************************************
  * Scheduling class statistics methods:
  */
@@ -440,7 +561,7 @@ calc_delta_fair(unsigned long delta, str
  *
  * p = (nr <= nl) ? l : l*nr/nl
  */
-static u64 __sched_period(unsigned long nr_running)
+static inline u64 __sched_period(unsigned long nr_running)
 {
 	u64 period = sysctl_sched_latency;
 	unsigned long nr_latency = sched_nr_latency;
@@ -535,7 +656,7 @@ void __update_root(struct cfs_root_rq *c
 	 */
 	if (first_fair(cfs_r_rq)) {
 		vruntime = min_vruntime(curr->vruntime,
-				__pick_next_entity(cfs_r_rq)->vruntime);
+				__pick_next_timeline(cfs_r_rq)->vruntime);
 	} else
 		vruntime = curr->vruntime;
 
@@ -1024,7 +1145,7 @@ static void yield_task_fair(struct rq *r
 	/*
 	 * Find the rightmost entry in the rbtree:
 	 */
-	rightmost = __pick_last_entity(&rq->cfs_root);
+	rightmost = __pick_last_timeline(&rq->cfs_root);
 	/*
 	 * Already in the rightmost position?
 	 */
Index: linux-2.6-2/kernel/sched_debug.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_debug.c
+++ linux-2.6-2/kernel/sched_debug.c
@@ -113,9 +113,9 @@ void print_cfs_root(struct seq_file *m, 
 	SEQ_printf(m, "\ncfs_root_rq\n");
 
 	spin_lock_irqsave(&rq->lock, flags);
-	if (cfs_r_rq->rb_leftmost)
-		MIN_vruntime = (__pick_next_entity(cfs_r_rq))->vruntime;
-	last = __pick_last_entity(cfs_r_rq);
+	if (cfs_r_rq->left_timeline)
+		MIN_vruntime = (__pick_next_timeline(cfs_r_rq))->vruntime;
+	last = __pick_last_timeline(cfs_r_rq);
 	if (last)
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_r_rq->min_vruntime;

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 16/17] sched: fair: bound tardiness
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (14 preceding siblings ...)
  2008-03-09 17:09 ` [RFC/PATCH 15/17] sched: fair-group: EEVDF scheduling Peter Zijlstra
@ 2008-03-09 17:09 ` Peter Zijlstra
  2008-03-09 17:09 ` [RFC/PATCH 17/17] sched: fair: optimize the elegebility test Peter Zijlstra
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-fair-slice-limit.patch --]
[-- Type: text/plain, Size: 1366 bytes --]

EEVDF guarantees a O(1) tardiness exceeding the deadline, this tardines comes
from the slice length. So by limiting the slice length when there is
competition, we limit the amount by which the deadline can be exceeded.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -584,15 +584,21 @@ static u64 sched_slice(struct cfs_rq *cf
 {
 	u64 slice = __sched_period(cfs_rq->nr_running);
 
-	/*
-	 * FIXME: curious 'hack' to make it boot - when the tick is started we
-	 * hit this with the init_task, which is not actually enqueued.
-	 */
 	if (unlikely(rq_of(cfs_rq)->load.weight <= se->load.weight))
 		goto out;
 
 	slice = calc_delta_weight(slice, se);
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * EEVDF guarantees a O(1) tardiness exceeding the deadline, this
+	 * tardines comes from the slice length. So by limiting the slice
+	 * length when there is competition, we limit the amount by which the
+	 * deadline can be exceeded.
+	 */
+	if (rq_of(cfs_rq)->cfs_root.nr_queued)
+		slice = min_t(u64, slice, sysctl_sched_min_granularity);
+#endif
 out:
 	return slice;
 }

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC/PATCH 17/17] sched: fair: optimize the elegebility test
  2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
                   ` (15 preceding siblings ...)
  2008-03-09 17:09 ` [RFC/PATCH 16/17] sched: fair: bound tardiness Peter Zijlstra
@ 2008-03-09 17:09 ` Peter Zijlstra
  16 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-09 17:09 UTC (permalink / raw)
  To: LKML, Ingo Molnar
  Cc: Dmitry Adamushko, Mike Galbraith, Dhaval Giani,
	Srivatsa Vaddagiri, Peter Zijlstra

[-- Attachment #1: sched-fair-avg_runtime-opt.patch --]
[-- Type: text/plain, Size: 1361 bytes --]

Suggested by Dmitry Adamushko; trade a div for a mult.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -339,6 +339,12 @@ s64 avg_vruntime(struct cfs_root_rq *cfs
 }
 
 static inline
+int avg_vruntime_cmp(struct cfs_root_rq *cfs_r_rq, s64 vruntime)
+{
+	return (vruntime * cfs_r_rq->nr_queued) <= cfs_r_rq->avg_vruntime;
+}
+
+static inline
 s64 entity_deadline_key(struct cfs_root_rq *cfs_r_rq, struct sched_entity *se)
 {
 	return se->vruntime + se->vperiod - cfs_r_rq->min_vruntime;
@@ -430,11 +436,10 @@ static struct sched_entity *__pick_next_
  */
 static struct sched_entity *__pick_next_entity(struct cfs_root_rq *cfs_r_rq)
 {
-	struct sched_entity *next_deadline;
-	s64 avg = avg_vruntime(cfs_r_rq);
+	struct sched_entity *next_deadline = __pick_next_deadline(cfs_r_rq);
+	s64 vruntime = entity_timeline_key(cfs_r_rq, next_deadline);
 
-	next_deadline = __pick_next_deadline(cfs_r_rq);
-	if (entity_timeline_key(cfs_r_rq, next_deadline) <= avg)
+	if (avg_vruntime_cmp(cfs_r_rq, vruntime))
 		return next_deadline;
 
 	return __pick_next_timeline(cfs_r_rq);

--


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups
  2008-03-09 17:08 ` [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups Peter Zijlstra
@ 2008-03-12  8:25   ` Dhaval Giani
       [not found]     ` <661de9470803120129l66497656o948c13c6c0c26766@mail.gmail.com>
  0 siblings, 1 reply; 26+ messages in thread
From: Dhaval Giani @ 2008-03-12  8:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Dmitry Adamushko, Mike Galbraith, Srivatsa Vaddagiri

On Sun, Mar 09, 2008 at 06:08:53PM +0100, Peter Zijlstra wrote:
> Update the -rt bits to support the full hierarchy support started by Dhaval.
> 

Thanks for the effort. Just a very minor comment. I will try these out
and give more feedback.

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c |   52 ++++++++++++++++++++++++++++++----------------------
>  1 file changed, 30 insertions(+), 22 deletions(-)
> 
> Index: linux-2.6-2/kernel/sched.c
> ===================================================================
> --- linux-2.6-2.orig/kernel/sched.c
> +++ linux-2.6-2/kernel/sched.c
> @@ -7302,10 +7302,12 @@ static void init_tg_cfs_entry(struct tas
>  #endif
> 
>  #ifdef CONFIG_RT_GROUP_SCHED
> -static void init_tg_rt_entry(struct rq *rq, struct task_group *tg,
> -		struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
> -		int cpu, int add)
> +static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
> +		struct sched_rt_entity *rt_se, int cpu, int add,
> +		struct sched_rt_entity *parent)
>  {
> +	struct rq *rq = cpu_rq(cpu);
> +
>  	tg->rt_rq[cpu] = rt_rq;
>  	init_rt_rq(rt_rq, rq);
>  	rt_rq->tg = tg;
> @@ -7318,6 +7320,11 @@ static void init_tg_rt_entry(struct rq *
>  	if (!rt_se)
>  		return;
> 
> +	if (!parent)
> +		rt_se->rt_rq = &rq->rt;
> +	else
> +		rt_se->rt_rq = parent->my_q;
> +
>  	rt_se->rt_rq = &rq->rt;
>  	rt_se->my_q = rt_rq;
>  	rt_se->parent = NULL;
> @@ -7380,8 +7387,7 @@ void __init sched_init(void)
>  		 * We achieve this by letting init_task_group's tasks sit
>  		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
>  		 */
> -		init_tg_cfs_entry(&init_task_group, &rq->cfs,
> -							NULL, i, 1, NULL);
> +		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);

Shouldn't this go in the previous patch?

>  #elif defined CONFIG_USER_SCHED
>  		/*
>  		 * In case of task-groups formed thr' the user id of tasks,
> @@ -7394,7 +7400,7 @@ void __init sched_init(void)
>  		 * (init_cfs_rq) and having one entity represent this group of
>  		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
>  		 */
> -		init_tg_cfs_entry(rq, &init_task_group,
> +		init_tg_cfs_entry(&init_task_group,
>  				&per_cpu(init_cfs_rq, i),
>  				&per_cpu(init_sched_entity, i), i, 1, NULL);
> 

Shouldn't this go in the previous patch? (my mistake, I missed this one in the
patch I sent out).

> @@ -7405,11 +7411,11 @@ void __init sched_init(void)
>  #ifdef CONFIG_RT_GROUP_SCHED
>  		INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
>  #ifdef CONFIG_CGROUP_SCHED
> -		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
> +		init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, 1, NULL);
>  #elif defined CONFIG_USER_SCHED
> -		init_tg_rt_entry(rq, &init_task_group,
> +		init_tg_rt_entry(&init_task_group,
>  				&per_cpu(init_rt_rq, i),
> -				&per_cpu(init_sched_rt_entity, i), i, 1);
> +				&per_cpu(init_sched_rt_entity, i), i, 1, NULL);
>  #endif
>  #endif
> 
> @@ -7613,11 +7619,11 @@ static void free_fair_sched_group(struct
>  	kfree(tg->se);
>  }
> 
> -static int alloc_fair_sched_group(struct task_group *tg,
> -					struct task_group *parent)
> +static
> +int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>  {
>  	struct cfs_rq *cfs_rq;
> -	struct sched_entity *se;
> +	struct sched_entity *se, *parent_se;
>  	struct rq *rq;
>  	int i;
> 
> @@ -7643,10 +7649,8 @@ static int alloc_fair_sched_group(struct
>  		if (!se)
>  			goto err;
> 
> -		if (!parent)
> -			init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
> -		else
> -			init_tg_cfs_entry(tg, cfs_rq, se, i, 0,	parent->se[i]);
> +		parent_se = parent ? parent->se[i] : NULL;
> +		init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent_se);
>  	}
> 
>  	return 1;
> @@ -7670,7 +7674,8 @@ static inline void free_fair_sched_group
>  {
>  }
> 
> -static inline int alloc_fair_sched_group(struct task_group *tg)
> +static inline
> +int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>  {
>  	return 1;
>  }
> @@ -7702,10 +7707,11 @@ static void free_rt_sched_group(struct t
>  	kfree(tg->rt_se);
>  }
> 
> -static int alloc_rt_sched_group(struct task_group *tg)
> +static
> +int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
>  {
>  	struct rt_rq *rt_rq;
> -	struct sched_rt_entity *rt_se;
> +	struct sched_rt_entity *rt_se, *parent_se;
>  	struct rq *rq;
>  	int i;
> 
> @@ -7732,7 +7738,8 @@ static int alloc_rt_sched_group(struct t
>  		if (!rt_se)
>  			goto err;
> 
> -		init_tg_rt_entry(rq, tg, rt_rq, rt_se, i, 0);
> +		parent_se = parent ? parent->rt_se[i] : NULL;
> +		init_tg_rt_entry(tg, rt_rq, rt_se, i, 0, parent_se);
>  	}
> 
>  	return 1;
> @@ -7756,7 +7763,8 @@ static inline void free_rt_sched_group(s
>  {
>  }
> 
> -static inline int alloc_rt_sched_group(struct task_group *tg)
> +static inline
> +int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
>  {
>  	return 1;
>  }
> @@ -7792,7 +7800,7 @@ struct task_group *sched_create_group(st
>  	if (!alloc_fair_sched_group(tg, parent))
>  		goto err;
> 
> -	if (!alloc_rt_sched_group(tg))
> +	if (!alloc_rt_sched_group(tg, parent))
>  		goto err;
> 
>  	spin_lock_irqsave(&task_group_lock, flags);
> 
> --

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups
       [not found]     ` <661de9470803120129l66497656o948c13c6c0c26766@mail.gmail.com>
@ 2008-03-12  9:29       ` Peter Zijlstra
  2008-03-12 13:34         ` Balbir Singh
  2008-03-12 14:08         ` Balbir Singh
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-03-12  9:29 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Dhaval Giani, LKML, Ingo Molnar, Dmitry Adamushko,
	Mike Galbraith, Srivatsa Vaddagiri

On Wed, 2008-03-12 at 13:59 +0530, Balbir Singh wrote:
> On Wed, Mar 12, 2008 at 1:55 PM, Dhaval Giani
> <dhaval@linux.vnet.ibm.com> wrote:
>         On Sun, Mar 09, 2008 at 06:08:53PM +0100, Peter Zijlstra
>         wrote:
>         > Update the -rt bits to support the full hierarchy support
>         started by Dhaval.
>         >
>         
>         
>         Thanks for the effort. Just a very minor comment. I will try
>         these out
>         and give more feedback.
>         
>         
>         > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>         
> 
> Shouldn't this have Dhaval's signed-off-by as well? Dhaval do you want
> to sign-off on these patches?

He can Ack it, but the 3rd patch was not authored by him.

That said, we should probably clean up the first 3 patches and fold
stuff.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups
  2008-03-12  9:29       ` Peter Zijlstra
@ 2008-03-12 13:34         ` Balbir Singh
  2008-03-12 13:39           ` Dhaval Giani
  2008-03-12 14:08         ` Balbir Singh
  1 sibling, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-03-12 13:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dhaval Giani, LKML, Ingo Molnar, Dmitry Adamushko,
	Mike Galbraith, Srivatsa Vaddagiri

Peter Zijlstra wrote:
> On Wed, 2008-03-12 at 13:59 +0530, Balbir Singh wrote:
>> On Wed, Mar 12, 2008 at 1:55 PM, Dhaval Giani
>> <dhaval@linux.vnet.ibm.com> wrote:
>>         On Sun, Mar 09, 2008 at 06:08:53PM +0100, Peter Zijlstra
>>         wrote:
>>         > Update the -rt bits to support the full hierarchy support
>>         started by Dhaval.
>>         >
>>         
>>         
>>         Thanks for the effort. Just a very minor comment. I will try
>>         these out
>>         and give more feedback.
>>         
>>         
>>         > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>         
>>
>> Shouldn't this have Dhaval's signed-off-by as well? Dhaval do you want
>> to sign-off on these patches?
> 
> He can Ack it, but the 3rd patch was not authored by him.
> 

So, the first two patches should have his sign-off then? I don't mean to
nit-pick, but want to follow DCO correctly.

> That said, we should probably clean up the first 3 patches and fold
> stuff.
> 

OK. I'll try and review the patchset tonight.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups
  2008-03-12 13:34         ` Balbir Singh
@ 2008-03-12 13:39           ` Dhaval Giani
  0 siblings, 0 replies; 26+ messages in thread
From: Dhaval Giani @ 2008-03-12 13:39 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Dmitry Adamushko,
	Mike Galbraith, Srivatsa Vaddagiri

On Wed, Mar 12, 2008 at 07:04:32PM +0530, Balbir Singh wrote:
> Peter Zijlstra wrote:
> > On Wed, 2008-03-12 at 13:59 +0530, Balbir Singh wrote:
> >> On Wed, Mar 12, 2008 at 1:55 PM, Dhaval Giani
> >> <dhaval@linux.vnet.ibm.com> wrote:
> >>         On Sun, Mar 09, 2008 at 06:08:53PM +0100, Peter Zijlstra
> >>         wrote:
> >>         > Update the -rt bits to support the full hierarchy support
> >>         started by Dhaval.
> >>         >
> >>         
> >>         
> >>         Thanks for the effort. Just a very minor comment. I will try
> >>         these out
> >>         and give more feedback.
> >>         
> >>         
> >>         > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >>         
> >>
> >> Shouldn't this have Dhaval's signed-off-by as well? Dhaval do you want
> >> to sign-off on these patches?
> > 
> > He can Ack it, but the 3rd patch was not authored by him.
> > 
> 
> So, the first two patches should have his sign-off then? I don't mean to
> nit-pick, but want to follow DCO correctly.
> 

They do, also have my From, just quilt mail ate it up. I hope peter can
find a workaround so that the next post of the patchset won't lose it
:).

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups
  2008-03-12  9:29       ` Peter Zijlstra
  2008-03-12 13:34         ` Balbir Singh
@ 2008-03-12 14:08         ` Balbir Singh
  1 sibling, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2008-03-12 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dhaval Giani, LKML, Ingo Molnar, Dmitry Adamushko,
	Mike Galbraith, Srivatsa Vaddagiri

Peter Zijlstra wrote:
> On Wed, 2008-03-12 at 13:59 +0530, Balbir Singh wrote:
>> On Wed, Mar 12, 2008 at 1:55 PM, Dhaval Giani
>> <dhaval@linux.vnet.ibm.com> wrote:
>>         On Sun, Mar 09, 2008 at 06:08:53PM +0100, Peter Zijlstra
>>         wrote:
>>         > Update the -rt bits to support the full hierarchy support
>>         started by Dhaval.
>>         >
>>         
>>         
>>         Thanks for the effort. Just a very minor comment. I will try
>>         these out
>>         and give more feedback.
>>         
>>         
>>         > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>         
>>
>> Shouldn't this have Dhaval's signed-off-by as well? Dhaval do you want
>> to sign-off on these patches?
> 
> He can Ack it, but the 3rd patch was not authored by him.
> 

My bad, I saw his sign-off on the first two patches. I got confused due to the
order in which my mailer received the patches.



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH 02/17] sched: allow the CFS group scheduler to have multiple levels
  2008-03-09 17:08 ` [RFC/PATCH 02/17] sched: allow the CFS group scheduler to have multiple levels Peter Zijlstra
@ 2008-03-12 14:47   ` Dhaval Giani
  0 siblings, 0 replies; 26+ messages in thread
From: Dhaval Giani @ 2008-03-12 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Dmitry Adamushko, Mike Galbraith, Srivatsa Vaddagiri

On Sun, Mar 09, 2008 at 06:08:52PM +0100, Peter Zijlstra wrote:
> This patch makes the group scheduler multi hierarchy aware.
> 

Just a heads up, my system is hanging when I try to add tasks to groups
till this patch applied. Debugging now.

thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH 01/17] sched: mix tasks and groups
  2008-04-01 12:12   ` Srivatsa Vaddagiri
@ 2008-04-01 12:05     ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-04-01 12:05 UTC (permalink / raw)
  To: vatsa; +Cc: LKML, Ingo Molnar, Dmitry Adamushko, Mike Galbraith, Dhaval Giani

On Tue, 2008-04-01 at 17:42 +0530, Srivatsa Vaddagiri wrote:
> On Sun, Mar 09, 2008 at 06:08:51PM +0100, Peter Zijlstra wrote:
> > This patch allows tasks and groups to exist in the same cfs_rq. With this
> > change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
> > fairness model where M tasks and N groups exist at the cfs_rq level.
> > 
> > [a.p.zijlstra@chello.nl: rt bits]
> > Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> > Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  kernel/sched.c      |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  kernel/sched_fair.c |   48 +++++++++++++++++++++++++++++++++++++++++++---
> >  kernel/sched_rt.c   |   15 ++++++++------
> >  3 files changed, 106 insertions(+), 11 deletions(-)
> > 
> > Index: linux-2.6-2/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-2.orig/kernel/sched.c
> > +++ linux-2.6-2/kernel/sched.c
> > @@ -273,18 +273,23 @@ struct task_group {
> >  };
> > 
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > +
> > +#ifdef CONFIG_USER_SCHED
> >  /* Default task group's sched entity on each cpu */
> >  static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
> >  /* Default task group's cfs_rq on each cpu */
> >  static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) ____cacheline_aligned_in_smp;
> > +#endif
> 
> 
> I am trying to understand the change this brings for semantics of RT-scheduling.
> 
> With this change, /cgroup will be seen as the parent group of all other
> groups (say: /cgroup/A, /cgroup/B etc). Is that correct?
> 
> If so, the check in __rt_schedulable() needs a change as well, which assumes 
> that all task groups form a flat hierarchy.

Yes, I have that on my todo list somewhere. I realized the same thing
earlier today :-)

> For example: lets say that init_task_group (/cgroup in this case) had the
> default rt_bandwidth of 95% (global_rt_runtime()). A child group under it 
> (/cgroup/A) is created. If user tries to assign it a rt-bandwidth of
> 50%, then AFAICS, it will fail with current code, whereas it shouldn't
> (because by giving /cgroup/A 50% bandwidth, we are not really exceeding
> the globally allowed RT bandwidth of 95%, since /cgroup/A is a child of
> /cgroup).


agreed.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC/PATCH 01/17] sched: mix tasks and groups
  2008-03-09 17:08 ` [RFC/PATCH 01/17] sched: mix tasks and groups Peter Zijlstra
@ 2008-04-01 12:12   ` Srivatsa Vaddagiri
  2008-04-01 12:05     ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-04-01 12:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Dmitry Adamushko, Mike Galbraith, Dhaval Giani

On Sun, Mar 09, 2008 at 06:08:51PM +0100, Peter Zijlstra wrote:
> This patch allows tasks and groups to exist in the same cfs_rq. With this
> change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
> fairness model where M tasks and N groups exist at the cfs_rq level.
> 
> [a.p.zijlstra@chello.nl: rt bits]
> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c      |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched_fair.c |   48 +++++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched_rt.c   |   15 ++++++++------
>  3 files changed, 106 insertions(+), 11 deletions(-)
> 
> Index: linux-2.6-2/kernel/sched.c
> ===================================================================
> --- linux-2.6-2.orig/kernel/sched.c
> +++ linux-2.6-2/kernel/sched.c
> @@ -273,18 +273,23 @@ struct task_group {
>  };
> 
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +
> +#ifdef CONFIG_USER_SCHED
>  /* Default task group's sched entity on each cpu */
>  static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
>  /* Default task group's cfs_rq on each cpu */
>  static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) ____cacheline_aligned_in_smp;
> +#endif


I am trying to understand the change this brings for semantics of RT-scheduling.

With this change, /cgroup will be seen as the parent group of all other
groups (say: /cgroup/A, /cgroup/B etc). Is that correct?

If so, the check in __rt_schedulable() needs a change as well, which assumes 
that all task groups form a flat hierarchy.

For example: lets say that init_task_group (/cgroup in this case) had the
default rt_bandwidth of 95% (global_rt_runtime()). A child group under it 
(/cgroup/A) is created. If user tries to assign it a rt-bandwidth of
50%, then AFAICS, it will fail with current code, whereas it shouldn't
(because by giving /cgroup/A 50% bandwidth, we are not really exceeding
the globally allowed RT bandwidth of 95%, since /cgroup/A is a child of
/cgroup).

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-04-01 12:05 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-09 17:08 [RFC/PATCH 00/17] Current sched patch stack Peter Zijlstra
2008-03-09 17:08 ` [RFC/PATCH 01/17] sched: mix tasks and groups Peter Zijlstra
2008-04-01 12:12   ` Srivatsa Vaddagiri
2008-04-01 12:05     ` Peter Zijlstra
2008-03-09 17:08 ` [RFC/PATCH 02/17] sched: allow the CFS group scheduler to have multiple levels Peter Zijlstra
2008-03-12 14:47   ` Dhaval Giani
2008-03-09 17:08 ` [RFC/PATCH 03/17] sched: rt-group: full hierarchy support for the rt groups Peter Zijlstra
2008-03-12  8:25   ` Dhaval Giani
     [not found]     ` <661de9470803120129l66497656o948c13c6c0c26766@mail.gmail.com>
2008-03-12  9:29       ` Peter Zijlstra
2008-03-12 13:34         ` Balbir Singh
2008-03-12 13:39           ` Dhaval Giani
2008-03-12 14:08         ` Balbir Singh
2008-03-09 17:08 ` [RFC/PATCH 04/17] sched: fair-group: SMP-nice for group scheduling Peter Zijlstra
2008-03-09 17:08 ` [RFC/PATCH 05/17] sched: rt-group: optimize dequeue_rt_stack Peter Zijlstra
2008-03-09 17:08 ` [RFC/PATCH 06/17] sched: fair-group scheduling vs latency Peter Zijlstra
2008-03-09 17:08 ` [RFC/PATCH 07/17] sched: fair-group: de-couple load-balancing from the rb-trees Peter Zijlstra
2008-03-09 17:08 ` [RFC/PATCH 08/17] sched: fair-group: single RQ approach Peter Zijlstra
2008-03-09 17:08 ` [RFC/PATCH 09/17] sched: fair: remove moved_group() Peter Zijlstra
2008-03-09 17:09 ` [RFC/PATCH 10/17] sched: remove sysctl_sched_batch_wakeup_granularity Peter Zijlstra
2008-03-09 17:09 ` [RFC/PATCH 11/17] sched: fair: optimize sched_slice() Peter Zijlstra
2008-03-09 17:09 ` [RFC/PATCH 12/17] sched: fair: cfs_root_rq Peter Zijlstra
2008-03-09 17:09 ` [RFC/PATCH 13/17] sched: fair: calc_delta_weight() Peter Zijlstra
2008-03-09 17:09 ` [RFC/PATCH 14/17] sched: fair: avg_vruntime Peter Zijlstra
2008-03-09 17:09 ` [RFC/PATCH 15/17] sched: fair-group: EEVDF scheduling Peter Zijlstra
2008-03-09 17:09 ` [RFC/PATCH 16/17] sched: fair: bound tardiness Peter Zijlstra
2008-03-09 17:09 ` [RFC/PATCH 17/17] sched: fair: optimize the elegebility test Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).