LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC] Fair-user scheduler
@ 2007-01-26  6:01 Srivatsa Vaddagiri
  2007-01-26  6:03 ` [PATCH 1/2] core scheduler changes Srivatsa Vaddagiri
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2007-01-26  6:01 UTC (permalink / raw)
  To: riel, Ingo Molnar, Nick Piggin, Andrew Morton; +Cc: linux-kernel

Current Linux CPU scheduler doesnt recognize process aggregates while
allocating bandwidth. As a result of this, an user could simply spawn large 
number of processes and get more bandwidth than others.

Here's a patch that provides fair allocation for all users in a system.

Some benchmark numbers with and without the patch applied follows:


		 	user "vatsa"		    user "guest"
		    (make -s -j4 bzImage)      (make -s -j20 bzImage)

2.6.20-rc5		472.07s (real)		   257.48s (real)
2.6.20-rc5+fairsched	766.74s (real)		   766.73s (real)


(Numbers taken on a 2way Intel x86_64 box)

Eventually something like this can be extended to do weighted fair share
scheduling for:

	- KVM
	- containers
	- resource management

Salient features of the patch:

	- Based on Ingo's RTLIMIT_RT_CPU patch [1]. Primary difference between 
	  RTLIMIT_RT_CPU patch and this one is that this patch handles 
	  starvation of lower priority tasks in a group and also accounting
	  is token based (rather than decaying avg).

	- Retains existing one-runqueue-per-cpu design

	- breaks O(1) (ouch!)
		Best way to avoid this is to split runqueue to be per-user and
		per-cpu, which I have not implemented to keep the patch simple.

	- Fairsched aware SMP load balance NOT addressed (yet)

Comments/flames wellcome!


References:

1. http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc2/2.6.11-rc2-mm2/broken-out/rlimit_rt_cpu.patch

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] core scheduler changes
  2007-01-26  6:01 [RFC] Fair-user scheduler Srivatsa Vaddagiri
@ 2007-01-26  6:03 ` Srivatsa Vaddagiri
  2007-01-31 15:01   ` Srivatsa Vaddagiri
  2007-01-26  6:05 ` [PATCH 2/2] Track number of users in the system Srivatsa Vaddagiri
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2007-01-26  6:03 UTC (permalink / raw)
  To: riel, Ingo Molnar, Nick Piggin, Andrew Morton; +Cc: linux-kernel


This patch does several things:

	- Introduces the notion of control window (current set at 1
	  sec - ideally the window size should be adjusted based on
	  number of users to avoid rapid context switches). Bandwidth of each 
	  user is controlled within this window.  rq->last_update tracks where 
	  are in the current window.

	- Modifies scheduler_tick() to account cpu bandwidth consumption
	  by a task group. Basically bandwidth consumed by a task is
	  charged to itself (p->time_slice) -and- to its group as well.

	- A task is forced off the CPU once its group has expired the
	  bandwidth in the current control window. Such a task is also
	  marked as "starving".

	- schedule() avoids picking tasks whose group has expired its
	  bandwidth in current control window. Any task (with non-zero
	  p->timeslice) which is not picked to run in schedule() because of 
	  this reason is marked "starving".
	
	- If a group has bandwidth left and it has starving tasks, then 
	  schedule() prefers picking such tasks over non-starving tasks.
	  This will avoid starvation of lower-priority tasks in a group.


Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com>

---


diff -puN include/linux/sched.h~cpu-controller-based-on-rtlimit_rt_cpu-patch include/linux/sched.h
--- linux-2.6.20-rc5/include/linux/sched.h~cpu-controller-based-on-rtlimit_rt_cpu-patch	2007-01-19 15:17:27.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/include/linux/sched.h	2007-01-26 09:04:07.000000000 +0530
@@ -531,6 +531,10 @@ struct signal_struct {
 #define is_rt_policy(p)		((p) != SCHED_NORMAL && (p) != SCHED_BATCH)
 #define has_rt_policy(p)	unlikely(is_rt_policy((p)->policy))
 
+#ifdef CONFIG_FAIRSCHED
+struct cpu_usage;
+#endif
+
 /*
  * Some day this will be a full-fledged user tracking system..
  */
@@ -555,6 +559,10 @@ struct user_struct {
 	/* Hash table maintenance information */
 	struct list_head uidhash_list;
 	uid_t uid;
+#ifdef CONFIG_FAIRSCHED
+	int cpu_limit;
+	struct cpu_usage *cpu_usage;
+#endif
 };
 
 extern struct user_struct *find_user(uid_t);
@@ -1137,6 +1145,9 @@ static inline void put_task_struct(struc
 					/* Not implemented yet, only for 486*/
 #define PF_STARTING	0x00000002	/* being created */
 #define PF_EXITING	0x00000004	/* getting shut down */
+#ifdef CONFIG_FAIRSCHED
+#define PF_STARVING	0x00000010      /* Task starving for CPU */
+#endif
 #define PF_FORKNOEXEC	0x00000040	/* forked but didn't exec */
 #define PF_SUPERPRIV	0x00000100	/* used super-user privileges */
 #define PF_DUMPCORE	0x00000200	/* dumped core */
diff -puN kernel/sched.c~cpu-controller-based-on-rtlimit_rt_cpu-patch kernel/sched.c
--- linux-2.6.20-rc5/kernel/sched.c~cpu-controller-based-on-rtlimit_rt_cpu-patch	2007-01-19 15:17:27.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/kernel/sched.c	2007-01-26 09:04:07.000000000 +0530
@@ -266,6 +266,9 @@ struct rq {
 	unsigned long ttwu_local;
 #endif
 	struct lock_class_key rq_lock_key;
+#ifdef CONFIG_FAIRSCHED
+	unsigned long last_update;
+#endif
 };
 
 static DEFINE_PER_CPU(struct rq, runqueues);
@@ -710,6 +713,126 @@ enqueue_task_head(struct task_struct *p,
 	p->array = array;
 }
 
+#ifdef CONFIG_FAIRSCHED
+
+struct cpu_usage {
+       long tokens;
+       unsigned long last_update;
+       int starve_count;
+};
+
+#define task_starving(p)	(p->flags & PF_STARVING)
+
+/* Mark a task starving - either we shortcircuited its timeslice or we didnt
+ * pick it to run (because user ran out of bandwidth limit in current epoch).
+ */
+static inline void set_tsk_starving(struct task_struct *p)
+{
+	struct user_struct *user = p->user;
+	struct cpu_usage *cu;
+
+	if (task_starving(p) || !user->cpu_limit)
+		return;
+
+	cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+	cu->starve_count++;
+	p->flags |= PF_STARVING;
+}
+
+/* Clear a task's starving flag */
+static inline void clear_tsk_starving(struct task_struct *p)
+{
+	struct user_struct *user = p->user;
+	struct cpu_usage *cu;
+
+	if (!task_starving(p) || !user->cpu_limit)
+		return;
+
+	cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+	cu->starve_count--;
+	p->flags &= ~PF_STARVING;
+}
+
+/* Does the task's group have starving tasks? */
+static inline int is_user_starving(struct task_struct *p)
+{
+	struct user_struct *user = p->user;
+	struct cpu_usage *cu;
+
+	if (!user->cpu_limit)
+		return 0;
+
+	cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+	if (cu->starve_count)
+		return 1;
+
+	return 0;
+}
+
+/* Are we past the 1-sec control window? If so, all groups get to renew their
+ * expired tokens.
+ */
+static inline void adjust_control_window(void)
+{
+	struct rq *rq = this_rq();
+	unsigned long delta;
+
+	delta = jiffies - rq->last_update;
+	if (delta >= HZ)
+		rq->last_update += (delta/HZ) * HZ;
+}
+
+/* Account group's cpu usage */
+static inline void inc_cpu_usage(struct task_struct *p)
+{
+	struct user_struct *user = p->user;
+	struct cpu_usage *cu;
+
+	adjust_control_window();
+
+	if (!user->cpu_limit)
+		return;
+
+	cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+	cu->tokens--;
+}
+
+static inline int task_over_cpu_limit(struct task_struct *p)
+{
+	struct rq *rq = task_rq(p);
+	struct user_struct *user = p->user;
+	struct cpu_usage *cu;
+
+	adjust_control_window();
+
+	if (!user->cpu_limit)
+	 	return 0;
+
+	cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+	if (cu->last_update != rq->last_update) {
+		/* Replenish tokens */
+		cu->tokens += user->cpu_limit * HZ / 100;
+		cu->last_update = rq->last_update;
+	}
+
+	if (cu->tokens <= 0)
+		return 1;
+
+	return 0;
+}
+
+#else
+
+#define task_starving(p)	0
+
+static void inc_cpu_usage(struct task_struct *p) { }
+static int task_over_cpu_limit(struct task_struct *p) { return 0; }
+static void set_tsk_starving(struct task_struct *p) { }
+static void clear_tsk_starving(struct task_struct *p) { }
+static int is_user_starving(struct task_struct *p) { return 0;}
+
+#endif		/* CONFIG_FAIRSCHED */
+
 /*
  * __normal_prio - return the priority that is based on the static
  * priority but is modified by bonuses/penalties.
@@ -1607,6 +1730,9 @@ void fastcall sched_fork(struct task_str
 	/* Want to start with kernel preemption disabled. */
 	task_thread_info(p)->preempt_count = 1;
 #endif
+#ifdef CONFIG_FAIRSCHED
+	p->flags &= ~PF_STARVING;
+#endif
 	/*
 	 * Share the timeslice between parent and child, thus the
 	 * total amount of pending timeslices in the system doesn't change,
@@ -2074,6 +2200,7 @@ static void pull_task(struct rq *src_rq,
 {
 	dequeue_task(p, src_array);
 	dec_nr_running(p, src_rq);
+	clear_tsk_starving(p);
 	set_task_cpu(p, this_cpu);
 	inc_nr_running(p, this_rq);
 	enqueue_task(p, this_array);
@@ -3137,6 +3264,9 @@ static void task_running_tick(struct rq 
 		return;
 	}
 	spin_lock(&rq->lock);
+
+	inc_cpu_usage(p);
+
 	/*
 	 * The task was running during this tick - update the
 	 * time slice counter. Note: we do not update a thread's
@@ -3163,17 +3293,18 @@ static void task_running_tick(struct rq 
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
 		p->prio = effective_prio(p);
-		p->time_slice = task_timeslice(p);
 		p->first_time_slice = 0;
 
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
+		if (!TASK_INTERACTIVE(p) || expired_starving(rq)
+						|| task_over_cpu_limit(p)) {
 			enqueue_task(p, rq->expired);
 			if (p->static_prio < rq->best_expired_prio)
 				rq->best_expired_prio = p->static_prio;
 		} else
 			enqueue_task(p, rq->active);
+		goto out_unlock;
 	} else {
 		/*
 		 * Prevent a too long timeslice allowing a task to monopolize
@@ -3200,6 +3331,14 @@ static void task_running_tick(struct rq 
 			set_tsk_need_resched(p);
 		}
 	}
+
+	if (task_over_cpu_limit(p)) {
+		dequeue_task(p, rq->active);
+		set_tsk_need_resched(p);
+		enqueue_task(p, rq->expired);
+		set_tsk_starving(p);
+	}
+
 out_unlock:
 	spin_unlock(&rq->lock);
 }
@@ -3416,7 +3555,7 @@ asmlinkage void __sched schedule(void)
 	struct list_head *queue;
 	unsigned long long now;
 	unsigned long run_time;
-	int cpu, idx, new_prio;
+	int cpu, idx, new_prio, array_switch;
 	long *switch_count;
 	struct rq *rq;
 
@@ -3478,6 +3617,7 @@ need_resched_nonpreemptible:
 		else {
 			if (prev->state == TASK_UNINTERRUPTIBLE)
 				rq->nr_uninterruptible++;
+			clear_tsk_starving(prev);
 			deactivate_task(prev, rq);
 		}
 	}
@@ -3493,11 +3633,15 @@ need_resched_nonpreemptible:
 		}
 	}
 
+	array_switch = 0;
+
+pick_next_task:
 	array = rq->active;
 	if (unlikely(!array->nr_active)) {
 		/*
 		 * Switch the active and expired arrays.
 		 */
+		array_switch++;
 		schedstat_inc(rq, sched_switch);
 		rq->active = rq->expired;
 		rq->expired = array;
@@ -3510,6 +3654,25 @@ need_resched_nonpreemptible:
 	queue = array->queue + idx;
 	next = list_entry(queue->next, struct task_struct, run_list);
 
+	/* If we have done an array switch twice, it means we cant find any
+	 * task which isn't above its limit and hence we just run the
+	 * first task on the active array.
+	 */
+	if (array_switch < 2 && (task_over_cpu_limit(next) ||
+			(!task_starving(next) && is_user_starving(next)))) {
+		dequeue_task(next, rq->active);
+		enqueue_task(next, rq->expired);
+		if (next->time_slice)
+			set_tsk_starving(next);
+		goto pick_next_task;
+	}
+
+	if (task_over_cpu_limit(next))
+		rq->last_update = jiffies;
+	if (!next->time_slice)
+		next->time_slice = task_timeslice(next);
+	clear_tsk_starving(next);
+
 	if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
 		unsigned long long delta = now - next->timestamp;
 		if (unlikely((long long)(now - next->timestamp) < 0))
@@ -5061,6 +5224,7 @@ static int __migrate_task(struct task_st
 	if (!cpu_isset(dest_cpu, p->cpus_allowed))
 		goto out;
 
+	clear_tsk_starving(p);
 	set_task_cpu(p, dest_cpu);
 	if (p->array) {
 		/*
_

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] Track number of users in the system
  2007-01-26  6:01 [RFC] Fair-user scheduler Srivatsa Vaddagiri
  2007-01-26  6:03 ` [PATCH 1/2] core scheduler changes Srivatsa Vaddagiri
@ 2007-01-26  6:05 ` Srivatsa Vaddagiri
  2007-01-26 14:09 ` [RFC] Fair-user scheduler Kirill Korotaev
  2007-01-26 18:41 ` Chris Friesen
  3 siblings, 0 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2007-01-26  6:05 UTC (permalink / raw)
  To: riel, Ingo Molnar, Nick Piggin, Andrew Morton; +Cc: linux-kernel

This patch tracks number of users in a system and divides cpu bandwidth 
equally among them.

Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com>


---


diff -puN include/linux/sched.h~user-interface include/linux/sched.h
--- linux-2.6.20-rc5/include/linux/sched.h~user-interface	2007-01-25 20:44:53.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/include/linux/sched.h	2007-01-25 20:44:53.000000000 +0530
@@ -533,7 +533,18 @@ struct signal_struct {
 
 #ifdef CONFIG_FAIRSCHED
 struct cpu_usage;
-#endif
+
+int sched_alloc_user(struct user_struct *user);
+void sched_free_user(struct user_struct *user);
+void sched_move_task(void);
+
+#else
+
+static inline int sched_alloc_user(struct user_struct *user) { return 0; }
+static inline void sched_free_user(struct user_struct *user) { }
+static inline void sched_move_task(void) { }
+
+#endif /* CONFIG_FAIRSCHED */
 
 /*
  * Some day this will be a full-fledged user tracking system..
@@ -562,6 +573,7 @@ struct user_struct {
 #ifdef CONFIG_FAIRSCHED
 	int cpu_limit;
 	struct cpu_usage *cpu_usage;
+	struct list_head fair_sched_list;
 #endif
 };
 
diff -puN kernel/user.c~user-interface kernel/user.c
--- linux-2.6.20-rc5/kernel/user.c~user-interface	2007-01-25 20:44:53.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/kernel/user.c	2007-01-25 20:44:53.000000000 +0530
@@ -51,6 +51,10 @@ struct user_struct root_user = {
 	.uid_keyring	= &root_user_keyring,
 	.session_keyring = &root_session_keyring,
 #endif
+#ifdef CONFIG_FAIRSCHED
+	.cpu_limit = 0,		/* No limit */
+	.fair_sched_list = LIST_HEAD_INIT(root_user.fair_sched_list),
+#endif
 };
 
 /*
@@ -112,6 +116,7 @@ void free_uid(struct user_struct *up)
 	if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
 		uid_hash_remove(up);
 		spin_unlock_irqrestore(&uidhash_lock, flags);
+		sched_free_user(up);
 		key_put(up->uid_keyring);
 		key_put(up->session_keyring);
 		kmem_cache_free(uid_cachep, up);
@@ -153,6 +158,8 @@ struct user_struct * alloc_uid(uid_t uid
 			return NULL;
 		}
 
+		sched_alloc_user(new);
+
 		/*
 		 * Before adding this, check whether we raced
 		 * on adding the same user already..
@@ -163,6 +170,7 @@ struct user_struct * alloc_uid(uid_t uid
 			key_put(new->uid_keyring);
 			key_put(new->session_keyring);
 			kmem_cache_free(uid_cachep, new);
+			sched_free_user(new);
 		} else {
 			uid_hash_insert(new, hashent);
 			up = new;
@@ -186,6 +194,7 @@ void switch_uid(struct user_struct *new_
 	atomic_inc(&new_user->processes);
 	atomic_dec(&old_user->processes);
 	switch_uid_keyring(new_user);
+	sched_move_task();
 	current->user = new_user;
 
 	/*
diff -puN kernel/sched.c~user-interface kernel/sched.c
--- linux-2.6.20-rc5/kernel/sched.c~user-interface	2007-01-25 20:44:53.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/kernel/sched.c	2007-01-26 09:04:04.000000000 +0530
@@ -7221,3 +7221,63 @@ void set_curr_task(int cpu, struct task_
 }
 
 #endif
+
+#ifdef CONFIG_FAIRSCHED
+
+static struct list_head user_list = LIST_HEAD_INIT(user_list);
+static atomic_t non_root_users;
+
+static void recalc_user_limits(void)
+{
+	int nr_users;
+	struct user_struct *user;
+
+	nr_users = atomic_read(&non_root_users);
+	if (!nr_users)
+		return;
+
+	list_for_each_entry(user, &user_list, fair_sched_list)
+		user->cpu_limit = 100/nr_users;
+}
+
+/* Allocate cpu_usage structure for the new task-group */
+int sched_alloc_user(struct user_struct *user)
+{
+	int i;
+
+	user->cpu_usage = alloc_percpu(struct cpu_usage);
+	if (!user->cpu_usage)
+		return -ENOMEM;
+
+	for_each_possible_cpu(i) {
+		struct cpu_usage *cu;
+
+		cu = per_cpu_ptr(user->cpu_usage, i);
+		cu->tokens = 1;
+		cu->last_update = 0;
+		cu->starve_count = 0;
+	}
+
+	list_add(&user->fair_sched_list, &user_list);
+	atomic_inc(&non_root_users);
+
+	recalc_user_limits();
+
+	return 0;
+}
+
+/* Deallocate cpu_usage structure */
+void sched_free_user(struct user_struct *user)
+{
+	list_del(&user->fair_sched_list);
+	atomic_dec(&non_root_users);
+	recalc_user_limits();
+	free_percpu(user->cpu_usage);
+}
+
+void sched_move_task(void)
+{
+	clear_tsk_starving(current);
+}
+
+#endif
diff -puN init/Kconfig~user-interface init/Kconfig
--- linux-2.6.20-rc5/init/Kconfig~user-interface	2007-01-25 20:44:53.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/init/Kconfig	2007-01-25 20:44:54.000000000 +0530
@@ -249,6 +249,13 @@ config CPUSETS
 
 	  Say N if unsure.
 
+config FAIRSCHED
+	bool "Fair user CPU scheduler"
+	depends on EXPERIMENTAL
+	help
+	  This options lets you allocate equal cpu bandwidth to each
+	  non-root user.
+
 config SYSFS_DEPRECATED
 	bool "Create deprecated sysfs files"
 	default y
_

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Fair-user scheduler
  2007-01-26  6:01 [RFC] Fair-user scheduler Srivatsa Vaddagiri
  2007-01-26  6:03 ` [PATCH 1/2] core scheduler changes Srivatsa Vaddagiri
  2007-01-26  6:05 ` [PATCH 2/2] Track number of users in the system Srivatsa Vaddagiri
@ 2007-01-26 14:09 ` Kirill Korotaev
  2007-01-26 18:52   ` Eric Piel
  2007-01-26 18:41 ` Chris Friesen
  3 siblings, 1 reply; 9+ messages in thread
From: Kirill Korotaev @ 2007-01-26 14:09 UTC (permalink / raw)
  To: vatsa; +Cc: riel, Ingo Molnar, Nick Piggin, Andrew Morton, linux-kernel

Srivatsa,

> Current Linux CPU scheduler doesnt recognize process aggregates while
> allocating bandwidth. As a result of this, an user could simply spawn large 
> number of processes and get more bandwidth than others.
> 
> Here's a patch that provides fair allocation for all users in a system.
> 
> Some benchmark numbers with and without the patch applied follows:
> 
> 
> 		 	user "vatsa"		    user "guest"
> 		    (make -s -j4 bzImage)      (make -s -j20 bzImage)
> 
> 2.6.20-rc5		472.07s (real)		   257.48s (real)
> 2.6.20-rc5+fairsched	766.74s (real)		   766.73s (real)
1. If I interpret these numbers correctly, then your scheduler is not work-conservative,
i.e. 766.74 + 766.73 >> 472.07 + 257.48
why does it slow down users so much?
2. compilation of kernel is quite CPU-bound task. So it's not that hard to be fair :)
   Can you please try some other applications?
   e.g. pipe-based context switching, java Volano benchmark etc.

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Fair-user scheduler
  2007-01-26  6:01 [RFC] Fair-user scheduler Srivatsa Vaddagiri
                   ` (2 preceding siblings ...)
  2007-01-26 14:09 ` [RFC] Fair-user scheduler Kirill Korotaev
@ 2007-01-26 18:41 ` Chris Friesen
  2007-01-31 15:16   ` Srivatsa Vaddagiri
  3 siblings, 1 reply; 9+ messages in thread
From: Chris Friesen @ 2007-01-26 18:41 UTC (permalink / raw)
  To: vatsa; +Cc: riel, Ingo Molnar, Nick Piggin, Andrew Morton, linux-kernel

Srivatsa Vaddagiri wrote:
> Current Linux CPU scheduler doesnt recognize process aggregates while
> allocating bandwidth. As a result of this, an user could simply spawn large 
> number of processes and get more bandwidth than others.
> 
> Here's a patch that provides fair allocation for all users in a system.
> 
> Some benchmark numbers with and without the patch applied follows:
> 
> 
> 		 	user "vatsa"		    user "guest"
> 		    (make -s -j4 bzImage)      (make -s -j20 bzImage)
> 
> 2.6.20-rc5		472.07s (real)		   257.48s (real)
> 2.6.20-rc5+fairsched	766.74s (real)		   766.73s (real)

As Kirill brought up, why does it take so much more time?  Are you 
thrashing the cache?

> 	- breaks O(1) (ouch!)
> 		Best way to avoid this is to split runqueue to be per-user and
> 		per-cpu, which I have not implemented to keep the patch simple.

Presumably this would be made generic, as in per-"group" rather than per 
user?

> 	- Fairsched aware SMP load balance NOT addressed (yet)

This is kind of important, no?

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Fair-user scheduler
  2007-01-26 14:09 ` [RFC] Fair-user scheduler Kirill Korotaev
@ 2007-01-26 18:52   ` Eric Piel
  2007-01-31 15:10     ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Piel @ 2007-01-26 18:52 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: vatsa, riel, Ingo Molnar, Nick Piggin, Andrew Morton, linux-kernel

01/26/2007 03:09 PM, Kirill Korotaev wrote/a écrit:
> Srivatsa,
> 
>> Current Linux CPU scheduler doesnt recognize process aggregates while
>> allocating bandwidth. As a result of this, an user could simply spawn large 
>> number of processes and get more bandwidth than others.
>>
>> Here's a patch that provides fair allocation for all users in a system.
>>
>> Some benchmark numbers with and without the patch applied follows:
>>
>>
>> 		 	user "vatsa"		    user "guest"
>> 		    (make -s -j4 bzImage)      (make -s -j20 bzImage)
>>
>> 2.6.20-rc5		472.07s (real)		   257.48s (real)
>> 2.6.20-rc5+fairsched	766.74s (real)		   766.73s (real)
> 1. If I interpret these numbers correctly, then your scheduler is not work-conservative,
> i.e. 766.74 + 766.73 >> 472.07 + 257.48
> why does it slow down users so much?
You can't measure work-conservation by summing! Everything is ran 
_concurrently_. A proof of losing computing power is to show 
"MAX(new_algorithm execution_times) > MAX(old_algorithm 
execution_times)". Anyway... it still seems lots of power is lost: 
MAX(766,766) >> MAX(472,257).

Actually, I'd be very interested by a "fairness number" and believe so 
far no one as proposed such thing. Probably needs to take into account 
the loss of CPU power and the variance of execution time in between the 
sets of tasks which are supposed to be fair.

> 2. compilation of kernel is quite CPU-bound task. So it's not that hard to be fair :)
>    Can you please try some other applications?
>    e.g. pipe-based context switching, java Volano benchmark etc.
Another worthy benchmark would be :
(make -s -j4 bzImage) vs (nice make -s -j20 bzImage)
                           ^^^^


Regards,
Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] core scheduler changes
  2007-01-26  6:03 ` [PATCH 1/2] core scheduler changes Srivatsa Vaddagiri
@ 2007-01-31 15:01   ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2007-01-31 15:01 UTC (permalink / raw)
  To: riel, Ingo Molnar, Nick Piggin, Andrew Morton, Eric Piel,
	Kirill Korotaev
  Cc: linux-kernel

On Fri, Jan 26, 2007 at 11:33:17AM +0530, Srivatsa Vaddagiri wrote:
> 
> This patch does several things:
> 
> 	- Introduces the notion of control window (current set at 1
> 	  sec - ideally the window size should be adjusted based on
> 	  number of users to avoid rapid context switches). Bandwidth of each 
> 	  user is controlled within this window.  rq->last_update tracks where 
> 	  are in the current window.

The patch below makes the control window size configurable (as a macro),
sets the default window size at 10 sec and also fixes a compile error (for 
!CONFIG_FAIRSCHED case).

Ideally the window size needs to be determined at run time (based on number of
users). I will address that if there is sufficient interest on a patch
like this ..

Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com>

---


diff -puN kernel/sched.c~window-fix kernel/sched.c
--- linux-2.6.20-rc5/kernel/sched.c~window-fix	2007-01-31 19:59:48.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/kernel/sched.c	2007-01-31 20:08:57.000000000 +0530
@@ -769,17 +769,24 @@ static inline int is_user_starving(struc
 	return 0;
 }
 
-/* Are we past the 1-sec control window? If so, all groups get to renew their
+#define WINDOW_SIZE	(10*HZ)
+
+/* Are we past the control window? If so, all groups get to renew their
  * expired tokens.
  */
-static inline void adjust_control_window(void)
+static inline void adjust_control_window(int force)
 {
 	struct rq *rq = this_rq();
 	unsigned long delta;
 
+	if (force) {
+		rq->last_update = jiffies;
+		return;
+	}
+
 	delta = jiffies - rq->last_update;
-	if (delta >= HZ)
-		rq->last_update += (delta/HZ) * HZ;
+	if (delta >= WINDOW_SIZE)
+		rq->last_update += (delta/WINDOW_SIZE) * WINDOW_SIZE;
 }
 
 /* Account group's cpu usage */
@@ -788,7 +795,7 @@ static inline void inc_cpu_usage(struct 
 	struct user_struct *user = p->user;
 	struct cpu_usage *cu;
 
-	adjust_control_window();
+	adjust_control_window(0);
 
 	if (!user->cpu_limit)
 		return;
@@ -803,7 +810,7 @@ static inline int task_over_cpu_limit(st
 	struct user_struct *user = p->user;
 	struct cpu_usage *cu;
 
-	adjust_control_window();
+	adjust_control_window(0);
 
 	if (!user->cpu_limit)
 	 	return 0;
@@ -811,7 +818,7 @@ static inline int task_over_cpu_limit(st
 	cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
 	if (cu->last_update != rq->last_update) {
 		/* Replenish tokens */
-		cu->tokens += user->cpu_limit * HZ / 100;
+		cu->tokens += user->cpu_limit * WINDOW_SIZE / 100;
 		cu->last_update = rq->last_update;
 	}
 
@@ -830,6 +837,7 @@ static int task_over_cpu_limit(struct ta
 static void set_tsk_starving(struct task_struct *p) { }
 static void clear_tsk_starving(struct task_struct *p) { }
 static int is_user_starving(struct task_struct *p) { return 0;}
+static inline void adjust_control_window(int force) { }
 
 #endif		/* CONFIG_FAIRSCHED */
 
@@ -3668,7 +3676,7 @@ pick_next_task:
 	}
 
 	if (task_over_cpu_limit(next))
-		rq->last_update = jiffies;
+		adjust_control_window(1);
 	if (!next->time_slice)
 		next->time_slice = task_timeslice(next);
 	clear_tsk_starving(next);
_


-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Fair-user scheduler
  2007-01-26 18:52   ` Eric Piel
@ 2007-01-31 15:10     ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2007-01-31 15:10 UTC (permalink / raw)
  To: Eric Piel
  Cc: Kirill Korotaev, riel, Ingo Molnar, Nick Piggin, Andrew Morton,
	linux-kernel

On Fri, Jan 26, 2007 at 07:52:49PM +0100, Eric Piel wrote:
> >>		 	user "vatsa"		    user "guest"
> >>		    (make -s -j4 bzImage)      (make -s -j20 bzImage)
> >>
> >>2.6.20-rc5		472.07s (real)		   257.48s (real)
> >>2.6.20-rc5+fairsched	766.74s (real)		   766.73s (real)
> >
> >1. If I interpret these numbers correctly, then your scheduler is not 
> >work-conservative,
> >i.e. 766.74 + 766.73 >> 472.07 + 257.48
> >why does it slow down users so much?

Ok, I investigated this a bit. The "1-sec" control window was the
killer. I guess it was causing too much of thrashing. I increased the
control window to 10 sec [1] and I get decent results now.

NOTE : Since the patches don't support SMP load-balancing currently, 
all benchmarks were run on only ONE cpu (using cpuset's cpu_exclusive
feature) and hence the numbers are indicative of UP performance.

First, results of kernel compilation:


-------------------------------------------------------------------------------
					vatsa			 guest
				  (make -j4 bzImage)	  (make -j20 bzImage)
-------------------------------------------------------------------------------

2.6.20-rc5 			       767.16s (real)	       495.98s (real)
2.6.20-rc5 + fairsched (W=10s) 	       765.89s (real)	       764.89s (real)

-------------------------------------------------------------------------------



-------------------------------------------------------------------------------
					vatsa			 guest
			          (make -j4 bzImage)    (nice make -j20 bzImage)
-------------------------------------------------------------------------------

2.6.20-rc5 				767.16s (real)	       636.51s (real)
2.6.20-rc5 + fairsched (W=10s) 		761.19s (real)	       766.51s (real)

-------------------------------------------------------------------------------

Second, results of volanomark benchmark [2].

Both users, vatsa and guest, ran a copy of volano server on different ports. 
Each user then launched the volano client on respective ports and
throughput of the client was measured. Ideally we want the throughput to
be same for both user.


-------------------------------------------------------------------------------
					vatsa			guest	
				(volano server/client)   (volano server/client)
-------------------------------------------------------------------------------

2.6.20-rc5			  400000 msg in 39.495s    400000 msg in 39.528s
				      (10128 msg/sec)	      (10119 msg/sec)

2.6.20-rc5 + fairsched (W=10s)    400000 msg in 39.662s    400000 msg in 39.646s
				      (10085 msg/sec)	      (10089 msg/sec)
				
-------------------------------------------------------------------------------


Here we dont see much difference between vanilla kernel and the patch
because number of threads spawned by both users are same. 

Kirill, did you have any other specific volano configuration in mind?

> Actually, I'd be very interested by a "fairness number" and believe so 
> far no one as proposed such thing. Probably needs to take into account 
> the loss of CPU power and the variance of execution time in between the 
> sets of tasks which are supposed to be fair.

Yeah, in my patch, I dont account/limit execution time of root tasks at
all. That needs to be factored in when we distribute left over cycles to
non-root users (which I dont think I have addressed in my current
patch).

> >2. compilation of kernel is quite CPU-bound task. So it's not that hard to 
> >be fair :)
> >   Can you please try some other applications?
> >   e.g. pipe-based context switching, java Volano benchmark etc.

Kirill, I have provided volano numbers above. Did you have any specific
volano configuration in mind to be tested?

Also do you have a pointer to a ready pipe-based benchmark I can use?

> Another worthy benchmark would be :
> (make -s -j4 bzImage) vs (nice make -s -j20 bzImage)
>                           ^^^^

I have provide the result of above benchmark also.


References:

1. http://lkml.org/lkml/2007/01/31/138

2. Volanomark benchmark configuration details:

java.vendor        = IBM Corporation
java.vendor.url    = http://www.ibm.com/
java.version       = 1.4.2
java.class.version = 48.0
java.compiler      = j9jit22
os.name            = Linux
os.version         = 2.6.20-rc5-1k
os.arch            = amd64

VolanoMark version = 2.5.0.9
Messages sent      = 20000
Messages received  = 380000
Total messages     = 400000





-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Fair-user scheduler
  2007-01-26 18:41 ` Chris Friesen
@ 2007-01-31 15:16   ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2007-01-31 15:16 UTC (permalink / raw)
  To: Chris Friesen; +Cc: riel, Ingo Molnar, Nick Piggin, Andrew Morton, linux-kernel

On Fri, Jan 26, 2007 at 12:41:27PM -0600, Chris Friesen wrote:
> As Kirill brought up, why does it take so much more time?  Are you 
> thrashing the cache?

Yes, the default 1-sec control window I had was causing lot of
thrashing. See http://lkml.org/lkml/2007/1/31/142

> Presumably this would be made generic, as in per-"group" rather than per 
> user?

Ideally yes! But I am trying to apply the solution to an existing problem
and later extend for other needs as well.

> >	- Fairsched aware SMP load balance NOT addressed (yet)
> 
> This is kind of important, no?

Yea again! I have again avoided it to keep the patch simple at first.
Based on interest/response, I can address this later.


-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-01-31 15:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-26  6:01 [RFC] Fair-user scheduler Srivatsa Vaddagiri
2007-01-26  6:03 ` [PATCH 1/2] core scheduler changes Srivatsa Vaddagiri
2007-01-31 15:01   ` Srivatsa Vaddagiri
2007-01-26  6:05 ` [PATCH 2/2] Track number of users in the system Srivatsa Vaddagiri
2007-01-26 14:09 ` [RFC] Fair-user scheduler Kirill Korotaev
2007-01-26 18:52   ` Eric Piel
2007-01-31 15:10     ` Srivatsa Vaddagiri
2007-01-26 18:41 ` Chris Friesen
2007-01-31 15:16   ` Srivatsa Vaddagiri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).