LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH -v2 0/7] New RT Task Balancing -v2
@ 2007-10-23  2:59 Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 1/7] Add rt_nr_running accounting Steven Rostedt
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23  2:59 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra

[
  Changes since V1:
    Updated to git tree 55b70a0300b873c0ec7ea6e33752af56f41250ce

    Various clean ups suggested by Gregory Haskins, Dmitry Adamushko,
    and Peter Zijlstra.

    Biggest change was recommended by Ingo Molnar. This is the use of cpusets
    for keeping track of RT overloaded CPUS.  When CONFIG_CPUSETS is not
    defined, we have a single global rt_overload_mask that keeps track
    of the runqueues with more than one RT task queued. When CONFIG_CPUSETS
    is configured, that bitmask is stored in the cpusets.

    Note that in the case of overlapping cpusets it is possible to have
    inconsistent data between the bitmask and actual RT overloaded runqueues.
    The worst that can happen is that a task doesn't get moved quickly
    over to a runnable CPU.  But this is a price we pay to keep from
    dirtying caches for large number of CPU boxes. If this does happen
    it gets cleaned up rather quickly since there are checks for
    RT overload bits being set when they shouldn't be.

    For most systems this is not an issue since a single cpuset is used.
]

Currently in mainline the balancing of multiple RT threads is quite broken.
That is to say that a high priority thread that is scheduled on a CPU
with a higher priority thread, may need to unnecessarily wait while it
can easily run on another CPU that's running a lower priority thread.

Balancing (or migrating) tasks in general is an art. Lots of considerations
must be taken into account. Cache lines, NUMA and more. This is true
with general processes which expect high through put and migration can
be done in batch.  But when it comes to RT tasks, we really need to
put them off to a CPU that they can run on as soon as possible. Even
if it means a bit of cache line flushing.

Right now an RT task can wait several milliseconds before it gets scheduled
to run. And perhaps even longer. The migration thread is not fast enough
to take care of RT tasks.

To demonstrate this, I wrote a simple test.
 
  http://rostedt.homelinux.com/rt/rt-migrate-test.c

  (gcc -o rt-migrate-test rt-migrate-test.c -lpthread)

This test expects a parameter to pass in the number of threads to create.
If you add the '-c' option (check) it will terminate if the test fails
one of the iterations. If you add this, pass in +1 threads.

For example, on a 4 way box, I used

  rt-migrate-test -c 5

What this test does is to create the number of threads specified (in this
case 5). Each thread is set as an RT FIFO task starting at a specified
prio (default 2) and each thread being one priority higher. So with this
example the 5 threads created are at priorities 2, 3, 4, 5, and 6.

The parent thread sets its priority to one higher than the highest of
the children (this example 7). It uses pthread_barrier_wait to synchronize
the threads.  Then it takes a time stamp and starts all the threads.
The threads when woken up take a time stamp and compares it to the parent
thread to see how long it took to be awoken. It then runs for an
interval (20ms default) in a busy loop. The busy loop ends when it reaches
the interval delta from the start time stamp. So if it is preempted, it
may not actually run for the full interval. This is expected behavior
of the test.

The numbers recorded are the delta from the thread's time stamp from the
parent time stamp. The number of iterations it ran the busy loop for, and
the delta from a thread time stamp taken at the end of the loop to the
parent time stamp.

Sometimes a lower priority task might wake up before a higher priority,
but this is OK, as long as the higher priority process gets the CPU when
it is awoken.

At the end of the test, the iteration data is printed to stdout. If a
higher priority task had to wait for a lower one to finish running, then
this is considered a failure. Here's an example of the output from
a run against git commit 4fa4d23fa20de67df919030c1216295664866ad7.

   1:       36      33   20041      39      33
 len:    20036   20033   40041   20039   20033
 loops: 167789  167693  227167  167829  167814

On iteration 1 (starts at 0) the third task started at 20ms after the parent
woke it up. We can see here that the first two tasks ran to completion
before the higher priority task was even able to start. That is a
20ms latency for the higher priority task!!!

So people who think that their audio would lose most latencies by upping 
the priority, may be in for a surprise. Since some kernel threads (like
the migration thread itself) may cause this latency.

To solve this issue, I've changed the RT task balancing from a passive
method (migration thread) to an active method.  This new method is
to actively push or pull RT tasks when they are woken up or scheduled.

On wake up of a task if it is an RT task, and there's already an RT task
of higher priority running on its runqueue, we initiate a push_rt_tasks
algorithm. This algorithm looks at the highest non-running RT task
and tries to find a CPU where it can run on. It only migrates the RT
task if it finds a CPU (of lowest priority) where the RT task
can run on and can preempt the currently running task on that CPU.
We continue pushing RT tasks until we can't push anymore.

If a RT task fails to be migrated we stop the pushing. This is possible
because we are always looking at the highest priority RT task on the
run queue. And if it can't migrate, then most likely the lower RT tasks
can not either.

There is one case that is not covered by this patch set. That is that
when the highest priority non running RT task has its CPU affinity
in such a way that it can not preempt any tasks on the CPUs running
on CPUs of its affinity. But a lower priority task has a larger affinity
to CPUs that it can run on. This is a case where the lower priority task
will not be migrated to those CPUS (although those CPUs may pull that task).
Currently this patch set ignores this scenario.

Another case where we push RT tasks is in the finish_task_switch. This is
done since the running RT task can not be migrated while it is running.
So if an RT task is preempted by a higher priority RT task, we can
migrate the RT task being preempted at that moment.

We also actively pull RT tasks. Whenever a runqueue is about to lower its
priority (schedule a lower priority task) a check is done to see if that
runqueue can pull RT tasks to it to run instead. A search is made of all
overloaded runqueues (runqueues with more than one RT task scheduled on it)
and checked to see if they have an RT task that can run on the runqueue
(affinity matches) and is of higher priority than the task the runqueue
is about to schedule. The pull algorithm pulls all RT tasks that match
this case.

With this patch set, I ran the rt-migrate-test over night in a while
loop with the -c option (which terminates upon failure) and it passed
over 6500 tests (each doing 50 wakeups each).

Here's an example of the output from the patched kernel. This is just to
explain it a bit more.

   1:    20060      61      55      56      61
 len:    40060   20061   20055   20056   20061
 loops: 227255  146050  168104  145965  168144

   2:       40      46      31      35      42
 len:    20040   20046   20031   20035   20042
 loops:     28  167769  167781  167668  167804

The first iteration (really 2cd, since we start at zero), is a typical run.
The lowest prio task didn't start executing until the other 4 tasks finished
and gave up the CPU.

The second iteration seems at first like a failure. But this is actually fine.
The lowest priority task just happen to schedule onto a CPU before the
higher priority tasks were woken up. But as you can see from this example,
the higher priority tasks still were able to get scheduled right away and
in doing so preempted the lower priority task. This can be seen by the
number of loops that the lower priority task was able to complete. Only 28.
This is because the busy loop terminates when the time stamp reaches the
time interval (20ms here) from the start time stamp. Since the lower priority
task was able to sneak in and start, it's time stamp was low. So after it
got preempted, and rescheduled, it was already past the run time interval
so it simply ended the loop.

Finally, the CFS RT balancing had to be removed in order for this to work.
Testing showed that the CFS RT balancing would actually pull RT tasks
from runqueues already assigned to the proper runqueues, and again cause
latencies.  With this new approach, the CFS RT balancing is not needed,
and I suggest that these patches replace the current CFS RT balancing.

Also, let me stress, that I made a great attempt to have this cause
as little overhead (practically none) to the non RT cases. Most of these
algorithms only take place when more than one RT task is scheduled on the
same runqueue.

Special thanks goes to Gregory Haskins for his advice and back and forth
on IRC with ideas. Although I didn't use his actual patches (his were
against -rt) it did help me clean up some of my code.  Also, thanks go to
Ingo Molnar himself for taking some ideas from his RT balance code in
the -rt patch.


Comments welcomed!

-- Steve


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -v2 1/7] Add rt_nr_running accounting
  2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
@ 2007-10-23  2:59 ` Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 2/7] track highest prio queued on runqueue Steven Rostedt
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23  2:59 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra

[-- Attachment #1: count-rt-queued.patch --]
[-- Type: text/plain, Size: 1869 bytes --]

This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched.c    |    1 +
 kernel/sched_rt.c |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-22 22:18:53.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-22 22:31:51.000000000 -0400
@@ -266,6 +266,7 @@ struct rt_rq {
 	struct rt_prio_array active;
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
+	unsigned long rt_nr_running;
 };
 
 /*
Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-22 22:18:53.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-22 22:31:51.000000000 -0400
@@ -25,12 +25,27 @@ static void update_curr_rt(struct rq *rq
 	curr->se.exec_start = rq->clock;
 }
 
+static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+	WARN_ON(!rt_task(p));
+	rq->rt.rt_nr_running++;
+}
+
+static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+	WARN_ON(!rt_task(p));
+	WARN_ON(!rq->rt.rt_nr_running);
+	rq->rt.rt_nr_running--;
+}
+
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 
 	list_add_tail(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
+
+	inc_rt_tasks(p, rq);
 }
 
 /*
@@ -45,6 +60,8 @@ static void dequeue_task_rt(struct rq *r
 	list_del(&p->run_list);
 	if (list_empty(array->queue + p->prio))
 		__clear_bit(p->prio, array->bitmap);
+
+	dec_rt_tasks(p, rq);
 }
 
 /*

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -v2 2/7] track highest prio queued on runqueue
  2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 1/7] Add rt_nr_running accounting Steven Rostedt
@ 2007-10-23  2:59 ` Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 3/7] push RT tasks Steven Rostedt
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23  2:59 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra

[-- Attachment #1: add-rq-highest-prio.patch --]
[-- Type: text/plain, Size: 2357 bytes --]

This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched.c    |    3 +++
 kernel/sched_rt.c |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-22 22:31:51.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-22 22:31:55.000000000 -0400
@@ -267,6 +267,8 @@ struct rt_rq {
 	int rt_load_balance_idx;
 	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
 	unsigned long rt_nr_running;
+	/* highest queued rt task prio */
+	int highest_prio;
 };
 
 /*
@@ -6725,6 +6727,7 @@ void __init sched_init(void)
 		rq->cpu = i;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
+		rq->rt.highest_prio = MAX_RT_PRIO;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 
Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-22 22:31:51.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-22 22:31:55.000000000 -0400
@@ -29,6 +29,10 @@ static inline void inc_rt_tasks(struct t
 {
 	WARN_ON(!rt_task(p));
 	rq->rt.rt_nr_running++;
+#ifdef CONFIG_SMP
+	if (p->prio < rq->rt.highest_prio)
+		rq->rt.highest_prio = p->prio;
+#endif /* CONFIG_SMP */
 }
 
 static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -36,6 +40,20 @@ static inline void dec_rt_tasks(struct t
 	WARN_ON(!rt_task(p));
 	WARN_ON(!rq->rt.rt_nr_running);
 	rq->rt.rt_nr_running--;
+#ifdef CONFIG_SMP
+	if (rq->rt.rt_nr_running) {
+		struct rt_prio_array *array;
+
+		WARN_ON(p->prio < rq->rt.highest_prio);
+		if (p->prio == rq->rt.highest_prio) {
+			/* recalculate */
+			array = &rq->rt.active;
+			rq->rt.highest_prio =
+				sched_find_first_bit(array->bitmap);
+		} /* otherwise leave rq->highest prio alone */
+	} else
+		rq->rt.highest_prio = MAX_RT_PRIO;
+#endif /* CONFIG_SMP */
 }
 
 static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -v2 3/7] push RT tasks
  2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 1/7] Add rt_nr_running accounting Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 2/7] track highest prio queued on runqueue Steven Rostedt
@ 2007-10-23  2:59 ` Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 4/7] RT overloaded runqueues accounting Steven Rostedt
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23  2:59 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra

[-- Attachment #1: rt-balance-push-tasks.patch --]
[-- Type: text/plain, Size: 9189 bytes --]

This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority.  The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more.  The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
 migrated. Specifically, if the highest non running RT task CPU affinity
 is restricted to CPUs that are running higher priority tasks, there may
 be a lower priority task queued that has an affinity with a CPU that is
 running a lower priority task that it could be migrated to.  This
 patch set does not address this issue.

Note: checkpatch reveals two over 80 character instances. I'm not sure
 that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched.c    |    8 +
 kernel/sched_rt.c |  229 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 236 insertions(+), 1 deletion(-)

Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-22 22:31:55.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-22 22:31:59.000000000 -0400
@@ -1860,6 +1860,8 @@ static void finish_task_switch(struct rq
 	prev_state = prev->state;
 	finish_arch_switch(prev);
 	finish_lock_switch(rq, prev);
+	schedule_tail_balance_rt(rq);
+
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
@@ -2093,11 +2095,13 @@ static void double_rq_unlock(struct rq *
 /*
  * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
  */
-static void double_lock_balance(struct rq *this_rq, struct rq *busiest)
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(this_rq->lock)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
+	int ret = 0;
+
 	if (unlikely(!irqs_disabled())) {
 		/* printk() doesn't work good under rq->lock */
 		spin_unlock(&this_rq->lock);
@@ -2108,9 +2112,11 @@ static void double_lock_balance(struct r
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
+			ret = 1;
 		} else
 			spin_lock(&busiest->lock);
 	}
+	return ret;
 }
 
 /*
Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-22 22:31:55.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-22 22:31:59.000000000 -0400
@@ -132,6 +132,235 @@ static void put_prev_task_rt(struct rq *
 	update_curr_rt(rq);
 	p->se.exec_start = 0;
 }
+#ifdef CONFIG_SMP
+/* Only try algorithms three times */
+#define RT_MAX_TRIES 3
+
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
+static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
+
+/* Return the second highest RT task, NULL otherwise */
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+{
+	struct rt_prio_array *array = &rq->rt.active;
+	struct task_struct *next;
+	struct list_head *queue;
+	int idx;
+
+	assert_spin_locked(&rq->lock);
+
+	if (likely(rq->rt.rt_nr_running < 2))
+		return NULL;
+
+	idx = sched_find_first_bit(array->bitmap);
+	if (unlikely(idx >= MAX_RT_PRIO)) {
+		WARN_ON(1); /* rt_nr_running is bad */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+	if (unlikely(next != rq->curr))
+		return next;
+
+	if (queue->next->next != queue) {
+		/* same prio task */
+		next = list_entry(queue->next->next, struct task_struct, run_list);
+		return next;
+	}
+
+	/* slower, but more flexible */
+	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
+	if (unlikely(idx >= MAX_RT_PRIO)) {
+		WARN_ON(1); /* rt_nr_running was 2 and above! */
+		return NULL;
+	}
+
+	queue = array->queue + idx;
+	next = list_entry(queue->next, struct task_struct, run_list);
+
+	return next;
+}
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
+				      struct rq *this_rq)
+{
+	struct rq *lowest_rq = NULL;
+	cpumask_t cpu_mask;
+	int dst_cpu = -1;
+	int cpu;
+	int tries;
+
+	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
+
+	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
+		/*
+		 * Scan each rq for the lowest prio.
+		 */
+		for_each_cpu_mask(cpu, cpu_mask) {
+			struct rq *rq = &per_cpu(runqueues, cpu);
+
+			if (cpu == this_rq->cpu)
+				continue;
+
+			/* We look for lowest RT prio or non-rt CPU */
+			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+				lowest_rq = rq;
+				dst_cpu = cpu;
+				break;
+			}
+
+			/* no locking for now */
+			if (rq->rt.highest_prio > task->prio &&
+			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
+				dst_cpu = cpu;
+				lowest_rq = rq;
+			}
+		}
+
+		if (!lowest_rq)
+			break;
+
+		/* if the prio of this runqueue changed, try again */
+		if (double_lock_balance(this_rq, lowest_rq)) {
+			/*
+			 * We had to unlock the run queue. In
+			 * the mean time, task could have
+			 * migrated already or had its affinity changed.
+			 * Also make sure that it wasn't scheduled on its rq.
+			 */
+			if (unlikely(task_rq(task) != this_rq ||
+				     !cpu_isset(dst_cpu, task->cpus_allowed) ||
+				     task_running(this_rq, task) ||
+				     !task->se.on_rq)) {
+				spin_unlock(&lowest_rq->lock);
+				lowest_rq = NULL;
+				break;
+			}
+		}
+
+		/* If this rq is still suitable use it. */
+		if (lowest_rq->rt.highest_prio > task->prio)
+			break;
+
+		/* try again */
+		spin_unlock(&lowest_rq->lock);
+		lowest_rq = NULL;
+	}
+
+	return lowest_rq;
+}
+
+/*
+ * If the current CPU has more than one RT task, see if the non
+ * running task can migrate over to a CPU that is running a task
+ * of lesser priority.
+ */
+static int push_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next_task;
+	struct rq *lowest_rq;
+	int dst_cpu;
+	int ret = 0;
+	int paranoid = RT_MAX_TRIES;
+
+	assert_spin_locked(&this_rq->lock);
+
+	next_task = pick_next_highest_task_rt(this_rq);
+	if (!next_task)
+		return 0;
+
+ retry:
+	if (unlikely(next_task == this_rq->curr))
+		return 0;
+
+	/*
+	 * It's possible that the next_task slipped in of
+	 * higher priority than current. If that's the case
+	 * just reschedule current.
+	 */
+	if (unlikely(next_task->prio < this_rq->curr->prio)) {
+		resched_task(this_rq->curr);
+		return 0;
+	}
+
+	/* We might release this_rq lock */
+	get_task_struct(next_task);
+
+	/* find_lock_lowest_rq locks the rq if found */
+	lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+	if (!lowest_rq) {
+		struct task_struct *task;
+		/*
+		 * find lock_lowest_rq releases this_rq->lock
+		 * so it is possible that next_task has changed.
+		 * If it has, then try again.
+		 */
+		task = pick_next_highest_task_rt(this_rq);
+		if (unlikely(task != next_task) && task && paranoid--) {
+			put_task_struct(next_task);
+			next_task = task;
+			goto retry;
+		}
+		goto out;
+	}
+
+	dst_cpu = lowest_rq->cpu;
+
+	assert_spin_locked(&lowest_rq->lock);
+
+	deactivate_task(this_rq, next_task, 0);
+	set_task_cpu(next_task, dst_cpu);
+	activate_task(lowest_rq, next_task, 0);
+
+	resched_task(lowest_rq->curr);
+
+	spin_unlock(&lowest_rq->lock);
+
+	ret = 1;
+out:
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+/*
+ * TODO: Currently we just use the second highest prio task on
+ *       the queue, and stop when it can't migrate (or there's
+ *       no more RT tasks).  There may be a case where a lower
+ *       priority RT task has a different affinity than the
+ *       higher RT task. In this case the lower RT task could
+ *       possibly be able to migrate where as the higher priority
+ *       RT task could not.  We currently ignore this issue.
+ *       Enhancements are welcome!
+ */
+static void push_rt_tasks(struct rq *rq)
+{
+	/* push_rt_task will return true if it moved an RT */
+	while (push_rt_task(rq))
+		;
+}
+
+static void schedule_tail_balance_rt(struct rq *rq)
+{
+	/*
+	 * If we have more than one rt_task queued, then
+	 * see if we can push the other rt_tasks off to other CPUS.
+	 * Note we may release the rq lock, and since
+	 * the lock was owned by prev, we need to release it
+	 * first via finish_lock_switch and then reaquire it here.
+	 */
+	if (unlikely(rq->rt.rt_nr_running > 1)) {
+		spin_lock_irq(&rq->lock);
+		push_rt_tasks(rq);
+		spin_unlock_irq(&rq->lock);
+	}
+}
+#else /* CONFIG_SMP */
+# define schedule_tail_balance_rt(rq)	do { } while (0)
+#endif /* CONFIG_SMP */
+
 
 /*
  * Load-balancing iterator. Note: while the runqueue stays locked

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -v2 4/7] RT overloaded runqueues accounting
  2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
                   ` (2 preceding siblings ...)
  2007-10-23  2:59 ` [PATCH -v2 3/7] push RT tasks Steven Rostedt
@ 2007-10-23  2:59 ` Steven Rostedt
  2007-10-23  4:17   ` Paul Jackson
  2007-10-23  2:59 ` [PATCH -v2 5/7] pull RT tasks Steven Rostedt
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23  2:59 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra

[-- Attachment #1: rt-overload.patch --]
[-- Type: text/plain, Size: 7706 bytes --]

This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.

If CONFIG_CPUSET is defined, then an rt overloaded CPU bitmask is created
in the cpusets.  The algorithm for pulling tasks is limited to the cpuset
of the current task on the runqueue. Because of overlapping cpusets, it is
possible that the bitmask may get out of sync with actual overloaded RT
runqueues. But it wont cause any real harm. The worst that can happen is
that a RT task may not migrate to a CPU that it can run on when it could.
But that's a OK price to pay to keep the accounting simple and not kill
the cache on large SMP boxes.

When CONFIG_CPUSET is not set, then a single RT overload CPU mask is used.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/cpuset.h |    6 ++++
 kernel/cpuset.c        |   62 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_rt.c      |   29 ++++++++++++++++++++++
 3 files changed, 97 insertions(+)

Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-22 22:31:59.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-22 22:32:18.000000000 -0400
@@ -3,6 +3,31 @@
  * policies)
  */
 
+#ifdef CONFIG_CPUSETS
+# define rt_overload(p) cpuset_rt_overload(p)
+# define rt_set_overload(p, cpu) cpuset_rt_set_overload(p, cpu)
+# define rt_clear_overload(p, cpu) cpuset_rt_clear_overload(p, cpu)
+# define rt_overloaded(p) cpuset_rt_overloaded(p)
+#else
+static cpumask_t rt_overload_mask;
+static inline int rt_overloaded(struct task_struct *p)
+{
+	return !cpus_empty(rt_overload_mask);
+}
+static inline cpumask_t rt_overload(struct task_struct *p)
+{
+	return rt_overload_mask;
+}
+static inline void rt_set_overload(struct task_struct *p, int cpu)
+{
+	cpu_set(cpu, rt_overload_mask);
+}
+static inline void rt_clear_overload(struct task_struct *p, int cpu)
+{
+	cpu_clear(cpu, rt_overload_mask);
+}
+#endif
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -32,6 +57,8 @@ static inline void inc_rt_tasks(struct t
 #ifdef CONFIG_SMP
 	if (p->prio < rq->rt.highest_prio)
 		rq->rt.highest_prio = p->prio;
+	if (rq->rt.rt_nr_running > 1)
+		rt_set_overload(p, rq->cpu);
 #endif /* CONFIG_SMP */
 }
 
@@ -53,6 +80,8 @@ static inline void dec_rt_tasks(struct t
 		} /* otherwise leave rq->highest prio alone */
 	} else
 		rq->rt.highest_prio = MAX_RT_PRIO;
+	if (rq->rt.rt_nr_running < 2)
+		rt_clear_overload(p, rq->cpu);
 #endif /* CONFIG_SMP */
 }
 
Index: linux-test.git/include/linux/cpuset.h
===================================================================
--- linux-test.git.orig/include/linux/cpuset.h	2007-10-22 22:18:53.000000000 -0400
+++ linux-test.git/include/linux/cpuset.h	2007-10-22 22:32:18.000000000 -0400
@@ -78,6 +78,12 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+/* The cpuset_rt_overload code is only used when CONFIG_CPUSETS is defined */
+extern int cpuset_rt_overloaded(struct task_struct *tsk);
+extern void cpuset_rt_set_overload(struct task_struct *tsk, int cpu);
+extern cpumask_t cpuset_rt_overload(struct task_struct *tsk);
+extern void cpuset_rt_clear_overload(struct task_struct *tsk, int cpu);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
Index: linux-test.git/kernel/cpuset.c
===================================================================
--- linux-test.git.orig/kernel/cpuset.c	2007-10-22 22:18:53.000000000 -0400
+++ linux-test.git/kernel/cpuset.c	2007-10-22 22:36:29.000000000 -0400
@@ -84,6 +84,9 @@ struct cpuset {
 	cpumask_t cpus_allowed;		/* CPUs allowed to tasks in cpuset */
 	nodemask_t mems_allowed;	/* Memory Nodes allowed to tasks */
 
+	/* bits protected by rq locks. */
+	cpumask_t rt_overload;		/* runqueue overload mask */
+
 	struct cpuset *parent;		/* my parent */
 
 	/*
@@ -179,6 +182,7 @@ static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
+	.rt_overload = CPU_MASK_NONE,
 };
 
 /*
@@ -1566,6 +1570,7 @@ static void cpuset_post_clone(struct cgr
 
 	cs->mems_allowed = parent_cs->mems_allowed;
 	cs->cpus_allowed = parent_cs->cpus_allowed;
+	cs->rt_overload = parent_cs->rt_overload;
 	return;
 }
 
@@ -1604,6 +1609,7 @@ static struct cgroup_subsys_state *cpuse
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cs->cpus_allowed = CPU_MASK_NONE;
 	cs->mems_allowed = NODE_MASK_NONE;
+	cs->rt_overload = CPU_MASK_NONE;
 	cs->mems_generation = cpuset_mems_generation++;
 	fmeter_init(&cs->fmeter);
 
@@ -1674,6 +1680,7 @@ int __init cpuset_init(void)
 
 	top_cpuset.cpus_allowed = CPU_MASK_ALL;
 	top_cpuset.mems_allowed = NODE_MASK_ALL;
+	top_cpuset.rt_overload = CPU_MASK_NONE;
 
 	fmeter_init(&top_cpuset.fmeter);
 	top_cpuset.mems_generation = cpuset_mems_generation++;
@@ -1749,6 +1756,7 @@ static void common_cpu_mem_hotplug_unplu
 	guarantee_online_cpus_mems_in_subtree(&top_cpuset);
 	top_cpuset.cpus_allowed = cpu_online_map;
 	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+	top_cpuset.rt_overload = CPU_MASK_NONE;
 
 	mutex_unlock(&callback_mutex);
 	cgroup_unlock();
@@ -1798,6 +1806,7 @@ void __init cpuset_init_smp(void)
 {
 	top_cpuset.cpus_allowed = cpu_online_map;
 	top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
+	top_cpuset.rt_overload = CPU_MASK_NONE;
 
 	hotcpu_notifier(cpuset_handle_cpuhp, 0);
 }
@@ -1868,6 +1877,59 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
+ * cpuset_rt_overloaded - true if the tasks cpuset has a RT overloaded runqueue.
+ * @tsk: pointer to the task_struct from which to test the runqueues.
+ *
+ * Description: Returns true if the tasks cpuset has a RT overloaded bit set.
+ * Due to overlapping cpusets, the runqueue marked as overloaded may no longer
+ * be overloaded, and a run queue that may be overloaded may not be marked.
+ * The RT balancing only focuses on relm of cpusets, so we tolerate these
+ * inconsistencies.
+ */
+int cpuset_rt_overloaded(struct task_struct *tsk)
+{
+	return !cpus_empty(task_cs(tsk)->rt_overload);
+}
+
+/**
+ * cpuset_rt_set_overload - set the tasks cpuset RT overload cpu
+ * @tsk: pointer to the task_struct from which to set the cpuset rt overload.
+ * @cpu: CPU number of the runqueue to set the rt overload.
+ *
+ * Description: Set the bit in the tasks cpuset RT overload mask
+ * that corresponds to the given @cpu.
+ */
+void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
+{
+	cpu_set(cpu, task_cs(tsk)->rt_overload);
+}
+
+/**
+ * cpuset_rt_overload - return the RT overload mask of a cpuset.
+ * @tsk: pointer to the task_struct from which to set the cpuset rt overload.
+ *
+ * Description: return the cpumask of the rt overload of a
+ * cpuset for a task.
+ */
+cpumask_t cpuset_rt_overload(struct task_struct *tsk)
+{
+	return task_cs(tsk)->rt_overload;
+}
+
+/**
+ * cpuset_rt_clear_overload - clear the tasks cpuset RT overload CPU
+ * @tsk: pointer to the task_struct from which to set the cpuset rt overload.
+ * @cpu: CPU number of the runqueue to clear the rt overload.
+ *
+ * Description: Clear the bit in the tasks cpuset RT overload maks
+ * that corresponds to the given @cpu.
+ */
+void cpuset_rt_clear_overload(struct task_struct *tsk, int cpu)
+{
+	cpu_clear(cpu, task_cs(tsk)->rt_overload);
+}
+
+/**
  * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
  * @zl: the zonelist to be checked
  *

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -v2 5/7] pull RT tasks
  2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
                   ` (3 preceding siblings ...)
  2007-10-23  2:59 ` [PATCH -v2 4/7] RT overloaded runqueues accounting Steven Rostedt
@ 2007-10-23  2:59 ` Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 6/7] wake up balance RT Steven Rostedt
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23  2:59 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra

[-- Attachment #1: rt-balance-pull-tasks.patch --]
[-- Type: text/plain, Size: 8017 bytes --]

This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

---
 kernel/sched.c    |    2 
 kernel/sched_rt.c |  196 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 187 insertions(+), 11 deletions(-)

Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-22 22:31:59.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-22 22:37:28.000000000 -0400
@@ -3622,6 +3622,8 @@ need_resched_nonpreemptible:
 		switch_count = &prev->nvcsw;
 	}
 
+	schedule_balance_rt(rq, prev);
+
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-22 22:32:18.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-22 22:37:28.000000000 -0400
@@ -168,8 +168,17 @@ static void put_prev_task_rt(struct rq *
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);
 
+static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)))
+		return 1;
+	return 0;
+}
+
 /* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
+						     int cpu)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 	struct task_struct *next;
@@ -188,26 +197,36 @@ static struct task_struct *pick_next_hig
 	}
 
 	queue = array->queue + idx;
+	BUG_ON(list_empty(queue));
+
 	next = list_entry(queue->next, struct task_struct, run_list);
-	if (unlikely(next != rq->curr))
-		return next;
+	if (unlikely(pick_rt_task(rq, next, cpu)))
+		goto out;
 
 	if (queue->next->next != queue) {
 		/* same prio task */
 		next = list_entry(queue->next->next, struct task_struct, run_list);
-		return next;
+		if (pick_rt_task(rq, next, cpu))
+			goto out;
 	}
 
+ retry:
 	/* slower, but more flexible */
 	idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-	if (unlikely(idx >= MAX_RT_PRIO)) {
-		WARN_ON(1); /* rt_nr_running was 2 and above! */
+	if (unlikely(idx >= MAX_RT_PRIO))
 		return NULL;
-	}
 
 	queue = array->queue + idx;
-	next = list_entry(queue->next, struct task_struct, run_list);
+	BUG_ON(list_empty(queue));
 
+	list_for_each_entry(next, queue, run_list) {
+		if (pick_rt_task(rq, next, cpu))
+			goto out;
+	}
+
+	goto retry;
+
+ out:
 	return next;
 }
 
@@ -296,13 +315,15 @@ static int push_rt_task(struct rq *this_
 
 	assert_spin_locked(&this_rq->lock);
 
-	next_task = pick_next_highest_task_rt(this_rq);
+	next_task = pick_next_highest_task_rt(this_rq, -1);
 	if (!next_task)
 		return 0;
 
  retry:
-	if (unlikely(next_task == this_rq->curr))
+	if (unlikely(next_task == this_rq->curr)) {
+		WARN_ON(1);
 		return 0;
+	}
 
 	/*
 	 * It's possible that the next_task slipped in of
@@ -326,7 +347,7 @@ static int push_rt_task(struct rq *this_
 		 * so it is possible that next_task has changed.
 		 * If it has, then try again.
 		 */
-		task = pick_next_highest_task_rt(this_rq);
+		task = pick_next_highest_task_rt(this_rq, -1);
 		if (unlikely(task != next_task) && task && paranoid--) {
 			put_task_struct(next_task);
 			next_task = task;
@@ -371,6 +392,158 @@ static void push_rt_tasks(struct rq *rq)
 		;
 }
 
+static int pull_rt_task(struct rq *this_rq)
+{
+	struct task_struct *next;
+	struct task_struct *p;
+	struct rq *src_rq;
+	cpumask_t rto_cpumask;
+	int this_cpu = this_rq->cpu;
+	int cpu;
+	int ret = 0;
+
+	assert_spin_locked(&this_rq->lock);
+
+	/*
+	 * If cpusets are used, and we have overlapping
+	 * run queue cpusets, then this algorithm may not catch all.
+	 * This is just the price you pay on trying to keep
+	 * dirtying caches down on large SMP machines.
+	 */
+	if (likely(!rt_overloaded(this_rq->curr)))
+		return 0;
+
+	next = pick_next_task_rt(this_rq);
+
+	rto_cpumask = rt_overload(this_rq->curr);
+
+	for_each_cpu_mask(cpu, rto_cpumask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+		if (unlikely(src_rq->rt.rt_nr_running <= 1)) {
+			/*
+			 * It is possible that overlapping cpusets
+			 * will miss clearing a non overloaded runqueue.
+			 * Clear it now.
+			 */
+			if (double_lock_balance(this_rq, src_rq)) {
+				/* unlocked our runqueue lock */
+				struct task_struct *old_next = next;
+				next = pick_next_task_rt(this_rq);
+				if (next != old_next)
+					ret = 1;
+			}
+			if (likely(src_rq->rt.rt_nr_running <= 1))
+				/*
+				 * Small chance that this_rq->curr changed
+				 * but it's really harmless here.
+				 */
+				rt_clear_overload(this_rq->curr, this_rq->cpu);
+			else
+				/*
+				 * Heh, the src_rq is now overloaded, since
+				 * we already have the src_rq lock, go straight
+				 * to pulling tasks from it.
+				 */
+				goto try_pulling;
+			spin_unlock(&src_rq->lock);
+			continue;
+		}
+
+		/*
+		 * We can potentially drop this_rq's lock in
+		 * double_lock_balance, and another CPU could
+		 * steal our next task - hence we must cause
+		 * the caller to recalculate the next task
+		 * in that case:
+		 */
+		if (double_lock_balance(this_rq, src_rq)) {
+			struct task_struct *old_next = next;
+			next = pick_next_task_rt(this_rq);
+			if (next != old_next)
+				ret = 1;
+		}
+
+		/*
+		 * Are there still pullable RT tasks?
+		 */
+		if (src_rq->rt.rt_nr_running <= 1) {
+			spin_unlock(&src_rq->lock);
+			continue;
+		}
+
+ try_pulling:
+		p = pick_next_highest_task_rt(src_rq, this_cpu);
+
+		/*
+		 * Do we have an RT task that preempts
+		 * the to-be-scheduled task?
+		 */
+		if (p && (!next || (p->prio < next->prio))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * There's a chance that p is higher in priority
+			 * than what's currently running on its cpu.
+			 * This is just that p is wakeing up and hasn't
+			 * had a chance to schedule. We only pull
+			 * p if it is lower in priority than the
+			 * current task on the run queue or
+			 * this_rq next task is lower in prio than
+			 * the current task on that rq.
+			 */
+			if (p->prio < src_rq->curr->prio ||
+			    (next && next->prio < src_rq->curr->prio))
+				goto bail;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			/*
+			 * We continue with the search, just in
+			 * case there's an even higher prio task
+			 * in another runqueue. (low likelyhood
+			 * but possible)
+			 */
+
+			/*
+			 * Update next so that we won't pick a task
+			 * on another cpu with a priority lower (or equal)
+			 * than the one we just picked.
+			 */
+			next = p;
+
+		}
+ bail:
+		spin_unlock(&src_rq->lock);
+	}
+
+	return ret;
+}
+
+static void schedule_balance_rt(struct rq *rq,
+				struct task_struct *prev)
+{
+	struct rt_prio_array *array;
+	int next_prio;
+
+	/* Try to pull RT tasks here if we lower this rq's prio */
+	if (unlikely(rt_task(prev))) {
+		next_prio = MAX_RT_PRIO;
+		if (rq->rt.rt_nr_running) {
+			array = &rq->rt.active;
+			next_prio = sched_find_first_bit(array->bitmap);
+		}
+		if (next_prio > prev->prio)
+			pull_rt_task(rq);
+	}
+}
+
 static void schedule_tail_balance_rt(struct rq *rq)
 {
 	/*
@@ -388,6 +561,7 @@ static void schedule_tail_balance_rt(str
 }
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)
+# define schedule_balance_rt(rq)	do { } while (0)
 #endif /* CONFIG_SMP */
 
 

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -v2 6/7] wake up balance RT
  2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
                   ` (4 preceding siblings ...)
  2007-10-23  2:59 ` [PATCH -v2 5/7] pull RT tasks Steven Rostedt
@ 2007-10-23  2:59 ` Steven Rostedt
  2007-10-23  2:59 ` [PATCH -v2 7/7] disable CFS RT load balancing Steven Rostedt
  2007-10-23  8:39 ` [PATCH -v2 0/7] New RT Task Balancing -v2 Ingo Molnar
  7 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23  2:59 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra

[-- Attachment #1: rt-balance-wakeup.patch --]
[-- Type: text/plain, Size: 1881 bytes --]

This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.

TODO: We don't cover the case of waking of new RT tasks (yet).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched.c    |    2 ++
 kernel/sched_rt.c |   12 ++++++++++++
 2 files changed, 14 insertions(+)

Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-22 22:37:28.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-22 22:38:04.000000000 -0400
@@ -559,9 +559,21 @@ static void schedule_tail_balance_rt(str
 		spin_unlock_irq(&rq->lock);
 	}
 }
+
+static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
+{
+	if (unlikely(rt_task(p)) &&
+	    (p->prio >= rq->curr->prio)) {
+		/* pull_rt_task needs task to be running */
+		p->state = TASK_RUNNING;
+		push_rt_tasks(rq);
+	}
+}
+
 #else /* CONFIG_SMP */
 # define schedule_tail_balance_rt(rq)	do { } while (0)
 # define schedule_balance_rt(rq)	do { } while (0)
+# define wakeup_balance_rt(rq, p)	do { } while (0)
 #endif /* CONFIG_SMP */
 
 
Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c	2007-10-22 22:37:28.000000000 -0400
+++ linux-test.git/kernel/sched.c	2007-10-22 22:38:04.000000000 -0400
@@ -22,6 +22,7 @@
  *              by Peter Williams
  *  2007-05-06  Interactivity improvements to CFS by Mike Galbraith
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
+ *  2007-10-22  RT overload balancing by Steven Rostedt
  */
 
 #include <linux/mm.h>
@@ -1614,6 +1615,7 @@ out_activate:
 	update_rq_clock(rq);
 	activate_task(rq, p, 1);
 	check_preempt_curr(rq, p);
+	wakeup_balance_rt(rq, p);
 	success = 1;
 
 out_running:

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH -v2 7/7] disable CFS RT load balancing.
  2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
                   ` (5 preceding siblings ...)
  2007-10-23  2:59 ` [PATCH -v2 6/7] wake up balance RT Steven Rostedt
@ 2007-10-23  2:59 ` Steven Rostedt
  2007-10-23  8:39 ` [PATCH -v2 0/7] New RT Task Balancing -v2 Ingo Molnar
  7 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23  2:59 UTC (permalink / raw)
  To: LKML, RT
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra

[-- Attachment #1: disable-CFS-rt-balance.patch --]
[-- Type: text/plain, Size: 3361 bytes --]

Since we now take an active approach to load balancing, we don't need to
balance RT tasks via CFS. In fact, this code was found to pull RT tasks
away from CPUS that the active movement performed, resulting in
large latencies.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched_rt.c |   91 +-----------------------------------------------------
 1 file changed, 2 insertions(+), 89 deletions(-)

Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c	2007-10-22 22:38:04.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c	2007-10-22 22:38:33.000000000 -0400
@@ -576,101 +576,14 @@ static void wakeup_balance_rt(struct rq 
 # define wakeup_balance_rt(rq, p)	do { } while (0)
 #endif /* CONFIG_SMP */
 
-
-/*
- * Load-balancing iterator. Note: while the runqueue stays locked
- * during the whole iteration, the current task might be
- * dequeued so the iterator has to be dequeue-safe. Here we
- * achieve that by always pre-iterating before returning
- * the current task:
- */
-static struct task_struct *load_balance_start_rt(void *arg)
-{
-	struct rq *rq = arg;
-	struct rt_prio_array *array = &rq->rt.active;
-	struct list_head *head, *curr;
-	struct task_struct *p;
-	int idx;
-
-	idx = sched_find_first_bit(array->bitmap);
-	if (idx >= MAX_RT_PRIO)
-		return NULL;
-
-	head = array->queue + idx;
-	curr = head->prev;
-
-	p = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
-	rq->rt.rt_load_balance_idx = idx;
-	rq->rt.rt_load_balance_head = head;
-	rq->rt.rt_load_balance_curr = curr;
-
-	return p;
-}
-
-static struct task_struct *load_balance_next_rt(void *arg)
-{
-	struct rq *rq = arg;
-	struct rt_prio_array *array = &rq->rt.active;
-	struct list_head *head, *curr;
-	struct task_struct *p;
-	int idx;
-
-	idx = rq->rt.rt_load_balance_idx;
-	head = rq->rt.rt_load_balance_head;
-	curr = rq->rt.rt_load_balance_curr;
-
-	/*
-	 * If we arrived back to the head again then
-	 * iterate to the next queue (if any):
-	 */
-	if (unlikely(head == curr)) {
-		int next_idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-
-		if (next_idx >= MAX_RT_PRIO)
-			return NULL;
-
-		idx = next_idx;
-		head = array->queue + idx;
-		curr = head->prev;
-
-		rq->rt.rt_load_balance_idx = idx;
-		rq->rt.rt_load_balance_head = head;
-	}
-
-	p = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
-	rq->rt.rt_load_balance_curr = curr;
-
-	return p;
-}
-
 static unsigned long
 load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
 			unsigned long max_nr_move, unsigned long max_load_move,
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *all_pinned, int *this_best_prio)
 {
-	int nr_moved;
-	struct rq_iterator rt_rq_iterator;
-	unsigned long load_moved;
-
-	rt_rq_iterator.start = load_balance_start_rt;
-	rt_rq_iterator.next = load_balance_next_rt;
-	/* pass 'busiest' rq argument into
-	 * load_balance_[start|next]_rt iterators
-	 */
-	rt_rq_iterator.arg = busiest;
-
-	nr_moved = balance_tasks(this_rq, this_cpu, busiest, max_nr_move,
-			max_load_move, sd, idle, all_pinned, &load_moved,
-			this_best_prio, &rt_rq_iterator);
-
-	return load_moved;
+	/* don't touch RT tasks */
+	return 0;
 }
 
 static void task_tick_rt(struct rq *rq, struct task_struct *p)

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
  2007-10-23  2:59 ` [PATCH -v2 4/7] RT overloaded runqueues accounting Steven Rostedt
@ 2007-10-23  4:17   ` Paul Jackson
  2007-10-23  6:11     ` Paul Menage
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Jackson @ 2007-10-23  4:17 UTC (permalink / raw)
  To: Steven Rostedt, Paul Menage
  Cc: linux-kernel, linux-rt-users, torvalds, akpm, mingo,
	dmitry.adamushko, ghaskins, a.p.zijlstra

Steven wrote:
> +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
> +{
> +	cpu_set(cpu, task_cs(tsk)->rt_overload);
> +}

Question for Steven:

  What locks are held when cpuset_rt_set_overload() is called?

Questions for Paul Menage:

  Does 'tsk' need to be locked for the above task_cs() call?

  Background concern -- one of the things that I like to think has
  allowed cpusets to be useful to others is the careful documentation
  of its locking requirements.  I hope that the cgroup infrastructure,
  and the portions of the cpuset code, such as this task_cs() call,
  that were adapted to work with cgroups, have their locking needs
  documented as well.  I suspect that there is still some presence of
  stale cpuset locking comments, and perhaps an absence of complete
  comments on new cgroup locking needs.  My recollection is that this
  is already on your todo list -- I'm just being annoying here ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
  2007-10-23  4:17   ` Paul Jackson
@ 2007-10-23  6:11     ` Paul Menage
  2007-10-23  6:19       ` Paul Jackson
  2007-10-23 13:43       ` Steven Rostedt
  0 siblings, 2 replies; 15+ messages in thread
From: Paul Menage @ 2007-10-23  6:11 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Steven Rostedt, linux-kernel, linux-rt-users, torvalds, akpm,
	mingo, dmitry.adamushko, ghaskins, a.p.zijlstra

On 10/22/07, Paul Jackson <pj@sgi.com> wrote:
> Steven wrote:
> > +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
> > +{
> > +     cpu_set(cpu, task_cs(tsk)->rt_overload);
> > +}
>
> Question for Steven:
>
>   What locks are held when cpuset_rt_set_overload() is called?
>
> Questions for Paul Menage:
>
>   Does 'tsk' need to be locked for the above task_cs() call?

Cgroups doesn't change the locking rules for accessing a cpuset from a
task - you have to have one of:

- task_lock(task)

- callback_mutex

- be in an RCU section from the point when you call task_cs to the
point when you stop using its result. (Additionally, in this case
there's no guarantee that the task stays in this cpuset for the
duration of the RCU section).

Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
  2007-10-23  6:11     ` Paul Menage
@ 2007-10-23  6:19       ` Paul Jackson
  2007-10-23 13:43       ` Steven Rostedt
  1 sibling, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2007-10-23  6:19 UTC (permalink / raw)
  To: Paul Menage
  Cc: rostedt, linux-kernel, linux-rt-users, torvalds, akpm, mingo,
	dmitry.adamushko, ghaskins, a.p.zijlstra

Paul M wrote:
> Cgroups doesn't change the locking rules for accessing a cpuset from a
> task - you have to have one of:

Good - could you comment task_cs() with this explanation?

The rules are derived from the cpuset rules, as you explain,
and as I suspected, but now task_cs() is the most popular
way to access a tasks cpuset from code within kernel/cpuset.c,
and that's code you added.

The reason that I started this subthread is that I didn't see
any immediate evidence that the RT code was honoring this locking,
and I suspected that I clear comment over task_cs() could have
clarified that for them.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2 0/7] New RT Task Balancing -v2
  2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
                   ` (6 preceding siblings ...)
  2007-10-23  2:59 ` [PATCH -v2 7/7] disable CFS RT load balancing Steven Rostedt
@ 2007-10-23  8:39 ` Ingo Molnar
  7 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2007-10-23  8:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, RT, Linus Torvalds, Andrew Morton, Dmitry Adamushko,
	Paul Jackson, Gregory Haskins, Peter Zijlstra


* Steven Rostedt <rostedt@goodmis.org> wrote:

>   Changes since V1:
>     Updated to git tree 55b70a0300b873c0ec7ea6e33752af56f41250ce
> 
>     Various clean ups suggested by Gregory Haskins, Dmitry Adamushko,
>     and Peter Zijlstra.

ok, i like this new approach - nice work! I'd suggest we test it in -rt 
for some time and then merge it into v2.6.25?

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
  2007-10-23  6:11     ` Paul Menage
  2007-10-23  6:19       ` Paul Jackson
@ 2007-10-23 13:43       ` Steven Rostedt
  2007-10-23 21:46         ` Paul Jackson
  1 sibling, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2007-10-23 13:43 UTC (permalink / raw)
  To: Paul Menage
  Cc: Paul Jackson, linux-kernel, linux-rt-users, torvalds, akpm,
	mingo, dmitry.adamushko, ghaskins, a.p.zijlstra


--
On Mon, 22 Oct 2007, Paul Menage wrote:

> On 10/22/07, Paul Jackson <pj@sgi.com> wrote:
> > Steven wrote:
> > > +void cpuset_rt_set_overload(struct task_struct *tsk, int cpu)
> > > +{
> > > +     cpu_set(cpu, task_cs(tsk)->rt_overload);
> > > +}
> >
> > Question for Steven:
> >
> >   What locks are held when cpuset_rt_set_overload() is called?

Right now only the rq lock. I know it's not enough. This needs to be
fixed. These are called in the heart of the scheduler so...

> >
> > Questions for Paul Menage:
> >
> >   Does 'tsk' need to be locked for the above task_cs() call?
>
> Cgroups doesn't change the locking rules for accessing a cpuset from a
> task - you have to have one of:
>
> - task_lock(task)

I'm not sure of the lock nesting between task_lock(task) and rq->lock.
If this nesting doesn't yet exist, then I can add code to do the
task_lock.  But if the rq->lock is called within the task_lock somewhere,
then this can't be used.

>
> - callback_mutex

Can schedule, so it's definitely out.

>
> - be in an RCU section from the point when you call task_cs to the
> point when you stop using its result. (Additionally, in this case
> there's no guarantee that the task stays in this cpuset for the
> duration of the RCU section).

This may also be an option. Although with interrupts disabled for the
entire time the cpusets are used should keep RCU grace periods from moving
forward.

But let me give you two some background to what I'm trying to solve. And
then I'll get to where cpusets come in.

Currently the mainline vanilla kernel does not handle migrating RT tasks
well. And this problem also exists (but not as badly) in the -rt patch.
When a RT task is queued on a CPU that is running an even higher priority
RT task, it should be pushed off to another CPU that is running a lower
priority task. But this does not always happen and a RT task may take
several milliseconds before it gets a chance to run.

This latancy is not acceptible for RT tasks. So I added logic to push and
pull RT tasks to and from CPUS.  The push happens when a RT task wakes up
and can't preempt the RT task running on the same CPU, or when a lower RT
task is preempted by a higher one. The lower may be pushed to another CPU.

This alone is not enough to cover RT migration. We also need to pull RT
tasks to a CPU if that CPU is lowering its priority (a high priority RT
task has just went to sleep).

Idealy, and when CONFIG_CPUSETS is not defined, I keep track of all CPUS
that have more than one RT task queued to run on it.  This I call an RT
overload.  There's an RT overload bitmask that keeps track of the CPUS
that have more than one RT task queued.  When a CPU stops running a high
priority RT task, a search is made of all the CPUS that are in the RT
overload state to see if there exists a RT task that can migrate to the
CPU that is lowering its priority.

Ingo Molnar and Peter Zijlstra pointed out to me that this global cpumask
would kill performance on >64 CPU boxes due to cacheline bouncing. To
solve this issue, I placed the RT overload mask into the cpusets.

Ideally, the RT overload mask would keep track of all CPUS that have tasks
that can run on a given CPU. In-other-words, a cpuset from the point of
view of the CPU (or runqueue).  But this is overkill for the RT migration
code.  Large # CPU boxes probably don't have the RT balancing problems
that small # CPU boxes have, since the CPU resource is greater to run RT
tasks on.

Using cpusets seemed to be a nice place to add functionality to keep the
RT overload code from crippling large # CPU boxes.

The RT overload mask is bound to the CPU (runqueue) and not to the task.
But to get to the cpuset, I needed to go through the task. The task I used
was whatever was currently running on the given CPU, or sometimes the task
that was being pushed to a CPU.

Due to overlapping cpusets, there can be inconsistencies between the RT
overload mask and actual CPUS that are in the overload state. This is
tolerated, as the more CPUS you have, the less of a problem it should be
to have overloaded CPUS. Remember, the push task doesn't use the overload.
Only the pull does, and that happens when a push didn't succeed. With more
CPUS, pushes are more likely to succeed.

So the switch to use cpusets was to keep the RT balancing code from
hurting large SMP boxes than for actually being correct on those boxes.
The RT balance is much more important when the CPU resource is limited.
The cpusets were picked just because it seemed resonable that most of the
time a cpuset of one task on a runqueue would equal that of another task
on the same runqueue. But the code is "good enough" if that's not the
case.

My code can handle inconsistencies between the RT overload mask and actual
overloaded CPUS. So what I need to really protect with regards to cpusets
is from them disappearing and causing an oops.  Whether or not a task
comes and goes from a cpuset is not the important part.  The RT balance
code only uses the cpuset to determine what other RT tasks are out there
that can be brought to a given CPU.

Sorry for this long explanation, but I was hoping to get some advice from
those that have access to large SMP boxes, and have a vested interest that
enhancements to the scheduling does not hurt performance for those boxes.

Thoughts?

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
  2007-10-23 13:43       ` Steven Rostedt
@ 2007-10-23 21:46         ` Paul Jackson
  2008-01-29 13:00           ` Paul Jackson
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Jackson @ 2007-10-23 21:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: menage, linux-kernel, linux-rt-users, torvalds, akpm, mingo,
	dmitry.adamushko, ghaskins, a.p.zijlstra

Steven wrote:
> Ingo Molnar and Peter Zijlstra pointed out to me that this global cpumask
> would kill performance on >64 CPU boxes due to cacheline bouncing. To
> solve this issue, I placed the RT overload mask into the cpusets.

This feels rather like sched domains to me (which I can't claim
to understand very well.)  That is, you need to subdivide the
overload masks, so as to avoid contention on a single mask in a
large system, making the tradeoff that you will sometimes resolve
overloads less aggressively.

With sched domains, we added cpuset code, that is only called in the
infrequent event that the cpusets are reconfigured, to update the
kernel's sched domains.  It hasn't been easy code, but at least the
main line scheduler code paths remain oblivious to cpusets.

Can you create "RT domains", each with their own overload mask?
Perhaps even attach the overload masks to the existing sched domains?

The essential approach here, as with all things involving cpusets and
the scheduler, is to not have the scheduler code access cpusets, but
rather to have the scheduler code have its own controlling data
structures, which it can access with locking suitable to its fast path
requirements, and then have the cpuset code ~~delicately~~ update those
structures on the infrequent occassion that something changes in the
cpuset configuration.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -v2 4/7] RT overloaded runqueues accounting
  2007-10-23 21:46         ` Paul Jackson
@ 2008-01-29 13:00           ` Paul Jackson
  0 siblings, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2008-01-29 13:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: rmenage, linux-kernel, linux-rt-users, torvalds, akpm, mingo,
	dmitry.adamushko, ghaskins, a.p.zijlstra

Back in October 2007, this thread to which I am now (Jan 2008)
responding had a discussion of realtime versus normal load balancing
and how to setup sched domains for this, perhaps using cpusets.

Thanks to Peter Zijlstra, this discussion has started up again on lkml,
in the new thread:

   http://lkml.org/lkml/2008/1/29/60

Come on over and join the fun if you like.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-01-29 13:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-23  2:59 [PATCH -v2 0/7] New RT Task Balancing -v2 Steven Rostedt
2007-10-23  2:59 ` [PATCH -v2 1/7] Add rt_nr_running accounting Steven Rostedt
2007-10-23  2:59 ` [PATCH -v2 2/7] track highest prio queued on runqueue Steven Rostedt
2007-10-23  2:59 ` [PATCH -v2 3/7] push RT tasks Steven Rostedt
2007-10-23  2:59 ` [PATCH -v2 4/7] RT overloaded runqueues accounting Steven Rostedt
2007-10-23  4:17   ` Paul Jackson
2007-10-23  6:11     ` Paul Menage
2007-10-23  6:19       ` Paul Jackson
2007-10-23 13:43       ` Steven Rostedt
2007-10-23 21:46         ` Paul Jackson
2008-01-29 13:00           ` Paul Jackson
2007-10-23  2:59 ` [PATCH -v2 5/7] pull RT tasks Steven Rostedt
2007-10-23  2:59 ` [PATCH -v2 6/7] wake up balance RT Steven Rostedt
2007-10-23  2:59 ` [PATCH -v2 7/7] disable CFS RT load balancing Steven Rostedt
2007-10-23  8:39 ` [PATCH -v2 0/7] New RT Task Balancing -v2 Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).