LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Context switch times
@ 2001-10-04 21:04 Mike Kravetz
  2001-10-04 21:14 ` arjan
  2001-10-04 21:25 ` David S. Miller
  0 siblings, 2 replies; 61+ messages in thread
From: Mike Kravetz @ 2001-10-04 21:04 UTC (permalink / raw)
  To: linux-kernel

I've been working on a rewrite of our Multi-Queue scheduler
and am using the lat_ctx program of LMbench as a benchmark.
I'm lucky enough to have access to an 8-CPU system for use
during development.  One time, I 'accidently' booted the
kernel that came with the distribution installed on this
machine.  That kernel level is '2.2.16-22'.  The results of
running lat-ctx on this kernel when compared to 2.4.10 really
surprised me.  Here is an example:

2.4.10 on 8 CPUs:  lat_ctx -s 0 -r 2 results
"size=0k ovr=2.27
2 3.86

2.2.16-22 on 8 CPUS:  lat_ctx -s 0 -r 2 results
"size=0k ovr=1.99
2 1.44

As you can see, the context switch times for 2.4.10 are more
than double what they were for 2.2.16-22 in this example.  

Comments?

One observation I did make is that this may be related to CPU
affinity/cache warmth.  If you increase the number of 'TRIPS'
to a very large number, you can run 'top' and observe per-CPU
utilization.  On 2.2.16-22, the '2 task' benchmark seemed to
stay on 3 of the 8 CPUs.  On 2.4.10, these 2 tasks were run
on all 8 CPUs and utilization was about the same for each CPU.

-- 
Mike Kravetz                                  kravetz@us.ibm.com
IBM Peace, Love and Linux Technology Center

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:04 Context switch times Mike Kravetz
@ 2001-10-04 21:14 ` arjan
  2001-10-04 21:39   ` Richard Gooch
  2001-10-04 21:25 ` David S. Miller
  1 sibling, 1 reply; 61+ messages in thread
From: arjan @ 2001-10-04 21:14 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-kernel

In article <20011004140417.C1245@w-mikek2.des.beaverton.ibm.com> you wrote:
> 2.4.10 on 8 CPUs:  lat_ctx -s 0 -r 2 results
> "size=0k ovr=2.27
> 2 3.86

> 2.2.16-22 on 8 CPUS:  lat_ctx -s 0 -r 2 results
> "size=0k ovr=1.99
> 2 1.44

> As you can see, the context switch times for 2.4.10 are more
> than double what they were for 2.2.16-22 in this example.  

> Comments?

2.4.x supports SSE on pentium III/athlons, so the SSE registers need to be
saved/restored on a taskswitch as well.... that's not exactly free.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:04 Context switch times Mike Kravetz
  2001-10-04 21:14 ` arjan
@ 2001-10-04 21:25 ` David S. Miller
  2001-10-04 21:52   ` David S. Miller
  1 sibling, 1 reply; 61+ messages in thread
From: David S. Miller @ 2001-10-04 21:25 UTC (permalink / raw)
  To: arjan; +Cc: kravetz, linux-kernel

   From: arjan@fenrus.demon.nl
   Date: Thu, 04 Oct 2001 22:14:13 +0100
   
   > Comments?
   
   2.4.x supports SSE on pentium III/athlons, so the SSE registers need to be
   saved/restored on a taskswitch as well.... that's not exactly free.

lat_ctx doesn't execute any FPU ops.  So at worst this happens once
on GLIBC program startup, but then never again.

This assumes I understand how lazy i387 restores work in the kernel
:-)

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:14 ` arjan
@ 2001-10-04 21:39   ` Richard Gooch
  2001-10-09  4:55     ` Richard Gooch
  0 siblings, 1 reply; 61+ messages in thread
From: Richard Gooch @ 2001-10-04 21:39 UTC (permalink / raw)
  To: David S. Miller; +Cc: arjan, kravetz, linux-kernel

David S. Miller writes:
>    From: arjan@fenrus.demon.nl
>    Date: Thu, 04 Oct 2001 22:14:13 +0100
>    
>    > Comments?
>    
>    2.4.x supports SSE on pentium III/athlons, so the SSE registers need to be
>    saved/restored on a taskswitch as well.... that's not exactly free.
> 
> lat_ctx doesn't execute any FPU ops.  So at worst this happens once
> on GLIBC program startup, but then never again.

Has something changed? Last I looked, the whole lmbench timing harness
was based on using the FPU. That was the cause of the big argument
Larry and I had some years back: my context switch benchmark didn't
use the FPU, and thus was more sensitive to variations (such as cache
misses due to aliasing).

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:25 ` David S. Miller
@ 2001-10-04 21:52   ` David S. Miller
  2001-10-04 21:55     ` Benjamin LaHaise
  2001-10-09  5:00     ` David S. Miller
  0 siblings, 2 replies; 61+ messages in thread
From: David S. Miller @ 2001-10-04 21:52 UTC (permalink / raw)
  To: rgooch; +Cc: arjan, kravetz, linux-kernel

   From: Richard Gooch <rgooch@ras.ucalgary.ca>
   Date: Thu, 4 Oct 2001 15:39:05 -0600

   David S. Miller writes:
   > lat_ctx doesn't execute any FPU ops.  So at worst this happens once
   > on GLIBC program startup, but then never again.
   
   Has something changed? Last I looked, the whole lmbench timing harness
   was based on using the FPU.

Oops, that's entirely possible...

But things are usually layed out like this:

	capture_start_time();
	context_switch_N_times();
	capture_end_time();

So the FPU hit is only before/after the runs, not during each and
every iteration.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:52   ` David S. Miller
@ 2001-10-04 21:55     ` Benjamin LaHaise
  2001-10-04 22:35       ` Davide Libenzi
  2001-10-04 22:42       ` Linus Torvalds
  2001-10-09  5:00     ` David S. Miller
  1 sibling, 2 replies; 61+ messages in thread
From: Benjamin LaHaise @ 2001-10-04 21:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: rgooch, arjan, kravetz, linux-kernel

On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote:
> So the FPU hit is only before/after the runs, not during each and
> every iteration.

Right.  Plus, the original mail mentioned that it was hitting all 8 
CPUs, which is a pretty good example of braindead scheduler behaviour.

		-ben

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:55     ` Benjamin LaHaise
@ 2001-10-04 22:35       ` Davide Libenzi
  2001-10-04 22:49         ` Mike Kravetz
  2001-10-04 22:42       ` Linus Torvalds
  1 sibling, 1 reply; 61+ messages in thread
From: Davide Libenzi @ 2001-10-04 22:35 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: David S. Miller, rgooch, arjan, kravetz, linux-kernel

On Thu, 4 Oct 2001, Benjamin LaHaise wrote:

> On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote:
> > So the FPU hit is only before/after the runs, not during each and
> > every iteration.
>
> Right.  Plus, the original mail mentioned that it was hitting all 8
> CPUs, which is a pretty good example of braindead scheduler behaviour.

There was a discussion about process spinning among idle CPUs a couple of
months ago.
Mike, did you code the patch that stick the task to an idle between the
send-IPI and the idle wakeup ?
At that time we simply left the issue unaddressed.



- Davide



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:55     ` Benjamin LaHaise
  2001-10-04 22:35       ` Davide Libenzi
@ 2001-10-04 22:42       ` Linus Torvalds
  2001-10-04 22:53         ` Benjamin LaHaise
  2001-10-04 23:41         ` Mike Kravetz
  1 sibling, 2 replies; 61+ messages in thread
From: Linus Torvalds @ 2001-10-04 22:42 UTC (permalink / raw)
  To: linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1329 bytes --]

In article <20011004175526.C18528@redhat.com>,
Benjamin LaHaise  <bcrl@redhat.com> wrote:
>On Thu, Oct 04, 2001 at 02:52:39PM -0700, David S. Miller wrote:
>> So the FPU hit is only before/after the runs, not during each and
>> every iteration.
>
>Right.  Plus, the original mail mentioned that it was hitting all 8 
>CPUs, which is a pretty good example of braindead scheduler behaviour.

Careful.

That's not actually true (the braindead part, that i).

We went through this with Ingo and Larry McVoy, and the sad fact is that
to get the best numbers for lmbench, you simply have to do the wrong
thing. 

Could we try to hit just two? Probably, but it doesn't really matter,
though: to make the lmbench scheduler benchmark go at full speed, you
want to limit it to _one_ CPU, which is not sensible in real-life
situations.  The amount of concurrency in the context switching
benchmark is pretty small, and does not make up for bouncing the locks
etc between CPU's. 

However, that lack of concurrency in lmbench is totally due to the
artificial nature of the benchmark, and the bigger-footprint scheduling
stuff (that isn't reported very much in the summary) are more realistic.

So 2.4.x took the (painful) road of saying that we care less about that
particular benchmark than about some other more realistic loads. 

		Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 22:35       ` Davide Libenzi
@ 2001-10-04 22:49         ` Mike Kravetz
  0 siblings, 0 replies; 61+ messages in thread
From: Mike Kravetz @ 2001-10-04 22:49 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Benjamin LaHaise, David S. Miller, rgooch, arjan, linux-kernel

On Thu, Oct 04, 2001 at 03:35:35PM -0700, Davide Libenzi wrote:
> > Right.  Plus, the original mail mentioned that it was hitting all 8
> > CPUs, which is a pretty good example of braindead scheduler behaviour.
> 
> There was a discussion about process spinning among idle CPUs a couple of
> months ago.
> Mike, did you code the patch that stick the task to an idle between the
> send-IPI and the idle wakeup ?
> At that time we simply left the issue unaddressed.

I believe the 'quick and dirty' patches we came up with substantially
increased context switch times for this benchmark (doubled them).
The reason is that you needed to add IPI time to the context switch
time.  Therefore, I did not actively pursue getting these accepted. :)
It appears that something in the 2.2 scheduler did a much better
job of handling this situation.  I haven't looked at the 2.2 code.
Does anyone know what feature of the 2.2 scheduler was more successful
in keeping tasks on the CPUs on which they previously executed?
Also, why was that code removed from 2.4?  I can research, but I
suspect someone here has firsthand knowledge.

-- 
Mike Kravetz                                  kravetz@us.ibm.com
IBM Peace, Love and Linux Technology Center

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 22:42       ` Linus Torvalds
@ 2001-10-04 22:53         ` Benjamin LaHaise
  2001-10-05 15:13           ` Alan Cox
  2001-10-04 23:41         ` Mike Kravetz
  1 sibling, 1 reply; 61+ messages in thread
From: Benjamin LaHaise @ 2001-10-04 22:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> Could we try to hit just two? Probably, but it doesn't really matter,
> though: to make the lmbench scheduler benchmark go at full speed, you
> want to limit it to _one_ CPU, which is not sensible in real-life
> situations.  The amount of concurrency in the context switching
> benchmark is pretty small, and does not make up for bouncing the locks
> etc between CPU's. 

I don't quite agree with you that it doesn't matter.  A lot of tests 
(volanomark, other silly things) show that the current scheduler jumps 
processes from CPU to CPU on SMP boxes far too easily, in addition to the 
lengthy duration of run queue processing when loaded down.  Yes, these 
applications are leaving too many runnable processes around, but that's 
the way some large app server loads behave.  And right now it makes linux 
look bad compared to other OSes.

Yes, low latency is good, but jumping around cpus adds more latency in 
cache misses across slow busses than is needed when the working set is 
already present in the 2MB L2 of your high end server.

		-ben

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 22:42       ` Linus Torvalds
  2001-10-04 22:53         ` Benjamin LaHaise
@ 2001-10-04 23:41         ` Mike Kravetz
  2001-10-04 23:50           ` Linus Torvalds
                             ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: Mike Kravetz @ 2001-10-04 23:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> Could we try to hit just two? Probably, but it doesn't really matter,
> though: to make the lmbench scheduler benchmark go at full speed, you
> want to limit it to _one_ CPU, which is not sensible in real-life
> situations.

Can you clarify?  I agree that tuning the system for the best LMbench
performance is not a good thing to do!  However, in general on an
8 CPU system with only 2 'active' tasks I would think limiting the
tasks to 2 CPUs would be desirable for cache effects.

I know that running LMbench with 2 active tasks on an 8 CPU system
results in those 2 tasks being 'round-robined' among all 8 CPUs.
Prior analysis leads me to believe the reason for this is due to
IPI latency.  reschedule_idle() chooses the 'best/correct' CPU for
a task to run on, but before schedule() runs on that CPU another
CPU runs schedule() and the result is that the task runs on a
?less desirable? CPU.  The nature of the LMbench scheduler benchmark
makes this occur frequently.  The real question is: how often
does this happen in real-life situations?

-- 
Mike

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 23:41         ` Mike Kravetz
@ 2001-10-04 23:50           ` Linus Torvalds
  2001-10-05 15:15             ` Eric W. Biederman
  2001-10-04 23:56           ` Davide Libenzi
  2001-10-05  0:45           ` Andrea Arcangeli
  2 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2001-10-04 23:50 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-kernel


On Thu, 4 Oct 2001, Mike Kravetz wrote:

> On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> > Could we try to hit just two? Probably, but it doesn't really matter,
> > though: to make the lmbench scheduler benchmark go at full speed, you
> > want to limit it to _one_ CPU, which is not sensible in real-life
> > situations.
>
> Can you clarify?  I agree that tuning the system for the best LMbench
> performance is not a good thing to do!  However, in general on an
> 8 CPU system with only 2 'active' tasks I would think limiting the
> tasks to 2 CPUs would be desirable for cache effects.

Yes, limiting to 2 CPU's probably gets better cache behaviour, and it
might be worth looking into why it doesn't. The CPU affinity _should_
prioritize it down to two, but I haven't thought through your theory about
IPI latency.

However, the reason 2.2.x does so well is that in 2.2.x it will stay on
_once_ CPU if I remember correctly. We basically tuned the scheduler for
lmbench, and not much else.

		Linus


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 23:41         ` Mike Kravetz
  2001-10-04 23:50           ` Linus Torvalds
@ 2001-10-04 23:56           ` Davide Libenzi
  2001-10-05  0:45           ` Andrea Arcangeli
  2 siblings, 0 replies; 61+ messages in thread
From: Davide Libenzi @ 2001-10-04 23:56 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Linus Torvalds, lkml

On Thu, 4 Oct 2001, Mike Kravetz wrote:

> On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> > Could we try to hit just two? Probably, but it doesn't really matter,
> > though: to make the lmbench scheduler benchmark go at full speed, you
> > want to limit it to _one_ CPU, which is not sensible in real-life
> > situations.
>
> Can you clarify?  I agree that tuning the system for the best LMbench
> performance is not a good thing to do!  However, in general on an
> 8 CPU system with only 2 'active' tasks I would think limiting the
> tasks to 2 CPUs would be desirable for cache effects.
>
> I know that running LMbench with 2 active tasks on an 8 CPU system
> results in those 2 tasks being 'round-robined' among all 8 CPUs.
> Prior analysis leads me to believe the reason for this is due to
> IPI latency.  reschedule_idle() chooses the 'best/correct' CPU for
> a task to run on, but before schedule() runs on that CPU another
> CPU runs schedule() and the result is that the task runs on a
> ?less desirable? CPU.  The nature of the LMbench scheduler benchmark
> makes this occur frequently.  The real question is: how often
> does this happen in real-life situations?

Well, if you remember the first time this issue was discussed on the
mailing list was due a real life situation not due a bench run.




- Davide



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 23:41         ` Mike Kravetz
  2001-10-04 23:50           ` Linus Torvalds
  2001-10-04 23:56           ` Davide Libenzi
@ 2001-10-05  0:45           ` Andrea Arcangeli
  2001-10-05  4:35             ` Mike Kravetz
  2 siblings, 1 reply; 61+ messages in thread
From: Andrea Arcangeli @ 2001-10-05  0:45 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Linus Torvalds, linux-kernel

On Thu, Oct 04, 2001 at 04:41:02PM -0700, Mike Kravetz wrote:
> I know that running LMbench with 2 active tasks on an 8 CPU system
> results in those 2 tasks being 'round-robined' among all 8 CPUs.
> Prior analysis leads me to believe the reason for this is due to
> IPI latency.  reschedule_idle() chooses the 'best/correct' CPU for
> a task to run on, but before schedule() runs on that CPU another
> CPU runs schedule() and the result is that the task runs on a
> ?less desirable? CPU.  The nature of the LMbench scheduler benchmark

doesn't lmbench wakeup only via pipes? Linux uses the sync-wakeup that
avoids reschedule_idle in such case, to serialize the pipe load in the
same cpu.

Andrea

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05  0:45           ` Andrea Arcangeli
@ 2001-10-05  4:35             ` Mike Kravetz
  2001-10-07 17:59               ` Andrea Arcangeli
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Kravetz @ 2001-10-05  4:35 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel

On Fri, Oct 05, 2001 at 02:45:26AM +0200, Andrea Arcangeli wrote:
> doesn't lmbench wakeup only via pipes? Linux uses the sync-wakeup that
> avoids reschedule_idle in such case, to serialize the pipe load in the
> same cpu.

That's what I thought too.  However, kernel profile data of a
lmbench run on 2.4.10 reveals that the pipe routines only call
the non-synchronous form of wake_up.  I believe I reached the
same conclusion in the 2.4.7 time frame by instrumenting this
code.

-- 
Mike

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 22:53         ` Benjamin LaHaise
@ 2001-10-05 15:13           ` Alan Cox
  2001-10-05 17:49             ` george anzinger
                               ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Alan Cox @ 2001-10-05 15:13 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Linus Torvalds, linux-kernel

> I don't quite agree with you that it doesn't matter.  A lot of tests 
> (volanomark, other silly things) show that the current scheduler jumps 
> processes from CPU to CPU on SMP boxes far too easily, in addition to the 
> lengthy duration of run queue processing when loaded down.  Yes, these 
> applications are leaving too many runnable processes around, but that's 
> the way some large app server loads behave.  And right now it makes linux 
> look bad compared to other OSes.

The current scheduler is completely hopeless by the time you hit 3
processors and a bit flaky on two. Its partly the algorithm and partly
implementation

#1	We schedule tasks based on a priority that has no cache
	consideration

#2	We have an O(tasks) loop refreshing priority that isnt needed
	(Montavista fix covers this)

#3	We have an O(runnable) loop which isnt ideal 

#4	On x86 we are horribly cache pessimal. All the task structs are
	on the same cache colour. Multiple tasks waiting for the same event
	put their variables (like the wait queue) on the same cache line.

The poor cache performance greatly comes from problem 1. Because we prioritize
blindly on CPU usage with a tiny magic constant fudge factor we are
executing totally wrong task orders. Instead we need to schedule for cache
optimal behaviour, and the moderator is the fairness requirement, not the
other way around.

The classic example is two steady cpu loads and an occasionally waking
client (like an editor)

We end up scheduling [A, B are the loads,  C is the editor)

A C B C A C B C A C B C A

whenever a key is hit we swap CPU hog because the priority balanced shifted.
Really we should have scheduled something looking more like

A C A C A C B C B C B C B 

I would argue there are two needed priorities

1.	CPU usage based priority - the current one

2.	Task priority.

Task priority being set to maximum when a task sleeps and then bounded by
a function of max(task_priorty, fn(cpupriority)) so that tasks fall down
priority levels as their cpu usage in the set of time slices. This causes
tasks that are CPU hogs to sit at the bottom of the pile with the same low
task_priority meaning they run for long bursts and don't get pre-empted and
switched with other hogs through each cycle of the scheduler.

This isnt idle speculation - I've done some minimal playing with this but
my initial re-implementation didnt handle SMP at all and I am still not 100%
sure how to resolve SMP or how SMP will improve out of the current cunning
plan.

Alan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 23:50           ` Linus Torvalds
@ 2001-10-05 15:15             ` Eric W. Biederman
  0 siblings, 0 replies; 61+ messages in thread
From: Eric W. Biederman @ 2001-10-05 15:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mike Kravetz, linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

> On Thu, 4 Oct 2001, Mike Kravetz wrote:
> 
> > On Thu, Oct 04, 2001 at 10:42:37PM +0000, Linus Torvalds wrote:
> > > Could we try to hit just two? Probably, but it doesn't really matter,
> > > though: to make the lmbench scheduler benchmark go at full speed, you
> > > want to limit it to _one_ CPU, which is not sensible in real-life
> > > situations.
> >
> > Can you clarify?  I agree that tuning the system for the best LMbench
> > performance is not a good thing to do!  However, in general on an
> > 8 CPU system with only 2 'active' tasks I would think limiting the
> > tasks to 2 CPUs would be desirable for cache effects.
> 
> Yes, limiting to 2 CPU's probably gets better cache behaviour, and it
> might be worth looking into why it doesn't. The CPU affinity _should_
> prioritize it down to two, but I haven't thought through your theory about
> IPI latency.

I don't know what it is but I have seen this excessive cpu switching
in the wild.  In particular on a dual processor machine I ran 2 cpu
intensive jobs, and a handful of daemons.  And the cpu intensive jobs
would switch cpus every couple of seconds.  

I was investigating it because on the Athlon I was running on a
customer was getting a super linear speed up.  With one processes it
would take 8 minutes, and with 2 processes one would take 8 minutes
and the other would take 6 minutes.  Very strange.  

These processes except at their very beginning did no I/O and were
pure cpu hogs until they spit out their results.  Very puzzling.
I can't see why we would ever want to take the cache miss penalty of
switching cpus, in this case.

Eric

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 15:13           ` Alan Cox
@ 2001-10-05 17:49             ` george anzinger
  2001-10-05 22:29               ` Alan Cox
  2001-10-06  2:24             ` Albert D. Cahalan
  2001-10-07  9:57             ` Mika Liljeberg
  2 siblings, 1 reply; 61+ messages in thread
From: george anzinger @ 2001-10-05 17:49 UTC (permalink / raw)
  To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, linux-kernel

Alan Cox wrote:
> 
> > I don't quite agree with you that it doesn't matter.  A lot of tests
> > (volanomark, other silly things) show that the current scheduler jumps
> > processes from CPU to CPU on SMP boxes far too easily, in addition to the
> > lengthy duration of run queue processing when loaded down.  Yes, these
> > applications are leaving too many runnable processes around, but that's
> > the way some large app server loads behave.  And right now it makes linux
> > look bad compared to other OSes.
> 
> The current scheduler is completely hopeless by the time you hit 3
> processors and a bit flaky on two. Its partly the algorithm and partly
> implementation
> 
> #1      We schedule tasks based on a priority that has no cache
>         consideration
> 
> #2      We have an O(tasks) loop refreshing priority that isnt needed
>         (Montavista fix covers this)
> 
> #3      We have an O(runnable) loop which isnt ideal
> 
> #4      On x86 we are horribly cache pessimal. All the task structs are
>         on the same cache colour. Multiple tasks waiting for the same event
>         put their variables (like the wait queue) on the same cache line.
> 
> The poor cache performance greatly comes from problem 1. Because we prioritize
> blindly on CPU usage with a tiny magic constant fudge factor we are
> executing totally wrong task orders. Instead we need to schedule for cache
> optimal behaviour, and the moderator is the fairness requirement, not the
> other way around.
> 
> The classic example is two steady cpu loads and an occasionally waking
> client (like an editor)
> 
> We end up scheduling [A, B are the loads,  C is the editor)
> 
> A C B C A C B C A C B C A
> 
> whenever a key is hit we swap CPU hog because the priority balanced shifted.
> Really we should have scheduled something looking more like
> 
> A C A C A C B C B C B C B
> 
> I would argue there are two needed priorities
> 
> 1.      CPU usage based priority - the current one
> 
> 2.      Task priority.
> 
> Task priority being set to maximum when a task sleeps and then bounded by
> a function of max(task_priorty, fn(cpupriority)) so that tasks fall down
> priority levels as their cpu usage in the set of time slices. This causes
> tasks that are CPU hogs to sit at the bottom of the pile with the same low
> task_priority meaning they run for long bursts and don't get pre-empted and
> switched with other hogs through each cycle of the scheduler.

Let me see if I have this right.  Task priority goes to max on any (?)
sleep regardless of how long.  And to min if it doesn't sleep for some
period of time.  Where does the time slice counter come into this, if at
all?  

For what its worth I am currently updating the MontaVista scheduler so,
I am open to ideas.

George
> 
> This isnt idle speculation - I've done some minimal playing with this but
> my initial re-implementation didnt handle SMP at all and I am still not 100%
> sure how to resolve SMP or how SMP will improve out of the current cunning
> plan.
> 
> Alan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 17:49             ` george anzinger
@ 2001-10-05 22:29               ` Alan Cox
  2001-10-05 22:56                 ` Davide Libenzi
  2001-10-07  1:20                 ` george anzinger
  0 siblings, 2 replies; 61+ messages in thread
From: Alan Cox @ 2001-10-05 22:29 UTC (permalink / raw)
  To: george anzinger; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

> Let me see if I have this right.  Task priority goes to max on any (?)
> sleep regardless of how long.  And to min if it doesn't sleep for some
> period of time.  Where does the time slice counter come into this, if at
> all?  
> 
> For what its worth I am currently updating the MontaVista scheduler so,
> I am open to ideas.

The time slice counter is the limit on the amount of time you can execute,
the priority determines who runs first.

So if you used your cpu quota you will get run reluctantly. If you slept
you will get run early and as you use time slice count you will drop
priority bands, but without pre-emption until you cross a band and there
is another task with higher priority.

This damps down task thrashing a bit, and for the cpu hogs it gets the
desired behaviour - which is that the all run their full quantum in the
background one after another instead of thrashing back and forth

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 22:29               ` Alan Cox
@ 2001-10-05 22:56                 ` Davide Libenzi
  2001-10-05 23:04                   ` Alan Cox
  2001-10-07  1:20                 ` george anzinger
  1 sibling, 1 reply; 61+ messages in thread
From: Davide Libenzi @ 2001-10-05 22:56 UTC (permalink / raw)
  To: Alan Cox; +Cc: george anzinger, Benjamin LaHaise, Linus Torvalds, linux-kernel

On Fri, 5 Oct 2001, Alan Cox wrote:

> > Let me see if I have this right.  Task priority goes to max on any (?)
> > sleep regardless of how long.  And to min if it doesn't sleep for some
> > period of time.  Where does the time slice counter come into this, if at
> > all?
> >
> > For what its worth I am currently updating the MontaVista scheduler so,
> > I am open to ideas.
>
> The time slice counter is the limit on the amount of time you can execute,
> the priority determines who runs first.
>
> So if you used your cpu quota you will get run reluctantly. If you slept
> you will get run early and as you use time slice count you will drop
> priority bands, but without pre-emption until you cross a band and there
> is another task with higher priority.
>
> This damps down task thrashing a bit, and for the cpu hogs it gets the
> desired behaviour - which is that the all run their full quantum in the
> background one after another instead of thrashing back and forth

What if we give to  prev  a priority boost P=F(T) where T is the time
prev  is ran before the current schedule ?



- Davide



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 22:56                 ` Davide Libenzi
@ 2001-10-05 23:04                   ` Alan Cox
  2001-10-05 23:16                     ` Davide Libenzi
  2001-10-05 23:43                     ` Roger Larsson
  0 siblings, 2 replies; 61+ messages in thread
From: Alan Cox @ 2001-10-05 23:04 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Alan Cox, george anzinger, Benjamin LaHaise, Linus Torvalds,
	linux-kernel

> > This damps down task thrashing a bit, and for the cpu hogs it gets the
> > desired behaviour - which is that the all run their full quantum in the
> > background one after another instead of thrashing back and forth
> 
> What if we give to  prev  a priority boost P=F(T) where T is the time
> prev  is ran before the current schedule ?

That would be the wrong key. You can argue certainly that it is maybe
appropriate to use some function based on remaining scheduler ticks, but
that already occurs as the scheduler ticks is the upper bound for priority
band

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 23:04                   ` Alan Cox
@ 2001-10-05 23:16                     ` Davide Libenzi
  2001-10-05 23:17                       ` Alan Cox
  2001-10-05 23:43                     ` Roger Larsson
  1 sibling, 1 reply; 61+ messages in thread
From: Davide Libenzi @ 2001-10-05 23:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Davide Libenzi, george anzinger, Benjamin LaHaise,
	Linus Torvalds, linux-kernel

On Sat, 6 Oct 2001, Alan Cox wrote:

> > > This damps down task thrashing a bit, and for the cpu hogs it gets the
> > > desired behaviour - which is that the all run their full quantum in the
> > > background one after another instead of thrashing back and forth
> >
> > What if we give to  prev  a priority boost P=F(T) where T is the time
> > prev  is ran before the current schedule ?
>
> That would be the wrong key. You can argue certainly that it is maybe
> appropriate to use some function based on remaining scheduler ticks, but
> that already occurs as the scheduler ticks is the upper bound for priority
> band

No, i mean T = (Tstart - Tend) where :

Tstart = time the current ( prev ) task has been scheduled
Tend   = current time ( in schedule() )

Basically it's the total time the current ( prev ) task has had the CPU



- Davide



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 23:16                     ` Davide Libenzi
@ 2001-10-05 23:17                       ` Alan Cox
  2001-10-05 23:21                         ` Davide Libenzi
  0 siblings, 1 reply; 61+ messages in thread
From: Alan Cox @ 2001-10-05 23:17 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Alan Cox, Davide Libenzi, george anzinger, Benjamin LaHaise,
	Linus Torvalds, linux-kernel

> No, i mean T = (Tstart - Tend) where :
> 
> Tstart = time the current ( prev ) task has been scheduled
> Tend   = current time ( in schedule() )
> 
> Basically it's the total time the current ( prev ) task has had the CPU

Ok let me ask one question - why ?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 23:17                       ` Alan Cox
@ 2001-10-05 23:21                         ` Davide Libenzi
  0 siblings, 0 replies; 61+ messages in thread
From: Davide Libenzi @ 2001-10-05 23:21 UTC (permalink / raw)
  To: Alan Cox
  Cc: Davide Libenzi, george anzinger, Benjamin LaHaise,
	Linus Torvalds, linux-kernel

On Sat, 6 Oct 2001, Alan Cox wrote:

> > No, i mean T = (Tstart - Tend) where :
> >
> > Tstart = time the current ( prev ) task has been scheduled
> > Tend   = current time ( in schedule() )
> >
> > Basically it's the total time the current ( prev ) task has had the CPU
>
> Ok let me ask one question - why ?

Because the more the task is ran the higher the cache footprint is and the
higher the cache footprint is the more we'd like to pick up the exiting
task to rerun.




- Davide



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 23:04                   ` Alan Cox
  2001-10-05 23:16                     ` Davide Libenzi
@ 2001-10-05 23:43                     ` Roger Larsson
  1 sibling, 0 replies; 61+ messages in thread
From: Roger Larsson @ 2001-10-05 23:43 UTC (permalink / raw)
  To: Alan Cox, Davide Libenzi
  Cc: Alan Cox, george anzinger, Benjamin LaHaise, Linus Torvalds,
	linux-kernel

On Saturday 06 October 2001 01:04, Alan Cox wrote:
> > > This damps down task thrashing a bit, and for the cpu hogs it gets the
> > > desired behaviour - which is that the all run their full quantum in the
> > > background one after another instead of thrashing back and forth
> >
> > What if we give to  prev  a priority boost P=F(T) where T is the time
> > prev  is ran before the current schedule ?
>
> That would be the wrong key. You can argue certainly that it is maybe
> appropriate to use some function based on remaining scheduler ticks, but
> that already occurs as the scheduler ticks is the upper bound for priority
> band

How about a LIFO list for each processor where preempted (count != 0) tasks
go?

When a preemption occurs the current goes to the LIFO.
When a process has run whole of its time slot - it can be moved to an usedup
queue. (No point in keeping it on the generic run queue, but remember
it when giving out new ticks, could also be on a per CPU basis).
Now select what to do next..
The first on the LIFO can be moved in as "current" before checking the rest
of the run queue...

Pros:
+ The LIFO will be sorted with highest prio on top - no need to search 'em all
+ a process that starts a time slot on one CPU stays the whole slot.

Cons:
- A process might get stuck waiting for the processor in the FIFO while the
  other CPU is idle (but that was the point, wasn't it...)

/RogerL

-- 
Roger Larsson
Skellefteå
Sweden

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 15:13           ` Alan Cox
  2001-10-05 17:49             ` george anzinger
@ 2001-10-06  2:24             ` Albert D. Cahalan
  2001-10-06  2:57               ` Davide Libenzi
  2001-10-07  9:57             ` Mika Liljeberg
  2 siblings, 1 reply; 61+ messages in thread
From: Albert D. Cahalan @ 2001-10-06  2:24 UTC (permalink / raw)
  To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, linux-kernel

Alan Cox writes:
> [somebody]

>> I don't quite agree with you that it doesn't matter.  A lot of tests
>> (volanomark, other silly things) show that the current scheduler jumps
>> processes from CPU to CPU on SMP boxes far too easily

Be careful that your viewer doesn't disturb the system.
Please don't even consider using "top" for this.

> #4	On x86 we are horribly cache pessimal. All the task structs are
> 	on the same cache colour. Multiple tasks waiting for the same event
> 	put their variables (like the wait queue) on the same cache line.

If cache problems are bad enough, maybe this pays for itself:

/* old */
current = stack_ptr & ~0x1fff;

/* new */
hash = (stack_ptr>>8)^(stack_ptr>>12)^(stack_ptr>>16)^(stack_ptr>>20);
current = (stack_ptr & ~0x1fff) | (hash & 0x1e0);

> The classic example is two steady cpu loads and an occasionally waking
> client (like an editor)

Like "top"!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-06  2:24             ` Albert D. Cahalan
@ 2001-10-06  2:57               ` Davide Libenzi
  0 siblings, 0 replies; 61+ messages in thread
From: Davide Libenzi @ 2001-10-06  2:57 UTC (permalink / raw)
  To: Albert D. Cahalan
  Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

On Fri, 5 Oct 2001, Albert D. Cahalan wrote:

> If cache problems are bad enough, maybe this pays for itself:
>
> /* old */
> current = stack_ptr & ~0x1fff;
>
> /* new */
> hash = (stack_ptr>>8)^(stack_ptr>>12)^(stack_ptr>>16)^(stack_ptr>>20);
> current = (stack_ptr & ~0x1fff) | (hash & 0x1e0);

I suggested something like this ( randomly "moved" struct task_struct * )
a couple of years ago.
Never tried to measure the impact of the system though.



- Davide



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 22:29               ` Alan Cox
  2001-10-05 22:56                 ` Davide Libenzi
@ 2001-10-07  1:20                 ` george anzinger
  2001-10-07  1:33                   ` Bill Davidsen
  2001-10-07  9:56                   ` Alan Cox
  1 sibling, 2 replies; 61+ messages in thread
From: george anzinger @ 2001-10-07  1:20 UTC (permalink / raw)
  To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, linux-kernel

Alan Cox wrote:
> 
> > Let me see if I have this right.  Task priority goes to max on any (?)
> > sleep regardless of how long.  And to min if it doesn't sleep for some
> > period of time.  Where does the time slice counter come into this, if at
> > all?
> >
> > For what its worth I am currently updating the MontaVista scheduler so,
> > I am open to ideas.
> 
> The time slice counter is the limit on the amount of time you can execute,
> the priority determines who runs first.
> 
> So if you used your cpu quota you will get run reluctantly. If you slept
> you will get run early and as you use time slice count you will drop
> priority bands, but without pre-emption until you cross a band and there
> is another task with higher priority.
> 
> This damps down task thrashing a bit, and for the cpu hogs it gets the
> desired behaviour - which is that the all run their full quantum in the
> background one after another instead of thrashing back and forth

If I understand this, you are decreasing the priority of a task that is
running as it consumes its slice.  In the current way of doing things
this happens and is noticed when the scheduler gets called.  A task can
loose its place by being preempted by some quick I/O bound task.  I.e. A
and B are cpu hogs with A having the cpu, if task Z comes along and runs
for 100 micro seconds, A finds itself replace by B, where as if Z does
not come along, A will complete its slice.  It seems to me that A should
be first in line after Z and not bumped just because it has used some of
its slice.  This would argue for priority being set at slice renewal
time and left alone until the task blocks or completes its slice.

George

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07  1:20                 ` george anzinger
@ 2001-10-07  1:33                   ` Bill Davidsen
  2001-10-07  9:56                   ` Alan Cox
  1 sibling, 0 replies; 61+ messages in thread
From: Bill Davidsen @ 2001-10-07  1:33 UTC (permalink / raw)
  To: linux-kernel

On Sat, 6 Oct 2001, george anzinger wrote:

> Alan Cox wrote:

> > So if you used your cpu quota you will get run reluctantly. If you slept
> > you will get run early and as you use time slice count you will drop
> > priority bands, but without pre-emption until you cross a band and there
> > is another task with higher priority.
> > 
> > This damps down task thrashing a bit, and for the cpu hogs it gets the
> > desired behaviour - which is that the all run their full quantum in the
> > background one after another instead of thrashing back and forth
> 
> If I understand this, you are decreasing the priority of a task that is
> running as it consumes its slice.  In the current way of doing things
> this happens and is noticed when the scheduler gets called.  A task can
> loose its place by being preempted by some quick I/O bound task.  I.e. A
> and B are cpu hogs with A having the cpu, if task Z comes along and runs
> for 100 micro seconds, A finds itself replace by B, where as if Z does
> not come along, A will complete its slice.  It seems to me that A should
> be first in line after Z and not bumped just because it has used some of
> its slice.  This would argue for priority being set at slice renewal
> time and left alone until the task blocks or completes its slice.

  It certainly argues for continuing to run the process with the unexpired
timeslice, becausle it is most likely to have current data in caches of
every type.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07  1:20                 ` george anzinger
  2001-10-07  1:33                   ` Bill Davidsen
@ 2001-10-07  9:56                   ` Alan Cox
  1 sibling, 0 replies; 61+ messages in thread
From: Alan Cox @ 2001-10-07  9:56 UTC (permalink / raw)
  To: george anzinger; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

> its slice.  This would argue for priority being set at slice renewal
> time and left alone until the task blocks or completes its slice.

That may well be the case. The banding I played with was effectively 
doubling the task switch rate for more interactive tasks versus the bottom
feeders

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05 15:13           ` Alan Cox
  2001-10-05 17:49             ` george anzinger
  2001-10-06  2:24             ` Albert D. Cahalan
@ 2001-10-07  9:57             ` Mika Liljeberg
  2001-10-07 13:03               ` Ingo Oeser
                                 ` (2 more replies)
  2 siblings, 3 replies; 61+ messages in thread
From: Mika Liljeberg @ 2001-10-07  9:57 UTC (permalink / raw)
  To: Alan Cox; +Cc: Benjamin LaHaise, Linus Torvalds, linux-kernel

Alan Cox wrote:
> This isnt idle speculation - I've done some minimal playing with this but
> my initial re-implementation didnt handle SMP at all and I am still not 100%
> sure how to resolve SMP or how SMP will improve out of the current cunning
> plan.

Here's some idle speculation on SMP to top it off. :) I tend to think
that the load balancing between CPUs should be a completely separate
algorithim and should not necessarily be run at every schedule(). The
idea is to compeletely decouple the problem of scheduling a single CPU
between tasks and the problem of load balancing between the CPUs, making
each problem simpler to solve.

Consider the following basic rules:

A) When a new task comes along, pick the "least loaded" CPU and lock the
new task onto that.
B) Whenever the load imbalance between least loaded CPU and most loaded
CPU becomes too great, move one or more tasks from most loaded CPU to
the least loaded CPU.

The rules themselves should be self-explanatory: A provides initial load
balancing, while B tries to keep the balance (with a sensible hysteresis
to avoid thrashing). However, there are a few minor details to solve:

1) How to determine the load of a CPU? If we can quantify this clearly,
we can easily set a hysteresis level to trigger load balancing between
two CPUs.
2) When and how often to check for load imbalance?
3) How to select the task(s) that should be moved between two CPUs to
correct an imbalance?

For problems 1 and 2 I propose the following solution: Insert the the
load balancing routine itself as a (fake) task on each CPU and run it
when the CPU gets around to it. The load balancer should behave almost
like a CPU-bound task, scheduled on the lowest priority level with other
runnable tasks. The last bit is important: the load balancer should not
be allowed to starve but should be invoked approximately once every
"full rotation" of the scheduler.

With the above it is easy to estimate the load of a CPU. We can simply
use the elapsed time between two invokations of the load balancer task.
When the load balancer task of a particular CPU gets run, it chalks up
the elapsed time on a score board somewhere, and checks whether there is
a significant imbalance between itself and some other CPU. If there is,
it commences to move some tasks between itself and the other CPU (note
rule B, though, it should be enough to mess with just two CPU queues at
a time to minimize balancing and locking overhead).

Problem 3 is tricky. Basically, there should be a cost/benefit function
F(tasks to move) that should be minimized. Ideally F(task_i), the
cost/benefit of moving a single task, would be calculated as a byproduct
of the CPU scheduler algorithm. 

F(task_i) might be function of elapsed time since task_i was last
scheduled and the average time slice used by task_i, to account for the
probable cache hit. This would leave it up to the load balancer to move
as many lowest cost tasks to a new CPU as is needed to correct the
imbalance (average time slices used by each task would be needed in
order to make this decision).

Naturally, some additional rules might be necessary to make a task
eligible for moving, e.g., never move the only/last CPU bound task to
another CPU. In addition, it might actually make sense to move at most
one task at each invocation of the load balancer, to further reduce the
probability of thrashing. The load would still converge fairly quickly
towards a balanced state. It would also scale fairly well with the
number of CPUs.

How does that sound?

	MikaL

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07  9:57             ` Mika Liljeberg
@ 2001-10-07 13:03               ` Ingo Oeser
  2001-10-07 13:48                 ` Mika Liljeberg
  2001-10-07 18:39               ` Davide Libenzi
  2001-10-09 20:37               ` Hubertus Franke
  2 siblings, 1 reply; 61+ messages in thread
From: Ingo Oeser @ 2001-10-07 13:03 UTC (permalink / raw)
  To: Mika Liljeberg; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

On Sun, Oct 07, 2001 at 12:57:29PM +0300, Mika Liljeberg wrote:
> For problems 1 and 2 I propose the following solution: Insert the the
> load balancing routine itself as a (fake) task on each CPU and run it
> when the CPU gets around to it. The load balancer should behave almost
> like a CPU-bound task, scheduled on the lowest priority level with other
> runnable tasks. 

The idle-task might be (ab-)used for this, because it has perfect
data for this.

T_SystemElapsed - T_IdleTaskRun = T_CPULoaded

Balancing could be done in schedule() itself, after checking this
value for each CPU.

> The last bit is important: the load balancer should not
> be allowed to starve but should be invoked approximately once every
> "full rotation" of the scheduler.
 
If a artificial CPU-hog is used for this task, the idle task will
never be run and power savings in the CPU are impossible.

Other parts sound interesting.

Regards

Ingo Oeser
-- 
.... Our continuing mission: To seek out knowledge of C, to explore
strange UNIX commands, and to boldly code where no one has man page 4.
                        --- Mike A. Harris <mharris@meteng.on.ca>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 13:03               ` Ingo Oeser
@ 2001-10-07 13:48                 ` Mika Liljeberg
  2001-10-07 14:24                   ` Ingo Oeser
  0 siblings, 1 reply; 61+ messages in thread
From: Mika Liljeberg @ 2001-10-07 13:48 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

Ingo Oeser wrote:
> 
> On Sun, Oct 07, 2001 at 12:57:29PM +0300, Mika Liljeberg wrote:
> > For problems 1 and 2 I propose the following solution: Insert the the
> > load balancing routine itself as a (fake) task on each CPU and run it
> > when the CPU gets around to it. The load balancer should behave almost
> > like a CPU-bound task, scheduled on the lowest priority level with other
> > runnable tasks.
> 
> The idle-task might be (ab-)used for this, because it has perfect
> data for this.
> 
> T_SystemElapsed - T_IdleTaskRun = T_CPULoaded

The problem here is that the idle task doesn't get run at all if there
are runnable tasks. With a little bit of tweaking, the idle task context
could probably be used for this, I guess. However, I was thinking more
along the lines that the load balancer would not be a real task but just
a routine that would be executed in schedule() when the fake task comes
up (i.e. when the sceduler has done a "full rotation").

> Balancing could be done in schedule() itself, after checking this
> value for each CPU.

The point about separating the load balancer from the scheduler is that
you don't need to run it at every task switch. You can make the
scheduler very simple if you don't have to scan queues looking for
likely switchover candidates. This saves time in context switches.

One interesting property of the load balancer tasks would be that the
less heavily loaded CPUs would tend to execute the load balancer more
often, actively releaving the more heavily loaded CPUs, while those
would concentrate more on getting the job done. Come to think of it, it
could be coded in such a way that only the least loaded CPU would
execute the load balancing algorithm, while the others would simply
chalk up elapsed times.

> > The last bit is important: the load balancer should not
> > be allowed to starve but should be invoked approximately once every
> > "full rotation" of the scheduler.
> 
> If a artificial CPU-hog is used for this task, the idle task will
> never be run and power savings in the CPU are impossible.

Right. Obviously, if a CPU has no runnable tasks and the load balancer
can't acquire any from other CPUs, it should yield to the idle task
rather than hog the CPU.

	MikaL

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 13:48                 ` Mika Liljeberg
@ 2001-10-07 14:24                   ` Ingo Oeser
  2001-10-07 14:33                     ` Mika Liljeberg
  0 siblings, 1 reply; 61+ messages in thread
From: Ingo Oeser @ 2001-10-07 14:24 UTC (permalink / raw)
  To: Mika Liljeberg; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

On Sun, Oct 07, 2001 at 04:48:30PM +0300, Mika Liljeberg wrote:
> Ingo Oeser wrote:
> > The idle-task might be (ab-)used for this, because it has perfect
> > data for this.
> > 
> > T_SystemElapsed - T_IdleTaskRun = T_CPULoaded
> 
> The problem here is that the idle task doesn't get run at all if there
> are runnable tasks. 

There are idle tasks per CPU. If there are runnable tasks and the
idle-task of a CPU is running it, it is not fully loaded at this
time.

No idle task is running, if all CPUs are fully loaded AFAIR.

> One interesting property of the load balancer tasks would be that the
> less heavily loaded CPUs would tend to execute the load balancer more
> often, actively releaving the more heavily loaded CPUs, while those
> would concentrate more on getting the job done. Come to think of it, it
> could be coded in such a way that only the least loaded CPU would
> execute the load balancing algorithm, while the others would simply
> chalk up elapsed times.
 
So you suggest only one load balancer task jumping from CPU to
CPU. I misunderstood it. _Only_ chalking up could certainly done by
the idle tasks.

Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 14:24                   ` Ingo Oeser
@ 2001-10-07 14:33                     ` Mika Liljeberg
  2001-10-07 18:00                       ` george anzinger
  2001-10-08 15:19                       ` Context switch times bill davidsen
  0 siblings, 2 replies; 61+ messages in thread
From: Mika Liljeberg @ 2001-10-07 14:33 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

Ingo Oeser wrote:
> There are idle tasks per CPU. If there are runnable tasks and the
> idle-task of a CPU is running it, it is not fully loaded at this
> time.
> 
> No idle task is running, if all CPUs are fully loaded AFAIR.

Yes. However, you still want to balance the queues even if all CPUs are
100% utilized. It's a fairness issue. Otherwise you could have 1 task
running on one CPU and 49 tasks on another.

	MikaL

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-05  4:35             ` Mike Kravetz
@ 2001-10-07 17:59               ` Andrea Arcangeli
  2001-10-07 19:54                 ` george anzinger
  0 siblings, 1 reply; 61+ messages in thread
From: Andrea Arcangeli @ 2001-10-07 17:59 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Linus Torvalds, linux-kernel

On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote:
> [..] the pipe routines only call
> the non-synchronous form of wake_up. [..]

Ok I see, I overlooked that when we don't need to block we wakeup-async.

So first of all it would be interesting to change the token passed
thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so
that the task goes to sleep before reading from the next pipe in the
ring.

And if we want to optimize the current benchmark (with the small token that
triggers the async-wakeup and it always goes to sleep in read() rather
than write()) we'd need to invalidate a basic point of the scheduler
that have nothing to do with IPI reschedule, the point to invalidate is
that if the current task (the one that is running wake_up()) has a small
avg_slice we'd better not reschedule the wakenup task on a new idle cpu
but we'd better wait the current task to go away instead. Unless I'm
missing something this should fix lmbench in its current implementation.

Andrea

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 14:33                     ` Mika Liljeberg
@ 2001-10-07 18:00                       ` george anzinger
  2001-10-07 22:06                         ` Mika Liljeberg
  2001-10-08 15:19                       ` Context switch times bill davidsen
  1 sibling, 1 reply; 61+ messages in thread
From: george anzinger @ 2001-10-07 18:00 UTC (permalink / raw)
  To: Mika Liljeberg
  Cc: Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

Mika Liljeberg wrote:
> 
> Ingo Oeser wrote:
> > There are idle tasks per CPU. If there are runnable tasks and the
> > idle-task of a CPU is running it, it is not fully loaded at this
> > time.
> >
> > No idle task is running, if all CPUs are fully loaded AFAIR.
> 
> Yes. However, you still want to balance the queues even if all CPUs are
> 100% utilized. It's a fairness issue. Otherwise you could have 1 task
> running on one CPU and 49 tasks on another.
> 
On the face of it this only seems unfair.  I think what we want to do is
to allocate the cpu resources among competing tasks as dictated by their
NICE values.  If each task gets its allotted share it should not matter
if one of them is monopolizing one cpu.  This is not to say that things
will work out this way, but to say that this is not the measure, nor the
thing to look at.

I suggest that, using the "recalculate tick" as a measure of time (I
know it is not very linear) when a cpu finds itself with nothing to do
(because all its tasks have completed their slices or blocked) and other
cpus have tasks in their queues it is time to "shop" for a new task to
run.  This would then do load balancing just before each "recalculate
tick".

Of course, the above assumes that each cpu has its own run queue.

George

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07  9:57             ` Mika Liljeberg
  2001-10-07 13:03               ` Ingo Oeser
@ 2001-10-07 18:39               ` Davide Libenzi
  2001-10-09 20:37               ` Hubertus Franke
  2 siblings, 0 replies; 61+ messages in thread
From: Davide Libenzi @ 2001-10-07 18:39 UTC (permalink / raw)
  To: Mika Liljeberg; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

On Sun, 7 Oct 2001, Mika Liljeberg wrote:

> Alan Cox wrote:
> > This isnt idle speculation - I've done some minimal playing with this but
> > my initial re-implementation didnt handle SMP at all and I am still not 100%
> > sure how to resolve SMP or how SMP will improve out of the current cunning
> > plan.
>
> Here's some idle speculation on SMP to top it off. :) I tend to think
> that the load balancing between CPUs should be a completely separate
> algorithim and should not necessarily be run at every schedule(). The
> idea is to compeletely decouple the problem of scheduling a single CPU
> between tasks and the problem of load balancing between the CPUs, making
> each problem simpler to solve.
>
> Consider the following basic rules:
>
> A) When a new task comes along, pick the "least loaded" CPU and lock the
> new task onto that.
> B) Whenever the load imbalance between least loaded CPU and most loaded
> CPU becomes too great, move one or more tasks from most loaded CPU to
> the least loaded CPU.
>
> The rules themselves should be self-explanatory: A provides initial load
> balancing, while B tries to keep the balance (with a sensible hysteresis
> to avoid thrashing). However, there are a few minor details to solve:
>
> 1) How to determine the load of a CPU? If we can quantify this clearly,
> we can easily set a hysteresis level to trigger load balancing between
> two CPUs.
> 2) When and how often to check for load imbalance?
> 3) How to select the task(s) that should be moved between two CPUs to
> correct an imbalance?
>
> For problems 1 and 2 I propose the following solution: Insert the the
> load balancing routine itself as a (fake) task on each CPU and run it
> when the CPU gets around to it. The load balancer should behave almost
> like a CPU-bound task, scheduled on the lowest priority level with other
> runnable tasks. The last bit is important: the load balancer should not
> be allowed to starve but should be invoked approximately once every
> "full rotation" of the scheduler.
>
> With the above it is easy to estimate the load of a CPU. We can simply
> use the elapsed time between two invokations of the load balancer task.
> When the load balancer task of a particular CPU gets run, it chalks up
> the elapsed time on a score board somewhere, and checks whether there is
> a significant imbalance between itself and some other CPU. If there is,
> it commences to move some tasks between itself and the other CPU (note
> rule B, though, it should be enough to mess with just two CPU queues at
> a time to minimize balancing and locking overhead).
>
> Problem 3 is tricky. Basically, there should be a cost/benefit function
> F(tasks to move) that should be minimized. Ideally F(task_i), the
> cost/benefit of moving a single task, would be calculated as a byproduct
> of the CPU scheduler algorithm.
>
> F(task_i) might be function of elapsed time since task_i was last
> scheduled and the average time slice used by task_i, to account for the
> probable cache hit. This would leave it up to the load balancer to move
> as many lowest cost tasks to a new CPU as is needed to correct the
> imbalance (average time slices used by each task would be needed in
> order to make this decision).
>
> Naturally, some additional rules might be necessary to make a task
> eligible for moving, e.g., never move the only/last CPU bound task to
> another CPU. In addition, it might actually make sense to move at most
> one task at each invocation of the load balancer, to further reduce the
> probability of thrashing. The load would still converge fairly quickly
> towards a balanced state. It would also scale fairly well with the
> number of CPUs.
>
> How does that sound?

To measure the load of a cpu "nr_running" ( per cpu ) is probably the best
choice. Anyway there's some work already done :

http://lse.sourceforge.net/scheduling/LB/poolMQ.html




- Davide



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 17:59               ` Andrea Arcangeli
@ 2001-10-07 19:54                 ` george anzinger
  2001-10-07 20:24                   ` Andrea Arcangeli
  0 siblings, 1 reply; 61+ messages in thread
From: george anzinger @ 2001-10-07 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Mike Kravetz, Linus Torvalds, linux-kernel

Andrea Arcangeli wrote:
> 
> On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote:
> > [..] the pipe routines only call
> > the non-synchronous form of wake_up. [..]
> 
> Ok I see, I overlooked that when we don't need to block we wakeup-async.
> 
> So first of all it would be interesting to change the token passed
> thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so
> that the task goes to sleep before reading from the next pipe in the
> ring.
> 
> And if we want to optimize the current benchmark (with the small token that
> triggers the async-wakeup and it always goes to sleep in read() rather
> than write()) we'd need to invalidate a basic point of the scheduler
> that have nothing to do with IPI reschedule, the point to invalidate is
> that if the current task (the one that is running wake_up()) has a small
> avg_slice we'd better not reschedule the wakenup task on a new idle cpu
> but we'd better wait the current task to go away instead. 

A couple of questions: 

1.) Do you want to qualify that the wake_up is not from an interrupt?

2.) Having done this AND the task doesn't block THIS time, do we wait
for the slice to end or is some other "dead man" timer needed?

George


Unless I'm
> missing something this should fix lmbench in its current implementation.
> 
> Andrea
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 19:54                 ` george anzinger
@ 2001-10-07 20:24                   ` Andrea Arcangeli
  0 siblings, 0 replies; 61+ messages in thread
From: Andrea Arcangeli @ 2001-10-07 20:24 UTC (permalink / raw)
  To: george anzinger; +Cc: Mike Kravetz, Linus Torvalds, linux-kernel

On Sun, Oct 07, 2001 at 12:54:08PM -0700, george anzinger wrote:
> Andrea Arcangeli wrote:
> > 
> > On Thu, Oct 04, 2001 at 09:35:07PM -0700, Mike Kravetz wrote:
> > > [..] the pipe routines only call
> > > the non-synchronous form of wake_up. [..]
> > 
> > Ok I see, I overlooked that when we don't need to block we wakeup-async.
> > 
> > So first of all it would be interesting to change the token passed
> > thorugh the pipe so that it always fills in the PAGE_SIZE pipe buffer so
> > that the task goes to sleep before reading from the next pipe in the
> > ring.
> > 
> > And if we want to optimize the current benchmark (with the small token that
> > triggers the async-wakeup and it always goes to sleep in read() rather
> > than write()) we'd need to invalidate a basic point of the scheduler
> > that have nothing to do with IPI reschedule, the point to invalidate is
> > that if the current task (the one that is running wake_up()) has a small
> > avg_slice we'd better not reschedule the wakenup task on a new idle cpu
> > but we'd better wait the current task to go away instead. 
> 
> A couple of questions: 
> 
> 1.) Do you want to qualify that the wake_up is not from an interrupt?

If the avg_slice is very small we know the task will probabilistically
go away soon regardless where the wakeup cames from. I don't think the
logic has connection to the kind of wakeup, it only has connection to
the kind of "running task".

Infact waiting "bash" or "ksoftirqd" to go away is the right thing to do
to fix the suprious migrations of cpu intensive tasks too, no matter
where the wakeup cames from.

The only risk here is that we never know from the scheduler if the
running task with the small avg_slice will really go away this time too
or not. And if it doesn't go away this time, we may not do the best
global decision by not migrating the wakenup task to the idle cpus, but
the probability information provided by avg_slice should cover this
problem.

Of course this is theory, I didn't attempted to test it.

btw, I remeber Ingo some year ago said on l-k that rescheduling
immediatly every idle cpu isn't the best thing to do.

> 2.) Having done this AND the task doesn't block THIS time, do we wait
> for the slice to end or is some other "dead man" timer needed?

Of course waiting for the slice to end anyways is the natural first
step. As said above the point of the probability information of
avg_slice is to render very unlikely the "task doesn't block THIS time"
case.

Andrea

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 18:00                       ` george anzinger
@ 2001-10-07 22:06                         ` Mika Liljeberg
  2001-10-07 22:31                           ` Davide Libenzi
  2001-10-07 23:49                           ` george anzinger
  0 siblings, 2 replies; 61+ messages in thread
From: Mika Liljeberg @ 2001-10-07 22:06 UTC (permalink / raw)
  To: george anzinger
  Cc: Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds,
	linux-kernel, Davide Libenzi

george anzinger wrote:
> On the face of it this only seems unfair.  I think what we want to do is
> to allocate the cpu resources among competing tasks as dictated by their
> NICE values.  If each task gets its allotted share it should not matter
> if one of them is monopolizing one cpu.  This is not to say that things
> will work out this way, but to say that this is not the measure, nor the
> thing to look at.

I'm not sure I fully understood what you're driving at, but surely "each
task getting its allotted share" implies that the tasks are correctly
balanced across CPUs? I take your point that if you're only interested
in the total completion time of certain set of tasks, it doesn't matter
how they are divided among CPUs as long as each CPU is crunching at
maximum capacity. However, if you're interested in the completion time
of a particular task (or if the task is mostly CPU bound but with
occasional interaction with the user), you need to provide fairness all
the way through the scheduling process. This, IMHO, implies balancing
the tasks across CPUs.

How the balance is determined is another issue, though. I basically
proposed summing the time slices consumed by tasks executing on a single
CPU and using that as the comparison value. Davide Libenzi, on the other
hand, simply proposed using the number of tasks.

I would contend that if you can evenly divide a particular load across
multiple CPUs, you can then run a pre-emptive scheduler on each CPUs run
queue independently, without regard for the other CPUs, and still come
out with high CPU utilization and reasonable completion times for all
tasks. This has the major advantage of limiting spinlock contention to
the load balancer, instead of risking it at every schedule().

> I suggest that, using the "recalculate tick" as a measure of time (I
> know it is not very linear)

I'll take your word for it as I'm not very familiar with the current
scheduler. The main thing is to do the rebalancing periodically and to
have a reliable way of estimating the loading (not utilization) of each
CPU.

> when a cpu finds itself with nothing to do
> (because all its tasks have completed their slices or blocked) and other
> cpus have tasks in their queues it is time to "shop" for a new task to
> run.

This again has the problem that you only look for new tasks if you have
nothing to do. While this would work if you only look at the total
completion time of a set of tasks, it does nothing to ensure fair
scheduling for a particular task.

>  This would then do load balancing just before each "recalculate
> tick".
> Of course, the above assumes that each cpu has its own run queue.
> 
> George

Regards,

	MikaL

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 22:06                         ` Mika Liljeberg
@ 2001-10-07 22:31                           ` Davide Libenzi
  2001-10-07 22:33                             ` Mika Liljeberg
  2001-10-07 23:49                           ` george anzinger
  1 sibling, 1 reply; 61+ messages in thread
From: Davide Libenzi @ 2001-10-07 22:31 UTC (permalink / raw)
  To: Mika Liljeberg
  Cc: george anzinger, Ingo Oeser, Alan Cox, Benjamin LaHaise,
	Linus Torvalds, linux-kernel, Davide Libenzi

On Mon, 8 Oct 2001, Mika Liljeberg wrote:

> How the balance is determined is another issue, though. I basically
> proposed summing the time slices consumed by tasks executing on a single
> CPU and using that as the comparison value. Davide Libenzi, on the other
> hand, simply proposed using the number of tasks.

CPU based number of __running__ tasks.




- Davide



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 22:31                           ` Davide Libenzi
@ 2001-10-07 22:33                             ` Mika Liljeberg
  0 siblings, 0 replies; 61+ messages in thread
From: Mika Liljeberg @ 2001-10-07 22:33 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: george anzinger, Ingo Oeser, Alan Cox, Benjamin LaHaise,
	Linus Torvalds, linux-kernel

Davide Libenzi wrote:
> 
> On Mon, 8 Oct 2001, Mika Liljeberg wrote:
> 
> > How the balance is determined is another issue, though. I basically
> > proposed summing the time slices consumed by tasks executing on a single
> > CPU and using that as the comparison value. Davide Libenzi, on the other
> > hand, simply proposed using the number of tasks.
> 
> CPU based number of __running__ tasks.

That's what I meant, sorry for not making it clear.

I suppose in practise this would amount to using the number of _CPU
bound_ running tasks per CPU, since the non-CPU bound tasks would likely
be waiting rather than running? This would then be essentially
equivalent to (and simpler than) my proposal, i.e. using the sum of the
time slices.

> - Davide

	MikaL

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 22:06                         ` Mika Liljeberg
  2001-10-07 22:31                           ` Davide Libenzi
@ 2001-10-07 23:49                           ` george anzinger
  2001-10-08 21:07                             ` Mika Liljeberg
  1 sibling, 1 reply; 61+ messages in thread
From: george anzinger @ 2001-10-07 23:49 UTC (permalink / raw)
  To: Mika Liljeberg
  Cc: Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds,
	linux-kernel, Davide Libenzi

Mika Liljeberg wrote:
> 
> george anzinger wrote:
> > On the face of it this only seems unfair.  I think what we want to do is
> > to allocate the cpu resources among competing tasks as dictated by their
> > NICE values.  If each task gets its allotted share it should not matter
> > if one of them is monopolizing one cpu.  This is not to say that things
> > will work out this way, but to say that this is not the measure, nor the
> > thing to look at.
> 
> I'm not sure I fully understood what you're driving at, but surely "each
> task getting its allotted share" implies that the tasks are correctly
> balanced across CPUs? 

Actually I am trying to recast your requirement (for fairness) into
something that is clean and easy to code :)  Keeping track of the number
of tasks a cpu executes over a given period of time, while possible,
does not seem to me to give a very good indication of its load.

> I take your point that if you're only interested
> in the total completion time of certain set of tasks, it doesn't matter
> how they are divided among CPUs as long as each CPU is crunching at
> maximum capacity. However, if you're interested in the completion time
> of a particular task (or if the task is mostly CPU bound but with
> occasional interaction with the user), you need to provide fairness all
> the way through the scheduling process. This, IMHO, implies balancing
> the tasks across CPUs.

I don't dispute balance, I am just trying to suggest a way to do it that
IMHO is relatively easy.
> 
> How the balance is determined is another issue, though. I basically
> proposed summing the time slices consumed by tasks executing on a single
> CPU and using that as the comparison value. Davide Libenzi, on the other
> hand, simply proposed using the number of tasks.

In practice, in a given time (say a "recalculate period") a given cpu
with a given set of tasks will end up doing a fixed amount of work
(provided we don't allow idle tasks).  This could be stated as a sum of
time slices or CPU jiffies if you like.  As long as loads are passed
from cpu to cpu to make them all end a "recalculate period" at the same
time, I would contend we have balance.  Counting tasks is problematic as
tasks come and go and present varying loads.  The real trick is in
figuring which task to move, or if none should be moved.
> 
> I would contend that if you can evenly divide a particular load across
> multiple CPUs, you can then run a pre-emptive scheduler on each CPUs run
> queue independently, without regard for the other CPUs, and still come
> out with high CPU utilization and reasonable completion times for all
> tasks. This has the major advantage of limiting spinlock contention to
> the load balancer, instead of risking it at every schedule().

Exactly.
> 
> > I suggest that, using the "recalculate tick" as a measure of time (I
> > know it is not very linear)
> 
> I'll take your word for it as I'm not very familiar with the current
> scheduler. The main thing is to do the rebalancing periodically and to
> have a reliable way of estimating the loading (not utilization) of each
> CPU.

The current schedule periodically calculates new time slices for each
task.  It does this when no task in the run queue has any slice time
left.  The slice size each task gets is related to the NICE value.  In
practice then (other things being equal) each task will get a portion of
the available processing time that is related to its NICE value.  I
think this is fair.  But, of course, other things are not equal.  When a
task is blocked and the recalculation happens, that task gets to keep
1/2 of it remaining slice as well as the new slice.  Since, in addition
to slice time, the remaining slice value is used as a major component of
the task's priority, this means that the blocked task, if it is blocked
during a recalculate, will get a boosted priority.

I think what Alan is suggesting is that the priority boost is needed for
blocked tasks, but not necessarily the increased slice time.
> 
> > when a cpu finds itself with nothing to do
> > (because all its tasks have completed their slices or blocked) and other
> > cpus have tasks in their queues it is time to "shop" for a new task to
> > run.
> 
> This again has the problem that you only look for new tasks if you have
> nothing to do. While this would work if you only look at the total
> completion time of a set of tasks, it does nothing to ensure fair
> scheduling for a particular task.

But doesn't the f(NICE) used as a slice do this.  The point is that we
will not allow a recalculate until all run able tasks with non zero
slice times have run their slice times to zero.  This insures that all
tasks that can use the cpu will get their fair share of time each
"recalculate period".

This is pretty much what is done today across the whole run queue.  The
enhancement we are talking about is separating tasks by cpu and pretty
much keeping them on one cpu.
> 
> >  This would then do load balancing just before each "recalculate
> > tick".
> > Of course, the above assumes that each cpu has its own run queue.
> >
> > George
> 
> Regards,
> 
>         MikaL

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 14:33                     ` Mika Liljeberg
  2001-10-07 18:00                       ` george anzinger
@ 2001-10-08 15:19                       ` bill davidsen
  2001-10-10  6:07                         ` Mike Fedyk
  1 sibling, 1 reply; 61+ messages in thread
From: bill davidsen @ 2001-10-08 15:19 UTC (permalink / raw)
  To: linux-kernel

In article <3BC067BB.73AF1EB5@welho.com> Mika.Liljeberg@welho.com wrote:

>Yes. However, you still want to balance the queues even if all CPUs are
>100% utilized. It's a fairness issue. Otherwise you could have 1 task
>running on one CPU and 49 tasks on another.

  You say that as if it were a bad thing... I believe that if you have
one long running task and many small tasks in the system CPU affinity
will make that happen now. Obviously not if all CPUs are 100% loaded,
and your 1 vs. 49 is unrealistic, but having a task stay with a CPU
while trivia run on other CPU(s) is generally a good thing under certain
load conditions, which I guess are no less likely than your example;-)

-- 
bill davidsen <davidsen@tmr.com>
 "If I were a diplomat, in the best case I'd go hungry.  In the worst
  case, people would die."
		-- Robert Lipe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07 23:49                           ` george anzinger
@ 2001-10-08 21:07                             ` Mika Liljeberg
  2001-10-08 22:54                               ` discontig physical memory Petko Manolov
  2001-10-08 23:05                               ` David S. Miller
  0 siblings, 2 replies; 61+ messages in thread
From: Mika Liljeberg @ 2001-10-08 21:07 UTC (permalink / raw)
  To: george anzinger
  Cc: Ingo Oeser, Alan Cox, Benjamin LaHaise, Linus Torvalds,
	linux-kernel, Davide Libenzi

george anzinger wrote:
> 
> Mika Liljeberg wrote:
> > I'm not sure I fully understood what you're driving at, but surely "each
> > task getting its allotted share" implies that the tasks are correctly
> > balanced across CPUs?
> 
> Actually I am trying to recast your requirement (for fairness) into
> something that is clean and easy to code :)

Fair enough. ;)

> The current schedule periodically calculates new time slices for each
> task.  It does this when no task in the run queue has any slice time
> left.  The slice size each task gets is related to the NICE value.  In
> practice then (other things being equal) each task will get a portion of
> the available processing time that is related to its NICE value.  I
> think this is fair.  But, of course, other things are not equal.  When a
> task is blocked and the recalculation happens, that task gets to keep
> 1/2 of it remaining slice as well as the new slice.  Since, in addition
> to slice time, the remaining slice value is used as a major component of
> the task's priority, this means that the blocked task, if it is blocked
> during a recalculate, will get a boosted priority.

Thanks for the concise explanation. I follow you a little better now.

> I think what Alan is suggesting is that the priority boost is needed for
> blocked tasks, but not necessarily the increased slice time.

I have to agree. If the job is I/O bound, it's unlikely to need the
extra CPU time.

> > > when a cpu finds itself with nothing to do
> > > (because all its tasks have completed their slices or blocked) and other
> > > cpus have tasks in their queues it is time to "shop" for a new task to
> > > run.
> >
> > This again has the problem that you only look for new tasks if you have
> > nothing to do. While this would work if you only look at the total
> > completion time of a set of tasks, it does nothing to ensure fair
> > scheduling for a particular task.
> 
> But doesn't the f(NICE) used as a slice do this.  The point is that we
> will not allow a recalculate until all run able tasks with non zero
> slice times have run their slice times to zero.  This insures that all
> tasks that can use the cpu will get their fair share of time each
> "recalculate period".

Ok, I think I get you. I didn't realize that there was already a strict
fairness enforcing mechanism built into the scheduler. So, you're
proposing to do a rebalance when the first CPU runs out of things to do,
which will happen sometime before the recalculate period expires?

Sounds reasonable, but I think you will see a flurry of rebalancing when
the recalculate period is ending and the CPUs are starting to run out of
things to do. This will cause load the balancer to frantically try to
spread the last few tasks with remaining time slices across the CPUs.
This is exactly the kind of thrashing that we would like to avoid.

> This is pretty much what is done today across the whole run queue.  The
> enhancement we are talking about is separating tasks by cpu and pretty
> much keeping them on one cpu.

Exactly.

> > >  This would then do load balancing just before each "recalculate
> > > tick".
> > > Of course, the above assumes that each cpu has its own run queue.
> > >
> > > George

Cheers,

	MikaL

^ permalink raw reply	[flat|nested] 61+ messages in thread

* discontig physical memory
  2001-10-08 21:07                             ` Mika Liljeberg
@ 2001-10-08 22:54                               ` Petko Manolov
  2001-10-08 23:05                               ` David S. Miller
  1 sibling, 0 replies; 61+ messages in thread
From: Petko Manolov @ 2001-10-08 22:54 UTC (permalink / raw)
  To: linux-kernel

	Hi guys,

I ran into this problem with a SOC weird design.
The physical memory map looks like this:

	0 	-  4MB		DMA-able embedded RAM;
	4MB	- 16MB		nothing here;
	16MB	- 32MB		external RAM;

Embedded controllers (FB/USB) can see only lowest 4MB
and they need almost all of it for buffers. The kernel
is living at phy address 16Mb.

Any ideas how to make lowest 4MB allocatable throu
kmalloc(size, GFP_DMA) without breaking the kernel?


	Petko

PS: I've seen CONFIG_DISCONTIGMEM but it is not yet
implemented for MIPS and i am not sure if it is what
is required in this case.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: discontig physical memory
  2001-10-08 21:07                             ` Mika Liljeberg
  2001-10-08 22:54                               ` discontig physical memory Petko Manolov
@ 2001-10-08 23:05                               ` David S. Miller
  2001-10-08 23:18                                 ` Petko Manolov
  2001-10-08 23:29                                 ` David S. Miller
  1 sibling, 2 replies; 61+ messages in thread
From: David S. Miller @ 2001-10-08 23:05 UTC (permalink / raw)
  To: pmanolov; +Cc: linux-kernel

   From: Petko Manolov <pmanolov@Lnxw.COM>
   Date: Mon, 08 Oct 2001 15:54:30 -0700
   
   Any ideas how to make lowest 4MB allocatable throu
   kmalloc(size, GFP_DMA) without breaking the kernel?

Setup ZONE_DMA correctly at boot time.

See the code around the free_area_init() call in
arch/i386/mm/init.c for an example of how to do this.
Intel does it for the low 16MB, you would just do it
for the low 4MB.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: discontig physical memory
  2001-10-08 23:05                               ` David S. Miller
@ 2001-10-08 23:18                                 ` Petko Manolov
  2001-10-08 23:29                                 ` David S. Miller
  1 sibling, 0 replies; 61+ messages in thread
From: Petko Manolov @ 2001-10-08 23:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

I already did that. I fixed MAX_DMA_ADDRESS and the
kernel now reports:

zone(0): 1024 pages
zone(1): 7168 pages	/* this should be 4096 */
...

Which is not true as we have a gap between 4 - 16MB
phys memory. I also played with add_memory_region(),
but the kernel still think we have 32MB ram instead
of 20MB as is the case.

Also it lock solid in __get_free_pages in mem_init()
Probably this is related to zone(1) wrong assumption
right?


best,
Petko



"David S. Miller" wrote:
> 
>    From: Petko Manolov <pmanolov@Lnxw.COM>
>    Date: Mon, 08 Oct 2001 15:54:30 -0700
> 
>    Any ideas how to make lowest 4MB allocatable throu
>    kmalloc(size, GFP_DMA) without breaking the kernel?
> 
> Setup ZONE_DMA correctly at boot time.
> 
> See the code around the free_area_init() call in
> arch/i386/mm/init.c for an example of how to do this.
> Intel does it for the low 16MB, you would just do it
> for the low 4MB.
> 
> Franks a lot,
> David S. Miller
> davem@redhat.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: discontig physical memory
  2001-10-08 23:05                               ` David S. Miller
  2001-10-08 23:18                                 ` Petko Manolov
@ 2001-10-08 23:29                                 ` David S. Miller
  2001-10-09  0:34                                   ` Petko Manolov
                                                     ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: David S. Miller @ 2001-10-08 23:29 UTC (permalink / raw)
  To: pmanolov; +Cc: linux-kernel

   From: Petko Manolov <pmanolov@Lnxw.COM>
   Date: Mon, 08 Oct 2001 16:18:25 -0700

   I already did that. I fixed MAX_DMA_ADDRESS and the
   kernel now reports:
   
   zone(0): 1024 pages
   zone(1): 7168 pages	/* this should be 4096 */
   ...
   
   Which is not true as we have a gap between 4 - 16MB
   phys memory. I also played with add_memory_region(),
   but the kernel still think we have 32MB ram instead
   of 20MB as is the case.
   
You need to setup bootmem correctly, earlier on, such that
you do not tell the kernel that there are any pages in this
4 - 16MB region.

Do something like this instead of whatever your bootmem
calls are doing now:

	bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
	free_bootmem((4 * 1024 * 1024),
		     ((16 - 4) * 1024 * 1024));

Franks a lot,
David S. Miller
davem@redhat.com


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: discontig physical memory
  2001-10-08 23:29                                 ` David S. Miller
@ 2001-10-09  0:34                                   ` Petko Manolov
  2001-10-09  0:36                                   ` Petko Manolov
  2001-10-09  1:37                                   ` David S. Miller
  2 siblings, 0 replies; 61+ messages in thread
From: Petko Manolov @ 2001-10-09  0:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

"David S. Miller" wrote:
> 
> You need to setup bootmem correctly, earlier on, such that
> you do not tell the kernel that there are any pages in this
> 4 - 16MB region.
> 
> Do something like this instead of whatever your bootmem
> calls are doing now:
> 
>         bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
>         free_bootmem((4 * 1024 * 1024),
>                      ((16 - 4) * 1024 * 1024));

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: discontig physical memory
  2001-10-08 23:29                                 ` David S. Miller
  2001-10-09  0:34                                   ` Petko Manolov
@ 2001-10-09  0:36                                   ` Petko Manolov
  2001-10-09  1:37                                   ` David S. Miller
  2 siblings, 0 replies; 61+ messages in thread
From: Petko Manolov @ 2001-10-09  0:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

"David S. Miller" wrote:
> 
> You need to setup bootmem correctly, earlier on, such that
> you do not tell the kernel that there are any pages in this
> 4 - 16MB region.
> 
> Do something like this instead of whatever your bootmem
> calls are doing now:
> 
>         bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
>         free_bootmem((4 * 1024 * 1024),
>                      ((16 - 4) * 1024 * 1024));

This is suppose to tell the kernel about the gap?


	Petko

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: discontig physical memory
  2001-10-08 23:29                                 ` David S. Miller
  2001-10-09  0:34                                   ` Petko Manolov
  2001-10-09  0:36                                   ` Petko Manolov
@ 2001-10-09  1:37                                   ` David S. Miller
  2001-10-09  2:43                                     ` Petko Manolov
  2 siblings, 1 reply; 61+ messages in thread
From: David S. Miller @ 2001-10-09  1:37 UTC (permalink / raw)
  To: pmanolov; +Cc: linux-kernel

   From: Petko Manolov <pmanolov@Lnxw.COM>
   Date: Mon, 08 Oct 2001 17:36:11 -0700

   "David S. Miller" wrote:
   > Do something like this instead of whatever your bootmem
   > calls are doing now:
   > 
   >         bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
   >         free_bootmem((4 * 1024 * 1024),
   >                      ((16 - 4) * 1024 * 1024));
   
   This is suppose to tell the kernel about the gap?
   
Precisely.  How else did you expect to let the kernel know?

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: discontig physical memory
  2001-10-09  1:37                                   ` David S. Miller
@ 2001-10-09  2:43                                     ` Petko Manolov
  0 siblings, 0 replies; 61+ messages in thread
From: Petko Manolov @ 2001-10-09  2:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

"David S. Miller" wrote:
> 
>    >         bootmap_size = init_bootmem(0, (32 * 1024 * 1024));
>    >         free_bootmem((4 * 1024 * 1024),
>    >                      ((16 - 4) * 1024 * 1024));
> 
>    This is suppose to tell the kernel about the gap?
> 
> Precisely.  How else did you expect to let the kernel know?

Unfortunately this didn't work. I also tried to declare
16MB hole starting from offset 0 so the kernel is supposed
to boot from phys addr 0x01000000. It crashed when tried to
execute kernel_thread()
Do you think there is a mechanism to tell the kernel that first
16MB is a non DMA-able area?


	Petko

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:39   ` Richard Gooch
@ 2001-10-09  4:55     ` Richard Gooch
  2001-10-09 13:49       ` bill davidsen
  0 siblings, 1 reply; 61+ messages in thread
From: Richard Gooch @ 2001-10-09  4:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: arjan, kravetz, linux-kernel

David S. Miller writes:
>    From: Richard Gooch <rgooch@ras.ucalgary.ca>
>    Date: Thu, 4 Oct 2001 15:39:05 -0600
> 
>    David S. Miller writes:
>    > lat_ctx doesn't execute any FPU ops.  So at worst this happens once
>    > on GLIBC program startup, but then never again.
>    
>    Has something changed? Last I looked, the whole lmbench timing harness
>    was based on using the FPU.
> 
> Oops, that's entirely possible...
> 
> But things are usually layed out like this:
> 
> 	capture_start_time();
> 	context_switch_N_times();
> 	capture_end_time();
> 
> So the FPU hit is only before/after the runs, not during each and
> every iteration.

Hm. Perhaps when I did my tests (where I noticed a penalty), we didn't
have lazy FPU saving. Now we disable the FPU, and restore state when
we trap, right?

I do note this comment in arch/i386/kernel/process.c:
 * We fsave/fwait so that an exception goes off at the right time
 * (as a call from the fsave or fwait in effect) rather than to
 * the wrong process. Lazy FP saving no longer makes any sense
 * with modern CPU's, and this simplifies a lot of things (SMP
 * and UP become the same).

So what exactly is the difference between our "delayed FPU restore
upon trap" (which I think of as lazy FPU saving), and the "lazy FP"
saving in the comments?

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-04 21:52   ` David S. Miller
  2001-10-04 21:55     ` Benjamin LaHaise
@ 2001-10-09  5:00     ` David S. Miller
  1 sibling, 0 replies; 61+ messages in thread
From: David S. Miller @ 2001-10-09  5:00 UTC (permalink / raw)
  To: rgooch; +Cc: arjan, kravetz, linux-kernel

   From: Richard Gooch <rgooch@ras.ucalgary.ca>
   Date: Mon, 8 Oct 2001 22:55:23 -0600

   So what exactly is the difference between our "delayed FPU restore
   upon trap" (which I think of as lazy FPU saving), and the "lazy FP"
   saving in the comments?

Save always on swithching OUT from a task vs. save only when some
different task asks for the FPU.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-09  4:55     ` Richard Gooch
@ 2001-10-09 13:49       ` bill davidsen
  0 siblings, 0 replies; 61+ messages in thread
From: bill davidsen @ 2001-10-09 13:49 UTC (permalink / raw)
  To: linux-kernel

In article <200110090455.f994tNB22322@vindaloo.ras.ucalgary.ca> rgooch@ras.ucalgary.ca asked:

| Hm. Perhaps when I did my tests (where I noticed a penalty), we didn't
| have lazy FPU saving. Now we disable the FPU, and restore state when
| we trap, right?
| 
| I do note this comment in arch/i386/kernel/process.c:
|  * We fsave/fwait so that an exception goes off at the right time
|  * (as a call from the fsave or fwait in effect) rather than to
|  * the wrong process. Lazy FP saving no longer makes any sense
|  * with modern CPU's, and this simplifies a lot of things (SMP
|  * and UP become the same).
| 
| So what exactly is the difference between our "delayed FPU restore
| upon trap" (which I think of as lazy FPU saving), and the "lazy FP"
| saving in the comments?

We always save the FPU, but only restore it when/if it is going to be
used. And obviously we don't want to save it if it hasn't been used,
since it wasn't restored...

-- 
bill davidsen <davidsen@tmr.com>
 "If I were a diplomat, in the best case I'd go hungry.  In the worst
  case, people would die."
		-- Robert Lipe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-07  9:57             ` Mika Liljeberg
  2001-10-07 13:03               ` Ingo Oeser
  2001-10-07 18:39               ` Davide Libenzi
@ 2001-10-09 20:37               ` Hubertus Franke
  2001-10-09 23:50                 ` george anzinger
  2 siblings, 1 reply; 61+ messages in thread
From: Hubertus Franke @ 2001-10-09 20:37 UTC (permalink / raw)
  To: Mika Liljeberg; +Cc: Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

* Mika Liljeberg <Mika.Liljeberg@welho.com> [20011007 05;57]:"
> Alan Cox wrote:
> > This isnt idle speculation - I've done some minimal playing with this but
> > my initial re-implementation didnt handle SMP at all and I am still not 100%
> > sure how to resolve SMP or how SMP will improve out of the current cunning
> > plan.
> 
> Here's some idle speculation on SMP to top it off. :) I tend to think
> that the load balancing between CPUs should be a completely separate
> algorithim and should not necessarily be run at every schedule(). The
> idea is to compeletely decouple the problem of scheduling a single CPU
> between tasks and the problem of load balancing between the CPUs, making
> each problem simpler to solve.
> 

This is what we implemented as an extension to our MQ scheduler.
I will present on the results for this during ALS technical session:
"CPU Pooling and Load Balancing in Linux MultiQueue Scheduling".
If interested I can already put an earlier version on lse.sourceforge.net.

It basically does (1) and (2). (3) is done unintelligently right now.

> Consider the following basic rules:
> 
> A) When a new task comes along, pick the "least loaded" CPU and lock the
> new task onto that.
> B) Whenever the load imbalance between least loaded CPU and most loaded
> CPU becomes too great, move one or more tasks from most loaded CPU to
> the least loaded CPU.
> 
> The rules themselves should be self-explanatory: A provides initial load
> balancing, while B tries to keep the balance (with a sensible hysteresis
> to avoid thrashing). However, there are a few minor details to solve:
> 
> 1) How to determine the load of a CPU? If we can quantify this clearly,
> we can easily set a hysteresis level to trigger load balancing between
> two CPUs.
> 2) When and how often to check for load imbalance?
> 3) How to select the task(s) that should be moved between two CPUs to
> correct an imbalance?
> 
> For problems 1 and 2 I propose the following solution: Insert the the
> load balancing routine itself as a (fake) task on each CPU and run it
> when the CPU gets around to it. The load balancer should behave almost
> like a CPU-bound task, scheduled on the lowest priority level with other
> runnable tasks. The last bit is important: the load balancer should not
> be allowed to starve but should be invoked approximately once every
> "full rotation" of the scheduler.
> 
> With the above it is easy to estimate the load of a CPU. We can simply
> use the elapsed time between two invokations of the load balancer task.
> When the load balancer task of a particular CPU gets run, it chalks up
> the elapsed time on a score board somewhere, and checks whether there is
> a significant imbalance between itself and some other CPU. If there is,
> it commences to move some tasks between itself and the other CPU (note
> rule B, though, it should be enough to mess with just two CPU queues at
> a time to minimize balancing and locking overhead).
> 
> Problem 3 is tricky. Basically, there should be a cost/benefit function
> F(tasks to move) that should be minimized. Ideally F(task_i), the
> cost/benefit of moving a single task, would be calculated as a byproduct
> of the CPU scheduler algorithm. 
> 
> F(task_i) might be function of elapsed time since task_i was last
> scheduled and the average time slice used by task_i, to account for the
> probable cache hit. This would leave it up to the load balancer to move
> as many lowest cost tasks to a new CPU as is needed to correct the
> imbalance (average time slices used by each task would be needed in
> order to make this decision).
> 
> Naturally, some additional rules might be necessary to make a task
> eligible for moving, e.g., never move the only/last CPU bound task to
> another CPU. In addition, it might actually make sense to move at most
> one task at each invocation of the load balancer, to further reduce the
> probability of thrashing. The load would still converge fairly quickly
> towards a balanced state. It would also scale fairly well with the
> number of CPUs.
> 
> How does that sound?
> 
> 	MikaL
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-09 20:37               ` Hubertus Franke
@ 2001-10-09 23:50                 ` george anzinger
  2001-10-11 10:52                   ` Hubertus Franke
  0 siblings, 1 reply; 61+ messages in thread
From: george anzinger @ 2001-10-09 23:50 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Mika Liljeberg, Alan Cox, Benjamin LaHaise, Linus Torvalds, linux-kernel

Hubertus Franke wrote:
> 
> * Mika Liljeberg <Mika.Liljeberg@welho.com> [20011007 05;57]:"
> > Alan Cox wrote:
> > > This isnt idle speculation - I've done some minimal playing with this but
> > > my initial re-implementation didnt handle SMP at all and I am still not 100%
> > > sure how to resolve SMP or how SMP will improve out of the current cunning
> > > plan.
> >
> > Here's some idle speculation on SMP to top it off. :) I tend to think
> > that the load balancing between CPUs should be a completely separate
> > algorithim and should not necessarily be run at every schedule(). The
> > idea is to compeletely decouple the problem of scheduling a single CPU
> > between tasks and the problem of load balancing between the CPUs, making
> > each problem simpler to solve.
> >
> 
> This is what we implemented as an extension to our MQ scheduler.
> I will present on the results for this during ALS technical session:
> "CPU Pooling and Load Balancing in Linux MultiQueue Scheduling".
> If interested I can already put an earlier version on lse.sourceforge.net.

Please do.  It should be a good read.

George

> 
> It basically does (1) and (2). (3) is done unintelligently right now.
> 
> > Consider the following basic rules:
> >
> > A) When a new task comes along, pick the "least loaded" CPU and lock the
> > new task onto that.
> > B) Whenever the load imbalance between least loaded CPU and most loaded
> > CPU becomes too great, move one or more tasks from most loaded CPU to
> > the least loaded CPU.
> >
> > The rules themselves should be self-explanatory: A provides initial load
> > balancing, while B tries to keep the balance (with a sensible hysteresis
> > to avoid thrashing). However, there are a few minor details to solve:
> >
> > 1) How to determine the load of a CPU? If we can quantify this clearly,
> > we can easily set a hysteresis level to trigger load balancing between
> > two CPUs.
> > 2) When and how often to check for load imbalance?
> > 3) How to select the task(s) that should be moved between two CPUs to
> > correct an imbalance?
> >
> > For problems 1 and 2 I propose the following solution: Insert the the
> > load balancing routine itself as a (fake) task on each CPU and run it
> > when the CPU gets around to it. The load balancer should behave almost
> > like a CPU-bound task, scheduled on the lowest priority level with other
> > runnable tasks. The last bit is important: the load balancer should not
> > be allowed to starve but should be invoked approximately once every
> > "full rotation" of the scheduler.
> >
> > With the above it is easy to estimate the load of a CPU. We can simply
> > use the elapsed time between two invokations of the load balancer task.
> > When the load balancer task of a particular CPU gets run, it chalks up
> > the elapsed time on a score board somewhere, and checks whether there is
> > a significant imbalance between itself and some other CPU. If there is,
> > it commences to move some tasks between itself and the other CPU (note
> > rule B, though, it should be enough to mess with just two CPU queues at
> > a time to minimize balancing and locking overhead).
> >
> > Problem 3 is tricky. Basically, there should be a cost/benefit function
> > F(tasks to move) that should be minimized. Ideally F(task_i), the
> > cost/benefit of moving a single task, would be calculated as a byproduct
> > of the CPU scheduler algorithm.
> >
> > F(task_i) might be function of elapsed time since task_i was last
> > scheduled and the average time slice used by task_i, to account for the
> > probable cache hit. This would leave it up to the load balancer to move
> > as many lowest cost tasks to a new CPU as is needed to correct the
> > imbalance (average time slices used by each task would be needed in
> > order to make this decision).
> >
> > Naturally, some additional rules might be necessary to make a task
> > eligible for moving, e.g., never move the only/last CPU bound task to
> > another CPU. In addition, it might actually make sense to move at most
> > one task at each invocation of the load balancer, to further reduce the
> > probability of thrashing. The load would still converge fairly quickly
> > towards a balanced state. It would also scale fairly well with the
> > number of CPUs.
> >
> > How does that sound?
> >
> >       MikaL
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-08 15:19                       ` Context switch times bill davidsen
@ 2001-10-10  6:07                         ` Mike Fedyk
  0 siblings, 0 replies; 61+ messages in thread
From: Mike Fedyk @ 2001-10-10  6:07 UTC (permalink / raw)
  To: linux-kernel

On Mon, Oct 08, 2001 at 11:19:49AM -0400, bill davidsen wrote:
> In article <3BC067BB.73AF1EB5@welho.com> Mika.Liljeberg@welho.com wrote:
> 
> >Yes. However, you still want to balance the queues even if all CPUs are
> >100% utilized. It's a fairness issue. Otherwise you could have 1 task
> >running on one CPU and 49 tasks on another.
> 
>   You say that as if it were a bad thing... I believe that if you have
> one long running task and many small tasks in the system CPU affinity
> will make that happen now. Obviously not if all CPUs are 100% loaded,
> and your 1 vs. 49 is unrealistic, but having a task stay with a CPU
> while trivia run on other CPU(s) is generally a good thing under certain
> load conditions, which I guess are no less likely than your example;-)
>

Lets put an actual load example (which I just happen do be doing now):

On my 2x366 celeron smp box, I have a kernel compile running (-j15), and a
bzip2 compressing several large files...

Right now, (2.4.10-ac4, compiling 2.4.10-ac10+preempt) both of my L2
processor caches are moot (128KB) because of the several gcc processes
running, and the very loose affinity.

In this case, bzip2 should probably get a processor (CPU0) to itself, and my kernel
compile another(CPU1).  

It would be interesting to see how a system that has a kind of hierarchy to
the processors.  All new processes would always start on cpu0, and long
running processes would get slowly moved up the list of processors the
longer they run.  Thus, CPU0 would have abysmal L2 cache locality, and CPU7
would have the best caching possible with its L2.

What do you guys think?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Context switch times
  2001-10-09 23:50                 ` george anzinger
@ 2001-10-11 10:52                   ` Hubertus Franke
  0 siblings, 0 replies; 61+ messages in thread
From: Hubertus Franke @ 2001-10-11 10:52 UTC (permalink / raw)
  To: george anzinger; +Cc: Mika Liljeberg, Linus Torvalds, linux-kernel

Done. I put the ALS2001 paper on CPU pooling and load balancing 
on the lse.sourceforge.net/scheduling website. Here is a direct shortcut
http://lse.sourceforge.net/scheduling/als2001/pmqs.ps
The patches are a bit behind. Will be updated soon.

If you have some suggestions on what is a good metric to move 
tasks please let me know. We can incorporate them.

-- Hubertus

* george anzinger <george@mvista.com> [20011009 19;50]:"
> Hubertus Franke wrote:
> > 
> > * Mika Liljeberg <Mika.Liljeberg@welho.com> [20011007 05;57]:"
> > > Alan Cox wrote:
> > > > This isnt idle speculation - I've done some minimal playing with this but
> > > > my initial re-implementation didnt handle SMP at all and I am still not 100%
> > > > sure how to resolve SMP or how SMP will improve out of the current cunning
> > > > plan.
> > >
> > > Here's some idle speculation on SMP to top it off. :) I tend to think
> > > that the load balancing between CPUs should be a completely separate
> > > algorithim and should not necessarily be run at every schedule(). The
> > > idea is to compeletely decouple the problem of scheduling a single CPU
> > > between tasks and the problem of load balancing between the CPUs, making
> > > each problem simpler to solve.
> > >
> > 
> > This is what we implemented as an extension to our MQ scheduler.
> > I will present on the results for this during ALS technical session:
> > "CPU Pooling and Load Balancing in Linux MultiQueue Scheduling".
> > If interested I can already put an earlier version on lse.sourceforge.net.
> 
> Please do.  It should be a good read.
> 
> George
> 
> > 
> > It basically does (1) and (2). (3) is done unintelligently right now.
> > 
> > > Consider the following basic rules:
> > >
> > > A) When a new task comes along, pick the "least loaded" CPU and lock the
> > > new task onto that.
> > > B) Whenever the load imbalance between least loaded CPU and most loaded
> > > CPU becomes too great, move one or more tasks from most loaded CPU to
> > > the least loaded CPU.
> > >
> > > The rules themselves should be self-explanatory: A provides initial load
> > > balancing, while B tries to keep the balance (with a sensible hysteresis
> > > to avoid thrashing). However, there are a few minor details to solve:
> > >
> > > 1) How to determine the load of a CPU? If we can quantify this clearly,
> > > we can easily set a hysteresis level to trigger load balancing between
> > > two CPUs.
> > > 2) When and how often to check for load imbalance?
> > > 3) How to select the task(s) that should be moved between two CPUs to
> > > correct an imbalance?
> > >
> > > For problems 1 and 2 I propose the following solution: Insert the the
> > > load balancing routine itself as a (fake) task on each CPU and run it
> > > when the CPU gets around to it. The load balancer should behave almost
> > > like a CPU-bound task, scheduled on the lowest priority level with other
> > > runnable tasks. The last bit is important: the load balancer should not
> > > be allowed to starve but should be invoked approximately once every
> > > "full rotation" of the scheduler.
> > >
> > > With the above it is easy to estimate the load of a CPU. We can simply
> > > use the elapsed time between two invokations of the load balancer task.
> > > When the load balancer task of a particular CPU gets run, it chalks up
> > > the elapsed time on a score board somewhere, and checks whether there is
> > > a significant imbalance between itself and some other CPU. If there is,
> > > it commences to move some tasks between itself and the other CPU (note
> > > rule B, though, it should be enough to mess with just two CPU queues at
> > > a time to minimize balancing and locking overhead).
> > >
> > > Problem 3 is tricky. Basically, there should be a cost/benefit function
> > > F(tasks to move) that should be minimized. Ideally F(task_i), the
> > > cost/benefit of moving a single task, would be calculated as a byproduct
> > > of the CPU scheduler algorithm.
> > >
> > > F(task_i) might be function of elapsed time since task_i was last
> > > scheduled and the average time slice used by task_i, to account for the
> > > probable cache hit. This would leave it up to the load balancer to move
> > > as many lowest cost tasks to a new CPU as is needed to correct the
> > > imbalance (average time slices used by each task would be needed in
> > > order to make this decision).
> > >
> > > Naturally, some additional rules might be necessary to make a task
> > > eligible for moving, e.g., never move the only/last CPU bound task to
> > > another CPU. In addition, it might actually make sense to move at most
> > > one task at each invocation of the load balancer, to further reduce the
> > > probability of thrashing. The load would still converge fairly quickly
> > > towards a balanced state. It would also scale fairly well with the
> > > number of CPUs.
> > >
> > > How does that sound?
> > >
> > >       MikaL
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2001-10-11 12:53 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-04 21:04 Context switch times Mike Kravetz
2001-10-04 21:14 ` arjan
2001-10-04 21:39   ` Richard Gooch
2001-10-09  4:55     ` Richard Gooch
2001-10-09 13:49       ` bill davidsen
2001-10-04 21:25 ` David S. Miller
2001-10-04 21:52   ` David S. Miller
2001-10-04 21:55     ` Benjamin LaHaise
2001-10-04 22:35       ` Davide Libenzi
2001-10-04 22:49         ` Mike Kravetz
2001-10-04 22:42       ` Linus Torvalds
2001-10-04 22:53         ` Benjamin LaHaise
2001-10-05 15:13           ` Alan Cox
2001-10-05 17:49             ` george anzinger
2001-10-05 22:29               ` Alan Cox
2001-10-05 22:56                 ` Davide Libenzi
2001-10-05 23:04                   ` Alan Cox
2001-10-05 23:16                     ` Davide Libenzi
2001-10-05 23:17                       ` Alan Cox
2001-10-05 23:21                         ` Davide Libenzi
2001-10-05 23:43                     ` Roger Larsson
2001-10-07  1:20                 ` george anzinger
2001-10-07  1:33                   ` Bill Davidsen
2001-10-07  9:56                   ` Alan Cox
2001-10-06  2:24             ` Albert D. Cahalan
2001-10-06  2:57               ` Davide Libenzi
2001-10-07  9:57             ` Mika Liljeberg
2001-10-07 13:03               ` Ingo Oeser
2001-10-07 13:48                 ` Mika Liljeberg
2001-10-07 14:24                   ` Ingo Oeser
2001-10-07 14:33                     ` Mika Liljeberg
2001-10-07 18:00                       ` george anzinger
2001-10-07 22:06                         ` Mika Liljeberg
2001-10-07 22:31                           ` Davide Libenzi
2001-10-07 22:33                             ` Mika Liljeberg
2001-10-07 23:49                           ` george anzinger
2001-10-08 21:07                             ` Mika Liljeberg
2001-10-08 22:54                               ` discontig physical memory Petko Manolov
2001-10-08 23:05                               ` David S. Miller
2001-10-08 23:18                                 ` Petko Manolov
2001-10-08 23:29                                 ` David S. Miller
2001-10-09  0:34                                   ` Petko Manolov
2001-10-09  0:36                                   ` Petko Manolov
2001-10-09  1:37                                   ` David S. Miller
2001-10-09  2:43                                     ` Petko Manolov
2001-10-08 15:19                       ` Context switch times bill davidsen
2001-10-10  6:07                         ` Mike Fedyk
2001-10-07 18:39               ` Davide Libenzi
2001-10-09 20:37               ` Hubertus Franke
2001-10-09 23:50                 ` george anzinger
2001-10-11 10:52                   ` Hubertus Franke
2001-10-04 23:41         ` Mike Kravetz
2001-10-04 23:50           ` Linus Torvalds
2001-10-05 15:15             ` Eric W. Biederman
2001-10-04 23:56           ` Davide Libenzi
2001-10-05  0:45           ` Andrea Arcangeli
2001-10-05  4:35             ` Mike Kravetz
2001-10-07 17:59               ` Andrea Arcangeli
2001-10-07 19:54                 ` george anzinger
2001-10-07 20:24                   ` Andrea Arcangeli
2001-10-09  5:00     ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).