LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
@ 2007-04-13 20:21 Ingo Molnar
2007-04-13 20:27 ` Bill Huey
` (14 more replies)
0 siblings, 15 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 20:21 UTC (permalink / raw)
To: linux-kernel
Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
i'm pleased to announce the first release of the "Modular Scheduler Core
and Completely Fair Scheduler [CFS]" patchset:
http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
This project is a complete rewrite of the Linux task scheduler. My goal
is to address various feature requests and to fix deficiencies in the
vanilla scheduler that were suggested/found in the past few years, both
for desktop scheduling and for server scheduling workloads.
[ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
new scheduler will be active by default and all tasks will default
to the new SCHED_FAIR interactive scheduling class. ]
Highlights are:
- the introduction of Scheduling Classes: an extensible hierarchy of
scheduler modules. These modules encapsulate scheduling policy
details and are handled by the scheduler core without the core
code assuming about them too much.
- sched_fair.c implements the 'CFS desktop scheduler': it is a
replacement for the vanilla scheduler's SCHED_OTHER interactivity
code.
i'd like to give credit to Con Kolivas for the general approach here:
he has proven via RSDL/SD that 'fair scheduling' is possible and that
it results in better desktop scheduling. Kudos Con!
The CFS patch uses a completely different approach and implementation
from RSDL/SD. My goal was to make CFS's interactivity quality exceed
that of RSDL/SD, which is a high standard to meet :-) Testing
feedback is welcome to decide this one way or another. [ and, in any
case, all of SD's logic could be added via a kernel/sched_sd.c module
as well, if Con is interested in such an approach. ]
CFS's design is quite radical: it does not use runqueues, it uses a
time-ordered rbtree to build a 'timeline' of future task execution,
and thus has no 'array switch' artifacts (by which both the vanilla
scheduler and RSDL/SD are affected).
CFS uses nanosecond granularity accounting and does not rely on any
jiffies or other HZ detail. Thus the CFS scheduler has no notion of
'timeslices' and has no heuristics whatsoever. There is only one
central tunable:
/proc/sys/kernel/sched_granularity_ns
which can be used to tune the scheduler from 'desktop' (low
latencies) to 'server' (good batching) workloads. It defaults to a
setting suitable for desktop workloads. SCHED_BATCH is handled by the
CFS scheduler module too.
due to its design, the CFS scheduler is not prone to any of the
'attacks' that exist today against the heuristics of the stock
scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
work fine and do not impact interactivity and produce the expected
behavior.
the CFS scheduler has a much stronger handling of nice levels and
SCHED_BATCH: both types of workloads should be isolated much more
agressively than under the vanilla scheduler.
( another rdetail: due to nanosec accounting and timeline sorting,
sched_yield() support is very simple under CFS, and in fact under
CFS sched_yield() behaves much better than under any other
scheduler i have tested so far. )
- sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
way than the vanilla scheduler does. It uses 100 runqueues (for all
100 RT priority levels, instead of 140 in the vanilla scheduler)
and it needs no expired array.
- reworked/sanitized SMP load-balancing: the runqueue-walking
assumptions are gone from the load-balancing code now, and
iterators of the scheduling modules are used. The balancing code got
quite a bit simpler as a result.
the core scheduler got smaller by more than 700 lines:
kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
1 file changed, 372 insertions(+), 1082 deletions(-)
and even adding all the scheduling modules, the total size impact is
relatively small:
18 files changed, 1454 insertions(+), 1133 deletions(-)
most of the increase is due to extensive comments. The kernel size
impact is in fact a small negative:
text data bss dec hex filename
23366 4001 24 27391 6aff kernel/sched.o.vanilla
24159 2705 56 26920 6928 kernel/sched.o.CFS
(this is mainly due to the benefit of getting rid of the expired array
and its data structure overhead.)
thanks go to Thomas Gleixner and Arjan van de Ven for review of this
patchset.
as usual, any sort of feedback, bugreports, fixes and suggestions are
more than welcome,
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
@ 2007-04-13 20:27 ` Bill Huey
2007-04-13 20:55 ` Ingo Molnar
2007-04-13 21:50 ` Ingo Molnar
` (13 subsequent siblings)
14 siblings, 1 reply; 577+ messages in thread
From: Bill Huey @ 2007-04-13 20:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
Bill Huey (hui)
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
...
> The CFS patch uses a completely different approach and implementation
> from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> that of RSDL/SD, which is a high standard to meet :-) Testing
> feedback is welcome to decide this one way or another. [ and, in any
> case, all of SD's logic could be added via a kernel/sched_sd.c module
> as well, if Con is interested in such an approach. ]
Ingo,
Con has been asking for module support for years if I understand your patch
corectly. You'll also need this for -rt as well with regards to bandwidth
scheduling. Good to see that you're moving in this direction.
bill
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:27 ` Bill Huey
@ 2007-04-13 20:55 ` Ingo Molnar
2007-04-13 21:21 ` William Lee Irwin III
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 20:55 UTC (permalink / raw)
To: Bill Huey
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Bill Huey <billh@gnuppy.monkey.org> wrote:
> Con has been asking for module support for years if I understand your
> patch corectly. [...]
Yeah. Note that there are some subtle but crutial differences between
PlugSched (which Con used, and which i opposed in the past) and this
approach.
PlugSched cuts the interfaces at a high level in a monolithic way and
introduces kernel/scheduler.c that uses one pluggable scheduler
(represented via the 'scheduler' global template) at a time.
while in this CFS patchset i'm using modularization ('scheduler
classes') to simplify the _existing_ multi-policy implementation of the
scheduler. These 'scheduler classes' are in a hierarchy and are stacked
on top of each other. They are in use at once. Currently there's two of
them: sched_ops_rt is stacked ontop of sched_ops_fair. Fortunately the
performance impact is minimal.
So scheduler classes are mainly a simplification of the design of the
scheduler - not just a mere facility to select multiple schedulers.
Their ability to also facilitate easier experimentation with schedulers
is 'just' a happy side-effect. So all in one: it's a fairly different
model from PlugSched (and that's why i didnt reuse PlugSched) - but
there's indeed overlap.
> [...] You'll also need this for -rt as well with regards to bandwidth
> scheduling.
yeah.
scheduler classes are also useful for other purposes like containers and
virtualization, hierarchical/group scheduling, security encapsulation,
etc. - features that can be on-demand layered, and which we dont
necessarily want to have enabled all the time.
> [...] Good to see that you're moving in this direction.
thanks! :)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:55 ` Ingo Molnar
@ 2007-04-13 21:21 ` William Lee Irwin III
2007-04-13 21:35 ` Bill Huey
2007-04-13 21:39 ` Ingo Molnar
0 siblings, 2 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-13 21:21 UTC (permalink / raw)
To: Ingo Molnar
Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> Yeah. Note that there are some subtle but crutial differences between
> PlugSched (which Con used, and which i opposed in the past) and this
> approach.
> PlugSched cuts the interfaces at a high level in a monolithic way and
> introduces kernel/scheduler.c that uses one pluggable scheduler
> (represented via the 'scheduler' global template) at a time.
What I originally did did so for a good reason, which was that it was
intended to support far more radical reorganizations, for instance,
things that changed the per-cpu runqueue affairs for gang scheduling.
I wrote a top-level driver that did support scheduling classes in a
similar fashion, though it didn't survive others maintaining the patches.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 21:21 ` William Lee Irwin III
@ 2007-04-13 21:35 ` Bill Huey
2007-04-13 21:39 ` Ingo Molnar
1 sibling, 0 replies; 577+ messages in thread
From: Bill Huey @ 2007-04-13 21:35 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Bill Huey (hui)
On Fri, Apr 13, 2007 at 02:21:10PM -0700, William Lee Irwin III wrote:
> On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> > Yeah. Note that there are some subtle but crutial differences between
> > PlugSched (which Con used, and which i opposed in the past) and this
> > approach.
> > PlugSched cuts the interfaces at a high level in a monolithic way and
> > introduces kernel/scheduler.c that uses one pluggable scheduler
> > (represented via the 'scheduler' global template) at a time.
>
> What I originally did did so for a good reason, which was that it was
> intended to support far more radical reorganizations, for instance,
> things that changed the per-cpu runqueue affairs for gang scheduling.
> I wrote a top-level driver that did support scheduling classes in a
> similar fashion, though it didn't survive others maintaining the patches.
Also, gang scheduling is needed to solve virtualization issues regarding
spinlocks in a guest image. You could potentally be spinning on a thread
that isn't currently running which, needless to say, is very bad.
bill
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 21:21 ` William Lee Irwin III
2007-04-13 21:35 ` Bill Huey
@ 2007-04-13 21:39 ` Ingo Molnar
1 sibling, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 21:39 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
> What I originally did did so for a good reason, which was that it was
> intended to support far more radical reorganizations, for instance,
> things that changed the per-cpu runqueue affairs for gang scheduling.
> I wrote a top-level driver that did support scheduling classes in a
> similar fashion, though it didn't survive others maintaining the
> patches.
yeah - i looked at plugsched-6.5-for-2.6.20.patch in particular.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
2007-04-13 20:27 ` Bill Huey
@ 2007-04-13 21:50 ` Ingo Molnar
2007-04-13 21:57 ` Michal Piotrowski
` (12 subsequent siblings)
14 siblings, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 21:50 UTC (permalink / raw)
To: linux-kernel
Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Ingo Molnar <mingo@elte.hu> wrote:
> and even adding all the scheduling modules, the total size impact is
> relatively small:
>
> 18 files changed, 1454 insertions(+), 1133 deletions(-)
>
> most of the increase is due to extensive comments. The kernel size
> impact is in fact a small negative:
>
> text data bss dec hex filename
> 23366 4001 24 27391 6aff kernel/sched.o.vanilla
> 24159 2705 56 26920 6928 kernel/sched.o.CFS
update: these were older numbers, here are the stats redone with the
latest patch:
text data bss dec hex filename
23366 4001 24 27391 6aff kernel/sched.o.vanilla
23671 4548 24 28243 6e53 kernel/sched.o.sd.v40
23349 2705 24 26078 65de kernel/sched.o.cfs
so CFS is now a win both for text and for data size :)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
2007-04-13 20:27 ` Bill Huey
2007-04-13 21:50 ` Ingo Molnar
@ 2007-04-13 21:57 ` Michal Piotrowski
2007-04-13 22:15 ` Daniel Walker
` (11 subsequent siblings)
14 siblings, 0 replies; 577+ messages in thread
From: Michal Piotrowski @ 2007-04-13 21:57 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Ingo Molnar napisał(a):
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
Friday the 13th, my lucky day :).
/mnt/md0/devel/linux-msc-cfs/usr/include/linux/sched.h requires linux/rbtree.h, which does not exist in exported headers
make[3]: *** No rule to make target `/mnt/md0/devel/linux-msc-cfs/usr/include/linux/.check.sched.h', needed by `__headerscheck'. Stop.
make[2]: *** [linux] Error 2
make[1]: *** [headers_check] Error 2
make: *** [vmlinux] Error 2
Regards,
Michal
--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
--- linux-msc-cfs-clean/include/linux/Kbuild 2007-04-13 23:52:47.000000000 +0200
+++ linux-msc-cfs/include/linux/Kbuild 2007-04-13 23:49:41.000000000 +0200
@@ -133,6 +133,7 @@ header-y += quotaio_v1.h
header-y += quotaio_v2.h
header-y += radeonfb.h
header-y += raw.h
+header-y += rbtree.h
header-y += resource.h
header-y += rose.h
header-y += smbno.h
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (2 preceding siblings ...)
2007-04-13 21:57 ` Michal Piotrowski
@ 2007-04-13 22:15 ` Daniel Walker
2007-04-13 22:30 ` Ingo Molnar
2007-04-13 22:21 ` William Lee Irwin III
` (10 subsequent siblings)
14 siblings, 1 reply; 577+ messages in thread
From: Daniel Walker @ 2007-04-13 22:15 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Fri, 2007-04-13 at 22:21 +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
I'm not in love with the current or other schedulers, so I'm indifferent
to this change. However, I was reviewing your release notes and the
patch and found myself wonder what the logarithmic complexity of this
new scheduler is .. I assumed it would also be constant time , but the
__enqueue_task_fair doesn't appear to be constant time (rbtree insert
complexity).. Maybe that's not a critical path , but I thought I would
at least comment on it.
Daniel
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (3 preceding siblings ...)
2007-04-13 22:15 ` Daniel Walker
@ 2007-04-13 22:21 ` William Lee Irwin III
2007-04-13 22:52 ` Ingo Molnar
2007-04-14 22:38 ` Davide Libenzi
2007-04-13 22:31 ` Willy Tarreau
` (9 subsequent siblings)
14 siblings, 2 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-13 22:21 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> new scheduler will be active by default and all tasks will default
> to the new SCHED_FAIR interactive scheduling class. ]
A pleasant surprise, though I did see it coming.
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> Highlights are:
> - the introduction of Scheduling Classes: an extensible hierarchy of
> scheduler modules. These modules encapsulate scheduling policy
> details and are handled by the scheduler core without the core
> code assuming about them too much.
It probably needs further clarification that they're things on the order
of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization amongst
the classes is furthermore assumed, and so on. They're not quite
capable of being full-blown alternative policies, though quite a bit
can be crammed into them.
There are issues with the per- scheduling class data not being very
well-abstracted. A union for per-class data might help, if not a
dynamically allocated scheduling class -private structure. Getting
an alternative policy floating around that actually clashes a little
with the stock data in the task structure would help clarify what's
needed.
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> - sched_fair.c implements the 'CFS desktop scheduler': it is a
> replacement for the vanilla scheduler's SCHED_OTHER interactivity
> code.
> i'd like to give credit to Con Kolivas for the general approach here:
> he has proven via RSDL/SD that 'fair scheduling' is possible and that
> it results in better desktop scheduling. Kudos Con!
Bob Mullens banged out a virtual deadline interactive task scheduler
for Multics back in 1976 or thereabouts. ISTR the name Ferranti in
connection with deadline task scheduling for UNIX in particular. I've
largely seen deadline schedulers as a realtime topic, though. In any
event, it's not so radical as to lack a fair number of precedents.
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> The CFS patch uses a completely different approach and implementation
> from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> that of RSDL/SD, which is a high standard to meet :-) Testing
> feedback is welcome to decide this one way or another. [ and, in any
> case, all of SD's logic could be added via a kernel/sched_sd.c module
> as well, if Con is interested in such an approach. ]
> CFS's design is quite radical: it does not use runqueues, it uses a
> time-ordered rbtree to build a 'timeline' of future task execution,
> and thus has no 'array switch' artifacts (by which both the vanilla
> scheduler and RSDL/SD are affected).
A binomial heap would likely serve your purposes better than rbtrees.
It's faster to have the next item to dequeue at the root of the tree
structure rather than a leaf, for one. There are, of course, other
priority queue structures (e.g. van Emde Boas) able to exploit the
limited precision of the priority key for faster asymptotics, though
actual performance is an open question.
Another advantage of heaps is that they support decreasing priorities
directly, so that instead of removal and reinsertion, a less invasive
movement within the tree is possible. This nets additional constant
factor improvements beyond those for the next item to dequeue for the
case where a task remains runnable, but is preempted and its priority
decreased while it remains runnable.
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> CFS uses nanosecond granularity accounting and does not rely on any
> jiffies or other HZ detail. Thus the CFS scheduler has no notion of
> 'timeslices' and has no heuristics whatsoever. There is only one
> central tunable:
> /proc/sys/kernel/sched_granularity_ns
> which can be used to tune the scheduler from 'desktop' (low
> latencies) to 'server' (good batching) workloads. It defaults to a
> setting suitable for desktop workloads. SCHED_BATCH is handled by the
> CFS scheduler module too.
I like not relying on timeslices. Timeslices ultimately get you into
a 2.4.x -like epoch expiry scenarios and introduce a number of RR-esque
artifacts therefore.
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> due to its design, the CFS scheduler is not prone to any of the
> 'attacks' that exist today against the heuristics of the stock
> scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> work fine and do not impact interactivity and produce the expected
> behavior.
I'm always suspicious of these claims. A moderately formal regression
test suite needs to be assembled and the testcases rather seriously
cleaned up so they e.g. run for a deterministic period of time, have
their parameters passable via command-line options instead of editing
and recompiling, don't need Lindenting to be legible, and so on. With
that in hand, a battery of regression tests can be run against scheduler
modifications to verify their correctness and to detect any disturbance
in scheduling semantics they might cause.
A very serious concern is that while a fresh scheduler may pass all
these tests, later modifications may later cause failures unnoticed
because no one's doing the regression tests and there's no obvious
test suite for testing types to latch onto. Another is that the
testcases themselves may bitrot if they're not maintainable code.
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> the CFS scheduler has a much stronger handling of nice levels and
> SCHED_BATCH: both types of workloads should be isolated much more
> agressively than under the vanilla scheduler.
Speaking of regression tests, let's please at least state intended
nice semantics and get a regression test for CPU bandwidth distribution
by nice levels going.
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> ( another rdetail: due to nanosec accounting and timeline sorting,
> sched_yield() support is very simple under CFS, and in fact under
> CFS sched_yield() behaves much better than under any other
> scheduler i have tested so far. )
And there's another one. sched_yield() semantics need a regression test
more transparent than VolanoMark or other macrobenchmarks.
At some point we really need to decide what our sched_yield() is
intended to do and get something out there to detect whether it's
behaving as intended.
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> - reworked/sanitized SMP load-balancing: the runqueue-walking
> assumptions are gone from the load-balancing code now, and
> iterators of the scheduling modules are used. The balancing code got
> quite a bit simpler as a result.
The SMP load balancing class operations strike me as unusual and likely
to trip over semantic issues in alternative scheduling classes. Getting
some alternative scheduling classes out there to clarify the issues
would help here, too.
A more general question here is what you mean by "completely fair;"
there doesn't appear to be inter-tgrp, inter-pgrp, inter-session,
or inter-user fairness going on, though one might argue those are
relatively obscure notions of fairness. Complete fairness arguably
precludes static prioritization by nice levels, so there is also
that. There is also the issue of what a fair CPU bandwidth distribution
between tasks of varying desired in-isolation CPU utilization might be.
I suppose my thorniest point is where the demonstration of fairness is
as, say, a testcase. Perhaps it's fair now; when will we find out when
that fairness has been disturbed?
What these things mean when there are multiple CPU's to schedule across
may also be of concern.
I propose the following two testcases:
(1) CPU bandwidth distribution of CPU-bound tasks of varying nice levels
Create a number of tasks at varying nice levels. Measure the
CPU bandwidth allocated to each.
Success depends on intent: we decide up-front that a given nice
level should correspond to a given share of CPU bandwidth.
Check to see how far from the intended distribution of CPU
bandwidth according to those decided-up-front shares the actual
distribution of CPU bandwidth is for the test.
(2) CPU bandwidth distribution of tasks with varying CPU demands
Create a number of tasks that would in isolation consume
varying %cpu. Measure the CPU bandwidth allocated to each.
Success depends on intent here, too. Decide up-front that a
given %cpu that would be consumed in isolation should
correspond to a given share of CPU bandwidth and check the
actual distribution of CPU bandwidth vs. what was intended.
Note that the shares need not linearly correspond to the
%cpu; various sorts of things related to interactivity will
make this nonlinear.
A third testcase for sched_yield() should be brewed up.
These testcases are oblivious to SMP. This will demand that a scheduling
policy integrate with load balancing to the extent that load balancing
occurs for the sake of distributing CPU bandwidth according to nice level.
Some explicit decision should be made regarding that.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 22:15 ` Daniel Walker
@ 2007-04-13 22:30 ` Ingo Molnar
2007-04-13 22:37 ` Willy Tarreau
2007-04-13 23:59 ` Daniel Walker
0 siblings, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 22:30 UTC (permalink / raw)
To: Daniel Walker
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Daniel Walker <dwalker@mvista.com> wrote:
> I'm not in love with the current or other schedulers, so I'm
> indifferent to this change. However, I was reviewing your release
> notes and the patch and found myself wonder what the logarithmic
> complexity of this new scheduler is .. I assumed it would also be
> constant time , but the __enqueue_task_fair doesn't appear to be
> constant time (rbtree insert complexity).. [...]
i've been worried about that myself and i've done extensive measurements
before choosing this implementation. The rbtree turned out to be a quite
compact data structure: we get it quite cheaply as part of the task
structure cachemisses - which have to be touched anyway. For 1000 tasks
it's a loop of ~10 - that's still very fast and bound in practice.
here's a test i did under CFS. Lets take some ridiculous load: 1000
infinite loop tasks running at SCHED_BATCH on a single CPU (all inserted
into the same rbtree), and lets run lat_ctx:
neptune:~/l> uptime
22:51:23 up 8 min, 2 users, load average: 713.06, 254.64, 91.51
neptune:~/l> ./lat_ctx -s 0 2
"size=0k ovr=1.61
2 1.41
lets stop the 1000 tasks and only have ~2 tasks in the runqueue:
neptune:~/l> ./lat_ctx -s 0 2
"size=0k ovr=1.70
2 1.16
so the overhead is 0.25 usecs. Considering the load (1000 tasks trash
the cache like crazy already), this is more than acceptable.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (4 preceding siblings ...)
2007-04-13 22:21 ` William Lee Irwin III
@ 2007-04-13 22:31 ` Willy Tarreau
2007-04-13 23:18 ` Ingo Molnar
2007-04-13 23:07 ` Gabriel C
` (8 subsequent siblings)
14 siblings, 1 reply; 577+ messages in thread
From: Willy Tarreau @ 2007-04-13 22:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Hi Ingo,
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
(...)
> CFS's design is quite radical: it does not use runqueues, it uses a
> time-ordered rbtree to build a 'timeline' of future task execution,
> and thus has no 'array switch' artifacts (by which both the vanilla
> scheduler and RSDL/SD are affected).
I have a high confidence this will work better : I've been using
time-ordered trees in userland projects for several years, and never
found anything better. To be honnest, I never understood the concept
behind the array switch, but as I never felt brave enough to hack
something in this kernel area, I simply preferred to shut up (not
enough knowledge and not enough time).
However, I have been using a very fast struct timeval-ordered RADIX
tree. I found generic rbtree code to generally be slower, certainly
because of the call to a function with arguments on every node. Both
trees are O(log(n)), the rbtree being balanced and the radix tree
being unbalanced. If you're interested, I can try to see how that
would fit (but not this week-end).
Also, I had spent much time in the past doing paper work on how to
improve fairness between interactive tasks and batch tasks. I came
up with the conclusion that for perfectness, tasks should not be
ordered by their expected wakeup time, but by their expected completion
time, which automatically takes account of their allocated and used
timeslice. It would also allow both types of workloads to share equal
CPU time with better responsiveness for the most interactive one through
the reallocation of a "credit" for the tasks which have not consumed
all of their timeslices. I remember we had discussed this with Mike
about one year ago when he fixed lots of problems in mainline scheduler.
The downside is that I never found how to make this algo fit in
O(log(n)). I always ended in something like O(n.log(n)) IIRC.
But maybe this is overkill for real life anyway. Given that a basic two
arrays switch (which I never understood) was sufficient for many people,
probably that a basic tree will be an order of magnitude better.
> CFS uses nanosecond granularity accounting and does not rely on any
> jiffies or other HZ detail. Thus the CFS scheduler has no notion of
> 'timeslices' and has no heuristics whatsoever. There is only one
> central tunable:
>
> /proc/sys/kernel/sched_granularity_ns
>
> which can be used to tune the scheduler from 'desktop' (low
> latencies) to 'server' (good batching) workloads. It defaults to a
> setting suitable for desktop workloads. SCHED_BATCH is handled by the
> CFS scheduler module too.
I find this useful, but to be fair with Mike and Con, they both have
proposed similar tuning knobs in the past and you said you did not want
to add that complexity for admins. People can sometimes be demotivated
by seeing their proposals finally used by people who first rejected them.
And since both Mike and Con both have done a wonderful job in that area,
we need their experience and continued active participation more than ever.
> due to its design, the CFS scheduler is not prone to any of the
> 'attacks' that exist today against the heuristics of the stock
> scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> work fine and do not impact interactivity and produce the expected
> behavior.
I'm very pleased to read this. Because as I have already said it, my major
concern with 2.6 was the stock scheduler. Recently, RSDL fixed most of the
basic problems for me to the point that I switched the default lilo entry
on my notebook to 2.6 ! I hope that whatever the next scheduler will be,
we'll definitely get rid of any heuristics. Heuristics are good in 95% of
situations and extremely bad in the remaining 5%. I prefer something
reasonably good in 100% of situations.
> the CFS scheduler has a much stronger handling of nice levels and
> SCHED_BATCH: both types of workloads should be isolated much more
> agressively than under the vanilla scheduler.
>
> ( another rdetail: due to nanosec accounting and timeline sorting,
> sched_yield() support is very simple under CFS, and in fact under
> CFS sched_yield() behaves much better than under any other
> scheduler i have tested so far. )
>
> - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
> way than the vanilla scheduler does. It uses 100 runqueues (for all
> 100 RT priority levels, instead of 140 in the vanilla scheduler)
> and it needs no expired array.
>
> - reworked/sanitized SMP load-balancing: the runqueue-walking
> assumptions are gone from the load-balancing code now, and
> iterators of the scheduling modules are used. The balancing code got
> quite a bit simpler as a result.
Will this have any impact on NUMA/HT/multi-core/etc... ?
> the core scheduler got smaller by more than 700 lines:
Well done !
Cheers,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 22:30 ` Ingo Molnar
@ 2007-04-13 22:37 ` Willy Tarreau
2007-04-13 23:59 ` Daniel Walker
1 sibling, 0 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-13 22:37 UTC (permalink / raw)
To: Ingo Molnar
Cc: Daniel Walker, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Sat, Apr 14, 2007 at 12:30:17AM +0200, Ingo Molnar wrote:
>
> * Daniel Walker <dwalker@mvista.com> wrote:
>
> > I'm not in love with the current or other schedulers, so I'm
> > indifferent to this change. However, I was reviewing your release
> > notes and the patch and found myself wonder what the logarithmic
> > complexity of this new scheduler is .. I assumed it would also be
> > constant time , but the __enqueue_task_fair doesn't appear to be
> > constant time (rbtree insert complexity).. [...]
>
> i've been worried about that myself and i've done extensive measurements
> before choosing this implementation. The rbtree turned out to be a quite
> compact data structure: we get it quite cheaply as part of the task
> structure cachemisses - which have to be touched anyway. For 1000 tasks
> it's a loop of ~10 - that's still very fast and bound in practice.
I'm not worried at all by O(log(n)) algorithms, and generally prefer smart log(n)
than dumb O(1).
In a userland TCP stack I started to write 2 years ago, I used a comparable
scheduler and could reach a sustained rate of 145000 connections/s at 4
millions of concurrent connections. And yes, each time a packet was sent or
received, a task was queued/dequeued (so about 450k/s with 4 million tasks,
on an athlon 1.5 GHz). So that seems much higher than what we currently need.
Regards,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 22:21 ` William Lee Irwin III
@ 2007-04-13 22:52 ` Ingo Molnar
2007-04-13 23:30 ` William Lee Irwin III
2007-04-14 22:38 ` Davide Libenzi
1 sibling, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 22:52 UTC (permalink / raw)
To: William Lee Irwin III
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> > This project is a complete rewrite of the Linux task scheduler. My goal
> > is to address various feature requests and to fix deficiencies in the
> > vanilla scheduler that were suggested/found in the past few years, both
> > for desktop scheduling and for server scheduling workloads.
> > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> > new scheduler will be active by default and all tasks will default
> > to the new SCHED_FAIR interactive scheduling class. ]
>
> A pleasant surprise, though I did see it coming.
hey ;)
> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > Highlights are:
> > - the introduction of Scheduling Classes: an extensible hierarchy of
> > scheduler modules. These modules encapsulate scheduling policy
> > details and are handled by the scheduler core without the core
> > code assuming about them too much.
>
> It probably needs further clarification that they're things on the
> order of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization
> amongst the classes is furthermore assumed, and so on. [...]
yep - they are linked via sched_ops->next pointer, with NULL delimiting
the last one.
> [...] They're not quite capable of being full-blown alternative
> policies, though quite a bit can be crammed into them.
yeah, they are not full-blown: i extended them on-demand, for the
specific purposes of sched_fair.c and sched_rt.c. More can be done too.
> There are issues with the per- scheduling class data not being very
> well-abstracted. [...]
yes. It's on my TODO list: i'll work more on extending the cleanups to
those fields too.
> A binomial heap would likely serve your purposes better than rbtrees.
> It's faster to have the next item to dequeue at the root of the tree
> structure rather than a leaf, for one. There are, of course, other
> priority queue structures (e.g. van Emde Boas) able to exploit the
> limited precision of the priority key for faster asymptotics, though
> actual performance is an open question.
i'm caching the leftmost leaf, which serves as an alternate, task-pick
centric root in essence.
> Another advantage of heaps is that they support decreasing priorities
> directly, so that instead of removal and reinsertion, a less invasive
> movement within the tree is possible. This nets additional constant
> factor improvements beyond those for the next item to dequeue for the
> case where a task remains runnable, but is preempted and its priority
> decreased while it remains runnable.
yeah. (Note that in CFS i'm not decreasing priorities anywhere though -
all the priority levels in CFS stay constant, fairness is not achieved
via rotating priorities or similar, it is achieved via the accounting
code.)
> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > due to its design, the CFS scheduler is not prone to any of the
> > 'attacks' that exist today against the heuristics of the stock
> > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> > work fine and do not impact interactivity and produce the expected
> > behavior.
>
> I'm always suspicious of these claims. [...]
hey, sure - but please give it a go nevertheless, i _did_ test all these
;)
> A moderately formal regression test suite needs to be assembled [...]
by all means feel free! ;)
> A more general question here is what you mean by "completely fair;"
by that i mean the most common-sense definition: with N tasks running
each gets 1/N CPU time if observed for a reasonable amount of time. Now
extend this to arbitrary scheduling patterns, the end result should
still be completely fair, according to the fundamental 1/N(time) rule
individually applied to all the small scheduling patterns that the
scheduling patterns give. (this assumes that the scheduling patterns are
reasonably independent of each other - if they are not then there's no
reasonable definition of fairness that makes sense, and we might as well
use the 1/N rule for those cases too.)
> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or
> inter-user fairness going on, though one might argue those are
> relatively obscure notions of fairness. [...]
sure, i mainly concentrated on what we have in Linux today. The things
you mention are add-ons that i can see handling via new scheduling
classes: all the CKRM and containers type of CPU time management
facilities.
> What these things mean when there are multiple CPU's to schedule
> across may also be of concern.
that is handled by the existing smp-nice load balancer, that logic is
preserved under CFS.
> These testcases are oblivious to SMP. This will demand that a
> scheduling policy integrate with load balancing to the extent that
> load balancing occurs for the sake of distributing CPU bandwidth
> according to nice level. Some explicit decision should be made
> regarding that.
this should already work reasonably fine with CFS: try massive_intr.c on
an SMP box.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (5 preceding siblings ...)
2007-04-13 22:31 ` Willy Tarreau
@ 2007-04-13 23:07 ` Gabriel C
2007-04-13 23:25 ` Ingo Molnar
2007-04-14 2:04 ` Nick Piggin
` (7 subsequent siblings)
14 siblings, 1 reply; 577+ messages in thread
From: Gabriel C @ 2007-04-13 23:07 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
>
[...]
> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,
>
Compile error here.
...
CC kernel/sched.o
kernel/sched.c: In function '__rq_clock':
kernel/sched.c:219: error: 'struct rq' has no member named 'cpu'
kernel/sched.c:219: warning: type defaults to 'int' in declaration of
'__ret_warn_once'
kernel/sched.c:219: error: 'struct rq' has no member named 'cpu'
kernel/sched.c: In function 'rq_clock':
kernel/sched.c:230: error: 'struct rq' has no member named 'cpu'
kernel/sched.c: In function 'sched_init':
kernel/sched.c:6013: warning: unused variable 'j'
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2
==> ERROR: Build Failed. Aborting...
...
There the config :
http://frugalware.org/~crazy/other/kernel/config
> Ingo
> -
>
>
Regards,
Gabriel
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 22:31 ` Willy Tarreau
@ 2007-04-13 23:18 ` Ingo Molnar
2007-04-14 18:48 ` Bill Huey
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:18 UTC (permalink / raw)
To: Willy Tarreau
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Willy Tarreau <w@1wt.eu> wrote:
> > central tunable:
> >
> > /proc/sys/kernel/sched_granularity_ns
> >
> > which can be used to tune the scheduler from 'desktop' (low
> > latencies) to 'server' (good batching) workloads. It defaults to a
> > setting suitable for desktop workloads. SCHED_BATCH is handled by the
> > CFS scheduler module too.
>
> I find this useful, but to be fair with Mike and Con, they both have
> proposed similar tuning knobs in the past and you said you did not
> want to add that complexity for admins. [...]
yeah. [ Note that what i opposed in the past was mostly the 'export all
the zillion of sched.c knobs to /sys and let people mess with them' kind
of patches which did exist and still exist. A _single_ knob, which
represents basically the totality of parameters within sched_fair.c is
less of a problem. I dont think i ever objected to this knob within
staircase/SD. (If i did then i was dead wrong.) ]
> [...] People can sometimes be demotivated by seeing their proposals
> finally used by people who first rejected them. And since both Mike
> and Con both have done a wonderful job in that area, we need their
> experience and continued active participation more than ever.
very much so! Both Con and Mike has contributed regularly to upstream
sched.c:
$ git-log kernel/sched.c | grep 'by: Con Kolivas' 1 | wc -l
19
$ git-log kernel/sched.c | grep 'by: Mike' | wc -l
6
and i'd very much like both counts to increase steadily in the future
too :)
> > - reworked/sanitized SMP load-balancing: the runqueue-walking
> > assumptions are gone from the load-balancing code now, and
> > iterators of the scheduling modules are used. The balancing code
> > got quite a bit simpler as a result.
>
> Will this have any impact on NUMA/HT/multi-core/etc... ?
it will inevitably have some sort of effect - and if it's negative, i'll
try to fix it.
I got rid of the explicit cache-hot tracking code and replaced it with a
more natural pure 'pick the next-to-run task first, that is likely the
most cache-cold one' logic. That just derives naturally from the rbtree
approach.
> > the core scheduler got smaller by more than 700 lines:
>
> Well done !
thanks :)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 23:07 ` Gabriel C
@ 2007-04-13 23:25 ` Ingo Molnar
2007-04-13 23:39 ` Gabriel C
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:25 UTC (permalink / raw)
To: Gabriel C
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Gabriel C <nix.or.die@googlemail.com> wrote:
> > as usual, any sort of feedback, bugreports, fixes and suggestions
> > are more than welcome,
>
> Compile error here.
ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also
updated the full patch at the cfs-scheduler URL)
Ingo
----------------------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [cfs] fix !CONFIG_SMP build
fix the !CONFIG_SMP build error reported by Gabriel C
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -257,16 +257,6 @@ static inline unsigned long long __rq_cl
return rq->rq_clock;
}
-static inline unsigned long long rq_clock(struct rq *rq)
-{
- int this_cpu = smp_processor_id();
-
- if (this_cpu == rq->cpu)
- return __rq_clock(rq);
-
- return rq->rq_clock;
-}
-
static inline int cpu_of(struct rq *rq)
{
#ifdef CONFIG_SMP
@@ -276,6 +266,16 @@ static inline int cpu_of(struct rq *rq)
#endif
}
+static inline unsigned long long rq_clock(struct rq *rq)
+{
+ int this_cpu = smp_processor_id();
+
+ if (this_cpu == cpu_of(rq))
+ return __rq_clock(rq);
+
+ return rq->rq_clock;
+}
+
/*
* The domain tree (rq->sd) is protected by RCU's quiescent state transition.
* See detach_destroy_domains: synchronize_sched for details.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 22:52 ` Ingo Molnar
@ 2007-04-13 23:30 ` William Lee Irwin III
2007-04-13 23:44 ` Ingo Molnar
0 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-13 23:30 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A binomial heap would likely serve your purposes better than rbtrees.
>> It's faster to have the next item to dequeue at the root of the tree
>> structure rather than a leaf, for one. There are, of course, other
>> priority queue structures (e.g. van Emde Boas) able to exploit the
>> limited precision of the priority key for faster asymptotics, though
>> actual performance is an open question.
On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> i'm caching the leftmost leaf, which serves as an alternate, task-pick
> centric root in essence.
I noticed that, yes. It seemed a better idea to me to use a data
structure that has what's needed built-in, but I suppose it's not gospel.
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Another advantage of heaps is that they support decreasing priorities
>> directly, so that instead of removal and reinsertion, a less invasive
>> movement within the tree is possible. This nets additional constant
>> factor improvements beyond those for the next item to dequeue for the
>> case where a task remains runnable, but is preempted and its priority
>> decreased while it remains runnable.
On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> yeah. (Note that in CFS i'm not decreasing priorities anywhere though -
> all the priority levels in CFS stay constant, fairness is not achieved
> via rotating priorities or similar, it is achieved via the accounting
> code.)
Sorry, "priority" here would be from the POV of the queue data
structure. From the POV of the scheduler it would be resetting the
deadline or whatever the nomenclature cooked up for things is, most
obviously in requeue_task_fair() and task_tick_fair().
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> I'm always suspicious of these claims. [...]
On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> hey, sure - but please give it a go nevertheless, i _did_ test all these
> ;)
The suspicion essentially centers around how long the state of affairs
will hold up because comprehensive re-testing is not noticeably done
upon updates to scheduling code or kernel point releases.
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A moderately formal regression test suite needs to be assembled [...]
On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> by all means feel free! ;)
I can only do so much, but I have done work to clean up other testcases
going around. I'm mostly looking at testcases as I go over them or
develop some interest in the subject and rewriting those that already
exist or hammering out new ones as I need them. The main contribution
toward this is that I've sort of made a mental note to stash the results
of the effort somewhere and pass them along to those who do regular
testing on kernels or otherwise import test suites into their collections.
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A more general question here is what you mean by "completely fair;"
On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> by that i mean the most common-sense definition: with N tasks running
> each gets 1/N CPU time if observed for a reasonable amount of time. Now
> extend this to arbitrary scheduling patterns, the end result should
> still be completely fair, according to the fundamental 1/N(time) rule
> individually applied to all the small scheduling patterns that the
> scheduling patterns give. (this assumes that the scheduling patterns are
> reasonably independent of each other - if they are not then there's no
> reasonable definition of fairness that makes sense, and we might as well
> use the 1/N rule for those cases too.)
I'd start with identically-behaving CPU-bound tasks here. It's easy
enough to hammer out a testcase that starts up N CPU-bound tasks, runs
them for a few minutes, stops them, collects statistics on their
runtime, and gives us an idea of whether 1/N came out properly. I'll
get around to that at some point.
Where it gets complex is when the behavior patterns vary, e.g. they're
not entirely CPU-bound and their desired in-isolation CPU utilization
varies, or when nice levels vary, or both vary. I went on about
testcases for those in particular in the prior post, though not both
at once. The nice level one in particular needs an up-front goal for
distribution of CPU bandwidth in a mixture of competing tasks with
varying nice levels.
There are different ways to define fairness, but a uniform distribution
of CPU bandwidth across a set of identical competing tasks is a good,
testable definition.
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or
>> inter-user fairness going on, though one might argue those are
>> relatively obscure notions of fairness. [...]
On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> sure, i mainly concentrated on what we have in Linux today. The things
> you mention are add-ons that i can see handling via new scheduling
> classes: all the CKRM and containers type of CPU time management
> facilities.
At some point the CKRM and container people should be pinged to see
what (if anything) they need to achieve these sorts of things. It's
not clear to me that the specific cases I cited are considered
relevant to anyone. I presume that if they are, someone will pipe
up with a feature request. It was more a sort of catalogue of different
notions of fairness that could arise than any sort of suggestion.
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> What these things mean when there are multiple CPU's to schedule
>> across may also be of concern.
On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> that is handled by the existing smp-nice load balancer, that logic is
> preserved under CFS.
Given the things going wrong, I'm curious as to whether that works, and
if so, how well. I'll drop that into my list of testcases that should be
arranged for, though I won't guarantee that I'll get to it myself in any
sort of timely fashion.
What this ultimately needs is specifying the semantics of nice levels
so that we can say that a mixture of competing tasks with varying nice
levels should have an ideal distribution of CPU bandwidth to check for.
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> These testcases are oblivious to SMP. This will demand that a
>> scheduling policy integrate with load balancing to the extent that
>> load balancing occurs for the sake of distributing CPU bandwidth
>> according to nice level. Some explicit decision should be made
>> regarding that.
On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> this should already work reasonably fine with CFS: try massive_intr.c on
> an SMP box.
Where is massive_intr.c, BTW?
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 23:25 ` Ingo Molnar
@ 2007-04-13 23:39 ` Gabriel C
0 siblings, 0 replies; 577+ messages in thread
From: Gabriel C @ 2007-04-13 23:39 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Ingo Molnar wrote:
> * Gabriel C <nix.or.die@googlemail.com> wrote:
>
>
>>> as usual, any sort of feedback, bugreports, fixes and suggestions
>>> are more than welcome,
>>>
>> Compile error here.
>>
>
> ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also
> updated the full patch at the cfs-scheduler URL)
>
Yes it does , thx :) , only the " warning: unused variable 'j' " left.
> Ingo
>
Regards,
Gabriel
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 23:30 ` William Lee Irwin III
@ 2007-04-13 23:44 ` Ingo Molnar
2007-04-13 23:58 ` William Lee Irwin III
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:44 UTC (permalink / raw)
To: William Lee Irwin III
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]
* William Lee Irwin III <wli@holomorphy.com> wrote:
> Where it gets complex is when the behavior patterns vary, e.g. they're
> not entirely CPU-bound and their desired in-isolation CPU utilization
> varies, or when nice levels vary, or both vary. [...]
yes. I tested things like 'massive_intr.c' (attached, written by Satoru
Takeuchi) which starts N tasks which each work for 8msec then sleep
1msec:
from its output, the second column is the CPU time each thread got, the
more even, the fairer the scheduling. On vanilla i get:
mercury:~> ./massive_intr 10 10
024873 00000150
024874 00000123
024870 00000069
024868 00000068
024866 00000051
024875 00000206
024872 00000093
024869 00000138
024867 00000078
024871 00000223
on CFS i get:
neptune:~> ./massive_intr 10 10
002266 00000112
002260 00000113
002261 00000112
002267 00000112
002269 00000112
002265 00000112
002262 00000113
002268 00000113
002264 00000112
002263 00000113
so it is quite a bit more even ;)
another related test-utility is one i wrote:
http://people.redhat.com/mingo/scheduler-patches/ring-test.c
this is a ring of 100 tasks each doing work for 100 msecs and then
sleeping for 1 msec. I usually test this by also running a CPU hog in
parallel to it, and checking whether it gets ~50.0% of CPU time under
CFS. (it does)
Ingo
[-- Attachment #2: massive_intr.c --]
[-- Type: text/plain, Size: 9833 bytes --]
#if 0
Hi Ingo and all,
When I was executing massive interactive processes, I found that some of them
occupy CPU time and the others hardly run.
It seems that some of processes which occupy CPU time always has max effective
prio (default+5) and the others have max - 1. What happen here is...
1. If there are moderate number of max interactive processes, they can be
re-inserted into active queue without falling down its priority again and
again.
2. In this case, the others seldom run, and can't get max effective priority
at next exhausting because scheduler considers them to sleep too long.
3. Goto 1, OOPS!
Unfortunately I haven't been able to make the patch resolving this problem
yet. Any idea?
I also attach the test program which easily recreates this problem.
Test program flow:
1. First process starts child proesses and wait for 5 minutes.
2. Each child process executes "work 8 msec and sleep 1 msec" loop
continuously.
3. After 3 minits have passed, each child processes prints the # of loops
which executed.
What expected:
Each child processes execute nearly equal # of loops.
Test environment:
- kernel: 2.6.20(*1)
- # of CPUs: 1 or 2
- # of child processes: 200 or 400
- nice value: 0 or 20(*2)
*1) I confirmed that 2.6.21-rc5 has no change regarding this problem.
*2) If a process have nice 20, scheduler never regards it as interactive.
Test results:
-----------+----------------+------+------------------------------------
# of CPUs | # of processes | nice | result
-----------+----------------+------+------------------------------------
| | 20 | looks good
1(i386) | +------+------------------------------------
| | 0 | 4 processes occupy 98% of CPU time
-----------+ 200 +------+------------------------------------
| | 20 | looks good
| +------+------------------------------------
| | 0 | 8 processes occupy 72% of CPU time
2(ia64) +----------------+------+------------------------------------
| 400 | 20 | looks good
| +------+------------------------------------
| | 0 | 8 processes occupy 98% of CPU time
-----------+----------------+------+------------------------------------
FYI. 2.6.21-rc3-mm1 (enabling RSDL scheduler) works fine in the all casees :-)
Thanks,
Satoru
-------------------------------------------------------------------------------
#endif
/*
* massive_intr - run @nproc interactive processes and print the number of
* loops(*1) each process executes in @runtime secs.
*
* *1) "work 8 msec and sleep 1msec" loop
*
* Usage: massive_intr <nproc> <runtime>
*
* @nproc: number of processes
* @runtime: execute time[sec]
*
* ex) If you want to run 300 processes for 5 mins, issue the
* command as follows:
*
* $ massive_intr 300 300
*
* How to build:
*
* cc -o massive_intr massive_intr.c -lrt
*
*
* Copyright (C) 2007 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
*
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or (at
* your option) any later version.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License along
* with this program; if not, write to the Free Software Foundation, Inc.,
* 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
*
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <errno.h>
#include <err.h>
#define WORK_MSECS 8
#define SLEEP_MSECS 1
#define MAX_PROC 1024
#define SAMPLE_COUNT 1000000000
#define USECS_PER_SEC 1000000
#define USECS_PER_MSEC 1000
#define NSECS_PER_MSEC 1000000
#define SHMEMSIZE 4096
static const char *shmname = "/sched_interactive_shmem";
static void *shmem;
static sem_t *printsem;
static int nproc;
static int runtime;
static int fd;
static time_t *first;
static pid_t pid[MAX_PROC];
static int return_code;
static void cleanup_resources(void)
{
if (sem_destroy(printsem) < 0)
warn("sem_destroy() failed");
if (munmap(shmem, SHMEMSIZE) < 0)
warn("munmap() failed");
if (close(fd) < 0)
warn("close() failed");
}
static void abnormal_exit(void)
{
if (kill(getppid(), SIGUSR2) < 0)
err(EXIT_FAILURE, "kill() failed");
}
static void sighandler(int signo)
{
}
static void sighandler2(int signo)
{
return_code = EXIT_FAILURE;
}
static void loopfnc(int nloop)
{
int i;
for (i = 0; i < nloop; i++)
;
}
static int loop_per_msec(void)
{
struct timeval tv[2];
int before, after;
if (gettimeofday(&tv[0], NULL) < 0)
return -1;
loopfnc(SAMPLE_COUNT);
if (gettimeofday(&tv[1], NULL) < 0)
return -1;
before = tv[0].tv_sec*USECS_PER_SEC+tv[0].tv_usec;
after = tv[1].tv_sec*USECS_PER_SEC+tv[1].tv_usec;
return SAMPLE_COUNT/(after - before)*USECS_PER_MSEC;
}
static void *test_job(void *arg)
{
int l = (int)arg;
int count = 0;
time_t current;
sigset_t sigset;
struct sigaction sa;
struct timespec ts = { 0, NSECS_PER_MSEC*SLEEP_MSECS};
sa.sa_handler = sighandler;
if (sigemptyset(&sa.sa_mask) < 0) {
warn("sigemptyset() failed");
abnormal_exit();
}
sa.sa_flags = 0;
if (sigaction(SIGUSR1, &sa, NULL) < 0) {
warn("sigaction() failed");
abnormal_exit();
}
if (sigemptyset(&sigset) < 0) {
warn("sigfillset() failed");
abnormal_exit();
}
sigsuspend(&sigset);
if (errno != EINTR) {
warn("sigsuspend() failed");
abnormal_exit();
}
/* main loop */
do {
loopfnc(WORK_MSECS*l);
if (nanosleep(&ts, NULL) < 0) {
warn("nanosleep() failed");
abnormal_exit();
}
count++;
if (time(¤t) == -1) {
warn("time() failed");
abnormal_exit();
}
} while (difftime(current, *first) < runtime);
if (sem_wait(printsem) < 0) {
warn("sem_wait() failed");
abnormal_exit();
}
printf("%06d\t%08d\n", getpid(), count);
if (sem_post(printsem) < 0) {
warn("sem_post() failed");
abnormal_exit();
}
exit(EXIT_SUCCESS);
}
static void usage(void)
{
fprintf(stderr,
"Usage : massive_intr <nproc> <runtime>\n"
"\t\tnproc : number of processes\n"
"\t\truntime : execute time[sec]\n");
exit(EXIT_FAILURE);
}
int main(int argc, char **argv)
{
int i, j;
int status;
sigset_t sigset;
struct sigaction sa;
int c;
if (argc != 3)
usage();
nproc = strtol(argv[1], NULL, 10);
if (errno || nproc < 1 || nproc > MAX_PROC)
err(EXIT_FAILURE, "invalid multinum");
runtime = strtol(argv[2], NULL, 10);
if (errno || runtime <= 0)
err(EXIT_FAILURE, "invalid runtime");
sa.sa_handler = sighandler2;
if (sigemptyset(&sa.sa_mask) < 0)
err(EXIT_FAILURE, "sigemptyset() failed");
sa.sa_flags = 0;
if (sigaction(SIGUSR2, &sa, NULL) < 0)
err(EXIT_FAILURE, "sigaction() failed");
if (sigemptyset(&sigset) < 0)
err(EXIT_FAILURE, "sigemptyset() failed");
if (sigaddset(&sigset, SIGUSR1) < 0)
err(EXIT_FAILURE, "sigaddset() failed");
if (sigaddset(&sigset, SIGUSR2) < 0)
err(EXIT_FAILURE, "sigaddset() failed");
if (sigprocmask(SIG_BLOCK, &sigset, NULL) < 0)
err(EXIT_FAILURE, "sigprocmask() failed");
/* setup shared memory */
if ((fd = shm_open(shmname, O_CREAT | O_RDWR, 0644)) < 0)
err(EXIT_FAILURE, "shm_open() failed");
if (shm_unlink(shmname) < 0) {
warn("shm_unlink() failed");
goto err_close;
}
if (ftruncate(fd, SHMEMSIZE) < 0) {
warn("ftruncate() failed");
goto err_close;
}
shmem = mmap(NULL, SHMEMSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (shmem == (void *)-1) {
warn("mmap() failed");
goto err_unmap;
}
printsem = shmem;
first = shmem + sizeof(*printsem);
/* initialize semaphore */
if ((sem_init(printsem, 1, 1)) < 0) {
warn("sem_init() failed");
goto err_unmap;
}
if ((c = loop_per_msec()) < 0) {
fprintf(stderr, "loop_per_msec() failed\n");
goto err_sem;
}
for (i = 0; i < nproc; i++) {
pid[i] = fork();
if (pid[i] == -1) {
warn("fork() failed\n");
for (j = 0; j < i; j++)
if (kill(pid[j], SIGKILL) < 0)
warn("kill() failed");
goto err_sem;
}
if (pid[i] == 0)
test_job((void *)c);
}
if (sigemptyset(&sigset) < 0) {
warn("sigemptyset() failed");
goto err_proc;
}
if (sigaddset(&sigset, SIGUSR2) < 0) {
warn("sigaddset() failed");
goto err_proc;
}
if (sigprocmask(SIG_UNBLOCK, &sigset, NULL) < 0) {
warn("sigprocmask() failed");
goto err_proc;
}
if (time(first) < 0) {
warn("time() failed");
goto err_proc;
}
if ((kill(0, SIGUSR1)) == -1) {
warn("kill() failed");
goto err_proc;
}
for (i = 0; i < nproc; i++) {
if (wait(&status) < 0) {
warn("wait() failed");
goto err_proc;
}
}
cleanup_resources();
exit(return_code);
err_proc:
for (i = 0; i < nproc; i++)
if (kill(pid[i], SIGKILL) < 0)
if (errno != ESRCH)
warn("kill() failed");
err_sem:
if (sem_destroy(printsem) < 0)
warn("sem_destroy() failed");
err_unmap:
if (munmap(shmem, SHMEMSIZE) < 0)
warn("munmap() failed");
err_close:
if (close(fd) < 0)
warn("close() failed");
exit(EXIT_FAILURE);
}
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 23:44 ` Ingo Molnar
@ 2007-04-13 23:58 ` William Lee Irwin III
0 siblings, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-13 23:58 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Where it gets complex is when the behavior patterns vary, e.g. they're
>> not entirely CPU-bound and their desired in-isolation CPU utilization
>> varies, or when nice levels vary, or both vary. [...]
On Sat, Apr 14, 2007 at 01:44:44AM +0200, Ingo Molnar wrote:
> yes. I tested things like 'massive_intr.c' (attached, written by Satoru
> Takeuchi) which starts N tasks which each work for 8msec then sleep
> 1msec:
[...]
> another related test-utility is one i wrote:
> http://people.redhat.com/mingo/scheduler-patches/ring-test.c
> this is a ring of 100 tasks each doing work for 100 msecs and then
> sleeping for 1 msec. I usually test this by also running a CPU hog in
> parallel to it, and checking whether it gets ~50.0% of CPU time under
> CFS. (it does)
These are both tremendously useful. The code is also in rather good
shape so only minimal modifications (for massive_intr.c I'm not even
sure if any are needed at all) are needed to plug them into the test
harness I'm aware of. I'll queue them both for me to adjust and send
over to testers I don't want to burden with hacking on testcases I
myself am asking them to add to their suites.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 22:30 ` Ingo Molnar
2007-04-13 22:37 ` Willy Tarreau
@ 2007-04-13 23:59 ` Daniel Walker
2007-04-14 10:55 ` Ingo Molnar
1 sibling, 1 reply; 577+ messages in thread
From: Daniel Walker @ 2007-04-13 23:59 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
One other thing, what happens in the case of slow, frequency changing,
are/or inaccurate clocks .. Is the old sched_clock behavior still
tolerated?
Daniel
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (6 preceding siblings ...)
2007-04-13 23:07 ` Gabriel C
@ 2007-04-14 2:04 ` Nick Piggin
2007-04-14 6:32 ` Ingo Molnar
2007-04-14 15:09 ` S.Çağlar Onur
` (6 subsequent siblings)
14 siblings, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-14 2:04 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
Always good to see another contender ;)
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
>
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> new scheduler will be active by default and all tasks will default
> to the new SCHED_FAIR interactive scheduling class. ]
I don't know why there is such noise about fairness right now... I
thought fairness was one of the fundamental properties of a good CPU
scheduler, and my scheduler definitely always aims for that above most
other things. Why not just keep SCHED_OTHER?
> Highlights are:
>
> - the introduction of Scheduling Classes: an extensible hierarchy of
> scheduler modules. These modules encapsulate scheduling policy
> details and are handled by the scheduler core without the core
> code assuming about them too much.
Don't really like this, but anyway...
> - sched_fair.c implements the 'CFS desktop scheduler': it is a
> replacement for the vanilla scheduler's SCHED_OTHER interactivity
> code.
>
> i'd like to give credit to Con Kolivas for the general approach here:
> he has proven via RSDL/SD that 'fair scheduling' is possible and that
> it results in better desktop scheduling. Kudos Con!
I guess the 2.4 and earlier scheduler kind of did that as well.
> The CFS patch uses a completely different approach and implementation
> from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> that of RSDL/SD, which is a high standard to meet :-) Testing
> feedback is welcome to decide this one way or another. [ and, in any
> case, all of SD's logic could be added via a kernel/sched_sd.c module
> as well, if Con is interested in such an approach. ]
Comment about the code: shouldn't you be requeueing the task in the rbtree
wherever you change wait_runtime? eg. task_new_fair? (I've only had a quick
look so far).
> CFS's design is quite radical: it does not use runqueues, it uses a
> time-ordered rbtree to build a 'timeline' of future task execution,
> and thus has no 'array switch' artifacts (by which both the vanilla
> scheduler and RSDL/SD are affected).
>
> CFS uses nanosecond granularity accounting and does not rely on any
> jiffies or other HZ detail. Thus the CFS scheduler has no notion of
> 'timeslices' and has no heuristics whatsoever.
Well, I guess there is still some mechanism to decide which process is most
eligible to run? ;) Considering that question has no "right" answer for
SCHED_OTHER scheduling, I guess you could say it has heuristics. But granted
they are obviously fairly elegant in contrast to the O(1) scheduler ;)
> There is only one
> central tunable:
>
> /proc/sys/kernel/sched_granularity_ns
Suppose you have 2 CPU hogs running, is sched_granularity_ns the
frequency at which they will context switch?
> ( another rdetail: due to nanosec accounting and timeline sorting,
> sched_yield() support is very simple under CFS, and in fact under
> CFS sched_yield() behaves much better than under any other
> scheduler i have tested so far. )
What is better behaviour for sched_yield?
Thanks,
Nick
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 2:04 ` Nick Piggin
@ 2007-04-14 6:32 ` Ingo Molnar
2007-04-14 6:43 ` Ingo Molnar
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 6:32 UTC (permalink / raw)
To: Nick Piggin
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Nick Piggin <npiggin@suse.de> wrote:
> > The CFS patch uses a completely different approach and implementation
> > from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> > that of RSDL/SD, which is a high standard to meet :-) Testing
> > feedback is welcome to decide this one way or another. [ and, in any
> > case, all of SD's logic could be added via a kernel/sched_sd.c module
> > as well, if Con is interested in such an approach. ]
>
> Comment about the code: shouldn't you be requeueing the task in the
> rbtree wherever you change wait_runtime? eg. task_new_fair? [...]
yes: the task's position within the rbtree is updated every time
wherever wait_runtime is change. task_new_fair is the method during new
task creation, but indeed i forgot to requeue the parent. I've fixed
this in my tree (see the delta patch below) - thanks!
Ingo
----------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [cfs] fix parent's rbtree position
Nick noticed that upon fork we change parent->wait_runtime but we do not
requeue it within the rbtree.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -524,6 +524,8 @@ static void task_new_fair(struct rq *rq,
p->wait_runtime = parent->wait_runtime/2;
parent->wait_runtime /= 2;
+ requeue_task_fair(rq, parent);
+
/*
* For the first timeslice we allow child threads
* to move their parent-inherited fairness back
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 6:32 ` Ingo Molnar
@ 2007-04-14 6:43 ` Ingo Molnar
2007-04-14 8:08 ` Willy Tarreau
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 6:43 UTC (permalink / raw)
To: Nick Piggin
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Ingo Molnar <mingo@elte.hu> wrote:
> Nick noticed that upon fork we change parent->wait_runtime but we do
> not requeue it within the rbtree.
this fix is not complete - because the child runqueue is locked here,
not the parent's. I've fixed this properly in my tree and have uploaded
a new sched-modular+cfs.patch. (the effects of the original bug are
mostly harmless, the rbtree position gets corrected the first time the
parent reschedules. The fix might improve heavy forker handling.)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 6:43 ` Ingo Molnar
@ 2007-04-14 8:08 ` Willy Tarreau
2007-04-14 8:36 ` Willy Tarreau
2007-04-14 10:36 ` Ingo Molnar
0 siblings, 2 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 8:08 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> > Nick noticed that upon fork we change parent->wait_runtime but we do
> > not requeue it within the rbtree.
>
> this fix is not complete - because the child runqueue is locked here,
> not the parent's. I've fixed this properly in my tree and have uploaded
> a new sched-modular+cfs.patch. (the effects of the original bug are
> mostly harmless, the rbtree position gets corrected the first time the
> parent reschedules. The fix might improve heavy forker handling.)
It looks like it did not reach your public dir yet.
BTW, I've given it a try. It seems pretty usable. I have also tried
the usual meaningless "glxgears" test with 12 of them at the same time,
and they rotate very smoothly, there is absolutely no pause in any of
them. But they don't all run at same speed, and top reports their CPU
load varying from 3.4 to 10.8%, with what looks like more CPU is
assigned to the first processes, and less CPU for the last ones. But
this is just a rough observation on a stupid test, I would not call
that one scientific in any way (and X has its share in the test too).
I'll perform other tests when I can rebuild with your fixed patch.
Cheers,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 8:08 ` Willy Tarreau
@ 2007-04-14 8:36 ` Willy Tarreau
2007-04-14 10:53 ` Ingo Molnar
2007-04-14 19:48 ` William Lee Irwin III
2007-04-14 10:36 ` Ingo Molnar
1 sibling, 2 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 8:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sat, Apr 14, 2007 at 10:08:34AM +0200, Willy Tarreau wrote:
> On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote:
> >
> > * Ingo Molnar <mingo@elte.hu> wrote:
> >
> > > Nick noticed that upon fork we change parent->wait_runtime but we do
> > > not requeue it within the rbtree.
> >
> > this fix is not complete - because the child runqueue is locked here,
> > not the parent's. I've fixed this properly in my tree and have uploaded
> > a new sched-modular+cfs.patch. (the effects of the original bug are
> > mostly harmless, the rbtree position gets corrected the first time the
> > parent reschedules. The fix might improve heavy forker handling.)
>
> It looks like it did not reach your public dir yet.
>
> BTW, I've given it a try. It seems pretty usable. I have also tried
> the usual meaningless "glxgears" test with 12 of them at the same time,
> and they rotate very smoothly, there is absolutely no pause in any of
> them. But they don't all run at same speed, and top reports their CPU
> load varying from 3.4 to 10.8%, with what looks like more CPU is
> assigned to the first processes, and less CPU for the last ones. But
> this is just a rough observation on a stupid test, I would not call
> that one scientific in any way (and X has its share in the test too).
Follow-up: I think this is mostly X-related. I've started 100 scheddos,
and all get the same CPU percentage. Interestingly, mpg123 in parallel
does never skip at all because it needs quite less than 1% CPU and gets
its fair share at a load of 112. Xterms are slow to respond to typing
with the 12 gears and 100 scheddos, and expectedly it was X which was
starving. renicing it to -5 restores normal feeling with very slow
but smooth gear rotations. Leaving X niced at 0 and killing the gears
also restores normal behaviour.
All in all, it seems logical that processes which serve many others
become a bottleneck for them.
Forking becomes very slow above a load of 100 it seems. Sometimes,
the shell takes 2 or 3 seconds to return to prompt after I run
"scheddos &"
Those are very promising results, I nearly observe the same responsiveness
as I had on a solaris 10 with 10k running processes on a bigger machine.
I would be curious what a mysql test result would look like now.
Regards,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 8:08 ` Willy Tarreau
2007-04-14 8:36 ` Willy Tarreau
@ 2007-04-14 10:36 ` Ingo Molnar
1 sibling, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:36 UTC (permalink / raw)
To: Willy Tarreau
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Willy Tarreau <w@1wt.eu> wrote:
> > this fix is not complete - because the child runqueue is locked
> > here, not the parent's. I've fixed this properly in my tree and have
> > uploaded a new sched-modular+cfs.patch. (the effects of the original
> > bug are mostly harmless, the rbtree position gets corrected the
> > first time the parent reschedules. The fix might improve heavy
> > forker handling.)
>
> It looks like it did not reach your public dir yet.
oops, forgot to do the last step - should be fixed now.
> BTW, I've given it a try. It seems pretty usable. I have also tried
> the usual meaningless "glxgears" test with 12 of them at the same
> time, and they rotate very smoothly, there is absolutely no pause in
> any of them. But they don't all run at same speed, and top reports
> their CPU load varying from 3.4 to 10.8%, with what looks like more
> CPU is assigned to the first processes, and less CPU for the last
> ones. But this is just a rough observation on a stupid test, I would
> not call that one scientific in any way (and X has its share in the
> test too).
ok, i'll try that too - there should be nothing particularly special
about glxgears.
there's another tweak you could try:
echo 500000 > /proc/sys/kernel/sched_granularity_ns
note that this causes preemption to be done as fast as the scheduler can
do it. (in practice it will be mainly driven by CONFIG_HZ, so to get the
best results a CONFIG_HZ of 1000 is useful.)
plus there's an add-on to CFS at:
http://redhat.com/~mingo/cfs-scheduler/sched-fair-hog.patch
this makes the 'CPU usage history cutoff' configurable and sets it to a
default of 100 msecs. This means that CPU hogs (tasks which actively
kept other tasks from running) will be remembered, for up to 100 msecs
of their 'hogness'.
Setting this limit back to 0 gives the 'vanilla' CFS scheduler's
behavior:
echo 0 > /proc/sys/kernel/sched_max_hog_history_ns
(So when trying this you dont have to reboot with this patch
applied/unapplied, just set this value.)
> I'll perform other tests when I can rebuild with your fixed patch.
cool, thanks!
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 8:36 ` Willy Tarreau
@ 2007-04-14 10:53 ` Ingo Molnar
2007-04-14 13:01 ` Willy Tarreau
2007-04-14 15:17 ` Mark Lord
2007-04-14 19:48 ` William Lee Irwin III
1 sibling, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:53 UTC (permalink / raw)
To: Willy Tarreau
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Willy Tarreau <w@1wt.eu> wrote:
> Forking becomes very slow above a load of 100 it seems. Sometimes, the
> shell takes 2 or 3 seconds to return to prompt after I run "scheddos
> &"
this might be changed/impacted by the parent-requeue fix that is in the
updated (for real, promise! ;) patch. Right now on CFS a forking parent
shares its own run stats with the child 50%/50%. This means that heavy
forkers are indeed penalized. Another logical choice would be 100%/0%: a
child has to earn its own right.
i kept the 50%/50% rule from the old scheduler, but maybe it's a more
pristine (and smaller/faster) approach to just not give new children any
stats history to begin with. I've implemented an add-on patch that
implements this, you can find it at:
http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch
> Those are very promising results, I nearly observe the same
> responsiveness as I had on a solaris 10 with 10k running processes on
> a bigger machine.
cool and thanks for the feedback! (Btw., as another test you could also
try to renice "scheddos" to +19. While that does not push the scheduler
nearly as hard as nice 0, it is perhaps more indicative of how a truly
abusive many-tasks workload would be run in practice.)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 23:59 ` Daniel Walker
@ 2007-04-14 10:55 ` Ingo Molnar
0 siblings, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:55 UTC (permalink / raw)
To: Daniel Walker
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Daniel Walker <dwalker@mvista.com> wrote:
> One other thing, what happens in the case of slow, frequency changing,
> are/or inaccurate clocks .. Is the old sched_clock behavior still
> tolerated?
yeah, good question. Yesterday i did a quick testboot with that too, and
it seemed to behave pretty OK with the low-res [jiffies based]
sched_clock() too. Although in that case things are much more of an
approximation and rounding/arithmetics artifacts are possible. CFS works
best with a high-resolution cycle counter.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 10:53 ` Ingo Molnar
@ 2007-04-14 13:01 ` Willy Tarreau
2007-04-14 13:27 ` Willy Tarreau
` (2 more replies)
2007-04-14 15:17 ` Mark Lord
1 sibling, 3 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 13:01 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sat, Apr 14, 2007 at 12:53:39PM +0200, Ingo Molnar wrote:
>
> * Willy Tarreau <w@1wt.eu> wrote:
>
> > Forking becomes very slow above a load of 100 it seems. Sometimes, the
> > shell takes 2 or 3 seconds to return to prompt after I run "scheddos
> > &"
>
> this might be changed/impacted by the parent-requeue fix that is in the
> updated (for real, promise! ;) patch. Right now on CFS a forking parent
> shares its own run stats with the child 50%/50%. This means that heavy
> forkers are indeed penalized. Another logical choice would be 100%/0%: a
> child has to earn its own right.
>
> i kept the 50%/50% rule from the old scheduler, but maybe it's a more
> pristine (and smaller/faster) approach to just not give new children any
> stats history to begin with. I've implemented an add-on patch that
> implements this, you can find it at:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch
Not tried yet, it already looks better with the update and sched-fair-hog.
Now xterm open "instantly" even with 1000 running processes.
> > Those are very promising results, I nearly observe the same
> > responsiveness as I had on a solaris 10 with 10k running processes on
> > a bigger machine.
>
> cool and thanks for the feedback! (Btw., as another test you could also
> try to renice "scheddos" to +19. While that does not push the scheduler
> nearly as hard as nice 0, it is perhaps more indicative of how a truly
> abusive many-tasks workload would be run in practice.)
Good idea. The machine I'm typing from now has 1000 scheddos running at +19,
and 12 gears at nice 0. Top keeps reporting different cpu usages for all gears,
but I'm pretty sure that it's a top artifact now because the cumulated times
are roughly identical :
14:33:13 up 13 min, 7 users, load average: 900.30, 443.75, 177.70
1088 processes: 80 sleeping, 1008 running, 0 zombie, 0 stopped
CPU0 states: 56.0% user 43.0% system 23.0% nice 0.0% iowait 0.0% idle
CPU1 states: 94.0% user 5.0% system 0.0% nice 0.0% iowait 0.0% idle
Mem: 1034764k av, 223788k used, 810976k free, 0k shrd, 7192k buff
104400k active, 51904k inactive
Swap: 497972k av, 0k used, 497972k free 68020k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
1325 root 20 0 69240 9400 3740 R 27.6 0.9 4:46 1 X
1412 willy 20 0 6284 2552 1740 R 14.2 0.2 1:09 1 glxgears
1419 willy 20 0 6256 2384 1612 R 10.7 0.2 1:09 1 glxgears
1409 willy 20 0 2824 1940 788 R 8.9 0.1 0:25 1 top
1414 willy 20 0 6280 2544 1728 S 8.9 0.2 1:08 0 glxgears
1415 willy 20 0 6256 2376 1600 R 8.9 0.2 1:07 1 glxgears
1417 willy 20 0 6256 2384 1612 S 8.9 0.2 1:05 1 glxgears
1420 willy 20 0 6284 2552 1740 R 8.9 0.2 1:07 1 glxgears
1410 willy 20 0 6256 2372 1600 S 7.1 0.2 1:11 1 glxgears
1413 willy 20 0 6260 2388 1612 S 7.1 0.2 1:08 0 glxgears
1416 willy 20 0 6284 2544 1728 S 6.2 0.2 1:06 0 glxgears
1418 willy 20 0 6252 2384 1612 S 6.2 0.2 1:09 0 glxgears
1411 willy 20 0 6280 2548 1740 S 5.3 0.2 1:15 1 glxgears
1421 willy 20 0 6280 2536 1728 R 5.3 0.2 1:05 1 glxgears
>From time to time, one of the 12 aligned gears will quickly perform a full
quarter of round while others slowly turn by a few degrees. In fact, while
I don't know this process's CPU usage pattern, there's something useful in
it : it allows me to visually see when process accelerate/deceleraet. What
would be best would be just a clock requiring low X ressources and eating
vast amounts of CPU between movements. It will help visually monitor CPU
distribution without being too much impacted by X.
I've just added another 100 scheddos at nice 0, and the system is still
amazingly usable. I just tried exchanging a 1-byte token between 188 "dd"
processes which communicate through circular pipes. The context switch
rate is rather high but this has no impact on the rest :
willy@pcw:c$ dd if=/tmp/fifo bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | (echo -n a;dd bs=1) | dd bs=1 of=/tmp/fifo
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1105 0 1 0 781108 8364 68180 0 0 0 12 5 82187 59 41 0
1114 0 1 0 781108 8364 68180 0 0 0 0 0 81528 58 42 0
1112 0 1 0 781108 8364 68180 0 0 0 0 1 80899 58 42 0
1113 0 1 0 781108 8364 68180 0 0 0 0 26 83466 58 42 0
1106 0 2 0 781108 8376 68168 0 0 0 8 91 83193 58 42 0
1107 0 1 0 781108 8376 68180 0 0 0 4 7 79951 58 42 0
1106 0 1 0 781108 8376 68180 0 0 0 0 46 80939 57 43 0
1114 0 1 0 781108 8376 68180 0 0 0 0 21 82019 56 44 0
1116 0 1 0 781108 8376 68180 0 0 0 0 16 85134 56 44 0
1114 0 3 0 781108 8388 68168 0 0 0 16 20 85871 56 44 0
1112 0 1 0 781108 8388 68168 0 0 0 0 15 80412 57 43 0
1112 0 1 0 781108 8388 68180 0 0 0 0 101 83002 58 42 0
1113 0 1 0 781108 8388 68180 0 0 0 0 25 82230 56 44 0
Playing with the sched_max_hog_history_ns does not seem to change anything.
Maybe it's useful for other workloads. Anyway, I have nothing to complain
about, because it's not common for me to be able to normally type a mail on
a system with more than 1000 running processes ;-)
Also, mixed with this load, I have started injecting HTTP requests between
two local processes. The load is stable at 7700 req/s (11800 when alone),
and what I was interested in is the response time. It's perfectly stable
between 9.0 and 9.4 ms with a standard deviation of about 6.0 ms. Those were
varying a lot under stock scheduler, with some sessions sometimes pausing
for seconds. (RSDL fixed this though).
Well, I'll stop heating the room for now as I get out of ideas about how
to defeat it. I'm convinced. I'm impatient to read about Mike's feedback
with his workload which behaves strangely on RSDL. If it works OK here,
it will be the proof that heuristics should not be needed.
Congrats !
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 13:01 ` Willy Tarreau
@ 2007-04-14 13:27 ` Willy Tarreau
2007-04-14 14:45 ` Willy Tarreau
2007-04-14 16:19 ` Ingo Molnar
2007-04-15 7:54 ` Mike Galbraith
2007-04-19 9:01 ` Ingo Molnar
2 siblings, 2 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 13:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
>
> Well, I'll stop heating the room for now as I get out of ideas about how
> to defeat it.
Ah, I found something nasty.
If I start large batches of processes like this :
$ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
the ramp up slows down after 700-800 processes, but something very
strange happens. If I'm under X, I can switch the focus to all xterms
(the WM is still alive) but all xterms are frozen. On the console,
after one moment I simply cannot switch to another VT anymore while
I can still start commands locally. But "chvt 2" simply blocks.
SysRq-K killed everything and restored full control. Dmesg shows lots
of :
SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
I wonder if part of the problem would be too many processes bound to
the same tty :-/
I'll investigate a bit.
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 13:27 ` Willy Tarreau
@ 2007-04-14 14:45 ` Willy Tarreau
2007-04-14 16:14 ` Ingo Molnar
2007-04-14 16:19 ` Ingo Molnar
1 sibling, 1 reply; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 14:45 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sat, Apr 14, 2007 at 03:27:32PM +0200, Willy Tarreau wrote:
> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> >
> > Well, I'll stop heating the room for now as I get out of ideas about how
> > to defeat it.
>
> Ah, I found something nasty.
> If I start large batches of processes like this :
>
> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
>
> the ramp up slows down after 700-800 processes, but something very
> strange happens. If I'm under X, I can switch the focus to all xterms
> (the WM is still alive) but all xterms are frozen. On the console,
> after one moment I simply cannot switch to another VT anymore while
> I can still start commands locally. But "chvt 2" simply blocks.
> SysRq-K killed everything and restored full control. Dmesg shows lots
> of :
> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>
> I wonder if part of the problem would be too many processes bound to
> the same tty :-/
Does not seem easy to reproduce, it looks like some resource pools are
kept pre-allocated after a first run, because if I kill scheddos during
the ramp up then start it again, it can go further. The problem happens
when the parent is forking. Also, I modified scheddos to close(0,1,2)
and to perform the forks itself and it does not cause any problem, even
with 4000 processes running. So I really suspect that the problem I
encountered above was tty-related.
BTW, I've tried your fork patch. It definitely helps forking because it
takes below one second to create 4000 processes, then the load slowly
increases. As you said, the children have to earn their share, and I
find that it makes it easier to conserve control of the whole system's
stability.
Regards,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (7 preceding siblings ...)
2007-04-14 2:04 ` Nick Piggin
@ 2007-04-14 15:09 ` S.Çağlar Onur
2007-04-14 16:09 ` Ingo Molnar
2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
` (5 subsequent siblings)
14 siblings, 1 reply; 577+ messages in thread
From: S.Çağlar Onur @ 2007-04-14 15:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]
13 Nis 2007 Cum tarihinde, Ingo Molnar şunları yazmıştı:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
Currently im using Linus's current git + your extra patches + CFS for a while.
Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i seek
forward/backward while its playing a video with some workload (checking out
SVN repositories, compiling something). Stopping other process didn't help
kaffeine so it stays freezed stated until i kill it.
I'm not sure whether its a xine-lib or kaffeine bug (cause mplayer didn't have
that problem) but i can't reproduce this with mainline or mainline + sd-0.39.
[1] http://cekirdek.pardus.org.tr/~caglar/psaux
--
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/
Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 10:53 ` Ingo Molnar
2007-04-14 13:01 ` Willy Tarreau
@ 2007-04-14 15:17 ` Mark Lord
1 sibling, 0 replies; 577+ messages in thread
From: Mark Lord @ 2007-04-14 15:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
Ingo Molnar wrote:
> i kept the 50%/50% rule from the old scheduler, but maybe it's a more
> pristine (and smaller/faster) approach to just not give new children any
> stats history to begin with. I've implemented an add-on patch that
> implements this, you can find it at:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch
I've been running my desktop (single-core Pentium-M w/2GB RAM, Kubuntu Dapper)
with the new CFS for much of this morning now, with the odd switch back to
the stock scheduler for comparison.
Here, CFS really works and feels better than the stock scheduler.
Even with a "make -j2" kernel rebuild happening (no manual renice, either!)
things "just work" about as smoothly as ever. That's something which RSDL
never achieved for me, though I have not retested RSDL beyond v0.34 or so.
Well done, Ingo! I *want* this as my default scheduler.
Things seemed slightly less smooth when I had the CPU hogs
and fair-fork extension patches both applied.
I'm going to try again now with just the fair-fork added on.
Cheers
Mark
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 15:09 ` S.Çağlar Onur
@ 2007-04-14 16:09 ` Ingo Molnar
2007-04-14 16:59 ` S.Çağlar Onur
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:09 UTC (permalink / raw)
To: S.Çağlar Onur
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* S.Çağlar Onur <caglar@pardus.org.tr> wrote:
> > i'm pleased to announce the first release of the "Modular Scheduler
> > Core and Completely Fair Scheduler [CFS]" patchset:
>
> Currently im using Linus's current git + your extra patches + CFS for
> a while. Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i
> seek forward/backward while its playing a video with some workload
> (checking out SVN repositories, compiling something). Stopping other
> process didn't help kaffeine so it stays freezed stated until i kill
> it.
hm, could you try to strace it and/or attach gdb to it and figure out
what's wrong? (perhaps involving the Kaffeine developers too?) As long
as it's not a kernel level crash i cannot see how the scheduler could
directly cause this - other than by accident creating a scheduling
pattern that triggers a user-space bug more often than with other
schedulers.
> [1] http://cekirdek.pardus.org.tr/~caglar/psaux
looks quite weird!
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 14:45 ` Willy Tarreau
@ 2007-04-14 16:14 ` Ingo Molnar
0 siblings, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:14 UTC (permalink / raw)
To: Willy Tarreau
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Willy Tarreau <w@1wt.eu> wrote:
> BTW, I've tried your fork patch. It definitely helps forking because
> it takes below one second to create 4000 processes, then the load
> slowly increases. As you said, the children have to earn their share,
> and I find that it makes it easier to conserve control of the whole
> system's stability.
ok, thanks for testing this out, i think i'll integrate this one back
into the core. (I'm still unsure about the cpu-hog one.) And it saves
some code-size too:
text data bss dec hex filename
23349 2705 24 26078 65de kernel/sched.o.cfs-v1
23189 2705 24 25918 653e kernel/sched.o.cfs-before
23052 2705 24 25781 64b5 kernel/sched.o.cfs-after
23366 4001 24 27391 6aff kernel/sched.o.vanilla
23671 4548 24 28243 6e53 kernel/sched.o.sd.v40
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 13:27 ` Willy Tarreau
2007-04-14 14:45 ` Willy Tarreau
@ 2007-04-14 16:19 ` Ingo Molnar
2007-04-14 17:15 ` Eric W. Biederman
1 sibling, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:19 UTC (permalink / raw)
To: Willy Tarreau
Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
Eric W. Biederman, Jiri Slaby, Alan Cox
* Willy Tarreau <w@1wt.eu> wrote:
> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> >
> > Well, I'll stop heating the room for now as I get out of ideas about how
> > to defeat it.
>
> Ah, I found something nasty.
> If I start large batches of processes like this :
>
> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
>
> the ramp up slows down after 700-800 processes, but something very
> strange happens. If I'm under X, I can switch the focus to all xterms
> (the WM is still alive) but all xterms are frozen. On the console,
> after one moment I simply cannot switch to another VT anymore while I
> can still start commands locally. But "chvt 2" simply blocks. SysRq-K
> killed everything and restored full control. Dmesg shows lots of :
> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>
> I wonder if part of the problem would be too many processes bound to
> the same tty :-/
hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan),
maybe this description rings a bell with them?
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 16:09 ` Ingo Molnar
@ 2007-04-14 16:59 ` S.Çağlar Onur
2007-04-15 16:13 ` Kaffeine problem with CFS Ingo Molnar
0 siblings, 1 reply; 577+ messages in thread
From: S.Çağlar Onur @ 2007-04-14 16:59 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]
14 Nis 2007 Cts tarihinde, Ingo Molnar şunları yazmıştı:
> hm, could you try to strace it and/or attach gdb to it and figure out
> what's wrong? (perhaps involving the Kaffeine developers too?) As long
> as it's not a kernel level crash i cannot see how the scheduler could
> directly cause this - other than by accident creating a scheduling
> pattern that triggers a user-space bug more often than with other
> schedulers.
...
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call)
--- SIGINT (Interrupt) @ 0 (0) ---
+++ killed by SIGINT +++
is where freeze occurs. Full log can be found at [1]
> > [1] http://cekirdek.pardus.org.tr/~caglar/psaux
>
> looks quite weird!
:)
[1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
--
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/
Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 16:19 ` Ingo Molnar
@ 2007-04-14 17:15 ` Eric W. Biederman
2007-04-14 17:29 ` Willy Tarreau
0 siblings, 1 reply; 577+ messages in thread
From: Eric W. Biederman @ 2007-04-14 17:15 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
Ingo Molnar <mingo@elte.hu> writes:
> * Willy Tarreau <w@1wt.eu> wrote:
>
>> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
>> >
>> > Well, I'll stop heating the room for now as I get out of ideas about how
>> > to defeat it.
>>
>> Ah, I found something nasty.
>> If I start large batches of processes like this :
>>
>> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
>>
>> the ramp up slows down after 700-800 processes, but something very
>> strange happens. If I'm under X, I can switch the focus to all xterms
>> (the WM is still alive) but all xterms are frozen. On the console,
>> after one moment I simply cannot switch to another VT anymore while I
>> can still start commands locally. But "chvt 2" simply blocks. SysRq-K
>> killed everything and restored full control. Dmesg shows lots of :
>
>> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
This. Yes. SAK is noisy and tells you everything it kills.
>> I wonder if part of the problem would be too many processes bound to
>> the same tty :-/
>
> hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan),
> maybe this description rings a bell with them?
Is there any swapping going on?
I'm inclined to suspect that it is a problem that has more to do with the
number of processes and has nothing to do with ttys.
Anyway you can easily rule out ttys by having your startup program
detach from a controlling tty before you start everything.
I'm more inclined to guess something is reading /proc a lot, or doing
something that holds the tasklist lock, a lot or something like that,
if the problem isn't that you are being kicked into swap.
Eric
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 17:15 ` Eric W. Biederman
@ 2007-04-14 17:29 ` Willy Tarreau
2007-04-14 17:44 ` Eric W. Biederman
2007-04-14 17:50 ` Linus Torvalds
0 siblings, 2 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 17:29 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
Hi Eric,
[...]
> >> the ramp up slows down after 700-800 processes, but something very
> >> strange happens. If I'm under X, I can switch the focus to all xterms
> >> (the WM is still alive) but all xterms are frozen. On the console,
> >> after one moment I simply cannot switch to another VT anymore while I
> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K
> >> killed everything and restored full control. Dmesg shows lots of :
> >
> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>
> This. Yes. SAK is noisy and tells you everything it kills.
OK, that's what I suspected, but I did not know if the fact that it talked
about the session was systematic or related to any particular state when it
killed the task.
> >> I wonder if part of the problem would be too many processes bound to
> >> the same tty :-/
> >
> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan),
> > maybe this description rings a bell with them?
>
> Is there any swapping going on?
Not at all.
> I'm inclined to suspect that it is a problem that has more to do with the
> number of processes and has nothing to do with ttys.
It is clearly possible. What I found strange is that I could still fork
processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.
It first happened under X with frozen xterms but a perfectly usable WM,
then I reproduced it on pure console to rule out any potential X problem.
> Anyway you can easily rule out ttys by having your startup program
> detach from a controlling tty before you start everything.
>
> I'm more inclined to guess something is reading /proc a lot, or doing
> something that holds the tasklist lock, a lot or something like that,
> if the problem isn't that you are being kicked into swap.
Oh I'm sorry you were invited into the discussion without a first description
of the context. I was giving a try to Ingo's new scheduler, and trying to
reach corner cases with lots of processes competing for CPU.
I simply used a "for" loop in bash to fork 1000 processes, and this problem
happened between 700-800 children. The program only uses a busy loop and a
pause. I then changed my program to close 0,1,2 and perform the fork itself,
and the problem vanished. So there are two differences here :
- bash not forking anymore
- far less FDs on /dev/tty1
At first, I had around 2200 fds on /dev/tty1, reason why I suspected something
in this area.
I agree that this is not normal usage at all, I'm just trying to attack
Ingo's scheduler to ensure it is more robust than the stock one. But
sometimes brute force methods can make other sleeping problems pop up.
Thinking about it, I don't know if there are calls to schedule() while
switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2"
simply blocked. It would have been possible that a schedule() call
somewhere got starved due to the load, I don't know.
Thanks,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 17:29 ` Willy Tarreau
@ 2007-04-14 17:44 ` Eric W. Biederman
2007-04-14 17:54 ` Ingo Molnar
2007-04-14 17:50 ` Linus Torvalds
1 sibling, 1 reply; 577+ messages in thread
From: Eric W. Biederman @ 2007-04-14 17:44 UTC (permalink / raw)
To: Willy Tarreau
Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
Willy Tarreau <w@1wt.eu> writes:
> Hi Eric,
>
> [...]
>> >> the ramp up slows down after 700-800 processes, but something very
>> >> strange happens. If I'm under X, I can switch the focus to all xterms
>> >> (the WM is still alive) but all xterms are frozen. On the console,
>> >> after one moment I simply cannot switch to another VT anymore while I
>> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K
>> >> killed everything and restored full control. Dmesg shows lots of :
>> >
>> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>>
>> This. Yes. SAK is noisy and tells you everything it kills.
>
> OK, that's what I suspected, but I did not know if the fact that it talked
> about the session was systematic or related to any particular state when it
> killed the task.
>
>> >> I wonder if part of the problem would be too many processes bound to
>> >> the same tty :-/
>> >
>> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan),
>> > maybe this description rings a bell with them?
>>
>> Is there any swapping going on?
>
> Not at all.
>
>> I'm inclined to suspect that it is a problem that has more to do with the
>> number of processes and has nothing to do with ttys.
>
> It is clearly possible. What I found strange is that I could still fork
> processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.
> It first happened under X with frozen xterms but a perfectly usable WM,
> then I reproduced it on pure console to rule out any potential X problem.
>
>> Anyway you can easily rule out ttys by having your startup program
>> detach from a controlling tty before you start everything.
>>
>> I'm more inclined to guess something is reading /proc a lot, or doing
>> something that holds the tasklist lock, a lot or something like that,
>> if the problem isn't that you are being kicked into swap.
>
> Oh I'm sorry you were invited into the discussion without a first description
> of the context. I was giving a try to Ingo's new scheduler, and trying to
> reach corner cases with lots of processes competing for CPU.
>
> I simply used a "for" loop in bash to fork 1000 processes, and this problem
> happened between 700-800 children. The program only uses a busy loop and a
> pause. I then changed my program to close 0,1,2 and perform the fork itself,
> and the problem vanished. So there are two differences here :
>
> - bash not forking anymore
> - far less FDs on /dev/tty1
Yes. But with /dev/tty1 being the controlling terminal in both cases,
as you haven't dropped your session, or disassociated your tty.
The bash problem may have something to setpgid or scheduling effects.
Hmm. I just looked and setpgid does grab the tasklist lock for
writing so we may possibly have some contention there.
> At first, I had around 2200 fds on /dev/tty1, reason why I suspected something
> in this area.
>
> I agree that this is not normal usage at all, I'm just trying to attack
> Ingo's scheduler to ensure it is more robust than the stock one. But
> sometimes brute force methods can make other sleeping problems pop up.
Yep. If we can narrow it down to one that would be interesting. Of course
that also means when we start finding other possibly sleeping problems people
are working in areas of code the don't normally touch, so we must investigate.
> Thinking about it, I don't know if there are calls to schedule() while
> switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2"
> simply blocked. It would have been possible that a schedule() call
> somewhere got starved due to the load, I don't know.
It looks like there is a call to schedule_work.
There are two pieces of the path. If you are switching in and out of a tty
controlled by something like X. User space has to grant permission before
the operation happens. Where there isn't a gate keeper I know it is cheaper
but I don't know by how much, I suspect there is still a schedule happening
in there.
Eric
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 17:29 ` Willy Tarreau
2007-04-14 17:44 ` Eric W. Biederman
@ 2007-04-14 17:50 ` Linus Torvalds
1 sibling, 0 replies; 577+ messages in thread
From: Linus Torvalds @ 2007-04-14 17:50 UTC (permalink / raw)
To: Willy Tarreau
Cc: Eric W. Biederman, Ingo Molnar, Nick Piggin, linux-kernel,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
On Sat, 14 Apr 2007, Willy Tarreau wrote:
>
> It is clearly possible. What I found strange is that I could still fork
> processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.
Considering the patches in question, it's almost definitely just a CPU
scheduling problem with starvation.
The VT switching is obviously done by the kernel, but the kernel will
signal and wait for the "controlling process" for the VT. The most obvious
case of that is X, of course, but even in text mode I think gpm will
have taken control of the VT's it runs on (all of them), which means that
when you initiate a VT switch, the kernel will actually signal the
controlling process (gpm), and wait for it to acknowledge the switch.
If gpm doesn't get a timeslice for some reason (and it sounds like there
may be some serious unfairness after "fork()"), your behaviour is
explainable.
(NOTE! I've never actually looked at gpm sources or what it really does,
so maybe I'm wrong, and it doesn't try to do the controlling VT thing, and
something else is going on, but quite frankly, it sounds like the obvious
candidate for this bug. Explaining it with some non-scheduler-related
thing sounds unlikely, considering the patch in question).
Linus
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 17:44 ` Eric W. Biederman
@ 2007-04-14 17:54 ` Ingo Molnar
2007-04-14 18:18 ` Willy Tarreau
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-14 17:54 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
* Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Thinking about it, I don't know if there are calls to schedule()
> > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and
> > "chvt 2" simply blocked. It would have been possible that a
> > schedule() call somewhere got starved due to the load, I don't know.
>
> It looks like there is a call to schedule_work.
so this goes over keventd, right?
> There are two pieces of the path. If you are switching in and out of a
> tty controlled by something like X. User space has to grant
> permission before the operation happens. Where there isn't a gate
> keeper I know it is cheaper but I don't know by how much, I suspect
> there is still a schedule happening in there.
Could keventd perhaps be starved? Willy, to exclude this possibility,
could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then
the command to set it to SCHED_FIFO:50 would be:
chrt -f -p 50 5
but ... events/0 is reniced to -5 by default, so it should definitely
not be starved.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 17:54 ` Ingo Molnar
@ 2007-04-14 18:18 ` Willy Tarreau
2007-04-14 18:40 ` Eric W. Biederman
2007-04-15 17:55 ` Ingo Molnar
0 siblings, 2 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 18:18 UTC (permalink / raw)
To: Ingo Molnar
Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
>
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> > > Thinking about it, I don't know if there are calls to schedule()
> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and
> > > "chvt 2" simply blocked. It would have been possible that a
> > > schedule() call somewhere got starved due to the load, I don't know.
> >
> > It looks like there is a call to schedule_work.
>
> so this goes over keventd, right?
>
> > There are two pieces of the path. If you are switching in and out of a
> > tty controlled by something like X. User space has to grant
> > permission before the operation happens. Where there isn't a gate
> > keeper I know it is cheaper but I don't know by how much, I suspect
> > there is still a schedule happening in there.
>
> Could keventd perhaps be starved? Willy, to exclude this possibility,
> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then
> the command to set it to SCHED_FIFO:50 would be:
>
> chrt -f -p 50 5
>
> but ... events/0 is reniced to -5 by default, so it should definitely
> not be starved.
Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
bash forks 1000 processes, then progressively execs scheddos, but it
takes some time). So I'm rebuilding right now. But I think that Linus
has an interesting clue about GPM and notification before switching
the terminal. I think it was enabled in console mode. I don't know
how that translates to frozen xterms, but let's attack the problems
one at a time.
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 18:18 ` Willy Tarreau
@ 2007-04-14 18:40 ` Eric W. Biederman
2007-04-14 19:01 ` Willy Tarreau
2007-04-15 17:55 ` Ingo Molnar
1 sibling, 1 reply; 577+ messages in thread
From: Eric W. Biederman @ 2007-04-14 18:40 UTC (permalink / raw)
To: Willy Tarreau
Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
Willy Tarreau <w@1wt.eu> writes:
> On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
>>
>> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> > > Thinking about it, I don't know if there are calls to schedule()
>> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and
>> > > "chvt 2" simply blocked. It would have been possible that a
>> > > schedule() call somewhere got starved due to the load, I don't know.
>> >
>> > It looks like there is a call to schedule_work.
>>
>> so this goes over keventd, right?
>>
>> > There are two pieces of the path. If you are switching in and out of a
>> > tty controlled by something like X. User space has to grant
>> > permission before the operation happens. Where there isn't a gate
>> > keeper I know it is cheaper but I don't know by how much, I suspect
>> > there is still a schedule happening in there.
>>
>> Could keventd perhaps be starved? Willy, to exclude this possibility,
>> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then
>> the command to set it to SCHED_FIFO:50 would be:
>>
>> chrt -f -p 50 5
>>
>> but ... events/0 is reniced to -5 by default, so it should definitely
>> not be starved.
>
> Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> bash forks 1000 processes, then progressively execs scheddos, but it
> takes some time). So I'm rebuilding right now. But I think that Linus
> has an interesting clue about GPM and notification before switching
> the terminal. I think it was enabled in console mode. I don't know
> how that translates to frozen xterms, but let's attack the problems
> one at a time.
I think it is a good clue. However the intention of the mechanism is
that only processes that change the video mode on a VT are supposed to
use it. So I really don't think gpm is the culprit. However it easily could
be something else that has similar characteristics.
I just realized we do have proof that schedule_work is actually working
because SAK works, and we can't sanely do SAK from interrupt context
so we call schedule work.
Eric
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 23:18 ` Ingo Molnar
@ 2007-04-14 18:48 ` Bill Huey
0 siblings, 0 replies; 577+ messages in thread
From: Bill Huey @ 2007-04-14 18:48 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Bill Huey (hui)
On Sat, Apr 14, 2007 at 01:18:09AM +0200, Ingo Molnar wrote:
> very much so! Both Con and Mike has contributed regularly to upstream
> sched.c:
The problem here is tha Con can get demotivated (and rather upset) when an
idea gets proposed, like SchedPlug, only to have people be hostile to it
and then sudden turn around an adopt this idea. It give the impression
that you, in this specific case, were more interested in controlling a
situation and the track of development instead of actually being inclusive
of the development process with discussion and serious consideration, etc...
This is how the Linux community can be perceived as elitist. The old guard
would serve the community better if people were more mindful and sensitive
to developer issues. There was a particular speech that I was turned off by
at OLS 2006 that pretty much pandering to the "old guard's" needs over
newer developers. Since I'm a some what established engineer in -rt (being
the only other person that mapped the lock hierarchy out for full
preemptibility), I had the confidence to pretty much ignored it while
previously this could have really upset me and be highly discouraging to
a relatively new developer.
As Linux gets larger and larger this is going to be an increasing problem
when folks come into the community with new ideas and the community will
need to change if it intends to integrate these folks. IMO, a lot of
these flame ware wouldn't need to exist if folks listent ot each other
better and permit co-ownership of code like the scheduler since it needs
multipule hands in it adapt to new loads and situations, etc...
I'm saying this nicely now since I can be nasty about it.
bill
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 18:40 ` Eric W. Biederman
@ 2007-04-14 19:01 ` Willy Tarreau
0 siblings, 0 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 19:01 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
On Sat, Apr 14, 2007 at 12:40:15PM -0600, Eric W. Biederman wrote:
> Willy Tarreau <w@1wt.eu> writes:
>
> > On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
> >>
> >> * Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>
> >> > > Thinking about it, I don't know if there are calls to schedule()
> >> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and
> >> > > "chvt 2" simply blocked. It would have been possible that a
> >> > > schedule() call somewhere got starved due to the load, I don't know.
> >> >
> >> > It looks like there is a call to schedule_work.
> >>
> >> so this goes over keventd, right?
> >>
> >> > There are two pieces of the path. If you are switching in and out of a
> >> > tty controlled by something like X. User space has to grant
> >> > permission before the operation happens. Where there isn't a gate
> >> > keeper I know it is cheaper but I don't know by how much, I suspect
> >> > there is still a schedule happening in there.
> >>
> >> Could keventd perhaps be starved? Willy, to exclude this possibility,
> >> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then
> >> the command to set it to SCHED_FIFO:50 would be:
> >>
> >> chrt -f -p 50 5
> >>
> >> but ... events/0 is reniced to -5 by default, so it should definitely
> >> not be starved.
> >
> > Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> > bash forks 1000 processes, then progressively execs scheddos, but it
> > takes some time). So I'm rebuilding right now. But I think that Linus
> > has an interesting clue about GPM and notification before switching
> > the terminal. I think it was enabled in console mode. I don't know
> > how that translates to frozen xterms, but let's attack the problems
> > one at a time.
>
> I think it is a good clue. However the intention of the mechanism is
> that only processes that change the video mode on a VT are supposed to
> use it. So I really don't think gpm is the culprit. However it easily could
> be something else that has similar characteristics.
>
> I just realized we do have proof that schedule_work is actually working
> because SAK works, and we can't sanely do SAK from interrupt context
> so we call schedule work.
Eric,
I can say that Linus, Ingo and you all got on the right track.
I could reproduce, I got a hung tty around 1400 running processes.
Fortunately, it was the one with the root shell which was reniced
to -19.
I could strace chvt 2 :
20:44:23.761117 open("/dev/tty", O_RDONLY) = 3 <0.004000>
20:44:23.765117 ioctl(3, KDGKBTYPE, 0xbfa305a3) = 0 <0.024002>
20:44:23.789119 ioctl(3, VIDIOC_G_COMP or VT_ACTIVATE, 0x3) = 0 <0.000000>
20:44:23.789119 ioctl(3, VIDIOC_S_COMP or VT_WAITACTIVE <unfinished ...>
Then I applied Ingo's suggestion about changing keventd prio :
root@pcw:~# ps auxw|grep event
root 8 0.0 0.0 0 0 ? SW< 20:31 0:00 [events/0]
root 9 0.0 0.0 0 0 ? RW< 20:31 0:00 [events/1]
root@pcw:~# rtprio -s 1 -p 50 8 9 (I don't have chrt but it does the same)
My VT immediately switched as soon as I hit Enter. Everything's
working fine again now. So the good news is that it's not a bug
in the tty code, nor a deadlock.
Now, maybe keventd should get a higher prio ? It seems worrying to
me that it may starve when it seems so much sensible.
Also, that may explain why I couldn't reproduce with the fork patch.
Since all new processes got no runtime at first, their impact on
existing ones must have been lower. But I think that if I had waited
longer, I would have had the problem again (though I did not see it
even under a load of 7800).
Regards,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 8:36 ` Willy Tarreau
2007-04-14 10:53 ` Ingo Molnar
@ 2007-04-14 19:48 ` William Lee Irwin III
2007-04-14 20:12 ` Willy Tarreau
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-14 19:48 UTC (permalink / raw)
To: Willy Tarreau
Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote:
> Forking becomes very slow above a load of 100 it seems. Sometimes,
> the shell takes 2 or 3 seconds to return to prompt after I run
> "scheddos &"
> Those are very promising results, I nearly observe the same responsiveness
> as I had on a solaris 10 with 10k running processes on a bigger machine.
> I would be curious what a mysql test result would look like now.
Where is scheddos?
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 19:48 ` William Lee Irwin III
@ 2007-04-14 20:12 ` Willy Tarreau
0 siblings, 0 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-14 20:12 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Sat, Apr 14, 2007 at 12:48:55PM -0700, William Lee Irwin III wrote:
> On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote:
> > Forking becomes very slow above a load of 100 it seems. Sometimes,
> > the shell takes 2 or 3 seconds to return to prompt after I run
> > "scheddos &"
> > Those are very promising results, I nearly observe the same responsiveness
> > as I had on a solaris 10 with 10k running processes on a bigger machine.
> > I would be curious what a mysql test result would look like now.
>
> Where is scheddos?
I will send it to you off-list. I've been avoiding to publish it for a long
time because the stock scheduler was *very* sensible to trivial attacks
(freezes larger than 30s, impossible to log in). It's very basic, and I
have no problem sending it to anyone who requests it, it's just that as
long as some distros ship early 2.6 kernels I do not want it to appear on
mailing list archives for anyone to grab it and annoy their admins for free.
Cheers,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 22:21 ` William Lee Irwin III
2007-04-13 22:52 ` Ingo Molnar
@ 2007-04-14 22:38 ` Davide Libenzi
2007-04-14 23:26 ` Davide Libenzi
` (2 more replies)
1 sibling, 3 replies; 577+ messages in thread
From: Davide Libenzi @ 2007-04-14 22:38 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
On Fri, 13 Apr 2007, William Lee Irwin III wrote:
> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > The CFS patch uses a completely different approach and implementation
> > from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> > that of RSDL/SD, which is a high standard to meet :-) Testing
> > feedback is welcome to decide this one way or another. [ and, in any
> > case, all of SD's logic could be added via a kernel/sched_sd.c module
> > as well, if Con is interested in such an approach. ]
> > CFS's design is quite radical: it does not use runqueues, it uses a
> > time-ordered rbtree to build a 'timeline' of future task execution,
> > and thus has no 'array switch' artifacts (by which both the vanilla
> > scheduler and RSDL/SD are affected).
>
> A binomial heap would likely serve your purposes better than rbtrees.
> It's faster to have the next item to dequeue at the root of the tree
> structure rather than a leaf, for one. There are, of course, other
> priority queue structures (e.g. van Emde Boas) able to exploit the
> limited precision of the priority key for faster asymptotics, though
> actual performance is an open question.
Haven't looked at the scheduler code yet, but for a similar problem I use
a time ring. The ring has Ns (2 power is better) slots (where tasks are
queued - in my case they were som sort of timers), and it has a current
base index (Ib), a current base time (Tb) and a time granularity (Tg). It
also has a bitmap with bits telling you which slots contains queued tasks.
An item (task) that has to be scheduled at time T, will be queued in the slot:
S = Ib + min((T - Tb) / Tg, Ns - 1);
Items with T longer than Ns*Tg will be scheduled in the relative last slot
(chosing a proper Ns and Tg can minimize this).
Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to
suite to your needs.
This is a simple bench between time-ring (TR) and CFS queueing:
http://www.xmailserver.org/smart-queue.c
In my box (Dual Opteron 252):
davide@alien:~$ ./smart-queue -n 8
CFS = 142.21 cycles/loop
TR = 72.33 cycles/loop
davide@alien:~$ ./smart-queue -n 16
CFS = 188.74 cycles/loop
TR = 83.79 cycles/loop
davide@alien:~$ ./smart-queue -n 32
CFS = 221.36 cycles/loop
TR = 75.93 cycles/loop
davide@alien:~$ ./smart-queue -n 64
CFS = 242.89 cycles/loop
TR = 81.29 cycles/loop
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 22:38 ` Davide Libenzi
@ 2007-04-14 23:26 ` Davide Libenzi
2007-04-15 4:01 ` William Lee Irwin III
2007-04-15 23:09 ` Pavel Pisa
2 siblings, 0 replies; 577+ messages in thread
From: Davide Libenzi @ 2007-04-14 23:26 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
On Sat, 14 Apr 2007, Davide Libenzi wrote:
> Haven't looked at the scheduler code yet, but for a similar problem I use
> a time ring. The ring has Ns (2 power is better) slots (where tasks are
> queued - in my case they were som sort of timers), and it has a current
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It
> also has a bitmap with bits telling you which slots contains queued tasks.
> An item (task) that has to be scheduled at time T, will be queued in the slot:
>
> S = Ib + min((T - Tb) / Tg, Ns - 1);
... mod Ns, of course ;)
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (8 preceding siblings ...)
2007-04-14 15:09 ` S.Çağlar Onur
@ 2007-04-15 3:27 ` Con Kolivas
2007-04-15 5:16 ` Bill Huey
` (2 more replies)
2007-04-15 12:29 ` Esben Nielsen
` (4 subsequent siblings)
14 siblings, 3 replies; 577+ messages in thread
From: Con Kolivas @ 2007-04-15 3:27 UTC (permalink / raw)
To: Ingo Molnar, ck list, Peter Williams, Bill Huey
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
The casual observer will be completely confused by what on earth has happened
here so let me try to demystify things for them.
1. I tried in vain some time ago to push a working extensable pluggable cpu
scheduler framework (based on wli's work) for the linux kernel. It was
perma-vetoed by Linus and Ingo (and Nick also said he didn't like it) as
being absolutely the wrong approach and that we should never do that. Oddly
enough the linux-kernel-mailing list was -dead- at the time and the
discussion did not make it to the mailing list. Every time I've tried to
forward it to the mailing list the spam filter decided to drop it so most
people have not even seen this original veto-forever discussion.
2. Since then I've been thinking/working on a cpu scheduler design that takes
away all the guesswork out of scheduling and gives very predictable, as fair
as possible, cpu distribution and latency while preserving as solid
interactivity as possible within those confines. For weeks now, Ingo has said
that the interactivity regressions were showstoppers and we should address
them, never mind the fact that the so-called regressions were purely "it
slows down linearly with load" which to me is perfectly desirable behaviour.
While this was not perma-vetoed, I predicted pretty accurately your intent
was to veto it based on this.
People kept claiming scheduling problems were few and far between but what was
really happening is users were terrified of lkml and instead used 1. windows
and 2. 2.4 kernels. The problems were there.
So where are we now? Here is where your latest patch comes in.
As a solution to the many scheduling problems we finally all agree exist, you
propose a patch that adds 1. a limited pluggable framework and 2. a fairness
based cpu scheduler policy... o_O
So I should be happy at last now that the things I was promoting you are also
promoting, right? Well I'll fill in the rest of the gaps and let other people
decide how I should feel.
> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,
In the last 4 weeks I've spent time lying in bed drugged to the eyeballs and
having trips in and out of hospitals for my condition. I appreciate greatly
the sympathy and patience from people in this regard. However at one stage I
virtually begged for support with my attempts and help with the code. Dmitry
Adamushko is the only person who actually helped me with the code in the
interim, while others poked sticks at it. Sure the sticks helped at times but
the sticks always seemed to have their ends kerosene doused and flaming for
reasons I still don't get. No other help was forthcoming.
Now that you're agreeing my direction was correct you've done the usual Linux
kernel thing - ignore all my previous code and write your own version. Oh
well, that I've come to expect; at least you get a copyright notice in the
bootup and somewhere in the comments give me credit for proving it's
possible. Let's give some other credit here too. William Lee Irwin provided
the major architecture behind plugsched at my request and I simply finished
the work and got it working. He is also responsible for many IRC discussions
I've had about cpu scheduling fairness, designs, programming history and code
help. Even though he did not contribute code directly to SD, his comments
have been invaluable.
So let's look at the code.
kernel/sched.c
kernel/sched_fair.c
kernel/sched_rt.c
It turns out this is not a pluggable cpu scheduler framework at all, and I
guess you didn't really promote it as such. It's a "modular scheduler core".
Which means you moved code from sched.c into sched_fair.c and sched_rt.c.
This abstracts out each _scheduling policy's_ functions into struct
sched_class and allows each scheduling policy's functions to be in a separate
file etc.
Ok so what it means is that instead of whole cpu schedulers being able to be
plugged into this framework we can plug in only cpu scheduling policies....
hrm... So let's look on
-#define SCHED_NORMAL 0
Ok once upon a time we rename SCHED_OTHER which every other unix calls the
standard policy 99.9% of applications used into a more meaningful name,
SCHED_NORMAL. That's fine since all it did was change the description
internally for those reading the code. Let's see what you've done now:
+#define SCHED_FAIR 0
You've renamed it again. This is, I don't know what exactly to call it, but an
interesting way of making it look like there is now more choice. Well,
whatever you call it, everything in linux spawned from init without
specifying a policy still gets policy 0. This is SCHED_OTHER still, renamed
SCHED_NORMAL and now SCHED_FAIR.
You encouraged me to create a sched_sd.c to add onto your design as well.
Well, what do I do with that? I need to create another scheduling policy for
that code to even be used. A separate scheduling policy requires a userspace
change to even benefit from it. Even if I make that sched_sd.c patch, people
cannot use SD as their default scheduler unless they hack SCHED_FAIR 0 to
read SCHED_SD 0 or similar. The same goes for original staircase cpusched,
nicksched, zaphod, spa_ws, ebs and so on.
So what you've achieved with your patch is - replaced the current scheduler
with another one and moved it into another file. There is no choice, and no
pluggability, just code trumping.
Do I support this? In this form.... no.
It's not that I don't like your new scheduler. Heck it's beautiful like most
of your _serious_ code. It even comes with a catchy name that's bound to give
people hard-ons (even though many schedulers aim to be completely fair, yours
has been named that for maximum selling power). The complaint I have is that
you are not providing quite what you advertise (on the modular front), or
perhaps you're advertising it as such to make it look more appealing; I'm not
sure.
Since we'll just end up with your code, don't pretend SCHED_NORMAL is anything
different, and that this is anything other than your NIH (Not Invented Here)
cpu scheduling policy rewrite which will probably end up taking it's position
in mainline after yet another truckload of regression/performance tests and
so on. I haven't seen an awful lot of comparisons with SD yet, just people
jumping on your bandwagon which is fine I guess. Maybe a few tiny tests
showing less than 5% variation in their fairness from what I can see. Either
way, I already feel you've killed off SD... like pretty much everything else
I've done lately. At least I no longer have to try and support my code mostly
by myself.
In the interest of putting aside any ego concerns since this is about linux
and not me...
Because... You are a hair's breadth away from producing something that I
would support, which _does_ do what you say and produces the pluggability
we're all begging for with only tiny changes to the code you've already done.
Make Kconfig let you choose which sched_*.c gets built into the kernel, and
make SCHED_OTHER choose which SCHED_* gets chosen as the default from Kconfig
and even choose one of the alternative built in ones with boot parametersyour
code has more clout than mine will (ie do exactly what plugsched does). Then
we can have 7 schedulers in linux kernel within a few weeks. Oh no! This is
the very thing Linus didn't want in specialisation with the cpu schedulers!
Does this mean this idea will be vetoed yet again? In all likelihood, yes.
I guess I have lots to put into -ck still... sigh.
> Ingo
--
-ck
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 22:38 ` Davide Libenzi
2007-04-14 23:26 ` Davide Libenzi
@ 2007-04-15 4:01 ` William Lee Irwin III
2007-04-15 4:18 ` Davide Libenzi
2007-04-15 23:09 ` Pavel Pisa
2 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-15 4:01 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
On Fri, 13 Apr 2007, William Lee Irwin III wrote:
>> A binomial heap would likely serve your purposes better than rbtrees.
[...]
On Sat, Apr 14, 2007 at 03:38:04PM -0700, Davide Libenzi wrote:
> Haven't looked at the scheduler code yet, but for a similar problem I use
> a time ring. The ring has Ns (2 power is better) slots (where tasks are
> queued - in my case they were som sort of timers), and it has a current
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It
> also has a bitmap with bits telling you which slots contains queued tasks.
> An item (task) that has to be scheduled at time T, will be queued in the slot:
> S = Ib + min((T - Tb) / Tg, Ns - 1);
> Items with T longer than Ns*Tg will be scheduled in the relative last slot
> (chosing a proper Ns and Tg can minimize this).
> Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to
> suite to your needs.
I used a similar sort of queue in the virtual deadline scheduler I
wrote in 2003 or thereabouts. CFS uses queue priorities with too high
a precision to map directly to this (queue priorities are marked as
"key" in the cfs code and should not be confused with task priorities).
The elder virtual deadline scheduler used millisecond resolution and a
rather different calculation for its equivalent of ->key, which
explains how it coped with a limited priority space.
The two basic attacks on such large priority spaces are the near future
vs. far future subdivisions and subdividing the priority space into
(most often regular) intervals. Subdividing the priority space into
intervals is the most obvious; you simply use some O(lg(n)) priority
queue as the bucket discipline in the "time ring," queue by the upper
bits of the queue priority in the time ring, and by the lower bits in
the O(lg(n)) bucket discipline. The near future vs. far future
subdivision is maintaining the first N tasks in a low-constant-overhead
structure like a sorted list and the remainder in some other sort of
queue structure intended to handle large numbers of elements gracefully.
The distribution of queue priorities strongly influences which of the
methods is most potent, though it should be clear the methods can be
used in combination.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 4:01 ` William Lee Irwin III
@ 2007-04-15 4:18 ` Davide Libenzi
0 siblings, 0 replies; 577+ messages in thread
From: Davide Libenzi @ 2007-04-15 4:18 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
On Sat, 14 Apr 2007, William Lee Irwin III wrote:
> The two basic attacks on such large priority spaces are the near future
> vs. far future subdivisions and subdividing the priority space into
> (most often regular) intervals. Subdividing the priority space into
> intervals is the most obvious; you simply use some O(lg(n)) priority
> queue as the bucket discipline in the "time ring," queue by the upper
> bits of the queue priority in the time ring, and by the lower bits in
> the O(lg(n)) bucket discipline.
Sure. If you really need sub-millisecond precision, you can replace the
bucket's list_head with an rb_root. It may be not necessary though for a
cpu scheduler (still, didn't read Ingo's code yet).
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
@ 2007-04-15 5:16 ` Bill Huey
2007-04-15 8:44 ` Ingo Molnar
2007-04-15 16:11 ` Bernd Eckenfels
2007-04-15 6:43 ` Mike Galbraith
2007-04-15 15:05 ` Ingo Molnar
2 siblings, 2 replies; 577+ messages in thread
From: Bill Huey @ 2007-04-15 5:16 UTC (permalink / raw)
To: Con Kolivas
Cc: Ingo Molnar, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, Bill Huey (hui)
On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote:
...
> Now that you're agreeing my direction was correct you've done the usual Linux
> kernel thing - ignore all my previous code and write your own version. Oh
> well, that I've come to expect; at least you get a copyright notice in the
> bootup and somewhere in the comments give me credit for proving it's
> possible. Let's give some other credit here too. William Lee Irwin provided
> the major architecture behind plugsched at my request and I simply finished
> the work and got it working. He is also responsible for many IRC discussions
> I've had about cpu scheduling fairness, designs, programming history and code
> help. Even though he did not contribute code directly to SD, his comments
> have been invaluable.
Hello folks,
I think the main failure I see here is that Con wasn't included in this design
or privately in review process. There could have been better co-ownership of the
code. This could also have been done openly on lkml (since this is kind of what
this medium is about to significant degree) so that consensus can happen (Con
can be reasoned with). It would have achieved the same thing but probably more
smoothly if folks just listened, considered an idea and then, in this case,
created something that would allow for experimentation from outsiders in a
fluid fashion.
If these issues aren't fixed, you're going to stuck with the same kind of creeping
elitism that has gradually killed the FreeBSD project and other BSDs. I can't
comment on the code implementation. I'm focus on other things now that I'm at
NetApp and I can't help out as much as I could. Being former BSDi, I had a first
hand account of these issues as they played out.
A development process like this is likely to exclude smart people from wanting
to contribute to Linux and folks should be conscious about this issues. It's
basically a lot of code and concept that at least two individuals have worked
on (wli and con) only to have it be rejected and then sudden replaced by
code from a community gatekeeper. In this case, this results in both Con and
Bill Irwin being woefully under utilized.
If I were one of these people. I'd be mighty pissed.
bill
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
2007-04-15 5:16 ` Bill Huey
@ 2007-04-15 6:43 ` Mike Galbraith
2007-04-15 8:36 ` Bill Huey
2007-04-17 0:06 ` Peter Williams
2007-04-15 15:05 ` Ingo Molnar
2 siblings, 2 replies; 577+ messages in thread
From: Mike Galbraith @ 2007-04-15 6:43 UTC (permalink / raw)
To: Con Kolivas
Cc: Ingo Molnar, ck list, Peter Williams, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
Thomas Gleixner
On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote:
> On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > [CFS]
> >
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >
> > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> >
> > This project is a complete rewrite of the Linux task scheduler. My goal
> > is to address various feature requests and to fix deficiencies in the
> > vanilla scheduler that were suggested/found in the past few years, both
> > for desktop scheduling and for server scheduling workloads.
>
> The casual observer will be completely confused by what on earth has happened
> here so let me try to demystify things for them.
[...]
Demystify what? The casual observer need only read either your attempt
at writing a scheduler, or my attempts at fixing the one we have, to see
that it was high time for someone with the necessary skills to step in.
Now progress can happen, which was _not_ happening before.
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 13:01 ` Willy Tarreau
2007-04-14 13:27 ` Willy Tarreau
@ 2007-04-15 7:54 ` Mike Galbraith
2007-04-15 8:58 ` Ingo Molnar
2007-04-19 9:01 ` Ingo Molnar
2 siblings, 1 reply; 577+ messages in thread
From: Mike Galbraith @ 2007-04-15 7:54 UTC (permalink / raw)
To: Willy Tarreau
Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner
On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote:
> Well, I'll stop heating the room for now as I get out of ideas about how
> to defeat it. I'm convinced. I'm impatient to read about Mike's feedback
> with his workload which behaves strangely on RSDL. If it works OK here,
> it will be the proof that heuristics should not be needed.
You mean the X + mp3 player + audio visualization test? X+Gforce
visualization have problems getting half of my box in the presence of
two other heavy cpu using tasks. Behavior is _much_ better than
RSDL/SD, but the synchronous nature of X/client seems to be a problem.
With this scheduler, renicing X/client does cure it, whereas with SD it
did not help one bit. (I know a trivial way to cure that, and this
framework makes that possible without dorking up fairness as a general
policy.)
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 6:43 ` Mike Galbraith
@ 2007-04-15 8:36 ` Bill Huey
2007-04-15 8:45 ` Mike Galbraith
` (2 more replies)
2007-04-17 0:06 ` Peter Williams
1 sibling, 3 replies; 577+ messages in thread
From: Bill Huey @ 2007-04-15 8:36 UTC (permalink / raw)
To: Mike Galbraith
Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
Thomas Gleixner, Bill Huey (hui)
On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> [...]
>
> Demystify what? The casual observer need only read either your attempt
Here's the problem. You're a casual observer and obviously not paying
attention.
> at writing a scheduler, or my attempts at fixing the one we have, to see
> that it was high time for someone with the necessary skills to step in.
> Now progress can happen, which was _not_ happening before.
I think that's inaccurate and there are plenty of folks that have that
technical skill and background. The scheduler code isn't a deep mystery
and there are plenty of good kernel hackers out here across many
communities. Ingo isn't the only person on this planet to have deep
scheduler knowledge. Priority heaps are not new and Solaris has had a
pluggable scheduler framework for years.
Con's characterization is something that I'm more prone to believe about
how Linux kernel development works versus your view. I think it's a great
shame to have folks like Bill Irwin and Con to have waste time trying to
do something right only to have their ideas attack, then copied and held
as the solution for this kind of technical problem as complete reversal
of technical opinion as it suits a moment. This is just wrong in so many
ways.
It outlines the problems with Linux kernel development and questionable
elistism regarding ownership of certain sections of the kernel code.
I call it "churn squat" and instances like this only support that view
which I would rather it be completely wrong and inaccurate instead.
bill
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 5:16 ` Bill Huey
@ 2007-04-15 8:44 ` Ingo Molnar
2007-04-15 9:51 ` Bill Huey
2007-04-15 16:11 ` Bernd Eckenfels
1 sibling, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 8:44 UTC (permalink / raw)
To: Bill Huey
Cc: Con Kolivas, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
* Bill Huey <billh@gnuppy.monkey.org> wrote:
> Hello folks,
>
> I think the main failure I see here is that Con wasn't included in
> this design or privately in review process. There could have been
> better co-ownership of the code. This could also have been done openly
> on lkml [...]
Bill, you come from a BSD background and you are still relatively new to
Linux development, so i dont at all fault you for misunderstanding this
situation, and fortunately i have a really easy resolution for your
worries: i did exactly that! :)
i wrote the first line of code of the CFS patch this week, 8am Wednesday
morning, and released it to lkml 62 hours later, 22pm on Friday. (I've
listed the file timestamps of my backup patches further below, for all
the fine details.)
I prefer such early releases to lkml _alot_ more than any private review
process. I released the CFS code about 6 hours after i thought "okay,
this looks pretty good" and i spent those final 6 hours on testing it
(making sure it doesnt blow up on your box, etc.), in the final 2 hours
i showed it to two folks i could reach on IRC (Arjan and Thomas) and on
various finishing touches. It doesnt get much faster than that and i
definitely didnt want to sit on it even one day longer because i very
much thought that Con and others should definitely see this work!
And i very much credited (and still credit) Con for the whole fairness
angle:
|| i'd like to give credit to Con Kolivas for the general approach here:
|| he has proven via RSDL/SD that 'fair scheduling' is possible and that
|| it results in better desktop scheduling. Kudos Con!
the 'design consultation' phase you are talking about is _NOW_! :)
I got the v1 code out to Con, to Mike and to many others ASAP. That's
how you are able to comment on this thread and be part of the
development process to begin with, in a 'private consultation' setup
you'd not have had any opportunity to see _any_ of this.
In the BSD space there seem to be more 'political' mechanisms for
development, but Linux is truly about doing things out in the open, and
doing it immediately.
Okay? ;-)
Here's the timestamps of all my backups of the patch, from its humble 4K
beginnings to the 100K first-cut v1 result:
-rw-rw-r-- 1 mingo mingo 4230 Apr 11 08:47 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 7653 Apr 11 09:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 7728 Apr 11 09:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 14416 Apr 11 10:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 24211 Apr 11 10:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 27878 Apr 11 10:45 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 33807 Apr 11 11:05 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 34524 Apr 11 11:09 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 39650 Apr 11 11:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40231 Apr 11 11:34 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40627 Apr 11 11:48 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40638 Apr 11 11:54 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 42733 Apr 11 12:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 42817 Apr 11 12:31 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 43270 Apr 11 12:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 43531 Apr 11 12:48 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 44331 Apr 11 12:51 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45173 Apr 11 12:56 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45288 Apr 11 12:59 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45368 Apr 11 13:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45370 Apr 11 13:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45815 Apr 11 13:14 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45887 Apr 11 13:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45914 Apr 11 13:25 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45850 Apr 11 13:29 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 49196 Apr 11 13:39 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 64317 Apr 11 13:45 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 64403 Apr 11 13:52 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:03 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:07 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 68995 Apr 11 14:50 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 69919 Apr 11 15:23 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71065 Apr 11 16:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 70642 Apr 11 16:28 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 72334 Apr 11 16:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71624 Apr 11 17:01 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71854 Apr 11 17:20 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 73571 Apr 11 17:42 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:51 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 75144 Apr 11 17:57 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80722 Apr 11 18:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:59 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 89356 Apr 11 21:32 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 95278 Apr 12 08:36 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97749 Apr 12 10:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97687 Apr 12 10:58 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97722 Apr 12 11:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97933 Apr 12 11:22 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:09 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 100405 Apr 12 12:29 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 100380 Apr 12 12:50 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101631 Apr 12 13:32 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102293 Apr 12 14:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102431 Apr 12 14:28 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102502 Apr 12 14:53 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102128 Apr 13 11:13 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102473 Apr 13 12:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102536 Apr 13 12:24 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102481 Apr 13 12:30 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103408 Apr 13 13:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103441 Apr 13 13:31 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104759 Apr 13 14:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104815 Apr 13 14:39 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104762 Apr 13 15:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 105978 Apr 13 16:18 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 105977 Apr 13 16:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 106761 Apr 13 17:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 106358 Apr 13 17:40 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 107802 Apr 13 19:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104427 Apr 13 19:35 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103927 Apr 13 19:40 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101867 Apr 13 20:30 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101011 Apr 13 21:05 patches/sched-fair.patch
i hope this helps :)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 8:36 ` Bill Huey
@ 2007-04-15 8:45 ` Mike Galbraith
2007-04-15 9:06 ` Ingo Molnar
2007-04-15 16:25 ` Arjan van de Ven
2 siblings, 0 replies; 577+ messages in thread
From: Mike Galbraith @ 2007-04-15 8:45 UTC (permalink / raw)
To: Bill Huey
Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
Thomas Gleixner
On Sun, 2007-04-15 at 01:36 -0700, Bill Huey wrote:
> On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> > [...]
> >
> > Demystify what? The casual observer need only read either your attempt
>
> Here's the problem. You're a casual observer and obviously not paying
> attention.
>
> > at writing a scheduler, or my attempts at fixing the one we have, to see
> > that it was high time for someone with the necessary skills to step in.
> > Now progress can happen, which was _not_ happening before.
>
> I think that's inaccurate and there are plenty of folks that have that
> technical skill and background. The scheduler code isn't a deep mystery
> and there are plenty of good kernel hackers out here across many
> communities. Ingo isn't the only person on this planet to have deep
> scheduler knowledge.
Ok <shrug>, I'm not paying attention, and you can't read. We're even.
Have a nice life.
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 7:54 ` Mike Galbraith
@ 2007-04-15 8:58 ` Ingo Molnar
2007-04-15 9:11 ` Mike Galbraith
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 8:58 UTC (permalink / raw)
To: Mike Galbraith
Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner
* Mike Galbraith <efault@gmx.de> wrote:
> On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote:
>
> > Well, I'll stop heating the room for now as I get out of ideas about
> > how to defeat it. I'm convinced. I'm impatient to read about Mike's
> > feedback with his workload which behaves strangely on RSDL. If it
> > works OK here, it will be the proof that heuristics should not be
> > needed.
>
> You mean the X + mp3 player + audio visualization test? X+Gforce
> visualization have problems getting half of my box in the presence of
> two other heavy cpu using tasks. Behavior is _much_ better than
> RSDL/SD, but the synchronous nature of X/client seems to be a problem.
>
> With this scheduler, renicing X/client does cure it, whereas with SD
> it did not help one bit. [...]
thanks for testing it! I was quite worried about your setup - two tasks
using up 50%/50% of CPU time, pitted against a kernel rebuild workload
seems to be a hard workload to get right.
> [...] (I know a trivial way to cure that, and this framework makes
> that possible without dorking up fairness as a general policy.)
great! Please send patches so i can add them (once you are happy with
the solution) - i think your workload isnt special in any way and could
hit other people too.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 8:36 ` Bill Huey
2007-04-15 8:45 ` Mike Galbraith
@ 2007-04-15 9:06 ` Ingo Molnar
2007-04-16 10:00 ` Ingo Molnar
2007-04-15 16:25 ` Arjan van de Ven
2 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 9:06 UTC (permalink / raw)
To: Bill Huey
Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Arjan van de Ven, Thomas Gleixner
* Bill Huey <billh@gnuppy.monkey.org> wrote:
> On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> > [...]
> >
> > Demystify what? The casual observer need only read either your
> > attempt
>
> Here's the problem. You're a casual observer and obviously not paying
> attention.
guys, please calm down. Judging by the number of contributions to
sched.c the main folks who are not 'observers' here and who thus have an
unalienable right to be involved in a nasty flamewar about scheduler
interactivity are Con, Mike, Nick and me ;-) Everyone else is just a
happy bystander, ok? ;-)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 8:58 ` Ingo Molnar
@ 2007-04-15 9:11 ` Mike Galbraith
0 siblings, 0 replies; 577+ messages in thread
From: Mike Galbraith @ 2007-04-15 9:11 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner
On Sun, 2007-04-15 at 10:58 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> > [...] (I know a trivial way to cure that, and this framework makes
> > that possible without dorking up fairness as a general policy.)
>
> great! Please send patches so i can add them (once you are happy with
> the solution) - i think your workload isnt special in any way and could
> hit other people too.
I'll give it a shot. (have to read and actually understand your new
code first though, then see if it's really viable)
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 8:44 ` Ingo Molnar
@ 2007-04-15 9:51 ` Bill Huey
2007-04-15 10:39 ` Pekka Enberg
0 siblings, 1 reply; 577+ messages in thread
From: Bill Huey @ 2007-04-15 9:51 UTC (permalink / raw)
To: Ingo Molnar
Cc: Con Kolivas, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, Bill Huey (hui)
On Sun, Apr 15, 2007 at 10:44:47AM +0200, Ingo Molnar wrote:
> I prefer such early releases to lkml _alot_ more than any private review
> process. I released the CFS code about 6 hours after i thought "okay,
> this looks pretty good" and i spent those final 6 hours on testing it
> (making sure it doesnt blow up on your box, etc.), in the final 2 hours
> i showed it to two folks i could reach on IRC (Arjan and Thomas) and on
> various finishing touches. It doesnt get much faster than that and i
> definitely didnt want to sit on it even one day longer because i very
> much thought that Con and others should definitely see this work!
>
> And i very much credited (and still credit) Con for the whole fairness
> angle:
>
> || i'd like to give credit to Con Kolivas for the general approach here:
> || he has proven via RSDL/SD that 'fair scheduling' is possible and that
> || it results in better desktop scheduling. Kudos Con!
>
> the 'design consultation' phase you are talking about is _NOW_! :)
>
> I got the v1 code out to Con, to Mike and to many others ASAP. That's
> how you are able to comment on this thread and be part of the
> development process to begin with, in a 'private consultation' setup
> you'd not have had any opportunity to see _any_ of this.
>
> In the BSD space there seem to be more 'political' mechanisms for
> development, but Linux is truly about doing things out in the open, and
> doing it immediately.
I can't even begin to talk about how screwed up BSD development is. Maybe
another time privately.
Ok, Linux development and inclusiveness can be improved. I'm not trying
to "call you out" (slang for accusing you with the sole intention to call
you crazy in a highly confrontative manner). This is discussed publically
here to bring this issue to light, open a communication channel as a means
to resolve it.
> Okay? ;-)
It's cool. We're still getting to know each other professionally and it's
okay to a certain degree to have a communication disconnect but only as
long as it clears. Your productivity is amazing BTW. But here's the
problem, there's this perception that NIH is the default mentality here
in Linux.
Con feels that this kind of action is intentional and has a malicious
quality to it as means of "churn squating" sections of the kernel tree.
The perception here is that there is that there is this expectation that
sections of the Linux kernel are intentionally "churn squated" to prevent
any other ideas from creeping in other than of the owner of that subsytem
(VM, scheduling, etc...) because of lack of modularity in the kernel.
This isn't an API question but a question possibly general code quality
and how maintenance () of it can .
This was predicted by folks and then this perception was *realized* when
you wrote the equivalent kind of code that has technical overlap with SDL
(this is just one dry example). To a person that is writing new code for
Linux, having one of the old guards write equivalent code to that of a
newcomer has the effect of displacing that person both with regards to
code and responsibility with that. When this happens over and over again
and folks get annoyed by it, it starts seeming that Linux development
seems elitist.
I know this because I heard (read) Con's IRC chats all the time about
these matters all of the time. This is not just his view but a view of
other kernel folks that differing views as to. The closing talk at OLS
2006 was highly disturbing in many ways. It went "Christoph" is right
everybody else is wrong which sends a highly negative message to new
kernel developers that, say, don't work for RH directly or any of the
other mainstream Linux companies. After a while, it starts seeming like
this kind of behavior is completely intentional and that Linux is
full of arrogant bastards.
What I would have done here was to contact Peter Williams, Bill Irwin
and Con about what your doing and reach a common concensus about how
to create something that would be inclusive of all of their ideas.
Discussions can technically heated but that's ok, the discussion is
happening and it brings down the wall of this perception. Bill and
Con are on oftc.net/#offtopic2. Riel is there as well as Peter Zijlstra.
It might be very useful, it might not be. Folks are all stubborn
about there ideas and hold on to them for dear life. Effective
leaders can deconstruct this hostility and animosity. I don't claim
to be one.
Because of past hostility to something like schedplugin, the hostility
and terseness of responses can be percieved simply as "I'm right,
you're wrong" which is condescending. This effects discussion and
outright destroys a constructive process if this happens continually
since it reenforces that view of "You're an outsider, we don't care
about you". Nobody is listening to each other at that point, folks get
pissed. Then they think about "I'm going to NIH this person with patc
X because he/she did the same here" which is dysfunctional.
Oddly enough, sometimes you're the best person to get a new idea
into the tree. What's not happening here is communication. That takes
sensitivity, careful listening which is a difficult skill, and then
understanding of the characters involved to unify creative energies.
That's a very difficult thing to do for folks that are use to working
solo. It take time to develop trust in those relationships so that
a true collaboration can happen. I know that there is a lot of
creativity in folks like Con and Bill. It would be wise to develop a
dialog with them to see if they can offload some of your work for you
(we all know you're really busy) yet have you be a key facilitator of
their and your ideas. That's a really tough thing to do and it requires
practice. Just imagine (assuming they can follow through) what could
have positively happened if there collective knowledge was leveraged
better. It's not all clear and rosy, but I think these people are more
on your side that you might realized and it might be a good thing to
discover that.
This is tough because I know the personalities involved and I know
kind of how people function and malfunction in this discussion on a
personal basis.
[We can continue privately. This not just about you but applicable
to open source development in general]
The tone of this email is intellectually critical (not ment as
personality attack) and calm. If I'm otherwise, them I'm a bastard. :)
bill
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 9:51 ` Bill Huey
@ 2007-04-15 10:39 ` Pekka Enberg
2007-04-15 12:45 ` Willy Tarreau
2007-04-15 15:16 ` Gene Heskett
0 siblings, 2 replies; 577+ messages in thread
From: Pekka Enberg @ 2007-04-15 10:39 UTC (permalink / raw)
To: hui Bill Huey
Cc: Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> The perception here is that there is that there is this expectation that
> sections of the Linux kernel are intentionally "churn squated" to prevent
> any other ideas from creeping in other than of the owner of that subsytem
Strangely enough, my perception is that Ingo is simply trying to
address the issues Mike's testing discovered in RDSL and SD. It's not
surprising Ingo made it a separate patch set as Con has repeatedly
stated that the "problems" are in fact by design and won't be fixed.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (9 preceding siblings ...)
2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
@ 2007-04-15 12:29 ` Esben Nielsen
2007-04-15 13:04 ` Ingo Molnar
2007-04-15 22:49 ` Ismail Dönmez
` (3 subsequent siblings)
14 siblings, 1 reply; 577+ messages in thread
From: Esben Nielsen @ 2007-04-15 12:29 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Fri, 13 Apr 2007, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
>
> [...]
I took a brief look at it. Have you tested priority inheritance?
As far as I can see rt_mutex_setprio doesn't have much effect on
SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task change
scheduler class when boosted in rt_mutex_setprio().
Esben
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 10:39 ` Pekka Enberg
@ 2007-04-15 12:45 ` Willy Tarreau
2007-04-15 13:08 ` Pekka J Enberg
` (2 more replies)
2007-04-15 15:16 ` Gene Heskett
1 sibling, 3 replies; 577+ messages in thread
From: Willy Tarreau @ 2007-04-15 12:45 UTC (permalink / raw)
To: Pekka Enberg
Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sun, Apr 15, 2007 at 01:39:27PM +0300, Pekka Enberg wrote:
> On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> >The perception here is that there is that there is this expectation that
> >sections of the Linux kernel are intentionally "churn squated" to prevent
> >any other ideas from creeping in other than of the owner of that subsytem
>
> Strangely enough, my perception is that Ingo is simply trying to
> address the issues Mike's testing discovered in RDSL and SD. It's not
> surprising Ingo made it a separate patch set as Con has repeatedly
> stated that the "problems" are in fact by design and won't be fixed.
That's not exactly the problem. There are people who work very hard to
try to improve some areas of the kernel. They progress slowly, and
acquire more and more skills. Sometimes they feel like they need to
change some concepts and propose those changes which are required for
them to go further, or to develop faster. Those are rejected. So they
are constrained to work in a delimited perimeter from which it is
difficult for them to escape.
Then, the same person who rejected their changes comes with something
shiny new, better and which took him far less time. But he sort of
broke the rules because what was forbidden to the first persons is
suddenly permitted. Maybe for very good reasons, I'm not discussing
that. The good reason should have been valid the first time too.
The fact is that when changes are rejected, we should not simply say
"no", but explain why and define what would be acceptable. Some people
here have excellent teaching skills for this, but most others do not.
Anyway, the rules should be the same for everybody.
Also, there is what can be perceived as marketting here. Con worked
on his idea with convictions, he took time to write some generous
documentation, but he hit a wall where his concept was suboptimal on
a given workload. But at least, all the work was oriented on a technical
basis : design + code + doc.
Then, Ingo comes in with something looking amazingly better, with
virtually no documentation, an appealing announcement, and a shiny
advertising at boot. All this implemented without the constraints
other people had to respect. It already looks like definitive work
which will be merge as-is without many changes except a few bugfixes.
If those were two companies, the first one would simply have accused
the second one of not having respected contracts and having employed
heaving marketting to take the first place.
People here do not code for a living, they do it at least because they
believe in what they are doing, and some of them want a bit of gratitude
for their work. I've met people who were proud to say they implement
this or that feature in the kernel, so it is something important for
them. And being cited in an email is nothing compared to advertising
at boot time.
When the discussion was blocked between Con and Mike concerning the
design problems, it is where a new discussion should have taken place.
Ingo could have publicly spoken with them about his ideas of killing
the O(1) scheduler and replacing it with an rbtree-based one, and using
part of Bill's work to speed up development.
It is far easier to resign when people explain what concepts are wrong
and how they think they will do than when they suddenly present something
out of nowhere which is already better.
And it's not specific to Ingo (though I think his ability to work that
fast alone makes him tend to practise this more often than others).
Imagine if Con had worked another full week on his scheduler with better
results on Mike's workload, but still not as good as Ingo's, and they both
published at the same time. You certainly can imagine he would have preferred
to be informed first that it was pointless to continue in that direction.
Now I hope he and Bill will get over this and accept to work on improving
this scheduler, because I really find it smarter than a dumb O(1). I even
agree with Mike that we now have a solid basis for future work. But for
this, maybe a good starting point would be to remove the selfish printk
at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
and improve the documentation a bit so that people can work together on
the new design, without feeling like their work will only server to
promote X or Y.
Regards,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 12:29 ` Esben Nielsen
@ 2007-04-15 13:04 ` Ingo Molnar
2007-04-16 7:16 ` Esben Nielsen
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 13:04 UTC (permalink / raw)
To: Esben Nielsen
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Esben Nielsen <nielsen.esben@googlemail.com> wrote:
> I took a brief look at it. Have you tested priority inheritance?
yeah, you are right, it's broken at the moment, i'll fix it. But the
good news is that i think PI could become cleaner via scheduling
classes.
> As far as I can see rt_mutex_setprio doesn't have much effect on
> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task
> change scheduler class when boosted in rt_mutex_setprio().
i think via scheduling classes we dont have to do the p->policy and
p->prio based gymnastics anymore, we can just have a clean look at
p->sched_class and stack the original scheduling class into
p->real_sched_class. It would probably also make sense to 'privatize'
p->prio into the scheduling class. That way PI would be a pure property
of sched_rt, and the PI scheduler would be driven purely by
p->rt_priority, not by p->prio. That way all the normal_prio() kind of
complications and interactions with SCHED_OTHER/SCHED_FAIR would be
eliminated as well. What do you think?
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 12:45 ` Willy Tarreau
@ 2007-04-15 13:08 ` Pekka J Enberg
2007-04-15 17:32 ` Mike Galbraith
2007-04-15 15:26 ` William Lee Irwin III
2007-04-15 15:39 ` Ingo Molnar
2 siblings, 1 reply; 577+ messages in thread
From: Pekka J Enberg @ 2007-04-15 13:08 UTC (permalink / raw)
To: Willy Tarreau
Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sun, 15 Apr 2007, Willy Tarreau wrote:
> Ingo could have publicly spoken with them about his ideas of killing
> the O(1) scheduler and replacing it with an rbtree-based one, and using
> part of Bill's work to speed up development.
He did exactly that and he did it with a patch. Nothing new here. This is
how development on LKML proceeds when you have two or more competing
designs. There's absolutely no need to get upset or hurt your feelings
over it. It's not malicious, it's how we do Linux development.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
2007-04-15 5:16 ` Bill Huey
2007-04-15 6:43 ` Mike Galbraith
@ 2007-04-15 15:05 ` Ingo Molnar
2007-04-15 20:05 ` Matt Mackall
2007-04-16 5:16 ` Con Kolivas
2 siblings, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 15:05 UTC (permalink / raw)
To: Con Kolivas
Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Con Kolivas <kernel@kolivas.org> wrote:
[ i'm quoting this bit out of order: ]
> 2. Since then I've been thinking/working on a cpu scheduler design
> that takes away all the guesswork out of scheduling and gives very
> predictable, as fair as possible, cpu distribution and latency while
> preserving as solid interactivity as possible within those confines.
yeah. I think you were right on target with this call. I've applied the
sched.c change attached at the bottom of this mail to the CFS patch, if
you dont mind. (or feel free to suggest some other text instead.)
> 1. I tried in vain some time ago to push a working extensable
> pluggable cpu scheduler framework (based on wli's work) for the linux
> kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he
> didn't like it) as being absolutely the wrong approach and that we
> should never do that. [...]
i partially replied to that point to Will already, and i'd like to make
it clear again: yes, i rejected plugsched 2-3 years ago (which already
drifted away from wli's original codebase) and i would still reject it
today.
First and foremost, please dont take such rejections too personally - i
had my own share of rejections (and in fact, as i mentioned it in a
previous mail, i had a fair number of complete project throwaways:
4g:4g, in-kernel Tux, irqrate and many others). I know that they can
hurt and can demoralize, but if i dont like something it's my job to
tell that.
Can i sum up your argument as: "you rejected plugsched, but then why on
earth did you modularize portions of the scheduler in CFS? Isnt your
position thus woefully inconsistent?" (i'm sure you would never put it
this impolitely though, but i guess i can flame myself with impunity ;)
While having an inconsistent position isnt a terminal sin in itself,
please realize that the scheduler classes code in CFS is quite different
from plugsched: it was a result of what i saw to be technological
pressure for _internal modularization_. (This internal/policy
modularization aspect is something that Will said was present in his
original plugsched code, but which aspect i didnt see in the plugsched
patches that i reviewed.)
That possibility never even occured to me to until 3 days ago. You never
raised it either AFAIK. No patches to simplify the scheduler that way
were ever sent. Plugsched doesnt even touch the core load-balancer for
example, and most of the time i spent with the modularization was to get
the load-balancing details right. So it's really apples to oranges.
My view about plugsched: first please take a look at the latest
plugsched code:
http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch
26 files changed, 8951 insertions(+), 1495 deletions(-)
As an experiment i've removed all the add-on schedulers (both the core
and the include files, only kept the vanilla one) from the plugsched
patch (and the makefile and kconfig complications, etc), to see the
'infrastructure cost', and it still gave:
12 files changed, 1933 insertions(+), 1479 deletions(-)
that's the extra complication i didnt like 3 years ago and which i still
dont like today. What the current plugsched code does is that it
simplifies the adding of new experimental schedulers, but it doesnt
really do what i wanted: to simplify the _scheduler itself_. Personally
i'm still not primarily interested in having a large selection of
schedulers, i'm mainly interested in a good and maintainable scheduler
that works for people.
so the rejection was on these grounds, and i still very much stand by
that position here and today: i didnt want to see the Linux scheduler
landscape balkanized and i saw no technological reasons for the
complication that external modularization brings.
the new scheding classes code in the CFS patch was not a result of "oh,
i want to write a new scheduler, lets make schedulers pluggable" kind of
thinking. That result was just a side-effect of it. (and as you
correctly noted it, the CFS related modularization is incomplete).
Btw., the thing that triggered the scheduling classes code wasnt even
plugsched or RSDL/SD, it was Mike's patches. Mike had an itch and he
fixed it within the framework of the existing scheduler, and the end
result behaved quite well when i threw various testloads on it.
But i felt a bit uncomfortable that it added another few hundred lines
of code to an already complex sched.c. This felt unnatural so i mailed
Mike that i'd attempt to clean these infrastructure aspects of sched.c
up a bit so that it becomes more hackable to him. Thus 3 days ago,
without having made up my mind about anything, i started this experiment
(which ended up in the modularization and in the CFS scheduler) to
simplify the code and to enable Mike to fix such itches in an easier
way. By your logic Mike should in fact be quite upset about this: if the
new code works out and proves to be useful then it obsoletes a whole lot
of code of him!
> For weeks now, Ingo has said that the interactivity regressions were
> showstoppers and we should address them, never mind the fact that the
> so-called regressions were purely "it slows down linearly with load"
> which to me is perfectly desirable behaviour. [...]
yes. For me the first thing when considering a large scheduler patch is:
"does a patch do what it claims" and "does it work". If those goals are
met (and if it's a complete scheduler i actually try it quite
extensively) then i look at the code cleanliness issues. Mike's patch
was the first one that seemed to meet that threshold in my own humble
testing, and CFS was a direct result of that.
note that i tried the same workloads with CFS and while it wasnt as good
as mainline, it handled them better than SD. Mike reported the same, and
Mark Lord (who too reported SD interactivity problems) reported success
yesterday too.
(but .. CFS is a mere 2 days old so we cannot really tell anything with
certainty yet.)
> [...] However at one stage I virtually begged for support with my
> attempts and help with the code. Dmitry Adamushko is the only person
> who actually helped me with the code in the interim, while others
> poked sticks at it. Sure the sticks helped at times but the sticks
> always seemed to have their ends kerosene doused and flaming for
> reasons I still don't get. No other help was forthcoming.
i'm really sorry you got that impression.
in 2004 i had a good look at the staircase scheduler and said:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0408.0/1146.html
"But in general i'm quite positive about the staircase scheduler."
and even tested it and gave you feedback:
http://lwn.net/Articles/96562/
i think i even told Andrew that i dont really like pluggable schedulers
and if there's any replacement for the current scheduler then that would
be a full replacement, and it would be the staircase scheduler.
Hey, i told this to you as recently as 1 month ago as well:
http://lkml.org/lkml/2007/3/8/54
"cool! I like this even more than i liked your original staircase
scheduler from 2 years ago :)"
Ingo
----------->
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -16,6 +16,7 @@
* by Davide Libenzi, preemptible kernel bits by Robert Love.
* 2003-09-03 Interactivity tuning by Con Kolivas.
* 2004-04-02 Scheduler domains code by Nick Piggin
+ * 2007-04-15 Con Kolivas was dead right: fairness matters! :)
*/
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 10:39 ` Pekka Enberg
2007-04-15 12:45 ` Willy Tarreau
@ 2007-04-15 15:16 ` Gene Heskett
2007-04-15 16:43 ` Con Kolivas
1 sibling, 1 reply; 577+ messages in thread
From: Gene Heskett @ 2007-04-15 15:16 UTC (permalink / raw)
To: Pekka Enberg
Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sunday 15 April 2007, Pekka Enberg wrote:
>On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
>> The perception here is that there is that there is this expectation that
>> sections of the Linux kernel are intentionally "churn squated" to prevent
>> any other ideas from creeping in other than of the owner of that subsytem
>
>Strangely enough, my perception is that Ingo is simply trying to
>address the issues Mike's testing discovered in RDSL and SD. It's not
>surprising Ingo made it a separate patch set as Con has repeatedly
>stated that the "problems" are in fact by design and won't be fixed.
I won't get into the middle of this just yet, not having decided which dog I
should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for about
24 hours, its been generally usable, but gzip still causes lots of 5 to 10+
second lags when its running. I'm coming to the conclusion that gzip simply
doesn't play well with others...
Amazing to me, the cpu its using stays generally below 80%, and often below
60%, even while the kmail composer has a full sentence in its buffer that it
still hasn't shown me when I switch to the htop screen to check, and back to
the kmail screen to see if its updated yet. The screen switch doesn't seem
to lag so I don't think renicing x would be helpfull. Those are the obvious
lags, and I'll build & reboot to the CFS patch at some point this morning
(whats left of it that is :). And report in due time of course
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
knot in cables caused data stream to become twisted and kinked
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 12:45 ` Willy Tarreau
2007-04-15 13:08 ` Pekka J Enberg
@ 2007-04-15 15:26 ` William Lee Irwin III
2007-04-16 15:55 ` Chris Friesen
2007-04-15 15:39 ` Ingo Molnar
2 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-15 15:26 UTC (permalink / raw)
To: Willy Tarreau
Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sun, Apr 15, 2007 at 02:45:27PM +0200, Willy Tarreau wrote:
> Now I hope he and Bill will get over this and accept to work on improving
> this scheduler, because I really find it smarter than a dumb O(1). I even
> agree with Mike that we now have a solid basis for future work. But for
> this, maybe a good starting point would be to remove the selfish printk
> at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
> and improve the documentation a bit so that people can work together on
> the new design, without feeling like their work will only server to
> promote X or Y.
While I appreciate people coming to my defense, or at least the good
intentions behind such, my only actual interest in pointing out
4-year-old work is getting some acknowledgment of having done something
relevant at all. Sometimes it has "I told you so" value. At other times
it's merely clarifying what went on when people refer to it since in a
number of cases the patches are no longer extant, so they can't
actually look at it to get an idea of what was or wasn't done. At other
times I'm miffed about not being credited, whether I should've been or
whether dead and buried code has an implementation of the same idea
resurfacing without the author(s) having any knowledge of my prior work.
One should note that in this case, the first work of mine this trips
over (scheduling classes) was never publicly posted as it was only a
part of the original plugsched (an alternate scheduler implementation
devised to demonstrate plugsched's flexibility with respect to
scheduling policies), and a part that was dropped by subsequent
maintainers. The second work of mine this trips over, a virtual deadline
scheduler named "vdls," was also never publicly posted. Both are from
around the same time period, which makes them approximately 4 years dead.
Neither of the codebases are extant, having been lost in a transition
between employers, though various people recall having been sent them
privately, and plugsched survives in a mutated form as maintained by
Peter Williams, who's been very good about acknowledging my original
contribution.
If I care to become a direct participant in scheduler work, I can do so
easily enough.
I'm not entirely sure what this is about a basis for future work. By
and large one should alter the API's and data structures to fit the
policy being implemented. While the array swapping was nice for
algorithmically improving 2.4.x -style epoch expiry, most algorithms
not based on the 2.4.x scheduler (in however mutated a form) should use
a different queue structure, in fact, one designed around their
policy's specific algorithmic needs. IOW, when one alters the scheduler,
one should also alter the queue data structure appropriately. I'd not
expect the priority queue implementation in cfs to continue to be used
unaltered as it matures, nor would I expect any significant modification
of the scheduler to necessarily use a similar one.
By and large I've been mystified as to why there is such a penchant for
preserving the existing queue structures in the various scheduler
patches floating around. I am now every bit as mystified at the point
of view that seems to be emerging that a change of queue structure is
particularly significant. These are all largely internal changes to
sched.c, and as such, rather small changes in and of themselves. While
they do tend to have user-visible effects, from this point of view
even changing out every line of sched.c is effectively a micropatch.
Something more significant might be altering the schedule() API to
take a mandatory description of the intention of the call to it, or
breaking up schedule() into several different functions to distinguish
between different sorts of uses of it to which one would then respond
differently. Also more significant would be adding a new state beyond
TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, and TASK_RUNNING for some
tasks to respond only to fatal signals, then sweeping TASK_UNINTERRUPTIBLE
users to use the new state and handle those fatal signals. While not
quite as ostentatious in their user-visible effects as SCHED_OTHER
policy affairs, they are tremendously more work than switching out the
implementation of a single C file, and so somewhat more respectable.
Even as scheduling semantics go, these are micropatches. So SCHED_OTHER
changes a little. Where are the gang schedulers? Where are the batch
schedulers (SCHED_BATCH is not truly such)? Where are the isochronous
(frame) schedulers? I suppose there is some CKRM work that actually has
a semantic impact despite being largely devoted to SCHED_OTHER, and
there's some spufs gang scheduling going on, though not all that much.
And to reiterate a point from other threads, even as SCHED_OTHER
patches go, I see precious little verification that things like the
semantics of nice numbers or other sorts of CPU bandwidth allocation
between competing tasks of various natures are staying the same while
other things are changed, or at least being consciously modified in
such a fashion as to improve them. I've literally only seen one or two
tests (and rather inflexible ones with respect to sleep and running
time mixtures) with any sort of quantification of how CPU bandwidth is
distributed get run on all this.
So from my point of view, there's a lot of churn and craziness going on
in one tiny corner of the kernel and people don't seem to have a very
solid grip on what effects their changes have or how they might
potentially break userspace. So I've developed a sudden interest in
regression testing of the scheduler in order to ensure that various sorts
of semantics on which userspace relies are not broken, and am trying to
spark more interest in general in nailing down scheduling semantics and
verifying that those semantics are honored and remain honored by whatever
future scheduler implementations might be merged.
Thus far, the laundry list of semantics I'd like to have nailed down
are specifically:
(1) CPU bandwidth allocation according to nice numbers
(2) CPU bandwidth allocation among mixtures of tasks with varying
sleep/wakeup behavior e.g. that consume some percentage of cpu
in isolation, perhaps also varying the granularity of their
sleep/wakeup patterns
(3) sched_yield(), so multitier userspace locking doesn't go haywire
(4) How these work with SMP; most people agree it should be mostly the
same as it works on UP, but it's not being verified, as most
testcases are barely SMP-aware if at all, and corner cases
where proportionality breaks down aren't considered
The sorts of like explicit decisions I'd like to be made for these are:
(1) In a mixture of tasks with varying nice numbers, a given nice number
corresponds to some share of CPU bandwidth. Implementations
should not have the freedom to change this arbitrarily according
to some intention.
(2) A given scheduler _implementation_ intends to distribute CPU
bandwidth among mixtures of tasks that would each consume some
percentage of the CPU in isolation varying across tasks in some
particular pattern. For example, maybe some scheduler
implementation assigns a share of 1/%cpu to a task that would
consume %cpu in isolation, for a CPU bandwidth allocation of
(1/%cpu)/(sum 1/%cpu(t)) as t ranges over all competing tasks
(this is not to say that such a policy makes sense).
(3) sched_yield() is intended to result in some particular scheduling
pattern in a given scheduler implementation. For instance, an
implementation may intend that a set of CPU hogs calling
sched_yield() between repeatedly printf()'ing their pid's will
see their printf()'s come out in an approximately consistent
order as the scheduler cycles between them.
(4) What an implementation intends to do with respect to SMP CPU
bandwidth allocation when precise emulation of UP behavior is
impossible, considering sched_yield() scheduling patterns when
possible as well. For instance, perhaps an implementation
intends to ensure equal CPU bandwidth among competing CPU-bound
tasks of equal priority at all costs, and so triggers migration
and/or load balancing to make it so. Or perhaps an implementation
intends to ensure precise sched_yield() ordering at all costs
even on SMP. Some sort of specification of the intention, then
a verification that the intention is carried out in a testcase.
Also, if there's a semantic issue to be resolved, I want it to have
something describing it and verifying it. For instance, characterizing
whatever sort of scheduling artifacts queue-swapping causes in the
mainline scheduler and then a testcase to demonstrate the artifact and
its resolution in a given scheduler rewrite would be a good design
statement and verification. For instance, if someone wants to go back
to queue-swapping or other epoch expiry semantics, it would make them
(and hopefully everyone else) conscious of the semantic issue the
change raises, or possibly serve as a demonstration that the artifacts
can be mitigated in some implementation retaining epoch expiry semantics.
As I become aware of more potential issues I'll add more to my laundry
list, and I'll hammer out testcases as I go. My concern with the
scheduler is that this sort of basic functionality may be significantly
disturbed with no one noticing at all until a distro issues a prerelease
and benchmarks go haywire, and furthermore that changes to this kind of
basic behavior may be signs of things going awry, particularly as more
churn happens.
So now that I've clarified my role in all this to date and my point of
view on it, it should be clear that accepting something and working on
some particular scheduler implementation don't make sense as
suggestions to me.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 12:45 ` Willy Tarreau
2007-04-15 13:08 ` Pekka J Enberg
2007-04-15 15:26 ` William Lee Irwin III
@ 2007-04-15 15:39 ` Ingo Molnar
2007-04-15 15:47 ` William Lee Irwin III
2007-04-16 5:27 ` Peter Williams
2 siblings, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 15:39 UTC (permalink / raw)
To: Willy Tarreau
Cc: Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Willy Tarreau <w@1wt.eu> wrote:
> Ingo could have publicly spoken with them about his ideas of killing
> the O(1) scheduler and replacing it with an rbtree-based one, [...]
yes, that's precisely what i did, via a patchset :)
[ I can even tell you when it all started: i was thinking about Mike's
throttling patches while watching Manchester United beat the crap out
of AS Roma (7 to 1 end result), Thuesday evening. I started coding it
Wednesday morning and sent the patch Friday evening. I very much
believe in low-latency when it comes to development too ;) ]
(if this had been done via a comittee then today we'd probably still be
trying to find a suitable timeslot for the initial conference call where
we'd discuss the election of a chair who would be tasked with writing up
an initial document of feature requests, on which we'd take a vote,
possibly this year already, because the matter is really urgent you know
;-)
> [...] and using part of Bill's work to speed up development.
ok, let me make this absolutely clear: i didnt use any bit of plugsched
- in fact the most difficult bits of the modularization was for areas of
sched.c that plugsched never even touched AFAIK. (the load-balancer for
example.)
Plugsched simply does something else: i modularized scheduling policies
in essence that have to cooperate with each other, while plugsched
modularized complete schedulers which are compile-time or boot-time
selected, with no runtime cooperation between them. (one has to be
selected at a time)
(and i have no trouble at all with crediting Will's work either: a few
years ago i used Will's PID rework concepts for an NPTL related speedup
and Will is very much credited for it in today's kernel/pid.c and he
continued to contribute to it later on.)
(the tree walking bits of sched_fair.c were in fact derived from
kernel/hrtimer.c, the rbtree code written by Thomas and me :-)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 15:39 ` Ingo Molnar
@ 2007-04-15 15:47 ` William Lee Irwin III
2007-04-16 5:27 ` Peter Williams
1 sibling, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-15 15:47 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Willy Tarreau <w@1wt.eu> wrote:
>> [...] and using part of Bill's work to speed up development.
On Sun, Apr 15, 2007 at 05:39:33PM +0200, Ingo Molnar wrote:
> ok, let me make this absolutely clear: i didnt use any bit of plugsched
> - in fact the most difficult bits of the modularization was for areas of
> sched.c that plugsched never even touched AFAIK. (the load-balancer for
> example.)
> Plugsched simply does something else: i modularized scheduling policies
> in essence that have to cooperate with each other, while plugsched
> modularized complete schedulers which are compile-time or boot-time
> selected, with no runtime cooperation between them. (one has to be
> selected at a time)
> (and i have no trouble at all with crediting Will's work either: a few
> years ago i used Will's PID rework concepts for an NPTL related speedup
> and Will is very much credited for it in today's kernel/pid.c and he
> continued to contribute to it later on.)
> (the tree walking bits of sched_fair.c were in fact derived from
> kernel/hrtimer.c, the rbtree code written by Thomas and me :-)
The extant plugsched patches have nothing to do with cfs; I suspect
what everyone else is going on about is terminological confusion. The
4-year-old sample policy with scheduling classes for the original
plugsched is something you had no way of knowing about, as it was never
publicly posted. There isn't really anything all that interesting going
on here, apart from pointing out that it's been done before.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 5:16 ` Bill Huey
2007-04-15 8:44 ` Ingo Molnar
@ 2007-04-15 16:11 ` Bernd Eckenfels
1 sibling, 0 replies; 577+ messages in thread
From: Bernd Eckenfels @ 2007-04-15 16:11 UTC (permalink / raw)
To: linux-kernel
In article <20070415051645.GA28438@gnuppy.monkey.org> you wrote:
> A development process like this is likely to exclude smart people from wanting
> to contribute to Linux and folks should be conscious about this issues.
Nobody is excluded, you can always have a next iteration.
Gruss
Bernd
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: Kaffeine problem with CFS
2007-04-14 16:59 ` S.Çağlar Onur
@ 2007-04-15 16:13 ` Ingo Molnar
2007-04-15 16:25 ` Ingo Molnar
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 16:13 UTC (permalink / raw)
To: S.Çağlar Onur
Cc: linux-kernel, Michael Lothian, Christophe Thommeret,
Christoph Pfister, Jurgen Kofler
* S.Çağlar Onur <caglar@pardus.org.tr> wrote:
> > hm, could you try to strace it and/or attach gdb to it and figure
> > out what's wrong? (perhaps involving the Kaffeine developers too?)
> > As long as it's not a kernel level crash i cannot see how the
> > scheduler could directly cause this - other than by accident
> > creating a scheduling pattern that triggers a user-space bug more
> > often than with other schedulers.
>
> ...
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call)
> --- SIGINT (Interrupt) @ 0 (0) ---
> +++ killed by SIGINT +++
>
> is where freeze occurs. Full log can be found at [1]
> [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
thanks. This does has the appearance of a userspace race condition of
some sorts. Can you trigger this hang with the patch below applied to
the vanilla tree as well? (with no CFS patch applied)
if yes then this would suggest that Kaffeine has some sort of
child-runs-first problem. (which CFS changes to parent-runs-first.
Kaffeine starts a couple of threads and the futex calls are a sign of
thread<->thread communication.)
[ i have also Cc:-ed the Kaffeine folks - maybe your strace gives them
an idea about what the problem could be :) ]
Ingo
---
kernel/sched.c | 70 ++-------------------------------------------------------
1 file changed, 3 insertions(+), 67 deletions(-)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1653,77 +1653,13 @@ void fastcall sched_fork(struct task_str
*/
void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
- struct rq *rq, *this_rq;
unsigned long flags;
- int this_cpu, cpu;
+ struct rq *rq;
rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
- this_cpu = smp_processor_id();
- cpu = task_cpu(p);
-
- /*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive. The parent
- * (current) is done further down, under its lock.
- */
- p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
- CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
- p->prio = effective_prio(p);
-
- if (likely(cpu == this_cpu)) {
- if (!(clone_flags & CLONE_VM)) {
- /*
- * The VM isn't cloned, so we're in a good position to
- * do child-runs-first in anticipation of an exec. This
- * usually avoids a lot of COW overhead.
- */
- if (unlikely(!current->array))
- __activate_task(p, rq);
- else {
- p->prio = current->prio;
- p->normal_prio = current->normal_prio;
- list_add_tail(&p->run_list, ¤t->run_list);
- p->array = current->array;
- p->array->nr_active++;
- inc_nr_running(p, rq);
- }
- set_need_resched();
- } else
- /* Run child last */
- __activate_task(p, rq);
- /*
- * We skip the following code due to cpu == this_cpu
- *
- * task_rq_unlock(rq, &flags);
- * this_rq = task_rq_lock(current, &flags);
- */
- this_rq = rq;
- } else {
- this_rq = cpu_rq(this_cpu);
-
- /*
- * Not the local CPU - must adjust timestamp. This should
- * get optimised away in the !CONFIG_SMP case.
- */
- p->timestamp = (p->timestamp - this_rq->most_recent_timestamp)
- + rq->most_recent_timestamp;
- __activate_task(p, rq);
- if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
-
- /*
- * Parent and child are on different CPUs, now get the
- * parent runqueue to update the parent's ->sleep_avg:
- */
- task_rq_unlock(rq, &flags);
- this_rq = task_rq_lock(current, &flags);
- }
- current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
- PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
- task_rq_unlock(this_rq, &flags);
+ __activate_task(p, rq);
+ task_rq_unlock(rq, &flags);
}
/*
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 8:36 ` Bill Huey
2007-04-15 8:45 ` Mike Galbraith
2007-04-15 9:06 ` Ingo Molnar
@ 2007-04-15 16:25 ` Arjan van de Ven
2007-04-16 5:36 ` Bill Huey
2 siblings, 1 reply; 577+ messages in thread
From: Arjan van de Ven @ 2007-04-15 16:25 UTC (permalink / raw)
To: Bill Huey
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Thomas Gleixner
> It outlines the problems with Linux kernel development and questionable
> elistism regarding ownership of certain sections of the kernel code.
I have to step in and disagree here....
Linux is not about who writes the code.
Linux is about getting the best solution for a problem. Who wrote which
line of the code is irrelevant in the big picture.
that often means that multiple implementations happen, and that the a
darwinistic process decides that the best solution wins.
This darwinistic process often happens in the form of discussion, and
that discussion can happen with words or with code. In this case it
happened with a code proposal.
To make this specific: it has happened many times to me that when I
solved an issue with code, someone else stepped in and wrote a different
solution (although that was usually for smaller pieces). Was I upset
about that? No! I was happy because my *problem got solved* in the best
possible way.
Now this doesn't mean that people shouldn't be nice to each other, not
cooperate or steal credits, but I don't get the impression that that is
happening here. Ingo is taking part in the discussion with a counter
proposal for discussion *on the mailing list*. What more do you want??
If you or anyone else can improve it or do better, take part of this
discussion and show what you mean either in words or in code.
Your qualification of the discussion as a elitist takeover... I disagree
with that. It's a *discussion*. Now if you agree that Ingo's patch is
better technically, you and others should be happy about that because
your problem is getting solved better. If you don't agree that his patch
is better technically, take part in the technical discussion.
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: Kaffeine problem with CFS
2007-04-15 16:13 ` Kaffeine problem with CFS Ingo Molnar
@ 2007-04-15 16:25 ` Ingo Molnar
2007-04-15 16:55 ` Christoph Pfister
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 16:25 UTC (permalink / raw)
To: S.Çağlar Onur
Cc: linux-kernel, Michael Lothian, Christophe Thommeret,
Christoph Pfister, Jurgen Kofler
* Ingo Molnar <mingo@elte.hu> wrote:
> > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
>
> thanks. This does has the appearance of a userspace race condition of
> some sorts. Can you trigger this hang with the patch below applied to
> the vanilla tree as well? (with no CFS patch applied)
oops, please use the patch below instead.
Ingo
---
kernel/sched.c | 69 ++++-----------------------------------------------------
1 file changed, 5 insertions(+), 64 deletions(-)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1653,77 +1653,18 @@ void fastcall sched_fork(struct task_str
*/
void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
- struct rq *rq, *this_rq;
unsigned long flags;
- int this_cpu, cpu;
+ struct rq *rq;
rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
- this_cpu = smp_processor_id();
- cpu = task_cpu(p);
-
- /*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive. The parent
- * (current) is done further down, under its lock.
- */
- p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
- CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
p->prio = effective_prio(p);
+ __activate_task(p, rq);
+ if (TASK_PREEMPTS_CURR(p, rq))
+ resched_task(rq->curr);
- if (likely(cpu == this_cpu)) {
- if (!(clone_flags & CLONE_VM)) {
- /*
- * The VM isn't cloned, so we're in a good position to
- * do child-runs-first in anticipation of an exec. This
- * usually avoids a lot of COW overhead.
- */
- if (unlikely(!current->array))
- __activate_task(p, rq);
- else {
- p->prio = current->prio;
- p->normal_prio = current->normal_prio;
- list_add_tail(&p->run_list, ¤t->run_list);
- p->array = current->array;
- p->array->nr_active++;
- inc_nr_running(p, rq);
- }
- set_need_resched();
- } else
- /* Run child last */
- __activate_task(p, rq);
- /*
- * We skip the following code due to cpu == this_cpu
- *
- * task_rq_unlock(rq, &flags);
- * this_rq = task_rq_lock(current, &flags);
- */
- this_rq = rq;
- } else {
- this_rq = cpu_rq(this_cpu);
-
- /*
- * Not the local CPU - must adjust timestamp. This should
- * get optimised away in the !CONFIG_SMP case.
- */
- p->timestamp = (p->timestamp - this_rq->most_recent_timestamp)
- + rq->most_recent_timestamp;
- __activate_task(p, rq);
- if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
-
- /*
- * Parent and child are on different CPUs, now get the
- * parent runqueue to update the parent's ->sleep_avg:
- */
- task_rq_unlock(rq, &flags);
- this_rq = task_rq_lock(current, &flags);
- }
- current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
- PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
- task_rq_unlock(this_rq, &flags);
+ task_rq_unlock(rq, &flags);
}
/*
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 15:16 ` Gene Heskett
@ 2007-04-15 16:43 ` Con Kolivas
2007-04-15 16:58 ` Gene Heskett
0 siblings, 1 reply; 577+ messages in thread
From: Con Kolivas @ 2007-04-15 16:43 UTC (permalink / raw)
To: Gene Heskett
Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Monday 16 April 2007 01:16, Gene Heskett wrote:
> On Sunday 15 April 2007, Pekka Enberg wrote:
> >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> >> The perception here is that there is that there is this expectation that
> >> sections of the Linux kernel are intentionally "churn squated" to
> >> prevent any other ideas from creeping in other than of the owner of that
> >> subsytem
> >
> >Strangely enough, my perception is that Ingo is simply trying to
> >address the issues Mike's testing discovered in RDSL and SD. It's not
> >surprising Ingo made it a separate patch set as Con has repeatedly
> >stated that the "problems" are in fact by design and won't be fixed.
>
> I won't get into the middle of this just yet, not having decided which dog
> I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for
> about 24 hours, its been generally usable, but gzip still causes lots of 5
> to 10+ second lags when its running. I'm coming to the conclusion that
> gzip simply doesn't play well with others...
Actually Gene I think you're being bitten here by something I/O bound since
the cpu usage never tops out. If that's the case and gzip is dumping
truckloads of writes then you're suffering something that irks me even more
than the scheduler in linux, and that's how much writes hurt just about
everything else. Try your testcase with bzip2 instead (since that won't be
i/o bound), or drop your dirty ratio to as low as possible which helps a
little bit (5% is the minimum)
echo 5 > /proc/sys/vm/dirty_ratio
and finally try the braindead noop i/o scheduler as well.
echo noop > /sys/block/sda/queue/scheduler
(replace sda with your drive obviously).
I'd wager a big one that's what causes your gzip pain. If it wasn't for the
fact that I've decided to all but give up ever trying to provide code for
mainline again, trying my best to make writes hurt less on linux would be my
next big thing [tm].
Oh and for the others watching, (points to vm hackers) I found a bug when
playing with the dirty ratio code. If you modify it to allow it drop below 5%
but still above the minimum in the vm code, stalls happen somewhere in the vm
where nothing much happens for sometimes 20 or 30 seconds worst case
scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to be
set ultra low because these stalls were gross.
> Amazing to me, the cpu its using stays generally below 80%, and often below
> 60%, even while the kmail composer has a full sentence in its buffer that
> it still hasn't shown me when I switch to the htop screen to check, and
> back to the kmail screen to see if its updated yet. The screen switch
> doesn't seem to lag so I don't think renicing x would be helpfull. Those
> are the obvious lags, and I'll build & reboot to the CFS patch at some
> point this morning (whats left of it that is :). And report in due time of
> course
--
-ck
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: Kaffeine problem with CFS
2007-04-15 16:25 ` Ingo Molnar
@ 2007-04-15 16:55 ` Christoph Pfister
2007-04-15 22:14 ` S.Çağlar Onur
` (2 more replies)
0 siblings, 3 replies; 577+ messages in thread
From: Christoph Pfister @ 2007-04-15 16:55 UTC (permalink / raw)
To: Ingo Molnar
Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
Christophe Thommeret, Jurgen Kofler
Hi,
2007/4/15, Ingo Molnar <mingo@elte.hu>:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> > > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
Could you try xine-ui or gxine? Because I suspect rather xine-lib for
freezing issues. In any way I think a gdb backtrace would be much
nicer - but if you can't reproduce the freeze issue with other xine
based players and want to run kaffeine in gdb, you need to execute
"gdb --args kaffeine --nofork".
> > thanks. This does has the appearance of a userspace race condition of
> > some sorts. Can you trigger this hang with the patch below applied to
> > the vanilla tree as well? (with no CFS patch applied)
>
> oops, please use the patch below instead.
>
> Ingo
<snip>
Christoph
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 16:43 ` Con Kolivas
@ 2007-04-15 16:58 ` Gene Heskett
2007-04-15 18:00 ` Mike Galbraith
0 siblings, 1 reply; 577+ messages in thread
From: Gene Heskett @ 2007-04-15 16:58 UTC (permalink / raw)
To: Con Kolivas
Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sunday 15 April 2007, Con Kolivas wrote:
>On Monday 16 April 2007 01:16, Gene Heskett wrote:
>> On Sunday 15 April 2007, Pekka Enberg wrote:
>> >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
>> >> The perception here is that there is that there is this expectation
>> >> that sections of the Linux kernel are intentionally "churn squated" to
>> >> prevent any other ideas from creeping in other than of the owner of
>> >> that subsytem
>> >
>> >Strangely enough, my perception is that Ingo is simply trying to
>> >address the issues Mike's testing discovered in RDSL and SD. It's not
>> >surprising Ingo made it a separate patch set as Con has repeatedly
>> >stated that the "problems" are in fact by design and won't be fixed.
>>
>> I won't get into the middle of this just yet, not having decided which dog
>> I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for
>> about 24 hours, its been generally usable, but gzip still causes lots of 5
>> to 10+ second lags when its running. I'm coming to the conclusion that
>> gzip simply doesn't play well with others...
>
>Actually Gene I think you're being bitten here by something I/O bound since
>the cpu usage never tops out. If that's the case and gzip is dumping
>truckloads of writes then you're suffering something that irks me even more
>than the scheduler in linux, and that's how much writes hurt just about
>everything else. Try your testcase with bzip2 instead (since that won't be
>i/o bound), or drop your dirty ratio to as low as possible which helps a
>little bit (5% is the minimum)
>
>echo 5 > /proc/sys/vm/dirty_ratio
>
>and finally try the braindead noop i/o scheduler as well.
>
>echo noop > /sys/block/sda/queue/scheduler
>
>(replace sda with your drive obviously).
>
>I'd wager a big one that's what causes your gzip pain. If it wasn't for the
>fact that I've decided to all but give up ever trying to provide code for
>mainline again, trying my best to make writes hurt less on linux would be my
>next big thing [tm].
Chuckle, possibly but then I'm not anything even remotely close to an expert
here Con, just reporting what I get. And I just rebooted to 2.6.21-rc6 +
sched-mike-5.patch for grins and giggles, or frowns and profanity as the case
may call for.
>Oh and for the others watching, (points to vm hackers) I found a bug when
>playing with the dirty ratio code. If you modify it to allow it drop below
> 5% but still above the minimum in the vm code, stalls happen somewhere in
> the vm where nothing much happens for sometimes 20 or 30 seconds worst case
> scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to
> be set ultra low because these stalls were gross.
I think I'd need a bit of tutoring on how to do that. I recall that one other
time, several weeks back, I thought I would try one of those famous echo this
>/proc/that ideas that went by on this list, but even though I was root,
apparently /proc was read-only AFAIWC.
>> Amazing to me, the cpu its using stays generally below 80%, and often
>> below 60%, even while the kmail composer has a full sentence in its buffer
>> that it still hasn't shown me when I switch to the htop screen to check,
>> and back to the kmail screen to see if its updated yet. The screen switch
>> doesn't seem to lag so I don't think renicing x would be helpfull. Those
>> are the obvious lags, and I'll build & reboot to the CFS patch at some
>> point this morning (whats left of it that is :). And report in due time
>> of course
And now I wonder if I applied the right patch. This one feels good ATM, but I
don't think its the CFS thingy. No, I'm sure of it now, none of the patches
I've saved say a thing about CFS. Backtrack up the list time I guess, ignore
me for the nonce.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Microsoft: Re-inventing square wheels
-- From a Slashdot.org post
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 13:08 ` Pekka J Enberg
@ 2007-04-15 17:32 ` Mike Galbraith
2007-04-15 17:59 ` Linus Torvalds
0 siblings, 1 reply; 577+ messages in thread
From: Mike Galbraith @ 2007-04-15 17:32 UTC (permalink / raw)
To: Pekka J Enberg
Cc: Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Arjan van de Ven, Thomas Gleixner
On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote:
> On Sun, 15 Apr 2007, Willy Tarreau wrote:
> > Ingo could have publicly spoken with them about his ideas of killing
> > the O(1) scheduler and replacing it with an rbtree-based one, and using
> > part of Bill's work to speed up development.
>
> He did exactly that and he did it with a patch. Nothing new here. This is
> how development on LKML proceeds when you have two or more competing
> designs. There's absolutely no need to get upset or hurt your feelings
> over it. It's not malicious, it's how we do Linux development.
Yes. Exactly. This is what it's all about, this is what makes it work.
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 18:18 ` Willy Tarreau
2007-04-14 18:40 ` Eric W. Biederman
@ 2007-04-15 17:55 ` Ingo Molnar
2007-04-15 18:06 ` Willy Tarreau
1 sibling, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 17:55 UTC (permalink / raw)
To: Willy Tarreau
Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
* Willy Tarreau <w@1wt.eu> wrote:
> Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> bash forks 1000 processes, then progressively execs scheddos, but it
> takes some time). So I'm rebuilding right now. But I think that Linus
> has an interesting clue about GPM and notification before switching
> the terminal. I think it was enabled in console mode. I don't know how
> that translates to frozen xterms, but let's attack the problems one at
> a time.
to debug this, could you try to apply this add-on as well:
http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
with this patch applied you should have a /proc/sched_debug file that
prints all runnable tasks and other interesting info from the runqueue.
[ i've refreshed all the patches on the CFS webpage, so if this doesnt
apply cleanly to your current tree then you'll probably have to
refresh one of the patches.]
The output should look like this:
Sched Debug Version: v0.01
now at 226761724575 nsecs
cpu: 0
.nr_running : 3
.raw_weighted_load : 384
.nr_switches : 13666
.nr_uninterruptible : 0
.next_balance : 4294947416
.curr->pid : 2179
.rq_clock : 241337421233
.fair_clock : 7503791206
.wait_runtime : 2269918379
runnable tasks:
task | PID | tree-key | -delta | waiting | switches
-----------------------------------------------------------------
+ cat 2179 7501930066 -1861140 1861140 2
loop_silent 2149 7503010354 -780852 0 911
loop_silent 2148 7503510048 -281158 280753 918
now for your workload the list should be considerably larger. If there's
starvation going on then the 'switches' field (number of context
switches) of one of the tasks would never increase while you have this
'cannot switch consoles' problem.
maybe you'll have to unapply the fair-fork patch to make it trigger
again. (fair-fork does not fix anything, so it probably just hides a
real bug.)
(i'm meanwhile busy running your scheddos utilities to reproduce it
locally as well :)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 17:32 ` Mike Galbraith
@ 2007-04-15 17:59 ` Linus Torvalds
2007-04-15 19:00 ` Jonathan Lundell
0 siblings, 1 reply; 577+ messages in thread
From: Linus Torvalds @ 2007-04-15 17:59 UTC (permalink / raw)
To: Mike Galbraith
Cc: Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar,
Con Kolivas, ck list, Peter Williams, linux-kernel,
Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner
On Sun, 15 Apr 2007, Mike Galbraith wrote:
> On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote:
> >
> > He did exactly that and he did it with a patch. Nothing new here. This is
> > how development on LKML proceeds when you have two or more competing
> > designs. There's absolutely no need to get upset or hurt your feelings
> > over it. It's not malicious, it's how we do Linux development.
>
> Yes. Exactly. This is what it's all about, this is what makes it work.
I obviously agree, but I will also add that one of the most motivating
things there *is* in open source is "personal pride".
It's a really good thing, and it means that if somebody shows that your
code is flawed in some way (by, for example, making a patch that people
claim gets better behaviour or numbers), any *good* programmer that
actually cares about his code will obviously suddenly be very motivated to
out-do the out-doer!
Does this mean that there will be tension and rivalry? Hell yes. But
that's kind of the point. Life is a game, and if you aren't in it to win,
what the heck are you still doing here?
As long as it's reasonably civil (I'm not personally a huge believer in
being too polite or "politically correct", so I think the "reasonably" is
more important than the "civil" part!), and as long as the end result is
judged on TECHNICAL MERIT, it's all good.
We don't want to play politics. But encouraging peoples competitive
feelings? Oh, yes.
Linus
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 16:58 ` Gene Heskett
@ 2007-04-15 18:00 ` Mike Galbraith
2007-04-16 0:18 ` Gene Heskett
0 siblings, 1 reply; 577+ messages in thread
From: Mike Galbraith @ 2007-04-15 18:00 UTC (permalink / raw)
To: Gene Heskett
Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Arjan van de Ven, Thomas Gleixner
On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote:
> Chuckle, possibly but then I'm not anything even remotely close to an expert
> here Con, just reporting what I get. And I just rebooted to 2.6.21-rc6 +
> sched-mike-5.patch for grins and giggles, or frowns and profanity as the case
> may call for.
Erm, that patch is embarrassingly buggy, so profanity should dominate.
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 17:55 ` Ingo Molnar
@ 2007-04-15 18:06 ` Willy Tarreau
2007-04-15 19:20 ` Ingo Molnar
0 siblings, 1 reply; 577+ messages in thread
From: Willy Tarreau @ 2007-04-15 18:06 UTC (permalink / raw)
To: Ingo Molnar
Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
Hi Ingo,
On Sun, Apr 15, 2007 at 07:55:55PM +0200, Ingo Molnar wrote:
>
> * Willy Tarreau <w@1wt.eu> wrote:
>
> > Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> > bash forks 1000 processes, then progressively execs scheddos, but it
> > takes some time). So I'm rebuilding right now. But I think that Linus
> > has an interesting clue about GPM and notification before switching
> > the terminal. I think it was enabled in console mode. I don't know how
> > that translates to frozen xterms, but let's attack the problems one at
> > a time.
>
> to debug this, could you try to apply this add-on as well:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
>
> with this patch applied you should have a /proc/sched_debug file that
> prints all runnable tasks and other interesting info from the runqueue.
I don't know if you have seen my mail from yesterday evening (here). I
found that changing keventd prio fixed the problem. You may be interested
in the description. I sent it at 21:01 (+200).
> [ i've refreshed all the patches on the CFS webpage, so if this doesnt
> apply cleanly to your current tree then you'll probably have to
> refresh one of the patches.]
Fine, I'll have a look. I already had to rediff the sched-fair-fork
patch last time.
> The output should look like this:
>
> Sched Debug Version: v0.01
> now at 226761724575 nsecs
>
> cpu: 0
> .nr_running : 3
> .raw_weighted_load : 384
> .nr_switches : 13666
> .nr_uninterruptible : 0
> .next_balance : 4294947416
> .curr->pid : 2179
> .rq_clock : 241337421233
> .fair_clock : 7503791206
> .wait_runtime : 2269918379
>
> runnable tasks:
> task | PID | tree-key | -delta | waiting | switches
> -----------------------------------------------------------------
> + cat 2179 7501930066 -1861140 1861140 2
> loop_silent 2149 7503010354 -780852 0 911
> loop_silent 2148 7503510048 -281158 280753 918
Nice.
> now for your workload the list should be considerably larger. If there's
> starvation going on then the 'switches' field (number of context
> switches) of one of the tasks would never increase while you have this
> 'cannot switch consoles' problem.
>
> maybe you'll have to unapply the fair-fork patch to make it trigger
> again. (fair-fork does not fix anything, so it probably just hides a
> real bug.)
>
> (i'm meanwhile busy running your scheddos utilities to reproduce it
> locally as well :)
I discovered I had the frame-buffer enabled (I did not notice it first
because I do not have the logo and the resolution is the same as text).
It's matroxfb with a G400, if that can help. It may be possible that
it needs some CPU that it cannot get to clear the display before
switching, I don't know.
However I won't try this right now, I'm deep in userland at the moment.
Regards,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 17:59 ` Linus Torvalds
@ 2007-04-15 19:00 ` Jonathan Lundell
2007-04-15 22:52 ` Con Kolivas
0 siblings, 1 reply; 577+ messages in thread
From: Jonathan Lundell @ 2007-04-15 19:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey,
Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner
On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:
> It's a really good thing, and it means that if somebody shows that
> your
> code is flawed in some way (by, for example, making a patch that
> people
> claim gets better behaviour or numbers), any *good* programmer that
> actually cares about his code will obviously suddenly be very
> motivated to
> out-do the out-doer!
"No one who cannot rejoice in the discovery of his own mistakes
deserves to be called a scholar."
--Don Foster, "literary sleuth", on retracting his attribution of "A
Funerall Elegye" to Shakespeare (it's more likely John Ford's work).
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 18:06 ` Willy Tarreau
@ 2007-04-15 19:20 ` Ingo Molnar
2007-04-15 19:35 ` William Lee Irwin III
2007-04-15 19:37 ` Ingo Molnar
0 siblings, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:20 UTC (permalink / raw)
To: Willy Tarreau
Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
* Willy Tarreau <w@1wt.eu> wrote:
> > to debug this, could you try to apply this add-on as well:
> >
> > http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
> >
> > with this patch applied you should have a /proc/sched_debug file
> > that prints all runnable tasks and other interesting info from the
> > runqueue.
>
> I don't know if you have seen my mail from yesterday evening (here). I
> found that changing keventd prio fixed the problem. You may be
> interested in the description. I sent it at 21:01 (+200).
ah, indeed i missed that mail - the response to the patches was quite
overwhelming (and i naively thought people dont do Linux hacking over
the weekends anymore ;).
so Linus was right: this was caused by scheduler starvation. I can see
one immediate problem already: the 'nice offset' is not divided by
nr_running as it should. The patch below should fix this but i have yet
to test it accurately, this change might as well render nice levels
unacceptably ineffective under high loads.
Ingo
--------->
---
kernel/sched_fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r
int leftmost = 1;
long long key;
- key = rq->fair_clock - p->wait_runtime + p->nice_offset;
+ key = rq->fair_clock - p->wait_runtime;
+ if (unlikely(p->nice_offset))
+ key += p->nice_offset / rq->nr_running;
p->fair_key = key;
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 19:20 ` Ingo Molnar
@ 2007-04-15 19:35 ` William Lee Irwin III
2007-04-15 19:57 ` Ingo Molnar
2007-04-15 19:37 ` Ingo Molnar
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-15 19:35 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox
On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote:
> so Linus was right: this was caused by scheduler starvation. I can see
> one immediate problem already: the 'nice offset' is not divided by
> nr_running as it should. The patch below should fix this but i have yet
> to test it accurately, this change might as well render nice levels
> unacceptably ineffective under high loads.
I've been suggesting testing CPU bandwidth allocation as influenced by
nice numbers for a while now for a reason.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 19:20 ` Ingo Molnar
2007-04-15 19:35 ` William Lee Irwin III
@ 2007-04-15 19:37 ` Ingo Molnar
1 sibling, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:37 UTC (permalink / raw)
To: Willy Tarreau
Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, Jiri Slaby, Alan Cox
* Ingo Molnar <mingo@elte.hu> wrote:
> so Linus was right: this was caused by scheduler starvation. I can see
> one immediate problem already: the 'nice offset' is not divided by
> nr_running as it should. The patch below should fix this but i have
> yet to test it accurately, this change might as well render nice
> levels unacceptably ineffective under high loads.
erm, rather the updated patch below if you want to use this on a 32-bit
system. But ... i think you should wait until i have all this re-tested.
Ingo
---
include/linux/sched.h | 2 +-
kernel/sched_fair.c | 4 +++-
2 files changed, 4 insertions(+), 2 deletions(-)
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -839,7 +839,7 @@ struct task_struct {
s64 wait_runtime;
u64 exec_runtime, fair_key;
- s64 nice_offset, hog_limit;
+ s32 nice_offset, hog_limit;
unsigned long policy;
cpumask_t cpus_allowed;
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r
int leftmost = 1;
long long key;
- key = rq->fair_clock - p->wait_runtime + p->nice_offset;
+ key = rq->fair_clock - p->wait_runtime;
+ if (unlikely(p->nice_offset))
+ key += p->nice_offset / (rq->nr_running + 1);
p->fair_key = key;
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 19:35 ` William Lee Irwin III
@ 2007-04-15 19:57 ` Ingo Molnar
2007-04-15 23:54 ` William Lee Irwin III
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:57 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox
* William Lee Irwin III <wli@holomorphy.com> wrote:
> On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote:
> > so Linus was right: this was caused by scheduler starvation. I can
> > see one immediate problem already: the 'nice offset' is not divided
> > by nr_running as it should. The patch below should fix this but i
> > have yet to test it accurately, this change might as well render
> > nice levels unacceptably ineffective under high loads.
>
> I've been suggesting testing CPU bandwidth allocation as influenced by
> nice numbers for a while now for a reason.
Oh I was very much testing "CPU bandwidth allocation as influenced by
nice numbers" - it's one of the basic things i do when modifying the
scheduler. An automated tool, while nice (all automation is nice)
wouldnt necessarily show such bugs though, because here too it needed
thousands of running tasks to trigger in practice. Any volunteers? ;)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 15:05 ` Ingo Molnar
@ 2007-04-15 20:05 ` Matt Mackall
2007-04-15 20:48 ` Ingo Molnar
2007-04-16 5:16 ` Con Kolivas
1 sibling, 1 reply; 577+ messages in thread
From: Matt Mackall @ 2007-04-15 20:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Sun, Apr 15, 2007 at 05:05:36PM +0200, Ingo Molnar wrote:
> so the rejection was on these grounds, and i still very much stand by
> that position here and today: i didnt want to see the Linux scheduler
> landscape balkanized and i saw no technological reasons for the
> complication that external modularization brings.
But "balkanization" is a good thing. "Monoculture" is a bad thing.
Look at what happened with I/O scheduling. Opening things up to some
new ideas by making it possible to select your I/O scheduler took us
from 10 years of stagnation to healthy, competitive development, which
gave us a substantially better I/O scheduler.
Look at what's happening right now with TCP congestion algorithms.
We've had decades of tweaking Reno slightly now turned into a vibrant
research area with lots of radical alternatives. A winner will
eventually emerge and it will probably look quite a bit different than
Reno.
Similar things have gone on since the beginning with filesystems on
Linux. Being able to easily compare filesystems head to head has been
immensely valuable in improving our 'core' Linux filesystems.
And what we've had up to now is a scheduler monoculture. Until Andrew
put RSDL in -mm, if people wanted to experiment with other schedulers,
they had to go well off the beaten path to do it. So all the people
who've been hopelessy frustrated with the mainline scheduler go off to
the -ck ghetto, or worse, stick with 2.4.
Whether your motivations have been protectionist or merely
shortsighted, you've stomped pretty heavily on alternative scheduler
development by completely rejecting the whole plugsched concept. If
we'd opened up mainline to a variety of schedulers _3 years ago_, we'd
probably have gotten to where we are today much sooner.
Hopefully, the next time Rik suggests pluggable page replacement
algorithms, folks will actually seriously consider it.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 20:05 ` Matt Mackall
@ 2007-04-15 20:48 ` Ingo Molnar
2007-04-15 21:31 ` Matt Mackall
2007-04-15 23:39 ` William Lee Irwin III
0 siblings, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-15 20:48 UTC (permalink / raw)
To: Matt Mackall
Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
* Matt Mackall <mpm@selenic.com> wrote:
> Look at what happened with I/O scheduling. Opening things up to some
> new ideas by making it possible to select your I/O scheduler took us
> from 10 years of stagnation to healthy, competitive development, which
> gave us a substantially better I/O scheduler.
actually, 2-3 years ago we already had IO schedulers, and my opinion
against plugsched back then (also shared by Nick and Linus) was very
much considering them. There are at least 4 reasons why I/O schedulers
are different from CPU schedulers:
1) CPUs are a non-persistent resource shared by _all_ tasks and
workloads in the system. Disks are _persistent_ resources very much
attached to specific workloads. (If tasks had to be 'persistent' to
the CPU they were started on we'd have much different scheduling
technology, and there would be much less complexity.) More analogous
to CPU schedulers would perhaps be VM/MM schedulers, and those tend
to be hard to modularize in a technologically sane way too. (and
unlike disks there's no good generic way to attach VM/MM schedulers
to particular workloads.) So it's apples to oranges.
in practice it comes down to having one good scheduler that runs all
workloads on a system reasonably well. And given that a very large
portion of system runs mixed workloads, the demand for one good
scheduler is pretty high. While i can run with mixed IO schedulers
just fine.
2) plugsched did not allow on the fly selection of schedulers, nor did
it allow a per CPU selection of schedulers. IO schedulers you can
change per disk, on the fly, making them much more useful in
practice. Also, IO schedulers (while definitely not being slow!) are
alot less performance sensitive than CPU schedulers.
3) I/O schedulers are pretty damn clean code, and plugsched, at least
the last version i saw of it, didnt come even close.
4) the good thing that happened to I/O, after years of stagnation isnt
I/O schedulers. The good thing that happened to I/O is called Jens
Axboe. If you care about the I/O subystem then print that name out
and hang it on the wall. That and only that is what mattered.
all in one, while there are definitely uses (embedded would like to have
a smaller/different scheduler, etc.), the technical case for
modularization for the sake of selectability is alot lower for CPU
schedulers than it is for I/O schedulers.
nor was the non-modularity of some piece of code ever an impediment to
competition. May i remind you of the pretty competitive SLAB allocator
landscape, resulting in things like the SLOB allocator, written by
yourself? ;-)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 20:48 ` Ingo Molnar
@ 2007-04-15 21:31 ` Matt Mackall
2007-04-16 3:03 ` Nick Piggin
2007-04-16 15:45 ` William Lee Irwin III
2007-04-15 23:39 ` William Lee Irwin III
1 sibling, 2 replies; 577+ messages in thread
From: Matt Mackall @ 2007-04-15 21:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>
> * Matt Mackall <mpm@selenic.com> wrote:
>
> > Look at what happened with I/O scheduling. Opening things up to some
> > new ideas by making it possible to select your I/O scheduler took us
> > from 10 years of stagnation to healthy, competitive development, which
> > gave us a substantially better I/O scheduler.
>
> actually, 2-3 years ago we already had IO schedulers, and my opinion
> against plugsched back then (also shared by Nick and Linus) was very
> much considering them. There are at least 4 reasons why I/O schedulers
> are different from CPU schedulers:
...
> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
> the last version i saw of it, didnt come even close.
That's irrelevant. Plugsched was an attempt to get alternative
schedulers exposure in mainline. I know, because I remember
encouraging Bill to pursue it. Not only did you veto plugsched (which
may have been a perfectly reasonable thing to do), but you also vetoed
the whole concept of multiple schedulers in the tree too. "We don't
want to balkanize the scheduling landscape".
And that latter part is what I'm claiming has set us back for years.
It's not a technical argument but a strategic one. And it's just not a
good strategy.
> 4) the good thing that happened to I/O, after years of stagnation isnt
> I/O schedulers. The good thing that happened to I/O is called Jens
> Axboe. If you care about the I/O subystem then print that name out
> and hang it on the wall. That and only that is what mattered.
Disagree. Things didn't actually get interesting until Nick showed up
with AS and got it in-tree to demonstrate the huge amount of room we
had for improvement. It took several iterations of AS and CFQ (with a
couple complete rewrites) before CFQ began to look like the winner.
The resulting time-sliced CFQ was fairly heavily influenced by the
ideas in AS.
Similarly, things in scheduler land had been pretty damn boring until
Con finally got Andrew to take one of his schedulers for a spin.
> nor was the non-modularity of some piece of code ever an impediment to
> competition. May i remind you of the pretty competitive SLAB allocator
> landscape, resulting in things like the SLOB allocator, written by
> yourself? ;-)
Thankfully no one came out and said "we don't want to balkanize the
allocator landscape" when I submitted it or I probably would have just
dropped it, rather than painfully dragging it along out of tree for
years. I'm not nearly the glutton for punishment that Con is. :-P
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: Kaffeine problem with CFS
2007-04-15 16:55 ` Christoph Pfister
@ 2007-04-15 22:14 ` S.Çağlar Onur
2007-04-18 8:27 ` Ingo Molnar
[not found] ` <19a3b7a80704180534w3688af87x78ee68cc1c330a5c@mail.gmail.com>
2 siblings, 0 replies; 577+ messages in thread
From: S.Çağlar Onur @ 2007-04-15 22:14 UTC (permalink / raw)
To: Christoph Pfister
Cc: Ingo Molnar, linux-kernel, Michael Lothian, Christophe Thommeret,
Jurgen Kofler
[-- Attachment #1: Type: text/plain, Size: 1063 bytes --]
15 Nis 2007 Paz tarihinde, Christoph Pfister şunları yazmıştı:
> Could you try xine-ui or gxine? Because I suspect rather xine-lib for
> freezing issues. In any way I think a gdb backtrace would be much
> nicer - but if you can't reproduce the freeze issue with other xine
> based players and want to run kaffeine in gdb, you need to execute
> "gdb --args kaffeine --nofork".
I just tested xine-ui and i can easily reproduce exact same problem with
xine-ui also so you are right, it seems a xine-lib problem trigger by CFS
changes.
> > > thanks. This does has the appearance of a userspace race condition of
> > > some sorts. Can you trigger this hang with the patch below applied to
> > > the vanilla tree as well? (with no CFS patch applied)
> >
> > oops, please use the patch below instead.
Tomorrow i'll test that patch and also try to get a backtrace.
Cheers
--
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/
Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (10 preceding siblings ...)
2007-04-15 12:29 ` Esben Nielsen
@ 2007-04-15 22:49 ` Ismail Dönmez
2007-04-15 23:23 ` Arjan van de Ven
2007-04-16 11:58 ` Ingo Molnar
2007-04-16 22:00 ` Andi Kleen
` (2 subsequent siblings)
14 siblings, 2 replies; 577+ messages in thread
From: Ismail Dönmez @ 2007-04-15 22:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
[-- Attachment #1: Type: text/plain, Size: 573 bytes --]
Hi,
On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
Tested this on top of Linus' GIT tree but the system gets very unresponsive
during high disk i/o using ext3 as filesystem but even writing a 300mb file
to a usb disk (iPod actually) has the same affect.
Regards,
ismail
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 19:00 ` Jonathan Lundell
@ 2007-04-15 22:52 ` Con Kolivas
2007-04-16 2:28 ` Nick Piggin
0 siblings, 1 reply; 577+ messages in thread
From: Con Kolivas @ 2007-04-15 22:52 UTC (permalink / raw)
To: Jonathan Lundell
Cc: Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau,
hui Bill Huey, Ingo Molnar, ck list, Peter Williams,
linux-kernel, Andrew Morton, Nick Piggin, Arjan van de Ven,
Thomas Gleixner
On Monday 16 April 2007 05:00, Jonathan Lundell wrote:
> On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:
> > It's a really good thing, and it means that if somebody shows that
> > your
> > code is flawed in some way (by, for example, making a patch that
> > people
> > claim gets better behaviour or numbers), any *good* programmer that
> > actually cares about his code will obviously suddenly be very
> > motivated to
> > out-do the out-doer!
>
> "No one who cannot rejoice in the discovery of his own mistakes
> deserves to be called a scholar."
Lovely comment. I realise this is not truly directed at me but clearly in the
context it has been said people will assume it is directed my way, so while
we're all spinning lkml quality rhetoric, let me have a right of reply.
One thing I have never tried to do was to ignore bug reports. I'm forever
joking that I keep pulling code out of my arse to improve what I've done.
RSDL/SD was no exception; heck it had 40 iterations. The reason I could not
reply to bug report A with "Oh that is problem B so I'll fix it with code C"
was, as I've said many many times over, health related. I did indeed try to
fix many of them without spending hours replying to sometimes unpleasant
emails. If health wasn't an issue there might have been 1000 iterations of
SD.
There was only ever _one_ thing that I was absolutely steadfast on as a
concept that I refused to fix that people might claim was "a mistake I did
not rejoice in to be a scholar". That was that the _correct_ behaviour for a
scheduler is to be fair such that proportional slowdown with load is (using
that awful pun) a feature, not a bug. Now there are people who will still
disagree violently with me on that. SD attempted to be a fairness first
virtual-deadline design. If I failed on that front, then so be it (and at
least one person certainly has said in lovely warm fuzzy friendly
communication that I'm a global failure on all fronts with SD). But let me
point out now that Ingo's shiny new scheduler is a fairness-first
virtual-deadline design which will have proportional slowdown with load. So
it will have a very similar feature. I dare anyone to claim that proportional
slowdown with load is a bug, because I will no longer feel like I'm standing
alone with a BFG9000 trying to defend my standpoint. Others can take up the
post at last.
--
-ck
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-14 22:38 ` Davide Libenzi
2007-04-14 23:26 ` Davide Libenzi
2007-04-15 4:01 ` William Lee Irwin III
@ 2007-04-15 23:09 ` Pavel Pisa
2007-04-16 5:47 ` Davide Libenzi
2 siblings, 1 reply; 577+ messages in thread
From: Pavel Pisa @ 2007-04-15 23:09 UTC (permalink / raw)
To: Davide Libenzi
Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Sunday 15 April 2007 00:38, Davide Libenzi wrote:
> Haven't looked at the scheduler code yet, but for a similar problem I use
> a time ring. The ring has Ns (2 power is better) slots (where tasks are
> queued - in my case they were som sort of timers), and it has a current
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It
> also has a bitmap with bits telling you which slots contains queued tasks.
> An item (task) that has to be scheduled at time T, will be queued in the
> slot:
>
> S = Ib + min((T - Tb) / Tg, Ns - 1);
>
> Items with T longer than Ns*Tg will be scheduled in the relative last slot
> (chosing a proper Ns and Tg can minimize this).
> Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to
> suite to your needs.
> This is a simple bench between time-ring (TR) and CFS queueing:
>
> http://www.xmailserver.org/smart-queue.c
>
> In my box (Dual Opteron 252):
>
> davide@alien:~$ ./smart-queue -n 8
> CFS = 142.21 cycles/loop
> TR = 72.33 cycles/loop
> davide@alien:~$ ./smart-queue -n 16
> CFS = 188.74 cycles/loop
> TR = 83.79 cycles/loop
> davide@alien:~$ ./smart-queue -n 32
> CFS = 221.36 cycles/loop
> TR = 75.93 cycles/loop
> davide@alien:~$ ./smart-queue -n 64
> CFS = 242.89 cycles/loop
> TR = 81.29 cycles/loop
Hello all,
I cannot help myself to not report results with GAVL
tree algorithm there as an another race competitor.
I believe, that it is better solution for large priority
queues than RB-tree and even heap trees. It could be
disputable if the scheduler needs such scalability on
the other hand. The AVL heritage guarantees lower height
which results in shorter search times which could
be profitable for other uses in kernel.
GAVL algorithm is AVL tree based, so it does not suffer from
"infinite" priorities granularity there as TR does. It allows
use for generalized case where tree is not fully balanced.
This allows to cut the first item withour rebalancing.
This leads to the degradation of the tree by one more level
(than non degraded AVL gives) in maximum, which is still
considerably better than RB-trees maximum.
http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c
The description behind the code is there
http://cmp.felk.cvut.cz/~pisa/ulan/gavl.pdf
The code is part of much more covering uLUt library
http://cmp.felk.cvut.cz/~pisa/ulan/ulut.pdf
http://sourceforge.net/project/showfiles.php?group_id=118937&package_id=130840
I have included all required GAVL code directly into smart-queue-v-gavl.c
to provide it for easy testing.
There are tests run on my little dated computer - Duron 600 MHz.
Test are run twice to suppress run order influence.
./smart-queue-v-gavl -n 1 -l 2000000
gavl_cfs = 55.66 cycles/loop
CFS = 88.33 cycles/loop
TR = 141.78 cycles/loop
CFS = 90.45 cycles/loop
gavl_cfs = 55.38 cycles/loop
./smart-queue-v-gavl -n 2 -l 2000000
gavl_cfs = 82.85 cycles/loop
CFS = 104.18 cycles/loop
TR = 145.21 cycles/loop
CFS = 102.74 cycles/loop
gavl_cfs = 82.05 cycles/loop
./smart-queue-v-gavl -n 4 -l 2000000
gavl_cfs = 137.45 cycles/loop
CFS = 156.47 cycles/loop
TR = 142.00 cycles/loop
CFS = 152.65 cycles/loop
gavl_cfs = 139.38 cycles/loop
./smart-queue-v-gavl -n 10 -l 2000000
gavl_cfs = 229.22 cycles/loop (WORSE)
CFS = 206.26 cycles/loop
TR = 140.81 cycles/loop
CFS = 208.29 cycles/loop
gavl_cfs = 223.62 cycles/loop (WORSE)
./smart-queue-v-gavl -n 100 -l 2000000
gavl_cfs = 257.66 cycles/loop
CFS = 329.68 cycles/loop
TR = 142.20 cycles/loop
CFS = 319.34 cycles/loop
gavl_cfs = 260.02 cycles/loop
./smart-queue-v-gavl -n 1000 -l 2000000
gavl_cfs = 258.41 cycles/loop
CFS = 393.04 cycles/loop
TR = 134.76 cycles/loop
CFS = 392.20 cycles/loop
gavl_cfs = 260.93 cycles/loop
./smart-queue-v-gavl -n 10000 -l 2000000
gavl_cfs = 259.45 cycles/loop
CFS = 605.89 cycles/loop
TR = 196.69 cycles/loop
CFS = 622.60 cycles/loop
gavl_cfs = 262.72 cycles/loop
./smart-queue-v-gavl -n 100000 -l 2000000
gavl_cfs = 258.21 cycles/loop
CFS = 845.62 cycles/loop
TR = 315.37 cycles/loop
CFS = 860.21 cycles/loop
gavl_cfs = 258.94 cycles/loop
The GAVL code has not been tuned by any "likely"/"unlikely"
constructs. It brings even some other overhead from it generic
design which is not necessary for this use - it keeps
permanently even pointer to the last element, ensures,
that the insertion order is preserved for same key values
etc. But it still proves much better scalability then
kernel used RB-tree code. On the other hand, it does not
encode color/height in one of the pointers and requires
additional field for height.
May it be, that difference is due some bug in my testing,
then I would be interrested in correction. The test case
is oversimplified probably. I have already run more different
tests against GAVL code in the past to compare it with
different tree and queues implementations and I have not found
case with real performance degradation. On the other hand, there
are cases for small items counts where GAVL is sometimes
a little worse than others (array based heap-tree for example).
The GAVL code itself is used in more opensource and commercial
projects and we have noticed no problems after one small fix
at the time of the first release in 2004.
Best wishes
Pavel Pisa
e-mail: pisa@cmp.felk.cvut.cz
www: http://cmp.felk.cvut.cz/~pisa
work: http://www.pikron.com
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 22:49 ` Ismail Dönmez
@ 2007-04-15 23:23 ` Arjan van de Ven
2007-04-15 23:33 ` Ismail Dönmez
2007-04-16 11:58 ` Ingo Molnar
1 sibling, 1 reply; 577+ messages in thread
From: Arjan van de Ven @ 2007-04-15 23:23 UTC (permalink / raw)
To: Ismail Dönmez
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner
On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote:
> Hi,
> On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > [CFS]
> >
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >
> > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> Tested this on top of Linus' GIT tree but the system gets very unresponsive
> during high disk i/o using ext3 as filesystem but even writing a 300mb file
> to a usb disk (iPod actually) has the same affect.
just to make sure; this exact same workload but with the stock scheduler
does not have this effect?
if so, then it could well be that the scheduler is too fair for it's own
good (being really fair inevitably ends up not batching as much as one
should, and batching is needed to get any kind of decent performance out
of disks nowadays)
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 23:23 ` Arjan van de Ven
@ 2007-04-15 23:33 ` Ismail Dönmez
0 siblings, 0 replies; 577+ messages in thread
From: Ismail Dönmez @ 2007-04-15 23:33 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner
On Monday 16 April 2007 02:23:08 Arjan van de Ven wrote:
> On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote:
> > Hi,
> >
> > On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > > [CFS]
> > >
> > > i'm pleased to announce the first release of the "Modular Scheduler
> > > Core and Completely Fair Scheduler [CFS]" patchset:
> > >
> > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> >
> > Tested this on top of Linus' GIT tree but the system gets very
> > unresponsive during high disk i/o using ext3 as filesystem but even
> > writing a 300mb file to a usb disk (iPod actually) has the same affect.
>
> just to make sure; this exact same workload but with the stock scheduler
> does not have this effect?
>
> if so, then it could well be that the scheduler is too fair for it's own
> good (being really fair inevitably ends up not batching as much as one
> should, and batching is needed to get any kind of decent performance out
> of disks nowadays)
Tried with make install in kdepim (which made system sluggish with CFS) and
the system is just fine (using CFQ).
Regards,
ismail
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 20:48 ` Ingo Molnar
2007-04-15 21:31 ` Matt Mackall
@ 2007-04-15 23:39 ` William Lee Irwin III
2007-04-16 1:06 ` Peter Williams
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-15 23:39 UTC (permalink / raw)
To: Ingo Molnar
Cc: Matt Mackall, Con Kolivas, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
> 2) plugsched did not allow on the fly selection of schedulers, nor did
> it allow a per CPU selection of schedulers. IO schedulers you can
> change per disk, on the fly, making them much more useful in
> practice. Also, IO schedulers (while definitely not being slow!) are
> alot less performance sensitive than CPU schedulers.
One of the reasons I never posted my own code is that it never met its
own design goals, which absolutely included switching on the fly. I
think Peter Williams may have done something about that. It was my hope
to be able to do insmod sched_foo.ko until it became clear that the
effort it was intended to assist wasn't going to get even the limited
hardware access required, at which point I largely stopped working on
it.
On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
> the last version i saw of it, didnt come even close.
I'm not sure what happened there. It wasn't a big enough patch to take
hits in this area due to getting overwhelmed by the programming burden
like some other efforts of mine. Maybe things started getting ugly once
on-the-fly switching entered the picture. My guess is that Peter Williams
will have to chime in here, since things have diverged enough from my
one-time contribution 4 years ago.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 19:57 ` Ingo Molnar
@ 2007-04-15 23:54 ` William Lee Irwin III
2007-04-16 11:24 ` Ingo Molnar
0 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-15 23:54 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> I've been suggesting testing CPU bandwidth allocation as influenced by
>> nice numbers for a while now for a reason.
On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote:
> Oh I was very much testing "CPU bandwidth allocation as influenced by
> nice numbers" - it's one of the basic things i do when modifying the
> scheduler. An automated tool, while nice (all automation is nice)
> wouldnt necessarily show such bugs though, because here too it needed
> thousands of running tasks to trigger in practice. Any volunteers? ;)
Worse comes to worse I might actually get around to doing it myself.
Any more detailed descriptions of the test for a rainy day?
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 18:00 ` Mike Galbraith
@ 2007-04-16 0:18 ` Gene Heskett
0 siblings, 0 replies; 577+ messages in thread
From: Gene Heskett @ 2007-04-16 0:18 UTC (permalink / raw)
To: Mike Galbraith
Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Arjan van de Ven, Thomas Gleixner
On Sunday 15 April 2007, Mike Galbraith wrote:
>On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote:
>> Chuckle, possibly but then I'm not anything even remotely close to an
>> expert here Con, just reporting what I get. And I just rebooted to
>> 2.6.21-rc6 + sched-mike-5.patch for grins and giggles, or frowns and
>> profanity as the case may call for.
>
>Erm, that patch is embarrassingly buggy, so profanity should dominate.
>
> -Mike
Chuckle, ROTFLMAO even.
I didn't run it that long as I immediately rebuilt and rebooted when I found
I'd used the wrong patch, and in fact had tested that one and found it
sub-optimal before I'd built and ran Con's -0.40 version. As for bugs of the
type that make it to the screen or logs, I didn't see any. OTOH, my eyesight
is slowly going downhill, now 20/25. It was 20/10 30 years ago. Now thats
reason for profanity...
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Unix weanies are as bad at this as anyone.
-- Larry Wall in <199702111730.JAA28598@wall.org>
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 23:39 ` William Lee Irwin III
@ 2007-04-16 1:06 ` Peter Williams
2007-04-16 3:04 ` William Lee Irwin III
2007-04-16 17:22 ` Chris Friesen
0 siblings, 2 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-16 1:06 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
William Lee Irwin III wrote:
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>> 2) plugsched did not allow on the fly selection of schedulers, nor did
>> it allow a per CPU selection of schedulers. IO schedulers you can
>> change per disk, on the fly, making them much more useful in
>> practice. Also, IO schedulers (while definitely not being slow!) are
>> alot less performance sensitive than CPU schedulers.
>
> One of the reasons I never posted my own code is that it never met its
> own design goals, which absolutely included switching on the fly. I
> think Peter Williams may have done something about that.
I didn't but some students did.
In a previous life, I did implement a runtime configurable CPU
scheduling mechanism (implemented on True64, Solaris and Linux) that
allowed schedulers to be loaded as modules at run time. This was
released commercially on True64 and Solaris. So I know that it can be done.
I have thought about doing something similar for the SPA schedulers
which differ in only small ways from each other but lack motivation.
> It was my hope
> to be able to do insmod sched_foo.ko until it became clear that the
> effort it was intended to assist wasn't going to get even the limited
> hardware access required, at which point I largely stopped working on
> it.
>
>
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
>> the last version i saw of it, didnt come even close.
>
> I'm not sure what happened there. It wasn't a big enough patch to take
> hits in this area due to getting overwhelmed by the programming burden
> like some other efforts of mine. Maybe things started getting ugly once
> on-the-fly switching entered the picture. My guess is that Peter Williams
> will have to chime in here, since things have diverged enough from my
> one-time contribution 4 years ago.
From my POV, the current version of plugsched is considerably simpler
than it was when I took the code over from Con as I put considerable
effort into minimizing code overlap in the various schedulers.
I also put considerable effort into minimizing any changes to the load
balancing code (something Ingo seems to think is a deficiency) and the
result is that plugsched allows "intra run queue" scheduling to be
easily modified WITHOUT effecting load balancing. To my mind scheduling
and load balancing are orthogonal and keeping them that way simplifies
things.
As Ingo correctly points out, plugsched does not allow different
schedulers to be used per CPU but it would not be difficult to modify it
so that they could. Although I've considered doing this over the years
I decided not to as it would just increase the complexity and the amount
of work required to keep the patch set going. About six months ago I
decided to reduce the amount of work I was doing on plugsched (as it was
obviously never going to be accepted) and now only publish patches
against the vanilla kernel's major releases (and the only reason that I
kept doing that is that the download figures indicated that about 80
users were interested in the experiment).
Peter
PS I no longer read LKML (due to time constraints) and would appreciate
it if I could be CC'd on any e-mails suggesting scheduler changes.
PPS I'm just happy to see that Ingo has finally accepted that the
vanilla scheduler was badly in need of fixing and don't really care who
fixes it.
PPS Different schedulers for different aims (i.e. server or work
station) do make a difference. E.g. the spa_svr scheduler in plugsched
does about 1% better on kernbench than the next best scheduler in the bunch.
PPPS Con, fairness isn't always best as humans aren't very altruistic
and we need to give unfair preference to interactive tasks in order to
stop the users flinging their PCs out the window. But the current
scheduler doesn't do this very well and is also not very good at
fairness so needs to change. But the changes need to address
interactive response and fairness not just fairness.
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 22:52 ` Con Kolivas
@ 2007-04-16 2:28 ` Nick Piggin
2007-04-16 3:15 ` Con Kolivas
[not found] ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com>
0 siblings, 2 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-16 2:28 UTC (permalink / raw)
To: Con Kolivas
Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Mon, Apr 16, 2007 at 08:52:33AM +1000, Con Kolivas wrote:
> On Monday 16 April 2007 05:00, Jonathan Lundell wrote:
> > On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:
> > > It's a really good thing, and it means that if somebody shows that
> > > your
> > > code is flawed in some way (by, for example, making a patch that
> > > people
> > > claim gets better behaviour or numbers), any *good* programmer that
> > > actually cares about his code will obviously suddenly be very
> > > motivated to
> > > out-do the out-doer!
> >
> > "No one who cannot rejoice in the discovery of his own mistakes
> > deserves to be called a scholar."
>
> Lovely comment. I realise this is not truly directed at me but clearly in the
> context it has been said people will assume it is directed my way, so while
> we're all spinning lkml quality rhetoric, let me have a right of reply.
>
> One thing I have never tried to do was to ignore bug reports. I'm forever
> joking that I keep pulling code out of my arse to improve what I've done.
> RSDL/SD was no exception; heck it had 40 iterations. The reason I could not
> reply to bug report A with "Oh that is problem B so I'll fix it with code C"
> was, as I've said many many times over, health related. I did indeed try to
> fix many of them without spending hours replying to sometimes unpleasant
> emails. If health wasn't an issue there might have been 1000 iterations of
> SD.
Well what matters is the code and development. I don't think Ingo's
scheduler is the final word, although I worry that Linus might jump the
gun and merge something "just to give it a test", which we then get
stuck with :P
I don't know how anybody can think Ingo's new scheduler is anything but
a good thing (so long as it has to compete before being merged). And
that's coming from someone who wants *their* scheduler to get merged...
I think mine can compete ;) and if it can't, then I'd rather be using
the scheduler that beats it.
> There was only ever _one_ thing that I was absolutely steadfast on as a
> concept that I refused to fix that people might claim was "a mistake I did
> not rejoice in to be a scholar". That was that the _correct_ behaviour for a
> scheduler is to be fair such that proportional slowdown with load is (using
> that awful pun) a feature, not a bug.
If something is using more than a fair share of CPU time, over some macro
period, in order to be interactive, then definitely it should get throttled.
I've always maintained (since starting scheduler work) that the 2.6 scheduler
is horrible because it allows these cases where some things can get more CPU
time just by how they behave.
Glad people are starting to come around on that point.
So, on to something productive, we have 3 candidates for a new scheduler so
far. How do we decide which way to go? (and yes, I still think switchable
schedulers is wrong and a copout) This is one area where it is virtually
impossible to discount any decent design on correctness/performance/etc.
and even testing in -mm isn't really enough.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 21:31 ` Matt Mackall
@ 2007-04-16 3:03 ` Nick Piggin
2007-04-16 14:28 ` Matt Mackall
2007-04-16 15:45 ` William Lee Irwin III
1 sibling, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-16 3:03 UTC (permalink / raw)
To: Matt Mackall
Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote:
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>
> > 4) the good thing that happened to I/O, after years of stagnation isnt
> > I/O schedulers. The good thing that happened to I/O is called Jens
> > Axboe. If you care about the I/O subystem then print that name out
> > and hang it on the wall. That and only that is what mattered.
>
> Disagree. Things didn't actually get interesting until Nick showed up
> with AS and got it in-tree to demonstrate the huge amount of room we
> had for improvement. It took several iterations of AS and CFQ (with a
> couple complete rewrites) before CFQ began to look like the winner.
> The resulting time-sliced CFQ was fairly heavily influenced by the
> ideas in AS.
Well to be fair, Jens had just implemented deadline, which got me
interested ;)
Actually, I would still like to be able to deprecate deadline for
AS, because AS has a tunable that you can switch to turn off read
anticipation and revert to deadline behaviour (or very close to).
It would have been nice if CFQ were then a layer on top of AS that
implemented priorities (or vice versa). And then AS could be
deprecated and we'd be back to 1 primary scheduler.
Well CFQ seems to be going in the right direction with that, however
some large users still find AS faster for some reason...
Anyway, moral of the story is that I think it would have been nice
if we hadn't proliferated IO schedulers, however in practice it
isn't easy to just layer features on top of each other, and also
keeping deadline helped a lot to be able to debug and examine
performance regressions and actually get code upstream. And this
was true even when it was globally boottine switchable only.
I'd prefer if we kept a single CPU scheduler in mainline, because I
think that simplifies analysis and focuses testing. I think we can
have one that is good enough for everyone. But if the only other
option for progress is that Linus or Andrew just pull one out of a
hat, then I would rather merge all of them. Yes I think Con's
scheduler should get a fair go, ditto for Ingo's, mine, and anyone
else's.
> > nor was the non-modularity of some piece of code ever an impediment to
> > competition. May i remind you of the pretty competitive SLAB allocator
> > landscape, resulting in things like the SLOB allocator, written by
> > yourself? ;-)
>
> Thankfully no one came out and said "we don't want to balkanize the
> allocator landscape" when I submitted it or I probably would have just
> dropped it, rather than painfully dragging it along out of tree for
> years. I'm not nearly the glutton for punishment that Con is. :-P
I don't think this is a fault of the people or the code involved.
We just didn't have much collective drive to replace the scheduler,
and even less an idea of how to decide between any two of them.
I've kept nicksched around since 2003 or so and no hard feelings ;)
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 1:06 ` Peter Williams
@ 2007-04-16 3:04 ` William Lee Irwin III
2007-04-16 5:09 ` Peter Williams
2007-04-16 17:22 ` Chris Friesen
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-16 3:04 UTC (permalink / raw)
To: Peter Williams
Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
William Lee Irwin III wrote:
>> One of the reasons I never posted my own code is that it never met its
>> own design goals, which absolutely included switching on the fly. I
>> think Peter Williams may have done something about that.
>> It was my hope
>> to be able to do insmod sched_foo.ko until it became clear that the
>> effort it was intended to assist wasn't going to get even the limited
>> hardware access required, at which point I largely stopped working on
>> it.
On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> I didn't but some students did.
> In a previous life, I did implement a runtime configurable CPU
> scheduling mechanism (implemented on True64, Solaris and Linux) that
> allowed schedulers to be loaded as modules at run time. This was
> released commercially on True64 and Solaris. So I know that it can be done.
> I have thought about doing something similar for the SPA schedulers
> which differ in only small ways from each other but lack motivation.
Driver models for scheduling are not so far out. AFAICS it's largely a
tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
switching out intra-queue policies vs. switching out whole-system
policies, SMP handling and all. Whether this involves load balancing
depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
scheduler module, for instance, would not have a load balancer at all,
as it has only one global runqueue. There are other sorts of policies
wanting significant changes to SMP handling vs. the stock load
balancing.
William Lee Irwin III wrote:
>> I'm not sure what happened there. It wasn't a big enough patch to take
>> hits in this area due to getting overwhelmed by the programming burden
>> like some other efforts of mine. Maybe things started getting ugly once
>> on-the-fly switching entered the picture. My guess is that Peter Williams
>> will have to chime in here, since things have diverged enough from my
>> one-time contribution 4 years ago.
On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> From my POV, the current version of plugsched is considerably simpler
> than it was when I took the code over from Con as I put considerable
> effort into minimizing code overlap in the various schedulers.
> I also put considerable effort into minimizing any changes to the load
> balancing code (something Ingo seems to think is a deficiency) and the
> result is that plugsched allows "intra run queue" scheduling to be
> easily modified WITHOUT effecting load balancing. To my mind scheduling
> and load balancing are orthogonal and keeping them that way simplifies
> things.
ISTR rearranging things for con in such a fashion that it no longer
worked out of the box (though that wasn't the intention; restructuring it
to be more suited to his purposes was) and that's what he worked off of
afterward. I don't remember very well what changed there as I clearly
invested less effort there than the prior versions. Now that I think of
it, that may have been where the sample policy demonstrating scheduling
classes was lost.
On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> As Ingo correctly points out, plugsched does not allow different
> schedulers to be used per CPU but it would not be difficult to modify it
> so that they could. Although I've considered doing this over the years
> I decided not to as it would just increase the complexity and the amount
> of work required to keep the patch set going. About six months ago I
> decided to reduce the amount of work I was doing on plugsched (as it was
> obviously never going to be accepted) and now only publish patches
> against the vanilla kernel's major releases (and the only reason that I
> kept doing that is that the download figures indicated that about 80
> users were interested in the experiment).
That's a rather different goal from what I was going on about with it,
so it's all diverged quite a bit. Where I had a significant need for
mucking with the entire concept of how SMP was handled, this is rather
different. At this point I'm questioning the relevance of my own work,
though it was already relatively marginal as it started life as an
attempt at a sort of debug patch to help gang scheduling (which is in
itself a rather marginally relevant feature to most users) code along.
On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> PS I no longer read LKML (due to time constraints) and would appreciate
> it if I could be CC'd on any e-mails suggesting scheduler changes.
> PPS I'm just happy to see that Ingo has finally accepted that the
> vanilla scheduler was badly in need of fixing and don't really care who
> fixes it.
> PPS Different schedulers for different aims (i.e. server or work
> station) do make a difference. E.g. the spa_svr scheduler in plugsched
> does about 1% better on kernbench than the next best scheduler in the bunch.
> PPPS Con, fairness isn't always best as humans aren't very altruistic
> and we need to give unfair preference to interactive tasks in order to
> stop the users flinging their PCs out the window. But the current
> scheduler doesn't do this very well and is also not very good at
> fairness so needs to change. But the changes need to address
> interactive response and fairness not just fairness.
Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
better ones. I'd not bother citing kernel compile results.
In any event, I'm not sure what to say about different schedulers for
different aims. My intentions with plugsched were not centered around
production usage or intra-queue policy. I'm relatively indifferent to
the notion of having pluggable CPU schedulers, intra-queue or otherwise,
in mainline. I don't see any particular harm in it, but neither am I
particularly motivated to have it in. I had a rather strong sense of
instrumentality about it, and since it became useless to me (at a
conceptual level; the implementation was never finished ot the point of
dynamic loading of scheduler modules) for assisting development on
large systems via reboot avoidance by dint of it becoming clear that
access to such was never going to happen, I've stopped looking at it.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 2:28 ` Nick Piggin
@ 2007-04-16 3:15 ` Con Kolivas
2007-04-16 3:34 ` Nick Piggin
[not found] ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com>
1 sibling, 1 reply; 577+ messages in thread
From: Con Kolivas @ 2007-04-16 3:15 UTC (permalink / raw)
To: Nick Piggin
Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Monday 16 April 2007 12:28, Nick Piggin wrote:
> So, on to something productive, we have 3 candidates for a new scheduler so
> far. How do we decide which way to go? (and yes, I still think switchable
> schedulers is wrong and a copout) This is one area where it is virtually
> impossible to discount any decent design on correctness/performance/etc.
> and even testing in -mm isn't really enough.
We're in agreement! YAY!
Actually this is simpler than that. I'm taking SD out of the picture. It has
served it's purpose of proving that we need to seriously address all the
scheduling issues and did more than a half decent job at it. Unfortunately I
also cannot sit around supporting it forever by myself. My own life is more
important, so consider SD not even running the race any more.
I'm off to continue maintaining permanent-out-of-tree leisurely code at my own
pace. What's more is, I think I'll just stick to staircase Gen I version blah
and shelve SD and try to have fond memories of SD as an intellectual
prompting exercise only.
--
-ck
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 3:15 ` Con Kolivas
@ 2007-04-16 3:34 ` Nick Piggin
0 siblings, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-16 3:34 UTC (permalink / raw)
To: Con Kolivas
Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Mon, Apr 16, 2007 at 01:15:27PM +1000, Con Kolivas wrote:
> On Monday 16 April 2007 12:28, Nick Piggin wrote:
> > So, on to something productive, we have 3 candidates for a new scheduler so
> > far. How do we decide which way to go? (and yes, I still think switchable
> > schedulers is wrong and a copout) This is one area where it is virtually
> > impossible to discount any decent design on correctness/performance/etc.
> > and even testing in -mm isn't really enough.
>
> We're in agreement! YAY!
>
> Actually this is simpler than that. I'm taking SD out of the picture. It has
> served it's purpose of proving that we need to seriously address all the
> scheduling issues and did more than a half decent job at it. Unfortunately I
> also cannot sit around supporting it forever by myself. My own life is more
> important, so consider SD not even running the race any more.
>
> I'm off to continue maintaining permanent-out-of-tree leisurely code at my own
> pace. What's more is, I think I'll just stick to staircase Gen I version blah
> and shelve SD and try to have fond memories of SD as an intellectual
> prompting exercise only.
Well I would hope that _if_ we decide to switch schedulers, then you
get a chance to field something (and I hope you will decide to and have
time to), and I hope we don't rush into the decision.
We've had the current scheduler for so many years now that it is much
more important to make sure we take the time to do the right thing
rather than absolutely have to merge a new scheduler right now ;)
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 3:04 ` William Lee Irwin III
@ 2007-04-16 5:09 ` Peter Williams
2007-04-16 11:04 ` William Lee Irwin III
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-16 5:09 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> One of the reasons I never posted my own code is that it never met its
>>> own design goals, which absolutely included switching on the fly. I
>>> think Peter Williams may have done something about that.
>>> It was my hope
>>> to be able to do insmod sched_foo.ko until it became clear that the
>>> effort it was intended to assist wasn't going to get even the limited
>>> hardware access required, at which point I largely stopped working on
>>> it.
>
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> I didn't but some students did.
>> In a previous life, I did implement a runtime configurable CPU
>> scheduling mechanism (implemented on True64, Solaris and Linux) that
>> allowed schedulers to be loaded as modules at run time. This was
>> released commercially on True64 and Solaris. So I know that it can be done.
>> I have thought about doing something similar for the SPA schedulers
>> which differ in only small ways from each other but lack motivation.
>
> Driver models for scheduling are not so far out. AFAICS it's largely a
> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
> switching out intra-queue policies vs. switching out whole-system
> policies, SMP handling and all. Whether this involves load balancing
> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
> scheduler module, for instance, would not have a load balancer at all,
> as it has only one global runqueue. There are other sorts of policies
> wanting significant changes to SMP handling vs. the stock load
> balancing.
Well a single run queue removes the need for load balancing but has
scalability issues on large systems. Personally, I think something in
between would be the best solution i.e. multiple run queues but more
than one CPU per run queue. I think that this would be a particularly
good solution to the problems introduced by hyper threading and multi
core systems and also NUMA systems. E.g. if all CPUs in a hyper thread
package are using the one queue then the case where one CPU is trying to
run a high priority task and the other a low priority task (i.e. the
cases that the sleeping dependent mechanism tried to address) is less
likely to occur.
By the way, I think that it's a very bad idea for the scheduling
mechanism and the load balancing mechanism to be coupled. The anomalies
that will be experienced and the attempts to make ad hoc fixes for them
will lead to complexity spiralling out of control.
>
>
> William Lee Irwin III wrote:
>>> I'm not sure what happened there. It wasn't a big enough patch to take
>>> hits in this area due to getting overwhelmed by the programming burden
>>> like some other efforts of mine. Maybe things started getting ugly once
>>> on-the-fly switching entered the picture. My guess is that Peter Williams
>>> will have to chime in here, since things have diverged enough from my
>>> one-time contribution 4 years ago.
>
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> From my POV, the current version of plugsched is considerably simpler
>> than it was when I took the code over from Con as I put considerable
>> effort into minimizing code overlap in the various schedulers.
>> I also put considerable effort into minimizing any changes to the load
>> balancing code (something Ingo seems to think is a deficiency) and the
>> result is that plugsched allows "intra run queue" scheduling to be
>> easily modified WITHOUT effecting load balancing. To my mind scheduling
>> and load balancing are orthogonal and keeping them that way simplifies
>> things.
>
> ISTR rearranging things for con in such a fashion that it no longer
> worked out of the box (though that wasn't the intention; restructuring it
> to be more suited to his purposes was) and that's what he worked off of
> afterward. I don't remember very well what changed there as I clearly
> invested less effort there than the prior versions. Now that I think of
> it, that may have been where the sample policy demonstrating scheduling
> classes was lost.
I can't comment here as (as far as I can recall) I never saw your code
and only became involved when Con posted his version of cpusched.
>
>
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> As Ingo correctly points out, plugsched does not allow different
>> schedulers to be used per CPU but it would not be difficult to modify it
>> so that they could. Although I've considered doing this over the years
>> I decided not to as it would just increase the complexity and the amount
>> of work required to keep the patch set going. About six months ago I
>> decided to reduce the amount of work I was doing on plugsched (as it was
>> obviously never going to be accepted) and now only publish patches
>> against the vanilla kernel's major releases (and the only reason that I
>> kept doing that is that the download figures indicated that about 80
>> users were interested in the experiment).
>
> That's a rather different goal from what I was going on about with it,
> so it's all diverged quite a bit.
Yes, pragmatic considerations dictated a change of tack.
> Where I had a significant need for
> mucking with the entire concept of how SMP was handled, this is rather
> different.
Yes, I went with the idea of intra run queue scheduling being orthogonal
to load balancing for two reasons:
1. I think that coupling them is a bad idea from the complexity POV, and
2. it's enough of a battle fighting for modifications to one bit of the
code without trying to do it to two simultaneously.
> At this point I'm questioning the relevance of my own work,
> though it was already relatively marginal as it started life as an
> attempt at a sort of debug patch to help gang scheduling (which is in
> itself a rather marginally relevant feature to most users) code along.
The main commercial plug in scheduler used with the run time loadable
module scheduler that I mentioned earlier did gang scheduling (at the
insistence of the Tru64 kernel folks). As this scheduler was a
hierarchical "fair share" scheduler: i.e. allocating CPU "fairly"
("unfairly" really in according to an allocation policy) among higher
level entities such as users, groups and applications as well as
processes; it was fairly easy to make it a gang scheduler by modifying
it to give all of a process's threads the same priority based on the
process's CPU usage rather than different priorities based on the
threads' usage rates. In fact, it would have been possible to select
between gang and non gang on a per process basis if that was considered
desirable.
The fact that threads and processes are distinct entities on Tru64 and
Solaris made this easier to do on them than on Linux.
My experience with this scheduler leads me to believe that to achieve
gang scheduling and fairness, etc. you need (usage) statistics based
schedulers.
>
>
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> PS I no longer read LKML (due to time constraints) and would appreciate
>> it if I could be CC'd on any e-mails suggesting scheduler changes.
>> PPS I'm just happy to see that Ingo has finally accepted that the
>> vanilla scheduler was badly in need of fixing and don't really care who
>> fixes it.
>> PPS Different schedulers for different aims (i.e. server or work
>> station) do make a difference. E.g. the spa_svr scheduler in plugsched
>> does about 1% better on kernbench than the next best scheduler in the bunch.
>> PPPS Con, fairness isn't always best as humans aren't very altruistic
>> and we need to give unfair preference to interactive tasks in order to
>> stop the users flinging their PCs out the window. But the current
>> scheduler doesn't do this very well and is also not very good at
>> fairness so needs to change. But the changes need to address
>> interactive response and fairness not just fairness.
>
> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
> better ones. I'd not bother citing kernel compile results.
spa_svr actually does its best work when the system isn't fully loaded
as the type of improvement it strives to achieve (minimizing on queue
wait time) hasn't got much room to manoeuvre when the system is fully
loaded. Therefore, the fact that it's 1% better even in these
circumstances is a good result and also indicates that the overhead for
keeping the scheduling statistics it uses for its decision making is
well spent. Especially, when you consider that the total available room
for improvement on this benchmark is less than 3%.
To elaborate, the motivation for this scheduler was acquired from the
observation of scheduling statistics (in particular, on queue wait time)
on systems running at about 30% to 50% load. Theoretically, at these
load levels there should be no such waiting but the statistics show that
there is considerable waiting (sometimes as high as 30% to 50%). I put
this down to "lack of serendipity" e.g. everyone sleeping at the same
time and then trying to run at the same time would be complete lack of
serendipity. On the other hand, if everyone is synced then there would
be total serendipity.
Obviously, from the POV of a client, time the server task spends waiting
on the queue adds to the response time for any request that has been
made so reduction of this time on a server is a good thing(tm). Equally
obviously, trying to achieve this synchronization by asking the tasks to
cooperate with each other is not a feasible solution and some external
influence needs to be exerted and this is what spa_svr does -- it nudges
the scheduling order of the tasks in a way that makes them become well
synced.
Unfortunately, this is not a good scheduler for an interactive system as
it minimizes the response times for ALL tasks (and the system as a
whole) and this can result in increased response time for some
interactive tasks (clunkiness) which annoys interactive users. When you
start fiddling with this scheduler to bring back "interactive
unfairness" you kill a lot of its superior low overall wait time
performance.
So this is why I think "horses for courses" schedulers are worth while.
>
> In any event, I'm not sure what to say about different schedulers for
> different aims. My intentions with plugsched were not centered around
> production usage or intra-queue policy. I'm relatively indifferent to
> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
> in mainline. I don't see any particular harm in it, but neither am I
> particularly motivated to have it in.
If you look at the struct sched_spa_child in the file
include/linux/sched_spa.h you'll see that the interface for switching
between the various SPA schedulers is quite small and making them
runtime switchable would be easy (I haven't done this in cpusched as I
wanted to keep the same interface for switching schedulers for all
schedulers: i.e. all run time switchable or none run time switchable; as
the main aim of plugsched had become a mechanism for evaluating
different intra queue scheduling designs.)
> I had a rather strong sense of
> instrumentality about it, and since it became useless to me (at a
> conceptual level; the implementation was never finished ot the point of
> dynamic loading of scheduler modules) for assisting development on
> large systems via reboot avoidance by dint of it becoming clear that
> access to such was never going to happen, I've stopped looking at it.
I'll probably stop looking at this problem as well at least for the time
being until all this new code has settled.
Peter
PS As I no longer read LKML, I haven't yet seen Ingo's or Con's or
Nick's new schedulers yet so am unable to comment on their technical
merits with respect to my comments above.
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 15:05 ` Ingo Molnar
2007-04-15 20:05 ` Matt Mackall
@ 2007-04-16 5:16 ` Con Kolivas
2007-04-16 5:48 ` Gene Heskett
1 sibling, 1 reply; 577+ messages in thread
From: Con Kolivas @ 2007-04-16 5:16 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Monday 16 April 2007 01:05, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > 2. Since then I've been thinking/working on a cpu scheduler design
> > that takes away all the guesswork out of scheduling and gives very
> > predictable, as fair as possible, cpu distribution and latency while
> > preserving as solid interactivity as possible within those confines.
>
> yeah. I think you were right on target with this call.
Yay thank goodness :) It's time to fix the damn cpu scheduler once and for
all. Everyone uses this; it's no minor driver or $bigsmp or $bigram or
$small_embedded_RT_hardware feature.
> I've applied the
> sched.c change attached at the bottom of this mail to the CFS patch, if
> you dont mind. (or feel free to suggest some other text instead.)
> * 2003-09-03 Interactivity tuning by Con Kolivas.
> * 2004-04-02 Scheduler domains code by Nick Piggin
> + * 2007-04-15 Con Kolivas was dead right: fairness matters! :)
LOL that's awful. I'd prefer something meaningful like "Work begun on
replacing all interactivity tuning with a fair virtual-deadline design by Con
Kolivas".
While you're at it, it's worth getting rid of a few slightly pointless name
changes too. Don't rename SCHED_NORMAL yet again, and don't call all your
things sched_fair blah_fair __blah_fair and so on. It means that anything
else is by proxy going to be considered unfair. Leave SCHED_NORMAL as is,
replace the use of the word _fair with _cfs. I don't really care how many
copyright notices you put into our already noisy bootup but it's redundant
since there is no choice; we all get the same cpu scheduler.
> > 1. I tried in vain some time ago to push a working extensable
> > pluggable cpu scheduler framework (based on wli's work) for the linux
> > kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he
> > didn't like it) as being absolutely the wrong approach and that we
> > should never do that. [...]
>
> i partially replied to that point to Will already, and i'd like to make
> it clear again: yes, i rejected plugsched 2-3 years ago (which already
> drifted away from wli's original codebase) and i would still reject it
> today.
No that was just me being flabbergasted by what appeared to be you posting
your own plugsched. Note nowhere in the 40 iterations of rsdl->sd did I
ask/suggest for plugsched. I said in my first announcement my aim was to
create a scheduling policy robust enough for all situations rather than
fantastic a lot of the time and awful sometimes. There are plenty of people
ready to throw out arguments for plugsched now and I don't have the energy to
continue that fight (I never did really).
But my question still stands about this comment:
> case, all of SD's logic could be added via a kernel/sched_sd.c module
> as well, if Con is interested in such an approach. ]
What exactly would be the purpose of such a module that governs nothing in
particular? Since there'll be no pluggable scheduler by your admission it has
no control over SCHED_NORMAL, and would require another scheduling policy for
it to govern which there is no express way to use at the moment and people
tend to just use the default without great effort.
> First and foremost, please dont take such rejections too personally - i
> had my own share of rejections (and in fact, as i mentioned it in a
> previous mail, i had a fair number of complete project throwaways:
> 4g:4g, in-kernel Tux, irqrate and many others). I know that they can
> hurt and can demoralize, but if i dont like something it's my job to
> tell that.
Hmm? No that's not what this is about. Remember dynticks which was not
originally my code but I tried to bring it up to mainline standard which I
fought with for months? You came along with yet another rewrite from scratch
and the flaws in the design I was working with were obvious so I instantly
bowed down to that and never touched my code again. I didn't ask for credit
back then, but obviously brought the requirement for a no idle tick
implementation to the table.
> My view about plugsched: first please take a look at the latest
> plugsched code:
>
> http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch
>
> 26 files changed, 8951 insertions(+), 1495 deletions(-)
>
> As an experiment i've removed all the add-on schedulers (both the core
> and the include files, only kept the vanilla one) from the plugsched
> patch (and the makefile and kconfig complications, etc), to see the
> 'infrastructure cost', and it still gave:
>
> 12 files changed, 1933 insertions(+), 1479 deletions(-)
I do not see extra code per-se as being a bad thing. I've heard said a few
times before "ever notice how when the correct solution is done it is a lot
more code than the quick hack that ultimately fails?". Insert long winded
discussion of perfect is the enemy of good here, _but_ I'm not arguing
perfect versus good, I'm talking about solid code versus quick fix. Again,
none of this comment is directed specifically at this implementation of
plugsched, its code quality or intent, but using "extra code is bad" as an
argument is not enough.
> By your logic Mike should in fact be quite upset about this: if the
> new code works out and proves to be useful then it obsoletes a whole lot
> of code of him!
> > [...] However at one stage I virtually begged for support with my
> > attempts and help with the code. Dmitry Adamushko is the only person
> > who actually helped me with the code in the interim, while others
> > poked sticks at it. Sure the sticks helped at times but the sticks
> > always seemed to have their ends kerosene doused and flaming for
> > reasons I still don't get. No other help was forthcoming.
> Hey, i told this to you as recently as 1 month ago as well:
>
> http://lkml.org/lkml/2007/3/8/54
>
> "cool! I like this even more than i liked your original staircase
> scheduler from 2 years ago :)"
Email has an awful knack of disguising intent so I took that on face value
that you did like the idea :).
Above when I said "no other help was forthcoming" all I was hoping for was
really simple obvious bugfixes to help me along while I was laid up in bed
such as "I like what you're doing but oh your use of memset here is bogus,
here is a one line patch". I wasn't specifically expecting you to fix my
code; you've got truckloads of things you need to do.
It just reminds me that the concept of "release early, release often" doesn't
actually work in the kernel. What is far more obvious is "release code only
when it's so close to perfect that noone can argue against it" since most of
the work is done by one person, otherwise someone will come out with a
counterpatch that is _complete_ earlier but in all possibility not as good,
it's just ready sooner. *NOTE* In no way am I saying your code is not as good
as mine; I would have to say exactly the opposite is true pretty much always
(<sarcasm>conversely then I doubt if I dropped you in my work environment
you'd do as good a job as I do</sarcasm>). At one stage wli (again at my
request) put together a quick hack to check for non-preemptible regions
within the kernel. From that quick hack you adopted it and turned it into
that beautiful latency tracer that is the cornerstone of the -rt tree
testing. However, there are many instances I've seen good evolving code in
the linux kernel be trumped by not-as-good but already-working alternatives
written from scratch with no reference to the original work. This is the NIH
(not invented here) mechanism I see happening that is worth objecting to.
What you may find most amusing is the very first iterations of RSDL looked
_nothing_ like the mainline scheduler. There were all sorts of different
structures, mechanisms, one priority array, plans to remove scheduler_tick
entirely and so on. Most of those were never made for public consumption. I
spent about half a dozen iterations of RSDL removing all of that and making
it as close to the mainline design as possible, thus minimising the size of
the patch, and to make it readily readable for most people familiar with the
scheduler policy code in sched.c (all 5 of them). I should have just said
bugger it and started everything from scratch with little to no reference to
the original scheduler but found myself obliged to try to do things the
minimal code patch size readable difference thingy that was valued in linux
kernel development. I think the radically different approach would have been
better in the long run. Trying to play ball I ruined it.
Either way I've decided for myself, my family, my career and my sanity I'm
abandoning SD. I will shelve SD and try to have fond memories of SD as an
intellectual prompting exercise only
> Ingo
--
-ck
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 15:39 ` Ingo Molnar
2007-04-15 15:47 ` William Lee Irwin III
@ 2007-04-16 5:27 ` Peter Williams
2007-04-16 6:23 ` Peter Williams
1 sibling, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-16 5:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Ingo Molnar wrote:
> * Willy Tarreau <w@1wt.eu> wrote:
>
>> Ingo could have publicly spoken with them about his ideas of killing
>> the O(1) scheduler and replacing it with an rbtree-based one, [...]
>
> yes, that's precisely what i did, via a patchset :)
>
> [ I can even tell you when it all started: i was thinking about Mike's
> throttling patches while watching Manchester United beat the crap out
> of AS Roma (7 to 1 end result), Thuesday evening. I started coding it
> Wednesday morning and sent the patch Friday evening. I very much
> believe in low-latency when it comes to development too ;) ]
>
> (if this had been done via a comittee then today we'd probably still be
> trying to find a suitable timeslot for the initial conference call where
> we'd discuss the election of a chair who would be tasked with writing up
> an initial document of feature requests, on which we'd take a vote,
> possibly this year already, because the matter is really urgent you know
> ;-)
>
>> [...] and using part of Bill's work to speed up development.
>
> ok, let me make this absolutely clear: i didnt use any bit of plugsched
> - in fact the most difficult bits of the modularization was for areas of
> sched.c that plugsched never even touched AFAIK. (the load-balancer for
> example.)
This sounds like your new scheduler intends to increase the coupling
between scheduling and load balancing. I think that this would be a
mistake and lead (down the track) to spiralling complexity as you make
changes to the code to address the corner conditions that it will create.
>
> Plugsched simply does something else: i modularized scheduling policies
> in essence that have to cooperate with each other, while plugsched
> modularized complete schedulers which are compile-time or boot-time
> selected, with no runtime cooperation between them. (one has to be
> selected at a time)
You can't really have more than one scheduler operating in the same
priority range on the same CPU as they will be fighting each other
trying to achieve their separate and not necessarily compatible (in fact
highly likely to be incompatible) aims. Multiple schedulers on the same
CPU have to have a pecking order just like SCHED_OTHER and real time
policies. It wouldn't be hard to prove that SCHED_RR and SCHED_FIFO is
a problem in waiting if ever someone tried to use them both on a highly
real time system.
>
> (and i have no trouble at all with crediting Will's work either: a few
> years ago i used Will's PID rework concepts for an NPTL related speedup
> and Will is very much credited for it in today's kernel/pid.c and he
> continued to contribute to it later on.)
>
> (the tree walking bits of sched_fair.c were in fact derived from
> kernel/hrtimer.c, the rbtree code written by Thomas and me :-)
>
> Ingo
Are your new patches available somewhere for easy download or do I have
to try to dig them out of the mailing list archive? Or could you mail
them to me separately? I'm keen to see how you new scheduler proposal
works.
Thanks
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 16:25 ` Arjan van de Ven
@ 2007-04-16 5:36 ` Bill Huey
2007-04-16 6:17 ` Nick Piggin
0 siblings, 1 reply; 577+ messages in thread
From: Bill Huey @ 2007-04-16 5:36 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Nick Piggin, Thomas Gleixner, Bill Huey (hui)
On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote:
> Now this doesn't mean that people shouldn't be nice to each other, not
> cooperate or steal credits, but I don't get the impression that that is
> happening here. Ingo is taking part in the discussion with a counter
> proposal for discussion *on the mailing list*. What more do you want??
Con should have been CCed from the first moment this was put into motion
to limit the perception of exclusion. That was mistake number one and big
time failures to understand this dynamic. After it was Con's idea. Why
the hell he was excluded from Ingo's development process is baffling to
me and him (most likely).
He put int a lot of effort into SDL and his experiences with scheduling
should still be seriously considered in this development process even if
he doesn't write a single line of code from this moment on.
What should have happened is that our very busy associate at RH by the
name of Ingo Molnar should have leverage more of Con's and Bill's work
and use them as a proxy for his own ideas. They would have loved to have
contributed more and our very busy Ingo Molnar would have gotten a lot
of his work and ideas implemented without him even opening a single
source file for editting. They would have happily done this work for
Ingo. Ingo could have been used for something else more important like
making KVM less of a freaking ugly hack and we all would have benefitted
from this.
He could have been working on SystemTap so that you stop losing accounts
to Sun and Solaris 10's Dtrace. He could have been working with Riel to
fix your butt ugly page scanning problem causing horrible contention via
the Clock/Pro algorithm, etc... He could have been fixing the ugly futex
rwsem mapping problem that's killing -rt and anything that uses Posix
threads. He could have created a userspace thread control block (TCB)
with Mr. Drepper so that we can turn off preemption in userspace
(userspace per CPU local storage) and implement a very quick non-kernel
crossing implementation of priority ceilings (userspace check for priority
and flags at preempt_schedule() in the TCB) so that our -rt Posix API
doesn't suck donkey shit... Need I say more ?
As programmers like Ingo get spread more thinly, he needs super smart
folks like Bill Irwin and Con to help him out and learn to resist NIH
folk's stuff out of some weird fear. When this happens, folks like Ingo
must learn to "facilitate" development in addition to implementing it
with those kind of folks.
This takes time and practice to entrust folks to do things for him.
Ingo is the best method of getting new Linux kernel ideas and communicate
them to Linus. His value goes beyond just just code and is often the
biggest hammer we have in the Linux community to get stuff into the
kernel. "Facilitation" of others is something that solo programmers must
need when groups like the Linux kernel get larger and large every year.
Understand ? Are we in embarrassing agreement here ?
bill
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 23:09 ` Pavel Pisa
@ 2007-04-16 5:47 ` Davide Libenzi
2007-04-17 0:37 ` Pavel Pisa
0 siblings, 1 reply; 577+ messages in thread
From: Davide Libenzi @ 2007-04-16 5:47 UTC (permalink / raw)
To: Pavel Pisa
Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Mon, 16 Apr 2007, Pavel Pisa wrote:
> I cannot help myself to not report results with GAVL
> tree algorithm there as an another race competitor.
> I believe, that it is better solution for large priority
> queues than RB-tree and even heap trees. It could be
> disputable if the scheduler needs such scalability on
> the other hand. The AVL heritage guarantees lower height
> which results in shorter search times which could
> be profitable for other uses in kernel.
>
> GAVL algorithm is AVL tree based, so it does not suffer from
> "infinite" priorities granularity there as TR does. It allows
> use for generalized case where tree is not fully balanced.
> This allows to cut the first item withour rebalancing.
> This leads to the degradation of the tree by one more level
> (than non degraded AVL gives) in maximum, which is still
> considerably better than RB-trees maximum.
>
> http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c
Here are the results on my Opteron 252:
Testing N=1
gavl_cfs = 187.20 cycles/loop
CFS = 194.16 cycles/loop
TR = 314.87 cycles/loop
CFS = 194.15 cycles/loop
gavl_cfs = 187.15 cycles/loop
Testing N=2
gavl_cfs = 268.94 cycles/loop
CFS = 305.53 cycles/loop
TR = 313.78 cycles/loop
CFS = 289.58 cycles/loop
gavl_cfs = 266.02 cycles/loop
Testing N=4
gavl_cfs = 452.13 cycles/loop
CFS = 518.81 cycles/loop
TR = 311.54 cycles/loop
CFS = 516.23 cycles/loop
gavl_cfs = 450.73 cycles/loop
Testing N=8
gavl_cfs = 609.29 cycles/loop
CFS = 644.65 cycles/loop
TR = 308.11 cycles/loop
CFS = 667.01 cycles/loop
gavl_cfs = 592.89 cycles/loop
Testing N=16
gavl_cfs = 686.30 cycles/loop
CFS = 807.41 cycles/loop
TR = 317.20 cycles/loop
CFS = 810.24 cycles/loop
gavl_cfs = 688.42 cycles/loop
Testing N=32
gavl_cfs = 756.57 cycles/loop
CFS = 852.14 cycles/loop
TR = 301.22 cycles/loop
CFS = 876.12 cycles/loop
gavl_cfs = 758.46 cycles/loop
Testing N=64
gavl_cfs = 831.97 cycles/loop
CFS = 997.16 cycles/loop
TR = 304.74 cycles/loop
CFS = 1003.26 cycles/loop
gavl_cfs = 832.83 cycles/loop
Testing N=128
gavl_cfs = 897.33 cycles/loop
CFS = 1030.36 cycles/loop
TR = 295.65 cycles/loop
CFS = 1035.29 cycles/loop
gavl_cfs = 892.51 cycles/loop
Testing N=256
gavl_cfs = 963.17 cycles/loop
CFS = 1146.04 cycles/loop
TR = 295.35 cycles/loop
CFS = 1162.04 cycles/loop
gavl_cfs = 966.31 cycles/loop
Testing N=512
gavl_cfs = 1029.82 cycles/loop
CFS = 1218.34 cycles/loop
TR = 288.78 cycles/loop
CFS = 1257.97 cycles/loop
gavl_cfs = 1029.83 cycles/loop
Testing N=1024
gavl_cfs = 1091.76 cycles/loop
CFS = 1318.47 cycles/loop
TR = 287.74 cycles/loop
CFS = 1311.72 cycles/loop
gavl_cfs = 1093.29 cycles/loop
Testing N=2048
gavl_cfs = 1153.03 cycles/loop
CFS = 1398.84 cycles/loop
TR = 286.75 cycles/loop
CFS = 1438.68 cycles/loop
gavl_cfs = 1149.97 cycles/loop
There seem to be some difference from your numbers. This is with:
gcc version 4.1.2
and -O2. But then and Opteron can behave quite differentyl than a Duron on
a bench like this ;)
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 5:16 ` Con Kolivas
@ 2007-04-16 5:48 ` Gene Heskett
0 siblings, 0 replies; 577+ messages in thread
From: Gene Heskett @ 2007-04-16 5:48 UTC (permalink / raw)
To: Con Kolivas
Cc: Ingo Molnar, Peter Williams, linux-kernel, Linus Torvalds,
Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Monday 16 April 2007, Con Kolivas wrote:
And I snipped, Sorry fellas.
Con's original submission was to me, quite an improvement. But I have to say
it, and no denegration of your efforts is intended Con, but you did 'pull the
trigger' and get this thing rolling by scratching the itch & drawing
attention to an ugly lack of user interactivity that had crept into the 2.6
family. So from me to Con, a tip of the hat, and a deep bow in your
direction, thank you. Now, you have done what you aimed to do, so please get
well.
I've now been through most of an amanda session using Ingo's "CFS" and I have
to say that it is another improvement over your 0.40 that's is just as
obvious as your first patch was against the stock scheduler. No other
scheduler yet has allowed the full utilization of the cpu, and maintained
user interactivity as well as this one has, my cpu is running about 5
degrees F hotter just from this effect alone. gzip, if the rest of the
system is in between tasks, is consistently showing around 95%, but let
anything else stick up its hand, like procmail etc, and gzip now dutifully
steps aside, dropping into the 40% range until procmail and spamd are done,
at which point there is no rest for the wicked and the cpu never gets a
chance to cool.
There was, just now, a pause of about 2 seconds, while amanda moved a tarball
from the holding disk area on /dev/hda to the vtapes disk on /dev/hdd, so
that would have been an I/O bound situation.
This one Ingo, even without any other patches and I think I did see one go by
in this thread which I didn't apply, is a definite keeper. Sweet even.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
A word to the wise is enough.
-- Miguel de Cervantes
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 5:36 ` Bill Huey
@ 2007-04-16 6:17 ` Nick Piggin
0 siblings, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-16 6:17 UTC (permalink / raw)
To: Bill Huey
Cc: Arjan van de Ven, Mike Galbraith, Con Kolivas, Ingo Molnar,
ck list, Peter Williams, linux-kernel, Linus Torvalds,
Andrew Morton, Thomas Gleixner
On Sun, Apr 15, 2007 at 10:36:29PM -0700, Bill Huey wrote:
> On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote:
> > Now this doesn't mean that people shouldn't be nice to each other, not
> > cooperate or steal credits, but I don't get the impression that that is
> > happening here. Ingo is taking part in the discussion with a counter
> > proposal for discussion *on the mailing list*. What more do you want??
>
> Con should have been CCed from the first moment this was put into motion
> to limit the perception of exclusion. That was mistake number one and big
> time failures to understand this dynamic. After it was Con's idea. Why
> the hell he was excluded from Ingo's development process is baffling to
> me and him (most likely).
Ingo's scheduler is completely different to any I've seen proposed
for Linux. And after he did an initial implementation, he did post
it to everyone.
Maybe something he said offended someone, but the process followed
is exactly how Linux kernel development works (ie. if you think you
can do better, then write the code). Sometimes you can give suggestions,
but other times if you come up with a different idea then it is
better just to do it yourself.
Con's code is still out there. If it is better than Ingo's then it
should win out. Nobody has a monopoly on schedulers or ideas or
posting patches.
> He put int a lot of effort into SDL and his experiences with scheduling
> should still be seriously considered in this development process even if
> he doesn't write a single line of code from this moment on.
>
> What should have happened is that our very busy associate at RH by the
> name of Ingo Molnar should have leverage more of Con's and Bill's work
> and use them as a proxy for his own ideas. They would have loved to have
> contributed more and our very busy Ingo Molnar would have gotten a lot
> of his work and ideas implemented without him even opening a single
> source file for editting. They would have happily done this work for
> Ingo. Ingo could have been used for something else more important like
> making KVM less of a freaking ugly hack and we all would have benefitted
> from this.
>
> He could have been working on SystemTap so that you stop losing accounts
> to Sun and Solaris 10's Dtrace. He could have been working with Riel to
> fix your butt ugly page scanning problem causing horrible contention via
> the Clock/Pro algorithm, etc... He could have been fixing the ugly futex
> rwsem mapping problem that's killing -rt and anything that uses Posix
> threads. He could have created a userspace thread control block (TCB)
> with Mr. Drepper so that we can turn off preemption in userspace
> (userspace per CPU local storage) and implement a very quick non-kernel
> crossing implementation of priority ceilings (userspace check for priority
> and flags at preempt_schedule() in the TCB) so that our -rt Posix API
> doesn't suck donkey shit... Need I say more ?
Well that's some pretty strong criticism of Linux and of someone who
does a great deal to improve things... Let's stick to the topic of
schedulers in this thread and try keeping it constructive.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 5:27 ` Peter Williams
@ 2007-04-16 6:23 ` Peter Williams
2007-04-16 6:40 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-16 6:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Peter Williams wrote:
>
> Are your new patches available somewhere for easy download or do I have
> to try to dig them out of the mailing list archive? Or could you mail
> them to me separately? I'm keen to see how you new scheduler proposal
> works.
Forget about this. I found the patch.
After a quick look, I like a lot of what I see especially the removal of
the dual arrays in the run queue.
Some minor suggestions:
1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to
initialize the task structure in init_task.h.
2. the on_rq field in the task structure is unnecessary as many years of
experience with ingosched in plugsched indicates that
!list_empty(&(p)->run_list does the job provided list_del_init() is used
when dequeueing and there is no noticeable overhead incurred so there's
no gain by caching the result. Also it removes the possibility of
errors creeping in due the value of on_rq being inconsistent with the
task's actual state.
3. having modular load balancing is a good idea but it should be
decoupled form the scheduler and provided as a separate interface. This
would enable different schedulers to use the same load balancer if they
desired.
4. why rename SCHED_OTHER to SCHED_FAIR? SCHED_OTHER's supposed to be
fair(ish) anyway.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
[not found] ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com>
@ 2007-04-16 6:27 ` Nick Piggin
0 siblings, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-16 6:27 UTC (permalink / raw)
To: Matthew Hawkins; +Cc: linux-kernel, ck list
On Mon, Apr 16, 2007 at 03:57:54PM +1000, Matthew Hawkins wrote:
> On 4/16/07, Nick Piggin <npiggin@suse.de> wrote:
> >
> >So, on to something productive, we have 3 candidates for a new scheduler
> >so
> >far. How do we decide which way to go? (and yes, I still think switchable
> >schedulers is wrong and a copout)
>
>
> I'm with you on that one. It sounds good as a concept however there's
> various kernel structures etc that simply cannot be altered at runtime,
> which throws away the only advantage I can see of plugsched - a test/debug
> framework.
>
> I think the best way is for those working on this stuff to keep producing
> their separate patches against mainline and people being encouraged to
> test. THEN
> (and here comes the fun part) subsystem maintainers have to be prepared to
> accept code that is not their own or that of their IRC buddies. I'm
> noticing this disturbing trend that Linux kernel development is becoming
> more and more like BSD where only the elite few ever get anywhere. Con
> Kolivas, having a medical not CS degree, bruises the egos of those with CS
> degrees when he comes up with fairly clean, working, and widely-tested
> implementations of things like the staircase scheduler, R(SD)L, SCHED_ISO,
> swap prefetch, etc. when they can't. We should be encouraging guys like
The thing is, it is really hard for anybody to change anything in page
reclaim or CPU scheduler. A few people saying a change is good for them
doesn't really mean anything because of the huge amount of diversity in
usages.
I've got my own CPU scheduler for 4 years and I and a few others think
it is better than mainline. I've tried to make many many VM changes
that haven't gone in.
Add to that, I don't actually know or care what sort of education most
kernel hackers have. I do know at least one of the more brilliant ones
does not have a CS degree, and I was able to get quite a few things in
before I had a degree (eg. rewrote IO scheduler and multiprocessor
CPU scheduler).
> It's all about the patches, baby
I don't know what would give anyone the idea that it isn't... patches
and numbers.
Nick
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 6:23 ` Peter Williams
@ 2007-04-16 6:40 ` Peter Williams
2007-04-16 7:32 ` Ingo Molnar
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-16 6:40 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Peter Williams wrote:
> Peter Williams wrote:
>>
>> Are your new patches available somewhere for easy download or do I
>> have to try to dig them out of the mailing list archive? Or could you
>> mail them to me separately? I'm keen to see how you new scheduler
>> proposal works.
>
> Forget about this. I found the patch.
>
> After a quick look, I like a lot of what I see especially the removal of
> the dual arrays in the run queue.
>
> Some minor suggestions:
>
> 1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to
> initialize the task structure in init_task.h.
> 2. the on_rq field in the task structure is unnecessary as many years of
> experience with ingosched in plugsched indicates that
> !list_empty(&(p)->run_list does the job provided list_del_init() is used
> when dequeueing and there is no noticeable overhead incurred so there's
> no gain by caching the result. Also it removes the possibility of
> errors creeping in due the value of on_rq being inconsistent with the
> task's actual state.
> 3. having modular load balancing is a good idea but it should be
> decoupled form the scheduler and provided as a separate interface. This
> would enable different schedulers to use the same load balancer if they
> desired.
> 4. why rename SCHED_OTHER to SCHED_FAIR? SCHED_OTHER's supposed to be
> fair(ish) anyway.
One more quick comment. The claim that there is no concept of time
slice in the new scheduler is only true in the sense of the rather
arcane implementation of time slices extant in the O(1) scheduler. Your
new parameter sched_granularity_ns is equivalent to the concept of time
slice in most other kernels that I've peeked inside and computing
literature in general (going back over several decades e.g. the magic
garden).
Welcome to the mainstream,
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 13:04 ` Ingo Molnar
@ 2007-04-16 7:16 ` Esben Nielsen
0 siblings, 0 replies; 577+ messages in thread
From: Esben Nielsen @ 2007-04-16 7:16 UTC (permalink / raw)
To: Ingo Molnar
Cc: Esben Nielsen, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Sun, 15 Apr 2007, Ingo Molnar wrote:
>
> * Esben Nielsen <nielsen.esben@googlemail.com> wrote:
>
>> I took a brief look at it. Have you tested priority inheritance?
>
> yeah, you are right, it's broken at the moment, i'll fix it. But the
> good news is that i think PI could become cleaner via scheduling
> classes.
>
>> As far as I can see rt_mutex_setprio doesn't have much effect on
>> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task
>> change scheduler class when boosted in rt_mutex_setprio().
>
> i think via scheduling classes we dont have to do the p->policy and
> p->prio based gymnastics anymore, we can just have a clean look at
> p->sched_class and stack the original scheduling class into
> p->real_sched_class. It would probably also make sense to 'privatize'
> p->prio into the scheduling class. That way PI would be a pure property
> of sched_rt, and the PI scheduler would be driven purely by
> p->rt_priority, not by p->prio. That way all the normal_prio() kind of
> complications and interactions with SCHED_OTHER/SCHED_FAIR would be
> eliminated as well. What do you think?
>
Now I have not read your patch into detail. But agree it would be nice to
have it more "OO" and remove cross references between schedulers. But
first one should consider wether PI between SCHED_FAIR tasks or not is
usefull or not. Does PI among dynamic priorities make sense at all? I think it
does: On heavy loaded systems where a nice 19 might not get the CPU for
very long, a nice -20 task can be priority inverted for a very long
time.
But I see no need it taking the dynamic part of the effective priorities
into account. The current/old solution of mapping the static nice values
into a global priority index which can incorporate the two scheduler
classes is probably good enough - it just has to be "switched on" a again
:-)
But what about other scheduler classes which some people want to add in
the future? What about having a "cleaner design"?
My thought was to generalize the concept of 'priority' to be an
object (a struct prio) to be interpreted with help from a scheduler class
instead of globally interpreted integer.
int compare_prio(struct prio *a, struct prio *b)
{
if (a->sched_class->class_prio < b->sched_class->class_prio)
return -1;
if (a->sched_class->class_prio < b->sched_class->class_prio)
return +1;
return a->sched_class->compare_prio(a, b);
}
Problem 1: Performance.
Problem 2: Operations on a plist with these generalized priorities are not
bounded because the number of different priorites are not bounded.
Problem 2 could be solved by using a combined plist (for rt priorities)
and rbtree (for fair priorities) - making operations logarithmic just as
the fair scheduler itself. But that would take more memory for every
rtmutex.
I conclude that is too complicated and go on to the obvious idea:
Use a global priority index where each scheduler class get's it own
range (rt: 0-99, fair 100-139 :-). Let the scheduler class have a
function returning it instead of reading it directly from task_struct such
that new scheduler classes can return their own numbers.
Esben
> Ingo
>
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 6:40 ` Peter Williams
@ 2007-04-16 7:32 ` Ingo Molnar
2007-04-16 8:54 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-16 7:32 UTC (permalink / raw)
To: Peter Williams
Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
* Peter Williams <pwil3058@bigpond.net.au> wrote:
> One more quick comment. The claim that there is no concept of time
> slice in the new scheduler is only true in the sense of the rather
> arcane implementation of time slices extant in the O(1) scheduler.
yeah. AFAIK most other mainstream OSs also still often use similarly
'arcane' concepts (i'm here ignoring literature, you can find everything
and its opposite suggested in literature) so i felt the need to point
out the difference ;) After all Linux is about doing a better mainstream
OS, it is not about beating the OS literature at lunacy ;-)
The precise statement would be: "there's no concept of giving out a
time-slice to a task and sticking to it unless a higher-prio task comes
along, nor is there a concept of having a low-res granularity
->time_slice thing. There is accurate accounting of how much CPU time a
task used up, and there is a granularity setting that together gives the
current task a fairness advantage of a given amount of nanoseconds -
which has similar [but not equivalent] effects to traditional timeslices
that most mainstream OSs use".
> Your new parameter sched_granularity_ns is equivalent to the concept
> of time slice in most other kernels that I've peeked inside and
> computing literature in general (going back over several decades e.g.
> the magic garden).
note that you can set it to 0 and the box still functions - so
sched_granularity_ns, while useful for performance/bandwidth workloads,
isnt truly inherent to the design.
So in the announcement i just opted for a short sentence: "there's no
concept of timeslices", albeit like most short stentences it's not a
technically 100% accurate statement - but still it conveyed the intended
information more effectively to the interested lkml reader than the
longer version could ever have =B-)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 7:32 ` Ingo Molnar
@ 2007-04-16 8:54 ` Peter Williams
0 siblings, 0 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-16 8:54 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
>
>> One more quick comment. The claim that there is no concept of time
>> slice in the new scheduler is only true in the sense of the rather
>> arcane implementation of time slices extant in the O(1) scheduler.
>
> yeah. AFAIK most other mainstream OSs also still often use similarly
> 'arcane' concepts (i'm here ignoring literature, you can find everything
> and its opposite suggested in literature) so i felt the need to point
> out the difference ;) After all Linux is about doing a better mainstream
> OS, it is not about beating the OS literature at lunacy ;-)
>
> The precise statement would be: "there's no concept of giving out a
> time-slice to a task and sticking to it unless a higher-prio task comes
> along,
I would have said "no concept of using tile slices to implement nice"
which always seemed strange to me.
If it really does what you just said then a (malicious or otherwise) CPU
intensive task that never sleeps, once it got the CPU, would completely
hog the CPU.
> nor is there a concept of having a low-res granularity
> ->time_slice thing. There is accurate accounting of how much CPU time a
> task used up, and there is a granularity setting that together gives the
> current task a fairness advantage of a given amount of nanoseconds -
> which has similar [but not equivalent] effects to traditional timeslices
> that most mainstream OSs use".
Most traditional OSes have more or less fixed time slices and do the
scheduling by fiddling the dynamic priority.
Using total CPU used will also come to grief when used for long running
tasks. Eventually, even very low bandwidth tasks will accumulate enough
total CPU to look busy. The CPU bandwidth the task is using is what
needs to be controlled.
Or have I not looked closely enough at what sched_granularity_ns does?
Is it really a control for the decay rate of a CPU usage bandwidth metric?
>
>> Your new parameter sched_granularity_ns is equivalent to the concept
>> of time slice in most other kernels that I've peeked inside and
>> computing literature in general (going back over several decades e.g.
>> the magic garden).
>
> note that you can set it to 0 and the box still functions - so
> sched_granularity_ns, while useful for performance/bandwidth workloads,
> isnt truly inherent to the design.
Just like my SPA schedulers. But if you set it to zero you'll get a
fairly high context switch rate with associated overhead, won't you?
>
> So in the announcement i just opted for a short sentence: "there's no
> concept of timeslices", albeit like most short stentences it's not a
> technically 100% accurate statement - but still it conveyed the intended
> information more effectively to the interested lkml reader than the
> longer version could ever have =B-)
I hope that I implied that I was being picky :-) (I meant to -- imply I
was being picky, that is).
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 9:06 ` Ingo Molnar
@ 2007-04-16 10:00 ` Ingo Molnar
0 siblings, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-16 10:00 UTC (permalink / raw)
To: Bill Huey
Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Peter Williams, Arjan van de Ven, Thomas Gleixner
* Ingo Molnar <mingo@elte.hu> wrote:
> guys, please calm down. Judging by the number of contributions to
> sched.c the main folks who are not 'observers' here and who thus have
> an unalienable right to be involved in a nasty flamewar about
> scheduler interactivity are Con, Mike, Nick and me ;-) Everyone else
> is just a happy bystander, ok? ;-)
just to make sure: this is a short (and incomplete) list of contributors
related to scheduler interactivity code. The full list of contributors
to sched.c includes many other people as well: Peter, Suresh, Christoph,
Kenneth and many others. Even the git logs, which only span 2 years out
of 15, already list 79 individual contributors to kernel/sched.c.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 5:09 ` Peter Williams
@ 2007-04-16 11:04 ` William Lee Irwin III
2007-04-16 12:55 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-16 11:04 UTC (permalink / raw)
To: Peter Williams
Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
William Lee Irwin III wrote:
>> Driver models for scheduling are not so far out. AFAICS it's largely a
>> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
>> switching out intra-queue policies vs. switching out whole-system
>> policies, SMP handling and all. Whether this involves load balancing
>> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
>> scheduler module, for instance, would not have a load balancer at all,
>> as it has only one global runqueue. There are other sorts of policies
>> wanting significant changes to SMP handling vs. the stock load
>> balancing.
On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Well a single run queue removes the need for load balancing but has
> scalability issues on large systems. Personally, I think something in
> between would be the best solution i.e. multiple run queues but more
> than one CPU per run queue. I think that this would be a particularly
> good solution to the problems introduced by hyper threading and multi
> core systems and also NUMA systems. E.g. if all CPUs in a hyper thread
> package are using the one queue then the case where one CPU is trying to
> run a high priority task and the other a low priority task (i.e. the
> cases that the sleeping dependent mechanism tried to address) is less
> likely to occur.
This wasn't meant to sing the praises of the 2.4.x scheduler; it was
rather meant to point out that the 2.4.x scheduler, among others, is
unimplementable within the framework if it assumes per-cpu runqueues.
More plausibly useful single-queue schedulers would likely use a vastly
different policy and attempt to carry out all queue manipulations via
lockless operations.
On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> By the way, I think that it's a very bad idea for the scheduling
> mechanism and the load balancing mechanism to be coupled. The anomalies
> that will be experienced and the attempts to make ad hoc fixes for them
> will lead to complexity spiralling out of control.
This is clearly unavoidable in the case of gang scheduling. There is
simply no other way to schedule N tasks which must all be run
simultaneously when they run at all on N cpus of the system without
such coupling and furthermore at an extremely intimate level,
particularly when multiple competing gangs must be scheduled in such
a fashion.
William Lee Irwin III wrote:
>> Where I had a significant need for
>> mucking with the entire concept of how SMP was handled, this is rather
>> different.
On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Yes, I went with the idea of intra run queue scheduling being orthogonal
> to load balancing for two reasons:
> 1. I think that coupling them is a bad idea from the complexity POV, and
> 2. it's enough of a battle fighting for modifications to one bit of the
> code without trying to do it to two simultaneously.
As nice as that sounds, such a code structure would've precluded the
entire raison d'etre of the patch, i.e. gang scheduling.
William Lee Irwin III wrote:
>> At this point I'm questioning the relevance of my own work,
>> though it was already relatively marginal as it started life as an
>> attempt at a sort of debug patch to help gang scheduling (which is in
>> itself a rather marginally relevant feature to most users) code along.
On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> The main commercial plug in scheduler used with the run time loadable
> module scheduler that I mentioned earlier did gang scheduling (at the
> insistence of the Tru64 kernel folks). As this scheduler was a
> hierarchical "fair share" scheduler: i.e. allocating CPU "fairly"
> ("unfairly" really in according to an allocation policy) among higher
> level entities such as users, groups and applications as well as
> processes; it was fairly easy to make it a gang scheduler by modifying
> it to give all of a process's threads the same priority based on the
> process's CPU usage rather than different priorities based on the
> threads' usage rates. In fact, it would have been possible to select
> between gang and non gang on a per process basis if that was considered
> desirable.
> The fact that threads and processes are distinct entities on Tru64 and
> Solaris made this easier to do on them than on Linux.
> My experience with this scheduler leads me to believe that to achieve
> gang scheduling and fairness, etc. you need (usage) statistics based
> schedulers.
This does not appear to make sense unless it's based on an incorrect
use of the term "gang scheduling." I'm referring to a gang as a set of
tasks (typically restricted to threads of the same process) which must
all be considered runnable or unrunnable simultaneously, and are for
the sake of performance required to all actually be run simultaneously.
This means a gang of N threads, when run, must run on N processors at
once. A time and a set of processors must be chosen for any time
interval where the gang is running. This interacts with load balancing
by needing to choose the cpus to run the gang on, and also arranging
for a set of cpus available for the gang to use to exist by means of
migrating tasks off the chosen cpus.
William Lee Irwin III wrote:
>> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
>> better ones. I'd not bother citing kernel compile results.
On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> spa_svr actually does its best work when the system isn't fully loaded
> as the type of improvement it strives to achieve (minimizing on queue
> wait time) hasn't got much room to manoeuvre when the system is fully
> loaded. Therefore, the fact that it's 1% better even in these
> circumstances is a good result and also indicates that the overhead for
> keeping the scheduling statistics it uses for its decision making is
> well spent. Especially, when you consider that the total available room
> for improvement on this benchmark is less than 3%.
None of these benchmarks require the system to be fully loaded. They
are, on the other hand, vastly more realistic simulated workloads than
kernel compiles, and furthermore are actually developed as benchmarks,
with in some cases even measurements of variance, iteration to
convergence, and similar such things that make them actually scientific.
On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> To elaborate, the motivation for this scheduler was acquired from the
> observation of scheduling statistics (in particular, on queue wait time)
> on systems running at about 30% to 50% load. Theoretically, at these
> load levels there should be no such waiting but the statistics show that
> there is considerable waiting (sometimes as high as 30% to 50%). I put
> this down to "lack of serendipity" e.g. everyone sleeping at the same
> time and then trying to run at the same time would be complete lack of
> serendipity. On the other hand, if everyone is synced then there would
> be total serendipity.
> Obviously, from the POV of a client, time the server task spends waiting
> on the queue adds to the response time for any request that has been
> made so reduction of this time on a server is a good thing(tm). Equally
> obviously, trying to achieve this synchronization by asking the tasks to
> cooperate with each other is not a feasible solution and some external
> influence needs to be exerted and this is what spa_svr does -- it nudges
> the scheduling order of the tasks in a way that makes them become well
> synced.
This all sounds like a relatively good idea. So it's good for throughput
vs. latency or otherwise not particularly interactive. No big deal, just
use it where it makes sense.
On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Unfortunately, this is not a good scheduler for an interactive system as
> it minimizes the response times for ALL tasks (and the system as a
> whole) and this can result in increased response time for some
> interactive tasks (clunkiness) which annoys interactive users. When you
> start fiddling with this scheduler to bring back "interactive
> unfairness" you kill a lot of its superior low overall wait time
> performance.
> So this is why I think "horses for courses" schedulers are worth while.
I have no particular objection to using an appropriate scheduler for the
system's workload. I also have little or no preference as to how that's
accomplished overall. But I really think that if we want to push
pluggable scheduling it should load schedulers as kernel modules on the
fly and so on versus pure /proc/ tunables and a compiled-in set of
alternatives.
William Lee Irwin III wrote:
>> In any event, I'm not sure what to say about different schedulers for
>> different aims. My intentions with plugsched were not centered around
>> production usage or intra-queue policy. I'm relatively indifferent to
>> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
>> in mainline. I don't see any particular harm in it, but neither am I
>> particularly motivated to have it in.
On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> If you look at the struct sched_spa_child in the file
> include/linux/sched_spa.h you'll see that the interface for switching
> between the various SPA schedulers is quite small and making them
> runtime switchable would be easy (I haven't done this in cpusched as I
> wanted to keep the same interface for switching schedulers for all
> schedulers: i.e. all run time switchable or none run time switchable; as
> the main aim of plugsched had become a mechanism for evaluating
> different intra queue scheduling designs.)
I remember actually looking at this, and I would almost characterize
the differences between the SPA schedulers as a tunable parameter. I
have a different concept of what pluggability means from how the SPA
schedulers were switched, but no particular objection to the method
given the commonalities between them.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 23:54 ` William Lee Irwin III
@ 2007-04-16 11:24 ` Ingo Molnar
2007-04-16 13:46 ` William Lee Irwin III
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-16 11:24 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox
* William Lee Irwin III <wli@holomorphy.com> wrote:
> On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote:
> > Oh I was very much testing "CPU bandwidth allocation as influenced by
> > nice numbers" - it's one of the basic things i do when modifying the
> > scheduler. An automated tool, while nice (all automation is nice)
> > wouldnt necessarily show such bugs though, because here too it needed
> > thousands of running tasks to trigger in practice. Any volunteers? ;)
>
> Worse comes to worse I might actually get around to doing it myself.
> Any more detailed descriptions of the test for a rainy day?
the main complication here is that the handling of nice levels is still
typically a 2nd or 3rd degree design factor when writing schedulers. The
reason isnt carelessness, the reason is simply that users typically only
care about a single nice level: the one that all tasks run under by
default.
Also, often there's just one or two good ways to attack the problem
within a given scheduler approach and the quality of nice levels often
suffers under other, more important design factors like performance.
This means that for example for the vanilla scheduler the distribution
of CPU power depends on HZ, on the number of tasks and on the scheduling
pattern. The distribution of CPU power amongst nice levels is basically
a function of _everything_. That makes any automated test pretty
challenging. Both with SD and with CFS there's a good chance to actually
formalize the meaning of nice levels, but i'd not go as far as to
mandate any particular behavior by rigidly saying "pass this automated
tool, else ...", other than "make nice levels resonable". All the other
more formal CPU resource limitation techniques are then a matter of
CKRM-alike patches, which offer much more finegrained mechanisms than
pure nice levels anyway.
so to answer your question: it's pretty much freely defined. Make up
your mind about it and figure out the ways how people use nice levels
and think about which aspects of that experience are worth testing for
intelligently.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 22:49 ` Ismail Dönmez
2007-04-15 23:23 ` Arjan van de Ven
@ 2007-04-16 11:58 ` Ingo Molnar
2007-04-16 12:02 ` Ismail Dönmez
1 sibling, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-16 11:58 UTC (permalink / raw)
To: Ismail Dönmez
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Ismail Dönmez <ismail@pardus.org.tr> wrote:
> Tested this on top of Linus' GIT tree but the system gets very
> unresponsive during high disk i/o using ext3 as filesystem but even
> writing a 300mb file to a usb disk (iPod actually) has the same
> affect.
hm. Is this an SMP system+kernel by any chance?
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 11:58 ` Ingo Molnar
@ 2007-04-16 12:02 ` Ismail Dönmez
0 siblings, 0 replies; 577+ messages in thread
From: Ismail Dönmez @ 2007-04-16 12:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Monday 16 April 2007 14:58:54 Ingo Molnar wrote:
> * Ismail Dönmez <ismail@pardus.org.tr> wrote:
> > Tested this on top of Linus' GIT tree but the system gets very
> > unresponsive during high disk i/o using ext3 as filesystem but even
> > writing a 300mb file to a usb disk (iPod actually) has the same
> > affect.
>
> hm. Is this an SMP system+kernel by any chance?
Nope the kernel and the system is UP.
Regards,
ismail
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 11:04 ` William Lee Irwin III
@ 2007-04-16 12:55 ` Peter Williams
2007-04-16 23:10 ` Michael K. Edwards
[not found] ` <20070416135915.GK8915@holomorphy.com>
0 siblings, 2 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-16 12:55 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> Driver models for scheduling are not so far out. AFAICS it's largely a
>>> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
>>> switching out intra-queue policies vs. switching out whole-system
>>> policies, SMP handling and all. Whether this involves load balancing
>>> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
>>> scheduler module, for instance, would not have a load balancer at all,
>>> as it has only one global runqueue. There are other sorts of policies
>>> wanting significant changes to SMP handling vs. the stock load
>>> balancing.
>
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Well a single run queue removes the need for load balancing but has
>> scalability issues on large systems. Personally, I think something in
>> between would be the best solution i.e. multiple run queues but more
>> than one CPU per run queue. I think that this would be a particularly
>> good solution to the problems introduced by hyper threading and multi
>> core systems and also NUMA systems. E.g. if all CPUs in a hyper thread
>> package are using the one queue then the case where one CPU is trying to
>> run a high priority task and the other a low priority task (i.e. the
>> cases that the sleeping dependent mechanism tried to address) is less
>> likely to occur.
>
> This wasn't meant to sing the praises of the 2.4.x scheduler; it was
> rather meant to point out that the 2.4.x scheduler, among others, is
> unimplementable within the framework if it assumes per-cpu runqueues.
> More plausibly useful single-queue schedulers would likely use a vastly
> different policy and attempt to carry out all queue manipulations via
> lockless operations.
>
>
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> By the way, I think that it's a very bad idea for the scheduling
>> mechanism and the load balancing mechanism to be coupled. The anomalies
>> that will be experienced and the attempts to make ad hoc fixes for them
>> will lead to complexity spiralling out of control.
>
> This is clearly unavoidable in the case of gang scheduling. There is
> simply no other way to schedule N tasks which must all be run
> simultaneously when they run at all on N cpus of the system without
> such coupling and furthermore at an extremely intimate level,
> particularly when multiple competing gangs must be scheduled in such
> a fashion.
I can't see the logic here or why you would want to do such a thing. It
certainly doesn't coincide with what I interpret "gang scheduling" to mean.
>
>
> William Lee Irwin III wrote:
>>> Where I had a significant need for
>>> mucking with the entire concept of how SMP was handled, this is rather
>>> different.
>
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Yes, I went with the idea of intra run queue scheduling being orthogonal
>> to load balancing for two reasons:
>> 1. I think that coupling them is a bad idea from the complexity POV, and
>> 2. it's enough of a battle fighting for modifications to one bit of the
>> code without trying to do it to two simultaneously.
>
> As nice as that sounds, such a code structure would've precluded the
> entire raison d'etre of the patch, i.e. gang scheduling.
Not for what I understand "gang scheduling" to mean. As I understand it
the constraints of gang scheduling are no where near as strict as you
seem to think they are. And for what it's worth I don't think that what
you think it means is in any sense a reasonable target.
>
>
> William Lee Irwin III wrote:
>>> At this point I'm questioning the relevance of my own work,
>>> though it was already relatively marginal as it started life as an
>>> attempt at a sort of debug patch to help gang scheduling (which is in
>>> itself a rather marginally relevant feature to most users) code along.
>
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> The main commercial plug in scheduler used with the run time loadable
>> module scheduler that I mentioned earlier did gang scheduling (at the
>> insistence of the Tru64 kernel folks). As this scheduler was a
>> hierarchical "fair share" scheduler: i.e. allocating CPU "fairly"
>> ("unfairly" really in according to an allocation policy) among higher
>> level entities such as users, groups and applications as well as
>> processes; it was fairly easy to make it a gang scheduler by modifying
>> it to give all of a process's threads the same priority based on the
>> process's CPU usage rather than different priorities based on the
>> threads' usage rates. In fact, it would have been possible to select
>> between gang and non gang on a per process basis if that was considered
>> desirable.
>> The fact that threads and processes are distinct entities on Tru64 and
>> Solaris made this easier to do on them than on Linux.
>> My experience with this scheduler leads me to believe that to achieve
>> gang scheduling and fairness, etc. you need (usage) statistics based
>> schedulers.
>
> This does not appear to make sense unless it's based on an incorrect
> use of the term "gang scheduling."
It's become obvious that we mean different things.
> I'm referring to a gang as a set of
> tasks (typically restricted to threads of the same process) which must
> all be considered runnable or unrunnable simultaneously, and are for
> the sake of performance required to all actually be run simultaneously.
> This means a gang of N threads, when run, must run on N processors at
> once. A time and a set of processors must be chosen for any time
> interval where the gang is running. This interacts with load balancing
> by needing to choose the cpus to run the gang on, and also arranging
> for a set of cpus available for the gang to use to exist by means of
> migrating tasks off the chosen cpus.
Sounds like a job for the load balancer NOT the scheduler.
Also I can't see you meeting such strict constraints without making the
tasks all SCHED_FIFO.
>
>
> William Lee Irwin III wrote:
>>> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
>>> better ones. I'd not bother citing kernel compile results.
>
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> spa_svr actually does its best work when the system isn't fully loaded
>> as the type of improvement it strives to achieve (minimizing on queue
>> wait time) hasn't got much room to manoeuvre when the system is fully
>> loaded. Therefore, the fact that it's 1% better even in these
>> circumstances is a good result and also indicates that the overhead for
>> keeping the scheduling statistics it uses for its decision making is
>> well spent. Especially, when you consider that the total available room
>> for improvement on this benchmark is less than 3%.
>
> None of these benchmarks require the system to be fully loaded. They
> are, on the other hand, vastly more realistic simulated workloads than
> kernel compiles, and furthermore are actually developed as benchmarks,
> with in some cases even measurements of variance, iteration to
> convergence, and similar such things that make them actually scientific.
>
>
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> To elaborate, the motivation for this scheduler was acquired from the
>> observation of scheduling statistics (in particular, on queue wait time)
>> on systems running at about 30% to 50% load. Theoretically, at these
>> load levels there should be no such waiting but the statistics show that
>> there is considerable waiting (sometimes as high as 30% to 50%). I put
>> this down to "lack of serendipity" e.g. everyone sleeping at the same
>> time and then trying to run at the same time would be complete lack of
>> serendipity. On the other hand, if everyone is synced then there would
>> be total serendipity.
>> Obviously, from the POV of a client, time the server task spends waiting
>> on the queue adds to the response time for any request that has been
>> made so reduction of this time on a server is a good thing(tm). Equally
>> obviously, trying to achieve this synchronization by asking the tasks to
>> cooperate with each other is not a feasible solution and some external
>> influence needs to be exerted and this is what spa_svr does -- it nudges
>> the scheduling order of the tasks in a way that makes them become well
>> synced.
>
> This all sounds like a relatively good idea. So it's good for throughput
> vs. latency or otherwise not particularly interactive. No big deal, just
> use it where it makes sense.
>
>
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Unfortunately, this is not a good scheduler for an interactive system as
>> it minimizes the response times for ALL tasks (and the system as a
>> whole) and this can result in increased response time for some
>> interactive tasks (clunkiness) which annoys interactive users. When you
>> start fiddling with this scheduler to bring back "interactive
>> unfairness" you kill a lot of its superior low overall wait time
>> performance.
>> So this is why I think "horses for courses" schedulers are worth while.
>
> I have no particular objection to using an appropriate scheduler for the
> system's workload. I also have little or no preference as to how that's
> accomplished overall. But I really think that if we want to push
> pluggable scheduling it should load schedulers as kernel modules on the
> fly and so on versus pure /proc/ tunables and a compiled-in set of
> alternatives.
>
>
> William Lee Irwin III wrote:
>>> In any event, I'm not sure what to say about different schedulers for
>>> different aims. My intentions with plugsched were not centered around
>>> production usage or intra-queue policy. I'm relatively indifferent to
>>> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
>>> in mainline. I don't see any particular harm in it, but neither am I
>>> particularly motivated to have it in.
>
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> If you look at the struct sched_spa_child in the file
>> include/linux/sched_spa.h you'll see that the interface for switching
>> between the various SPA schedulers is quite small and making them
>> runtime switchable would be easy (I haven't done this in cpusched as I
>> wanted to keep the same interface for switching schedulers for all
>> schedulers: i.e. all run time switchable or none run time switchable; as
>> the main aim of plugsched had become a mechanism for evaluating
>> different intra queue scheduling designs.)
>
> I remember actually looking at this, and I would almost characterize
> the differences between the SPA schedulers as a tunable parameter. I
> have a different concept of what pluggability means from how the SPA
> schedulers were switched, but no particular objection to the method
> given the commonalities between them.
Yes, that's the way I look at them (in fact, in Zaphod that's exactly
what they were -- i.e. Zaphod could be made to behave like various SPA
schedulers by fiddling its run time parameters). They illustrate (to my
mind) that once you get rid of the O(1) scheduler and replace it with a
simple mechanism such as SPA (where there's a small number of points
where the scheduling discipline gets to do its thing rather than being
interspersed willy nilly throughout the rest of the code) adding run
time switchable "horses for courses" scheduler disciplines becomes
simple. I think that the simplifications in Ingo's new scheduler (whose
scheduling classes now look a lot like Solaris's and its predecessor
OSes' scheduler classes) may make it possible to have switchable
scheduling disciplines within a scheduling class.
I think that something similar (i.e. switchability) could be done for
load balancing so that different load balancers could be used when
required. By keeping this load balancing functionality orthogonal to
the intra run queue scheduling disciplines you increase the number of
options available.
As I see it, if the scheduling discipline in use does its job properly
within a run queue and the load balancer does its job of keeping the
weighted load/demand on each run queue roughly equal (except where it
has to do otherwise for your version of "gang scheduling") then the
overall outcome will meet expectations. Note that I talk of run queues
not CPUs as I think a shift to multiple CPUs per run queue may be a good
idea.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 11:24 ` Ingo Molnar
@ 2007-04-16 13:46 ` William Lee Irwin III
0 siblings, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-16 13:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Worse comes to worse I might actually get around to doing it myself.
>> Any more detailed descriptions of the test for a rainy day?
On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> the main complication here is that the handling of nice levels is still
> typically a 2nd or 3rd degree design factor when writing schedulers. The
> reason isnt carelessness, the reason is simply that users typically only
> care about a single nice level: the one that all tasks run under by
> default.
I'm a bit unconvinced here. Support for prioritization is a major
scheduler feature IMHO.
On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> Also, often there's just one or two good ways to attack the problem
> within a given scheduler approach and the quality of nice levels often
> suffers under other, more important design factors like performance.
> This means that for example for the vanilla scheduler the distribution
> of CPU power depends on HZ, on the number of tasks and on the scheduling
> pattern. The distribution of CPU power amongst nice levels is basically
> a function of _everything_. That makes any automated test pretty
> challenging. Both with SD and with CFS there's a good chance to actually
> formalize the meaning of nice levels, but i'd not go as far as to
> mandate any particular behavior by rigidly saying "pass this automated
> tool, else ...", other than "make nice levels resonable". All the other
> more formal CPU resource limitation techniques are then a matter of
> CKRM-alike patches, which offer much more finegrained mechanisms than
> pure nice levels anyway.
Some of the issues with respect to the number of tasks and scheduling
patterns can be made part of the test; one can furthermore insist that
the system be quiescent in a variety of ways. I'm not convinced that
formalization of nice levels is a bad idea. They're the standard UNIX
prioritization facility, and it should work with some definite value
of "work."
Even supposing one doesn't care to bolt down the semantics of nice
levels, there should at least be some awareness of what those semantics
are and when and how they're changing. So in that respect a test for
CPU bandwidth distribution according to nice level remains valuable
even supposing that the semantics aren't required to be rigidly fixed.
As far as CKRM goes, I'm wild about it. I wish things would get in
shape to be merged (if they're not already) and merged ASAP on that
front. I think with so much agreement in concept we can work with
changing out implementations as-needed with it sitting in mainline once
the the user API/ABI is decided upon, and I think it already is.
I'm not entirely convinced CKRM answers this, though. If the scheduler
can't support nice levels, how is it supposed to support prioritization
or CPU bandwidth allocation according to CKRM configurations? I'm
relatively certain schedulers must be able to support prioritization
with deterministic CPU bandwidth as essential functionality. This is,
of course, not to say my certainty about things sets the standards for
what testcases are considered meaningful and valid.
On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> so to answer your question: it's pretty much freely defined. Make up
> your mind about it and figure out the ways how people use nice levels
> and think about which aspects of that experience are worth testing for
> intelligently.
Looking for usage cases is a good idea; I'll do that before coding any
testcase for nice semantics.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 3:03 ` Nick Piggin
@ 2007-04-16 14:28 ` Matt Mackall
2007-04-17 3:31 ` Nick Piggin
0 siblings, 1 reply; 577+ messages in thread
From: Matt Mackall @ 2007-04-16 14:28 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> I'd prefer if we kept a single CPU scheduler in mainline, because I
> think that simplifies analysis and focuses testing.
I think you'll find something like 80-90% of the testing will be done
on the default choice, even if other choices exist. So you really
won't have much of a problem here.
But when the only choice for other schedulers is to go out-of-tree,
then only 1% of the people will try it out and those people are
guaranteed to be the ones who saw scheduling problems in mainline.
So the alternative won't end up getting any testing on many of the
workloads that work fine in mainstream so their feedback won't tell
you very much at all.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 21:31 ` Matt Mackall
2007-04-16 3:03 ` Nick Piggin
@ 2007-04-16 15:45 ` William Lee Irwin III
1 sibling, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-16 15:45 UTC (permalink / raw)
To: Matt Mackall
Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote:
> That's irrelevant. Plugsched was an attempt to get alternative
> schedulers exposure in mainline. I know, because I remember
> encouraging Bill to pursue it. Not only did you veto plugsched (which
> may have been a perfectly reasonable thing to do), but you also vetoed
> the whole concept of multiple schedulers in the tree too. "We don't
> want to balkanize the scheduling landscape".
> And that latter part is what I'm claiming has set us back for years.
> It's not a technical argument but a strategic one. And it's just not a
> good strategy.
[... excellent post trimmed...]
These are some rather powerful arguments. I think I'll actually start
looking at plugsched again.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 15:26 ` William Lee Irwin III
@ 2007-04-16 15:55 ` Chris Friesen
2007-04-16 16:13 ` William Lee Irwin III
` (2 more replies)
0 siblings, 3 replies; 577+ messages in thread
From: Chris Friesen @ 2007-04-16 15:55 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar,
Con Kolivas, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
William Lee Irwin III wrote:
> The sorts of like explicit decisions I'd like to be made for these are:
> (1) In a mixture of tasks with varying nice numbers, a given nice number
> corresponds to some share of CPU bandwidth. Implementations
> should not have the freedom to change this arbitrarily according
> to some intention.
The first question that comes to my mind is whether nice levels should
be linear or not. I would lean towards nonlinear as it allows a wider
range (although of course at the expense of precision). Maybe something
like "each nice level gives X times the cpu of the previous"? I think a
value of X somewhere between 1.15 and 1.25 might be reasonable.
What about also having something that looks at latency, and how latency
changes with niceness?
What about specifying the timeframe over which the cpu bandwidth is
measured? I currently have a system where the application designers
would like it to be totally fair over a period of 1 second. As you can
imagine, mainline doesn't do very well in this case.
Chris
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 15:55 ` Chris Friesen
@ 2007-04-16 16:13 ` William Lee Irwin III
2007-04-17 0:04 ` Peter Williams
2007-04-17 13:07 ` James Bruce
2 siblings, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-16 16:13 UTC (permalink / raw)
To: Chris Friesen
Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar,
Con Kolivas, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
William Lee Irwin III wrote:
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>> corresponds to some share of CPU bandwidth. Implementations
>> should not have the freedom to change this arbitrarily according
>> to some intention.
On Mon, Apr 16, 2007 at 09:55:14AM -0600, Chris Friesen wrote:
> The first question that comes to my mind is whether nice levels should
> be linear or not. I would lean towards nonlinear as it allows a wider
> range (although of course at the expense of precision). Maybe something
> like "each nice level gives X times the cpu of the previous"? I think a
> value of X somewhere between 1.15 and 1.25 might be reasonable.
> What about also having something that looks at latency, and how latency
> changes with niceness?
> What about specifying the timeframe over which the cpu bandwidth is
> measured? I currently have a system where the application designers
> would like it to be totally fair over a period of 1 second. As you can
> imagine, mainline doesn't do very well in this case.
It's unclear how latency enters the picture as the semantics of nice
levels relevant to such are essentially priority preemption, which is
not particularly easy to mess up. I suppose tests to ensure priority
preemption occurs properly are in order.
I don't really have a preference regarding specific semantics for nice
numbers, just that they should be deterministic and specified somewhere.
It's not really for us to decide what those semantics are as it's more
of a userspace ABI/API issue.
The timeframe is also relevant, but I suspect it's more of a performance
metric than a strict requirement.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 1:06 ` Peter Williams
2007-04-16 3:04 ` William Lee Irwin III
@ 2007-04-16 17:22 ` Chris Friesen
2007-04-17 0:54 ` Peter Williams
1 sibling, 1 reply; 577+ messages in thread
From: Chris Friesen @ 2007-04-16 17:22 UTC (permalink / raw)
To: Peter Williams
Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Peter Williams wrote:
> To my mind scheduling
> and load balancing are orthogonal and keeping them that way simplifies
> things.
Scuse me if I jump in here, but doesn't the load balancer need some way
to figure out a) when to run, and b) which tasks to pull and where to
push them?
I suppose you could abstract this into a per-scheduler API, but to me at
least these are the hard parts of the load balancer...
Chris
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 22:00 ` Andi Kleen
@ 2007-04-16 21:05 ` Ingo Molnar
2007-04-16 21:21 ` Andi Kleen
0 siblings, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-16 21:05 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-kernel
* Andi Kleen <andi@firstfloor.org> wrote:
> > i'm pleased to announce the first release of the "Modular Scheduler
> > Core and Completely Fair Scheduler [CFS]" patchset:
> >
> > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> I would suggest to drop the tsc.c change. The "small errors" can be
> really large on some systems and you can also see large backward
> jumps.
actually, i designed the CFS code assuming a per-CPU TSC (with no global
synchronization), not assuming any globally sync TSC. In fact i wrote it
on such systems: a CoreDuo2 box that has stops the TSC in C3 and the
different cores have wildly different TSC values and a dual-core
Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock()
change for now.
> BTW with all this CPU time measurement it would be really nice to
> report it to the user too. It seems a bit bizarre that the scheduler
> keeps track of ns, but top only knows jiffies with large sampling
> errors.
yeah - i'll fix that too if someone doesnt beat me at it.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 21:05 ` Ingo Molnar
@ 2007-04-16 21:21 ` Andi Kleen
0 siblings, 0 replies; 577+ messages in thread
From: Andi Kleen @ 2007-04-16 21:21 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Andi Kleen, linux-kernel
> actually, i designed the CFS code assuming a per-CPU TSC (with no global
> synchronization), not assuming any globally sync TSC. In fact i wrote it
That already worked in the old scheduler (just in a hackish way)
> on such systems: a CoreDuo2 box that has stops the TSC in C3 and the
> different cores have wildly different TSC values and a dual-core
> Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock()
> change for now.
The problem is not CPU synchronized TSC, but TSC with varying frequency
on a single CPU like on the A64.
The old implementation can lose really badly on that because it mixes
measurements at different frequencies together without individual scaling.
The error gets worse the longer the system runs.
>> BTW with all this CPU time measurement it would be really nice to
>> report it to the user too. It seems a bit bizarre that the scheduler
>> keeps track of ns, but top only knows jiffies with large sampling
>> errors.
> yeah - i'll fix that too if someone doesnt beat me at it.
I've been pondering for some time if doubling the NMI watchdog as a
ring 0 counter for this is worth it. So far I'm still undecided
(and it's moot now since it's disabled by default :/)
-Andi
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (11 preceding siblings ...)
2007-04-15 22:49 ` Ismail Dönmez
@ 2007-04-16 22:00 ` Andi Kleen
2007-04-16 21:05 ` Ingo Molnar
2007-04-17 7:56 ` Andy Whitcroft
2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse
14 siblings, 1 reply; 577+ messages in thread
From: Andi Kleen @ 2007-04-16 22:00 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel
Ingo Molnar <mingo@elte.hu> writes:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
I would suggest to drop the tsc.c change. The "small errors" can be
really large on some systems and you can also see large backward jumps.
I have a proper (but complicated) solution pending
in ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/sched-clock-share
BTW with all this CPU time measurement it would be really nice to
report it to the user too. It seems a bit bizarre that the scheduler
keeps track of ns, but top only knows jiffies with large sampling errors.
-Andi
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 12:55 ` Peter Williams
@ 2007-04-16 23:10 ` Michael K. Edwards
2007-04-17 3:55 ` Nick Piggin
[not found] ` <20070416135915.GK8915@holomorphy.com>
1 sibling, 1 reply; 577+ messages in thread
From: Michael K. Edwards @ 2007-04-16 23:10 UTC (permalink / raw)
To: Peter Williams
Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> Note that I talk of run queues
> not CPUs as I think a shift to multiple CPUs per run queue may be a good
> idea.
This observation of Peter's is the best thing to come out of this
whole foofaraw. Looking at what's happening in CPU-land, I think it's
going to be necessary, within a couple of years, to replace the whole
idea of "CPU scheduling" with "run queue scheduling" across a complex,
possibly dynamic mix of CPU-ish resources. Ergo, there's not much
point in churning the mainline scheduler through a design that isn't
significantly more flexible than any of those now under discussion.
For instance, there are architectures where several "CPUs"
(instruction stream decoders feeding execution pipelines) share parts
of a cache hierarchy ("chip-level multitasking"). On these machines,
you may want to co-schedule a "real" processing task on one pipeline
with a "cache warming" task on the other pipeline -- but only for
tasks whose memory access patterns have been sufficiently analyzed to
write the "cache warming" task code. Some other tasks may want to
idle the second pipeline so they can use the full cache-to-RAM
bandwidth. Yet other tasks may be genuinely CPU-intensive (or I/O
bound but so context-heavy that it's not worth yielding the CPU during
quick I/Os), and hence perfectly happy to run concurrently with an
unrelated task on the other pipeline.
There are other architectures where several "hardware threads" fight
over parts of a cache hierarchy (sometimes bizarrely described as
"sharing" the cache, kind of the way most two-year-olds "share" toys).
On these machines, one instruction pipeline can't help the other
along cache-wise, but it sure can hurt. A scheduler designed, tested,
and tuned principally on one of these architectures (hint:
"hyperthreading") will probably leave a lot of performance on the
floor on processors in the former category.
In the not-so-distant future, we're likely to see architectures with
dynamically reconfigurable interconnect between instruction issue
units and execution resources. (This is already quite feasible on,
say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with
as many Nios II cores as fit on the chip.) Restoring task context may
involve not just MMU swaps and FPU instructions (with state-dependent
hidden costs) but processsor reconfiguration. Achieving "fairness"
according to any standard that a platform integrator cares about (let
alone an end user) will require a fairly detailed model of the hidden
costs associated with different sorts of task switch.
So if you are interested in schedulers for some reason other than a
paycheck, let the distros worry about 5% improvements on x86[_64].
Get hold of some different "hardware" -- say:
- a Xilinx ML410 if you've got $3K to blow and want to explore
reconfigurable processors;
- a SunFire T2000 if you've got $11K and want to mess with a CMT
system that's actually shipping;
- a QEMU-simulated massively SMP x86 if you're poor but clever
enough to implement funky cross-core cache effects yourself; or
- a cycle-accurate simulator from Gaisler or Virtio if you want a
real research project.
Then go explore some more interesting regions of parameter space and
see what the demands on mainline Linux will look like in a few years.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 15:55 ` Chris Friesen
2007-04-16 16:13 ` William Lee Irwin III
@ 2007-04-17 0:04 ` Peter Williams
2007-04-17 13:07 ` James Bruce
2 siblings, 0 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-17 0:04 UTC (permalink / raw)
To: Chris Friesen
Cc: William Lee Irwin III, Willy Tarreau, Pekka Enberg,
hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
Chris Friesen wrote:
> William Lee Irwin III wrote:
>
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>> corresponds to some share of CPU bandwidth. Implementations
>> should not have the freedom to change this arbitrarily according
>> to some intention.
>
> The first question that comes to my mind is whether nice levels should
> be linear or not.
No. That squishes one end of the table too much. It needs to be
(approximately) piecewise linear around nice == 0. Here's the mapping I
use in my entitlement based schedulers:
#define NICE_TO_LP(nice) ((nice >=0) ? (20 - (nice)) : (20 + (nice) *
(nice)))
It has the (good) feature that a nice == 19 task has 1/20th the
entitlement of a nice == 0 task and a nice == -20 task has 21 times the
entitlement of a nice == 0 task. It's not strictly linear for negative
nice values but is very cheap to calculate and quite easy to invert if
necessary.
> I would lean towards nonlinear as it allows a wider
> range (although of course at the expense of precision). Maybe something
> like "each nice level gives X times the cpu of the previous"? I think a
> value of X somewhere between 1.15 and 1.25 might be reasonable.
>
> What about also having something that looks at latency, and how latency
> changes with niceness?
>
> What about specifying the timeframe over which the cpu bandwidth is
> measured? I currently have a system where the application designers
> would like it to be totally fair over a period of 1 second.
Have you tried the spa_ebs scheduler? The half life is no longer a run
time configurable parameter (as making it highly adjustable results in
less efficient code) but it could be adjusted to be approximately
equivalent to 0.5 seconds by changing some constants in the code.
> As you can
> imagine, mainline doesn't do very well in this case.
You should look back through the plugsched patches where many of these
ideas have been experimented with.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-15 6:43 ` Mike Galbraith
2007-04-15 8:36 ` Bill Huey
@ 2007-04-17 0:06 ` Peter Williams
2007-04-17 2:29 ` Mike Galbraith
1 sibling, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 0:06 UTC (permalink / raw)
To: Mike Galbraith
Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
Thomas Gleixner
Mike Galbraith wrote:
> On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote:
>> On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
>>> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
>>> [CFS]
>>>
>>> i'm pleased to announce the first release of the "Modular Scheduler Core
>>> and Completely Fair Scheduler [CFS]" patchset:
>>>
>>> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>>>
>>> This project is a complete rewrite of the Linux task scheduler. My goal
>>> is to address various feature requests and to fix deficiencies in the
>>> vanilla scheduler that were suggested/found in the past few years, both
>>> for desktop scheduling and for server scheduling workloads.
>> The casual observer will be completely confused by what on earth has happened
>> here so let me try to demystify things for them.
>
> [...]
>
> Demystify what? The casual observer need only read either your attempt
> at writing a scheduler, or my attempts at fixing the one we have, to see
> that it was high time for someone with the necessary skills to step in.
Make that "someone with the necessary clout".
> Now progress can happen, which was _not_ happening before.
>
This is true.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 5:47 ` Davide Libenzi
@ 2007-04-17 0:37 ` Pavel Pisa
0 siblings, 0 replies; 577+ messages in thread
From: Pavel Pisa @ 2007-04-17 0:37 UTC (permalink / raw)
To: Davide Libenzi
Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Monday 16 April 2007 07:47, Davide Libenzi wrote:
> On Mon, 16 Apr 2007, Pavel Pisa wrote:
> > I cannot help myself to not report results with GAVL
> > tree algorithm there as an another race competitor.
> > I believe, that it is better solution for large priority
> > queues than RB-tree and even heap trees. It could be
> > disputable if the scheduler needs such scalability on
> > the other hand. The AVL heritage guarantees lower height
> > which results in shorter search times which could
> > be profitable for other uses in kernel.
> >
> > GAVL algorithm is AVL tree based, so it does not suffer from
> > "infinite" priorities granularity there as TR does. It allows
> > use for generalized case where tree is not fully balanced.
> > This allows to cut the first item withour rebalancing.
> > This leads to the degradation of the tree by one more level
> > (than non degraded AVL gives) in maximum, which is still
> > considerably better than RB-trees maximum.
> >
> > http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c
>
> Here are the results on my Opteron 252:
>
> Testing N=1
> gavl_cfs = 187.20 cycles/loop
> CFS = 194.16 cycles/loop
> TR = 314.87 cycles/loop
> CFS = 194.15 cycles/loop
> gavl_cfs = 187.15 cycles/loop
>
> Testing N=2
> gavl_cfs = 268.94 cycles/loop
> CFS = 305.53 cycles/loop
> TR = 313.78 cycles/loop
> CFS = 289.58 cycles/loop
> gavl_cfs = 266.02 cycles/loop
>
> Testing N=4
> gavl_cfs = 452.13 cycles/loop
> CFS = 518.81 cycles/loop
> TR = 311.54 cycles/loop
> CFS = 516.23 cycles/loop
> gavl_cfs = 450.73 cycles/loop
>
> Testing N=8
> gavl_cfs = 609.29 cycles/loop
> CFS = 644.65 cycles/loop
> TR = 308.11 cycles/loop
> CFS = 667.01 cycles/loop
> gavl_cfs = 592.89 cycles/loop
>
> Testing N=16
> gavl_cfs = 686.30 cycles/loop
> CFS = 807.41 cycles/loop
> TR = 317.20 cycles/loop
> CFS = 810.24 cycles/loop
> gavl_cfs = 688.42 cycles/loop
>
> Testing N=32
> gavl_cfs = 756.57 cycles/loop
> CFS = 852.14 cycles/loop
> TR = 301.22 cycles/loop
> CFS = 876.12 cycles/loop
> gavl_cfs = 758.46 cycles/loop
>
> Testing N=64
> gavl_cfs = 831.97 cycles/loop
> CFS = 997.16 cycles/loop
> TR = 304.74 cycles/loop
> CFS = 1003.26 cycles/loop
> gavl_cfs = 832.83 cycles/loop
>
> Testing N=128
> gavl_cfs = 897.33 cycles/loop
> CFS = 1030.36 cycles/loop
> TR = 295.65 cycles/loop
> CFS = 1035.29 cycles/loop
> gavl_cfs = 892.51 cycles/loop
>
> Testing N=256
> gavl_cfs = 963.17 cycles/loop
> CFS = 1146.04 cycles/loop
> TR = 295.35 cycles/loop
> CFS = 1162.04 cycles/loop
> gavl_cfs = 966.31 cycles/loop
>
> Testing N=512
> gavl_cfs = 1029.82 cycles/loop
> CFS = 1218.34 cycles/loop
> TR = 288.78 cycles/loop
> CFS = 1257.97 cycles/loop
> gavl_cfs = 1029.83 cycles/loop
>
> Testing N=1024
> gavl_cfs = 1091.76 cycles/loop
> CFS = 1318.47 cycles/loop
> TR = 287.74 cycles/loop
> CFS = 1311.72 cycles/loop
> gavl_cfs = 1093.29 cycles/loop
>
> Testing N=2048
> gavl_cfs = 1153.03 cycles/loop
> CFS = 1398.84 cycles/loop
> TR = 286.75 cycles/loop
> CFS = 1438.68 cycles/loop
> gavl_cfs = 1149.97 cycles/loop
>
>
> There seem to be some difference from your numbers. This is with:
>
> gcc version 4.1.2
>
> and -O2. But then and Opteron can behave quite differentyl than a Duron on
> a bench like this ;)
Thanks for testing, but yours numbers are more correct
than my first report. My numbers seemed to be over-optimistic even
to me, In the fact I have been surprised that difference is so high.
But I have tested bad version of code without GAVL_FAFTER option
set. The code pushed to the web page has been the correct one.
I have not get to look into case until now because I have busy day
to prepare some Linux based labs at university.
Without GAVL_FAFTER option, insert operation does fail
if item with same key is already inserted (intended feature of
the code) and as result of that, not all items have been inserted
in the test. The meaning of GAVL_FAFTER is find/insert after
all items with the same key value. Default behavior is
operate on unique keys in tree and reject duplicates.
My results are even worse for GAVL than yours.
It is possible to try tweak code and optimize it more
(likely/unlikely/do not keep last ptr etc) for this actual usage.
May it be, that I try this exercise, but I do not expect that
the result after tuning would be so much better, that it would
outweight some redesign work. I could see some advantages of AVL
still, but it has its own drawbacks with need of separate height
field and little worse delete in the middle timing.
So excuse me for disturbance. I have been only curious how
GAVL code would behave in the comparison of other algorithms
and I did not kept my premature enthusiasm under the lock.
Best wishes
Pavel Pisa
./smart-queue-v-gavl -n 4
gavl_cfs = 279.02 cycles/loop
CFS = 200.87 cycles/loop
TR = 229.55 cycles/loop
CFS = 201.23 cycles/loop
gavl_cfs = 276.08 cycles/loop
./smart-queue-v-gavl -n 8
gavl_cfs = 310.92 cycles/loop
CFS = 288.45 cycles/loop
TR = 192.46 cycles/loop
CFS = 284.94 cycles/loop
gavl_cfs = 357.02 cycles/loop
./smart-queue-v-gavl -n 16
gavl_cfs = 350.45 cycles/loop
CFS = 354.01 cycles/loop
TR = 189.79 cycles/loop
CFS = 320.08 cycles/loop
gavl_cfs = 387.43 cycles/loop
./smart-queue-v-gavl -n 32
gavl_cfs = 419.23 cycles/loop
CFS = 406.88 cycles/loop
TR = 198.10 cycles/loop
CFS = 398.15 cycles/loop
gavl_cfs = 412.57 cycles/loop
./smart-queue-v-gavl -n 64
gavl_cfs = 442.81 cycles/loop
CFS = 429.62 cycles/loop
TR = 235.40 cycles/loop
CFS = 389.54 cycles/loop
gavl_cfs = 433.56 cycles/loop
./smart-queue-v-gavl -n 128
gavl_cfs = 358.20 cycles/loop
CFS = 605.49 cycles/loop
TR = 236.01 cycles/loop
CFS = 458.50 cycles/loop
gavl_cfs = 455.05 cycles/loop
./smart-queue-v-gavl -n 256
gavl_cfs = 529.72 cycles/loop
CFS = 530.98 cycles/loop
TR = 193.75 cycles/loop
CFS = 533.75 cycles/loop
gavl_cfs = 471.47 cycles/loop
./smart-queue-v-gavl -n 512
gavl_cfs = 525.80 cycles/loop
CFS = 550.63 cycles/loop
TR = 188.71 cycles/loop
CFS = 549.81 cycles/loop
gavl_cfs = 494.73 cycles/loop
./smart-queue-v-gavl -n 1024
gavl_cfs = 544.91 cycles/loop
CFS = 561.68 cycles/loop
TR = 230.97 cycles/loop
CFS = 522.68 cycles/loop
gavl_cfs = 542.40 cycles/loop
./smart-queue-v-gavl -n 2048
gavl_cfs = 567.46 cycles/loop
CFS = 581.85 cycles/loop
TR = 229.69 cycles/loop
CFS = 585.41 cycles/loop
gavl_cfs = 563.22 cycles/loop
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 17:22 ` Chris Friesen
@ 2007-04-17 0:54 ` Peter Williams
2007-04-17 15:52 ` Chris Friesen
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 0:54 UTC (permalink / raw)
To: Chris Friesen
Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Chris Friesen wrote:
> Peter Williams wrote:
>
>> To my mind scheduling and load balancing are orthogonal and keeping
>> them that way simplifies things.
>
> Scuse me if I jump in here, but doesn't the load balancer need some way
> to figure out a) when to run, and b) which tasks to pull and where to
> push them?
Yes but both of these are independent of the scheduler discipline in force.
>
> I suppose you could abstract this into a per-scheduler API, but to me at
> least these are the hard parts of the load balancer...
Load balancing needs to be based on the static priorities (i.e. nice or
real time priority) of the runnable tasks not the dynamic priorities.
If the load balancer manages to keep the weighted (according to static
priority) load and distribution of priorities within the loads on the
CPUs roughly equal and the scheduler does a good job of ensuring
fairness, interactive responsiveness etc. for the tasks within a CPU
then the result will be good system performance within the constraints
set by the sys admins use of real time priorities and nice.
The smpnice modifications to the load balancer were meant to give it the
appropriate behaviour and what we need to fix now is the intra CPU
scheduling.
Even if the load balancer isn't yet perfect perfecting it can be done
separately to fixing the scheduler preferably with as little
interdependency as possible. Probably the only contribution to load
balancing that the scheduler really needs to make is the calculating of
the average weighted load on each of the CPUs (or run queues if there's
more than one CPU per runqueue).
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 0:06 ` Peter Williams
@ 2007-04-17 2:29 ` Mike Galbraith
2007-04-17 3:40 ` Nick Piggin
0 siblings, 1 reply; 577+ messages in thread
From: Mike Galbraith @ 2007-04-17 2:29 UTC (permalink / raw)
To: Peter Williams
Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
Thomas Gleixner
On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> >
> > Demystify what? The casual observer need only read either your attempt
> > at writing a scheduler, or my attempts at fixing the one we have, to see
> > that it was high time for someone with the necessary skills to step in.
>
> Make that "someone with the necessary clout".
No, I was brutally honest to both of us, but quite correct.
> > Now progress can happen, which was _not_ happening before.
> >
>
> This is true.
Yup, and progress _is_ happening now, quite rapidly.
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 14:28 ` Matt Mackall
@ 2007-04-17 3:31 ` Nick Piggin
2007-04-17 17:35 ` Matt Mackall
0 siblings, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 3:31 UTC (permalink / raw)
To: Matt Mackall
Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote:
> On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> > I'd prefer if we kept a single CPU scheduler in mainline, because I
> > think that simplifies analysis and focuses testing.
>
> I think you'll find something like 80-90% of the testing will be done
> on the default choice, even if other choices exist. So you really
> won't have much of a problem here.
>
> But when the only choice for other schedulers is to go out-of-tree,
> then only 1% of the people will try it out and those people are
> guaranteed to be the ones who saw scheduling problems in mainline.
> So the alternative won't end up getting any testing on many of the
> workloads that work fine in mainstream so their feedback won't tell
> you very much at all.
Yeah I concede that perhaps it is the only way to get things going
any further. But how do we decide if and when the current scheduler
should be demoted from default, and which should replace it?
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 2:29 ` Mike Galbraith
@ 2007-04-17 3:40 ` Nick Piggin
2007-04-17 4:01 ` Mike Galbraith
2007-04-17 4:17 ` Peter Williams
0 siblings, 2 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 3:40 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> > Mike Galbraith wrote:
> > >
> > > Demystify what? The casual observer need only read either your attempt
> > > at writing a scheduler, or my attempts at fixing the one we have, to see
> > > that it was high time for someone with the necessary skills to step in.
> >
> > Make that "someone with the necessary clout".
>
> No, I was brutally honest to both of us, but quite correct.
>
> > > Now progress can happen, which was _not_ happening before.
> > >
> >
> > This is true.
>
> Yup, and progress _is_ happening now, quite rapidly.
Progress as in progress on Ingo's scheduler. I still don't know how we'd
decide when to replace the mainline scheduler or with what.
I don't think we can say Ingo's is better than the alternatives, can we?
If there is some kind of bakeoff, then I'd like one of Con's designs to
be involved, and mine, and Peter's...
Maybe the progress is that more key people are becoming open to the idea
of changing the scheduler.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS]
2007-04-17 4:01 ` Mike Galbraith
@ 2007-04-17 3:43 ` David Lang
2007-04-17 4:14 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin
2007-04-20 20:36 ` Bill Davidsen
2 siblings, 0 replies; 577+ messages in thread
From: David Lang @ 2007-04-17 3:43 UTC (permalink / raw)
To: Mike Galbraith
Cc: Nick Piggin, Peter Williams, Con Kolivas, Ingo Molnar, ck list,
Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, 17 Apr 2007, Mike Galbraith wrote:
> Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely
> FairScheduler [CFS]
>
> On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>
>>> Yup, and progress _is_ happening now, quite rapidly.
>>
>> Progress as in progress on Ingo's scheduler. I still don't know how we'd
>> decide when to replace the mainline scheduler or with what.
>>
>> I don't think we can say Ingo's is better than the alternatives, can we?
>
> No, that would require massive performance testing of all alternatives.
>
>> If there is some kind of bakeoff, then I'd like one of Con's designs to
>> be involved, and mine, and Peter's...
>
> The trouble with a bakeoff is that it's pretty darn hard to get people
> to test in the first place, and then comes weighting the subjective and
> hard performance numbers. If they're close in numbers, do you go with
> the one which starts the least flamewars or what?
it's especially hard if the people doing the testing need to find the latest
patch and apply it.
even having a compile-time option to switch between them at least means that the
testers can have confidence that the various patches haven't bitrotted.
boot time options would be even better, but I understand from previous
discussions I've watched that this is performance critical enough that the
overhead of this would throw off the results.
David Lang
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 23:10 ` Michael K. Edwards
@ 2007-04-17 3:55 ` Nick Piggin
2007-04-17 4:25 ` Peter Williams
2007-04-17 8:24 ` William Lee Irwin III
0 siblings, 2 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 3:55 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Peter Williams, William Lee Irwin III, Ingo Molnar, Matt Mackall,
Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> >Note that I talk of run queues
> >not CPUs as I think a shift to multiple CPUs per run queue may be a good
> >idea.
>
> This observation of Peter's is the best thing to come out of this
> whole foofaraw. Looking at what's happening in CPU-land, I think it's
> going to be necessary, within a couple of years, to replace the whole
> idea of "CPU scheduling" with "run queue scheduling" across a complex,
> possibly dynamic mix of CPU-ish resources. Ergo, there's not much
> point in churning the mainline scheduler through a design that isn't
> significantly more flexible than any of those now under discussion.
Why? If you do that, then your load balancer just becomes less flexible
because it is harder to have tasks run on one or the other.
You can have single-runqueue-per-domain behaviour (or close to) just by
relaxing all restrictions on idle load balancing within that domain. It
is harder to go the other way and place any per-cpu affinity or
restirctions with multiple cpus on a single runqueue.
> For instance, there are architectures where several "CPUs"
> (instruction stream decoders feeding execution pipelines) share parts
> of a cache hierarchy ("chip-level multitasking"). On these machines,
> you may want to co-schedule a "real" processing task on one pipeline
> with a "cache warming" task on the other pipeline -- but only for
> tasks whose memory access patterns have been sufficiently analyzed to
> write the "cache warming" task code. Some other tasks may want to
> idle the second pipeline so they can use the full cache-to-RAM
> bandwidth. Yet other tasks may be genuinely CPU-intensive (or I/O
> bound but so context-heavy that it's not worth yielding the CPU during
> quick I/Os), and hence perfectly happy to run concurrently with an
> unrelated task on the other pipeline.
We can do all that now with load balancing, affinities or by shutting
down threads dynamically.
> There are other architectures where several "hardware threads" fight
> over parts of a cache hierarchy (sometimes bizarrely described as
> "sharing" the cache, kind of the way most two-year-olds "share" toys).
> On these machines, one instruction pipeline can't help the other
> along cache-wise, but it sure can hurt. A scheduler designed, tested,
> and tuned principally on one of these architectures (hint:
> "hyperthreading") will probably leave a lot of performance on the
> floor on processors in the former category.
>
> In the not-so-distant future, we're likely to see architectures with
> dynamically reconfigurable interconnect between instruction issue
> units and execution resources. (This is already quite feasible on,
> say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with
> as many Nios II cores as fit on the chip.) Restoring task context may
> involve not just MMU swaps and FPU instructions (with state-dependent
> hidden costs) but processsor reconfiguration. Achieving "fairness"
> according to any standard that a platform integrator cares about (let
> alone an end user) will require a fairly detailed model of the hidden
> costs associated with different sorts of task switch.
>
> So if you are interested in schedulers for some reason other than a
> paycheck, let the distros worry about 5% improvements on x86[_64].
> Get hold of some different "hardware" -- say:
> - a Xilinx ML410 if you've got $3K to blow and want to explore
> reconfigurable processors;
> - a SunFire T2000 if you've got $11K and want to mess with a CMT
> system that's actually shipping;
> - a QEMU-simulated massively SMP x86 if you're poor but clever
> enough to implement funky cross-core cache effects yourself; or
> - a cycle-accurate simulator from Gaisler or Virtio if you want a
> real research project.
> Then go explore some more interesting regions of parameter space and
> see what the demands on mainline Linux will look like in a few years.
There are no doubt improvements to be made, but they are generally
intended to be able to be done within the sched-domains framework. I
am not aware of a particular need that would be impossible to do using
that topology hierarchy and per-CPU runqueues, and there are added
complications involved with multiple CPUs per runqueue.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 3:40 ` Nick Piggin
@ 2007-04-17 4:01 ` Mike Galbraith
2007-04-17 3:43 ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang
` (2 more replies)
2007-04-17 4:17 ` Peter Williams
1 sibling, 3 replies; 577+ messages in thread
From: Mike Galbraith @ 2007-04-17 4:01 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
> > Yup, and progress _is_ happening now, quite rapidly.
>
> Progress as in progress on Ingo's scheduler. I still don't know how we'd
> decide when to replace the mainline scheduler or with what.
>
> I don't think we can say Ingo's is better than the alternatives, can we?
No, that would require massive performance testing of all alternatives.
> If there is some kind of bakeoff, then I'd like one of Con's designs to
> be involved, and mine, and Peter's...
The trouble with a bakeoff is that it's pretty darn hard to get people
to test in the first place, and then comes weighting the subjective and
hard performance numbers. If they're close in numbers, do you go with
the one which starts the least flamewars or what?
> Maybe the progress is that more key people are becoming open to the idea
> of changing the scheduler.
Could be. All was quiet for quite a while, but when RSDL showed up, it
aroused enough interest to show that scheduling woes is on folks radar.
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:01 ` Mike Galbraith
2007-04-17 3:43 ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang
@ 2007-04-17 4:14 ` Nick Piggin
2007-04-17 6:26 ` Peter Williams
2007-04-17 9:51 ` Ingo Molnar
2007-04-20 20:36 ` Bill Davidsen
2 siblings, 2 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 4:14 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Tue, Apr 17, 2007 at 06:01:29AM +0200, Mike Galbraith wrote:
> On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
> > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>
> > > Yup, and progress _is_ happening now, quite rapidly.
> >
> > Progress as in progress on Ingo's scheduler. I still don't know how we'd
> > decide when to replace the mainline scheduler or with what.
> >
> > I don't think we can say Ingo's is better than the alternatives, can we?
>
> No, that would require massive performance testing of all alternatives.
>
> > If there is some kind of bakeoff, then I'd like one of Con's designs to
> > be involved, and mine, and Peter's...
>
> The trouble with a bakeoff is that it's pretty darn hard to get people
> to test in the first place, and then comes weighting the subjective and
> hard performance numbers. If they're close in numbers, do you go with
> the one which starts the least flamewars or what?
I don't know how you'd do it. I know you wouldn't count people telling you
how good they are (getting people to tell you how bad they are, and whether
others do better in a given situation might be slightly move viable).
But we have to choose somehow. I'd hope that is going to be based solely
on the results and technical properties of the code, so... if we were to
somehow determine that the results are exactly the same, we'd go for the
the simpler one, wouldn't we?
> > Maybe the progress is that more key people are becoming open to the idea
> > of changing the scheduler.
>
> Could be. All was quiet for quite a while, but when RSDL showed up, it
> aroused enough interest to show that scheduling woes is on folks radar.
Well I know people have had woes with the scheduler for ever (I guess that
isn't going to change :P). I think people generally lost a bit of interest
in trying to improve the situation because of the upstream problem.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 3:40 ` Nick Piggin
2007-04-17 4:01 ` Mike Galbraith
@ 2007-04-17 4:17 ` Peter Williams
2007-04-17 4:29 ` Nick Piggin
1 sibling, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 4:17 UTC (permalink / raw)
To: Nick Piggin
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
>>> Mike Galbraith wrote:
>>>> Demystify what? The casual observer need only read either your attempt
>>>> at writing a scheduler, or my attempts at fixing the one we have, to see
>>>> that it was high time for someone with the necessary skills to step in.
>>> Make that "someone with the necessary clout".
>> No, I was brutally honest to both of us, but quite correct.
>>
>>>> Now progress can happen, which was _not_ happening before.
>>>>
>>> This is true.
>> Yup, and progress _is_ happening now, quite rapidly.
>
> Progress as in progress on Ingo's scheduler. I still don't know how we'd
> decide when to replace the mainline scheduler or with what.
>
> I don't think we can say Ingo's is better than the alternatives, can we?
> If there is some kind of bakeoff, then I'd like one of Con's designs to
> be involved, and mine, and Peter's...
I myself was thinking of this as the chance for a much needed
simplification of the scheduling code and if this can be done with the
result being "reasonable" it then gives us the basis on which to propose
improvements based on the ideas of others such as you mention.
As the size of the cpusched indicates, trying to evaluate alternative
proposals based on the current O(1) scheduler is fraught. Hopefully,
this initiative can fix this problem. Then we just need Ingo to listen
to suggestions and he's showing signs of being willing to do this :-)
>
> Maybe the progress is that more key people are becoming open to the idea
> of changing the scheduler.
That too.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 3:55 ` Nick Piggin
@ 2007-04-17 4:25 ` Peter Williams
2007-04-17 4:34 ` Nick Piggin
2007-04-17 8:24 ` William Lee Irwin III
1 sibling, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 4:25 UTC (permalink / raw)
To: Nick Piggin
Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>>> Note that I talk of run queues
>>> not CPUs as I think a shift to multiple CPUs per run queue may be a good
>>> idea.
>> This observation of Peter's is the best thing to come out of this
>> whole foofaraw. Looking at what's happening in CPU-land, I think it's
>> going to be necessary, within a couple of years, to replace the whole
>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>> possibly dynamic mix of CPU-ish resources. Ergo, there's not much
>> point in churning the mainline scheduler through a design that isn't
>> significantly more flexible than any of those now under discussion.
>
> Why? If you do that, then your load balancer just becomes less flexible
> because it is harder to have tasks run on one or the other.
>
> You can have single-runqueue-per-domain behaviour (or close to) just by
> relaxing all restrictions on idle load balancing within that domain. It
> is harder to go the other way and place any per-cpu affinity or
> restirctions with multiple cpus on a single runqueue.
Allowing N (where N can be one or greater) CPUs per run queue actually
increases flexibility as you can still set N to 1 to get the current
behaviour.
One advantage of allowing multiple CPUs per run queue would be at the
smaller end of the system scale i.e. a PC with a single hyper threading
chip (i.e. 2 CPUs) would not need to worry about load balancing at all
if both CPUs used the one runqueue and all the nasty side effects that
come with hyper threading would be minimized at the same time.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:17 ` Peter Williams
@ 2007-04-17 4:29 ` Nick Piggin
2007-04-17 5:53 ` Willy Tarreau
` (2 more replies)
0 siblings, 3 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 4:29 UTC (permalink / raw)
To: Peter Williams
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
> >>On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> >>>Mike Galbraith wrote:
> >>>>Demystify what? The casual observer need only read either your attempt
> >>>>at writing a scheduler, or my attempts at fixing the one we have, to see
> >>>>that it was high time for someone with the necessary skills to step in.
> >>>Make that "someone with the necessary clout".
> >>No, I was brutally honest to both of us, but quite correct.
> >>
> >>>>Now progress can happen, which was _not_ happening before.
> >>>>
> >>>This is true.
> >>Yup, and progress _is_ happening now, quite rapidly.
> >
> >Progress as in progress on Ingo's scheduler. I still don't know how we'd
> >decide when to replace the mainline scheduler or with what.
> >
> >I don't think we can say Ingo's is better than the alternatives, can we?
> >If there is some kind of bakeoff, then I'd like one of Con's designs to
> >be involved, and mine, and Peter's...
>
> I myself was thinking of this as the chance for a much needed
> simplification of the scheduling code and if this can be done with the
> result being "reasonable" it then gives us the basis on which to propose
> improvements based on the ideas of others such as you mention.
>
> As the size of the cpusched indicates, trying to evaluate alternative
> proposals based on the current O(1) scheduler is fraught. Hopefully,
I don't know why. The problem is that you can't really evaluate good
proposals by looking at the code (you can say that one is bad, ie. the
current one, which has a huge amount of temporal complexity and is
explicitly unfair), but it is pretty hard to say one behaves well.
And my scheduler for example cuts down the amount of policy code and
code size significantly. I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...
For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:25 ` Peter Williams
@ 2007-04-17 4:34 ` Nick Piggin
2007-04-17 6:03 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 4:34 UTC (permalink / raw)
To: Peter Williams
Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
> >>On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> >>>Note that I talk of run queues
> >>>not CPUs as I think a shift to multiple CPUs per run queue may be a good
> >>>idea.
> >>This observation of Peter's is the best thing to come out of this
> >>whole foofaraw. Looking at what's happening in CPU-land, I think it's
> >>going to be necessary, within a couple of years, to replace the whole
> >>idea of "CPU scheduling" with "run queue scheduling" across a complex,
> >>possibly dynamic mix of CPU-ish resources. Ergo, there's not much
> >>point in churning the mainline scheduler through a design that isn't
> >>significantly more flexible than any of those now under discussion.
> >
> >Why? If you do that, then your load balancer just becomes less flexible
> >because it is harder to have tasks run on one or the other.
> >
> >You can have single-runqueue-per-domain behaviour (or close to) just by
> >relaxing all restrictions on idle load balancing within that domain. It
> >is harder to go the other way and place any per-cpu affinity or
> >restirctions with multiple cpus on a single runqueue.
>
> Allowing N (where N can be one or greater) CPUs per run queue actually
> increases flexibility as you can still set N to 1 to get the current
> behaviour.
But you add extra code for that on top of what we have, and are also
prevented from making per-cpu assumptions.
And you can get N CPUs per runqueue behaviour by having them in a domain
with no restrictions on idle balancing. So where does your increased
flexibilty come from?
> One advantage of allowing multiple CPUs per run queue would be at the
> smaller end of the system scale i.e. a PC with a single hyper threading
> chip (i.e. 2 CPUs) would not need to worry about load balancing at all
> if both CPUs used the one runqueue and all the nasty side effects that
> come with hyper threading would be minimized at the same time.
I don't know about that -- the current load balancer already minimises
the nasty multi threading effects. SMT is very important for IBM's chips
for example, and they've never had any problem with that side of it
since it was introduced and bugs ironed out (at least, none that I've
heard).
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:29 ` Nick Piggin
@ 2007-04-17 5:53 ` Willy Tarreau
2007-04-17 6:10 ` Nick Piggin
2007-04-17 6:09 ` William Lee Irwin III
2007-04-17 6:23 ` Peter Williams
2 siblings, 1 reply; 577+ messages in thread
From: Willy Tarreau @ 2007-04-17 5:53 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
Hi Nick,
On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
(...)
> And my scheduler for example cuts down the amount of policy code and
> code size significantly. I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...
>
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?
Of course, if this is the case, the question will be raised. But as a
general rule, I don't see much potential in O(1) to finely tune scheduling
according to several criteria. In O(logN), you can adjust scheduling in
realtime at a very low cost. Better processing of varying priorities or
fork() comes to mind.
Regards,
Willy
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:34 ` Nick Piggin
@ 2007-04-17 6:03 ` Peter Williams
2007-04-17 6:14 ` William Lee Irwin III
` (2 more replies)
0 siblings, 3 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-17 6:03 UTC (permalink / raw)
To: Nick Piggin
Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>>>> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>>>>> Note that I talk of run queues
>>>>> not CPUs as I think a shift to multiple CPUs per run queue may be a good
>>>>> idea.
>>>> This observation of Peter's is the best thing to come out of this
>>>> whole foofaraw. Looking at what's happening in CPU-land, I think it's
>>>> going to be necessary, within a couple of years, to replace the whole
>>>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>>>> possibly dynamic mix of CPU-ish resources. Ergo, there's not much
>>>> point in churning the mainline scheduler through a design that isn't
>>>> significantly more flexible than any of those now under discussion.
>>> Why? If you do that, then your load balancer just becomes less flexible
>>> because it is harder to have tasks run on one or the other.
>>>
>>> You can have single-runqueue-per-domain behaviour (or close to) just by
>>> relaxing all restrictions on idle load balancing within that domain. It
>>> is harder to go the other way and place any per-cpu affinity or
>>> restirctions with multiple cpus on a single runqueue.
>> Allowing N (where N can be one or greater) CPUs per run queue actually
>> increases flexibility as you can still set N to 1 to get the current
>> behaviour.
>
> But you add extra code for that on top of what we have, and are also
> prevented from making per-cpu assumptions.
>
> And you can get N CPUs per runqueue behaviour by having them in a domain
> with no restrictions on idle balancing. So where does your increased
> flexibilty come from?
>
>> One advantage of allowing multiple CPUs per run queue would be at the
>> smaller end of the system scale i.e. a PC with a single hyper threading
>> chip (i.e. 2 CPUs) would not need to worry about load balancing at all
>> if both CPUs used the one runqueue and all the nasty side effects that
>> come with hyper threading would be minimized at the same time.
>
> I don't know about that -- the current load balancer already minimises
> the nasty multi threading effects. SMT is very important for IBM's chips
> for example, and they've never had any problem with that side of it
> since it was introduced and bugs ironed out (at least, none that I've
> heard).
>
There's a lot of ugly code in the load balancer that is only there to
overcome the side effects of SMT and dual core. A lot of it was put
there by Intel employees trying to make load balancing more friendly to
their systems. What I'm suggesting is that an N CPUs per runqueue is a
better way of achieving that end. I may (of course) be wrong but I
think that the idea deserves more consideration than you're willing to
give it.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:29 ` Nick Piggin
2007-04-17 5:53 ` Willy Tarreau
@ 2007-04-17 6:09 ` William Lee Irwin III
2007-04-17 6:15 ` Nick Piggin
2007-04-17 6:23 ` Peter Williams
2 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 6:09 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
>> I myself was thinking of this as the chance for a much needed
>> simplification of the scheduling code and if this can be done with the
>> result being "reasonable" it then gives us the basis on which to propose
>> improvements based on the ideas of others such as you mention.
>> As the size of the cpusched indicates, trying to evaluate alternative
>> proposals based on the current O(1) scheduler is fraught. Hopefully,
On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> I don't know why. The problem is that you can't really evaluate good
> proposals by looking at the code (you can say that one is bad, ie. the
> current one, which has a huge amount of temporal complexity and is
> explicitly unfair), but it is pretty hard to say one behaves well.
> And my scheduler for example cuts down the amount of policy code and
> code size significantly. I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?
All things are not equal; they all have different properties. I like
getting rid of the queue-swapping artifacts as ebs and cfs have done,
as the artifacts introduced there are nasty IMNSHO. I'm queueing up
a demonstration of epoch expiry scheduling artifacts as a testcase,
albeit one with no pass/fail semantics for its results, just detecting
scheduler properties.
That said, inequality/inequivalence is not a superiority/inferiority
ranking per se. What needs to come out of these discussions is a set
of standards which a candidate for mainline must pass to be considered
correct and a set of performance metrics by which to rank them. Video
game framerates and some sort of way to automate window wiggle tests
sound like good ideas, but automating such is beyond my userspace
programming abilities. An organization able to devote manpower to
devising such testcases will likely have to get involved for them to
happen, I suspect.
On a random note, limitations on kernel address space make O(lg(n))
effectively O(1), albeit with large upper bounds on the worst case
and an expected case much faster than the worst case.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 5:53 ` Willy Tarreau
@ 2007-04-17 6:10 ` Nick Piggin
0 siblings, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 6:10 UTC (permalink / raw)
To: Willy Tarreau
Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 07:53:55AM +0200, Willy Tarreau wrote:
> Hi Nick,
>
> On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> (...)
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly. I haven't looked at Con's ones for a while,
> > but I believe they are also much more straightforward than mainline...
> >
> > For example, let's say all else is equal between them, then why would
> > we go with the O(logN) implementation rather than the O(1)?
>
> Of course, if this is the case, the question will be raised. But as a
> general rule, I don't see much potential in O(1) to finely tune scheduling
> according to several criteria.
What do you mean? By what criteria?
> In O(logN), you can adjust scheduling in
> realtime at a very low cost. Better processing of varying priorities or
> fork() comes to mind.
The main problem as I see it is choosing which task to run next and
how much time to run it for. And given that there are typically far less
than 58 (the number of priorities in nicksched) runnable tasks for a
desktop system, I don't find it at all constraining to quantize my "next
runnable" criteria onto that size of key.
Even if you do expect a huge number of runnable tasks, you would hope
for fewer interactive ones toward the higher end of the priority scale.
Handwaving or even detailed design descriptions is simply not the best
way to decide on a new scheduler, is all I'm saying.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:03 ` Peter Williams
@ 2007-04-17 6:14 ` William Lee Irwin III
2007-04-17 6:23 ` Nick Piggin
2007-04-17 9:36 ` Ingo Molnar
2 siblings, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 6:14 UTC (permalink / raw)
To: Peter Williams
Cc: Nick Piggin, Michael K. Edwards, Ingo Molnar, Matt Mackall,
Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
> There's a lot of ugly code in the load balancer that is only there to
> overcome the side effects of SMT and dual core. A lot of it was put
> there by Intel employees trying to make load balancing more friendly to
> their systems. What I'm suggesting is that an N CPUs per runqueue is a
> better way of achieving that end. I may (of course) be wrong but I
> think that the idea deserves more consideration than you're willing to
> give it.
This may be a good one to ask Ingo about, as he did significant
performance work on per-core runqueues for SMT. While I did write
per-node runqueue code for NUMA at some point in the past, I did no
tuning or other performance work on it, only functionality.
I've actually dealt with kernels using elder versions of Ingo's code
for per-core runqueues on SMT, but was never called upon to examine
that particular code for either performance or stability, so I'm
largely ignorant of what the perceived outcome of it was.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:09 ` William Lee Irwin III
@ 2007-04-17 6:15 ` Nick Piggin
2007-04-17 6:26 ` William Lee Irwin III
2007-04-17 6:50 ` Davide Libenzi
0 siblings, 2 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 6:15 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
> >> I myself was thinking of this as the chance for a much needed
> >> simplification of the scheduling code and if this can be done with the
> >> result being "reasonable" it then gives us the basis on which to propose
> >> improvements based on the ideas of others such as you mention.
> >> As the size of the cpusched indicates, trying to evaluate alternative
> >> proposals based on the current O(1) scheduler is fraught. Hopefully,
>
> On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> > I don't know why. The problem is that you can't really evaluate good
> > proposals by looking at the code (you can say that one is bad, ie. the
> > current one, which has a huge amount of temporal complexity and is
> > explicitly unfair), but it is pretty hard to say one behaves well.
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly. I haven't looked at Con's ones for a while,
> > but I believe they are also much more straightforward than mainline...
> > For example, let's say all else is equal between them, then why would
> > we go with the O(logN) implementation rather than the O(1)?
>
> All things are not equal; they all have different properties. I like
Exactly. So we have to explore those properties and evaluate performance
(in all meanings of the word). That's only logical.
> On a random note, limitations on kernel address space make O(lg(n))
> effectively O(1), albeit with large upper bounds on the worst case
> and an expected case much faster than the worst case.
Yeah. O(n!) is also O(1) if you can put an upper bound on n ;)
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:03 ` Peter Williams
2007-04-17 6:14 ` William Lee Irwin III
@ 2007-04-17 6:23 ` Nick Piggin
2007-04-17 9:36 ` Ingo Molnar
2 siblings, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 6:23 UTC (permalink / raw)
To: Peter Williams
Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >
> >But you add extra code for that on top of what we have, and are also
> >prevented from making per-cpu assumptions.
> >
> >And you can get N CPUs per runqueue behaviour by having them in a domain
> >with no restrictions on idle balancing. So where does your increased
> >flexibilty come from?
> >
> >>One advantage of allowing multiple CPUs per run queue would be at the
> >>smaller end of the system scale i.e. a PC with a single hyper threading
> >>chip (i.e. 2 CPUs) would not need to worry about load balancing at all
> >>if both CPUs used the one runqueue and all the nasty side effects that
> >>come with hyper threading would be minimized at the same time.
> >
> >I don't know about that -- the current load balancer already minimises
> >the nasty multi threading effects. SMT is very important for IBM's chips
> >for example, and they've never had any problem with that side of it
> >since it was introduced and bugs ironed out (at least, none that I've
> >heard).
> >
>
> There's a lot of ugly code in the load balancer that is only there to
> overcome the side effects of SMT and dual core. A lot of it was put
> there by Intel employees trying to make load balancing more friendly to
I agree that some of that has exploded complexity. I have some
thoughts about better approaches for some of those things, but
basically been stuck working on VM problems for a while.
> their systems. What I'm suggesting is that an N CPUs per runqueue is a
> better way of achieving that end. I may (of course) be wrong but I
> think that the idea deserves more consideration than you're willing to
> give it.
Put it this way: it is trivial to group the load balancing stats
of N CPUs with their own runqueues. Just put them under a domain
and take the sum. The domain essentially takes on the same function
as a single queue with N CPUs under it. Anything _further_ you can
do with individual runqueues (like naturally adding an affinity
pressure ranging from nothing to absolute) are things that you
don't trivially get with 1:N approach. AFAIKS.
So I will definitely give any idea consideration, but I just need to
be shown where the benefit comes from.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:29 ` Nick Piggin
2007-04-17 5:53 ` Willy Tarreau
2007-04-17 6:09 ` William Lee Irwin III
@ 2007-04-17 6:23 ` Peter Williams
2007-04-17 6:44 ` Nick Piggin
2007-04-17 8:44 ` Ingo Molnar
2 siblings, 2 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-17 6:23 UTC (permalink / raw)
To: Nick Piggin
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>>>> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
>>>>> Mike Galbraith wrote:
>>>>>> Demystify what? The casual observer need only read either your attempt
>>>>>> at writing a scheduler, or my attempts at fixing the one we have, to see
>>>>>> that it was high time for someone with the necessary skills to step in.
>>>>> Make that "someone with the necessary clout".
>>>> No, I was brutally honest to both of us, but quite correct.
>>>>
>>>>>> Now progress can happen, which was _not_ happening before.
>>>>>>
>>>>> This is true.
>>>> Yup, and progress _is_ happening now, quite rapidly.
>>> Progress as in progress on Ingo's scheduler. I still don't know how we'd
>>> decide when to replace the mainline scheduler or with what.
>>>
>>> I don't think we can say Ingo's is better than the alternatives, can we?
>>> If there is some kind of bakeoff, then I'd like one of Con's designs to
>>> be involved, and mine, and Peter's...
>> I myself was thinking of this as the chance for a much needed
>> simplification of the scheduling code and if this can be done with the
>> result being "reasonable" it then gives us the basis on which to propose
>> improvements based on the ideas of others such as you mention.
>>
>> As the size of the cpusched indicates, trying to evaluate alternative
>> proposals based on the current O(1) scheduler is fraught. Hopefully,
>
> I don't know why. The problem is that you can't really evaluate good
> proposals by looking at the code (you can say that one is bad, ie. the
> current one, which has a huge amount of temporal complexity and is
> explicitly unfair), but it is pretty hard to say one behaves well.
I meant that it's indicative of the amount of work that you have to do
to implement a new scheduling discipline for evaluation.
>
> And my scheduler for example cuts down the amount of policy code and
> code size significantly.
Yours is one of the smaller patches mainly because you perpetuate (or
you did in the last one I looked at) the (horrible to my eyes) dual
array (active/expired) mechanism. That this idea was bad should have
been apparent to all as soon as the decision was made to excuse some
tasks from being moved from the active array to the expired array. This
essentially meant that there would be circumstances where extreme
unfairness (to the extent of starvation in some cases) -- the very
things that the mechanism was originally designed to ensure (as far as I
can gather). Right about then in the development of the O(1) scheduler
alternative solutions should have been sought.
Other hints that it was a bad idea was the need to transfer time slices
between children and parents during fork() and exit().
This disregard for the dual array mechanism has prevented me from
looking at the rest of your scheduler in any great detail so I can't
comment on any other ideas that may be in there.
> I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...
I like Con's scheduler (partly because it uses a single array) but
mainly because it's nice and simple. However, his earlier schedulers
were prone to starvation (admittedly, only if you went out of your way
to make it happen) and I tried to convince him to use the anti
starvation mechanism in my SPA schedulers but was unsuccessful. I
haven't looked at his latest scheduler that sparked all this furore so
can't comment on it.
>
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?
In the highly unlikely event that you can't separate them on technical
grounds, Occam's razor recommends choosing the simplest solution. :-)
To digress, my main concern is that load balancing is being lumped in
with this new change. It's becoming "accept this beg lump of new code
or nothing". I'd rather see a good fix to the intra runqueue/CPU
scheduler problem implemented first and then if there really are any
outstanding problems with the load balancer attack them later. Them all
being mixed up together gives me a nasty deja vu of impending disaster.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:14 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin
@ 2007-04-17 6:26 ` Peter Williams
2007-04-17 9:51 ` Ingo Molnar
1 sibling, 0 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-17 6:26 UTC (permalink / raw)
To: Nick Piggin
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
Nick Piggin wrote:
> Well I know people have had woes with the scheduler for ever (I guess that
> isn't going to change :P). I think people generally lost a bit of interest
> in trying to improve the situation because of the upstream problem.
Yes.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:15 ` Nick Piggin
@ 2007-04-17 6:26 ` William Lee Irwin III
2007-04-17 7:01 ` Nick Piggin
2007-04-17 6:50 ` Davide Libenzi
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 6:26 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
>> All things are not equal; they all have different properties. I like
On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> Exactly. So we have to explore those properties and evaluate performance
> (in all meanings of the word). That's only logical.
Any chance you'd be willing to put down a few thoughts on what sorts
of standards you'd like to set for both correctness (i.e. the bare
minimum a scheduler implementation must do to be considered valid
beyond not oopsing) and performance metrics (i.e. things that produce
numbers for each scheduler you can compare to say "this scheduler is
better than this other scheduler at this.").
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:23 ` Peter Williams
@ 2007-04-17 6:44 ` Nick Piggin
2007-04-17 7:48 ` Peter Williams
2007-04-17 8:44 ` Ingo Molnar
1 sibling, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 6:44 UTC (permalink / raw)
To: Peter Williams
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >And my scheduler for example cuts down the amount of policy code and
> >code size significantly.
>
> Yours is one of the smaller patches mainly because you perpetuate (or
> you did in the last one I looked at) the (horrible to my eyes) dual
> array (active/expired) mechanism.
Actually, I wasn't comparing with other out of tree schedulers (but it
is good to know mine is among the smaller ones). I was comparing with
the mainline scheduler, which also has the dual arrays.
> That this idea was bad should have
> been apparent to all as soon as the decision was made to excuse some
> tasks from being moved from the active array to the expired array. This
My patch doesn't implement any such excusing.
> essentially meant that there would be circumstances where extreme
> unfairness (to the extent of starvation in some cases) -- the very
> things that the mechanism was originally designed to ensure (as far as I
> can gather). Right about then in the development of the O(1) scheduler
> alternative solutions should have been sought.
Fairness has always been my first priority, and I consider it a bug
if it is possible for any process to get more CPU time than a CPU hog
over the long term. Or over another task doing the same thing, for
that matter.
> Other hints that it was a bad idea was the need to transfer time slices
> between children and parents during fork() and exit().
I don't see how that has anything to do with dual arrays. If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.
You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.
> This disregard for the dual array mechanism has prevented me from
> looking at the rest of your scheduler in any great detail so I can't
> comment on any other ideas that may be in there.
Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.
A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?
> >I haven't looked at Con's ones for a while,
> >but I believe they are also much more straightforward than mainline...
>
> I like Con's scheduler (partly because it uses a single array) but
> mainly because it's nice and simple. However, his earlier schedulers
> were prone to starvation (admittedly, only if you went out of your way
> to make it happen) and I tried to convince him to use the anti
> starvation mechanism in my SPA schedulers but was unsuccessful. I
> haven't looked at his latest scheduler that sparked all this furore so
> can't comment on it.
I agree starvation or unfairness is unacceptable for a new scheduler.
> >For example, let's say all else is equal between them, then why would
> >we go with the O(logN) implementation rather than the O(1)?
>
> In the highly unlikely event that you can't separate them on technical
> grounds, Occam's razor recommends choosing the simplest solution. :-)
O(logN) vs O(1) is technical grounds.
But yeah, see my earlier comment: simplicity would be a factor too.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:15 ` Nick Piggin
2007-04-17 6:26 ` William Lee Irwin III
@ 2007-04-17 6:50 ` Davide Libenzi
2007-04-17 7:09 ` William Lee Irwin III
2007-04-17 7:11 ` Nick Piggin
1 sibling, 2 replies; 577+ messages in thread
From: Davide Libenzi @ 2007-04-17 6:50 UTC (permalink / raw)
To: Nick Piggin
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, 17 Apr 2007, Nick Piggin wrote:
> > All things are not equal; they all have different properties. I like
>
> Exactly. So we have to explore those properties and evaluate performance
> (in all meanings of the word). That's only logical.
I had a quick look at Ingo's code yesterday. Ingo is always smart to
prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
And even this code does that pretty nicely. The deadline designs looks
good, although I think the final "key" calculation code will end up quite
different from what it looks now.
I would suggest to thoroughly test all your alternatives before deciding.
Some code and design may look very good and small at the beginning, but
when you start patching it to cover all the dark spots, you effectively
end up with another thing (in both design and code footprint).
About O(1), I never thought it was a must (besides a good marketing
material), and O(log(N)) *may* be just fine (to be verified, of course).
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:26 ` William Lee Irwin III
@ 2007-04-17 7:01 ` Nick Piggin
2007-04-17 8:23 ` William Lee Irwin III
2007-04-17 21:39 ` Matt Mackall
0 siblings, 2 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 7:01 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> >> All things are not equal; they all have different properties. I like
>
> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > Exactly. So we have to explore those properties and evaluate performance
> > (in all meanings of the word). That's only logical.
>
> Any chance you'd be willing to put down a few thoughts on what sorts
> of standards you'd like to set for both correctness (i.e. the bare
> minimum a scheduler implementation must do to be considered valid
> beyond not oopsing) and performance metrics (i.e. things that produce
> numbers for each scheduler you can compare to say "this scheduler is
> better than this other scheduler at this.").
Yeah I guess that's the hard part :)
For correctness, I guess fairness is an easy one. I think that unfairness
is basically a bug and that it would be very unfortunate to merge something
unfair. But this is just within the context of a single runqueue... for
better or worse, we allow some unfairness in multiprocessors for performance
reasons of course.
Latency. Given N tasks in the system, an arbitrary task should get
onto the CPU in a bounded amount of time (excluding events like freak
IRQ holdoffs and such, obviously -- ie. just considering the context
of the scheduler's state machine).
I wouldn't like to see a significant drop in any micro or macro
benchmarks or even worse real workloads, but I could accept some if it
means haaving a fair scheduler by default.
Now it isn't actually too hard to achieve the above, I think. The hard bit
is trying to compare interactivity. Ideally, we'd be able to get scripted
dumps of login sessions, and measure scheduling latencies of key proceses
(sh/X/wm/xmms/firefox/etc). People would send a dump if they were having
problems with any scheduler, and we could compare all of them against it.
Wishful thinking!
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:50 ` Davide Libenzi
@ 2007-04-17 7:09 ` William Lee Irwin III
2007-04-17 7:22 ` Peter Williams
` (3 more replies)
2007-04-17 7:11 ` Nick Piggin
1 sibling, 4 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 7:09 UTC (permalink / raw)
To: Davide Libenzi
Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> I had a quick look at Ingo's code yesterday. Ingo is always smart to
> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
> And even this code does that pretty nicely. The deadline designs looks
> good, although I think the final "key" calculation code will end up quite
> different from what it looks now.
The additive nice_offset breaks nice levels. A multiplicative priority
weighting of a different, nonnegative metric of cpu utilization from
what's now used is required for nice levels to work. I've been trying
to point this out politely by strongly suggesting testing whether nice
levels work.
On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> I would suggest to thoroughly test all your alternatives before deciding.
> Some code and design may look very good and small at the beginning, but
> when you start patching it to cover all the dark spots, you effectively
> end up with another thing (in both design and code footprint).
> About O(1), I never thought it was a must (besides a good marketing
> material), and O(log(N)) *may* be just fine (to be verified, of course).
The trouble with thorough testing right now is that no one agrees on
what the tests should be and a number of the testcases are not in great
shape. An agreed-upon set of testcases for basic correctness should be
devised and the implementations of those testcases need to be
maintainable code and the tests set up for automated testing and
changing their parameters without recompiling via command-line options.
Once there's a standard regression test suite for correctness, one
needs to be devised for performance, including interactive performance.
The primary difficulty I see along these lines is finding a way to
automate tests of graphics and input device response performance. Others,
like how deterministically priorities are respected over progressively
smaller time intervals and noninteractive workload performance are
nowhere near as difficult to arrange and in many cases already exist.
Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:50 ` Davide Libenzi
2007-04-17 7:09 ` William Lee Irwin III
@ 2007-04-17 7:11 ` Nick Piggin
2007-04-17 7:21 ` Davide Libenzi
1 sibling, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 7:11 UTC (permalink / raw)
To: Davide Libenzi
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, Nick Piggin wrote:
>
> > > All things are not equal; they all have different properties. I like
> >
> > Exactly. So we have to explore those properties and evaluate performance
> > (in all meanings of the word). That's only logical.
>
> I had a quick look at Ingo's code yesterday. Ingo is always smart to
> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
> And even this code does that pretty nicely. The deadline designs looks
> good, although I think the final "key" calculation code will end up quite
> different from what it looks now.
> I would suggest to thoroughly test all your alternatives before deciding.
> Some code and design may look very good and small at the beginning, but
> when you start patching it to cover all the dark spots, you effectively
> end up with another thing (in both design and code footprint).
> About O(1), I never thought it was a must (besides a good marketing
> material), and O(log(N)) *may* be just fine (to be verified, of course).
To be clear, I'm not saying O(logN) itself is a big problem. Type
plot [10:100] x with lines, log(x) with lines, 1 with lines
into gnuplot. I was just trying to point out that we need to evalute
things. Considering how long we've had this scheduler with its known
deficiencies, let's pick a new one wisely.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:11 ` Nick Piggin
@ 2007-04-17 7:21 ` Davide Libenzi
0 siblings, 0 replies; 577+ messages in thread
From: Davide Libenzi @ 2007-04-17 7:21 UTC (permalink / raw)
To: Nick Piggin
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, 17 Apr 2007, Nick Piggin wrote:
> To be clear, I'm not saying O(logN) itself is a big problem. Type
>
> plot [10:100] x with lines, log(x) with lines, 1 with lines
Haha, Nick, I know how a log() looks like :)
The Time Ring I posted as example (that nothing is other than a
ring-based bucket sort), keeps O(1) if you can concede some timer
clustering.
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:09 ` William Lee Irwin III
@ 2007-04-17 7:22 ` Peter Williams
2007-04-17 7:23 ` Nick Piggin
` (2 subsequent siblings)
3 siblings, 0 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-17 7:22 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Davide Libenzi, Nick Piggin, Mike Galbraith, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
William Lee Irwin III wrote:
> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
>> I had a quick look at Ingo's code yesterday. Ingo is always smart to
>> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
>> And even this code does that pretty nicely. The deadline designs looks
>> good, although I think the final "key" calculation code will end up quite
>> different from what it looks now.
>
> The additive nice_offset breaks nice levels. A multiplicative priority
> weighting of a different, nonnegative metric of cpu utilization from
> what's now used is required for nice levels to work. I've been trying
> to point this out politely by strongly suggesting testing whether nice
> levels work.
>
>
> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
>> I would suggest to thoroughly test all your alternatives before deciding.
>> Some code and design may look very good and small at the beginning, but
>> when you start patching it to cover all the dark spots, you effectively
>> end up with another thing (in both design and code footprint).
>> About O(1), I never thought it was a must (besides a good marketing
>> material), and O(log(N)) *may* be just fine (to be verified, of course).
>
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
>
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
At this point, I'd like direct everyone's attention to the simloads package:
<http://downloads.sourceforge.net/cpuse/simloads-0.1.1.tar.gz>
which contains a set of programs designed to be used in the construction
of CPU scheduler tests. Of particular use is the aspin program which
can be used to launch tasks with specified sleep/wake characteristics.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:09 ` William Lee Irwin III
2007-04-17 7:22 ` Peter Williams
@ 2007-04-17 7:23 ` Nick Piggin
2007-04-17 7:27 ` Davide Libenzi
2007-04-17 7:33 ` Ingo Molnar
3 siblings, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 7:23 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 12:09:49AM -0700, William Lee Irwin III wrote:
>
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
>
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
Definitely. It would be really good if we could have interactivity
regression tests too (see my earlier wishful email). The problem
with a lot of the scripted interactivity tests I see is that they
don't really capture the complexities of the interactions between,
say, an interactive X session. Others just go straight for trying
to exploit the design by making lots of high priority processes
runnablel at once. This just provides an unrealistic decoy and you
end up trying to tune for the wrong thing.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:09 ` William Lee Irwin III
2007-04-17 7:22 ` Peter Williams
2007-04-17 7:23 ` Nick Piggin
@ 2007-04-17 7:27 ` Davide Libenzi
2007-04-17 7:33 ` Nick Piggin
2007-04-17 7:33 ` Ingo Molnar
3 siblings, 1 reply; 577+ messages in thread
From: Davide Libenzi @ 2007-04-17 7:27 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Tue, 17 Apr 2007, William Lee Irwin III wrote:
> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > I would suggest to thoroughly test all your alternatives before deciding.
> > Some code and design may look very good and small at the beginning, but
> > when you start patching it to cover all the dark spots, you effectively
> > end up with another thing (in both design and code footprint).
> > About O(1), I never thought it was a must (besides a good marketing
> > material), and O(log(N)) *may* be just fine (to be verified, of course).
>
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
>
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
What I meant was, that the rules (requirements and associated test cases)
for this new Scheduler Amazing Race should be set forward, and not kept a
moving target to fit&follow one or the other implementation.
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:09 ` William Lee Irwin III
` (2 preceding siblings ...)
2007-04-17 7:27 ` Davide Libenzi
@ 2007-04-17 7:33 ` Ingo Molnar
2007-04-17 7:40 ` Nick Piggin
2007-04-17 9:05 ` William Lee Irwin III
3 siblings, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-17 7:33 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > I had a quick look at Ingo's code yesterday. Ingo is always smart to
> > prepare a main dish (feature) with a nice sider (code cleanup) to
> > Linus ;) And even this code does that pretty nicely. The deadline
> > designs looks good, although I think the final "key" calculation
> > code will end up quite different from what it looks now.
>
> The additive nice_offset breaks nice levels. A multiplicative priority
> weighting of a different, nonnegative metric of cpu utilization from
> what's now used is required for nice levels to work. I've been trying
> to point this out politely by strongly suggesting testing whether nice
> levels work.
granted, CFS's nice code is still incomplete, but you err quite
significantly with this extreme statement that they are "broken".
nice levels certainly work to a fair degree even in the current code and
much of the focus is elsewhere - just try it. (In fact i claim that
CFS's nice levels often work _better_ than the mainline scheduler's nice
level support, for the testcases that matter to users.)
The precise behavior of nice levels, as i pointed it out in previous
mails, is largely 'uninteresting' and it has changed multiple times in
the past 10 years.
What matters to users is mainly: whether X reniced to -10 does get
enough CPU time and whether stuff reniced to +19 doesnt take away too
much CPU time from the rest of the system. _How_ a Linux scheduler
achieves this is an internal matter and certainly CFS does it in a hacky
way at the moment.
All the rest, 'CPU bandwidth utilization' or whatever abstract metric we
could come up with is just a fancy academic technicality that has no
real significance to any of the testers who are trying CFS right now.
Sure we prefer final solutions that are clean and make sense (because
such things are the easiest to maintain long-term), and often such final
solutions are quite close to academic concepts, and i think Davide
correctly observed this by saying that "the final key calculation code
will end up quite different from what it looks now", but your
extreme-end claim of 'breakage' for something that is just plain
incomplete is not really a fair characterisation at this point.
Anyone who thinks that there exists only two kinds of code: 100% correct
and 100% incorrect with no shades of grey inbetween is in reality a sort
of an extremist: whom, depending on mood and affection, we could call
either a 'coding purist' or a 'coding taliban' ;-)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:27 ` Davide Libenzi
@ 2007-04-17 7:33 ` Nick Piggin
0 siblings, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 7:33 UTC (permalink / raw)
To: Davide Libenzi
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 12:27:28AM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, William Lee Irwin III wrote:
>
> > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > > I would suggest to thoroughly test all your alternatives before deciding.
> > > Some code and design may look very good and small at the beginning, but
> > > when you start patching it to cover all the dark spots, you effectively
> > > end up with another thing (in both design and code footprint).
> > > About O(1), I never thought it was a must (besides a good marketing
> > > material), and O(log(N)) *may* be just fine (to be verified, of course).
> >
> > The trouble with thorough testing right now is that no one agrees on
> > what the tests should be and a number of the testcases are not in great
> > shape. An agreed-upon set of testcases for basic correctness should be
> > devised and the implementations of those testcases need to be
> > maintainable code and the tests set up for automated testing and
> > changing their parameters without recompiling via command-line options.
> >
> > Once there's a standard regression test suite for correctness, one
> > needs to be devised for performance, including interactive performance.
> > The primary difficulty I see along these lines is finding a way to
> > automate tests of graphics and input device response performance. Others,
> > like how deterministically priorities are respected over progressively
> > smaller time intervals and noninteractive workload performance are
> > nowhere near as difficult to arrange and in many cases already exist.
> > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
>
> What I meant was, that the rules (requirements and associated test cases)
> for this new Scheduler Amazing Race should be set forward, and not kept a
> moving target to fit&follow one or the other implementation.
Exactly. Well I don't mind if it is a moving target as such, just as
long as the decisions are rational (no "blah is more important
because I say so").
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:33 ` Ingo Molnar
@ 2007-04-17 7:40 ` Nick Piggin
2007-04-17 7:58 ` Ingo Molnar
2007-04-17 9:05 ` William Lee Irwin III
1 sibling, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 7:40 UTC (permalink / raw)
To: Ingo Molnar
Cc: William Lee Irwin III, Davide Libenzi, Peter Williams,
Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
>
> * William Lee Irwin III <wli@holomorphy.com> wrote:
>
> > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > > I had a quick look at Ingo's code yesterday. Ingo is always smart to
> > > prepare a main dish (feature) with a nice sider (code cleanup) to
> > > Linus ;) And even this code does that pretty nicely. The deadline
> > > designs looks good, although I think the final "key" calculation
> > > code will end up quite different from what it looks now.
> >
> > The additive nice_offset breaks nice levels. A multiplicative priority
> > weighting of a different, nonnegative metric of cpu utilization from
> > what's now used is required for nice levels to work. I've been trying
> > to point this out politely by strongly suggesting testing whether nice
> > levels work.
>
> granted, CFS's nice code is still incomplete, but you err quite
> significantly with this extreme statement that they are "broken".
>
> nice levels certainly work to a fair degree even in the current code and
> much of the focus is elsewhere - just try it. (In fact i claim that
> CFS's nice levels often work _better_ than the mainline scheduler's nice
> level support, for the testcases that matter to users.)
>
> The precise behavior of nice levels, as i pointed it out in previous
> mails, is largely 'uninteresting' and it has changed multiple times in
> the past 10 years.
>
> What matters to users is mainly: whether X reniced to -10 does get
> enough CPU time and whether stuff reniced to +19 doesnt take away too
> much CPU time from the rest of the system.
I agree there.
> _How_ a Linux scheduler
> achieves this is an internal matter and certainly CFS does it in a hacky
> way at the moment.
>
> All the rest, 'CPU bandwidth utilization' or whatever abstract metric we
> could come up with is just a fancy academic technicality that has no
> real significance to any of the testers who are trying CFS right now.
>
> Sure we prefer final solutions that are clean and make sense (because
> such things are the easiest to maintain long-term), and often such final
> solutions are quite close to academic concepts, and i think Davide
> correctly observed this by saying that "the final key calculation code
> will end up quite different from what it looks now", but your
> extreme-end claim of 'breakage' for something that is just plain
> incomplete is not really a fair characterisation at this point.
>
> Anyone who thinks that there exists only two kinds of code: 100% correct
> and 100% incorrect with no shades of grey inbetween is in reality a sort
> of an extremist: whom, depending on mood and affection, we could call
> either a 'coding purist' or a 'coding taliban' ;-)
Only if you are an extremist-naming extremist with no shades of grey.
Others, like myself, also include 'coding al-qaeda' and 'coding john
howard' in that scale.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:44 ` Nick Piggin
@ 2007-04-17 7:48 ` Peter Williams
2007-04-17 7:56 ` Nick Piggin
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 7:48 UTC (permalink / raw)
To: Nick Piggin
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> And my scheduler for example cuts down the amount of policy code and
>>> code size significantly.
>> Yours is one of the smaller patches mainly because you perpetuate (or
>> you did in the last one I looked at) the (horrible to my eyes) dual
>> array (active/expired) mechanism.
>
> Actually, I wasn't comparing with other out of tree schedulers (but it
> is good to know mine is among the smaller ones). I was comparing with
> the mainline scheduler, which also has the dual arrays.
>
>
>> That this idea was bad should have
>> been apparent to all as soon as the decision was made to excuse some
>> tasks from being moved from the active array to the expired array. This
>
> My patch doesn't implement any such excusing.
>
>
>> essentially meant that there would be circumstances where extreme
>> unfairness (to the extent of starvation in some cases) -- the very
>> things that the mechanism was originally designed to ensure (as far as I
>> can gather). Right about then in the development of the O(1) scheduler
>> alternative solutions should have been sought.
>
> Fairness has always been my first priority, and I consider it a bug
> if it is possible for any process to get more CPU time than a CPU hog
> over the long term. Or over another task doing the same thing, for
> that matter.
>
>
>> Other hints that it was a bad idea was the need to transfer time slices
>> between children and parents during fork() and exit().
>
> I don't see how that has anything to do with dual arrays.
It's totally to do with the dual arrays. The only real purpose of the
time slice in O(1) (regardless of what its perceived purpose was) was to
control the switching between the arrays.
> If you put
> a new child at the back of the queue, then your various interactive
> shell commands that typically do a lot of dependant forking get slowed
> right down behind your compile job. If you give a new child its own
> timeslice irrespective of the parent, then you have things like 'make'
> (which doesn't use a lot of CPU time) spawning off lots of high
> priority children.
This is an artefact of trying to control nice using time slices while
using them for controlling array switching and whatever else they were
being used for. Priority (static and dynamic) is the the best way to
implement nice.
>
> You need to do _something_ (Ingo's does). I don't see why this would
> be tied with a dual array. FWIW, mine doesn't do anything on exit()
> like most others, but it may need more tuning in this area.
>
>
>> This disregard for the dual array mechanism has prevented me from
>> looking at the rest of your scheduler in any great detail so I can't
>> comment on any other ideas that may be in there.
>
> Well I wasn't really asking you to review it. As I said, everyone
> has their own idea of what a good design does, and review can't really
> distinguish between the better of two reasonable designs.
>
> A fair evaluation of the alternatives seems like a good idea though.
> Nobody is actually against this, are they?
No. It would be nice if the basic ideas that each scheduler tries to
implement could be extracted and explained though. This could lead to a
melding of ideas that leads to something quite good.
>
>
>>> I haven't looked at Con's ones for a while,
>>> but I believe they are also much more straightforward than mainline...
>> I like Con's scheduler (partly because it uses a single array) but
>> mainly because it's nice and simple. However, his earlier schedulers
>> were prone to starvation (admittedly, only if you went out of your way
>> to make it happen) and I tried to convince him to use the anti
>> starvation mechanism in my SPA schedulers but was unsuccessful. I
>> haven't looked at his latest scheduler that sparked all this furore so
>> can't comment on it.
>
> I agree starvation or unfairness is unacceptable for a new scheduler.
>
>
>>> For example, let's say all else is equal between them, then why would
>>> we go with the O(logN) implementation rather than the O(1)?
>> In the highly unlikely event that you can't separate them on technical
>> grounds, Occam's razor recommends choosing the simplest solution. :-)
>
> O(logN) vs O(1) is technical grounds.
In that case I'd go O(1) provided that the k factor for the O(1) wasn't
greater than O(logN)'s k factor multiplied by logMaxN.
>
> But yeah, see my earlier comment: simplicity would be a factor too.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:48 ` Peter Williams
@ 2007-04-17 7:56 ` Nick Piggin
2007-04-17 13:16 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 7:56 UTC (permalink / raw)
To: Peter Williams
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >>Other hints that it was a bad idea was the need to transfer time slices
> >>between children and parents during fork() and exit().
> >
> >I don't see how that has anything to do with dual arrays.
>
> It's totally to do with the dual arrays. The only real purpose of the
> time slice in O(1) (regardless of what its perceived purpose was) was to
> control the switching between the arrays.
The O(1) design is pretty convoluted in that regard. In my scheduler,
the only purpose of the arrays is to renew time slices.
The fork/exit logic is added to make interactivity better. Ingo's
scheduler has similar equivalent logic.
> >If you put
> >a new child at the back of the queue, then your various interactive
> >shell commands that typically do a lot of dependant forking get slowed
> >right down behind your compile job. If you give a new child its own
> >timeslice irrespective of the parent, then you have things like 'make'
> >(which doesn't use a lot of CPU time) spawning off lots of high
> >priority children.
>
> This is an artefact of trying to control nice using time slices while
> using them for controlling array switching and whatever else they were
> being used for. Priority (static and dynamic) is the the best way to
> implement nice.
I don't like the timeslice based nice in mainline. It's too nasty
with latencies. nicksched is far better in that regard IMO.
But I don't know how you can assert a particular way is the best way
to do something.
> >You need to do _something_ (Ingo's does). I don't see why this would
> >be tied with a dual array. FWIW, mine doesn't do anything on exit()
> >like most others, but it may need more tuning in this area.
> >
> >
> >>This disregard for the dual array mechanism has prevented me from
> >>looking at the rest of your scheduler in any great detail so I can't
> >>comment on any other ideas that may be in there.
> >
> >Well I wasn't really asking you to review it. As I said, everyone
> >has their own idea of what a good design does, and review can't really
> >distinguish between the better of two reasonable designs.
> >
> >A fair evaluation of the alternatives seems like a good idea though.
> >Nobody is actually against this, are they?
>
> No. It would be nice if the basic ideas that each scheduler tries to
> implement could be extracted and explained though. This could lead to a
> melding of ideas that leads to something quite good.
>
> >
> >
> >>>I haven't looked at Con's ones for a while,
> >>>but I believe they are also much more straightforward than mainline...
> >>I like Con's scheduler (partly because it uses a single array) but
> >>mainly because it's nice and simple. However, his earlier schedulers
> >>were prone to starvation (admittedly, only if you went out of your way
> >>to make it happen) and I tried to convince him to use the anti
> >>starvation mechanism in my SPA schedulers but was unsuccessful. I
> >>haven't looked at his latest scheduler that sparked all this furore so
> >>can't comment on it.
> >
> >I agree starvation or unfairness is unacceptable for a new scheduler.
> >
> >
> >>>For example, let's say all else is equal between them, then why would
> >>>we go with the O(logN) implementation rather than the O(1)?
> >>In the highly unlikely event that you can't separate them on technical
> >>grounds, Occam's razor recommends choosing the simplest solution. :-)
> >
> >O(logN) vs O(1) is technical grounds.
>
> In that case I'd go O(1) provided that the k factor for the O(1) wasn't
> greater than O(logN)'s k factor multiplied by logMaxN.
Yes, or even significantly greater around typical large sizes of N.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
` (12 preceding siblings ...)
2007-04-16 22:00 ` Andi Kleen
@ 2007-04-17 7:56 ` Andy Whitcroft
2007-04-17 9:32 ` Nick Piggin
2007-04-18 10:22 ` Ingo Molnar
2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse
14 siblings, 2 replies; 577+ messages in thread
From: Andy Whitcroft @ 2007-04-17 7:56 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
Steve Fox, Nishanth Aravamudan
Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
>
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> new scheduler will be active by default and all tasks will default
> to the new SCHED_FAIR interactive scheduling class. ]
>
> Highlights are:
>
> - the introduction of Scheduling Classes: an extensible hierarchy of
> scheduler modules. These modules encapsulate scheduling policy
> details and are handled by the scheduler core without the core
> code assuming about them too much.
>
> - sched_fair.c implements the 'CFS desktop scheduler': it is a
> replacement for the vanilla scheduler's SCHED_OTHER interactivity
> code.
>
> i'd like to give credit to Con Kolivas for the general approach here:
> he has proven via RSDL/SD that 'fair scheduling' is possible and that
> it results in better desktop scheduling. Kudos Con!
>
> The CFS patch uses a completely different approach and implementation
> from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> that of RSDL/SD, which is a high standard to meet :-) Testing
> feedback is welcome to decide this one way or another. [ and, in any
> case, all of SD's logic could be added via a kernel/sched_sd.c module
> as well, if Con is interested in such an approach. ]
>
> CFS's design is quite radical: it does not use runqueues, it uses a
> time-ordered rbtree to build a 'timeline' of future task execution,
> and thus has no 'array switch' artifacts (by which both the vanilla
> scheduler and RSDL/SD are affected).
>
> CFS uses nanosecond granularity accounting and does not rely on any
> jiffies or other HZ detail. Thus the CFS scheduler has no notion of
> 'timeslices' and has no heuristics whatsoever. There is only one
> central tunable:
>
> /proc/sys/kernel/sched_granularity_ns
>
> which can be used to tune the scheduler from 'desktop' (low
> latencies) to 'server' (good batching) workloads. It defaults to a
> setting suitable for desktop workloads. SCHED_BATCH is handled by the
> CFS scheduler module too.
>
> due to its design, the CFS scheduler is not prone to any of the
> 'attacks' that exist today against the heuristics of the stock
> scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> work fine and do not impact interactivity and produce the expected
> behavior.
>
> the CFS scheduler has a much stronger handling of nice levels and
> SCHED_BATCH: both types of workloads should be isolated much more
> agressively than under the vanilla scheduler.
>
> ( another rdetail: due to nanosec accounting and timeline sorting,
> sched_yield() support is very simple under CFS, and in fact under
> CFS sched_yield() behaves much better than under any other
> scheduler i have tested so far. )
>
> - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
> way than the vanilla scheduler does. It uses 100 runqueues (for all
> 100 RT priority levels, instead of 140 in the vanilla scheduler)
> and it needs no expired array.
>
> - reworked/sanitized SMP load-balancing: the runqueue-walking
> assumptions are gone from the load-balancing code now, and
> iterators of the scheduling modules are used. The balancing code got
> quite a bit simpler as a result.
>
> the core scheduler got smaller by more than 700 lines:
>
> kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
> 1 file changed, 372 insertions(+), 1082 deletions(-)
>
> and even adding all the scheduling modules, the total size impact is
> relatively small:
>
> 18 files changed, 1454 insertions(+), 1133 deletions(-)
>
> most of the increase is due to extensive comments. The kernel size
> impact is in fact a small negative:
>
> text data bss dec hex filename
> 23366 4001 24 27391 6aff kernel/sched.o.vanilla
> 24159 2705 56 26920 6928 kernel/sched.o.CFS
>
> (this is mainly due to the benefit of getting rid of the expired array
> and its data structure overhead.)
>
> thanks go to Thomas Gleixner and Arjan van de Ven for review of this
> patchset.
>
> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,
Pushed this through the test.kernel.org and nothing new blew up.
Notably the kernbench figures are within expectations even on the bigger
numa systems, commonly badly affected by balancing problems in the
schedular.
I see there is a second one out, I'll push that one through too.
-apw
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:40 ` Nick Piggin
@ 2007-04-17 7:58 ` Ingo Molnar
0 siblings, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-17 7:58 UTC (permalink / raw)
To: Nick Piggin
Cc: William Lee Irwin III, Davide Libenzi, Peter Williams,
Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
* Nick Piggin <npiggin@suse.de> wrote:
> > Anyone who thinks that there exists only two kinds of code: 100%
> > correct and 100% incorrect with no shades of grey inbetween is in
> > reality a sort of an extremist: whom, depending on mood and
> > affection, we could call either a 'coding purist' or a 'coding
> > taliban' ;-)
>
> Only if you are an extremist-naming extremist with no shades of grey.
> Others, like myself, also include 'coding al-qaeda' and 'coding john
> howard' in that scale.
heh ;) You, you ... nitpicking extremist! ;)
And beware that you just commited another act of extremism too:
> I agree there.
because you just went to the extreme position of saying that "i agree
with this portion 100%", instead of saying "this seems to be 91.5%
correct in my opinion, Tue, 17 Apr 2007 09:40:25 +0200".
and the nasty thing is, that in reality even shades of grey, if you
print them out, are just a set of extreme black dots on an extreme white
sheet of paper! ;)
[ so i guess we've got to consider the scope of extremism too: the
larger the scope, the more limiting and hence the more dangerous it
is. ]
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
[not found] ` <20070417064109.GP8915@holomorphy.com>
@ 2007-04-17 8:00 ` Peter Williams
2007-04-17 10:41 ` William Lee Irwin III
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 8:00 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Linux Kernel Mailing List
William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 04:34:36PM +1000, Peter Williams wrote:
>> This doesn't make any sense to me.
>> For a start, exact simultaneous operation would be impossible to achieve
>> except with highly specialized architecture such as the long departed
>> transputer. And secondly, I can't see why it's necessary.
>
> We're not going to make any headway here, so we might as well drop the
> thread.
Yes, we were starting to go around in circles weren't we?
>
> There are other things to talk about anyway, for instance I'm seeing
> interest in plugsched come about from elsewhere and am taking an
> interest in getting it into shape wrt. various design goals therefore.
>
> Probably the largest issue of note is getting scheduler drivers
> loadable as kernel modules. Addressing the points Ingo made that can
> be addressed are also lined up for this effort.
>
> Comments on which directions you'd like this to go in these respects
> would be appreciated, as I regard you as the current "project owner."
I'd do scan through LKML from about 18 months ago looking for mention of
runtime configurable version of plugsched. Some students at a
university (in Germany, I think) posted some patches adding this feature
to plugsched around about then.
I never added them to plugsched proper as I knew (from previous
experience when the company I worked for posted patches with similar
functionality) that Linux would like this idea less than he did the
current plugsched mechanism.
Unfortunately, my own cache of the relevant e-mails got overwritten
during a Fedora Core upgrade (I've since moved /var onto a separate
drive to avoid a repetition) or I would dig them out and send them to
you. I'd provided with copies of the company's patches to use as a
guide to how to overcome the problems associated with changing
schedulers on a running system (a few non trivial locking issues pop up).
Maybe if one of the students still reads LKML he will provide a pointer.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:01 ` Nick Piggin
@ 2007-04-17 8:23 ` William Lee Irwin III
2007-04-17 22:23 ` Davide Libenzi
2007-04-17 21:39 ` Matt Mackall
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 8:23 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
>> Any chance you'd be willing to put down a few thoughts on what sorts
>> of standards you'd like to set for both correctness (i.e. the bare
>> minimum a scheduler implementation must do to be considered valid
>> beyond not oopsing) and performance metrics (i.e. things that produce
>> numbers for each scheduler you can compare to say "this scheduler is
>> better than this other scheduler at this.").
On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Yeah I guess that's the hard part :)
> For correctness, I guess fairness is an easy one. I think that unfairness
> is basically a bug and that it would be very unfortunate to merge something
> unfair. But this is just within the context of a single runqueue... for
> better or worse, we allow some unfairness in multiprocessors for performance
> reasons of course.
Requiring that identical tasks be allocated equal shares of CPU
bandwidth is the easy part here. ringtest.c exercises another aspect
of fairness that is extremely important. Generalizing ringtest.c is
a good idea for fairness testing.
But another aspect of fairness is that "controlled unfairness" is also
intended to exist, in no small part by virtue of nice levels, but also
in the form of favoring tasks that are considered interactive somehow.
Testing various forms of controlled unfairness to ensure that they are
indeed controlled and otherwise have the semantics intended is IMHO the
more difficult aspect of fairness testing.
On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Latency. Given N tasks in the system, an arbitrary task should get
> onto the CPU in a bounded amount of time (excluding events like freak
> IRQ holdoffs and such, obviously -- ie. just considering the context
> of the scheduler's state machine).
ISTR Davide Libenzi having a scheduling latency test a number of years
ago. Resurrecting that and tuning it to the needs of this kind of
testing sounds relevant here. The test suite Peter Willliams mentioned
would also help.
On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> I wouldn't like to see a significant drop in any micro or macro
> benchmarks or even worse real workloads, but I could accept some if it
> means haaving a fair scheduler by default.
On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Now it isn't actually too hard to achieve the above, I think. The hard bit
> is trying to compare interactivity. Ideally, we'd be able to get scripted
> dumps of login sessions, and measure scheduling latencies of key proceses
> (sh/X/wm/xmms/firefox/etc). People would send a dump if they were having
> problems with any scheduler, and we could compare all of them against it.
> Wishful thinking!
That's a pretty good idea. I'll queue up writing something of that form
as well.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 3:55 ` Nick Piggin
2007-04-17 4:25 ` Peter Williams
@ 2007-04-17 8:24 ` William Lee Irwin III
1 sibling, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 8:24 UTC (permalink / raw)
To: Nick Piggin
Cc: Michael K. Edwards, Peter Williams, Ingo Molnar, Matt Mackall,
Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>> This observation of Peter's is the best thing to come out of this
>> whole foofaraw. Looking at what's happening in CPU-land, I think it's
>> going to be necessary, within a couple of years, to replace the whole
>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>> possibly dynamic mix of CPU-ish resources. Ergo, there's not much
>> point in churning the mainline scheduler through a design that isn't
>> significantly more flexible than any of those now under discussion.
On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
> Why? If you do that, then your load balancer just becomes less flexible
> because it is harder to have tasks run on one or the other.
On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
> You can have single-runqueue-per-domain behaviour (or close to) just by
> relaxing all restrictions on idle load balancing within that domain. It
> is harder to go the other way and place any per-cpu affinity or
> restirctions with multiple cpus on a single runqueue.
The big sticking point here is order-sensitivity. One can point to
stringent sched_yield() ordering but that's not so important in and of
itself. The more significant case is RT applications which are order-
sensitive. Per-cpu runqueues rather significantly disturb the ordering
requirements of applications that care about it.
In terms of a plugging framework, the per-cpu arrangement precludes or
makes extremely awkward scheduling policies that don't have per-cpu
runqueues, for instance, the 2.4.x policy. There is also the alternate
SMP scalability strategy of a lockless scheduler with a single global
queue, which is more performance-oriented.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:23 ` Peter Williams
2007-04-17 6:44 ` Nick Piggin
@ 2007-04-17 8:44 ` Ingo Molnar
2007-04-19 2:20 ` Peter Williams
1 sibling, 1 reply; 577+ messages in thread
From: Ingo Molnar @ 2007-04-17 8:44 UTC (permalink / raw)
To: Peter Williams
Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
* Peter Williams <pwil3058@bigpond.net.au> wrote:
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly.
>
> Yours is one of the smaller patches mainly because you perpetuate (or
> you did in the last one I looked at) the (horrible to my eyes) dual
> array (active/expired) mechanism. That this idea was bad should have
> been apparent to all as soon as the decision was made to excuse some
> tasks from being moved from the active array to the expired array.
> This essentially meant that there would be circumstances where extreme
> unfairness (to the extent of starvation in some cases) -- the very
> things that the mechanism was originally designed to ensure (as far as
> I can gather). Right about then in the development of the O(1)
> scheduler alternative solutions should have been sought.
in hindsight i'd agree. But back then we were clearly not ready for
fine-grained accurate statistics + trees (cpus are alot faster at more
complex arithmetics today, plus people still believed that low-res can
be done well enough), and taking out any of these two concepts from CFS
would result in a similarly complex runqueue implementation. Also, the
array switch was just thought to be of another piece of 'if the
heuristics go wrong, we fall back to an array switch' logic, right in
line with the other heuristics. And you have to accept it, mainline's
ability to auto-renice make -j jobs (and other CPU hogs) was quite a
plus for developers, so it had (and probably still has) quite some
inertia.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:33 ` Ingo Molnar
2007-04-17 7:40 ` Nick Piggin
@ 2007-04-17 9:05 ` William Lee Irwin III
2007-04-17 9:24 ` Ingo Molnar
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 9:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> The additive nice_offset breaks nice levels. A multiplicative priority
>> weighting of a different, nonnegative metric of cpu utilization from
>> what's now used is required for nice levels to work. I've been trying
>> to point this out politely by strongly suggesting testing whether nice
>> levels work.
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> granted, CFS's nice code is still incomplete, but you err quite
> significantly with this extreme statement that they are "broken".
I used the word relatively loosely. Nothing extreme is going on.
Maybe the phrasing exaggerated the force of the opinion. I'm sorry
about having misspoke so.
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> nice levels certainly work to a fair degree even in the current code and
> much of the focus is elsewhere - just try it. (In fact i claim that
> CFS's nice levels often work _better_ than the mainline scheduler's nice
> level support, for the testcases that matter to users.)
Al Boldi's testcase appears to reveal some issues. I'm plotting a
testcase of my own if I can ever get past responding to email.
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> The precise behavior of nice levels, as i pointed it out in previous
> mails, is largely 'uninteresting' and it has changed multiple times in
> the past 10 years.
I expect that whether a scheduler can handle such prioritization has a
rather strong predictive quality regarding whether it can handle, say,
CKRM controls. I remain convinced that there should be some target
behavior and that some attempt should be made to achieve it. I don't
think any particular behavior is best, just that the behavior should
be well-defined.
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> What matters to users is mainly: whether X reniced to -10 does get
> enough CPU time and whether stuff reniced to +19 doesnt take away too
> much CPU time from the rest of the system. _How_ a Linux scheduler
> achieves this is an internal matter and certainly CFS does it in a hacky
> way at the moment.
It's not so far out. Basically just changing the key calculation in a
relatively simple manner should get things into relatively good shape.
It can, of course, be done other ways (I did it a rather different way
in vdls, though that method is not likely to be considered desirable).
I can't really write a testcase for such loose semantics, so the above
description is useless to me. These squishy sorts of definitions of
semantics are also uninformative to users, who, I would argue, do have
some interest in what nice levels mean. There have been at least a small
number of concerns about the strength of nice levels, and it would
reveal issues surrounding that area earlier if there were an objective
one could test to see if it were achieved.
It's furthermore a user-visible change in system call semantics we
should be more careful about changing out from beneath users.
So I see a lot of good reasons to pin down nice numbers. Incompleteness
is not a particularly mortal sin, but the proliferation of competing
schedulers is creating a need for standards, and that's what I'm really
on about.
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> All the rest, 'CPU bandwidth utilization' or whatever abstract metric we
> could come up with is just a fancy academic technicality that has no
> real significance to any of the testers who are trying CFS right now.
I could say "percent cpu" if it sounds less like formal jargon, which
"CPU bandwidth utilization" isn't really.
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> Sure we prefer final solutions that are clean and make sense (because
> such things are the easiest to maintain long-term), and often such final
> solutions are quite close to academic concepts, and i think Davide
> correctly observed this by saying that "the final key calculation code
> will end up quite different from what it looks now", but your
> extreme-end claim of 'breakage' for something that is just plain
> incomplete is not really a fair characterisation at this point.
It wasn't meant to be quite as strong a statement as it came out.
Sorry about that.
On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> Anyone who thinks that there exists only two kinds of code: 100% correct
> and 100% incorrect with no shades of grey inbetween is in reality a sort
> of an extremist: whom, depending on mood and affection, we could call
> either a 'coding purist' or a 'coding taliban' ;-)
I've made no such claims. Also rest assured that the tone of the
critique is not hostile, and wasn't meant to sound that way.
Also, given the general comments it appears clear that some statistical
metric of deviation from the intended behavior furthermore qualified by
timescale is necessary, so this appears to be headed toward a sort of
performance metric as opposed to a pass/fail test anyway. However, to
even measure this at all, some statement of intention is required. I'd
prefer that there be a Linux-standard semantics for nice so results are
more directly comparable and so that users also get similar nice
behavior from the scheduler as it varies over time and possibly
implementations if users should care to switch them out with some
scheduler patch or other.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 9:05 ` William Lee Irwin III
@ 2007-04-17 9:24 ` Ingo Molnar
2007-04-17 9:57 ` William Lee Irwin III
2007-04-17 22:08 ` Matt Mackall
0 siblings, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-17 9:24 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
> [...] Also rest assured that the tone of the critique is not hostile,
> and wasn't meant to sound that way.
ok :) (And i guess i was too touchy - sorry about coming out swinging.)
> Also, given the general comments it appears clear that some
> statistical metric of deviation from the intended behavior furthermore
> qualified by timescale is necessary, so this appears to be headed
> toward a sort of performance metric as opposed to a pass/fail test
> anyway. However, to even measure this at all, some statement of
> intention is required. I'd prefer that there be a Linux-standard
> semantics for nice so results are more directly comparable and so that
> users also get similar nice behavior from the scheduler as it varies
> over time and possibly implementations if users should care to switch
> them out with some scheduler patch or other.
yeah. If you could come up with a sane definition that also translates
into low overhead on the algorithm side that would be great! The only
good generic definition i could come up with (nice levels are isolated
buckets with a constant maximum relative percentage of CPU time
available to every active bucket) resulted in having a per-nice-level
array of rbtree roots, which did not look worth the hassle at first
sight :-)
until now the main approach for nice levels in Linux was always:
"implement your main scheduling logic for nice 0 and then look for some
low-overhead method that can be glued to it that does something that
behaves like nice levels". Feel free to turn that around into a more
natural approach, but the algorithm should remain fairly simple i think.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:56 ` Andy Whitcroft
@ 2007-04-17 9:32 ` Nick Piggin
2007-04-17 9:59 ` Ingo Molnar
2007-04-18 10:22 ` Ingo Molnar
1 sibling, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 9:32 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
Steve Fox, Nishanth Aravamudan
On Tue, Apr 17, 2007 at 08:56:27AM +0100, Andy Whitcroft wrote:
> >
> > as usual, any sort of feedback, bugreports, fixes and suggestions are
> > more than welcome,
>
> Pushed this through the test.kernel.org and nothing new blew up.
> Notably the kernbench figures are within expectations even on the bigger
> numa systems, commonly badly affected by balancing problems in the
> schedular.
>
> I see there is a second one out, I'll push that one through too.
Well I just sent some feedback on cfs-v2, but realised it went off-list,
so I'll resend here because others may find it interesting too. Sorry
about jamming it in here, but it is relevant to performance...
Anyway, roughly in the context of good cfs-v2 interactivity, I wrote:
Well I'm not too surprised. I am disappointed that it uses such small
timeslices (or whatever they are called) as the default.
Using small timeslices is actually a pretty easy way to ensure everything
stays smooth even under load, but is bad for efficiency. Sure you can say
you'll have desktop and server tunings, but... With nicksched I'm testing
a default timeslice of *300ms* even on the desktop, wheras Ingo's seems
to be effectively 3ms :P So if you compare default tunings, it isn't
exactly fair!
Kbuild times on a 2x Xeon:
2.6.21-rc7
508.87user 32.47system 2:17.82elapsed 392%CPU
509.05user 32.25system 2:17.84elapsed 392%CPU
508.75user 32.26system 2:17.83elapsed 392%CPU
508.63user 32.17system 2:17.88elapsed 392%CPU
509.01user 32.26system 2:17.90elapsed 392%CPU
509.08user 32.20system 2:17.95elapsed 392%CPU
2.6.21-rc7-cfs-v2
534.80user 30.92system 2:23.64elapsed 393%CPU
534.75user 31.01system 2:23.70elapsed 393%CPU
534.66user 31.07system 2:23.76elapsed 393%CPU
534.56user 30.91system 2:23.76elapsed 393%CPU
534.66user 31.07system 2:23.67elapsed 393%CPU
535.43user 30.62system 2:23.72elapsed 393%CPU
2.6.21-rc7-nicksched
505.60user 32.31system 2:17.91elapsed 390%CPU
506.55user 32.42system 2:17.66elapsed 391%CPU
506.41user 32.30system 2:17.85elapsed 390%CPU
506.48user 32.36system 2:17.77elapsed 391%CPU
506.10user 32.40system 2:17.81elapsed 390%CPU
506.69user 32.16system 2:17.78elapsed 391%CPU
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 6:03 ` Peter Williams
2007-04-17 6:14 ` William Lee Irwin III
2007-04-17 6:23 ` Nick Piggin
@ 2007-04-17 9:36 ` Ingo Molnar
2 siblings, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-17 9:36 UTC (permalink / raw)
To: Peter Williams
Cc: Nick Piggin, Michael K. Edwards, William Lee Irwin III,
Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner
* Peter Williams <pwil3058@bigpond.net.au> wrote:
> There's a lot of ugly code in the load balancer that is only there to
> overcome the side effects of SMT and dual core. A lot of it was put
> there by Intel employees trying to make load balancing more friendly
> to their systems. What I'm suggesting is that an N CPUs per runqueue
> is a better way of achieving that end. I may (of course) be wrong but
> I think that the idea deserves more consideration than you're willing
> to give it.
i actually implemented that some time ago and i'm afraid it was ugly as
hell and pretty fragile. Load-balancing gets simpler, but task picking
gets alot uglier.
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 4:14 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin
2007-04-17 6:26 ` Peter Williams
@ 2007-04-17 9:51 ` Ingo Molnar
2007-04-17 13:44 ` Peter Williams
2007-04-20 20:47 ` Bill Davidsen
1 sibling, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-17 9:51 UTC (permalink / raw)
To: Nick Piggin
Cc: Mike Galbraith, Peter Williams, Con Kolivas, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
* Nick Piggin <npiggin@suse.de> wrote:
> > > Maybe the progress is that more key people are becoming open to
> > > the idea of changing the scheduler.
> >
> > Could be. All was quiet for quite a while, but when RSDL showed up,
> > it aroused enough interest to show that scheduling woes is on folks
> > radar.
>
> Well I know people have had woes with the scheduler for ever (I guess
> that isn't going to change :P). [...]
yes, that part isnt going to change, because the CPU is a _scarce
resource_ that is perhaps the most frequently overcommitted physical
computer resource in existence, and because the kernel does not (yet)
track eye movements of humans to figure out which tasks are more
important them. So critical human constraints are unknown to the
scheduler and thus complaints will always come.
The upstream scheduler thought it had enough information: the sleep
average. So now the attempt is to go back and _simplify_ the scheduler
and remove that information, and concentrate on getting fairness
precisely right. The magic thing about 'fairness' is that it's a pretty
good default policy if we decide that we simply have not enough
information to do an intelligent choice.
( Lets be cautious though: the jury is still out whether people actually
like this more than the current approach. While CFS feedback looks
promising after a whopping 3 days of it being released [ ;-) ], the
test coverage of all 'fairness centric' schedulers, even considering
years of availability is less than 1% i'm afraid, and that < 1% was
mostly self-selecting. )
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 9:24 ` Ingo Molnar
@ 2007-04-17 9:57 ` William Lee Irwin III
2007-04-17 10:01 ` Ingo Molnar
2007-04-17 11:31 ` William Lee Irwin III
2007-04-17 22:08 ` Matt Mackall
1 sibling, 2 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 9:57 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Also, given the general comments it appears clear that some
>> statistical metric of deviation from the intended behavior furthermore
>> qualified by timescale is necessary, so this appears to be headed
>> toward a sort of performance metric as opposed to a pass/fail test
[...]
On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> yeah. If you could come up with a sane definition that also translates
> into low overhead on the algorithm side that would be great! The only
> good generic definition i could come up with (nice levels are isolated
> buckets with a constant maximum relative percentage of CPU time
> available to every active bucket) resulted in having a per-nice-level
> array of rbtree roots, which did not look worth the hassle at first
> sight :-)
Interesting! That's what vdls did, except its fundamental data structure
was more like a circular buffer data structure (resembling Davide
Libenzi's timer ring in concept, but with all the details different).
I'm not entirely sure how that would've turned out performancewise if
I'd done any tuning at all. I was mostly interested in doing something
like what I heard Bob Mullens did in 1976 for basic pedagogical value
about schedulers to prepare for writing patches for gang scheduling as
opposed to creating a viable replacement for the mainline scheduler.
I'm relatively certain a different key calculation will suffice, but
it may disturb other desired semantics since they really need to be
nonnegative for multiplying by a scaling factor corresponding to its
nice number to work properly. Well, as the cfs code now stands, it
would correspond to negative keys. Dividing positive keys by the nice
scaling factor is my first thought of how to extend the method to the
current key semantics. Or such are my thoughts on the subject.
I expect that all that's needed is to fiddle with those numbers a bit.
There's quite some capacity for expression there given the precision.
On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> until now the main approach for nice levels in Linux was always:
> "implement your main scheduling logic for nice 0 and then look for some
> low-overhead method that can be glued to it that does something that
> behaves like nice levels". Feel free to turn that around into a more
> natural approach, but the algorithm should remain fairly simple i think.
Part of my insistence was because it seemed to be relatively close to a
one-liner, though I'm not entirely sure what particular computation to
use to handle the signedness of the keys. I guess I could pick some
particular nice semantics myself and then sweep the extant schedulers
to use them after getting a testcase hammered out.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 9:32 ` Nick Piggin
@ 2007-04-17 9:59 ` Ingo Molnar
2007-04-17 11:11 ` Nick Piggin
2007-04-18 8:55 ` Nick Piggin
0 siblings, 2 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-17 9:59 UTC (permalink / raw)
To: Nick Piggin
Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
Steve Fox, Nishanth Aravamudan
* Nick Piggin <npiggin@suse.de> wrote:
> 2.6.21-rc7-cfs-v2
> 534.80user 30.92system 2:23.64elapsed 393%CPU
> 534.75user 31.01system 2:23.70elapsed 393%CPU
> 534.66user 31.07system 2:23.76elapsed 393%CPU
> 534.56user 30.91system 2:23.76elapsed 393%CPU
> 534.66user 31.07system 2:23.67elapsed 393%CPU
> 535.43user 30.62system 2:23.72elapsed 393%CPU
Thanks for testing this! Could you please try this also with:
echo 100000000 > /proc/sys/kernel/sched_granularity
on the same system, so that we can get a complete set of numbers? Just
to make sure that lowering the preemption frequency indeed has the
expected result of moving kernbench numbers back to mainline levels. (if
not then that would indicate some CFS buglet)
could you maybe even try a more extreme setting of:
echo 500000000 > /proc/sys/kernel/sched_granularity
for kicks? This would allow us to see how much kernbench we lose due to
preemption granularity. Thanks!
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 9:57 ` William Lee Irwin III
@ 2007-04-17 10:01 ` Ingo Molnar
2007-04-17 11:31 ` William Lee Irwin III
1 sibling, 0 replies; 577+ messages in thread
From: Ingo Molnar @ 2007-04-17 10:01 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
* William Lee Irwin III <wli@holomorphy.com> wrote:
> On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
>
> > until now the main approach for nice levels in Linux was always:
> > "implement your main scheduling logic for nice 0 and then look for
> > some low-overhead method that can be glued to it that does something
> > that behaves like nice levels". Feel free to turn that around into a
> > more natural approach, but the algorithm should remain fairly simple
> > i think.
>
> Part of my insistence was because it seemed to be relatively close to
> a one-liner, though I'm not entirely sure what particular computation
> to use to handle the signedness of the keys. I guess I could pick some
> particular nice semantics myself and then sweep the extant schedulers
> to use them after getting a testcase hammered out.
i'd love to have a oneliner solution :-)
wrt. signedness: note that in v2 i have made rq_running signed, and most
calculations (especially those related to nice) are signed values. (On
64-bit systems this all isnt a big issue - most of the arithmetics
gymnastics in CFS are done to keep deltas within 32 bits, so that
divisions and multiplications are sane.)
Ingo
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 8:00 ` Peter Williams
@ 2007-04-17 10:41 ` William Lee Irwin III
2007-04-17 13:48 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 10:41 UTC (permalink / raw)
To: Peter Williams; +Cc: Linux Kernel Mailing List
William Lee Irwin III wrote:
>> Comments on which directions you'd like this to go in these respects
>> would be appreciated, as I regard you as the current "project owner."
On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> I'd do scan through LKML from about 18 months ago looking for mention of
> runtime configurable version of plugsched. Some students at a
> university (in Germany, I think) posted some patches adding this feature
> to plugsched around about then.
Excellent. I'll go hunting for that.
On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> I never added them to plugsched proper as I knew (from previous
> experience when the company I worked for posted patches with similar
> functionality) that Linux would like this idea less than he did the
> current plugsched mechanism.
Odd how the requirements ended up including that. Fickleness abounds.
If only we knew up-front what the end would be.
On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> Unfortunately, my own cache of the relevant e-mails got overwritten
> during a Fedora Core upgrade (I've since moved /var onto a separate
> drive to avoid a repetition) or I would dig them out and send them to
> you. I'd provided with copies of the company's patches to use as a
> guide to how to overcome the problems associated with changing
> schedulers on a running system (a few non trivial locking issues pop up).
> Maybe if one of the students still reads LKML he will provide a pointer.
I was tempted to restart from scratch given Ingo's comments, but I
reconsidered and I'll be working with your code (and the German
students' as well). If everything has to change, so be it, but it'll
still be a derived work. It would be ignoring precedent and failure to
properly attribute if I did otherwise.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 9:59 ` Ingo Molnar
@ 2007-04-17 11:11 ` Nick Piggin
2007-04-18 8:55 ` Nick Piggin
1 sibling, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-17 11:11 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
Steve Fox, Nishanth Aravamudan
On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
>
> * Nick Piggin <npiggin@suse.de> wrote:
>
> > 2.6.21-rc7-cfs-v2
> > 534.80user 30.92system 2:23.64elapsed 393%CPU
> > 534.75user 31.01system 2:23.70elapsed 393%CPU
> > 534.66user 31.07system 2:23.76elapsed 393%CPU
> > 534.56user 30.91system 2:23.76elapsed 393%CPU
> > 534.66user 31.07system 2:23.67elapsed 393%CPU
> > 535.43user 30.62system 2:23.72elapsed 393%CPU
>
> Thanks for testing this! Could you please try this also with:
>
> echo 100000000 > /proc/sys/kernel/sched_granularity
>
> on the same system, so that we can get a complete set of numbers? Just
> to make sure that lowering the preemption frequency indeed has the
> expected result of moving kernbench numbers back to mainline levels. (if
> not then that would indicate some CFS buglet)
>
> could you maybe even try a more extreme setting of:
>
> echo 500000000 > /proc/sys/kernel/sched_granularity
>
> for kicks? This would allow us to see how much kernbench we lose due to
> preemption granularity. Thanks!
Yeah but I just powered down the test-box, so I'll have to get onto
that tomorrow.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 9:57 ` William Lee Irwin III
2007-04-17 10:01 ` Ingo Molnar
@ 2007-04-17 11:31 ` William Lee Irwin III
1 sibling, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 11:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 02:57:49AM -0700, William Lee Irwin III wrote:
> Interesting! That's what vdls did, except its fundamental data structure
> was more like a circular buffer data structure (resembling Davide
> Libenzi's timer ring in concept, but with all the details different).
> I'm not entirely sure how that would've turned out performancewise if
> I'd done any tuning at all. I was mostly interested in doing something
> like what I heard Bob Mullens did in 1976 for basic pedagogical value
> about schedulers to prepare for writing patches for gang scheduling as
> opposed to creating a viable replacement for the mainline scheduler.
Con helped me dredge up the vdls bits, so here is the last version I
before I got tired of toying with the idea. It's not all that clean,
with a fair amount of debug code floating around and a number of
idiocies (it seems there was a plot to use a heap somewhere I forgot
about entirely, never mind other cruft), but I thought I should at least
say something more provable than "there was a patch I never posted."
Enjoy!
-- wli
diff -prauN linux-2.6.0-test11/fs/proc/array.c sched-2.6.0-test11-5/fs/proc/array.c
--- linux-2.6.0-test11/fs/proc/array.c 2003-11-26 12:44:26.000000000 -0800
+++ sched-2.6.0-test11-5/fs/proc/array.c 2003-12-17 07:37:11.000000000 -0800
@@ -162,7 +162,7 @@ static inline char * task_state(struct t
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
- (p->sleep_avg/1024)*100/(1000000000/1024),
+ 0UL, /* was ->sleep_avg */
p->tgid,
p->pid, p->pid ? p->real_parent->pid : 0,
p->pid && p->ptrace ? p->parent->pid : 0,
@@ -345,7 +345,7 @@ int proc_pid_stat(struct task_struct *ta
read_unlock(&tasklist_lock);
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %d %d\n",
task->pid,
task->comm,
state,
@@ -390,8 +390,8 @@ int proc_pid_stat(struct task_struct *ta
task->cnswap,
task->exit_signal,
task_cpu(task),
- task->rt_priority,
- task->policy);
+ task_prio(task),
+ task_sched_policy(task));
if(mm)
mmput(mm);
return res;
diff -prauN linux-2.6.0-test11/include/asm-i386/thread_info.h sched-2.6.0-test11-5/include/asm-i386/thread_info.h
--- linux-2.6.0-test11/include/asm-i386/thread_info.h 2003-11-26 12:43:06.000000000 -0800
+++ sched-2.6.0-test11-5/include/asm-i386/thread_info.h 2003-12-17 04:55:22.000000000 -0800
@@ -114,6 +114,8 @@ static inline struct thread_info *curren
#define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */
#define TIF_IRET 5 /* return with iret */
#define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED 17
+#define TIF_PREEMPT 18
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
diff -prauN linux-2.6.0-test11/include/linux/binomial.h sched-2.6.0-test11-5/include/linux/binomial.h
--- linux-2.6.0-test11/include/linux/binomial.h 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/binomial.h 2003-12-20 15:53:33.000000000 -0800
@@ -0,0 +1,16 @@
+/*
+ * Simple binomial heaps.
+ */
+
+struct binomial {
+ unsigned priority, degree;
+ struct binomial *parent, *child, *sibling;
+};
+
+
+struct binomial *binomial_minimum(struct binomial **);
+void binomial_union(struct binomial **, struct binomial **, struct binomial **);
+void binomial_insert(struct binomial **, struct binomial *);
+struct binomial *binomial_extract_min(struct binomial **);
+void binomial_decrease(struct binomial **, struct binomial *, unsigned);
+void binomial_delete(struct binomial **, struct binomial *);
diff -prauN linux-2.6.0-test11/include/linux/init_task.h sched-2.6.0-test11-5/include/linux/init_task.h
--- linux-2.6.0-test11/include/linux/init_task.h 2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/init_task.h 2003-12-18 05:51:16.000000000 -0800
@@ -56,6 +56,12 @@
.siglock = SPIN_LOCK_UNLOCKED, \
}
+#define INIT_SCHED_INFO(info) \
+{ \
+ .run_list = LIST_HEAD_INIT((info).run_list), \
+ .policy = 1 /* SCHED_POLICY_TS */, \
+}
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -67,14 +73,10 @@
.usage = ATOMIC_INIT(2), \
.flags = 0, \
.lock_depth = -1, \
- .prio = MAX_PRIO-20, \
- .static_prio = MAX_PRIO-20, \
- .policy = SCHED_NORMAL, \
+ .sched_info = INIT_SCHED_INFO(tsk.sched_info), \
.cpus_allowed = CPU_MASK_ALL, \
.mm = NULL, \
.active_mm = &init_mm, \
- .run_list = LIST_HEAD_INIT(tsk.run_list), \
- .time_slice = HZ, \
.tasks = LIST_HEAD_INIT(tsk.tasks), \
.ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \
.ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \
diff -prauN linux-2.6.0-test11/include/linux/sched.h sched-2.6.0-test11-5/include/linux/sched.h
--- linux-2.6.0-test11/include/linux/sched.h 2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/sched.h 2003-12-23 03:47:45.000000000 -0800
@@ -126,6 +126,8 @@ extern unsigned long nr_iowait(void);
#define SCHED_NORMAL 0
#define SCHED_FIFO 1
#define SCHED_RR 2
+#define SCHED_BATCH 3
+#define SCHED_IDLE 4
struct sched_param {
int sched_priority;
@@ -281,10 +283,14 @@ struct signal_struct {
#define MAX_USER_RT_PRIO 100
#define MAX_RT_PRIO MAX_USER_RT_PRIO
-
-#define MAX_PRIO (MAX_RT_PRIO + 40)
-
-#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
+#define NICE_QLEN 128
+#define MIN_TS_PRIO MAX_RT_PRIO
+#define MAX_TS_PRIO (40*NICE_QLEN)
+#define MIN_BATCH_PRIO (MAX_RT_PRIO + MAX_TS_PRIO)
+#define MAX_BATCH_PRIO 100
+#define MAX_PRIO (MIN_BATCH_PRIO + MAX_BATCH_PRIO)
+#define USER_PRIO(prio) ((prio) - MAX_RT_PRIO)
+#define MAX_USER_PRIO USER_PRIO(MAX_PRIO)
/*
* Some day this will be a full-fledged user tracking system..
@@ -330,6 +336,36 @@ struct k_itimer {
struct io_context; /* See blkdev.h */
void exit_io_context(void);
+struct rt_data {
+ int prio, rt_policy;
+ unsigned long quantum, ticks;
+};
+
+/* XXX: do %cpu estimation for ts wakeup levels */
+struct ts_data {
+ int nice;
+ unsigned long ticks, frac_cpu;
+ unsigned long sample_start, sample_ticks;
+};
+
+struct bt_data {
+ int prio;
+ unsigned long ticks;
+};
+
+union class_data {
+ struct rt_data rt;
+ struct ts_data ts;
+ struct bt_data bt;
+};
+
+struct sched_info {
+ int idx; /* queue index, used by all classes */
+ unsigned long policy; /* scheduling policy */
+ struct list_head run_list; /* list links for priority queues */
+ union class_data cl_data; /* class-specific data */
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
struct thread_info *thread_info;
@@ -339,18 +375,9 @@ struct task_struct {
int lock_depth; /* Lock depth */
- int prio, static_prio;
- struct list_head run_list;
- prio_array_t *array;
-
- unsigned long sleep_avg;
- long interactive_credit;
- unsigned long long timestamp;
- int activated;
+ struct sched_info sched_info;
- unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
struct list_head tasks;
struct list_head ptrace_children;
@@ -391,7 +418,6 @@ struct task_struct {
int __user *set_child_tid; /* CLONE_CHILD_SETTID */
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
- unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
@@ -520,12 +546,14 @@ extern void node_nr_running_init(void);
#define node_nr_running_init() {}
#endif
-extern void set_user_nice(task_t *p, long nice);
-extern int task_prio(task_t *p);
-extern int task_nice(task_t *p);
-extern int task_curr(task_t *p);
-extern int idle_cpu(int cpu);
-
+void set_user_nice(task_t *task, long nice);
+int task_prio(task_t *task);
+int task_nice(task_t *task);
+int task_sched_policy(task_t *task);
+void set_task_sched_policy(task_t *task, int policy);
+int rt_task(task_t *task);
+int task_curr(task_t *task);
+int idle_cpu(int cpu);
void yield(void);
/*
@@ -844,6 +872,21 @@ static inline int need_resched(void)
return unlikely(test_thread_flag(TIF_NEED_RESCHED));
}
+static inline void set_task_queued(task_t *task)
+{
+ set_tsk_thread_flag(task, TIF_QUEUED);
+}
+
+static inline void clear_task_queued(task_t *task)
+{
+ clear_tsk_thread_flag(task, TIF_QUEUED);
+}
+
+static inline int task_queued(task_t *task)
+{
+ return test_tsk_thread_flag(task, TIF_QUEUED);
+}
+
extern void __cond_resched(void);
static inline void cond_resched(void)
{
diff -prauN linux-2.6.0-test11/kernel/Makefile sched-2.6.0-test11-5/kernel/Makefile
--- linux-2.6.0-test11/kernel/Makefile 2003-11-26 12:43:24.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/Makefile 2003-12-17 03:30:08.000000000 -0800
@@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o
exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
- rcupdate.o intermodule.o extable.o params.o posix-timers.o
+ rcupdate.o intermodule.o extable.o params.o posix-timers.o sched/
obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
diff -prauN linux-2.6.0-test11/kernel/exit.c sched-2.6.0-test11-5/kernel/exit.c
--- linux-2.6.0-test11/kernel/exit.c 2003-11-26 12:45:29.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/exit.c 2003-12-17 07:04:02.000000000 -0800
@@ -225,7 +225,7 @@ void reparent_to_init(void)
/* Set the exit signal to SIGCHLD so we signal init on exit */
current->exit_signal = SIGCHLD;
- if ((current->policy == SCHED_NORMAL) && (task_nice(current) < 0))
+ if (task_nice(current) < 0)
set_user_nice(current, 0);
/* cpus_allowed? */
/* rt_priority? */
diff -prauN linux-2.6.0-test11/kernel/fork.c sched-2.6.0-test11-5/kernel/fork.c
--- linux-2.6.0-test11/kernel/fork.c 2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/fork.c 2003-12-23 06:22:59.000000000 -0800
@@ -836,6 +836,9 @@ struct task_struct *copy_process(unsigne
atomic_inc(&p->user->__count);
atomic_inc(&p->user->processes);
+ clear_tsk_thread_flag(p, TIF_SIGPENDING);
+ clear_tsk_thread_flag(p, TIF_QUEUED);
+
/*
* If multiple threads are within copy_process(), then this check
* triggers too late. This doesn't hurt, the check is only there
@@ -861,13 +864,21 @@ struct task_struct *copy_process(unsigne
p->state = TASK_UNINTERRUPTIBLE;
copy_flags(clone_flags, p);
- if (clone_flags & CLONE_IDLETASK)
+ if (clone_flags & CLONE_IDLETASK) {
p->pid = 0;
- else {
+ set_task_sched_policy(p, SCHED_IDLE);
+ } else {
+ if (task_sched_policy(p) == SCHED_IDLE) {
+ memset(&p->sched_info, 0, sizeof(struct sched_info));
+ set_task_sched_policy(p, SCHED_NORMAL);
+ set_user_nice(p, 0);
+ }
p->pid = alloc_pidmap();
if (p->pid == -1)
goto bad_fork_cleanup;
}
+ if (p->pid == 1)
+ BUG_ON(task_nice(p));
retval = -EFAULT;
if (clone_flags & CLONE_PARENT_SETTID)
if (put_user(p->pid, parent_tidptr))
@@ -875,8 +886,7 @@ struct task_struct *copy_process(unsigne
p->proc_dentry = NULL;
- INIT_LIST_HEAD(&p->run_list);
-
+ INIT_LIST_HEAD(&p->sched_info.run_list);
INIT_LIST_HEAD(&p->children);
INIT_LIST_HEAD(&p->sibling);
INIT_LIST_HEAD(&p->posix_timers);
@@ -885,8 +895,6 @@ struct task_struct *copy_process(unsigne
spin_lock_init(&p->alloc_lock);
spin_lock_init(&p->switch_lock);
spin_lock_init(&p->proc_lock);
-
- clear_tsk_thread_flag(p, TIF_SIGPENDING);
init_sigpending(&p->pending);
p->it_real_value = p->it_virt_value = p->it_prof_value = 0;
@@ -898,7 +906,6 @@ struct task_struct *copy_process(unsigne
p->tty_old_pgrp = 0;
p->utime = p->stime = 0;
p->cutime = p->cstime = 0;
- p->array = NULL;
p->lock_depth = -1; /* -1 = no lock */
p->start_time = get_jiffies_64();
p->security = NULL;
@@ -948,33 +955,6 @@ struct task_struct *copy_process(unsigne
p->pdeath_signal = 0;
/*
- * Share the timeslice between parent and child, thus the
- * total amount of pending timeslices in the system doesn't change,
- * resulting in more scheduling fairness.
- */
- local_irq_disable();
- p->time_slice = (current->time_slice + 1) >> 1;
- /*
- * The remainder of the first timeslice might be recovered by
- * the parent if the child exits early enough.
- */
- p->first_time_slice = 1;
- current->time_slice >>= 1;
- p->timestamp = sched_clock();
- if (!current->time_slice) {
- /*
- * This case is rare, it happens when the parent has only
- * a single jiffy left from its timeslice. Taking the
- * runqueue lock is not a problem.
- */
- current->time_slice = 1;
- preempt_disable();
- scheduler_tick(0, 0);
- local_irq_enable();
- preempt_enable();
- } else
- local_irq_enable();
- /*
* Ok, add it to the run-queues and make it
* visible to the rest of the system.
*
diff -prauN linux-2.6.0-test11/kernel/sched/Makefile sched-2.6.0-test11-5/kernel/sched/Makefile
--- linux-2.6.0-test11/kernel/sched/Makefile 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/Makefile 2003-12-17 03:32:21.000000000 -0800
@@ -0,0 +1 @@
+obj-y = util.o ts.o idle.o rt.o batch.o
diff -prauN linux-2.6.0-test11/kernel/sched/batch.c sched-2.6.0-test11-5/kernel/sched/batch.c
--- linux-2.6.0-test11/kernel/sched/batch.c 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/batch.c 2003-12-19 21:32:49.000000000 -0800
@@ -0,0 +1,190 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+struct batch_queue {
+ int base, tasks;
+ task_t *curr;
+ unsigned long bitmap[BITS_TO_LONGS(MAX_BATCH_PRIO)];
+ struct list_head queue[MAX_BATCH_PRIO];
+};
+
+static int batch_quantum = 1024;
+static DEFINE_PER_CPU(struct batch_queue, batch_queues);
+
+static int batch_init(struct policy *policy, int cpu)
+{
+ int k;
+ struct batch_queue *queue = &per_cpu(batch_queues, cpu);
+
+ policy->queue = (struct queue *)queue;
+ for (k = 0; k < MAX_BATCH_PRIO; ++k)
+ INIT_LIST_HEAD(&queue->queue[k]);
+ return 0;
+}
+
+static int batch_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks)
+{
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+ struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+
+ cpustat->nice += user_ticks;
+ cpustat->system += sys_ticks;
+
+ task->sched_info.cl_data.bt.ticks--;
+ if (!task->sched_info.cl_data.bt.ticks) {
+ int new_idx;
+
+ task->sched_info.cl_data.bt.ticks = batch_quantum;
+ new_idx = (task->sched_info.idx + task->sched_info.cl_data.bt.prio)
+ % MAX_BATCH_PRIO;
+ if (!test_bit(new_idx, queue->bitmap))
+ __set_bit(new_idx, queue->bitmap);
+ list_move_tail(&task->sched_info.run_list,
+ &queue->queue[new_idx]);
+ if (list_empty(&queue->queue[task->sched_info.idx]))
+ __clear_bit(task->sched_info.idx, queue->bitmap);
+ task->sched_info.idx = new_idx;
+ queue->base = find_first_circular_bit(queue->bitmap,
+ queue->base,
+ MAX_BATCH_PRIO);
+ set_need_resched();
+ }
+ return 0;
+}
+
+static void batch_yield(struct queue *__queue, task_t *task)
+{
+ int new_idx;
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+
+ new_idx = (queue->base + MAX_BATCH_PRIO - 1) % MAX_BATCH_PRIO;
+ if (!test_bit(new_idx, queue->bitmap))
+ __set_bit(new_idx, queue->bitmap);
+ list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]);
+ if (list_empty(&queue->queue[task->sched_info.idx]))
+ __clear_bit(task->sched_info.idx, queue->bitmap);
+ task->sched_info.idx = new_idx;
+ queue->base = find_first_circular_bit(queue->bitmap,
+ queue->base,
+ MAX_BATCH_PRIO);
+ set_need_resched();
+}
+
+static task_t *batch_curr(struct queue *__queue)
+{
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+ return queue->curr;
+}
+
+static void batch_set_curr(struct queue *__queue, task_t *task)
+{
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+ queue->curr = task;
+}
+
+static task_t *batch_best(struct queue *__queue)
+{
+ int idx;
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+
+ idx = find_first_circular_bit(queue->bitmap,
+ queue->base,
+ MAX_BATCH_PRIO);
+ BUG_ON(idx >= MAX_BATCH_PRIO);
+ BUG_ON(list_empty(&queue->queue[idx]));
+ return list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+}
+
+static void batch_enqueue(struct queue *__queue, task_t *task)
+{
+ int idx;
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+
+ idx = (queue->base + task->sched_info.cl_data.bt.prio) % MAX_BATCH_PRIO;
+ if (!test_bit(idx, queue->bitmap))
+ __set_bit(idx, queue->bitmap);
+ list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+ task->sched_info.idx = idx;
+ task->sched_info.cl_data.bt.ticks = batch_quantum;
+ queue->tasks++;
+ if (!queue->curr)
+ queue->curr = task;
+}
+
+static void batch_dequeue(struct queue *__queue, task_t *task)
+{
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+ list_del(&task->sched_info.run_list);
+ if (list_empty(&queue->queue[task->sched_info.idx]))
+ __clear_bit(task->sched_info.idx, queue->bitmap);
+ queue->tasks--;
+ if (!queue->tasks)
+ queue->curr = NULL;
+ else if (task == queue->curr)
+ queue->curr = batch_best(__queue);
+}
+
+static int batch_preempt(struct queue *__queue, task_t *task)
+{
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+ if (!queue->curr)
+ return 1;
+ else
+ return task->sched_info.cl_data.bt.prio
+ < queue->curr->sched_info.cl_data.bt.prio;
+}
+
+static int batch_tasks(struct queue *__queue)
+{
+ struct batch_queue *queue = (struct batch_queue *)__queue;
+ return queue->tasks;
+}
+
+static int batch_nice(struct queue *queue, task_t *task)
+{
+ return 20;
+}
+
+static int batch_prio(task_t *task)
+{
+ return USER_PRIO(task->sched_info.cl_data.bt.prio + MIN_BATCH_PRIO);
+}
+
+static void batch_setprio(task_t *task, int prio)
+{
+ BUG_ON(prio < 0);
+ BUG_ON(prio >= MAX_BATCH_PRIO);
+ task->sched_info.cl_data.bt.prio = prio;
+}
+
+struct queue_ops batch_ops = {
+ .init = batch_init,
+ .fini = nop_fini,
+ .tick = batch_tick,
+ .yield = batch_yield,
+ .curr = batch_curr,
+ .set_curr = batch_set_curr,
+ .tasks = batch_tasks,
+ .best = batch_best,
+ .enqueue = batch_enqueue,
+ .dequeue = batch_dequeue,
+ .start_wait = queue_nop,
+ .stop_wait = queue_nop,
+ .sleep = queue_nop,
+ .wake = queue_nop,
+ .preempt = batch_preempt,
+ .nice = batch_nice,
+ .renice = nop_renice,
+ .prio = batch_prio,
+ .setprio = batch_setprio,
+ .timeslice = nop_timeslice,
+ .set_timeslice = nop_set_timeslice,
+};
+
+struct policy batch_policy = {
+ .ops = &batch_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/idle.c sched-2.6.0-test11-5/kernel/sched/idle.c
--- linux-2.6.0-test11/kernel/sched/idle.c 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/idle.c 2003-12-19 17:31:39.000000000 -0800
@@ -0,0 +1,99 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+static DEFINE_PER_CPU(task_t *, idle_tasks) = NULL;
+
+static int idle_nice(struct queue *queue, task_t *task)
+{
+ return 20;
+}
+
+static int idle_tasks(struct queue *queue)
+{
+ task_t **idle = (task_t **)queue;
+ return !!(*idle);
+}
+
+static task_t *idle_task(struct queue *queue)
+{
+ return *((task_t **)queue);
+}
+
+static void idle_yield(struct queue *queue, task_t *task)
+{
+ set_need_resched();
+}
+
+static void idle_enqueue(struct queue *queue, task_t *task)
+{
+ task_t **idle = (task_t **)queue;
+ *idle = task;
+}
+
+static void idle_dequeue(struct queue *queue, task_t *task)
+{
+}
+
+static int idle_preempt(struct queue *queue, task_t *task)
+{
+ return 0;
+}
+
+static int idle_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks)
+{
+ struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+ runqueue_t *rq = &per_cpu(runqueues, smp_processor_id());
+
+ if (atomic_read(&rq->nr_iowait) > 0)
+ cpustat->iowait += sys_ticks;
+ else
+ cpustat->idle += sys_ticks;
+ return 1;
+}
+
+static int idle_init(struct policy *policy, int cpu)
+{
+ policy->queue = (struct queue *)&per_cpu(idle_tasks, cpu);
+ return 0;
+}
+
+static int idle_prio(task_t *task)
+{
+ return MAX_USER_PRIO;
+}
+
+static void idle_setprio(task_t *task, int prio)
+{
+}
+
+static struct queue_ops idle_ops = {
+ .init = idle_init,
+ .fini = nop_fini,
+ .tick = idle_tick,
+ .yield = idle_yield,
+ .curr = idle_task,
+ .set_curr = queue_nop,
+ .tasks = idle_tasks,
+ .best = idle_task,
+ .enqueue = idle_enqueue,
+ .dequeue = idle_dequeue,
+ .start_wait = queue_nop,
+ .stop_wait = queue_nop,
+ .sleep = queue_nop,
+ .wake = queue_nop,
+ .preempt = idle_preempt,
+ .nice = idle_nice,
+ .renice = nop_renice,
+ .prio = idle_prio,
+ .setprio = idle_setprio,
+ .timeslice = nop_timeslice,
+ .set_timeslice = nop_set_timeslice,
+};
+
+struct policy idle_policy = {
+ .ops = &idle_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/queue.h sched-2.6.0-test11-5/kernel/sched/queue.h
--- linux-2.6.0-test11/kernel/sched/queue.h 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/queue.h 2003-12-23 03:58:02.000000000 -0800
@@ -0,0 +1,104 @@
+#define SCHED_POLICY_RT 0
+#define SCHED_POLICY_TS 1
+#define SCHED_POLICY_BATCH 2
+#define SCHED_POLICY_IDLE 3
+
+#define RT_POLICY_FIFO 0
+#define RT_POLICY_RR 1
+
+#define NODE_THRESHOLD 125
+
+struct queue;
+struct queue_ops;
+
+struct policy {
+ struct queue *queue;
+ struct queue_ops *ops;
+};
+
+extern struct policy rt_policy, ts_policy, batch_policy, idle_policy;
+
+struct runqueue {
+ spinlock_t lock;
+ int curr;
+ task_t *__curr;
+ unsigned long policy_bitmap;
+ struct policy *policies[BITS_PER_LONG];
+ unsigned long nr_running, nr_switches, nr_uninterruptible;
+ struct mm_struct *prev_mm;
+ int prev_cpu_load[NR_CPUS];
+#ifdef CONFIG_NUMA
+ atomic_t *node_nr_running;
+ int prev_node_load[MAX_NUMNODES];
+#endif
+ task_t *migration_thread;
+ struct list_head migration_queue;
+
+ atomic_t nr_iowait;
+};
+
+typedef struct runqueue runqueue_t;
+
+struct queue_ops {
+ int (*init)(struct policy *, int);
+ void (*fini)(struct policy *, int);
+ task_t *(*curr)(struct queue *);
+ void (*set_curr)(struct queue *, task_t *);
+ task_t *(*best)(struct queue *);
+ int (*tick)(struct queue *, task_t *, int, int);
+ int (*tasks)(struct queue *);
+ void (*enqueue)(struct queue *, task_t *);
+ void (*dequeue)(struct queue *, task_t *);
+ void (*start_wait)(struct queue *, task_t *);
+ void (*stop_wait)(struct queue *, task_t *);
+ void (*sleep)(struct queue *, task_t *);
+ void (*wake)(struct queue *, task_t *);
+ int (*preempt)(struct queue *, task_t *);
+ void (*yield)(struct queue *, task_t *);
+ int (*prio)(task_t *);
+ void (*setprio)(task_t *, int);
+ int (*nice)(struct queue *, task_t *);
+ void (*renice)(struct queue *, task_t *, int);
+ unsigned long (*timeslice)(struct queue *, task_t *);
+ void (*set_timeslice)(struct queue *, task_t *, unsigned long);
+};
+
+DECLARE_PER_CPU(runqueue_t, runqueues);
+
+int find_first_circular_bit(unsigned long *, int, int);
+void queue_nop(struct queue *, task_t *);
+void nop_renice(struct queue *, task_t *, int);
+void nop_fini(struct policy *, int);
+unsigned long nop_timeslice(struct queue *, task_t *);
+void nop_set_timeslice(struct queue *, task_t *, unsigned long);
+
+/* #define DEBUG_SCHED */
+
+#ifdef DEBUG_SCHED
+#define __check_task_policy(idx) \
+do { \
+ unsigned long __idx__ = (idx); \
+ if (__idx__ > SCHED_POLICY_IDLE) { \
+ printk("invalid policy 0x%lx\n", __idx__); \
+ BUG(); \
+ } \
+} while (0)
+
+#define check_task_policy(task) \
+do { \
+ __check_task_policy((task)->sched_info.policy); \
+} while (0)
+
+#define check_policy(policy) \
+do { \
+ BUG_ON((policy) != &rt_policy && \
+ (policy) != &ts_policy && \
+ (policy) != &batch_policy && \
+ (policy) != &idle_policy); \
+} while (0)
+
+#else /* !DEBUG_SCHED */
+#define __check_task_policy(idx) do { } while (0)
+#define check_task_policy(task) do { } while (0)
+#define check_policy(policy) do { } while (0)
+#endif /* !DEBUG_SCHED */
diff -prauN linux-2.6.0-test11/kernel/sched/rt.c sched-2.6.0-test11-5/kernel/sched/rt.c
--- linux-2.6.0-test11/kernel/sched/rt.c 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/rt.c 2003-12-19 18:16:07.000000000 -0800
@@ -0,0 +1,208 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+#ifdef DEBUG_SCHED
+#define check_rt_policy(task) \
+do { \
+ BUG_ON((task)->sched_info.policy != SCHED_POLICY_RT); \
+ BUG_ON((task)->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR \
+ && \
+ (task)->sched_info.cl_data.rt.rt_policy!=RT_POLICY_FIFO); \
+ BUG_ON((task)->sched_info.cl_data.rt.prio < 0); \
+ BUG_ON((task)->sched_info.cl_data.rt.prio >= MAX_RT_PRIO); \
+} while (0)
+#else
+#define check_rt_policy(task) do { } while (0)
+#endif
+
+struct rt_queue {
+ unsigned long bitmap[BITS_TO_LONGS(MAX_RT_PRIO)];
+ struct list_head queue[MAX_RT_PRIO];
+ task_t *curr;
+ int tasks;
+};
+
+static DEFINE_PER_CPU(struct rt_queue, rt_queues);
+
+static int rt_init(struct policy *policy, int cpu)
+{
+ int k;
+ struct rt_queue *queue = &per_cpu(rt_queues, cpu);
+
+ policy->queue = (struct queue *)queue;
+ for (k = 0; k < MAX_RT_PRIO; ++k)
+ INIT_LIST_HEAD(&queue->queue[k]);
+ return 0;
+}
+
+static void rt_yield(struct queue *__queue, task_t *task)
+{
+ struct rt_queue *queue = (struct rt_queue *)__queue;
+ check_rt_policy(task);
+ list_del(&task->sched_info.run_list);
+ if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio]))
+ set_need_resched();
+ list_add_tail(&task->sched_info.run_list,
+ &queue->queue[task->sched_info.cl_data.rt.prio]);
+ check_rt_policy(task);
+}
+
+static int rt_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks)
+{
+ struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+ check_rt_policy(task);
+ cpustat->user += user_ticks;
+ cpustat->system += sys_ticks;
+ if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR) {
+ task->sched_info.cl_data.rt.ticks--;
+ if (!task->sched_info.cl_data.rt.ticks) {
+ task->sched_info.cl_data.rt.ticks =
+ task->sched_info.cl_data.rt.quantum;
+ rt_yield(queue, task);
+ }
+ }
+ check_rt_policy(task);
+ return 0;
+}
+
+static task_t *rt_curr(struct queue *__queue)
+{
+ struct rt_queue *queue = (struct rt_queue *)__queue;
+ task_t *task = queue->curr;
+ check_rt_policy(task);
+ return task;
+}
+
+static void rt_set_curr(struct queue *__queue, task_t *task)
+{
+ struct rt_queue *queue = (struct rt_queue *)__queue;
+ queue->curr = task;
+ check_rt_policy(task);
+}
+
+static task_t *rt_best(struct queue *__queue)
+{
+ struct rt_queue *queue = (struct rt_queue *)__queue;
+ task_t *task;
+ int idx;
+ idx = find_first_bit(queue->bitmap, MAX_RT_PRIO);
+ BUG_ON(idx >= MAX_RT_PRIO);
+ task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+ check_rt_policy(task);
+ return task;
+}
+
+static void rt_enqueue(struct queue *__queue, task_t *task)
+{
+ struct rt_queue *queue = (struct rt_queue *)__queue;
+ check_rt_policy(task);
+ if (!test_bit(task->sched_info.cl_data.rt.prio, queue->bitmap))
+ __set_bit(task->sched_info.cl_data.rt.prio, queue->bitmap);
+ list_add_tail(&task->sched_info.run_list,
+ &queue->queue[task->sched_info.cl_data.rt.prio]);
+ check_rt_policy(task);
+ queue->tasks++;
+ if (!queue->curr)
+ queue->curr = task;
+}
+
+static void rt_dequeue(struct queue *__queue, task_t *task)
+{
+ struct rt_queue *queue = (struct rt_queue *)__queue;
+ check_rt_policy(task);
+ list_del(&task->sched_info.run_list);
+ if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio]))
+ __clear_bit(task->sched_info.cl_data.rt.prio, queue->bitmap);
+ queue->tasks--;
+ check_rt_policy(task);
+ if (!queue->tasks)
+ queue->curr = NULL;
+ else if (task == queue->curr)
+ queue->curr = rt_best(__queue);
+}
+
+static int rt_preempt(struct queue *__queue, task_t *task)
+{
+ struct rt_queue *queue = (struct rt_queue *)__queue;
+ check_rt_policy(task);
+ if (!queue->curr)
+ return 1;
+ check_rt_policy(queue->curr);
+ return task->sched_info.cl_data.rt.prio
+ < queue->curr->sched_info.cl_data.rt.prio;
+}
+
+static int rt_tasks(struct queue *__queue)
+{
+ struct rt_queue *queue = (struct rt_queue *)__queue;
+ return queue->tasks;
+}
+
+static int rt_nice(struct queue *queue, task_t *task)
+{
+ check_rt_policy(task);
+ return -20;
+}
+
+static unsigned long rt_timeslice(struct queue *queue, task_t *task)
+{
+ check_rt_policy(task);
+ if (task->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR)
+ return 0;
+ else
+ return task->sched_info.cl_data.rt.quantum;
+}
+
+static void rt_set_timeslice(struct queue *queue, task_t *task, unsigned long n)
+{
+ check_rt_policy(task);
+ if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR)
+ task->sched_info.cl_data.rt.quantum = n;
+ check_rt_policy(task);
+}
+
+static void rt_setprio(task_t *task, int prio)
+{
+ check_rt_policy(task);
+ BUG_ON(prio < 0);
+ BUG_ON(prio >= MAX_RT_PRIO);
+ task->sched_info.cl_data.rt.prio = prio;
+}
+
+static int rt_prio(task_t *task)
+{
+ check_rt_policy(task);
+ return USER_PRIO(task->sched_info.cl_data.rt.prio);
+}
+
+static struct queue_ops rt_ops = {
+ .init = rt_init,
+ .fini = nop_fini,
+ .tick = rt_tick,
+ .yield = rt_yield,
+ .curr = rt_curr,
+ .set_curr = rt_set_curr,
+ .tasks = rt_tasks,
+ .best = rt_best,
+ .enqueue = rt_enqueue,
+ .dequeue = rt_dequeue,
+ .start_wait = queue_nop,
+ .stop_wait = queue_nop,
+ .sleep = queue_nop,
+ .wake = queue_nop,
+ .preempt = rt_preempt,
+ .nice = rt_nice,
+ .renice = nop_renice,
+ .prio = rt_prio,
+ .setprio = rt_setprio,
+ .timeslice = rt_timeslice,
+ .set_timeslice = rt_set_timeslice,
+};
+
+struct policy rt_policy = {
+ .ops = &rt_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/ts.c sched-2.6.0-test11-5/kernel/sched/ts.c
--- linux-2.6.0-test11/kernel/sched/ts.c 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/ts.c 2003-12-23 08:24:55.000000000 -0800
@@ -0,0 +1,841 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+#ifdef DEBUG_SCHED
+#define check_ts_policy(task) \
+do { \
+ BUG_ON((task)->sched_info.policy != SCHED_POLICY_TS); \
+} while (0)
+
+#define check_nice(__queue__) \
+({ \
+ int __k__, __count__ = 0; \
+ if ((__queue__)->tasks < 0) { \
+ printk("negative nice task count %d\n", \
+ (__queue__)->tasks); \
+ BUG(); \
+ } \
+ for (__k__ = 0; __k__ < NICE_QLEN; ++__k__) { \
+ task_t *__task__; \
+ if (list_empty(&(__queue__)->queue[__k__])) { \
+ if (test_bit(__k__, (__queue__)->bitmap)) { \
+ printk("wrong nice bit set\n"); \
+ BUG(); \
+ } \
+ } else { \
+ if (!test_bit(__k__, (__queue__)->bitmap)) { \
+ printk("wrong nice bit clear\n"); \
+ BUG(); \
+ } \
+ } \
+ list_for_each_entry(__task__, \
+ &(__queue__)->queue[__k__], \
+ sched_info.run_list) { \
+ check_ts_policy(__task__); \
+ if (__task__->sched_info.idx != __k__) { \
+ printk("nice index mismatch\n"); \
+ BUG(); \
+ } \
+ ++__count__; \
+ } \
+ } \
+ if ((__queue__)->tasks != __count__) { \
+ printk("wrong nice task count\n"); \
+ printk("expected %d, got %d\n", \
+ (__queue__)->tasks, \
+ __count__); \
+ BUG(); \
+ } \
+ __count__; \
+})
+
+#define check_queue(__queue) \
+do { \
+ int __k, __count = 0; \
+ if ((__queue)->tasks < 0) { \
+ printk("negative queue task count %d\n", \
+ (__queue)->tasks); \
+ BUG(); \
+ } \
+ for (__k = 0; __k < 40; ++__k) { \
+ struct nice_queue *__nice; \
+ if (list_empty(&(__queue)->nices[__k])) { \
+ if (test_bit(__k, (__queue)->bitmap)) { \
+ printk("wrong queue bit set\n"); \
+ BUG(); \
+ } \
+ } else { \
+ if (!test_bit(__k, (__queue)->bitmap)) { \
+ printk("wrong queue bit clear\n"); \
+ BUG(); \
+ } \
+ } \
+ list_for_each_entry(__nice, \
+ &(__queue)->nices[__k], \
+ list) { \
+ __count += check_nice(__nice); \
+ if (__nice->idx != __k) { \
+ printk("queue index mismatch\n"); \
+ BUG(); \
+ } \
+ } \
+ } \
+ if ((__queue)->tasks != __count) { \
+ printk("wrong queue task count\n"); \
+ printk("expected %d, got %d\n", \
+ (__queue)->tasks, \
+ __count); \
+ BUG(); \
+ } \
+} while (0)
+
+#else /* !DEBUG_SCHED */
+#define check_ts_policy(task) do { } while (0)
+#define check_nice(nice) do { } while (0)
+#define check_queue(queue) do { } while (0)
+#endif
+
+/*
+ * Hybrid deadline/multilevel scheduling. Cpu utilization
+ * -dependent deadlines at wake. Queue rotation every 50ms or when
+ * demotions empty the highest level, setting demoted deadlines
+ * relative to the new highest level. Intra-level RR quantum at 10ms.
+ */
+struct nice_queue {
+ int idx, nice, base, tasks, level_quantum, expired;
+ unsigned long bitmap[BITS_TO_LONGS(NICE_QLEN)];
+ struct list_head list, queue[NICE_QLEN];
+ task_t *curr;
+};
+
+/*
+ * Deadline schedule nice levels with priority-dependent deadlines,
+ * default quantum of 100ms. Queue rotates at demotions emptying the
+ * highest level, setting the demoted deadline relative to the new
+ * highest level.
+ */
+struct ts_queue {
+ struct nice_queue nice_levels[40];
+ struct list_head nices[40];
+ int base, quantum, tasks;
+ unsigned long bitmap[BITS_TO_LONGS(40)];
+ struct nice_queue *curr;
+};
+
+/*
+ * Make these sysctl-tunable.
+ */
+static int nice_quantum = 100;
+static int rr_quantum = 10;
+static int level_quantum = 50;
+static int sample_interval = HZ;
+
+static DEFINE_PER_CPU(struct ts_queue, ts_queues);
+
+static task_t *nice_best(struct nice_queue *);
+static struct nice_queue *ts_best_nice(struct ts_queue *);
+
+static void nice_init(struct nice_queue *queue)
+{
+ int k;
+
+ INIT_LIST_HEAD(&queue->list);
+ for (k = 0; k < NICE_QLEN; ++k) {
+ INIT_LIST_HEAD(&queue->queue[k]);
+ }
+}
+
+static int ts_init(struct policy *policy, int cpu)
+{
+ int k;
+ struct ts_queue *queue = &per_cpu(ts_queues, cpu);
+
+ policy->queue = (struct queue *)queue;
+ queue->quantum = nice_quantum;
+
+ for (k = 0; k < 40; ++k) {
+ nice_init(&queue->nice_levels[k]);
+ queue->nice_levels[k].nice = k;
+ INIT_LIST_HEAD(&queue->nices[k]);
+ }
+ return 0;
+}
+
+static int task_deadline(task_t *task)
+{
+ u64 frac_cpu = task->sched_info.cl_data.ts.frac_cpu;
+ frac_cpu *= (u64)NICE_QLEN;
+ frac_cpu >>= 32;
+ return (int)min((u32)(NICE_QLEN - 1), (u32)frac_cpu);
+}
+
+static void nice_rotate_queue(struct nice_queue *queue)
+{
+ int idx, new_idx, deadline, idxdiff;
+ task_t *task = queue->curr;
+
+ check_nice(queue);
+
+ /* shit what if idxdiff == NICE_QLEN - 1?? */
+ idx = queue->curr->sched_info.idx;
+ idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN;
+ deadline = min(1 + task_deadline(task), NICE_QLEN - idxdiff - 1);
+ new_idx = (idx + deadline) % NICE_QLEN;
+#if 0
+ if (idx == new_idx) {
+ /*
+ * buggy; it sets queue->base = idx because in this case
+ * we have task_deadline(task) == 0
+ */
+ new_idx = (idx - task_deadline(task) + NICE_QLEN) % NICE_QLEN;
+ if (queue->base != new_idx)
+ queue->base = new_idx;
+ return;
+ }
+ BUG_ON(!deadline);
+ BUG_ON(queue->base <= new_idx && new_idx <= idx);
+ BUG_ON(idx < queue->base && queue->base <= new_idx);
+ BUG_ON(new_idx <= idx && idx < queue->base);
+ if (0 && idx == new_idx) {
+ printk("FUCKUP: pid = %d, tdl = %d, dl = %d, idx = %d, "
+ "base = %d, diff = %d, fcpu = 0x%lx\n",
+ queue->curr->pid,
+ task_deadline(queue->curr),
+ deadline,
+ idx,
+ queue->base,
+ idxdiff,
+ task->sched_info.cl_data.ts.frac_cpu);
+ BUG();
+ }
+#else
+ /*
+ * RR in the last deadline
+ * special-cased so as not to trip BUG_ON()'s below
+ */
+ if (idx == new_idx) {
+ /* if we got here these two things must hold */
+ BUG_ON(idxdiff != NICE_QLEN - 1);
+ BUG_ON(deadline);
+ list_move_tail(&task->sched_info.run_list, &queue->queue[idx]);
+ if (queue->expired) {
+ queue->level_quantum = level_quantum;
+ queue->expired = 0;
+ }
+ return;
+ }
+#endif
+ task->sched_info.idx = new_idx;
+ if (!test_bit(new_idx, queue->bitmap)) {
+ BUG_ON(!list_empty(&queue->queue[new_idx]));
+ __set_bit(new_idx, queue->bitmap);
+ }
+ list_move_tail(&task->sched_info.run_list,
+ &queue->queue[new_idx]);
+
+ /* expired until list drains */
+ if (!list_empty(&queue->queue[idx]))
+ queue->expired = 1;
+ else {
+ int k, w, m = NICE_QLEN % BITS_PER_LONG;
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ __clear_bit(idx, queue->bitmap);
+
+ for (w = 0, k = 0; k < NICE_QLEN/BITS_PER_LONG; ++k)
+ w += hweight_long(queue->bitmap[k]);
+ if (NICE_QLEN % BITS_PER_LONG)
+ w += hweight_long(queue->bitmap[k] & ((1UL << m) - 1));
+ if (w > 1)
+ queue->base = (queue->base + 1) % NICE_QLEN;
+ queue->level_quantum = level_quantum;
+ queue->expired = 0;
+ }
+ check_nice(queue);
+}
+
+static void nice_tick(struct nice_queue *queue, task_t *task)
+{
+ int idx = task->sched_info.idx;
+ BUG_ON(!task_queued(task));
+ BUG_ON(task != queue->curr);
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ BUG_ON(list_empty(&queue->queue[idx]));
+ check_ts_policy(task);
+ check_nice(queue);
+
+ if (task->sched_info.cl_data.ts.ticks)
+ task->sched_info.cl_data.ts.ticks--;
+
+ if (queue->level_quantum > level_quantum) {
+ WARN_ON(1);
+ queue->level_quantum = 1;
+ }
+
+ if (!queue->expired) {
+ if (queue->level_quantum)
+ queue->level_quantum--;
+ } else if (0 && queue->queue[idx].prev != &task->sched_info.run_list) {
+ int queued = 0, new_idx = (queue->base + 1) % NICE_QLEN;
+ task_t *curr, *sav;
+ task_t *victim = list_entry(queue->queue[idx].prev,
+ task_t,
+ sched_info.run_list);
+ victim->sched_info.idx = new_idx;
+ if (!test_bit(new_idx, queue->bitmap))
+ __set_bit(new_idx, queue->bitmap);
+#if 1
+ list_for_each_entry_safe(curr, sav, &queue->queue[new_idx], sched_info.run_list) {
+ if (victim->sched_info.cl_data.ts.frac_cpu
+ < curr->sched_info.cl_data.ts.frac_cpu) {
+ queued = 1;
+ list_move(&victim->sched_info.run_list,
+ curr->sched_info.run_list.prev);
+ break;
+ }
+ }
+ if (!queued)
+ list_move_tail(&victim->sched_info.run_list,
+ &queue->queue[new_idx]);
+#else
+ list_move(&victim->sched_info.run_list, &queue->queue[new_idx]);
+#endif
+ BUG_ON(list_empty(&queue->queue[idx]));
+ }
+
+ if (!queue->level_quantum && !queue->expired) {
+ check_nice(queue);
+ nice_rotate_queue(queue);
+ check_nice(queue);
+ set_need_resched();
+ } else if (!task->sched_info.cl_data.ts.ticks) {
+ int idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN;
+ check_nice(queue);
+ task->sched_info.cl_data.ts.ticks = rr_quantum;
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ BUG_ON(list_empty(&queue->queue[idx]));
+ if (queue->expired)
+ nice_rotate_queue(queue);
+ else if (idxdiff == NICE_QLEN - 1)
+ list_move_tail(&task->sched_info.run_list,
+ &queue->queue[idx]);
+ else {
+ int new_idx = (idx + 1) % NICE_QLEN;
+ list_del(&task->sched_info.run_list);
+ if (list_empty(&queue->queue[idx])) {
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ __clear_bit(idx, queue->bitmap);
+ }
+ if (!test_bit(new_idx, queue->bitmap)) {
+ BUG_ON(!list_empty(&queue->queue[new_idx]));
+ __set_bit(new_idx, queue->bitmap);
+ }
+ task->sched_info.idx = new_idx;
+ list_add(&task->sched_info.run_list,
+ &queue->queue[new_idx]);
+ }
+ check_nice(queue);
+ set_need_resched();
+ }
+ check_nice(queue);
+ check_ts_policy(task);
+}
+
+static void ts_rotate_queue(struct ts_queue *queue)
+{
+ int idx, new_idx, idxdiff, off, deadline;
+
+ queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+
+ /* shit what if idxdiff == 39?? */
+ check_queue(queue);
+ idx = queue->curr->idx;
+ idxdiff = (idx - queue->base + 40) % 40;
+ off = (int)(queue->curr - queue->nice_levels);
+ deadline = min(1 + off, 40 - idxdiff - 1);
+ new_idx = (idx + deadline) % 40;
+ if (idx == new_idx) {
+ new_idx = (idx - off + 40) % 40;
+ if (queue->base != new_idx)
+ queue->base = new_idx;
+ return;
+ }
+ BUG_ON(!deadline);
+ BUG_ON(queue->base <= new_idx && new_idx <= idx);
+ BUG_ON(idx < queue->base && queue->base <= new_idx);
+ BUG_ON(new_idx <= idx && idx < queue->base);
+ if (!test_bit(new_idx, queue->bitmap)) {
+ BUG_ON(!list_empty(&queue->nices[new_idx]));
+ __set_bit(new_idx, queue->bitmap);
+ }
+ list_move_tail(&queue->curr->list, &queue->nices[new_idx]);
+ queue->curr->idx = new_idx;
+
+ if (list_empty(&queue->nices[idx])) {
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ __clear_bit(idx, queue->bitmap);
+ queue->base = (queue->base + 1) % 40;
+ }
+ check_queue(queue);
+}
+
+static int ts_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks)
+{
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ struct nice_queue *nice = queue->curr;
+ struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+ int nice_idx = (int)(queue->curr - queue->nice_levels);
+ unsigned long sample_end, delta;
+
+ check_queue(queue);
+ check_ts_policy(task);
+ BUG_ON(!nice);
+ BUG_ON(nice_idx != task->sched_info.cl_data.ts.nice);
+ BUG_ON(!test_bit(nice->idx, queue->bitmap));
+ BUG_ON(list_empty(&queue->nices[nice->idx]));
+
+ sample_end = jiffies;
+ delta = sample_end - task->sched_info.cl_data.ts.sample_start;
+ if (delta)
+ task->sched_info.cl_data.ts.sample_ticks++;
+ else {
+ task->sched_info.cl_data.ts.sample_start = jiffies;
+ task->sched_info.cl_data.ts.sample_ticks = 1;
+ }
+
+ if (delta >= sample_interval) {
+ u64 frac_cpu;
+ frac_cpu = (u64)task->sched_info.cl_data.ts.sample_ticks << 32;
+ do_div(frac_cpu, delta);
+ frac_cpu = 2*frac_cpu + task->sched_info.cl_data.ts.frac_cpu;
+ do_div(frac_cpu, 3);
+ frac_cpu = min(frac_cpu, (1ULL << 32) - 1);
+ task->sched_info.cl_data.ts.frac_cpu = (unsigned long)frac_cpu;
+ task->sched_info.cl_data.ts.sample_start = sample_end;
+ task->sched_info.cl_data.ts.sample_ticks = 0;
+ }
+
+ cpustat->user += user_ticks;
+ cpustat->system += sys_ticks;
+ nice_tick(nice, task);
+ if (queue->quantum > nice_quantum) {
+ queue->quantum = 0;
+ WARN_ON(1);
+ } else if (queue->quantum)
+ queue->quantum--;
+ if (!queue->quantum) {
+ queue->quantum = nice_quantum;
+ ts_rotate_queue(queue);
+ set_need_resched();
+ }
+ check_queue(queue);
+ check_ts_policy(task);
+ return 0;
+}
+
+static void nice_yield(struct nice_queue *queue, task_t *task)
+{
+ int idx, new_idx = (queue->base + NICE_QLEN - 1) % NICE_QLEN;
+
+ check_nice(queue);
+ check_ts_policy(task);
+ if (!test_bit(new_idx, queue->bitmap)) {
+ BUG_ON(!list_empty(&queue->queue[new_idx]));
+ __set_bit(new_idx, queue->bitmap);
+ }
+ list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]);
+ idx = task->sched_info.idx;
+ task->sched_info.idx = new_idx;
+ set_need_resched();
+
+ if (list_empty(&queue->queue[idx])) {
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ __clear_bit(idx, queue->bitmap);
+ }
+ queue->curr = nice_best(queue);
+#if 0
+ if (queue->curr->sched_info.idx != queue->base)
+ queue->base = queue->curr->sched_info.idx;
+#endif
+ check_nice(queue);
+ check_ts_policy(task);
+}
+
+/*
+ * This is somewhat problematic; nice_yield() only parks tasks on
+ * the end of their current nice levels.
+ */
+static void ts_yield(struct queue *__queue, task_t *task)
+{
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ struct nice_queue *nice = queue->curr;
+
+ check_queue(queue);
+ check_ts_policy(task);
+ nice_yield(nice, task);
+
+ /*
+ * If there's no one to yield to, move the whole nice level.
+ * If this is problematic, setting nice-dependent deadlines
+ * on a single unified queue may be in order.
+ */
+ if (nice->tasks == 1) {
+ int idx, new_idx = (queue->base + 40 - 1) % 40;
+ idx = nice->idx;
+ if (!test_bit(new_idx, queue->bitmap)) {
+ BUG_ON(!list_empty(&queue->nices[new_idx]));
+ __set_bit(new_idx, queue->bitmap);
+ }
+ list_move_tail(&nice->list, &queue->nices[new_idx]);
+ if (list_empty(&queue->nices[idx])) {
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ __clear_bit(idx, queue->bitmap);
+ }
+ nice->idx = new_idx;
+ queue->base = find_first_circular_bit(queue->bitmap,
+ queue->base,
+ 40);
+ BUG_ON(queue->base >= 40);
+ BUG_ON(!test_bit(queue->base, queue->bitmap));
+ queue->curr = ts_best_nice(queue);
+ }
+ check_queue(queue);
+ check_ts_policy(task);
+}
+
+static task_t *ts_curr(struct queue *__queue)
+{
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ task_t *task = queue->curr->curr;
+ check_queue(queue);
+ if (task)
+ check_ts_policy(task);
+ return task;
+}
+
+static void ts_set_curr(struct queue *__queue, task_t *task)
+{
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ struct nice_queue *nice;
+ check_queue(queue);
+ check_ts_policy(task);
+ nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+ queue->curr = nice;
+ nice->curr = task;
+ check_queue(queue);
+ check_ts_policy(task);
+}
+
+static task_t *nice_best(struct nice_queue *queue)
+{
+ task_t *task;
+ int idx = find_first_circular_bit(queue->bitmap,
+ queue->base,
+ NICE_QLEN);
+ check_nice(queue);
+ if (idx >= NICE_QLEN)
+ return NULL;
+ BUG_ON(list_empty(&queue->queue[idx]));
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+ check_nice(queue);
+ check_ts_policy(task);
+ return task;
+}
+
+static struct nice_queue *ts_best_nice(struct ts_queue *queue)
+{
+ int idx = find_first_circular_bit(queue->bitmap, queue->base, 40);
+ check_queue(queue);
+ if (idx >= 40)
+ return NULL;
+ BUG_ON(list_empty(&queue->nices[idx]));
+ BUG_ON(!test_bit(idx, queue->bitmap));
+ return list_entry(queue->nices[idx].next, struct nice_queue, list);
+}
+
+static task_t *ts_best(struct queue *__queue)
+{
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ struct nice_queue *nice = ts_best_nice(queue);
+ return nice ? nice_best(nice) : NULL;
+}
+
+static void nice_enqueue(struct nice_queue *queue, task_t *task)
+{
+ task_t *curr, *sav;
+ int queued = 0, idx, deadline, base, idxdiff;
+ check_nice(queue);
+ check_ts_policy(task);
+
+ /* don't livelock when queue->expired */
+ deadline = min(!!queue->expired + task_deadline(task), NICE_QLEN - 1);
+ idx = (queue->base + deadline) % NICE_QLEN;
+
+ if (!test_bit(idx, queue->bitmap)) {
+ BUG_ON(!list_empty(&queue->queue[idx]));
+ __set_bit(idx, queue->bitmap);
+ }
+
+#if 1
+ /* keep nice level's queue sorted -- use binomial heaps here soon */
+ list_for_each_entry_safe(curr, sav, &queue->queue[idx], sched_info.run_list) {
+ if (task->sched_info.cl_data.ts.frac_cpu
+ >= curr->sched_info.cl_data.ts.frac_cpu) {
+ list_add(&task->sched_info.run_list,
+ curr->sched_info.run_list.prev);
+ queued = 1;
+ break;
+ }
+ }
+ if (!queued)
+ list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+#else
+ list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+#endif
+ task->sched_info.idx = idx;
+ /* if (!task->sched_info.cl_data.ts.ticks) */
+ task->sched_info.cl_data.ts.ticks = rr_quantum;
+
+ if (queue->tasks)
+ BUG_ON(!queue->curr);
+ else {
+ BUG_ON(queue->curr);
+ queue->curr = task;
+ }
+ queue->tasks++;
+ check_nice(queue);
+ check_ts_policy(task);
+}
+
+static void ts_enqueue(struct queue *__queue, task_t *task)
+{
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ struct nice_queue *nice;
+
+ check_queue(queue);
+ check_ts_policy(task);
+ nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+ if (!nice->tasks) {
+ int idx = (queue->base + task->sched_info.cl_data.ts.nice) % 40;
+ if (!test_bit(idx, queue->bitmap)) {
+ BUG_ON(!list_empty(&queue->nices[idx]));
+ __set_bit(idx, queue->bitmap);
+ }
+ list_add_tail(&nice->list, &queue->nices[idx]);
+ nice->idx = idx;
+ if (!queue->curr)
+ queue->curr = nice;
+ }
+ nice_enqueue(nice, task);
+ queue->tasks++;
+ queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+ check_queue(queue);
+ check_ts_policy(task);
+}
+
+static void nice_dequeue(struct nice_queue *queue, task_t *task)
+{
+ check_nice(queue);
+ check_ts_policy(task);
+ list_del(&task->sched_info.run_list);
+ if (list_empty(&queue->queue[task->sched_info.idx])) {
+ BUG_ON(!test_bit(task->sched_info.idx, queue->bitmap));
+ __clear_bit(task->sched_info.idx, queue->bitmap);
+ }
+ queue->tasks--;
+ if (task == queue->curr) {
+ queue->curr = nice_best(queue);
+#if 0
+ if (queue->curr)
+ queue->base = queue->curr->sched_info.idx;
+#endif
+ }
+ check_nice(queue);
+ check_ts_policy(task);
+}
+
+static void ts_dequeue(struct queue *__queue, task_t *task)
+{
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ struct nice_queue *nice;
+
+ BUG_ON(!queue->tasks);
+ check_queue(queue);
+ check_ts_policy(task);
+ nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+
+ nice_dequeue(nice, task);
+ queue->tasks--;
+ if (!nice->tasks) {
+ list_del_init(&nice->list);
+ if (list_empty(&queue->nices[nice->idx])) {
+ BUG_ON(!test_bit(nice->idx, queue->bitmap));
+ __clear_bit(nice->idx, queue->bitmap);
+ }
+ if (nice == queue->curr)
+ queue->curr = ts_best_nice(queue);
+ }
+ queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+ if (queue->base >= 40)
+ queue->base = 0;
+ check_queue(queue);
+ check_ts_policy(task);
+}
+
+static int ts_tasks(struct queue *__queue)
+{
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ check_queue(queue);
+ return queue->tasks;
+}
+
+static int ts_nice(struct queue *__queue, task_t *task)
+{
+ int nice = task->sched_info.cl_data.ts.nice - 20;
+ check_ts_policy(task);
+ BUG_ON(nice < -20);
+ BUG_ON(nice >= 20);
+ return nice;
+}
+
+static void ts_renice(struct queue *queue, task_t *task, int nice)
+{
+ check_queue((struct ts_queue *)queue);
+ check_ts_policy(task);
+ BUG_ON(nice < -20);
+ BUG_ON(nice >= 20);
+ task->sched_info.cl_data.ts.nice = nice + 20;
+ check_queue((struct ts_queue *)queue);
+}
+
+static int nice_task_prio(struct nice_queue *nice, task_t *task)
+{
+ if (!task_queued(task))
+ return task_deadline(task);
+ else {
+ int prio = task->sched_info.idx - nice->base;
+ return prio < 0 ? prio + NICE_QLEN : prio;
+ }
+}
+
+static int ts_nice_prio(struct ts_queue *ts, struct nice_queue *nice)
+{
+ if (list_empty(&nice->list))
+ return (int)(nice - ts->nice_levels);
+ else {
+ int prio = nice->idx - ts->base;
+ return prio < 0 ? prio + 40 : prio;
+ }
+}
+
+/* 100% fake priority to report heuristics and the like */
+static int ts_prio(task_t *task)
+{
+ int policy_idx;
+ struct policy *policy;
+ struct ts_queue *ts;
+ struct nice_queue *nice;
+
+ policy_idx = task->sched_info.policy;
+ policy = per_cpu(runqueues, task_cpu(task)).policies[policy_idx];
+ ts = (struct ts_queue *)policy->queue;
+ nice = &ts->nice_levels[task->sched_info.cl_data.ts.nice];
+ return 40*ts_nice_prio(ts, nice) + nice_task_prio(nice, task);
+}
+
+static void ts_setprio(task_t *task, int prio)
+{
+}
+
+static void ts_start_wait(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_stop_wait(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_sleep(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_wake(struct queue *__queue, task_t *task)
+{
+}
+
+static int nice_preempt(struct nice_queue *queue, task_t *task)
+{
+ check_nice(queue);
+ check_ts_policy(task);
+ /* assume FB style preemption at wakeup */
+ if (!task_queued(task) || !queue->curr)
+ return 1;
+ else {
+ int delta_t, delta_q;
+ delta_t = (task->sched_info.idx - queue->base + NICE_QLEN)
+ % NICE_QLEN;
+ delta_q = (queue->curr->sched_info.idx - queue->base
+ + NICE_QLEN)
+ % NICE_QLEN;
+ if (delta_t < delta_q)
+ return 1;
+ else if (task->sched_info.cl_data.ts.frac_cpu
+ < queue->curr->sched_info.cl_data.ts.frac_cpu)
+ return 1;
+ else
+ return 0;
+ }
+ check_nice(queue);
+}
+
+static int ts_preempt(struct queue *__queue, task_t *task)
+{
+ int curr_nice;
+ struct ts_queue *queue = (struct ts_queue *)__queue;
+ struct nice_queue *nice = queue->curr;
+
+ check_queue(queue);
+ check_ts_policy(task);
+ if (!queue->curr)
+ return 1;
+
+ curr_nice = (int)(nice - queue->nice_levels);
+
+ /* preempt when nice number is lower, or the above for matches */
+ if (task->sched_info.cl_data.ts.nice != curr_nice)
+ return task->sched_info.cl_data.ts.nice < curr_nice;
+ else
+ return nice_preempt(nice, task);
+}
+
+static struct queue_ops ts_ops = {
+ .init = ts_init,
+ .fini = nop_fini,
+ .tick = ts_tick,
+ .yield = ts_yield,
+ .curr = ts_curr,
+ .set_curr = ts_set_curr,
+ .tasks = ts_tasks,
+ .best = ts_best,
+ .enqueue = ts_enqueue,
+ .dequeue = ts_dequeue,
+ .start_wait = ts_start_wait,
+ .stop_wait = ts_stop_wait,
+ .sleep = ts_sleep,
+ .wake = ts_wake,
+ .preempt = ts_preempt,
+ .nice = ts_nice,
+ .renice = ts_renice,
+ .prio = ts_prio,
+ .setprio = ts_setprio,
+ .timeslice = nop_timeslice,
+ .set_timeslice = nop_set_timeslice,
+};
+
+struct policy ts_policy = {
+ .ops = &ts_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/util.c sched-2.6.0-test11-5/kernel/sched/util.c
--- linux-2.6.0-test11/kernel/sched/util.c 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/util.c 2003-12-19 08:43:20.000000000 -0800
@@ -0,0 +1,37 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <asm/page.h>
+#include "queue.h"
+
+int find_first_circular_bit(unsigned long *addr, int start, int end)
+{
+ int bit = find_next_bit(addr, end, start);
+ if (bit < end)
+ return bit;
+ bit = find_first_bit(addr, start);
+ if (bit < start)
+ return bit;
+ return end;
+}
+
+void queue_nop(struct queue *queue, task_t *task)
+{
+}
+
+void nop_renice(struct queue *queue, task_t *task, int nice)
+{
+}
+
+void nop_fini(struct policy *policy, int cpu)
+{
+}
+
+unsigned long nop_timeslice(struct queue *queue, task_t *task)
+{
+ return 0;
+}
+
+void nop_set_timeslice(struct queue *queue, task_t *task, unsigned long n)
+{
+}
diff -prauN linux-2.6.0-test11/kernel/sched.c sched-2.6.0-test11-5/kernel/sched.c
--- linux-2.6.0-test11/kernel/sched.c 2003-11-26 12:45:17.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched.c 2003-12-21 06:06:32.000000000 -0800
@@ -15,6 +15,8 @@
* and per-CPU runqueues. Cleanups and useful suggestions
* by Davide Libenzi, preemptible kernel bits by Robert Love.
* 2003-09-03 Interactivity tuning by Con Kolivas.
+ * 2003-12-17 Total rewrite and generalized scheduler policies
+ * by William Irwin.
*/
#include <linux/mm.h>
@@ -38,6 +40,8 @@
#include <linux/cpu.h>
#include <linux/percpu.h>
+#include "sched/queue.h"
+
#ifdef CONFIG_NUMA
#define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu))
#else
@@ -45,181 +49,79 @@
#endif
/*
- * Convert user-nice values [ -20 ... 0 ... 19 ]
- * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
- * and back.
- */
-#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 20)
-#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20)
-#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)
-
-/*
- * 'User priority' is the nice value converted to something we
- * can work with better when scaling various scheduler parameters,
- * it's a [ 0 ... 39 ] range.
- */
-#define USER_PRIO(p) ((p)-MAX_RT_PRIO)
-#define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio)
-#define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
-#define AVG_TIMESLICE (MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\
- (MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1)))
-
-/*
- * Some helpers for converting nanosecond timing to jiffy resolution
- */
-#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
-#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))
-
-/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
- * maximum timeslice is 200 msecs. Timeslices get refilled after
- * they expire.
- */
-#define MIN_TIMESLICE ( 10 * HZ / 1000)
-#define MAX_TIMESLICE (200 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT 30
-#define CHILD_PENALTY 95
-#define PARENT_PENALTY 100
-#define EXIT_WEIGHT 3
-#define PRIO_BONUS_RATIO 25
-#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA 2
-#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT (MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
-#define NODE_THRESHOLD 125
-#define CREDIT_LIMIT 100
-
-/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
- *
- * This part scales the interactivity limit depending on niceness.
- *
- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
- * Here are a few examples of different nice levels:
- *
- * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
- * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
- * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0]
- * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
- * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
- *
- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
- * priority range a task can explore, a value of '1' means the
- * task is rated interactive.)
- *
- * Ie. nice +19 tasks can never get 'interactive' enough to be
- * reinserted into the active array. And only heavily CPU-hog nice -20
- * tasks will be expired. Default nice 0 tasks are somewhere between,
- * it takes some effort for them to get interactive, but it's not
- * too hard.
- */
-
-#define CURRENT_BONUS(p) \
- (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
- MAX_SLEEP_AVG)
-
-#ifdef CONFIG_SMP
-#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \
- (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
- num_online_cpus())
-#else
-#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \
- (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
-#endif
-
-#define SCALE(v1,v1_max,v2_max) \
- (v1) * (v2_max) / (v1_max)
-
-#define DELTA(p) \
- (SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \
- INTERACTIVE_DELTA)
-
-#define TASK_INTERACTIVE(p) \
- ((p)->prio <= (p)->static_prio - DELTA(p))
-
-#define JUST_INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
-#define HIGH_CREDIT(p) \
- ((p)->interactive_credit > CREDIT_LIMIT)
-
-#define LOW_CREDIT(p) \
- ((p)->interactive_credit < -CREDIT_LIMIT)
-
-#define TASK_PREEMPTS_CURR(p, rq) \
- ((p)->prio < (rq)->curr->prio)
-
-/*
- * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
- * to time slice values.
- *
- * The higher a thread's priority, the bigger timeslices
- * it gets during one round of execution. But even the lowest
- * priority thread gets MIN_TIMESLICE worth of execution time.
- *
- * task_timeslice() is the interface that is used by the scheduler.
- */
-
-#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
- ((MAX_TIMESLICE - MIN_TIMESLICE) * (MAX_PRIO-1-(p)->static_prio)/(MAX_USER_PRIO - 1)))
-
-static inline unsigned int task_timeslice(task_t *p)
-{
- return BASE_TIMESLICE(p);
-}
-
-/*
- * These are the runqueue data structures:
- */
-
-#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))
-
-typedef struct runqueue runqueue_t;
-
-struct prio_array {
- int nr_active;
- unsigned long bitmap[BITMAP_SIZE];
- struct list_head queue[MAX_PRIO];
-};
-
-/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
* (such as the load balancing or the thread migration code), lock
* acquire operations must be ordered by ascending &runqueue.
*/
-struct runqueue {
- spinlock_t lock;
- unsigned long nr_running, nr_switches, expired_timestamp,
- nr_uninterruptible;
- task_t *curr, *idle;
- struct mm_struct *prev_mm;
- prio_array_t *active, *expired, arrays[2];
- int prev_cpu_load[NR_CPUS];
-#ifdef CONFIG_NUMA
- atomic_t *node_nr_running;
- int prev_node_load[MAX_NUMNODES];
-#endif
- task_t *migration_thread;
- struct list_head migration_queue;
+DEFINE_PER_CPU(struct runqueue, runqueues);
- atomic_t nr_iowait;
+struct policy *policies[] = {
+ &rt_policy,
+ &ts_policy,
+ &batch_policy,
+ &idle_policy,
+ NULL,
};
-static DEFINE_PER_CPU(struct runqueue, runqueues);
-
#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
#define this_rq() (&__get_cpu_var(runqueues))
#define task_rq(p) cpu_rq(task_cpu(p))
-#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
+#define rq_curr(rq) (rq)->__curr
+#define cpu_curr(cpu) rq_curr(cpu_rq(cpu))
+
+static inline struct policy *task_policy(task_t *task)
+{
+ unsigned long idx;
+ struct policy *policy;
+ idx = task->sched_info.policy;
+ __check_task_policy(idx);
+ policy = task_rq(task)->policies[idx];
+ check_policy(policy);
+ return policy;
+}
+
+static inline struct policy *rq_policy(runqueue_t *rq)
+{
+ unsigned long idx;
+ task_t *task;
+ struct policy *policy;
+
+ task = rq_curr(rq);
+ BUG_ON(!task);
+ BUG_ON((unsigned long)task < PAGE_OFFSET);
+ idx = task->sched_info.policy;
+ __check_task_policy(idx);
+ policy = rq->policies[idx];
+ check_policy(policy);
+ return policy;
+}
+
+static int __task_nice(task_t *task)
+{
+ struct policy *policy = task_policy(task);
+ return policy->ops->nice(policy->queue, task);
+}
+
+static inline void set_rq_curr(runqueue_t *rq, task_t *task)
+{
+ rq->curr = task->sched_info.policy;
+ __check_task_policy(rq->curr);
+ rq->__curr = task;
+}
+
+static inline int task_preempts_curr(task_t *task, runqueue_t *rq)
+{
+ check_task_policy(rq_curr(rq));
+ check_task_policy(task);
+ if (rq_curr(rq)->sched_info.policy != task->sched_info.policy)
+ return task->sched_info.policy < rq_curr(rq)->sched_info.policy;
+ else {
+ struct policy *policy = rq_policy(rq);
+ return policy->ops->preempt(policy->queue, task);
+ }
+}
/*
* Default context-switch locking:
@@ -227,7 +129,7 @@ static DEFINE_PER_CPU(struct runqueue, r
#ifndef prepare_arch_switch
# define prepare_arch_switch(rq, next) do { } while(0)
# define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock)
-# define task_running(rq, p) ((rq)->curr == (p))
+# define task_running(rq, p) (rq_curr(rq) == (p))
#endif
#ifdef CONFIG_NUMA
@@ -320,53 +222,32 @@ static inline void rq_unlock(runqueue_t
}
/*
- * Adding/removing a task to/from a priority array:
+ * Adding/removing a task to/from a policy's queue.
+ * We dare not BUG_ON() a wrong task_queued() as boot-time
+ * calls may trip it.
*/
-static inline void dequeue_task(struct task_struct *p, prio_array_t *array)
+static inline void dequeue_task(task_t *task, runqueue_t *rq)
{
- array->nr_active--;
- list_del(&p->run_list);
- if (list_empty(array->queue + p->prio))
- __clear_bit(p->prio, array->bitmap);
+ struct policy *policy = task_policy(task);
+ BUG_ON(!task_queued(task));
+ policy->ops->dequeue(policy->queue, task);
+ if (!policy->ops->tasks(policy->queue)) {
+ BUG_ON(!test_bit(task->sched_info.policy, &rq->policy_bitmap));
+ __clear_bit(task->sched_info.policy, &rq->policy_bitmap);
+ }
+ clear_task_queued(task);
}
-static inline void enqueue_task(struct task_struct *p, prio_array_t *array)
+static inline void enqueue_task(task_t *task, runqueue_t *rq)
{
- list_add_tail(&p->run_list, array->queue + p->prio);
- __set_bit(p->prio, array->bitmap);
- array->nr_active++;
- p->array = array;
-}
-
-/*
- * effective_prio - return the priority that is based on the static
- * priority but is modified by bonuses/penalties.
- *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
- * We use 25% of the full 0...39 priority range so that:
- *
- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
- *
- * Both properties are important to certain workloads.
- */
-static int effective_prio(task_t *p)
-{
- int bonus, prio;
-
- if (rt_task(p))
- return p->prio;
-
- bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
-
- prio = p->static_prio - bonus;
- if (prio < MAX_RT_PRIO)
- prio = MAX_RT_PRIO;
- if (prio > MAX_PRIO-1)
- prio = MAX_PRIO-1;
- return prio;
+ struct policy *policy = task_policy(task);
+ BUG_ON(task_queued(task));
+ if (!policy->ops->tasks(policy->queue)) {
+ BUG_ON(test_bit(task->sched_info.policy, &rq->policy_bitmap));
+ __set_bit(task->sched_info.policy, &rq->policy_bitmap);
+ }
+ policy->ops->enqueue(policy->queue, task);
+ set_task_queued(task);
}
/*
@@ -374,134 +255,34 @@ static int effective_prio(task_t *p)
*/
static inline void __activate_task(task_t *p, runqueue_t *rq)
{
- enqueue_task(p, rq->active);
+ enqueue_task(p, rq);
nr_running_inc(rq);
}
-static void recalc_task_prio(task_t *p, unsigned long long now)
-{
- unsigned long long __sleep_time = now - p->timestamp;
- unsigned long sleep_time;
-
- if (__sleep_time > NS_MAX_SLEEP_AVG)
- sleep_time = NS_MAX_SLEEP_AVG;
- else
- sleep_time = (unsigned long)__sleep_time;
-
- if (likely(sleep_time > 0)) {
- /*
- * User tasks that sleep a long time are categorised as
- * idle and will get just interactive status to stay active &
- * prevent them suddenly becoming cpu hogs and starving
- * other processes.
- */
- if (p->mm && p->activated != -1 &&
- sleep_time > JUST_INTERACTIVE_SLEEP(p)){
- p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- AVG_TIMESLICE);
- if (!HIGH_CREDIT(p))
- p->interactive_credit++;
- } else {
- /*
- * The lower the sleep avg a task has the more
- * rapidly it will rise with sleep time.
- */
- sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
-
- /*
- * Tasks with low interactive_credit are limited to
- * one timeslice worth of sleep avg bonus.
- */
- if (LOW_CREDIT(p) &&
- sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
- sleep_time =
- JIFFIES_TO_NS(task_timeslice(p));
-
- /*
- * Non high_credit tasks waking from uninterruptible
- * sleep are limited in their sleep_avg rise as they
- * are likely to be cpu hogs waiting on I/O
- */
- if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm){
- if (p->sleep_avg >= JUST_INTERACTIVE_SLEEP(p))
- sleep_time = 0;
- else if (p->sleep_avg + sleep_time >=
- JUST_INTERACTIVE_SLEEP(p)){
- p->sleep_avg =
- JUST_INTERACTIVE_SLEEP(p);
- sleep_time = 0;
- }
- }
-
- /*
- * This code gives a bonus to interactive tasks.
- *
- * The boost works by updating the 'average sleep time'
- * value here, based on ->timestamp. The more time a task
- * spends sleeping, the higher the average gets - and the
- * higher the priority boost gets as well.
- */
- p->sleep_avg += sleep_time;
-
- if (p->sleep_avg > NS_MAX_SLEEP_AVG){
- p->sleep_avg = NS_MAX_SLEEP_AVG;
- if (!HIGH_CREDIT(p))
- p->interactive_credit++;
- }
- }
- }
-
- p->prio = effective_prio(p);
-}
-
/*
* activate_task - move a task to the runqueue and do priority recalculation
*
* Update all the scheduling statistics stuff. (sleep average
* calculation, priority modifiers, etc.)
*/
-static inline void activate_task(task_t *p, runqueue_t *rq)
+static inline void activate_task(task_t *task, runqueue_t *rq)
{
- unsigned long long now = sched_clock();
-
- recalc_task_prio(p, now);
-
- /*
- * This checks to make sure it's not an uninterruptible task
- * that is now waking up.
- */
- if (!p->activated){
- /*
- * Tasks which were woken up by interrupts (ie. hw events)
- * are most likely of interactive nature. So we give them
- * the credit of extending their sleep time to the period
- * of time they spend on the runqueue, waiting for execution
- * on a CPU, first time around:
- */
- if (in_interrupt())
- p->activated = 2;
- else
- /*
- * Normal first-time wakeups get a credit too for on-runqueue
- * time, but it will be weighted down:
- */
- p->activated = 1;
- }
- p->timestamp = now;
-
- __activate_task(p, rq);
+ struct policy *policy = task_policy(task);
+ policy->ops->wake(policy->queue, task);
+ __activate_task(task, rq);
}
/*
* deactivate_task - remove a task from the runqueue.
*/
-static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
+static inline void deactivate_task(task_t *task, runqueue_t *rq)
{
+ struct policy *policy = task_policy(task);
nr_running_dec(rq);
- if (p->state == TASK_UNINTERRUPTIBLE)
+ if (task->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
- dequeue_task(p, p->array);
- p->array = NULL;
+ policy->ops->sleep(policy->queue, task);
+ dequeue_task(task, rq);
}
/*
@@ -625,7 +406,7 @@ repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
if (old_state & state) {
- if (!p->array) {
+ if (!task_queued(p)) {
/*
* Fast-migrate the task if it's not running or runnable
* currently. Do not violate hard affinity.
@@ -644,14 +425,13 @@ repeat_lock_task:
* Tasks on involuntary sleep don't earn
* sleep_avg beyond just interactive state.
*/
- p->activated = -1;
}
if (sync)
__activate_task(p, rq);
else {
activate_task(p, rq);
- if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
+ if (task_preempts_curr(p, rq))
+ resched_task(rq_curr(rq));
}
success = 1;
}
@@ -679,68 +459,26 @@ int wake_up_state(task_t *p, unsigned in
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created process.
*/
-void wake_up_forked_process(task_t * p)
+void wake_up_forked_process(task_t *task)
{
unsigned long flags;
runqueue_t *rq = task_rq_lock(current, &flags);
- p->state = TASK_RUNNING;
- /*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive.
- */
- current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
- PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
- p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
- CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
- p->interactive_credit = 0;
-
- p->prio = effective_prio(p);
- set_task_cpu(p, smp_processor_id());
-
- if (unlikely(!current->array))
- __activate_task(p, rq);
- else {
- p->prio = current->prio;
- list_add_tail(&p->run_list, ¤t->run_list);
- p->array = current->array;
- p->array->nr_active++;
- nr_running_inc(rq);
- }
+ task->state = TASK_RUNNING;
+ set_task_cpu(task, smp_processor_id());
+ if (unlikely(!task_queued(current)))
+ __activate_task(task, rq);
+ else
+ activate_task(task, rq);
task_rq_unlock(rq, &flags);
}
/*
- * Potentially available exiting-child timeslices are
- * retrieved here - this way the parent does not get
- * penalized for creating too many threads.
- *
- * (this cannot be used to 'generate' timeslices
- * artificially, because any timeslice recovered here
- * was given away by the parent in the first place.)
+ * Policies that depend on trapping fork() and exit() may need to
+ * put a hook here.
*/
-void sched_exit(task_t * p)
+void sched_exit(task_t *task)
{
- unsigned long flags;
-
- local_irq_save(flags);
- if (p->first_time_slice) {
- p->parent->time_slice += p->time_slice;
- if (unlikely(p->parent->time_slice > MAX_TIMESLICE))
- p->parent->time_slice = MAX_TIMESLICE;
- }
- local_irq_restore(flags);
- /*
- * If the child was a (relative-) CPU hog then decrease
- * the sleep_avg of the parent as well.
- */
- if (p->sleep_avg < p->parent->sleep_avg)
- p->parent->sleep_avg = p->parent->sleep_avg /
- (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
- (EXIT_WEIGHT + 1);
}
/**
@@ -1128,18 +866,18 @@ out:
* pull_task - move a task from a remote runqueue to the local runqueue.
* Both runqueues must be locked.
*/
-static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+static inline void pull_task(runqueue_t *src_rq, task_t *p, runqueue_t *this_rq, int this_cpu)
{
- dequeue_task(p, src_array);
+ dequeue_task(p, src_rq);
nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
nr_running_inc(this_rq);
- enqueue_task(p, this_rq->active);
+ enqueue_task(p, this_rq);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
* to be always true for them.
*/
- if (TASK_PREEMPTS_CURR(p, this_rq))
+ if (task_preempts_curr(p, this_rq))
set_need_resched();
}
@@ -1150,14 +888,14 @@ static inline void pull_task(runqueue_t
* ((!idle || (NS_TO_JIFFIES(now - (p)->timestamp) > \
* cache_decay_ticks)) && !task_running(rq, p) && \
* cpu_isset(this_cpu, (p)->cpus_allowed))
+ *
+ * Since there isn't a timestamp anymore, this needs adjustment.
*/
static inline int
can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle)
{
- unsigned long delta = sched_clock() - tsk->timestamp;
-
- if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
+ if (!idle)
return 0;
if (task_running(rq, tsk))
return 0;
@@ -1176,11 +914,8 @@ can_migrate_task(task_t *tsk, runqueue_t
*/
static void load_balance(runqueue_t *this_rq, int idle, cpumask_t cpumask)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, this_cpu = smp_processor_id();
runqueue_t *busiest;
- prio_array_t *array;
- struct list_head *head, *curr;
- task_t *tmp;
busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
if (!busiest)
@@ -1192,37 +927,6 @@ static void load_balance(runqueue_t *thi
*/
imbalance /= 2;
- /*
- * We first consider expired tasks. Those will likely not be
- * executed in the near future, and they are most likely to
- * be cache-cold, thus switching CPUs has the least effect
- * on them.
- */
- if (busiest->expired->nr_active)
- array = busiest->expired;
- else
- array = busiest->active;
-
-new_array:
- /* Start searching at priority 0: */
- idx = 0;
-skip_bitmap:
- if (!idx)
- idx = sched_find_first_bit(array->bitmap);
- else
- idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
- if (idx >= MAX_PRIO) {
- if (array == busiest->expired) {
- array = busiest->active;
- goto new_array;
- }
- goto out_unlock;
- }
-
- head = array->queue + idx;
- curr = head->prev;
-skip_queue:
- tmp = list_entry(curr, task_t, run_list);
/*
* We do not migrate tasks that are:
@@ -1231,21 +935,19 @@ skip_queue:
* 3) are cache-hot on their current CPU.
*/
- curr = curr->prev;
+ do {
+ struct policy *policy;
+ task_t *task;
+
+ policy = rq_migrate_policy(busiest);
+ if (!policy)
+ break;
+ task = policy->migrate(policy->queue);
+ if (!task)
+ break;
+ pull_task(busiest, task, this_rq, this_cpu);
+ } while (!idle && --imbalance);
- if (!can_migrate_task(tmp, busiest, this_cpu, idle)) {
- if (curr != head)
- goto skip_queue;
- idx++;
- goto skip_bitmap;
- }
- pull_task(busiest, array, tmp, this_rq, this_cpu);
- if (!idle && --imbalance) {
- if (curr != head)
- goto skip_queue;
- idx++;
- goto skip_bitmap;
- }
out_unlock:
spin_unlock(&busiest->lock);
out:
@@ -1356,10 +1058,10 @@ EXPORT_PER_CPU_SYMBOL(kstat);
*/
void scheduler_tick(int user_ticks, int sys_ticks)
{
- int cpu = smp_processor_id();
+ int idle, cpu = smp_processor_id();
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+ struct policy *policy;
runqueue_t *rq = this_rq();
- task_t *p = current;
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_ticks);
@@ -1373,98 +1075,28 @@ void scheduler_tick(int user_ticks, int
sys_ticks = 0;
}
- if (p == rq->idle) {
- if (atomic_read(&rq->nr_iowait) > 0)
- cpustat->iowait += sys_ticks;
- else
- cpustat->idle += sys_ticks;
- rebalance_tick(rq, 1);
- return;
- }
- if (TASK_NICE(p) > 0)
- cpustat->nice += user_ticks;
- else
- cpustat->user += user_ticks;
- cpustat->system += sys_ticks;
-
- /* Task might have expired already, but not scheduled off yet */
- if (p->array != rq->active) {
- set_tsk_need_resched(p);
- goto out;
- }
spin_lock(&rq->lock);
- /*
- * The task was running during this tick - update the
- * time slice counter. Note: we do not update a thread's
- * priority until it either goes to sleep or uses up its
- * timeslice. This makes it possible for interactive tasks
- * to use up their timeslices at their highest priority levels.
- */
- if (unlikely(rt_task(p))) {
- /*
- * RR tasks need a special form of timeslice management.
- * FIFO tasks have no timeslices.
- */
- if ((p->policy == SCHED_RR) && !--p->time_slice) {
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
- set_tsk_need_resched(p);
-
- /* put it at the end of the queue: */
- dequeue_task(p, rq->active);
- enqueue_task(p, rq->active);
- }
- goto out_unlock;
- }
- if (!--p->time_slice) {
- dequeue_task(p, rq->active);
- set_tsk_need_resched(p);
- p->prio = effective_prio(p);
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
-
- if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
- if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
- enqueue_task(p, rq->expired);
- } else
- enqueue_task(p, rq->active);
- } else {
- /*
- * Prevent a too long timeslice allowing a task to monopolize
- * the CPU. We do this by splitting up the timeslice into
- * smaller pieces.
- *
- * Note: this does not mean the task's timeslices expire or
- * get lost in any way, they just might be preempted by
- * another task of equal priority. (one with higher
- * priority would have preempted this task already.) We
- * requeue this task to the end of the list on this priority
- * level, which is in essence a round-robin of tasks with
- * equal priority.
- *
- * This only applies to tasks in the interactive
- * delta range with at least TIMESLICE_GRANULARITY to requeue.
- */
- if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
- p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
- (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
- (p->array == rq->active)) {
-
- dequeue_task(p, rq->active);
- set_tsk_need_resched(p);
- p->prio = effective_prio(p);
- enqueue_task(p, rq->active);
- }
- }
-out_unlock:
+ policy = rq_policy(rq);
+ idle = policy->ops->tick(policy->queue, current, user_ticks, sys_ticks);
spin_unlock(&rq->lock);
-out:
- rebalance_tick(rq, 0);
+ rebalance_tick(rq, idle);
}
void scheduling_functions_start_here(void) { }
+static inline task_t *find_best_task(runqueue_t *rq)
+{
+ int idx;
+ struct policy *policy;
+
+ BUG_ON(!rq->policy_bitmap);
+ idx = __ffs(rq->policy_bitmap);
+ __check_task_policy(idx);
+ policy = rq->policies[idx];
+ check_policy(policy);
+ return policy->ops->best(policy->queue);
+}
+
/*
* schedule() is the main scheduler function.
*/
@@ -1472,11 +1104,7 @@ asmlinkage void schedule(void)
{
task_t *prev, *next;
runqueue_t *rq;
- prio_array_t *array;
- struct list_head *queue;
- unsigned long long now;
- unsigned long run_time;
- int idx;
+ struct policy *policy;
/*
* Test if we are atomic. Since do_exit() needs to call into
@@ -1494,22 +1122,9 @@ need_resched:
preempt_disable();
prev = current;
rq = this_rq();
+ policy = rq_policy(rq);
release_kernel_lock(prev);
- now = sched_clock();
- if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
- run_time = now - prev->timestamp;
- else
- run_time = NS_MAX_SLEEP_AVG;
-
- /*
- * Tasks with interactive credits get charged less run_time
- * at high sleep_avg to delay them losing their interactive
- * status
- */
- if (HIGH_CREDIT(prev))
- run_time /= (CURRENT_BONUS(prev) ? : 1);
-
spin_lock_irq(&rq->lock);
/*
@@ -1530,66 +1145,27 @@ need_resched:
prev->nvcsw++;
break;
case TASK_RUNNING:
+ policy->ops->start_wait(policy->queue, prev);
prev->nivcsw++;
}
+
pick_next_task:
- if (unlikely(!rq->nr_running)) {
#ifdef CONFIG_SMP
+ if (unlikely(!rq->nr_running))
load_balance(rq, 1, cpu_to_node_mask(smp_processor_id()));
- if (rq->nr_running)
- goto pick_next_task;
#endif
- next = rq->idle;
- rq->expired_timestamp = 0;
- goto switch_tasks;
- }
-
- array = rq->active;
- if (unlikely(!array->nr_active)) {
- /*
- * Switch the active and expired arrays.
- */
- rq->active = rq->expired;
- rq->expired = array;
- array = rq->active;
- rq->expired_timestamp = 0;
- }
-
- idx = sched_find_first_bit(array->bitmap);
- queue = array->queue + idx;
- next = list_entry(queue->next, task_t, run_list);
-
- if (next->activated > 0) {
- unsigned long long delta = now - next->timestamp;
-
- if (next->activated == 1)
- delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
-
- array = next->array;
- dequeue_task(next, array);
- recalc_task_prio(next, next->timestamp + delta);
- enqueue_task(next, array);
- }
- next->activated = 0;
-switch_tasks:
+ next = find_best_task(rq);
+ BUG_ON(!next);
prefetch(next);
clear_tsk_need_resched(prev);
RCU_qsctr(task_cpu(prev))++;
- prev->sleep_avg -= run_time;
- if ((long)prev->sleep_avg <= 0){
- prev->sleep_avg = 0;
- if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
- prev->interactive_credit--;
- }
- prev->timestamp = now;
-
if (likely(prev != next)) {
- next->timestamp = now;
rq->nr_switches++;
- rq->curr = next;
-
prepare_arch_switch(rq, next);
+ policy = task_policy(next);
+ policy->ops->set_curr(policy->queue, next);
+ set_rq_curr(rq, next);
prev = context_switch(rq, prev, next);
barrier();
@@ -1845,45 +1421,46 @@ void scheduling_functions_end_here(void)
void set_user_nice(task_t *p, long nice)
{
unsigned long flags;
- prio_array_t *array;
runqueue_t *rq;
- int old_prio, new_prio, delta;
+ struct policy *policy;
+ int delta, queued;
- if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
+ if (nice < -20 || nice > 19)
return;
/*
* We have to be careful, if called from sys_setpriority(),
* the task might be in the middle of scheduling on another CPU.
*/
rq = task_rq_lock(p, &flags);
+ delta = nice - __task_nice(p);
+ if (!delta) {
+ if (p->pid == 0 || p->pid == 1)
+ printk("no change in nice, set_user_nice() nops!\n");
+ goto out_unlock;
+ }
+
+ policy = task_policy(p);
+
/*
* The RT priorities are set via setscheduler(), but we still
* allow the 'normal' nice value to be set - but as expected
* it wont have any effect on scheduling until the task is
* not SCHED_NORMAL:
*/
- if (rt_task(p)) {
- p->static_prio = NICE_TO_PRIO(nice);
- goto out_unlock;
- }
- array = p->array;
- if (array)
- dequeue_task(p, array);
-
- old_prio = p->prio;
- new_prio = NICE_TO_PRIO(nice);
- delta = new_prio - old_prio;
- p->static_prio = NICE_TO_PRIO(nice);
- p->prio += delta;
+ queued = task_queued(p);
+ if (queued)
+ dequeue_task(p, rq);
+
+ policy->ops->renice(policy->queue, p, nice);
- if (array) {
- enqueue_task(p, array);
+ if (queued) {
+ enqueue_task(p, rq);
/*
* If the task increased its priority or is running and
* lowered its priority, then reschedule its CPU:
*/
if (delta < 0 || (delta > 0 && task_running(rq, p)))
- resched_task(rq->curr);
+ resched_task(rq_curr(rq));
}
out_unlock:
task_rq_unlock(rq, &flags);
@@ -1919,7 +1496,7 @@ asmlinkage long sys_nice(int increment)
if (increment > 40)
increment = 40;
- nice = PRIO_TO_NICE(current->static_prio) + increment;
+ nice = task_nice(current) + increment;
if (nice < -20)
nice = -20;
if (nice > 19)
@@ -1935,6 +1512,12 @@ asmlinkage long sys_nice(int increment)
#endif
+static int __task_prio(task_t *task)
+{
+ struct policy *policy = task_policy(task);
+ return policy->ops->prio(task);
+}
+
/**
* task_prio - return the priority value of a given task.
* @p: the task in question.
@@ -1943,29 +1526,111 @@ asmlinkage long sys_nice(int increment)
* RT tasks are offset by -200. Normal tasks are centered
* around 0, value goes from -16 to +15.
*/
-int task_prio(task_t *p)
+int task_prio(task_t *task)
{
- return p->prio - MAX_RT_PRIO;
+ int prio;
+ unsigned long flags;
+ runqueue_t *rq;
+
+ rq = task_rq_lock(task, &flags);
+ prio = __task_prio(task);
+ task_rq_unlock(rq, &flags);
+ return prio;
}
/**
* task_nice - return the nice value of a given task.
* @p: the task in question.
*/
-int task_nice(task_t *p)
+int task_nice(task_t *task)
{
- return TASK_NICE(p);
+ int nice;
+ unsigned long flags;
+ runqueue_t *rq;
+
+
+ rq = task_rq_lock(task, &flags);
+ nice = __task_nice(task);
+ task_rq_unlock(rq, &flags);
+ return nice;
}
EXPORT_SYMBOL(task_nice);
+int task_sched_policy(task_t *task)
+{
+ check_task_policy(task);
+ switch (task->sched_info.policy) {
+ case SCHED_POLICY_RT:
+ if (task->sched_info.cl_data.rt.rt_policy
+ == RT_POLICY_RR)
+ return SCHED_RR;
+ else
+ return SCHED_FIFO;
+ case SCHED_POLICY_TS:
+ return SCHED_NORMAL;
+ case SCHED_POLICY_BATCH:
+ return SCHED_BATCH;
+ case SCHED_POLICY_IDLE:
+ return SCHED_IDLE;
+ default:
+ BUG();
+ return -1;
+ }
+}
+EXPORT_SYMBOL(task_sched_policy);
+
+void set_task_sched_policy(task_t *task, int policy)
+{
+ check_task_policy(task);
+ BUG_ON(task_queued(task));
+ switch (policy) {
+ case SCHED_FIFO:
+ task->sched_info.policy = SCHED_POLICY_RT;
+ task->sched_info.cl_data.rt.rt_policy = RT_POLICY_FIFO;
+ break;
+ case SCHED_RR:
+ task->sched_info.policy = SCHED_POLICY_RT;
+ task->sched_info.cl_data.rt.rt_policy = RT_POLICY_RR;
+ break;
+ case SCHED_NORMAL:
+ task->sched_info.policy = SCHED_POLICY_TS;
+ break;
+ case SCHED_BATCH:
+ task->sched_info.policy = SCHED_POLICY_BATCH;
+ break;
+ case SCHED_IDLE:
+ task->sched_info.policy = SCHED_POLICY_IDLE;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ check_task_policy(task);
+}
+EXPORT_SYMBOL(set_task_sched_policy);
+
+int rt_task(task_t *task)
+{
+ check_task_policy(task);
+ return !!(task->sched_info.policy == SCHED_POLICY_RT);
+}
+EXPORT_SYMBOL(rt_task);
+
/**
* idle_cpu - is a given cpu idle currently?
* @cpu: the processor in question.
*/
int idle_cpu(int cpu)
{
- return cpu_curr(cpu) == cpu_rq(cpu)->idle;
+ int idle;
+ unsigned long flags;
+ runqueue_t *rq = cpu_rq(cpu);
+
+ spin_lock_irqsave(&rq->lock, flags);
+ idle = !!(rq->curr == SCHED_POLICY_IDLE);
+ spin_unlock_irqrestore(&rq->lock, flags);
+ return idle;
}
EXPORT_SYMBOL_GPL(idle_cpu);
@@ -1985,11 +1650,10 @@ static inline task_t *find_process_by_pi
static int setscheduler(pid_t pid, int policy, struct sched_param __user *param)
{
struct sched_param lp;
- int retval = -EINVAL;
- int oldprio;
- prio_array_t *array;
+ int queued, retval = -EINVAL;
unsigned long flags;
runqueue_t *rq;
+ struct policy *rq_policy;
task_t *p;
if (!param || pid < 0)
@@ -2017,7 +1681,7 @@ static int setscheduler(pid_t pid, int p
rq = task_rq_lock(p, &flags);
if (policy < 0)
- policy = p->policy;
+ policy = task_sched_policy(p);
else {
retval = -EINVAL;
if (policy != SCHED_FIFO && policy != SCHED_RR &&
@@ -2047,29 +1711,23 @@ static int setscheduler(pid_t pid, int p
if (retval)
goto out_unlock;
- array = p->array;
- if (array)
+ queued = task_queued(p);
+ if (queued)
deactivate_task(p, task_rq(p));
retval = 0;
- p->policy = policy;
- p->rt_priority = lp.sched_priority;
- oldprio = p->prio;
- if (policy != SCHED_NORMAL)
- p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
- else
- p->prio = p->static_prio;
- if (array) {
+ set_task_sched_policy(p, policy);
+ check_task_policy(p);
+ rq_policy = rq->policies[p->sched_info.policy];
+ check_policy(rq_policy);
+ rq_policy->ops->setprio(p, lp.sched_priority);
+ if (queued) {
__activate_task(p, task_rq(p));
/*
* Reschedule if we are currently running on this runqueue and
* our priority decreased, or if we are not currently running on
* this runqueue and our priority is higher than the current's
*/
- if (rq->curr == p) {
- if (p->prio > oldprio)
- resched_task(rq->curr);
- } else if (p->prio < rq->curr->prio)
- resched_task(rq->curr);
+ resched_task(rq_curr(rq));
}
out_unlock:
@@ -2121,7 +1779,7 @@ asmlinkage long sys_sched_getscheduler(p
if (p) {
retval = security_task_getscheduler(p);
if (!retval)
- retval = p->policy;
+ retval = task_sched_policy(p);
}
read_unlock(&tasklist_lock);
@@ -2153,7 +1811,7 @@ asmlinkage long sys_sched_getparam(pid_t
if (retval)
goto out_unlock;
- lp.sched_priority = p->rt_priority;
+ lp.sched_priority = task_prio(p);
read_unlock(&tasklist_lock);
/*
@@ -2262,32 +1920,13 @@ out_unlock:
*/
asmlinkage long sys_sched_yield(void)
{
+ struct policy *policy;
runqueue_t *rq = this_rq_lock();
- prio_array_t *array = current->array;
-
- /*
- * We implement yielding by moving the task into the expired
- * queue.
- *
- * (special rule: RT tasks will just roundrobin in the active
- * array.)
- */
- if (likely(!rt_task(current))) {
- dequeue_task(current, array);
- enqueue_task(current, rq->expired);
- } else {
- list_del(¤t->run_list);
- list_add_tail(¤t->run_list, array->queue + current->prio);
- }
- /*
- * Since we are going to call schedule() anyway, there's
- * no need to preempt:
- */
+ policy = rq_policy(rq);
+ policy->ops->yield(policy->queue, current);
_raw_spin_unlock(&rq->lock);
preempt_enable_no_resched();
-
schedule();
-
return 0;
}
@@ -2387,6 +2026,19 @@ asmlinkage long sys_sched_get_priority_m
return ret;
}
+static inline unsigned long task_timeslice(task_t *task)
+{
+ unsigned long flags, timeslice;
+ struct policy *policy;
+ runqueue_t *rq;
+
+ rq = task_rq_lock(task, &flags);
+ policy = task_policy(task);
+ timeslice = policy->ops->timeslice(policy->queue, task);
+ task_rq_unlock(rq, &flags);
+ return timeslice;
+}
+
/**
* sys_sched_rr_get_interval - return the default timeslice of a process.
* @pid: pid of the process.
@@ -2414,8 +2066,7 @@ asmlinkage long sys_sched_rr_get_interva
if (retval)
goto out_unlock;
- jiffies_to_timespec(p->policy & SCHED_FIFO ?
- 0 : task_timeslice(p), &t);
+ jiffies_to_timespec(task_timeslice(p), &t);
read_unlock(&tasklist_lock);
retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
out_nounlock:
@@ -2523,17 +2174,22 @@ void show_state(void)
void __init init_idle(task_t *idle, int cpu)
{
runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle));
+ struct policy *policy;
unsigned long flags;
local_irq_save(flags);
double_rq_lock(idle_rq, rq);
-
- idle_rq->curr = idle_rq->idle = idle;
+ policy = rq_policy(rq);
+ BUG_ON(policy != task_policy(idle));
+ printk("deactivating, have %d tasks\n",
+ policy->ops->tasks(policy->queue));
deactivate_task(idle, rq);
- idle->array = NULL;
- idle->prio = MAX_PRIO;
+ set_task_sched_policy(idle, SCHED_IDLE);
idle->state = TASK_RUNNING;
set_task_cpu(idle, cpu);
+ activate_task(idle, rq);
+ nr_running_dec(rq);
+ set_rq_curr(rq, idle);
double_rq_unlock(idle_rq, rq);
set_tsk_need_resched(idle);
local_irq_restore(flags);
@@ -2804,38 +2460,27 @@ __init static void init_kstat(void) {
void __init sched_init(void)
{
runqueue_t *rq;
- int i, j, k;
+ int i, j;
/* Init the kstat counters */
init_kstat();
for (i = 0; i < NR_CPUS; i++) {
- prio_array_t *array;
-
rq = cpu_rq(i);
- rq->active = rq->arrays;
- rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
nr_running_init(rq);
-
- for (j = 0; j < 2; j++) {
- array = rq->arrays + j;
- for (k = 0; k < MAX_PRIO; k++) {
- INIT_LIST_HEAD(array->queue + k);
- __clear_bit(k, array->bitmap);
- }
- // delimiter for bitsearch
- __set_bit(MAX_PRIO, array->bitmap);
- }
+ memcpy(rq->policies, policies, sizeof(policies));
+ for (j = 0; j < BITS_PER_LONG && rq->policies[j]; ++j)
+ rq->policies[j]->ops->init(rq->policies[j], i);
}
/*
* We have to do a little magic to get the first
* thread right in SMP mode.
*/
rq = this_rq();
- rq->curr = current;
- rq->idle = current;
+ set_task_sched_policy(current, SCHED_NORMAL);
+ set_rq_curr(rq, current);
set_task_cpu(current, smp_processor_id());
wake_up_forked_process(current);
diff -prauN linux-2.6.0-test11/lib/Makefile sched-2.6.0-test11-5/lib/Makefile
--- linux-2.6.0-test11/lib/Makefile 2003-11-26 12:42:55.000000000 -0800
+++ sched-2.6.0-test11-5/lib/Makefile 2003-12-20 15:09:16.000000000 -0800
@@ -5,7 +5,7 @@
lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \
bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
- kobject.o idr.o div64.o parser.o
+ kobject.o idr.o div64.o parser.o binomial.o
lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
diff -prauN linux-2.6.0-test11/lib/binomial.c sched-2.6.0-test11-5/lib/binomial.c
--- linux-2.6.0-test11/lib/binomial.c 1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/lib/binomial.c 2003-12-20 17:32:09.000000000 -0800
@@ -0,0 +1,138 @@
+#include <linux/kernel.h>
+#include <linux/binomial.h>
+
+struct binomial *binomial_minimum(struct binomial **heap)
+{
+ struct binomial *minimum, *tmp;
+
+ for (minimum = NULL, tmp = *heap; tmp; tmp = tmp->sibling) {
+ if (!minimum || minimum->priority > tmp->priority)
+ minimum = tmp;
+ }
+ return minimum;
+}
+
+static void binomial_link(struct binomial *left, struct binomial *right)
+{
+ left->parent = right;
+ left->sibling = right->child;
+ right->child = left;
+ right->degree++;
+}
+
+static void binomial_merge(struct binomial **both, struct binomial **left,
+ struct binomial **right)
+{
+ while (*left && *right) {
+ if ((*left)->degree < (*right)->degree) {
+ *both = *left;
+ left = &(*left)->sibling;
+ } else {
+ *both = *right;
+ right = &(*right)->sibling;
+ }
+ both = &(*both)->sibling;
+ }
+ /*
+ * for more safety:
+ * *left = *right = NULL;
+ */
+}
+
+void binomial_union(struct binomial **both, struct binomial **left,
+ struct binomial **right)
+{
+ struct binomial *prev, *tmp, *next;
+
+ binomial_merge(both, left, right);
+ if (!(tmp = *both))
+ return;
+
+ for (prev = NULL, next = tmp->sibling; next; next = tmp->sibling) {
+ if ((next->sibling && next->sibling->degree == tmp->degree)
+ || tmp->degree != next->degree) {
+ prev = tmp;
+ tmp = next;
+ } else if (tmp->priority <= next->priority) {
+ tmp->sibling = next->sibling;
+ binomial_link(next, tmp);
+ } else {
+ if (!prev)
+ *both = next;
+ else
+ prev->sibling = next;
+ binomial_link(tmp, next);
+ tmp = next;
+ }
+ }
+}
+
+void binomial_insert(struct binomial **heap, struct binomial *element)
+{
+ element->parent = NULL;
+ element->child = NULL;
+ element->sibling = NULL;
+ element->degree = 0;
+ binomial_union(heap, heap, &element);
+}
+
+static void binomial_reverse(struct binomial **in, struct binomial **out)
+{
+ while (*in) {
+ struct binomial *tmp = *in;
+ *in = (*in)->sibling;
+ tmp->sibling = *out;
+ *out = tmp;
+ }
+}
+
+struct binomial *binomial_extract_min(struct binomial **heap)
+{
+ struct binomial *tmp, *minimum, *last, *min_last, *new_heap;
+
+ minimum = last = min_last = new_heap = NULL;
+ for (tmp = *heap; tmp; last = tmp, tmp = tmp->sibling) {
+ if (!minimum || tmp->priority < minimum->priority) {
+ minimum = tmp;
+ min_last = last;
+ }
+ }
+ if (min_last && minimum)
+ min_last->sibling = minimum->sibling;
+ else if (minimum)
+ (*heap)->sibling = minimum->sibling;
+ else
+ return NULL;
+ binomial_reverse(&minimum->child, &new_heap);
+ binomial_union(heap, heap, &new_heap);
+ return minimum;
+}
+
+void binomial_decrease(struct binomial **heap, struct binomial *element,
+ unsigned increment)
+{
+ struct binomial *tmp, *last = NULL;
+
+ element->priority -= min(element->priority, increment);
+ last = element;
+ tmp = last->parent;
+ while (tmp && last->priority < tmp->priority) {
+ unsigned tmp_prio = tmp->priority;
+ tmp->priority = last->priority;
+ last->priority = tmp_prio;
+ last = tmp;
+ tmp = tmp->parent;
+ }
+}
+
+void binomial_delete(struct binomial **heap, struct binomial *element)
+{
+ struct binomial *tmp, *last = element;
+ for (tmp = last->parent; tmp; last = tmp, tmp = tmp->parent) {
+ unsigned tmp_prio = tmp->priority;
+ tmp->priority = last->priority;
+ last->priority = tmp_prio;
+ }
+ binomial_reverse(&last->child, &tmp);
+ binomial_union(heap, heap, &tmp);
+}
diff -prauN linux-2.6.0-test11/mm/oom_kill.c sched-2.6.0-test11-5/mm/oom_kill.c
--- linux-2.6.0-test11/mm/oom_kill.c 2003-11-26 12:44:16.000000000 -0800
+++ sched-2.6.0-test11-5/mm/oom_kill.c 2003-12-17 07:07:53.000000000 -0800
@@ -158,7 +158,6 @@ static void __oom_kill_task(task_t *p)
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
- p->time_slice = HZ;
p->flags |= PF_MEMALLOC | PF_MEMDIE;
/* This process has hardware access, be more careful. */
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-16 15:55 ` Chris Friesen
2007-04-16 16:13 ` William Lee Irwin III
2007-04-17 0:04 ` Peter Williams
@ 2007-04-17 13:07 ` James Bruce
2007-04-17 20:05 ` William Lee Irwin III
2 siblings, 1 reply; 577+ messages in thread
From: James Bruce @ 2007-04-17 13:07 UTC (permalink / raw)
To: linux-kernel; +Cc: ck
Chris Friesen wrote:
> William Lee Irwin III wrote:
>
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>> corresponds to some share of CPU bandwidth. Implementations
>> should not have the freedom to change this arbitrarily according
>> to some intention.
>
> The first question that comes to my mind is whether nice levels should
> be linear or not. I would lean towards nonlinear as it allows a wider
> range (although of course at the expense of precision). Maybe something
> like "each nice level gives X times the cpu of the previous"? I think a
> value of X somewhere between 1.15 and 1.25 might be reasonable.
Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589
That value has the property that a nice=10 task gets 1/10th the cpu of a
nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that
would be fairly easy to explain to admins and users so that they can
know what to expect from nicing tasks.
> What about also having something that looks at latency, and how latency
> changes with niceness?
I think this would be a lot harder to pin down, since it's a function of
all the other tasks running and their nice levels. Do you have any of
the RT-derived analysis models in mind?
> What about specifying the timeframe over which the cpu bandwidth is
> measured? I currently have a system where the application designers
> would like it to be totally fair over a period of 1 second. As you can
> imagine, mainline doesn't do very well in this case.
It might be easier to specify the maximum deviation from the ideal
bandwidth over a certain period. I.e. something like "over a period of
one second, each task receives within 10% of the expected bandwidth".
- Jim Bruce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:56 ` Nick Piggin
@ 2007-04-17 13:16 ` Peter Williams
2007-04-18 4:46 ` Nick Piggin
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 13:16 UTC (permalink / raw)
To: Nick Piggin
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>>> Other hints that it was a bad idea was the need to transfer time slices
>>>> between children and parents during fork() and exit().
>>> I don't see how that has anything to do with dual arrays.
>> It's totally to do with the dual arrays. The only real purpose of the
>> time slice in O(1) (regardless of what its perceived purpose was) was to
>> control the switching between the arrays.
>
> The O(1) design is pretty convoluted in that regard. In my scheduler,
> the only purpose of the arrays is to renew time slices.
>
> The fork/exit logic is added to make interactivity better. Ingo's
> scheduler has similar equivalent logic.
>
>
>>> If you put
>>> a new child at the back of the queue, then your various interactive
>>> shell commands that typically do a lot of dependant forking get slowed
>>> right down behind your compile job. If you give a new child its own
>>> timeslice irrespective of the parent, then you have things like 'make'
>>> (which doesn't use a lot of CPU time) spawning off lots of high
>>> priority children.
>> This is an artefact of trying to control nice using time slices while
>> using them for controlling array switching and whatever else they were
>> being used for. Priority (static and dynamic) is the the best way to
>> implement nice.
>
> I don't like the timeslice based nice in mainline. It's too nasty
> with latencies. nicksched is far better in that regard IMO.
>
> But I don't know how you can assert a particular way is the best way
> to do something.
I should have added "I may be wrong but I think that ...".
My opinion is based on a lot of experience with different types of
scheduler design and the observation from gathering scheduling
statistics while playing with these schedulers that the size of the time
slices we're talking about is much larger than the CPU chunks most tasks
use in any one go so time slice size has no real effect on most tasks
and the faster CPUs become the more this becomes true.
>
>
>>> You need to do _something_ (Ingo's does). I don't see why this would
>>> be tied with a dual array. FWIW, mine doesn't do anything on exit()
>>> like most others, but it may need more tuning in this area.
>>>
>>>
>>>> This disregard for the dual array mechanism has prevented me from
>>>> looking at the rest of your scheduler in any great detail so I can't
>>>> comment on any other ideas that may be in there.
>>> Well I wasn't really asking you to review it. As I said, everyone
>>> has their own idea of what a good design does, and review can't really
>>> distinguish between the better of two reasonable designs.
>>>
>>> A fair evaluation of the alternatives seems like a good idea though.
>>> Nobody is actually against this, are they?
>> No. It would be nice if the basic ideas that each scheduler tries to
>> implement could be extracted and explained though. This could lead to a
>> melding of ideas that leads to something quite good.
>>
>>>
>>>>> I haven't looked at Con's ones for a while,
>>>>> but I believe they are also much more straightforward than mainline...
>>>> I like Con's scheduler (partly because it uses a single array) but
>>>> mainly because it's nice and simple. However, his earlier schedulers
>>>> were prone to starvation (admittedly, only if you went out of your way
>>>> to make it happen) and I tried to convince him to use the anti
>>>> starvation mechanism in my SPA schedulers but was unsuccessful. I
>>>> haven't looked at his latest scheduler that sparked all this furore so
>>>> can't comment on it.
>>> I agree starvation or unfairness is unacceptable for a new scheduler.
>>>
>>>
>>>>> For example, let's say all else is equal between them, then why would
>>>>> we go with the O(logN) implementation rather than the O(1)?
>>>> In the highly unlikely event that you can't separate them on technical
>>>> grounds, Occam's razor recommends choosing the simplest solution. :-)
>>> O(logN) vs O(1) is technical grounds.
>> In that case I'd go O(1) provided that the k factor for the O(1) wasn't
>> greater than O(logN)'s k factor multiplied by logMaxN.
>
> Yes, or even significantly greater around typical large sizes of N.
Yes. In fact its' probably better to use the maximum number of threads
allowed on the system for N. We know that value don't we?
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 9:51 ` Ingo Molnar
@ 2007-04-17 13:44 ` Peter Williams
2007-04-17 23:00 ` Michael K. Edwards
2007-04-20 20:47 ` Bill Davidsen
1 sibling, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 13:44 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
Ingo Molnar wrote:
> * Nick Piggin <npiggin@suse.de> wrote:
>
>>>> Maybe the progress is that more key people are becoming open to
>>>> the idea of changing the scheduler.
>>> Could be. All was quiet for quite a while, but when RSDL showed up,
>>> it aroused enough interest to show that scheduling woes is on folks
>>> radar.
>> Well I know people have had woes with the scheduler for ever (I guess
>> that isn't going to change :P). [...]
>
> yes, that part isnt going to change, because the CPU is a _scarce
> resource_ that is perhaps the most frequently overcommitted physical
> computer resource in existence, and because the kernel does not (yet)
> track eye movements of humans to figure out which tasks are more
> important them. So critical human constraints are unknown to the
> scheduler and thus complaints will always come.
>
> The upstream scheduler thought it had enough information: the sleep
> average. So now the attempt is to go back and _simplify_ the scheduler
> and remove that information, and concentrate on getting fairness
> precisely right. The magic thing about 'fairness' is that it's a pretty
> good default policy if we decide that we simply have not enough
> information to do an intelligent choice.
>
> ( Lets be cautious though: the jury is still out whether people actually
> like this more than the current approach. While CFS feedback looks
> promising after a whopping 3 days of it being released [ ;-) ], the
> test coverage of all 'fairness centric' schedulers, even considering
> years of availability is less than 1% i'm afraid, and that < 1% was
> mostly self-selecting. )
At this point I'd like to make the observation that spa_ebs is a very
fair scheduler if you consider "nice" to be an indication of the
relative entitlement of tasks to CPU bandwidth.
It works by mapping nice to shares using a function very similar to the
one for calculating p->load weight except it's not offset by the RT
priorities as RT is handled separately. In theory, a runnable task's
entitlement to CPU bandwidth at any time is the ratio of its shares to
the total shares held by runnable tasks on the same CPU (in reality, a
smoothed average of this sum is used to make scheduling smoother). The
dynamic priorities of the runnable tasks are then fiddled to try to keep
each tasks CPU bandwidth usage in proportion to its entitlement.
That's the theory anyway.
The actual implementation looks a bit different due to efficiency
considerations. The modifications to the above theory boil down to
keeping a running measure of the (recent) highest CPU bandwidth use per
share for tasks running on the CPU -- I call this the yardstick for this
CPU. When it's time to put a task on the run queue it's dynamic
priority is determined by comparing its CPU bandwidth per share value
with the yardstick for its CPU. If it's greater than the yardstick this
value becomes the new yardstick and the task gets given the lowest
possible dynamic priority (for its scheduling class). If the value is
zero it gets the highest possible priority (for its scheduling class)
which would be MAX_RT_PRIO for a SCHED_OTHER task. Otherwise it gets
given a priority between these two extremes proportional to ratio of its
CPU bandwidth per share value and the yardstick. Quite simple really.
The other way in which the code deviates from the original as that (for
a few years now) I no longer calculated CPU bandwidth usage directly.
I've found that the overhead is less if I keep a running average of the
size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
from on CPU one time to on CPU next time) and using the ratio of these
values as a measure of bandwidth usage.
Anyway it works and gives very predictable allocations of CPU bandwidth
based on nice.
Another good feature is that (in this pure form) it's starvation free.
However, if you fiddle with it and do things like giving bonus priority
boosts to interactive tasks it becomes susceptible to starvation. This
can be fixed by using an anti starvation mechanism such as SPA's
promotion scheme and that's what spa_ebs does.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 10:41 ` William Lee Irwin III
@ 2007-04-17 13:48 ` Peter Williams
2007-04-18 0:27 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 13:48 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Linux Kernel Mailing List
William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> Comments on which directions you'd like this to go in these respects
>>> would be appreciated, as I regard you as the current "project owner."
>
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> I'd do scan through LKML from about 18 months ago looking for mention of
>> runtime configurable version of plugsched. Some students at a
>> university (in Germany, I think) posted some patches adding this feature
>> to plugsched around about then.
>
> Excellent. I'll go hunting for that.
>
>
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> I never added them to plugsched proper as I knew (from previous
>> experience when the company I worked for posted patches with similar
>> functionality) that Linux would like this idea less than he did the
>> current plugsched mechanism.
>
> Odd how the requirements ended up including that. Fickleness abounds.
> If only we knew up-front what the end would be.
>
>
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> Unfortunately, my own cache of the relevant e-mails got overwritten
>> during a Fedora Core upgrade (I've since moved /var onto a separate
>> drive to avoid a repetition) or I would dig them out and send them to
>> you. I'd provided with copies of the company's patches to use as a
>> guide to how to overcome the problems associated with changing
>> schedulers on a running system (a few non trivial locking issues pop up).
>> Maybe if one of the students still reads LKML he will provide a pointer.
>
> I was tempted to restart from scratch given Ingo's comments, but I
> reconsidered and I'll be working with your code (and the German
> students' as well). If everything has to change, so be it, but it'll
> still be a derived work. It would be ignoring precedent and failure to
> properly attribute if I did otherwise.
I can give you a patch (or set of patches) against the latest git
vanilla kernel version if that would help. There have been changes to
the vanilla scheduler code since 2.6.20 so the latest patch on
sourceforge won't apply cleanly. I've found that implementing this as a
series of patches rather than one big patch makes it easier fro me to
cope with changes to the underlying code.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 0:54 ` Peter Williams
@ 2007-04-17 15:52 ` Chris Friesen
2007-04-17 23:50 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: Chris Friesen @ 2007-04-17 15:52 UTC (permalink / raw)
To: Peter Williams
Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Peter Williams wrote:
> Chris Friesen wrote:
>> Scuse me if I jump in here, but doesn't the load balancer need some
>> way to figure out a) when to run, and b) which tasks to pull and where
>> to push them?
> Yes but both of these are independent of the scheduler discipline in force.
It is not clear to me that this is always the case, especially once you
mix in things like resource groups.
> If
> the load balancer manages to keep the weighted (according to static
> priority) load and distribution of priorities within the loads on the
> CPUs roughly equal and the scheduler does a good job of ensuring
> fairness, interactive responsiveness etc. for the tasks within a CPU
> then the result will be good system performance within the constraints
> set by the sys admins use of real time priorities and nice.
Suppose I have a really high priority task running. Another very high
priority task wakes up and would normally preempt the first one.
However, there happens to be another cpu available. It seems like it
would be a win if we moved one of those tasks to the available cpu
immediately so they can both run simultaneously. This would seem to
require some communication between the scheduler and the load balancer.
Certainly the above design could introduce a lot of context switching.
But if my goal is a scheduler that minimizes latency (even at the cost
of throughput) then that's an acceptable price to pay.
Chris
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 3:31 ` Nick Piggin
@ 2007-04-17 17:35 ` Matt Mackall
0 siblings, 0 replies; 577+ messages in thread
From: Matt Mackall @ 2007-04-17 17:35 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner
On Tue, Apr 17, 2007 at 05:31:20AM +0200, Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote:
> > On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> > > I'd prefer if we kept a single CPU scheduler in mainline, because I
> > > think that simplifies analysis and focuses testing.
> >
> > I think you'll find something like 80-90% of the testing will be done
> > on the default choice, even if other choices exist. So you really
> > won't have much of a problem here.
> >
> > But when the only choice for other schedulers is to go out-of-tree,
> > then only 1% of the people will try it out and those people are
> > guaranteed to be the ones who saw scheduling problems in mainline.
> > So the alternative won't end up getting any testing on many of the
> > workloads that work fine in mainstream so their feedback won't tell
> > you very much at all.
>
> Yeah I concede that perhaps it is the only way to get things going
> any further. But how do we decide if and when the current scheduler
> should be demoted from default, and which should replace it?
Step one is ship both in -mm. If that doesn't give us enough
confidence, ship both in mainline. If that doesn't give us enough
confidence, wait until vendors ship both. Eventually a clear picture
should emerge. If it doesn't, either the change is not significant or
no one cares.
But it really is important to be able to do controlled experiments on
this stuff with little effort. That's the recipe for getting lots of
valid feedback.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 13:07 ` James Bruce
@ 2007-04-17 20:05 ` William Lee Irwin III
0 siblings, 0 replies; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 20:05 UTC (permalink / raw)
To: James Bruce
Cc: Chris Friesen, Willy Tarreau, Pekka Enberg, hui Bill Huey,
Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589
> That value has the property that a nice=10 task gets 1/10th the cpu of a
> nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that
> would be fairly easy to explain to admins and users so that they can
> know what to expect from nicing tasks.
[...additional good commentary trimmed...]
Lots of good ideas here. I'll follow them.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 7:01 ` Nick Piggin
2007-04-17 8:23 ` William Lee Irwin III
@ 2007-04-17 21:39 ` Matt Mackall
2007-04-17 23:23 ` Peter Williams
2007-04-18 3:15 ` Nick Piggin
1 sibling, 2 replies; 577+ messages in thread
From: Matt Mackall @ 2007-04-17 21:39 UTC (permalink / raw)
To: Nick Piggin
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > >> All things are not equal; they all have different properties. I like
> >
> > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > Exactly. So we have to explore those properties and evaluate performance
> > > (in all meanings of the word). That's only logical.
> >
> > Any chance you'd be willing to put down a few thoughts on what sorts
> > of standards you'd like to set for both correctness (i.e. the bare
> > minimum a scheduler implementation must do to be considered valid
> > beyond not oopsing) and performance metrics (i.e. things that produce
> > numbers for each scheduler you can compare to say "this scheduler is
> > better than this other scheduler at this.").
>
> Yeah I guess that's the hard part :)
>
> For correctness, I guess fairness is an easy one. I think that unfairness
> is basically a bug and that it would be very unfortunate to merge something
> unfair. But this is just within the context of a single runqueue... for
> better or worse, we allow some unfairness in multiprocessors for performance
> reasons of course.
I'm a big fan of fairness, but I think it's a bit early to declare it
a mandatory feature. Bounded unfairness is probably something we can
agree on, ie "if we decide to be unfair, no process suffers more than
a factor of x".
> Latency. Given N tasks in the system, an arbitrary task should get
> onto the CPU in a bounded amount of time (excluding events like freak
> IRQ holdoffs and such, obviously -- ie. just considering the context
> of the scheduler's state machine).
This is a slightly stronger statement than starvation-free (which is
obviously mandatory). I think you're looking for something like
"worst-case scheduling latency is proportional to the number of
runnable tasks". Which I think is quite a reasonable requirement.
I'm pretty sure the stock scheduler falls short of both of these
guarantees though.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 9:24 ` Ingo Molnar
2007-04-17 9:57 ` William Lee Irwin III
@ 2007-04-17 22:08 ` Matt Mackall
2007-04-17 22:32 ` William Lee Irwin III
1 sibling, 1 reply; 577+ messages in thread
From: Matt Mackall @ 2007-04-17 22:08 UTC (permalink / raw)
To: Ingo Molnar
Cc: William Lee Irwin III, Davide Libenzi, Nick Piggin,
Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
>
> * William Lee Irwin III <wli@holomorphy.com> wrote:
>
> > [...] Also rest assured that the tone of the critique is not hostile,
> > and wasn't meant to sound that way.
>
> ok :) (And i guess i was too touchy - sorry about coming out swinging.)
>
> > Also, given the general comments it appears clear that some
> > statistical metric of deviation from the intended behavior furthermore
> > qualified by timescale is necessary, so this appears to be headed
> > toward a sort of performance metric as opposed to a pass/fail test
> > anyway. However, to even measure this at all, some statement of
> > intention is required. I'd prefer that there be a Linux-standard
> > semantics for nice so results are more directly comparable and so that
> > users also get similar nice behavior from the scheduler as it varies
> > over time and possibly implementations if users should care to switch
> > them out with some scheduler patch or other.
>
> yeah. If you could come up with a sane definition that also translates
> into low overhead on the algorithm side that would be great!
How's this:
If you're running two identical CPU hog tasks A and B differing only by nice
level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
constant f(Anice - Bnice).
Other definitions make things hard to analyze and probably not
well-bounded when confronted with > 2 tasks.
I -think- this implies keeping a separate scaled CPU usage counter,
where the scaling factor is a trivial exponential function of nice
level where f(0) == 1. Then you schedule based on this scaled usage
counter rather than unscaled.
I also suspect we want to keep the exponential base small so that the
maximal difference is 10x-100x.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 8:23 ` William Lee Irwin III
@ 2007-04-17 22:23 ` Davide Libenzi
0 siblings, 0 replies; 577+ messages in thread
From: Davide Libenzi @ 2007-04-17 22:23 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Tue, 17 Apr 2007, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > Latency. Given N tasks in the system, an arbitrary task should get
> > onto the CPU in a bounded amount of time (excluding events like freak
> > IRQ holdoffs and such, obviously -- ie. just considering the context
> > of the scheduler's state machine).
>
> ISTR Davide Libenzi having a scheduling latency test a number of years
> ago. Resurrecting that and tuning it to the needs of this kind of
> testing sounds relevant here. The test suite Peter Willliams mentioned
> would also help.
That helped me a lot at that time. At every context switch was sampling
critical scheduler parameters for both entering and exiting task (and
associated timestamps). Then the data was collected through a
/dev/idontremember from userspace for analysis. It'd very useful to have
it those days, to study what really happens under the hook
(scheduler internal parameters variations and such) when those wierd loads
make the scheduler unstable.
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 22:08 ` Matt Mackall
@ 2007-04-17 22:32 ` William Lee Irwin III
2007-04-17 22:39 ` Matt Mackall
0 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 22:32 UTC (permalink / raw)
To: Matt Mackall
Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
>> yeah. If you could come up with a sane definition that also translates
>> into low overhead on the algorithm side that would be great!
On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote:
> How's this:
> If you're running two identical CPU hog tasks A and B differing only by nice
> level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
> constant f(Anice - Bnice).
> Other definitions make things hard to analyze and probably not
> well-bounded when confronted with > 2 tasks.
> I -think- this implies keeping a separate scaled CPU usage counter,
> where the scaling factor is a trivial exponential function of nice
> level where f(0) == 1. Then you schedule based on this scaled usage
> counter rather than unscaled.
> I also suspect we want to keep the exponential base small so that the
> maximal difference is 10x-100x.
I'm already working with this as my assumed nice semantics (actually
something with a specific exponential base, suggested in other emails)
until others start saying they want something different and agree.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 22:32 ` William Lee Irwin III
@ 2007-04-17 22:39 ` Matt Mackall
2007-04-17 22:59 ` William Lee Irwin III
0 siblings, 1 reply; 577+ messages in thread
From: Matt Mackall @ 2007-04-17 22:39 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> >> yeah. If you could come up with a sane definition that also translates
> >> into low overhead on the algorithm side that would be great!
>
> On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote:
> > How's this:
> > If you're running two identical CPU hog tasks A and B differing only by nice
> > level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
> > constant f(Anice - Bnice).
> > Other definitions make things hard to analyze and probably not
> > well-bounded when confronted with > 2 tasks.
> > I -think- this implies keeping a separate scaled CPU usage counter,
> > where the scaling factor is a trivial exponential function of nice
> > level where f(0) == 1. Then you schedule based on this scaled usage
> > counter rather than unscaled.
> > I also suspect we want to keep the exponential base small so that the
> > maximal difference is 10x-100x.
>
> I'm already working with this as my assumed nice semantics (actually
> something with a specific exponential base, suggested in other emails)
> until others start saying they want something different and agree.
Good. This has a couple nice mathematical properties, including
"bounded unfairness" which I mentioned earlier. What base are you
looking at?
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 22:59 ` William Lee Irwin III
@ 2007-04-17 22:57 ` Matt Mackall
2007-04-18 4:29 ` William Lee Irwin III
2007-04-18 7:29 ` James Bruce
0 siblings, 2 replies; 577+ messages in thread
From: Matt Mackall @ 2007-04-17 22:57 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
> >> I'm already working with this as my assumed nice semantics (actually
> >> something with a specific exponential base, suggested in other emails)
> >> until others start saying they want something different and agree.
>
> On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote:
> > Good. This has a couple nice mathematical properties, including
> > "bounded unfairness" which I mentioned earlier. What base are you
> > looking at?
>
> I'm working with the following suggestion:
>
>
> On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> > Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589
> > That value has the property that a nice=10 task gets 1/10th the cpu of a
> > nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that
> > would be fairly easy to explain to admins and users so that they can
> > know what to expect from nicing tasks.
>
> I'm not likely to write the testcase until this upcoming weekend, though.
So that means there's a 10000:1 ratio between nice 20 and nice -19. In
that sort of dynamic range, you're likely to have non-trivial
numerical accuracy issues in integer/fixed-point math.
(Especially if your clock is jiffies-scale, which a significant number
of machines will continue to be.)
I really think if we want to have vastly different ratios, we probably
want to be looking at BATCH and RT scheduling classes instead.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 22:39 ` Matt Mackall
@ 2007-04-17 22:59 ` William Lee Irwin III
2007-04-17 22:57 ` Matt Mackall
0 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 22:59 UTC (permalink / raw)
To: Matt Mackall
Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
>> I'm already working with this as my assumed nice semantics (actually
>> something with a specific exponential base, suggested in other emails)
>> until others start saying they want something different and agree.
On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote:
> Good. This has a couple nice mathematical properties, including
> "bounded unfairness" which I mentioned earlier. What base are you
> looking at?
I'm working with the following suggestion:
On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589
> That value has the property that a nice=10 task gets 1/10th the cpu of a
> nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that
> would be fairly easy to explain to admins and users so that they can
> know what to expect from nicing tasks.
I'm not likely to write the testcase until this upcoming weekend, though.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 13:44 ` Peter Williams
@ 2007-04-17 23:00 ` Michael K. Edwards
2007-04-17 23:07 ` William Lee Irwin III
2007-04-18 2:39 ` Peter Williams
0 siblings, 2 replies; 577+ messages in thread
From: Michael K. Edwards @ 2007-04-17 23:00 UTC (permalink / raw)
To: Peter Williams
Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list,
Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> The other way in which the code deviates from the original as that (for
> a few years now) I no longer calculated CPU bandwidth usage directly.
> I've found that the overhead is less if I keep a running average of the
> size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
> from on CPU one time to on CPU next time) and using the ratio of these
> values as a measure of bandwidth usage.
>
> Anyway it works and gives very predictable allocations of CPU bandwidth
> based on nice.
Works, that is, right up until you add nonlinear interactions with CPU
speed scaling. From my perspective as an embedded platform
integrator, clock/voltage scaling is the elephant in the scheduler's
living room. Patch in DPM (now OpPoint?) to scale the clock based on
what task is being scheduled, and suddenly the dynamic priority
calculations go wild. Nip this in the bud by putting an RT priority
on the relevant threads (which you have to do anyway if you need
remotely audio-grade latency), and the lock affinity heuristics break,
so you have to hand-tune all the thread priorities. Blecch.
Not to mention the likelihood that the task whose clock speed you're
trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
than the application. (You want to crank the CPU for this task
because it runs with the RF hot, which may cost you as much power as
the rest of the platform.) You'd better hope you can remove it from
the dynamic priority heuristics with SCHED_BATCH. Otherwise
everything _else_ has to be RT priority (or it'll be starved by the
soft MAC) and you've basically tossed SCHED_NORMAL in the bin. Double
blecch!
Is it too much to ask for someone with actual engineering training
(not me, unfortunately) to sit down and build a negative-feedback
control system that handles soft-real-time _and_ dynamic-priority
_and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
scaling? And actually separates the accounting and control mechanisms
from the heuristics, so the latter can be tuned (within a well
documented stable range) to reflect the expected system usage
patterns?
It's not like there isn't a vast literature in this area over the past
decade, including some dealing specifically with clock scaling
consistent with low-latency applications. It's a pity that people
doing academic work in this area rarely wade into LKML, even when
they're hacking on a Linux fork. But then, there's not much economic
incentive for them to do so, and they can usually get their fill of
citation politics and dominance games without leaving their home
department. :-P
Seriously, though. If you're really going to put the mainline
scheduler through this kind of churn, please please pretty please knit
in per-task clock scaling (possibly even rejigged during the slice;
see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
linger mechanism to keep from taking context switch hits when you're
confident that an I/O will complete quickly.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 23:00 ` Michael K. Edwards
@ 2007-04-17 23:07 ` William Lee Irwin III
2007-04-17 23:52 ` Michael K. Edwards
2007-04-18 2:39 ` Peter Williams
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-17 23:07 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith,
Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds,
Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 04:00:53PM -0700, Michael K. Edwards wrote:
> Works, that is, right up until you add nonlinear interactions with CPU
> speed scaling. From my perspective as an embedded platform
> integrator, clock/voltage scaling is the elephant in the scheduler's
> living room. Patch in DPM (now OpPoint?) to scale the clock based on
> what task is being scheduled, and suddenly the dynamic priority
> calculations go wild. Nip this in the bud by putting an RT priority
> on the relevant threads (which you have to do anyway if you need
> remotely audio-grade latency), and the lock affinity heuristics break,
> so you have to hand-tune all the thread priorities. Blecch.
[...not terribly enlightening stuff trimmed...]
The ongoing scheduler work is on a much more basic level than these
affairs I'm guessing you googled. When the basics work as intended it
will be possible to move on to more advanced issues.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 23:23 ` Peter Williams
@ 2007-04-17 23:19 ` Matt Mackall
0 siblings, 0 replies; 577+ messages in thread
From: Matt Mackall @ 2007-04-17 23:19 UTC (permalink / raw)
To: Peter Williams
Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Wed, Apr 18, 2007 at 09:23:42AM +1000, Peter Williams wrote:
> Matt Mackall wrote:
> >On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> >>On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> >>>On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> >>>>>All things are not equal; they all have different properties. I like
> >>>On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> >>>>Exactly. So we have to explore those properties and evaluate performance
> >>>>(in all meanings of the word). That's only logical.
> >>>Any chance you'd be willing to put down a few thoughts on what sorts
> >>>of standards you'd like to set for both correctness (i.e. the bare
> >>>minimum a scheduler implementation must do to be considered valid
> >>>beyond not oopsing) and performance metrics (i.e. things that produce
> >>>numbers for each scheduler you can compare to say "this scheduler is
> >>>better than this other scheduler at this.").
> >>Yeah I guess that's the hard part :)
> >>
> >>For correctness, I guess fairness is an easy one. I think that unfairness
> >>is basically a bug and that it would be very unfortunate to merge
> >>something
> >>unfair. But this is just within the context of a single runqueue... for
> >>better or worse, we allow some unfairness in multiprocessors for
> >>performance
> >>reasons of course.
> >
> >I'm a big fan of fairness, but I think it's a bit early to declare it
> >a mandatory feature. Bounded unfairness is probably something we can
> >agree on, ie "if we decide to be unfair, no process suffers more than
> >a factor of x".
> >
> >>Latency. Given N tasks in the system, an arbitrary task should get
> >>onto the CPU in a bounded amount of time (excluding events like freak
> >>IRQ holdoffs and such, obviously -- ie. just considering the context
> >>of the scheduler's state machine).
> >
> >This is a slightly stronger statement than starvation-free (which is
> >obviously mandatory). I think you're looking for something like
> >"worst-case scheduling latency is proportional to the number of
> >runnable tasks".
>
> add "taking into consideration nice and/or real time priorities of
> runnable tasks". I.e. if a task is nice 19 it can expect to wait longer
> to get onto the CPU than if it was nice 0.
Yes. Assuming we meet the "bounded unfairness" criterion above, this
follows.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 21:39 ` Matt Mackall
@ 2007-04-17 23:23 ` Peter Williams
2007-04-17 23:19 ` Matt Mackall
2007-04-18 3:15 ` Nick Piggin
1 sibling, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 23:23 UTC (permalink / raw)
To: Matt Mackall
Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
Andrew Morton, Arjan van de Ven, Thomas Gleixner
Matt Mackall wrote:
> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
>> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
>>> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
>>>>> All things are not equal; they all have different properties. I like
>>> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
>>>> Exactly. So we have to explore those properties and evaluate performance
>>>> (in all meanings of the word). That's only logical.
>>> Any chance you'd be willing to put down a few thoughts on what sorts
>>> of standards you'd like to set for both correctness (i.e. the bare
>>> minimum a scheduler implementation must do to be considered valid
>>> beyond not oopsing) and performance metrics (i.e. things that produce
>>> numbers for each scheduler you can compare to say "this scheduler is
>>> better than this other scheduler at this.").
>> Yeah I guess that's the hard part :)
>>
>> For correctness, I guess fairness is an easy one. I think that unfairness
>> is basically a bug and that it would be very unfortunate to merge something
>> unfair. But this is just within the context of a single runqueue... for
>> better or worse, we allow some unfairness in multiprocessors for performance
>> reasons of course.
>
> I'm a big fan of fairness, but I think it's a bit early to declare it
> a mandatory feature. Bounded unfairness is probably something we can
> agree on, ie "if we decide to be unfair, no process suffers more than
> a factor of x".
>
>> Latency. Given N tasks in the system, an arbitrary task should get
>> onto the CPU in a bounded amount of time (excluding events like freak
>> IRQ holdoffs and such, obviously -- ie. just considering the context
>> of the scheduler's state machine).
>
> This is a slightly stronger statement than starvation-free (which is
> obviously mandatory). I think you're looking for something like
> "worst-case scheduling latency is proportional to the number of
> runnable tasks".
add "taking into consideration nice and/or real time priorities of
runnable tasks". I.e. if a task is nice 19 it can expect to wait longer
to get onto the CPU than if it was nice 0.
> Which I think is quite a reasonable requirement.
>
> I'm pretty sure the stock scheduler falls short of both of these
> guarantees though.
>
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 15:52 ` Chris Friesen
@ 2007-04-17 23:50 ` Peter Williams
2007-04-18 5:43 ` Chris Friesen
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-17 23:50 UTC (permalink / raw)
To: Chris Friesen
Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Chris Friesen wrote:
> Peter Williams wrote:
>> Chris Friesen wrote:
>>> Scuse me if I jump in here, but doesn't the load balancer need some
>>> way to figure out a) when to run, and b) which tasks to pull and
>>> where to push them?
>
>> Yes but both of these are independent of the scheduler discipline in
>> force.
>
> It is not clear to me that this is always the case, especially once you
> mix in things like resource groups.
>
>> If
>> the load balancer manages to keep the weighted (according to static
>> priority) load and distribution of priorities within the loads on the
>> CPUs roughly equal and the scheduler does a good job of ensuring
>> fairness, interactive responsiveness etc. for the tasks within a CPU
>> then the result will be good system performance within the constraints
>> set by the sys admins use of real time priorities and nice.
>
> Suppose I have a really high priority task running. Another very high
> priority task wakes up and would normally preempt the first one.
> However, there happens to be another cpu available. It seems like it
> would be a win if we moved one of those tasks to the available cpu
> immediately so they can both run simultaneously. This would seem to
> require some communication between the scheduler and the load balancer.
Not really the load balancer can do this on its own AND the decision
should be based on the STATIC priority of the task being woken.
>
> Certainly the above design could introduce a lot of context switching.
> But if my goal is a scheduler that minimizes latency (even at the cost
> of throughput) then that's an acceptable price to pay.
It would actually probably reduce context switching as putting the woken
task on the best CPU at wake up means you don't have to move it later
on. The wake up code already does a little bit in this direction when
it chooses which CPU to put a newly woken task on but could do more --
the only real cost would be the cost of looking at more candidate CPUs
than it currently does.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 23:07 ` William Lee Irwin III
@ 2007-04-17 23:52 ` Michael K. Edwards
2007-04-18 0:36 ` Bill Huey
0 siblings, 1 reply; 577+ messages in thread
From: Michael K. Edwards @ 2007-04-17 23:52 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith,
Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds,
Andrew Morton, Arjan van de Ven, Thomas Gleixner
On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> The ongoing scheduler work is on a much more basic level than these
> affairs I'm guessing you googled. When the basics work as intended it
> will be possible to move on to more advanced issues.
OK, let me try this in smaller words for people who can't tell bitter
experience from Google hits. CPU clock scaling for power efficiency
is already the only thing that matters about the Linux scheduler in my
world, because battery-powered device vendors in their infinite wisdom
are abandoning real RTOSes in favor of Linux now that WiFi is the "in"
thing (again). And on the timescale that anyone will actually be
_using_ this shiny new scheduler of Ingo's, it'll be nearly the only
thing that matters about the Linux scheduler in anyone's world,
because the amount of work the CPU can get done in a given minute will
depend mostly on how intelligently it can spend its heat dissipation
budget.
Clock scaling schemes that aren't integral to the scheduler design
make a bad situation (scheduling embedded loads with shotgun
heuristics tuned for desktop CPUs) worse, because the opaque
heuristics are now being applied to distorted data. Add a "smoothing"
scheme for the distorted data, and you may find that you have
introduced an actual control-path instability. A small fluctuation in
the data (say, two bursts of interrupt traffic at just the right
interval) can result in a long-lasting oscillation in some task's
"dynamic priority" -- and, on a fully loaded CPU, in the time that
task actually gets. If anything else depends on how much work this
task gets done each time around, the oscillation can easily propagate
throughout the system. Thrash city.
(If you haven't seen this happen on real production systems under what
shouldn't be a pathological load, you haven't been around long. The
classic mechanisms that triggered oscillations in, say, early SMP
Solaris boxes haven't bitten recently, perhaps because most modern
CPUs don't lose their marbles so comprehensively on context switch.
But I got to live this nightmare again recently on ARM Linux, due to
some impressively broken application-level threading/locking "design",
whose assumptions about scheduler behavior got broken when I switched
to an NPTL toolchain.)
I don't have the training to design a scheduler that isn't vulnerable
to control-feedback oscillations. Neither do you, if you haven't
taken (and excelled at) a control theory course, which nowadays seems
to be taught by applied math and ECE departments and too often skipped
by CS types. But I can recognize an impending train wreck when I see
it.
Cheers,
- Michael
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 13:48 ` Peter Williams
@ 2007-04-18 0:27 ` Peter Williams
2007-04-18 2:03 ` William Lee Irwin III
0 siblings, 1 reply; 577+ messages in thread
From: Peter Williams @ 2007-04-18 0:27 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Linux Kernel Mailing List
Peter Williams wrote:
> William Lee Irwin III wrote:
>> I was tempted to restart from scratch given Ingo's comments, but I
>> reconsidered and I'll be working with your code (and the German
>> students' as well). If everything has to change, so be it, but it'll
>> still be a derived work. It would be ignoring precedent and failure to
>> properly attribute if I did otherwise.
>
> I can give you a patch (or set of patches) against the latest git
> vanilla kernel version if that would help. There have been changes to
> the vanilla scheduler code since 2.6.20 so the latest patch on
> sourceforge won't apply cleanly. I've found that implementing this as a
> series of patches rather than one big patch makes it easier fro me to
> cope with changes to the underlying code.
I've just placed a single patch for plugsched against 2.6.21-rc7 updated
to Linus's git tree as of an hour or two ago on sourceforge:
<http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>
This should at least enable you to get it to apply cleanly to the latest
kernel sources. Let me know if you'd also like this as a quilt/mq
friendly patch series?
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 23:52 ` Michael K. Edwards
@ 2007-04-18 0:36 ` Bill Huey
0 siblings, 0 replies; 577+ messages in thread
From: Bill Huey @ 2007-04-18 0:36 UTC (permalink / raw)
To: Michael K. Edwards
Cc: William Lee Irwin III, Peter Williams, Ingo Molnar, Nick Piggin,
Mike Galbraith, Con Kolivas, ck list, linux-kernel,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner,
Bill Huey (hui)
On Tue, Apr 17, 2007 at 04:52:08PM -0700, Michael K. Edwards wrote:
> On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> >The ongoing scheduler work is on a much more basic level than these
> >affairs I'm guessing you googled. When the basics work as intended it
> >will be possible to move on to more advanced issues.
...
Will probably shouldn't have dismissed your points but he probably means
that can't even get at this stuff until fundamental are in place.
> Clock scaling schemes that aren't integral to the scheduler design
> make a bad situation (scheduling embedded loads with shotgun
> heuristics tuned for desktop CPUs) worse, because the opaque
> heuristics are now being applied to distorted data. Add a "smoothing"
> scheme for the distorted data, and you may find that you have
> introduced an actual control-path instability. A small fluctuation in
> the data (say, two bursts of interrupt traffic at just the right
> interval) can result in a long-lasting oscillation in some task's
> "dynamic priority" -- and, on a fully loaded CPU, in the time that
> task actually gets. If anything else depends on how much work this
> task gets done each time around, the oscillation can easily propagate
> throughout the system. Thrash city.
Hyperthreading issues are quite similar that clock scaling issues.
Con's infrastructures changes to move things in that direction were
rejected, as well as other infrastructure changes, further infuritating
Con to drop development on RSDL and derivatives.
bill
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 0:27 ` Peter Williams
@ 2007-04-18 2:03 ` William Lee Irwin III
2007-04-18 2:31 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-18 2:03 UTC (permalink / raw)
To: Peter Williams; +Cc: Linux Kernel Mailing List
> Peter Williams wrote:
> >William Lee Irwin III wrote:
> >>I was tempted to restart from scratch given Ingo's comments, but I
> >>reconsidered and I'll be working with your code (and the German
> >>students' as well). If everything has to change, so be it, but it'll
> >>still be a derived work. It would be ignoring precedent and failure to
> >>properly attribute if I did otherwise.
> >
> >I can give you a patch (or set of patches) against the latest git
> >vanilla kernel version if that would help. There have been changes to
> >the vanilla scheduler code since 2.6.20 so the latest patch on
> >sourceforge won't apply cleanly. I've found that implementing this as a
> >series of patches rather than one big patch makes it easier fro me to
> >cope with changes to the underlying code.
>
On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote:
> I've just placed a single patch for plugsched against 2.6.21-rc7 updated
> to Linus's git tree as of an hour or two ago on sourceforge:
> <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>
> This should at least enable you to get it to apply cleanly to the latest
> kernel sources. Let me know if you'd also like this as a quilt/mq
> friendly patch series?
A quilt-friendly series would be most excellent if you could arrange it.
Thanks.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 2:03 ` William Lee Irwin III
@ 2007-04-18 2:31 ` Peter Williams
0 siblings, 0 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-18 2:31 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Linux Kernel Mailing List
William Lee Irwin III wrote:
>> Peter Williams wrote:
>>> William Lee Irwin III wrote:
>>>> I was tempted to restart from scratch given Ingo's comments, but I
>>>> reconsidered and I'll be working with your code (and the German
>>>> students' as well). If everything has to change, so be it, but it'll
>>>> still be a derived work. It would be ignoring precedent and failure to
>>>> properly attribute if I did otherwise.
>>> I can give you a patch (or set of patches) against the latest git
>>> vanilla kernel version if that would help. There have been changes to
>>> the vanilla scheduler code since 2.6.20 so the latest patch on
>>> sourceforge won't apply cleanly. I've found that implementing this as a
>>> series of patches rather than one big patch makes it easier fro me to
>>> cope with changes to the underlying code.
> On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote:
>> I've just placed a single patch for plugsched against 2.6.21-rc7 updated
>> to Linus's git tree as of an hour or two ago on sourceforge:
>> <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>
>> This should at least enable you to get it to apply cleanly to the latest
>> kernel sources. Let me know if you'd also like this as a quilt/mq
>> friendly patch series?
>
> A quilt-friendly series would be most excellent if you could arrange it.
Done:
<http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch-series.tar.gz>
Just untar this in the base directory of your Linux kernel source and
Bob's your uncle.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 23:00 ` Michael K. Edwards
2007-04-17 23:07 ` William Lee Irwin III
@ 2007-04-18 2:39 ` Peter Williams
1 sibling, 0 replies; 577+ messages in thread
From: Peter Williams @ 2007-04-18 2:39 UTC (permalink / raw)
To: Michael K. Edwards
Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list,
Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
Michael K. Edwards wrote:
> On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> The other way in which the code deviates from the original as that (for
>> a few years now) I no longer calculated CPU bandwidth usage directly.
>> I've found that the overhead is less if I keep a running average of the
>> size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
>> from on CPU one time to on CPU next time) and using the ratio of these
>> values as a measure of bandwidth usage.
>>
>> Anyway it works and gives very predictable allocations of CPU bandwidth
>> based on nice.
>
> Works, that is, right up until you add nonlinear interactions with CPU
> speed scaling. From my perspective as an embedded platform
> integrator, clock/voltage scaling is the elephant in the scheduler's
> living room. Patch in DPM (now OpPoint?) to scale the clock based on
> what task is being scheduled, and suddenly the dynamic priority
> calculations go wild. Nip this in the bud by putting an RT priority
> on the relevant threads (which you have to do anyway if you need
> remotely audio-grade latency), and the lock affinity heuristics break,
> so you have to hand-tune all the thread priorities. Blecch.
>
> Not to mention the likelihood that the task whose clock speed you're
> trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
> than the application. (You want to crank the CPU for this task
> because it runs with the RF hot, which may cost you as much power as
> the rest of the platform.) You'd better hope you can remove it from
> the dynamic priority heuristics with SCHED_BATCH. Otherwise
> everything _else_ has to be RT priority (or it'll be starved by the
> soft MAC) and you've basically tossed SCHED_NORMAL in the bin. Double
> blecch!
>
> Is it too much to ask for someone with actual engineering training
> (not me, unfortunately) to sit down and build a negative-feedback
> control system that handles soft-real-time _and_ dynamic-priority
> _and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
> scaling? And actually separates the accounting and control mechanisms
> from the heuristics, so the latter can be tuned (within a well
> documented stable range) to reflect the expected system usage
> patterns?
>
> It's not like there isn't a vast literature in this area over the past
> decade, including some dealing specifically with clock scaling
> consistent with low-latency applications. It's a pity that people
> doing academic work in this area rarely wade into LKML, even when
> they're hacking on a Linux fork. But then, there's not much economic
> incentive for them to do so, and they can usually get their fill of
> citation politics and dominance games without leaving their home
> department. :-P
>
> Seriously, though. If you're really going to put the mainline
> scheduler through this kind of churn, please please pretty please knit
> in per-task clock scaling (possibly even rejigged during the slice;
> see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
> linger mechanism to keep from taking context switch hits when you're
> confident that an I/O will complete quickly.
I think that this doesn't effect the basic design principles of spa_ebs
but just means that the statistics that it uses need to be rethought.
E.g. instead of measuring average CPU usage per burst in terms of wall
clock time spent on the CPU measure it in terms of CPU capacity (for the
want of a better word) used per burst.
I don't have suitable hardware for investigating this line of attack
further, unfortunately, and have no idea what would be the best way to
calculate this new statistic.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 21:39 ` Matt Mackall
2007-04-17 23:23 ` Peter Williams
@ 2007-04-18 3:15 ` Nick Piggin
2007-04-18 3:45 ` Mike Galbraith
2007-04-18 4:38 ` Matt Mackall
1 sibling, 2 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-18 3:15 UTC (permalink / raw)
To: Matt Mackall
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > > >> All things are not equal; they all have different properties. I like
> > >
> > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > > Exactly. So we have to explore those properties and evaluate performance
> > > > (in all meanings of the word). That's only logical.
> > >
> > > Any chance you'd be willing to put down a few thoughts on what sorts
> > > of standards you'd like to set for both correctness (i.e. the bare
> > > minimum a scheduler implementation must do to be considered valid
> > > beyond not oopsing) and performance metrics (i.e. things that produce
> > > numbers for each scheduler you can compare to say "this scheduler is
> > > better than this other scheduler at this.").
> >
> > Yeah I guess that's the hard part :)
> >
> > For correctness, I guess fairness is an easy one. I think that unfairness
> > is basically a bug and that it would be very unfortunate to merge something
> > unfair. But this is just within the context of a single runqueue... for
> > better or worse, we allow some unfairness in multiprocessors for performance
> > reasons of course.
>
> I'm a big fan of fairness, but I think it's a bit early to declare it
> a mandatory feature. Bounded unfairness is probably something we can
> agree on, ie "if we decide to be unfair, no process suffers more than
> a factor of x".
I don't know why this would be a useful feature (of course I'm talking
about processes at the same nice level). One of the big problems with
the current scheduler is that it is unfair in some corner cases. It
works OK for most people, but when it breaks down it really hurts. At
least if you start with a fair scheduler, you can alter priorities
until it satisfies your need... with an unfair one your guess is as
good as mine.
So on what basis would you allow unfairness? On the basis that it doesn't
seem to harm anyone? It doesn't seem to harm testers?
I think we should aim for something better.
> > Latency. Given N tasks in the system, an arbitrary task should get
> > onto the CPU in a bounded amount of time (excluding events like freak
> > IRQ holdoffs and such, obviously -- ie. just considering the context
> > of the scheduler's state machine).
>
> This is a slightly stronger statement than starvation-free (which is
> obviously mandatory). I think you're looking for something like
> "worst-case scheduling latency is proportional to the number of
> runnable tasks". Which I think is quite a reasonable requirement.
Yes, bounded and proportional to.
> I'm pretty sure the stock scheduler falls short of both of these
> guarantees though.
And I think that's what its main problems are. It's interactivity
obviously can't be too bad for most people. It's performance seems to
be pretty good.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 3:15 ` Nick Piggin
@ 2007-04-18 3:45 ` Mike Galbraith
2007-04-18 3:56 ` Nick Piggin
2007-04-18 4:38 ` Matt Mackall
1 sibling, 1 reply; 577+ messages in thread
From: Mike Galbraith @ 2007-04-18 3:45 UTC (permalink / raw)
To: Nick Piggin
Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> >
> > I'm a big fan of fairness, but I think it's a bit early to declare it
> > a mandatory feature. Bounded unfairness is probably something we can
> > agree on, ie "if we decide to be unfair, no process suffers more than
> > a factor of x".
>
> I don't know why this would be a useful feature (of course I'm talking
> about processes at the same nice level). One of the big problems with
> the current scheduler is that it is unfair in some corner cases. It
> works OK for most people, but when it breaks down it really hurts. At
> least if you start with a fair scheduler, you can alter priorities
> until it satisfies your need... with an unfair one your guess is as
> good as mine.
>
> So on what basis would you allow unfairness? On the basis that it doesn't
> seem to harm anyone? It doesn't seem to harm testers?
Well, there's short term fair and long term fair. Seems to me a burst
load having to always merge with a steady stream load using a short term
fairness yardstick absolutely must 'starve' relative to the steady load,
so to be long term fair, you have to add some short term unfairness.
The mainline scheduler is more long term fair (discounting the rather
obnoxious corner cases).
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 3:45 ` Mike Galbraith
@ 2007-04-18 3:56 ` Nick Piggin
2007-04-18 4:29 ` Mike Galbraith
0 siblings, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-18 3:56 UTC (permalink / raw)
To: Mike Galbraith
Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote:
> On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> > >
> > > I'm a big fan of fairness, but I think it's a bit early to declare it
> > > a mandatory feature. Bounded unfairness is probably something we can
> > > agree on, ie "if we decide to be unfair, no process suffers more than
> > > a factor of x".
> >
> > I don't know why this would be a useful feature (of course I'm talking
> > about processes at the same nice level). One of the big problems with
> > the current scheduler is that it is unfair in some corner cases. It
> > works OK for most people, but when it breaks down it really hurts. At
> > least if you start with a fair scheduler, you can alter priorities
> > until it satisfies your need... with an unfair one your guess is as
> > good as mine.
> >
> > So on what basis would you allow unfairness? On the basis that it doesn't
> > seem to harm anyone? It doesn't seem to harm testers?
>
> Well, there's short term fair and long term fair. Seems to me a burst
> load having to always merge with a steady stream load using a short term
> fairness yardstick absolutely must 'starve' relative to the steady load,
> so to be long term fair, you have to add some short term unfairness.
> The mainline scheduler is more long term fair (discounting the rather
> obnoxious corner cases).
Oh yes definitely I mean long term fair. I guess it is impossible to be
completely fair so long as we have to timeshare the CPU :)
So a constant delta is fine and unavoidable. But I don't think I agree
with a constant factor: that means you can pick a time where process 1
is allowed an arbitrary T more CPU time than process 2.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 22:57 ` Matt Mackall
@ 2007-04-18 4:29 ` William Lee Irwin III
2007-04-18 4:42 ` Davide Libenzi
2007-04-18 7:29 ` James Bruce
1 sibling, 1 reply; 577+ messages in thread
From: William Lee Irwin III @ 2007-04-18 4:29 UTC (permalink / raw)
To: Matt Mackall
Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
>>> Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589
>>> That value has the property that a nice=10 task gets 1/10th the cpu of a
>>> nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that
>>> would be fairly easy to explain to admins and users so that they can
>>> know what to expect from nicing tasks.
On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote:
>> I'm not likely to write the testcase until this upcoming weekend, though.
On Tue, Apr 17, 2007 at 05:57:23PM -0500, Matt Mackall wrote:
> So that means there's a 10000:1 ratio between nice 20 and nice -19. In
> that sort of dynamic range, you're likely to have non-trivial
> numerical accuracy issues in integer/fixed-point math.
> (Especially if your clock is jiffies-scale, which a significant number
> of machines will continue to be.)
> I really think if we want to have vastly different ratios, we probably
> want to be looking at BATCH and RT scheduling classes instead.
100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and
even 1000**(1/39.0) ~= 1.19378 still seems weak.
I suspect that in order to get low nice numbers strong enough without
making high nice numbers too strong something sub-exponential may need
to be used. Maybe just picking percentages outright as opposed to some
particular function.
We may also be better off defining it in terms of a share weighting as
opposed to two tasks in competition. In such a manner the extension to
N tasks is more automatic. f(n) would be a univariate function of nice
numbers and two tasks in competition with nice numbers n_1 and n_2
would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In
the exponential case f(n) = K*e**(r*n) this ends up as
1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for
other choices it's not so. f(n) = n+K for K >= 20 results in a share
weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear
in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n
when n <= 0 is highly plausible. An exponent or an additive constant
may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21,
and the ratio of shares is 420, which is still arithmeticaly feasible.
-10 vs. 0 and 0 vs. 10 are both 10:1.
-- wli
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 3:56 ` Nick Piggin
@ 2007-04-18 4:29 ` Mike Galbraith
0 siblings, 0 replies; 577+ messages in thread
From: Mike Galbraith @ 2007-04-18 4:29 UTC (permalink / raw)
To: Nick Piggin
Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Wed, 2007-04-18 at 05:56 +0200, Nick Piggin wrote:
> On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote:
> > On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> > >
> > >
> > > So on what basis would you allow unfairness? On the basis that it doesn't
> > > seem to harm anyone? It doesn't seem to harm testers?
> >
> > Well, there's short term fair and long term fair. Seems to me a burst
> > load having to always merge with a steady stream load using a short term
> > fairness yardstick absolutely must 'starve' relative to the steady load,
> > so to be long term fair, you have to add some short term unfairness.
> > The mainline scheduler is more long term fair (discounting the rather
> > obnoxious corner cases).
>
> Oh yes definitely I mean long term fair. I guess it is impossible to be
> completely fair so long as we have to timeshare the CPU :)
>
> So a constant delta is fine and unavoidable. But I don't think I agree
> with a constant factor: that means you can pick a time where process 1
> is allowed an arbitrary T more CPU time than process 2.
Definitely. Using constants with no consideration of what else is
running is what causes the fairness mechanism in mainline to break down
under load.
(aside: What I was experimenting with before this new scheduler came
along was to turn the sleep_avg thing into an off-cpu period thing. Once
a time slice begins execution [runqueue wait doesn't count], that task
has the right to use it's slice in one go, and _anything_ that knocked
it off the cpu added to it's credit. Knocking someone else off detracts
from credit, and to get to the point where you can knock others off
costs you stored credit proportional to the dynamic priority you attain
by using it. All tasks that have credit stay active, no favorites.)
-Mike
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 3:15 ` Nick Piggin
2007-04-18 3:45 ` Mike Galbraith
@ 2007-04-18 4:38 ` Matt Mackall
2007-04-18 5:00 ` Nick Piggin
1 sibling, 1 reply; 577+ messages in thread
From: Matt Mackall @ 2007-04-18 4:38 UTC (permalink / raw)
To: Nick Piggin
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > > > >> All things are not equal; they all have different properties. I like
> > > >
> > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > > > Exactly. So we have to explore those properties and evaluate performance
> > > > > (in all meanings of the word). That's only logical.
> > > >
> > > > Any chance you'd be willing to put down a few thoughts on what sorts
> > > > of standards you'd like to set for both correctness (i.e. the bare
> > > > minimum a scheduler implementation must do to be considered valid
> > > > beyond not oopsing) and performance metrics (i.e. things that produce
> > > > numbers for each scheduler you can compare to say "this scheduler is
> > > > better than this other scheduler at this.").
> > >
> > > Yeah I guess that's the hard part :)
> > >
> > > For correctness, I guess fairness is an easy one. I think that unfairness
> > > is basically a bug and that it would be very unfortunate to merge something
> > > unfair. But this is just within the context of a single runqueue... for
> > > better or worse, we allow some unfairness in multiprocessors for performance
> > > reasons of course.
> >
> > I'm a big fan of fairness, but I think it's a bit early to declare it
> > a mandatory feature. Bounded unfairness is probably something we can
> > agree on, ie "if we decide to be unfair, no process suffers more than
> > a factor of x".
>
> I don't know why this would be a useful feature (of course I'm talking
> about processes at the same nice level). One of the big problems with
> the current scheduler is that it is unfair in some corner cases. It
> works OK for most people, but when it breaks down it really hurts. At
> least if you start with a fair scheduler, you can alter priorities
> until it satisfies your need... with an unfair one your guess is as
> good as mine.
>
> So on what basis would you allow unfairness? On the basis that it doesn't
> seem to harm anyone? It doesn't seem to harm testers?
On the basis that there's only anecdotal evidence thus far that
fairness is the right approach.
It's not yet clear that a fair scheduler can do the right thing with X,
with various kernel threads, etc. without fiddling with nice levels.
Which makes it no longer "completely fair".
It's also not yet clear that a scheduler can't be taught to do the
right thing with X without fiddling with nice levels.
So I'm just not yet willing to completely rule out systems that aren't
"completely fair".
But I think we should rule out schedulers that don't have rigid bounds on
that unfairness. That's where the really ugly behavior lies.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 4:29 ` William Lee Irwin III
@ 2007-04-18 4:42 ` Davide Libenzi
0 siblings, 0 replies; 577+ messages in thread
From: Davide Libenzi @ 2007-04-18 4:42 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Matt Mackall, Ingo Molnar, Nick Piggin, Peter Williams,
Mike Galbraith, Con Kolivas, ck list, Bill Huey,
Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
Arjan van de Ven, Thomas Gleixner
On Tue, 17 Apr 2007, William Lee Irwin III wrote:
> 100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and
> even 1000**(1/39.0) ~= 1.19378 still seems weak.
>
> I suspect that in order to get low nice numbers strong enough without
> making high nice numbers too strong something sub-exponential may need
> to be used. Maybe just picking percentages outright as opposed to some
> particular function.
>
> We may also be better off defining it in terms of a share weighting as
> opposed to two tasks in competition. In such a manner the extension to
> N tasks is more automatic. f(n) would be a univariate function of nice
> numbers and two tasks in competition with nice numbers n_1 and n_2
> would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In
> the exponential case f(n) = K*e**(r*n) this ends up as
> 1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for
> other choices it's not so. f(n) = n+K for K >= 20 results in a share
> weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear
> in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n
> when n <= 0 is highly plausible. An exponent or an additive constant
> may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21,
> and the ratio of shares is 420, which is still arithmeticaly feasible.
> -10 vs. 0 and 0 vs. 10 are both 10:1.
This makes more sense, and the ratio at the extremes is something
reasonable.
- Davide
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 13:16 ` Peter Williams
@ 2007-04-18 4:46 ` Nick Piggin
0 siblings, 0 replies; 577+ messages in thread
From: Nick Piggin @ 2007-04-18 4:46 UTC (permalink / raw)
To: Peter Williams
Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
Thomas Gleixner
On Tue, Apr 17, 2007 at 11:16:54PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >I don't like the timeslice based nice in mainline. It's too nasty
> >with latencies. nicksched is far better in that regard IMO.
> >
> >But I don't know how you can assert a particular way is the best way
> >to do something.
>
> I should have added "I may be wrong but I think that ...".
>
> My opinion is based on a lot of experience with different types of
> scheduler design and the observation from gathering scheduling
> statistics while playing with these schedulers that the size of the time
> slices we're talking about is much larger than the CPU chunks most tasks
> use in any one go so time slice size has no real effect on most tasks
> and the faster CPUs become the more this becomes true.
For desktop loads, maybe. But for things that are compute bound, the
cost of context switching I believe still gets worse as CPUs continue
to be able to execute more instructions per cycle, get clocked faster,
and get larger caches.
> >>In that case I'd go O(1) provided that the k factor for the O(1) wasn't
> >>greater than O(logN)'s k factor multiplied by logMaxN.
> >
> >Yes, or even significantly greater around typical large sizes of N.
>
> Yes. In fact its' probably better to use the maximum number of threads
> allowed on the system for N. We know that value don't we?
Well we might be able to work it out by looking at the tunables or
amount of kernel memory available, but I guess it is hard to just
pick a number.
I'll try running a few more benchmarks.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 4:38 ` Matt Mackall
@ 2007-04-18 5:00 ` Nick Piggin
2007-04-18 5:55 ` Matt Mackall
0 siblings, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-18 5:00 UTC (permalink / raw)
To: Matt Mackall
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Tue, Apr 17, 2007 at 11:38:31PM -0500, Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote:
> >
> > I don't know why this would be a useful feature (of course I'm talking
> > about processes at the same nice level). One of the big problems with
> > the current scheduler is that it is unfair in some corner cases. It
> > works OK for most people, but when it breaks down it really hurts. At
> > least if you start with a fair scheduler, you can alter priorities
> > until it satisfies your need... with an unfair one your guess is as
> > good as mine.
> >
> > So on what basis would you allow unfairness? On the basis that it doesn't
> > seem to harm anyone? It doesn't seem to harm testers?
>
> On the basis that there's only anecdotal evidence thus far that
> fairness is the right approach.
>
> It's not yet clear that a fair scheduler can do the right thing with X,
> with various kernel threads, etc. without fiddling with nice levels.
> Which makes it no longer "completely fair".
Of course I mean SCHED_OTHER tasks at the same nice level. Otherwise
I would be arguing to make nice basically a noop.
> It's also not yet clear that a scheduler can't be taught to do the
> right thing with X without fiddling with nice levels.
Being fair doesn't prevent that. Implicit unfairness is wrong though,
because it will bite people.
What's wrong with allowing X to get more than it's fair share of CPU
time by "fiddling with nice levels"? That's what they're there for.
> So I'm just not yet willing to completely rule out systems that aren't
> "completely fair".
>
> But I think we should rule out schedulers that don't have rigid bounds on
> that unfairness. That's where the really ugly behavior lies.
Been a while since I really looked at the mainline scheduler, but I
don't think it can permanently starve something, so I don't know what
your bounded unfairness would help with.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-17 23:50 ` Peter Williams
@ 2007-04-18 5:43 ` Chris Friesen
2007-04-18 13:00 ` Peter Williams
0 siblings, 1 reply; 577+ messages in thread
From: Chris Friesen @ 2007-04-18 5:43 UTC (permalink / raw)
To: Peter Williams
Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
Peter Williams wrote:
> Chris Friesen wrote:
>> Suppose I have a really high priority task running. Another very high
>> priority task wakes up and would normally preempt the first one.
>> However, there happens to be another cpu available. It seems like it
>> would be a win if we moved one of those tasks to the available cpu
>> immediately so they can both run simultaneously. This would seem to
>> require some communication between the scheduler and the load balancer.
>
>
> Not really the load balancer can do this on its own AND the decision
> should be based on the STATIC priority of the task being woken.
I guess I don't follow. How would the load balancer know that it needs
to run? Running on every task wake-up seems expensive. Also, static
priority isn't everything. What about the gang-scheduler concept where
certain tasks must be scheduled simultaneously on different cpus? What
about a resource-group scenario where you have per-cpu resource limits,
so that for good latency/fairness you need to force a high priority task
to migrate to another cpu once it has consumed the cpu allocation of
that group on the current cpu?
I can see having a generic load balancer core code, but it seems to me
that the scheduler proper needs to have some way of triggering the load
balancer to run, and some kind of goodness functions to indicate a)
which tasks to move, and b) where to move them.
Chris
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 5:00 ` Nick Piggin
@ 2007-04-18 5:55 ` Matt Mackall
2007-04-18 6:37 ` Nick Piggin
` (2 more replies)
0 siblings, 3 replies; 577+ messages in thread
From: Matt Mackall @ 2007-04-18 5:55 UTC (permalink / raw)
To: Nick Piggin
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote:
> > It's also not yet clear that a scheduler can't be taught to do the
> > right thing with X without fiddling with nice levels.
>
> Being fair doesn't prevent that. Implicit unfairness is wrong though,
> because it will bite people.
>
> What's wrong with allowing X to get more than it's fair share of CPU
> time by "fiddling with nice levels"? That's what they're there for.
Why is X special? Because it does work on behalf of other processes?
Lots of things do this. Perhaps a scheduler should focus entirely on
the implicit and directed wakeup matrix and optimizing that
instead[1].
Why are processes special? Should user A be able to get more CPU time
for his job than user B by splitting it into N parallel jobs? Should
we be fair per process, per user, per thread group, per session, per
controlling terminal? Some weighted combination of the preceding?[2]
Why is the measure CPU time? I can imagine a scheduler that weighed
memory bandwidth in the equation. Or power consumption. Or execution
unit usage.
Fairness is nice. It's simple, it's obvious, it's predictable. But
it's just not clear that it's optimal. If the question is (and it
was!) "what should the basic requirements for the scheduler be?" it's
not clear that fairness is a requirement or even how to pick a metric
for fairness that's obviously and uniquely superior.
It's instead much easier to try to recognize and rule out really bad
behaviour with bounded latencies, minimum service guarantees, etc.
[1] That's basically how Google decides to prioritize webpages, which
it seems to do moderately well. And how a number of other optimization
problems are solved.
[2] It's trivial to construct two or more perfectly reasonable and
desirable definitions of fairness that are mutually incompatible.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 5:55 ` Matt Mackall
@ 2007-04-18 6:37 ` Nick Piggin
2007-04-18 6:55 ` Matt Mackall
2007-04-18 13:08 ` William Lee Irwin III
2007-04-18 14:48 ` Linus Torvalds
2 siblings, 1 reply; 577+ messages in thread
From: Nick Piggin @ 2007-04-18 6:37 UTC (permalink / raw)
To: Matt Mackall
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote:
> > > It's also not yet clear that a scheduler can't be taught to do the
> > > right thing with X without fiddling with nice levels.
> >
> > Being fair doesn't prevent that. Implicit unfairness is wrong though,
> > because it will bite people.
> >
> > What's wrong with allowing X to get more than it's fair share of CPU
> > time by "fiddling with nice levels"? That's what they're there for.
>
> Why is X special? Because it does work on behalf of other processes?
The high level reason is that giving it more than its fair share of
CPU allows a desktop to remain interactive under load. And it isn't
just about doing work on behalf of other processes. Mouse interrupts
are a big part of it, for example.
> Lots of things do this. Perhaps a scheduler should focus entirely on
> the implicit and directed wakeup matrix and optimizing that
> instead[1].
You could do that, and I tried a variant of it at one point. The
problem was that it leads to unexpected bad things too.
UNIX programs more or less expect fair SCHED_OTHER scheduling, and
given the principle of least surprise...
> Why are processes special? Should user A be able to get more CPU time
> for his job than user B by splitting it into N parallel jobs? Should
> we be fair per process, per user, per thread group, per session, per
> controlling terminal? Some weighted combination of the preceding?[2]
I don't know how that supports your argument for unfairness, but
processes are special only because that's how we've always done
scheduling. I'm not precluding other groupings for fairness, though.
> Why is the measure CPU time? I can imagine a scheduler that weighed
> memory bandwidth in the equation. Or power consumption. Or execution
> unit usage.
Feel free. And I'd also argue that once you schedule for those metrics
then fairness is also important there too.
> Fairness is nice. It's simple, it's obvious, it's predictable. But
> it's just not clear that it's optimal. If the question is (and it
> was!) "what should the basic requirements for the scheduler be?" it's
> not clear that fairness is a requirement or even how to pick a metric
> for fairness that's obviously and uniquely superior.
What do you mean optimal? If your criteria is fairness, then of course
it is optimal. If your criteria is throughput, then it probably isn't.
Considering it is simple and what we've always done, measuring fairness
by CPU time per process is obvious for a general purpose scheduler.
If you accept that, then I argue that fairness is an optimal property
given that the alternative is unfairness.
> It's instead much easier to try to recognize and rule out really bad
> behaviour with bounded latencies, minimum service guarantees, etc.
It's the bad behaviour that you didn't recognize that is the problem.
If you start with explicit fairness, then unfairness will never be
one of those problems.
> [1] That's basically how Google decides to prioritize webpages, which
> it seems to do moderately well. And how a number of other optimization
> problems are solved.
This is not an optimization problem, it is a heuristic. There is no
right and wrong answer.
> [2] It's trivial to construct two or more perfectly reasonable and
> desirable definitions of fairness that are mutually incompatible.
Probably not if you use common sense, and in the context of a replacement
for the 2.6 scheduler.
^ permalink raw reply [flat|nested] 577+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
2007-04-18 6:37 ` Nick Piggin
@ 2007-04-18 6:55 ` Matt Mackall
2007-04-18 7:24 ` Nick Piggin
2007-04-21 13:33 ` Bill Davidsen
0 siblings, 2 replies; 577+ messages in thread
From: Matt Mackall @ 2007-04-18 6:55 UTC (permalink / raw)
To: Nick Piggin
Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner
On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote: