LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RESEND][RFC] BFQ I/O Scheduler
@ 2008-04-01 15:29 Fabio Checconi
  2008-04-15  8:22 ` Jens Axboe
  2008-04-16 18:44 ` Pavel Machek
  0 siblings, 2 replies; 30+ messages in thread
From: Fabio Checconi @ 2008-04-01 15:29 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, paolo.valente

[sorry for reposting, wrong subject]

Hi,
    we are working to a new I/O scheduler based on CFQ, aiming at
improved predictability and fairness of the service, while maintaining
the high throughput it already provides.

The patchset, too big for lkml posting, is available here:
    http://feanor.sssup.it/~fabio/linux/bfq/patches/

The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin
scheduling policy of time slices into a fair queueing scheduling
of sector budgets.  More precisely, each task is assigned a budget
measured in number of sectors instead of amount of time, and budgets
are scheduled using a slightly modified version of WF2Q+.  The
budget assigned to each task varies over time as a function of its
behaviour.  However, one can set the maximum value of the budget
that BFQ can assign to any task.

The time-based allocation of the disk service in CFQ, while having
the desirable effect of implicitly charging each application for
the seek time it incurs, suffers from unfairness problems also
towards processes making the best possible use of the disk bandwidth.
In fact, even if the same time slice is assigned to two processes,
they may get a different throughput each, as a function of the
positions on the disk of their requests.  On the contrary, BFQ can
provide strong guarantees on bandwidth distribution because the
assigned budgets are measured in number of sectors.  Moreover, due
to its Round Robin policy, CFQ is characterized by an O(N) worst-case
delay (jitter) in request completion time, where N is the number
of tasks competing for the disk.  On the contrary, given the accurate
service distribution of the internal WF2Q+ scheduler, BFQ exhibits
O(1) delay.

We made several tests to measure the aggregate throughput, long-term
bandwidth distribution and single-request completion time guaranteed
by CFQ and BFQ; what we present here was obtained with an outdated
version of the code, we are in the process of collecting data for
the current one (see [1]).

In the first type of tests, to achieve a higher throughput than CFQ
(with the default 100 ms time slice), the maximum budget for BFQ
had to be set to at least 4k sectors.  Using the same value for the
maximum budget, in the second type of tests, BFQ guaranteed a maximum
deviation from the desired bandwidth distribution in the order of
3% over all the experiments.  On the contrary CFQ exhibited a maximum
deviation of 28% in consequence of the different positions of the
files on the disk.

		 Slowest task's bw (MB/s)  Fastest task's bw (MB/s)
-------------------------------------------------------------------
BFQ (2 files)		9.81 +/- 0.47		9.95 +/- 0.43
CFQ (2 files)		8.61 +/- 0.67		11.92 +/- 0.44
-------------------------------------------------------------------
BFQ (5 files)		4.29 +/- 0.10		4.31 +/- 0.09
CFQ (5 files)		4.01 +/- 0.17		5.24 +/- 0.14

Finally, we set up a VLC video streaming server to stream an
increasing number of movies in presence of disturbing ON/OFF random
file readers.  Each test ended when a 1% packet loss was reached
(a packet was deemed as lost if transmitted with a delayed of more
than 1 second).  With BFQ it was possible to transmit at most 24
movies in parallel (again with a 4k sectors maximum budget), against
15 movies with CFQ (with a time slice of 20 ms).  This is likely
to be a consequence of the higher jitter of CFQ.

			Nr. of movies		Aggr. bw (MB/s)
---------------------------------------------------------------
BFQ (max_budget=4096)	24.00 +/- 0.00		7.56 +/- 0.87
BFQ (max_budget=16384)	18.70 +/- 9.45		12.78 +/- 5.64
CFQ (slice_sync=20)	14.35 +/- 1.40		12.59 +/- 2.12

More stuff related to BFQ (extended results, the test programs used
and the setup for the tests, a document describing the algorithm in
detail and so on) can be found at:

[1] http://algo.ing.unimo.it/people/paolo/disk_sched/

We would greatly appreciate any sort of feedback from you, comments,
suggestions, corrections and so on.  Thank you for your attention.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-01 15:29 [RESEND][RFC] BFQ I/O Scheduler Fabio Checconi
@ 2008-04-15  8:22 ` Jens Axboe
  2008-04-15  9:11   ` Fabio Checconi
                     ` (2 more replies)
  2008-04-16 18:44 ` Pavel Machek
  1 sibling, 3 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-15  8:22 UTC (permalink / raw)
  To: linux-kernel, paolo.valente

On Tue, Apr 01 2008, Fabio Checconi wrote:
> [sorry for reposting, wrong subject]
> 
> Hi,
>     we are working to a new I/O scheduler based on CFQ, aiming at
> improved predictability and fairness of the service, while maintaining
> the high throughput it already provides.
> 
> The patchset, too big for lkml posting, is available here:
>     http://feanor.sssup.it/~fabio/linux/bfq/patches/
> 
> The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin
> scheduling policy of time slices into a fair queueing scheduling
> of sector budgets.  More precisely, each task is assigned a budget
> measured in number of sectors instead of amount of time, and budgets
> are scheduled using a slightly modified version of WF2Q+.  The
> budget assigned to each task varies over time as a function of its
> behaviour.  However, one can set the maximum value of the budget
> that BFQ can assign to any task.
> 
> The time-based allocation of the disk service in CFQ, while having
> the desirable effect of implicitly charging each application for
> the seek time it incurs, suffers from unfairness problems also
> towards processes making the best possible use of the disk bandwidth.
> In fact, even if the same time slice is assigned to two processes,
> they may get a different throughput each, as a function of the
> positions on the disk of their requests.  On the contrary, BFQ can
> provide strong guarantees on bandwidth distribution because the
> assigned budgets are measured in number of sectors.  Moreover, due
> to its Round Robin policy, CFQ is characterized by an O(N) worst-case
> delay (jitter) in request completion time, where N is the number
> of tasks competing for the disk.  On the contrary, given the accurate
> service distribution of the internal WF2Q+ scheduler, BFQ exhibits
> O(1) delay.
> 
> We made several tests to measure the aggregate throughput, long-term
> bandwidth distribution and single-request completion time guaranteed
> by CFQ and BFQ; what we present here was obtained with an outdated
> version of the code, we are in the process of collecting data for
> the current one (see [1]).
> 
> In the first type of tests, to achieve a higher throughput than CFQ
> (with the default 100 ms time slice), the maximum budget for BFQ
> had to be set to at least 4k sectors.  Using the same value for the
> maximum budget, in the second type of tests, BFQ guaranteed a maximum
> deviation from the desired bandwidth distribution in the order of
> 3% over all the experiments.  On the contrary CFQ exhibited a maximum
> deviation of 28% in consequence of the different positions of the
> files on the disk.
> 
> 		 Slowest task's bw (MB/s)  Fastest task's bw (MB/s)
> -------------------------------------------------------------------
> BFQ (2 files)		9.81 +/- 0.47		9.95 +/- 0.43
> CFQ (2 files)		8.61 +/- 0.67		11.92 +/- 0.44
> -------------------------------------------------------------------
> BFQ (5 files)		4.29 +/- 0.10		4.31 +/- 0.09
> CFQ (5 files)		4.01 +/- 0.17		5.24 +/- 0.14
> 
> Finally, we set up a VLC video streaming server to stream an
> increasing number of movies in presence of disturbing ON/OFF random
> file readers.  Each test ended when a 1% packet loss was reached
> (a packet was deemed as lost if transmitted with a delayed of more
> than 1 second).  With BFQ it was possible to transmit at most 24
> movies in parallel (again with a 4k sectors maximum budget), against
> 15 movies with CFQ (with a time slice of 20 ms).  This is likely
> to be a consequence of the higher jitter of CFQ.
> 
> 			Nr. of movies		Aggr. bw (MB/s)
> ---------------------------------------------------------------
> BFQ (max_budget=4096)	24.00 +/- 0.00		7.56 +/- 0.87
> BFQ (max_budget=16384)	18.70 +/- 9.45		12.78 +/- 5.64
> CFQ (slice_sync=20)	14.35 +/- 1.40		12.59 +/- 2.12
> 
> More stuff related to BFQ (extended results, the test programs used
> and the setup for the tests, a document describing the algorithm in
> detail and so on) can be found at:
> 
> [1] http://algo.ing.unimo.it/people/paolo/disk_sched/
> 
> We would greatly appreciate any sort of feedback from you, comments,
> suggestions, corrections and so on.  Thank you for your attention.

Fabio, I've merged the scheduler for some testing. Overall the code
looks great, you've done a good job!

I didn't touch patches 2 and 3, but I rewrote #1 somewhat. See the
result here:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=973a02c4ea1c324c41e45b69c074b13d3bfa97de;hp=a985aabe4d7a720b109c2b63549f8641676a9c88

I'm sure you'll agree with the hlist_sched_*() functions. I also killed
the ->bfq_ioprio_changed modification, what exactly was the purpose of
that?

The code is now in the 'bfq' branch of the block git repo.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-15  8:22 ` Jens Axboe
@ 2008-04-15  9:11   ` Fabio Checconi
  2008-04-15 12:42     ` Jens Axboe
  2008-04-16  6:48   ` Paolo Valente
  2008-04-18  1:26   ` Aaron Carroll
  2 siblings, 1 reply; 30+ messages in thread
From: Fabio Checconi @ 2008-04-15  9:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, paolo.valente

> From: Jens Axboe <jens.axboe@oracle.com>
> Date: Tue, Apr 15, 2008 10:22:36AM +0200
>
> On Tue, Apr 01 2008, Fabio Checconi wrote:
...
> > We would greatly appreciate any sort of feedback from you, comments,
> > suggestions, corrections and so on.  Thank you for your attention.
> 
> Fabio, I've merged the scheduler for some testing. Overall the code
> looks great, you've done a good job!
> 

thank you very much :)


> I didn't touch patches 2 and 3, but I rewrote #1 somewhat. See the
> result here:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=973a02c4ea1c324c41e45b69c074b13d3bfa97de;hp=a985aabe4d7a720b109c2b63549f8641676a9c88
> 
> I'm sure you'll agree with the hlist_sched_*() functions. I also killed
> the ->bfq_ioprio_changed modification, what exactly was the purpose of
> that?
> 

of course the hlist_sched_*() functions are much better than what was
in the patch (the idea behind the patch was to not touch at all cfq code).

the ->bfq_ioprio_changed was there to avoid that a process/ioc doing
i/o on multiple devices managed by cfq and bfq would see ioprio
changes only for one of the two schedulers.

cfq_ioc_set_ioprio() (and its bfq counterpart bfq_ioc_set_ioprio())
unconditionally assign 0 to (bfq_)ioprio_changed, so the first
scheduler that sees the ioprio change eats the new priority values.
of course I may be wrong, but I think it (or some better mechanism
to avoid the problem) is necessary.


> The code is now in the 'bfq' branch of the block git repo.
> 

of course we'll be glad to help in testing in any way you can find useful.

thank you!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-15  9:11   ` Fabio Checconi
@ 2008-04-15 12:42     ` Jens Axboe
  2008-04-15 18:08       ` Fabio Checconi
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2008-04-15 12:42 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: linux-kernel, paolo.valente

On Tue, Apr 15 2008, Fabio Checconi wrote:
> > From: Jens Axboe <jens.axboe@oracle.com>
> > Date: Tue, Apr 15, 2008 10:22:36AM +0200
> >
> > On Tue, Apr 01 2008, Fabio Checconi wrote:
> ...
> > > We would greatly appreciate any sort of feedback from you, comments,
> > > suggestions, corrections and so on.  Thank you for your attention.
> > 
> > Fabio, I've merged the scheduler for some testing. Overall the code
> > looks great, you've done a good job!
> > 
> 
> thank you very much :)
> 
> 
> > I didn't touch patches 2 and 3, but I rewrote #1 somewhat. See the
> > result here:
> > 
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=973a02c4ea1c324c41e45b69c074b13d3bfa97de;hp=a985aabe4d7a720b109c2b63549f8641676a9c88
> > 
> > I'm sure you'll agree with the hlist_sched_*() functions. I also killed
> > the ->bfq_ioprio_changed modification, what exactly was the purpose of
> > that?
> > 
> 
> of course the hlist_sched_*() functions are much better than what was
> in the patch (the idea behind the patch was to not touch at all cfq code).

As long as the changes at that point are straight forward and 'obviously
correct', there's no harm done. Have you thought about merging bfq and
cfq?

> the ->bfq_ioprio_changed was there to avoid that a process/ioc doing
> i/o on multiple devices managed by cfq and bfq would see ioprio
> changes only for one of the two schedulers.
> 
> cfq_ioc_set_ioprio() (and its bfq counterpart bfq_ioc_set_ioprio())
> unconditionally assign 0 to (bfq_)ioprio_changed, so the first
> scheduler that sees the ioprio change eats the new priority values.
> of course I may be wrong, but I think it (or some better mechanism
> to avoid the problem) is necessary.

I see. If we can and will merge bfq and cfq, then it's not really an
issue. Otherwise, I'd suggest using bits 0 of ioprio_changed for cfq and
1 for bfq and so on. That avoids adding another int to the io context.

> > The code is now in the 'bfq' branch of the block git repo.
> > 
> 
> of course we'll be glad to help in testing in any way you can find useful.

I'll push it to the for-akpm branch as well, so it should show up in the
next -mm and get some testing there.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-15 12:42     ` Jens Axboe
@ 2008-04-15 18:08       ` Fabio Checconi
  0 siblings, 0 replies; 30+ messages in thread
From: Fabio Checconi @ 2008-04-15 18:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, paolo.valente

> From: Jens Axboe <jens.axboe@oracle.com>
> Date: Tue, Apr 15, 2008 02:42:48PM +0200
>
> On Tue, Apr 15 2008, Fabio Checconi wrote:
> > of course the hlist_sched_*() functions are much better than what was
> > in the patch (the idea behind the patch was to not touch at all cfq code).
> 
> As long as the changes at that point are straight forward and 'obviously
> correct', there's no harm done. Have you thought about merging bfq and
> cfq?
> 

Well, I'm maintaining bfq as a modified version of cfq, and I use a
script to generate the bfq-iosched.c file.  It allows keeping the common
code in sync.

I've posted this version to allow comparisons between the two schedulers
and because I consider cfq a reference, more tested/stable scheduler.  If you
are interested in it I can clean up the cfq patch in the next few days and
post it here for discussion.


> > the ->bfq_ioprio_changed was there to avoid that a process/ioc doing
> > i/o on multiple devices managed by cfq and bfq would see ioprio
> > changes only for one of the two schedulers.
> > 
> > cfq_ioc_set_ioprio() (and its bfq counterpart bfq_ioc_set_ioprio())
> > unconditionally assign 0 to (bfq_)ioprio_changed, so the first
> > scheduler that sees the ioprio change eats the new priority values.
> > of course I may be wrong, but I think it (or some better mechanism
> > to avoid the problem) is necessary.
> 
> I see. If we can and will merge bfq and cfq, then it's not really an
> issue. Otherwise, I'd suggest using bits 0 of ioprio_changed for cfq and
> 1 for bfq and so on. That avoids adding another int to the io context.
> 

Yes, that's a better solution, at least for now.
[I'm sorry I cannot post a patch correcting it right now because I don't
have access to my dev and test boxes.]
 

> > > The code is now in the 'bfq' branch of the block git repo.
> > > 
> > 
> > of course we'll be glad to help in testing in any way you can find useful.
> 
> I'll push it to the for-akpm branch as well, so it should show up in the
> next -mm and get some testing there.
> 

Ok, thank you very much.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-15  8:22 ` Jens Axboe
  2008-04-15  9:11   ` Fabio Checconi
@ 2008-04-16  6:48   ` Paolo Valente
  2008-04-18  1:26   ` Aaron Carroll
  2 siblings, 0 replies; 30+ messages in thread
From: Paolo Valente @ 2008-04-16  6:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Fabio

Jens Axboe ha scritto:
>
>
> Fabio, I've merged the scheduler for some testing. Overall the code
> looks great, you've done a good job!
>   
Hi, I'm Paolo Valente and I worked with Fabio on bfq. I'm very happy 
about your interest in it. Thanks a lot :)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-01 15:29 [RESEND][RFC] BFQ I/O Scheduler Fabio Checconi
  2008-04-15  8:22 ` Jens Axboe
@ 2008-04-16 18:44 ` Pavel Machek
  2008-04-17  6:14   ` Paolo Valente
  2008-04-17  9:14   ` Fabio Checconi
  1 sibling, 2 replies; 30+ messages in thread
From: Pavel Machek @ 2008-04-16 18:44 UTC (permalink / raw)
  To: axboe, linux-kernel, paolo.valente

On Tue 2008-04-01 17:29:03, Fabio Checconi wrote:
> [sorry for reposting, wrong subject]
> 
> Hi,
>     we are working to a new I/O scheduler based on CFQ, aiming at
> improved predictability and fairness of the service, while maintaining
> the high throughput it already provides.
> 
> The patchset, too big for lkml posting, is available here:
>     http://feanor.sssup.it/~fabio/linux/bfq/patches/
> 
> The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin
> scheduling policy of time slices into a fair queueing scheduling
> of sector budgets.  More precisely, each task is assigned a budget
> measured in number of sectors instead of amount of time, and budgets
...
> In the first type of tests, to achieve a higher throughput than CFQ
> (with the default 100 ms time slice), the maximum budget for BFQ
> had to be set to at least 4k sectors.  Using the same value for the

Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long...
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-16 18:44 ` Pavel Machek
@ 2008-04-17  6:14   ` Paolo Valente
  2008-04-17  7:10     ` Jens Axboe
  2008-04-17  9:14   ` Fabio Checconi
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Valente @ 2008-04-17  6:14 UTC (permalink / raw)
  To: Pavel Machek; +Cc: axboe, linux-kernel

Pavel Machek ha scritto:
>
>> In the first type of tests, to achieve a higher throughput than CFQ
>> (with the default 100 ms time slice), the maximum budget for BFQ
>> had to be set to at least 4k sectors.  Using the same value for the
>>     
>
> Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long...
>   
Actually, in the worst case among our tests, the aggregate throughput 
with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512 
/ 20M = 100 ms.
About these test cases we repeated several measures of the aggregate 
throughput with simultaneous file reads from 2 to 5 (and other varying 
parameters). The lowest aggregate throughput for 4k sectors (~ 20 MB/s) 
was achieved in case of 2, 128 MB long, files, lying on two slices at 
the maximum distance from each other (more details on the experiments, 
testbed and so on at http://algo.ing.unimo.it/people/paolo/disk_sched/).
I hope I didn't misunderstand your point.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  6:14   ` Paolo Valente
@ 2008-04-17  7:10     ` Jens Axboe
  2008-04-17  8:26       ` Paolo Valente
  2008-04-17  8:48       ` Pavel Machek
  0 siblings, 2 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-17  7:10 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Pavel Machek, linux-kernel

On Thu, Apr 17 2008, Paolo Valente wrote:
> Pavel Machek ha scritto:
> >
> >>In the first type of tests, to achieve a higher throughput than CFQ
> >>(with the default 100 ms time slice), the maximum budget for BFQ
> >>had to be set to at least 4k sectors.  Using the same value for the
> >>    
> >
> >Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long...
> >  
> Actually, in the worst case among our tests, the aggregate throughput 
> with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512 
> / 20M = 100 ms.

That's not worse case, it is pretty close to BEST case. Worst case is 4k
of sectors, with each being a 512b IO and causing a full stroke seek.
For that type of workload, even a modern sata hard drive will be doing
500kb/sec or less. That's rougly a thousand sectors per seconds, so ~4
seconds worst case for 4k sectors.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  7:10     ` Jens Axboe
@ 2008-04-17  8:26       ` Paolo Valente
  2008-04-17  8:30         ` Jens Axboe
  2008-04-17 10:24         ` Aaron Carroll
  2008-04-17  8:48       ` Pavel Machek
  1 sibling, 2 replies; 30+ messages in thread
From: Paolo Valente @ 2008-04-17  8:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Machek, linux-kernel

Jens Axboe ha scritto:
>> Actually, in the worst case among our tests, the aggregate throughput 
>> with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512 
>> / 20M = 100 ms.
>>     
>
> That's not worse case, it is pretty close to BEST case. 
Yes. 100 ms is just the worst case among our tests with 4k, but these 
tests are limited to not much more than simultaneous sequential reads.
> Worst case is 4k
> of sectors, with each being a 512b IO and causing a full stroke seek.
> For that type of workload, even a modern sata hard drive will be doing
> 500kb/sec or less. That's rougly a thousand sectors per seconds, so ~4
> seconds worst case for 4k sectors.
>   
In my opinion, the time-slice approach of cfq is definitely better 
suited than the (sector) budget approach for this type of workloads. On 
the opposite end, the price of time-slices is unfairness towards, e.g., 
threads doing sequential accesses. In bfq we were mainly thinking about 
file copy, ftp, video streaming and so on. I was not able to find a good 
solution for both types of workloads.

BTW, there is also another possibility. The internal scheduler of bfq 
may be used to schedule time-slices instead of budgets. By doing so, the 
O(1) instead of O(N) delay/jitter would still be guaranteed (as it is 
probably already clear, bfq is obtained from cfq by just turning slices 
into budgets, and the Round Robin-like scheduling policy into a Weighted 
Fair Queueuing one).



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  8:26       ` Paolo Valente
@ 2008-04-17  8:30         ` Jens Axboe
  2008-04-17  9:24           ` Paolo Valente
  2008-04-17 15:19           ` Avi Kivity
  2008-04-17 10:24         ` Aaron Carroll
  1 sibling, 2 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-17  8:30 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Pavel Machek, linux-kernel

On Thu, Apr 17 2008, Paolo Valente wrote:
> Jens Axboe ha scritto:
> >>Actually, in the worst case among our tests, the aggregate throughput 
> >>with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512 
> >>/ 20M = 100 ms.
> >>    
> >
> >That's not worse case, it is pretty close to BEST case. 
> Yes. 100 ms is just the worst case among our tests with 4k, but these 
> tests are limited to not much more than simultaneous sequential reads.
> >Worst case is 4k
> >of sectors, with each being a 512b IO and causing a full stroke seek.
> >For that type of workload, even a modern sata hard drive will be doing
> >500kb/sec or less. That's rougly a thousand sectors per seconds, so ~4
> >seconds worst case for 4k sectors.
> >  
> In my opinion, the time-slice approach of cfq is definitely better 
> suited than the (sector) budget approach for this type of workloads. On 
> the opposite end, the price of time-slices is unfairness towards, e.g., 
> threads doing sequential accesses. In bfq we were mainly thinking about 
> file copy, ftp, video streaming and so on. I was not able to find a good 
> solution for both types of workloads.

Which is fine, nothing wrong with a scheduler tuned for a specific
workload. High lighting the short comings are also interesting :-)

> BTW, there is also another possibility. The internal scheduler of bfq 
> may be used to schedule time-slices instead of budgets. By doing so, the 
> O(1) instead of O(N) delay/jitter would still be guaranteed (as it is 
> probably already clear, bfq is obtained from cfq by just turning slices 
> into budgets, and the Round Robin-like scheduling policy into a Weighted 
> Fair Queueuing one).

I was thinking about that too. Generally I've been opposed to doing
scheduling decisions on anything but time, since that is always
relevant. When to hand out slices and to what process, that algorithm is
really basic in CFQ and could do with an improvement.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  7:10     ` Jens Axboe
  2008-04-17  8:26       ` Paolo Valente
@ 2008-04-17  8:48       ` Pavel Machek
  2008-04-17  8:57         ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: Pavel Machek @ 2008-04-17  8:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Paolo Valente, linux-kernel

> On Thu, Apr 17 2008, Paolo Valente wrote:
> > Pavel Machek ha scritto:
> > >
> > >>In the first type of tests, to achieve a higher throughput than CFQ
> > >>(with the default 100 ms time slice), the maximum budget for BFQ
> > >>had to be set to at least 4k sectors.  Using the same value for the
> > >>    
> > >
> > >Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long...
> > >  
> > Actually, in the worst case among our tests, the aggregate throughput 
> > with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512 
> > / 20M = 100 ms.
> 
> That's not worse case, it is pretty close to BEST case. Worst case is 4k
> of sectors, with each being a 512b IO and causing a full stroke seek.
> For that type of workload, even a modern sata hard drive will be doing
> 500kb/sec or less. That's rougly a thousand sectors per seconds, so ~4
> seconds worst case for 4k sectors.

One seek is still 10msec on modern drive, right? So 4k seeks =
40seconds, no? 4seconds would correspond to 1msec per seek, which
seems low.

writes with O_SYNC could force full seek on each request, right?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  8:48       ` Pavel Machek
@ 2008-04-17  8:57         ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-17  8:57 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Paolo Valente, linux-kernel

On Thu, Apr 17 2008, Pavel Machek wrote:
> > On Thu, Apr 17 2008, Paolo Valente wrote:
> > > Pavel Machek ha scritto:
> > > >
> > > >>In the first type of tests, to achieve a higher throughput than CFQ
> > > >>(with the default 100 ms time slice), the maximum budget for BFQ
> > > >>had to be set to at least 4k sectors.  Using the same value for the
> > > >>    
> > > >
> > > >Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long...
> > > >  
> > > Actually, in the worst case among our tests, the aggregate throughput 
> > > with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512 
> > > / 20M = 100 ms.
> > 
> > That's not worse case, it is pretty close to BEST case. Worst case is 4k
> > of sectors, with each being a 512b IO and causing a full stroke seek.
> > For that type of workload, even a modern sata hard drive will be doing
> > 500kb/sec or less. That's rougly a thousand sectors per seconds, so ~4
> > seconds worst case for 4k sectors.
> 
> One seek is still 10msec on modern drive, right? So 4k seeks =
> 40seconds, no? 4seconds would correspond to 1msec per seek, which
> seems low.

I actually meant 4k ios, not 512b at that isn't really realistic. With
512b full device seeks, you are looking at probably 30kb/sec on a normal
7200rpm drive and that would be around a minute worst case time. The 4kb
number of 500kb/sec may even be a bit too high, doing a quick test here
shows a little less than 300kb/sec on this drive. So more than 4 seconds
still, around 7-8s or so.

> writes with O_SYNC could force full seek on each request, right?

Writes generally work somewhat better due to caching, but doing O_DIRECT
512 byte reads all over the drive would exhibit worst case behaviour
easily.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-16 18:44 ` Pavel Machek
  2008-04-17  6:14   ` Paolo Valente
@ 2008-04-17  9:14   ` Fabio Checconi
  1 sibling, 0 replies; 30+ messages in thread
From: Fabio Checconi @ 2008-04-17  9:14 UTC (permalink / raw)
  To: Pavel Machek; +Cc: axboe, linux-kernel, paolo.valente

> From: Pavel Machek <pavel@ucw.cz>
> Date: Wed, Apr 16, 2008 08:44:41PM +0200
>
> On Tue 2008-04-01 17:29:03, Fabio Checconi wrote:
> ...
> > In the first type of tests, to achieve a higher throughput than CFQ
> > (with the default 100 ms time slice), the maximum budget for BFQ
> > had to be set to at least 4k sectors.  Using the same value for the
> 
> Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long...


A process with such a low throughput would be marked as seeky from
the heuristics implemented in cfq/bfq.  Seeky processes are not
treated in the same way as sequential ones and they should not get
their full slice allocated, since they idle only for very short
periods.

BTW looking at the code they can get a full slice, if they always
reissue requests fast enough - within BFQ_MIN_TT - and this is
definitely an issue/error in the current implementation (and we
didn't notice it when converting the code from time-based to
service-based allocation :) ).

An easy solution (without changing the nature of bfq) would be
to use shorter slices for seeky queues, with the same mechanism
we already use for the async ones.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  8:30         ` Jens Axboe
@ 2008-04-17  9:24           ` Paolo Valente
  2008-04-17  9:27             ` Jens Axboe
  2008-04-17 15:19           ` Avi Kivity
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Valente @ 2008-04-17  9:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Machek, linux-kernel

Jens Axboe ha scritto:
>
>
> I was thinking about that too. Generally I've been opposed to doing
> scheduling decisions on anything but time, since that is always
> relevant. When to hand out slices and to what process, that algorithm is
> really basic in CFQ and could do with an improvement.
>
>   
Maybe there is also another middle-ground solution. I'll try to sketch 
it out:
. use sectors instead of time
. impose a penalty to each thread in proportion to the distance between 
its disk requests
. reduce the maximum budget of each thread as a function of this seek 
penalty so as to prevent the thread from stealing more than a given time 
slice (the simple mechanism to limit per-thread budget is already 
implemented in bfq).

By doing so, both fairness and time isolation should be guaranteed.
Finally, this policy should be safe in that, given the maximum time used 
by a seeky thread to consume its maximum budget on a reference disk, the 
time used on any faster disk should be shorter.

Does it seem reasonable?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  9:24           ` Paolo Valente
@ 2008-04-17  9:27             ` Jens Axboe
  2008-04-17 10:19               ` Aaron Carroll
  2008-04-17 11:30               ` Fabio Checconi
  0 siblings, 2 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-17  9:27 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Pavel Machek, linux-kernel

On Thu, Apr 17 2008, Paolo Valente wrote:
> Jens Axboe ha scritto:
> >
> >
> >I was thinking about that too. Generally I've been opposed to doing
> >scheduling decisions on anything but time, since that is always
> >relevant. When to hand out slices and to what process, that algorithm is
> >really basic in CFQ and could do with an improvement.
> >
> >  
> Maybe there is also another middle-ground solution. I'll try to sketch 
> it out:
> . use sectors instead of time
> . impose a penalty to each thread in proportion to the distance between 
> its disk requests
> . reduce the maximum budget of each thread as a function of this seek 
> penalty so as to prevent the thread from stealing more than a given time 
> slice (the simple mechanism to limit per-thread budget is already 
> implemented in bfq).
> 
> By doing so, both fairness and time isolation should be guaranteed.
> Finally, this policy should be safe in that, given the maximum time used 
> by a seeky thread to consume its maximum budget on a reference disk, the 
> time used on any faster disk should be shorter.
> 
> Does it seem reasonable?

Not for CFQ, that will stay time based. The problem with #2 above is
that it then quickly turns to various heuristics, which is just
impossible to tune for general behaviour. Or it just falls apart for
other real life situations.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  9:27             ` Jens Axboe
@ 2008-04-17 10:19               ` Aaron Carroll
  2008-04-17 10:21                 ` Jens Axboe
  2008-04-17 11:30               ` Fabio Checconi
  1 sibling, 1 reply; 30+ messages in thread
From: Aaron Carroll @ 2008-04-17 10:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Paolo Valente, Pavel Machek, linux-kernel

Jens Axboe wrote:
>> Maybe there is also another middle-ground solution. I'll try to sketch 
>> it out:
>> . use sectors instead of time
>> . impose a penalty to each thread in proportion to the distance between 
>> its disk requests
>> . reduce the maximum budget of each thread as a function of this seek 
>> penalty so as to prevent the thread from stealing more than a given time 
>> slice (the simple mechanism to limit per-thread budget is already 
>> implemented in bfq).
>>
>> By doing so, both fairness and time isolation should be guaranteed.
>> Finally, this policy should be safe in that, given the maximum time used 
>> by a seeky thread to consume its maximum budget on a reference disk, the 
>> time used on any faster disk should be shorter.
>>
>> Does it seem reasonable?
> 
> Not for CFQ, that will stay time based. The problem with #2 above is
> that it then quickly turns to various heuristics, which is just
> impossible to tune for general behaviour. Or it just falls apart for
> other real life situations.

Like SSD or hardware RAID.  Time-slices have the nice property of fairness
irrespective of the underlying hardware characteristics.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 10:19               ` Aaron Carroll
@ 2008-04-17 10:21                 ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-17 10:21 UTC (permalink / raw)
  To: Aaron Carroll; +Cc: Paolo Valente, Pavel Machek, linux-kernel

On Thu, Apr 17 2008, Aaron Carroll wrote:
> Jens Axboe wrote:
> >>Maybe there is also another middle-ground solution. I'll try to sketch 
> >>it out:
> >>. use sectors instead of time
> >>. impose a penalty to each thread in proportion to the distance between 
> >>its disk requests
> >>. reduce the maximum budget of each thread as a function of this seek 
> >>penalty so as to prevent the thread from stealing more than a given time 
> >>slice (the simple mechanism to limit per-thread budget is already 
> >>implemented in bfq).
> >>
> >>By doing so, both fairness and time isolation should be guaranteed.
> >>Finally, this policy should be safe in that, given the maximum time used 
> >>by a seeky thread to consume its maximum budget on a reference disk, the 
> >>time used on any faster disk should be shorter.
> >>
> >>Does it seem reasonable?
> >
> >Not for CFQ, that will stay time based. The problem with #2 above is
> >that it then quickly turns to various heuristics, which is just
> >impossible to tune for general behaviour. Or it just falls apart for
> >other real life situations.
> 
> Like SSD or hardware RAID.  Time-slices have the nice property of fairness
> irrespective of the underlying hardware characteristics.

Exactly. We can cater to that somewhat by adding some simple hardware
profiles, so that the IO schedulers if seeks etc are costly or not. But
it's a good example none the less.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  8:26       ` Paolo Valente
  2008-04-17  8:30         ` Jens Axboe
@ 2008-04-17 10:24         ` Aaron Carroll
  2008-04-17 11:14           ` Fabio Checconi
  1 sibling, 1 reply; 30+ messages in thread
From: Aaron Carroll @ 2008-04-17 10:24 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, Pavel Machek, linux-kernel

Paolo Valente wrote:
> In my opinion, the time-slice approach of cfq is definitely better 
> suited than the (sector) budget approach for this type of workloads. On 
> the opposite end, the price of time-slices is unfairness towards, e.g., 
> threads doing sequential accesses. In bfq we were mainly thinking about 

How do you figure that?  This is a situation where time-slices work nicely,
because they implicitly account for the performance penalty of poor access
patterns.  The sequential-accessing processes (and the system overall) ends
up with higher throughput.
 
 -- Aaron


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 10:24         ` Aaron Carroll
@ 2008-04-17 11:14           ` Fabio Checconi
  2008-04-17 12:14             ` Aaron Carroll
  0 siblings, 1 reply; 30+ messages in thread
From: Fabio Checconi @ 2008-04-17 11:14 UTC (permalink / raw)
  To: Aaron Carroll; +Cc: Paolo Valente, Jens Axboe, Pavel Machek, linux-kernel

> From: Aaron Carroll <aaronc@cse.unsw.edu.au>
> Date: Thu, Apr 17, 2008 08:24:09PM +1000
>
> Paolo Valente wrote:
> >In my opinion, the time-slice approach of cfq is definitely better 
> >suited than the (sector) budget approach for this type of workloads. On 
> >the opposite end, the price of time-slices is unfairness towards, e.g., 
> >threads doing sequential accesses. In bfq we were mainly thinking about 
> 
> How do you figure that?  This is a situation where time-slices work nicely,
> because they implicitly account for the performance penalty of poor access
> patterns.  The sequential-accessing processes (and the system overall) ends
> up with higher throughput.
> 

The unfairness is not WRT tasks generating poor access patterns.
If you have two tasks doing sequential accesses on two different
regions of the disk the exact amount of service they receive in the
same amount of time depends on the transfer rate of the disk on
that regions, and, depending on the media, it is not always the same.

We showed some example of that in the original post and it is quite
easy to try it out if you put two partitions at the ends of a disk
and you try to read from them concurrently.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  9:27             ` Jens Axboe
  2008-04-17 10:19               ` Aaron Carroll
@ 2008-04-17 11:30               ` Fabio Checconi
  1 sibling, 0 replies; 30+ messages in thread
From: Fabio Checconi @ 2008-04-17 11:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Paolo Valente, Pavel Machek, linux-kernel

> From: Jens Axboe <jens.axboe@oracle.com>
> Date: Thu, Apr 17, 2008 11:27:55AM +0200
>
> On Thu, Apr 17 2008, Paolo Valente wrote:
> > Maybe there is also another middle-ground solution. I'll try to sketch 
...
> > Does it seem reasonable?
> 
> Not for CFQ, that will stay time based. The problem with #2 above is
> that it then quickly turns to various heuristics, which is just
> impossible to tune for general behaviour. Or it just falls apart for
> other real life situations.
> 

Ok.

After a brief offline discussion with paolo, here it is what we can do:

    o propose a patch for discussion that uses our WF2Q+ variant to
      schedule timeslices in cfq.  The resulting scheduler would be
      quite close to the EEVDF scheduler discussed some time ago for
      the cpu.

    o Introduce a timeout in bfq to give an upper time limit to the
      slices.  Since we have not experimented with that mixed approach
      before[*], we will need to do some tests with relevant workloads to
      see if/how it can work.

I fear that it will take some time, as we're both travelling in this
week.

[*] Anyway it is quite close to how cfq handles async queues, with their
slice_async and slice_async_rq, so it's definitely not something new.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 11:14           ` Fabio Checconi
@ 2008-04-17 12:14             ` Aaron Carroll
  2008-04-17 13:54               ` Jens Axboe
  2008-04-17 15:18               ` Paolo Valente
  0 siblings, 2 replies; 30+ messages in thread
From: Aaron Carroll @ 2008-04-17 12:14 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: Paolo Valente, Jens Axboe, Pavel Machek, linux-kernel

Fabio Checconi wrote:
>> From: Aaron Carroll <aaronc@cse.unsw.edu.au>
>> How do you figure that?  This is a situation where time-slices work nicely,
>> because they implicitly account for the performance penalty of poor access
>> patterns.  The sequential-accessing processes (and the system overall) ends
>> up with higher throughput.
>>
> 
> The unfairness is not WRT tasks generating poor access patterns.
> If you have two tasks doing sequential accesses on two different
> regions of the disk the exact amount of service they receive in the
> same amount of time depends on the transfer rate of the disk on
> that regions, and, depending on the media, it is not always the same.

Ok... you're talking about ZBR.

I'm not convinced this should be treated differently to, say, random vs.
sequential workloads.  You still end up with reduced global throughput as
you've shown in the ``Short-term time guarantees'' table.  It is an
interesting case though... since the lower performance is not though fault
of the process it doesn't seem fair to ``punish'' it.

 -- Aaron


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 12:14             ` Aaron Carroll
@ 2008-04-17 13:54               ` Jens Axboe
  2008-04-17 15:18               ` Paolo Valente
  1 sibling, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-17 13:54 UTC (permalink / raw)
  To: Aaron Carroll; +Cc: Fabio Checconi, Paolo Valente, Pavel Machek, linux-kernel

On Thu, Apr 17 2008, Aaron Carroll wrote:
> Fabio Checconi wrote:
> >>From: Aaron Carroll <aaronc@cse.unsw.edu.au>
> >>How do you figure that?  This is a situation where time-slices work 
> >>nicely,
> >>because they implicitly account for the performance penalty of poor access
> >>patterns.  The sequential-accessing processes (and the system overall) 
> >>ends
> >>up with higher throughput.
> >>
> >
> >The unfairness is not WRT tasks generating poor access patterns.
> >If you have two tasks doing sequential accesses on two different
> >regions of the disk the exact amount of service they receive in the
> >same amount of time depends on the transfer rate of the disk on
> >that regions, and, depending on the media, it is not always the same.
> 
> Ok... you're talking about ZBR.
> 
> I'm not convinced this should be treated differently to, say, random vs.
> sequential workloads.  You still end up with reduced global throughput as
> you've shown in the ``Short-term time guarantees'' table.  It is an
> interesting case though... since the lower performance is not though fault
> of the process it doesn't seem fair to ``punish'' it.

It is indeed a valid observation, but I think we are getting into
details still. CFQ wants to provide fair access to the drive, it doesn't
claim to be 100% fair wrt throughput or transfer sums at all costs. This
is where fairness and real life for an all round scheduler divert
somewhat.

So while it IS true that you could have 40mb/sec at one end and 65mb/sec
at the other and thus give the process at the start an 'unfair' share of
bandwidth, it's honestly mostly a theoretical problem. I can envision
some valid concerns for media streaming filling the entire drive, but
then my solution would be to just bump the time slice if you are not
meeting deadlines. I've never heard anyone complain about this issue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 12:14             ` Aaron Carroll
  2008-04-17 13:54               ` Jens Axboe
@ 2008-04-17 15:18               ` Paolo Valente
  1 sibling, 0 replies; 30+ messages in thread
From: Paolo Valente @ 2008-04-17 15:18 UTC (permalink / raw)
  To: Aaron Carroll; +Cc: Fabio Checconi, Jens Axboe, Pavel Machek, linux-kernel

Aaron Carroll ha scritto:
> You still end up with reduced global throughput as
> you've shown in the ``Short-term time guarantees'' table.  It is an
> interesting case though... since the lower performance is not though 
> fault
> of the process it doesn't seem fair to ``punish'' it.
Just a note about that table. The lower aggregate throughput of bfq is 
due to the fact that, because of the higher number of movies being read, 
a higher percentage of not-that-profitable accesses is being performed 
under bfq wrt to cfq. As shown in the complete logs of the aggregate 
throughput in the raw results, the aggregate throughput with bfq and cfq 
is practically the same when the number of movies is the same.
The figure in the "Aggregate throughput" subsection is probably best 
suited for a comparison of the performance of the two schedulers with 
sequential accesses under the same conditions (the figure refers to the 
2, 128 MB long, files, but we got virtually the same results in all the 
other tests).
I do agree on that these experiments should be repeated with different 
(faster) devices.

Paolo
>
> -- Aaron
>
>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17  8:30         ` Jens Axboe
  2008-04-17  9:24           ` Paolo Valente
@ 2008-04-17 15:19           ` Avi Kivity
  2008-04-17 15:47             ` Paolo Valente
  2008-04-17 23:44             ` Aaron Carroll
  1 sibling, 2 replies; 30+ messages in thread
From: Avi Kivity @ 2008-04-17 15:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Paolo Valente, Pavel Machek, linux-kernel

Jens Axboe wrote:
> I was thinking about that too. Generally I've been opposed to doing
> scheduling decisions on anything but time, since that is always
> relevant. When to hand out slices and to what process, that algorithm is
> really basic in CFQ and could do with an improvement.
>   

Jumping in at random, does "process" here mean task or mms_struct?  If 
the former, doesn't that mean that a 100-thread process can starve out a 
single-threaded process?

Perhaps we need hierarchical io scheduling, like cfs has for the cpu.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 15:19           ` Avi Kivity
@ 2008-04-17 15:47             ` Paolo Valente
  2008-04-17 15:51               ` Avi Kivity
  2008-04-17 23:44             ` Aaron Carroll
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Valente @ 2008-04-17 15:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jens Axboe, Pavel Machek, linux-kernel

Avi Kivity ha scritto:
> Jumping in at random, does "process" here mean task or mms_struct?  If 
> the former, doesn't that mean that a 100-thread process can starve out 
> a single-threaded process?
>
> Perhaps we need hierarchical io scheduling, like cfs has for the cpu.
>
Hierarchical would simplify isolating groups of threads or processes.
However, some simple solution is already available with bfq. For 
example, if you have to fairly share the disk bandwidth between the 
above 100 threads and another important thread, you get it by just 
assigning weight 1 to each of these 100 threads, and weight 100 to the 
important one.

Paolo

-- 
-----------------------------------------------------------
| Paolo Valente              |                            |
| Algogroup                  |                            |
| Dip. Ing. Informazione     | tel:   +39 059 2056318     |
| Via Vignolese 905/b	     | fax:   +39 059 2056199     |
| 41100 Modena               | 				  |
|     home:  http://algo.ing.unimo.it/people/paolo/       |
-----------------------------------------------------------


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 15:47             ` Paolo Valente
@ 2008-04-17 15:51               ` Avi Kivity
  2008-04-17 18:12                 ` Paolo Valente
  0 siblings, 1 reply; 30+ messages in thread
From: Avi Kivity @ 2008-04-17 15:51 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jens Axboe, Pavel Machek, linux-kernel

Paolo Valente wrote:
> Avi Kivity ha scritto:
>> Jumping in at random, does "process" here mean task or mms_struct?  
>> If the former, doesn't that mean that a 100-thread process can starve 
>> out a single-threaded process?
>>
>> Perhaps we need hierarchical io scheduling, like cfs has for the cpu.
>>
> Hierarchical would simplify isolating groups of threads or processes.
> However, some simple solution is already available with bfq. For 
> example, if you have to fairly share the disk bandwidth between the 
> above 100 threads and another important thread, you get it by just 
> assigning weight 1 to each of these 100 threads, and weight 100 to the 
> important one.

Doesn't work.  If the 100-thread process wants to use just on thread for 
issuing I/O, it will be starved by the single-threaded process.

[my example has process A with 100 threads, and process B with 1 thread, 
not a 101-thread process with one important thread]


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 15:51               ` Avi Kivity
@ 2008-04-17 18:12                 ` Paolo Valente
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Valente @ 2008-04-17 18:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jens Axboe, Pavel Machek, linux-kernel

Avi Kivity ha scritto:
> Paolo Valente wrote:
>> Avi Kivity ha scritto:
>>> Jumping in at random, does "process" here mean task or mms_struct?  
>>> If the former, doesn't that mean that a 100-thread process can 
>>> starve out a single-threaded process?
>>>
>>> Perhaps we need hierarchical io scheduling, like cfs has for the cpu.
>>>
>> Hierarchical would simplify isolating groups of threads or processes.
>> However, some simple solution is already available with bfq. For 
>> example, if you have to fairly share the disk bandwidth between the 
>> above 100 threads and another important thread, you get it by just 
>> assigning weight 1 to each of these 100 threads, and weight 100 to 
>> the important one.
>
> Doesn't work.  If the 100-thread process wants to use just on thread 
> for issuing I/O, it will be starved by the single-threaded process.
>
> [my example has process A with 100 threads, and process B with 1 
> thread, not a 101-thread process with one important thread]
>
Right. I was thinking only about the case where all the 101 threads 
concurrently access the disk, and I just wanted to say that weights may 
offer more help than priorities in simple cases as this one.
Apart from this, automatically recomputing weights as needed is most 
certainly a worse solution than hierarchical scheduling.

Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-17 15:19           ` Avi Kivity
  2008-04-17 15:47             ` Paolo Valente
@ 2008-04-17 23:44             ` Aaron Carroll
  1 sibling, 0 replies; 30+ messages in thread
From: Aaron Carroll @ 2008-04-17 23:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jens Axboe, Paolo Valente, Pavel Machek, linux-kernel

Avi Kivity wrote:
> Jens Axboe wrote:
>> I was thinking about that too. Generally I've been opposed to doing
>> scheduling decisions on anything but time, since that is always
>> relevant. When to hand out slices and to what process, that algorithm is
>> really basic in CFQ and could do with an improvement.
>>   
> 
> Jumping in at random, does "process" here mean task or mms_struct?  If 
> the former, doesn't that mean that a 100-thread process can starve out a 
> single-threaded process?

task_struct.  It would also be nice to do per-user I/O scheduling..

 -- Aaron

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RESEND][RFC] BFQ I/O Scheduler
  2008-04-15  8:22 ` Jens Axboe
  2008-04-15  9:11   ` Fabio Checconi
  2008-04-16  6:48   ` Paolo Valente
@ 2008-04-18  1:26   ` Aaron Carroll
  2 siblings, 0 replies; 30+ messages in thread
From: Aaron Carroll @ 2008-04-18  1:26 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: Jens Axboe, linux-kernel, paolo.valente

Jens Axboe wrote:
> On Tue, Apr 01 2008, Fabio Checconi wrote:
>> [sorry for reposting, wrong subject]
>>
>> Hi,
>>     we are working to a new I/O scheduler based on CFQ, aiming at
>> improved predictability and fairness of the service, while maintaining
>> the high throughput it already provides.

Here are some microbenchmark results.  Test setup is a 2-way IA64 with a
single 15k RPM 73GiB SCSI disk with TCQ depth set to 1.  Workloads are
generated with FIO: 128 processes issuing synchronous, O_DIRECT, 16KiB
block size requests.

Figures are quoted as average (stdev).  CFQ (i=0) means idle window
disabled.  All other tunables are default.

==================================x8=======================================

                Random Readers
-----------------------------------------------
           Latency (ms)       Bandwidth (KiB/s)
-----------------------------------------------
CFQ        841.788 (4070.3)   2428.032 (23.1)
CFQ (i=0)  536.728 (216.9)    3841.024 (8.5)
BFQ        884.4 (8816.0)     2439.04 (1375.0)


            Sequential 1MiB Readers
-----------------------------------------------
           Latency (ms)       Bandwidth (KiB/s)
-----------------------------------------------
CFQ        2865.331 (737.2)   46866.048 (103.1)
CFQ (i=0)  2544.618 (1047.2)  52685.952 (294.2)
BFQ        2860.795 (419.1)   46850.944 (81.5)


Clearly BFQ suffers from the same idle window problems as CFQ, but otherwise
the performance seems comparable in bandwidth terms.  I'm guessing variability
in random workload service is due to max budget being too large compared to
CFQ's default time-slice.  Sequential access looks nice and consistent, though.


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-04-18  1:26 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-01 15:29 [RESEND][RFC] BFQ I/O Scheduler Fabio Checconi
2008-04-15  8:22 ` Jens Axboe
2008-04-15  9:11   ` Fabio Checconi
2008-04-15 12:42     ` Jens Axboe
2008-04-15 18:08       ` Fabio Checconi
2008-04-16  6:48   ` Paolo Valente
2008-04-18  1:26   ` Aaron Carroll
2008-04-16 18:44 ` Pavel Machek
2008-04-17  6:14   ` Paolo Valente
2008-04-17  7:10     ` Jens Axboe
2008-04-17  8:26       ` Paolo Valente
2008-04-17  8:30         ` Jens Axboe
2008-04-17  9:24           ` Paolo Valente
2008-04-17  9:27             ` Jens Axboe
2008-04-17 10:19               ` Aaron Carroll
2008-04-17 10:21                 ` Jens Axboe
2008-04-17 11:30               ` Fabio Checconi
2008-04-17 15:19           ` Avi Kivity
2008-04-17 15:47             ` Paolo Valente
2008-04-17 15:51               ` Avi Kivity
2008-04-17 18:12                 ` Paolo Valente
2008-04-17 23:44             ` Aaron Carroll
2008-04-17 10:24         ` Aaron Carroll
2008-04-17 11:14           ` Fabio Checconi
2008-04-17 12:14             ` Aaron Carroll
2008-04-17 13:54               ` Jens Axboe
2008-04-17 15:18               ` Paolo Valente
2008-04-17  8:48       ` Pavel Machek
2008-04-17  8:57         ` Jens Axboe
2008-04-17  9:14   ` Fabio Checconi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).