LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 1/3] fix illogical behavior in balance_dirty_pages()
@ 2007-03-24 21:55 Miklos Szeredi
2007-03-24 21:57 ` [patch 2/3] remove throttle_vm_writeout() Miklos Szeredi
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-24 21:55 UTC (permalink / raw)
To: akpm; +Cc: dgc, linux-kernel
This is a slightly different take on the fix for the deadlock in fuse
with dirty balancing. David Chinner convinced me, that per-bdi
counters are too expensive, and that it's not worth trying to account
the number of pages under writeback, as they will be limited by the
queue anyway.
----
From: Miklos Szeredi <mszeredi@suse.cz>
Current behavior of balance_dirty_pages() is to try to start writeout
into the specified queue for at least "write_chunk" number of pages.
If "write_chunk" pages have been submitted, then return.
However if there are less than "write_chunk" dirty pages for this
queue, then it doesn't return, waiting for the global dirty+writeback
counters to subside, but without doing any actual work.
This is illogical behavior: it allows more dirtyings while there are
dirty pages, but stops further dirtying completely if there are no
more dirty pages.
It also makes a deadlock possible when one filesystem is writing data
through another, and the balance_dirty_pages() for the lower
filesystem is stalling the writeback for the upper filesystem's
data (*).
So the exit condition should instead be:
submitted at least "write_chunk" number of pages
OR
submitted ALL the dirty pages destined for this backing dev
AND
backing dev is not congested
To do this, introduce a new counter in writeback_control, which counts
the number of dirty pages encountered during writeback. This includes
all dirty pages, even those which are already under writeback but have
been dirtied again, and those which have been skipped due to having
locked buffers.
If this counter is zero after trying to submit some pages for
writeback, and the backing dev is uncongested, then don't wait any
more. After this, newly dirtied pages can quickly be written back to
this backing dev.
If there are globally no more pages to submit for writeback
(nr_reclaimable == 0), then also don't wait for ever, only while this
backing dev is congested.
(*) For more info on this deadlock, see the following discussions:
http://lkml.org/lkml/2007/3/1/9
http://lkml.org/lkml/2007/3/12/16
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h 2007-03-24 22:06:56.000000000 +0100
+++ linux/include/linux/writeback.h 2007-03-24 22:29:02.000000000 +0100
@@ -44,6 +44,7 @@ struct writeback_control {
long nr_to_write; /* Write this many pages, and decrement
this for each page written */
long pages_skipped; /* Pages which were not written */
+ long nr_dirty; /* Number of dirty pages encountered */
/*
* For a_ops->writepages(): is start or end are non-zero then this is
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2007-03-24 22:06:56.000000000 +0100
+++ linux/mm/page-writeback.c 2007-03-24 22:29:02.000000000 +0100
@@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
* written to the server's write cache, but has not yet
* been flushed to permanent storage.
*/
- if (nr_reclaimable) {
+ if (!nr_reclaimable) {
+ /*
+ * If there's nothing more to write back and this queue
+ * is uncongested, then it is possible to quickly
+ * write out some more data, so let's not wait
+ */
+ if (!bdi_write_congested(bdi))
+ break;
+ } else {
writeback_inodes(&wbc);
get_dirty_limits(&background_thresh,
&dirty_thresh, mapping);
@@ -220,6 +228,14 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
break; /* We've done our duty */
+
+ /*
+ * If there are no more dirty pages for this backing
+ * backing dev, and the queue is not congested, then
+ * it is possible to quickly write out some more data
+ */
+ if (!wbc.nr_dirty && !bdi_write_congested(bdi))
+ break;
}
congestion_wait(WRITE, HZ/10);
}
@@ -619,6 +635,7 @@ retry:
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
unsigned i;
+ wbc->nr_dirty += nr_pages;
scanned = 1;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
^ permalink raw reply [flat|nested] 21+ messages in thread
* [patch 2/3] remove throttle_vm_writeout()
2007-03-24 21:55 [patch 1/3] fix illogical behavior in balance_dirty_pages() Miklos Szeredi
@ 2007-03-24 21:57 ` Miklos Szeredi
2007-03-25 23:41 ` Andrew Morton
2007-03-24 21:58 ` [patch 3/3] balance dirty pages from loop device Miklos Szeredi
` (2 subsequent siblings)
3 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-24 21:57 UTC (permalink / raw)
To: akpm; +Cc: dgc, linux-kernel
From: Miklos Szeredi <mszeredi@suse.cz>
Remove this function. It's purpose was to limit the global number of
writeback pages from submitted by direct reclaim. But this is equally
well accomplished by limited queue lengths. When this function was
added, the device queues had much larger default lengths (8192
requests, now it's 128), causing problems.
When writable shared mapping support is added to fuse, this function
would be able to cause a deadlock if the userspace filesystem needs to
allocate memory while writing back dirty pages.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h 2007-03-24 22:07:00.000000000 +0100
+++ linux/include/linux/writeback.h 2007-03-24 22:28:52.000000000 +0100
@@ -85,7 +85,6 @@ static inline void wait_on_inode(struct
int wakeup_pdflush(long nr_pages);
void laptop_io_completion(void);
void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
/* These are exported to sysctl. */
extern int dirty_background_ratio;
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2007-03-24 22:07:00.000000000 +0100
+++ linux/mm/page-writeback.c 2007-03-24 22:28:52.000000000 +0100
@@ -312,37 +312,6 @@ void balance_dirty_pages_ratelimited_nr(
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
-void throttle_vm_writeout(gfp_t gfp_mask)
-{
- long background_thresh;
- long dirty_thresh;
-
- if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
- /*
- * The caller might hold locks which can prevent IO completion
- * or progress in the filesystem. So we cannot just sit here
- * waiting for IO to complete.
- */
- congestion_wait(WRITE, HZ/10);
- return;
- }
-
- for ( ; ; ) {
- get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
-
- /*
- * Boost the allowable dirty threshold a bit for page
- * allocators so they don't get DoS'ed by heavy writers
- */
- dirty_thresh += dirty_thresh / 10; /* wheeee... */
-
- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
- break;
- congestion_wait(WRITE, HZ/10);
- }
-}
-
/*
* writeback at least _min_pages, and keep writing until the amount of dirty
* memory is less than the background threshold, or until we're all clean.
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c 2007-03-24 22:06:53.000000000 +0100
+++ linux/mm/vmscan.c 2007-03-24 22:07:03.000000000 +0100
@@ -952,8 +952,6 @@ static unsigned long shrink_zone(int pri
}
}
- throttle_vm_writeout(sc->gfp_mask);
-
atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;
}
^ permalink raw reply [flat|nested] 21+ messages in thread
* [patch 3/3] balance dirty pages from loop device
2007-03-24 21:55 [patch 1/3] fix illogical behavior in balance_dirty_pages() Miklos Szeredi
2007-03-24 21:57 ` [patch 2/3] remove throttle_vm_writeout() Miklos Szeredi
@ 2007-03-24 21:58 ` Miklos Szeredi
2007-03-25 10:03 ` [patch 1/3] fix illogical behavior in balance_dirty_pages() Peter Zijlstra
2007-03-25 23:35 ` Andrew Morton
3 siblings, 0 replies; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-24 21:58 UTC (permalink / raw)
To: akpm; +Cc: dgc, linux-kernel
From: Miklos Szeredi <mszeredi@suse.cz>
The function do_lo_send_aops() should call
balance_dirty_pages_ratelimited() after each page similarly to
generic_file_buffered_write().
Without this, writing the loop device directly (not through a
filesystem) is very slow, and also slows the whole system down,
because nr_dirty is constantly over the limit.
Beware: this patch without the fix to balance_dirty_pages() makes a
loopback mounted filesystem prone to deadlock.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
Index: linux/drivers/block/loop.c
===================================================================
--- linux.orig/drivers/block/loop.c 2007-03-24 21:00:40.000000000 +0100
+++ linux/drivers/block/loop.c 2007-03-24 22:07:06.000000000 +0100
@@ -275,6 +275,8 @@ static int do_lo_send_aops(struct loop_d
pos += size;
unlock_page(page);
page_cache_release(page);
+ balance_dirty_pages_ratelimited(mapping);
+ cond_resched();
}
ret = 0;
out:
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-24 21:55 [patch 1/3] fix illogical behavior in balance_dirty_pages() Miklos Szeredi
2007-03-24 21:57 ` [patch 2/3] remove throttle_vm_writeout() Miklos Szeredi
2007-03-24 21:58 ` [patch 3/3] balance dirty pages from loop device Miklos Szeredi
@ 2007-03-25 10:03 ` Peter Zijlstra
2007-03-25 11:12 ` Miklos Szeredi
2007-03-25 23:35 ` Andrew Morton
3 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2007-03-25 10:03 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: akpm, dgc, linux-kernel
On Sat, 2007-03-24 at 22:55 +0100, Miklos Szeredi wrote:
> This is a slightly different take on the fix for the deadlock in fuse
> with dirty balancing. David Chinner convinced me, that per-bdi
> counters are too expensive, and that it's not worth trying to account
> the number of pages under writeback, as they will be limited by the
> queue anyway.
>
Please have a look at this:
http://lkml.org/lkml/2007/3/19/220
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-25 10:03 ` [patch 1/3] fix illogical behavior in balance_dirty_pages() Peter Zijlstra
@ 2007-03-25 11:12 ` Miklos Szeredi
2007-03-25 11:34 ` Miklos Szeredi
0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-25 11:12 UTC (permalink / raw)
To: a.p.zijlstra; +Cc: akpm, dgc, linux-kernel
> On Sat, 2007-03-24 at 22:55 +0100, Miklos Szeredi wrote:
> > This is a slightly different take on the fix for the deadlock in fuse
> > with dirty balancing. David Chinner convinced me, that per-bdi
> > counters are too expensive, and that it's not worth trying to account
> > the number of pages under writeback, as they will be limited by the
> > queue anyway.
> >
>
> Please have a look at this:
> http://lkml.org/lkml/2007/3/19/220
> + if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
> + bdi_thresh)
> + break;
>
Yes, this will resolve the deadlock as well, where balance_dirty_pages()
is currently looping forever with:
bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) == 0
Thanks,
Miklos
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-25 11:12 ` Miklos Szeredi
@ 2007-03-25 11:34 ` Miklos Szeredi
2007-03-25 11:50 ` Peter Zijlstra
0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-25 11:34 UTC (permalink / raw)
To: a.p.zijlstra; +Cc: akpm, dgc, linux-kernel
> >
> > Please have a look at this:
> > http://lkml.org/lkml/2007/3/19/220
>
>
>
> > + if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
> > + bdi_thresh)
> > + break;
> >
>
> Yes, this will resolve the deadlock as well, where balance_dirty_pages()
> is currently looping forever with:
Almost.
This
> - if (nr_reclaimable) {
> + if (bdi_nr_reclaimable) {
> writeback_inodes(&wbc);
still makes it loop forever if bdi_nr_reclaimable == 0, since the exit
condition is not checked.
Shouldn't it break out of the loop if bdi_stat(bdi, BDI_WRITEBACK) <=
bdi_thresh in this case?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-25 11:34 ` Miklos Szeredi
@ 2007-03-25 11:50 ` Peter Zijlstra
2007-03-25 20:41 ` Miklos Szeredi
0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2007-03-25 11:50 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: akpm, dgc, linux-kernel
On Sun, 2007-03-25 at 13:34 +0200, Miklos Szeredi wrote:
> > >
> > > Please have a look at this:
> > > http://lkml.org/lkml/2007/3/19/220
> >
> >
> >
> > > + if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
> > > + bdi_thresh)
> > > + break;
> > >
> >
> > Yes, this will resolve the deadlock as well, where balance_dirty_pages()
> > is currently looping forever with:
>
> Almost.
>
> This
>
> > - if (nr_reclaimable) {
> > + if (bdi_nr_reclaimable) {
> > writeback_inodes(&wbc);
>
> still makes it loop forever if bdi_nr_reclaimable == 0, since the exit
> condition is not checked.
>
> Shouldn't it break out of the loop if bdi_stat(bdi, BDI_WRITEBACK) <=
> bdi_thresh in this case?
for (;;) {
struct writeback_control wbc = {
.bdi = bdi,
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write = write_chunk,
.range_cyclic = 1,
};
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);
bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
bdi_stat(bdi, BDI_UNSTABLE);
(A) if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
bdi_thresh)
break;
/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
* Unstable writes are a feature of certain networked
* filesystems (i.e. NFS) in which data may have been
* written to the server's write cache, but has not yet
* been flushed to permanent storage.
*/
(B) if (bdi_nr_reclaimable) {
writeback_inodes(&wbc);
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);
bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
bdi_stat(bdi, BDI_UNSTABLE);
(C) if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
bdi_thresh)
break;
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
break; /* We've done our duty */
}
congestion_wait(WRITE, HZ/10);
}
I'm thinking that if bdi_nr_reclaimable == 0, A reduces to
bdi_stat(bdi, BDI_WRITEBACK) <= bdi_thresh and we're still out of the
loop, no?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-25 11:50 ` Peter Zijlstra
@ 2007-03-25 20:41 ` Miklos Szeredi
0 siblings, 0 replies; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-25 20:41 UTC (permalink / raw)
To: a.p.zijlstra; +Cc: akpm, dgc, linux-kernel
> for (;;) {
> struct writeback_control wbc = {
> .bdi = bdi,
> .sync_mode = WB_SYNC_NONE,
> .older_than_this = NULL,
> .nr_to_write = write_chunk,
> .range_cyclic = 1,
> };
>
> get_dirty_limits(&background_thresh, &dirty_thresh,
> &bdi_thresh, bdi);
> bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
> bdi_stat(bdi, BDI_UNSTABLE);
> (A) if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
> bdi_thresh)
> break;
>
> /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> * Unstable writes are a feature of certain networked
> * filesystems (i.e. NFS) in which data may have been
> * written to the server's write cache, but has not yet
> * been flushed to permanent storage.
> */
> (B) if (bdi_nr_reclaimable) {
> writeback_inodes(&wbc);
>
> get_dirty_limits(&background_thresh, &dirty_thresh,
> &bdi_thresh, bdi);
> bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
> bdi_stat(bdi, BDI_UNSTABLE);
> (C) if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
> bdi_thresh)
> break;
>
> pages_written += write_chunk - wbc.nr_to_write;
> if (pages_written >= write_chunk)
> break; /* We've done our duty */
> }
> congestion_wait(WRITE, HZ/10);
> }
>
> I'm thinking that if bdi_nr_reclaimable == 0, A reduces to
> bdi_stat(bdi, BDI_WRITEBACK) <= bdi_thresh and we're still out of the
> loop, no?
>
Yep, sorry.
I'll review this patch, there seem to be odd things, like calculating
dirty_thresh, then not ever using it...
Btw, would you mind measuring the performance of my patch with the
slow/fast device mix, to see how it compares to your patchset and the
vanila kernel?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-24 21:55 [patch 1/3] fix illogical behavior in balance_dirty_pages() Miklos Szeredi
` (2 preceding siblings ...)
2007-03-25 10:03 ` [patch 1/3] fix illogical behavior in balance_dirty_pages() Peter Zijlstra
@ 2007-03-25 23:35 ` Andrew Morton
2007-03-26 8:26 ` Miklos Szeredi
3 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2007-03-25 23:35 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: dgc, linux-kernel
On Sat, 24 Mar 2007 22:55:29 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> This is a slightly different take on the fix for the deadlock in fuse
> with dirty balancing. David Chinner convinced me, that per-bdi
> counters are too expensive, and that it's not worth trying to account
> the number of pages under writeback, as they will be limited by the
> queue anyway.
>
> ----
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Current behavior of balance_dirty_pages() is to try to start writeout
> into the specified queue for at least "write_chunk" number of pages.
> If "write_chunk" pages have been submitted, then return.
>
> However if there are less than "write_chunk" dirty pages for this
> queue, then it doesn't return, waiting for the global dirty+writeback
> counters to subside, but without doing any actual work.
>
> This is illogical behavior: it allows more dirtyings while there are
> dirty pages, but stops further dirtying completely if there are no
> more dirty pages.
That behaviour is perfectly logical. It prevents the number of
dirty+writeback pages from exceeding dirty_ratio.
> It also makes a deadlock possible when one filesystem is writing data
> through another, and the balance_dirty_pages() for the lower
> filesystem is stalling the writeback for the upper filesystem's
> data (*).
I still don't understand this one. I got lost when belatedly told that
i_mutex had something to do with it.
> So the exit condition should instead be:
>
> submitted at least "write_chunk" number of pages
> OR
> submitted ALL the dirty pages destined for this backing dev
> AND
> backing dev is not congested
>
> To do this, introduce a new counter in writeback_control, which counts
> the number of dirty pages encountered during writeback. This includes
> all dirty pages, even those which are already under writeback but have
> been dirtied again, and those which have been skipped due to having
> locked buffers.
>
> If this counter is zero after trying to submit some pages for
> writeback, and the backing dev is uncongested, then don't wait any
> more. After this, newly dirtied pages can quickly be written back to
> this backing dev.
>
> If there are globally no more pages to submit for writeback
> (nr_reclaimable == 0), then also don't wait for ever, only while this
> backing dev is congested.
>
> (*) For more info on this deadlock, see the following discussions:
>
> http://lkml.org/lkml/2007/3/1/9
> http://lkml.org/lkml/2007/3/12/16
>
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
>
> Index: linux/include/linux/writeback.h
> ===================================================================
> --- linux.orig/include/linux/writeback.h 2007-03-24 22:06:56.000000000 +0100
> +++ linux/include/linux/writeback.h 2007-03-24 22:29:02.000000000 +0100
> @@ -44,6 +44,7 @@ struct writeback_control {
> long nr_to_write; /* Write this many pages, and decrement
> this for each page written */
> long pages_skipped; /* Pages which were not written */
> + long nr_dirty; /* Number of dirty pages encountered */
>
> /*
> * For a_ops->writepages(): is start or end are non-zero then this is
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c 2007-03-24 22:06:56.000000000 +0100
> +++ linux/mm/page-writeback.c 2007-03-24 22:29:02.000000000 +0100
> @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
> * written to the server's write cache, but has not yet
> * been flushed to permanent storage.
> */
> - if (nr_reclaimable) {
> + if (!nr_reclaimable) {
> + /*
> + * If there's nothing more to write back and this queue
> + * is uncongested, then it is possible to quickly
> + * write out some more data, so let's not wait
> + */
> + if (!bdi_write_congested(bdi))
> + break;
> + } else {
This says "if there are no dirty pages in the machine at all, then go back
and dirty some more pages, regardless of the present number of
under-writeback pages".
> writeback_inodes(&wbc);
> get_dirty_limits(&background_thresh,
> &dirty_thresh, mapping);
> @@ -220,6 +228,14 @@ static void balance_dirty_pages(struct a
> pages_written += write_chunk - wbc.nr_to_write;
> if (pages_written >= write_chunk)
> break; /* We've done our duty */
> +
> + /*
> + * If there are no more dirty pages for this backing
> + * backing dev, and the queue is not congested, then
> + * it is possible to quickly write out some more data
> + */
> + if (!wbc.nr_dirty && !bdi_write_congested(bdi))
> + break;
This says "if there are no pages dirty againt this device then go back and
dirty some more pages, regardless of the present number of under-writeback
pages".
IOW: this change will allow us to 100% fill all memory with under-writeback
pages.
The VM _should_ cope with that - it kinda used to. But it is an untested
region of operation and the chances of bogus oom-killings are excellent.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 2/3] remove throttle_vm_writeout()
2007-03-24 21:57 ` [patch 2/3] remove throttle_vm_writeout() Miklos Szeredi
@ 2007-03-25 23:41 ` Andrew Morton
2007-03-26 8:35 ` Miklos Szeredi
0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2007-03-25 23:41 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: dgc, linux-kernel
On Sat, 24 Mar 2007 22:57:34 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Remove this function. It's purpose was to limit the global number of
> writeback pages from submitted by direct reclaim. But this is equally
> well accomplished by limited queue lengths. When this function was
> added, the device queues had much larger default lengths (8192
> requests, now it's 128), causing problems.
This changelog is wrong.
Yes, the _default_ depth of CFQ was decreased. But it is trivial for the
user to inrcrease the queue depth again, and we'd prefer that the VM not
shit itself in response.
Plus, more significantly, the queue is per-disk. A system with many disks
will easily be able to cover all memory with under-writeback pages.
Which is why the level of under-writeback memory is controlled at a higher
level.
These variables are why we must not reply upon per-queue throttling in
VFS/MM/VM.
All that being said, the patch (with a new, correct changlog) is OK. This
is because we now control the amount of dirty memory in the machine by
running balance_dirty_pages() at first-write-fault time. So we can no
longer get into the situation where all memory is dirty+writeback.
That being said, this patch is only correct if we don't apply "[patch 1/3]
fix illogical behavior in balance_dirty_pages()". If we _do_ apply that
patch then we can again get all memory under-writeback and we again need
throttle_vm_writeout().
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-25 23:35 ` Andrew Morton
@ 2007-03-26 8:26 ` Miklos Szeredi
2007-03-26 9:01 ` Andrew Morton
0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-26 8:26 UTC (permalink / raw)
To: akpm; +Cc: dgc, linux-kernel
> > This is a slightly different take on the fix for the deadlock in fuse
> > with dirty balancing. David Chinner convinced me, that per-bdi
> > counters are too expensive, and that it's not worth trying to account
> > the number of pages under writeback, as they will be limited by the
> > queue anyway.
> >
> > ----
> > From: Miklos Szeredi <mszeredi@suse.cz>
> >
> > Current behavior of balance_dirty_pages() is to try to start writeout
> > into the specified queue for at least "write_chunk" number of pages.
> > If "write_chunk" pages have been submitted, then return.
> >
> > However if there are less than "write_chunk" dirty pages for this
> > queue, then it doesn't return, waiting for the global dirty+writeback
> > counters to subside, but without doing any actual work.
> >
> > This is illogical behavior: it allows more dirtyings while there are
> > dirty pages, but stops further dirtying completely if there are no
> > more dirty pages.
>
> That behaviour is perfectly logical. It prevents the number of
> dirty+writeback pages from exceeding dirty_ratio.
But it does it in an illogical way. While it's cleaning your pages,
it allows more dirtyings, but when it has cleaned ALL the pages
destined for this queue, you hit a wall and stand still waiting for
other queues to get their act together even if this queue is
_completely idle_.
I think this accounts for 90% of the problems experienced with copying
to slow devices.
> > It also makes a deadlock possible when one filesystem is writing data
> > through another, and the balance_dirty_pages() for the lower
> > filesystem is stalling the writeback for the upper filesystem's
> > data (*).
>
> I still don't understand this one. I got lost when belatedly told that
> i_mutex had something to do with it.
This deadlock only happens, if there's some bottleneck for writing
data to the lower filesystem. This bottleneck could be
- i_mutex, preventing parallel writes to the same inode
- limited number of filesystem threads
- limited request queue length in the upper filesystem
Imagine it this way: balance_dirty_pages() for the lower filesystem is
stalling a write() because dirty pages in the upper filesystem are
over the limit. Because there's a bottleneck for writing to the lower
filesystem, this is stalling _other_ writes from completing. So
there's no progress in writing back pages from the upper filesystem.
The problem is not that dirty pages are not being submitted for
writeback, rather that the writeback itself is stalling.
Hmm?
> > Index: linux/mm/page-writeback.c
> > ===================================================================
> > --- linux.orig/mm/page-writeback.c 2007-03-24 22:06:56.000000000 +0100
> > +++ linux/mm/page-writeback.c 2007-03-24 22:29:02.000000000 +0100
> > @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
> > * written to the server's write cache, but has not yet
> > * been flushed to permanent storage.
> > */
> > - if (nr_reclaimable) {
> > + if (!nr_reclaimable) {
> > + /*
> > + * If there's nothing more to write back and this queue
> > + * is uncongested, then it is possible to quickly
> > + * write out some more data, so let's not wait
> > + */
> > + if (!bdi_write_congested(bdi))
> > + break;
> > + } else {
>
> This says "if there are no dirty pages in the machine at all, then go back
> and dirty some more pages, regardless of the present number of
> under-writeback pages".
No, this says, that "there are no dirty pages in the machine at all,
and this queue is uncongested, so you can safely go back and dirty
some more data, because we are sure it can be written back quickly".
Ditto for the case, when there are no more dirty pages destined for
this queue.
I understand, that this can fill up the memory with under writeback
pages, but only if the data sitting in all the device queues is
comparable to the total memory. I don't know what the realistic
chance of that is but David Chinner convinced me, that it's not
something that happens in real life. Quoting him from an earlier
mail:
| Right, and most ppl don't have enough devices in their system for
| this to be a problem. Even those of us that do have enough devices
| for this to potentially be a problem usually have enough RAM in
| the machine so that it is not a problem....
Miklos
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 2/3] remove throttle_vm_writeout()
2007-03-25 23:41 ` Andrew Morton
@ 2007-03-26 8:35 ` Miklos Szeredi
0 siblings, 0 replies; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-26 8:35 UTC (permalink / raw)
To: akpm; +Cc: dgc, linux-kernel
> >
> > Remove this function. It's purpose was to limit the global number of
> > writeback pages from submitted by direct reclaim. But this is equally
> > well accomplished by limited queue lengths. When this function was
> > added, the device queues had much larger default lengths (8192
> > requests, now it's 128), causing problems.
>
> This changelog is wrong.
>
> Yes, the _default_ depth of CFQ was decreased. But it is trivial for the
> user to inrcrease the queue depth again, and we'd prefer that the VM not
> shit itself in response.
>
> Plus, more significantly, the queue is per-disk. A system with many disks
> will easily be able to cover all memory with under-writeback pages.
> Which is why the level of under-writeback memory is controlled at a higher
> level.
>
> These variables are why we must not reply upon per-queue throttling in
> VFS/MM/VM.
>
>
> All that being said, the patch (with a new, correct changlog) is OK. This
> is because we now control the amount of dirty memory in the machine by
> running balance_dirty_pages() at first-write-fault time. So we can no
> longer get into the situation where all memory is dirty+writeback.
Well, sort of. I think you are not taking SwapCache pages into
account, which just pop up out of nothing to be turned into writeback
pages.
So I think removing this function still relies on limited queues, or
some other method of limiting writeback.
Miklos
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 8:26 ` Miklos Szeredi
@ 2007-03-26 9:01 ` Andrew Morton
2007-03-26 9:20 ` Miklos Szeredi
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Andrew Morton @ 2007-03-26 9:01 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: dgc, linux-kernel
On Mon, 26 Mar 2007 10:26:18 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > This is a slightly different take on the fix for the deadlock in fuse
> > > with dirty balancing. David Chinner convinced me, that per-bdi
> > > counters are too expensive, and that it's not worth trying to account
> > > the number of pages under writeback, as they will be limited by the
> > > queue anyway.
> > >
> > > ----
> > > From: Miklos Szeredi <mszeredi@suse.cz>
> > >
> > > Current behavior of balance_dirty_pages() is to try to start writeout
> > > into the specified queue for at least "write_chunk" number of pages.
> > > If "write_chunk" pages have been submitted, then return.
> > >
> > > However if there are less than "write_chunk" dirty pages for this
> > > queue, then it doesn't return, waiting for the global dirty+writeback
> > > counters to subside, but without doing any actual work.
> > >
> > > This is illogical behavior: it allows more dirtyings while there are
> > > dirty pages, but stops further dirtying completely if there are no
> > > more dirty pages.
> >
> > That behaviour is perfectly logical. It prevents the number of
> > dirty+writeback pages from exceeding dirty_ratio.
>
> But it does it in an illogical way. While it's cleaning your pages,
> it allows more dirtyings, but when it has cleaned ALL the pages
> destined for this queue, you hit a wall and stand still waiting for
> other queues to get their act together even if this queue is
> _completely idle_.
Memory is a machine-wide resource, not a per-queue resource.
If we have hit the dirty+writeback limit, it is logical to throttle
dirtiers until the levels subside.
> I think this accounts for 90% of the problems experienced with copying
> to slow devices.
It is related, but this observation doesn't actually strengthen your
argument. If memory is full of dirty-or-writeback pages against a USB
drive, we don't want to go letting writers to other devices go and dirty
even more memory. We instead need to clamp down harder on the writer to
the USB drive.
> > > It also makes a deadlock possible when one filesystem is writing data
> > > through another, and the balance_dirty_pages() for the lower
> > > filesystem is stalling the writeback for the upper filesystem's
> > > data (*).
> >
> > I still don't understand this one. I got lost when belatedly told that
> > i_mutex had something to do with it.
>
> This deadlock only happens, if there's some bottleneck for writing
> data to the lower filesystem. This bottleneck could be
>
> - i_mutex, preventing parallel writes to the same inode
> - limited number of filesystem threads
> - limited request queue length in the upper filesystem
>
> Imagine it this way: balance_dirty_pages() for the lower filesystem is
> stalling a write() because dirty pages in the upper filesystem are
> over the limit. Because there's a bottleneck for writing to the lower
> filesystem, this is stalling _other_ writes from completing. So
> there's no progress in writing back pages from the upper filesystem.
You mean that someone is stuck in balance_dirty_pages() against the lower
fs while holding locks which prevent writes into the upper fs from
succeeding?
Draw us a picture ;)
> The problem is not that dirty pages are not being submitted for
> writeback, rather that the writeback itself is stalling.
>
> Hmm?
>
> > > Index: linux/mm/page-writeback.c
> > > ===================================================================
> > > --- linux.orig/mm/page-writeback.c 2007-03-24 22:06:56.000000000 +0100
> > > +++ linux/mm/page-writeback.c 2007-03-24 22:29:02.000000000 +0100
> > > @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
> > > * written to the server's write cache, but has not yet
> > > * been flushed to permanent storage.
> > > */
> > > - if (nr_reclaimable) {
> > > + if (!nr_reclaimable) {
> > > + /*
> > > + * If there's nothing more to write back and this queue
> > > + * is uncongested, then it is possible to quickly
> > > + * write out some more data, so let's not wait
> > > + */
> > > + if (!bdi_write_congested(bdi))
> > > + break;
> > > + } else {
> >
> > This says "if there are no dirty pages in the machine at all, then go back
> > and dirty some more pages, regardless of the present number of
> > under-writeback pages".
>
> No, this says, that "there are no dirty pages in the machine at all,
> and this queue is uncongested, so you can safely go back and dirty
> some more data, because we are sure it can be written back quickly".
The queue congestion thing is meaningless, and I ignored it.
> Ditto for the case, when there are no more dirty pages destined for
> this queue.
>
> I understand, that this can fill up the memory with under writeback
> pages, but only if the data sitting in all the device queues is
> comparable to the total memory. I don't know what the realistic
> chance of that is but David Chinner convinced me, that it's not
> something that happens in real life. Quoting him from an earlier
> mail:
>
> | Right, and most ppl don't have enough devices in their system for
> | this to be a problem. Even those of us that do have enough devices
> | for this to potentially be a problem usually have enough RAM in
> | the machine so that it is not a problem....
>
David is overoptimistic ;)
Two scsi disks on a 256MB machine will put us over the edge, and that's
even before the user has gone and fiddled with the knobs.
Then there's NFS which for a long time had basically unbounded queue
lengths. That's allegedly been fixed, but I haven't seen testing results
which convince me of that.
Then there's CIFS, SMBFS, v9fs, DM, MD, whatever. Plenty of things in
there which can screw things up.
<ponders scsi_tgt_alloc_queue()>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 9:01 ` Andrew Morton
@ 2007-03-26 9:20 ` Miklos Szeredi
2007-03-26 9:32 ` Andrew Morton
2007-03-26 9:32 ` Miklos Szeredi
2007-03-27 0:23 ` David Chinner
2 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-26 9:20 UTC (permalink / raw)
To: akpm; +Cc: dgc, linux-kernel
> > > > It also makes a deadlock possible when one filesystem is writing data
> > > > through another, and the balance_dirty_pages() for the lower
> > > > filesystem is stalling the writeback for the upper filesystem's
> > > > data (*).
> > >
> > > I still don't understand this one. I got lost when belatedly told that
> > > i_mutex had something to do with it.
> >
> > This deadlock only happens, if there's some bottleneck for writing
> > data to the lower filesystem. This bottleneck could be
> >
> > - i_mutex, preventing parallel writes to the same inode
> > - limited number of filesystem threads
> > - limited request queue length in the upper filesystem
> >
> > Imagine it this way: balance_dirty_pages() for the lower filesystem is
> > stalling a write() because dirty pages in the upper filesystem are
> > over the limit. Because there's a bottleneck for writing to the lower
> > filesystem, this is stalling _other_ writes from completing. So
> > there's no progress in writing back pages from the upper filesystem.
>
> You mean that someone is stuck in balance_dirty_pages() against the lower
> fs while holding locks which prevent writes into the upper fs from
> succeeding?
>
> Draw us a picture ;)
Well, not a picture, but a sort of indented call trace:
[some process, which has a fuse file writably mmaped]
write fault on upper filesystem
balance_dirty_pages
loop...
submit write requests
---------------------------------
[fuse loopback fs thread 1]
read request from /dev/fuse
sys_write
mutex_lock(i_mutex)
...
copy data to page cache
balance_dirty_pages
loop ...
submit write requests
write requests completed ...
dirty still over limit ...
... loop forever
[fuse loopback fs thread 2]
read request from /dev/fuse
sys_write
mute_lock(i_mutex) blocks
The lower filesystem (e.g. ext3) has completed the single write
request that was sent to it, and then it's just looping in
balance_dirty_pages. The upper (fuse) filesystem has all the dirty
data (over the threshold), either still dirty or waiting in the
request queue as writeback.
Does this help?
Miklos
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 9:20 ` Miklos Szeredi
@ 2007-03-26 9:32 ` Andrew Morton
2007-03-26 9:48 ` Miklos Szeredi
0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2007-03-26 9:32 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: dgc, linux-kernel
On Mon, 26 Mar 2007 11:20:11 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > > It also makes a deadlock possible when one filesystem is writing data
> > > > > through another, and the balance_dirty_pages() for the lower
> > > > > filesystem is stalling the writeback for the upper filesystem's
> > > > > data (*).
> > > >
> > > > I still don't understand this one. I got lost when belatedly told that
> > > > i_mutex had something to do with it.
> > >
> > > This deadlock only happens, if there's some bottleneck for writing
> > > data to the lower filesystem. This bottleneck could be
> > >
> > > - i_mutex, preventing parallel writes to the same inode
> > > - limited number of filesystem threads
> > > - limited request queue length in the upper filesystem
> > >
> > > Imagine it this way: balance_dirty_pages() for the lower filesystem is
> > > stalling a write() because dirty pages in the upper filesystem are
> > > over the limit. Because there's a bottleneck for writing to the lower
> > > filesystem, this is stalling _other_ writes from completing. So
> > > there's no progress in writing back pages from the upper filesystem.
> >
> > You mean that someone is stuck in balance_dirty_pages() against the lower
> > fs while holding locks which prevent writes into the upper fs from
> > succeeding?
> >
> > Draw us a picture ;)
>
> Well, not a picture, but a sort of indented call trace:
>
> [some process, which has a fuse file writably mmaped]
> write fault on upper filesystem
> balance_dirty_pages
> loop...
> submit write requests
This, I assume, is the upper fs
> ---------------------------------
> [fuse loopback fs thread 1]
> read request from /dev/fuse
> sys_write
> mutex_lock(i_mutex)
> ...
> copy data to page cache
> balance_dirty_pages
> loop ...
> submit write requests
> write requests completed ...
> dirty still over limit ...
> ... loop forever
>
> [fuse loopback fs thread 2]
> read request from /dev/fuse
> sys_write
> mute_lock(i_mutex) blocks
And these, I assume, are handling what you term the lower fs.
>
> The lower filesystem (e.g. ext3) has completed the single write
> request that was sent to it, and then it's just looping in
> balance_dirty_pages. The upper (fuse) filesystem has all the dirty
> data (over the threshold), either still dirty or waiting in the
> request queue as writeback.
>
> Does this help?
yup.
Interesting problem. I don't suppose that it'd be appreiated if I were to
commend the use of O_DIRECT for handling the lower fs ;)
Let me think about that a bit, after I've made the latest shitpile people
have inflicted upon me begin to look like it has a chance of compiling.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 9:01 ` Andrew Morton
2007-03-26 9:20 ` Miklos Szeredi
@ 2007-03-26 9:32 ` Miklos Szeredi
2007-03-26 10:08 ` Andrew Morton
2007-03-27 0:23 ` David Chinner
2 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-26 9:32 UTC (permalink / raw)
To: akpm; +Cc: dgc, linux-kernel
> > > > This is a slightly different take on the fix for the deadlock in fuse
> > > > with dirty balancing. David Chinner convinced me, that per-bdi
> > > > counters are too expensive, and that it's not worth trying to account
> > > > the number of pages under writeback, as they will be limited by the
> > > > queue anyway.
> > > >
> > > > ----
> > > > From: Miklos Szeredi <mszeredi@suse.cz>
> > > >
> > > > Current behavior of balance_dirty_pages() is to try to start writeout
> > > > into the specified queue for at least "write_chunk" number of pages.
> > > > If "write_chunk" pages have been submitted, then return.
> > > >
> > > > However if there are less than "write_chunk" dirty pages for this
> > > > queue, then it doesn't return, waiting for the global dirty+writeback
> > > > counters to subside, but without doing any actual work.
> > > >
> > > > This is illogical behavior: it allows more dirtyings while there are
> > > > dirty pages, but stops further dirtying completely if there are no
> > > > more dirty pages.
> > >
> > > That behaviour is perfectly logical. It prevents the number of
> > > dirty+writeback pages from exceeding dirty_ratio.
> >
> > But it does it in an illogical way. While it's cleaning your pages,
> > it allows more dirtyings, but when it has cleaned ALL the pages
> > destined for this queue, you hit a wall and stand still waiting for
> > other queues to get their act together even if this queue is
> > _completely idle_.
>
> Memory is a machine-wide resource, not a per-queue resource.
>
> If we have hit the dirty+writeback limit, it is logical to throttle
> dirtiers until the levels subside.
It is logical to throttle, but it's not logical to stall.
> > I think this accounts for 90% of the problems experienced with copying
> > to slow devices.
>
> It is related, but this observation doesn't actually strengthen your
> argument. If memory is full of dirty-or-writeback pages against a USB
> drive, we don't want to go letting writers to other devices go and dirty
> even more memory. We instead need to clamp down harder on the writer to
> the USB drive.
Yes, if you can notice that the USB drive is really slow, _before_ the
writer fills all the allotted memory, then that's good. I don't know
how well Peter's patch does that, but I'll have a look.
But even if you don't want to bother with limiting per-bdi dirty
numbers based on device speed, it is more logical to allow unrelated
queues to function at a steady state, than to stop them completely.
Stopping writers which have idle queues is completely unproductive,
and that is basically what the current algorithm does.
Miklos
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 9:32 ` Andrew Morton
@ 2007-03-26 9:48 ` Miklos Szeredi
0 siblings, 0 replies; 21+ messages in thread
From: Miklos Szeredi @ 2007-03-26 9:48 UTC (permalink / raw)
To: akpm; +Cc: miklos, dgc, linux-kernel
> > Well, not a picture, but a sort of indented call trace:
> >
> > [some process, which has a fuse file writably mmaped]
> > write fault on upper filesystem
> > balance_dirty_pages
> > loop...
> > submit write requests
>
> This, I assume, is the upper fs
>
> > ---------------------------------
> > [fuse loopback fs thread 1]
> > read request from /dev/fuse
> > sys_write
> > mutex_lock(i_mutex)
> > ...
> > copy data to page cache
> > balance_dirty_pages
> > loop ...
> > submit write requests
> > write requests completed ...
> > dirty still over limit ...
> > ... loop forever
> >
> > [fuse loopback fs thread 2]
> > read request from /dev/fuse
> > sys_write
> > mute_lock(i_mutex) blocks
>
> And these, I assume, are handling what you term the lower fs.
All of this is actually the upper, fuse filesystem (maybe except the
filemap/page-writeback parts, which could be argued to belong to the
lower, ext3 filesystem).
Above the line is the kernel part, below the line is the userspace
part.
> > The lower filesystem (e.g. ext3) has completed the single write
> > request that was sent to it, and then it's just looping in
> > balance_dirty_pages. The upper (fuse) filesystem has all the dirty
> > data (over the threshold), either still dirty or waiting in the
> > request queue as writeback.
> >
> > Does this help?
>
> yup.
>
> Interesting problem. I don't suppose that it'd be appreiated if I were to
> commend the use of O_DIRECT for handling the lower fs ;)
Not really ;) One of the goals of fuse is to not restrict the
filesystem daemon in any way. There are zillions of ways to work
around this deadlock, but I don't even want to make it even
theoretically possible.
> Let me think about that a bit, after I've made the latest shitpile people
> have inflicted upon me begin to look like it has a chance of compiling.
OK.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 9:32 ` Miklos Szeredi
@ 2007-03-26 10:08 ` Andrew Morton
2007-03-26 13:30 ` Peter Zijlstra
0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2007-03-26 10:08 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: dgc, linux-kernel
On Mon, 26 Mar 2007 11:32:47 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> Stopping writers which have idle queues is completely unproductive,
> and that is basically what the current algorithm does.
This is because the kernel permits all of its allotment of dirty+writeback
pages to be dirty+writeback against a single device.
A good way of solving the one-device-starves-another-one problem is to
dynamically adjust the per-device dirty+writeback levels so that (for
example) if two devices are being written to, each gets 50% of the
allotment.
I started working on that but got derailed by the usual blah. I don't
think either of the proposed fixes took that approach, actually.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 10:08 ` Andrew Morton
@ 2007-03-26 13:30 ` Peter Zijlstra
2007-03-27 0:30 ` David Chinner
0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2007-03-26 13:30 UTC (permalink / raw)
To: Andrew Morton; +Cc: Miklos Szeredi, dgc, linux-kernel
On Mon, 2007-03-26 at 02:08 -0800, Andrew Morton wrote:
> On Mon, 26 Mar 2007 11:32:47 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Stopping writers which have idle queues is completely unproductive,
> > and that is basically what the current algorithm does.
>
> This is because the kernel permits all of its allotment of dirty+writeback
> pages to be dirty+writeback against a single device.
>
> A good way of solving the one-device-starves-another-one problem is to
> dynamically adjust the per-device dirty+writeback levels so that (for
> example) if two devices are being written to, each gets 50% of the
> allotment.
This is exactly what happens with my patch if both devices write at the
same speed. (Or at least, that is what is supposed to happen ;-)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 9:01 ` Andrew Morton
2007-03-26 9:20 ` Miklos Szeredi
2007-03-26 9:32 ` Miklos Szeredi
@ 2007-03-27 0:23 ` David Chinner
2 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-03-27 0:23 UTC (permalink / raw)
To: Andrew Morton; +Cc: Miklos Szeredi, dgc, linux-kernel
On Mon, Mar 26, 2007 at 01:01:24AM -0800, Andrew Morton wrote:
> On Mon, 26 Mar 2007 10:26:18 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > Ditto for the case, when there are no more dirty pages destined for
> > this queue.
> >
> > I understand, that this can fill up the memory with under writeback
> > pages, but only if the data sitting in all the device queues is
> > comparable to the total memory. I don't know what the realistic
> > chance of that is but David Chinner convinced me, that it's not
> > something that happens in real life. Quoting him from an earlier
> > mail:
> >
> > | Right, and most ppl don't have enough devices in their system for
> > | this to be a problem. Even those of us that do have enough devices
> > | for this to potentially be a problem usually have enough RAM in
> > | the machine so that it is not a problem....
> >
>
> David is overoptimistic ;)
My excuse is that I have trouble thinking in units of memory smaller
than a GB. ;)
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
2007-03-26 13:30 ` Peter Zijlstra
@ 2007-03-27 0:30 ` David Chinner
0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-03-27 0:30 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Andrew Morton, Miklos Szeredi, dgc, linux-kernel
On Mon, Mar 26, 2007 at 03:30:27PM +0200, Peter Zijlstra wrote:
> On Mon, 2007-03-26 at 02:08 -0800, Andrew Morton wrote:
> > On Mon, 26 Mar 2007 11:32:47 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > Stopping writers which have idle queues is completely unproductive,
> > > and that is basically what the current algorithm does.
> >
> > This is because the kernel permits all of its allotment of dirty+writeback
> > pages to be dirty+writeback against a single device.
> >
> > A good way of solving the one-device-starves-another-one problem is to
> > dynamically adjust the per-device dirty+writeback levels so that (for
> > example) if two devices are being written to, each gets 50% of the
> > allotment.
>
> This is exactly what happens with my patch if both devices write at the
> same speed. (Or at least, that is what is supposed to happen ;-)
The testing that I did of Peter's patch showed that this cache
splitting works as advertised for multiple devices writing at
the same speed.
(http://marc.info/?l=linux-kernel&m=117437686328396&w=2)
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2007-03-27 0:31 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-24 21:55 [patch 1/3] fix illogical behavior in balance_dirty_pages() Miklos Szeredi
2007-03-24 21:57 ` [patch 2/3] remove throttle_vm_writeout() Miklos Szeredi
2007-03-25 23:41 ` Andrew Morton
2007-03-26 8:35 ` Miklos Szeredi
2007-03-24 21:58 ` [patch 3/3] balance dirty pages from loop device Miklos Szeredi
2007-03-25 10:03 ` [patch 1/3] fix illogical behavior in balance_dirty_pages() Peter Zijlstra
2007-03-25 11:12 ` Miklos Szeredi
2007-03-25 11:34 ` Miklos Szeredi
2007-03-25 11:50 ` Peter Zijlstra
2007-03-25 20:41 ` Miklos Szeredi
2007-03-25 23:35 ` Andrew Morton
2007-03-26 8:26 ` Miklos Szeredi
2007-03-26 9:01 ` Andrew Morton
2007-03-26 9:20 ` Miklos Szeredi
2007-03-26 9:32 ` Andrew Morton
2007-03-26 9:48 ` Miklos Szeredi
2007-03-26 9:32 ` Miklos Szeredi
2007-03-26 10:08 ` Andrew Morton
2007-03-26 13:30 ` Peter Zijlstra
2007-03-27 0:30 ` David Chinner
2007-03-27 0:23 ` David Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).