LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Miklos Szeredi <miklos@szeredi.hu>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: [patch 04/22] fix deadlock in throttle_vm_writeout
Date: Wed, 28 Feb 2007 00:14:46 +0100 [thread overview]
Message-ID: <20070227231553.047834932@szeredi.hu> (raw)
In-Reply-To: <20070227231442.627972152@szeredi.hu>
[-- Attachment #1: throttle_vm_writeout_fix.patch --]
[-- Type: text/plain, Size: 4480 bytes --]
From: Miklos Szeredi <mszeredi@suse.cz>
This deadlock is similar to the one in balance_dirty_pages, but
instead of waiting in balance_dirty_pages after submitting a write
request, it happens during a memory allocation for filesystem B before
submitting a write request.
It is easy to reproduce on a machine with not too much memory.
E.g. try this on 2.6.21-rc1 UML with 32MB (works on physical hw as
well):
dd if=/dev/zero of=/tmp/tmp.img bs=1048576 count=40
mke2fs -j -F /tmp/tmp.img
mkdir /tmp/img
mount -oloop /tmp/tmp.img /tmp/img
bash-shared-mapping /tmp/img/foo 30000000
The deadlock doesn't happen immediately, sometimes only after a few
minutes.
Simplified stack trace for bash-shared-mapping after the deadlock:
io_schedule_timeout
congestion_wait
balance_dirty_pages
balance_dirty_pages_ratelimited_nr
generic_file_buffered_write
__generic_file_aio_write_nolock
generic_file_aio_write
ext3_file_write
do_sync_write
vfs_write
sys_pwrite64
and for [loop0]:
io_schedule_timeout
congestion_wait
throttle_vm_writeout
shrink_zone
shrink_zones
try_to_free_pages
__alloc_pages
find_or_create_page
do_lo_send_aops
lo_send
do_bio_filebacked
loop_thread
The requirement for the deadlock is that
nr_writeback > dirty_thresh * 1.1 + margin
Again margin seems to be in the 100 page range.
The task of throttle_vm_writeout is to limit the rate at which
under-writeback pages are created due to swapping. There's no other
way direct reclaim can increase the nr_writeback + nr_file_dirty.
So when there are few or no under-swap pages, it is safe for this
function to return. This ensures, that there's progress with writing
back dirty pages.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h 2007-02-27 14:40:55.000000000 +0100
+++ linux/include/linux/swap.h 2007-02-27 14:41:08.000000000 +0100
@@ -279,10 +279,14 @@ static inline void disable_swap_token(vo
put_swap_token(swap_token_mm);
}
+#define nr_swap_writeback \
+ atomic_long_read(&swapper_space.backing_dev_info->nr_writeback)
+
#else /* CONFIG_SWAP */
#define total_swap_pages 0
#define total_swapcache_pages 0UL
+#define nr_swap_writeback 0UL
#define si_swapinfo(val) \
do { (val)->freeswap = (val)->totalswap = 0; } while (0)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2007-02-27 14:41:07.000000000 +0100
+++ linux/mm/page-writeback.c 2007-02-27 14:41:08.000000000 +0100
@@ -33,6 +33,7 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h>
#include <linux/pagevec.h>
+#include <linux/swap.h>
/*
* The maximum number of pages to writeout in a single bdflush/kupdate
@@ -303,6 +304,21 @@ void throttle_vm_writeout(void)
long dirty_thresh;
for ( ; ; ) {
+ /*
+ * If there's no swapping going on, don't throttle.
+ *
+ * Starting writeback against mapped pages shouldn't
+ * be a problem, as that doesn't increase the
+ * sum of dirty + writeback.
+ *
+ * Without this, a deadlock is possible (also see
+ * comment in balance_dirty_pages). This has been
+ * observed with running bash-shared-mapping on a
+ * loopback mount.
+ */
+ if (nr_swap_writeback < 16)
+ break;
+
get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
/*
@@ -314,6 +330,7 @@ void throttle_vm_writeout(void)
if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
+
congestion_wait(WRITE, HZ/10);
}
}
Index: linux/mm/page_io.c
===================================================================
--- linux.orig/mm/page_io.c 2007-02-27 14:40:55.000000000 +0100
+++ linux/mm/page_io.c 2007-02-27 14:41:08.000000000 +0100
@@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio
ClearPageReclaim(page);
}
end_page_writeback(page);
+ atomic_long_dec(&swapper_space.backing_dev_info->nr_writeback);
bio_put(bio);
return 0;
}
@@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st
if (wbc->sync_mode == WB_SYNC_ALL)
rw |= (1 << BIO_RW_SYNC);
count_vm_event(PSWPOUT);
+ atomic_long_inc(&swapper_space.backing_dev_info->nr_writeback);
set_page_writeback(page);
unlock_page(page);
submit_bio(rw, bio);
--
next prev parent reply other threads:[~2007-02-27 23:17 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
2007-02-27 23:14 ` [patch 01/22] update ctime and mtime for mmaped write Miklos Szeredi
2007-02-28 14:16 ` Peter Staubach
2007-02-28 17:06 ` Miklos Szeredi
2007-02-28 17:21 ` Peter Staubach
2007-02-28 17:51 ` Miklos Szeredi
2007-02-28 20:01 ` Peter Staubach
2007-02-28 20:35 ` Miklos Szeredi
2007-02-28 20:58 ` Miklos Szeredi
2007-02-28 21:09 ` Peter Staubach
2007-03-01 7:25 ` Miklos Szeredi
2007-02-27 23:14 ` [patch 02/22] fix quadratic behavior of shrink_dcache_parent() Miklos Szeredi
2007-02-27 23:14 ` [patch 03/22] fix deadlock in balance_dirty_pages Miklos Szeredi
2007-02-27 23:14 ` Miklos Szeredi [this message]
2007-02-27 23:14 ` [patch 05/22] balance dirty pages from loop device Miklos Szeredi
2007-02-27 23:14 ` [patch 06/22] consolidate generic_writepages and mpage_writepages Miklos Szeredi
2007-02-27 23:14 ` [patch 07/22] add filesystem subtype support Miklos Szeredi
2007-02-27 23:14 ` [patch 08/22] fuse: update backing_dev_info congestion state Miklos Szeredi
2007-02-27 23:14 ` [patch 09/22] fuse: fix reserved request wake up Miklos Szeredi
2007-02-27 23:14 ` [patch 10/22] fuse: add reference counting to fuse_file Miklos Szeredi
2007-02-27 23:14 ` [patch 11/22] fuse: add truncation semaphore Miklos Szeredi
2007-02-27 23:14 ` [patch 12/22] fuse: fix page invalidation Miklos Szeredi
2007-02-27 23:14 ` [patch 13/22] fuse: add list of writable files to fuse_inode Miklos Szeredi
2007-02-27 23:14 ` [patch 14/22] fuse: add helper for asynchronous writes Miklos Szeredi
2007-02-27 23:14 ` [patch 15/22] add non-owner variant of down_read_trylock() Miklos Szeredi
2007-02-27 23:14 ` [patch 16/22] fuse: add fuse_writepage() function Miklos Szeredi
2007-02-27 23:14 ` [patch 17/22] fuse: writable shared mmap support Miklos Szeredi
2007-02-27 23:15 ` [patch 18/22] fuse: add fuse_writepages() function Miklos Szeredi
2007-02-27 23:15 ` [patch 19/22] export sync_sb() to modules Miklos Szeredi
2007-02-27 23:15 ` [patch 20/22] fuse: make dirty stats available Miklos Szeredi
2007-02-27 23:15 ` [patch 21/22] fuse: limit dirty pages Miklos Szeredi
2007-02-27 23:15 ` [patch 22/22] fuse: allow big write requests Miklos Szeredi
[not found] <20070227223809.684624012@szeredi.hu>
[not found] ` <20070227223914.057085427@szeredi.hu>
2007-03-01 7:11 ` [patch 04/22] fix deadlock in throttle_vm_writeout Andrew Morton
2007-03-01 7:48 ` Miklos Szeredi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070227231553.047834932@szeredi.hu \
--to=miklos@szeredi.hu \
--cc=akpm@linux-foundation.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--subject='Re: [patch 04/22] fix deadlock in throttle_vm_writeout' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).