LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Miklos Szeredi <miklos@szeredi.hu>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: [patch 03/22] fix deadlock in balance_dirty_pages
Date: Wed, 28 Feb 2007 00:14:45 +0100	[thread overview]
Message-ID: <20070227231549.597468815@szeredi.hu> (raw)
In-Reply-To: 20070227231442.627972152@szeredi.hu

[-- Attachment #1: dirty_balancing_fix.patch --]
[-- Type: text/plain, Size: 3829 bytes --]

From: Miklos Szeredi <mszeredi@suse.cz>

This deadlock happens, when dirty pages from one filesystem are
written back through another filesystem.  It easiest to demonstrate
with fuse although it could affect looback mounts as well (see
following patches).

Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
writing to A, and process Pr_b is writing to B.

Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
(fusexmp_fh), for simplicity let's assume that Pr_b is single
threaded.

These are the simplified stack traces of these processes after the
deadlock:

Pr_a (bash-shared-mapping):

  (block on queue)
  fuse_writepage
  generic_writepages
  writeback_inodes
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  set_page_dirty_mapping_balance
  do_no_page


Pr_b (fusexmp_fh):

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64


Thanks to the aggressive nature of Pr_a, it can happen, that

  nr_file_dirty > dirty_thresh + margin

This is due to both nr_dirty growing and dirty_thresh shrinking, which
in turn is due to nr_file_mapped rapidly growing.  The exact size of
the margin at which the deadlock happens is not known, but it's around
100 pages.

At this point Pr_a enters balance_dirty_pages and starts to write back
some if it's dirty pages.  After submitting some requests, it blocks
on the request queue.

The first write request will trigger Pr_b to perform a write()
syscall.  This will submit a write request to the block device and
then may enter balance_dirty_pages().

The condition for exiting balance_dirty_pages() is

 - either that write_chunk pages have been written

 - or nr_file_dirty + nr_writeback < dirty_thresh

It is entirely possible that less than write_chunk pages were written,
in which case balance_dirty_pages() will not exit even after all the
submitted requests have been succesfully completed.

Which means that the write() syscall does not return.

Which means, that no more dirty pages from A will be written back, and
neither nr_writeback nor nr_file_dirty will decrease.

Which means, that balance_dirty_pages() will loop forever.

Q.E.D.

The solution is to exit balance_dirty_pages() on the condition, that
there are only a few dirty + writeback pages for this backing dev.  This
makes sure, that there is always some progress with this setup.

The number of outstanding dirty + written pages is limited to 8, which
means that when over the threshold (dirty_exceeded == 1), each
filesystem may only effectively pin a maximum of 16 (+8 because of
ratelimiting) extra pages.

Note: a similar safety vent is always needed if there's a global limit
for the dirty+writeback pages, even if in the future there will be
some per-queue (or other) soft limit.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
+++ linux/mm/page-writeback.c	2007-02-27 14:41:07.000000000 +0100
@@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
 		if (!dirty_exceeded)
 			dirty_exceeded = 1;
 
+		/*
+		 * Acquit producer of dirty pages if there's little or
+		 * nothing to write back to this particular queue.
+		 *
+		 * Without this check a deadlock is possible for if
+		 * one filesystem is writing data through another.
+		 */
+		if (atomic_long_read(&bdi->nr_dirty) +
+		    atomic_long_read(&bdi->nr_writeback) < 8)
+			break;
+
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been

--

  parent reply	other threads:[~2007-02-27 23:24 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-02-27 23:14 [patch 00/22] misc VFS/VM patches and fuse writable shared mapping support Miklos Szeredi
2007-02-27 23:14 ` [patch 01/22] update ctime and mtime for mmaped write Miklos Szeredi
2007-02-28 14:16   ` Peter Staubach
2007-02-28 17:06     ` Miklos Szeredi
2007-02-28 17:21       ` Peter Staubach
2007-02-28 17:51         ` Miklos Szeredi
2007-02-28 20:01           ` Peter Staubach
2007-02-28 20:35             ` Miklos Szeredi
2007-02-28 20:58               ` Miklos Szeredi
2007-02-28 21:09                 ` Peter Staubach
2007-03-01  7:25                   ` Miklos Szeredi
2007-02-27 23:14 ` [patch 02/22] fix quadratic behavior of shrink_dcache_parent() Miklos Szeredi
2007-02-27 23:14 ` Miklos Szeredi [this message]
2007-02-27 23:14 ` [patch 04/22] fix deadlock in throttle_vm_writeout Miklos Szeredi
2007-02-27 23:14 ` [patch 05/22] balance dirty pages from loop device Miklos Szeredi
2007-02-27 23:14 ` [patch 06/22] consolidate generic_writepages and mpage_writepages Miklos Szeredi
2007-02-27 23:14 ` [patch 07/22] add filesystem subtype support Miklos Szeredi
2007-02-27 23:14 ` [patch 08/22] fuse: update backing_dev_info congestion state Miklos Szeredi
2007-02-27 23:14 ` [patch 09/22] fuse: fix reserved request wake up Miklos Szeredi
2007-02-27 23:14 ` [patch 10/22] fuse: add reference counting to fuse_file Miklos Szeredi
2007-02-27 23:14 ` [patch 11/22] fuse: add truncation semaphore Miklos Szeredi
2007-02-27 23:14 ` [patch 12/22] fuse: fix page invalidation Miklos Szeredi
2007-02-27 23:14 ` [patch 13/22] fuse: add list of writable files to fuse_inode Miklos Szeredi
2007-02-27 23:14 ` [patch 14/22] fuse: add helper for asynchronous writes Miklos Szeredi
2007-02-27 23:14 ` [patch 15/22] add non-owner variant of down_read_trylock() Miklos Szeredi
2007-02-27 23:14 ` [patch 16/22] fuse: add fuse_writepage() function Miklos Szeredi
2007-02-27 23:14 ` [patch 17/22] fuse: writable shared mmap support Miklos Szeredi
2007-02-27 23:15 ` [patch 18/22] fuse: add fuse_writepages() function Miklos Szeredi
2007-02-27 23:15 ` [patch 19/22] export sync_sb() to modules Miklos Szeredi
2007-02-27 23:15 ` [patch 20/22] fuse: make dirty stats available Miklos Szeredi
2007-02-27 23:15 ` [patch 21/22] fuse: limit dirty pages Miklos Szeredi
2007-02-27 23:15 ` [patch 22/22] fuse: allow big write requests Miklos Szeredi
     [not found] <20070227223809.684624012@szeredi.hu>
     [not found] ` <20070227223911.472192712@szeredi.hu>
2007-03-01  6:58   ` [patch 03/22] fix deadlock in balance_dirty_pages Andrew Morton
2007-03-01  7:35     ` Miklos Szeredi
2007-03-01  8:27       ` Andrew Morton
2007-03-01  8:37         ` Miklos Szeredi
2007-03-01  8:41           ` Andrew Morton
2007-03-01  8:58             ` Miklos Szeredi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070227231549.597468815@szeredi.hu \
    --to=miklos@szeredi.hu \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).