LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Andrew Morton <akpm@osdl.org>
To: Nick Piggin <npiggin@suse.de>, Hugh Dickins <hugh@veritas.com>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux Filesystems <linux-fsdevel@vger.kernel.org>,
	Linux Memory Management <linux-mm@kvack.org>
Subject: Re: [patch 0/9] buffered write deadlock fix
Date: Tue, 30 Jan 2007 12:55:58 -0800	[thread overview]
Message-ID: <20070130125558.ae9119b0.akpm@osdl.org> (raw)
In-Reply-To: <20070129081905.23584.97878.sendpatchset@linux.site>

On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
Nick Piggin <npiggin@suse.de> wrote:

> The following set of patches attempt to fix the buffered write
> locking problems 

y'know, four or five years back I fixed this bug by doing

	current->locked_page = page;

in the write() code, and then teaching the pagefault code to avoid locking
the same page.  Patch below.

But then evil mean Hugh pointed out that the patch is still vulnerable to
ab/ba deadlocking so I dropped it.

But Hugh lied!  There is no ab/ba deadlock because both a and b need
i_mutex to get into the write() path.

This approach doesn't fix the writev() performance regresson which
nobody has measured yet but which the NFS guys reported.

But I think with this fix in place we can go back to a modified version of
the 2.6.17 filemap.c code and get that performance back, but I haven't
thought about that.

It's a heck of a lot simpler than your patches though ;)





There is a longstanding problem in which the kernel can deadlock if an
application performs a write(bf, buf, n), and `buf' points at a mmapped
page of `fd', and that page is the pagecache page into which this write
will write, and the page is not yet present.

The kernel will put a new page into pagecache, lock it, run
copy_from_user().  The copy will fault and filemap_nopage() will try to
lock the page in preparation for reading it in.  But it was already
locked up in generic_file_write().

We have had a workaround in there - touch the userspace address by hand
(to fault it in) before locking the pagecache page, and hope that it
doesn't get evicted before we try to lock it.

But there's a place in the kmap_atomic-for-generic_file_write code
where this race is detectable.  I put a printk in there and saw a
stream of them during heavy testing.  So the workaround doesn't work.


What this patch does is

- within generic_file_write(): record the locked page in current->locked_page

- within filemap_nopage, look to see if the page which we're faulting
  in is equal to current->locked_page.

  If it is, taken special action: instead of locking the page,
  reading it and unlocking it, we assume that the page is already
  locked.  Bring it uptodate and relock it before returning.


I tested this by disabling the fault-it-in-by-hand workaround, writing
a special test app and adding several printks.  Not pretty, but it does
the job.




 include/linux/sched.h |    2 ++
 mm/filemap.c          |   26 ++++++++++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

--- linux-mnm/mm/filemap.c~write-deadlock	Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/mm/filemap.c	Thu Oct 31 22:42:38 2002
@@ -997,6 +997,7 @@ struct page * filemap_nopage(struct vm_a
 	struct page *page;
 	unsigned long size, pgoff, endoff;
 	int did_readahead;
+	struct page *locked_page = current->locked_page;
 
 	pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 	endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
@@ -1092,10 +1093,12 @@ no_cached_page:
 
 page_not_uptodate:
 	inc_page_state(pgmajfault);
-	lock_page(page);
+	if (page != locked_page)
+		lock_page(page);
 
 	/* Did it get unhashed while we waited for it? */
 	if (!page->mapping) {
+		BUG_ON(page == locked_page);	/* can't happen: i_sem */
 		unlock_page(page);
 		page_cache_release(page);
 		goto retry_all;
@@ -1103,12 +1106,15 @@ page_not_uptodate:
 
 	/* Did somebody else get it up-to-date? */
 	if (PageUptodate(page)) {
-		unlock_page(page);
+		if (page != locked_page)
+			unlock_page(page);
 		goto success;
 	}
 
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
+		if (page == locked_page)
+			lock_page(page);
 		if (PageUptodate(page))
 			goto success;
 	}
@@ -1119,10 +1125,12 @@ page_not_uptodate:
 	 * because there really aren't any performance issues here
 	 * and we need to check for errors.
 	 */
-	lock_page(page);
+	if (page != locked_page)
+		lock_page(page);
 
 	/* Somebody truncated the page on us? */
 	if (!page->mapping) {
+		BUG_ON(page == locked_page);	/* can't happen: i_sem */
 		unlock_page(page);
 		page_cache_release(page);
 		goto retry_all;
@@ -1130,12 +1138,15 @@ page_not_uptodate:
 
 	/* Somebody else successfully read it in? */
 	if (PageUptodate(page)) {
-		unlock_page(page);
+		if (page != locked_page)
+			unlock_page(page);
 		goto success;
 	}
 	ClearPageError(page);
 	if (!mapping->a_ops->readpage(file, page)) {
 		wait_on_page_locked(page);
+		if (page == locked_page)
+			lock_page(page);
 		if (PageUptodate(page))
 			goto success;
 	}
@@ -1445,6 +1456,11 @@ void remove_suid(struct dentry *dentry)
 	}
 }
 
+/*
+ * There is rare deadlock scenario in which the source and dest pages are
+ * the same, and the VM evicts the page at the wrong time.   Here we set
+ * current->locked_page to signal that to special-case code in filemap_nopage
+ */
 static inline int
 filemap_copy_from_user(struct page *page, unsigned long offset,
 			const char *buf, unsigned bytes)
@@ -1459,7 +1475,9 @@ filemap_copy_from_user(struct page *page
 	if (left != 0) {
 		/* Do it the slow way */
 		kaddr = kmap(page);
+		current->locked_page = page;
 		left = __copy_from_user(kaddr + offset, buf, bytes);
+		current->locked_page = NULL;
 		kunmap(page);
 	}
 	return left;
--- linux-mnm/include/linux/sched.h~write-deadlock	Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/include/linux/sched.h	Thu Oct 31 22:42:38 2002
@@ -266,6 +266,7 @@ extern struct user_struct root_user;
 
 typedef struct prio_array prio_array_t;
 struct backing_dev_info;
+struct page;
 
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
@@ -387,6 +388,7 @@ struct task_struct {
 	void *journal_info;
 	struct dentry *proc_dentry;
 	struct backing_dev_info *backing_dev_info;
+	struct page *locked_page;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);

.


  parent reply	other threads:[~2007-01-30 20:56 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-29 10:31 Nick Piggin
2007-01-29 10:31 ` [patch 1/9] fs: libfs buffered write leak fix Nick Piggin
2007-02-02 23:52   ` Andrew Morton
2007-02-03  1:33     ` Nick Piggin
2007-02-03  1:58       ` Andrew Morton
2007-02-03  2:09         ` Nick Piggin
2007-02-03  2:19           ` Andrew Morton
2007-02-03  2:28             ` Nick Piggin
2007-02-03 17:49       ` Jörn Engel
2007-02-04  3:55         ` Nick Piggin
2007-01-29 10:31 ` [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin
2007-01-29 10:32 ` [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin
2007-01-29 10:32 ` [patch 4/9] mm: generic_file_buffered_write cleanup Nick Piggin
2007-01-29 10:32 ` [patch 5/9] mm: debug write deadlocks Nick Piggin
2007-01-29 10:32 ` [patch 6/9] mm: be sure to trim blocks Nick Piggin
2007-01-29 10:32 ` [patch 7/9] mm: cleanup pagecache insertion operations Nick Piggin
2007-01-29 10:32 ` [patch 8/9] mm: generic_file_buffered_write iovec cleanup Nick Piggin
2007-01-29 10:33 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-01-29 11:11   ` Nick Piggin
2007-02-02 23:53   ` Andrew Morton
2007-02-03  1:38     ` Nick Piggin
2007-01-30 20:55 ` Andrew Morton [this message]
2007-01-30 23:21   ` [patch 0/9] buffered write deadlock fix Andrew Morton
2007-01-31  1:31     ` Nick Piggin
2007-01-31  0:32   ` Nick Piggin
2007-02-02 23:52 ` Andrew Morton
2007-02-03  1:22   ` Nick Piggin
2007-02-03  6:43   ` Suparna Bhattacharya
     [not found]   ` <370516630.03363@ustc.edu.cn>
2007-02-03 15:31     ` Fengguang Wu
2007-02-04  8:49 Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070130125558.ae9119b0.akpm@osdl.org \
    --to=akpm@osdl.org \
    --cc=hugh@veritas.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    --subject='Re: [patch 0/9] buffered write deadlock fix' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).