LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Chris Wright <chrisw@sous-sol.org>
To: linux-kernel@vger.kernel.org, stable@kernel.org
Cc: Justin Forbes <jmforbes@linuxtx.org>,
	Zwane Mwaikambo <zwane@arm.linux.org.uk>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Randy Dunlap <rdunlap@xenotime.net>,
	Dave Jones <davej@redhat.com>,
	Chuck Wolber <chuckw@quantumlinux.com>,
	Chris Wedgwood <reviews@ml.cw.f00f.org>,
	Michael Krufky <mkrufky@linuxtv.org>,
	torvalds@osdl.org, akpm@osdl.org, alan@lxorguk.ukuu.org.uk,
	Linus Torvalds <torvalds@macmini.osdl.org>,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	Martin J Bligh <mbligh@google.com>,
	Martin Michlmayr <tbm@cyrius.com>,
	Martin Johansson <martin@fatbob.nu>, Ingo Molnar <mingo@elte.hu>,
	Andrei Popa <andrei.popa@i-neo.ro>,
	High Dickins <hugh@veritas.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	David Miller <davem@davemloft.net>,
	Arjan van de Ven <arjan@infradead.org>,
	Gordon Farquharson <gordonfarquharson@gmail.com>,
	Guillaume Chazarain <guichaz@yahoo.fr>,
	Kenneth Cheng <kenneth.w.chen@intel.com>,
	Tobias Diedrich <ranma@tdiedrich.de>
Subject: [patch 38/50] VM: Fix nasty and subtle race in shared mmaped page writeback
Date: Fri, 05 Jan 2007 18:28:31 -0800	[thread overview]
Message-ID: <20070106023542.365678000@sous-sol.org> (raw)
In-Reply-To: <20070106022753.334962000@sous-sol.org>

[-- Attachment #1: vm-fix-nasty-and-subtle-race-in-shared-mmap-ed-page-writeback.patch --]
[-- Type: text/plain, Size: 4512 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

From: Linus Torvalds <torvalds@macmini.osdl.org>

The VM layer (on the face of it, fairly reasonably) expected that when
it does a ->writepage() call to the filesystem, it would write out the
full page at that point in time.  Especially since it had earlier marked
the whole page dirty with "set_page_dirty()".

But that isn't actually the case: ->writepage() does not actually write
a page, it writes the parts of the page that have been explicitly marked
dirty before, *and* that had not got written out for other reasons since
the last time we told it they were dirty.

That last caveat is the important one.

Which _most_ of the time ends up being the whole page (since we had
called "set_page_dirty()" on the page earlier), but if the filesystem
had done any dirty flushing of its own (for example, to honor some
internal write ordering guarantees), it might end up doing only a
partial page IO (or none at all) when ->writepage() is actually called.

That is the correct thing in general (since we actually often _want_
only the known-dirty parts of the page to be written out), but the
shared dirty page handling had implicitly forgotten about these details,
and had a number of cases where it was doing just the "->writepage()"
part, without telling the low-level filesystem that the whole page might
have been re-dirtied as part of being mapped writably into user space.

Since most of the time the FS did actually write out the full page, we
didn't notice this for a loong time, and this needed some really odd
patterns to trigger.  But it caused occasional corruption with rtorrent
and with the Debian "apt" database, because both use shared mmaps to
update the end result.

This fixes it. Finally. After way too much hair-pulling.

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Martin J. Bligh <mbligh@google.com>
Acked-by: Martin Michlmayr <tbm@cyrius.com>
Acked-by: Martin Johansson <martin@fatbob.nu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Andrei Popa <andrei.popa@i-neo.ro>
Cc: High Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>,
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Kenneth Cheng <kenneth.w.chen@intel.com>
Cc: Tobias Diedrich <ranma@tdiedrich.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[chrisw: backport to 2.6.19.1]
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 mm/page-writeback.c |   39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

--- linux-2.6.19.1.orig/mm/page-writeback.c
+++ linux-2.6.19.1/mm/page-writeback.c
@@ -893,12 +893,41 @@ int clear_page_dirty_for_io(struct page 
 {
 	struct address_space *mapping = page_mapping(page);
 
-	if (mapping) {
+	if (mapping && mapping_cap_account_dirty(mapping)) {
+		/*
+		 * Yes, Virginia, this is indeed insane.
+		 *
+		 * We use this sequence to make sure that
+		 *  (a) we account for dirty stats properly
+		 *  (b) we tell the low-level filesystem to
+		 *      mark the whole page dirty if it was
+		 *      dirty in a pagetable. Only to then
+		 *  (c) clean the page again and return 1 to
+		 *      cause the writeback.
+		 *
+		 * This way we avoid all nasty races with the
+		 * dirty bit in multiple places and clearing
+		 * them concurrently from different threads.
+		 *
+		 * Note! Normally the "set_page_dirty(page)"
+		 * has no effect on the actual dirty bit - since
+		 * that will already usually be set. But we
+		 * need the side effects, and it can help us
+		 * avoid races.
+		 *
+		 * We basically use the page "master dirty bit"
+		 * as a serialization point for all the different
+		 * threads doing their things.
+		 *
+		 * FIXME! We still have a race here: if somebody
+		 * adds the page back to the page tables in
+		 * between the "page_mkclean()" and the "TestClearPageDirty()",
+		 * we might have it mapped without the dirty bit set.
+		 */
+		if (page_mkclean(page))
+			set_page_dirty(page);
 		if (TestClearPageDirty(page)) {
-			if (mapping_cap_account_dirty(mapping)) {
-				page_mkclean(page);
-				dec_zone_page_state(page, NR_FILE_DIRTY);
-			}
+			dec_zone_page_state(page, NR_FILE_DIRTY);
 			return 1;
 		}
 		return 0;

--

  parent reply	other threads:[~2007-01-06  2:38 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-06  2:27 [patch 00/50] -stable review Chris Wright
2007-01-06  2:27 ` [patch 01/50] dm-crypt: Select CRYPTO_CBC Chris Wright
2007-01-06  2:27 ` [patch 02/50] sha512: Fix sha384 block size Chris Wright
2007-01-06  2:27 ` [patch 03/50] read_zero_pagealigned() locking fix Chris Wright
2007-01-06  2:27 ` [patch 04/50] ieee80211softmac: Fix mutex_lock at exit of ieee80211_softmac_get_genie Chris Wright
2007-01-08 15:24   ` John W. Linville
2007-01-06  2:27 ` [patch 05/50] x86-64: Mark rdtsc as sync only for netburst, not for core2 Chris Wright
2007-01-06  2:27 ` [patch 06/50] bonding: incorrect bonding state reported via ioctl Chris Wright
2007-01-06  2:28 ` [patch 07/50] DVB: lgdt330x: fix signal / lock status detection bug Chris Wright
2007-01-06  2:28 ` [patch 08/50] V4L: Fix broken TUNER_LG_NTSC_TAPE radio support Chris Wright
2007-01-06  2:28 ` [patch 09/50] Revert "[PATCH] zd1211rw: Removed unneeded packed attributes" Chris Wright
2007-01-08 15:32   ` John W. Linville
2007-01-06  2:28 ` [patch 10/50] libata: handle 0xff status properly Chris Wright
2007-01-06  2:28 ` [patch 11/50] ieee1394: ohci1394: add PPC_PMAC platform code to driver probe Chris Wright
2007-01-06  2:28 ` [patch 12/50] kbuild: dont put temp files in source Chris Wright
2007-01-06  2:28 ` [patch 13/50] ARM: Add sys_*at syscalls Chris Wright
2007-01-06  2:28 ` [patch 14/50] sched: remove __cpuinitdata anotation to cpu_isolated_map Chris Wright
2007-01-06  2:28 ` [patch 15/50] IB/srp: Fix FMR mapping for 32-bit kernels and addresses above 4G Chris Wright
2007-01-06  2:28 ` [patch 16/50] SCSI: add missing cdb clearing in scsi_execute() Chris Wright
2007-01-06  2:28 ` [patch 17/50] Bluetooth: Add packet size checks for CAPI messages (CVE-2006-6106) Chris Wright
2007-01-06  2:28 ` [patch 18/50] i2c: fix broken ds1337 initialization Chris Wright
2007-01-06  2:28 ` [patch 19/50] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code Chris Wright
2007-01-06  2:28 ` [patch 20/50] Fix for shmem_truncate_range() BUG_ON() Chris Wright
2007-01-06  2:28 ` [patch 21/50] smc911x: fix netpoll compilation faliure Chris Wright
2007-01-06  2:28 ` [patch 22/50] fix aoe without scatter-gather [Bug 7662] Chris Wright
2007-01-06  2:28 ` [patch 23/50] UDP: Fix reversed logic in udp_get_port() Chris Wright
2007-01-06  2:28 ` [patch 24/50] cciss: fix XFER_READ/XFER_WRITE in do_cciss_request Chris Wright
2007-01-06  2:28 ` [patch 25/50] [stable] [stable patch] i386: CPU hotplug broken with 2GB VMSPLIT Chris Wright
2007-01-06  2:28 ` [patch 26/50] ramfs breaks without CONFIG_BLOCK Chris Wright
2007-01-06  2:28 ` [patch 27/50] Buglet in vmscan.c Chris Wright
2007-01-06  2:28 ` [patch 28/50] softmac: Fixed handling of deassociation from AP Chris Wright
2007-01-08 15:14   ` John W. Linville
2007-01-06  2:28 ` [patch 29/50] zd1211rw: Call ieee80211_rx in tasklet Chris Wright
2007-01-08 15:20   ` John W. Linville
2007-01-06  2:28 ` [patch 30/50] handle ext3 directory corruption better (CVE-2006-6053) Chris Wright
2007-01-06  2:28 ` [patch 31/50] corrupted cramfs filesystems cause kernel oops (CVE-2006-5823) Chris Wright
2007-01-06  2:28 ` [patch 32/50] ext2: skip pages past number of blocks in ext2_find_entry (CVE-2006-6054) Chris Wright
2007-01-06  2:28 ` [patch 33/50] PKTGEN: Fix module load/unload races Chris Wright
2007-01-06  2:28 ` [patch 34/50] SPARC64: Fix "mem=xxx" handling Chris Wright
2007-01-06  2:28 ` [patch 35/50] SPARC64: Handle ISA devices with no regs property Chris Wright
2007-01-06  2:28 ` [patch 36/50] NET: Dont export linux/random.h outside __KERNEL__ Chris Wright
2007-01-06  2:28 ` [patch 37/50] sparc32: add offset in pci_map_sg() Chris Wright
2007-01-06  2:28 ` Chris Wright [this message]
2007-01-06  2:28 ` [patch 39/50] V4L: cx2341x: audio_properties is an u16, not u8 Chris Wright
2007-01-06  2:28 ` [patch 40/50] dvb-core: fix bug in CRC-32 checking on 64-bit systems Chris Wright
2007-01-06  2:28 ` [patch 41/50] V4L: cx88: Fix leadtek_eeprom tagging Chris Wright
2007-01-06  2:28 ` [patch 42/50] ebtables: dont compute gap before checking struct type Chris Wright
2007-01-06  2:28 ` [patch 43/50] SOUND: Sparc CS4231: Fix IRQ return value and initialization Chris Wright
2007-01-06  2:28 ` [patch 44/50] SOUND: Sparc CS4231: Use 64 for period_bytes_min Chris Wright
2007-01-06 14:20   ` [patch 43/50] SOUND: Sparc CS4231: Fix IRQ return value and initialization Jan Engelhardt
2007-01-06  2:28 ` [patch 45/50] IPV4/IPV6: Fix inet{,6} device initialization order Chris Wright
2007-01-06  2:28 ` [patch 46/50] asix: Fix typo for AX88772 PHY Selection Chris Wright
2007-01-06  2:28 ` [patch 47/50] NetLabel: correctly fill in unused CIPSOv4 level and category mappings Chris Wright
2007-01-06  2:28 ` [patch 48/50] connector: some fixes for ia64 unaligned access errors Chris Wright
2007-01-06  2:28 ` [patch 49/50] fix OOM killing of swapoff Chris Wright
2007-01-06  2:28 ` [patch 50/50] Fix incorrect user space access locking in mincore() (CVE-2006-4814) Chris Wright
2007-01-06  2:37 ` [patch 00/50] -stable review Chris Wright
2007-02-04 18:24 ` Michael Krufky
2007-02-05 21:43   ` [stable] " Greg KH
2007-02-05 22:12     ` Chris Wright
2007-02-05 22:18       ` Greg KH
2007-02-05 22:16     ` Michael Krufky
2007-02-05 22:33       ` Greg KH
2007-02-07 21:08         ` Michael Krufky
2007-02-09 19:41           ` Michael Krufky
2007-02-09 19:51             ` Greg KH
2007-02-20 23:12               ` Greg KH
2007-02-21  2:29                 ` Michael Krufky
2007-02-21 21:01                   ` [v4l-dvb-maintainer] " Michael Krufky
2007-02-26 14:35                     ` Adrian Bunk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070106023542.365678000@sous-sol.org \
    --to=chrisw@sous-sol.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@osdl.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=andrei.popa@i-neo.ro \
    --cc=arjan@infradead.org \
    --cc=chuckw@quantumlinux.com \
    --cc=davej@redhat.com \
    --cc=davem@davemloft.net \
    --cc=gordonfarquharson@gmail.com \
    --cc=guichaz@yahoo.fr \
    --cc=hugh@veritas.com \
    --cc=jmforbes@linuxtx.org \
    --cc=kenneth.w.chen@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin@fatbob.nu \
    --cc=mbligh@google.com \
    --cc=mingo@elte.hu \
    --cc=mkrufky@linuxtv.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=ranma@tdiedrich.de \
    --cc=rdunlap@xenotime.net \
    --cc=reviews@ml.cw.f00f.org \
    --cc=segher@kernel.crashing.org \
    --cc=stable@kernel.org \
    --cc=tbm@cyrius.com \
    --cc=torvalds@macmini.osdl.org \
    --cc=torvalds@osdl.org \
    --cc=tytso@mit.edu \
    --cc=zwane@arm.linux.org.uk \
    --subject='Re: [patch 38/50] VM: Fix nasty and subtle race in shared mmaped page writeback' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).