LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Filipe Manana <fdmanana@suse.com>,
	Chris Mason <clm@fb.com>
Subject: [PATCH 3.10 26/62] Btrfs: fix data loss in the fast fsync path
Date: Mon, 16 Mar 2015 15:09:43 +0100	[thread overview]
Message-ID: <20150316140934.373400231@linuxfoundation.org> (raw)
In-Reply-To: <20150316140933.139548981@linuxfoundation.org>

3.10-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Filipe Manana <fdmanana@suse.com>

commit 3a8b36f378060d20062a0918e99fae39ff077bf0 upstream.

When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/btrfs/file.c |   56 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1593,22 +1593,10 @@ static ssize_t btrfs_file_aio_write(stru
 	mutex_unlock(&inode->i_mutex);
 
 	/*
-	 * we want to make sure fsync finds this change
-	 * but we haven't joined a transaction running right now.
-	 *
-	 * Later on, someone is sure to update the inode and get the
-	 * real transid recorded.
-	 *
-	 * We set last_trans now to the fs_info generation + 1,
-	 * this will either be one more than the running transaction
-	 * or the generation used for the next transaction if there isn't
-	 * one running right now.
-	 *
 	 * We also have to set last_sub_trans to the current log transid,
 	 * otherwise subsequent syncs to a file that's been synced in this
 	 * transaction will appear to have already occured.
 	 */
-	BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
 	BTRFS_I(inode)->last_sub_trans = root->log_transid;
 	if (num_written > 0 || num_written == -EIOCBQUEUED) {
 		err = generic_write_sync(file, pos, num_written);
@@ -1706,25 +1694,37 @@ int btrfs_sync_file(struct file *file, l
 	atomic_inc(&root->log_batch);
 
 	/*
-	 * check the transaction that last modified this inode
-	 * and see if its already been committed
-	 */
-	if (!BTRFS_I(inode)->last_trans) {
-		mutex_unlock(&inode->i_mutex);
-		goto out;
-	}
-
-	/*
-	 * if the last transaction that changed this file was before
-	 * the current transaction, we can bail out now without any
-	 * syncing
+	 * If the last transaction that changed this file was before the current
+	 * transaction and we have the full sync flag set in our inode, we can
+	 * bail out now without any syncing.
+	 *
+	 * Note that we can't bail out if the full sync flag isn't set. This is
+	 * because when the full sync flag is set we start all ordered extents
+	 * and wait for them to fully complete - when they complete they update
+	 * the inode's last_trans field through:
+	 *
+	 *     btrfs_finish_ordered_io() ->
+	 *         btrfs_update_inode_fallback() ->
+	 *             btrfs_update_inode() ->
+	 *                 btrfs_set_inode_last_trans()
+	 *
+	 * So we are sure that last_trans is up to date and can do this check to
+	 * bail out safely. For the fast path, when the full sync flag is not
+	 * set in our inode, we can not do it because we start only our ordered
+	 * extents and don't wait for them to complete (that is when
+	 * btrfs_finish_ordered_io runs), so here at this point their last_trans
+	 * value might be less than or equals to fs_info->last_trans_committed,
+	 * and setting a speculative last_trans for an inode when a buffered
+	 * write is made (such as fs_info->generation + 1 for example) would not
+	 * be reliable since after setting the value and before fsync is called
+	 * any number of transactions can start and commit (transaction kthread
+	 * commits the current transaction periodically), and a transaction
+	 * commit does not start nor waits for ordered extents to complete.
 	 */
 	smp_mb();
 	if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
-	    BTRFS_I(inode)->last_trans <=
-	    root->fs_info->last_trans_committed) {
-		BTRFS_I(inode)->last_trans = 0;
-
+	    (full_sync && BTRFS_I(inode)->last_trans <=
+	     root->fs_info->last_trans_committed)) {
 		/*
 		 * We'v had everything committed since the last time we were
 		 * modified so clear this flag in case it was set for whatever



  parent reply	other threads:[~2015-03-16 14:26 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-16 14:09 [PATCH 3.10 00/62] 3.10.72-stable review Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 01/62] rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 02/62] ipv6: fix ipv6_cow_metrics for non DST_HOST case Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 03/62] rtnetlink: call ->dellink on failure when ->newlink exists Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 05/62] ipv4: ip_check_defrag should correctly check return value of skb_copy_bits Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 06/62] ipv4: ip_check_defrag should not assume that skb_network_offset is zero Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 07/62] net: phy: Fix verification of EEE support in phy_init_eee Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 09/62] net: reject creation of netdev names with colons Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 10/62] team: fix possible null pointer dereference in team_handle_frame Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 11/62] net: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msg Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 12/62] macvtap: make sure neighbour code can push ethernet header Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 13/62] usb: plusb: Add support for National Instruments host-to-host cable Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 14/62] udp: only allow UFO for packets from SOCK_DGRAM sockets Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 15/62] team: dont traverse port list using rcu in team_set_mac_address Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 16/62] mm/hugetlb: add migration entry check in __unmap_hugepage_range Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 17/62] mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 18/62] mm/nommu.c: " Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 19/62] mm/compaction: fix wrong order check in compact_finished() Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 20/62] mm/memory.c: actually remap enough memory Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 21/62] target: Check for LBA + sectors wrap-around in sbc_parse_cdb Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 22/62] PM / QoS: remove duplicate call to pm_qos_update_target Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 23/62] x86/asm/entry/64: Remove a bogus ret_from_fork optimization Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 24/62] iio: imu: adis16400: Fix sign extension Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 25/62] btrfs: fix lost return value due to variable shadowing Greg Kroah-Hartman
2015-03-16 14:09 ` Greg Kroah-Hartman [this message]
2015-03-16 14:09 ` [PATCH 3.10 27/62] Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 28/62] KVM: emulate: fix CMPXCHG8B on 32-bit hosts Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 29/62] KVM: MIPS: Fix trace event to save PC directly Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 30/62] USB: serial: cp210x: Adding Seletek device ids Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 31/62] USB: usbfs: dont leak kernel data in siginfo Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 32/62] usb: ftdi_sio: Add jtag quirk support for Cyber Cortex AV boards Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 33/62] xhci: Allocate correct amount of scratchpad buffers Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 34/62] xhci: fix reporting of 0-sized URBs in control endpoint Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 35/62] net: irda: fix wait_until_sent poll timeout Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 36/62] USB: serial: fix infinite wait_until_sent timeout Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 37/62] TTY: fix tty_wait_until_sent on 64-bit machines Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 38/62] USB: serial: fix potential use-after-free after failed probe Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 39/62] autofs4 copy_dev_ioctl(): keep the value of ->size wed used for allocation Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 40/62] debugfs: leave freeing a symlink body until inode eviction Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 41/62] procfs: fix race between symlink removals and traversals Greg Kroah-Hartman
2015-03-16 14:09 ` [PATCH 3.10 42/62] sunrpc: fix braino in ->poll() Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 43/62] tty: fix up atime/mtime mess, take four Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 44/62] ALSA: pcm: Dont leave PREPARED state after draining Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 45/62] ALSA: hda - Add pin configs for ASUS mobo with IDT 92HD73XX codec Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 46/62] sg: fix read() error reporting Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 47/62] IB/qib: Do not write EEPROM Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 48/62] nilfs2: fix potential memory overrun on inode Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 49/62] fixed invalid assignment of 64bit mask to host dma_boundary for scatter gather segment boundary limit Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 50/62] clk: sunxi: Support factor clocks with N factor starting not from 0 Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 51/62] staging: comedi: comedi_compat32.c: fix COMEDI_CMD copy back Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 52/62] dm mirror: do not degrade the mirror on discard error Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 53/62] dm io: reject unsupported DISCARD requests with EOPNOTSUPP Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 54/62] dm: fix a race condition in dm_get_md Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 55/62] dm snapshot: fix a possible invalid memory access on unload Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 56/62] staging: comedi: cb_pcidas64: fix incorrect AI range code handling Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 57/62] HID: input: fix confusion on conflicting mappings Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 58/62] HID: fixup the conflicting keyboard mappings quirk Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 60/62] drm/radeon: fix 1 RB harvest config setup for TN/RL Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 61/62] ACPI / video: Load the module even if ACPI is disabled Greg Kroah-Hartman
2015-03-16 14:10 ` [PATCH 3.10 62/62] ath5k: fix spontaneus AR5312 freezes Greg Kroah-Hartman
2015-03-16 19:57 ` [PATCH 3.10 00/62] 3.10.72-stable review Guenter Roeck
2015-03-17 17:01 ` Masanari Iida
2015-03-17 20:44   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150316140934.373400231@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=clm@fb.com \
    --cc=fdmanana@suse.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --subject='Re: [PATCH 3.10 26/62] Btrfs: fix data loss in the fast fsync path' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).