Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem
@ 2020-09-04  8:58 Jan Kara
  2020-09-04  8:58 ` [PATCH 1/2] fs: Don't invalidate page buffers in block_write_full_page() Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jan Kara @ 2020-09-04  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-block, Christoph Hellwig, yebin,
	Andreas Dilger, Jens Axboe, Jan Kara

Hello,

this patch set fixes problems when buffer heads are discarded under a
live filesystem (which can lead to all sorts of issues like crashes in case
of ext4). Patch 1 drops some stale buffer invalidation code, patch 2
temporarily gets exclusive access to the block device for the duration of
buffer cache handling to avoid interfering with other exclusive bdev user.
The patch fixes the problems for me and pass xfstests for ext4.

Changes since v1:
* Check for exclusive access to the bdev instead of for the presence of
  superblock

								Honza

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] fs: Don't invalidate page buffers in block_write_full_page()
  2020-09-04  8:58 [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem Jan Kara
@ 2020-09-04  8:58 ` Jan Kara
  2020-09-07  7:12   ` Christoph Hellwig
  2020-09-04  8:58 ` [PATCH 2/2] block: Do not discard buffers under a mounted filesystem Jan Kara
  2020-09-07 10:35 ` [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem Jan Kara
  2 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2020-09-04  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-block, Christoph Hellwig, yebin,
	Andreas Dilger, Jens Axboe, Jan Kara, stable

If block_write_full_page() is called for a page that is beyond current
inode size, it will truncate page buffers for the page and return 0.
This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3
BUG due to race with truncate") in history.git tree to fix a problem
with ext3 in data=ordered mode. This particular problem doesn't exist
anymore because ext3 is long gone and ext4 handles ordered data
differently. Also normally buffers are invalidated by truncate code and
there's no need to specially handle this in ->writepage() code.

This invalidation of page buffers in block_write_full_page() is causing
issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
under filesystem's hands and metadata buffers get discarded while being
tracked by the journalling layer. Although it is obviously "not
supported" it can cause kernel crashes like:

[ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at
+0000000000000008
[ 7986.697197] PGD 0 P4D 0
[ 7986.699724] Oops: 0002 [#1] SMP PTI
[ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G
+O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
[ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
[ 7986.810150] Call Trace:
[ 7986.812595]  __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
[ 7986.818408]  jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
[ 7986.836467]  kjournald2+0xbd/0x270 [jbd2]

which is not great. The crash happens because bh->b_private is suddently
NULL although BH_JBD flag is still set (this is because
block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup
found buffer without BH_Mapped set, called init_page_buffers() which has
rewritten bh->b_private). So just remove the invalidation in
block_write_full_page().

Note that the buffer cache invalidation when block device changes size
is already careful to avoid similar problems by using
invalidate_mapping_pages() which skips busy buffers so it was only this
odd block_write_full_page() behavior that could tear down bdev buffers
under filesystem's hands.

Reported-by: Ye Bin <yebin10@huawei.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 061dd202979d..163c2c0b9aa3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2771,16 +2771,6 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-#if 0
-		/* Not really sure about this  - do we need this ? */
-		if (page->mapping->a_ops->invalidatepage)
-			page->mapping->a_ops->invalidatepage(page, offset);
-#endif
 		unlock_page(page);
 		return 0; /* don't care */
 	}
@@ -2975,12 +2965,6 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
 	/* Is the page fully outside i_size? (truncate in progress) */
 	offset = i_size & (PAGE_SIZE-1);
 	if (page->index >= end_index+1 || !offset) {
-		/*
-		 * The page may have dirty, unmapped buffers.  For example,
-		 * they may have been added in ext3_writepage().  Make them
-		 * freeable here, so the page does not leak.
-		 */
-		do_invalidatepage(page, 0, PAGE_SIZE);
 		unlock_page(page);
 		return 0; /* don't care */
 	}
-- 
2.16.4


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] block: Do not discard buffers under a mounted filesystem
  2020-09-04  8:58 [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem Jan Kara
  2020-09-04  8:58 ` [PATCH 1/2] fs: Don't invalidate page buffers in block_write_full_page() Jan Kara
@ 2020-09-04  8:58 ` Jan Kara
  2020-09-07  7:12   ` Christoph Hellwig
  2020-09-07 10:35 ` [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem Jan Kara
  2 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2020-09-04  8:58 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-block, Christoph Hellwig, yebin,
	Andreas Dilger, Jens Axboe, Jan Kara

Discarding blocks and buffers under a mounted filesystem is hardly
anything admin wants to do. Usually it will confuse the filesystem and
sometimes the loss of buffer_head state (including b_private field) can
even cause crashes like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
Call Trace:
 __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
 jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
 kjournald2+0xbd/0x270 [jbd2]

So if we don't have block device open with O_EXCL already, claim the
block device while we truncate buffer cache. This makes sure any
exclusive block device user (such as filesystem) cannot operate on the
device while we are discarding buffer cache.

Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/ioctl.c          | 16 ++++++++++------
 fs/block_dev.c         | 37 +++++++++++++++++++++++++++++++++----
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index bdb3bbb253d9..ae74d0409afa 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -112,8 +112,7 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 	uint64_t range[2];
 	uint64_t start, len;
 	struct request_queue *q = bdev_get_queue(bdev);
-	struct address_space *mapping = bdev->bd_inode->i_mapping;
-
+	int err;
 
 	if (!(mode & FMODE_WRITE))
 		return -EBADF;
@@ -134,7 +133,11 @@ static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
 
 	if (start + len > i_size_read(bdev->bd_inode))
 		return -EINVAL;
-	truncate_inode_pages_range(mapping, start, start + len - 1);
+
+	err = truncate_bdev_range(bdev, mode, start, start + len - 1);
+	if (err)
+		return err;
+
 	return blkdev_issue_discard(bdev, start >> 9, len >> 9,
 				    GFP_KERNEL, flags);
 }
@@ -143,8 +146,8 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 		unsigned long arg)
 {
 	uint64_t range[2];
-	struct address_space *mapping;
 	uint64_t start, end, len;
+	int err;
 
 	if (!(mode & FMODE_WRITE))
 		return -EBADF;
@@ -166,8 +169,9 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 		return -EINVAL;
 
 	/* Invalidate the page cache, including dirty pages */
-	mapping = bdev->bd_inode->i_mapping;
-	truncate_inode_pages_range(mapping, start, end);
+	err = truncate_bdev_range(bdev, mode, start, end);
+	if (err)
+		return err;
 
 	return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
 			BLKDEV_ZERO_NOUNMAP);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8ae833e00443..02a749370717 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -103,6 +103,35 @@ void invalidate_bdev(struct block_device *bdev)
 }
 EXPORT_SYMBOL(invalidate_bdev);
 
+/*
+ * Drop all buffers & page cache for given bdev range. This function bails
+ * with error if bdev has other exclusive owner (such as filesystem).
+ */
+int truncate_bdev_range(struct block_device *bdev, fmode_t mode,
+			loff_t lstart, loff_t lend)
+{
+	struct block_device *claimed_bdev = NULL;
+	int err;
+
+	/*
+	 * If we don't hold exclusive handle for the device, upgrade to it
+	 * while we discard the buffer cache to avoid discarding buffers
+	 * under live filesystem.
+	 */
+	if (!(mode & FMODE_EXCL)) {
+		claimed_bdev = bdev->bd_contains;
+		err = bd_prepare_to_claim(bdev, claimed_bdev,
+					  truncate_bdev_range);
+		if (err)
+			return err;
+	}
+	truncate_inode_pages_range(bdev->bd_inode->i_mapping, lstart, lend);
+	if (claimed_bdev)
+		bd_abort_claiming(bdev, claimed_bdev, truncate_bdev_range);
+	return 0;
+}
+EXPORT_SYMBOL(truncate_bdev_range);
+
 static void set_init_blocksize(struct block_device *bdev)
 {
 	bdev->bd_inode->i_blkbits = blksize_bits(bdev_logical_block_size(bdev));
@@ -1969,7 +1998,6 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
 {
 	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
-	struct address_space *mapping;
 	loff_t end = start + len - 1;
 	loff_t isize;
 	int error;
@@ -1997,8 +2025,9 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		return -EINVAL;
 
 	/* Invalidate the page cache, including dirty pages. */
-	mapping = bdev->bd_inode->i_mapping;
-	truncate_inode_pages_range(mapping, start, end);
+	error = truncate_bdev_range(bdev, file->f_mode, start, end);
+	if (error)
+		return error;
 
 	switch (mode) {
 	case FALLOC_FL_ZERO_RANGE:
@@ -2025,7 +2054,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	 * the caller will be given -EBUSY.  The third argument is
 	 * inclusive, so the rounding here is safe.
 	 */
-	return invalidate_inode_pages2_range(mapping,
+	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
 					     start >> PAGE_SHIFT,
 					     end >> PAGE_SHIFT);
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bb5636cc17b9..91c62bfb2042 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1984,11 +1984,18 @@ void bdput(struct block_device *);
 
 #ifdef CONFIG_BLOCK
 void invalidate_bdev(struct block_device *bdev);
+int truncate_bdev_range(struct block_device *bdev, fmode_t mode, loff_t lstart,
+			loff_t lend);
 int sync_blockdev(struct block_device *bdev);
 #else
 static inline void invalidate_bdev(struct block_device *bdev)
 {
 }
+int truncate_bdev_range(struct block_device *bdev, fmode_t mode, loff_t lstart,
+			loff_t lend)
+{
+	return 0;
+}
 static inline int sync_blockdev(struct block_device *bdev)
 {
 	return 0;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] fs: Don't invalidate page buffers in block_write_full_page()
  2020-09-04  8:58 ` [PATCH 1/2] fs: Don't invalidate page buffers in block_write_full_page() Jan Kara
@ 2020-09-07  7:12   ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2020-09-07  7:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-ext4, linux-block, Christoph Hellwig, yebin,
	Andreas Dilger, Jens Axboe, stable

On Fri, Sep 04, 2020 at 10:58:51AM +0200, Jan Kara wrote:
> If block_write_full_page() is called for a page that is beyond current
> inode size, it will truncate page buffers for the page and return 0.
> This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3
> BUG due to race with truncate") in history.git tree to fix a problem
> with ext3 in data=ordered mode. This particular problem doesn't exist
> anymore because ext3 is long gone and ext4 handles ordered data
> differently. Also normally buffers are invalidated by truncate code and
> there's no need to specially handle this in ->writepage() code.
> 
> This invalidation of page buffers in block_write_full_page() is causing
> issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
> under filesystem's hands and metadata buffers get discarded while being
> tracked by the journalling layer. Although it is obviously "not
> supported" it can cause kernel crashes like:

Btw, while looking over the block device revalidation code I think
all the magic we do on shrinking block devices actually is a bad
idea - potentially very harmful, but without much real benefit.
And it only is run on file systems directly created on the whole device,
meaning it isn't even used at all with the typical setups that use
partitions.

Anyway, this patch looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] block: Do not discard buffers under a mounted filesystem
  2020-09-04  8:58 ` [PATCH 2/2] block: Do not discard buffers under a mounted filesystem Jan Kara
@ 2020-09-07  7:12   ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2020-09-07  7:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-ext4, linux-block, Christoph Hellwig, yebin,
	Andreas Dilger, Jens Axboe

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem
  2020-09-04  8:58 [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem Jan Kara
  2020-09-04  8:58 ` [PATCH 1/2] fs: Don't invalidate page buffers in block_write_full_page() Jan Kara
  2020-09-04  8:58 ` [PATCH 2/2] block: Do not discard buffers under a mounted filesystem Jan Kara
@ 2020-09-07 10:35 ` Jan Kara
  2020-09-07 16:24   ` Jens Axboe
  2 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2020-09-07 10:35 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, linux-block, Christoph Hellwig, yebin,
	Andreas Dilger, Jens Axboe, Jan Kara

Hello!

On Fri 04-09-20 10:58:50, Jan Kara wrote:
> this patch set fixes problems when buffer heads are discarded under a
> live filesystem (which can lead to all sorts of issues like crashes in case
> of ext4). Patch 1 drops some stale buffer invalidation code, patch 2
> temporarily gets exclusive access to the block device for the duration of
> buffer cache handling to avoid interfering with other exclusive bdev user.
> The patch fixes the problems for me and pass xfstests for ext4.
> 
> Changes since v1:
> * Check for exclusive access to the bdev instead of for the presence of
>   superblock

Jens, now that Christoph has reviewed the patches (thanks Christoph!), can
you pick up the patches to your tree please? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem
  2020-09-07 10:35 ` [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem Jan Kara
@ 2020-09-07 16:24   ` Jens Axboe
  0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2020-09-07 16:24 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: linux-ext4, linux-block, Christoph Hellwig, yebin, Andreas Dilger

On 9/7/20 4:35 AM, Jan Kara wrote:
> Hello!
> 
> On Fri 04-09-20 10:58:50, Jan Kara wrote:
>> this patch set fixes problems when buffer heads are discarded under a
>> live filesystem (which can lead to all sorts of issues like crashes in case
>> of ext4). Patch 1 drops some stale buffer invalidation code, patch 2
>> temporarily gets exclusive access to the block device for the duration of
>> buffer cache handling to avoid interfering with other exclusive bdev user.
>> The patch fixes the problems for me and pass xfstests for ext4.
>>
>> Changes since v1:
>> * Check for exclusive access to the bdev instead of for the presence of
>>   superblock
> 
> Jens, now that Christoph has reviewed the patches (thanks Christoph!), can
> you pick up the patches to your tree please? Thanks!

Yep, I applied them for 5.10. Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-07 16:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-04  8:58 [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem Jan Kara
2020-09-04  8:58 ` [PATCH 1/2] fs: Don't invalidate page buffers in block_write_full_page() Jan Kara
2020-09-07  7:12   ` Christoph Hellwig
2020-09-04  8:58 ` [PATCH 2/2] block: Do not discard buffers under a mounted filesystem Jan Kara
2020-09-07  7:12   ` Christoph Hellwig
2020-09-07 10:35 ` [PATCH 0/2 v2] bdev: Avoid discarding buffers under a filesystem Jan Kara
2020-09-07 16:24   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).