Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC 0/4] Fix gfs2 readahead deadlocks
@ 2020-07-02 16:51 Andreas Gruenbacher
  2020-07-02 16:51 ` [RFC 1/4] gfs2: Revert readahead conversion Andreas Gruenbacher
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-02 16:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Linus Torvalds, linux-fsdevel, linux-mm,
	linux-kernel, Andreas Gruenbacher

Hi all,

commit d4388340ae0b ("fs: convert mpage_readpages to mpage_readahead")
converted gfs2 and other filesystems from the ->readpages to the
->readahead address space operation.  Due to gfs2 doing its locking in
the ->readpage and ->readahead address space operations rather than at a
higher level, this is leading to deadlocks.  Switching to a trylock
operation in ->readahead improves things but doesn't eliminate all
deadlocks; the only reasonable fix seems to be to lift gfs2's locking to
the ->fault vm operation and ->read_iter file operation.

However, gfs2 includes an optimization that allows reads to be served
out of the page cache without any filesystem locking.  This optimization
is important in concurrent read scenarios.  The best way we could find
to preserve this optimization is by introducing a new IOCB_NOIO flag for
generic_file_read_iter.

Introducing this new flag may be too big a change for 5.8.  So this
patch queue takes a different approach:  it first walks back gfs2's
conversion to ->readahead.  Then it introduces IOCB_NOIO, fixes the
locking, and re-applies the readahead conversion.

Of this patch queue, either only the first patch or all four patches can
be applied to fix gfs2's current issues in 5.8.  Please let me know what
you think.

Thanks,
Andreas

Andreas Gruenbacher (4):
  gfs2: Revert readahead conversion
  fs: Add IOCB_NOIO flag for generic_file_read_iter
  gfs2: Rework read and page fault locking
  gfs2: Reinstate readahead conversion

 fs/gfs2/aops.c     | 45 +------------------------------------
 fs/gfs2/file.c     | 55 ++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h |  1 +
 mm/filemap.c       | 16 ++++++++++++--
 4 files changed, 69 insertions(+), 48 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/4] gfs2: Revert readahead conversion
  2020-07-02 16:51 [RFC 0/4] Fix gfs2 readahead deadlocks Andreas Gruenbacher
@ 2020-07-02 16:51 ` Andreas Gruenbacher
  2020-07-02 16:51 ` [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter Andreas Gruenbacher
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-02 16:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Linus Torvalds, linux-fsdevel, linux-mm,
	linux-kernel, Andreas Gruenbacher

Commit d4388340ae0b ("fs: convert mpage_readpages to mpage_readahead")
converted gfs2 and other filesystems from the ->readpages to the
->readahead address space operation.  Other than ->readpages,
->readahead is passed the pages to readahead locked.  Due to problems in
the current page locking strategy, this is now causing deadlocks in
gfs2.

Fix this by reinstating mpage_readpages from before commit d4388340ae0b
and by converting gfs2 back to ->readpages.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/aops.c        | 23 ++++++++-----
 fs/mpage.c            | 75 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/mpage.h |  2 ++
 3 files changed, 92 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 72c9560f4467..786c1ce8f030 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -577,7 +577,7 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
 }
 
 /**
- * gfs2_readahead - Read a bunch of pages at once
+ * gfs2_readpages - Read a bunch of pages at once
  * @file: The file to read from
  * @mapping: Address space info
  * @pages: List of pages to read
@@ -590,24 +590,31 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
  *    obviously not something we'd want to do on too regular a basis.
  *    Any I/O we ignore at this time will be done via readpage later.
  * 2. We don't handle stuffed files here we let readpage do the honours.
- * 3. mpage_readahead() does most of the heavy lifting in the common case.
+ * 3. mpage_readpages() does most of the heavy lifting in the common case.
  * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places.
  */
 
-static void gfs2_readahead(struct readahead_control *rac)
+static int gfs2_readpages(struct file *file, struct address_space *mapping,
+			  struct list_head *pages, unsigned nr_pages)
 {
-	struct inode *inode = rac->mapping->host;
+	struct inode *inode = mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_holder gh;
+	int ret;
 
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	if (gfs2_glock_nq(&gh))
+	ret = gfs2_glock_nq(&gh);
+	if (unlikely(ret))
 		goto out_uninit;
 	if (!gfs2_is_stuffed(ip))
-		mpage_readahead(rac, gfs2_block_map);
+		ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
 	gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
+	if (unlikely(gfs2_withdrawn(sdp)))
+		ret = -EIO;
+	return ret;
 }
 
 /**
@@ -826,7 +833,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepage = gfs2_writepage,
 	.writepages = gfs2_writepages,
 	.readpage = gfs2_readpage,
-	.readahead = gfs2_readahead,
+	.readpages = gfs2_readpages,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,
@@ -840,7 +847,7 @@ static const struct address_space_operations gfs2_jdata_aops = {
 	.writepage = gfs2_jdata_writepage,
 	.writepages = gfs2_jdata_writepages,
 	.readpage = gfs2_readpage,
-	.readahead = gfs2_readahead,
+	.readpages = gfs2_readpages,
 	.set_page_dirty = jdata_set_page_dirty,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
diff --git a/fs/mpage.c b/fs/mpage.c
index 830e6cc2a9e7..5243a065a062 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -396,6 +396,81 @@ void mpage_readahead(struct readahead_control *rac, get_block_t get_block)
 }
 EXPORT_SYMBOL(mpage_readahead);
 
+/**
+ * mpage_readpages - populate an address space with some pages & start reads against them
+ * @mapping: the address_space
+ * @pages: The address of a list_head which contains the target pages.  These
+ *   pages have their ->index populated and are otherwise uninitialised.
+ *   The page at @pages->prev has the lowest file offset, and reads should be
+ *   issued in @pages->prev to @pages->next order.
+ * @nr_pages: The number of pages at *@pages
+ * @get_block: The filesystem's block mapper function.
+ *
+ * This function walks the pages and the blocks within each page, building and
+ * emitting large BIOs.
+ *
+ * If anything unusual happens, such as:
+ *
+ * - encountering a page which has buffers
+ * - encountering a page which has a non-hole after a hole
+ * - encountering a page with non-contiguous blocks
+ *
+ * then this code just gives up and calls the buffer_head-based read function.
+ * It does handle a page which has holes at the end - that is a common case:
+ * the end-of-file on blocksize < PAGE_SIZE setups.
+ *
+ * BH_Boundary explanation:
+ *
+ * There is a problem.  The mpage read code assembles several pages, gets all
+ * their disk mappings, and then submits them all.  That's fine, but obtaining
+ * the disk mappings may require I/O.  Reads of indirect blocks, for example.
+ *
+ * So an mpage read of the first 16 blocks of an ext2 file will cause I/O to be
+ * submitted in the following order:
+ *
+ * 	12 0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16
+ *
+ * because the indirect block has to be read to get the mappings of blocks
+ * 13,14,15,16.  Obviously, this impacts performance.
+ *
+ * So what we do it to allow the filesystem's get_block() function to set
+ * BH_Boundary when it maps block 11.  BH_Boundary says: mapping of the block
+ * after this one will require I/O against a block which is probably close to
+ * this one.  So you should push what I/O you have currently accumulated.
+ *
+ * This all causes the disk requests to be issued in the correct order.
+ */
+int
+mpage_readpages(struct address_space *mapping, struct list_head *pages,
+				unsigned nr_pages, get_block_t get_block)
+{
+	struct mpage_readpage_args args = {
+		.get_block = get_block,
+		.is_readahead = true,
+	};
+	unsigned page_idx;
+
+	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
+		struct page *page = lru_to_page(pages);
+
+		prefetchw(&page->flags);
+		list_del(&page->lru);
+		if (!add_to_page_cache_lru(page, mapping,
+					page->index,
+					readahead_gfp_mask(mapping))) {
+			args.page = page;
+			args.nr_pages = nr_pages - page_idx;
+			args.bio = do_mpage_readpage(&args);
+		}
+		put_page(page);
+	}
+	BUG_ON(!list_empty(pages));
+	if (args.bio)
+		mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio);
+	return 0;
+}
+EXPORT_SYMBOL(mpage_readpages);
+
 /*
  * This isn't called much at all
  */
diff --git a/include/linux/mpage.h b/include/linux/mpage.h
index f4f5e90a6844..181f1b0fbd83 100644
--- a/include/linux/mpage.h
+++ b/include/linux/mpage.h
@@ -16,6 +16,8 @@ struct writeback_control;
 struct readahead_control;
 
 void mpage_readahead(struct readahead_control *, get_block_t get_block);
+int mpage_readpages(struct address_space *, struct list_head *, unsigned,
+		    get_block_t);
 int mpage_readpage(struct page *page, get_block_t get_block);
 int mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter
  2020-07-02 16:51 [RFC 0/4] Fix gfs2 readahead deadlocks Andreas Gruenbacher
  2020-07-02 16:51 ` [RFC 1/4] gfs2: Revert readahead conversion Andreas Gruenbacher
@ 2020-07-02 16:51 ` Andreas Gruenbacher
  2020-07-02 18:06   ` Linus Torvalds
  2020-07-02 16:51 ` [RFC 3/4] gfs2: Rework read and page fault locking Andreas Gruenbacher
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-02 16:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Linus Torvalds, linux-fsdevel, linux-mm,
	linux-kernel, Andreas Gruenbacher

Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
shouldn't trigger any filesystem I/O for the actual request or for
readahead.  This allows to do tentative reads out of the page cache as
some filesystems allow, and to take the appropriate locks and retry the
reads only if the requested pages are not cached.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/fs.h |  1 +
 mm/filemap.c       | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f881a892ea7..1ab2ea19e883 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_NOIO		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
diff --git a/mm/filemap.c b/mm/filemap.c
index f0ae9a6308cb..e8318f99f468 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2028,7 +2028,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		page = find_get_page(mapping, index);
 		if (!page) {
-			if (iocb->ki_flags & IOCB_NOWAIT)
+			if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
 				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
@@ -2038,12 +2038,16 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
+			if (iocb->ki_flags & IOCB_NOIO) {
+				put_page(page);
+				goto out;
+			}
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
 					index, last_index - index);
 		}
 		if (!PageUptodate(page)) {
-			if (iocb->ki_flags & IOCB_NOWAIT) {
+			if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) {
 				put_page(page);
 				goto would_block;
 			}
@@ -2249,6 +2253,14 @@ EXPORT_SYMBOL_GPL(generic_file_buffered_read);
  *
  * This is the "read_iter()" routine for all filesystems
  * that can use the page cache directly.
+ *
+ * The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall
+ * be returned when no data can be read without issuing I/O requests;
+ * this doesn't prevent readahead.  The IOCB_NOIO flag indicates that
+ * -EAGAIN shall be returned when no data can be read without issuing
+ * I/O requests, and 0 shall be returned when readhead would be
+ * triggered.
+ *
  * Return:
  * * number of bytes copied, even for partial reads
  * * negative error code if nothing was read
-- 
2.26.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 3/4] gfs2: Rework read and page fault locking
  2020-07-02 16:51 [RFC 0/4] Fix gfs2 readahead deadlocks Andreas Gruenbacher
  2020-07-02 16:51 ` [RFC 1/4] gfs2: Revert readahead conversion Andreas Gruenbacher
  2020-07-02 16:51 ` [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter Andreas Gruenbacher
@ 2020-07-02 16:51 ` Andreas Gruenbacher
  2020-07-02 16:51 ` [RFC 4/4] gfs2: Reinstate readahead conversion Andreas Gruenbacher
  2020-07-02 18:10 ` [RFC 0/4] Fix gfs2 readahead deadlocks Linus Torvalds
  4 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-02 16:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Linus Torvalds, linux-fsdevel, linux-mm,
	linux-kernel, Andreas Gruenbacher

So far, gfs2 has taken the filesystem locks inside the ->readpage and
->readahead address space operations.  These operations are too
low-level, causing lock ordering problems and workarounds.  To get rid
of those, move the locking to the ->read_iter file and ->fault vm
operations.

The cache consistency model of filesystems like gfs2 is such that if
data is found in the page cache, the data is up to date and can be used
without taking any filesystem locks.  If a page is not cached,
filesystem locks must be taken before populating the page cache.

To avoid taking the inode glock when the data is already cached, the
->read_iter file operation first tries to read the data with the
IOCB_NOIO flag set.  If that fails, the inode glock is taken and the
operation is repeated with the IOCB_NOIO flag cleared.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/aops.c | 59 ++++----------------------------------------------
 fs/gfs2/file.c | 55 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 57 insertions(+), 57 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 786c1ce8f030..28f097636e78 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -468,21 +468,10 @@ static int stuffed_readpage(struct gfs2_inode *ip, struct page *page)
 }
 
 
-/**
- * __gfs2_readpage - readpage
- * @file: The file to read a page for
- * @page: The page to read
- *
- * This is the core of gfs2's readpage. It's used by the internal file
- * reading code as in that case we already hold the glock. Also it's
- * called by gfs2_readpage() once the required lock has been granted.
- */
-
 static int __gfs2_readpage(void *file, struct page *page)
 {
 	struct gfs2_inode *ip = GFS2_I(page->mapping->host);
 	struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
-
 	int error;
 
 	if (i_blocksize(page->mapping->host) == PAGE_SIZE &&
@@ -505,36 +494,11 @@ static int __gfs2_readpage(void *file, struct page *page)
  * gfs2_readpage - read a page of a file
  * @file: The file to read
  * @page: The page of the file
- *
- * This deals with the locking required. We have to unlock and
- * relock the page in order to get the locking in the right
- * order.
  */
 
 static int gfs2_readpage(struct file *file, struct page *page)
 {
-	struct address_space *mapping = page->mapping;
-	struct gfs2_inode *ip = GFS2_I(mapping->host);
-	struct gfs2_holder gh;
-	int error;
-
-	unlock_page(page);
-	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	error = gfs2_glock_nq(&gh);
-	if (unlikely(error))
-		goto out;
-	error = AOP_TRUNCATED_PAGE;
-	lock_page(page);
-	if (page->mapping == mapping && !PageUptodate(page))
-		error = __gfs2_readpage(file, page);
-	else
-		unlock_page(page);
-	gfs2_glock_dq(&gh);
-out:
-	gfs2_holder_uninit(&gh);
-	if (error && error != AOP_TRUNCATED_PAGE)
-		lock_page(page);
-	return error;
+	return __gfs2_readpage(file, page);
 }
 
 /**
@@ -597,24 +561,9 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
 static int gfs2_readpages(struct file *file, struct address_space *mapping,
 			  struct list_head *pages, unsigned nr_pages)
 {
-	struct inode *inode = mapping->host;
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	struct gfs2_holder gh;
-	int ret;
-
-	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	ret = gfs2_glock_nq(&gh);
-	if (unlikely(ret))
-		goto out_uninit;
-	if (!gfs2_is_stuffed(ip))
-		ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
-	gfs2_glock_dq(&gh);
-out_uninit:
-	gfs2_holder_uninit(&gh);
-	if (unlikely(gfs2_withdrawn(sdp)))
-		ret = -EIO;
-	return ret;
+	if (!gfs2_is_stuffed(GFS2_I(mapping->host)))
+		return mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
+	return 0;
 }
 
 /**
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index fe305e4bfd37..607bbf4dfadb 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -558,8 +558,29 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	return block_page_mkwrite_return(ret);
 }
 
+static vm_fault_t gfs2_fault(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_holder gh;
+	vm_fault_t ret;
+	int err;
+
+	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+	err = gfs2_glock_nq(&gh);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
+		goto out_uninit;
+	}
+	ret = filemap_fault(vmf);
+	gfs2_glock_dq(&gh);
+out_uninit:
+	gfs2_holder_uninit(&gh);
+	return ret;
+}
+
 static const struct vm_operations_struct gfs2_vm_ops = {
-	.fault = filemap_fault,
+	.fault = gfs2_fault,
 	.map_pages = filemap_map_pages,
 	.page_mkwrite = gfs2_page_mkwrite,
 };
@@ -824,15 +845,45 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 
 static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
+	struct gfs2_inode *ip;
+	struct gfs2_holder gh;
+	size_t written = 0;
 	ssize_t ret;
 
+	gfs2_holder_mark_uninitialized(&gh);
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = gfs2_file_direct_read(iocb, to);
 		if (likely(ret != -ENOTBLK))
 			return ret;
 		iocb->ki_flags &= ~IOCB_DIRECT;
 	}
-	return generic_file_read_iter(iocb, to);
+	iocb->ki_flags |= IOCB_NOIO;
+	ret = generic_file_read_iter(iocb, to);
+	iocb->ki_flags &= ~IOCB_NOIO;
+	if (ret >= 0) {
+		if (!iov_iter_count(to))
+			return ret;
+		written = ret;
+	} else {
+		if (ret != -EAGAIN)
+			return ret;
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return ret;
+	}
+	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
+	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+	ret = gfs2_glock_nq(&gh);
+	if (ret)
+		goto out_uninit;
+	ret = generic_file_read_iter(iocb, to);
+	if (ret > 0)
+		written += ret;
+	if (gfs2_holder_initialized(&gh))
+		gfs2_glock_dq(&gh);
+out_uninit:
+	if (gfs2_holder_initialized(&gh))
+		gfs2_holder_uninit(&gh);
+	return written ? written : ret;
 }
 
 /**
-- 
2.26.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 4/4] gfs2: Reinstate readahead conversion
  2020-07-02 16:51 [RFC 0/4] Fix gfs2 readahead deadlocks Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2020-07-02 16:51 ` [RFC 3/4] gfs2: Rework read and page fault locking Andreas Gruenbacher
@ 2020-07-02 16:51 ` Andreas Gruenbacher
  2020-07-02 18:10 ` [RFC 0/4] Fix gfs2 readahead deadlocks Linus Torvalds
  4 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-02 16:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Linus Torvalds, linux-fsdevel, linux-mm,
	linux-kernel, Andreas Gruenbacher

Now that the locking order in gfs2 is fixed, switch back to using the
->readahead address space operation.  With that, mpage_readpages is
unused and can be removed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/aops.c        | 19 +++++------
 fs/mpage.c            | 75 -------------------------------------------
 include/linux/mpage.h |  2 --
 3 files changed, 10 insertions(+), 86 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 28f097636e78..68cd700a2719 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -541,7 +541,7 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
 }
 
 /**
- * gfs2_readpages - Read a bunch of pages at once
+ * gfs2_readahead - Read a bunch of pages at once
  * @file: The file to read from
  * @mapping: Address space info
  * @pages: List of pages to read
@@ -554,16 +554,17 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
  *    obviously not something we'd want to do on too regular a basis.
  *    Any I/O we ignore at this time will be done via readpage later.
  * 2. We don't handle stuffed files here we let readpage do the honours.
- * 3. mpage_readpages() does most of the heavy lifting in the common case.
+ * 3. mpage_readahead() does most of the heavy lifting in the common case.
  * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places.
  */
 
-static int gfs2_readpages(struct file *file, struct address_space *mapping,
-			  struct list_head *pages, unsigned nr_pages)
+static void gfs2_readahead(struct readahead_control *rac)
 {
-	if (!gfs2_is_stuffed(GFS2_I(mapping->host)))
-		return mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
-	return 0;
+	struct inode *inode = rac->mapping->host;
+	struct gfs2_inode *ip = GFS2_I(inode);
+
+	if (!gfs2_is_stuffed(ip))
+		mpage_readahead(rac, gfs2_block_map);
 }
 
 /**
@@ -782,7 +783,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepage = gfs2_writepage,
 	.writepages = gfs2_writepages,
 	.readpage = gfs2_readpage,
-	.readpages = gfs2_readpages,
+	.readahead = gfs2_readahead,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,
@@ -796,7 +797,7 @@ static const struct address_space_operations gfs2_jdata_aops = {
 	.writepage = gfs2_jdata_writepage,
 	.writepages = gfs2_jdata_writepages,
 	.readpage = gfs2_readpage,
-	.readpages = gfs2_readpages,
+	.readahead = gfs2_readahead,
 	.set_page_dirty = jdata_set_page_dirty,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
diff --git a/fs/mpage.c b/fs/mpage.c
index 5243a065a062..830e6cc2a9e7 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -396,81 +396,6 @@ void mpage_readahead(struct readahead_control *rac, get_block_t get_block)
 }
 EXPORT_SYMBOL(mpage_readahead);
 
-/**
- * mpage_readpages - populate an address space with some pages & start reads against them
- * @mapping: the address_space
- * @pages: The address of a list_head which contains the target pages.  These
- *   pages have their ->index populated and are otherwise uninitialised.
- *   The page at @pages->prev has the lowest file offset, and reads should be
- *   issued in @pages->prev to @pages->next order.
- * @nr_pages: The number of pages at *@pages
- * @get_block: The filesystem's block mapper function.
- *
- * This function walks the pages and the blocks within each page, building and
- * emitting large BIOs.
- *
- * If anything unusual happens, such as:
- *
- * - encountering a page which has buffers
- * - encountering a page which has a non-hole after a hole
- * - encountering a page with non-contiguous blocks
- *
- * then this code just gives up and calls the buffer_head-based read function.
- * It does handle a page which has holes at the end - that is a common case:
- * the end-of-file on blocksize < PAGE_SIZE setups.
- *
- * BH_Boundary explanation:
- *
- * There is a problem.  The mpage read code assembles several pages, gets all
- * their disk mappings, and then submits them all.  That's fine, but obtaining
- * the disk mappings may require I/O.  Reads of indirect blocks, for example.
- *
- * So an mpage read of the first 16 blocks of an ext2 file will cause I/O to be
- * submitted in the following order:
- *
- * 	12 0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16
- *
- * because the indirect block has to be read to get the mappings of blocks
- * 13,14,15,16.  Obviously, this impacts performance.
- *
- * So what we do it to allow the filesystem's get_block() function to set
- * BH_Boundary when it maps block 11.  BH_Boundary says: mapping of the block
- * after this one will require I/O against a block which is probably close to
- * this one.  So you should push what I/O you have currently accumulated.
- *
- * This all causes the disk requests to be issued in the correct order.
- */
-int
-mpage_readpages(struct address_space *mapping, struct list_head *pages,
-				unsigned nr_pages, get_block_t get_block)
-{
-	struct mpage_readpage_args args = {
-		.get_block = get_block,
-		.is_readahead = true,
-	};
-	unsigned page_idx;
-
-	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
-		struct page *page = lru_to_page(pages);
-
-		prefetchw(&page->flags);
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping,
-					page->index,
-					readahead_gfp_mask(mapping))) {
-			args.page = page;
-			args.nr_pages = nr_pages - page_idx;
-			args.bio = do_mpage_readpage(&args);
-		}
-		put_page(page);
-	}
-	BUG_ON(!list_empty(pages));
-	if (args.bio)
-		mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio);
-	return 0;
-}
-EXPORT_SYMBOL(mpage_readpages);
-
 /*
  * This isn't called much at all
  */
diff --git a/include/linux/mpage.h b/include/linux/mpage.h
index 181f1b0fbd83..f4f5e90a6844 100644
--- a/include/linux/mpage.h
+++ b/include/linux/mpage.h
@@ -16,8 +16,6 @@ struct writeback_control;
 struct readahead_control;
 
 void mpage_readahead(struct readahead_control *, get_block_t get_block);
-int mpage_readpages(struct address_space *, struct list_head *, unsigned,
-		    get_block_t);
 int mpage_readpage(struct page *page, get_block_t get_block);
 int mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter
  2020-07-02 16:51 ` [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter Andreas Gruenbacher
@ 2020-07-02 18:06   ` Linus Torvalds
  2020-07-02 19:58     ` Andreas Gruenbacher
  2020-07-07 14:30     ` Andreas Gruenbacher
  0 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2020-07-02 18:06 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Matthew Wilcox, Dave Chinner, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List

On Thu, Jul 2, 2020 at 9:51 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
> shouldn't trigger any filesystem I/O for the actual request or for
> readahead.  This allows to do tentative reads out of the page cache as
> some filesystems allow, and to take the appropriate locks and retry the
> reads only if the requested pages are not cached.

This looks sane to me, except for this part:
>                 if (!PageUptodate(page)) {
> -                       if (iocb->ki_flags & IOCB_NOWAIT) {
> +                       if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) {
>                                 put_page(page);
>                                 goto would_block;
>                         }

This path doesn't actually initiate reads at all - it waits for
existing reads to finish.

So I think it should only check for IOCB_NOWAIT.

Of course, if you want to avoid both new reads to be submitted _and_
avoid waiting for existing pending reads, you should just set both
flags, and you get the semantics you want. So for your case, this may
not make any difference.

But if the issue is a deadlock where the code can block for IO, but
not call back down to the filesystem for new IO (because it holds a
lock that the filesystem might need) then this patch as-is is wrong,
because it disallows even that case.

              Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] Fix gfs2 readahead deadlocks
  2020-07-02 16:51 [RFC 0/4] Fix gfs2 readahead deadlocks Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2020-07-02 16:51 ` [RFC 4/4] gfs2: Reinstate readahead conversion Andreas Gruenbacher
@ 2020-07-02 18:10 ` Linus Torvalds
  2020-07-02 18:23   ` Andreas Gruenbacher
  4 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-07-02 18:10 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Matthew Wilcox, Dave Chinner, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List

On Thu, Jul 2, 2020 at 9:51 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Of this patch queue, either only the first patch or all four patches can
> be applied to fix gfs2's current issues in 5.8.  Please let me know what
> you think.

I think the IOCB_NOIO flag looks fine (apart from the nit I pointed
out), abnd we could do that.

However, is the "revert and reinstate" looks odd. Is the reinstate so
different front he original that it makes sense to do that way?

Or was it done that way only to give the choice of just doing the revert?

Because if so, I think I'd rather just see a "fix" rather than
"revert+reinstate".

But I didn't look that closely at the gfs2 code itself, maybe there's
some reason you did it that way.

              Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] Fix gfs2 readahead deadlocks
  2020-07-02 18:10 ` [RFC 0/4] Fix gfs2 readahead deadlocks Linus Torvalds
@ 2020-07-02 18:23   ` Andreas Gruenbacher
  0 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-02 18:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Dave Chinner, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List

On Thu, Jul 2, 2020 at 8:11 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 2, 2020 at 9:51 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Of this patch queue, either only the first patch or all four patches can
> > be applied to fix gfs2's current issues in 5.8.  Please let me know what
> > you think.
>
> I think the IOCB_NOIO flag looks fine (apart from the nit I pointed
> out), and we could do that.

Ok, that's a step forward.

> However, is the "revert and reinstate" looks odd. Is the reinstate so
> different from the original that it makes sense to do that way?
>
> Or was it done that way only to give the choice of just doing the revert?
>
> Because if so, I think I'd rather just see a "fix" rather than
> "revert+reinstate".

I only did the "revert and reinstate" so that the revert alone will
give us a working gfs2 in 5.8. If there's agreement to add the
IOCB_NOIO flag, then we can just fix gfs2 (basically
https://lore.kernel.org/linux-fsdevel/20200619093916.1081129-3-agruenba@redhat.com/
with IOCB_CACHED renamed to IOCB_NOIO).

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter
  2020-07-02 18:06   ` Linus Torvalds
@ 2020-07-02 19:58     ` Andreas Gruenbacher
  2020-07-02 20:17       ` Linus Torvalds
  2020-07-07 14:30     ` Andreas Gruenbacher
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-02 19:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Dave Chinner, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List

On Thu, Jul 2, 2020 at 8:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 2, 2020 at 9:51 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
> > shouldn't trigger any filesystem I/O for the actual request or for
> > readahead.  This allows to do tentative reads out of the page cache as
> > some filesystems allow, and to take the appropriate locks and retry the
> > reads only if the requested pages are not cached.
>
> This looks sane to me, except for this part:
> >                 if (!PageUptodate(page)) {
> > -                       if (iocb->ki_flags & IOCB_NOWAIT) {
> > +                       if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) {
> >                                 put_page(page);
> >                                 goto would_block;
> >                         }
>
> This path doesn't actually initiate reads at all - it waits for
> existing reads to finish.
>
> So I think it should only check for IOCB_NOWAIT.
>
> Of course, if you want to avoid both new reads to be submitted _and_
> avoid waiting for existing pending reads, you should just set both
> flags, and you get the semantics you want. So for your case, this may
> not make any difference.

Indeed, in the gfs2 case, waiting for existing pending reads should be
fine. I'll send an update after some testing.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter
  2020-07-02 19:58     ` Andreas Gruenbacher
@ 2020-07-02 20:17       ` Linus Torvalds
  2020-07-03  9:45         ` Andreas Gruenbacher
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-07-02 20:17 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Matthew Wilcox, Dave Chinner, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List

On Thu, Jul 2, 2020 at 12:58 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> > Of course, if you want to avoid both new reads to be submitted _and_
> > avoid waiting for existing pending reads, you should just set both
> > flags, and you get the semantics you want. So for your case, this may
> > not make any difference.
>
> Indeed, in the gfs2 case, waiting for existing pending reads should be
> fine. I'll send an update after some testing.

Do note that "wait for pending reads" very much does imply "wait for
those reads to _complete_".

And maybe the IO completion handler itself ends up having to finalize
something and take the lock to do that?

So in that case, even just "waiting" will cause a deadlock. Not
because the waiter itself needs the lock, but because the thing it
waits for might possibly need it.

But in many simple cases, IO completion shouldn't need any filesystem
locks. I just don't know the gfs2 code at all, so I'm not even going
to guess. I just wanted to mention it.

               Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter
  2020-07-02 20:17       ` Linus Torvalds
@ 2020-07-03  9:45         ` Andreas Gruenbacher
  0 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-03  9:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Dave Chinner, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List

On Thu, Jul 2, 2020 at 10:18 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 2, 2020 at 12:58 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > Of course, if you want to avoid both new reads to be submitted _and_
> > > avoid waiting for existing pending reads, you should just set both
> > > flags, and you get the semantics you want. So for your case, this may
> > > not make any difference.
> >
> > Indeed, in the gfs2 case, waiting for existing pending reads should be
> > fine. I'll send an update after some testing.
>
> Do note that "wait for pending reads" very much does imply "wait for
> those reads to _complete_".
>
> And maybe the IO completion handler itself ends up having to finalize
> something and take the lock to do that?
>
> So in that case, even just "waiting" will cause a deadlock. Not
> because the waiter itself needs the lock, but because the thing it
> waits for might possibly need it.
>
> But in many simple cases, IO completion shouldn't need any filesystem
> locks. I just don't know the gfs2 code at all, so I'm not even going
> to guess. I just wanted to mention it.

Yes, that makes sense. Luckily gfs2 doesn't do any such locking on IO
completion.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter
  2020-07-02 18:06   ` Linus Torvalds
  2020-07-02 19:58     ` Andreas Gruenbacher
@ 2020-07-07 14:30     ` Andreas Gruenbacher
  1 sibling, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2020-07-07 14:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Dave Chinner, linux-fsdevel, Linux-MM,
	Linux Kernel Mailing List

On Thu, Jul 2, 2020 at 8:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 2, 2020 at 9:51 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
> > shouldn't trigger any filesystem I/O for the actual request or for
> > readahead.  This allows to do tentative reads out of the page cache as
> > some filesystems allow, and to take the appropriate locks and retry the
> > reads only if the requested pages are not cached.
>
> This looks sane to me, except for this part:
> >                 if (!PageUptodate(page)) {
> > -                       if (iocb->ki_flags & IOCB_NOWAIT) {
> > +                       if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) {
> >                                 put_page(page);
> >                                 goto would_block;
> >                         }
>
> This path doesn't actually initiate reads at all - it waits for
> existing reads to finish.
>
> So I think it should only check for IOCB_NOWAIT.

It turns out that label readpage is reachable from here via goto
page_not_up_to_date / goto page_not_up_to_date_locked. So IOCB_NOIO
needs to be checked somewhere. I'll send an update.

Andreas


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-07-07 14:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-02 16:51 [RFC 0/4] Fix gfs2 readahead deadlocks Andreas Gruenbacher
2020-07-02 16:51 ` [RFC 1/4] gfs2: Revert readahead conversion Andreas Gruenbacher
2020-07-02 16:51 ` [RFC 2/4] fs: Add IOCB_NOIO flag for generic_file_read_iter Andreas Gruenbacher
2020-07-02 18:06   ` Linus Torvalds
2020-07-02 19:58     ` Andreas Gruenbacher
2020-07-02 20:17       ` Linus Torvalds
2020-07-03  9:45         ` Andreas Gruenbacher
2020-07-07 14:30     ` Andreas Gruenbacher
2020-07-02 16:51 ` [RFC 3/4] gfs2: Rework read and page fault locking Andreas Gruenbacher
2020-07-02 16:51 ` [RFC 4/4] gfs2: Reinstate readahead conversion Andreas Gruenbacher
2020-07-02 18:10 ` [RFC 0/4] Fix gfs2 readahead deadlocks Linus Torvalds
2020-07-02 18:23   ` Andreas Gruenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).