Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/7] no-copy bvec
@ 2021-01-02 15:17 Pavel Begunkov
  2021-01-02 15:17 ` [PATCH v2 1/7] splice: don't generate zero-len segement bvecs Pavel Begunkov
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-02 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

Currently, when iomap and block direct IO gets a bvec based iterator
the bvec will be copied, with all other accounting that takes much
CPU time and causes additional allocation for larger bvecs. The
patchset makes it to reuse the passed in iter bvec.

[1,2] are forbidding zero-length bvec segments to not pile special
cases, [3] skip/fix PSI tracking to not iterate over bvecs extra
time.


nullblk completion_nsec=0 submit_queues=NR_CORES, no merges, no stats
fio/t/io_uring /dev/nullb0 -d 128 -s 32 -c 32 -p 0 -B 1 -F 1 -b BLOCK_SIZE

BLOCK_SIZE             512     4K      8K      16K     32K     64K
===================================================================
old (KIOPS)            1208    1208    1131    1039    863     699
new (KIOPS)            1222    1222    1170    1137    1083    982

Previously, Jens got before 10% difference for polling real HW and small
block sizes, but that was for an older version that had one
iov_iter_advance() less


since RFC:
- add target_core_file patch by Christoph
- make no-copy default behaviour, remove iter flag
- iter_advance() instead of hacks to revert to work
- add bvec iter_advance() optimisation patch
- remove PSI annotations from direct IO (iomap, block and fs/direct)
- note in d/f/porting

since v1:
- don't allow zero-length bvec segments (Ming)
- don't add a BIO_WORKINGSET-less version of bio_add_page(), just clear
  the flag at the end and leave it for further cleanups (Christoph)
- commit message and comments rewording (Dave)
- other nits by Christoph

Christoph Hellwig (1):
  target/file: allocate the bvec array as part of struct
    target_core_file_cmd

Pavel Begunkov (6):
  splice: don't generate zero-len segement bvecs
  bvec/iter: disallow zero-length segment bvecs
  block/psi: remove PSI annotations from direct IO
  iov_iter: optimise bvec iov_iter_advance()
  bio: add a helper calculating nr segments to alloc
  bio: don't copy bvec for direct IO

 Documentation/filesystems/porting.rst | 16 ++++++
 block/bio.c                           | 71 +++++++++++++--------------
 drivers/target/target_core_file.c     | 20 +++-----
 fs/block_dev.c                        |  7 +--
 fs/direct-io.c                        |  2 +
 fs/iomap/direct-io.c                  |  9 ++--
 fs/splice.c                           |  9 ++--
 include/linux/bio.h                   | 13 +++++
 lib/iov_iter.c                        | 21 +++++++-
 9 files changed, 103 insertions(+), 65 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/7] splice: don't generate zero-len segement bvecs
  2021-01-02 15:17 [PATCH v2 0/7] no-copy bvec Pavel Begunkov
@ 2021-01-02 15:17 ` Pavel Begunkov
  2021-01-04 16:17   ` Christoph Hellwig
  2021-01-02 15:17 ` [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs Pavel Begunkov
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-02 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

iter_file_splice_write() may spawn bvec segments with zero-length. In
preparation for prohibiting them, filter out by hand at splice level.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/splice.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 866d5c2367b2..7299330c3270 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -644,7 +644,6 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		ret = splice_from_pipe_next(pipe, &sd);
 		if (ret <= 0)
 			break;
-
 		if (unlikely(nbufs < pipe->max_usage)) {
 			kfree(array);
 			nbufs = pipe->max_usage;
@@ -662,12 +661,13 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 		/* build the vector */
 		left = sd.total_len;
-		for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++, n++) {
+		for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++) {
 			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
 			size_t this_len = buf->len;
 
-			if (this_len > left)
-				this_len = left;
+			if (!this_len)
+				continue;
+			this_len = min(this_len, left);
 
 			ret = pipe_buf_confirm(pipe, buf);
 			if (unlikely(ret)) {
@@ -680,6 +680,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 			array[n].bv_len = this_len;
 			array[n].bv_offset = buf->offset;
 			left -= this_len;
+			n++;
 		}
 
 		iov_iter_bvec(&from, WRITE, array, n, sd.total_len - left);
-- 
2.24.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs
  2021-01-02 15:17 [PATCH v2 0/7] no-copy bvec Pavel Begunkov
  2021-01-02 15:17 ` [PATCH v2 1/7] splice: don't generate zero-len segement bvecs Pavel Begunkov
@ 2021-01-02 15:17 ` Pavel Begunkov
  2021-01-04 16:18   ` Christoph Hellwig
  2021-01-04 16:37   ` Matthew Wilcox
  2021-01-02 15:17 ` [PATCH v2 3/7] block/psi: remove PSI annotations from direct IO Pavel Begunkov
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-02 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

zero-length bvec segments are allowed in general, but not handled by bio
and down the block layer so filtered out. This inconsistency may be
confusing and prevent from optimisations. As zero-length segments are
useless and places that were generating them are patched, declare them
not allowed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/filesystems/porting.rst | 7 +++++++
 lib/iov_iter.c                        | 2 --
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 867036aa90b8..c722d94f29ea 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -865,3 +865,10 @@ no matter what.  Everything is handled by the caller.
 
 clone_private_mount() returns a longterm mount now, so the proper destructor of
 its result is kern_unmount() or kern_unmount_array().
+
+---
+
+**mandatory**
+
+zero-length bvec segments are disallowed, they must be filtered out before
+passed on to an iterator.
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1635111c5bd2..7de304269641 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -72,8 +72,6 @@
 	__start.bi_bvec_done = skip;			\
 	__start.bi_idx = 0;				\
 	for_each_bvec(__v, i->bvec, __bi, __start) {	\
-		if (!__v.bv_len)			\
-			continue;			\
 		(void)(STEP);				\
 	}						\
 }
-- 
2.24.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 3/7] block/psi: remove PSI annotations from direct IO
  2021-01-02 15:17 [PATCH v2 0/7] no-copy bvec Pavel Begunkov
  2021-01-02 15:17 ` [PATCH v2 1/7] splice: don't generate zero-len segement bvecs Pavel Begunkov
  2021-01-02 15:17 ` [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs Pavel Begunkov
@ 2021-01-02 15:17 ` Pavel Begunkov
  2021-01-04 16:45   ` Christoph Hellwig
  2021-01-02 15:17 ` [PATCH v2 4/7] target/file: allocate the bvec array as part of struct target_core_file_cmd Pavel Begunkov
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-02 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

Direct IO does not operate on the current working set of pages managed
by the kernel, so it should not be accounted as memory stall to PSI
infrastructure.

The block layer and iomap direct IO use bio_iov_iter_get_pages()
to build bios, and they are the only users of it, so to avoid PSI
tracking for them clear out BIO_WORKINGSET flag. Do same for
dio_bio_submit() because fs/direct_io constructs bios by hand directly
calling bio_add_page().

Reported-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 block/bio.c    | 6 ++++++
 fs/direct-io.c | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 1f2cc1fbe283..9f26984af643 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1099,6 +1099,9 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
  * fit into the bio, or are requested in @iter, whatever is smaller. If
  * MM encounters an error pinning the requested pages, it stops. Error
  * is returned only if 0 pages could be pinned.
+ *
+ * It's intended for direct IO, so doesn't do PSI tracking, the caller is
+ * responsible for setting BIO_WORKINGSET if necessary.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1123,6 +1126,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	if (is_bvec)
 		bio_set_flag(bio, BIO_NO_PAGE_REF);
+
+	/* don't account direct I/O as memory stall */
+	bio_clear_flag(bio, BIO_WORKINGSET);
 	return bio->bi_vcnt ? 0 : ret;
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index d53fa92a1ab6..0e689233f2c7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -426,6 +426,8 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	unsigned long flags;
 
 	bio->bi_private = dio;
+	/* don't account direct I/O as memory stall */
+	bio_clear_flag(bio, BIO_WORKINGSET);
 
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	dio->refcount++;
-- 
2.24.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 4/7] target/file: allocate the bvec array as part of struct target_core_file_cmd
  2021-01-02 15:17 [PATCH v2 0/7] no-copy bvec Pavel Begunkov
                   ` (2 preceding siblings ...)
  2021-01-02 15:17 ` [PATCH v2 3/7] block/psi: remove PSI annotations from direct IO Pavel Begunkov
@ 2021-01-02 15:17 ` Pavel Begunkov
  2021-01-02 15:17 ` [PATCH v2 5/7] iov_iter: optimise bvec iov_iter_advance() Pavel Begunkov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-02 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc,
	Christoph Hellwig

From: Christoph Hellwig <hch@lst.de>

This saves one memory allocation, and ensures the bvecs aren't freed
before the AIO completion.  This will allow the lower level code to be
optimized so that it can avoid allocating another bvec array.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/target/target_core_file.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index b0cb5b95e892..cce455929778 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -241,6 +241,7 @@ struct target_core_file_cmd {
 	unsigned long	len;
 	struct se_cmd	*cmd;
 	struct kiocb	iocb;
+	struct bio_vec	bvecs[];
 };
 
 static void cmd_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
@@ -268,29 +269,22 @@ fd_execute_rw_aio(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
 	struct target_core_file_cmd *aio_cmd;
 	struct iov_iter iter = {};
 	struct scatterlist *sg;
-	struct bio_vec *bvec;
 	ssize_t len = 0;
 	int ret = 0, i;
 
-	aio_cmd = kmalloc(sizeof(struct target_core_file_cmd), GFP_KERNEL);
+	aio_cmd = kmalloc(struct_size(aio_cmd, bvecs, sgl_nents), GFP_KERNEL);
 	if (!aio_cmd)
 		return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 
-	bvec = kcalloc(sgl_nents, sizeof(struct bio_vec), GFP_KERNEL);
-	if (!bvec) {
-		kfree(aio_cmd);
-		return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
-	}
-
 	for_each_sg(sgl, sg, sgl_nents, i) {
-		bvec[i].bv_page = sg_page(sg);
-		bvec[i].bv_len = sg->length;
-		bvec[i].bv_offset = sg->offset;
+		aio_cmd->bvecs[i].bv_page = sg_page(sg);
+		aio_cmd->bvecs[i].bv_len = sg->length;
+		aio_cmd->bvecs[i].bv_offset = sg->offset;
 
 		len += sg->length;
 	}
 
-	iov_iter_bvec(&iter, is_write, bvec, sgl_nents, len);
+	iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len);
 
 	aio_cmd->cmd = cmd;
 	aio_cmd->len = len;
@@ -307,8 +301,6 @@ fd_execute_rw_aio(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
 	else
 		ret = call_read_iter(file, &aio_cmd->iocb, &iter);
 
-	kfree(bvec);
-
 	if (ret != -EIOCBQUEUED)
 		cmd_rw_aio_complete(&aio_cmd->iocb, ret, 0);
 
-- 
2.24.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 5/7] iov_iter: optimise bvec iov_iter_advance()
  2021-01-02 15:17 [PATCH v2 0/7] no-copy bvec Pavel Begunkov
                   ` (3 preceding siblings ...)
  2021-01-02 15:17 ` [PATCH v2 4/7] target/file: allocate the bvec array as part of struct target_core_file_cmd Pavel Begunkov
@ 2021-01-02 15:17 ` Pavel Begunkov
  2021-01-02 15:17 ` [PATCH v2 6/7] bio: add a helper calculating nr segments to alloc Pavel Begunkov
  2021-01-02 15:17 ` [PATCH v2 7/7] bio: don't copy bvec for direct IO Pavel Begunkov
  6 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-02 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc,
	Christoph Hellwig

iov_iter_advance() is heavily used, but implemented through generic
means. For bvecs there is a specifically crafted function for that, so
use bvec_iter_advance() instead, it's faster and slimmer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 lib/iov_iter.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7de304269641..9b1c109dc8a9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1065,6 +1065,21 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 	pipe_truncate(i);
 }
 
+static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
+{
+	struct bvec_iter bi;
+
+	bi.bi_size = i->count;
+	bi.bi_bvec_done = i->iov_offset;
+	bi.bi_idx = 0;
+	bvec_iter_advance(i->bvec, &bi, size);
+
+	i->bvec += bi.bi_idx;
+	i->nr_segs -= bi.bi_idx;
+	i->count = bi.bi_size;
+	i->iov_offset = bi.bi_bvec_done;
+}
+
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
 	if (unlikely(iov_iter_is_pipe(i))) {
@@ -1075,6 +1090,10 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		i->count -= size;
 		return;
 	}
+	if (iov_iter_is_bvec(i)) {
+		iov_iter_bvec_advance(i, size);
+		return;
+	}
 	iterate_and_advance(i, size, v, 0, 0, 0)
 }
 EXPORT_SYMBOL(iov_iter_advance);
-- 
2.24.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 6/7] bio: add a helper calculating nr segments to alloc
  2021-01-02 15:17 [PATCH v2 0/7] no-copy bvec Pavel Begunkov
                   ` (4 preceding siblings ...)
  2021-01-02 15:17 ` [PATCH v2 5/7] iov_iter: optimise bvec iov_iter_advance() Pavel Begunkov
@ 2021-01-02 15:17 ` Pavel Begunkov
  2021-01-02 15:17 ` [PATCH v2 7/7] bio: don't copy bvec for direct IO Pavel Begunkov
  6 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-02 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc,
	Christoph Hellwig

Add a helper function calculating the number of bvec segments we need to
allocate to construct a bio. It doesn't change anything functionally,
but will be used to not duplicate special cases in the future.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/block_dev.c       |  7 ++++---
 fs/iomap/direct-io.c |  9 ++++-----
 include/linux/bio.h  | 10 ++++++++++
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3e5b02f6606c..034e6c8cfcab 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -416,7 +416,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		dio->size += bio->bi_iter.bi_size;
 		pos += bio->bi_iter.bi_size;
 
-		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
+		nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_PAGES);
 		if (!nr_pages) {
 			bool polled = false;
 
@@ -481,9 +481,10 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	int nr_pages;
 
-	nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1);
-	if (!nr_pages)
+	if (!iov_iter_count(iter))
 		return 0;
+
+	nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_PAGES + 1);
 	if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
 		return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 933f234d5bec..ea1e8f696076 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -250,11 +250,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	orig_count = iov_iter_count(dio->submit.iter);
 	iov_iter_truncate(dio->submit.iter, length);
 
-	nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
-	if (nr_pages <= 0) {
-		ret = nr_pages;
+	if (!iov_iter_count(dio->submit.iter))
 		goto out;
-	}
 
 	if (need_zeroout) {
 		/* zero out from the start of the block to the write offset */
@@ -263,6 +260,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 			iomap_dio_zero(dio, iomap, pos - pad, pad);
 	}
 
+	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_PAGES);
 	do {
 		size_t n;
 		if (dio->error) {
@@ -308,7 +306,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		dio->size += n;
 		copied += n;
 
-		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
+		nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter,
+						 BIO_MAX_PAGES);
 		iomap_dio_submit_bio(dio, iomap, bio, pos);
 		pos += n;
 	} while (nr_pages);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1edda614f7ce..d8f9077c43ef 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -10,6 +10,7 @@
 #include <linux/ioprio.h>
 /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */
 #include <linux/blk_types.h>
+#include <linux/uio.h>
 
 #define BIO_DEBUG
 
@@ -441,6 +442,15 @@ static inline void bio_wouldblock_error(struct bio *bio)
 	bio_endio(bio);
 }
 
+/*
+ * Calculate number of bvec segments that should be allocated to fit data
+ * pointed by @iter.
+ */
+static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
+{
+	return iov_iter_npages(iter, max_segs);
+}
+
 struct request_queue;
 
 extern int submit_bio_wait(struct bio *bio);
-- 
2.24.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 7/7] bio: don't copy bvec for direct IO
  2021-01-02 15:17 [PATCH v2 0/7] no-copy bvec Pavel Begunkov
                   ` (5 preceding siblings ...)
  2021-01-02 15:17 ` [PATCH v2 6/7] bio: add a helper calculating nr segments to alloc Pavel Begunkov
@ 2021-01-02 15:17 ` Pavel Begunkov
  2021-01-04 16:20   ` Christoph Hellwig
  6 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-02 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Christoph Hellwig, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

The block layer spends quite a while in blkdev_direct_IO() to copy and
initialise bio's bvec. However, if we've already got a bvec in the input
iterator it might be reused in some cases, i.e. when new
ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable
performance boost, and it also reduces memory footprint.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/filesystems/porting.rst |  9 ++++
 block/bio.c                           | 67 ++++++++++++---------------
 include/linux/bio.h                   |  5 +-
 3 files changed, 42 insertions(+), 39 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index c722d94f29ea..1f8cf8e10b34 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -872,3 +872,12 @@ its result is kern_unmount() or kern_unmount_array().
 
 zero-length bvec segments are disallowed, they must be filtered out before
 passed on to an iterator.
+
+---
+
+**mandatory**
+
+For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs but
+uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec and
+page references stay until I/O has completed, i.e. until ->ki_complete() has
+been called or returned with non -EIOCBQUEUED code.
diff --git a/block/bio.c b/block/bio.c
index 9f26984af643..6f031a04b59a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -960,21 +960,17 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
 }
 EXPORT_SYMBOL_GPL(bio_release_pages);
 
-static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
+static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 {
-	const struct bio_vec *bv = iter->bvec;
-	unsigned int len;
-	size_t size;
-
-	if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
-		return -EINVAL;
-
-	len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
-	size = bio_add_page(bio, bv->bv_page, len,
-				bv->bv_offset + iter->iov_offset);
-	if (unlikely(size != len))
-		return -EINVAL;
-	iov_iter_advance(iter, size);
+	WARN_ON_ONCE(BVEC_POOL_IDX(bio) != 0);
+
+	bio->bi_vcnt = iter->nr_segs;
+	bio->bi_max_vecs = iter->nr_segs;
+	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
+	bio->bi_iter.bi_bvec_done = iter->iov_offset;
+	bio->bi_iter.bi_size = iter->count;
+
+	iov_iter_advance(iter, iter->count);
 	return 0;
 }
 
@@ -1088,12 +1084,12 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
  * This takes either an iterator pointing to user memory, or one pointing to
  * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
  * map them into the kernel. On IO completion, the caller should put those
- * pages. If we're adding kernel pages, and the caller told us it's safe to
- * do so, we just have to add the pages to the bio directly. We don't grab an
- * extra reference to those pages (the user should already have that), and we
- * don't put the page on IO completion. The caller needs to check if the bio is
- * flagged BIO_NO_PAGE_REF on IO completion. If it isn't, then pages should be
- * released.
+ * pages. For bvec based iterators bio_iov_iter_get_pages() uses the provided
+ * bvecs rather than copying them. Hence anyone issuing kiocb based IO needs
+ * to ensure the bvecs and pages stay referenced until the submitted I/O is
+ * completed by a call to ->ki_complete() or returns with an error other than
+ * -EIOCBQUEUED. The caller needs to check if the bio is flagged BIO_NO_PAGE_REF
+ * on IO completion. If it isn't, then pages should be released.
  *
  * The function tries, but does not guarantee, to pin as many pages as
  * fit into the bio, or are requested in @iter, whatever is smaller. If
@@ -1105,27 +1101,22 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
-	const bool is_bvec = iov_iter_is_bvec(iter);
-	int ret;
-
-	if (WARN_ON_ONCE(bio->bi_vcnt))
-		return -EINVAL;
+	int ret = 0;
 
-	do {
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			if (WARN_ON_ONCE(is_bvec))
-				return -EINVAL;
-			ret = __bio_iov_append_get_pages(bio, iter);
-		} else {
-			if (is_bvec)
-				ret = __bio_iov_bvec_add_pages(bio, iter);
+	if (iov_iter_is_bvec(iter)) {
+		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
+			return -EINVAL;
+		bio_iov_bvec_set(bio, iter);
+		bio_set_flag(bio, BIO_NO_PAGE_REF);
+		return 0;
+	} else {
+		do {
+			if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+				ret = __bio_iov_append_get_pages(bio, iter);
 			else
 				ret = __bio_iov_iter_get_pages(bio, iter);
-		}
-	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
-
-	if (is_bvec)
-		bio_set_flag(bio, BIO_NO_PAGE_REF);
+		} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
+	}
 
 	/* don't account direct I/O as memory stall */
 	bio_clear_flag(bio, BIO_WORKINGSET);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d8f9077c43ef..1d30572a8c53 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -444,10 +444,13 @@ static inline void bio_wouldblock_error(struct bio *bio)
 
 /*
  * Calculate number of bvec segments that should be allocated to fit data
- * pointed by @iter.
+ * pointed by @iter. If @iter is backed by bvec it's going to be reused
+ * instead of allocating a new one.
  */
 static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
 {
+	if (iov_iter_is_bvec(iter))
+		return 0;
 	return iov_iter_npages(iter, max_segs);
 }
 
-- 
2.24.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/7] splice: don't generate zero-len segement bvecs
  2021-01-02 15:17 ` [PATCH v2 1/7] splice: don't generate zero-len segement bvecs Pavel Begunkov
@ 2021-01-04 16:17   ` Christoph Hellwig
  2021-01-04 16:54     ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2021-01-04 16:17 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	Ming Lei, Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

On Sat, Jan 02, 2021 at 03:17:33PM +0000, Pavel Begunkov wrote:
> iter_file_splice_write() may spawn bvec segments with zero-length. In
> preparation for prohibiting them, filter out by hand at splice level.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  fs/splice.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 866d5c2367b2..7299330c3270 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -644,7 +644,6 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
>  		ret = splice_from_pipe_next(pipe, &sd);
>  		if (ret <= 0)
>  			break;
> -

Spurious empty line removal..

> +			if (!this_len)
> +				continue;

Maybe throw in a comment on why we skip empty segments here?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs
  2021-01-02 15:17 ` [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs Pavel Begunkov
@ 2021-01-04 16:18   ` Christoph Hellwig
  2021-01-04 16:37   ` Matthew Wilcox
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2021-01-04 16:18 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	Ming Lei, Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

On Sat, Jan 02, 2021 at 03:17:34PM +0000, Pavel Begunkov wrote:
> zero-length bvec segments are allowed in general, but not handled by bio
> and down the block layer so filtered out. This inconsistency may be
> confusing and prevent from optimisations. As zero-length segments are
> useless and places that were generating them are patched, declare them
> not allowed.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 7/7] bio: don't copy bvec for direct IO
  2021-01-02 15:17 ` [PATCH v2 7/7] bio: don't copy bvec for direct IO Pavel Begunkov
@ 2021-01-04 16:20   ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2021-01-04 16:20 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	Ming Lei, Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs
  2021-01-02 15:17 ` [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs Pavel Begunkov
  2021-01-04 16:18   ` Christoph Hellwig
@ 2021-01-04 16:37   ` Matthew Wilcox
  2021-01-04 17:23     ` Pavel Begunkov
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2021-01-04 16:37 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

On Sat, Jan 02, 2021 at 03:17:34PM +0000, Pavel Begunkov wrote:
> --- a/Documentation/filesystems/porting.rst
> +++ b/Documentation/filesystems/porting.rst
> @@ -865,3 +865,10 @@ no matter what.  Everything is handled by the caller.
>  
>  clone_private_mount() returns a longterm mount now, so the proper destructor of
>  its result is kern_unmount() or kern_unmount_array().
> +
> +---
> +
> +**mandatory**
> +
> +zero-length bvec segments are disallowed, they must be filtered out before
> +passed on to an iterator.

Why are you putting this in filesystems/porting?  Filesystems don't usually
generate bvecs ... there's nothing in this current series that stops them.
I'd suggest Documentation/block/biovecs.rst or biodoc.rst (and frankly,
biodoc.rst needs a good cleanup)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/7] block/psi: remove PSI annotations from direct IO
  2021-01-02 15:17 ` [PATCH v2 3/7] block/psi: remove PSI annotations from direct IO Pavel Begunkov
@ 2021-01-04 16:45   ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2021-01-04 16:45 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	Ming Lei, Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/7] splice: don't generate zero-len segement bvecs
  2021-01-04 16:17   ` Christoph Hellwig
@ 2021-01-04 16:54     ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-04 16:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, Matthew Wilcox, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

On 04/01/2021 16:17, Christoph Hellwig wrote:
> On Sat, Jan 02, 2021 at 03:17:33PM +0000, Pavel Begunkov wrote:
>> iter_file_splice_write() may spawn bvec segments with zero-length. In
>> preparation for prohibiting them, filter out by hand at splice level.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>  fs/splice.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/splice.c b/fs/splice.c
>> index 866d5c2367b2..7299330c3270 100644
>> --- a/fs/splice.c
>> +++ b/fs/splice.c
>> @@ -644,7 +644,6 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
>>  		ret = splice_from_pipe_next(pipe, &sd);
>>  		if (ret <= 0)
>>  			break;
>> -
> 
> Spurious empty line removal..
> 
>> +			if (!this_len)
>> +				continue;
> 
> Maybe throw in a comment on why we skip empty segments here?

Definitely won't hurt. Thanks for taking a look

> 
> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs
  2021-01-04 16:37   ` Matthew Wilcox
@ 2021-01-04 17:23     ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-01-04 17:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Ming Lei,
	Johannes Weiner, Alexander Viro, Darrick J . Wong,
	Martin K . Petersen, Jonathan Corbet, linux-xfs, linux-fsdevel,
	io-uring, linux-kernel, target-devel, linux-scsi, linux-doc

On 04/01/2021 16:37, Matthew Wilcox wrote:
> On Sat, Jan 02, 2021 at 03:17:34PM +0000, Pavel Begunkov wrote:
>> --- a/Documentation/filesystems/porting.rst
>> +++ b/Documentation/filesystems/porting.rst
>> @@ -865,3 +865,10 @@ no matter what.  Everything is handled by the caller.
>>  
>>  clone_private_mount() returns a longterm mount now, so the proper destructor of
>>  its result is kern_unmount() or kern_unmount_array().
>> +
>> +---
>> +
>> +**mandatory**
>> +
>> +zero-length bvec segments are disallowed, they must be filtered out before
>> +passed on to an iterator.
> 
> Why are you putting this in filesystems/porting?  Filesystems don't usually

At least splice and a dozen of others do. As block apriori doesn't have them,
it's mainly addressed to fs + net. 

> generate bvecs ... there's nothing in this current series that stops them.

Yes, just mixing it with splice changes, which did all the work, looks
awkward to me. Would you suggest another segregation for the patches?

> I'd suggest Documentation/block/biovecs.rst or biodoc.rst (and frankly,
> biodoc.rst needs a good cleanup)

I'll add a note to biovecs.rst

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-01-04 17:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-02 15:17 [PATCH v2 0/7] no-copy bvec Pavel Begunkov
2021-01-02 15:17 ` [PATCH v2 1/7] splice: don't generate zero-len segement bvecs Pavel Begunkov
2021-01-04 16:17   ` Christoph Hellwig
2021-01-04 16:54     ` Pavel Begunkov
2021-01-02 15:17 ` [PATCH v2 2/7] bvec/iter: disallow zero-length segment bvecs Pavel Begunkov
2021-01-04 16:18   ` Christoph Hellwig
2021-01-04 16:37   ` Matthew Wilcox
2021-01-04 17:23     ` Pavel Begunkov
2021-01-02 15:17 ` [PATCH v2 3/7] block/psi: remove PSI annotations from direct IO Pavel Begunkov
2021-01-04 16:45   ` Christoph Hellwig
2021-01-02 15:17 ` [PATCH v2 4/7] target/file: allocate the bvec array as part of struct target_core_file_cmd Pavel Begunkov
2021-01-02 15:17 ` [PATCH v2 5/7] iov_iter: optimise bvec iov_iter_advance() Pavel Begunkov
2021-01-02 15:17 ` [PATCH v2 6/7] bio: add a helper calculating nr segments to alloc Pavel Begunkov
2021-01-02 15:17 ` [PATCH v2 7/7] bio: don't copy bvec for direct IO Pavel Begunkov
2021-01-04 16:20   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).