Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast()
@ 2020-08-22  4:20 John Hubbard
  2020-08-22  4:20 ` [PATCH 1/5] iov_iter: introduce iov_iter_pin_user_pages*() routines John Hubbard
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: John Hubbard @ 2020-08-22  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML, John Hubbard

Hi,

This converts the Direct IO block/bio layer over to use FOLL_PIN pages
(those acquired via pin_user_pages*()). This effectively converts
several file systems (ext4, for example) that use the common Direct IO
routines. See "Remaining work", below for a bit more detail there.

Quite a few approaches have been considered over the years. This one is
inspired by Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. After working through how bio submission and completion works, I
became convinced that this is the simplest and cleanest approach to
conversion.

Not content to let well enough alone, I then continued on to the
unthinkable: adding a new flag to struct bio, whose "short int" flags
field was full, thuse triggering an expansion of the field from 16, to
32 bits. This allows for a nice assertion in bio_release_pages(), that
the bio page release mechanism matches the page acquisition mechanism.
This is especially welcome for a change that affects a lot of callers
and could really make a mess if there is a bug somewhere.

I'm unable to spot any performance implications, either theoretically or
via (rather light) performance testing, from enlarging bio.bi_flags, but
I suspect that there are maybe still valid reasons for having such a
tiny bio.bi_flags field. I just have no idea what they are. (Hardware
that knows the size of a bio? No, because there would be obvious
build-time assertions, and comments about such a constraint.) Anyway, I
can drop that patch if it seems like too much cost for too little
benefit.

And finally, as long as we're all staring at the iter_iov code, I'm
including a nice easy ceph patch, that removes one more caller of
iter_iov_get_pages().

Design notes ============

This whole approach depends on certain concepts:

1) Each struct bio instance must not mix different types of pages:
FOLL_PIN and non-FOLL_PIN pages. (By FOLL_PIN I'm referring to pages
that were acquired and pinned via pin_user_page*() routines.)
Fortunately, this is already an enforced constraint for bio's, as
evidenced by the existence and use of BIO_NO_PAGE_REF.

2) Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. Accordingly, this series implements the following pseudocode:

Direct IO behavior:

    ITER_IOVEC:
        pin_user_pages_fast();
        break;

    ITER_KVEC:    // already elevated page refcount, leave alone
    ITER_BVEC:    // already elevated page refcount, leave alone
    ITER_PIPE:    // just, no :)
    ITER_DISCARD: // discard
        return -EFAULT or -ENVALID;

...which works for callers that already have sorted out which case they
are in. Such as, Direct IO in the block/bio layers.

Now, this does leave ITER_KVEC and ITER_BVEC unconverted, but on the
other hand, it's not clear that these are actually affected in the real
world, by the get_user_pages()+filesystem interaction problems of [2].
If it turns out to matter, then those can be handled too, but it's just
more refactoring and surgery to do so.

Testing
=======

Performance: no obvious regressions from running fio (direct=1: Direct
IO) on both SSD and NVMe drives.

Functionality: selected non-destructive bare metal xfstests on xfs,
ext4, btrfs, orangefs filesystems, plus LTP tests.

Note that I have only a single x86 64-bit test machine, though.

Remaining work
==============

Non-converted call sites for iter_iov_get_pages*() at the
moment include: net, crypto, cifs, ceph, vhost, fuse, nfs/direct,
vhost/scsi.

About-to-be-converted sites (in a subsequent patch) are: Direct IO for
filesystems that use the generic read/write functions.

[1] https://lore.kernel.org/kvm/20190724061750.GA19397@infradead.org/

[2] "Explicit pinning of user-space pages":
    https://lwn.net/Articles/807108/


John Hubbard (5):
  iov_iter: introduce iov_iter_pin_user_pages*() routines
  mm/gup: introduce pin_user_page()
  bio: convert get_user_pages_fast() --> pin_user_pages_fast()
  bio: introduce BIO_FOLL_PIN flag
  fs/ceph: use pipe_get_pages_alloc() for pipe

 block/bio.c               | 29 +++++++------
 block/blk-map.c           |  7 +--
 fs/ceph/file.c            |  3 +-
 fs/direct-io.c            | 30 ++++++-------
 fs/iomap/direct-io.c      |  2 +-
 include/linux/blk_types.h |  5 ++-
 include/linux/mm.h        |  2 +
 include/linux/uio.h       |  9 +++-
 lib/iov_iter.c            | 91 +++++++++++++++++++++++++++++++++++++--
 mm/gup.c                  | 30 +++++++++++++
 10 files changed, 169 insertions(+), 39 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] iov_iter: introduce iov_iter_pin_user_pages*() routines
  2020-08-22  4:20 [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() John Hubbard
@ 2020-08-22  4:20 ` John Hubbard
  2020-08-22  4:20 ` [PATCH 2/5] mm/gup: introduce pin_user_page() John Hubbard
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2020-08-22  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML, John Hubbard

The new routines are:
    iov_iter_pin_user_pages()
    iov_iter_pin_user_pages_alloc()

and those correspond to these pre-existing routines:
    iov_iter_get_pages()
    iov_iter_get_pages_alloc()

Unlike the iov_iter_get_pages*() routines, the
iov_iter_pin_user_pages*() routines assert that only ITER_IOVEC items
are passed in. They then call pin_user_pages_fast(), instead of
get_user_pages_fast().

Why: In order to incrementally change Direct IO callers from calling
get_user_pages_fast() and put_page(), over to calling
pin_user_pages_fast() and unpin_user_page(), there need to be mid-level
routines that specifically call one or the other systems, for both page
acquisition and page release.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/uio.h |  5 +++
 lib/iov_iter.c      | 80 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 3835a8a8e9ea..29b0504a27cc 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -229,6 +229,11 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
+ssize_t iov_iter_pin_user_pages(struct iov_iter *i, struct page **pages,
+			size_t maxsize, unsigned int maxpages, size_t *start);
+ssize_t iov_iter_pin_user_pages_alloc(struct iov_iter *i, struct page ***pages,
+			size_t maxsize, size_t *start);
+
 static inline size_t iov_iter_count(const struct iov_iter *i)
 {
 	return i->count;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 5e40786c8f12..d818b16d136b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1309,6 +1309,44 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, start);
 }
 
+ssize_t iov_iter_pin_user_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned int maxpages,
+		   size_t *start)
+{
+	size_t skip = i->iov_offset;
+	const struct iovec *iov;
+	struct iovec v;
+
+	if (WARN_ON_ONCE(!iter_is_iovec(i)))
+		return -EFAULT;
+
+	if (unlikely(!maxsize))
+		return 0;
+	maxsize = min(maxsize, i->count);
+
+	iterate_iovec(i, maxsize, v, iov, skip, ({
+		unsigned long addr = (unsigned long)v.iov_base;
+		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+		int n;
+		int res;
+
+		if (len > maxpages * PAGE_SIZE)
+			len = maxpages * PAGE_SIZE;
+		addr &= ~(PAGE_SIZE - 1);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+
+		res = pin_user_pages_fast(addr, n,
+				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
+				pages);
+		if (unlikely(res < 0))
+			return res;
+		return (res == n ? len : res * PAGE_SIZE) - *start;
+		0;
+	}))
+	return 0;
+}
+EXPORT_SYMBOL(iov_iter_pin_user_pages);
+
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -1388,6 +1426,48 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	return n;
 }
 
+ssize_t iov_iter_pin_user_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	struct page **p;
+	size_t skip = i->iov_offset;
+	const struct iovec *iov;
+	struct iovec v;
+
+	if (WARN_ON_ONCE(!iter_is_iovec(i)))
+		return -EFAULT;
+
+	if (unlikely(!maxsize))
+		return 0;
+	maxsize = min(maxsize, i->count);
+
+	iterate_iovec(i, maxsize, v, iov, skip, ({
+		unsigned long addr = (unsigned long)v.iov_base;
+		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+		int n;
+		int res;
+
+		addr &= ~(PAGE_SIZE - 1);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		p = get_pages_array(n);
+		if (!p)
+			return -ENOMEM;
+
+		res = pin_user_pages_fast(addr, n,
+				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+		if (unlikely(res < 0)) {
+			kvfree(p);
+			return res;
+		}
+		*pages = p;
+		return (res == n ? len : res * PAGE_SIZE) - *start;
+		0;
+	}))
+	return 0;
+}
+EXPORT_SYMBOL(iov_iter_pin_user_pages_alloc);
+
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
-- 
2.28.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/5] mm/gup: introduce pin_user_page()
  2020-08-22  4:20 [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() John Hubbard
  2020-08-22  4:20 ` [PATCH 1/5] iov_iter: introduce iov_iter_pin_user_pages*() routines John Hubbard
@ 2020-08-22  4:20 ` John Hubbard
  2020-08-22  4:20 ` [PATCH 3/5] bio: convert get_user_pages_fast() --> pin_user_pages_fast() John Hubbard
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2020-08-22  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML, John Hubbard

pin_user_page() is the FOLL_PIN equivalent of get_page().

This was always a missing piece of the pin/unpin API calls (early
reviewers of pin_user_pages() asked about it, in fact), but until now,
it just wasn't needed. Finally though, now that the Direct IO pieces in
block/bio are about to be converted to use FOLL_PIN, it turns out that
there are some cases in which get_page() and get_user_pages_fast() were
both used. Converting those sites requires a drop-in replacement for
get_page(), which this patch supplies.

[1] and [2] provide some background about pin_user_pages() in general.

[1] "Explicit pinning of user-space pages":
    https://lwn.net/Articles/807108/

[2] Documentation/core-api/pin_user_pages.rst

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  2 ++
 mm/gup.c           | 30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1983e08f5906..bee26614f430 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1149,6 +1149,8 @@ static inline void get_page(struct page *page)
 	page_ref_inc(page);
 }
 
+void pin_user_page(struct page *page);
+
 bool __must_check try_grab_page(struct page *page, unsigned int flags);
 
 static inline __must_check bool try_get_page(struct page *page)
diff --git a/mm/gup.c b/mm/gup.c
index ae096ea7583f..2cae5bbbc862 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -123,6 +123,36 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
 	return NULL;
 }
 
+/*
+ * pin_user_page() - elevate the page refcount, and mark as FOLL_PIN
+ *
+ * This the FOLL_PIN equivalent of get_page(). It is intended for use when the
+ * page will be released via unpin_user_page().
+ */
+void pin_user_page(struct page *page)
+{
+	int refs = 1;
+
+	page = compound_head(page);
+
+	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
+
+	if (hpage_pincount_available(page))
+		hpage_pincount_add(page, 1);
+	else
+		refs = GUP_PIN_COUNTING_BIAS;
+
+	/*
+	 * Similar to try_grab_compound_head(): even if using the
+	 * hpage_pincount_add/_sub() routines, be sure to
+	 * *also* increment the normal page refcount field at least
+	 * once, so that the page really is pinned.
+	 */
+	page_ref_add(page, refs);
+
+	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1);
+}
+
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount
  *
-- 
2.28.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/5] bio: convert get_user_pages_fast() --> pin_user_pages_fast()
  2020-08-22  4:20 [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() John Hubbard
  2020-08-22  4:20 ` [PATCH 1/5] iov_iter: introduce iov_iter_pin_user_pages*() routines John Hubbard
  2020-08-22  4:20 ` [PATCH 2/5] mm/gup: introduce pin_user_page() John Hubbard
@ 2020-08-22  4:20 ` John Hubbard
  2020-08-22  4:20 ` [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag John Hubbard
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2020-08-22  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML, John Hubbard

Change generic block/bio Direct IO routines, to acquire FOLL_PIN user
pages via the recently added routines:

    iov_iter_pin_user_pages()
    iov_iter_pin_user_pages_alloc()
    pin_user_page()

This effectively converts several file systems (ext4, for example) that
use the common Direct IO routines.

Change the corresponding page release calls from put_page() to
unpin_user_page().

Change bio_release_pages() to handle FOLL_PIN pages. In fact, that
is now the *only* type of pages it handles now.

Design notes
============

Quite a few approaches have been considered over the years. This one is
inspired by Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. Accordingly, this patch implements the following pseudocode:

Direct IO behavior:

    ITER_IOVEC:
        pin_user_pages_fast();
        break;

    ITER_KVEC:    // already elevated page refcount, leave alone
    ITER_BVEC:    // already elevated page refcount, leave alone
    ITER_PIPE:    // just, no :)
    ITER_DISCARD: // discard
        return -EFAULT or -ENVALID;

...which works for callers that already have sorted out which case they
are in. Such as, Direct IO in the block/bio layers.

Now, this does leave ITER_KVEC and ITER_BVEC unconverted, but on the
other hand, it's not clear that these are actually affected in the real
world, by the get_user_pages()+filesystem interaction problems of [2].
If it turns out to matter, then those can be handled too, but it's just
more refactoring and surgery to do so.

Page acquisition: The iov_iter_get_pages*() routines
above are at just the right level in the call stack: the callers already
know which system to use, and so it's a small change to just drop in the
replacement routines. And it's a fan-in/fan-out point: block/bio call
sites for Direct IO funnel their page acquisitions through the
iov_iter_get_pages*() routines, and there are many other callers of
those. And we can't convert all of the callers at once--too many
subsystems are involved, and it would be a too large and too risky
patch.

Page release: there are already separate release routines: put_page()
vs. unpin_user_page(), so it's already done there.

[1] https://lore.kernel.org/kvm/20190724061750.GA19397@infradead.org/

[2] "Explicit pinning of user-space pages":
    https://lwn.net/Articles/807108/

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c          | 24 ++++++++++++------------
 block/blk-map.c      |  6 +++---
 fs/direct-io.c       | 28 ++++++++++++++--------------
 fs/iomap/direct-io.c |  2 +-
 4 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index c63ba04bd629..00d548e3c2b8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -955,7 +955,7 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		unpin_user_page(bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(bio_release_pages);
@@ -986,9 +986,9 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
  * @iter: iov iterator describing the region to be mapped
  *
  * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * pages will have to be released using put_page() or unpin_user_page() when
+ * done. For multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1009,7 +1009,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_pin_user_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1020,7 +1020,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
 			if (same_page)
-				put_page(page);
+				unpin_user_page(page);
 		} else {
 			if (WARN_ON_ONCE(bio_full(bio, len)))
                                 return -EINVAL;
@@ -1056,7 +1056,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_pin_user_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1069,7 +1069,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 				max_append_sectors, &same_page) != len)
 			return -EINVAL;
 		if (same_page)
-			put_page(page);
+			unpin_user_page(page);
 		offset = 0;
 	}
 
@@ -1113,8 +1113,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		} else {
 			if (is_bvec)
 				ret = __bio_iov_bvec_add_pages(bio, iter);
-			else
-				ret = __bio_iov_iter_get_pages(bio, iter);
+		else
+			ret = __bio_iov_iter_get_pages(bio, iter);
 		}
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
@@ -1326,8 +1326,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index 6e804892d5ec..7a095b4947ea 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -275,7 +275,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_pin_user_pages_alloc(iter, &pages, LONG_MAX, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -298,7 +298,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
 				if (!bio_add_hw_page(q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						unpin_user_page(page);
 					break;
 				}
 
@@ -312,7 +312,7 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			unpin_user_page(pages[j++]);
 		kvfree(pages);
 		/* couldn't stuff something into bio? */
 		if (bytes)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 183299892465..b01c8d003bd3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -170,7 +170,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
 	ssize_t ret;
 
-	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
+	ret = iov_iter_pin_user_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
 				&sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
@@ -182,7 +182,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
+		pin_user_page(page);
 		dio->pages[0] = page;
 		sdio->head = 0;
 		sdio->tail = 1;
@@ -472,7 +472,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+		unpin_user_page(dio->pages[sdio->head++]);
 }
 
 /*
@@ -739,7 +739,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 		 */
 		if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
 			sdio->pages_in_io--;
-		get_page(sdio->cur_page);
+		pin_user_page(sdio->cur_page);
 		sdio->final_block_in_bio = sdio->cur_page_block +
 			(sdio->cur_page_len >> sdio->blkbits);
 		ret = 0;
@@ -853,13 +853,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	pin_user_page(page);		/* It is in dio */
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -874,7 +874,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -974,7 +974,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					unpin_user_page(page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -1019,7 +1019,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio->op == REQ_OP_WRITE) {
-					put_page(page);
+					unpin_user_page(page);
 					return -ENOTBLK;
 				}
 
@@ -1032,7 +1032,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					unpin_user_page(page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1072,7 +1072,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				unpin_user_page(page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1087,8 +1087,8 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				break;
 		}
 
-		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		/* Drop the ref which was taken in pin_user_pages() */
+		unpin_user_page(page);
 	}
 out:
 	return ret;
@@ -1327,7 +1327,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		unpin_user_page(sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c1aafb2ab990..390f611528ea 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -194,7 +194,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	pin_user_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
 	iomap_dio_submit_bio(dio, iomap, bio, pos);
-- 
2.28.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag
  2020-08-22  4:20 [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() John Hubbard
                   ` (2 preceding siblings ...)
  2020-08-22  4:20 ` [PATCH 3/5] bio: convert get_user_pages_fast() --> pin_user_pages_fast() John Hubbard
@ 2020-08-22  4:20 ` John Hubbard
  2020-08-23  6:25   ` Christoph Hellwig
  2020-08-22  4:20 ` [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe John Hubbard
  2020-08-25  1:54 ` [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() Al Viro
  5 siblings, 1 reply; 24+ messages in thread
From: John Hubbard @ 2020-08-22  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML, John Hubbard

Add a new BIO_FOLL_PIN flag to struct bio, whose "short int" flags field
was full, thuse triggering an expansion of the field from 16, to 32
bits. This allows for a nice assertion in bio_release_pages(), that the
bio page release mechanism matches the page acquisition mechanism.

Set BIO_FOLL_PIN whenever pin_user_pages_fast() is used, and check for
BIO_FOLL_PIN before using unpin_user_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c               | 9 +++++++--
 block/blk-map.c           | 3 ++-
 fs/direct-io.c            | 4 ++--
 include/linux/blk_types.h | 5 +++--
 include/linux/uio.h       | 5 +++--
 lib/iov_iter.c            | 9 +++++++--
 6 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 00d548e3c2b8..dd8e85618d5e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -952,6 +952,9 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
 	if (bio_flagged(bio, BIO_NO_PAGE_REF))
 		return;
 
+	if (WARN_ON_ONCE(!bio_flagged(bio, BIO_FOLL_PIN)))
+		return;
+
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
@@ -1009,7 +1012,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	size = iov_iter_pin_user_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_pin_user_pages(bio, iter, pages, LONG_MAX, nr_pages,
+				       &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1056,7 +1060,8 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	size = iov_iter_pin_user_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_pin_user_pages(bio, iter, pages, LONG_MAX, nr_pages,
+				       &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
diff --git a/block/blk-map.c b/block/blk-map.c
index 7a095b4947ea..ddfff2f0b1cb 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -275,7 +275,8 @@ static struct bio *bio_map_user_iov(struct request_queue *q,
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_pin_user_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_pin_user_pages_alloc(bio, iter, &pages,
+						      LONG_MAX, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b01c8d003bd3..4d0787ba85eb 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -170,8 +170,8 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
 	ssize_t ret;
 
-	ret = iov_iter_pin_user_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-				&sdio->from);
+	ret = iov_iter_pin_user_pages(sdio->bio, sdio->iter, dio->pages,
+				      LONG_MAX, DIO_PAGES, &sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
 		struct page *page = ZERO_PAGE(0);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4ecf4fed171f..d0e0da762af3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -188,7 +188,7 @@ struct bio {
 						 * top bits REQ_OP. Use
 						 * accessors.
 						 */
-	unsigned short		bi_flags;	/* status, etc and bvec pool number */
+	unsigned int		bi_flags;	/* status, etc and bvec pool number */
 	unsigned short		bi_ioprio;
 	unsigned short		bi_write_hint;
 	blk_status_t		bi_status;
@@ -267,6 +267,7 @@ enum {
 				 * of this bio. */
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
+	BIO_FOLL_PIN,		/* must release pages via unpin_user_pages() */
 	BIO_FLAG_LAST
 };
 
@@ -285,7 +286,7 @@ enum {
  * freed.
  */
 #define BVEC_POOL_BITS		(3)
-#define BVEC_POOL_OFFSET	(16 - BVEC_POOL_BITS)
+#define BVEC_POOL_OFFSET	(32 - BVEC_POOL_BITS)
 #define BVEC_POOL_IDX(bio)	((bio)->bi_flags >> BVEC_POOL_OFFSET)
 #if (1<< BVEC_POOL_BITS) < (BVEC_POOL_NR+1)
 # error "BVEC_POOL_BITS is too small"
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 29b0504a27cc..62bcf5e45f2b 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -209,6 +209,7 @@ size_t copy_to_iter_mcsafe(void *addr, size_t bytes, struct iov_iter *i)
 		return _copy_to_iter_mcsafe(addr, bytes, i);
 }
 
+struct bio;
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
@@ -229,9 +230,9 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
-ssize_t iov_iter_pin_user_pages(struct iov_iter *i, struct page **pages,
+ssize_t iov_iter_pin_user_pages(struct bio *bio, struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned int maxpages, size_t *start);
-ssize_t iov_iter_pin_user_pages_alloc(struct iov_iter *i, struct page ***pages,
+ssize_t iov_iter_pin_user_pages_alloc(struct bio *bio, struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 
 static inline size_t iov_iter_count(const struct iov_iter *i)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d818b16d136b..a4bc1b3a3fda 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -3,6 +3,7 @@
 #include <linux/export.h>
 #include <linux/bvec.h>
 #include <linux/uio.h>
+#include <linux/bio.h>
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
@@ -1309,7 +1310,7 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, start);
 }
 
-ssize_t iov_iter_pin_user_pages(struct iov_iter *i,
+ssize_t iov_iter_pin_user_pages(struct bio *bio, struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned int maxpages,
 		   size_t *start)
 {
@@ -1335,6 +1336,8 @@ ssize_t iov_iter_pin_user_pages(struct iov_iter *i,
 		addr &= ~(PAGE_SIZE - 1);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 
+		bio_set_flag(bio, BIO_FOLL_PIN);
+
 		res = pin_user_pages_fast(addr, n,
 				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
 				pages);
@@ -1426,7 +1429,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	return n;
 }
 
-ssize_t iov_iter_pin_user_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_pin_user_pages_alloc(struct bio *bio, struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
 {
@@ -1454,6 +1457,8 @@ ssize_t iov_iter_pin_user_pages_alloc(struct iov_iter *i,
 		if (!p)
 			return -ENOMEM;
 
+		bio_set_flag(bio, BIO_FOLL_PIN);
+
 		res = pin_user_pages_fast(addr, n,
 				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
 		if (unlikely(res < 0)) {
-- 
2.28.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe
  2020-08-22  4:20 [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() John Hubbard
                   ` (3 preceding siblings ...)
  2020-08-22  4:20 ` [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag John Hubbard
@ 2020-08-22  4:20 ` John Hubbard
  2020-08-24 10:53   ` Jeff Layton
  2020-08-25  1:54 ` [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() Al Viro
  5 siblings, 1 reply; 24+ messages in thread
From: John Hubbard @ 2020-08-22  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML, John Hubbard

This reduces, by one, the number of callers of iov_iter_get_pages().
That's helpful because these calls are being audited and converted over
to use iov_iter_pin_user_pages(), where applicable. And this one here is
already known by the caller to be only for ITER_PIPE, so let's just
simplify it now.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/ceph/file.c      | 3 +--
 include/linux/uio.h | 3 ++-
 lib/iov_iter.c      | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d51c3f2fdca0..d3d7dd957390 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -879,8 +879,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
 		more = len < iov_iter_count(to);
 
 		if (unlikely(iov_iter_is_pipe(to))) {
-			ret = iov_iter_get_pages_alloc(to, &pages, len,
-						       &page_off);
+			ret = pipe_get_pages_alloc(to, &pages, len, &page_off);
 			if (ret <= 0) {
 				ceph_osdc_put_request(req);
 				ret = -ENOMEM;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 62bcf5e45f2b..76cd47ab3dfd 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -227,7 +227,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
-
+ssize_t pipe_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+			     size_t maxsize, size_t *start);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
 ssize_t iov_iter_pin_user_pages(struct bio *bio, struct iov_iter *i, struct page **pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index a4bc1b3a3fda..f571fe3ddbe8 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1396,9 +1396,8 @@ static struct page **get_pages_array(size_t n)
 	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
 }
 
-static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize,
-		   size_t *start)
+ssize_t pipe_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+			     size_t maxsize, size_t *start)
 {
 	struct page **p;
 	unsigned int iter_head, npages;
@@ -1428,6 +1427,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 		kvfree(p);
 	return n;
 }
+EXPORT_SYMBOL(pipe_get_pages_alloc);
 
 ssize_t iov_iter_pin_user_pages_alloc(struct bio *bio, struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-- 
2.28.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag
  2020-08-22  4:20 ` [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag John Hubbard
@ 2020-08-23  6:25   ` Christoph Hellwig
  2020-08-23  6:57     ` John Hubbard
  2020-08-24  9:20     ` Jens Axboe
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-08-23  6:25 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Alexander Viro, Christoph Hellwig, Ilya Dryomov,
	Jens Axboe, Jeff Layton, linux-xfs, linux-fsdevel, linux-block,
	ceph-devel, linux-mm, LKML

On Fri, Aug 21, 2020 at 09:20:58PM -0700, John Hubbard wrote:
> Add a new BIO_FOLL_PIN flag to struct bio, whose "short int" flags field
> was full, thuse triggering an expansion of the field from 16, to 32
> bits. This allows for a nice assertion in bio_release_pages(), that the
> bio page release mechanism matches the page acquisition mechanism.
> 
> Set BIO_FOLL_PIN whenever pin_user_pages_fast() is used, and check for
> BIO_FOLL_PIN before using unpin_user_page().

When would the flag not be set when BIO_NO_PAGE_REF is not set?

Also I don't think we can't just expand the flags field, but I can send
a series to kill off two flags.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag
  2020-08-23  6:25   ` Christoph Hellwig
@ 2020-08-23  6:57     ` John Hubbard
  2020-08-24  7:36       ` John Hubbard
  2020-08-24  9:20     ` Jens Axboe
  1 sibling, 1 reply; 24+ messages in thread
From: John Hubbard @ 2020-08-23  6:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Alexander Viro, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On 8/22/20 11:25 PM, Christoph Hellwig wrote:
> On Fri, Aug 21, 2020 at 09:20:58PM -0700, John Hubbard wrote:
>> Add a new BIO_FOLL_PIN flag to struct bio, whose "short int" flags field
>> was full, thuse triggering an expansion of the field from 16, to 32
>> bits. This allows for a nice assertion in bio_release_pages(), that the
>> bio page release mechanism matches the page acquisition mechanism.
>>
>> Set BIO_FOLL_PIN whenever pin_user_pages_fast() is used, and check for
>> BIO_FOLL_PIN before using unpin_user_page().
> 
> When would the flag not be set when BIO_NO_PAGE_REF is not set?

Well, I don't *think* you can get there. However, I've only been studying
bio/block for a fairly short time, and the scattering of get_page() and
put_page() calls in some of the paths made me wonder if, for example,
someone was using get_page() to acquire ITER_BVEC or ITER_KVEC via
get_page(), and release them via bio_release_pages(). It's hard to tell.

It seems like that shouldn't be part of the design. I'm asserting that
it isn't, with this new flag. But if you're sure that this assertion is
unnecessary, then let's just drop this patch, of course.

> 
> Also I don't think we can't just expand the flags field, but I can send
> a series to kill off two flags.
> 

Good to know, just in case we do want this flag. Great!

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag
  2020-08-23  6:57     ` John Hubbard
@ 2020-08-24  7:36       ` John Hubbard
  0 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2020-08-24  7:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Alexander Viro, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On 8/22/20 11:57 PM, John Hubbard wrote:
> On 8/22/20 11:25 PM, Christoph Hellwig wrote:
>> On Fri, Aug 21, 2020 at 09:20:58PM -0700, John Hubbard wrote:
>>> Add a new BIO_FOLL_PIN flag to struct bio, whose "short int" flags field
>>> was full, thuse triggering an expansion of the field from 16, to 32
>>> bits. This allows for a nice assertion in bio_release_pages(), that the
>>> bio page release mechanism matches the page acquisition mechanism.
>>>
>>> Set BIO_FOLL_PIN whenever pin_user_pages_fast() is used, and check for
>>> BIO_FOLL_PIN before using unpin_user_page().
>>
>> When would the flag not be set when BIO_NO_PAGE_REF is not set?
> 
> Well, I don't *think* you can get there. However, I've only been studying
> bio/block for a fairly short time, and the scattering of get_page() and
> put_page() calls in some of the paths made me wonder if, for example,
> someone was using get_page() to acquire ITER_BVEC or ITER_KVEC via
> get_page(), and release them via bio_release_pages(). It's hard to tell.
> 
> It seems like that shouldn't be part of the design. I'm asserting that
> it isn't, with this new flag. But if you're sure that this assertion is
> unnecessary, then let's just drop this patch, of course.
> 

Also, I should have done a few more subsystem conversions, before
concluding that BIO_FOLL_PIN was a good idea. Now, as I'm working through mopping
up those other subsystems, I see that nfs/direct.c for example does not have access
to a bio instance, and so the whole thing is not really a great move, at least not
for adding to the iov_iter_pin_user_pages*() APIs.

Let's just drop this patch, after all.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag
  2020-08-23  6:25   ` Christoph Hellwig
  2020-08-23  6:57     ` John Hubbard
@ 2020-08-24  9:20     ` Jens Axboe
  2020-08-24 14:42       ` Christoph Hellwig
  1 sibling, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2020-08-24  9:20 UTC (permalink / raw)
  To: Christoph Hellwig, John Hubbard
  Cc: Andrew Morton, Alexander Viro, Ilya Dryomov, Jeff Layton,
	linux-xfs, linux-fsdevel, linux-block, ceph-devel, linux-mm,
	LKML

On 8/23/20 12:25 AM, Christoph Hellwig wrote:
> On Fri, Aug 21, 2020 at 09:20:58PM -0700, John Hubbard wrote:
>> Add a new BIO_FOLL_PIN flag to struct bio, whose "short int" flags field
>> was full, thuse triggering an expansion of the field from 16, to 32
>> bits. This allows for a nice assertion in bio_release_pages(), that the
>> bio page release mechanism matches the page acquisition mechanism.
>>
>> Set BIO_FOLL_PIN whenever pin_user_pages_fast() is used, and check for
>> BIO_FOLL_PIN before using unpin_user_page().
> 
> When would the flag not be set when BIO_NO_PAGE_REF is not set?
> 
> Also I don't think we can't just expand the flags field, but I can send
> a series to kill off two flags.

(not relevant to this series as this patch has thankfully already been
dropped, just in general - but yes, definitely need a *strong* justification
to bump the bio size).

Would actually be nice to kill off a few flags, if possible, so the
flags space isn't totally full.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe
  2020-08-22  4:20 ` [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe John Hubbard
@ 2020-08-24 10:53   ` Jeff Layton
  2020-08-24 17:54     ` John Hubbard
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2020-08-24 10:53 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	linux-xfs, linux-fsdevel, linux-block, ceph-devel, linux-mm,
	LKML

On Fri, 2020-08-21 at 21:20 -0700, John Hubbard wrote:
> This reduces, by one, the number of callers of iov_iter_get_pages().
> That's helpful because these calls are being audited and converted over
> to use iov_iter_pin_user_pages(), where applicable. And this one here is
> already known by the caller to be only for ITER_PIPE, so let's just
> simplify it now.
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  fs/ceph/file.c      | 3 +--
>  include/linux/uio.h | 3 ++-
>  lib/iov_iter.c      | 6 +++---
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d51c3f2fdca0..d3d7dd957390 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -879,8 +879,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>  		more = len < iov_iter_count(to);
>  
>  		if (unlikely(iov_iter_is_pipe(to))) {
> -			ret = iov_iter_get_pages_alloc(to, &pages, len,
> -						       &page_off);
> +			ret = pipe_get_pages_alloc(to, &pages, len, &page_off);
>  			if (ret <= 0) {
>  				ceph_osdc_put_request(req);
>  				ret = -ENOMEM;
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 62bcf5e45f2b..76cd47ab3dfd 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -227,7 +227,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
>  ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
>  			size_t maxsize, size_t *start);
>  int iov_iter_npages(const struct iov_iter *i, int maxpages);
> -
> +ssize_t pipe_get_pages_alloc(struct iov_iter *i, struct page ***pages,
> +			     size_t maxsize, size_t *start);
>  const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
>  
>  ssize_t iov_iter_pin_user_pages(struct bio *bio, struct iov_iter *i, struct page **pages,
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index a4bc1b3a3fda..f571fe3ddbe8 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1396,9 +1396,8 @@ static struct page **get_pages_array(size_t n)
>  	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
>  }
>  
> -static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
> -		   struct page ***pages, size_t maxsize,
> -		   size_t *start)
> +ssize_t pipe_get_pages_alloc(struct iov_iter *i, struct page ***pages,
> +			     size_t maxsize, size_t *start)
>  {
>  	struct page **p;
>  	unsigned int iter_head, npages;
> @@ -1428,6 +1427,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
>  		kvfree(p);
>  	return n;
>  }
> +EXPORT_SYMBOL(pipe_get_pages_alloc);
>  
>  ssize_t iov_iter_pin_user_pages_alloc(struct bio *bio, struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,


This looks fine to me. Let me know if you need this merged via the ceph
tree. Thanks!

Acked-by: Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag
  2020-08-24  9:20     ` Jens Axboe
@ 2020-08-24 14:42       ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2020-08-24 14:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, John Hubbard, Andrew Morton, Alexander Viro,
	Ilya Dryomov, Jeff Layton, linux-xfs, linux-fsdevel, linux-block,
	ceph-devel, linux-mm, LKML

On Mon, Aug 24, 2020 at 03:20:26AM -0600, Jens Axboe wrote:
> (not relevant to this series as this patch has thankfully already been
> dropped, just in general - but yes, definitely need a *strong* justification
> to bump the bio size).
> 
> Would actually be nice to kill off a few flags, if possible, so the
> flags space isn't totally full.

I have a series to kill two flags that I need to resurrect and post.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe
  2020-08-24 10:53   ` Jeff Layton
@ 2020-08-24 17:54     ` John Hubbard
  2020-08-24 18:47       ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: John Hubbard @ 2020-08-24 17:54 UTC (permalink / raw)
  To: Jeff Layton, Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	linux-xfs, linux-fsdevel, linux-block, ceph-devel, linux-mm,
	LKML

On 8/24/20 3:53 AM, Jeff Layton wrote:
> 
> This looks fine to me. Let me know if you need this merged via the ceph
> tree. Thanks!
> 
> Acked-by: Jeff Layton <jlayton@kernel.org>
> 

Yes, please! It will get proper testing that way, and it doesn't have
any prerequisites, despite being part of this series. So it would be
great to go in via the ceph tree.

For the main series here, I'll send a v2 with only patches 1-3, once
enough feedback has happened.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe
  2020-08-24 17:54     ` John Hubbard
@ 2020-08-24 18:47       ` Jeff Layton
  2020-08-24 18:54         ` Matthew Wilcox
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2020-08-24 18:47 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton
  Cc: Alexander Viro, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	linux-xfs, linux-fsdevel, linux-block, ceph-devel, linux-mm,
	LKML

On Mon, 2020-08-24 at 10:54 -0700, John Hubbard wrote:
> On 8/24/20 3:53 AM, Jeff Layton wrote:
> > This looks fine to me. Let me know if you need this merged via the ceph
> > tree. Thanks!
> > 
> > Acked-by: Jeff Layton <jlayton@kernel.org>
> > 
> 
> Yes, please! It will get proper testing that way, and it doesn't have
> any prerequisites, despite being part of this series. So it would be
> great to go in via the ceph tree.
> 
> For the main series here, I'll send a v2 with only patches 1-3, once
> enough feedback has happened.
> 

Ok, I'll plan to pick it up providing no one has issues with exporting that symbol.

Thanks!
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe
  2020-08-24 18:47       ` Jeff Layton
@ 2020-08-24 18:54         ` Matthew Wilcox
  2020-08-24 19:02           ` John Hubbard
  2020-08-25  1:20           ` [PATCH v2] " John Hubbard
  0 siblings, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2020-08-24 18:54 UTC (permalink / raw)
  To: Jeff Layton
  Cc: John Hubbard, Andrew Morton, Alexander Viro, Christoph Hellwig,
	Ilya Dryomov, Jens Axboe, linux-xfs, linux-fsdevel, linux-block,
	ceph-devel, linux-mm, LKML

On Mon, Aug 24, 2020 at 02:47:53PM -0400, Jeff Layton wrote:
> Ok, I'll plan to pick it up providing no one has issues with exporting that symbol.

_GPL, perhaps?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe
  2020-08-24 18:54         ` Matthew Wilcox
@ 2020-08-24 19:02           ` John Hubbard
  2020-08-25  1:20           ` [PATCH v2] " John Hubbard
  1 sibling, 0 replies; 24+ messages in thread
From: John Hubbard @ 2020-08-24 19:02 UTC (permalink / raw)
  To: Matthew Wilcox, Jeff Layton
  Cc: Andrew Morton, Alexander Viro, Christoph Hellwig, Ilya Dryomov,
	Jens Axboe, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On 8/24/20 11:54 AM, Matthew Wilcox wrote:
> On Mon, Aug 24, 2020 at 02:47:53PM -0400, Jeff Layton wrote:
>> Ok, I'll plan to pick it up providing no one has issues with exporting that symbol.
> 
> _GPL, perhaps?
> 
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1;
	t=1598295688; bh=TGtHrKY9YE9vgbD3P6YUTHn7yhP+gCmFr3Z8XIdh17s=;
	h=X-PGP-Universal:Subject:To:CC:References:From:Message-ID:Date:
	 User-Agent:MIME-Version:In-Reply-To:X-Originating-IP:
	 X-ClientProxiedBy:Content-Type:Content-Language:
	 Content-Transfer-Encoding;
	b=KWZ+bMZ8RXIxd4CMBL8Dn6hn0ggsojx1vJaaueo+/+Xwe0yKBa+bMZfnn1XUDWvJs
	 KMZJ22FgEdc+HO/8Mx0JKVQsLfKj74dwj3kGjs1z0KA5Vcnx93pzd/iMXDFhClf3uz
	 KmyGEdar0p6poBaBOlsapOb59acP6ot16rhi7ZbTch+7ErO/Rupx6VinU1A2Ydc3OP
	 mBYXZqz35DZ/H5eqhoE84MuOFj8Ti/Wpen637pLLa5dmtXjSMRmYTQXMRygUmQXdaw
	 g9XLuqaxKRxp2lnuoVdFK0T90Hfu/71T+S8asZZYhH9zHY2Wzhgp1VkR07ZtXmMNqI
	 W/lB00RAVtj3Q==

I looked at that, and the equivalent iov_iter_get_pages* and related stuff was
just EXPORT_SYMBOL, so I tried to match that. But if it needs to be _GPL then
that's fine too...

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2] fs/ceph: use pipe_get_pages_alloc() for pipe
  2020-08-24 18:54         ` Matthew Wilcox
  2020-08-24 19:02           ` John Hubbard
@ 2020-08-25  1:20           ` John Hubbard
  2020-08-25 16:22             ` Jeff Layton
  1 sibling, 1 reply; 24+ messages in thread
From: John Hubbard @ 2020-08-25  1:20 UTC (permalink / raw)
  To: willy, jlayton
  Cc: akpm, axboe, ceph-devel, hch, idryomov, jhubbard, linux-block,
	linux-fsdevel, linux-kernel, linux-mm, linux-xfs, viro

This reduces, by one, the number of callers of iov_iter_get_pages().
That's helpful because these calls are being audited and converted over
to use iov_iter_pin_user_pages(), where applicable. And this one here is
already known by the caller to be only for ITER_PIPE, so let's just
simplify it now.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---

OK, here's a v2 that does EXPORT_SYMBOL_GPL, instead of EXPORT_SYMBOL,
that's the only change from v1. That should help give this patch a
clear bill of passage. :)

thanks,
John Hubbard
NVIDIA

 fs/ceph/file.c      | 3 +--
 include/linux/uio.h | 3 ++-
 lib/iov_iter.c      | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d51c3f2fdca0..d3d7dd957390 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -879,8 +879,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
 		more = len < iov_iter_count(to);
 
 		if (unlikely(iov_iter_is_pipe(to))) {
-			ret = iov_iter_get_pages_alloc(to, &pages, len,
-						       &page_off);
+			ret = pipe_get_pages_alloc(to, &pages, len, &page_off);
 			if (ret <= 0) {
 				ceph_osdc_put_request(req);
 				ret = -ENOMEM;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 3835a8a8e9ea..270a4dcf5453 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -226,7 +226,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
-
+ssize_t pipe_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+			     size_t maxsize, size_t *start);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
 static inline size_t iov_iter_count(const struct iov_iter *i)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 5e40786c8f12..6290998df480 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1355,9 +1355,8 @@ static struct page **get_pages_array(size_t n)
 	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
 }
 
-static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize,
-		   size_t *start)
+ssize_t pipe_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+			     size_t maxsize, size_t *start)
 {
 	struct page **p;
 	unsigned int iter_head, npages;
@@ -1387,6 +1386,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 		kvfree(p);
 	return n;
 }
+EXPORT_SYMBOL_GPL(pipe_get_pages_alloc);
 
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-- 
2.28.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast()
  2020-08-22  4:20 [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() John Hubbard
                   ` (4 preceding siblings ...)
  2020-08-22  4:20 ` [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe John Hubbard
@ 2020-08-25  1:54 ` Al Viro
  2020-08-25  2:07   ` Al Viro
  2020-08-25  2:07   ` John Hubbard
  5 siblings, 2 replies; 24+ messages in thread
From: Al Viro @ 2020-08-25  1:54 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:

> Direct IO behavior:
> 
>     ITER_IOVEC:
>         pin_user_pages_fast();
>         break;
> 
>     ITER_KVEC:    // already elevated page refcount, leave alone
>     ITER_BVEC:    // already elevated page refcount, leave alone
>     ITER_PIPE:    // just, no :)

Why?  What's wrong with splice to O_DIRECT file?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast()
  2020-08-25  1:54 ` [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() Al Viro
@ 2020-08-25  2:07   ` Al Viro
  2020-08-25  2:13     ` John Hubbard
  2020-08-25  2:07   ` John Hubbard
  1 sibling, 1 reply; 24+ messages in thread
From: Al Viro @ 2020-08-25  2:07 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On Tue, Aug 25, 2020 at 02:54:28AM +0100, Al Viro wrote:
> On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
> 
> > Direct IO behavior:
> > 
> >     ITER_IOVEC:
> >         pin_user_pages_fast();
> >         break;
> > 
> >     ITER_KVEC:    // already elevated page refcount, leave alone
> >     ITER_BVEC:    // already elevated page refcount, leave alone
> >     ITER_PIPE:    // just, no :)
> 
> Why?  What's wrong with splice to O_DIRECT file?

Sorry - s/to/from/, obviously.

To spell it out: consider generic_file_splice_read() behaviour when
the source had been opened with O_DIRECT; you will get a call of
->read_iter() into ITER_PIPE destination.  And it bloody well
will hit iov_iter_get_pages() on common filesystems, to pick the
pages we want to read into.

So... what's wrong with having that "pin" primitive making sure
the pages are there and referenced by the pipe?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast()
  2020-08-25  1:54 ` [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() Al Viro
  2020-08-25  2:07   ` Al Viro
@ 2020-08-25  2:07   ` John Hubbard
  2020-08-25  2:22     ` Al Viro
  1 sibling, 1 reply; 24+ messages in thread
From: John Hubbard @ 2020-08-25  2:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On 8/24/20 6:54 PM, Al Viro wrote:
> On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
> 
>> Direct IO behavior:
>>
>>      ITER_IOVEC:
>>          pin_user_pages_fast();
>>          break;
>>
>>      ITER_KVEC:    // already elevated page refcount, leave alone
>>      ITER_BVEC:    // already elevated page refcount, leave alone
>>      ITER_PIPE:    // just, no :)
> 
> Why?  What's wrong with splice to O_DIRECT file?
> 

Oh! I'll take a look. Is this the fs/splice.c stuff?  I ruled this out early
mainly based on Christoph's comment in [1] ("ITER_PIPE is rejected іn the
direct I/O path"), but if it's supportable then I'll hook it up.

(As you can see, I'm still very much coming up to speed on the various things
that invoke iov_iter_get_pages*().)

[1] https://lore.kernel.org/kvm/20190724061750.GA19397@infradead.org/

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast()
  2020-08-25  2:07   ` Al Viro
@ 2020-08-25  2:13     ` John Hubbard
  0 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2020-08-25  2:13 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On 8/24/20 7:07 PM, Al Viro wrote:
> On Tue, Aug 25, 2020 at 02:54:28AM +0100, Al Viro wrote:
>> On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
>>
>>> Direct IO behavior:
>>>
>>>      ITER_IOVEC:
>>>          pin_user_pages_fast();
>>>          break;
>>>
>>>      ITER_KVEC:    // already elevated page refcount, leave alone
>>>      ITER_BVEC:    // already elevated page refcount, leave alone
>>>      ITER_PIPE:    // just, no :)
>>
>> Why?  What's wrong with splice to O_DIRECT file?
> 
> Sorry - s/to/from/, obviously.
> 
> To spell it out: consider generic_file_splice_read() behaviour when
> the source had been opened with O_DIRECT; you will get a call of
> ->read_iter() into ITER_PIPE destination.  And it bloody well
> will hit iov_iter_get_pages() on common filesystems, to pick the
> pages we want to read into.
> 
> So... what's wrong with having that "pin" primitive making sure
> the pages are there and referenced by the pipe?
> 

(our emails crossed) OK, yes, let me hook that up. I was just unaware
of that flow, I'll go off and figure it out.

Thanks for looking at this!

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast()
  2020-08-25  2:07   ` John Hubbard
@ 2020-08-25  2:22     ` Al Viro
  2020-08-25  2:27       ` John Hubbard
  0 siblings, 1 reply; 24+ messages in thread
From: Al Viro @ 2020-08-25  2:22 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On Mon, Aug 24, 2020 at 07:07:02PM -0700, John Hubbard wrote:
> On 8/24/20 6:54 PM, Al Viro wrote:
> > On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
> > 
> > > Direct IO behavior:
> > > 
> > >      ITER_IOVEC:
> > >          pin_user_pages_fast();
> > >          break;
> > > 
> > >      ITER_KVEC:    // already elevated page refcount, leave alone
> > >      ITER_BVEC:    // already elevated page refcount, leave alone
> > >      ITER_PIPE:    // just, no :)
> > 
> > Why?  What's wrong with splice to O_DIRECT file?
> > 
> 
> Oh! I'll take a look. Is this the fs/splice.c stuff?  I ruled this out early
> mainly based on Christoph's comment in [1] ("ITER_PIPE is rejected іn the
> direct I/O path"), but if it's supportable then I'll hook it up.

; cat >a.c <<'EOF'
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>

int main()
{
        int fd = open("./a.out", O_DIRECT);
        splice(fd, NULL, 1, NULL, 4096, 0);
	return 0;
}
EOF
; cc a.c
; ./a.out | wc -c
4096

and you just had ->read_iter() called with ITER_PIPE destination.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast()
  2020-08-25  2:22     ` Al Viro
@ 2020-08-25  2:27       ` John Hubbard
  0 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2020-08-25  2:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Christoph Hellwig, Ilya Dryomov, Jens Axboe,
	Jeff Layton, linux-xfs, linux-fsdevel, linux-block, ceph-devel,
	linux-mm, LKML

On 8/24/20 7:22 PM, Al Viro wrote:
> On Mon, Aug 24, 2020 at 07:07:02PM -0700, John Hubbard wrote:
>> On 8/24/20 6:54 PM, Al Viro wrote:
>>> On Fri, Aug 21, 2020 at 09:20:54PM -0700, John Hubbard wrote:
>>>
>>>> Direct IO behavior:
>>>>
>>>>       ITER_IOVEC:
>>>>           pin_user_pages_fast();
>>>>           break;
>>>>
>>>>       ITER_KVEC:    // already elevated page refcount, leave alone
>>>>       ITER_BVEC:    // already elevated page refcount, leave alone
>>>>       ITER_PIPE:    // just, no :)
>>>
>>> Why?  What's wrong with splice to O_DIRECT file?
>>>
>>
>> Oh! I'll take a look. Is this the fs/splice.c stuff?  I ruled this out early
>> mainly based on Christoph's comment in [1] ("ITER_PIPE is rejected іn the
>> direct I/O path"), but if it's supportable then I'll hook it up.
> 
> ; cat >a.c <<'EOF'
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdlib.h>
> 
> int main()
> {
>          int fd = open("./a.out", O_DIRECT);
>          splice(fd, NULL, 1, NULL, 4096, 0);
> 	return 0;
> }
> EOF
> ; cc a.c
> ; ./a.out | wc -c
> 4096
> 
> and you just had ->read_iter() called with ITER_PIPE destination.
> 

That example saves me a lot of time!  Much appreciated.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] fs/ceph: use pipe_get_pages_alloc() for pipe
  2020-08-25  1:20           ` [PATCH v2] " John Hubbard
@ 2020-08-25 16:22             ` Jeff Layton
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2020-08-25 16:22 UTC (permalink / raw)
  To: John Hubbard, willy
  Cc: akpm, axboe, ceph-devel, hch, idryomov, linux-block,
	linux-fsdevel, linux-kernel, linux-mm, linux-xfs, viro, Yan,
	Zheng

On Mon, 2020-08-24 at 18:20 -0700, John Hubbard wrote:
> This reduces, by one, the number of callers of iov_iter_get_pages().
> That's helpful because these calls are being audited and converted over
> to use iov_iter_pin_user_pages(), where applicable. And this one here is
> already known by the caller to be only for ITER_PIPE, so let's just
> simplify it now.
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
> 
> OK, here's a v2 that does EXPORT_SYMBOL_GPL, instead of EXPORT_SYMBOL,
> that's the only change from v1. That should help give this patch a
> clear bill of passage. :)
> 
> thanks,
> John Hubbard
> NVIDIA
> 
>  fs/ceph/file.c      | 3 +--
>  include/linux/uio.h | 3 ++-
>  lib/iov_iter.c      | 6 +++---
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d51c3f2fdca0..d3d7dd957390 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -879,8 +879,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>  		more = len < iov_iter_count(to);
>  
>  		if (unlikely(iov_iter_is_pipe(to))) {
> -			ret = iov_iter_get_pages_alloc(to, &pages, len,
> -						       &page_off);
> +			ret = pipe_get_pages_alloc(to, &pages, len, &page_off);
>  			if (ret <= 0) {
>  				ceph_osdc_put_request(req);
>  				ret = -ENOMEM;
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 3835a8a8e9ea..270a4dcf5453 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -226,7 +226,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
>  ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
>  			size_t maxsize, size_t *start);
>  int iov_iter_npages(const struct iov_iter *i, int maxpages);
> -
> +ssize_t pipe_get_pages_alloc(struct iov_iter *i, struct page ***pages,
> +			     size_t maxsize, size_t *start);
>  const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
>  
>  static inline size_t iov_iter_count(const struct iov_iter *i)
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 5e40786c8f12..6290998df480 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1355,9 +1355,8 @@ static struct page **get_pages_array(size_t n)
>  	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
>  }
>  
> -static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
> -		   struct page ***pages, size_t maxsize,
> -		   size_t *start)
> +ssize_t pipe_get_pages_alloc(struct iov_iter *i, struct page ***pages,
> +			     size_t maxsize, size_t *start)
>  {
>  	struct page **p;
>  	unsigned int iter_head, npages;
> @@ -1387,6 +1386,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
>  		kvfree(p);
>  	return n;
>  }
> +EXPORT_SYMBOL_GPL(pipe_get_pages_alloc);
>  
>  ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,

Thanks. I've got a v1 of this in the ceph-client/testing branch and it
seems fine so far.

I'd prefer an ack from Al on one or the other though, since I'm not sure
he wants to expose this primitive, and in the past he hasn't been
enamored with EXPORT_SYMBOL_GPL, because its meaning wasn't well
defined. Maybe that's changed since.

As a side note, Al also asked privately why ceph special cases
ITER_PIPE. I wasn't sure either, so I did a bit of git-archaeology. The
change was added here:

---------------------------8<---------------------------
commit
7ce469a53e7106acdaca2e25027941d0f7c12a8e
Author: Yan, Zheng <zyan@redhat.com>
Date:   Tue Nov 8 21:54:34 2016 +0800

    ceph: fix splice read for no Fc capability case
    
    When iov_iter type is ITER_PIPE, copy_page_to_iter() increases
    the page's reference and add the page to a pipe_buffer. It also
    set the pipe_buffer's ops to page_cache_pipe_buf_ops. The comfirm
    callback in page_cache_pipe_buf_ops expects the page is from page
    cache and uptodate, otherwise it return error.
    
    For ceph_sync_read() case, pages are not from page cache. So we
    can't call copy_page_to_iter() when iov_iter type is ITER_PIPE.
    The fix is using iov_iter_get_pages_alloc() to allocate pages
    for the pipe. (the code is similar to default_file_splice_read)
    
    Signed-off-by: Yan, Zheng <zyan@redhat.com>
---------------------------8<---------------------------

If we don't have Fc (FILE_CACHE) caps then the client's not allowed to
cache data and so we can't use the pagecache. I'm not certain special
casing pipes in ceph, is the best approach to handle that, but the
confirm callback still seems to work the same way today.

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-08-25 16:22 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-22  4:20 [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() John Hubbard
2020-08-22  4:20 ` [PATCH 1/5] iov_iter: introduce iov_iter_pin_user_pages*() routines John Hubbard
2020-08-22  4:20 ` [PATCH 2/5] mm/gup: introduce pin_user_page() John Hubbard
2020-08-22  4:20 ` [PATCH 3/5] bio: convert get_user_pages_fast() --> pin_user_pages_fast() John Hubbard
2020-08-22  4:20 ` [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag John Hubbard
2020-08-23  6:25   ` Christoph Hellwig
2020-08-23  6:57     ` John Hubbard
2020-08-24  7:36       ` John Hubbard
2020-08-24  9:20     ` Jens Axboe
2020-08-24 14:42       ` Christoph Hellwig
2020-08-22  4:20 ` [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe John Hubbard
2020-08-24 10:53   ` Jeff Layton
2020-08-24 17:54     ` John Hubbard
2020-08-24 18:47       ` Jeff Layton
2020-08-24 18:54         ` Matthew Wilcox
2020-08-24 19:02           ` John Hubbard
2020-08-25  1:20           ` [PATCH v2] " John Hubbard
2020-08-25 16:22             ` Jeff Layton
2020-08-25  1:54 ` [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() Al Viro
2020-08-25  2:07   ` Al Viro
2020-08-25  2:13     ` John Hubbard
2020-08-25  2:07   ` John Hubbard
2020-08-25  2:22     ` Al Viro
2020-08-25  2:27       ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).