LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC][PATCHES] iov_iter.c rewrite
@ 2014-12-04 20:20 Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 01/13] iov_iter.c: macros for iterating over iov_iter Al Viro
                   ` (13 more replies)
  0 siblings, 14 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

	First of all, I want to apologize for the nastiness of preprocessor
use in this series.  Seeing that the whole "macros that look like new kinds
of C statements" thing (including list_for_each_...(), etc) is very much not
to my liking, I really don't trust my taste on finer details and I'd very
much like some feedback.

	The reason for doing that kind of tricks is that iov_iter.c keeps
growing more and more boilerplate code.  For iov_iter-net series we need
	* csum_and_copy_from_iter()
	* csum_and_copy_to_iter()
	* copy_from_iter_nocache()
That's 3 new primitives, each in 2 variants (iovec and bvec).
	* ITER_KVEC handled without going through uaccess.h stuff (and
independent of set_fs() state).
And *that* means 3 variants intstead of 2 for most of the existing primitives.
That's far too much, and the amount of copies of the same logics would pretty
much guarantee that it will be a breeding ground for hard-to-kill bugs.

	The following series (also in vfs.git#iov_iter) actually manages to
do all of the above *and* shrink the damn thing quite a bit.  The generated
code appears to be no worse than before.  The price is a couple of iterator
macros - iterate_all_kinds() and iterate_and_advance().  They are given an
iov_iter, size (i.e. the amount of data in iov_iter beginning we want to go
through), name of the loop variable and 3 variants of loop body - for iovec,
bvec and kvec resp.  Loop variable is declared *inside* the expansion of those
suckers according to the kind of iov_iter - it's struct iovec, struct bio_vec
or struct kvec, covering the current range to deal with.
	The difference between those two is that iterate_and_advance() will
advance the iov_iter by the amount it has handled and iterate_all_kinds()
will leave iov_iter unchanged.

	Unless I hear anybody yelling, it goes into vfs.git#for-next today,
so if you have objections, suggestions, etc., give those *now*.

Al Viro (13):
      iov_iter.c: macros for iterating over iov_iter
      iov_iter.c: iterate_and_advance
      iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
      iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
      iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
      iov_iter.c: convert iov_iter_zero() to iterate_and_advance
      iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
      iov_iter.c: convert copy_from_iter() to iterate_and_advance
      iov_iter.c: convert copy_to_iter() to iterate_and_advance
      iov_iter.c: handle ITER_KVEC directly
      csum_and_copy_..._iter()
      new helper: iov_iter_kvec()
      copy_from_iter_nocache()

Diffstat:
 include/linux/uio.h |    6 +
 mm/iov_iter.c       | 1077 +++++++++++++++++++++------------------------------
 2 files changed, 445 insertions(+), 638 deletions(-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 01/13] iov_iter.c: macros for iterating over iov_iter
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 02/13] iov_iter.c: iterate_and_advance Al Viro
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

iterate_all_kinds(iter, size, ident, step_iovec, step_bvec)
iterates through the ranges covered by iter (up to size bytes total),
repeating step_iovec or step_bvec for each of those.  ident is
declared in expansion of that thing, either as struct iovec or
struct bvec, and it contains the range we are currently looking
at.  step_bvec should be a void expression, step_iovec - a size_t
one, with non-zero meaning "stop here, that many bytes from this
range left".  In the end, the amount actually handled is stored
in size.

iov_iter_copy_from_user_atomic() and iov_iter_alignment() converted
to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 212 ++++++++++++++++++++++++----------------------------------
 1 file changed, 86 insertions(+), 126 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index e34a3cb..798fcb4 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -4,6 +4,72 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 
+#define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
+	size_t left;					\
+	size_t wanted = n;				\
+	__p = i->iov;					\
+	__v.iov_len = min(n, __p->iov_len - skip);	\
+	if (likely(__v.iov_len)) {			\
+		__v.iov_base = __p->iov_base + skip;	\
+		left = (STEP);				\
+		__v.iov_len -= left;			\
+		skip += __v.iov_len;			\
+		n -= __v.iov_len;			\
+	} else {					\
+		left = 0;				\
+	}						\
+	while (unlikely(!left && n)) {			\
+		__p++;					\
+		__v.iov_len = min(n, __p->iov_len);	\
+		if (unlikely(!__v.iov_len))		\
+			continue;			\
+		__v.iov_base = __p->iov_base;		\
+		left = (STEP);				\
+		__v.iov_len -= left;			\
+		skip = __v.iov_len;			\
+		n -= __v.iov_len;			\
+	}						\
+	n = wanted - n;					\
+}
+
+#define iterate_bvec(i, n, __v, __p, skip, STEP) {	\
+	size_t wanted = n;				\
+	__p = i->bvec;					\
+	__v.bv_len = min_t(size_t, n, __p->bv_len - skip);	\
+	if (likely(__v.bv_len)) {			\
+		__v.bv_page = __p->bv_page;		\
+		__v.bv_offset = __p->bv_offset + skip; 	\
+		(void)(STEP);				\
+		skip += __v.bv_len;			\
+		n -= __v.bv_len;			\
+	}						\
+	while (unlikely(n)) {				\
+		__p++;					\
+		__v.bv_len = min_t(size_t, n, __p->bv_len);	\
+		if (unlikely(!__v.bv_len))		\
+			continue;			\
+		__v.bv_page = __p->bv_page;		\
+		__v.bv_offset = __p->bv_offset;		\
+		(void)(STEP);				\
+		skip = __v.bv_len;			\
+		n -= __v.bv_len;			\
+	}						\
+	n = wanted;					\
+}
+
+#define iterate_all_kinds(i, n, v, I, B) {			\
+	size_t skip = i->iov_offset;				\
+	if (unlikely(i->type & ITER_BVEC)) {			\
+		const struct bio_vec *bvec;			\
+		struct bio_vec v;				\
+		iterate_bvec(i, n, v, bvec, skip, (B))		\
+	} else {						\
+		const struct iovec *iov;			\
+		struct iovec v;					\
+		iterate_iovec(i, n, v, iov, skip, (I))		\
+	}							\
+}
+
 static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
 {
 	size_t skip, copy, left, wanted;
@@ -300,54 +366,6 @@ static size_t zero_iovec(size_t bytes, struct iov_iter *i)
 	return wanted - bytes;
 }
 
-static size_t __iovec_copy_from_user_inatomic(char *vaddr,
-			const struct iovec *iov, size_t base, size_t bytes)
-{
-	size_t copied = 0, left = 0;
-
-	while (bytes) {
-		char __user *buf = iov->iov_base + base;
-		int copy = min(bytes, iov->iov_len - base);
-
-		base = 0;
-		left = __copy_from_user_inatomic(vaddr, buf, copy);
-		copied += copy;
-		bytes -= copy;
-		vaddr += copy;
-		iov++;
-
-		if (unlikely(left))
-			break;
-	}
-	return copied - left;
-}
-
-/*
- * Copy as much as we can into the page and return the number of bytes which
- * were successfully copied.  If a fault is encountered then return the number of
- * bytes which were copied.
- */
-static size_t copy_from_user_atomic_iovec(struct page *page,
-		struct iov_iter *i, unsigned long offset, size_t bytes)
-{
-	char *kaddr;
-	size_t copied;
-
-	kaddr = kmap_atomic(page);
-	if (likely(i->nr_segs == 1)) {
-		int left;
-		char __user *buf = i->iov->iov_base + i->iov_offset;
-		left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
-		copied = bytes - left;
-	} else {
-		copied = __iovec_copy_from_user_inatomic(kaddr + offset,
-						i->iov, i->iov_offset, bytes);
-	}
-	kunmap_atomic(kaddr);
-
-	return copied;
-}
-
 static void advance_iovec(struct iov_iter *i, size_t bytes)
 {
 	BUG_ON(i->count < bytes);
@@ -404,30 +422,6 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
-static unsigned long alignment_iovec(const struct iov_iter *i)
-{
-	const struct iovec *iov = i->iov;
-	unsigned long res;
-	size_t size = i->count;
-	size_t n;
-
-	if (!size)
-		return 0;
-
-	res = (unsigned long)iov->iov_base + i->iov_offset;
-	n = iov->iov_len - i->iov_offset;
-	if (n >= size)
-		return res | size;
-	size -= n;
-	res |= n;
-	while (size > (++iov)->iov_len) {
-		res |= (unsigned long)iov->iov_base | iov->iov_len;
-		size -= iov->iov_len;
-	}
-	res |= (unsigned long)iov->iov_base | size;
-	return res;
-}
-
 void iov_iter_init(struct iov_iter *i, int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
@@ -691,28 +685,6 @@ static size_t zero_bvec(size_t bytes, struct iov_iter *i)
 	return wanted - bytes;
 }
 
-static size_t copy_from_user_bvec(struct page *page,
-		struct iov_iter *i, unsigned long offset, size_t bytes)
-{
-	char *kaddr;
-	size_t left;
-	const struct bio_vec *bvec;
-	size_t base = i->iov_offset;
-
-	kaddr = kmap_atomic(page);
-	for (left = bytes, bvec = i->bvec; left; bvec++, base = 0) {
-		size_t copy = min(left, bvec->bv_len - base);
-		if (!bvec->bv_len)
-			continue;
-		memcpy_from_page(kaddr + offset, bvec->bv_page,
-				 bvec->bv_offset + base, copy);
-		offset += copy;
-		left -= copy;
-	}
-	kunmap_atomic(kaddr);
-	return bytes;
-}
-
 static void advance_bvec(struct iov_iter *i, size_t bytes)
 {
 	BUG_ON(i->count < bytes);
@@ -749,30 +721,6 @@ static void advance_bvec(struct iov_iter *i, size_t bytes)
 	}
 }
 
-static unsigned long alignment_bvec(const struct iov_iter *i)
-{
-	const struct bio_vec *bvec = i->bvec;
-	unsigned long res;
-	size_t size = i->count;
-	size_t n;
-
-	if (!size)
-		return 0;
-
-	res = bvec->bv_offset + i->iov_offset;
-	n = bvec->bv_len - i->iov_offset;
-	if (n >= size)
-		return res | size;
-	size -= n;
-	res |= n;
-	while (size > (++bvec)->bv_len) {
-		res |= bvec->bv_offset | bvec->bv_len;
-		size -= bvec->bv_len;
-	}
-	res |= bvec->bv_offset | size;
-	return res;
-}
-
 static ssize_t get_pages_bvec(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -887,10 +835,15 @@ EXPORT_SYMBOL(iov_iter_zero);
 size_t iov_iter_copy_from_user_atomic(struct page *page,
 		struct iov_iter *i, unsigned long offset, size_t bytes)
 {
-	if (i->type & ITER_BVEC)
-		return copy_from_user_bvec(page, i, offset, bytes);
-	else
-		return copy_from_user_atomic_iovec(page, i, offset, bytes);
+	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
+	iterate_all_kinds(i, bytes, v,
+		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
+					  v.iov_base, v.iov_len),
+		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
+	)
+	kunmap_atomic(kaddr);
+	return bytes;
 }
 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
@@ -919,10 +872,17 @@ EXPORT_SYMBOL(iov_iter_single_seg_count);
 
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
-	if (i->type & ITER_BVEC)
-		return alignment_bvec(i);
-	else
-		return alignment_iovec(i);
+	unsigned long res = 0;
+	size_t size = i->count;
+
+	if (!size)
+		return 0;
+
+	iterate_all_kinds(i, size, v,
+		(res |= (unsigned long)v.iov_base | v.iov_len, 0),
+		res |= v.bv_offset | v.bv_len
+	)
+	return res;
 }
 EXPORT_SYMBOL(iov_iter_alignment);
 
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 02/13] iov_iter.c: iterate_and_advance
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 01/13] iov_iter.c: macros for iterating over iov_iter Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 03/13] iov_iter.c: convert iov_iter_npages() to iterate_all_kinds Al Viro
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

same as iterate_all_kinds, but iterator is moved to the position past
the last byte we'd handled.

iov_iter_advance() converted to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 104 ++++++++++++++++------------------------------------------
 1 file changed, 28 insertions(+), 76 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 798fcb4..e91bf0a 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -70,6 +70,33 @@
 	}							\
 }
 
+#define iterate_and_advance(i, n, v, I, B) {			\
+	size_t skip = i->iov_offset;				\
+	if (unlikely(i->type & ITER_BVEC)) {			\
+		const struct bio_vec *bvec;			\
+		struct bio_vec v;				\
+		iterate_bvec(i, n, v, bvec, skip, (B))		\
+		if (skip == bvec->bv_len) {			\
+			bvec++;					\
+			skip = 0;				\
+		}						\
+		i->nr_segs -= bvec - i->bvec;			\
+		i->bvec = bvec;					\
+	} else {						\
+		const struct iovec *iov;			\
+		struct iovec v;					\
+		iterate_iovec(i, n, v, iov, skip, (I))		\
+		if (skip == iov->iov_len) {			\
+			iov++;					\
+			skip = 0;				\
+		}						\
+		i->nr_segs -= iov - i->iov;			\
+		i->iov = iov;					\
+	}							\
+	i->count -= n;						\
+	i->iov_offset = skip;					\
+}
+
 static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
 {
 	size_t skip, copy, left, wanted;
@@ -366,42 +393,6 @@ static size_t zero_iovec(size_t bytes, struct iov_iter *i)
 	return wanted - bytes;
 }
 
-static void advance_iovec(struct iov_iter *i, size_t bytes)
-{
-	BUG_ON(i->count < bytes);
-
-	if (likely(i->nr_segs == 1)) {
-		i->iov_offset += bytes;
-		i->count -= bytes;
-	} else {
-		const struct iovec *iov = i->iov;
-		size_t base = i->iov_offset;
-		unsigned long nr_segs = i->nr_segs;
-
-		/*
-		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments (without overruning the iovec).
-		 */
-		while (bytes || unlikely(i->count && !iov->iov_len)) {
-			int copy;
-
-			copy = min(bytes, iov->iov_len - base);
-			BUG_ON(!i->count || i->count < copy);
-			i->count -= copy;
-			bytes -= copy;
-			base += copy;
-			if (iov->iov_len == base) {
-				iov++;
-				nr_segs--;
-				base = 0;
-			}
-		}
-		i->iov = iov;
-		i->iov_offset = base;
-		i->nr_segs = nr_segs;
-	}
-}
-
 /*
  * Fault in the first iovec of the given iov_iter, to a maximum length
  * of bytes. Returns 0 on success, or non-zero if the memory could not be
@@ -685,42 +676,6 @@ static size_t zero_bvec(size_t bytes, struct iov_iter *i)
 	return wanted - bytes;
 }
 
-static void advance_bvec(struct iov_iter *i, size_t bytes)
-{
-	BUG_ON(i->count < bytes);
-
-	if (likely(i->nr_segs == 1)) {
-		i->iov_offset += bytes;
-		i->count -= bytes;
-	} else {
-		const struct bio_vec *bvec = i->bvec;
-		size_t base = i->iov_offset;
-		unsigned long nr_segs = i->nr_segs;
-
-		/*
-		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments (without overruning the iovec).
-		 */
-		while (bytes || unlikely(i->count && !bvec->bv_len)) {
-			int copy;
-
-			copy = min(bytes, bvec->bv_len - base);
-			BUG_ON(!i->count || i->count < copy);
-			i->count -= copy;
-			bytes -= copy;
-			base += copy;
-			if (bvec->bv_len == base) {
-				bvec++;
-				nr_segs--;
-				base = 0;
-			}
-		}
-		i->bvec = bvec;
-		i->iov_offset = base;
-		i->nr_segs = nr_segs;
-	}
-}
-
 static ssize_t get_pages_bvec(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -849,10 +804,7 @@ EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
-	if (i->type & ITER_BVEC)
-		advance_bvec(i, size);
-	else
-		advance_iovec(i, size);
+	iterate_and_advance(i, size, v, 0, 0)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 03/13] iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 01/13] iov_iter.c: macros for iterating over iov_iter Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 02/13] iov_iter.c: iterate_and_advance Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 04/13] iov_iter.c: convert iov_iter_get_pages() " Al Viro
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 73 ++++++++++++++++-------------------------------------------
 1 file changed, 19 insertions(+), 54 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index e91bf0a..bc666e7 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -493,32 +493,6 @@ static ssize_t get_pages_alloc_iovec(struct iov_iter *i,
 	return (res == n ? len : res * PAGE_SIZE) - *start;
 }
 
-static int iov_iter_npages_iovec(const struct iov_iter *i, int maxpages)
-{
-	size_t offset = i->iov_offset;
-	size_t size = i->count;
-	const struct iovec *iov = i->iov;
-	int npages = 0;
-	int n;
-
-	for (n = 0; size && n < i->nr_segs; n++, iov++) {
-		unsigned long addr = (unsigned long)iov->iov_base + offset;
-		size_t len = iov->iov_len - offset;
-		offset = 0;
-		if (unlikely(!len))	/* empty segment */
-			continue;
-		if (len > size)
-			len = size;
-		npages += (addr + len + PAGE_SIZE - 1) / PAGE_SIZE
-			  - addr / PAGE_SIZE;
-		if (npages >= maxpages)	/* don't bother going further */
-			return maxpages;
-		size -= len;
-		offset = 0;
-	}
-	return min(npages, maxpages);
-}
-
 static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t len)
 {
 	char *from = kmap_atomic(page);
@@ -715,30 +689,6 @@ static ssize_t get_pages_alloc_bvec(struct iov_iter *i,
 	return len;
 }
 
-static int iov_iter_npages_bvec(const struct iov_iter *i, int maxpages)
-{
-	size_t offset = i->iov_offset;
-	size_t size = i->count;
-	const struct bio_vec *bvec = i->bvec;
-	int npages = 0;
-	int n;
-
-	for (n = 0; size && n < i->nr_segs; n++, bvec++) {
-		size_t len = bvec->bv_len - offset;
-		offset = 0;
-		if (unlikely(!len))	/* empty segment */
-			continue;
-		if (len > size)
-			len = size;
-		npages++;
-		if (npages >= maxpages)	/* don't bother going further */
-			return maxpages;
-		size -= len;
-		offset = 0;
-	}
-	return min(npages, maxpages);
-}
-
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
@@ -862,9 +812,24 @@ EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
-	if (i->type & ITER_BVEC)
-		return iov_iter_npages_bvec(i, maxpages);
-	else
-		return iov_iter_npages_iovec(i, maxpages);
+	size_t size = i->count;
+	int npages = 0;
+
+	if (!size)
+		return 0;
+
+	iterate_all_kinds(i, size, v, ({
+		unsigned long p = (unsigned long)v.iov_base;
+		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
+			- p / PAGE_SIZE;
+		if (npages >= maxpages)
+			return maxpages;
+	0;}),({
+		npages++;
+		if (npages >= maxpages)
+			return maxpages;
+	})
+	)
+	return npages;
 }
 EXPORT_SYMBOL(iov_iter_npages);
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 04/13] iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (2 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 03/13] iov_iter.c: convert iov_iter_npages() to iterate_all_kinds Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 05/13] iov_iter.c: convert iov_iter_get_pages_alloc() " Al Viro
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 78 +++++++++++++++++++++--------------------------------------
 1 file changed, 28 insertions(+), 50 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index bc666e7..75e29ef 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -428,34 +428,6 @@ void iov_iter_init(struct iov_iter *i, int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-static ssize_t get_pages_iovec(struct iov_iter *i,
-		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
-{
-	size_t offset = i->iov_offset;
-	const struct iovec *iov = i->iov;
-	size_t len;
-	unsigned long addr;
-	int n;
-	int res;
-
-	len = iov->iov_len - offset;
-	if (len > i->count)
-		len = i->count;
-	if (len > maxsize)
-		len = maxsize;
-	addr = (unsigned long)iov->iov_base + offset;
-	len += *start = addr & (PAGE_SIZE - 1);
-	if (len > maxpages * PAGE_SIZE)
-		len = maxpages * PAGE_SIZE;
-	addr &= ~(PAGE_SIZE - 1);
-	n = (len + PAGE_SIZE - 1) / PAGE_SIZE;
-	res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, pages);
-	if (unlikely(res < 0))
-		return res;
-	return (res == n ? len : res * PAGE_SIZE) - *start;
-}
-
 static ssize_t get_pages_alloc_iovec(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -650,24 +622,6 @@ static size_t zero_bvec(size_t bytes, struct iov_iter *i)
 	return wanted - bytes;
 }
 
-static ssize_t get_pages_bvec(struct iov_iter *i,
-		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
-{
-	const struct bio_vec *bvec = i->bvec;
-	size_t len = bvec->bv_len - i->iov_offset;
-	if (len > i->count)
-		len = i->count;
-	if (len > maxsize)
-		len = maxsize;
-	/* can't be more than PAGE_SIZE */
-	*start = bvec->bv_offset + i->iov_offset;
-
-	get_page(*pages = bvec->bv_page);
-
-	return len;
-}
-
 static ssize_t get_pages_alloc_bvec(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -792,10 +746,34 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
-	if (i->type & ITER_BVEC)
-		return get_pages_bvec(i, pages, maxsize, maxpages, start);
-	else
-		return get_pages_iovec(i, pages, maxsize, maxpages, start);
+	if (maxsize > i->count)
+		maxsize = i->count;
+
+	if (!maxsize)
+		return 0;
+
+	iterate_all_kinds(i, maxsize, v, ({
+		unsigned long addr = (unsigned long)v.iov_base;
+		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+		int n;
+		int res;
+
+		if (len > maxpages * PAGE_SIZE)
+			len = maxpages * PAGE_SIZE;
+		addr &= ~(PAGE_SIZE - 1);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, pages);
+		if (unlikely(res < 0))
+			return res;
+		return (res == n ? len : res * PAGE_SIZE) - *start;
+	0;}),({
+		/* can't be more than PAGE_SIZE */
+		*start = v.bv_offset;
+		get_page(*pages = v.bv_page);
+		return v.bv_len;
+	})
+	)
+	return 0;
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 05/13] iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (3 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 04/13] iov_iter.c: convert iov_iter_get_pages() " Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 06/13] iov_iter.c: convert iov_iter_zero() to iterate_and_advance Al Viro
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 107 ++++++++++++++++++++++++----------------------------------
 1 file changed, 45 insertions(+), 62 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 75e29ef..3214b9b 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -428,43 +428,6 @@ void iov_iter_init(struct iov_iter *i, int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-static ssize_t get_pages_alloc_iovec(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize,
-		   size_t *start)
-{
-	size_t offset = i->iov_offset;
-	const struct iovec *iov = i->iov;
-	size_t len;
-	unsigned long addr;
-	void *p;
-	int n;
-	int res;
-
-	len = iov->iov_len - offset;
-	if (len > i->count)
-		len = i->count;
-	if (len > maxsize)
-		len = maxsize;
-	addr = (unsigned long)iov->iov_base + offset;
-	len += *start = addr & (PAGE_SIZE - 1);
-	addr &= ~(PAGE_SIZE - 1);
-	n = (len + PAGE_SIZE - 1) / PAGE_SIZE;
-	
-	p = kmalloc(n * sizeof(struct page *), GFP_KERNEL);
-	if (!p)
-		p = vmalloc(n * sizeof(struct page *));
-	if (!p)
-		return -ENOMEM;
-
-	res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, p);
-	if (unlikely(res < 0)) {
-		kvfree(p);
-		return res;
-	}
-	*pages = p;
-	return (res == n ? len : res * PAGE_SIZE) - *start;
-}
-
 static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t len)
 {
 	char *from = kmap_atomic(page);
@@ -622,27 +585,6 @@ static size_t zero_bvec(size_t bytes, struct iov_iter *i)
 	return wanted - bytes;
 }
 
-static ssize_t get_pages_alloc_bvec(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize,
-		   size_t *start)
-{
-	const struct bio_vec *bvec = i->bvec;
-	size_t len = bvec->bv_len - i->iov_offset;
-	if (len > i->count)
-		len = i->count;
-	if (len > maxsize)
-		len = maxsize;
-	*start = bvec->bv_offset + i->iov_offset;
-
-	*pages = kmalloc(sizeof(struct page *), GFP_KERNEL);
-	if (!*pages)
-		return -ENOMEM;
-
-	get_page(**pages = bvec->bv_page);
-
-	return len;
-}
-
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
@@ -777,14 +719,55 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 
+static struct page **get_pages_array(size_t n)
+{
+	struct page **p = kmalloc(n * sizeof(struct page *), GFP_KERNEL);
+	if (!p)
+		p = vmalloc(n * sizeof(struct page *));
+	return p;
+}
+
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
 {
-	if (i->type & ITER_BVEC)
-		return get_pages_alloc_bvec(i, pages, maxsize, start);
-	else
-		return get_pages_alloc_iovec(i, pages, maxsize, start);
+	struct page **p;
+
+	if (maxsize > i->count)
+		maxsize = i->count;
+
+	if (!maxsize)
+		return 0;
+
+	iterate_all_kinds(i, maxsize, v, ({
+		unsigned long addr = (unsigned long)v.iov_base;
+		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+		int n;
+		int res;
+
+		addr &= ~(PAGE_SIZE - 1);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		p = get_pages_array(n);
+		if (!p)
+			return -ENOMEM;
+		res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, p);
+		if (unlikely(res < 0)) {
+			kvfree(p);
+			return res;
+		}
+		*pages = p;
+		return (res == n ? len : res * PAGE_SIZE) - *start;
+	0;}),({
+		/* can't be more than PAGE_SIZE */
+		*start = v.bv_offset;
+		*pages = p = get_pages_array(1);
+		if (!p)
+			return -ENOMEM;
+		get_page(*p = v.bv_page);
+		return v.bv_len;
+	})
+	)
+	return 0;
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 06/13] iov_iter.c: convert iov_iter_zero() to iterate_and_advance
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (4 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 05/13] iov_iter.c: convert iov_iter_get_pages_alloc() " Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 07/13] iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() Al Viro
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 98 ++++++++---------------------------------------------------
 1 file changed, 12 insertions(+), 86 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 3214b9b..39ad713 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -349,50 +349,6 @@ done:
 	return wanted - bytes;
 }
 
-static size_t zero_iovec(size_t bytes, struct iov_iter *i)
-{
-	size_t skip, copy, left, wanted;
-	const struct iovec *iov;
-	char __user *buf;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	wanted = bytes;
-	iov = i->iov;
-	skip = i->iov_offset;
-	buf = iov->iov_base + skip;
-	copy = min(bytes, iov->iov_len - skip);
-
-	left = __clear_user(buf, copy);
-	copy -= left;
-	skip += copy;
-	bytes -= copy;
-
-	while (unlikely(!left && bytes)) {
-		iov++;
-		buf = iov->iov_base;
-		copy = min(bytes, iov->iov_len);
-		left = __clear_user(buf, copy);
-		copy -= left;
-		skip = copy;
-		bytes -= copy;
-	}
-
-	if (skip == iov->iov_len) {
-		iov++;
-		skip = 0;
-	}
-	i->count -= wanted - bytes;
-	i->nr_segs -= iov - i->iov;
-	i->iov = iov;
-	i->iov_offset = skip;
-	return wanted - bytes;
-}
-
 /*
  * Fault in the first iovec of the given iov_iter, to a maximum length
  * of bytes. Returns 0 on success, or non-zero if the memory could not be
@@ -548,43 +504,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
 	return wanted;
 }
 
-static size_t zero_bvec(size_t bytes, struct iov_iter *i)
-{
-	size_t skip, copy, wanted;
-	const struct bio_vec *bvec;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	wanted = bytes;
-	bvec = i->bvec;
-	skip = i->iov_offset;
-	copy = min_t(size_t, bytes, bvec->bv_len - skip);
-
-	memzero_page(bvec->bv_page, skip + bvec->bv_offset, copy);
-	skip += copy;
-	bytes -= copy;
-	while (bytes) {
-		bvec++;
-		copy = min(bytes, (size_t)bvec->bv_len);
-		memzero_page(bvec->bv_page, bvec->bv_offset, copy);
-		skip = copy;
-		bytes -= copy;
-	}
-	if (skip == bvec->bv_len) {
-		bvec++;
-		skip = 0;
-	}
-	i->count -= wanted - bytes;
-	i->nr_segs -= bvec - i->bvec;
-	i->bvec = bvec;
-	i->iov_offset = skip;
-	return wanted - bytes;
-}
-
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
@@ -625,11 +544,18 @@ EXPORT_SYMBOL(copy_from_iter);
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
-	if (i->type & ITER_BVEC) {
-		return zero_bvec(bytes, i);
-	} else {
-		return zero_iovec(bytes, i);
-	}
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	iterate_and_advance(i, bytes, v,
+		__clear_user(v.iov_base, v.iov_len),
+		memzero_page(v.bv_page, v.bv_offset, v.bv_len)
+	)
+
+	return bytes;
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 07/13] iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (5 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 06/13] iov_iter.c: convert iov_iter_zero() to iterate_and_advance Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-05 12:28   ` Sergei Shtylyov
  2014-12-04 20:23 ` [RFC][PATCH 08/13] iov_iter.c: convert copy_from_iter() to iterate_and_advance Al Viro
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

Just have copy_page_{to,from}_iter() fall back to kmap_atomic +
copy_{to,from}_iter() + kunmap_atomic() in ITER_BVEC case.  As
the matter of fact, that's what we want to do for any iov_iter
kind that isn't blocking - e.g. ITER_KVEC will also go that way
once we recognize it on iov_iter.c primitives level

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 60 ++++++++++++++++++++++++-----------------------------------
 1 file changed, 24 insertions(+), 36 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 39ad713..17b7144 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -486,30 +486,33 @@ static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
 	return wanted;
 }
 
-static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
-					size_t bytes, struct iov_iter *i)
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
-	void *kaddr = kmap_atomic(page);
-	size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
-	kunmap_atomic(kaddr);
-	return wanted;
+	if (i->type & ITER_BVEC)
+		return copy_to_iter_bvec(addr, bytes, i);
+	else
+		return copy_to_iter_iovec(addr, bytes, i);
 }
+EXPORT_SYMBOL(copy_to_iter);
 
-static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
-					size_t bytes, struct iov_iter *i)
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
-	void *kaddr = kmap_atomic(page);
-	size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);
-	kunmap_atomic(kaddr);
-	return wanted;
+	if (i->type & ITER_BVEC)
+		return copy_from_iter_bvec(addr, bytes, i);
+	else
+		return copy_from_iter_iovec(addr, bytes, i);
 }
+EXPORT_SYMBOL(copy_from_iter);
 
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
-	if (i->type & ITER_BVEC)
-		return copy_page_to_iter_bvec(page, offset, bytes, i);
-	else
+	if (i->type & (ITER_BVEC|ITER_KVEC)) {
+		void *kaddr = kmap_atomic(page);
+		size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
+		kunmap_atomic(kaddr);
+		return wanted;
+	} else
 		return copy_page_to_iter_iovec(page, offset, bytes, i);
 }
 EXPORT_SYMBOL(copy_page_to_iter);
@@ -517,31 +520,16 @@ EXPORT_SYMBOL(copy_page_to_iter);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
-	if (i->type & ITER_BVEC)
-		return copy_page_from_iter_bvec(page, offset, bytes, i);
-	else
+	if (i->type & ITER_BVEC) {
+		void *kaddr = kmap_atomic(page);
+		size_t wanted = copy_from_iter(kaddr + offset, bytes, i);
+		kunmap_atomic(kaddr);
+		return wanted;
+	} else
 		return copy_page_from_iter_iovec(page, offset, bytes, i);
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
-size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
-{
-	if (i->type & ITER_BVEC)
-		return copy_to_iter_bvec(addr, bytes, i);
-	else
-		return copy_to_iter_iovec(addr, bytes, i);
-}
-EXPORT_SYMBOL(copy_to_iter);
-
-size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
-{
-	if (i->type & ITER_BVEC)
-		return copy_from_iter_bvec(addr, bytes, i);
-	else
-		return copy_from_iter_iovec(addr, bytes, i);
-}
-EXPORT_SYMBOL(copy_from_iter);
-
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
 	if (unlikely(bytes > i->count))
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 08/13] iov_iter.c: convert copy_from_iter() to iterate_and_advance
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (6 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 07/13] iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 09/13] iov_iter.c: convert copy_to_iter() " Al Viro
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 106 +++++++++-------------------------------------------------
 1 file changed, 15 insertions(+), 91 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 17b7144..791429d 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -142,51 +142,6 @@ static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
 	return wanted - bytes;
 }
 
-static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i)
-{
-	size_t skip, copy, left, wanted;
-	const struct iovec *iov;
-	char __user *buf;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	wanted = bytes;
-	iov = i->iov;
-	skip = i->iov_offset;
-	buf = iov->iov_base + skip;
-	copy = min(bytes, iov->iov_len - skip);
-
-	left = __copy_from_user(to, buf, copy);
-	copy -= left;
-	skip += copy;
-	to += copy;
-	bytes -= copy;
-	while (unlikely(!left && bytes)) {
-		iov++;
-		buf = iov->iov_base;
-		copy = min(bytes, iov->iov_len);
-		left = __copy_from_user(to, buf, copy);
-		copy -= left;
-		skip = copy;
-		to += copy;
-		bytes -= copy;
-	}
-
-	if (skip == iov->iov_len) {
-		iov++;
-		skip = 0;
-	}
-	i->count -= wanted - bytes;
-	i->nr_segs -= iov - i->iov;
-	i->iov = iov;
-	i->iov_offset = skip;
-	return wanted - bytes;
-}
-
 static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
@@ -444,48 +399,6 @@ static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
 	return wanted - bytes;
 }
 
-static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
-{
-	size_t skip, copy, wanted;
-	const struct bio_vec *bvec;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	wanted = bytes;
-	bvec = i->bvec;
-	skip = i->iov_offset;
-
-	copy = min(bytes, bvec->bv_len - skip);
-
-	memcpy_from_page(to, bvec->bv_page, bvec->bv_offset + skip, copy);
-
-	to += copy;
-	skip += copy;
-	bytes -= copy;
-
-	while (bytes) {
-		bvec++;
-		copy = min(bytes, (size_t)bvec->bv_len);
-		memcpy_from_page(to, bvec->bv_page, bvec->bv_offset, copy);
-		skip = copy;
-		to += copy;
-		bytes -= copy;
-	}
-	if (skip == bvec->bv_len) {
-		bvec++;
-		skip = 0;
-	}
-	i->count -= wanted;
-	i->nr_segs -= bvec - i->bvec;
-	i->bvec = bvec;
-	i->iov_offset = skip;
-	return wanted;
-}
-
 size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (i->type & ITER_BVEC)
@@ -497,10 +410,21 @@ EXPORT_SYMBOL(copy_to_iter);
 
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
-	if (i->type & ITER_BVEC)
-		return copy_from_iter_bvec(addr, bytes, i);
-	else
-		return copy_from_iter_iovec(addr, bytes, i);
+	char *to = addr;
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
+				 v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
+	)
+
+	return bytes;
 }
 EXPORT_SYMBOL(copy_from_iter);
 
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 09/13] iov_iter.c: convert copy_to_iter() to iterate_and_advance
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (7 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 08/13] iov_iter.c: convert copy_from_iter() to iterate_and_advance Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 10/13] iov_iter.c: handle ITER_KVEC directly Al Viro
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 mm/iov_iter.c | 91 ++++++-----------------------------------------------------
 1 file changed, 9 insertions(+), 82 deletions(-)

diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 791429d..66665449 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -97,51 +97,6 @@
 	i->iov_offset = skip;					\
 }
 
-static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
-{
-	size_t skip, copy, left, wanted;
-	const struct iovec *iov;
-	char __user *buf;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	wanted = bytes;
-	iov = i->iov;
-	skip = i->iov_offset;
-	buf = iov->iov_base + skip;
-	copy = min(bytes, iov->iov_len - skip);
-
-	left = __copy_to_user(buf, from, copy);
-	copy -= left;
-	skip += copy;
-	from += copy;
-	bytes -= copy;
-	while (unlikely(!left && bytes)) {
-		iov++;
-		buf = iov->iov_base;
-		copy = min(bytes, iov->iov_len);
-		left = __copy_to_user(buf, from, copy);
-		copy -= left;
-		skip = copy;
-		from += copy;
-		bytes -= copy;
-	}
-
-	if (skip == iov->iov_len) {
-		iov++;
-		skip = 0;
-	}
-	i->count -= wanted - bytes;
-	i->nr_segs -= iov - i->iov;
-	i->iov = iov;
-	i->iov_offset = skip;
-	return wanted - bytes;
-}
-
 static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
@@ -360,51 +315,23 @@ static void memzero_page(struct page *page, size_t offset, size_t len)
 	kunmap_atomic(addr);
 }
 
-static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
-	size_t skip, copy, wanted;
-	const struct bio_vec *bvec;
-
+	char *from = addr;
 	if (unlikely(bytes > i->count))
 		bytes = i->count;
 
 	if (unlikely(!bytes))
 		return 0;
 
-	wanted = bytes;
-	bvec = i->bvec;
-	skip = i->iov_offset;
-	copy = min_t(size_t, bytes, bvec->bv_len - skip);
-
-	memcpy_to_page(bvec->bv_page, skip + bvec->bv_offset, from, copy);
-	skip += copy;
-	from += copy;
-	bytes -= copy;
-	while (bytes) {
-		bvec++;
-		copy = min(bytes, (size_t)bvec->bv_len);
-		memcpy_to_page(bvec->bv_page, bvec->bv_offset, from, copy);
-		skip = copy;
-		from += copy;
-		bytes -= copy;
-	}
-	if (skip == bvec->bv_len) {
-		bvec++;
-		skip = 0;
-	}
-	i->count -= wanted - bytes;
-	i->nr_segs -= bvec - i->bvec;
-	i->bvec = bvec;
-	i->iov_offset = skip;
-	return wanted - bytes;
-}
+	iterate_and_advance(i, bytes, v,
+		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
+			       v.iov_len),
+		memcpy_to_page(v.bv_page, v.bv_offset,
+			       (from += v.bv_len) - v.bv_len, v.bv_len)
+	)
 
-size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
-{
-	if (i->type & ITER_BVEC)
-		return copy_to_iter_bvec(addr, bytes, i);
-	else
-		return copy_to_iter_iovec(addr, bytes, i);
+	return bytes;
 }
 EXPORT_SYMBOL(copy_to_iter);
 
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 10/13] iov_iter.c: handle ITER_KVEC directly
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (8 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 09/13] iov_iter.c: convert copy_to_iter() " Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 11/13] csum_and_copy_..._iter() Al Viro
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

... without bothering with copy_..._user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/uio.h |   1 +
 mm/iov_iter.c       | 101 +++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9b15814..6e16945 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -31,6 +31,7 @@ struct iov_iter {
 	size_t count;
 	union {
 		const struct iovec *iov;
+		const struct kvec *kvec;
 		const struct bio_vec *bvec;
 	};
 	unsigned long nr_segs;
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 66665449..d74de6d 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -32,6 +32,29 @@
 	n = wanted - n;					\
 }
 
+#define iterate_kvec(i, n, __v, __p, skip, STEP) {	\
+	size_t wanted = n;				\
+	__p = i->kvec;					\
+	__v.iov_len = min(n, __p->iov_len - skip);	\
+	if (likely(__v.iov_len)) {			\
+		__v.iov_base = __p->iov_base + skip;	\
+		(void)(STEP);				\
+		skip += __v.iov_len;			\
+		n -= __v.iov_len;			\
+	}						\
+	while (unlikely(n)) {				\
+		__p++;					\
+		__v.iov_len = min(n, __p->iov_len);	\
+		if (unlikely(!__v.iov_len))		\
+			continue;			\
+		__v.iov_base = __p->iov_base;		\
+		(void)(STEP);				\
+		skip = __v.iov_len;			\
+		n -= __v.iov_len;			\
+	}						\
+	n = wanted;					\
+}
+
 #define iterate_bvec(i, n, __v, __p, skip, STEP) {	\
 	size_t wanted = n;				\
 	__p = i->bvec;					\
@@ -57,12 +80,16 @@
 	n = wanted;					\
 }
 
-#define iterate_all_kinds(i, n, v, I, B) {			\
+#define iterate_all_kinds(i, n, v, I, B, K) {			\
 	size_t skip = i->iov_offset;				\
 	if (unlikely(i->type & ITER_BVEC)) {			\
 		const struct bio_vec *bvec;			\
 		struct bio_vec v;				\
 		iterate_bvec(i, n, v, bvec, skip, (B))		\
+	} else if (unlikely(i->type & ITER_KVEC)) {		\
+		const struct kvec *kvec;			\
+		struct kvec v;					\
+		iterate_kvec(i, n, v, kvec, skip, (K))		\
 	} else {						\
 		const struct iovec *iov;			\
 		struct iovec v;					\
@@ -70,7 +97,7 @@
 	}							\
 }
 
-#define iterate_and_advance(i, n, v, I, B) {			\
+#define iterate_and_advance(i, n, v, I, B, K) {			\
 	size_t skip = i->iov_offset;				\
 	if (unlikely(i->type & ITER_BVEC)) {			\
 		const struct bio_vec *bvec;			\
@@ -82,6 +109,16 @@
 		}						\
 		i->nr_segs -= bvec - i->bvec;			\
 		i->bvec = bvec;					\
+	} else if (unlikely(i->type & ITER_KVEC)) {		\
+		const struct kvec *kvec;			\
+		struct kvec v;					\
+		iterate_kvec(i, n, v, kvec, skip, (K))		\
+		if (skip == kvec->iov_len) {			\
+			kvec++;					\
+			skip = 0;				\
+		}						\
+		i->nr_segs -= kvec - i->kvec;			\
+		i->kvec = kvec;					\
 	} else {						\
 		const struct iovec *iov;			\
 		struct iovec v;					\
@@ -270,7 +307,7 @@ done:
  */
 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 {
-	if (!(i->type & ITER_BVEC)) {
+	if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
 		char __user *buf = i->iov->iov_base + i->iov_offset;
 		bytes = min(bytes, i->iov->iov_len - i->iov_offset);
 		return fault_in_pages_readable(buf, bytes);
@@ -284,10 +321,14 @@ void iov_iter_init(struct iov_iter *i, int direction,
 			size_t count)
 {
 	/* It will get better.  Eventually... */
-	if (segment_eq(get_fs(), KERNEL_DS))
+	if (segment_eq(get_fs(), KERNEL_DS)) {
 		direction |= ITER_KVEC;
-	i->type = direction;
-	i->iov = iov;
+		i->type = direction;
+		i->kvec = (struct kvec *)iov;
+	} else {
+		i->type = direction;
+		i->iov = iov;
+	}
 	i->nr_segs = nr_segs;
 	i->iov_offset = 0;
 	i->count = count;
@@ -328,7 +369,8 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
 		__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
 			       v.iov_len),
 		memcpy_to_page(v.bv_page, v.bv_offset,
-			       (from += v.bv_len) - v.bv_len, v.bv_len)
+			       (from += v.bv_len) - v.bv_len, v.bv_len),
+		memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
 	)
 
 	return bytes;
@@ -348,7 +390,8 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 		__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
 				 v.iov_len),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
-				 v.bv_offset, v.bv_len)
+				 v.bv_offset, v.bv_len),
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
 
 	return bytes;
@@ -371,7 +414,7 @@ EXPORT_SYMBOL(copy_page_to_iter);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
-	if (i->type & ITER_BVEC) {
+	if (i->type & (ITER_BVEC|ITER_KVEC)) {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = copy_from_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
@@ -391,7 +434,8 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 
 	iterate_and_advance(i, bytes, v,
 		__clear_user(v.iov_base, v.iov_len),
-		memzero_page(v.bv_page, v.bv_offset, v.bv_len)
+		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
+		memset(v.iov_base, 0, v.iov_len)
 	)
 
 	return bytes;
@@ -406,7 +450,8 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 		__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
 					  v.iov_base, v.iov_len),
 		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
-				 v.bv_offset, v.bv_len)
+				 v.bv_offset, v.bv_len),
+		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
 	)
 	kunmap_atomic(kaddr);
 	return bytes;
@@ -415,7 +460,7 @@ EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
-	iterate_and_advance(i, size, v, 0, 0)
+	iterate_and_advance(i, size, v, 0, 0, 0)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
@@ -443,7 +488,8 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 
 	iterate_all_kinds(i, size, v,
 		(res |= (unsigned long)v.iov_base | v.iov_len, 0),
-		res |= v.bv_offset | v.bv_len
+		res |= v.bv_offset | v.bv_len,
+		res |= (unsigned long)v.iov_base | v.iov_len
 	)
 	return res;
 }
@@ -478,6 +524,16 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		*start = v.bv_offset;
 		get_page(*pages = v.bv_page);
 		return v.bv_len;
+	}),({
+		unsigned long addr = (unsigned long)v.iov_base, end;
+		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+
+		if (len > maxpages * PAGE_SIZE)
+			len = maxpages * PAGE_SIZE;
+		addr &= ~(PAGE_SIZE - 1);
+		for (end = addr + len; addr < end; addr += PAGE_SIZE)
+			get_page(*pages++ = virt_to_page(addr));
+		return len - *start;
 	})
 	)
 	return 0;
@@ -530,6 +586,19 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 			return -ENOMEM;
 		get_page(*p = v.bv_page);
 		return v.bv_len;
+	}),({
+		unsigned long addr = (unsigned long)v.iov_base, end;
+		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+		int n;
+
+		addr &= ~(PAGE_SIZE - 1);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		*pages = p = get_pages_array(n);
+		if (!p)
+			return -ENOMEM;
+		for (end = addr + len; addr < end; addr += PAGE_SIZE)
+			get_page(*p++ = virt_to_page(addr));
+		return len - *start;
 	})
 	)
 	return 0;
@@ -554,6 +623,12 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		npages++;
 		if (npages >= maxpages)
 			return maxpages;
+	}),({
+		unsigned long p = (unsigned long)v.iov_base;
+		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
+			- p / PAGE_SIZE;
+		if (npages >= maxpages)
+			return maxpages;
 	})
 	)
 	return npages;
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 11/13] csum_and_copy_..._iter()
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (9 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 10/13] iov_iter.c: handle ITER_KVEC directly Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 12/13] new helper: iov_iter_kvec() Al Viro
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/uio.h |  2 ++
 mm/iov_iter.c       | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 6e16945..28ed2d9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -124,6 +124,8 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
 {
 	i->count = count;
 }
+size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
+size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
 
 int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len);
 int memcpy_toiovec(struct iovec *iov, unsigned char *kdata, int len);
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index d74de6d..5a3b4a8 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -3,6 +3,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <net/checksum.h>
 
 #define iterate_iovec(i, n, __v, __p, skip, STEP) {	\
 	size_t left;					\
@@ -605,6 +606,94 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
+size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
+			       struct iov_iter *i)
+{
+	char *to = addr;
+	__wsum sum, next;
+	size_t off = 0;
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	sum = *csum;
+	iterate_and_advance(i, bytes, v, ({
+		int err = 0;
+		next = csum_and_copy_from_user(v.iov_base, 
+					       (to += v.iov_len) - v.iov_len,
+					       v.iov_len, 0, &err);
+		if (!err) {
+			sum = csum_block_add(sum, next, off);
+			off += v.iov_len;
+		}
+		err ? v.iov_len : 0;
+	}), ({
+		char *p = kmap_atomic(v.bv_page);
+		next = csum_partial_copy_nocheck(p + v.bv_offset,
+						 (to += v.bv_len) - v.bv_len,
+						 v.bv_len, 0);
+		kunmap_atomic(p);
+		sum = csum_block_add(sum, next, off);
+		off += v.bv_len;
+	}),({
+		next = csum_partial_copy_nocheck(v.iov_base,
+						 (to += v.iov_len) - v.iov_len,
+						 v.iov_len, 0);
+		sum = csum_block_add(sum, next, off);
+		off += v.iov_len;
+	})
+	)
+	*csum = sum;
+	return bytes;
+}
+EXPORT_SYMBOL(csum_and_copy_from_iter);
+
+size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum,
+			     struct iov_iter *i)
+{
+	char *from = addr;
+	__wsum sum, next;
+	size_t off = 0;
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	sum = *csum;
+	iterate_and_advance(i, bytes, v, ({
+		int err = 0;
+		next = csum_and_copy_to_user((from += v.iov_len) - v.iov_len,
+					     v.iov_base, 
+					     v.iov_len, 0, &err);
+		if (!err) {
+			sum = csum_block_add(sum, next, off);
+			off += v.iov_len;
+		}
+		err ? v.iov_len : 0;
+	}), ({
+		char *p = kmap_atomic(v.bv_page);
+		next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len,
+						 p + v.bv_offset,
+						 v.bv_len, 0);
+		kunmap_atomic(p);
+		sum = csum_block_add(sum, next, off);
+		off += v.bv_len;
+	}),({
+		next = csum_partial_copy_nocheck((from += v.iov_len) - v.iov_len,
+						 v.iov_base,
+						 v.iov_len, 0);
+		sum = csum_block_add(sum, next, off);
+		off += v.iov_len;
+	})
+	)
+	*csum = sum;
+	return bytes;
+}
+EXPORT_SYMBOL(csum_and_copy_to_iter);
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	size_t size = i->count;
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 12/13] new helper: iov_iter_kvec()
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (10 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 11/13] csum_and_copy_..._iter() Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-04 20:23 ` [RFC][PATCH 13/13] copy_from_iter_nocache() Al Viro
  2014-12-08 16:46 ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

initialization of kvec-backed iov_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/uio.h |  2 ++
 mm/iov_iter.c       | 13 +++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 28ed2d9..c567655 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -87,6 +87,8 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
 			unsigned long nr_segs, size_t count);
+void iov_iter_kvec(struct iov_iter *i, int direction, const struct kvec *iov,
+			unsigned long nr_segs, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 5a3b4a8..7c04051 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -479,6 +479,19 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
 
+void iov_iter_kvec(struct iov_iter *i, int direction,
+			const struct kvec *iov, unsigned long nr_segs,
+			size_t count)
+{
+	BUG_ON(!(direction & ITER_KVEC));
+	i->type = direction;
+	i->kvec = (struct kvec *)iov;
+	i->nr_segs = nr_segs;
+	i->iov_offset = 0;
+	i->count = count;
+}
+EXPORT_SYMBOL(iov_iter_kvec);
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	unsigned long res = 0;
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 13/13] copy_from_iter_nocache()
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (11 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 12/13] new helper: iov_iter_kvec() Al Viro
@ 2014-12-04 20:23 ` Al Viro
  2014-12-08 16:46 ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
  13 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-04 20:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

From: Al Viro <viro@zeniv.linux.org.uk>

BTW, do we want memcpy_nocache()?

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/uio.h |  1 +
 mm/iov_iter.c       | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index c567655..bd8569a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -83,6 +83,7 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 7c04051..e0605c1 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -399,6 +399,27 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(copy_from_iter);
 
+size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
+{
+	char *to = addr;
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+
+	if (unlikely(!bytes))
+		return 0;
+
+	iterate_and_advance(i, bytes, v,
+		__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
+					 v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len),
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+	)
+
+	return bytes;
+}
+EXPORT_SYMBOL(copy_from_iter_nocache);
+
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
-- 
2.1.3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 07/13] iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
  2014-12-04 20:23 ` [RFC][PATCH 07/13] iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() Al Viro
@ 2014-12-05 12:28   ` Sergei Shtylyov
  0 siblings, 0 replies; 48+ messages in thread
From: Sergei Shtylyov @ 2014-12-05 12:28 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, netdev

Hello.

On 12/4/2014 11:23 PM, Al Viro wrote:

    Looks like you've erred in the subject as you seem to be getting rid of 
copy_page_{to|from}_iter_bvec() instead of bvec_copy_page_{to|from}_iter()...

> From: Al Viro <viro@zeniv.linux.org.uk>

> Just have copy_page_{to,from}_iter() fall back to kmap_atomic +
> copy_{to,from}_iter() + kunmap_atomic() in ITER_BVEC case.  As
> the matter of fact, that's what we want to do for any iov_iter
> kind that isn't blocking - e.g. ITER_KVEC will also go that way
> once we recognize it on iov_iter.c primitives level

> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>   mm/iov_iter.c | 60 ++++++++++++++++++++++++-----------------------------------
>   1 file changed, 24 insertions(+), 36 deletions(-)

> diff --git a/mm/iov_iter.c b/mm/iov_iter.c
> index 39ad713..17b7144 100644
> --- a/mm/iov_iter.c
> +++ b/mm/iov_iter.c
> @@ -486,30 +486,33 @@ static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
>   	return wanted;
>   }
>
> -static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
> -					size_t bytes, struct iov_iter *i)
> +size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
>   {
> -	void *kaddr = kmap_atomic(page);
> -	size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
> -	kunmap_atomic(kaddr);
> -	return wanted;
> +	if (i->type & ITER_BVEC)
> +		return copy_to_iter_bvec(addr, bytes, i);
> +	else
> +		return copy_to_iter_iovec(addr, bytes, i);
>   }
> +EXPORT_SYMBOL(copy_to_iter);
>
> -static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
> -					size_t bytes, struct iov_iter *i)
> +size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
>   {
> -	void *kaddr = kmap_atomic(page);
> -	size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);
> -	kunmap_atomic(kaddr);
> -	return wanted;
> +	if (i->type & ITER_BVEC)
> +		return copy_from_iter_bvec(addr, bytes, i);
> +	else
> +		return copy_from_iter_iovec(addr, bytes, i);
>   }
> +EXPORT_SYMBOL(copy_from_iter);
[...]

WBR, Sergei


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
                   ` (12 preceding siblings ...)
  2014-12-04 20:23 ` [RFC][PATCH 13/13] copy_from_iter_nocache() Al Viro
@ 2014-12-08 16:46 ` Kirill A. Shutemov
  2014-12-08 17:58   ` Al Viro
  2014-12-08 18:07   ` Linus Torvalds
  13 siblings, 2 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-12-08 16:46 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, netdev

On Thu, Dec 04, 2014 at 08:20:11PM +0000, Al Viro wrote:
> 	First of all, I want to apologize for the nastiness of preprocessor
> use in this series.  Seeing that the whole "macros that look like new kinds
> of C statements" thing (including list_for_each_...(), etc) is very much not
> to my liking, I really don't trust my taste on finer details and I'd very
> much like some feedback.
> 
> 	The reason for doing that kind of tricks is that iov_iter.c keeps
> growing more and more boilerplate code.  For iov_iter-net series we need
> 	* csum_and_copy_from_iter()
> 	* csum_and_copy_to_iter()
> 	* copy_from_iter_nocache()
> That's 3 new primitives, each in 2 variants (iovec and bvec).
> 	* ITER_KVEC handled without going through uaccess.h stuff (and
> independent of set_fs() state).
> And *that* means 3 variants intstead of 2 for most of the existing primitives.
> That's far too much, and the amount of copies of the same logics would pretty
> much guarantee that it will be a breeding ground for hard-to-kill bugs.
> 
> 	The following series (also in vfs.git#iov_iter) actually manages to
> do all of the above *and* shrink the damn thing quite a bit.  The generated
> code appears to be no worse than before.  The price is a couple of iterator
> macros - iterate_all_kinds() and iterate_and_advance().  They are given an
> iov_iter, size (i.e. the amount of data in iov_iter beginning we want to go
> through), name of the loop variable and 3 variants of loop body - for iovec,
> bvec and kvec resp.  Loop variable is declared *inside* the expansion of those
> suckers according to the kind of iov_iter - it's struct iovec, struct bio_vec
> or struct kvec, covering the current range to deal with.
> 	The difference between those two is that iterate_and_advance() will
> advance the iov_iter by the amount it has handled and iterate_all_kinds()
> will leave iov_iter unchanged.
> 
> 	Unless I hear anybody yelling, it goes into vfs.git#for-next today,
> so if you have objections, suggestions, etc., give those *now*.
> 
> Al Viro (13):
>       iov_iter.c: macros for iterating over iov_iter
>       iov_iter.c: iterate_and_advance
>       iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
>       iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
>       iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
>       iov_iter.c: convert iov_iter_zero() to iterate_and_advance
>       iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
>       iov_iter.c: convert copy_from_iter() to iterate_and_advance
>       iov_iter.c: convert copy_to_iter() to iterate_and_advance
>       iov_iter.c: handle ITER_KVEC directly
>       csum_and_copy_..._iter()
>       new helper: iov_iter_kvec()
>       copy_from_iter_nocache()

I guess this crash is related to the patchset.

[  102.337742] ------------[ cut here ]------------
[  102.338270] kernel BUG at /home/kas/git/public/linux-next/arch/x86/mm/physaddr.c:26!
[  102.339043] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[  102.339622] Modules linked in:
[  102.339951] CPU: 2 PID: 6029 Comm: trinity-c23 Not tainted 3.18.0-next-20141208-00036-gc7edb4791544-dirty #269
[  102.340011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[  102.340011] task: ffff880041c51510 ti: ffff880049c70000 task.ti: ffff880049c70000
[  102.340011] RIP: 0010:[<ffffffff8104d8c0>]  [<ffffffff8104d8c0>] __phys_addr+0x40/0x60
[  102.340011] RSP: 0018:ffff880049c73838  EFLAGS: 00010206
[  102.340011] RAX: 00004100174b4000 RBX: ffff880049c73b08 RCX: 0000000000000028
[  102.340011] RDX: 0000000000000041 RSI: ffff88015dc980a8 RDI: ffffc900174b4000
[  102.340011] RBP: ffff880049c73838 R08: ffffc900174b4000 R09: 000000000000000c
[  102.340011] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88015dc980a8
[  102.340011] R13: ffffc900174f4000 R14: ffffea0000000000 R15: ffffc900174b4000
[  102.340011] FS:  00007fcdd37fb700(0000) GS:ffff88017aa00000(0000) knlGS:0000000000000000
[  102.340011] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  102.340011] CR2: 00000000006246a8 CR3: 0000000068dbb000 CR4: 00000000001406e0
[  102.340011] Stack:
[  102.340011]  ffff880049c73888 ffffffff8117a96b 0000000000000002 0000000000040000
[  102.340011]  ffffffff81204b3f ffff88015dc98028 0000000000000000 ffff880049c73a78
[  102.340011]  ffff88015dc98000 ffff880049c73a10 ffff880049c73978 ffffffff81203ec9
[  102.340011] Call Trace:
[  102.340011]  [<ffffffff8117a96b>] iov_iter_get_pages+0x17b/0x390
[  102.340011]  [<ffffffff81204b3f>] ? __blockdev_direct_IO+0x15f/0x16a0
[  102.340011]  [<ffffffff81203ec9>] do_direct_IO+0x10a9/0x1bc0
[  102.340011]  [<ffffffff810a92d8>] ? lockdep_init_map+0x68/0x5c0
[  102.340011]  [<ffffffff81204d6c>] __blockdev_direct_IO+0x38c/0x16a0
[  102.340011]  [<ffffffff810a9d27>] ? __lock_acquire+0x4f7/0xd40
[  102.340011]  [<ffffffff810a73f9>] ? mark_held_locks+0x79/0xb0
[  102.340011]  [<ffffffff8133b510>] ? xfs_get_blocks+0x20/0x20
[  102.340011]  [<ffffffff81338ff0>] xfs_vm_direct_IO+0x120/0x140
[  102.340011]  [<ffffffff8133b510>] ? xfs_get_blocks+0x20/0x20
[  102.340011]  [<ffffffff810aa773>] ? lock_release_non_nested+0x203/0x3d0
[  102.340011]  [<ffffffff8135b9a7>] ? xfs_ilock+0x167/0x2e0
[  102.340011]  [<ffffffff8114d957>] generic_file_read_iter+0x517/0x6a0
[  102.340011]  [<ffffffff810a73f9>] ? mark_held_locks+0x79/0xb0
[  102.340011]  [<ffffffff8185e192>] ? __mutex_unlock_slowpath+0xb2/0x190
[  102.340011]  [<ffffffff8134bc2f>] xfs_file_read_iter+0x12f/0x460
[  102.340011]  [<ffffffff811c237e>] new_sync_read+0x7e/0xb0
[  102.340011]  [<ffffffff811c3528>] __vfs_read+0x18/0x50
[  102.340011]  [<ffffffff811c35ed>] vfs_read+0x8d/0x150
[  102.340011]  [<ffffffff811c89e8>] kernel_read+0x48/0x60
[  102.340011]  [<ffffffff810f4a92>] copy_module_from_fd.isra.51+0x112/0x170
[  102.340011]  [<ffffffff810f7646>] SyS_finit_module+0x56/0x80
[  102.340011]  [<ffffffff81861f92>] system_call_fastpath+0x12/0x17
[  102.340011] Code: 00 00 00 00 00 78 00 00 48 01 f8 48 39 c2 72 1b 0f b6 0d 9d 7a ec 00 48 89 c2 48 d3 ea 48 85 d2 75 09 5d c3 0f 1f 80 00 00 00 00 <0f> 0b 66 0f 1f 44 00 00 48 89 d0 48 03 05 3e 87 dc 00 48 81 fa 
[  102.340011] RIP  [<ffffffff8104d8c0>] __phys_addr+0x40/0x60
[  102.340011]  RSP <ffff880049c73838>
[  102.371939] ---[ end trace e07368268cd6b49c ]---


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 16:46 ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
@ 2014-12-08 17:58   ` Al Viro
  2014-12-08 18:08     ` Al Viro
  2014-12-08 18:56     ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
  2014-12-08 18:07   ` Linus Torvalds
  1 sibling, 2 replies; 48+ messages in thread
From: Al Viro @ 2014-12-08 17:58 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, netdev

On Mon, Dec 08, 2014 at 06:46:50PM +0200, Kirill A. Shutemov wrote:

> I guess this crash is related to the patchset.

Might be.  Do you have a reproducer for it?

It looks like the second VIRTUAL_BUG_ON() in __phys_addr(), most likely
from __pa(), from virt_to_page(), from
                unsigned long addr = (unsigned long)v.iov_base, end;
                size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));

                if (len > maxpages * PAGE_SIZE)
                        len = maxpages * PAGE_SIZE;
                addr &= ~(PAGE_SIZE - 1);
                for (end = addr + len; addr < end; addr += PAGE_SIZE)
                        get_page(*pages++ = virt_to_page(addr));
                return len - *start;
in iov_iter_get_pages().  And that's ITER_KVEC case there...  Further
call chain looks like dio_refill_pages(), from dio_get_page(), from
do_direct_io(), eventually from kernel_read() and finit_module(),
Presumably called on O_DIRECT descriptor...

I'll try to reproduce it here, but if you have any reliable reproducer, it
would be very welcome (and would make a useful addition to LTP and/or
xfstests).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 16:46 ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
  2014-12-08 17:58   ` Al Viro
@ 2014-12-08 18:07   ` Linus Torvalds
  2014-12-08 18:14     ` Al Viro
  1 sibling, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2014-12-08 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel, Network Development

On Mon, Dec 8, 2014 at 8:46 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> I guess this crash is related to the patchset.

Sounds likely.

> [  102.338270] kernel BUG at /home/kas/git/public/linux-next/arch/x86/mm/physaddr.c:26!

So that's

                VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x));

and the code disassembles to:

   0: 48 01 f8             add    %rdi,%rax
   3: 48 39 c2             cmp    %rax,%rdx
   6: 72 1b                 jb     0x23
   8: 0f b6 0d 9d 7a ec 00 movzbl 0xec7a9d(%rip),%ecx        # 0xec7aac
   f: 48 89 c2             mov    %rax,%rdx
  12: 48 d3 ea             shr    %cl,%rdx
  15: 48 85 d2             test   %rdx,%rdx
  18: 75 09                 jne    0x23
  1a: 5d                   pop    %rbp
  1b: c3                   retq
  1c: 0f 1f 80 00 00 00 00 nopl   0x0(%rax)
  23:* 0f 0b                 ud2     <-- trapping instruction

with thre relevant registers being

> [  102.340011] RAX: 00004100174b4000 RBX: ffff880049c73b08 RCX: 0000000000000028
> [  102.340011] RDX: 0000000000000041 RSI: ffff88015dc980a8 RDI: ffffc900174b4000

so we've taken the second case (the %rcx value is
"boot_cpu_data.x86_phys_bits", which is that "movzbl", and the %rdx
value is the shifted value of %rax).

So %rax seems to contain 'x' at that point, which means that 'y' should be

   x - (__START_KERNEL_map - PAGE_OFFSET)

which means that the _original_ address should be that plus
__START_KERNEL_map, ie just x + PAGE_OFFSET.

So it smells like the original virtual address was that
ffffc900174b4000 that we still find in %rdi.

Which is in the vmalloc address space. So somebody used a vmalloc'ed
address and tried to convert it to a physical address in order to look
up the page.

Which is not a valid operation, and the BUG_ON() is definitely proper.

Now *why* something tried to do a virt_to_page() on a vmalloc'ed
address, that I leave to others.

                        Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 17:58   ` Al Viro
@ 2014-12-08 18:08     ` Al Viro
  2014-12-08 18:14       ` Linus Torvalds
  2014-12-08 18:56     ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
  1 sibling, 1 reply; 48+ messages in thread
From: Al Viro @ 2014-12-08 18:08 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, netdev

On Mon, Dec 08, 2014 at 05:58:05PM +0000, Al Viro wrote:
> It looks like the second VIRTUAL_BUG_ON() in __phys_addr(), most likely
> from __pa(), from virt_to_page(), from
>                 unsigned long addr = (unsigned long)v.iov_base, end;
>                 size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
> 
>                 if (len > maxpages * PAGE_SIZE)
>                         len = maxpages * PAGE_SIZE;
>                 addr &= ~(PAGE_SIZE - 1);
>                 for (end = addr + len; addr < end; addr += PAGE_SIZE)
>                         get_page(*pages++ = virt_to_page(addr));
>                 return len - *start;
> in iov_iter_get_pages().  And that's ITER_KVEC case there...  Further
> call chain looks like dio_refill_pages(), from dio_get_page(), from
> do_direct_io(), eventually from kernel_read() and finit_module(),
> Presumably called on O_DIRECT descriptor...

FWIW, virt_to_page() is probably not OK to call on an address in the
middle of vmalloc'ed area, is it?  Would
		for (end = addr + len; addr < end; addr += PAGE_SIZE) {
			if (is_vmalloc_addr(addr))
				ACCESS_ONCE(*(char *)addr);
			get_page(*pages++ = virt_to_page(addr));
		}
be a safe replacement for the loop in the above?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:07   ` Linus Torvalds
@ 2014-12-08 18:14     ` Al Viro
  2014-12-08 18:23       ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Al Viro @ 2014-12-08 18:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 08, 2014 at 10:07:55AM -0800, Linus Torvalds wrote:

> Which is in the vmalloc address space. So somebody used a vmalloc'ed
> address and tried to convert it to a physical address in order to look
> up the page.
> 
> Which is not a valid operation, and the BUG_ON() is definitely proper.
> 
> Now *why* something tried to do a virt_to_page() on a vmalloc'ed
> address, that I leave to others.

iov_iter_get_pages() in ITER_KVEC case, trying to avoid get_user_pages_fast()
and getting it wrong.  FWIW, the reproducer is finit_module(fd, ....)
where fd has been opened with O_DIRECT.  In that case we get kernel_read()
on O_DIRECT and the buffer has just been vmalloc'ed.

What's the sane way to grab struct page * for a vmalloc'ed address?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:08     ` Al Viro
@ 2014-12-08 18:14       ` Linus Torvalds
  2014-12-08 18:20         ` Al Viro
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2014-12-08 18:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 8, 2014 at 10:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, virt_to_page() is probably not OK to call on an address in the
> middle of vmalloc'ed area, is it?

See my email that crossed yours. No it is not.

>  Would
>                 for (end = addr + len; addr < end; addr += PAGE_SIZE) {
>                         if (is_vmalloc_addr(addr))
>                                 ACCESS_ONCE(*(char *)addr);
>                         get_page(*pages++ = virt_to_page(addr));
>                 }
> be a safe replacement for the loop in the above?

No. That "ACCESS_ONCE()" does nothing. It reads a byte from 'addr' in
the vmalloc space, and might cause a page fault to make sure it's
mapped, but that is still a no-op.

You can't do "virt_to_page()" on anything but the normal 1:1 kernel
mappings (and only for non-highmem pages at that).

For a vmalloc() address, you'd have to actually walk the page tables.
Which is a f*cking horrible idea. Don't do it. We do have a
"vmalloc_to_page()" that does it, but the basic issue is that you damn
well shouldn't do IO on vmalloc'ed addresses.  vmalloc'ed addresses
only exist in the first place to give a linear *virtual* mapping, if
you want physical pages you shouldn't have mixed it up with vmalloc in
the first place!

Where the hell does this crop up, and who does this insane thing
anyway? It's wrong. How did it ever work before?

                     Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:14       ` Linus Torvalds
@ 2014-12-08 18:20         ` Al Viro
  2014-12-08 18:37           ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Al Viro @ 2014-12-08 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 08, 2014 at 10:14:13AM -0800, Linus Torvalds wrote:

> For a vmalloc() address, you'd have to actually walk the page tables.
> Which is a f*cking horrible idea. Don't do it. We do have a
> "vmalloc_to_page()" that does it, but the basic issue is that you damn
> well shouldn't do IO on vmalloc'ed addresses.  vmalloc'ed addresses
> only exist in the first place to give a linear *virtual* mapping, if
> you want physical pages you shouldn't have mixed it up with vmalloc in
> the first place!
> 
> Where the hell does this crop up, and who does this insane thing
> anyway? It's wrong. How did it ever work before?

finit_module() with O_DIRECT descriptor.  And I suspect that "not well"
is the answer - it used to call get_user_pages_fast() in that case.

I certainly had missed that insanity during the analysis - we don't do
a lot of O_DIRECT IO to/from kernel addresses of any sort...  This
codepath allows it ;-/  Ability to trigger it is equivalent to ability
to run any code in kernel mode, so it's not an additional security hole,
but...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:14     ` Al Viro
@ 2014-12-08 18:23       ` Linus Torvalds
  2014-12-08 18:35         ` Al Viro
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2014-12-08 18:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 8, 2014 at 10:14 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> iov_iter_get_pages() in ITER_KVEC case, trying to avoid get_user_pages_fast()
> and getting it wrong.  FWIW, the reproducer is finit_module(fd, ....)
> where fd has been opened with O_DIRECT.  In that case we get kernel_read()
> on O_DIRECT and the buffer has just been vmalloc'ed.

Ugh. That's horrid. Do we need to even support O_DIRECT in that case?
In general, we should *not* do IO on vmalloc'ed areas, although at
least the non-O_DIRECT case where we just memcpy() it as if it came
from user space is much better.

Did this actually use to work? Or is it an issue of "the new iov_iter
is so generic that something that used to just return an error now
'works' and triggers the problem"?

> What's the sane way to grab struct page * for a vmalloc'ed address?

So "vmalloc_to_page()" should work.

However, it's actually fundamentally racy unless you can guarantee
that the vmalloc()'ed area in question is stable (so you had better
have done that allocation yourself, and be in control of the freeing,
rather than "we look up random vmalloc'ed addresses).

In general, it's really a horrible thing to use, and tends to be a big
red sign that "somebody misdesigned this badly"

                    Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:23       ` Linus Torvalds
@ 2014-12-08 18:35         ` Al Viro
  0 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2014-12-08 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 08, 2014 at 10:23:26AM -0800, Linus Torvalds wrote:

> Did this actually use to work? Or is it an issue of "the new iov_iter
> is so generic that something that used to just return an error now
> 'works' and triggers the problem"?

Looks like it failed with EINVAL.  Which might very well be the sane
reaction - if we run into a vmalloc/module address, act as if we failed
to get that page and exit the loop.

> > What's the sane way to grab struct page * for a vmalloc'ed address?
> 
> So "vmalloc_to_page()" should work.
> 
> However, it's actually fundamentally racy unless you can guarantee
> that the vmalloc()'ed area in question is stable (so you had better
> have done that allocation yourself, and be in control of the freeing,
> rather than "we look up random vmalloc'ed addresses).

If vfree(buffer) races with kernel_read() into buffer, we are so badly
fucked that stability of pointers to pages is the least of our concerns...

> In general, it's really a horrible thing to use, and tends to be a big
> red sign that "somebody misdesigned this badly"

More like "nobody has thought of that case", at a guess, but then I hadn't
been involved in finit_module() design - I don't even remember the discussions
around it.  That would be what, something circa 3.7?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:20         ` Al Viro
@ 2014-12-08 18:37           ` Linus Torvalds
  2014-12-08 18:46             ` Al Viro
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2014-12-08 18:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 8, 2014 at 10:20 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I certainly had missed that insanity during the analysis - we don't do
> a lot of O_DIRECT IO to/from kernel addresses of any sort...  This
> codepath allows it ;-/  Ability to trigger it is equivalent to ability
> to run any code in kernel mode, so it's not an additional security hole,
> but...

Is there any chance we could just return EINVAL for this case?

Who does O_DIRECT on module load anyway? If this is only for
finit_module(), that uses "kernel_read()", and maybe we could just
make sure that the kernel_read() function never ever uses the
direct-IO paths?

[ Time passes, I look at the code ]

Oh crap. So the reason it triggers seems to be that we basically get a
random file descriptor that we didn't open, and then we have

  vfs_read() ->
  xfs_file_operations->read() ->
  ew_sync_read() ->
  xfs_file_operations->read_iter()
  xfs_file_read_iter()

and we are stuck with this iterator that really just wants to do copies.

How about we make "kernel_read()" just clear O_DIRECT? Does that fix
it to just use copies?

                      Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:37           ` Linus Torvalds
@ 2014-12-08 18:46             ` Al Viro
  2014-12-08 18:57               ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Al Viro @ 2014-12-08 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 08, 2014 at 10:37:51AM -0800, Linus Torvalds wrote:

> How about we make "kernel_read()" just clear O_DIRECT? Does that fix
> it to just use copies?

Umm...  clearing O_DIRECT on struct file that might have other references
to it isn't nice, to put it mildly...

Frankly, stopping iov_iter_get_pages() on the first page in vmalloc/module
space looks like the sanest strategy anyway.  We'd get the same behaviour
as we used to, and as for finit_module(2), well... put "fails if given
an O_DIRECT descriptor" and be done with that.

Alternatively, we can really go for
	page = is_vmalloc_or_module_addr(addr) ? vmalloc_to_page(addr)
						: virt_to_page(addr)
	*pages++ = get_page(page);
and have the fucker work.  Stability is up to the caller, of course -
reading into buffer that might get freed (and reused) under you has much
worse problems...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 17:58   ` Al Viro
  2014-12-08 18:08     ` Al Viro
@ 2014-12-08 18:56     ` Kirill A. Shutemov
  2014-12-08 19:01       ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-12-08 18:56 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, netdev

On Mon, Dec 08, 2014 at 05:58:05PM +0000, Al Viro wrote:
> On Mon, Dec 08, 2014 at 06:46:50PM +0200, Kirill A. Shutemov wrote:
> 
> > I guess this crash is related to the patchset.
> 
> Might be.  Do you have a reproducer for it?

trinity triggers it for me in few minutes. I will try find out more once
get some time.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:46             ` Al Viro
@ 2014-12-08 18:57               ` Linus Torvalds
  2014-12-08 19:28                 ` Al Viro
  2015-04-01  2:33                 ` [RFC] iov_iter_get_pages() semantics Al Viro
  0 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2014-12-08 18:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 8, 2014 at 10:46 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Dec 08, 2014 at 10:37:51AM -0800, Linus Torvalds wrote:
>
>> How about we make "kernel_read()" just clear O_DIRECT? Does that fix
>> it to just use copies?
>
> Umm...  clearing O_DIRECT on struct file that might have other references
> to it isn't nice, to put it mildly...

Yeah.

> Frankly, stopping iov_iter_get_pages() on the first page in vmalloc/module
> space looks like the sanest strategy anyway.  We'd get the same behaviour
> as we used to, and as for finit_module(2), well... put "fails if given
> an O_DIRECT descriptor" and be done with that.

If it used to fail, then by all means, just make this a failure in the
new model. I really don't want to make core infrastructure silently
just call vmalloc_to_page() to make things "work".

And if it used to do "get_user_pages_fast()" then the old code really
didn't work on vmalloc ranges anyway, since that one checks for not
just _PAGE_PRESENT but also _PAGE_USER, which won't be set on a
vmalloc() page. In fact, it should have failed on *all* kernel pages.

> Alternatively, we can really go for
>         page = is_vmalloc_or_module_addr(addr) ? vmalloc_to_page(addr)
>                                                 : virt_to_page(addr)
>         *pages++ = get_page(page);

actually, no we cannot. Thinking some more about it, that
"get_page(page)" is wrong in _all_ cases. It actually works better for
vmalloc pages than for normal 1:1 pages, since it's actually seriously
and *horrendously* wrong for the case of random kernel addresses which
may not even be refcounted to begin with.

So the whole "get_page()" thing is broken. Iterating over pages in a
KVEC is simply wrong, wrong, wrong. It needs to fail.

Iterating over a KVEC to *copy* data is ok. But no page lookup stuff
or page reference things.

The old code that apparently used "get_user_pages_fast()" was ok
almost by mistake, because it fails on all kernel pages.

                       Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:56     ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
@ 2014-12-08 19:01       ` Linus Torvalds
  2014-12-08 19:15         ` Dave Jones
  2014-12-08 19:23         ` Kirill A. Shutemov
  0 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2014-12-08 19:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel, Network Development

On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> trinity triggers it for me in few minutes. I will try find out more once
> get some time.

You run trinity as *root*?

You're a brave man. Stupid, but brave ;)

I guess you're running it in a VM, but still.. Doing random system
calls as root sounds like a bad bad idea.

                      Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 19:01       ` Linus Torvalds
@ 2014-12-08 19:15         ` Dave Jones
  2014-12-08 19:23         ` Kirill A. Shutemov
  1 sibling, 0 replies; 48+ messages in thread
From: Dave Jones @ 2014-12-08 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Al Viro, Linux Kernel Mailing List,
	linux-fsdevel, Network Development

On Mon, Dec 08, 2014 at 11:01:41AM -0800, Linus Torvalds wrote:
 > On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
 > <kirill@shutemov.name> wrote:
 > >
 > > trinity triggers it for me in few minutes. I will try find out more once
 > > get some time.
 > 
 > You run trinity as *root*?
 > 
 > You're a brave man. Stupid, but brave ;)
 > 
 > I guess you're running it in a VM, but still.. Doing random system
 > calls as root sounds like a bad bad idea.

I've flip-flopped on this a few times.  I used to be solidly in the same
position as your statement, but after seeing the things the secure-boot
crowd want to lock down, there are a ton of places in the kernel that
would need additional root-proofing to avoid scribbling over kernel
memory.

In short though, yeah, expect fireworks right now, especially on bare-metal.

At the same time, just to increase coverage testing of a lot of
root-required functionality (like various network sockets that can't be
opened as a regular user) I added a --drop-privs mode to trinity a while
ago, so after the socket creation, it can't do anything _too_ crazy.

	Dave

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 19:01       ` Linus Torvalds
  2014-12-08 19:15         ` Dave Jones
@ 2014-12-08 19:23         ` Kirill A. Shutemov
  2014-12-08 22:14           ` Theodore Ts'o
  1 sibling, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-12-08 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel, Network Development

On Mon, Dec 08, 2014 at 11:01:41AM -0800, Linus Torvalds wrote:
> On Mon, Dec 8, 2014 at 10:56 AM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> >
> > trinity triggers it for me in few minutes. I will try find out more once
> > get some time.
> 
> You run trinity as *root*?
> 
> You're a brave man. Stupid, but brave ;)
> 
> I guess you're running it in a VM, but still.. Doing random system
> calls as root sounds like a bad bad idea.

I'm doing this for a long time and didn't see any problem bigger than qemu
crash[1]. ;)

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/194845/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 18:57               ` Linus Torvalds
@ 2014-12-08 19:28                 ` Al Viro
  2014-12-08 19:48                   ` Linus Torvalds
  2014-12-09  1:56                   ` Al Viro
  2015-04-01  2:33                 ` [RFC] iov_iter_get_pages() semantics Al Viro
  1 sibling, 2 replies; 48+ messages in thread
From: Al Viro @ 2014-12-08 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 08, 2014 at 10:57:31AM -0800, Linus Torvalds wrote:
> So the whole "get_page()" thing is broken. Iterating over pages in a
> KVEC is simply wrong, wrong, wrong. It needs to fail.

Well, _that_ is easy to do, of course...  E.g. by replacing that thing in
iov_iter_get_pages()/iov_iter_get_pages_alloc() with ({ return -EFAULT; })
will do it.

> Iterating over a KVEC to *copy* data is ok. But no page lookup stuff
> or page reference things.
> 
> The old code that apparently used "get_user_pages_fast()" was ok
> almost by mistake, because it fails on all kernel pages.

On x86 it does, but I don't see anything obvious in generic version in
mm/gup.c, so the old code might still have a problem on some architectures.
What am I missing here?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 19:28                 ` Al Viro
@ 2014-12-08 19:48                   ` Linus Torvalds
  2014-12-09  1:56                   ` Al Viro
  1 sibling, 0 replies; 48+ messages in thread
From: Linus Torvalds @ 2014-12-08 19:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development, linux-arch

On Mon, Dec 8, 2014 at 11:28 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On x86 it does, but I don't see anything obvious in generic version in
> mm/gup.c, so the old code might still have a problem on some architectures.
> What am I missing here?

Hmm. You may be right. The "access_ok()" is supposed to protect
things, but for cases like finit_module() that has explicitly said
"kernel addresses are ok", that check doesn't work.

Maybe  something like this..

    diff --git a/mm/gup.c b/mm/gup.c
    index cd62c8c90d4a..6234b1e6ced9 100644
    --- a/mm/gup.c
    +++ b/mm/gup.c
    @@ -951,6 +951,9 @@ int __get_user_pages_fast(unsigned long start,
int nr_pages, int write,
            len = (unsigned long) nr_pages << PAGE_SHIFT;
            end = start + len;

    +       if (unlikely(segment_eq(get_fs(), KERNEL_DS)))
    +               return 0;
    +
            if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
                                            start, len)))
                    return 0;

Completely untested, obviously. That code isn't even compiled on x86.
Adding linux-arch for more comments.

(Background: the generic non-x86 "get_user_pages_fast()" function
cannot check that the page tables are actually *user* page tables,
so..)

                      Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 19:23         ` Kirill A. Shutemov
@ 2014-12-08 22:14           ` Theodore Ts'o
  2014-12-08 22:23             ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Theodore Ts'o @ 2014-12-08 22:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Al Viro, Linux Kernel Mailing List,
	linux-fsdevel, Network Development

On Mon, Dec 08, 2014 at 09:23:58PM +0200, Kirill A. Shutemov wrote:
> > I guess you're running it in a VM, but still.. Doing random system
> > calls as root sounds like a bad bad idea.
> 
> I'm doing this for a long time and didn't see any problem bigger than qemu
> crash[1]. ;)
> 
> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/194845/

If you use something like this:

   -drive file=root_fs.img,if=virtio,snapshot=on

running trinity as root should be quite safe in a VM.  :-)

		   	       	  	- Ted

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 22:14           ` Theodore Ts'o
@ 2014-12-08 22:23             ` Linus Torvalds
  2014-12-08 22:31               ` Dave Jones
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2014-12-08 22:23 UTC (permalink / raw)
  To: Theodore Ts'o, Kirill A. Shutemov, Linus Torvalds, Al Viro,
	Linux Kernel Mailing List, linux-fsdevel, Network Development

On Mon, Dec 8, 2014 at 2:14 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> running trinity as root should be quite safe in a VM.  :-)

It's not so much the safety that I'd worry about, it's the "you can
legitimately just reboot it or cause kernel corruption as root". You
may not cause any problems outside of the VM, but any oopses inside
the VM might be due to trinity just doing bad things as root, rather
than kernel bugs..

Of course, it's probably hard to hit things like laoding random
modules etc, since even without signature requirements there are tons
of ELF sanity checks and other things. So it might be hard to actually
do those kinds of "corrupt kernel memory as root" things with trinity.

                 Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 22:23             ` Linus Torvalds
@ 2014-12-08 22:31               ` Dave Jones
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Jones @ 2014-12-08 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Ts'o, Kirill A. Shutemov, Al Viro,
	Linux Kernel Mailing List, linux-fsdevel, Network Development

On Mon, Dec 08, 2014 at 02:23:06PM -0800, Linus Torvalds wrote:
 > On Mon, Dec 8, 2014 at 2:14 PM, Theodore Ts'o <tytso@mit.edu> wrote:
 > >
 > > running trinity as root should be quite safe in a VM.  :-)
 > 
 > It's not so much the safety that I'd worry about, it's the "you can
 > legitimately just reboot it or cause kernel corruption as root". You
 > may not cause any problems outside of the VM, but any oopses inside
 > the VM might be due to trinity just doing bad things as root, rather
 > than kernel bugs..
 > 
 > Of course, it's probably hard to hit things like laoding random
 > modules etc, since even without signature requirements there are tons
 > of ELF sanity checks and other things. So it might be hard to actually
 > do those kinds of "corrupt kernel memory as root" things with trinity.

It also goes out of its way to avoid doing obviously stupid things,
like using /dev/mem as an fd, plus a whole bunch of similar sysfs/procfs
knobs.  There are still likely a whole bunch of similar things that
might have horrible effects too. It's been on my todo for a while to
revisit that particular case and blacklist a bunch of other things.

And then there's obviously "don't do this syscall, ever" or "with these
args" type things, which could use expanding.. It's a big effort tbh.
I'm amazed that Sasha, Kirill and everyone else running it as root in
vm's aren't buried alive in false-positives.

	Dave
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-08 19:28                 ` Al Viro
  2014-12-08 19:48                   ` Linus Torvalds
@ 2014-12-09  1:56                   ` Al Viro
  2014-12-09  2:21                     ` Kirill A. Shutemov
  1 sibling, 1 reply; 48+ messages in thread
From: Al Viro @ 2014-12-09  1:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 08, 2014 at 07:28:17PM +0000, Al Viro wrote:
> On Mon, Dec 08, 2014 at 10:57:31AM -0800, Linus Torvalds wrote:
> > So the whole "get_page()" thing is broken. Iterating over pages in a
> > KVEC is simply wrong, wrong, wrong. It needs to fail.
> 
> Well, _that_ is easy to do, of course...  E.g. by replacing that thing in
> iov_iter_get_pages()/iov_iter_get_pages_alloc() with ({ return -EFAULT; })
> will do it.

OK, folded and pushed (i.e. in vfs.git#iov_iter and vfs.git#for-next).
Matches earlier behaviour; oops reproducer is easy -

dd if=/dev/zero of=foo.ko bs=4096 count=1
cat >a.c <<'EOF'
#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/syscall.h>
main()
{
        int fd = open("foo.ko", 00040000);
        syscall(__NR_finit_module, fd, "", 0);
}
EOF
gcc a.c
strace ./a.out

Correct behaviour (both the mainline and this series with fixes folded):
open("foo.ko", O_RDONLY|O_DIRECT)       = 3
finit_module(3, "", 0)                  = -1 EFAULT (Bad address)
Incorrect one:
open("foo.ko", O_RDONLY|O_DIRECT)       = 3
finit_module(3, "", 0 <unfinished ...>
+++ killed by SIGKILL +++
Killed

with that oops reproduced.  Not sure if it's a proper LTP or xfstests fodder,
but anyway, here it is.

Incremental from previous for-next is
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index e0605c1..a1599ca 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -560,15 +560,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		get_page(*pages = v.bv_page);
 		return v.bv_len;
 	}),({
-		unsigned long addr = (unsigned long)v.iov_base, end;
-		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
-
-		if (len > maxpages * PAGE_SIZE)
-			len = maxpages * PAGE_SIZE;
-		addr &= ~(PAGE_SIZE - 1);
-		for (end = addr + len; addr < end; addr += PAGE_SIZE)
-			get_page(*pages++ = virt_to_page(addr));
-		return len - *start;
+		return -EFAULT;
 	})
 	)
 	return 0;
@@ -622,18 +614,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		get_page(*p = v.bv_page);
 		return v.bv_len;
 	}),({
-		unsigned long addr = (unsigned long)v.iov_base, end;
-		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
-		int n;
-
-		addr &= ~(PAGE_SIZE - 1);
-		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		*pages = p = get_pages_array(n);
-		if (!p)
-			return -ENOMEM;
-		for (end = addr + len; addr < end; addr += PAGE_SIZE)
-			get_page(*p++ = virt_to_page(addr));
-		return len - *start;
+		return -EFAULT;
 	})
 	)
 	return 0;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCHES] iov_iter.c rewrite
  2014-12-09  1:56                   ` Al Viro
@ 2014-12-09  2:21                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2014-12-09  2:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Tue, Dec 09, 2014 at 01:56:51AM +0000, Al Viro wrote:
> On Mon, Dec 08, 2014 at 07:28:17PM +0000, Al Viro wrote:
> > On Mon, Dec 08, 2014 at 10:57:31AM -0800, Linus Torvalds wrote:
> > > So the whole "get_page()" thing is broken. Iterating over pages in a
> > > KVEC is simply wrong, wrong, wrong. It needs to fail.
> > 
> > Well, _that_ is easy to do, of course...  E.g. by replacing that thing in
> > iov_iter_get_pages()/iov_iter_get_pages_alloc() with ({ return -EFAULT; })
> > will do it.
> 
> OK, folded and pushed (i.e. in vfs.git#iov_iter and vfs.git#for-next).
> Matches earlier behaviour; oops reproducer is easy -
> 
> dd if=/dev/zero of=foo.ko bs=4096 count=1
> cat >a.c <<'EOF'
> #define _GNU_SOURCE
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> main()
> {
>         int fd = open("foo.ko", 00040000);
>         syscall(__NR_finit_module, fd, "", 0);
> }
> EOF
> gcc a.c
> strace ./a.out
> 
> Correct behaviour (both the mainline and this series with fixes folded):
> open("foo.ko", O_RDONLY|O_DIRECT)       = 3
> finit_module(3, "", 0)                  = -1 EFAULT (Bad address)
> Incorrect one:
> open("foo.ko", O_RDONLY|O_DIRECT)       = 3
> finit_module(3, "", 0 <unfinished ...>
> +++ killed by SIGKILL +++
> Killed
> 
> with that oops reproduced.  Not sure if it's a proper LTP or xfstests fodder,
> but anyway, here it is.
> 
> Incremental from previous for-next is

Works for me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC] iov_iter_get_pages() semantics
  2014-12-08 18:57               ` Linus Torvalds
  2014-12-08 19:28                 ` Al Viro
@ 2015-04-01  2:33                 ` Al Viro
  2015-04-01 16:45                   ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Al Viro @ 2015-04-01  2:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Mon, Dec 08, 2014 at 10:57:31AM -0800, Linus Torvalds wrote:
> actually, no we cannot. Thinking some more about it, that
> "get_page(page)" is wrong in _all_ cases. It actually works better for
> vmalloc pages than for normal 1:1 pages, since it's actually seriously
> and *horrendously* wrong for the case of random kernel addresses which
> may not even be refcounted to begin with.
> 
> So the whole "get_page()" thing is broken. Iterating over pages in a
> KVEC is simply wrong, wrong, wrong. It needs to fail.
> 
> Iterating over a KVEC to *copy* data is ok. But no page lookup stuff
> or page reference things.

Hmm...  FWIW, for ITER_KVEC the underlying data would bloody better not
go away anyway - vmalloc space or not.  Protecting the object from being
freed under us is caller's responsibility and caller can guarantee that.
Would a variant that does kmap_to_page()/vmalloc_to_page() _without_
get_page() for ITER_KVEC work sanely?

Of course, that would have to be used with matching primitive for releasing
those suckers - page_cache_release() for ITER_IOVEC (and ITER_BVEC, while
we are at it - those are backed with normal pages) and nothing for ITER_KVEC
ones.

It would make life much more pleasant for fuse and zerocopy side of 9p - the
latter does pretty much that kind of thing anyway...

Comments?
			Al, digging himself from under a huge pile of mail...


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01  2:33                 ` [RFC] iov_iter_get_pages() semantics Al Viro
@ 2015-04-01 16:45                   ` Linus Torvalds
  2015-04-01 18:08                     ` Al Viro
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2015-04-01 16:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Tue, Mar 31, 2015 at 7:33 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> So the whole "get_page()" thing is broken. Iterating over pages in a
>> KVEC is simply wrong, wrong, wrong. It needs to fail.
>>
>> Iterating over a KVEC to *copy* data is ok. But no page lookup stuff
>> or page reference things.
>
> Hmm...  FWIW, for ITER_KVEC the underlying data would bloody better not
> go away anyway - vmalloc space or not.  Protecting the object from being
> freed under us is caller's responsibility and caller can guarantee that.
> Would a variant that does kmap_to_page()/vmalloc_to_page() _without_
> get_page() for ITER_KVEC work sanely?

No.

It's insane to iterate over 'struct page' pointers, whether you do
get_page or not.

Why?

The *only* thing you can do with those pages is to copy the content.
You must never *ever* do anything else. And if the caller only copies
the content, then the caller is *wrong* to use the page-iterators.

It really is that simple. Either the caller cares about just the
content - in which case it should use the normal "iterate over
addresses" - or the caller somehow cares about 'struct page' itself,
in which case it is *wrong* to do it over vmalloc space or random
kernel mappings.

You cannot have it both ways.

In fact, even _if_ the caller just does a "kmap()" and then looks at
the contents, it is very questionable to allow it for kernel or
vmalloc data. Why? Because we actually do things like mark vmalloc
areas read-only for module code etc, and using kmap() on the pages is
just a way for bad users to more easily overcome things like that by
mistake.

So it's simply wrong in so many ways. There is absolutely *zero*
excuse to look at "struct page" for random kernel data. You had better
get the struct page some valid way - either by explicitly allocating
the page as such, or by looking it up from a *valid* source (ie
looking up a page range from a user mapping).

                       Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01 16:45                   ` Linus Torvalds
@ 2015-04-01 18:08                     ` Al Viro
  2015-04-01 18:15                       ` Linus Torvalds
  2015-04-01 18:26                       ` Linus Torvalds
  0 siblings, 2 replies; 48+ messages in thread
From: Al Viro @ 2015-04-01 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Wed, Apr 01, 2015 at 09:45:27AM -0700, Linus Torvalds wrote:

> The *only* thing you can do with those pages is to copy the content.
> You must never *ever* do anything else. And if the caller only copies
> the content, then the caller is *wrong* to use the page-iterators.
> 
> It really is that simple. Either the caller cares about just the
> content - in which case it should use the normal "iterate over
> addresses" - or the caller somehow cares about 'struct page' itself,
> in which case it is *wrong* to do it over vmalloc space or random
> kernel mappings.

... or the caller wants sg_table.  When net/9p p9_client_write() is called
with virtio for underlying transport (and the amount of data is large enough)
it allocates two buffers (for request header and response resp.), builds
header in the first buffer and proceeds to populate two scatterlists - one
with header + data being sent (sg_set_buf() on the header + sg_set_page() on
each page in payload) and another covering the buffer for response.  Then it
passes the sucker to virtio, which does the actual transfer.

Non-zerocopy variant of the same allocates a buffer for request + data,
a buffer for response, builds header _and_ copies data, then proceeds the
same way as zerocopy path.

In both cases we end up doing sg_set_buf() on some of that - in fact,
in non-zerocopy path that's all we do.  And sg_set_buf() is just
	sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));

What's more, on the sg_set_page() side it has to deal with several cases:
	1) normal write(); pages are obtained by get_user_pages_fast().
	2) writepage().  Page should've been really passed by caller
via ITER_BVEC iov_iter and picked from there, but p9_client_write() isn't
iov_iter-aware, so the caller does kmap() and passes the kernel address
as payload.  And then this sucker proceeds to do kmap_to_page() instead
of get_user_pages_fast() - virt_to_page() won't do.  That one should be
killed off - there's really no reason to play with kmap() at all.
	3) setxattr().  The data being written comes from the kernel
buffer, _hopefully_ not vmalloc'ed one.  Same path as in (2) once we
get inside p9_client_write(); I'd prefer to pass ITER_KVEC there.

Similar thing happens in p9_client_read(), only there we definitely can
get vmalloc'ed destination - finit_module(2) will trigger read into
vmalloc'ed buffer.  Zerocopy logics in net/9p is actually shared for
read and write (except that we have fixed header for request and header
+ payload for response).  Getting the pages to feed into sg_set_page()
is done in net/9p/trans_virtio.c:p9_get_mapped_pages(); basically, it's
get_user_pages_fast() for userland payloads and this
        } else {
                /* kernel buffer, no need to pin pages */
                int s, index = 0;
                int count = nr_pages;
                while (nr_pages) {
                        s = rest_of_page(data);
                        if (is_vmalloc_addr(data))
                                pages[index++] = vmalloc_to_page(data);
                        else
                                pages[index++] = kmap_to_page(data);
                        data += s;
                        nr_pages--;
                }
                nr_pages = count;
        }
for the kernel ones.  kmap_to_page() probably ought to be virt_to_page()
(it's an artefact of calling conventions - readpage and writepage shouldn't
have to pass the page as kmapped address), but vmalloc_to_page() is
for absolutely real use case - finit_module() reading a file from 9p fs
into vmalloc'ed buffer.

AFAICS, your objections seem to apply to that code.  A lot of clumsiness
in there (and in fs/9p) would be killed if we switched p9_client_{read,write}
to passing iov_iter, but it's not a matter of adding functionality to those;
they already do the same thing, only with bloody inconvenient calling
conventions.  It's already there.

How about a primitive that would feed iov_iter into scatterlist, and fill
an array with struct page pointers to drop afterwards?  With ITER_IOVEC
dumping the get_user_pages_fast() results into that array and ITER_KVEC
just putting NULLs there.

IOW, do you have a problem with obtaining a pointer to kernel page and
immediately shoving it into scatterlist?  

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01 18:08                     ` Al Viro
@ 2015-04-01 18:15                       ` Linus Torvalds
  2015-04-01 19:23                         ` Al Viro
  2015-04-01 18:26                       ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2015-04-01 18:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Wed, Apr 1, 2015 at 11:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> It really is that simple. Either the caller cares about just the
>> content - in which case it should use the normal "iterate over
>> addresses" - or the caller somehow cares about 'struct page' itself,
>> in which case it is *wrong* to do it over vmalloc space or random
>> kernel mappings.
>
> ... or the caller wants sg_table.

Completely immaterial.

Seriously.

If y9ou want an sg-table, how the hell do you know that some user
won't be doing "get_page()" etc on the resulting pages?

Answer: you don't.  If you pass this off to networking or whatever,
and it does some zero-copy thing, it may well be playing games with
the 'struct page'. After all, that's why it exists in the first place.

So no. You *MUST*NOT* turn random vmalloc space (or non-vmalloc space,
for that matter) kernel mappings into struct page arrays. It's wrong.
It's fundamentally invalid crap.

If there are specific cases where it is valid, those specific cases
need to be special-cased.

NO WAY IN HELL do we add generic support for doing shit. Really. If p9
does crazy crap, that is not an excuse to extend the crazy crap to
more code.

                         Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01 18:08                     ` Al Viro
  2015-04-01 18:15                       ` Linus Torvalds
@ 2015-04-01 18:26                       ` Linus Torvalds
  2015-04-01 18:34                         ` Linus Torvalds
  2015-04-01 19:50                         ` Al Viro
  1 sibling, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2015-04-01 18:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Wed, Apr 1, 2015 at 11:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> IOW, do you have a problem with obtaining a pointer to kernel page and
> immediately shoving it into scatterlist?

And just to clarify, yes I do. Why the f*ck wasn't it a struct page to
begin with? And why do you think that a scatter-list is somehow "safe"
and guarantees people won't be playing (invalid and completely broken)
games with page counters etc that you cannot play for those things?

If this is just about finit_module(), then dammit, why the f*ck does
it even try to do zero-copy in the first place? But if that's the only
use, maybe we can improve on kernel_read() to do some aio-read on the
raw pages instead. And change the "info->hdr" thing to not just do a
blind vmalloc, but actually do the page allocations and then do
vmap_page_range() to map in the end result after IO etc.

IOW, it's fine to do IO on 'struct page', but it should be
*controlled* and you damn well need to _own_ that struct page and its
lifetime, no just "look up random struct page from some kernel
address".

                        Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01 18:26                       ` Linus Torvalds
@ 2015-04-01 18:34                         ` Linus Torvalds
  2015-04-01 20:15                           ` Al Viro
  2015-04-01 19:50                         ` Al Viro
  1 sibling, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2015-04-01 18:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Wed, Apr 1, 2015 at 11:26 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> IOW, it's fine to do IO on 'struct page', but it should be
> *controlled* and you damn well need to _own_ that struct page and its
> lifetime, no just "look up random struct page from some kernel
> address".

.. and finally, the reason I feel so strongly about this is that we
*already* had a bug exactly here where you tried to make kvec's "more
generic" and it was just wrong, wrong, wrong.  It's just dangerous to
randomly allow these things.

When we map things into user space, such mappings have some rules
about them (ie they have to be full pages, they haev to actually be
special or refcounted etc etc). Kernel mappings simply don't have
those rules. Not even vmalloc, exactly because we sometimes play games
with kernel mappings. But vmalloc space is at least better than random
"just kernel addresses" which is even worse, since they could be stack
pages or SLAB pages or whatever.

And slab allocations, for example, do *not* honor the page count, even
though such pages do have page counts. The slab allocations can be
reused for other things freely, completely regardless of whether
somebody incremented the page count or not.

And yes, people do things like "kernel_read()" on just normal
kmalloc() allocations. So no, I do *not* think that it's ok to "just
make zero-copy kernel_read() work on kernel addresses by turning them
into 'struct page' and then do whop-the-hell-knows-what random things
with such a 'struct page').

                       Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01 18:15                       ` Linus Torvalds
@ 2015-04-01 19:23                         ` Al Viro
  0 siblings, 0 replies; 48+ messages in thread
From: Al Viro @ 2015-04-01 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Wed, Apr 01, 2015 at 11:15:17AM -0700, Linus Torvalds wrote:
> Seriously.
> 
> If y9ou want an sg-table, how the hell do you know that some user
> won't be doing "get_page()" etc on the resulting pages?
> 
> Answer: you don't.  If you pass this off to networking or whatever,
> and it does some zero-copy thing, it may well be playing games with
> the 'struct page'. After all, that's why it exists in the first place.
> 
> So no. You *MUST*NOT* turn random vmalloc space (or non-vmalloc space,
> for that matter) kernel mappings into struct page arrays. It's wrong.
> It's fundamentally invalid crap.

For non-vmalloc space you've just described sg_set_buf() (and sg_init_one(),
which is a wrapper for it)...

> If there are specific cases where it is valid, those specific cases
> need to be special-cased.
> 
> NO WAY IN HELL do we add generic support for doing shit. Really. If p9
> does crazy crap, that is not an excuse to extend the crazy crap to
> more code.

OK...  The scenario you've described might actually be a real bug in
net/p9 - I'm not familiar enough with virtio to tell.  AFAICS, all it
wants with them is sg_phys() + length, which would be safe, but I could
easily be missing something...

Oh, well - virtio is already in somewhat incestous relationship with
iov_iter.  I wonder if an analogue of virtqueue_add_sgs() that would
take iov_iter would be the right approach long-term...

Anyway, for now I'll just switch p9_client_{read,write}() to saner
calling conventions and let it look into iov_iter internals in that one
place (p9_get_mapped_pages()).  For IOVEC_ITER case we can use
iov_iter_get_pages() there, for ITER_BVEC - pick the page from
iter->bvec->bv_page, for ITER_KVEC - take the first element of iter->kvec
and do that vmalloc_to_page()/virt_to_page() thing.

It's no worse than what we do right now, doesn't introduce new primitives
that would invite abuse and it allows to make things much simpler for
the rest of net/9p and for fs/9p...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01 18:26                       ` Linus Torvalds
  2015-04-01 18:34                         ` Linus Torvalds
@ 2015-04-01 19:50                         ` Al Viro
  1 sibling, 0 replies; 48+ messages in thread
From: Al Viro @ 2015-04-01 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Wed, Apr 01, 2015 at 11:26:38AM -0700, Linus Torvalds wrote:
> On Wed, Apr 1, 2015 at 11:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > IOW, do you have a problem with obtaining a pointer to kernel page and
> > immediately shoving it into scatterlist?
> 
> And just to clarify, yes I do. Why the f*ck wasn't it a struct page to
> begin with? And why do you think that a scatter-list is somehow "safe"
> and guarantees people won't be playing (invalid and completely broken)
> games with page counters etc that you cannot play for those things?

Point taken.  What do you think about sg_set_buf() and sg_init_one()?

> If this is just about finit_module(), then dammit, why the f*ck does
> it even try to do zero-copy in the first place?

Mostly because there's no way to tell the filesystem that we don't want
zero-copy deep in the bowels of underlying driver...

> But if that's the only
> use, maybe we can improve on kernel_read() to do some aio-read on the
> raw pages instead. And change the "info->hdr" thing to not just do a
> blind vmalloc, but actually do the page allocations and then do
> vmap_page_range() to map in the end result after IO etc.

Can do, but that would depend on 9p getting converted to read_iter/write_iter
in a sane way ;-) (and that's worth doing for a lot of other reasons, which
is what had brought me to net/9p in the first place).

That might actually be a good idea - for ITER_BVEC we know that page is
a normal one (not many originators of such - __swap_writepage() and
pipe_buffer ones), so making iov_iter_get_pages() work for those wouldn't
be a problem...

kernel_read() is a wrong helper, though - it should just use vfs_read_iter().
We are -><- that close to making it work on all "normal files" - the only
exceptions right now are ncpfs (fixed in local tree), coda (ditto) and 9p.

> IOW, it's fine to do IO on 'struct page', but it should be
> *controlled* and you damn well need to _own_ that struct page and its
> lifetime, no just "look up random struct page from some kernel
> address".

I certainly agree that throwing pointers to weird pages around is generally
a bad idea, but lifetime is not an issue, AFAICS - if somebody manages to do
vfree() the destination of your read right under you, you are already very
deep in trouble.

Speaking of weird pages, some of vmalloc_to_page() users look very strange -
netlink_mmap(), in particular...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01 18:34                         ` Linus Torvalds
@ 2015-04-01 20:15                           ` Al Viro
  2015-04-01 21:57                             ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Al Viro @ 2015-04-01 20:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Wed, Apr 01, 2015 at 11:34:12AM -0700, Linus Torvalds wrote:
> with kernel mappings. But vmalloc space is at least better than random
> "just kernel addresses" which is even worse, since they could be stack
> pages or SLAB pages or whatever.
> 
> And slab allocations, for example, do *not* honor the page count, even
> though such pages do have page counts. The slab allocations can be
> reused for other things freely, completely regardless of whether
> somebody incremented the page count or not.
> 
> And yes, people do things like "kernel_read()" on just normal
> kmalloc() allocations. So no, I do *not* think that it's ok to "just
> make zero-copy kernel_read() work on kernel addresses by turning them
> into 'struct page' and then do whop-the-hell-knows-what random things
> with such a 'struct page').

Umm...  Tangentially related question - do we have any IO codepaths where
a pointer to page might stick around indefinitely?  AFAICS, for AIO it's
enough to wait for request to complete, but that's available to caller.
Do we have anything where one would really have to do e.g. freeing (or
dropping refcount, etc.) in a callback one can't conveniently wait for?
sg_table with pointers to pages of kmalloc'ed objects are really common in
e.g. crypto/*...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC] iov_iter_get_pages() semantics
  2015-04-01 20:15                           ` Al Viro
@ 2015-04-01 21:57                             ` Linus Torvalds
  0 siblings, 0 replies; 48+ messages in thread
From: Linus Torvalds @ 2015-04-01 21:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Kirill A. Shutemov, Linux Kernel Mailing List, linux-fsdevel,
	Network Development

On Wed, Apr 1, 2015 at 1:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  Tangentially related question - do we have any IO codepaths where
> a pointer to page might stick around indefinitely?

Well, the one obvious case is "splice()", where things might sit
around forever in the pipe buffer. But that is fairly explicit, and
splice does have the "release" callback to notify you when it finally
does drop.

For something like regular read/write (or, aio_read with waiting for
completion), I don't see how anybody could really ever keep a
reference around to after the completion unless it has some odd
RCU-like freeing.

And for regular vmalloc'ed pages, I actually think that the page
refcounting *should* work fine. I'd still prefer to never see code
that doesn't directly own the area do it, because I could certainly
_imagine_ people playing games with  vmap_page_range() (eg doing
things like mapping in partial hugepages or something odd like that).

Put another way: I think that it's fine if somebody does "vmalloc()"
and then does "vmalloc_to_page()" to look up the pages it just
allocated.

I'm *not* nearly as happy with code that does

   if (is_vmalloc(addr)) vmalloc_to_page()

because that basically says "I don't know what this allocation is, but
if somebody plays around in vmalloc space, I'll just look up the
pages".

And yeah, it probably works fine. It just makes me really really
nervous. We use that vmalloc space for some mmio mappings too, don't
we? Worry9ing about things like that just make me go "auugh".

                        Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2015-04-01 21:57 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-04 20:20 [RFC][PATCHES] iov_iter.c rewrite Al Viro
2014-12-04 20:23 ` [RFC][PATCH 01/13] iov_iter.c: macros for iterating over iov_iter Al Viro
2014-12-04 20:23 ` [RFC][PATCH 02/13] iov_iter.c: iterate_and_advance Al Viro
2014-12-04 20:23 ` [RFC][PATCH 03/13] iov_iter.c: convert iov_iter_npages() to iterate_all_kinds Al Viro
2014-12-04 20:23 ` [RFC][PATCH 04/13] iov_iter.c: convert iov_iter_get_pages() " Al Viro
2014-12-04 20:23 ` [RFC][PATCH 05/13] iov_iter.c: convert iov_iter_get_pages_alloc() " Al Viro
2014-12-04 20:23 ` [RFC][PATCH 06/13] iov_iter.c: convert iov_iter_zero() to iterate_and_advance Al Viro
2014-12-04 20:23 ` [RFC][PATCH 07/13] iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() Al Viro
2014-12-05 12:28   ` Sergei Shtylyov
2014-12-04 20:23 ` [RFC][PATCH 08/13] iov_iter.c: convert copy_from_iter() to iterate_and_advance Al Viro
2014-12-04 20:23 ` [RFC][PATCH 09/13] iov_iter.c: convert copy_to_iter() " Al Viro
2014-12-04 20:23 ` [RFC][PATCH 10/13] iov_iter.c: handle ITER_KVEC directly Al Viro
2014-12-04 20:23 ` [RFC][PATCH 11/13] csum_and_copy_..._iter() Al Viro
2014-12-04 20:23 ` [RFC][PATCH 12/13] new helper: iov_iter_kvec() Al Viro
2014-12-04 20:23 ` [RFC][PATCH 13/13] copy_from_iter_nocache() Al Viro
2014-12-08 16:46 ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
2014-12-08 17:58   ` Al Viro
2014-12-08 18:08     ` Al Viro
2014-12-08 18:14       ` Linus Torvalds
2014-12-08 18:20         ` Al Viro
2014-12-08 18:37           ` Linus Torvalds
2014-12-08 18:46             ` Al Viro
2014-12-08 18:57               ` Linus Torvalds
2014-12-08 19:28                 ` Al Viro
2014-12-08 19:48                   ` Linus Torvalds
2014-12-09  1:56                   ` Al Viro
2014-12-09  2:21                     ` Kirill A. Shutemov
2015-04-01  2:33                 ` [RFC] iov_iter_get_pages() semantics Al Viro
2015-04-01 16:45                   ` Linus Torvalds
2015-04-01 18:08                     ` Al Viro
2015-04-01 18:15                       ` Linus Torvalds
2015-04-01 19:23                         ` Al Viro
2015-04-01 18:26                       ` Linus Torvalds
2015-04-01 18:34                         ` Linus Torvalds
2015-04-01 20:15                           ` Al Viro
2015-04-01 21:57                             ` Linus Torvalds
2015-04-01 19:50                         ` Al Viro
2014-12-08 18:56     ` [RFC][PATCHES] iov_iter.c rewrite Kirill A. Shutemov
2014-12-08 19:01       ` Linus Torvalds
2014-12-08 19:15         ` Dave Jones
2014-12-08 19:23         ` Kirill A. Shutemov
2014-12-08 22:14           ` Theodore Ts'o
2014-12-08 22:23             ` Linus Torvalds
2014-12-08 22:31               ` Dave Jones
2014-12-08 18:07   ` Linus Torvalds
2014-12-08 18:14     ` Al Viro
2014-12-08 18:23       ` Linus Torvalds
2014-12-08 18:35         ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).