LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] Add block device speciffic splice write method
@ 2008-10-19 14:00 Dmitri Monakhov
  2008-10-20 17:49 ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitri Monakhov @ 2008-10-19 14:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Dmitri Monakhov

Block device write procedure is different from regular file:
 - Actual write performed without i_mutex.
 - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock.
 - We do not have to worry about S_ISUID/S_ISGID bits.

Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org>
---
 fs/block_dev.c     |    2 +-
 fs/splice.c        |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |    2 ++
 3 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7ce823c..9aa63b5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1251,7 +1251,7 @@ const struct file_operations def_blk_fops = {
 	.compat_ioctl	= compat_blkdev_ioctl,
 #endif
 	.splice_read	= generic_file_splice_read,
-	.splice_write	= generic_file_splice_write,
+	.splice_write	= blkdev_splice_write,
 };
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
diff --git a/fs/splice.c b/fs/splice.c
index a1e701c..f0ba76c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -884,6 +884,54 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
 
 EXPORT_SYMBOL(generic_splice_sendpage);
 
+/**
+ * blkdev_splice_write - splice data from a pipe to a block device
+ * @pipe:	pipe info
+ * @out:	file to write to
+ * @ppos:	position in @out
+ * @len:	number of bytes to splice
+ * @flags:	splice modifier flags
+ *
+ * Description:
+ *    Will either move or copy pages (determined by @flags options) from
+ *    the given pipe inode to the given block device.
+ *    Note: blockdev's i_mutex is not held on entry and it is never taken.
+ */
+ssize_t
+blkdev_splice_write(struct pipe_inode_info *pipe, struct file *out,
+				loff_t *ppos, size_t len, unsigned int flags)
+{
+	struct address_space *mapping = out->f_mapping;
+	struct inode *inode = mapping->host;
+	struct splice_desc sd = {
+		.total_len = len,
+		.flags = flags,
+		.pos = *ppos,
+		.u.file = out,
+	};
+	ssize_t ret;
+	unsigned long nr_pages;
+	mutex_lock(&pipe->inode->i_mutex);
+	ret = __splice_from_pipe(pipe, &sd, pipe_to_file);
+	mutex_unlock(&pipe->inode->i_mutex);
+	if (ret <= 0)
+		return ret;
+
+	*ppos += ret;
+	nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+	if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
+		int err;
+		err = sync_page_range_nolock(inode, mapping, *ppos, ret);
+		if (err)
+			ret = err;
+	}
+	balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
+	return ret;
+}
+
+EXPORT_SYMBOL(blkdev_splice_write);
+
 /*
  * Attempt to initiate a splice from pipe to file.
  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 194b607..8543b21 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1960,6 +1960,8 @@ extern ssize_t generic_file_splice_write_nolock(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
 		struct file *out, loff_t *, size_t len, unsigned int flags);
+extern ssize_t blkdev_splice_write(struct pipe_inode_info *pipe,
+		struct file *out, loff_t *, size_t len, unsigned int flags);
 extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 		size_t len, unsigned int flags);
 
-- 
1.5.4.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-19 14:00 [PATCH] Add block device speciffic splice write method Dmitri Monakhov
@ 2008-10-20 17:49 ` Jens Axboe
  2008-10-20 18:11   ` Jens Axboe
  2008-10-20 18:29   ` Dmitri Monakhov
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2008-10-20 17:49 UTC (permalink / raw)
  To: Dmitri Monakhov; +Cc: linux-kernel, linux-fsdevel

On Sun, Oct 19 2008, Dmitri Monakhov wrote:
> Block device write procedure is different from regular file:
>  - Actual write performed without i_mutex.
>  - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock.
>  - We do not have to worry about S_ISUID/S_ISGID bits.

I already did an O_DIRECT part of block device splicing [1], I'll fold
this into the splice branch and double check with some testing.

[1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-20 17:49 ` Jens Axboe
@ 2008-10-20 18:11   ` Jens Axboe
  2008-10-20 18:42     ` Dmitri Monakhov
  2008-10-23  5:39     ` Andrew Morton
  2008-10-20 18:29   ` Dmitri Monakhov
  1 sibling, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2008-10-20 18:11 UTC (permalink / raw)
  To: Dmitri Monakhov; +Cc: linux-kernel, linux-fsdevel

On Mon, Oct 20 2008, Jens Axboe wrote:
> On Sun, Oct 19 2008, Dmitri Monakhov wrote:
> > Block device write procedure is different from regular file:
> >  - Actual write performed without i_mutex.
> >  - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock.
> >  - We do not have to worry about S_ISUID/S_ISGID bits.
> 
> I already did an O_DIRECT part of block device splicing [1], I'll fold
> this into the splice branch and double check with some testing.
> 
> [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5

The below is what I merged. Note that I changed the naming and made the
function look a lot more like the other splice helpers, so it's more
apparent how it differs. Let me know if I can add you Signed-off-by to
this one (preferably after you test it as well :-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4d154dc..083198a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1288,7 +1288,7 @@ new_bio:
  * Splice to file opened with O_DIRECT. Bypass caching completely and
  * just go direct-to-bio
  */
-static ssize_t __block_splice_write(struct pipe_inode_info *pipe,
+static ssize_t __block_splice_direct_write(struct pipe_inode_info *pipe,
 				    struct file *out, loff_t *ppos, size_t len,
 				    unsigned int flags)
 {
@@ -1318,6 +1318,9 @@ static ssize_t __block_splice_write(struct pipe_inode_info *pipe,
 	if (bsd.bio)
 		submit_bio(WRITE, bsd.bio);
 
+	if (ret > 0)
+		*ppos += ret;
+
 	return ret;
 }
 
@@ -1327,12 +1330,11 @@ static ssize_t block_splice_write(struct pipe_inode_info *pipe,
 {
 	ssize_t ret;
 
-	if (out->f_flags & O_DIRECT) {
-		ret = __block_splice_write(pipe, out, ppos, len, flags);
-		if (ret > 0)
-			*ppos += ret;
-	} else
-		ret = generic_file_splice_write(pipe, out, ppos, len, flags);
+	if (out->f_flags & O_DIRECT)
+		ret = __block_splice_direct_write(pipe, out, ppos, len, flags);
+	else
+		ret = generic_file_splice_write_file_nolock(pipe, out, ppos,
+								len, flags);
 
 	return ret;
 }
diff --git a/fs/splice.c b/fs/splice.c
index 4108264..eb1e1ac 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -788,6 +788,59 @@ ssize_t splice_from_pipe(struct pipe_inode_info *pipe, struct file *out,
 }
 
 /**
+ * generic_file_splice_write_file_nolock - splice data from a pipe to a file
+ * @pipe:	pipe info
+ * @out:	file to write to
+ * @ppos:	position in @out
+ * @len:	number of bytes to splice
+ * @flags:	splice modifier flags
+ *
+ * Description:
+ *    Will either move or copy pages (determined by @flags options) from
+ *    the given pipe inode to the given block device.
+ *    Note: this is like @generic_file_splice_write, except that we
+ *    don't bother locking the output file. Useful for splicing directly
+ *    to a block device.
+ */
+ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe,
+					      struct file *out, loff_t *ppos,
+					      size_t len, unsigned int flags)
+{
+	struct address_space *mapping = out->f_mapping;
+	struct inode *inode = mapping->host;
+	struct splice_desc sd = {
+		.total_len = len,
+		.flags = flags,
+		.pos = *ppos,
+		.u.file = out,
+	};
+	ssize_t ret;
+
+	mutex_lock(&pipe->inode->i_mutex);
+	ret = __splice_from_pipe(pipe, &sd, pipe_to_file);
+	mutex_unlock(&pipe->inode->i_mutex);
+
+	if (ret > 0) {
+		unsigned long nr_pages;
+
+		*ppos += ret;
+		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
+			int er;
+
+			er = sync_page_range_nolock(inode, mapping, *ppos, ret);
+			if (er)
+				ret = er;
+		}
+		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_splice_write_file_nolock);
+
+/**
  * generic_file_splice_write_nolock - generic_file_splice_write without mutexes
  * @pipe:	pipe info
  * @out:	file to write to
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a6a625b..5c9b880 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1957,6 +1957,8 @@ extern ssize_t generic_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_file_splice_write_nolock(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
+extern ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *,
+		struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
 		struct file *out, loff_t *, size_t len, unsigned int flags);
 extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-20 17:49 ` Jens Axboe
  2008-10-20 18:11   ` Jens Axboe
@ 2008-10-20 18:29   ` Dmitri Monakhov
  2008-10-20 18:33     ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Dmitri Monakhov @ 2008-10-20 18:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel

Jens Axboe <jens.axboe@oracle.com> writes:

> On Sun, Oct 19 2008, Dmitri Monakhov wrote:
>> Block device write procedure is different from regular file:
>>  - Actual write performed without i_mutex.
>>  - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock.
>>  - We do not have to worry about S_ISUID/S_ISGID bits.
>
> I already did an O_DIRECT part of block device splicing [1], I'll fold
> this into the splice branch and double check with some testing.
>
> [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5
Ok i've missed this branch :(, your approach is really cool.
But current patch seems not completely ready, 
O_DIRECT case:
  - sync case missed, some one may want use it with O_DIRECT|O_SYNC
  - i'm not sure why it is necessary to always hold bd_inode->i_mutex
    inside __splice_on_pice(.., pipe_to_disk)
!O_DIRECT case:
  - still use generic_file_splice_write

So I'll re-base to your patch and:
 - add appropriate fixes  necessary fixes for direct case.
 - redone my patch on top of yours for buffered writes.

What do you think?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-20 18:29   ` Dmitri Monakhov
@ 2008-10-20 18:33     ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2008-10-20 18:33 UTC (permalink / raw)
  To: Dmitri Monakhov; +Cc: linux-kernel, linux-fsdevel

On Mon, Oct 20 2008, Dmitri Monakhov wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Sun, Oct 19 2008, Dmitri Monakhov wrote:
> >> Block device write procedure is different from regular file:
> >>  - Actual write performed without i_mutex.
> >>  - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock.
> >>  - We do not have to worry about S_ISUID/S_ISGID bits.
> >
> > I already did an O_DIRECT part of block device splicing [1], I'll fold
> > this into the splice branch and double check with some testing.
> >
> > [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5
> Ok i've missed this branch :(, your approach is really cool.
> But current patch seems not completely ready, 

Not surprising, it's still pretty fresh. The core of it works, which was
the first objective :-)

> O_DIRECT case:
>   - sync case missed, some one may want use it with O_DIRECT|O_SYNC

Good point, I'll update that to wait on in-progress bios.

>   - i'm not sure why it is necessary to always hold bd_inode->i_mutex
>     inside __splice_on_pice(.., pipe_to_disk)

It is not, I'll drop that too.

> !O_DIRECT case:
>   - still use generic_file_splice_write

Well, the patch adds O_DIRECT support, so that's not really a missing
piece!

> So I'll re-base to your patch and:
>  - add appropriate fixes  necessary fixes for direct case.
>  - redone my patch on top of yours for buffered writes.
> 
> What do you think?

Please just send a patch for the missing bits on top of the current
splice branch, that includes the patch I sent which is a rebased version
of yours.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-20 18:11   ` Jens Axboe
@ 2008-10-20 18:42     ` Dmitri Monakhov
  2008-10-23  5:39     ` Andrew Morton
  1 sibling, 0 replies; 13+ messages in thread
From: Dmitri Monakhov @ 2008-10-20 18:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel

Jens Axboe <jens.axboe@oracle.com> writes:

> On Mon, Oct 20 2008, Jens Axboe wrote:
>> On Sun, Oct 19 2008, Dmitri Monakhov wrote:
>> > Block device write procedure is different from regular file:
>> >  - Actual write performed without i_mutex.
>> >  - It has no metadata, so generic_osync_inode(O_SYNCMETEDATA) can not livelock.
>> >  - We do not have to worry about S_ISUID/S_ISGID bits.
>> 
>> I already did an O_DIRECT part of block device splicing [1], I'll fold
>> this into the splice branch and double check with some testing.
>> 
>> [1] http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=fbb724a0484aba938024d41ca1dd86337d2550c9;hp=08c7910b275a4c580ad646ae8654439c8dfae4c5
>
> The below is what I merged. Note that I changed the naming and made the
> function look a lot more like the other splice helpers, so it's more
> apparent how it differs. Let me know if I can add you Signed-off-by to
Off course yes.
> this one (preferably after you test it as well :-)
currently i'm testing this stuff.
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4d154dc..083198a 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1288,7 +1288,7 @@ new_bio:
>   * Splice to file opened with O_DIRECT. Bypass caching completely and
>   * just go direct-to-bio
>   */
> -static ssize_t __block_splice_write(struct pipe_inode_info *pipe,
> +static ssize_t __block_splice_direct_write(struct pipe_inode_info *pipe,
>  				    struct file *out, loff_t *ppos, size_t len,
>  				    unsigned int flags)
>  {
> @@ -1318,6 +1318,9 @@ static ssize_t __block_splice_write(struct pipe_inode_info *pipe,
>  	if (bsd.bio)
>  		submit_bio(WRITE, bsd.bio);
>  
> +	if (ret > 0)
> +		*ppos += ret;
> +
>  	return ret;
>  }
>  
> @@ -1327,12 +1330,11 @@ static ssize_t block_splice_write(struct pipe_inode_info *pipe,
>  {
>  	ssize_t ret;
>  
> -	if (out->f_flags & O_DIRECT) {
> -		ret = __block_splice_write(pipe, out, ppos, len, flags);
> -		if (ret > 0)
> -			*ppos += ret;
> -	} else
> -		ret = generic_file_splice_write(pipe, out, ppos, len, flags);
> +	if (out->f_flags & O_DIRECT)
> +		ret = __block_splice_direct_write(pipe, out, ppos, len, flags);
> +	else
> +		ret = generic_file_splice_write_file_nolock(pipe, out, ppos,
> +								len, flags);
>  
>  	return ret;
>  }
> diff --git a/fs/splice.c b/fs/splice.c
> index 4108264..eb1e1ac 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -788,6 +788,59 @@ ssize_t splice_from_pipe(struct pipe_inode_info *pipe, struct file *out,
>  }
>  
>  /**
> + * generic_file_splice_write_file_nolock - splice data from a pipe to a file
> + * @pipe:	pipe info
> + * @out:	file to write to
> + * @ppos:	position in @out
> + * @len:	number of bytes to splice
> + * @flags:	splice modifier flags
> + *
> + * Description:
> + *    Will either move or copy pages (determined by @flags options) from
> + *    the given pipe inode to the given block device.
> + *    Note: this is like @generic_file_splice_write, except that we
> + *    don't bother locking the output file. Useful for splicing directly
> + *    to a block device.
> + */
> +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe,
> +					      struct file *out, loff_t *ppos,
> +					      size_t len, unsigned int flags)
> +{
> +	struct address_space *mapping = out->f_mapping;
> +	struct inode *inode = mapping->host;
> +	struct splice_desc sd = {
> +		.total_len = len,
> +		.flags = flags,
> +		.pos = *ppos,
> +		.u.file = out,
> +	};
> +	ssize_t ret;
> +
> +	mutex_lock(&pipe->inode->i_mutex);
> +	ret = __splice_from_pipe(pipe, &sd, pipe_to_file);
> +	mutex_unlock(&pipe->inode->i_mutex);
> +
> +	if (ret > 0) {
> +		unsigned long nr_pages;
> +
> +		*ppos += ret;
> +		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> +
> +		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> +			int er;
> +
> +			er = sync_page_range_nolock(inode, mapping, *ppos, ret);
> +			if (er)
> +				ret = er;
> +		}
> +		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(generic_file_splice_write_file_nolock);
> +
> +/**
>   * generic_file_splice_write_nolock - generic_file_splice_write without mutexes
>   * @pipe:	pipe info
>   * @out:	file to write to
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a6a625b..5c9b880 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1957,6 +1957,8 @@ extern ssize_t generic_file_splice_write(struct pipe_inode_info *,
>  		struct file *, loff_t *, size_t, unsigned int);
>  extern ssize_t generic_file_splice_write_nolock(struct pipe_inode_info *,
>  		struct file *, loff_t *, size_t, unsigned int);
> +extern ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *,
> +		struct file *, loff_t *, size_t, unsigned int);
>  extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
>  		struct file *out, loff_t *, size_t len, unsigned int flags);
>  extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-20 18:11   ` Jens Axboe
  2008-10-20 18:42     ` Dmitri Monakhov
@ 2008-10-23  5:39     ` Andrew Morton
  2008-10-23  6:29       ` Jens Axboe
  2008-10-23  8:41       ` Dmitri Monakhov
  1 sibling, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2008-10-23  5:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel

On Mon, 20 Oct 2008 20:11:56 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:

> +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe,
> +					      struct file *out, loff_t *ppos,
> +					      size_t len, unsigned int flags)
> +{
> +	struct address_space *mapping = out->f_mapping;
> +	struct inode *inode = mapping->host;
> +	struct splice_desc sd = {
> +		.total_len = len,
> +		.flags = flags,
> +		.pos = *ppos,
> +		.u.file = out,
> +	};
> +	ssize_t ret;
> +
> +	mutex_lock(&pipe->inode->i_mutex);
> +	ret = __splice_from_pipe(pipe, &sd, pipe_to_file);
> +	mutex_unlock(&pipe->inode->i_mutex);
> +
> +	if (ret > 0) {
> +		unsigned long nr_pages;
> +
> +		*ppos += ret;
> +		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> +
> +		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> +			int er;
> +
> +			er = sync_page_range_nolock(inode, mapping, *ppos, ret);
> +			if (er)
> +				ret = er;
> +		}
> +		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(generic_file_splice_write_file_nolock);

I don't think the balance_dirty_pages() is needed if we just did the
sync_page_range().


But really it'd be better if the throttling happened down in
pipe_to_file(), on a per-page basis.  As it stands we can dirty an
arbitrary number of pagecache pages without throttling.  I think?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-23  5:39     ` Andrew Morton
@ 2008-10-23  6:29       ` Jens Axboe
  2008-10-23  6:41         ` Andrew Morton
  2008-10-23  8:41       ` Dmitri Monakhov
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2008-10-23  6:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel

On Wed, Oct 22 2008, Andrew Morton wrote:
> On Mon, 20 Oct 2008 20:11:56 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe,
> > +					      struct file *out, loff_t *ppos,
> > +					      size_t len, unsigned int flags)
> > +{
> > +	struct address_space *mapping = out->f_mapping;
> > +	struct inode *inode = mapping->host;
> > +	struct splice_desc sd = {
> > +		.total_len = len,
> > +		.flags = flags,
> > +		.pos = *ppos,
> > +		.u.file = out,
> > +	};
> > +	ssize_t ret;
> > +
> > +	mutex_lock(&pipe->inode->i_mutex);
> > +	ret = __splice_from_pipe(pipe, &sd, pipe_to_file);
> > +	mutex_unlock(&pipe->inode->i_mutex);
> > +
> > +	if (ret > 0) {
> > +		unsigned long nr_pages;
> > +
> > +		*ppos += ret;
> > +		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> > +
> > +		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > +			int er;
> > +
> > +			er = sync_page_range_nolock(inode, mapping, *ppos, ret);
> > +			if (er)
> > +				ret = er;
> > +		}
> > +		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
> > +	}
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(generic_file_splice_write_file_nolock);
> 
> I don't think the balance_dirty_pages() is needed if we just did the
> sync_page_range().

Good point, I think we can get rid of that.
> 
> 
> But really it'd be better if the throttling happened down in
> pipe_to_file(), on a per-page basis.  As it stands we can dirty an
> arbitrary number of pagecache pages without throttling.  I think?

That's pretty exactly why it isn't done in the actor, to avoid doing it
per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this
is better.

Back in the splice early days, the balance_dirty_pages() actually showed
up in profiles when it was done on a per-page basis. So I'm reluctant to
change it :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-23  6:29       ` Jens Axboe
@ 2008-10-23  6:41         ` Andrew Morton
  2008-10-23  6:51           ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2008-10-23  6:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel

On Thu, 23 Oct 2008 08:29:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:

> > But really it'd be better if the throttling happened down in
> > pipe_to_file(), on a per-page basis.  As it stands we can dirty an
> > arbitrary number of pagecache pages without throttling.  I think?
> 
> That's pretty exactly why it isn't done in the actor, to avoid doing it
> per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this
> is better.
> 
> Back in the splice early days, the balance_dirty_pages() actually showed
> up in profiles when it was done on a per-page basis. So I'm reluctant to
> change it :-)

That's why (the misnamed) balance_dirty_pages_ratelimited() exists?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-23  6:41         ` Andrew Morton
@ 2008-10-23  6:51           ` Jens Axboe
  2008-10-23  7:03             ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2008-10-23  6:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel

On Wed, Oct 22 2008, Andrew Morton wrote:
> On Thu, 23 Oct 2008 08:29:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > > But really it'd be better if the throttling happened down in
> > > pipe_to_file(), on a per-page basis.  As it stands we can dirty an
> > > arbitrary number of pagecache pages without throttling.  I think?
> > 
> > That's pretty exactly why it isn't done in the actor, to avoid doing it
> > per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this
> > is better.
> > 
> > Back in the splice early days, the balance_dirty_pages() actually showed
> > up in profiles when it was done on a per-page basis. So I'm reluctant to
> > change it :-)
> 
> That's why (the misnamed) balance_dirty_pages_ratelimited() exists?

I think that is what was used, but the details are a little hazy at this
point. So I can't say for sure. In this case it's moot anyway, since we
can kill it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-23  6:51           ` Jens Axboe
@ 2008-10-23  7:03             ` Andrew Morton
  2008-10-23  7:16               ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2008-10-23  7:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel

On Thu, 23 Oct 2008 08:51:13 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:

> On Wed, Oct 22 2008, Andrew Morton wrote:
> > On Thu, 23 Oct 2008 08:29:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > > But really it'd be better if the throttling happened down in
> > > > pipe_to_file(), on a per-page basis.  As it stands we can dirty an
> > > > arbitrary number of pagecache pages without throttling.  I think?
> > > 
> > > That's pretty exactly why it isn't done in the actor, to avoid doing it
> > > per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this
> > > is better.
> > > 
> > > Back in the splice early days, the balance_dirty_pages() actually showed
> > > up in profiles when it was done on a per-page basis. So I'm reluctant to
> > > change it :-)
> > 
> > That's why (the misnamed) balance_dirty_pages_ratelimited() exists?
> 
> I think that is what was used, but the details are a little hazy at this
> point. So I can't say for sure.

All that function does is to bump a per-cpu variable and
once-per-thousand or so it does the balance.  If it was causing
problems in the splice application we want to know, because write()
uses it!

>  In this case it's moot anyway, since we can kill it.

Nope, we can only remove it if the fd is O_SYNC||is_sync().

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-23  7:03             ` Andrew Morton
@ 2008-10-23  7:16               ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2008-10-23  7:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dmitri Monakhov, linux-kernel, linux-fsdevel

On Thu, Oct 23 2008, Andrew Morton wrote:
> On Thu, 23 Oct 2008 08:51:13 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Wed, Oct 22 2008, Andrew Morton wrote:
> > > On Thu, 23 Oct 2008 08:29:23 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > > But really it'd be better if the throttling happened down in
> > > > > pipe_to_file(), on a per-page basis.  As it stands we can dirty an
> > > > > arbitrary number of pagecache pages without throttling.  I think?
> > > > 
> > > > That's pretty exactly why it isn't done in the actor, to avoid doing it
> > > > per-page. As it's going to be PIPE_BUFFERS (16) pages max, I think this
> > > > is better.
> > > > 
> > > > Back in the splice early days, the balance_dirty_pages() actually showed
> > > > up in profiles when it was done on a per-page basis. So I'm reluctant to
> > > > change it :-)
> > > 
> > > That's why (the misnamed) balance_dirty_pages_ratelimited() exists?
> > 
> > I think that is what was used, but the details are a little hazy at this
> > point. So I can't say for sure.
> 
> All that function does is to bump a per-cpu variable and
> once-per-thousand or so it does the balance.  If it was causing
> problems in the splice application we want to know, because write()
> uses it!

Once per 8 or 32. If we haven't exceeded the dirty limit, calling it in
the actor or at the end should not make a difference for splice, since
we should be going into balance_dirty_pages() at most once.

Perhaps it was different some years ago, or perhaps the micro benchmarks
were screwed. Or perhaps my memory is shot, can't say for sure :)

> >  In this case it's moot anyway, since we can kill it.
> 
> Nope, we can only remove it if the fd is O_SYNC||is_sync().

Right, I forgot this is still the buffered path.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Add block device speciffic splice write method
  2008-10-23  5:39     ` Andrew Morton
  2008-10-23  6:29       ` Jens Axboe
@ 2008-10-23  8:41       ` Dmitri Monakhov
  1 sibling, 0 replies; 13+ messages in thread
From: Dmitri Monakhov @ 2008-10-23  8:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, linux-fsdevel

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 20 Oct 2008 20:11:56 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:
>
>> +ssize_t generic_file_splice_write_file_nolock(struct pipe_inode_info *pipe,
>> +					      struct file *out, loff_t *ppos,
>> +					      size_t len, unsigned int flags)
>> +{
>> +	struct address_space *mapping = out->f_mapping;
>> +	struct inode *inode = mapping->host;
>> +	struct splice_desc sd = {
>> +		.total_len = len,
>> +		.flags = flags,
>> +		.pos = *ppos,
>> +		.u.file = out,
>> +	};
>> +	ssize_t ret;
>> +
>> +	mutex_lock(&pipe->inode->i_mutex);
>> +	ret = __splice_from_pipe(pipe, &sd, pipe_to_file);
>> +	mutex_unlock(&pipe->inode->i_mutex);
>> +
>> +	if (ret > 0) {
>> +		unsigned long nr_pages;
>> +
>> +		*ppos += ret;
>> +		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>> +
>> +		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
>> +			int er;
>> +
>> +			er = sync_page_range_nolock(inode, mapping, *ppos, ret);
>> +			if (er)
>> +				ret = er;
>> +		}
>> +		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
>> +	}
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(generic_file_splice_write_file_nolock);
>
> I don't think the balance_dirty_pages() is needed if we just did the
> sync_page_range().
I think so too, but I've done it in this way because all other writers
does it. 
>
>
> But really it'd be better if the throttling happened down in
> pipe_to_file(), on a per-page basis.  As it stands we can dirty an
> arbitrary number of pagecache pages without throttling.  I think?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-10-23  8:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-19 14:00 [PATCH] Add block device speciffic splice write method Dmitri Monakhov
2008-10-20 17:49 ` Jens Axboe
2008-10-20 18:11   ` Jens Axboe
2008-10-20 18:42     ` Dmitri Monakhov
2008-10-23  5:39     ` Andrew Morton
2008-10-23  6:29       ` Jens Axboe
2008-10-23  6:41         ` Andrew Morton
2008-10-23  6:51           ` Jens Axboe
2008-10-23  7:03             ` Andrew Morton
2008-10-23  7:16               ` Jens Axboe
2008-10-23  8:41       ` Dmitri Monakhov
2008-10-20 18:29   ` Dmitri Monakhov
2008-10-20 18:33     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).