LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 1 of 2] block_page_mkwrite() Implementation V2
@ 2007-03-18 23:30 David Chinner
  2007-03-19  6:37 ` Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: David Chinner @ 2007-03-18 23:30 UTC (permalink / raw)
  To: lkml; +Cc: linux-mm, linux-fsdevel


Generic page_mkwrite functionality.

Filesystems that make use of the VM ->page_mkwrite() callout will generally use
the same core code to implement it. There are several tricky truncate-related
issues that we need to deal with here as we cannot take the i_mutex as we
normally would for these paths.  These issues are not documented anywhere yet
so block_page_mkwrite() seems like the best place to start.

Version 2:

- read inode size only once
- more comments explaining implementation restrictions

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 fs/buffer.c                 |   47 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/buffer_head.h |    2 +
 2 files changed, 49 insertions(+)

Index: 2.6.x-xfs-new/fs/buffer.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/buffer.c	2007-03-17 10:55:32.291414968 +1100
+++ 2.6.x-xfs-new/fs/buffer.c	2007-03-19 08:13:54.519909087 +1100
@@ -2194,6 +2194,52 @@ int generic_commit_write(struct file *fi
 	return 0;
 }
 
+/*
+ * block_page_mkwrite() is not allowed to change the file size as it gets
+ * called from a page fault handler when a page is first dirtied. Hence we must
+ * be careful to check for EOF conditions here. We set the page up correctly
+ * for a written page which means we get ENOSPC checking when writing into
+ * holes and correct delalloc and unwritten extent mapping on filesystems that
+ * support these features.
+ *
+ * We are not allowed to take the i_mutex here so we have to play games to
+ * protect against truncate races as the page could now be beyond EOF.  Because
+ * vmtruncate() writes the inode size before removing pages, once we have the
+ * page lock we can determine safely if the page is beyond EOF. If it is not
+ * beyond EOF, then the page is guaranteed safe against truncation until we
+ * unlock the page.
+ */
+int
+block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
+		   get_block_t get_block)
+{
+	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+	unsigned long end;
+	loff_t size;
+	int ret = -EINVAL;
+
+	lock_page(page);
+	size = i_size_read(inode);
+	if ((page->mapping != inode->i_mapping) ||
+	    ((page->index << PAGE_CACHE_SHIFT) > size)) {
+		/* page got truncated out from underneath us */
+		goto out_unlock;
+	}
+
+	/* page is wholly or partially inside EOF */
+	if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
+		end = size & ~PAGE_CACHE_MASK;
+	else
+		end = PAGE_CACHE_SIZE;
+
+	ret = block_prepare_write(page, 0, end, get_block);
+	if (!ret)
+		ret = block_commit_write(page, 0, end);
+
+out_unlock:
+	unlock_page(page);
+	return ret;
+}
 
 /*
  * nobh_prepare_write()'s prereads are special: the buffer_heads are freed
@@ -2997,6 +3043,7 @@ EXPORT_SYMBOL(__brelse);
 EXPORT_SYMBOL(__wait_on_buffer);
 EXPORT_SYMBOL(block_commit_write);
 EXPORT_SYMBOL(block_prepare_write);
+EXPORT_SYMBOL(block_page_mkwrite);
 EXPORT_SYMBOL(block_read_full_page);
 EXPORT_SYMBOL(block_sync_page);
 EXPORT_SYMBOL(block_truncate_page);
Index: 2.6.x-xfs-new/include/linux/buffer_head.h
===================================================================
--- 2.6.x-xfs-new.orig/include/linux/buffer_head.h	2007-03-17 10:55:32.135435539 +1100
+++ 2.6.x-xfs-new/include/linux/buffer_head.h	2007-03-17 10:55:32.567378573 +1100
@@ -206,6 +206,8 @@ int cont_prepare_write(struct page*, uns
 int generic_cont_expand(struct inode *inode, loff_t size);
 int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
+int block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
+				get_block_t get_block);
 void block_sync_page(struct page *);
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
 int generic_commit_write(struct file *, struct page *, unsigned, unsigned);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-18 23:30 [PATCH 1 of 2] block_page_mkwrite() Implementation V2 David Chinner
@ 2007-03-19  6:37 ` Nick Piggin
  2007-03-19  8:12   ` David Chinner
  2007-03-19  9:22 ` Christoph Hellwig
  2007-05-16 10:19 ` David Howells
  2 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-03-19  6:37 UTC (permalink / raw)
  To: David Chinner; +Cc: lkml, linux-mm, linux-fsdevel

David Chinner wrote:
> Generic page_mkwrite functionality.
> 
> Filesystems that make use of the VM ->page_mkwrite() callout will generally use
> the same core code to implement it. There are several tricky truncate-related
> issues that we need to deal with here as we cannot take the i_mutex as we
> normally would for these paths.  These issues are not documented anywhere yet
> so block_page_mkwrite() seems like the best place to start.



> 
> Version 2:
> 
> - read inode size only once
> - more comments explaining implementation restrictions
> 
> Signed-Off-By: Dave Chinner <dgc@sgi.com>
> 
> ---
>  fs/buffer.c                 |   47 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/buffer_head.h |    2 +
>  2 files changed, 49 insertions(+)
> 
> Index: 2.6.x-xfs-new/fs/buffer.c
> ===================================================================
> --- 2.6.x-xfs-new.orig/fs/buffer.c	2007-03-17 10:55:32.291414968 +1100
> +++ 2.6.x-xfs-new/fs/buffer.c	2007-03-19 08:13:54.519909087 +1100
> @@ -2194,6 +2194,52 @@ int generic_commit_write(struct file *fi
>  	return 0;
>  }
>  
> +/*
> + * block_page_mkwrite() is not allowed to change the file size as it gets
> + * called from a page fault handler when a page is first dirtied. Hence we must
> + * be careful to check for EOF conditions here. We set the page up correctly
> + * for a written page which means we get ENOSPC checking when writing into
> + * holes and correct delalloc and unwritten extent mapping on filesystems that
> + * support these features.
> + *
> + * We are not allowed to take the i_mutex here so we have to play games to
> + * protect against truncate races as the page could now be beyond EOF.  Because
> + * vmtruncate() writes the inode size before removing pages, once we have the
> + * page lock we can determine safely if the page is beyond EOF. If it is not
> + * beyond EOF, then the page is guaranteed safe against truncation until we
> + * unlock the page.
> + */
> +int
> +block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
> +		   get_block_t get_block)
> +{
> +	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> +	unsigned long end;
> +	loff_t size;
> +	int ret = -EINVAL;
> +
> +	lock_page(page);
> +	size = i_size_read(inode);
> +	if ((page->mapping != inode->i_mapping) ||
> +	    ((page->index << PAGE_CACHE_SHIFT) > size)) {
> +		/* page got truncated out from underneath us */
> +		goto out_unlock;
> +	}

I see your explanation above, but I still don't see why this can't
just follow the conventional if (!page->mapping) check for truncation.
If the test happens to be performed after truncate concurrently
decreases i_size, then the blocks are going to get truncated by the
truncate afterwards anyway.

> +
> +	/* page is wholly or partially inside EOF */
> +	if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
> +		end = size & ~PAGE_CACHE_MASK;
> +	else
> +		end = PAGE_CACHE_SIZE;
> +
> +	ret = block_prepare_write(page, 0, end, get_block);
> +	if (!ret)
> +		ret = block_commit_write(page, 0, end);
> +
> +out_unlock:
> +	unlock_page(page);
> +	return ret;
> +}

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-19  6:37 ` Nick Piggin
@ 2007-03-19  8:12   ` David Chinner
  2007-03-19  9:57     ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: David Chinner @ 2007-03-19  8:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

On Mon, Mar 19, 2007 at 05:37:03PM +1100, Nick Piggin wrote:
> David Chinner wrote:
> > 
> >+/*
> >+ * block_page_mkwrite() is not allowed to change the file size as it gets
> >+ * called from a page fault handler when a page is first dirtied. Hence 
> >we must
> >+ * be careful to check for EOF conditions here. We set the page up 
> >correctly
> >+ * for a written page which means we get ENOSPC checking when writing into
> >+ * holes and correct delalloc and unwritten extent mapping on filesystems 
> >that
> >+ * support these features.
> >+ *
> >+ * We are not allowed to take the i_mutex here so we have to play games to
> >+ * protect against truncate races as the page could now be beyond EOF.  
> >Because
> >+ * vmtruncate() writes the inode size before removing pages, once we have 
> >the
> >+ * page lock we can determine safely if the page is beyond EOF. If it is 
> >not
> >+ * beyond EOF, then the page is guaranteed safe against truncation until 
> >we
> >+ * unlock the page.
> >+ */
> >+int
> >+block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
> >+		   get_block_t get_block)
> >+{
> >+	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> >+	unsigned long end;
> >+	loff_t size;
> >+	int ret = -EINVAL;
> >+
> >+	lock_page(page);
> >+	size = i_size_read(inode);
> >+	if ((page->mapping != inode->i_mapping) ||
> >+	    ((page->index << PAGE_CACHE_SHIFT) > size)) {
> >+		/* page got truncated out from underneath us */
> >+		goto out_unlock;
> >+	}
> 
> I see your explanation above, but I still don't see why this can't
> just follow the conventional if (!page->mapping) check for truncation.
> If the test happens to be performed after truncate concurrently
> decreases i_size, then the blocks are going to get truncated by the
> truncate afterwards anyway.

We have to read the inode size in the normal case so that we know if
the page is at EOF and is a partial page so we don't allocate past EOF in
block_prepare_write().  Hence it seems like a no-brainer to me to check
and error out on a page that we *know* is beyond EOF.

I can drop the check if you see no value in it - I just don't
like the idea of ignoring obvious boundary condition violations...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-18 23:30 [PATCH 1 of 2] block_page_mkwrite() Implementation V2 David Chinner
  2007-03-19  6:37 ` Nick Piggin
@ 2007-03-19  9:22 ` Christoph Hellwig
  2007-03-19 10:11   ` Nick Piggin
  2007-05-16 10:19 ` David Howells
  2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2007-03-19  9:22 UTC (permalink / raw)
  To: David Chinner; +Cc: lkml, linux-mm, linux-fsdevel, nickpiggin

On Mon, Mar 19, 2007 at 10:30:08AM +1100, David Chinner wrote:
> 
> Generic page_mkwrite functionality.
> 
> Filesystems that make use of the VM ->page_mkwrite() callout will generally use
> the same core code to implement it. There are several tricky truncate-related
> issues that we need to deal with here as we cannot take the i_mutex as we
> normally would for these paths.  These issues are not documented anywhere yet
> so block_page_mkwrite() seems like the best place to start.

This will need some updates when ->fault replaces ->page_mkwrite.

Nich, what's the plan for merging ->fault?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-19  8:12   ` David Chinner
@ 2007-03-19  9:57     ` Nick Piggin
  2007-03-19 10:28       ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-03-19  9:57 UTC (permalink / raw)
  To: David Chinner; +Cc: lkml, linux-mm, linux-fsdevel

David Chinner wrote:
> On Mon, Mar 19, 2007 at 05:37:03PM +1100, Nick Piggin wrote:
> 
>>David Chinner wrote:
>>

>>>+block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>>>+		   get_block_t get_block)
>>>+{
>>>+	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
>>>+	unsigned long end;
>>>+	loff_t size;
>>>+	int ret = -EINVAL;
>>>+
>>>+	lock_page(page);
>>>+	size = i_size_read(inode);
>>>+	if ((page->mapping != inode->i_mapping) ||
>>>+	    ((page->index << PAGE_CACHE_SHIFT) > size)) {
>>>+		/* page got truncated out from underneath us */
>>>+		goto out_unlock;
>>>+	}
>>
>>I see your explanation above, but I still don't see why this can't
>>just follow the conventional if (!page->mapping) check for truncation.
>>If the test happens to be performed after truncate concurrently
>>decreases i_size, then the blocks are going to get truncated by the
>>truncate afterwards anyway.
> 
> 
> We have to read the inode size in the normal case so that we know if
> the page is at EOF and is a partial page so we don't allocate past EOF in
> block_prepare_write().  Hence it seems like a no-brainer to me to check
> and error out on a page that we *know* is beyond EOF.
> 
> I can drop the check if you see no value in it - I just don't
> like the idea of ignoring obvious boundary condition violations...

I would prefer it dropped, to be honest. I can see how the check does
pick up that corner case, however truncate is difficult enough (at
least, it has been an endless source of problems) that we want to keep
everyone else simple and have all the non-trivial stuff in truncate.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-19  9:22 ` Christoph Hellwig
@ 2007-03-19 10:11   ` Nick Piggin
  2007-03-19 12:22     ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-03-19 10:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

Christoph Hellwig wrote:
> On Mon, Mar 19, 2007 at 10:30:08AM +1100, David Chinner wrote:
> 
>>Generic page_mkwrite functionality.
>>
>>Filesystems that make use of the VM ->page_mkwrite() callout will generally use
>>the same core code to implement it. There are several tricky truncate-related
>>issues that we need to deal with here as we cannot take the i_mutex as we
>>normally would for these paths.  These issues are not documented anywhere yet
>>so block_page_mkwrite() seems like the best place to start.
> 
> 
> This will need some updates when ->fault replaces ->page_mkwrite.
> 
> Nich, what's the plan for merging ->fault?

I've got the patches in -mm now. I hope they will get merged when the
the next window opens.

I didn't submit the ->page_mkwrite conversion yet, because I didn't
have any callers to look at. It is is slightly less trivial than for
nopage and nopfn, so having David's block_page_mkwrite is helpful.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-19  9:57     ` Nick Piggin
@ 2007-03-19 10:28       ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2007-03-19 10:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

Nick Piggin wrote:
> David Chinner wrote:
> 
>> On Mon, Mar 19, 2007 at 05:37:03PM +1100, Nick Piggin wrote:
>>
>>> David Chinner wrote:
>>>
> 
>>>> +block_page_mkwrite(struct vm_area_struct *vma, struct page *page,
>>>> +           get_block_t get_block)
>>>> +{
>>>> +    struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
>>>> +    unsigned long end;
>>>> +    loff_t size;
>>>> +    int ret = -EINVAL;
>>>> +
>>>> +    lock_page(page);
>>>> +    size = i_size_read(inode);
>>>> +    if ((page->mapping != inode->i_mapping) ||
>>>> +        ((page->index << PAGE_CACHE_SHIFT) > size)) {
>>>> +        /* page got truncated out from underneath us */
>>>> +        goto out_unlock;
>>>> +    }
>>>
>>>
>>> I see your explanation above, but I still don't see why this can't
>>> just follow the conventional if (!page->mapping) check for truncation.
>>> If the test happens to be performed after truncate concurrently
>>> decreases i_size, then the blocks are going to get truncated by the
>>> truncate afterwards anyway.
>>
>>
>>
>> We have to read the inode size in the normal case so that we know if
>> the page is at EOF and is a partial page so we don't allocate past EOF in
>> block_prepare_write().  Hence it seems like a no-brainer to me to check
>> and error out on a page that we *know* is beyond EOF.
>>
>> I can drop the check if you see no value in it - I just don't
>> like the idea of ignoring obvious boundary condition violations...
> 
> 
> I would prefer it dropped, to be honest. I can see how the check does
> pick up that corner case, however truncate is difficult enough (at
> least, it has been an endless source of problems) that we want to keep
> everyone else simple and have all the non-trivial stuff in truncate.
> 

Hmm, actually on second thoughts it probably is reasonable to recheck
i_size under the page lock... we need to do similar in the nopage path
to close the nopage vs invalidate race.

However, the already-truncated test I think can just be !page->mapping:
there should be no way for the page mapping to change to something
other than NULL.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-19 10:11   ` Nick Piggin
@ 2007-03-19 12:22     ` Christoph Hellwig
  2007-03-20  5:34       ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2007-03-19 12:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, David Chinner, lkml, linux-mm, linux-fsdevel

On Mon, Mar 19, 2007 at 09:11:31PM +1100, Nick Piggin wrote:
> I've got the patches in -mm now. I hope they will get merged when the
> the next window opens.
> 
> I didn't submit the ->page_mkwrite conversion yet, because I didn't
> have any callers to look at. It is is slightly less trivial than for
> nopage and nopfn, so having David's block_page_mkwrite is helpful.

Yes.  I was just wondering whether it makes more sense to do this
functionality directly ontop of ->fault instead of converting i over
real soon.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-19 12:22     ` Christoph Hellwig
@ 2007-03-20  5:34       ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2007-03-20  5:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

Christoph Hellwig wrote:
> On Mon, Mar 19, 2007 at 09:11:31PM +1100, Nick Piggin wrote:
> 
>>I've got the patches in -mm now. I hope they will get merged when the
>>the next window opens.
>>
>>I didn't submit the ->page_mkwrite conversion yet, because I didn't
>>have any callers to look at. It is is slightly less trivial than for
>>nopage and nopfn, so having David's block_page_mkwrite is helpful.
> 
> 
> Yes.  I was just wondering whether it makes more sense to do this
> functionality directly ontop of ->fault instead of converting i over
> real soon.

I would personally prefer that, but I don't want to block David's
patch from being merged if the ->fault patches do not get in next
cycle. If the fault patches do make it in first, then yes we should
do the page_mkwrite conversion before merging David's patch.

I'll keep an eye on it, and try to do the right thing.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-03-18 23:30 [PATCH 1 of 2] block_page_mkwrite() Implementation V2 David Chinner
  2007-03-19  6:37 ` Nick Piggin
  2007-03-19  9:22 ` Christoph Hellwig
@ 2007-05-16 10:19 ` David Howells
  2007-05-16 11:59   ` Nick Piggin
                     ` (4 more replies)
  2 siblings, 5 replies; 19+ messages in thread
From: David Howells @ 2007-05-16 10:19 UTC (permalink / raw)
  To: David Chinner; +Cc: lkml, linux-mm, linux-fsdevel

David Chinner <dgc@sgi.com> wrote:

> +	ret = block_prepare_write(page, 0, end, get_block);

As I understand the way prepare_write() works, this is incorrect.

The start and end points passed to block_prepare_write() delimit the region of
the page that is going to be modified.  This means that prepare_write()
doesn't need to fill it in if the page is not up to date.  It does, however,
need to fill in the region before (if present) and the region after (if
present).  Look at it like this:

		+---------------+
		|               |
		|               |	<-- Filled in by prepare_write()
		|               |
	to->	|:::::::::::::::|
		|               |
		|               |	<-- Filled in by caller
		|               |
	offset->|:::::::::::::::|
		|               |
		|               |	<-- Filled in by prepare_write()
		|               |
	page->	+---------------+

However, page_mkwrite() isn't told which bit of the page is going to be
written to.  This means it has to ask prepare_write() to make sure the whole
page is filled in.  In other words, offset and to must be equal (in AFS I set
them both to 0).

With what you've got, if, say, 'offset' is 0 and 'to' is calculated at
PAGE_SIZE, then if the page is not up to date for any reason, then none of the
page will be updated before the page is written on by the faulting code.

You probably get away with this in a blockdev-based filesystem because it's
unlikely that the page will cease to be up to date.

However, if someone adds a syscall to punch holes in files, this may change...

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 10:19 ` David Howells
@ 2007-05-16 11:59   ` Nick Piggin
  2007-05-16 12:09   ` David Woodhouse
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2007-05-16 11:59 UTC (permalink / raw)
  To: David Howells; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

David Howells wrote:
> David Chinner <dgc@sgi.com> wrote:
> 
> 
>>+	ret = block_prepare_write(page, 0, end, get_block);
> 
> 
> As I understand the way prepare_write() works, this is incorrect.

I think it is actually OK.


> The start and end points passed to block_prepare_write() delimit the region of
> the page that is going to be modified.  This means that prepare_write()
> doesn't need to fill it in if the page is not up to date.  It does, however,
> need to fill in the region before (if present) and the region after (if
> present).  Look at it like this:
> 
> 		+---------------+
> 		|               |
> 		|               |	<-- Filled in by prepare_write()
> 		|               |
> 	to->	|:::::::::::::::|
> 		|               |
> 		|               |	<-- Filled in by caller
> 		|               |
> 	offset->|:::::::::::::::|
> 		|               |
> 		|               |	<-- Filled in by prepare_write()
> 		|               |
> 	page->	+---------------+
> 
> However, page_mkwrite() isn't told which bit of the page is going to be
> written to.  This means it has to ask prepare_write() to make sure the whole
> page is filled in.  In other words, offset and to must be equal (in AFS I set
> them both to 0).

Dave is using prepare_write here to ensure blocks are allocated in the
given range. The filesystem's ->nopage function must ensure it is uptodate
before allowing it to be mapped.


> With what you've got, if, say, 'offset' is 0 and 'to' is calculated at
> PAGE_SIZE, then if the page is not up to date for any reason, then none of the
> page will be updated before the page is written on by the faulting code.

Consider that the code currently works OK today _without_ page_mkwrite.
page_mkwrite is being added to do block allocation / reservation.


> You probably get away with this in a blockdev-based filesystem because it's
> unlikely that the page will cease to be up to date.
> 
> However, if someone adds a syscall to punch holes in files, this may change...

We have one. Strangely enough, it is done with madvise(MADV_REMOVE).

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 10:19 ` David Howells
  2007-05-16 11:59   ` Nick Piggin
@ 2007-05-16 12:09   ` David Woodhouse
  2007-05-16 12:53     ` Chris Mason
  2007-05-16 13:20   ` David Howells
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: David Woodhouse @ 2007-05-16 12:09 UTC (permalink / raw)
  To: David Howells; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> The start and end points passed to block_prepare_write() delimit the region of
> the page that is going to be modified.  This means that prepare_write()
> doesn't need to fill it in if the page is not up to date. 

Really? Is it _really_ going to be modified? Even if the pointer
userspace gave to write() is bogus, and is going to fault half-way
through the copy_from_user()?

-- 
dwmw2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 12:09   ` David Woodhouse
@ 2007-05-16 12:53     ` Chris Mason
  2007-05-16 13:04       ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Mason @ 2007-05-16 12:53 UTC (permalink / raw)
  To: David Woodhouse
  Cc: David Howells, David Chinner, lkml, linux-mm, linux-fsdevel

On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
> On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> > The start and end points passed to block_prepare_write() delimit the region of
> > the page that is going to be modified.  This means that prepare_write()
> > doesn't need to fill it in if the page is not up to date. 
> 
> Really? Is it _really_ going to be modified? Even if the pointer
> userspace gave to write() is bogus, and is going to fault half-way
> through the copy_from_user()?

This is why there are so many variations on copy_from_user that zero on
faults.  One way or another, the prepare_write/commit_write pair are
responsible for filling it in.

-chris


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 12:53     ` Chris Mason
@ 2007-05-16 13:04       ` Nick Piggin
  2007-05-16 13:10         ` Chris Mason
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2007-05-16 13:04 UTC (permalink / raw)
  To: Chris Mason
  Cc: David Woodhouse, David Howells, David Chinner, lkml, linux-mm,
	linux-fsdevel

Chris Mason wrote:
> On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
> 
>>On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
>>
>>>The start and end points passed to block_prepare_write() delimit the region of
>>>the page that is going to be modified.  This means that prepare_write()
>>>doesn't need to fill it in if the page is not up to date. 
>>
>>Really? Is it _really_ going to be modified? Even if the pointer
>>userspace gave to write() is bogus, and is going to fault half-way
>>through the copy_from_user()?
> 
> 
> This is why there are so many variations on copy_from_user that zero on
> faults.  One way or another, the prepare_write/commit_write pair are
> responsible for filling it in.

I'll add to David's question about David's comment on David's patch, yes
it will be modified but in that case it would be zero-filled as Chris
says. However I believe this is incorrect behaviour.

It is possible to easily fix that so it would only happen via a tiny race
window (where the source memory gets unmapped at just the right time)
however nobody seemed to interested (just by checking the return value of
fault_in_pages_readable).

The buffered write patches I'm working on fix that (among other things) of
course. But they do away with prepare_write and introduce new aops, and
they indeed must not expect the full range to have been written to.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 13:04       ` Nick Piggin
@ 2007-05-16 13:10         ` Chris Mason
  0 siblings, 0 replies; 19+ messages in thread
From: Chris Mason @ 2007-05-16 13:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Woodhouse, David Howells, David Chinner, lkml, linux-mm,
	linux-fsdevel

On Wed, May 16, 2007 at 11:04:11PM +1000, Nick Piggin wrote:
> Chris Mason wrote:
> >On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote:
> >
> >>On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote:
> >>
> >>>The start and end points passed to block_prepare_write() delimit the 
> >>>region of
> >>>the page that is going to be modified.  This means that prepare_write()
> >>>doesn't need to fill it in if the page is not up to date. 
> >>
> >>Really? Is it _really_ going to be modified? Even if the pointer
> >>userspace gave to write() is bogus, and is going to fault half-way
> >>through the copy_from_user()?
> >
> >
> >This is why there are so many variations on copy_from_user that zero on
> >faults.  One way or another, the prepare_write/commit_write pair are
> >responsible for filling it in.
> 
> I'll add to David's question about David's comment on David's patch, yes
> it will be modified but in that case it would be zero-filled as Chris
> says. However I believe this is incorrect behaviour.
> 
> It is possible to easily fix that so it would only happen via a tiny race
> window (where the source memory gets unmapped at just the right time)
> however nobody seemed to interested (just by checking the return value of
> fault_in_pages_readable).
> 
> The buffered write patches I'm working on fix that (among other things) of
> course. But they do away with prepare_write and introduce new aops, and
> they indeed must not expect the full range to have been written to.

I was also wrong to say prepare_write and commit_write are
responsible, they work together with their callers to make the right
things happen.  Oh well, so much for trying to give a short answer for a
chunk of code full of corner cases ;)

-chris


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 10:19 ` David Howells
  2007-05-16 11:59   ` Nick Piggin
  2007-05-16 12:09   ` David Woodhouse
@ 2007-05-16 13:20   ` David Howells
  2007-05-16 13:41     ` Nick Piggin
  2007-05-16 13:25   ` David Howells
  2007-05-16 23:28   ` David Chinner
  4 siblings, 1 reply; 19+ messages in thread
From: David Howells @ 2007-05-16 13:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Dave is using prepare_write here to ensure blocks are allocated in the
> given range. The filesystem's ->nopage function must ensure it is uptodate
> before allowing it to be mapped.

Which is fine... assuming it's called.  For blockdev-based filesystems, this
is probably true.  But I'm not sure you can guarantee it.

I've seen Ext3, for example, unlocking a page that isn't yet uptodate.
nopage() won't get called on it again, but prepare_write() might.  I don't
know why this happens, but it's something I've fallen over in doing
CacheFiles.  When reading, readpage() is just called on it again and again
until it is up to date.  When writing, prepare_write() is called correctly.

> Consider that the code currently works OK today _without_ page_mkwrite.
> page_mkwrite is being added to do block allocation / reservation.

Which doesn't prove anything.  All it means is that PG_uptodate being unset is
handled elsewhere.

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 10:19 ` David Howells
                     ` (2 preceding siblings ...)
  2007-05-16 13:20   ` David Howells
@ 2007-05-16 13:25   ` David Howells
  2007-05-16 23:28   ` David Chinner
  4 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2007-05-16 13:25 UTC (permalink / raw)
  To: David Woodhouse; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

David Woodhouse <dwmw2@infradead.org> wrote:

> Really? Is it _really_ going to be modified?

Well, generic_file_buffered_write() doesn't check the success of the copy
before calling commit_write(), presumably because it uses
fault_in_pages_readable() first.

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 13:20   ` David Howells
@ 2007-05-16 13:41     ` Nick Piggin
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2007-05-16 13:41 UTC (permalink / raw)
  To: David Howells; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

David Howells wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Dave is using prepare_write here to ensure blocks are allocated in the
>>given range. The filesystem's ->nopage function must ensure it is uptodate
>>before allowing it to be mapped.
> 
> 
> Which is fine... assuming it's called.  For blockdev-based filesystems, this
> is probably true.  But I'm not sure you can guarantee it.
> 
> I've seen Ext3, for example, unlocking a page that isn't yet uptodate.
> nopage() won't get called on it again, but prepare_write() might.  I don't
> know why this happens, but it's something I've fallen over in doing
> CacheFiles.  When reading, readpage() is just called on it again and again
> until it is up to date.  When writing, prepare_write() is called correctly.

There are bugs in the core VM and block filesystem code where !uptodate pages
are left in pagetables. Some of these are fixed in -mm.

But they aren't a good reason to invent completely different ways to do things.


>>Consider that the code currently works OK today _without_ page_mkwrite.
>>page_mkwrite is being added to do block allocation / reservation.
> 
> 
> Which doesn't prove anything.  All it means is that PG_uptodate being unset is
> handled elsewhere.

It means that Dave's page_mkwrite function will do the block allocation
and everything else continues as it is. Your suggested change to pass in
offset == to is just completely wrong for this.

PG_uptodate being unset should be done via pagecache invalidation or truncation
APIs, which (sometimes... modulo bugs) tear down pagetables first.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
  2007-05-16 10:19 ` David Howells
                     ` (3 preceding siblings ...)
  2007-05-16 13:25   ` David Howells
@ 2007-05-16 23:28   ` David Chinner
  4 siblings, 0 replies; 19+ messages in thread
From: David Chinner @ 2007-05-16 23:28 UTC (permalink / raw)
  To: David Howells; +Cc: David Chinner, lkml, linux-mm, linux-fsdevel

On Wed, May 16, 2007 at 11:19:29AM +0100, David Howells wrote:
> 
> However, page_mkwrite() isn't told which bit of the page is going to be
> written to.  This means it has to ask prepare_write() to make sure the whole
> page is filled in.  In other words, offset and to must be equal (in AFS I set
> them both to 0).

The assumption is the page is already up to date and we are writing
the whole page unless EOF lands inside the page. AFAICT, we can't
get called with a page that is not uptodate and so page filling is
not something we should be doing (or want to be doing) here. All we
want to do is to be able to change the mapping from a read to a
write mapping (e.g. a read mapping of a hole needs to be changed on
write) and do the relevant space reservation/allocation and buffer
mapping needed for this change.

> However, if someone adds a syscall to punch holes in files, this may change...

We already have them - ioctl(XFS_IOC_UNRESVSP) and
madvise(MADV_REMOVE) - and another - fallocate(FA_DEALLOCATE) - is
on it's way. Racing with truncates should already be handled by the
truncate code (i.e. partial page truncation does the zero filling).

/me makes note to implement ->truncate_range() in XFS for MADV_REMOVE.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-05-16 23:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-18 23:30 [PATCH 1 of 2] block_page_mkwrite() Implementation V2 David Chinner
2007-03-19  6:37 ` Nick Piggin
2007-03-19  8:12   ` David Chinner
2007-03-19  9:57     ` Nick Piggin
2007-03-19 10:28       ` Nick Piggin
2007-03-19  9:22 ` Christoph Hellwig
2007-03-19 10:11   ` Nick Piggin
2007-03-19 12:22     ` Christoph Hellwig
2007-03-20  5:34       ` Nick Piggin
2007-05-16 10:19 ` David Howells
2007-05-16 11:59   ` Nick Piggin
2007-05-16 12:09   ` David Woodhouse
2007-05-16 12:53     ` Chris Mason
2007-05-16 13:04       ` Nick Piggin
2007-05-16 13:10         ` Chris Mason
2007-05-16 13:20   ` David Howells
2007-05-16 13:41     ` Nick Piggin
2007-05-16 13:25   ` David Howells
2007-05-16 23:28   ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).