LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH][RFC] truncate vs add_to_page_cache race
@ 2004-05-14  2:01 Nick Piggin
  2004-05-14  2:20 ` Nick Piggin
  2004-05-14  2:33 ` Andrew Morton
  0 siblings, 2 replies; 9+ messages in thread
From: Nick Piggin @ 2004-05-14  2:01 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, viro, Linus Torvalds, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

Hi,
I think there is a race between truncate and do_generic_mapping_read.

do_generic_mapping_read()
{
	check i_size -> ok
no_cached_page:
	allocate a page
	add_to_page_cache
	readpage
}

And what can happen is truncate gets to the file after i_size is
checked and before the page is added to the page cache.

I asked Hugh about this because a quick search showed he was the
last one to make a noise about this kind of thing. He wasn't up
to speed with the current code, but agreed it looks fishy.

OK, I made a debug patch to printk and schedule_timeout in this
race window so I can easily truncate the file. When this happens,
it turns out that the readpage thinks it is reading a hole and
fills the page with zeros -> invalid result?

I have attached a patch which uses i_lock to close this race
AFAIKS. Lightly tested only. I can't experimentally verify
that it closes the race because I have not been able to reproduce
it without changing the code.

Comments? Too ugly? Have I've missed something?

[-- Attachment #2: truncate-vs-add_to_page_cache.patch --]
[-- Type: text/x-patch, Size: 4258 bytes --]

 linux-2.6-npiggin/mm/filemap.c |   39 +++++++++++++++++++++++++++------------
 linux-2.6-npiggin/mm/memory.c  |    9 +++++++++
 linux-2.6-npiggin/mm/nommu.c   |    8 ++++++++
 mm/readahead.c                 |    0 
 4 files changed, 44 insertions(+), 12 deletions(-)

diff -puN mm/filemap.c~truncate-vs-add_to_page_cache mm/filemap.c
--- linux-2.6/mm/filemap.c~truncate-vs-add_to_page_cache	2004-05-14 11:16:03.000000000 +1000
+++ linux-2.6-npiggin/mm/filemap.c	2004-05-14 11:55:46.000000000 +1000
@@ -78,6 +78,9 @@
  *  ->i_sem
  *    ->i_alloc_sem             (various)
  *
+ *  ->i_lock			(do_generic_mapping_read)
+ *    ->mapping->tree_lock	(add_to_page_cache)
+ *
  *  ->inode_lock
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
@@ -649,8 +652,9 @@ void do_generic_mapping_read(struct addr
 			     read_actor_t actor)
 {
 	struct inode *inode = mapping->host;
-	unsigned long index, offset;
+	unsigned long index, end_index, offset;
 	struct page *cached_page;
+	loff_t isize;
 	int error;
 	struct file_ra_state ra = *_ra;
 
@@ -658,26 +662,28 @@ void do_generic_mapping_read(struct addr
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
 
+	isize = i_size_read(inode);
+	end_index = isize >> PAGE_CACHE_SHIFT;
+	if (index > end_index)
+		goto out;
+
 	for (;;) {
 		struct page *page;
-		unsigned long end_index, nr, ret;
-		loff_t isize = i_size_read(inode);
+		unsigned long nr, ret;
 
-		end_index = isize >> PAGE_CACHE_SHIFT;
-			
-		if (index > end_index)
-			break;
+		cond_resched();
+
+		/* Establish the number of bytes to read from this page */
 		nr = PAGE_CACHE_SIZE;
 		if (index == end_index) {
 			nr = isize & ~PAGE_CACHE_MASK;
 			if (nr <= offset)
 				break;
 		}
+		nr = nr - offset;
 
-		cond_resched();
 		page_cache_readahead(mapping, &ra, filp, index);
 
-		nr = nr - offset;
 find_page:
 		page = find_get_page(mapping, index);
 		if (unlikely(page == NULL)) {
@@ -721,9 +727,6 @@ page_ok:
 		break;
 
 page_not_up_to_date:
-		if (PageUptodate(page))
-			goto page_ok;
-
 		/* Get exclusive access to the page ... */
 		lock_page(page);
 
@@ -770,8 +773,19 @@ no_cached_page:
 				break;
 			}
 		}
+
+		/* Take the i_lock to protect against a concurrent truncate */
+		spin_lock(&inode->i_lock);
+		isize = i_size_read(inode);
+		end_index = isize >> PAGE_CACHE_SHIFT;
+		if (index > end_index) {
+			spin_unlock(&inode->i_lock);
+			goto out;
+		}
 		error = add_to_page_cache_lru(cached_page, mapping,
 						index, GFP_KERNEL);
+		spin_unlock(&inode->i_lock);
+
 		if (error) {
 			if (error == -EEXIST)
 				goto find_page;
@@ -783,6 +797,7 @@ no_cached_page:
 		goto readpage;
 	}
 
+out:
 	*_ra = ra;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
diff -puN mm/memory.c~truncate-vs-add_to_page_cache mm/memory.c
--- linux-2.6/mm/memory.c~truncate-vs-add_to_page_cache	2004-05-14 11:18:26.000000000 +1000
+++ linux-2.6-npiggin/mm/memory.c	2004-05-14 11:31:27.000000000 +1000
@@ -1222,7 +1222,16 @@ int vmtruncate(struct inode * inode, lof
 
 	if (inode->i_size < offset)
 		goto do_expand;
+	
+	/*
+	 * Need i_lock to serialise against a concurrent reader adding a new
+	 * page to the pagecache. See mm/filemap.c.
+	 *
+	 * This isn't needed if i_size is being expanded.
+	 */
+	spin_lock(&inode->i_lock);
 	i_size_write(inode, offset);
+	spin_unlock(&inode->i_lock);
 	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
 	truncate_inode_pages(mapping, offset);
 	goto out_truncate;
diff -puN mm/nommu.c~truncate-vs-add_to_page_cache mm/nommu.c
--- linux-2.6/mm/nommu.c~truncate-vs-add_to_page_cache	2004-05-14 11:18:29.000000000 +1000
+++ linux-2.6-npiggin/mm/nommu.c	2004-05-14 11:31:55.000000000 +1000
@@ -48,7 +48,15 @@ int vmtruncate(struct inode *inode, loff
 
 	if (inode->i_size < offset)
 		goto do_expand;
+	/*
+	 * Need i_lock to serialise against a concurrent reader adding a new
+	 * page to the pagecache. See mm/filemap.c.
+	 *
+	 * This isn't needed if i_size is being expanded.
+	 */
+	spin_lock(&inode->i_lock);
 	i_size_write(inode, offset);
+	spin_unlock(&inode->i_lock);
 
 	truncate_inode_pages(mapping, offset);
 	goto out_truncate;
diff -puN mm/readahead.c~truncate-vs-add_to_page_cache mm/readahead.c

_

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] truncate vs add_to_page_cache race
  2004-05-14  2:01 [PATCH][RFC] truncate vs add_to_page_cache race Nick Piggin
@ 2004-05-14  2:20 ` Nick Piggin
  2004-05-14  2:33 ` Andrew Morton
  1 sibling, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2004-05-14  2:20 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, viro, Linus Torvalds, linux-kernel

Nick Piggin wrote:
> Hi,
> I think there is a race between truncate and do_generic_mapping_read.

> Comments? Too ugly? Have I've missed something?

Hmm, no I think the right timing can still cause readpage to see
the shortened i_size and fill the page with zeros.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] truncate vs add_to_page_cache race
  2004-05-14  2:01 [PATCH][RFC] truncate vs add_to_page_cache race Nick Piggin
  2004-05-14  2:20 ` Nick Piggin
@ 2004-05-14  2:33 ` Andrew Morton
  2004-05-14  2:39   ` Nick Piggin
  1 sibling, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2004-05-14  2:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: hugh, viro, torvalds, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>  I think there is a race between truncate and do_generic_mapping_read.
> 
>  do_generic_mapping_read()
>  {
>  	check i_size -> ok
>  no_cached_page:
>  	allocate a page
>  	add_to_page_cache
>  	readpage
>  }
> 
>  And what can happen is truncate gets to the file after i_size is
>  checked and before the page is added to the page cache.
> 
>  I asked Hugh about this because a quick search showed he was the
>  last one to make a noise about this kind of thing. He wasn't up
>  to speed with the current code, but agreed it looks fishy.
> 
>  OK, I made a debug patch to printk and schedule_timeout in this
>  race window so I can easily truncate the file. When this happens,
>  it turns out that the readpage thinks it is reading a hole and
>  fills the page with zeros -> invalid result?

A zero-filled pagecache page outside i_size is OK.

If someone extends i_size and reads the page, they get zeros.  If they
truncate the file more, it gets dropped.  If they extend i_size then write
to the page, that's OK.  And page reclaim can reclaim it.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] truncate vs add_to_page_cache race
  2004-05-14  2:33 ` Andrew Morton
@ 2004-05-14  2:39   ` Nick Piggin
  2004-05-14  3:10     ` Nick Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2004-05-14  2:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hugh, viro, torvalds, linux-kernel

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>> OK, I made a debug patch to printk and schedule_timeout in this
>> race window so I can easily truncate the file. When this happens,
>> it turns out that the readpage thinks it is reading a hole and
>> fills the page with zeros -> invalid result?
> 
> 
> A zero-filled pagecache page outside i_size is OK.
> 

Yes. But in this case the zero filled page actually gets
read by read(2).

In any case, I think my patch won't close the race completely.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] truncate vs add_to_page_cache race
  2004-05-14  2:39   ` Nick Piggin
@ 2004-05-14  3:10     ` Nick Piggin
  2004-05-14  3:15       ` Nick Piggin
  2004-05-14 23:29       ` Nick Piggin
  0 siblings, 2 replies; 9+ messages in thread
From: Nick Piggin @ 2004-05-14  3:10 UTC (permalink / raw)
  To: Andrew Morton, hugh; +Cc: viro, torvalds, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 689 bytes --]

Nick Piggin wrote:
> Andrew Morton wrote:
> 
>> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>> OK, I made a debug patch to printk and schedule_timeout in this
>>> race window so I can easily truncate the file. When this happens,
>>> it turns out that the readpage thinks it is reading a hole and
>>> fills the page with zeros -> invalid result?
>>
>>
>>
>> A zero-filled pagecache page outside i_size is OK.
>>
> 
> Yes. But in this case the zero filled page actually gets
> read by read(2).
> 
> In any case, I think my patch won't close the race completely.
> 

This following patch should be right.

It causes the zeros to not get copied back unless i_size
gets extended again.

[-- Attachment #2: truncate-vs-add_to_page_cache2.patch --]
[-- Type: text/x-patch, Size: 2163 bytes --]

 linux-2.6-npiggin/mm/filemap.c |   41 +++++++++++++++++++++++++----------------
 1 files changed, 25 insertions(+), 16 deletions(-)

diff -puN mm/filemap.c~truncate-vs-add_to_page_cache2 mm/filemap.c
--- linux-2.6/mm/filemap.c~truncate-vs-add_to_page_cache2	2004-05-14 12:54:42.000000000 +1000
+++ linux-2.6-npiggin/mm/filemap.c	2004-05-14 13:09:16.000000000 +1000
@@ -661,23 +661,11 @@ void do_generic_mapping_read(struct addr
 	for (;;) {
 		struct page *page;
 		unsigned long end_index, nr, ret;
-		loff_t isize = i_size_read(inode);
-
-		end_index = isize >> PAGE_CACHE_SHIFT;
-			
-		if (index > end_index)
-			break;
-		nr = PAGE_CACHE_SIZE;
-		if (index == end_index) {
-			nr = isize & ~PAGE_CACHE_MASK;
-			if (nr <= offset)
-				break;
-		}
+		loff_t isize;
 
 		cond_resched();
 		page_cache_readahead(mapping, &ra, filp, index);
 
-		nr = nr - offset;
 find_page:
 		page = find_get_page(mapping, index);
 		if (unlikely(page == NULL)) {
@@ -687,6 +675,30 @@ find_page:
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
+		/*
+		 * i_size must be checked after we have a reference
+		 * to the uptodate page. Checking i_size before readpage
+		 * can cause a racing truncate to cause readpage to see
+		 * the truncated i_size and zero-fill the page.
+		 *
+		 * Checking i_size after the readpage allows us to calculate
+		 * the correct value for "nr", which means the zero-filled
+		 * part of the page is not copied back to userspace (unless
+		 * another truncate extends the file, this is not a problem).
+		 */
+		isize = i_size_read(inode);
+		end_index = isize >> PAGE_CACHE_SHIFT;
+		if (index > end_index)
+			break;
+
+		nr = PAGE_CACHE_SIZE;
+		if (index == end_index) {
+			nr = isize & ~PAGE_CACHE_MASK;
+			if (nr <= offset)
+				break;
+		}
+		nr = nr - offset;
+
 		/* If users can be writing to this page using arbitrary
 		 * virtual addresses, take care about potential aliasing
 		 * before reading the page on the kernel side.
@@ -721,9 +733,6 @@ page_ok:
 		break;
 
 page_not_up_to_date:
-		if (PageUptodate(page))
-			goto page_ok;
-
 		/* Get exclusive access to the page ... */
 		lock_page(page);
 

_

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] truncate vs add_to_page_cache race
  2004-05-14  3:10     ` Nick Piggin
@ 2004-05-14  3:15       ` Nick Piggin
  2004-05-14 23:29       ` Nick Piggin
  1 sibling, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2004-05-14  3:15 UTC (permalink / raw)
  To: Andrew Morton, hugh; +Cc: viro, torvalds, linux-kernel

Nick Piggin wrote:
> Nick Piggin wrote:
> 
>> Andrew Morton wrote:
>>
>>> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>
>>
>>
>>>> OK, I made a debug patch to printk and schedule_timeout in this
>>>> race window so I can easily truncate the file. When this happens,
>>>> it turns out that the readpage thinks it is reading a hole and
>>>> fills the page with zeros -> invalid result?
>>>
>>>
>>>
>>>
>>> A zero-filled pagecache page outside i_size is OK.
>>>
>>
>> Yes. But in this case the zero filled page actually gets
>> read by read(2).
>>
>> In any case, I think my patch won't close the race completely.
>>
> 
> This following patch should be right.
> 
> It causes the zeros to not get copied back unless i_size
> gets extended again.
> 

However, it causes the fast path reading off the end of a file
to always go into ->readpage and copy the non-existant page of
zeros. This could be fixed no problem, but I'll shut up and let
others comment in case I'm making a fool of myself :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] truncate vs add_to_page_cache race
  2004-05-14  3:10     ` Nick Piggin
  2004-05-14  3:15       ` Nick Piggin
@ 2004-05-14 23:29       ` Nick Piggin
  2004-05-14 23:50         ` Andrew Morton
  1 sibling, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2004-05-14 23:29 UTC (permalink / raw)
  To: torvalds; +Cc: Andrew Morton, hugh, viro, linux-kernel

Nick Piggin wrote:

> 
> This following patch should be right.
> 
> It causes the zeros to not get copied back unless i_size
> gets extended again.
> 

OK, nobody has looked at this yet, so I'll just summarise my
ramblings.

The truncate vs add_to_page_cache race which I first misidentified
to be the problem is actually harmless and there by design as
Andrew points out. My first patch did "fix" it AFAIKS, but doubtful
if it is worth the locking and complexity.

The real problem is that readpage doesn't use the same i_size as
do_generic_mapping_read (ie. it could be changed by another racing
operation). This race window is quite large at the moment, containing
cond_resched and a page allocation.

Is this transient returning of zero fill actually a problem? Transient,
ie. a second read could see the correct i_size and not return the zeros.

My second patch gets closer to the heart of the problem: I think it
fixes all truncate races. However a write(2) could still expand the
i_size between ->readpage and calculation of nr, thus causing zero fill
to appear transiently again (one would either want to return the
written data, or a short read, not zeros).

I think the entire problem can be fixed by ensuring ->readpage and
do_generic_mapping read see the same i_size. This would either mean
passing i_size to or from ->readpage, *or* having ->readpage return
the number of bytes read, for example.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] truncate vs add_to_page_cache race
  2004-05-14 23:29       ` Nick Piggin
@ 2004-05-14 23:50         ` Andrew Morton
  2004-05-15  0:09           ` Nick Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2004-05-14 23:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: torvalds, hugh, viro, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> I think the entire problem can be fixed by ensuring ->readpage and
> do_generic_mapping read see the same i_size. This would either mean
> passing i_size to or from ->readpage, *or* having ->readpage return
> the number of bytes read, for example.

Or not check i_size in ->readpage at all?

If fixing this is going to cost extra fastpath cycles I'd be inclined to
not bother, frankly.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH][RFC] truncate vs add_to_page_cache race
  2004-05-14 23:50         ` Andrew Morton
@ 2004-05-15  0:09           ` Nick Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2004-05-15  0:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, hugh, viro, linux-kernel

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>I think the entire problem can be fixed by ensuring ->readpage and
>>do_generic_mapping read see the same i_size. This would either mean
>>passing i_size to or from ->readpage, *or* having ->readpage return
>>the number of bytes read, for example.
> 
> 
> Or not check i_size in ->readpage at all?
> 

It needn't check readpage if it gets the number of bytes to read
passed to it, or gets i_size passed to it.

With do_generic_mapping_read and ->readpage each having a different
idea of how much of the page to process(*), bad things can happen.
They have different ideas about how much they need to process due to
the each one checking i_size on its own.

* That is "copy to userspace" and "read" for do_generic_mapping_read
   and ->readpages respectively.

> If fixing this is going to cost extra fastpath cycles I'd be inclined to
> not bother, frankly.
> 

What I'm thinking of shouldn't cost any cycles, it would require a
change to ->readpage API though. Preferably one where we can tell it
how many bytes to read. I can't see how else to fix it.

If this is not acceptable for 2.6, we could use a nicer variation of
my second patch which at least fixes the truncate problem, and its
remaining race is *much* more improbable.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-05-15  0:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-14  2:01 [PATCH][RFC] truncate vs add_to_page_cache race Nick Piggin
2004-05-14  2:20 ` Nick Piggin
2004-05-14  2:33 ` Andrew Morton
2004-05-14  2:39   ` Nick Piggin
2004-05-14  3:10     ` Nick Piggin
2004-05-14  3:15       ` Nick Piggin
2004-05-14 23:29       ` Nick Piggin
2004-05-14 23:50         ` Andrew Morton
2004-05-15  0:09           ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).