LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
@ 2007-01-12 20:45 Eric Sandeen
  2007-01-12 21:02 ` Alex Tomas
  2007-01-14 11:58 ` Dmitriy Monakhov
  0 siblings, 2 replies; 13+ messages in thread
From: Eric Sandeen @ 2007-01-12 20:45 UTC (permalink / raw)
  To: Linux Kernel Mailing List, ext4 development

I've been looking at a case where many threads are opening, unlinking, and
hardlinking files on ext3 .  At unmount time I see an oops, because the superblock's
orphan list points to a freed inode.

I did some tracing of the inodes, and it looks like this:

  ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
      i_state:0x7 cpu:1 i_count:2 i_nlink:0

  ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
      i_state:0x7 cpu:1 i_count:2 i_nlink:0

  iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
      i_state:0x7 cpu:1 i_count:2 i_nlink:0

  ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
      i_state:0x7 cpu:3 i_count:1 i_nlink:0

  ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
      i_state:0x7 cpu:3 i_count:1 i_nlink:1

The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, so
it puts it on the orphan inode list.  Then link comes along, and bumps the link
back up to 1.  So now we are on the orphan inode list, but we are not unlinked.

Eventually when count goes to 0, and we still have 1 link, again no action is
taken to remove the inode from the orphan list, because it is still linked (i.e.
we don't go through ext3_delete())

When this inode is eventually freed, the sb orphan list gets corrupted, because 
we have freed it without first removing it from the orphan list.

I think the simple solution is to remove the inode from the orphan list
when we bump the link back up from 0 to 1.  I put that test in there because
there are other potential reasons that we might be on the list (truncates,
direct IO).

Comments?

Thanks,
-Eric

p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
arg, and are very infrequently called.  I'll probably submit a patch
to just put the single line of code into the caller, too.

---

Remove inode from the orphan list in ext3_link() if we might have
raced with ext3_unlink(), which potentially put it on the list.
If we're on the list with nlink > 0, we'll never get cleaned up
properly and eventually may corrupt the list.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>

Index: linux-2.6.19/fs/ext3/namei.c
===================================================================
--- linux-2.6.19.orig/fs/ext3/namei.c
+++ linux-2.6.19/fs/ext3/namei.c
@@ -2204,6 +2204,9 @@ retry:
 	inode->i_ctime = CURRENT_TIME_SEC;
 	ext3_inc_count(handle, inode);
 	atomic_inc(&inode->i_count);
+	/* did we race w/ unlink? */
+	if (inode->i_nlink == 1)
+		ext3_orphan_del(handle, inode);
 
 	err = ext3_add_nondir(handle, dentry, inode);
 	ext3_journal_stop(handle);



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 20:45 [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race Eric Sandeen
@ 2007-01-12 21:02 ` Alex Tomas
  2007-01-12 21:14   ` Eric Sandeen
                     ` (2 more replies)
  2007-01-14 11:58 ` Dmitriy Monakhov
  1 sibling, 3 replies; 13+ messages in thread
From: Alex Tomas @ 2007-01-12 21:02 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Linux Kernel Mailing List, ext4 development


interesting ..

I thought VFS doesn't allow concurrent operations.
if unlink goes first, then link should wait on the
parent's i_mutex and then found no source name.

thanks, Alex

>>>>> Eric Sandeen (ES) writes:

 ES> )
 ES> I've been looking at a case where many threads are opening, unlinking, and
 ES> hardlinking files on ext3 .  At unmount time I see an oops, because the superblock's
 ES> orphan list points to a freed inode.

 ES> I did some tracing of the inodes, and it looks like this:

 ES>   ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
 ES>       i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES>   ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
 ES>       i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES>   iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
 ES>       i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES>   ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
 ES>       i_state:0x7 cpu:3 i_count:1 i_nlink:0

 ES>   ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
 ES>       i_state:0x7 cpu:3 i_count:1 i_nlink:1

 ES> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, so
 ES> it puts it on the orphan inode list.  Then link comes along, and bumps the link
 ES> back up to 1.  So now we are on the orphan inode list, but we are not unlinked.

 ES> Eventually when count goes to 0, and we still have 1 link, again no action is
 ES> taken to remove the inode from the orphan list, because it is still linked (i.e.
 ES> we don't go through ext3_delete())

 ES> When this inode is eventually freed, the sb orphan list gets corrupted, because 
 ES> we have freed it without first removing it from the orphan list.

 ES> I think the simple solution is to remove the inode from the orphan list
 ES> when we bump the link back up from 0 to 1.  I put that test in there because
 ES> there are other potential reasons that we might be on the list (truncates,
 ES> direct IO).

 ES> Comments?

 ES> Thanks,
 ES> -Eric

 ES> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
 ES> arg, and are very infrequently called.  I'll probably submit a patch
 ES> to just put the single line of code into the caller, too.

 ES> ---

 ES> Remove inode from the orphan list in ext3_link() if we might have
 ES> raced with ext3_unlink(), which potentially put it on the list.
 ES> If we're on the list with nlink > 0, we'll never get cleaned up
 ES> properly and eventually may corrupt the list.

 ES> Signed-off-by: Eric Sandeen <sandeen@redhat.com>

 ES> Index: linux-2.6.19/fs/ext3/namei.c
 ES> ===================================================================
 ES> --- linux-2.6.19.orig/fs/ext3/namei.c
 ES> +++ linux-2.6.19/fs/ext3/namei.c
 ES> @@ -2204,6 +2204,9 @@ retry:
 inode-> i_ctime = CURRENT_TIME_SEC;
 ES>  	ext3_inc_count(handle, inode);
 ES>  	atomic_inc(&inode->i_count);
 ES> +	/* did we race w/ unlink? */
 ES> +	if (inode->i_nlink == 1)
 ES> +		ext3_orphan_del(handle, inode);
 
 ES>  	err = ext3_add_nondir(handle, dentry, inode);
 ES>  	ext3_journal_stop(handle);


 ES> -
 ES> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
 ES> the body of a message to majordomo@vger.kernel.org
 ES> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 21:02 ` Alex Tomas
@ 2007-01-12 21:14   ` Eric Sandeen
  2007-01-12 21:25     ` Alex Tomas
  2007-01-12 21:17   ` Alex Tomas
  2007-01-12 21:24   ` Dave Kleikamp
  2 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2007-01-12 21:14 UTC (permalink / raw)
  To: Alex Tomas; +Cc: Linux Kernel Mailing List, ext4 development

Alex Tomas wrote:
> interesting ..
> 
> I thought VFS doesn't allow concurrent operations.
> if unlink goes first, then link should wait on the
> parent's i_mutex and then found no source name.
> 
> thanks, Alex

Well... I was wondering that myself, whether this race should even
happen.  But the bottom of do_unlinkat looks like:

        mutex_unlock(&nd.dentry->d_inode->i_mutex);
        if (inode)
                iput(inode);    /* truncate the inode here */
exit1:
        path_release(&nd);
exit:
        putname(name);
        return error;

so I think it's possible that link can sneak in there & find it after
the mutex is dropped...?  Is this ok? :)  It's certainly -happening-
anyway....

-Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 21:02 ` Alex Tomas
  2007-01-12 21:14   ` Eric Sandeen
@ 2007-01-12 21:17   ` Alex Tomas
  2007-01-12 21:24   ` Dave Kleikamp
  2 siblings, 0 replies; 13+ messages in thread
From: Alex Tomas @ 2007-01-12 21:17 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Alex Tomas, Linux Kernel Mailing List, ext4 development


ah, it seems vfs_link() doesn't check whether source is still alive.
for example, in mkdir case vfs_mkdir() calls may_create() and
checks the parent is still there:
	if (IS_DEADDIR(dir))
		return -ENOENT;

VFS doesn't set S_DEAD on regular files, but we could check i_nlink.

thanks, Alex

>>>>> Alex Tomas (AT) writes:

 AT> interesting ..

 AT> I thought VFS doesn't allow concurrent operations.
 AT> if unlink goes first, then link should wait on the
 AT> parent's i_mutex and then found no source name.

 AT> thanks, Alex

>>>>> Eric Sandeen (ES) writes:

 ES> )
 ES> I've been looking at a case where many threads are opening, unlinking, and
 ES> hardlinking files on ext3 .  At unmount time I see an oops, because the superblock's
 ES> orphan list points to a freed inode.

 ES> I did some tracing of the inodes, and it looks like this:

 ES> ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
 ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES> ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
 ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES> iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
 ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES> ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
 ES> i_state:0x7 cpu:3 i_count:1 i_nlink:0

 ES> ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
 ES> i_state:0x7 cpu:3 i_count:1 i_nlink:1

 ES> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, so
 ES> it puts it on the orphan inode list.  Then link comes along, and bumps the link
 ES> back up to 1.  So now we are on the orphan inode list, but we are not unlinked.

 ES> Eventually when count goes to 0, and we still have 1 link, again no action is
 ES> taken to remove the inode from the orphan list, because it is still linked (i.e.
 ES> we don't go through ext3_delete())

 ES> When this inode is eventually freed, the sb orphan list gets corrupted, because 
 ES> we have freed it without first removing it from the orphan list.

 ES> I think the simple solution is to remove the inode from the orphan list
 ES> when we bump the link back up from 0 to 1.  I put that test in there because
 ES> there are other potential reasons that we might be on the list (truncates,
 ES> direct IO).

 ES> Comments?

 ES> Thanks,
 ES> -Eric

 ES> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
 ES> arg, and are very infrequently called.  I'll probably submit a patch
 ES> to just put the single line of code into the caller, too.

 ES> ---

 ES> Remove inode from the orphan list in ext3_link() if we might have
 ES> raced with ext3_unlink(), which potentially put it on the list.
 ES> If we're on the list with nlink > 0, we'll never get cleaned up
 ES> properly and eventually may corrupt the list.

 ES> Signed-off-by: Eric Sandeen <sandeen@redhat.com>

 ES> Index: linux-2.6.19/fs/ext3/namei.c
 ES> ===================================================================
 ES> --- linux-2.6.19.orig/fs/ext3/namei.c
 ES> +++ linux-2.6.19/fs/ext3/namei.c
 ES> @@ -2204,6 +2204,9 @@ retry:
 inode-> i_ctime = CURRENT_TIME_SEC;
 ES> ext3_inc_count(handle, inode);
 ES> atomic_inc(&inode->i_count);
 ES> +	/* did we race w/ unlink? */
 ES> +	if (inode->i_nlink == 1)
 ES> +		ext3_orphan_del(handle, inode);
 
 ES> err = ext3_add_nondir(handle, dentry, inode);
 ES> ext3_journal_stop(handle);


 ES> -
 ES> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
 ES> the body of a message to majordomo@vger.kernel.org
 ES> More majordomo info at  http://vger.kernel.org/majordomo-info.html
 AT> -
 AT> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
 AT> the body of a message to majordomo@vger.kernel.org
 AT> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 21:02 ` Alex Tomas
  2007-01-12 21:14   ` Eric Sandeen
  2007-01-12 21:17   ` Alex Tomas
@ 2007-01-12 21:24   ` Dave Kleikamp
  2 siblings, 0 replies; 13+ messages in thread
From: Dave Kleikamp @ 2007-01-12 21:24 UTC (permalink / raw)
  To: Alex Tomas; +Cc: Eric Sandeen, Linux Kernel Mailing List, ext4 development

On Sat, 2007-01-13 at 00:02 +0300, Alex Tomas wrote:
> interesting ..
> 
> I thought VFS doesn't allow concurrent operations.
> if unlink goes first, then link should wait on the
> parent's i_mutex and then found no source name.

I don't think the VFS ever takes the source's parent's i_mutex.  Unless
the source and destination's parent is the same, in which case the
i_mutex is taken after the source has already been looked up.

> thanks, Alex
> 
> >>>>> Eric Sandeen (ES) writes:
> 
>  ES> )
>  ES> I've been looking at a case where many threads are opening, unlinking, and
>  ES> hardlinking files on ext3 .  At unmount time I see an oops, because the superblock's
>  ES> orphan list points to a freed inode.
> 
>  ES> I did some tracing of the inodes, and it looks like this:
> 
>  ES>   ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
>  ES>       i_state:0x7 cpu:1 i_count:2 i_nlink:0
> 
>  ES>   ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
>  ES>       i_state:0x7 cpu:1 i_count:2 i_nlink:0
> 
>  ES>   iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
>  ES>       i_state:0x7 cpu:1 i_count:2 i_nlink:0
> 
>  ES>   ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
>  ES>       i_state:0x7 cpu:3 i_count:1 i_nlink:0
> 
>  ES>   ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
>  ES>       i_state:0x7 cpu:3 i_count:1 i_nlink:1
> 
>  ES> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, so
>  ES> it puts it on the orphan inode list.  Then link comes along, and bumps the link
>  ES> back up to 1.  So now we are on the orphan inode list, but we are not unlinked.
> 
>  ES> Eventually when count goes to 0, and we still have 1 link, again no action is
>  ES> taken to remove the inode from the orphan list, because it is still linked (i.e.
>  ES> we don't go through ext3_delete())
> 
>  ES> When this inode is eventually freed, the sb orphan list gets corrupted, because 
>  ES> we have freed it without first removing it from the orphan list.
> 
>  ES> I think the simple solution is to remove the inode from the orphan list
>  ES> when we bump the link back up from 0 to 1.  I put that test in there because
>  ES> there are other potential reasons that we might be on the list (truncates,
>  ES> direct IO).
> 
>  ES> Comments?
> 
>  ES> Thanks,
>  ES> -Eric
> 
>  ES> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
>  ES> arg, and are very infrequently called.  I'll probably submit a patch
>  ES> to just put the single line of code into the caller, too.
> 
>  ES> ---
> 
>  ES> Remove inode from the orphan list in ext3_link() if we might have
>  ES> raced with ext3_unlink(), which potentially put it on the list.
>  ES> If we're on the list with nlink > 0, we'll never get cleaned up
>  ES> properly and eventually may corrupt the list.
> 
>  ES> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> 
>  ES> Index: linux-2.6.19/fs/ext3/namei.c
>  ES> ===================================================================
>  ES> --- linux-2.6.19.orig/fs/ext3/namei.c
>  ES> +++ linux-2.6.19/fs/ext3/namei.c
>  ES> @@ -2204,6 +2204,9 @@ retry:
>  inode-> i_ctime = CURRENT_TIME_SEC;
>  ES>  	ext3_inc_count(handle, inode);
>  ES>  	atomic_inc(&inode->i_count);
>  ES> +	/* did we race w/ unlink? */
>  ES> +	if (inode->i_nlink == 1)
>  ES> +		ext3_orphan_del(handle, inode);
>  
>  ES>  	err = ext3_add_nondir(handle, dentry, inode);
>  ES>  	ext3_journal_stop(handle);

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 21:14   ` Eric Sandeen
@ 2007-01-12 21:25     ` Alex Tomas
  2007-01-12 21:48       ` Eric Sandeen
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Tomas @ 2007-01-12 21:25 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Alex Tomas, Linux Kernel Mailing List, ext4 development

>>>>> Eric Sandeen (ES) writes:

 ES> so I think it's possible that link can sneak in there & find it after
 ES> the mutex is dropped...?  Is this ok? :)  It's certainly -happening-
 ES> anyway....

yes, but it shouldn't allow to re-link such inode back, IMHO.
a filesystem may start some non-revertable activity in its
unlink method.

thanks, Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 21:25     ` Alex Tomas
@ 2007-01-12 21:48       ` Eric Sandeen
  2007-01-12 21:55         ` Alex Tomas
  2007-01-12 22:35         ` Eric Sandeen
  0 siblings, 2 replies; 13+ messages in thread
From: Eric Sandeen @ 2007-01-12 21:48 UTC (permalink / raw)
  To: Alex Tomas; +Cc: Linux Kernel Mailing List, ext4 development

Alex Tomas wrote:
>>>>>> Eric Sandeen (ES) writes:
> 
>  ES> so I think it's possible that link can sneak in there & find it after
>  ES> the mutex is dropped...?  Is this ok? :)  It's certainly -happening-
>  ES> anyway....
> 
> yes, but it shouldn't allow to re-link such inode back, IMHO.
> a filesystem may start some non-revertable activity in its
> unlink method.
> 
> thanks, Alex

I tend to agree, chatting w/ Al I think he does too.  :)  I'll test
a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit
that if things go well.

-Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 21:48       ` Eric Sandeen
@ 2007-01-12 21:55         ` Alex Tomas
  2007-01-12 22:01           ` Eric Sandeen
  2007-01-12 22:35         ` Eric Sandeen
  1 sibling, 1 reply; 13+ messages in thread
From: Alex Tomas @ 2007-01-12 21:55 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Alex Tomas, Linux Kernel Mailing List, ext4 development

>>>>> Eric Sandeen (ES) writes:
 ES> I tend to agree, chatting w/ Al I think he does too.  :)  I'll test
 ES> a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit
 ES> that if things go well.

shouldn't VFS do that?

thanks, Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 21:55         ` Alex Tomas
@ 2007-01-12 22:01           ` Eric Sandeen
  2007-01-12 22:07             ` Alex Tomas
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2007-01-12 22:01 UTC (permalink / raw)
  To: Alex Tomas; +Cc: Linux Kernel Mailing List, ext4 development

Alex Tomas wrote:
>>>>>> Eric Sandeen (ES) writes:
>  ES> I tend to agree, chatting w/ Al I think he does too.  :)  I'll test
>  ES> a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit
>  ES> that if things go well.
> 
> shouldn't VFS do that?

Al says "no" and I'm not arguing.  :)

Apparently this may be OK with some filesystems, and Al says he doesn't
want to know about i_nlink in the vfs in any case.

But I suppose there may be other filesystems which DO care, and should
be checking if they're not.

-Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 22:01           ` Eric Sandeen
@ 2007-01-12 22:07             ` Alex Tomas
  0 siblings, 0 replies; 13+ messages in thread
From: Alex Tomas @ 2007-01-12 22:07 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Alex Tomas, Linux Kernel Mailing List, ext4 development

>>>>> Eric Sandeen (ES) writes:

 ES> Al says "no" and I'm not arguing.  :)

 ES> Apparently this may be OK with some filesystems, and Al says he doesn't
 ES> want to know about i_nlink in the vfs in any case.

well, generic_drop_inode() uses i_nlink ...

 ES> But I suppose there may be other filesystems which DO care, and should
 ES> be checking if they're not.

this is why I thought VFS could take care.

thanks, Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 21:48       ` Eric Sandeen
  2007-01-12 21:55         ` Alex Tomas
@ 2007-01-12 22:35         ` Eric Sandeen
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Sandeen @ 2007-01-12 22:35 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Alex Tomas, Linux Kernel Mailing List, ext4 development, Al Viro

Eric Sandeen wrote:

> Alex Tomas wrote:
>   
>> yes, but it shouldn't allow to re-link such inode back, IMHO.
>> a filesystem may start some non-revertable activity in its
>> unlink method.
>>
>> thanks, Alex
>>     
>
> I tend to agree, chatting w/ Al I think he does too.  :)  I'll test
> a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit
> that if things go well.
>   
Well this seems to fix things up for ext3 (and ext4 by extension):

---

Return -ENOENT from ext[34]_link if we've raced with unlink and
i_nlink is 0.  Doing otherwise has the potential to corrupt the
orphan inode list, because we'd wind up with an inode with a
non-zero link count on the list, and it will never get properly
cleaned up.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>

Index: linux-2.6.19/fs/ext3/namei.c
===================================================================
--- linux-2.6.19.orig/fs/ext3/namei.c
+++ linux-2.6.19/fs/ext3/namei.c
@@ -2191,6 +2191,8 @@ static int ext3_link (struct dentry * ol
 
 	if (inode->i_nlink >= EXT3_LINK_MAX)
 		return -EMLINK;
+	if (inode->i_nlink == 0)
+		return -ENOENT;
 
 retry:
 	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
Index: linux-2.6.19/fs/ext4/namei.c
===================================================================
--- linux-2.6.19.orig/fs/ext4/namei.c
+++ linux-2.6.19/fs/ext4/namei.c
@@ -2189,6 +2189,8 @@ static int ext4_link (struct dentry * ol
 
 	if (inode->i_nlink >= EXT4_LINK_MAX)
 		return -EMLINK;
+	if (inode->i_nlink == 0)
+		return -ENOENT;
 
 retry:
 	handle = ext4_journal_start(dir, EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-12 20:45 [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race Eric Sandeen
  2007-01-12 21:02 ` Alex Tomas
@ 2007-01-14 11:58 ` Dmitriy Monakhov
  2007-01-14 15:42   ` Eric Sandeen
  1 sibling, 1 reply; 13+ messages in thread
From: Dmitriy Monakhov @ 2007-01-14 11:58 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Linux Kernel Mailing List, ext4 development

Eric Sandeen <sandeen@redhat.com> writes:

> I've been looking at a case where many threads are opening, unlinking, and
> hardlinking files on ext3 .
How many concurent threads do you use and how long does it takes to trigger 
this race? I've tried to reproduce this with two threads, but not succeed.
<thread 1>  
        fd = create("src")
        close(fd)
        unlink("src")
<thread 2>
        link("src", "dst")
        unlink("dst")

Original testcase will be the best answer :).
Thanks.
>  At unmount time I see an oops, because the superblock's
> orphan list points to a freed inode.
>
> I did some tracing of the inodes, and it looks like this:
>
>   ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
>       i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
>       i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
>       i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
>       i_state:0x7 cpu:3 i_count:1 i_nlink:0
>
>   ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
>       i_state:0x7 cpu:3 i_count:1 i_nlink:1
>
> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, so
> it puts it on the orphan inode list.  Then link comes along, and bumps the link
> back up to 1.  So now we are on the orphan inode list, but we are not unlinked.
>
> Eventually when count goes to 0, and we still have 1 link, again no action is
> taken to remove the inode from the orphan list, because it is still linked (i.e.
> we don't go through ext3_delete())
>
> When this inode is eventually freed, the sb orphan list gets corrupted, because 
> we have freed it without first removing it from the orphan list.
>
> I think the simple solution is to remove the inode from the orphan list
> when we bump the link back up from 0 to 1.  I put that test in there because
> there are other potential reasons that we might be on the list (truncates,
> direct IO).
>
> Comments?
>
> Thanks,
> -Eric
>
> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
> arg, and are very infrequently called.  I'll probably submit a patch
> to just put the single line of code into the caller, too.
>
> ---
>
> Remove inode from the orphan list in ext3_link() if we might have
> raced with ext3_unlink(), which potentially put it on the list.
> If we're on the list with nlink > 0, we'll never get cleaned up
> properly and eventually may corrupt the list.
>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>
> Index: linux-2.6.19/fs/ext3/namei.c
> ===================================================================
> --- linux-2.6.19.orig/fs/ext3/namei.c
> +++ linux-2.6.19/fs/ext3/namei.c
> @@ -2204,6 +2204,9 @@ retry:
>  	inode->i_ctime = CURRENT_TIME_SEC;
>  	ext3_inc_count(handle, inode);
>  	atomic_inc(&inode->i_count);
> +	/* did we race w/ unlink? */
> +	if (inode->i_nlink == 1)
> +		ext3_orphan_del(handle, inode);
>  
>  	err = ext3_add_nondir(handle, dentry, inode);
>  	ext3_journal_stop(handle);
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
  2007-01-14 11:58 ` Dmitriy Monakhov
@ 2007-01-14 15:42   ` Eric Sandeen
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Sandeen @ 2007-01-14 15:42 UTC (permalink / raw)
  To: Dmitriy Monakhov; +Cc: Linux Kernel Mailing List, ext4 development

Dmitriy Monakhov wrote:
> Eric Sandeen <sandeen@redhat.com> writes:
> 
>> I've been looking at a case where many threads are opening, unlinking, and
>> hardlinking files on ext3 .
> How many concurent threads do you use and how long does it takes to trigger 
> this race? I've tried to reproduce this with two threads, but not succeed.
> <thread 1>  
>         fd = create("src")
>         close(fd)
>         unlink("src")
> <thread 2>
>         link("src", "dst")
>         unlink("dst")
> 
> Original testcase will be the best answer :).

Sure :)  Though I didn't write it... see this collection of bash scripts:

http://people.redhat.com/esandeen/testcases/orphan-repro.tar.bz2

I didn't write it, but it exposed the bug for me.  The VAR file contains 
variables to specify mountpoint and a device, which the script starts by 
mkfs'ing, so be warned.

It spawns -many- threads, and on my 4 CPU opteron I can hit it in a 
reasonable amount of time.  It would probably be nice to have a more 
targeted testcase but it did the trick for me.

Thanks,
-Eric

> Thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-01-14 15:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-12 20:45 [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race Eric Sandeen
2007-01-12 21:02 ` Alex Tomas
2007-01-12 21:14   ` Eric Sandeen
2007-01-12 21:25     ` Alex Tomas
2007-01-12 21:48       ` Eric Sandeen
2007-01-12 21:55         ` Alex Tomas
2007-01-12 22:01           ` Eric Sandeen
2007-01-12 22:07             ` Alex Tomas
2007-01-12 22:35         ` Eric Sandeen
2007-01-12 21:17   ` Alex Tomas
2007-01-12 21:24   ` Dave Kleikamp
2007-01-14 11:58 ` Dmitriy Monakhov
2007-01-14 15:42   ` Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).