LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: sysfs reclaim crash
[not found] ` <46042DDF.5060000@google.com>
@ 2007-03-24 3:05 ` Maneesh Soni
2007-03-27 18:27 ` Ethan Solomita
2007-04-03 7:38 ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
0 siblings, 2 replies; 6+ messages in thread
From: Maneesh Soni @ 2007-03-24 3:05 UTC (permalink / raw)
To: Ethan Solomita
Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
viro, LKML
On Fri, Mar 23, 2007 at 12:43:27PM -0700, Ethan Solomita wrote:
> I ran stress testing overnight and came up with a similar failure
> (s_dentry == NULL) but in a different location. A NULL pointer
> dereference happened in sysfs_readdir():
>
> | if (next->s_dentry)
> | ino =
> next->s_dentry->d_inode->i_ino;
>
> It seems that d_inode was NULL. I don't have the pointer to d_inode
> to look at, but I have next and, lo and behold, its s_dentry is now
> NULL, which it clearly wasn't when the if-clause above ran.
>
> I tried to reconstruct the sysfs_dirents starting with "next". I
> filled in all the structure contents that I had data for:
>
> sysfs_dirent 0xffff81000fc61690:
> s_count 1
> s_sibling ffff81000fc616e8 / ffff81000e0c7468
> s_children ffff81000fc616a8 / ffff81000fc616a8
> s_element ffff81000f4ad1b0
> DOR__ATA1RTS
> ffffffff8800b600
> 124
> s_type 4
> s_mode 8124
> s_dentry NULL
> s_iattr NULL
> s_event 0
>
> s_sibling.prev:
> s_count 1
> s_sibling ffff81000fc61738 / ffff81000fc61698
> s_children ffff81000fc616f8 / ffff81000fc616f8
> s_element ffff81000f4ad148
> s_type 4
> s_mode 8124
> <unknown>
>
> s_sibling.next:
> s_count 1
> s_sibling ffff81000fc61698 / ffff81000fc61648
> s_children ffff81000e0c7478 / ffff81000e0c7478
> s_element NULL
> s_type 0
> s_mode 0
> s_dentry NULL
> s_iattr NULL
> s_event 0
>
> s_sibling.next.next:
> s_count 1
> s_sibling ffff81000e0c7468 / ffff81000fc615f8
> s_children ffff81000fc61658 / ffff81000fc61658
> s_element ffff81000f4ad218
> DOR__ATA1RTS
> ffffffff8800b600
> 124
> s_type 4
> s_mode 8124
> s_dentry NULL
> s_iattr NULL
> s_event 0
>
> s_sibling.next.next.next:
> s_count 1
> s_sibling ffff81000fc61648 / ffff81000fc610f8
> s_children ffff81000fc61608 / ffff81000fc61608
> s_element ffff81000f4ad280
> CK??ATCR
> ffffffff8800b600
> 124
> s_type 4
> s_mode 8124
> s_dentry NULL
> s_iattr NULL
> s_event 0
>
> I should acknowledge that this is based upon 2.6.18 with some newer
> code backported. If there are fixes since 2.6.18 that we should know
> about I can try backporting them into our kernel.
>
> Thanks,
> -- Ethan
Hi Ethan,
Thank you very much for the crash data. This is helpful. Could you please
test the appended patch. Here we avoid modifying s_dentry in sysfs_d_iput()
and also uses iunique() in sysfs_readdir() instead of accessing s_dentry
to get the inode number.
Though I was not able to recreate this race without the patch, but I am
running this patch successfully for more than 6 hrs now with the following
loops parallely on a 4-way SMP system. So, at least it has not done anything
bad.
1. while true; do insmod drivers/net/dummy.ko ; rmmod dummy; done
2. [root@llm01 net]# pwd
/sys/class/net
[root@llm01 net]# while true; do find | xargs cat > /dev/null; done
3. [root@llm01 sys]# pwd
/sys
[root@llm01 sys]# while true; do ls -liR; done
Also, CC-ed Viro and lkml for feedback.
Thanks
Maneesh
o sysfs_d_iput() is invoked in dentry reclaim path under memory pressure. This
happens without i_mutex. It also nullifies s_dentry just to indicate that
the associated dentry is evicted. sysfs_readdir() access the s_dentry,
and gets the inode number from the associated dentry, if there is one, else
it invokes iunique(). This can create a race situation, and crash while
accessing the dentry/inode in sysfs_readdir().
o The following patch always use i_unique() to get the inode number. This
is ok as sysfs doesnot have permanent inode numbering. It could be slower
but avoids the above mentioned race.
o This also avoids the now unnecessary s_dentry in sysfs_d_iput().
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
---
linux-2.6.21-rc4-git8-maneesh/fs/sysfs/dir.c | 6 +-----
1 files changed, 1 insertion(+), 5 deletions(-)
diff -puN fs/sysfs/dir.c~fix-sysfs-reclaim-race fs/sysfs/dir.c
--- linux-2.6.21-rc4-git8/fs/sysfs/dir.c~fix-sysfs-reclaim-race 2007-03-24 02:06:38.000000000 +0530
+++ linux-2.6.21-rc4-git8-maneesh/fs/sysfs/dir.c 2007-03-24 02:09:26.000000000 +0530
@@ -20,7 +20,6 @@ static void sysfs_d_iput(struct dentry *
if (sd) {
BUG_ON(sd->s_dentry != dentry);
- sd->s_dentry = NULL;
sysfs_put(sd);
}
iput(inode);
@@ -538,10 +537,7 @@ static int sysfs_readdir(struct file * f
name = sysfs_get_name(next);
len = strlen(name);
- if (next->s_dentry)
- ino = next->s_dentry->d_inode->i_ino;
- else
- ino = iunique(sysfs_sb, 2);
+ ino = iunique(sysfs_sb, 2);
if (filldir(dirent, name, len, filp->f_pos, ino,
dt_type(next)) < 0)
_
--
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab,
Bangalore, India
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: sysfs reclaim crash
2007-03-24 3:05 ` sysfs reclaim crash Maneesh Soni
@ 2007-03-27 18:27 ` Ethan Solomita
2007-03-27 18:49 ` Maneesh Soni
2007-04-03 7:38 ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
1 sibling, 1 reply; 6+ messages in thread
From: Ethan Solomita @ 2007-03-27 18:27 UTC (permalink / raw)
To: maneesh
Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
viro, LKML
Hi Maneesh -- I will start testing with the patch you provided. If you
come up with any further issues please let me know. Also, if you could
suggest some additional BUG() lines that I could insert I would
appreciate it. Since the bug is hard to reproduce, it may be easier to
catch a race condition in the making via BUG() than an actual failure
due to a race condition.
Thanks!
-- Ethan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: sysfs reclaim crash
2007-03-27 18:27 ` Ethan Solomita
@ 2007-03-27 18:49 ` Maneesh Soni
0 siblings, 0 replies; 6+ messages in thread
From: Maneesh Soni @ 2007-03-27 18:49 UTC (permalink / raw)
To: Ethan Solomita
Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
viro, LKML
On Tue, Mar 27, 2007 at 11:27:14AM -0700, Ethan Solomita wrote:
> Hi Maneesh -- I will start testing with the patch you provided. If
> you come up with any further issues please let me know. Also, if you could
> suggest some additional BUG() lines that I could insert I would
> appreciate it. Since the bug is hard to reproduce, it may be easier to
> catch a race condition in the making via BUG() than an actual failure
> due to a race condition.
>
Hi Ethan,
Thanks for testing. The BUG_ON in sysfs_d_iput() is still there to catch
the first race we saw. And the second one should not occur as now we are not
using the s_dentry in sysfs_readdir().
Regards,
Maneesh
--
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab,
Bangalore, India
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)
2007-03-24 3:05 ` sysfs reclaim crash Maneesh Soni
2007-03-27 18:27 ` Ethan Solomita
@ 2007-04-03 7:38 ` Maneesh Soni
2007-04-03 20:46 ` Ethan Solomita
2007-04-07 8:31 ` Tejun Heo
1 sibling, 2 replies; 6+ messages in thread
From: Maneesh Soni @ 2007-04-03 7:38 UTC (permalink / raw)
To: Ethan Solomita
Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
viro, LKML
On Sat, Mar 24, 2007 at 08:35:35AM +0530, Maneesh Soni wrote:
> On Fri, Mar 23, 2007 at 12:43:27PM -0700, Ethan Solomita wrote:
> > I ran stress testing overnight and came up with a similar failure
> > (s_dentry == NULL) but in a different location. A NULL pointer
> > dereference happened in sysfs_readdir():
> >
> > | if (next->s_dentry)
> > | ino =
> > next->s_dentry->d_inode->i_ino;
> >
> > It seems that d_inode was NULL. I don't have the pointer to d_inode
> > to look at, but I have next and, lo and behold, its s_dentry is now
> > NULL, which it clearly wasn't when the if-clause above ran.
> >
> > I tried to reconstruct the sysfs_dirents starting with "next". I
> > filled in all the structure contents that I had data for:
> >
> > sysfs_dirent 0xffff81000fc61690:
> > s_count 1
> > s_sibling ffff81000fc616e8 / ffff81000e0c7468
> > s_children ffff81000fc616a8 / ffff81000fc616a8
> > s_element ffff81000f4ad1b0
> > DOR__ATA1RTS
> > ffffffff8800b600
> > 124
> > s_type 4
> > s_mode 8124
> > s_dentry NULL
> > s_iattr NULL
> > s_event 0
> >
> > s_sibling.prev:
> > s_count 1
> > s_sibling ffff81000fc61738 / ffff81000fc61698
> > s_children ffff81000fc616f8 / ffff81000fc616f8
> > s_element ffff81000f4ad148
> > s_type 4
> > s_mode 8124
> > <unknown>
> >
> > s_sibling.next:
> > s_count 1
> > s_sibling ffff81000fc61698 / ffff81000fc61648
> > s_children ffff81000e0c7478 / ffff81000e0c7478
> > s_element NULL
> > s_type 0
> > s_mode 0
> > s_dentry NULL
> > s_iattr NULL
> > s_event 0
> >
> > s_sibling.next.next:
> > s_count 1
> > s_sibling ffff81000e0c7468 / ffff81000fc615f8
> > s_children ffff81000fc61658 / ffff81000fc61658
> > s_element ffff81000f4ad218
> > DOR__ATA1RTS
> > ffffffff8800b600
> > 124
> > s_type 4
> > s_mode 8124
> > s_dentry NULL
> > s_iattr NULL
> > s_event 0
> >
> > s_sibling.next.next.next:
> > s_count 1
> > s_sibling ffff81000fc61648 / ffff81000fc610f8
> > s_children ffff81000fc61608 / ffff81000fc61608
> > s_element ffff81000f4ad280
> > CK??ATCR
> > ffffffff8800b600
> > 124
> > s_type 4
> > s_mode 8124
> > s_dentry NULL
> > s_iattr NULL
> > s_event 0
> >
> > I should acknowledge that this is based upon 2.6.18 with some newer
> > code backported. If there are fixes since 2.6.18 that we should know
> > about I can try backporting them into our kernel.
> >
> > Thanks,
> > -- Ethan
>
>
Hi Ethan / Andrew,
I have modified the previous patch (which was dropped from -mm) and now keeping
the statement making s_dentry as NULL in sysfs_d_iput(), so this should
_safely_ fix sysfs_readdir() oops.
Please see the other mail for sysfs_d_iput() BUG hit.
Thanks
Maneesh
--
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab,
Bangalore, India
o sysfs_d_iput() is invoked in dentry reclaim path under memory pressure. This
happens without i_mutex. It also nullifies s_dentry to indicate that
the associated dentry is evicted. sysfs_readdir() accesses the s_dentry,
and gets the inode number from the associated dentry->d_inode, if
there is one, else it invokes iunique(). This can create a race situation,
and crash while accessing the d_inode in sysfs_readdir().
o The race happens when the dentry is getting reclaimed and detached from
the corresponding sysfs_dirent though sysfs_dirent is still a valid
node. Accessing dentry fields are ok as it is under RCU but the inode is
not hence we may see oops accessing dentry->d_inode->i_no.
o The following patch always use i_unique() to get the inode number in
sysfs_readdir. This is ok as sysfs doesnot have permanent inode numbering.
It could be slower but avoids the oops.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
---
linux-2.6.21-rc5-mm3-maneesh/fs/sysfs/dir.c | 5 +----
1 files changed, 1 insertion(+), 4 deletions(-)
diff -puN fs/sysfs/dir.c~fix-sysfs_readdir-oops fs/sysfs/dir.c
--- linux-2.6.21-rc5-mm3/fs/sysfs/dir.c~fix-sysfs_readdir-oops 2007-04-03 10:39:52.000000000 +0530
+++ linux-2.6.21-rc5-mm3-maneesh/fs/sysfs/dir.c 2007-04-03 10:39:52.000000000 +0530
@@ -538,10 +538,7 @@ static int sysfs_readdir(struct file * f
name = sysfs_get_name(next);
len = strlen(name);
- if (next->s_dentry)
- ino = next->s_dentry->d_inode->i_ino;
- else
- ino = iunique(sysfs_sb, 2);
+ ino = iunique(sysfs_sb, 2);
if (filldir(dirent, name, len, filp->f_pos, ino,
dt_type(next)) < 0)
_
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)
2007-04-03 7:38 ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
@ 2007-04-03 20:46 ` Ethan Solomita
2007-04-07 8:31 ` Tejun Heo
1 sibling, 0 replies; 6+ messages in thread
From: Ethan Solomita @ 2007-04-03 20:46 UTC (permalink / raw)
To: maneesh
Cc: Dipankar Sarma, Andrew Morton, Greg KH, Martin Bligh, Rohit Seth,
viro, LKML
Maneesh Soni wrote:
> I have modified the previous patch (which was dropped from -mm) and now keeping
> the statement making s_dentry as NULL in sysfs_d_iput(), so this should
> _safely_ fix sysfs_readdir() oops.
>
If you could find some additional places in sysfs code to add new
BUG() checks I'd appreciate it. Especially if it turns out that you
can't reproduce it, I'd like to have as many asserts as is reasonable.
-- Ethan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash)
2007-04-03 7:38 ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
2007-04-03 20:46 ` Ethan Solomita
@ 2007-04-07 8:31 ` Tejun Heo
1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2007-04-07 8:31 UTC (permalink / raw)
To: maneesh
Cc: Ethan Solomita, Dipankar Sarma, Andrew Morton, Greg KH,
Martin Bligh, Rohit Seth, viro, LKML
Hello, Maneesh.
Maneesh Soni wrote:
> o sysfs_d_iput() is invoked in dentry reclaim path under memory pressure. This
> happens without i_mutex. It also nullifies s_dentry to indicate that
> the associated dentry is evicted. sysfs_readdir() accesses the s_dentry,
> and gets the inode number from the associated dentry->d_inode, if
> there is one, else it invokes iunique(). This can create a race situation,
> and crash while accessing the d_inode in sysfs_readdir().
>
> o The race happens when the dentry is getting reclaimed and detached from
> the corresponding sysfs_dirent though sysfs_dirent is still a valid
> node. Accessing dentry fields are ok as it is under RCU but the inode is
> not hence we may see oops accessing dentry->d_inode->i_no.
>
> o The following patch always use i_unique() to get the inode number in
> sysfs_readdir. This is ok as sysfs doesnot have permanent inode numbering.
> It could be slower but avoids the oops.
This isn't correct as i_unique() assumes the inode is in inode hash
table which isn't true for sysfs. This can result in duplicate inode
numbers. Please take a look at the following alternative fix.
http://article.gmane.org/gmane.linux.kernel/513325
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-04-07 8:31 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20070319140238.cbf28b2b.akpm@linux-foundation.org>
[not found] ` <20070321134654.5wkqpfbpk0ggwks8@imap.linux.ibm.com>
[not found] ` <20070321182123.GA12602@in.ibm.com>
[not found] ` <20070323043047.GA5641@in.ibm.com>
[not found] ` <20070322210504.3b052139.akpm@linux-foundation.org>
[not found] ` <46042DDF.5060000@google.com>
2007-03-24 3:05 ` sysfs reclaim crash Maneesh Soni
2007-03-27 18:27 ` Ethan Solomita
2007-03-27 18:49 ` Maneesh Soni
2007-04-03 7:38 ` [PATCH] fix sysfs_readdir oops (was Re: sysfs reclaim crash) Maneesh Soni
2007-04-03 20:46 ` Ethan Solomita
2007-04-07 8:31 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).