LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* BUG: drop_pagecache_sb vs kjournald lockup
@ 2008-03-18 11:28 David Chinner
  2008-03-18 13:43 ` Jan Kara
       [not found] ` <E1JbwS9-0008Gq-7h@localhost>
  0 siblings, 2 replies; 7+ messages in thread
From: David Chinner @ 2008-03-18 11:28 UTC (permalink / raw)
  To: lkml; +Cc: linux-fsdevel

2.6.25-rc3, 4p ia64, ext3 root drive.

I was running an XFS stress test on one of the XFS partitions on
the machine (zero load on the root ext3 drive), when the system
locked up in kjournald with this on the console:

BUG: spinlock lockup on CPU#2, kjournald/2150, a000000100e022e0

Call Trace:
 [<a0000001000125e0>] show_stack+0x40/0xa0
                                sp=e00000384195fbf0 bsp=e000003841951010
 [<a000000100012670>] dump_stack+0x30/0x60
                                sp=e00000384195fdc0 bsp=e000003841950ff0
 [<a0000001004a5500>] _raw_spin_lock+0x200/0x260
                                sp=e00000384195fdc0 bsp=e000003841950fb8
 [<a000000100837360>] _spin_lock+0x20/0x40
                                sp=e00000384195fdc0 bsp=e000003841950f98
 [<a0000001001ab370>] __mark_inode_dirty+0x270/0x440
                                sp=e00000384195fdc0 bsp=e000003841950f58
 [<a0000001001b5930>] __set_page_dirty+0x330/0x360
                                sp=e00000384195fdc0 bsp=e000003841950f28
 [<a0000001001b5aa0>] mark_buffer_dirty+0x140/0x160
                                sp=e00000384195fdc0 bsp=e000003841950f00
 [<a0000001002a9100>] __journal_temp_unlink_buffer+0x3a0/0x3c0
                                sp=e00000384195fdc0 bsp=e000003841950ee0
 [<a0000001002a9690>] __journal_unfile_buffer+0x30/0x60
                                sp=e00000384195fdc0 bsp=e000003841950ec0
 [<a0000001002a97f0>] __journal_refile_buffer+0x130/0x2a0
                                sp=e00000384195fdc0 bsp=e000003841950e88
 [<a0000001002b0e70>] journal_commit_transaction+0x23b0/0x29c0
                                sp=e00000384195fdc0 bsp=e000003841950e18
 [<a0000001002b8f80>] kjournald+0x180/0x420
                                sp=e00000384195fe00 bsp=e000003841950db8
 [<a0000001000bf780>] kthread+0xa0/0x120
                                sp=e00000384195fe30 bsp=e000003841950d88
 [<a000000100010d10>] kernel_thread_helper+0x30/0x60
                                sp=e00000384195fe30 bsp=e000003841950d60
 [<a0000001000089c0>] start_kernel_thread+0x20/0x40
                                sp=e00000384195fe30 bsp=e000003841950d60


The system is stuck with all 4 CPUs spinning on locks:

[0]kdb> bta R
Stack traceback for pid 2592
0xe000003802950000     2592        1  1    0   R  0xe0000038029503a0 *syslogd
0xa0000001004a5470 _raw_spin_lock+0x170
0xa000000100837360 _spin_lock+0x20
0xa0000001002aaa20 do_get_write_access+0x960
0xa0000001002aad80 journal_get_write_access+0x40
0xa0000001002a2e50 __ext3_journal_get_write_access+0x30
0xa000000100285f60 ext3_reserve_inode_write+0x80
0xa000000100286050 ext3_mark_inode_dirty+0x50
0xa00000010028c990 ext3_dirty_inode+0xd0
0xa0000001001ab190 __mark_inode_dirty+0x90
0xa0000001001948c0 file_update_time+0x220
0xa000000100101000 __generic_file_aio_write_nolock+0x4e0
0xa000000100101340 generic_file_aio_write+0xe0
0xa000000100281a00 ext3_file_write+0x60
0xa000000100164030 do_sync_readv_writev+0x150
0xa000000100164fd0 do_readv_writev+0x130
0xa000000100165270 vfs_writev+0xd0
0xa000000100165310 sys_writev+0x70
0xa000000100009ca0 ia64_ret_from_syscall
0xa000000000010720 __kernel_syscall_via_break
Enter <q> to end, <cr> to continue:ÿ
Stack traceback for pid 4700
0xe00000b84aa10000     4700     2884  1    1   R  0xe00000b84aa103a0  sh
0xa0000001004a5460 _raw_spin_lock+0x160
0xa000000100837360 _spin_lock+0x20
0xa0000001002ae460 journal_try_to_free_buffers+0x200
0xa00000010028a730 ext3_releasepage+0xb0
0xa000000100101a90 try_to_release_page+0xf0
0xa0000001001139f0 __invalidate_mapping_pages+0x170
0xa0000001001ac3f0 drop_pagecache+0x110
0xa0000001001ac5d0 drop_caches_sysctl_handler+0x70
0xa0000001002089b0 proc_sys_write+0x110
0xa000000100165670 vfs_write+0x1b0
0xa000000100166110 sys_write+0x70
0xa000000100009ca0 ia64_ret_from_syscall
0xa000000000010720 __kernel_syscall_via_break
Enter <q> to end, <cr> to continue:ÿ
Stack traceback for pid 2150
0xe000003841950000     2150        2  1    2   R  0xe0000038419503a0  kjournald
0xa0000001004a5470 _raw_spin_lock+0x170
0xa000000100837360 _spin_lock+0x20
0xa0000001001ab370 __mark_inode_dirty+0x270
0xa0000001001b5930 __set_page_dirty+0x330
0xa0000001001b5aa0 mark_buffer_dirty+0x140
0xa0000001002a9100 __journal_temp_unlink_buffer+0x3a0
0xa0000001002a9690 __journal_unfile_buffer+0x30
0xa0000001002a97f0 __journal_refile_buffer+0x130
0xa0000001002b0e70 journal_commit_transaction+0x23b0
0xa0000001002b8f80 kjournald+0x180
0xa0000001000bf780 kthread+0xa0
0xa000000100010d10 kernel_thread_helper+0x30
0xa0000001000089c0 start_kernel_thread+0x20
Enter <q> to end, <cr> to continue:ÿ
Stack traceback for pid 2739
0xe00000b8222e0000     2739        1  1    3   R  0xe00000b8222e03a0  cron
0xa0000001004a5470 _raw_spin_lock+0x170
0xa000000100837360 _spin_lock+0x20
0xa0000001001931a0 ifind_fast+0x20
0xa0000001001936f0 iget_locked+0x70
0xa00000010028ca10 ext3_iget+0x30
0xa000000100293e30 ext3_lookup+0x130
0xa000000100176270 do_lookup+0x1b0
0xa00000010017afe0 __link_path_walk+0x1580
0xa00000010017bc20 link_path_walk+0xc0
0xa00000010017bdb0 path_walk+0x30
0xa00000010017c4d0 do_path_lookup+0x330
0xa00000010017d8c0 __user_walk_fd+0x60
0xa00000010016c410 vfs_stat_fd+0x30
0xa00000010016c950 sys_newstat+0x30
0xa000000100009ca0 ia64_ret_from_syscall
0xa000000000010720 __kernel_syscall_via_break
Enter <q> to end, <cr> to continue:ÿ

Looks like everything is backed up on the inode_lock.  Why? Looks
like drop_pagecache_sb() is doing something ..... suboptimal.

static void drop_pagecache_sb(struct super_block *sb)
{
        struct inode *inode;

        spin_lock(&inode_lock);
        list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
                if (inode->i_state & (I_FREEING|I_WILL_FREE))
                        continue;
                __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
        }
        spin_unlock(&inode_lock);
}

It holds the inode_lock for an amazingly long time, and calls a
function that ends up in ->release_page which can issue
transactions.

Given that transactions can then mark an inode dirty or the
kjournald might need to mark an inode dirty while holding
transaction locks, the implementation of drop_pagecache_sb seems to
be just a little dangerous....

Anyone know the reason why drop_pagecache_sb() uses such a brute-force
mechanism to free up clean page cache pages?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: drop_pagecache_sb vs kjournald lockup
  2008-03-18 11:28 BUG: drop_pagecache_sb vs kjournald lockup David Chinner
@ 2008-03-18 13:43 ` Jan Kara
  2008-03-18 20:46   ` David Chinner
       [not found] ` <E1JbwS9-0008Gq-7h@localhost>
  1 sibling, 1 reply; 7+ messages in thread
From: Jan Kara @ 2008-03-18 13:43 UTC (permalink / raw)
  To: David Chinner; +Cc: lkml, linux-fsdevel

> 2.6.25-rc3, 4p ia64, ext3 root drive.
> 
> I was running an XFS stress test on one of the XFS partitions on
> the machine (zero load on the root ext3 drive), when the system
> locked up in kjournald with this on the console:
> 
> BUG: spinlock lockup on CPU#2, kjournald/2150, a000000100e022e0
> 
  <snip traces>

> Looks like everything is backed up on the inode_lock.  Why? Looks
> like drop_pagecache_sb() is doing something ..... suboptimal.
> 
> static void drop_pagecache_sb(struct super_block *sb)
> {
>         struct inode *inode;
> 
>         spin_lock(&inode_lock);
>         list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>                 if (inode->i_state & (I_FREEING|I_WILL_FREE))
>                         continue;
>                 __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
>         }
>         spin_unlock(&inode_lock);
> }
> 
> It holds the inode_lock for an amazingly long time, and calls a
> function that ends up in ->release_page which can issue
> transactions.
> 
> Given that transactions can then mark an inode dirty or the
> kjournald might need to mark an inode dirty while holding
> transaction locks, the implementation of drop_pagecache_sb seems to
> be just a little dangerous....
> 
> Anyone know the reason why drop_pagecache_sb() uses such a brute-force
> mechanism to free up clean page cache pages?
  Yes, we know that drop_pagecache_sb() has locking issues but since it
is intended to be used for debugging purposes only, nobody cared enough
to fix it. Completely untested patch below if you dare to try ;)

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
---

From: Jan Kara <jack@suse.cz>
Date: Tue, 18 Mar 2008 14:38:06 +0100
Subject: [PATCH] Fix drop_pagecache_sb() to not call __invalidate_mapping_pages() under
inode_lock.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/drop_caches.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 59375ef..f5aae26 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -14,15 +14,21 @@ int sysctl_drop_caches;
 
 static void drop_pagecache_sb(struct super_block *sb)
 {
-	struct inode *inode;
+	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE))
 			continue;
+		__iget(inode);
+		spin_unlock(&inode_lock);
 		__invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
+		iput(toput_inode);
+		toput_inode = inode;
+		spin_lock(&inode_lock);
 	}
 	spin_unlock(&inode_lock);
+	iput(toput_inode);
 }
 
 void drop_pagecache(void)
-- 
1.5.2.4


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: drop_pagecache_sb vs kjournald lockup
  2008-03-18 13:43 ` Jan Kara
@ 2008-03-18 20:46   ` David Chinner
  2008-03-19 10:08     ` Jan Kara
  0 siblings, 1 reply; 7+ messages in thread
From: David Chinner @ 2008-03-18 20:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: David Chinner, lkml, linux-fsdevel

On Tue, Mar 18, 2008 at 02:43:26PM +0100, Jan Kara wrote:
> > 2.6.25-rc3, 4p ia64, ext3 root drive.
> > 
> > I was running an XFS stress test on one of the XFS partitions on
> > the machine (zero load on the root ext3 drive), when the system
> > locked up in kjournald with this on the console:
> > 
> > BUG: spinlock lockup on CPU#2, kjournald/2150, a000000100e022e0
> > 
>   <snip traces>

<snip other stuff>

> > Anyone know the reason why drop_pagecache_sb() uses such a brute-force
> > mechanism to free up clean page cache pages?
>   Yes, we know that drop_pagecache_sb() has locking issues but since it
> is intended to be used for debugging purposes only, nobody cared enough
> to fix it. Completely untested patch below if you dare to try ;)

It may be intended for debuging purposes, but it does get used in
production HPC environments (a lot!). I guess I've never seen this
lockup before because SGI customers don't use ext3, but they have
complained about the system "stopping" while drop_caches is executed.
This locking ..... strategy would explain it, though.

I'll try the patch, but I can't guarantee anything - I only saw this
lockup once in about 18 hours when dropping caches every 2 seconds.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: drop_pagecache_sb vs kjournald lockup
  2008-03-18 20:46   ` David Chinner
@ 2008-03-19 10:08     ` Jan Kara
  2008-03-19 12:17       ` David Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2008-03-19 10:08 UTC (permalink / raw)
  To: David Chinner; +Cc: lkml, linux-fsdevel

On Wed 19-03-08 07:46:46, David Chinner wrote:
> On Tue, Mar 18, 2008 at 02:43:26PM +0100, Jan Kara wrote:
> > > 2.6.25-rc3, 4p ia64, ext3 root drive.
> > > 
> > > I was running an XFS stress test on one of the XFS partitions on
> > > the machine (zero load on the root ext3 drive), when the system
> > > locked up in kjournald with this on the console:
> > > 
> > > BUG: spinlock lockup on CPU#2, kjournald/2150, a000000100e022e0
> > > 
> >   <snip traces>
> 
> <snip other stuff>
> 
> > > Anyone know the reason why drop_pagecache_sb() uses such a brute-force
> > > mechanism to free up clean page cache pages?
> >   Yes, we know that drop_pagecache_sb() has locking issues but since it
> > is intended to be used for debugging purposes only, nobody cared enough
> > to fix it. Completely untested patch below if you dare to try ;)
> 
> It may be intended for debuging purposes, but it does get used in
> production HPC environments (a lot!). I guess I've never seen this
> lockup before because SGI customers don't use ext3, but they have
> complained about the system "stopping" while drop_caches is executed.
> This locking ..... strategy would explain it, though.
  ;) But in case your customers use it in production, shouldn't you push
for a better interface for such feature? Just a thought...

> I'll try the patch, but I can't guarantee anything - I only saw this
> lockup once in about 18 hours when dropping caches every 2 seconds.
  Well, if you enable lockdep, it should warn you about possible problems
much earlier (at least we already got several reports of lockdep warnings
when using drop_caches).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: drop_pagecache_sb vs kjournald lockup
  2008-03-19 10:08     ` Jan Kara
@ 2008-03-19 12:17       ` David Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: David Chinner @ 2008-03-19 12:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: David Chinner, lkml, linux-fsdevel

On Wed, Mar 19, 2008 at 11:08:05AM +0100, Jan Kara wrote:
> On Wed 19-03-08 07:46:46, David Chinner wrote:
> > On Tue, Mar 18, 2008 at 02:43:26PM +0100, Jan Kara wrote:
> > > > 2.6.25-rc3, 4p ia64, ext3 root drive.
> > > > 
> > > > I was running an XFS stress test on one of the XFS partitions on
> > > > the machine (zero load on the root ext3 drive), when the system
> > > > locked up in kjournald with this on the console:
> > > > 
> > > > BUG: spinlock lockup on CPU#2, kjournald/2150, a000000100e022e0
> > > > 
> > >   <snip traces>
> > 
> > <snip other stuff>
> > 
> > > > Anyone know the reason why drop_pagecache_sb() uses such a brute-force
> > > > mechanism to free up clean page cache pages?
> > >   Yes, we know that drop_pagecache_sb() has locking issues but since it
> > > is intended to be used for debugging purposes only, nobody cared enough
> > > to fix it. Completely untested patch below if you dare to try ;)
> > 
> > It may be intended for debuging purposes, but it does get used in
> > production HPC environments (a lot!). I guess I've never seen this
> > lockup before because SGI customers don't use ext3, but they have
> > complained about the system "stopping" while drop_caches is executed.
> > This locking ..... strategy would explain it, though.
>   ;) But in case your customers use it in production, shouldn't you push
> for a better interface for such feature? Just a thought...

Because it's much less horrible than the thing it replaced, does
what is needed and mostly does not cause problems? And because it
usually just works I've never felt the need to look at the
implementation....

/me is off to look at Fengguang's patch

> > I'll try the patch, but I can't guarantee anything - I only saw this
> > lockup once in about 18 hours when dropping caches every 2 seconds.
>   Well, if you enable lockdep, it should warn you about possible problems
> much earlier (at least we already got several reports of lockdep warnings
> when using drop_caches).

ia64 - no lockdep :/

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] A deadlock free and best try version of drop_caches()
       [not found] ` <E1JbwS9-0008Gq-7h@localhost>
@ 2008-03-19 22:03   ` David Chinner
       [not found]     ` <406016111.11683@ustc.edu.cn>
  0 siblings, 1 reply; 7+ messages in thread
From: David Chinner @ 2008-03-19 22:03 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: David Chinner, lkml, linux-fsdevel

On Wed, Mar 19, 2008 at 07:27:29PM +0800, Fengguang Wu wrote:
> On Tue, Mar 18, 2008 at 10:28:44PM +1100, David Chinner wrote:
> > Looks like everything is backed up on the inode_lock.  Why? Looks
> > like drop_pagecache_sb() is doing something ..... suboptimal.
......
> > Anyone know the reason why drop_pagecache_sb() uses such a brute-force
> > mechanism to free up clean page cache pages?
> 
> Because extensive use of it(out of testings) is discouraged? ;-)
> 
> I have been running a longer but safer version, let's merge it?

So you walk the inode hash to find inodes? Seems like a nice idea on
the surface.... Won't it need to hold the iprune_mutex to prevent
races with prune_icache() and invalidate_list()?

Hmmmm - what about unhashed inodes? We'll never see them with this
method of traversal. I ask because I'm working on some prototype
patches for XFS that avoid using the inode hash altogether and drive
inode lookup from the multitude of radix trees we have per filesystem
(for parallelised and lockless inode lookup).

The above scanning method would not work at all with that sort of
filesystem structure. Perhaps combining the bulk get/put with Jan's
get/put method for walking the sb inode list would be sufficient?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] A deadlock free and best try version of drop_caches()
       [not found]     ` <406016111.11683@ustc.edu.cn>
@ 2008-03-20 12:28       ` Fengguang Wu
  0 siblings, 0 replies; 7+ messages in thread
From: Fengguang Wu @ 2008-03-20 12:28 UTC (permalink / raw)
  To: David Chinner; +Cc: lkml, linux-fsdevel

On Thu, Mar 20, 2008 at 09:03:18AM +1100, David Chinner wrote:
> On Wed, Mar 19, 2008 at 07:27:29PM +0800, Fengguang Wu wrote:
> > On Tue, Mar 18, 2008 at 10:28:44PM +1100, David Chinner wrote:
> > > Looks like everything is backed up on the inode_lock.  Why? Looks
> > > like drop_pagecache_sb() is doing something ..... suboptimal.
> ......
> > > Anyone know the reason why drop_pagecache_sb() uses such a brute-force
> > > mechanism to free up clean page cache pages?
> > 
> > Because extensive use of it(out of testings) is discouraged? ;-)
> > 
> > I have been running a longer but safer version, let's merge it?
> 
> So you walk the inode hash to find inodes? Seems like a nice idea on
> the surface.... Won't it need to hold the iprune_mutex to prevent
> races with prune_icache() and invalidate_list()?

Oh, I didn't notice iprune_mutex :-/

> Hmmmm - what about unhashed inodes? We'll never see them with this
> method of traversal. I ask because I'm working on some prototype
> patches for XFS that avoid using the inode hash altogether and drive
> inode lookup from the multitude of radix trees we have per filesystem
> (for parallelised and lockless inode lookup).
> 
> The above scanning method would not work at all with that sort of
> filesystem structure. Perhaps combining the bulk get/put with Jan's
> get/put method for walking the sb inode list would be sufficient?

You mean to move __invalidate_mapping_pages() out of inode_lock in
drop_pagecache_sb()? Sounds reasonable.

BTW, I noticed the following comments on locking order are outdated:

filemap.c 
 *  ->inode_lock
 *    ->sb_lock                 (fs/fs-writeback.c)

rmap.c 
 *                 sb_lock (within inode_lock in fs/fs-writeback.c)

They are invalidated by this patch:
http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=03da8814512f55441b9bd57c1ceddd28fbcf59ba
[PATCH] Rearrangement of inode_lock in writeback_inodes()

[...] It narrows down use of inode_lock and removes unneccassary
nesting of sb_lock and inode_lock. [...]

Thank you,
Fengguang


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-03-20 12:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-18 11:28 BUG: drop_pagecache_sb vs kjournald lockup David Chinner
2008-03-18 13:43 ` Jan Kara
2008-03-18 20:46   ` David Chinner
2008-03-19 10:08     ` Jan Kara
2008-03-19 12:17       ` David Chinner
     [not found] ` <E1JbwS9-0008Gq-7h@localhost>
2008-03-19 22:03   ` [PATCH] A deadlock free and best try version of drop_caches() David Chinner
     [not found]     ` <406016111.11683@ustc.edu.cn>
2008-03-20 12:28       ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).