LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* xfslogd-spinlock bug? @ 2006-12-11 23:00 Haar János 2006-12-12 14:32 ` Justin Piszcz 0 siblings, 1 reply; 17+ messages in thread From: Haar János @ 2006-12-11 23:00 UTC (permalink / raw) To: linux-xfs; +Cc: linux-kernel Hello, list, I am the "big red button men" with the one big 14TB xfs, if somebody can remember me. :-) Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the 2.6.19, but the bug still exists: Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317 Dec 11 22:47:21 dy-base general protection fault: 0000 [1] Dec 11 22:47:21 dy-base SMP Dec 11 22:47:21 dy-base Dec 11 22:47:21 dy-base CPU 3 Dec 11 22:47:21 dy-base Dec 11 22:47:21 dy-base Modules linked in: Dec 11 22:47:21 dy-base nbd Dec 11 22:47:21 dy-base rd Dec 11 22:47:21 dy-base netconsole Dec 11 22:47:21 dy-base e1000 Dec 11 22:47:21 dy-base video Dec 11 22:47:21 dy-base Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1 Dec 11 22:47:21 dy-base RIP: 0010:[<ffffffff803f3aba>] Dec 11 22:47:21 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf Dec 11 22:47:21 dy-base RSP: 0018:ffff81011fb89bc0 EFLAGS: 00010002 Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 Dec 11 22:47:21 dy-base RDX: ffffffff808137a0 RSI: 0000000000000082 RDI: 0000000100000000 Dec 11 22:47:21 dy-base RBP: ffff81011fb89be0 R08: 0000000000026a70 R09: 000000006b6b6b6b Dec 11 22:47:21 dy-base R10: 0000000000000082 R11: ffff81000584d380 R12: ffff8100db92ad80 Dec 11 22:47:21 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15: 0000000000000003 Dec 11 22:47:21 dy-base FS: 0000000000000000(0000) GS:ffff81011fc76b90(0000) knlGS:0000000000000000 Dec 11 22:47:21 dy-base CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Dec 11 22:47:21 dy-base CR2: 00002ba007700000 CR3: 0000000108c05000 CR4: 00000000000006e0 Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo ffff81011fb88000, task ffff81011fa7f830) Dec 11 22:47:21 dy-base Stack: Dec 11 22:47:21 dy-base ffff81011fb89be0 Dec 11 22:47:21 dy-base ffff8100db92ad80 Dec 11 22:47:21 dy-base 0000000000000000 Dec 11 22:47:21 dy-base 0000000000000000 Dec 11 22:47:21 dy-base Dec 11 22:47:21 dy-base ffff81011fb89c10 Dec 11 22:47:21 dy-base ffffffff803f3bdc Dec 11 22:47:21 dy-base 0000000000000282 Dec 11 22:47:21 dy-base 0000000000000000 Dec 11 22:47:21 dy-base Dec 11 22:47:21 dy-base 0000000000000000 Dec 11 22:47:21 dy-base 0000000000000000 Dec 11 22:47:21 dy-base ffff81011fb89c30 Dec 11 22:47:21 dy-base ffffffff805e7f2b Dec 11 22:47:21 dy-base Dec 11 22:47:21 dy-base Call Trace: Dec 11 22:47:21 dy-base [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1 Dec 11 22:47:21 dy-base [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18 Dec 11 22:47:21 dy-base [<ffffffff80222aab>] __wake_up+0x22/0x50 Dec 11 22:47:21 dy-base [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23 Dec 11 22:47:21 dy-base [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6 Dec 11 22:47:21 dy-base [<ffffffff803bc460>] xfs_trans_chunk_committed+0xc3/0xf7 Dec 11 22:47:21 dy-base [<ffffffff803bc4dd>] xfs_trans_committed+0x49/0xde Dec 11 22:47:21 dy-base [<ffffffff803b1bde>] xlog_state_do_callback+0x185/0x33f Dec 11 22:47:21 dy-base [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131 Dec 11 22:47:22 dy-base [<ffffffff803c9dae>] xfs_buf_iodone_work+0x1a/0x3e Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 Dec 11 22:47:22 dy-base [<ffffffff8023937e>] run_workqueue+0xa8/0xf8 Dec 11 22:47:22 dy-base [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 Dec 11 22:47:22 dy-base [<ffffffff80239ad3>] worker_thread+0xfb/0x134 Dec 11 22:47:22 dy-base [<ffffffff80223f6c>] default_wake_function+0x0/0xf Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 Dec 11 22:47:22 dy-base [<ffffffff8023c6e5>] kthread+0xd8/0x10b Dec 11 22:47:22 dy-base [<ffffffff802256ac>] schedule_tail+0x45/0xa6 Dec 11 22:47:22 dy-base [<ffffffff8020a6a8>] child_rip+0xa/0x12 Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 Dec 11 22:47:22 dy-base [<ffffffff8023c60d>] kthread+0x0/0x10b Dec 11 22:47:22 dy-base [<ffffffff8020a69e>] child_rip+0x0/0x12 Dec 11 22:47:22 dy-base Dec 11 22:47:22 dy-base Dec 11 22:47:22 dy-base Code: Dec 11 22:47:22 dy-base 8b Dec 11 22:47:22 dy-base 83 Dec 11 22:47:22 dy-base 0c Dec 11 22:47:22 dy-base 01 Dec 11 22:47:22 dy-base 00 Dec 11 22:47:22 dy-base 00 Dec 11 22:47:22 dy-base 48 Dec 11 22:47:22 dy-base 8d Dec 11 22:47:22 dy-base 8b Dec 11 22:47:22 dy-base 98 Dec 11 22:47:22 dy-base 02 Dec 11 22:47:22 dy-base 00 Dec 11 22:47:22 dy-base 00 Dec 11 22:47:22 dy-base 41 Dec 11 22:47:22 dy-base 8b Dec 11 22:47:22 dy-base 54 Dec 11 22:47:22 dy-base 24 Dec 11 22:47:22 dy-base 04 Dec 11 22:47:22 dy-base 41 Dec 11 22:47:22 dy-base 89 Dec 11 22:47:22 dy-base Dec 11 22:47:22 dy-base RIP Dec 11 22:47:22 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf Dec 11 22:47:22 dy-base RSP <ffff81011fb89bc0> Dec 11 22:47:22 dy-base Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception Dec 11 22:47:22 dy-base Dec 11 22:47:22 dy-base Rebooting in 5 seconds.. After this, sometimes the server reboots normally, but sometimes hangs, no console, no sysreq, no nothing. This is a "simple" crash, no "too much" data lost, or else. Can somebody help me to tracking down the problem? Thanks, Janos Haar ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-11 23:00 xfslogd-spinlock bug? Haar János @ 2006-12-12 14:32 ` Justin Piszcz 2006-12-13 1:11 ` Haar János 0 siblings, 1 reply; 17+ messages in thread From: Justin Piszcz @ 2006-12-12 14:32 UTC (permalink / raw) To: Haar János; +Cc: linux-xfs, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 5771 bytes --] I'm not sure what is causing this problem but I was curious is this on a 32bit or 64bit platform? Justin. On Tue, 12 Dec 2006, Haar János wrote: > Hello, list, > > I am the "big red button men" with the one big 14TB xfs, if somebody can > remember me. :-) > > Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the > 2.6.19, but the bug still exists: > > Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317 > Dec 11 22:47:21 dy-base general protection fault: 0000 [1] > Dec 11 22:47:21 dy-base SMP > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base CPU 3 > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base Modules linked in: > Dec 11 22:47:21 dy-base nbd > Dec 11 22:47:21 dy-base rd > Dec 11 22:47:21 dy-base netconsole > Dec 11 22:47:21 dy-base e1000 > Dec 11 22:47:21 dy-base video > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1 > Dec 11 22:47:21 dy-base RIP: 0010:[<ffffffff803f3aba>] > Dec 11 22:47:21 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf > Dec 11 22:47:21 dy-base RSP: 0018:ffff81011fb89bc0 EFLAGS: 00010002 > Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: > 0000000000000000 > Dec 11 22:47:21 dy-base RDX: ffffffff808137a0 RSI: 0000000000000082 RDI: > 0000000100000000 > Dec 11 22:47:21 dy-base RBP: ffff81011fb89be0 R08: 0000000000026a70 R09: > 000000006b6b6b6b > Dec 11 22:47:21 dy-base R10: 0000000000000082 R11: ffff81000584d380 R12: > ffff8100db92ad80 > Dec 11 22:47:21 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15: > 0000000000000003 > Dec 11 22:47:21 dy-base FS: 0000000000000000(0000) > GS:ffff81011fc76b90(0000) knlGS:0000000000000000 > Dec 11 22:47:21 dy-base CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > Dec 11 22:47:21 dy-base CR2: 00002ba007700000 CR3: 0000000108c05000 CR4: > 00000000000006e0 > Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo > ffff81011fb88000, task ffff81011fa7f830) > Dec 11 22:47:21 dy-base Stack: > Dec 11 22:47:21 dy-base ffff81011fb89be0 > Dec 11 22:47:21 dy-base ffff8100db92ad80 > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base ffff81011fb89c10 > Dec 11 22:47:21 dy-base ffffffff803f3bdc > Dec 11 22:47:21 dy-base 0000000000000282 > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base ffff81011fb89c30 > Dec 11 22:47:21 dy-base ffffffff805e7f2b > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base Call Trace: > Dec 11 22:47:21 dy-base [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1 > Dec 11 22:47:21 dy-base [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18 > Dec 11 22:47:21 dy-base [<ffffffff80222aab>] __wake_up+0x22/0x50 > Dec 11 22:47:21 dy-base [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23 > Dec 11 22:47:21 dy-base [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6 > Dec 11 22:47:21 dy-base [<ffffffff803bc460>] > xfs_trans_chunk_committed+0xc3/0xf7 > Dec 11 22:47:21 dy-base [<ffffffff803bc4dd>] xfs_trans_committed+0x49/0xde > Dec 11 22:47:21 dy-base [<ffffffff803b1bde>] > xlog_state_do_callback+0x185/0x33f > Dec 11 22:47:21 dy-base [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131 > Dec 11 22:47:22 dy-base [<ffffffff803c9dae>] xfs_buf_iodone_work+0x1a/0x3e > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > Dec 11 22:47:22 dy-base [<ffffffff8023937e>] run_workqueue+0xa8/0xf8 > Dec 11 22:47:22 dy-base [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > Dec 11 22:47:22 dy-base [<ffffffff80239ad3>] worker_thread+0xfb/0x134 > Dec 11 22:47:22 dy-base [<ffffffff80223f6c>] default_wake_function+0x0/0xf > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > Dec 11 22:47:22 dy-base [<ffffffff8023c6e5>] kthread+0xd8/0x10b > Dec 11 22:47:22 dy-base [<ffffffff802256ac>] schedule_tail+0x45/0xa6 > Dec 11 22:47:22 dy-base [<ffffffff8020a6a8>] child_rip+0xa/0x12 > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > Dec 11 22:47:22 dy-base [<ffffffff8023c60d>] kthread+0x0/0x10b > Dec 11 22:47:22 dy-base [<ffffffff8020a69e>] child_rip+0x0/0x12 > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base Code: > Dec 11 22:47:22 dy-base 8b > Dec 11 22:47:22 dy-base 83 > Dec 11 22:47:22 dy-base 0c > Dec 11 22:47:22 dy-base 01 > Dec 11 22:47:22 dy-base 00 > Dec 11 22:47:22 dy-base 00 > Dec 11 22:47:22 dy-base 48 > Dec 11 22:47:22 dy-base 8d > Dec 11 22:47:22 dy-base 8b > Dec 11 22:47:22 dy-base 98 > Dec 11 22:47:22 dy-base 02 > Dec 11 22:47:22 dy-base 00 > Dec 11 22:47:22 dy-base 00 > Dec 11 22:47:22 dy-base 41 > Dec 11 22:47:22 dy-base 8b > Dec 11 22:47:22 dy-base 54 > Dec 11 22:47:22 dy-base 24 > Dec 11 22:47:22 dy-base 04 > Dec 11 22:47:22 dy-base 41 > Dec 11 22:47:22 dy-base 89 > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base RIP > Dec 11 22:47:22 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf > Dec 11 22:47:22 dy-base RSP <ffff81011fb89bc0> > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base Rebooting in 5 seconds.. > > After this, sometimes the server reboots normally, but sometimes hangs, no > console, no sysreq, no nothing. > > This is a "simple" crash, no "too much" data lost, or else. > > Can somebody help me to tracking down the problem? > > Thanks, > Janos Haar > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-12 14:32 ` Justin Piszcz @ 2006-12-13 1:11 ` Haar János 2006-12-16 11:19 ` Haar János 0 siblings, 1 reply; 17+ messages in thread From: Haar János @ 2006-12-13 1:11 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-xfs, linux-kernel Hello, Justin, This is a 64bit system. But i cannot understand, what is the curious? :-) I am not a kernel developer, and not a C programmer, but the long pointers shows me, the 64 bit. Or am i on the wrong clue? :-) Anyway, this issue happens for me about daily, or max 2-3 day often. But i have no idea what cause this exactly. The 2.6.16.18 was stable for me a long time, and one day starts to tricking me, and happens more and more often. Thats why i thinking some bad part of the (14TB) FS on the disks. (this fs have a lot of errors, what the xfs_repair cannot be corrected, but anyway this is a productive system, and works well, except this issue. Some months before i have replaced the parity disk in one of the RAID4 array, and the next day, during the resync process, another one disk died. I almost lost everything, but thanks to the raid4, and mdadm, i have successfully recovered a lot of data with the 1 day older parity-only drive. This was really bad, and leave some scars on the fs. ) This issue looks like for me a race condition between the cpus. (2x Xeon HT) Am i right? :-) Thanks, Janos ----- Original Message ----- From: "Justin Piszcz" <jpiszcz@lucidpixels.com> To: "Haar János" <djani22@netcenter.hu> Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> Sent: Tuesday, December 12, 2006 3:32 PM Subject: Re: xfslogd-spinlock bug? I'm not sure what is causing this problem but I was curious is this on a 32bit or 64bit platform? Justin. On Tue, 12 Dec 2006, Haar János wrote: > Hello, list, > > I am the "big red button men" with the one big 14TB xfs, if somebody can > remember me. :-) > > Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the > 2.6.19, but the bug still exists: > > Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317 > Dec 11 22:47:21 dy-base general protection fault: 0000 [1] > Dec 11 22:47:21 dy-base SMP > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base CPU 3 > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base Modules linked in: > Dec 11 22:47:21 dy-base nbd > Dec 11 22:47:21 dy-base rd > Dec 11 22:47:21 dy-base netconsole > Dec 11 22:47:21 dy-base e1000 > Dec 11 22:47:21 dy-base video > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1 > Dec 11 22:47:21 dy-base RIP: 0010:[<ffffffff803f3aba>] > Dec 11 22:47:21 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf > Dec 11 22:47:21 dy-base RSP: 0018:ffff81011fb89bc0 EFLAGS: 00010002 > Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: > 0000000000000000 > Dec 11 22:47:21 dy-base RDX: ffffffff808137a0 RSI: 0000000000000082 RDI: > 0000000100000000 > Dec 11 22:47:21 dy-base RBP: ffff81011fb89be0 R08: 0000000000026a70 R09: > 000000006b6b6b6b > Dec 11 22:47:21 dy-base R10: 0000000000000082 R11: ffff81000584d380 R12: > ffff8100db92ad80 > Dec 11 22:47:21 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15: > 0000000000000003 > Dec 11 22:47:21 dy-base FS: 0000000000000000(0000) > GS:ffff81011fc76b90(0000) knlGS:0000000000000000 > Dec 11 22:47:21 dy-base CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > Dec 11 22:47:21 dy-base CR2: 00002ba007700000 CR3: 0000000108c05000 CR4: > 00000000000006e0 > Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo > ffff81011fb88000, task ffff81011fa7f830) > Dec 11 22:47:21 dy-base Stack: > Dec 11 22:47:21 dy-base ffff81011fb89be0 > Dec 11 22:47:21 dy-base ffff8100db92ad80 > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base ffff81011fb89c10 > Dec 11 22:47:21 dy-base ffffffff803f3bdc > Dec 11 22:47:21 dy-base 0000000000000282 > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base 0000000000000000 > Dec 11 22:47:21 dy-base ffff81011fb89c30 > Dec 11 22:47:21 dy-base ffffffff805e7f2b > Dec 11 22:47:21 dy-base > Dec 11 22:47:21 dy-base Call Trace: > Dec 11 22:47:21 dy-base [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1 > Dec 11 22:47:21 dy-base [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18 > Dec 11 22:47:21 dy-base [<ffffffff80222aab>] __wake_up+0x22/0x50 > Dec 11 22:47:21 dy-base [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23 > Dec 11 22:47:21 dy-base [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6 > Dec 11 22:47:21 dy-base [<ffffffff803bc460>] > xfs_trans_chunk_committed+0xc3/0xf7 > Dec 11 22:47:21 dy-base [<ffffffff803bc4dd>] xfs_trans_committed+0x49/0xde > Dec 11 22:47:21 dy-base [<ffffffff803b1bde>] > xlog_state_do_callback+0x185/0x33f > Dec 11 22:47:21 dy-base [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131 > Dec 11 22:47:22 dy-base [<ffffffff803c9dae>] xfs_buf_iodone_work+0x1a/0x3e > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > Dec 11 22:47:22 dy-base [<ffffffff8023937e>] run_workqueue+0xa8/0xf8 > Dec 11 22:47:22 dy-base [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > Dec 11 22:47:22 dy-base [<ffffffff80239ad3>] worker_thread+0xfb/0x134 > Dec 11 22:47:22 dy-base [<ffffffff80223f6c>] default_wake_function+0x0/0xf > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > Dec 11 22:47:22 dy-base [<ffffffff8023c6e5>] kthread+0xd8/0x10b > Dec 11 22:47:22 dy-base [<ffffffff802256ac>] schedule_tail+0x45/0xa6 > Dec 11 22:47:22 dy-base [<ffffffff8020a6a8>] child_rip+0xa/0x12 > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > Dec 11 22:47:22 dy-base [<ffffffff8023c60d>] kthread+0x0/0x10b > Dec 11 22:47:22 dy-base [<ffffffff8020a69e>] child_rip+0x0/0x12 > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base Code: > Dec 11 22:47:22 dy-base 8b > Dec 11 22:47:22 dy-base 83 > Dec 11 22:47:22 dy-base 0c > Dec 11 22:47:22 dy-base 01 > Dec 11 22:47:22 dy-base 00 > Dec 11 22:47:22 dy-base 00 > Dec 11 22:47:22 dy-base 48 > Dec 11 22:47:22 dy-base 8d > Dec 11 22:47:22 dy-base 8b > Dec 11 22:47:22 dy-base 98 > Dec 11 22:47:22 dy-base 02 > Dec 11 22:47:22 dy-base 00 > Dec 11 22:47:22 dy-base 00 > Dec 11 22:47:22 dy-base 41 > Dec 11 22:47:22 dy-base 8b > Dec 11 22:47:22 dy-base 54 > Dec 11 22:47:22 dy-base 24 > Dec 11 22:47:22 dy-base 04 > Dec 11 22:47:22 dy-base 41 > Dec 11 22:47:22 dy-base 89 > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base RIP > Dec 11 22:47:22 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf > Dec 11 22:47:22 dy-base RSP <ffff81011fb89bc0> > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception > Dec 11 22:47:22 dy-base > Dec 11 22:47:22 dy-base Rebooting in 5 seconds.. > > After this, sometimes the server reboots normally, but sometimes hangs, no > console, no sysreq, no nothing. > > This is a "simple" crash, no "too much" data lost, or else. > > Can somebody help me to tracking down the problem? > > Thanks, > Janos Haar > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-13 1:11 ` Haar János @ 2006-12-16 11:19 ` Haar János 2006-12-17 22:44 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Haar János @ 2006-12-16 11:19 UTC (permalink / raw) To: Haar János; +Cc: linux-xfs, linux-kernel Hi I have some news. I dont know there is a context between 2 messages, but i can see, the spinlock bug comes always on cpu #3. Somebody have any idea? Thanks Janos Dec 16 11:42:48 dy-base BUG: warning at mm/truncate.c:398/invalidate_inode_pages2_range() Dec 16 11:42:49 dy-base Dec 16 11:42:49 dy-base Call Trace: Dec 16 11:42:49 dy-base [<ffffffff8025b8a2>] invalidate_inode_pages2_range+0x285/0x2b8 Dec 16 11:42:49 dy-base [<ffffffff805e7ddd>] _spin_unlock+0x9/0xb Dec 16 11:42:49 dy-base [<ffffffff8031b98d>] nfs_sync_inode_wait+0x1ab/0x1bd Dec 16 11:42:49 dy-base [<ffffffff8025b8e4>] invalidate_inode_pages2+0xf/0x11 Dec 16 11:42:49 dy-base [<ffffffff80314859>] nfs_revalidate_mapping+0xa0/0x152 Dec 16 11:42:49 dy-base [<ffffffff80312a07>] nfs_file_read+0x6e/0xbe Dec 16 11:42:49 dy-base [<ffffffff80273f99>] do_sync_read+0xe2/0x126 Dec 16 11:42:49 dy-base [<ffffffff805e7ffc>] unlock_kernel+0x35/0x37 Dec 16 11:42:49 dy-base [<ffffffff8023c804>] autoremove_wake_function+0x0/0x38 Dec 16 11:42:49 dy-base [<ffffffff802729e7>] nameidata_to_filp+0x2d/0x3e Dec 16 11:42:49 dy-base [<ffffffff8026fd15>] poison_obj+0x27/0x32 Dec 16 11:42:49 dy-base [<ffffffff8026fee6>] cache_free_debugcheck+0x1c6/0x1d6 Dec 16 11:42:49 dy-base [<ffffffff8027cb26>] putname+0x37/0x39 Dec 16 11:42:49 dy-base [<ffffffff80274849>] vfs_read+0xcc/0x172 Dec 16 11:42:49 dy-base [<ffffffff80274d3e>] sys_pread64+0x55/0x76 Dec 16 11:42:49 dy-base [<ffffffff802098ee>] system_call+0x7e/0x83 Dec 16 11:42:49 dy-base .... Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317 Dec 16 12:08:36 dy-base general protection fault: 0000 [1] Dec 16 12:08:36 dy-base SMP Dec 16 12:08:36 dy-base Dec 16 12:08:36 dy-base CPU 3 Dec 16 12:08:36 dy-base Dec 16 12:08:36 dy-base Modules linked in: Dec 16 12:08:36 dy-base nbd Dec 16 12:08:36 dy-base rd Dec 16 12:08:36 dy-base netconsole Dec 16 12:08:36 dy-base e1000 Dec 16 12:08:36 dy-base video Dec 16 12:08:36 dy-base Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1 Dec 16 12:08:36 dy-base RIP: 0010:[<ffffffff803f3aba>] Dec 16 12:08:36 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002 Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 Dec 16 12:08:36 dy-base RDX: ffffffff807f3be2 RSI: 0000000000000082 RDI: 0000000100000000 Dec 16 12:08:36 dy-base RBP: ffff81011fdedbe0 R08: 0000000000006eb2 R09: 000000006b6b6b6b Dec 16 12:08:36 dy-base R10: 0000000000000082 R11: ffff81000584d280 R12: ffff810081476098 Dec 16 12:08:36 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15: 0000000000000003 Dec 16 12:08:36 dy-base FS: 0000000000000000(0000) GS:ffff81011fc76b90(0000) knlGS:0000000000000000 Dec 16 12:08:36 dy-base CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Dec 16 12:08:36 dy-base CR2: 00002afc7d3ea000 CR3: 0000000117afc000 CR4: 00000000000006e0 Dec 16 12:08:36 dy-base Process xfslogd/3 (pid: 317, threadinfo ffff81011fdec000, task ffff81011fafc140) Dec 16 12:08:36 dy-base Stack: Dec 16 12:08:36 dy-base ffff81011fdedbe0 Dec 16 12:08:36 dy-base ffff810081476098 Dec 16 12:08:36 dy-base 0000000000000000 Dec 16 12:08:36 dy-base 0000000000000000 Dec 16 12:08:36 dy-base Dec 16 12:08:36 dy-base ffff81011fdedc10 Dec 16 12:08:36 dy-base ffffffff803f3bdc Dec 16 12:08:36 dy-base 0000000000000282 Dec 16 12:08:36 dy-base 0000000000000000 Dec 16 12:08:36 dy-base Dec 16 12:08:36 dy-base 0000000000000000 Dec 16 12:08:36 dy-base 0000000000000000 Dec 16 12:08:36 dy-base ffff81011fdedc30 Dec 16 12:08:36 dy-base ffffffff805e7f2b Dec 16 12:08:36 dy-base Dec 16 12:08:36 dy-base Call Trace: Dec 16 12:08:36 dy-base [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1 Dec 16 12:08:36 dy-base [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18 Dec 16 12:08:36 dy-base [<ffffffff80222aab>] __wake_up+0x22/0x50 Dec 16 12:08:36 dy-base [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23 Dec 16 12:08:36 dy-base [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6 Dec 16 12:08:36 dy-base [<ffffffff803bc460>] xfs_trans_chunk_committed+0xc3/0xf7 Dec 16 12:08:37 dy-base [<ffffffff803bc4dd>] xfs_trans_committed+0x49/0xde Dec 16 12:08:37 dy-base [<ffffffff803b1bde>] xlog_state_do_callback+0x185/0x33f Dec 16 12:08:37 dy-base [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131 Dec 16 12:08:37 dy-base [<ffffffff803c9dae>] xfs_buf_iodone_work+0x1a/0x3e Dec 16 12:08:37 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 Dec 16 12:08:37 dy-base [<ffffffff8023937e>] run_workqueue+0xa8/0xf8 Dec 16 12:08:37 dy-base [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e Dec 16 12:08:37 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 Dec 16 12:08:37 dy-base [<ffffffff80239ad3>] worker_thread+0xfb/0x134 Dec 16 12:08:37 dy-base [<ffffffff80223f6c>] default_wake_function+0x0/0xf Dec 16 12:08:37 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 Dec 16 12:08:37 dy-base [<ffffffff8023c6e5>] kthread+0xd8/0x10b Dec 16 12:08:37 dy-base [<ffffffff802256ac>] schedule_tail+0x45/0xa6 Dec 16 12:08:37 dy-base [<ffffffff8020a6a8>] child_rip+0xa/0x12 Dec 16 12:08:37 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 Dec 16 12:08:37 dy-base [<ffffffff8023c60d>] kthread+0x0/0x10b Dec 16 12:08:38 dy-base [<ffffffff8020a69e>] child_rip+0x0/0x12 Dec 16 12:08:38 dy-base Dec 16 12:08:38 dy-base Dec 16 12:08:38 dy-base Code: Dec 16 12:08:38 dy-base 8b Dec 16 12:08:38 dy-base 83 Dec 16 12:08:38 dy-base 0c Dec 16 12:08:38 dy-base 01 Dec 16 12:08:38 dy-base 00 Dec 16 12:08:38 dy-base 00 Dec 16 12:08:38 dy-base 48 Dec 16 12:08:38 dy-base 8d Dec 16 12:08:38 dy-base 8b Dec 16 12:08:38 dy-base 98 Dec 16 12:08:38 dy-base 02 Dec 16 12:08:38 dy-base 00 Dec 16 12:08:38 dy-base 00 Dec 16 12:08:38 dy-base 41 Dec 16 12:08:38 dy-base 8b Dec 16 12:08:38 dy-base 54 Dec 16 12:08:38 dy-base 24 Dec 16 12:08:38 dy-base 04 Dec 16 12:08:38 dy-base 41 Dec 16 12:08:38 dy-base 89 Dec 16 12:08:38 dy-base Dec 16 12:08:38 dy-base RIP Dec 16 12:08:38 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf Dec 16 12:08:38 dy-base RSP <ffff81011fdedbc0> Dec 16 12:08:38 dy-base Dec 16 12:08:38 dy-base Kernel panic - not syncing: Fatal exception Dec 16 12:08:38 dy-base Dec 16 12:08:38 dy-base Rebooting in 5 seconds.. ----- Original Message ----- From: "Haar János" <djani22@netcenter.hu> To: "Justin Piszcz" <jpiszcz@lucidpixels.com> Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> Sent: Wednesday, December 13, 2006 2:11 AM Subject: Re: xfslogd-spinlock bug? > Hello, Justin, > > This is a 64bit system. > > But i cannot understand, what is the curious? :-) > > I am not a kernel developer, and not a C programmer, but the long pointers > shows me, the 64 bit. > Or am i on the wrong clue? :-) > > Anyway, this issue happens for me about daily, or max 2-3 day often. > But i have no idea what cause this exactly. > The 2.6.16.18 was stable for me a long time, and one day starts to tricking > me, and happens more and more often. > Thats why i thinking some bad part of the (14TB) FS on the disks. > > (this fs have a lot of errors, what the xfs_repair cannot be corrected, but > anyway this is a productive system, and works well, except this issue. > Some months before i have replaced the parity disk in one of the RAID4 > array, and the next day, during the resync process, another one disk died. > I almost lost everything, but thanks to the raid4, and mdadm, i have > successfully recovered a lot of data with the 1 day older parity-only drive. > This was really bad, and leave some scars on the fs. ) > > This issue looks like for me a race condition between the cpus. (2x Xeon HT) > > Am i right? :-) > > Thanks, > > Janos > > > > > ----- Original Message ----- > From: "Justin Piszcz" <jpiszcz@lucidpixels.com> > To: "Haar János" <djani22@netcenter.hu> > Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> > Sent: Tuesday, December 12, 2006 3:32 PM > Subject: Re: xfslogd-spinlock bug? > > > I'm not sure what is causing this problem but I was curious is this on a > 32bit or 64bit platform? > > Justin. > > On Tue, 12 Dec 2006, Haar János wrote: > > > Hello, list, > > > > I am the "big red button men" with the one big 14TB xfs, if somebody can > > remember me. :-) > > > > Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the > > 2.6.19, but the bug still exists: > > > > Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317 > > Dec 11 22:47:21 dy-base general protection fault: 0000 [1] > > Dec 11 22:47:21 dy-base SMP > > Dec 11 22:47:21 dy-base > > Dec 11 22:47:21 dy-base CPU 3 > > Dec 11 22:47:21 dy-base > > Dec 11 22:47:21 dy-base Modules linked in: > > Dec 11 22:47:21 dy-base nbd > > Dec 11 22:47:21 dy-base rd > > Dec 11 22:47:21 dy-base netconsole > > Dec 11 22:47:21 dy-base e1000 > > Dec 11 22:47:21 dy-base video > > Dec 11 22:47:21 dy-base > > Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1 > > Dec 11 22:47:21 dy-base RIP: 0010:[<ffffffff803f3aba>] > > Dec 11 22:47:21 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf > > Dec 11 22:47:21 dy-base RSP: 0018:ffff81011fb89bc0 EFLAGS: 00010002 > > Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: > > 0000000000000000 > > Dec 11 22:47:21 dy-base RDX: ffffffff808137a0 RSI: 0000000000000082 RDI: > > 0000000100000000 > > Dec 11 22:47:21 dy-base RBP: ffff81011fb89be0 R08: 0000000000026a70 R09: > > 000000006b6b6b6b > > Dec 11 22:47:21 dy-base R10: 0000000000000082 R11: ffff81000584d380 R12: > > ffff8100db92ad80 > > Dec 11 22:47:21 dy-base R13: ffffffff80642dc6 R14: 0000000000000000 R15: > > 0000000000000003 > > Dec 11 22:47:21 dy-base FS: 0000000000000000(0000) > > GS:ffff81011fc76b90(0000) knlGS:0000000000000000 > > Dec 11 22:47:21 dy-base CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > > Dec 11 22:47:21 dy-base CR2: 00002ba007700000 CR3: 0000000108c05000 CR4: > > 00000000000006e0 > > Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo > > ffff81011fb88000, task ffff81011fa7f830) > > Dec 11 22:47:21 dy-base Stack: > > Dec 11 22:47:21 dy-base ffff81011fb89be0 > > Dec 11 22:47:21 dy-base ffff8100db92ad80 > > Dec 11 22:47:21 dy-base 0000000000000000 > > Dec 11 22:47:21 dy-base 0000000000000000 > > Dec 11 22:47:21 dy-base > > Dec 11 22:47:21 dy-base ffff81011fb89c10 > > Dec 11 22:47:21 dy-base ffffffff803f3bdc > > Dec 11 22:47:21 dy-base 0000000000000282 > > Dec 11 22:47:21 dy-base 0000000000000000 > > Dec 11 22:47:21 dy-base > > Dec 11 22:47:21 dy-base 0000000000000000 > > Dec 11 22:47:21 dy-base 0000000000000000 > > Dec 11 22:47:21 dy-base ffff81011fb89c30 > > Dec 11 22:47:21 dy-base ffffffff805e7f2b > > Dec 11 22:47:21 dy-base > > Dec 11 22:47:21 dy-base Call Trace: > > Dec 11 22:47:21 dy-base [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1 > > Dec 11 22:47:21 dy-base [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18 > > Dec 11 22:47:21 dy-base [<ffffffff80222aab>] __wake_up+0x22/0x50 > > Dec 11 22:47:21 dy-base [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23 > > Dec 11 22:47:21 dy-base [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6 > > Dec 11 22:47:21 dy-base [<ffffffff803bc460>] > > xfs_trans_chunk_committed+0xc3/0xf7 > > Dec 11 22:47:21 dy-base [<ffffffff803bc4dd>] > xfs_trans_committed+0x49/0xde > > Dec 11 22:47:21 dy-base [<ffffffff803b1bde>] > > xlog_state_do_callback+0x185/0x33f > > Dec 11 22:47:21 dy-base [<ffffffff803b1e9c>] xlog_iodone+0x104/0x131 > > Dec 11 22:47:22 dy-base [<ffffffff803c9dae>] > xfs_buf_iodone_work+0x1a/0x3e > > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > > Dec 11 22:47:22 dy-base [<ffffffff8023937e>] run_workqueue+0xa8/0xf8 > > Dec 11 22:47:22 dy-base [<ffffffff803c9d94>] xfs_buf_iodone_work+0x0/0x3e > > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > > Dec 11 22:47:22 dy-base [<ffffffff80239ad3>] worker_thread+0xfb/0x134 > > Dec 11 22:47:22 dy-base [<ffffffff80223f6c>] > default_wake_function+0x0/0xf > > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > > Dec 11 22:47:22 dy-base [<ffffffff8023c6e5>] kthread+0xd8/0x10b > > Dec 11 22:47:22 dy-base [<ffffffff802256ac>] schedule_tail+0x45/0xa6 > > Dec 11 22:47:22 dy-base [<ffffffff8020a6a8>] child_rip+0xa/0x12 > > Dec 11 22:47:22 dy-base [<ffffffff802399d8>] worker_thread+0x0/0x134 > > Dec 11 22:47:22 dy-base [<ffffffff8023c60d>] kthread+0x0/0x10b > > Dec 11 22:47:22 dy-base [<ffffffff8020a69e>] child_rip+0x0/0x12 > > Dec 11 22:47:22 dy-base > > Dec 11 22:47:22 dy-base > > Dec 11 22:47:22 dy-base Code: > > Dec 11 22:47:22 dy-base 8b > > Dec 11 22:47:22 dy-base 83 > > Dec 11 22:47:22 dy-base 0c > > Dec 11 22:47:22 dy-base 01 > > Dec 11 22:47:22 dy-base 00 > > Dec 11 22:47:22 dy-base 00 > > Dec 11 22:47:22 dy-base 48 > > Dec 11 22:47:22 dy-base 8d > > Dec 11 22:47:22 dy-base 8b > > Dec 11 22:47:22 dy-base 98 > > Dec 11 22:47:22 dy-base 02 > > Dec 11 22:47:22 dy-base 00 > > Dec 11 22:47:22 dy-base 00 > > Dec 11 22:47:22 dy-base 41 > > Dec 11 22:47:22 dy-base 8b > > Dec 11 22:47:22 dy-base 54 > > Dec 11 22:47:22 dy-base 24 > > Dec 11 22:47:22 dy-base 04 > > Dec 11 22:47:22 dy-base 41 > > Dec 11 22:47:22 dy-base 89 > > Dec 11 22:47:22 dy-base > > Dec 11 22:47:22 dy-base RIP > > Dec 11 22:47:22 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf > > Dec 11 22:47:22 dy-base RSP <ffff81011fb89bc0> > > Dec 11 22:47:22 dy-base > > Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception > > Dec 11 22:47:22 dy-base > > Dec 11 22:47:22 dy-base Rebooting in 5 seconds.. > > > > After this, sometimes the server reboots normally, but sometimes hangs, no > > console, no sysreq, no nothing. > > > > This is a "simple" crash, no "too much" data lost, or else. > > > > Can somebody help me to tracking down the problem? > > > > Thanks, > > Janos Haar > > > > > > > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-16 11:19 ` Haar János @ 2006-12-17 22:44 ` David Chinner 2006-12-17 23:56 ` Haar János 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2006-12-17 22:44 UTC (permalink / raw) To: Haar János; +Cc: linux-xfs, linux-kernel On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote: > Hi > > I have some news. > > I dont know there is a context between 2 messages, but i can see, the > spinlock bug comes always on cpu #3. > > Somebody have any idea? Your disk interrupts are directed to CPU 3, and so log I/O completion occurs on that CPU. > Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317 > Dec 16 12:08:36 dy-base general protection fault: 0000 [1] > Dec 16 12:08:36 dy-base SMP > Dec 16 12:08:36 dy-base > Dec 16 12:08:36 dy-base CPU 3 > Dec 16 12:08:36 dy-base > Dec 16 12:08:36 dy-base Modules linked in: > Dec 16 12:08:36 dy-base nbd Are you using XFS on a NBD? > Dec 16 12:08:36 dy-base rd > Dec 16 12:08:36 dy-base netconsole > Dec 16 12:08:36 dy-base e1000 > Dec 16 12:08:36 dy-base video > Dec 16 12:08:36 dy-base > Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1 > Dec 16 12:08:36 dy-base RIP: 0010:[<ffffffff803f3aba>] > Dec 16 12:08:36 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002 > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: ^^^^^^^^^^^^^^^^ Anyone recognise that pattern? > Dec 16 12:08:36 dy-base Call Trace: > Dec 16 12:08:36 dy-base [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1 > Dec 16 12:08:36 dy-base [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18 > Dec 16 12:08:36 dy-base [<ffffffff80222aab>] __wake_up+0x22/0x50 > Dec 16 12:08:36 dy-base [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23 > Dec 16 12:08:36 dy-base [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6 This implies a spinlock inside a wait_queue_head_t is corrupt. What are you type of system do you have, and what sort of workload are you running? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-17 22:44 ` David Chinner @ 2006-12-17 23:56 ` Haar János 2006-12-18 6:24 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Haar János @ 2006-12-17 23:56 UTC (permalink / raw) To: David Chinner; +Cc: linux-xfs, linux-kernel ----- Original Message ----- From: "David Chinner" <dgc@sgi.com> To: "Haar János" <djani22@netcenter.hu> Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> Sent: Sunday, December 17, 2006 11:44 PM Subject: Re: xfslogd-spinlock bug? > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote: > > Hi > > > > I have some news. > > > > I dont know there is a context between 2 messages, but i can see, the > > spinlock bug comes always on cpu #3. > > > > Somebody have any idea? > > Your disk interrupts are directed to CPU 3, and so log I/O completion > occurs on that CPU. CPU0 CPU1 CPU2 CPU3 0: 100 0 0 4583704 IO-APIC-edge timer 1: 0 0 0 2 IO-APIC-edge i8042 4: 0 0 0 3878668 IO-APIC-edge serial 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 0 0 0 3 IO-APIC-edge i8042 14: 3072118 0 0 181 IO-APIC-edge ide0 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2 18: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1 52: 0 0 0 213052723 IO-APIC-fasteoi eth1 53: 0 0 0 91913759 IO-APIC-fasteoi eth2 100: 0 0 0 16776910 IO-APIC-fasteoi eth0 NMI: 42271 43187 42234 43168 LOC: 4584247 4584219 4584215 4584198 ERR: 0 Maybe.... I have 3 XFS on this system, with 3 source. 1. 200G one ide hdd. 2. 2x200G mirror on 1 ide + 1 sata hdd. 3. 4x3.3TB strip on NBD. The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the CPU0. > > > Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317 > > Dec 16 12:08:36 dy-base general protection fault: 0000 [1] > > Dec 16 12:08:36 dy-base SMP > > Dec 16 12:08:36 dy-base > > Dec 16 12:08:36 dy-base CPU 3 > > Dec 16 12:08:36 dy-base > > Dec 16 12:08:36 dy-base Modules linked in: > > Dec 16 12:08:36 dy-base nbd > > Are you using XFS on a NBD? Yes, on the 3. source. I used it about 1.5 years. (The nbd deadlock is fixed on my system, thanks to Herbert Xu on 2.6.14.) > > > Dec 16 12:08:36 dy-base rd > > Dec 16 12:08:36 dy-base netconsole > > Dec 16 12:08:36 dy-base e1000 > > Dec 16 12:08:36 dy-base video > > Dec 16 12:08:36 dy-base > > Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1 > > Dec 16 12:08:36 dy-base RIP: 0010:[<ffffffff803f3aba>] > > Dec 16 12:08:36 dy-base [<ffffffff803f3aba>] spin_bug+0x69/0xdf > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002 > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: > ^^^^^^^^^^^^^^^^ > Anyone recognise that pattern? I think i have one idea. This issue can stops sometimes the 5sec automatic restart on crash, and this shows possible memory corruption, and if the bug occurs in the IRQ handling.... :-) I have a lot of logs about this issue, and the RAX, RBX always the same. > > > Dec 16 12:08:36 dy-base Call Trace: > > Dec 16 12:08:36 dy-base [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1 > > Dec 16 12:08:36 dy-base [<ffffffff805e7f2b>] _spin_lock_irqsave+0x11/0x18 > > Dec 16 12:08:36 dy-base [<ffffffff80222aab>] __wake_up+0x22/0x50 > > Dec 16 12:08:36 dy-base [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23 > > Dec 16 12:08:36 dy-base [<ffffffff803970a4>] xfs_buf_item_unpin+0x2e/0xa6 > > This implies a spinlock inside a wait_queue_head_t is corrupt. > > What are you type of system do you have, and what sort of > workload are you running? OS: Fedora 5, 64bit. HW: dual xeon, with HT, ram 4GB. (the min_free_kbytes limit is set to 128000, because sometimes the e1000 driver run out the reserved memory during irq handling.) Workload: I use this system for free web storage. (2x apache 2.0.xx, 12x pure-ftpd, 2x mysql but sql only use the source #2 fs.) The normal system load is ~20-40, but currently i have a little problem with apache, because it sometimes starts to read a lot from the big XFS device, and eats all memory, the load is rising to 700-800. At this point i use httpd restart, and everithing go back to normal, but if i offline..... Thanks a lot! Janos > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-17 23:56 ` Haar János @ 2006-12-18 6:24 ` David Chinner 2006-12-18 8:17 ` Haar János 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2006-12-18 6:24 UTC (permalink / raw) To: Haar János; +Cc: David Chinner, linux-xfs, linux-kernel On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote: > > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote: > > > I dont know there is a context between 2 messages, but i can see, the > > > spinlock bug comes always on cpu #3. > > > > > > Somebody have any idea? > > > > Your disk interrupts are directed to CPU 3, and so log I/O completion > > occurs on that CPU. > > CPU0 CPU1 CPU2 CPU3 > 0: 100 0 0 4583704 IO-APIC-edge timer > 1: 0 0 0 2 IO-APIC-edge i8042 > 4: 0 0 0 3878668 IO-APIC-edge serial ..... > 14: 3072118 0 0 181 IO-APIC-edge ide0 ..... > 52: 0 0 0 213052723 IO-APIC-fasteoi eth1 > 53: 0 0 0 91913759 IO-APIC-fasteoi eth2 > 100: 0 0 0 16776910 IO-APIC-fasteoi eth0 .... > > Maybe.... > I have 3 XFS on this system, with 3 source. > > 1. 200G one ide hdd. > 2. 2x200G mirror on 1 ide + 1 sata hdd. > 3. 4x3.3TB strip on NBD. > > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the > CPU0. I'd say your NBD based XFS filesystem is having trouble. > > Are you using XFS on a NBD? > > Yes, on the 3. source. Ok, I've never heard of a problem like this before and you are doing something that very few ppl are doing (i.e. XFS on NBD). I'd start Hence I'd start by suspecting a bug in the NBD driver. > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002 > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: > > ^^^^^^^^^^^^^^^^ > > Anyone recognise that pattern? > > I think i have one idea. > This issue can stops sometimes the 5sec automatic restart on crash, and this > shows possible memory corruption, and if the bug occurs in the IRQ > handling.... :-) > I have a lot of logs about this issue, and the RAX, RBX always the same. And is this the only place where you see the problem? Or are there other stack traces that you see this in as well? > > This implies a spinlock inside a wait_queue_head_t is corrupt. > > > > What are you type of system do you have, and what sort of > > workload are you running? > > OS: Fedora 5, 64bit. > HW: dual xeon, with HT, ram 4GB. > (the min_free_kbytes limit is set to 128000, because sometimes the e1000 > driver run out the reserved memory during irq handling.) That does not sound good. What happens when it does run out of memory? Is that when you start to see the above corruptions? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-18 6:24 ` David Chinner @ 2006-12-18 8:17 ` Haar János 2006-12-18 22:36 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Haar János @ 2006-12-18 8:17 UTC (permalink / raw) To: David Chinner; +Cc: dgc, linux-xfs, linux-kernel ----- Original Message ----- From: "David Chinner" <dgc@sgi.com> To: "Haar János" <djani22@netcenter.hu> Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> Sent: Monday, December 18, 2006 7:24 AM Subject: Re: xfslogd-spinlock bug? > On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote: > > > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote: > > > > I dont know there is a context between 2 messages, but i can see, the > > > > spinlock bug comes always on cpu #3. > > > > > > > > Somebody have any idea? > > > > > > Your disk interrupts are directed to CPU 3, and so log I/O completion > > > occurs on that CPU. > > > > CPU0 CPU1 CPU2 CPU3 > > 0: 100 0 0 4583704 IO-APIC-edge timer > > 1: 0 0 0 2 IO-APIC-edge i8042 > > 4: 0 0 0 3878668 IO-APIC-edge serial > ..... > > 14: 3072118 0 0 181 IO-APIC-edge ide0 > ..... > > 52: 0 0 0 213052723 IO-APIC-fasteoi eth1 > > 53: 0 0 0 91913759 IO-APIC-fasteoi eth2 > > 100: 0 0 0 16776910 IO-APIC-fasteoi eth0 > .... > > > > Maybe.... > > I have 3 XFS on this system, with 3 source. > > > > 1. 200G one ide hdd. > > 2. 2x200G mirror on 1 ide + 1 sata hdd. > > 3. 4x3.3TB strip on NBD. > > > > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the > > CPU0. > > I'd say your NBD based XFS filesystem is having trouble. > > > > Are you using XFS on a NBD? > > > > Yes, on the 3. source. > > Ok, I've never heard of a problem like this before and you are doing > something that very few ppl are doing (i.e. XFS on NBD). I'd start > Hence I'd start by suspecting a bug in the NBD driver. Ok, if you have right, this also can be in context with the following issue: http://download.netcenter.hu/bughunt/20061217/messages.txt (10KB) > > > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002 > > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: > > > ^^^^^^^^^^^^^^^^ > > > Anyone recognise that pattern? > > > > I think i have one idea. > > This issue can stops sometimes the 5sec automatic restart on crash, and this > > shows possible memory corruption, and if the bug occurs in the IRQ > > handling.... :-) > > I have a lot of logs about this issue, and the RAX, RBX always the same. > > And is this the only place where you see the problem? Or are there > other stack traces that you see this in as well? I have used the 2.6.16.18 for a long time, and it was stable, except this issue. (~20 dump with xfslogd) And i try the new releases, and now i have more. :-) What do you think exactly? I can see in the logs, but search for what? The RAX, RBX thing, or the xfslogd-spinlock problem or the old nbd-deadlock + mem corruption? [root@NetCenter netlog]# grep "0000000000000033" messages* messages.1:Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.1:Dec 12 18:16:35 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.1:Dec 13 11:40:05 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.1:Dec 14 22:25:32 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.1:Dec 15 06:24:44 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.1:Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.11:Oct 3 19:49:44 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.11:Oct 7 01:11:17 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.13:Sep 21 15:35:31 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.15:Sep 3 16:13:35 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.15:Sep 5 21:00:38 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.2:Dec 9 00:10:47 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.2:Dec 9 14:07:01 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.2:Dec 10 04:44:48 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.3:Nov 30 10:59:21 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.3:Dec 2 00:54:23 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.5:Nov 13 10:44:49 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.5:Nov 14 03:14:14 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.5:Nov 14 03:37:07 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.5:Nov 15 01:39:54 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.6:Nov 6 14:48:54 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.6:Nov 7 04:36:13 dy-base RAX: 0000000000000033 RBX: ffff8100057d2080 RCX: ffff810050d638f8 messages.6:Nov 7 04:36:13 dy-base RDX: 0000000000000008 RSI: 0000000000012cff RDI: 0000000000000033 messages.6:Nov 7 11:12:06 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.6:Nov 8 03:20:38 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.6:Nov 8 15:02:16 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.6:Nov 8 15:27:12 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.6:Nov 10 15:29:43 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.6:Nov 11 20:44:14 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.9:Oct 18 15:31:02 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 messages.9:Oct 19 13:53:24 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 > > > > This implies a spinlock inside a wait_queue_head_t is corrupt. > > > > > > What are you type of system do you have, and what sort of > > > workload are you running? > > > > OS: Fedora 5, 64bit. > > HW: dual xeon, with HT, ram 4GB. > > (the min_free_kbytes limit is set to 128000, because sometimes the e1000 > > driver run out the reserved memory during irq handling.) > > That does not sound good. What happens when it does run out of memory? This is an old problem, on 2.6.16.18 . The default min_free_kbytes is 38xx , and the GIGE controller easily can be overflow this little place. If this happens, the system freez, and i can only use the serial console + sysreq to dump stack: download.netcenter.hu/bughunt/20060530/261618-good.txt download.netcenter.hu/bughunt/20060530/dmesg.txt download.netcenter.hu/bughunt/20060530/dump.txt This problem is already fixed with set the min_free_kbytes to 128M. > Is that when you start to see the above corruptions? I think no, but i am not 100% sure. Cheers, Janos > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-18 8:17 ` Haar János @ 2006-12-18 22:36 ` David Chinner 2006-12-18 23:39 ` Haar János 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2006-12-18 22:36 UTC (permalink / raw) To: Haar János; +Cc: David Chinner, linux-xfs, linux-kernel On Mon, Dec 18, 2006 at 09:17:50AM +0100, Haar János wrote: > From: "David Chinner" <dgc@sgi.com> > > > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on > the > > > CPU0. > > > > I'd say your NBD based XFS filesystem is having trouble. > > > > > > Are you using XFS on a NBD? > > > > > > Yes, on the 3. source. > > > > Ok, I've never heard of a problem like this before and you are doing > > something that very few ppl are doing (i.e. XFS on NBD). I'd start > > Hence I'd start by suspecting a bug in the NBD driver. > > Ok, if you have right, this also can be in context with the following issue: > > http://download.netcenter.hu/bughunt/20061217/messages.txt (10KB) Which appears to be a crash in wake_up_process() when doing memory reclaim (waking the xfsbufd). > > > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002 > > > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b > RCX: > > > > ^^^^^^^^^^^^^^^^ > > > > Anyone recognise that pattern? Ok, I've found this pattern: #define POISON_FREE 0x6b Can you confirm that you are running with CONFIG_DEBUG_SLAB=y? If so, we have a use after free occurring here and it would also explain why no-one has reported it before. FWIW, can you turn on CONFIG_XFS_DEBUG=y and see if that triggers a different bug check prior to the above dump? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-18 22:36 ` David Chinner @ 2006-12-18 23:39 ` Haar János 2006-12-19 2:52 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Haar János @ 2006-12-18 23:39 UTC (permalink / raw) To: David Chinner; +Cc: dgc, linux-xfs, linux-kernel ----- Original Message ----- From: "David Chinner" <dgc@sgi.com> To: "Haar János" <djani22@netcenter.hu> Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> Sent: Monday, December 18, 2006 11:36 PM Subject: Re: xfslogd-spinlock bug? > On Mon, Dec 18, 2006 at 09:17:50AM +0100, Haar János wrote: > > From: "David Chinner" <dgc@sgi.com> > > > > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on > > the > > > > CPU0. > > > > > > I'd say your NBD based XFS filesystem is having trouble. > > > > > > > > Are you using XFS on a NBD? > > > > > > > > Yes, on the 3. source. > > > > > > Ok, I've never heard of a problem like this before and you are doing > > > something that very few ppl are doing (i.e. XFS on NBD). I'd start > > > Hence I'd start by suspecting a bug in the NBD driver. > > > > Ok, if you have right, this also can be in context with the following issue: > > > > http://download.netcenter.hu/bughunt/20061217/messages.txt (10KB) > > Which appears to be a crash in wake_up_process() when doing memory > reclaim (waking the xfsbufd). Sorry, can you translate it to "poor mans language"? :-) This is a different bug? > > > > > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002 > > > > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b > > RCX: > > > > > ^^^^^^^^^^^^^^^^ > > > > > Anyone recognise that pattern? > > Ok, I've found this pattern: > > #define POISON_FREE 0x6b > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y? Yes, i build with this option enabled. Is this wrong? > > If so, we have a use after free occurring here and it would also > explain why no-one has reported it before. > > FWIW, can you turn on CONFIG_XFS_DEBUG=y and see if that triggers > a different bug check prior to the above dump? [root@X64 linux-2.6.19]# make bzImage scripts/kconfig/conf -s arch/x86_64/Kconfig .config:7:warning: trying to assign nonexistent symbol XFS_DEBUG I have missed something? Thanks, Janos > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-18 23:39 ` Haar János @ 2006-12-19 2:52 ` David Chinner 2006-12-19 4:47 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2006-12-19 2:52 UTC (permalink / raw) To: Haar János; +Cc: David Chinner, linux-xfs, linux-kernel On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote: > From: "David Chinner" <dgc@sgi.com> > > On Mon, Dec 18, 2006 at 09:17:50AM +0100, Haar János wrote: > > > From: "David Chinner" <dgc@sgi.com> > > > > Ok, I've never heard of a problem like this before and you are doing > > > > something that very few ppl are doing (i.e. XFS on NBD). I'd start > > > > Hence I'd start by suspecting a bug in the NBD driver. > > > > > > Ok, if you have right, this also can be in context with the following > issue: > > > > > > http://download.netcenter.hu/bughunt/20061217/messages.txt (10KB) > > > > Which appears to be a crash in wake_up_process() when doing memory > > reclaim (waking the xfsbufd). > > Sorry, can you translate it to "poor mans language"? :-) > This is a different bug? Don't know - it's a different crash, but once again one that I've never heard of occurring before. > > Ok, I've found this pattern: > > > > #define POISON_FREE 0x6b > > > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y? > > Yes, i build with this option enabled. > Is this wrong? No, but it does slow your machine down. > > If so, we have a use after free occurring here and it would also > > explain why no-one has reported it before. > > > > FWIW, can you turn on CONFIG_XFS_DEBUG=y and see if that triggers > > a different bug check prior to the above dump? > > [root@X64 linux-2.6.19]# make bzImage > scripts/kconfig/conf -s arch/x86_64/Kconfig > .config:7:warning: trying to assign nonexistent symbol XFS_DEBUG > > I have missed something? No - I forgot that config option doesn't exist in mainline XFS - it's only in the dev tree. FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned on and I haven't seen this problem. I'm not sure how to track down the source of the problem without a test case, but as a quick test, can you try the following patch? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_buf.c | 4 ++++ 1 file changed, 4 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2006-12-19 12:22:54.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2006-12-19 13:48:36.937118569 +1100 @@ -942,11 +942,14 @@ xfs_buf_unlock( /* * Pinning Buffer Storage in Memory * Ensure that no attempt to force a buffer to disk will succeed. + * Hold the buffer so we don't attempt to free it while it + * is pinned. */ void xfs_buf_pin( xfs_buf_t *bp) { + xfs_buf_hold(bp); atomic_inc(&bp->b_pin_count); XB_TRACE(bp, "pin", (long)bp->b_pin_count.counter); } @@ -958,6 +961,7 @@ xfs_buf_unpin( if (atomic_dec_and_test(&bp->b_pin_count)) wake_up_all(&bp->b_waiters); XB_TRACE(bp, "unpin", (long)bp->b_pin_count.counter); + xfs_buf_rele(bp); } int ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-19 2:52 ` David Chinner @ 2006-12-19 4:47 ` David Chinner 2006-12-27 12:58 ` Haar János 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2006-12-19 4:47 UTC (permalink / raw) To: David Chinner; +Cc: Haar János, linux-xfs, linux-kernel On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote: > On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote: > > From: "David Chinner" <dgc@sgi.com> > > > #define POISON_FREE 0x6b > > > > > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y? > > > > Yes, i build with this option enabled. ...... > FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned > on and I haven't seen this problem. I'm not sure how to track down > the source of the problem without a test case, but as a quick test, can > you try the following patch? Third try an I got a crash on a poisoned object: [1]kdb> md8c40 e00000300d7d5100 0xe00000300d7d5100 000000005a2cf071 0000000000000000 q.,Z............ 0xe00000300d7d5110 000000005a2cf071 6b6b6b6b6b6b6b6b q.,Z....kkkkkkkk 0xe00000300d7d5120 e0000039eb7b6320 6b6b6b6b6b6b6b6b c{.9...kkkkkkkk 0xe00000300d7d5130 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d5140 6b6b6b6f6b6b6b6b 6b6b6b6b6b6b6b6b kkkkokkkkkkkkkkk 0xe00000300d7d5150 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d5160 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d5170 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d5180 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d5190 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d51a0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d51b0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d51c0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk 0xe00000300d7d51d0 6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b kkkkkkkkkkkkkkk. 0xe00000300d7d51e0 000000005a2cf071 a000000100468c30 q.,Z....0.F..... [1]kdb> mds 0xe00000300d7d51e0 0xe00000300d7d51e0 5a2cf071 q.,Z.... 0xe00000300d7d51e8 a000000100468c30 xfs_inode_item_destroy+0x30 So the use-after-free here is on an inode item. You're tripping over a buffer item. Unfortunately, it is not the same problem - the problem I've just hit is to do with a QA test that does a forced shutdown on an active filesystem, and: [1]kdb> xmount 0xe00000304393e238 ..... flags 0x440010 <FSSHUTDOWN IDELETE COMPAT_IOSIZE > The filesystem was being shutdown so xfs_inode_item_destroy() just frees the inode log item without removing it from the AIL. I'll fix that, and see if i have any luck.... So I'd still try that patch i sent in the previous email... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-19 4:47 ` David Chinner @ 2006-12-27 12:58 ` Haar János 2007-01-07 23:14 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Haar János @ 2006-12-27 12:58 UTC (permalink / raw) To: David Chinner; +Cc: linux-xfs, linux-kernel, dgc Hello, ----- Original Message ----- From: "David Chinner" <dgc@sgi.com> To: "David Chinner" <dgc@sgi.com> Cc: "Haar János" <djani22@netcenter.hu>; <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> Sent: Tuesday, December 19, 2006 5:47 AM Subject: Re: xfslogd-spinlock bug? > On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote: > > On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote: > > > From: "David Chinner" <dgc@sgi.com> > > > > #define POISON_FREE 0x6b > > > > > > > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y? > > > > > > Yes, i build with this option enabled. > > ...... > > > FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned > > on and I haven't seen this problem. I'm not sure how to track down > > the source of the problem without a test case, but as a quick test, can > > you try the following patch? > > Third try an I got a crash on a poisoned object: > > [1]kdb> md8c40 e00000300d7d5100 > 0xe00000300d7d5100 000000005a2cf071 0000000000000000 q.,Z............ > 0xe00000300d7d5110 000000005a2cf071 6b6b6b6b6b6b6b6b q.,Z....kkkkkkkk > 0xe00000300d7d5120 e0000039eb7b6320 6b6b6b6b6b6b6b6b c{.9...kkkkkkkk > 0xe00000300d7d5130 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d5140 6b6b6b6f6b6b6b6b 6b6b6b6b6b6b6b6b kkkkokkkkkkkkkkk > 0xe00000300d7d5150 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d5160 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d5170 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d5180 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d5190 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d51a0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d51b0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d51c0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b kkkkkkkkkkkkkkkk > 0xe00000300d7d51d0 6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b kkkkkkkkkkkkkkk. > 0xe00000300d7d51e0 000000005a2cf071 a000000100468c30 q.,Z....0.F..... > [1]kdb> mds 0xe00000300d7d51e0 > 0xe00000300d7d51e0 5a2cf071 q.,Z.... > 0xe00000300d7d51e8 a000000100468c30 xfs_inode_item_destroy+0x30 > > So the use-after-free here is on an inode item. You're tripping > over a buffer item. > > Unfortunately, it is not the same problem - the problem I've just > hit is to do with a QA test that does a forced shutdown on an active > filesystem, and: > > [1]kdb> xmount 0xe00000304393e238 > ..... > flags 0x440010 <FSSHUTDOWN IDELETE COMPAT_IOSIZE > > > The filesystem was being shutdown so xfs_inode_item_destroy() just > frees the inode log item without removing it from the AIL. I'll fix that, > and see if i have any luck.... > > So I'd still try that patch i sent in the previous email... I still using the patch, but didnt shows any messages at this point. I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one: Dec 27 13:41:29 dy-base BUG: warning at kernel/mutex.c:220/__mutex_unlock_common_slowpath() Dec 27 13:41:29 dy-base Unable to handle kernel paging request at 0000000066604480 RIP: Dec 27 13:41:29 dy-base [<ffffffff80222c64>] resched_task+0x12/0x64 Dec 27 13:41:29 dy-base PGD 115246067 PUD 0 Dec 27 13:41:29 dy-base Oops: 0000 [1] SMP Dec 27 13:41:29 dy-base CPU 1 Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3 Dec 27 13:41:29 dy-base RIP: 0010:[<ffffffff80222c64>] [<ffffffff80222c64>] resched_task+0x12/0x64 Dec 27 13:41:29 dy-base RSP: 0018:ffff810105c01b78 EFLAGS: 00010083 Dec 27 13:41:29 dy-base RAX: ffffffff807d5800 RBX: 00001749fd97c214 RCX: ffff81001cbd0000 Dec 27 13:41:29 dy-base RDX: 000000001cbd0048 RSI: ffff810005834068 RDI: ffff8101047bf040 Dec 27 13:41:29 dy-base RBP: ffff810105c01b78 R08: 0000000000000001 R09: 0000000000000000 Dec 27 13:41:29 dy-base R10: 0000000000000057 R11: ffff81000583cd80 R12: ffff810116693140 Dec 27 13:41:29 dy-base R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 Dec 27 13:41:29 dy-base FS: 00002ba3c1ad07d0(0000) GS:ffff81011fc769c8(0000) knlGS:0000000000000000 Dec 27 13:41:29 dy-base CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 27 13:41:29 dy-base CR2: 0000000066604480 CR3: 0000000118196000 CR4: 00000000000006e0 Dec 27 13:41:29 dy-base Process httpd (pid: 4069, threadinfo ffff810105c00000, task ffff8101166e1040) Dec 27 13:41:29 dy-base Stack: ffff810105c01bf8 ffffffff80223f37 ffff810002996a00 0000000002cba600 Dec 27 13:41:29 dy-base 000000000000000f 0000000000000001 ffff810005833700 0000000100000000 Dec 27 13:41:29 dy-base 0000000000000005 0000000000000296 ffff810105c01bd8 ffff810117fef690 Dec 27 13:41:29 dy-base Call Trace: Dec 27 13:41:29 dy-base [<ffffffff80223f37>] try_to_wake_up+0x3a7/0x3dc Dec 27 13:41:29 dy-base [<ffffffff80223f98>] wake_up_process+0x10/0x12 Dec 27 13:41:29 dy-base [<ffffffff803c9186>] xfsbufd_wakeup+0x34/0x61 Dec 27 13:41:29 dy-base [<ffffffff8025cdf9>] shrink_slab+0x64/0x163 Dec 27 13:41:29 dy-base [<ffffffff8025d913>] try_to_free_pages+0x19c/0x289 Dec 27 13:41:29 dy-base [<ffffffff80258b62>] __alloc_pages+0x1b8/0x2c0 Dec 27 13:41:29 dy-base [<ffffffff80267f94>] anon_vma_prepare+0x29/0xf1 Dec 27 13:41:29 dy-base [<ffffffff80260d52>] __handle_mm_fault+0x496/0x9e3 Dec 27 13:41:29 dy-base [<ffffffff805e7dfd>] _spin_unlock+0x9/0xb Dec 27 13:41:29 dy-base [<ffffffff8021a1b6>] do_page_fault+0x418/0x7b6 Dec 27 13:41:29 dy-base [<ffffffff802082ed>] __switch_to+0x280/0x28f Dec 27 13:41:29 dy-base [<ffffffff805e7efb>] _spin_unlock_irq+0x9/0xc Dec 27 13:41:29 dy-base [<ffffffff805e5ca8>] thread_return+0x5e/0xf7 Dec 27 13:41:29 dy-base [<ffffffff805e809d>] error_exit+0x0/0x84 Dec 27 13:41:29 dy-base Dec 27 13:41:29 dy-base Dec 27 13:41:29 dy-base Code: 48 8b 14 d5 40 42 78 80 48 03 42 08 8b 00 85 c0 7e 0a 0f 0b Dec 27 13:41:29 dy-base RIP [<ffffffff80222c64>] resched_task+0x12/0x64 Dec 27 13:41:29 dy-base RSP <ffff810105c01b78> Dec 27 13:41:29 dy-base CR2: 0000000066604480 Dec 27 13:41:29 dy-base <0>Kernel panic - not syncing: Fatal exception Dec 27 13:41:29 dy-base Dec 27 13:41:29 dy-base Rebooting in 5 seconds.. I found one bug on my apache config, and i think, the test case is changed. :-( Before the config is fixed, some users can stress the xfs source device readahead, and can periodically overload the system, with this action. I think, the original bug only comes on the highly overloaded system, and about readahead+buffering/caching. Thanks, Janos > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2006-12-27 12:58 ` Haar János @ 2007-01-07 23:14 ` David Chinner 2007-01-10 17:18 ` Janos Haar 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2007-01-07 23:14 UTC (permalink / raw) To: Haar János; +Cc: David Chinner, linux-xfs, linux-kernel On Wed, Dec 27, 2006 at 01:58:06PM +0100, Haar János wrote: > Hello, > > ----- Original Message ----- > From: "David Chinner" <dgc@sgi.com> > To: "David Chinner" <dgc@sgi.com> > Cc: "Haar János" <djani22@netcenter.hu>; <linux-xfs@oss.sgi.com>; > <linux-kernel@vger.kernel.org> > Sent: Tuesday, December 19, 2006 5:47 AM > Subject: Re: xfslogd-spinlock bug? > > > > On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote: > > > > The filesystem was being shutdown so xfs_inode_item_destroy() just > > frees the inode log item without removing it from the AIL. I'll fix that, > > and see if i have any luck.... > > > > So I'd still try that patch i sent in the previous email... > > I still using the patch, but didnt shows any messages at this point. > > I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one: > > Dec 27 13:41:29 dy-base BUG: warning at > kernel/mutex.c:220/__mutex_unlock_common_slowpath() > Dec 27 13:41:29 dy-base Unable to handle kernel paging request at > 0000000066604480 RIP: > Dec 27 13:41:29 dy-base [<ffffffff80222c64>] resched_task+0x12/0x64 > Dec 27 13:41:29 dy-base PGD 115246067 PUD 0 > Dec 27 13:41:29 dy-base Oops: 0000 [1] SMP > Dec 27 13:41:29 dy-base CPU 1 > Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video > Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3 > Dec 27 13:41:29 dy-base RIP: 0010:[<ffffffff80222c64>] [<ffffffff80222c64>] > resched_task+0x12/0x64 > Dec 27 13:41:29 dy-base RSP: 0018:ffff810105c01b78 EFLAGS: 00010083 > Dec 27 13:41:29 dy-base RAX: ffffffff807d5800 RBX: 00001749fd97c214 RCX: Different corruption in RBX here. Looks like semi-random garbage there. I wonder - what's the mac and ip address(es) of your machine and nbd servers? (i.e. I suspect this is a nbd problem, not an XFS problem) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2007-01-07 23:14 ` David Chinner @ 2007-01-10 17:18 ` Janos Haar 2007-01-11 3:34 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Janos Haar @ 2007-01-10 17:18 UTC (permalink / raw) To: David Chinner; +Cc: dgc, linux-xfs, linux-kernel ----- Original Message ----- From: "David Chinner" <dgc@sgi.com> To: "Haar János" <djani22@netcenter.hu> Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> Sent: Monday, January 08, 2007 12:14 AM Subject: Re: xfslogd-spinlock bug? > On Wed, Dec 27, 2006 at 01:58:06PM +0100, Haar János wrote: > > Hello, > > > > ----- Original Message ----- > > From: "David Chinner" <dgc@sgi.com> > > To: "David Chinner" <dgc@sgi.com> > > Cc: "Haar János" <djani22@netcenter.hu>; <linux-xfs@oss.sgi.com>; > > <linux-kernel@vger.kernel.org> > > Sent: Tuesday, December 19, 2006 5:47 AM > > Subject: Re: xfslogd-spinlock bug? > > > > > > > On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote: > > > > > > The filesystem was being shutdown so xfs_inode_item_destroy() just > > > frees the inode log item without removing it from the AIL. I'll fix that, > > > and see if i have any luck.... > > > > > > So I'd still try that patch i sent in the previous email... > > > > I still using the patch, but didnt shows any messages at this point. > > > > I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one: > > > > Dec 27 13:41:29 dy-base BUG: warning at > > kernel/mutex.c:220/__mutex_unlock_common_slowpath() > > Dec 27 13:41:29 dy-base Unable to handle kernel paging request at > > 0000000066604480 RIP: > > Dec 27 13:41:29 dy-base [<ffffffff80222c64>] resched_task+0x12/0x64 > > Dec 27 13:41:29 dy-base PGD 115246067 PUD 0 > > Dec 27 13:41:29 dy-base Oops: 0000 [1] SMP > > Dec 27 13:41:29 dy-base CPU 1 > > Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video > > Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3 > > Dec 27 13:41:29 dy-base RIP: 0010:[<ffffffff80222c64>] [<ffffffff80222c64>] > > resched_task+0x12/0x64 > > Dec 27 13:41:29 dy-base RSP: 0018:ffff810105c01b78 EFLAGS: 00010083 > > Dec 27 13:41:29 dy-base RAX: ffffffff807d5800 RBX: 00001749fd97c214 RCX: > > Different corruption in RBX here. Looks like semi-random garbage there. > I wonder - what's the mac and ip address(es) of your machine and nbd > servers? dy-base: eth0 Link encap:Ethernet HWaddr 00:90:27:A2:7B:8B eth0:1 Link encap:Ethernet HWaddr 00:90:27:A2:7B:8B eth0:2 Link encap:Ethernet HWaddr 00:90:27:A2:7B:8B eth1 Link encap:Ethernet HWaddr 00:07:E9:32:E6:D8 eth1:1 Link encap:Ethernet HWaddr 00:07:E9:32:E6:D8 eth1:2 Link encap:Ethernet HWaddr 00:07:E9:32:E6:D8 eth2 Link encap:Ethernet HWaddr 00:07:E9:32:E6:D9 node1-4: 00:0E:0C:A0:E5:7E 00:0E:0C:A0:EF:5E 00:0E:0C:A0:E9:58 00:0E:0C:A0:EF:A3 Some new stuff: Jan 8 18:11:16 dy-base BUG: warning at kernel/mutex.c:220/__mutex_unlock_common_slowpath() Jan 8 18:11:16 dy-base Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: Jan 8 18:11:16 dy-base [<ffffffff80222c74>] resched_task+0x1a/0x64 Jan 8 18:11:16 dy-base PGD 9859d067 PUD 4e347067 PMD 0 Jan 8 18:11:16 dy-base Oops: 0000 [1] SMP Jan 8 18:11:16 dy-base CPU 3 Jan 8 18:11:16 dy-base Modules linked in: nbd rd netconsole e1000 Jan 8 18:11:16 dy-base Pid: 3471, comm: ls Not tainted 2.6.19 #4 Jan 8 18:11:16 dy-base RIP: 0010:[<ffffffff80222c74>] [<ffffffff80222c74>] resched_task+0x1a/0x64 Jan 8 18:11:16 dy-base RSP: 0000:ffff81011fd1fb10 EFLAGS: 00010097 Jan 8 18:11:16 dy-base RAX: ffffffff80810800 RBX: 000004f0e2850659 RCX: ffff81010c0a2000 Jan 8 18:11:16 dy-base RDX: 0000000000000000 RSI: ffff81000583c368 RDI: ffff81002b809830 Jan 8 18:11:16 dy-base RBP: ffff81011fd1fb10 R08: 0000000000000000 R09: 0000000000000080 Jan 8 18:11:16 dy-base R10: 0000000000000080 R11: ffff81000584d280 R12: ffff8100d852a7b0 Jan 8 18:11:16 dy-base R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000001 Jan 8 18:11:16 dy-base FS: 00002b69786c8a90(0000) GS:ffff81011fcb98c0(0000) knlGS:0000000000000000 Jan 8 18:11:16 dy-base CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 8 18:11:16 dy-base CR2: 0000000000000008 CR3: 0000000056ec5000 CR4: 00000000000006e0 Jan 8 18:11:16 dy-base Process ls (pid: 3471, threadinfo ffff810080d80000, task ffff8100dcdea770) Jan 8 18:11:16 dy-base Stack: ffff81011fd1fb90 ffffffff80223f3f ffff81011fd1fb50 0000000000000282 Jan 8 18:11:16 dy-base 0000000000000001 0000000000000001 ffff81000583ba00 00000003032ca4b8 Jan 8 18:11:16 dy-base 000000000000000a 0000000000000082 ffff810103539d00 ffff810013867bf8 Jan 8 18:11:16 dy-base Call Trace: Jan 8 18:11:16 dy-base <IRQ> [<ffffffff80223f3f>] try_to_wake_up+0x3a7/0x3dc Jan 8 18:11:16 dy-base [<ffffffff80223f81>] default_wake_function+0xd/0xf Jan 8 18:11:16 dy-base [<ffffffff8023c81d>] autoremove_wake_function+0x11/0x38 Jan 8 18:11:16 dy-base [<ffffffff802225b1>] __wake_up_common+0x3e/0x68 Jan 8 18:11:16 dy-base [<ffffffff80222ac9>] __wake_up+0x38/0x50 Jan 8 18:11:16 dy-base [<ffffffff80569d1c>] sk_stream_write_space+0x5d/0x83 Jan 8 18:11:16 dy-base [<ffffffff80591afd>] tcp_check_space+0x8f/0xcd Jan 8 18:11:16 dy-base [<ffffffff80596de0>] tcp_rcv_established+0x116/0x76e Jan 8 18:11:16 dy-base [<ffffffff8059ce41>] tcp_v4_do_rcv+0x2d/0x322 Jan 8 18:11:16 dy-base [<ffffffff8059f59e>] tcp_v4_rcv+0x8bb/0x925 Jan 8 18:11:16 dy-base [<ffffffff80583b24>] ip_local_deliver_finish+0x0/0x1ce Jan 8 18:11:16 dy-base [<ffffffff805843e6>] ip_local_deliver+0x172/0x238 Jan 8 18:11:16 dy-base [<ffffffff8058422c>] ip_rcv+0x44f/0x497 Jan 8 18:11:16 dy-base [<ffffffff88004bbc>] :e1000:e1000_alloc_rx_buffers+0x1e7/0x2cb Jan 8 18:11:16 dy-base [<ffffffff8056c0a6>] netif_receive_skb+0x1ee/0x255 Jan 8 18:11:16 dy-base [<ffffffff8056de0c>] process_backlog+0x8a/0x10f Jan 8 18:11:16 dy-base [<ffffffff8056e060>] net_rx_action+0xa9/0x16e Jan 8 18:11:16 dy-base [<ffffffff8022edf9>] __do_softirq+0x57/0xc7 Jan 8 18:11:16 dy-base [<ffffffff8020aa1c>] call_softirq+0x1c/0x28 Jan 8 18:11:16 dy-base [<ffffffff8020c68a>] do_softirq+0x34/0x87 Jan 8 18:11:16 dy-base [<ffffffff8022ed0f>] irq_exit+0x3f/0x41 Jan 8 18:11:16 dy-base [<ffffffff8020c786>] do_IRQ+0xa9/0xc7 Jan 8 18:11:16 dy-base [<ffffffff80209e11>] ret_from_intr+0x0/0xa Jan 8 18:11:16 dy-base <EOI> Jan 8 18:11:16 dy-base Jan 8 18:11:16 dy-base Code: 48 03 42 08 8b 00 85 c0 7e 0a 0f 0b 68 ed 80 64 80 c2 f0 03 Jan 8 18:11:16 dy-base RIP [<ffffffff80222c74>] resched_task+0x1a/0x64 Jan 8 18:11:16 dy-base RSP <ffff81011fd1fb10> Jan 8 18:11:16 dy-base CR2: 0000000000000008 Jan 8 18:11:16 dy-base <0>Kernel panic - not syncing: Fatal exception Jan 8 18:11:16 dy-base Jan 8 18:11:16 dy-base Rebooting in 5 seconds.. (i have disabled the slab debuggint, because i need more perf.) Thanks, Janos > > (i.e. I suspect this is a nbd problem, not an XFS problem) > > Cheers, > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2007-01-10 17:18 ` Janos Haar @ 2007-01-11 3:34 ` David Chinner 2007-01-11 20:15 ` Janos Haar 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2007-01-11 3:34 UTC (permalink / raw) To: Janos Haar; +Cc: David Chinner, linux-xfs, linux-kernel On Wed, Jan 10, 2007 at 06:18:08PM +0100, Janos Haar wrote: > From: "David Chinner" <dgc@sgi.com> > > Different corruption in RBX here. Looks like semi-random garbage there. > > I wonder - what's the mac and ip address(es) of your machine and nbd > > servers? > > dy-base: no matches. Oh well, it was a long shot. > Some new stuff: > Jan 8 18:11:16 dy-base RAX: ffffffff80810800 RBX: 000004f0e2850659 RCX: RBX trashed again with more random garbage. > Jan 8 18:11:16 dy-base [<ffffffff80223f81>] default_wake_function+0xd/0xf > Jan 8 18:11:16 dy-base [<ffffffff8023c81d>] > autoremove_wake_function+0x11/0x38 > Jan 8 18:11:16 dy-base [<ffffffff802225b1>] __wake_up_common+0x3e/0x68 > Jan 8 18:11:16 dy-base [<ffffffff80222ac9>] __wake_up+0x38/0x50 > Jan 8 18:11:16 dy-base [<ffffffff80569d1c>] > sk_stream_write_space+0x5d/0x83 > Jan 8 18:11:16 dy-base [<ffffffff80591afd>] tcp_check_space+0x8f/0xcd > Jan 8 18:11:16 dy-base [<ffffffff80596de0>] > tcp_rcv_established+0x116/0x76e > Jan 8 18:11:16 dy-base [<ffffffff8059ce41>] tcp_v4_do_rcv+0x2d/0x322 > Jan 8 18:11:16 dy-base [<ffffffff8059f59e>] tcp_v4_rcv+0x8bb/0x925 > Jan 8 18:11:16 dy-base [<ffffffff80583b24>] > ip_local_deliver_finish+0x0/0x1ce > Jan 8 18:11:16 dy-base [<ffffffff805843e6>] ip_local_deliver+0x172/0x238 > Jan 8 18:11:16 dy-base [<ffffffff8058422c>] ip_rcv+0x44f/0x497 > Jan 8 18:11:16 dy-base [<ffffffff88004bbc>] > :e1000:e1000_alloc_rx_buffers+0x1e7/0x2cb > Jan 8 18:11:16 dy-base [<ffffffff8056c0a6>] netif_receive_skb+0x1ee/0x255 > Jan 8 18:11:16 dy-base [<ffffffff8056de0c>] process_backlog+0x8a/0x10f > Jan 8 18:11:16 dy-base [<ffffffff8056e060>] net_rx_action+0xa9/0x16e > Jan 8 18:11:16 dy-base [<ffffffff8022edf9>] __do_softirq+0x57/0xc7 > Jan 8 18:11:16 dy-base [<ffffffff8020aa1c>] call_softirq+0x1c/0x28 > Jan 8 18:11:16 dy-base [<ffffffff8020c68a>] do_softirq+0x34/0x87 > Jan 8 18:11:16 dy-base [<ffffffff8022ed0f>] irq_exit+0x3f/0x41 > Jan 8 18:11:16 dy-base [<ffffffff8020c786>] do_IRQ+0xa9/0xc7 > Jan 8 18:11:16 dy-base [<ffffffff80209e11>] ret_from_intr+0x0/0xa ...... > > (i.e. I suspect this is a nbd problem, not an XFS problem) There's something seriously wrong in your kernel that has, AFAICT, nothing to do with XFS. I suggest talking to the NBD folk as that is the only unusualy thing that I can see that you are using.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfslogd-spinlock bug? 2007-01-11 3:34 ` David Chinner @ 2007-01-11 20:15 ` Janos Haar 0 siblings, 0 replies; 17+ messages in thread From: Janos Haar @ 2007-01-11 20:15 UTC (permalink / raw) To: David Chinner; +Cc: dgc, linux-xfs, linux-kernel ----- Original Message ----- From: "David Chinner" <dgc@sgi.com> To: "Janos Haar" <djani22@netcenter.hu> Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org> Sent: Thursday, January 11, 2007 4:34 AM Subject: Re: xfslogd-spinlock bug? > On Wed, Jan 10, 2007 at 06:18:08PM +0100, Janos Haar wrote: > > From: "David Chinner" <dgc@sgi.com> > > > Different corruption in RBX here. Looks like semi-random garbage there. > > > I wonder - what's the mac and ip address(es) of your machine and nbd > > > servers? > > > > dy-base: > > no matches. Oh well, it was a long shot. > > > Some new stuff: > > Jan 8 18:11:16 dy-base RAX: ffffffff80810800 RBX: 000004f0e2850659 RCX: > > RBX trashed again with more random garbage. > > > Jan 8 18:11:16 dy-base [<ffffffff80223f81>] default_wake_function+0xd/0xf > > Jan 8 18:11:16 dy-base [<ffffffff8023c81d>] > > autoremove_wake_function+0x11/0x38 > > Jan 8 18:11:16 dy-base [<ffffffff802225b1>] __wake_up_common+0x3e/0x68 > > Jan 8 18:11:16 dy-base [<ffffffff80222ac9>] __wake_up+0x38/0x50 > > Jan 8 18:11:16 dy-base [<ffffffff80569d1c>] > > sk_stream_write_space+0x5d/0x83 > > Jan 8 18:11:16 dy-base [<ffffffff80591afd>] tcp_check_space+0x8f/0xcd > > Jan 8 18:11:16 dy-base [<ffffffff80596de0>] > > tcp_rcv_established+0x116/0x76e > > Jan 8 18:11:16 dy-base [<ffffffff8059ce41>] tcp_v4_do_rcv+0x2d/0x322 > > Jan 8 18:11:16 dy-base [<ffffffff8059f59e>] tcp_v4_rcv+0x8bb/0x925 > > Jan 8 18:11:16 dy-base [<ffffffff80583b24>] > > ip_local_deliver_finish+0x0/0x1ce > > Jan 8 18:11:16 dy-base [<ffffffff805843e6>] ip_local_deliver+0x172/0x238 > > Jan 8 18:11:16 dy-base [<ffffffff8058422c>] ip_rcv+0x44f/0x497 > > Jan 8 18:11:16 dy-base [<ffffffff88004bbc>] > > :e1000:e1000_alloc_rx_buffers+0x1e7/0x2cb > > Jan 8 18:11:16 dy-base [<ffffffff8056c0a6>] netif_receive_skb+0x1ee/0x255 > > Jan 8 18:11:16 dy-base [<ffffffff8056de0c>] process_backlog+0x8a/0x10f > > Jan 8 18:11:16 dy-base [<ffffffff8056e060>] net_rx_action+0xa9/0x16e > > Jan 8 18:11:16 dy-base [<ffffffff8022edf9>] __do_softirq+0x57/0xc7 > > Jan 8 18:11:16 dy-base [<ffffffff8020aa1c>] call_softirq+0x1c/0x28 > > Jan 8 18:11:16 dy-base [<ffffffff8020c68a>] do_softirq+0x34/0x87 > > Jan 8 18:11:16 dy-base [<ffffffff8022ed0f>] irq_exit+0x3f/0x41 > > Jan 8 18:11:16 dy-base [<ffffffff8020c786>] do_IRQ+0xa9/0xc7 > > Jan 8 18:11:16 dy-base [<ffffffff80209e11>] ret_from_intr+0x0/0xa > ...... > > > (i.e. I suspect this is a nbd problem, not an XFS problem) > > There's something seriously wrong in your kernel that has, AFAICT, > nothing to do with XFS. I suggest talking to the NBD folk as that is > the only unusualy thing that I can see that you are using.... > > Cheers, OK, i will try.... Thanks a lot, Janos > > Dave. > -- > Dave Chinner > Principal Engineer > SGI Australian Software Group > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2007-01-11 20:17 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-12-11 23:00 xfslogd-spinlock bug? Haar János 2006-12-12 14:32 ` Justin Piszcz 2006-12-13 1:11 ` Haar János 2006-12-16 11:19 ` Haar János 2006-12-17 22:44 ` David Chinner 2006-12-17 23:56 ` Haar János 2006-12-18 6:24 ` David Chinner 2006-12-18 8:17 ` Haar János 2006-12-18 22:36 ` David Chinner 2006-12-18 23:39 ` Haar János 2006-12-19 2:52 ` David Chinner 2006-12-19 4:47 ` David Chinner 2006-12-27 12:58 ` Haar János 2007-01-07 23:14 ` David Chinner 2007-01-10 17:18 ` Janos Haar 2007-01-11 3:34 ` David Chinner 2007-01-11 20:15 ` Janos Haar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).