LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* BUG/ spinlock lockup, 2.6.24
@ 2008-02-15 15:19 Denys Fedoryshchenko
  2008-02-15 15:24 ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Denys Fedoryshchenko @ 2008-02-15 15:19 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Server crashed(not responding over network), last line over netconsole was 

Feb 15 15:50:17 217.151.X.X [1521315.068984] BUG: spinlock lockup on CPU#1, 
ksoftirqd/1/7, f0551180

I have random crashes, at least once per week. It is very difficult to catch 
error message, and only recently i setup netconsole. Now i got crash, but 
there is no traceback and only single line came over netconsole, mentioned 
before.

.config file 
http://www.nuclearcat.com/files/config_qos

Kernel is 2.6.24 with epoll patch(it is from mainline) applied.
cat /proc/version
Linux version 2.6.24-devel (root@visp-1) (gcc version 4.1.1 (Gentoo 4.1.1-
r3)) #1 SMP Sat Jan 26 17:26:54 EET 2008

visp-1 ~ # cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:      18361      17785      17471      17748   IO-APIC-edge      timer
  1:          2          0          0          0   IO-APIC-edge      i8042
  8:          5          4          3          4   IO-APIC-edge      rtc
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          1          0          1          2   IO-APIC-edge      i8042
 14:         14         17         17         15   IO-APIC-edge      libata
 15:          0          0          0          0   IO-APIC-edge      libata
 17:        269        259        256        259   IO-APIC-fasteoi   ioc0
 18:          5          5          6          7   IO-APIC-fasteoi   
ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb4
 19:          0          0          0          0   IO-APIC-fasteoi   
uhci_hcd:usb3
 66:          1          0          0          0      none-<NULL>
212:         27         32         35         32   PCI-MSI-edge      eth1
213:      36818      36995      37307      37029   PCI-MSI-edge      eth0
214:          0          1          1          1   PCI-MSI-edge
NMI:      71107      70983      70962      70962   Non-maskable interrupts
LOC:      53005      53178      53490      53214   Local timer interrupts
RES:        414        434        363        378   Rescheduling interrupts
CAL:         52         46         56         47   function call interrupts
TLB:        398        288        403        264   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          0
MIS:          0

visp-1 ~ # cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 4
cpu MHz         : 3192.163
cache size      : 2048 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 6
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm 
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx cid cx16 xtpr lahf_lm
bogomips        : 6390.17
clflush size    : 64

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 4
cpu MHz         : 3192.163
cache size      : 2048 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 6
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm 
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx cid cx16 xtpr lahf_lm
bogomips        : 6383.72
clflush size    : 64

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 4
cpu MHz         : 3192.163
cache size      : 2048 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 6
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm 
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx cid cx16 xtpr lahf_lm
bogomips        : 6383.75
clflush size    : 64

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 4
cpu MHz         : 3192.163
cache size      : 2048 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 6
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm 
constant_tsc pebs bts sync_rdtsc pni monitor ds_cpl vmx cid cx16 xtpr lahf_lm
bogomips        : 6383.76
clflush size    : 64


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/ spinlock lockup, 2.6.24
  2008-02-15 15:19 BUG/ spinlock lockup, 2.6.24 Denys Fedoryshchenko
@ 2008-02-15 15:24 ` Bart Van Assche
  2008-02-15 19:42   ` Denys Fedoryshchenko
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2008-02-15 15:24 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev, linux-kernel

2008/2/15 Denys Fedoryshchenko <denys@visp.net.lb>:
>  I have random crashes, at least once per week. It is very difficult to catch
>  error message, and only recently i setup netconsole. Now i got crash, but
>  there is no traceback and only single line came over netconsole, mentioned
>  before.

Did you already run memtest ? You can run memtest by booting from the
Knoppix CD-ROM or DVD. Most Linux distributions also have included
memtest on their bootable distribution CD's/DVD's.

Bart Van Assche.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/ spinlock lockup, 2.6.24
  2008-02-15 15:24 ` Bart Van Assche
@ 2008-02-15 19:42   ` Denys Fedoryshchenko
  2008-02-15 20:21     ` Jarek Poplawski
  0 siblings, 1 reply; 6+ messages in thread
From: Denys Fedoryshchenko @ 2008-02-15 19:42 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: netdev, linux-kernel

This server was working fine under load under FreeBSD, and worked fine before 
with other tasks under Linux. I dont think it is RAM.
Additionally it is server hardware (Dell PowerEdge) with ECC, MCE and other 
layers, who will report about any hardware issue most probably, and i think 
even better than memtest. 
Additionally it is very difficult to run test on it, cause it is in another 
country, and i have limited access to it (i dont have network KVM).

I have similar crashes on completely different hardware with same job (QOS), 
so i think it is actually some nasty bug in networking.


On Fri, 15 Feb 2008 16:24:56 +0100, Bart Van Assche wrote
> 2008/2/15 Denys Fedoryshchenko <denys@visp.net.lb>:
> >  I have random crashes, at least once per week. It is very difficult to 
catch
> >  error message, and only recently i setup netconsole. Now i got crash, but
> >  there is no traceback and only single line came over netconsole, 
mentioned
> >  before.
> 
> Did you already run memtest ? You can run memtest by booting from the
> Knoppix CD-ROM or DVD. Most Linux distributions also have included
> memtest on their bootable distribution CD's/DVD's.
> 
> Bart Van Assche.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/ spinlock lockup, 2.6.24
  2008-02-15 19:42   ` Denys Fedoryshchenko
@ 2008-02-15 20:21     ` Jarek Poplawski
  2008-02-15 21:03       ` Jarek Poplawski
  0 siblings, 1 reply; 6+ messages in thread
From: Jarek Poplawski @ 2008-02-15 20:21 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Bart Van Assche, netdev, linux-kernel

Denys Fedoryshchenko wrote, On 02/15/2008 08:42 PM:
...

> I have similar crashes on completely different hardware with same job (QOS), 
> so i think it is actually some nasty bug in networking.

Maybe you could try with some other debugging options? E.g. since lockdep
doesn't help - turn this off. Instead try some others, like these:

> # CONFIG_DEBUG_SPINLOCK_SLEEP is not set
> # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
> # CONFIG_DEBUG_KOBJECT is not set
> # CONFIG_DEBUG_HIGHMEM is not set
> # CONFIG_DEBUG_VM is not set
> # CONFIG_DEBUG_LIST is not set
> # CONFIG_DEBUG_SG is not set
> # CONFIG_BOOT_PRINTK_DELAY is not set
> # CONFIG_DEBUG_STACKOVERFLOW is not set
> # CONFIG_DEBUG_STACK_USAGE is not set
> # CONFIG_DEBUG_RODATA is not set

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/ spinlock lockup, 2.6.24
  2008-02-15 20:21     ` Jarek Poplawski
@ 2008-02-15 21:03       ` Jarek Poplawski
  2008-02-15 22:49         ` Jarek Poplawski
  0 siblings, 1 reply; 6+ messages in thread
From: Jarek Poplawski @ 2008-02-15 21:03 UTC (permalink / raw)
  Cc: Denys Fedoryshchenko, Bart Van Assche, netdev, linux-kernel

Jarek Poplawski wrote, On 02/15/2008 09:21 PM:

> Denys Fedoryshchenko wrote, On 02/15/2008 08:42 PM:
> ...
> 
>> I have similar crashes on completely different hardware with same job (QOS), 
>> so i think it is actually some nasty bug in networking.
> 
> Maybe you could try with some other debugging options? E.g. since lockdep
> doesn't help - turn this off. Instead try some others, like these:

...On the other hand this:

> Feb 15 15:50:17 217.151.X.X [1521315.068984] BUG: spinlock lockup on CPU#1, 
> ksoftirqd/1/7, f0551180

seems to point just at spinlock lockup, so it's more about the full report.
I wonder if this patch to prink could help here:

author	Ingo Molnar <mingo at elte.hu>
	Fri, 25 Jan 2008 20:07:58 +0000 (21:07 +0100)
printk: make printk more robust by not allowing recursion

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=32a76006683f7b28ae3cc491da37716e002f198e

Jarek P.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/ spinlock lockup, 2.6.24
  2008-02-15 21:03       ` Jarek Poplawski
@ 2008-02-15 22:49         ` Jarek Poplawski
  0 siblings, 0 replies; 6+ messages in thread
From: Jarek Poplawski @ 2008-02-15 22:49 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: Bart Van Assche, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 654 bytes --]

Jarek Poplawski wrote, On 02/15/2008 10:03 PM:
...

> ...On the other hand this:
> 
>> Feb 15 15:50:17 217.151.X.X [1521315.068984] BUG: spinlock lockup on CPU#1, 
>> ksoftirqd/1/7, f0551180
> 
> seems to point just at spinlock lockup, so it's more about the full report.
> I wonder if this patch to prink could help here:
> 
> author	Ingo Molnar <mingo at elte.hu>
> 	Fri, 25 Jan 2008 20:07:58 +0000 (21:07 +0100)
> printk: make printk more robust by not allowing recursion
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=32a76006683f7b28ae3cc491da37716e002f198e


...or maybe a patch like this attached here?

Jarek P.

[-- Attachment #2: spinlock_debug.diff --]
[-- Type: text/x-diff, Size: 733 bytes --]

diff --git a/lib/spinlock_debug.c b/lib/spinlock_debug.c
index 9c4b025..21c8aaa 100644
--- a/lib/spinlock_debug.c
+++ b/lib/spinlock_debug.c
@@ -111,8 +111,7 @@ static void __spin_lock_debug(spinlock_t *lock)
 			__delay(1);
 		}
 		/* lockup suspected: */
-		if (print_once) {
-			print_once = 0;
+		if (print_once == 1) {
 			printk(KERN_EMERG "BUG: spinlock lockup on CPU#%d, "
 					"%s/%d, %p\n",
 				raw_smp_processor_id(), current->comm,
@@ -122,7 +121,14 @@ static void __spin_lock_debug(spinlock_t *lock)
 			trigger_all_cpu_backtrace();
 #endif
 		}
+		if (print_once++ > 1000)
+			goto out;
 	}
+	return;
+out:
+	panic("spinlock lockup panic #%llu\n", i);
+	// or:
+	// BUG();
 }
 
 void _raw_spin_lock(spinlock_t *lock)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-02-15 22:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-15 15:19 BUG/ spinlock lockup, 2.6.24 Denys Fedoryshchenko
2008-02-15 15:24 ` Bart Van Assche
2008-02-15 19:42   ` Denys Fedoryshchenko
2008-02-15 20:21     ` Jarek Poplawski
2008-02-15 21:03       ` Jarek Poplawski
2008-02-15 22:49         ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).