LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Hangs and reboots under high loads, oops with DEBUG_SHIRQ
@ 2007-07-30 15:30 Attila Nagy
  2007-07-30 16:18 ` Kok, Auke
  2007-07-30 16:19 ` Alan Cox
  0 siblings, 2 replies; 7+ messages in thread
From: Attila Nagy @ 2007-07-30 15:30 UTC (permalink / raw)
  To: linux-kernel

Hello,

I have four identical machines, based on Supermicro X7DBE motherboards.
All of the machines have two Xeon 5130 CPUs, 16GB RAM and two add-on
cards: an Areca 1261 SATA RAID HBA and a Qlogic QLA2342 fibre channel HBA.

I would like to use these as file servers (via FC), but during the 
performance and
reliabilty tests it turned out that the machines are very unreliable, 
despite that they
seemed to be OK hardware-wise (memtest and the usual stuff).

During the debugging of this (seemingly) high IO load related problem, I 
have
observed the following:
- when MSI is enabled (the first iteration), the machines sometimes 
"hang", but
not the whole system, just the SCSI target subsystem (SCST), which makes
heavy IO on the Areca (arcmsr) and the Qlogic (patched qla2xxx) HBAs
- when MSI is disabled, I couldn't reproduce that hung up state, instead the
machines sometimes throw an MCE (see below), but I couldn't find its cause
- when MSI is disabled, and CONFIG_DEBUG_SHIRQ is enabled, the machines
can't even boot normally, I get an oops instantly during the kernel 
initialization
- with MSI disabled sometimes the machines fail to respond, the ssh 
sessions terminate
and on the console I can't type for very long seconds. I have nearly all 
debugging turned on,
but can't see anything in the logs or on the console. The machine 
recovers from this hang
automatically. The whole thing seems like when a high (eg. network) 
interrupt activity happens
on a highly loaded machine, but I could observe this even after a fresh 
boot, without anything
(of course minus the standard stuff, sshd, and the others) running on 
the machine.

The kernel is 2.6.21.5 (I've tried 2.6.18, the effects are the same), 
running in 64 bit mode.

The oops I get with MSI disabled and CONFIG_DEBUG_SHIRQ enabled:

[   92.681320] NET: Registered protocol family 17
[   93.491658] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[   93.557402]  [<0000000000000000>]
[   93.626770] PGD 0
[   93.651106] Oops: 0010 [1] SMP
[   93.689170] CPU 1
[   93.713506] Modules linked in:
[   93.750322] Pid: 1, comm: swapper Not tainted 2.6.21.5 #1
[   93.815011] RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
[   93.887187] RSP: 0018:ffff81042fc5dc68  EFLAGS: 00010002
[   93.950836] RAX: ffff81042fbe6b70 RBX: 0000000000000202 RCX: ffff81042fbe6b70
[   94.036323] RDX: ffffc20000040000 RSI: ffff81042f51cdf8 RDI: ffff81042fbe6800
[   94.121812] RBP: ffff81042fc5dd10 R08: 0000000000000000 R09: ffff81042f4c0ea8
[   94.207298] R10: 0000000000000000 R11: ffff81042fbe6800 R12: 00000000fffffff4
[   94.292788] R13: ffff81042fbe6000 R14: 0000000000000001 R15: ffffffff80399450
[   94.378275] FS:  0000000000000000(0000) GS:ffff81042fc694c8(0000) knlGS:0000000000000000
[   94.475307] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[   94.544153] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
[   94.629643] Process swapper (pid: 1, threadinfo ffff81042fc5c000, task ffff81042fc58040)
[   94.726673] Stack:  ffffffff80399559 ffff81042fc5dca0 ffffffff802121b2 ffff81042f4c0ea0
[   94.823603]  ffff81042fc5dca0 ffffffff80221cd7 ffffffff802addb5 ffff81042fbe6800
[   94.913042]  ffffffff8020c9bc ffff81042fbe6b70 0000000000000246 ffff81042fbe6b70
[   95.000194] Call Trace:
[   95.031814]  [<ffffffff80399559>] e1000_intr+0x109/0x590
[   95.095461]  [<ffffffff802121b2>] poison_obj+0x42/0x60
[   95.157027]  [<ffffffff80221cd7>] dbg_redzone1+0x17/0x30
[   95.220676]  [<ffffffff802addb5>] request_irq+0x95/0x150
[   95.284324]  [<ffffffff8020c9bc>] cache_alloc_debugcheck_after+0x17c/0x1c0
[   95.366690]  [<ffffffff8020a43d>] kmem_cache_alloc+0xcd/0xf0
[   95.434500]  [<ffffffff80399450>] e1000_intr+0x0/0x590
[   95.496067]  [<ffffffff802ade00>] request_irq+0xe0/0x150
[   95.559716]  [<ffffffff8039558c>] e1000_request_irq+0x3c/0x80
[   95.628564]  [<ffffffff803985bc>] e1000_open+0x5c/0x100
[   95.691172]  [<ffffffff8041d937>] dev_open+0x37/0x80
[   95.750661]  [<ffffffff8041becd>] dev_change_flags+0x6d/0x150
[   95.819508]  [<ffffffff80616565>] ip_auto_config+0x175/0xea0
[   95.887317]  [<ffffffff80442f88>] tcp_set_default_congestion_control+0x18/0x70
[   95.973947]  [<ffffffff80442fcf>] tcp_set_default_congestion_control+0x5f/0x70
[   96.060582]  [<ffffffff80265236>] _spin_unlock+0x26/0x30
[   96.124227]  [<ffffffff805f1754>] init+0x1a4/0x2b0
[   96.181635]  [<ffffffff802a0e7b>] trace_hardirqs_on+0x14b/0x180
[   96.252563]  [<ffffffff8025ff28>] child_rip+0xa/0x12
[   96.312051]  [<ffffffff8026563b>] _spin_unlock_irq+0x2b/0x40
[   96.379859]  [<ffffffff8025f63c>] restore_args+0x0/0x30
[   96.442467]  [<ffffffff805f15b0>] init+0x0/0x2b0
[   96.497795]  [<ffffffff8025ff1e>] child_rip+0x0/0x12
[   96.557282]
[   96.575170]
[   96.575171] Code:  Bad RIP value.
[   96.633203] RIP  [<0000000000000000>]
[   96.677297]  RSP <ffff81042fc5dc68>
[   96.719105] CR2: 0000000000000000
[   96.758835] Kernel panic - not syncing: Attempted to kill init!


MCE:
[153103.918654] HARDWARE ERROR
[153103.918655] CPU 1: Machine Check Exception:                5 Bank 0: 
b200004010000400
[153104.066037] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
[153104.145699] TSC 1167e915e93ce
[153104.183554] This is not a software problem!
[153104.234724] Run through mcelog --ascii to decode and contact your 
hardware vendor
[153104.325517]
[153104.325518] HARDWARE ERROR
[153104.325519] CPU 1: Machine Check Exception:                5 Bank 5: 
b200221024080400
[153104.472883] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
[153104.552546] TSC 1167e915e9ea8
[153104.590402] This is not a software problem!
[153104.641572] Run through mcelog --ascii to decode and contact your 
hardware vendor
[153104.732365] Kernel panic - not syncing: Machine check

I've got exactly the same errors (only the TSC and the CPU value 
changing) on all four machines,
could this really be a hardware error?

full dmesg: http://people.fsn.hu/~bra/linux/x7dbe-20070730/with_debug_shirq
dmesg with MSI enabled: 
http://people.fsn.hu/~bra/linux/x7dbe-20070730/with_msi_and_debug_shirq
kernel config: http://people.fsn.hu/~bra/linux/x7dbe-20070730/config

I've tried to disable all possible devices which consume interrupts,
and placed the cards into various slots (the FC HBA is PCI-X, which the 
Areca
is PCI-E). Currently the arcmsr and (one of, it's a dual channel
HBA) qla2xxx are on a shared IRQ.

Could you please help?
Do you think this is related to the strange hang under high IO load, the
occasional, complete "blackouts", where all ssh network sessions
time out, but the machine recovers, and the MCEs?

I'm open for anything, have serial consoles, etc, if needed.

Please keep me on CC, I'm not subscribed.

Thanks,

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
  2007-07-30 15:30 Hangs and reboots under high loads, oops with DEBUG_SHIRQ Attila Nagy
@ 2007-07-30 16:18 ` Kok, Auke
  2007-07-30 16:19 ` Alan Cox
  1 sibling, 0 replies; 7+ messages in thread
From: Kok, Auke @ 2007-07-30 16:18 UTC (permalink / raw)
  To: Attila Nagy; +Cc: linux-kernel

Attila Nagy wrote:
> Hello,
> 
> I have four identical machines, based on Supermicro X7DBE motherboards.
> All of the machines have two Xeon 5130 CPUs, 16GB RAM and two add-on
> cards: an Areca 1261 SATA RAID HBA and a Qlogic QLA2342 fibre channel HBA.
> 
> I would like to use these as file servers (via FC), but during the 
> performance and
> reliabilty tests it turned out that the machines are very unreliable, 
> despite that they
> seemed to be OK hardware-wise (memtest and the usual stuff).
> 
> During the debugging of this (seemingly) high IO load related problem, I 
> have
> observed the following:
> - when MSI is enabled (the first iteration), the machines sometimes 
> "hang", but
> not the whole system, just the SCSI target subsystem (SCST), which makes
> heavy IO on the Areca (arcmsr) and the Qlogic (patched qla2xxx) HBAs
> - when MSI is disabled, I couldn't reproduce that hung up state, instead the
> machines sometimes throw an MCE (see below), but I couldn't find its cause
> - when MSI is disabled, and CONFIG_DEBUG_SHIRQ is enabled, the machines
> can't even boot normally, I get an oops instantly during the kernel 
> initialization
> - with MSI disabled sometimes the machines fail to respond, the ssh 
> sessions terminate
> and on the console I can't type for very long seconds. I have nearly all 
> debugging turned on,
> but can't see anything in the logs or on the console. The machine 
> recovers from this hang
> automatically. The whole thing seems like when a high (eg. network) 
> interrupt activity happens
> on a highly loaded machine, but I could observe this even after a fresh 
> boot, without anything
> (of course minus the standard stuff, sshd, and the others) running on 
> the machine.
> 
> The kernel is 2.6.21.5 (I've tried 2.6.18, the effects are the same), 
> running in 64 bit mode.

something is definately not happy on this system. There was a e1000 fix related 
to DEBUG_SHIRQ in 2.6.22, so I definately advise you to test 2.6.22.1 
immediately - however:

> The oops I get with MSI disabled and CONFIG_DEBUG_SHIRQ enabled:
> 
> [   92.681320] NET: Registered protocol family 17
> [   93.491658] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> [   93.557402]  [<0000000000000000>]
> [   93.626770] PGD 0
> [   93.651106] Oops: 0010 [1] SMP
> [   93.689170] CPU 1
> [   93.713506] Modules linked in:
> [   93.750322] Pid: 1, comm: swapper Not tainted 2.6.21.5 #1
> [   93.815011] RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
> [   93.887187] RSP: 0018:ffff81042fc5dc68  EFLAGS: 00010002
> [   93.950836] RAX: ffff81042fbe6b70 RBX: 0000000000000202 RCX: ffff81042fbe6b70
> [   94.036323] RDX: ffffc20000040000 RSI: ffff81042f51cdf8 RDI: ffff81042fbe6800
> [   94.121812] RBP: ffff81042fc5dd10 R08: 0000000000000000 R09: ffff81042f4c0ea8
> [   94.207298] R10: 0000000000000000 R11: ffff81042fbe6800 R12: 00000000fffffff4
> [   94.292788] R13: ffff81042fbe6000 R14: 0000000000000001 R15: ffffffff80399450
> [   94.378275] FS:  0000000000000000(0000) GS:ffff81042fc694c8(0000) knlGS:0000000000000000
> [   94.475307] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [   94.544153] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> [   94.629643] Process swapper (pid: 1, threadinfo ffff81042fc5c000, task ffff81042fc58040)
> [   94.726673] Stack:  ffffffff80399559 ffff81042fc5dca0 ffffffff802121b2 ffff81042f4c0ea0
> [   94.823603]  ffff81042fc5dca0 ffffffff80221cd7 ffffffff802addb5 ffff81042fbe6800
> [   94.913042]  ffffffff8020c9bc ffff81042fbe6b70 0000000000000246 ffff81042fbe6b70
> [   95.000194] Call Trace:
> [   95.031814]  [<ffffffff80399559>] e1000_intr+0x109/0x590
> [   95.095461]  [<ffffffff802121b2>] poison_obj+0x42/0x60
> [   95.157027]  [<ffffffff80221cd7>] dbg_redzone1+0x17/0x30
> [   95.220676]  [<ffffffff802addb5>] request_irq+0x95/0x150
> [   95.284324]  [<ffffffff8020c9bc>] cache_alloc_debugcheck_after+0x17c/0x1c0
> [   95.366690]  [<ffffffff8020a43d>] kmem_cache_alloc+0xcd/0xf0
> [   95.434500]  [<ffffffff80399450>] e1000_intr+0x0/0x590
> [   95.496067]  [<ffffffff802ade00>] request_irq+0xe0/0x150
> [   95.559716]  [<ffffffff8039558c>] e1000_request_irq+0x3c/0x80
> [   95.628564]  [<ffffffff803985bc>] e1000_open+0x5c/0x100
> [   95.691172]  [<ffffffff8041d937>] dev_open+0x37/0x80
> [   95.750661]  [<ffffffff8041becd>] dev_change_flags+0x6d/0x150
> [   95.819508]  [<ffffffff80616565>] ip_auto_config+0x175/0xea0
> [   95.887317]  [<ffffffff80442f88>] tcp_set_default_congestion_control+0x18/0x70
> [   95.973947]  [<ffffffff80442fcf>] tcp_set_default_congestion_control+0x5f/0x70
> [   96.060582]  [<ffffffff80265236>] _spin_unlock+0x26/0x30
> [   96.124227]  [<ffffffff805f1754>] init+0x1a4/0x2b0
> [   96.181635]  [<ffffffff802a0e7b>] trace_hardirqs_on+0x14b/0x180
> [   96.252563]  [<ffffffff8025ff28>] child_rip+0xa/0x12
> [   96.312051]  [<ffffffff8026563b>] _spin_unlock_irq+0x2b/0x40
> [   96.379859]  [<ffffffff8025f63c>] restore_args+0x0/0x30
> [   96.442467]  [<ffffffff805f15b0>] init+0x0/0x2b0
> [   96.497795]  [<ffffffff8025ff1e>] child_rip+0x0/0x12
> [   96.557282]
> [   96.575170]
> [   96.575171] Code:  Bad RIP value.
> [   96.633203] RIP  [<0000000000000000>]
> [   96.677297]  RSP <ffff81042fc5dc68>
> [   96.719105] CR2: 0000000000000000
> [   96.758835] Kernel panic - not syncing: Attempted to kill init!
> 
> 
> MCE:
> [153103.918654] HARDWARE ERROR
> [153103.918655] CPU 1: Machine Check Exception:                5 Bank 0: 
> b200004010000400
> [153104.066037] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
> [153104.145699] TSC 1167e915e93ce
> [153104.183554] This is not a software problem!
> [153104.234724] Run through mcelog --ascii to decode and contact your 
> hardware vendor
> [153104.325517]
> [153104.325518] HARDWARE ERROR
> [153104.325519] CPU 1: Machine Check Exception:                5 Bank 5: 
> b200221024080400
> [153104.472883] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
> [153104.552546] TSC 1167e915e9ea8
> [153104.590402] This is not a software problem!
> [153104.641572] Run through mcelog --ascii to decode and contact your 
> hardware vendor
> [153104.732365] Kernel panic - not syncing: Machine check

this is serious problems that might not be resolved and be a symptom of a true 
hardware issue. Looking at the time it seems an issue on itself and unrelated to 
the e1000 debug_shirq fix.

> I've got exactly the same errors (only the TSC and the CPU value 
> changing) on all four machines,
> could this really be a hardware error?

yes

> full dmesg: http://people.fsn.hu/~bra/linux/x7dbe-20070730/with_debug_shirq
> dmesg with MSI enabled: 
> http://people.fsn.hu/~bra/linux/x7dbe-20070730/with_msi_and_debug_shirq
> kernel config: http://people.fsn.hu/~bra/linux/x7dbe-20070730/config
> 
> I've tried to disable all possible devices which consume interrupts,
> and placed the cards into various slots (the FC HBA is PCI-X, which the 
> Areca
> is PCI-E). Currently the arcmsr and (one of, it's a dual channel
> HBA) qla2xxx are on a shared IRQ.
> 
> Could you please help?
> Do you think this is related to the strange hang under high IO load, the
> occasional, complete "blackouts", where all ssh network sessions
> time out, but the machine recovers, and the MCEs?


Like I said, please try 2.6.22 which should have the e1000 issue fixed. The MCE 
looks real and a different problem

Auke

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
  2007-07-30 15:30 Hangs and reboots under high loads, oops with DEBUG_SHIRQ Attila Nagy
  2007-07-30 16:18 ` Kok, Auke
@ 2007-07-30 16:19 ` Alan Cox
  2007-07-31 14:21   ` Attila Nagy
  1 sibling, 1 reply; 7+ messages in thread
From: Alan Cox @ 2007-07-30 16:19 UTC (permalink / raw)
  To: Attila Nagy; +Cc: linux-kernel

O> MCE:
> [153103.918654] HARDWARE ERROR
> [153103.918655] CPU 1: Machine Check Exception:                5 Bank 0: 
> b200004010000400
> [153104.066037] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
> [153104.145699] TSC 1167e915e93ce
> [153104.183554] This is not a software problem!
> [153104.234724] Run through mcelog --ascii to decode and contact your 
> hardware vendor

If you it through mcelog as it suggests it wil decode the meaning of the
MCE data and that should give you some idea. Generally speaking MCE
errors are real hardware errors but can certainly be caused by external
factors (power supply glitches, heat etc)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
  2007-07-30 16:19 ` Alan Cox
@ 2007-07-31 14:21   ` Attila Nagy
  2007-07-31 22:08     ` Roger Heflin
  0 siblings, 1 reply; 7+ messages in thread
From: Attila Nagy @ 2007-07-31 14:21 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On 2007.07.30. 18:19, Alan Cox wrote:
> O> MCE:
>   
>> [153103.918654] HARDWARE ERROR
>> [153103.918655] CPU 1: Machine Check Exception:                5 Bank 0: 
>> b200004010000400
>> [153104.066037] RIP !INEXACT! 10:<ffffffff802569e6> {mwait_idle+0x46/0x60}
>> [153104.145699] TSC 1167e915e93ce
>> [153104.183554] This is not a software problem!
>> [153104.234724] Run through mcelog --ascii to decode and contact your 
>> hardware vendor
>>     
>
> If you it through mcelog as it suggests it wil decode the meaning of the
> MCE data and that should give you some idea. Generally speaking MCE
> errors are real hardware errors but can certainly be caused by external
> factors (power supply glitches, heat etc)
>   
Sorry, of course I ran that through mcelog, but inadvertently attached 
the original version.

I've tried the machines with two types of power sources (different 
UPSes, line filtering, etc,
and the chassis have redundant PSes), monitoring the temperatures (seems 
to be OK,
the CPUs don't go over 30 °C even under load). I have the latest BIOS 
for the
motherboard.
But I will recheck everything.

BTW, here's the output from mcelog, I see this occasionally on all four 
machines:

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 0 TSC 1167e915e93ce
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200004010000400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 TSC 1167e915e9ea8
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal Timer error
STATUS b200221024080400 MCGSTATUS 5
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

Thanks,


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
  2007-07-31 14:21   ` Attila Nagy
@ 2007-07-31 22:08     ` Roger Heflin
  2007-08-02  8:29       ` Attila Nagy
  0 siblings, 1 reply; 7+ messages in thread
From: Roger Heflin @ 2007-07-31 22:08 UTC (permalink / raw)
  To: Attila Nagy; +Cc: Alan Cox, linux-kernel

Attila Nagy wrote:
> On 2007.07.30. 18:19, Alan Cox wrote:
>> O> MCE:
>>  
>>> [153103.918654] HARDWARE ERROR
>>> [153103.918655] CPU 1: Machine Check Exception:                5 Bank 
>>> 0: b200004010000400
>>> [153104.066037] RIP !INEXACT! 10:<ffffffff802569e6> 
>>> {mwait_idle+0x46/0x60}
>>> [153104.145699] TSC 1167e915e93ce
>>> [153104.183554] This is not a software problem!
>>> [153104.234724] Run through mcelog --ascii to decode and contact your 
>>> hardware vendor
>>>     
>>
>> If you it through mcelog as it suggests it wil decode the meaning of the
>> MCE data and that should give you some idea. Generally speaking MCE
>> errors are real hardware errors but can certainly be caused by external
>> factors (power supply glitches, heat etc)
>>   
> Sorry, of course I ran that through mcelog, but inadvertently attached 
> the original version.
> 
> I've tried the machines with two types of power sources (different 
> UPSes, line filtering, etc,
> and the chassis have redundant PSes), monitoring the temperatures (seems 
> to be OK,
> the CPUs don't go over 30 °C even under load). I have the latest BIOS 
> for the
> motherboard.
> But I will recheck everything.
> 
> BTW, here's the output from mcelog, I see this occasionally on all four 
> machines:
> 
> HARDWARE ERROR
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 1 BANK 0 TSC 1167e915e93ce
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200004010000400 MCGSTATUS 5
> This is not a software problem!
> Run through mcelog --ascii to decode and contact your hardware vendor
> 
> HARDWARE ERROR
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 1 BANK 5 TSC 1167e915e9ea8
> MCG status:RIPV MCIP
> MCi status:
> Uncorrected error
> Error enabled
> Processor context corrupt
> MCA: Internal Timer error
> STATUS b200221024080400 MCGSTATUS 5
> This is not a software problem!
> Run through mcelog --ascii to decode and contact your hardware vendor
> 
> Thanks,
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Attila,

We had some issues with very similar boards all of the problems
seem to be around the PCIX bus area of the machine, setting the
PCIX buses to 66 mhz in the bios made things stable (but slow).   Not using
the PCIX bus also seemed to make things work.   We got MCE's and
other odd crashes under heavy IO loads.   I believe turning things
down to 100mhz made things more stable, but things still crashed.

Supermicro reported being able to fix the issue with:
setting the PCI Configuration -> PCI-e I/O performance
setting to Colasce 128B.

I am not exactly sure where to set it as we did not try it
as we had already changed to a different motherboard that did not
have the issue.

If this works please tell me.

                      Roger






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
  2007-07-31 22:08     ` Roger Heflin
@ 2007-08-02  8:29       ` Attila Nagy
  2007-08-04 22:56         ` Mr. James W. Laferriere
  0 siblings, 1 reply; 7+ messages in thread
From: Attila Nagy @ 2007-08-02  8:29 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Alan Cox, linux-kernel

On 2007.08.01. 0:08, Roger Heflin wrote:
> Attila Nagy wrote:
>> HARDWARE ERROR
>> HARDWARE ERROR. This is *NOT* a software problem!
>> Please contact your hardware vendor
>> CPU 1 BANK 0 TSC 1167e915e93ce
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200004010000400 MCGSTATUS 5
>> This is not a software problem!
>> Run through mcelog --ascii to decode and contact your hardware vendor
>>
>> HARDWARE ERROR
>> HARDWARE ERROR. This is *NOT* a software problem!
>> Please contact your hardware vendor
>> CPU 1 BANK 5 TSC 1167e915e9ea8
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200221024080400 MCGSTATUS 5
>> This is not a software problem!
>> Run through mcelog --ascii to decode and contact your hardware vendor
>
> Attila,
>
> We had some issues with very similar boards all of the problems
> seem to be around the PCIX bus area of the machine, setting the
> PCIX buses to 66 mhz in the bios made things stable (but slow).   Not 
> using
> the PCIX bus also seemed to make things work.   We got MCE's and
> other odd crashes under heavy IO loads.   I believe turning things
> down to 100mhz made things more stable, but things still crashed.
>
> Supermicro reported being able to fix the issue with:
> setting the PCI Configuration -> PCI-e I/O performance
> setting to Colasce 128B.
>
> I am not exactly sure where to set it as we did not try it
> as we had already changed to a different motherboard that did not
> have the issue.
>
> If this works please tell me.
Roger, you are my hero. :)
With that PCI-e setting (again, for the record, this is on a Supermicro 
X7DBE motherboard,
and the BIOS setting is PCIe I/O performance, which has two states: 
Coalesce and Payload 256B)
all of the four machines have survived a half day of continous bashing. 
Previously one, or two
machines typically fell off after such amount of IO load, so it looks 
promising so far.
I hope this won't change over the time.

BTW, this is still with 2.6.21.5, because the SCSI target stuff I use 
(SCST) has some
-I hope temporary- problems with changed (deleted) interfaces in newer 
kernels.

Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)?

Thanks,

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hangs and reboots under high loads, oops with DEBUG_SHIRQ
  2007-08-02  8:29       ` Attila Nagy
@ 2007-08-04 22:56         ` Mr. James W. Laferriere
  0 siblings, 0 replies; 7+ messages in thread
From: Mr. James W. Laferriere @ 2007-08-04 22:56 UTC (permalink / raw)
  To: Attila Nagy; +Cc: Linux Kernel Maillist

 	Hello Atilla ,

On Thu, 2 Aug 2007, Attila Nagy wrote:
> On 2007.08.01. 0:08, Roger Heflin wrote:
>> Attila Nagy wrote:
>>> HARDWARE ERROR
>>> HARDWARE ERROR. This is *NOT* a software problem!
>>> Please contact your hardware vendor
>>> CPU 1 BANK 0 TSC 1167e915e93ce
>>> MCG status:RIPV MCIP
>>> MCi status:
>>> Uncorrected error
>>> Error enabled
>>> Processor context corrupt
>>> MCA: Internal Timer error
>>> STATUS b200004010000400 MCGSTATUS 5
>>> This is not a software problem!
>>> Run through mcelog --ascii to decode and contact your hardware vendor
>>> 
>>> HARDWARE ERROR
>>> HARDWARE ERROR. This is *NOT* a software problem!
>>> Please contact your hardware vendor
>>> CPU 1 BANK 5 TSC 1167e915e9ea8
>>> MCG status:RIPV MCIP
>>> MCi status:
>>> Uncorrected error
>>> Error enabled
>>> Processor context corrupt
>>> MCA: Internal Timer error
>>> STATUS b200221024080400 MCGSTATUS 5
>>> This is not a software problem!
>>> Run through mcelog --ascii to decode and contact your hardware vendor
>> 
>> Attila,
>> 
>> We had some issues with very similar boards all of the problems
>> seem to be around the PCIX bus area of the machine, setting the
>> PCIX buses to 66 mhz in the bios made things stable (but slow).   Not using
>> the PCIX bus also seemed to make things work.   We got MCE's and
>> other odd crashes under heavy IO loads.   I believe turning things
>> down to 100mhz made things more stable, but things still crashed.
>> 
>> Supermicro reported being able to fix the issue with:
>> setting the PCI Configuration -> PCI-e I/O performance
>> setting to Colasce 128B.
>> 
>> I am not exactly sure where to set it as we did not try it
>> as we had already changed to a different motherboard that did not
>> have the issue.
>> 
>> If this works please tell me.
> Roger, you are my hero. :)
> With that PCI-e setting (again, for the record, this is on a Supermicro X7DBE 
> motherboard,
> and the BIOS setting is PCIe I/O performance, which has two states: Coalesce 
> and Payload 256B)
> all of the four machines have survived a half day of continous bashing. 
> Previously one, or two
> machines typically fell off after such amount of IO load, so it looks 
> promising so far.
> I hope this won't change over the time.
>
> BTW, this is still with 2.6.21.5, because the SCSI target stuff I use (SCST) 
> has some
> -I hope temporary- problems with changed (deleted) interfaces in newer 
> kernels.
>
> Should the DEBUG_SHIRQ problem in e1000 affect stability (or performance)?
>
> Thanks,

 	I too have a SuperMicro MB ,  But it is a X7DB8 .  Same symptoms . 
Reported MCE problems here a couple of times .
 	I set the BIOS setting 'PCIe I/O performance', to 'Coalesce' .
  	For everyones information ,  stability went way up ,  scsi IO is ~ half 
,  But if there's no stability ...

 	I'm going to try their 1.3b bios update & see if that helps any .
 	iirc ,  Some said they'd already acquired the lastest for their MB & 
that did not help them at all .  What th eheck I'll give it a try anyway .
 		Hth ,  JimL
-- 
+-----------------------------------------------------------------+
| James   W.   Laferriere | System   Techniques | Give me VMS     |
| Network        Engineer | 663  Beaumont  Blvd |  Give me Linux  |
| babydr@baby-dragons.com | Pacifica, CA. 94044 |   only  on  AXP |
+-----------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-08-04 22:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-30 15:30 Hangs and reboots under high loads, oops with DEBUG_SHIRQ Attila Nagy
2007-07-30 16:18 ` Kok, Auke
2007-07-30 16:19 ` Alan Cox
2007-07-31 14:21   ` Attila Nagy
2007-07-31 22:08     ` Roger Heflin
2007-08-02  8:29       ` Attila Nagy
2007-08-04 22:56         ` Mr. James W. Laferriere

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).