LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* BUG: soft lockup detected on Phenom with Debian 2.6.24-4
@ 2008-03-08 22:00 Laurent GUERBY
  2008-03-16 14:43 ` Laurent GUERBY
  2008-04-02 19:15 ` Tim Schmielau
  0 siblings, 2 replies; 10+ messages in thread
From: Laurent GUERBY @ 2008-03-08 22:00 UTC (permalink / raw)
  To: linux-kernel

Hi,

I have a system with an "AMD64 Phenom 9500" quad core cpu, 4GB RAM,
"ASUS M3A32 MVP Deluxe wifi" motherboard with latest vendor BIOS (0801).

I tried stock debian etch kernel (Debian 2.6.18.dfsg.1-18etch1), machine
froze with no message, debian etch backport kernel same, and then
Debian 2.6.24-4 from unstable and I got some messages: machine
is not frozen but some userland processes are (ps says "Dl" state
with child in "Zs" state) and "events/3" is taking 100% cpu
according to top:

   18 root      15  -5     0    0    0 R  100  0.0  74:59.46 events/3  

Got to the same state with ubuntu hardy 2.6.24-8-server kernel. All
kernels are untainted, no X running anyway.

It takes a few hours of doing some stuff, in my case bootstraping or
testing GCC at -j 4, and then the problem happens.

I did 32 hours of memtest without issue on this system, temperatures
are very low and the case has plenty of airflow, making memory
issue less likely.

Here is what I found by looking around:

"BUG: soft lockup with kernel 2.6.23.12 (x86-64)"
http://lkml.org/lkml/2008/1/11/98

"[BUG] 2.6.24-git6 soft lockup detected while running libhugetlbfs"
http://lkml.org/lkml/2008/1/30/30
(this one has some discussion)

Bug I reported in ubuntu BTS "BUG: soft lockup - CPU#3 stuck for 11s!
[events/3:18]"
https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/197252

"BUG: soft lockup - CPU#0 stuck for 11s! [swapper:0]"
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=464387

Let me know if you need more information or if there's something
I should try. ssh and root access to this box is possible for known
developpers (send me your ssh public key), there's nothing of value on
it, I'm trying to make this system stable for use in the GCC compile
farm (which is open for free software projects):

http://gcc.gnu.org/wiki/CompileFarm

Thanks in advance,

Laurent

Mar  8 21:36:28 gcc04 kernel: BUG: soft lockup - CPU#3 stuck for 11s! [events/3:18]
Mar  8 21:36:28 gcc04 kernel: CPU 3:
Mar  8 21:36:28 gcc04 kernel: Modules linked in: tun nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ac battery ipv6 loop ath5k mac80211 serio_raw snd_hda_intel snd_pcm snd_timer snd soundcore snd_page_alloc floppy psmouse pcspkr cfg80211 button i2c_piix4 i2c_core sky2 evdev ext3 jbd mbcache dm_mirror dm_snapshot dm_mod ata_generic sd_mod usbhid hid ahci pata_marvell firewire_ohci firewire_core crc_itu_t libata r8169 atiixp ehci_hcd ohci_hcd scsi_mod generic ide_core thermal processor fan
Mar  8 21:36:28 gcc04 kernel: Pid: 18, comm: events/3 Not tainted 2.6.24-1-amd64 #1
Mar  8 21:36:28 gcc04 kernel: RIP: 0010:[<ffffffff8021bc75>]  [<ffffffff8021bc75>] __smp_call_function_mask+0x9b/0xbf
Mar  8 21:36:28 gcc04 kernel: RSP: 0000:ffff81012b773e00  EFLAGS: 00000297
Mar  8 21:36:28 gcc04 kernel: RAX: 00000000000008fc RBX: 0000000000000003 RCX: 0000000000000001
Mar  8 21:36:28 gcc04 kernel: RDX: 00000000000008fc RSI: 00000000000000fc RDI: 0000000000000007
Mar  8 21:36:28 gcc04 kernel: RBP: ffff81000103c8c0 R08: ffff81012b772000 R09: ffff810001040ee0
Mar  8 21:36:28 gcc04 kernel: R10: ffff81010484c800 R11: 0000000000000000 R12: 0000000280428760
Mar  8 21:36:28 gcc04 kernel: R13: 0000000000000000 R14: ffff81012b773d90 R15: ffff81012b773e24
Mar  8 21:36:28 gcc04 kernel: FS:  00002b97e47336d0(0000) GS:ffff81012b6b58c0(0000) knlGS:0000000000000000
Mar  8 21:36:28 gcc04 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Mar  8 21:36:28 gcc04 kernel: CR2: 00002b97e4c6d008 CR3: 0000000000201000 CR4: 00000000000006e0
Mar  8 21:36:28 gcc04 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar  8 21:36:28 gcc04 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar  8 21:36:28 gcc04 kernel: 
Mar  8 21:36:28 gcc04 kernel: Call Trace:
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff8021611c>] mcheck_check_cpu+0x0/0x37
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff8021611c>] mcheck_check_cpu+0x0/0x37
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff8021bcf7>] smp_call_function_mask+0x5e/0x70
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff802159c2>] mcheck_timer+0x0/0x7c
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff8021611c>] mcheck_check_cpu+0x0/0x37
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff8023a0cb>] on_each_cpu+0x10/0x22
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff802159df>] mcheck_timer+0x1d/0x7c
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff8027cb05>] vmstat_update+0x0/0x31
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff80244695>] run_workqueue+0x7f/0x10b
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff80244fa7>] worker_thread+0x0/0xe4
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff80245081>] worker_thread+0xda/0xe4
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff80247ff2>] autoremove_wake_function+0x0/0x2e
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff80247ed3>] kthread+0x47/0x74
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff8020cc48>] child_rip+0xa/0x12
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff80247e8c>] kthread+0x0/0x74
Mar  8 21:36:28 gcc04 kernel:  [<ffffffff8020cc3e>] child_rip+0x0/0x12
Mar  8 21:36:28 gcc04 kernel: 
Mar  8 21:36:40 gcc04 kernel: BUG: soft lockup - CPU#3 stuck for 11s! [events/3:18]
(repeated about every 11s)

# lspci
00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual slot PCI-e_GFX and HT3 K8 part
00:02.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (external gfx0 port A)
00:04.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port A)
00:06.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port C)
00:07.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port D)
00:09.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port E)
00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA
00:13.0 USB Controller: ATI Technologies Inc SB600 USB (OHCI0)
00:13.1 USB Controller: ATI Technologies Inc SB600 USB (OHCI1)
00:13.2 USB Controller: ATI Technologies Inc SB600 USB (OHCI2)
00:13.3 USB Controller: ATI Technologies Inc SB600 USB (OHCI3)
00:13.4 USB Controller: ATI Technologies Inc SB600 USB (OHCI4)
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 14)
00:14.1 IDE interface: ATI Technologies Inc SB600 IDE
00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia
00:14.3 ISA bridge: ATI Technologies Inc SB600 PCI to LPC Bridge
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
01:00.0 VGA compatible controller: ATI Technologies Inc RV515 [Radeon X1600]
01:00.1 Display controller: ATI Technologies Inc Unknown device 7160
02:00.0 IDE interface: Marvell Technology Group Ltd. 88SE6121 SATA II Controller (rev b1)
03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 12)
04:00.0 IDE interface: Marvell Technology Group Ltd. 88SE6121 SATA II Controller (rev b2)
05:00.0 Ethernet controller: Atheros Communications, Inc. AR242x 802.11abg Wireless PCI Express Adapter (rev 01)
06:08.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 70)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-03-08 22:00 BUG: soft lockup detected on Phenom with Debian 2.6.24-4 Laurent GUERBY
@ 2008-03-16 14:43 ` Laurent GUERBY
  2008-03-20 12:20   ` Laurent GUERBY
  2008-04-02 19:15 ` Tim Schmielau
  1 sibling, 1 reply; 10+ messages in thread
From: Laurent GUERBY @ 2008-03-16 14:43 UTC (permalink / raw)
  To: linux-kernel

On Sat, 2008-03-08 at 23:00 +0100, Laurent GUERBY wrote:
> Hi,
> 
> I have a system with an "AMD64 Phenom 9500" quad core cpu, 4GB RAM,
> "ASUS M3A32 MVP Deluxe wifi" motherboard with latest vendor BIOS
> (0801).
> 
> I tried stock debian etch kernel (Debian 2.6.18.dfsg.1-18etch1),
> machine
> froze with no message, debian etch backport kernel same, and then
> Debian 2.6.24-4 from unstable and I got some messages: machine
> is not frozen but some userland processes are (ps says "Dl" state
> with child in "Zs" state) and "events/3" is taking 100% cpu
> according to top:
> 
>    18 root      15  -5     0    0    0 R  100  0.0  74:59.46
> events/3  
> 
> Got to the same state with ubuntu hardy 2.6.24-8-server kernel. All
> kernels are untainted, no X running anyway.
> 
> It takes a few hours of doing some stuff, in my case bootstraping or
> testing GCC at -j 4, and then the problem happens.

On 2.6.24-1-amd64 (Debian 2.6.24-4) I got a slightly different
backtrace in /var/log/messages after a few hours of stressing the
machine with compilations, see below. The given process
was stuck and unkillable.

Any idea on what to do/try?

Thanks in advance,

Laurent

BUG: soft lockup - CPU#3 stuck for 11s! [sh:7652]
CPU 3:
Modules linked in: nfs tun nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ac battery ipv6 it87 hwmon_vid eeprom loop snd_hda_intel snd_pcm snd_timer snd serio_raw i2c_piix4 soundcore snd_page_alloc ath5k pcspkr psmouse i2c_core mac80211 cfg80211 button sky2 evdev ext3 jbd mbcache dm_mirror dm_snapshot dm_mod sd_mod ata_generic firewire_ohci firewire_core atiixp crc_itu_t ehci_hcd ahci pata_marvell ohci_hcd libata scsi_mod generic ide_core thermal processor fan
Pid: 7652, comm: sh Not tainted 2.6.24-1-amd64 #1
RIP: 0010:[<ffffffff8021bc79>]  [<ffffffff8021bc79>] __smp_call_function_mask+0x9f/0xbf
RSP: 0018:ffff81000131bd68  EFLAGS: 00000297
RAX: 00000000000008fc RBX: 0000000000000003 RCX: 0000000000000001
RDX: 00000000000008fc RSI: 00000000000000fc RDI: 0000000000000007
RBP: ffff81000000e000 R08: ffff81011bab5982 R09: ffff81000131bce8
R10: ffff81000131bce8 R11: 0000000000000062 R12: 0000000000000000
R13: 00000000000000d0 R14: ffff81000131bcb8 R15: ffff81000131bcb8
FS:  00002b61abbe4ae0(0000) GS:ffff81011bab58c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000005e1808 CR3: 0000000001303000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
 [<ffffffff8021bc64>] __smp_call_function_mask+0x8a/0xbf
 [<ffffffff80275939>] smp_drain_local_pages+0x0/0x5
 [<ffffffff80275939>] smp_drain_local_pages+0x0/0x5
 [<ffffffff8021bcf7>] smp_call_function_mask+0x5e/0x70
 [<ffffffff80276c3a>] __alloc_pages+0x1ef/0x309
 [<ffffffff8020be2e>] system_call+0x7e/0x83
 [<ffffffff802762a4>] __get_free_pages+0xe/0x32
 [<ffffffff802334bb>] copy_process+0xc1/0x148f
 [<ffffffff8022d224>] set_next_entity+0x18/0x3a
 [<ffffffff8022d2a3>] pick_next_task_fair+0x2d/0x3f
 [<ffffffff8020be2e>] system_call+0x7e/0x83
 [<ffffffff80234985>] do_fork+0x70/0x204
 [<ffffffff8023f29a>] recalc_sigpending+0xe/0x3c
 [<ffffffff8023f366>] sigprocmask+0x9e/0xc0
 [<ffffffff8020c147>] ptregscall_common+0x67/0xb0


guerby@gcc04:~$ cat /proc/7652/stat
7652 (sh) R 7633 3857 3069 34816 3069 4194368 114 0 0 0 0 294472 0 0 20 0 1 0 2369557 8212480 332 18446744073709551615 4194304 4924460 140737476255664 18446744073709551615 47698491444962 262400 65538 0 65538 0 0 0 17 3 0 0 0 0 0
guerby@gcc04:~$ cat /proc/7652/status 
Name:   sh
State:  R (running)
Tgid:   7652
Pid:    7652
PPid:   7633
TracerPid:      27382
Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000
FDSize: 256
Groups: 1000 
VmPeak:     8020 kB
VmSize:     8020 kB
VmLck:         0 kB
VmHWM:      1328 kB
VmRSS:      1328 kB
VmData:     1168 kB
VmStk:        92 kB
VmExe:       716 kB
VmLib:      1560 kB
VmPTE:        24 kB
Threads:        1
SigQ:   12/18446744073709551615
SigPnd: 0000000000040100
ShdPnd: 00000000000f4102
SigBlk: 0000000000010002
SigIgn: 0000000000000000
SigCgt: 0000000000010002
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed:   0000000f
Mems_allowed:   00000000,00000001
voluntary_ctxt_switches:        0
nonvoluntary_ctxt_switches:     1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-03-16 14:43 ` Laurent GUERBY
@ 2008-03-20 12:20   ` Laurent GUERBY
  2008-03-20 13:45     ` Peter Oruba
  0 siblings, 1 reply; 10+ messages in thread
From: Laurent GUERBY @ 2008-03-20 12:20 UTC (permalink / raw)
  To: linux-kernel

On Sun, 2008-03-16 at 15:43 +0100, Laurent GUERBY wrote:
> On Sat, 2008-03-08 at 23:00 +0100, Laurent GUERBY wrote:
> > Hi,
> > 
> > I have a system with an "AMD64 Phenom 9500" quad core cpu, 4GB RAM,
> > "ASUS M3A32 MVP Deluxe wifi" motherboard with latest vendor BIOS
> > (0801).
> > 
> > I tried stock debian etch kernel (Debian 2.6.18.dfsg.1-18etch1),
> > machine
> > froze with no message, debian etch backport kernel same, and then
> > Debian 2.6.24-4 from unstable and I got some messages: machine
> > is not frozen but some userland processes are (ps says "Dl" state
> > with child in "Zs" state) and "events/3" is taking 100% cpu
> > according to top:
> > 
> >    18 root      15  -5     0    0    0 R  100  0.0  74:59.46
> > events/3  
> > 
> > Got to the same state with ubuntu hardy 2.6.24-8-server kernel. All
> > kernels are untainted, no X running anyway.
> > 
> > It takes a few hours of doing some stuff, in my case bootstraping or
> > testing GCC at -j 4, and then the problem happens.
> 
> On 2.6.24-1-amd64 (Debian 2.6.24-4) I got a slightly different
> backtrace in /var/log/messages after a few hours of stressing the
> machine with compilations, see below. The given process
> was stuck and unkillable.
> 
> Any idea on what to do/try?

I changed motherboard and went for a (way cheaper but older) ASUS M2A-VM
with is based on the AMD 690G chipset with the exact same
kernel/phenom/memory/disk/box and installed the latest vendor BIOS
(1604). It took longer (25 hours) to get a stuck and unkillable process
but it did happen, but I got nothing in /var/log/kern.log this time.

In order to rule out a motherboard or memory issue I bought an Athlon X2
4400+ EE and put in replacement of the phenom with the exact same
kenel/memory/disk/box and my stress test has been running for 72 hours
without any issue so far.

So in the end it seems to be a problem specific to phenom 9500 with the
linux kernel.

Did anyone succeed in getting a stable linux box based on a phenom 9500
processor ? "Stable" defined as being able to survive a few days
compiling at -j4 (this is for the GCC compile farm after all :).

If so I'm interested by the exact motherboard/bios version/kernel
version/distro used.

As proposed in my first email, ssh root access is possible to my machine
(with either motherboard).

Thanks in advance,

Laurent



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-03-20 12:20   ` Laurent GUERBY
@ 2008-03-20 13:45     ` Peter Oruba
  2008-03-21 18:23       ` Laurent GUERBY
  2008-04-03  1:04       ` Schmielau, Tim
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Oruba @ 2008-03-20 13:45 UTC (permalink / raw)
  To: Laurent GUERBY; +Cc: linux-kernel

Laurent,

you may have triggered the L2 eviction bug (E298).

Please try

hexdump -s 0xc0010015 -n 8 -C /dev/cpu/0/msr

Output is little-endian, so the left-most byte must have bit 3 enabled meaning 
TLB caching is disabled.

-Peter

Laurent GUERBY schrieb:
> On Sun, 2008-03-16 at 15:43 +0100, Laurent GUERBY wrote:
>> On Sat, 2008-03-08 at 23:00 +0100, Laurent GUERBY wrote:
>>> Hi,
>>>
>>> I have a system with an "AMD64 Phenom 9500" quad core cpu, 4GB RAM,
>>> "ASUS M3A32 MVP Deluxe wifi" motherboard with latest vendor BIOS
>>> (0801).
>>>
>>> I tried stock debian etch kernel (Debian 2.6.18.dfsg.1-18etch1),
>>> machine
>>> froze with no message, debian etch backport kernel same, and then
>>> Debian 2.6.24-4 from unstable and I got some messages: machine
>>> is not frozen but some userland processes are (ps says "Dl" state
>>> with child in "Zs" state) and "events/3" is taking 100% cpu
>>> according to top:
>>>
>>>    18 root      15  -5     0    0    0 R  100  0.0  74:59.46
>>> events/3  
>>>
>>> Got to the same state with ubuntu hardy 2.6.24-8-server kernel. All
>>> kernels are untainted, no X running anyway.
>>>
>>> It takes a few hours of doing some stuff, in my case bootstraping or
>>> testing GCC at -j 4, and then the problem happens.
>> On 2.6.24-1-amd64 (Debian 2.6.24-4) I got a slightly different
>> backtrace in /var/log/messages after a few hours of stressing the
>> machine with compilations, see below. The given process
>> was stuck and unkillable.
>>
>> Any idea on what to do/try?
> 
> I changed motherboard and went for a (way cheaper but older) ASUS M2A-VM
> with is based on the AMD 690G chipset with the exact same
> kernel/phenom/memory/disk/box and installed the latest vendor BIOS
> (1604). It took longer (25 hours) to get a stuck and unkillable process
> but it did happen, but I got nothing in /var/log/kern.log this time.
> 
> In order to rule out a motherboard or memory issue I bought an Athlon X2
> 4400+ EE and put in replacement of the phenom with the exact same
> kenel/memory/disk/box and my stress test has been running for 72 hours
> without any issue so far.
> 
> So in the end it seems to be a problem specific to phenom 9500 with the
> linux kernel.
> 
> Did anyone succeed in getting a stable linux box based on a phenom 9500
> processor ? "Stable" defined as being able to survive a few days
> compiling at -j4 (this is for the GCC compile farm after all :).
> 
> If so I'm interested by the exact motherboard/bios version/kernel
> version/distro used.
> 
> As proposed in my first email, ssh root access is possible to my machine
> (with either motherboard).
> 
> Thanks in advance,
> 
> Laurent
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

-- 
            |           AMD Saxony Limited Liability Company & Co. KG
  Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
  System    |                  Register Court Dresden: HRA 4896
  Research  |              General Partner authorized to represent:
  Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
            | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-03-20 13:45     ` Peter Oruba
@ 2008-03-21 18:23       ` Laurent GUERBY
  2008-04-12  7:28         ` Laurent GUERBY
  2008-04-03  1:04       ` Schmielau, Tim
  1 sibling, 1 reply; 10+ messages in thread
From: Laurent GUERBY @ 2008-03-21 18:23 UTC (permalink / raw)
  To: Peter Oruba; +Cc: linux-kernel

On Thu, 2008-03-20 at 14:45 +0100, Peter Oruba wrote:
> Laurent,
> 
> you may have triggered the L2 eviction bug (E298).
> 
> Please try
> 
> hexdump -s 0xc0010015 -n 8 -C /dev/cpu/0/msr
> 
> Output is little-endian, so the left-most byte must have bit 3 enabled meaning 
> TLB caching is disabled.

Hi Peter,

Since I did not have /dev/cpu I did after some searching:

for i in 0 1 2 3; do
	mkdir -p /dev/cpu/$i
	mknod -m 444 /dev/cpu/$i/msr c 202 $i
	mknod -m 444 /dev/cpu/$i/cpuid c 203 $i
done
modprobe msr

Then:

# hexdump -s 0xc0010015 -n 8 -C /dev/cpu/0/msr
# echo $?
0
# cat /proc/version 
Linux version 2.6.24-1-amd64 (Debian 2.6.24-4) (waldi@debian.org) (gcc
version 4.1.3 20080114 (prerelease) (Debian 4.1.2-19)) #1 SMP Mon Feb 11
13:47:43 UTC 2008

When I look at hexdump -v it's only zeroes for as long as I run it.

What am I doing wrong?

Thanks for your help!

Laurent

# lsmod
Module                  Size  Used by
msr                     8968  0 
nfs                   258800  0 
tun                    16512  1 
cpufreq_userspace       9604  4 
nfsd                  261416  17 
lockd                  73136  3 nfs,nfsd
nfs_acl                 8192  2 nfs,nfsd
auth_rpcgss            53152  1 nfsd
sunrpc                201096  11 nfs,nfsd,lockd,nfs_acl,auth_rpcgss
exportfs                9472  1 nfsd
ac                     11400  0 
battery                19976  0 
ipv6                  286248  28 
it87                   28432  0 
hwmon_vid               7424  1 it87
k8temp                 10496  0 
eeprom                 12688  0 
powernow_k8            18464  0 
freq_table              9728  1 powernow_k8
loop                   22788  0 
snd_hda_intel         362024  0 
parport_pc             42408  0 
parport                44812  1 parport_pc
serio_raw              11908  0 
snd_pcm                89352  1 snd_hda_intel
snd_timer              28552  1 snd_pcm
snd                    65896  3 snd_hda_intel,snd_pcm,snd_timer
soundcore              13216  1 snd
snd_page_alloc         15248  2 snd_hda_intel,snd_pcm
pcspkr                  7808  0 
i2c_piix4              13708  0 
psmouse                45724  0 
i2c_core               29824  2 eeprom,i2c_piix4
shpchp                 38428  0 
button                 13984  0 
pci_hotplug            35504  1 shpchp
evdev                  17024  0 
ext3                  139024  1 
jbd                    55976  1 ext3
mbcache                13952  1 ext3
dm_mirror              27008  0 
dm_snapshot            21960  0 
dm_mod                 67832  2 dm_mirror,dm_snapshot
ata_generic            13060  0 
sd_mod                 33408  3 
generic                 9732  0 [permanent]
usbhid                 35168  0 
hid                    41792  1 usbhid
atiixp                  9488  0 [permanent]
ahci                   34436  2 
ide_core              144920  2 generic,atiixp
libata                162736  2 ata_generic,ahci
ehci_hcd               39692  0 
ohci_hcd               28164  0 
scsi_mod              170296  2 sd_mod,libata
r8169                  36740  0 
thermal                22688  0 
processor              45032  2 powernow_k8,thermal
fan                     9864  0 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-03-08 22:00 BUG: soft lockup detected on Phenom with Debian 2.6.24-4 Laurent GUERBY
  2008-03-16 14:43 ` Laurent GUERBY
@ 2008-04-02 19:15 ` Tim Schmielau
  1 sibling, 0 replies; 10+ messages in thread
From: Tim Schmielau @ 2008-04-02 19:15 UTC (permalink / raw)
  To: Laurent GUERBY; +Cc: linux-kernel

[apologies if you receive this email twice; apparently the exchange  
server
this message was sent through previously triggered one of lkml's taboo
expressions]

On Sat, 08 Mar 2008 23:00:15 +0100, Laurent GUERBY wrote:

 > I have a system with an "AMD64 Phenom 9500" quad core cpu, 4GB RAM,
 > "ASUS M3A32 MVP Deluxe wifi" motherboard with latest vendor BIOS  
(0801).
 >
 > I tried stock debian etch kernel (Debian 2.6.18.dfsg.1-18etch1),  
machine
 > froze with no message, debian etch backport kernel same, and then
 > Debian 2.6.24-4 from unstable and I got some messages: machine
 > is not frozen but some userland processes are (ps says "Dl" state
 > with child in "Zs" state) and "events/3" is taking 100% cpu
 > according to top:
 >
 >    18 root      15  -5     0    0    0 R  100  0.0  74:59.46 events/3
 >
 > Got to the same state with ubuntu hardy 2.6.24-8-server kernel. All
 > kernels are untainted, no X running anyway.
 >
 > It takes a few hours of doing some stuff, in my case bootstraping or
 > testing GCC at -j 4, and then the problem happens.
 >
 > I did 32 hours of memtest without issue on this system, temperatures
 > are very low and the case has plenty of airflow, making memory
 > issue less likely.

I have a very similar, if not the same issue:

I just bought a HP Pavillion 6332 with a Phenom 9500 quad core cpu and
3GB RAM on some ASUS-like looking mainboard with NVidia MCP61 chipset
(actually my first PC not assembled from components myself).
I installed OpenSUSE 10.3 64bit and tried various kernels (2.6.24.4,
2.6.23.17 + the erratum298-workaround from AMD, OpenSUSE's default
2.6.22.5-31 and 2.6.22.17-0.1), but the machine will hang within less
than an hour of intense OpenMP load over all four cores (using a
homemade scientific application).

The symptoms of the hang are similar to what Laurent saw: In xosview,
the load display one cpu would get stuck (not necessarily at 100%), and
usually (but not always) ps would hang in the middle of its output. I
could still login (no X running here either) and use the remaining  
cores,
but the OpenMP application would hang in an unkillable state (don't
know which as ps gets stuck). Unlike Laurent, I however don't see
anything relevant in the syslog.

Occasionally I also had wrong output from the program (i.e., running it
twice gives different results, one of them bogus), although that was on
32bit OpenSUSE 10.3, as far as I remember. I know due to the nature of
OpenMP this could as well be a bug in my program, but it has not yet
happened on any other machine.

The system is well vented and very cool, but as of yet I only had time
to run memtest for 6 hours (no errors found).

If anybody has an idea how to debug this, please ask for more
information, otherwise I'll just return this machine for refund.

Thanks,
Tim


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-03-20 13:45     ` Peter Oruba
  2008-03-21 18:23       ` Laurent GUERBY
@ 2008-04-03  1:04       ` Schmielau, Tim
  1 sibling, 0 replies; 10+ messages in thread
From: Schmielau, Tim @ 2008-04-03  1:04 UTC (permalink / raw)
  To: Peter Oruba, Laurent GUERBY; +Cc: linux-kernel

On Thu 3/20/2008 1:45 PM, Peter Oruba wrote:
 
> you may have triggered the L2 eviction bug (E298).
> 
> Please try
> 
> hexdump -s 0xc0010015 -n 8 -C /dev/cpu/0/msr
> 
> Output is little-endian, so the left-most byte must have bit 3 enabled meaning 
> TLB caching is disabled.

While this question was not directed at me, I indeed find that bit 3 of the
leftmost byte is set with 2.6.24:

  # hexdump -s 0xc0010015 -n 8 -C /dev/cpu/0/msr
  c0010015  18 00 00 01 00 00 00 00                           |........|
  c001001d
  # uname -a
  Linux quad 2.6.24.4-default #1 SMP Mon Mar 31 18:51:48 BST 2008 x86_64 x86_64 x86_64 GNU/Linux

while (as expected) it was zero running 2.6.23.17 patched with the erratum 298
workaround. Yet I saw the lockups with both kernels.

Thanks,
Tim

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-03-21 18:23       ` Laurent GUERBY
@ 2008-04-12  7:28         ` Laurent GUERBY
  2008-04-13 14:37           ` Thomas Gleixner
  0 siblings, 1 reply; 10+ messages in thread
From: Laurent GUERBY @ 2008-04-12  7:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Oruba

Hi,

FYI with Peter off-list help we found a way to make the ASUS M2A-VM with
1604 BIOS stable under my stress test: we just needed nmi_watchdog=1 in
the kernel boot options (no other boot option necessary).

With nmi_watchdog=1 we see in kern.log "APIC error" but
the machine stayed stable during 3 days of stress testing:

...
Apr  7 22:41:43 gcc04 kernel: APIC error on CPU2: 00(40)
Apr  7 22:41:43 gcc04 kernel: APIC error on CPU1: 00(40)
Apr  7 22:41:43 gcc04 kernel: APIC error on CPU3: 00(40)
Apr  7 22:41:43 gcc04 kernel: APIC error on CPU0: 00(40)
Apr  7 22:53:01 gcc04 kernel: APIC error on CPU3: 40(40)
Apr  7 22:53:01 gcc04 kernel: APIC error on CPU0: 40(40)
Apr  7 22:53:01 gcc04 kernel: APIC error on CPU1: 40(40)
...

guerby@gcc04:~$ cat /proc/cmdline 
root=/dev/sda1 ro nmi_watchdog=1

We are now stress testing the 1705 BIOS version which was released by
ASUS on 20080331, with and without nmi_watchdog=1. Then we'll go
back to testing the ASUS M3A32-MVP Deluxe/WiFi-AP with the newer 1002
BIOS also released on 20080331.

Note: for msr decoding xxd should be used since hexdump doesn't work:

xxd -s 0xc0010015 -l 8 /dev/cpu/0/msr

So people having stability problems with Phenom 9x00 with Linux should
try nmi_watchdog=1 as boot option.

Sincerely,

Laurent



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-04-12  7:28         ` Laurent GUERBY
@ 2008-04-13 14:37           ` Thomas Gleixner
  2008-04-14 10:23             ` Laurent GUERBY
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2008-04-13 14:37 UTC (permalink / raw)
  To: Laurent GUERBY; +Cc: linux-kernel, Peter Oruba

On Sat, 12 Apr 2008, Laurent GUERBY wrote:
> Hi,
> 
> FYI with Peter off-list help we found a way to make the ASUS M2A-VM with
> 1604 BIOS stable under my stress test: we just needed nmi_watchdog=1 in
> the kernel boot options (no other boot option necessary).

hmm, nmi_watchdog=1 disables the local apic timer. Can you add
"noapictimer" to the kernel command line instead ?

Thanks,
	tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: BUG: soft lockup detected on Phenom with Debian 2.6.24-4
  2008-04-13 14:37           ` Thomas Gleixner
@ 2008-04-14 10:23             ` Laurent GUERBY
  0 siblings, 0 replies; 10+ messages in thread
From: Laurent GUERBY @ 2008-04-14 10:23 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, Peter Oruba

In fact BIOS 1705 without any boot option seems
to provide a stable machine (51 hours of stress
test without failure). Plus BIOS 1705 has
an explicit TLB fix auto/enable/disable
that the previous BIOS didn't have (even
if TLB fix was on).

So probably it was probably a BIOS bug, we're continuing
the stress testing.

Laurent


On Sun, 2008-04-13 at 16:37 +0200, Thomas Gleixner wrote:
> On Sat, 12 Apr 2008, Laurent GUERBY wrote:
> > Hi,
> > 
> > FYI with Peter off-list help we found a way to make the ASUS M2A-VM with
> > 1604 BIOS stable under my stress test: we just needed nmi_watchdog=1 in
> > the kernel boot options (no other boot option necessary).
> 
> hmm, nmi_watchdog=1 disables the local apic timer. Can you add
> "noapictimer" to the kernel command line instead ?
> 
> Thanks,
> 	tglx
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-04-14 10:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-08 22:00 BUG: soft lockup detected on Phenom with Debian 2.6.24-4 Laurent GUERBY
2008-03-16 14:43 ` Laurent GUERBY
2008-03-20 12:20   ` Laurent GUERBY
2008-03-20 13:45     ` Peter Oruba
2008-03-21 18:23       ` Laurent GUERBY
2008-04-12  7:28         ` Laurent GUERBY
2008-04-13 14:37           ` Thomas Gleixner
2008-04-14 10:23             ` Laurent GUERBY
2008-04-03  1:04       ` Schmielau, Tim
2008-04-02 19:15 ` Tim Schmielau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).