LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* I do not know if this is the correct place to ask about this but...
@ 2011-02-08  9:31 dave b
  2011-02-08  9:52 ` Clemens Ladisch
  2011-02-08 10:00 ` Borislav Petkov
  0 siblings, 2 replies; 4+ messages in thread
From: dave b @ 2011-02-08  9:31 UTC (permalink / raw)
  To: Linux Kernel

I do not know if this is the correct place to ask about this but...
I have only seen the following output output twice and both times have
been when I was running a 2.6.37 kernel.

[152399.816058] [Hardware Error]: MC4_STATUS: Corrected error, other
errors lost: no, CPU context corrupt: no, CECC Error
[152399.816075] [Hardware Error]: Northbridge Error, node 0: , core:
1L3 ECC data cache error.
[152399.816086] [Hardware Error]: Transaction: RD, Type: GEN, Cache
Level: L3/GEN
[152399.816092] Disabling lock debugging due to kernel taint
[152399.816099] [Hardware Error]: Machine check events logged

I assume it is just a coincidence. Also, I am not exactly sure what
the message "means". (Yes I can read the text - but I haven't found
good documentation which describes the impact it). Note: I submitted a
bug[0] regarding 'the output' the first time this occurrence.

[0] - https://bugzilla.kernel.org/show_bug.cgi?id=27332

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I do not know if this is the correct place to ask about this but...
  2011-02-08  9:31 I do not know if this is the correct place to ask about this but dave b
@ 2011-02-08  9:52 ` Clemens Ladisch
  2011-02-08 10:03   ` dave b
  2011-02-08 10:00 ` Borislav Petkov
  1 sibling, 1 reply; 4+ messages in thread
From: Clemens Ladisch @ 2011-02-08  9:52 UTC (permalink / raw)
  To: dave b; +Cc: Linux Kernel

dave b wrote:
> I do not know if this is the correct place to ask about this but...
> 
> [Hardware Error]:

This is a hardware error that was detected by the kernel.

> I have only seen the following output output twice
> 
> ... Corrected error
> ... L3 ECC data cache error.

There was a wrong bit in your CPU's level 3 cache, but with the help of
the redundant error correction bits, this was caught and corrected.

(If possible, enable background scrubbing of the caches in the BIOS ECC
settings to catch these errors earlier.)

If this is an overclocked CPU or one with an unlocked core, you deserve
what you got.  Otherwise, if this happens repeatedly, it indicates
a hardware defect, and your warranty should cover this.


Regards,
Clemens

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I do not know if this is the correct place to ask about this but...
  2011-02-08  9:31 I do not know if this is the correct place to ask about this but dave b
  2011-02-08  9:52 ` Clemens Ladisch
@ 2011-02-08 10:00 ` Borislav Petkov
  1 sibling, 0 replies; 4+ messages in thread
From: Borislav Petkov @ 2011-02-08 10:00 UTC (permalink / raw)
  To: dave b; +Cc: Linux Kernel, borislav.petkov

On Tue, Feb 08, 2011 at 08:31:50PM +1100, dave b wrote:
> I do not know if this is the correct place to ask about this but...
> I have only seen the following output output twice and both times have
> been when I was running a 2.6.37 kernel.
> 
> [152399.816058] [Hardware Error]: MC4_STATUS: Corrected error, other
> errors lost: no, CPU context corrupt: no, CECC Error
> [152399.816075] [Hardware Error]: Northbridge Error, node 0: , core:
> 1L3 ECC data cache error.
> [152399.816086] [Hardware Error]: Transaction: RD, Type: GEN, Cache
> Level: L3/GEN
> [152399.816092] Disabling lock debugging due to kernel taint
> [152399.816099] [Hardware Error]: Machine check events logged
> 
> I assume it is just a coincidence. Also, I am not exactly sure what
> the message "means". (Yes I can read the text - but I haven't found
> good documentation which describes the impact it). Note: I submitted a
> bug[0] regarding 'the output' the first time this occurrence.

This is a L3 cache correctable error on an AMD F10h machine I'd guess.

You could go and install x86info from
http://codemonkey.org.uk/projects/x86info/ and do as root

for i in $(seq 0 3); do echo -e "\nCPU$i:"; lsmsr -c $i -a; done > lsmsr.log

 [ ($seq 0 3) assumes you have 4 cores, adjust it according to your
    machine. Also, you need msr.ko module support, i.e. CONFIG_X86_MSR in
    your kernel .config. ]

and send me the lsmsr.log file to check whether there is some more info
about the L3 error.

If you don't have the msr.ko support (or CONFIG_X86_MSR is not set
to y in your config) that tool won't help. In that case, I'd suggest
you upgrade your kernel to 2.6.38-rc4 which is stable enough, enable
CONFIG_X86_MSR and catch the error again. Then retry the small bash
oneliner above again.

That should be all for now, feel free to ask questions should anything
be not clear.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I do not know if this is the correct place to ask about this but...
  2011-02-08  9:52 ` Clemens Ladisch
@ 2011-02-08 10:03   ` dave b
  0 siblings, 0 replies; 4+ messages in thread
From: dave b @ 2011-02-08 10:03 UTC (permalink / raw)
  To: Clemens Ladisch; +Cc: Linux Kernel

On 8 February 2011 20:52, Clemens Ladisch <clemens@ladisch.de> wrote:
> dave b wrote:
>> I do not know if this is the correct place to ask about this but...
>>
>> [Hardware Error]:
>
> This is a hardware error that was detected by the kernel.
>
>> I have only seen the following output output twice
>>
>> ... Corrected error
>> ... L3 ECC data cache error.
>
> There was a wrong bit in your CPU's level 3 cache, but with the help of
> the redundant error correction bits, this was caught and corrected.

Yep I got that.

> (If possible, enable background scrubbing of the caches in the BIOS ECC
> settings to catch these errors earlier.)

Ok I will look into that.

> If this is an overclocked CPU or one with an unlocked core, you deserve
> what you got.  Otherwise, if this happens repeatedly, it indicates
> a hardware defect, and your warranty should cover this.

Ok fair enough.

Notes:
The cpu is not overclocked.
I forgot to list the hardware specifications:
cpu is a AMD Phenom(tm) II X6 1055T Processor
running on a ASUS  M4A88TD-M motherboard.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-02-08 10:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-08  9:31 I do not know if this is the correct place to ask about this but dave b
2011-02-08  9:52 ` Clemens Ladisch
2011-02-08 10:03   ` dave b
2011-02-08 10:00 ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).