LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Opteron ECC/ChipKill error
@ 2011-02-08 13:30 martin f krafft
  2011-02-08 13:49 ` Borislav Petkov
  0 siblings, 1 reply; 5+ messages in thread
From: martin f krafft @ 2011-02-08 13:30 UTC (permalink / raw)
  To: linux kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 765 bytes --]

Dear list,

I just got to see the following message on my Opteron server:

  kernel: [810137.744689]  Northbridge Error, node 1
  kernel: [810137.756250] ECC/ChipKill ECC error.
  kernel: [810137.766975] EDAC amd64 MC1: CE ERROR_ADDRESS= 0x26bdd40f0
  kernel: [810137.766982] EDAC MC1: CE page 0x26bdd4, offset 0xf0, grain 0, syndrome 0xe1e2, row 6, channel 1, label "": amd64_edac

Is there any way to deduce from these data the actual
culprit/component to replace?

Thanks,

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"a cigarette is the perfect type of pleasure.
 it is exquisite, and it leaves one unsatisfied."
                                                        -- oscar wilde
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current) --]
[-- Type: application/pgp-signature, Size: 1124 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Opteron ECC/ChipKill error
  2011-02-08 13:30 Opteron ECC/ChipKill error martin f krafft
@ 2011-02-08 13:49 ` Borislav Petkov
  2011-02-08 13:59   ` martin f krafft
  0 siblings, 1 reply; 5+ messages in thread
From: Borislav Petkov @ 2011-02-08 13:49 UTC (permalink / raw)
  To: martin f krafft; +Cc: LKML

On Tue, Feb 08, 2011 at 02:30:11PM +0100, martin f krafft wrote:
> Dear list,
> 
> I just got to see the following message on my Opteron server:
> 
>   kernel: [810137.744689]  Northbridge Error, node 1
>   kernel: [810137.756250] ECC/ChipKill ECC error.
>   kernel: [810137.766975] EDAC amd64 MC1: CE ERROR_ADDRESS= 0x26bdd40f0
>   kernel: [810137.766982] EDAC MC1: CE page 0x26bdd4, offset 0xf0, grain 0, syndrome 0xe1e2, row 6, channel 1, label "": amd64_edac
> 
> Is there any way to deduce from these data the actual
> culprit/component to replace?

It is a DRAM ECC error on one of the DIMMs on your node 1. If it is a
single occurrence I wouldn't start to worry yet - I'd monitor to see
whether the same row above (row 6) starts increasing its error rate.
Also, sometimes reseating the DIMMs helps.

Can you send your dmesg please?

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Opteron ECC/ChipKill error
  2011-02-08 13:49 ` Borislav Petkov
@ 2011-02-08 13:59   ` martin f krafft
  2011-02-08 14:24     ` Borislav Petkov
  0 siblings, 1 reply; 5+ messages in thread
From: martin f krafft @ 2011-02-08 13:59 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 842 bytes --]

also sprach Borislav Petkov <bp@amd64.org> [2011.02.08.1449 +0100]:
> It is a DRAM ECC error on one of the DIMMs on your node 1. If it is a
> single occurrence I wouldn't start to worry yet - I'd monitor to see
> whether the same row above (row 6) starts increasing its error rate.
> Also, sometimes reseating the DIMMs helps.

Thanks. I really hope this won't happen again as I really don't want
to go to the hosting place and open the server. ;)

> Can you send your dmesg please?

Don't want to spam the list, so:
http://scratch.madduck.net/__tmp__dmesg.gz

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"women, when they are not in love,
 have all the cold blood of an experienced attorney."
                                                   -- honoré de balzac
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current) --]
[-- Type: application/pgp-signature, Size: 1124 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Opteron ECC/ChipKill error
  2011-02-08 13:59   ` martin f krafft
@ 2011-02-08 14:24     ` Borislav Petkov
  2011-02-08 17:09       ` martin f krafft
  0 siblings, 1 reply; 5+ messages in thread
From: Borislav Petkov @ 2011-02-08 14:24 UTC (permalink / raw)
  To: martin f krafft; +Cc: LKML

On Tue, Feb 08, 2011 at 08:59:56AM -0500, martin f krafft wrote:
> also sprach Borislav Petkov <bp@amd64.org> [2011.02.08.1449 +0100]:
> > It is a DRAM ECC error on one of the DIMMs on your node 1. If it is a
> > single occurrence I wouldn't start to worry yet - I'd monitor to see
> > whether the same row above (row 6) starts increasing its error rate.
> > Also, sometimes reseating the DIMMs helps.
> 
> Thanks. I really hope this won't happen again as I really don't want
> to go to the hosting place and open the server. ;)

Yeah, well, keep your fingers crossed. Just to reiterate, getting ECCs
is not a problem per se - they may appear even during normal operation
and in this case get corrected just fine by the memory controller. Only
an increase in the error rate may hint at a failing DRAM device so if
the error starts repeating you might start thinking when the downtime to
replace the failing DIMM is less hurtful/more suitable for you.

Cough.. IMHO. :)

> > Can you send your dmesg please?
> 
> Don't want to spam the list, so:
> http://scratch.madduck.net/__tmp__dmesg.gz

Ah ok, this is a .32 kernel and it doesn't have the information I was
looking for. I've changed that in later kernels so that EDAC dumps the
DRAM chip selects placement on the memory controller. Here's an example:


[   15.256809] EDAC MC: DCT0 chip selects:
[   15.261007] EDAC amd64: MC: 0:  2048MB 1:  2048MB
[   15.266073] EDAC amd64: MC: 2:  2048MB 3:  2048MB
[   15.271140] EDAC amd64: MC: 4:     0MB 5:     0MB
[   15.276207] EDAC amd64: MC: 6:     0MB 7:     0MB
[   15.291246] EDAC MC: DCT1 chip selects:
[   15.295443] EDAC amd64: MC: 0:  2048MB 1:  2048MB
[   15.300511] EDAC amd64: MC: 2:  2048MB 3:  2048MB
[   15.305579] EDAC amd64: MC: 4:     0MB 5:     0MB
[   15.310647] EDAC amd64: MC: 6:     0MB 7:     0MB
[   15.315711] EDAC amd64: using x8 syndromes.

and from this I can see that I have 4 DIMMs on the node, 2 per channel
and each DIMM is 4G (dual-ranked). The last one I know from the DIMMs
type.

In your case, the ECC comes from chip select 6 which should mean the
last DIMM on the node on the second channel. You have to look at the
silkscreen labels on the board to pinpoint which DIMM it is or search
through board layout manuals. (I know, this should be easier, I know...
).

Btw, you should be able to get the above output if you enable
CONFIG_EDAC_DEBUG or upgrade your kernel.

HTH.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Opteron ECC/ChipKill error
  2011-02-08 14:24     ` Borislav Petkov
@ 2011-02-08 17:09       ` martin f krafft
  0 siblings, 0 replies; 5+ messages in thread
From: martin f krafft @ 2011-02-08 17:09 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 998 bytes --]

also sprach Borislav Petkov <bp@amd64.org> [2011.02.08.1524 +0100]:
> Yeah, well, keep your fingers crossed. Just to reiterate, getting ECCs
> is not a problem per se - they may appear even during normal operation
> and in this case get corrected just fine by the memory controller.

That's what I thought. Many thanks for confirming it.

> > Don't want to spam the list, so:
> > http://scratch.madduck.net/__tmp__dmesg.gz
> 
> Ah ok, this is a .32 kernel and it doesn't have the information I was
> looking for. I've changed that in later kernels so that EDAC dumps the
> DRAM chip selects placement on the memory controller.

Excellent to see you are working to improve this. If the problems
increase, then I shall either turn on CONFIG_EDAC_DEBUG or upgrade
to 2.6.38.

Thank you for your help!

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
wind catches lily,
scattering petals to the ground.
segmentation fault.
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current) --]
[-- Type: application/pgp-signature, Size: 1124 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-02-08 17:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-08 13:30 Opteron ECC/ChipKill error martin f krafft
2011-02-08 13:49 ` Borislav Petkov
2011-02-08 13:59   ` martin f krafft
2011-02-08 14:24     ` Borislav Petkov
2011-02-08 17:09       ` martin f krafft

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).