LKML Archive on
help / color / mirror / Atom feed
From: Borislav Petkov <>
To: martin f krafft <>
Cc: LKML <>
Subject: Re: Opteron ECC/ChipKill error
Date: Tue, 8 Feb 2011 15:24:42 +0100	[thread overview]
Message-ID: <20110208142442.GA30263@aftab> (raw)
In-Reply-To: <>

On Tue, Feb 08, 2011 at 08:59:56AM -0500, martin f krafft wrote:
> also sprach Borislav Petkov <> [2011.02.08.1449 +0100]:
> > It is a DRAM ECC error on one of the DIMMs on your node 1. If it is a
> > single occurrence I wouldn't start to worry yet - I'd monitor to see
> > whether the same row above (row 6) starts increasing its error rate.
> > Also, sometimes reseating the DIMMs helps.
> Thanks. I really hope this won't happen again as I really don't want
> to go to the hosting place and open the server. ;)

Yeah, well, keep your fingers crossed. Just to reiterate, getting ECCs
is not a problem per se - they may appear even during normal operation
and in this case get corrected just fine by the memory controller. Only
an increase in the error rate may hint at a failing DRAM device so if
the error starts repeating you might start thinking when the downtime to
replace the failing DIMM is less hurtful/more suitable for you.

Cough.. IMHO. :)

> > Can you send your dmesg please?
> Don't want to spam the list, so:

Ah ok, this is a .32 kernel and it doesn't have the information I was
looking for. I've changed that in later kernels so that EDAC dumps the
DRAM chip selects placement on the memory controller. Here's an example:

[   15.256809] EDAC MC: DCT0 chip selects:
[   15.261007] EDAC amd64: MC: 0:  2048MB 1:  2048MB
[   15.266073] EDAC amd64: MC: 2:  2048MB 3:  2048MB
[   15.271140] EDAC amd64: MC: 4:     0MB 5:     0MB
[   15.276207] EDAC amd64: MC: 6:     0MB 7:     0MB
[   15.291246] EDAC MC: DCT1 chip selects:
[   15.295443] EDAC amd64: MC: 0:  2048MB 1:  2048MB
[   15.300511] EDAC amd64: MC: 2:  2048MB 3:  2048MB
[   15.305579] EDAC amd64: MC: 4:     0MB 5:     0MB
[   15.310647] EDAC amd64: MC: 6:     0MB 7:     0MB
[   15.315711] EDAC amd64: using x8 syndromes.

and from this I can see that I have 4 DIMMs on the node, 2 per channel
and each DIMM is 4G (dual-ranked). The last one I know from the DIMMs

In your case, the ECC comes from chip select 6 which should mean the
last DIMM on the node on the second channel. You have to look at the
silkscreen labels on the board to pinpoint which DIMM it is or search
through board layout manuals. (I know, this should be easier, I know...

Btw, you should be able to get the above output if you enable
CONFIG_EDAC_DEBUG or upgrade your kernel.



Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

  reply	other threads:[~2011-02-08 14:24 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-08 13:30 martin f krafft
2011-02-08 13:49 ` Borislav Petkov
2011-02-08 13:59   ` martin f krafft
2011-02-08 14:24     ` Borislav Petkov [this message]
2011-02-08 17:09       ` martin f krafft

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110208142442.GA30263@aftab \ \ \ \
    --subject='Re: Opteron ECC/ChipKill error' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).