LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Hans-Peter Jansen <hpj@urpla.net>
To: Tejun Heo <htejun@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org,
	Roger Heflin <rogerheflin@gmail.com>
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?
Date: Sun, 30 Mar 2008 13:00:09 +0100	[thread overview]
Message-ID: <200803301400.10766.hpj@urpla.net> (raw)
In-Reply-To: <47EEE4BF.5080609@gmail.com>

Am Sonntag, 30. März 2008 schrieb Tejun Heo:
> Hello,
>
> Hans-Peter Jansen wrote:
> >>>> Should I be worried? smartd doesn't show anything suspicious on
> >>>> those.
> >>
> >> Can you please post the result of "smartctl -a /dev/sdX"?
> >
> > Here's the last smart report from two of the offending drives. As noted
> > before, I did the hardware reorganization, replaced the dog slow 3ware
> > 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the
> > drives for now, but a nephew already showed interest. What do you
> > think, can I cede those drives with a clear conscience? The
> > Hardware_ECC_Recovered values are really worrisome, aren't they?
>
> Different vendors use different scales for the raw values.  The value is
> still pegged at the highest so it could be those raw values are okay or
> that the vendor just doesn't update value field accordingly.  My P120
> says 0 for the raw value and 904635 for hardware ECC recovered so there
> is some difference.  What do other non-failing drives say about those
> values?

The only non-failing drive was sdf as it was running in standby mode in this 
md raid 5 ensemble:

20080323-011337-sdc.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
20080323-011337-sdc.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
20080323-011337-sdc.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
20080323-011337-sdc.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
20080323-011337-sdc.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
20080323-011338-sdd.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162520674
20080323-011338-sdd.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
20080323-011338-sdd.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
20080323-011338-sdd.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
20080323-011338-sdd.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
20080323-011338-sde.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       148429049
20080323-011338-sde.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
20080323-011338-sde.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
20080323-011338-sde.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
20080323-011338-sde.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
20080323-011339-sdf.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1559
20080323-011339-sdf.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
20080323-011339-sdf.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
20080323-011339-sdf.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
20080323-011339-sdf.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
> reallocation counters and maybe some pending counts.  Aieee.. weird.

But there are no reallocations nor any pending sectors on any of them.

> >>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
> >>
> >> FLUSH_EXT timing out usually indicates that the drive is having
> >> problem writing out what it has in its cache to the media.  There was
> >> one case where FLUSH_EXT timeout was caused by the driver failing to
> >> switch controller back from NCQ mode before issuing FLUSH_EXT but that
> >> was on sata_nv.  There hasn't been any similar problem on sata_sil24.
> >
> > Hmm, I didn't noticed any data distortions, and if there where, they
> > live on as copies in their new home..
>
> It should have appeared as read errors.  Maybe the drive successfully
                             ^^^^
                             write (I guess)
> wrote those sectors after 30+ secs timeout.

That would point to some driver issue, wouldn't it? Roger Heflin also
experienced similar behavior with that controller, which wasn't 
reproducible with another. 

I can offer to you rebuilding that md in a test environment, and giving 
you access to it, if you're interested.

Anyway, thanks for caring Tejun,
Pete

  reply	other threads:[~2008-03-30 12:00 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-20 14:18 2.6.24.3: regular sata drive resets - worrisome? Hans-Peter Jansen
2008-03-21  4:48 ` Andrew Morton
2008-03-21 18:32   ` Roger Heflin
2008-03-21 23:06     ` Hans-Peter Jansen
2008-03-29 12:58   ` Tejun Heo
2008-03-30  0:14     ` Hans-Peter Jansen
2008-03-30  0:54       ` Tejun Heo
2008-03-30 12:00         ` Hans-Peter Jansen [this message]
2008-03-30 12:41           ` Roger Heflin
2008-03-31  4:33             ` Tejun Heo
2008-04-01 19:27               ` Roger Heflin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200803301400.10766.hpj@urpla.net \
    --to=hpj@urpla.net \
    --cc=akpm@linux-foundation.org \
    --cc=htejun@gmail.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rogerheflin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).