LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Tejun Heo <htejun@gmail.com>
To: Roger Heflin <rogerheflin@gmail.com>
Cc: Hans-Peter Jansen <hpj@urpla.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?
Date: Mon, 31 Mar 2008 13:33:11 +0900	[thread overview]
Message-ID: <47F06987.2060208@gmail.com> (raw)
In-Reply-To: <47EF8A65.1010005@gmail.com>

Roger Heflin wrote:
>> The only non-failing drive was sdf as it was running in standby mode 
>> in this md raid 5 ensemble:
>>
>> 20080323-011337-sdc.log:195 Hardware_ECC_Recovered  0x001a   100   
>> 100   000    Old_age   Always       -       162956700
>> 20080323-011338-sde.log:195 Hardware_ECC_Recovered  0x001a   100   
>> 100   000    Old_age   Always       -       148429049

Hmmm... looks similar.

>>> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
>>> reallocation counters and maybe some pending counts.  Aieee.. weird.
>>
>> But there are no reallocations nor any pending sectors on any of them.

Yeah, indded.

>>>>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>>>>> FLUSH_EXT timing out usually indicates that the drive is having
>>>>> problem writing out what it has in its cache to the media.  There was
>>>>> one case where FLUSH_EXT timeout was caused by the driver failing to
>>>>> switch controller back from NCQ mode before issuing FLUSH_EXT but that
>>>>> was on sata_nv.  There hasn't been any similar problem on sata_sil24.
>>>> Hmm, I didn't noticed any data distortions, and if there where, they
>>>> live on as copies in their new home..
>>> It should have appeared as read errors.  Maybe the drive successfully
>>                              ^^^^
>>                              write (I guess)

I actually meant read.  For the corrupted data to get transferred to 
other disks, it should have been read as wrong values but such things 
should never happen as ECC checks would fail.

>>> wrote those sectors after 30+ secs timeout.
>>
>> That would point to some driver issue, wouldn't it? Roger Heflin also
>> experienced similar behavior with that controller, which wasn't 
>> reproducible with another.

Roger's problem is different one.  I'll talk about it below.

>> I can offer to you rebuilding that md in a test environment, and 
>> giving you access to it, if you're interested.

Can you hook up those failed drives to a different controller?  Say, 
ahci or ata_piix and put them under write load (ext3 w/ barrier=1 and 
copying lots of files into it should work) and see whether the problem 
reproduces?

> Here are the errors I get, though look at it closer, I am don't appear 
> to be getting the reset, just this error from time to time:
> 
> sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
> sd 9:0:0:0: [sde] Write Protect is off
> sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
> sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't 
> support DPO or FUA
> ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
> ata8.00: BMDMA2 stat 0x687d8009
> ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data 
> 196608 in
>          res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
> ata8.00: configured for UDMA/100

That's device abort error on read.  The drive just can't read sector one 
of the requested sectors and it's not sata_sil24.  It's a bmdma one.

> I have 4 identical disks, with all 4 connected to the SIL controller all 
> give some errors, moving 2 of the disks to a promise controller makes 
> the errors go away on the 2 connected to the promise controller.   All 
> drives are part of a software raid5 array.

Ah.. okay, sata_sil.  Roger, the moving and errors are not very likely 
to have anything to do with each other.  The only possibility is 
transmission problems but the drive didn't report transport error (ICRC) 
and it's more likely that the drive was experiencing temporary failures. 
  It's also possible that the drive set ABRT although there was some 
problem with the transport tho.

If you move the drive back to the sata_sil, do those problems appear 
again?  Anyways, this doesn't really have anything to do with what Hans 
is seeing.

Thanks.

-- 
tejun

  reply	other threads:[~2008-03-31  4:33 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-20 14:18 Hans-Peter Jansen
2008-03-21  4:48 ` Andrew Morton
2008-03-21 18:32   ` Roger Heflin
2008-03-21 23:06     ` Hans-Peter Jansen
2008-03-29 12:58   ` Tejun Heo
2008-03-30  0:14     ` Hans-Peter Jansen
2008-03-30  0:54       ` Tejun Heo
2008-03-30 12:00         ` Hans-Peter Jansen
2008-03-30 12:41           ` Roger Heflin
2008-03-31  4:33             ` Tejun Heo [this message]
2008-04-01 19:27               ` Roger Heflin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47F06987.2060208@gmail.com \
    --to=htejun@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hpj@urpla.net \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rogerheflin@gmail.com \
    --subject='Re: 2.6.24.3: regular sata drive resets - worrisome?' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).