LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* 2.6.24.3: regular sata drive resets - worrisome?
@ 2008-03-20 14:18 Hans-Peter Jansen
  2008-03-21  4:48 ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Hans-Peter Jansen @ 2008-03-20 14:18 UTC (permalink / raw)
  To: linux-kernel

Hi,

since I upgraded to 2.6.24.3 on one of my production systems, I see 
regular device resets like these:

Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 20 14:33:03 lisa5 kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 20 14:33:03 lisa5 kernel: ata2.00: status: { DRDY }
Mar 20 14:33:03 lisa5 kernel: ata2: hard resetting link
Mar 20 14:33:05 lisa5 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
Mar 20 14:33:05 lisa5 kernel: ata2.00: configured for UDMA/100
Mar 20 14:33:05 lisa5 kernel: ata2: EH complete
Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] 488397168 512-byte hardware sectors (250059 MB)
Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write Protect is off
Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 20 14:36:11 lisa5 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Mar 20 14:36:11 lisa5 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 20 14:36:11 lisa5 kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 20 14:36:11 lisa5 kernel: ata3.00: status: { DRDY }
Mar 20 14:36:11 lisa5 kernel: ata3: hard resetting link
Mar 20 14:36:13 lisa5 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
Mar 20 14:36:13 lisa5 kernel: ata3.00: configured for UDMA/100
Mar 20 14:36:13 lisa5 kernel: ata3: EH complete
Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write Protect is off
Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Should I be worried? smartd doesn't show anything suspicious on those.

It's been 4 samsung drives at all hanging on a sata sil 3124:

29: PCI 202.0: 0104 RAID bus controller
  [Created at pci.273]
  Unique ID: LHB6.kPnnhej1LTC
  Parent ID: bSAa.qlvrYsPk2G3
  SysFS ID: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0
  SysFS BusID: 0000:02:02.0
  Hardware Class: storage
  Model: "Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3124 PCI-X Serial ATA Controller"
  Vendor: pci 0x1095 "Silicon Image, Inc. (formerly CMD Technology Inc)"
  Device: pci 0x3124 "SiI 3124 PCI-X Serial ATA Controller"
  SubVendor: pci 0x1095 "Silicon Image, Inc. (formerly CMD Technology Inc)"
  SubDevice: pci 0x7124 
  Revision: 0x01
  Driver: "sata_sil24"
  Memory Range: 0xfc9ffc00-0xfc9ffc7f (rw,non-prefetchable)
  Memory Range: 0xfc9f0000-0xfc9f7fff (rw,non-prefetchable)
  I/O Ports: 0xac00-0xac0f (rw)
  Memory Range: 0xfc900000-0xfc97ffff (ro,prefetchable,disabled)
  IRQ: 17 (20116954 events)
  Driver Info #0:
    Driver Status: sata_sil24 is active
    Driver Activation Cmd: "modprobe sata_sil24"
  Config Status: cfg=no, avail=yes, need=no, active=unknown
  Attached to: #15 (PCI bridge)

27: IDE 100.0: 10600 Disk
  [Created at block.188]
  Unique ID: DQI9.0o8IZwvZth7
  Parent ID: LHB6.kPnnhej1LTC
  SysFS ID: /block/sdb
  SysFS BusID: 1:0:0:0
  SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host1/target1:0:0/1:0:0:0
  Hardware Class: disk
  Model: "SAMSUNG SP2504C"
  Vendor: "SAMSUNG"
  Device: "SP2504C"
  Revision: "VT10"
  Serial ID: "S09QJ1GYA03006"
  Driver: "sata_sil24", "sd"
  Device File: /dev/sdb
  Device 
Files: /dev/sdb, /dev/disk/by-path/pci-0000:02:02.0-scsi-0:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA03006
  Device Number: block 8:16-8:31
  Geometry (Logical): CHS 30401/255/63
  Size: 488397168 sectors a 512 bytes
  Config Status: cfg=no, avail=yes, need=no, active=unknown
  Attached to: #20 (RAID bus controller)

28: IDE 200.0: 10600 Disk
  [Created at block.188]
  Unique ID: gbYD.jbwYSfXQ1s1
  Parent ID: LHB6.kPnnhej1LTC
  SysFS ID: /block/sdc
  SysFS BusID: 2:0:0:0
  SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host2/target2:0:0/2:0:0:0
  Hardware Class: disk
  Model: "SAMSUNG SP2504C"
  Vendor: "SAMSUNG"
  Device: "SP2504C"
  Revision: "VT10"
  Serial ID: "S09QJ1GYA03003"
  Driver: "sata_sil24", "sd"
  Device File: /dev/sdc
  Device 
Files: /dev/sdc, /dev/disk/by-path/pci-0000:02:02.0-scsi-1:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA03003
  Device Number: block 8:32-8:47
  Geometry (Logical): CHS 30401/255/63
  Size: 488397168 sectors a 512 bytes
  Config Status: cfg=no, avail=yes, need=no, active=unknown
  Attached to: #20 (RAID bus controller)

29: IDE 300.0: 10600 Disk
  [Created at block.188]
  Unique ID: 8noH.C6Upa9vrTP5
  Parent ID: LHB6.kPnnhej1LTC
  SysFS ID: /block/sdd
  SysFS BusID: 3:0:0:0
  SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host3/target3:0:0/3:0:0:0
  Hardware Class: disk
  Model: "SAMSUNG SP2504C"
  Vendor: "SAMSUNG"
  Device: "SP2504C"
  Revision: "VT10"
  Serial ID: "S09QJ1GYA02991"
  Driver: "sata_sil24", "sd"
  Device File: /dev/sdd
  Device 
Files: /dev/sdd, /dev/disk/by-path/pci-0000:02:02.0-scsi-2:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA02991
  Device Number: block 8:48-8:63
  Geometry (Logical): CHS 30401/255/63
  Size: 488397168 sectors a 512 bytes
  Config Status: cfg=no, avail=yes, need=no, active=unknown
  Attached to: #20 (RAID bus controller)

30: IDE 400.0: 10600 Disk
  [Created at block.188]
  Unique ID: by2M.RNPEDP68X7E
  Parent ID: LHB6.kPnnhej1LTC
  SysFS ID: /block/sde
  SysFS BusID: 4:0:0:0
  SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host4/target4:0:0/4:0:0:0
  Hardware Class: disk
  Model: "SAMSUNG SP2504C"
  Vendor: "SAMSUNG"
  Device: "SP2504C"
  Revision: "VT10"
  Serial ID: "S09QJ1GYA02990"
  Driver: "sata_sil24", "sd"
  Device File: /dev/sde
  Device 
Files: /dev/sde, /dev/disk/by-path/pci-0000:02:02.0-scsi-3:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA02990
  Device Number: block 8:64-8:79
  Geometry (Logical): CHS 30401/255/63
  Size: 488397168 sectors a 512 bytes
  Config Status: cfg=no, avail=yes, need=no, active=unknown
  Attached to: #20 (RAID bus controller)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-20 14:18 2.6.24.3: regular sata drive resets - worrisome? Hans-Peter Jansen
@ 2008-03-21  4:48 ` Andrew Morton
  2008-03-21 18:32   ` Roger Heflin
  2008-03-29 12:58   ` Tejun Heo
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2008-03-21  4:48 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, linux-ide


(cc linux-ide)
(regression?)

On Thu, 20 Mar 2008 15:18:31 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:

> Hi,
> 
> since I upgraded to 2.6.24.3 on one of my production systems, I see 
> regular device resets like these:
> 
> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Mar 20 14:33:03 lisa5 kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Mar 20 14:33:03 lisa5 kernel: ata2.00: status: { DRDY }
> Mar 20 14:33:03 lisa5 kernel: ata2: hard resetting link
> Mar 20 14:33:05 lisa5 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
> Mar 20 14:33:05 lisa5 kernel: ata2.00: configured for UDMA/100
> Mar 20 14:33:05 lisa5 kernel: ata2: EH complete
> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] 488397168 512-byte hardware sectors (250059 MB)
> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write Protect is off
> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Mar 20 14:36:11 lisa5 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> Mar 20 14:36:11 lisa5 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> Mar 20 14:36:11 lisa5 kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Mar 20 14:36:11 lisa5 kernel: ata3.00: status: { DRDY }
> Mar 20 14:36:11 lisa5 kernel: ata3: hard resetting link
> Mar 20 14:36:13 lisa5 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
> Mar 20 14:36:13 lisa5 kernel: ata3.00: configured for UDMA/100
> Mar 20 14:36:13 lisa5 kernel: ata3: EH complete
> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write Protect is off
> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> 
> Should I be worried? smartd doesn't show anything suspicious on those.
> 
> It's been 4 samsung drives at all hanging on a sata sil 3124:
> 
> 29: PCI 202.0: 0104 RAID bus controller
>   [Created at pci.273]
>   Unique ID: LHB6.kPnnhej1LTC
>   Parent ID: bSAa.qlvrYsPk2G3
>   SysFS ID: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0
>   SysFS BusID: 0000:02:02.0
>   Hardware Class: storage
>   Model: "Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3124 PCI-X Serial ATA Controller"
>   Vendor: pci 0x1095 "Silicon Image, Inc. (formerly CMD Technology Inc)"
>   Device: pci 0x3124 "SiI 3124 PCI-X Serial ATA Controller"
>   SubVendor: pci 0x1095 "Silicon Image, Inc. (formerly CMD Technology Inc)"
>   SubDevice: pci 0x7124 
>   Revision: 0x01
>   Driver: "sata_sil24"
>   Memory Range: 0xfc9ffc00-0xfc9ffc7f (rw,non-prefetchable)
>   Memory Range: 0xfc9f0000-0xfc9f7fff (rw,non-prefetchable)
>   I/O Ports: 0xac00-0xac0f (rw)
>   Memory Range: 0xfc900000-0xfc97ffff (ro,prefetchable,disabled)
>   IRQ: 17 (20116954 events)
>   Driver Info #0:
>     Driver Status: sata_sil24 is active
>     Driver Activation Cmd: "modprobe sata_sil24"
>   Config Status: cfg=no, avail=yes, need=no, active=unknown
>   Attached to: #15 (PCI bridge)
> 
> 27: IDE 100.0: 10600 Disk
>   [Created at block.188]
>   Unique ID: DQI9.0o8IZwvZth7
>   Parent ID: LHB6.kPnnhej1LTC
>   SysFS ID: /block/sdb
>   SysFS BusID: 1:0:0:0
>   SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host1/target1:0:0/1:0:0:0
>   Hardware Class: disk
>   Model: "SAMSUNG SP2504C"
>   Vendor: "SAMSUNG"
>   Device: "SP2504C"
>   Revision: "VT10"
>   Serial ID: "S09QJ1GYA03006"
>   Driver: "sata_sil24", "sd"
>   Device File: /dev/sdb
>   Device 
> Files: /dev/sdb, /dev/disk/by-path/pci-0000:02:02.0-scsi-0:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA03006
>   Device Number: block 8:16-8:31
>   Geometry (Logical): CHS 30401/255/63
>   Size: 488397168 sectors a 512 bytes
>   Config Status: cfg=no, avail=yes, need=no, active=unknown
>   Attached to: #20 (RAID bus controller)
> 
> 28: IDE 200.0: 10600 Disk
>   [Created at block.188]
>   Unique ID: gbYD.jbwYSfXQ1s1
>   Parent ID: LHB6.kPnnhej1LTC
>   SysFS ID: /block/sdc
>   SysFS BusID: 2:0:0:0
>   SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host2/target2:0:0/2:0:0:0
>   Hardware Class: disk
>   Model: "SAMSUNG SP2504C"
>   Vendor: "SAMSUNG"
>   Device: "SP2504C"
>   Revision: "VT10"
>   Serial ID: "S09QJ1GYA03003"
>   Driver: "sata_sil24", "sd"
>   Device File: /dev/sdc
>   Device 
> Files: /dev/sdc, /dev/disk/by-path/pci-0000:02:02.0-scsi-1:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA03003
>   Device Number: block 8:32-8:47
>   Geometry (Logical): CHS 30401/255/63
>   Size: 488397168 sectors a 512 bytes
>   Config Status: cfg=no, avail=yes, need=no, active=unknown
>   Attached to: #20 (RAID bus controller)
> 
> 29: IDE 300.0: 10600 Disk
>   [Created at block.188]
>   Unique ID: 8noH.C6Upa9vrTP5
>   Parent ID: LHB6.kPnnhej1LTC
>   SysFS ID: /block/sdd
>   SysFS BusID: 3:0:0:0
>   SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host3/target3:0:0/3:0:0:0
>   Hardware Class: disk
>   Model: "SAMSUNG SP2504C"
>   Vendor: "SAMSUNG"
>   Device: "SP2504C"
>   Revision: "VT10"
>   Serial ID: "S09QJ1GYA02991"
>   Driver: "sata_sil24", "sd"
>   Device File: /dev/sdd
>   Device 
> Files: /dev/sdd, /dev/disk/by-path/pci-0000:02:02.0-scsi-2:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA02991
>   Device Number: block 8:48-8:63
>   Geometry (Logical): CHS 30401/255/63
>   Size: 488397168 sectors a 512 bytes
>   Config Status: cfg=no, avail=yes, need=no, active=unknown
>   Attached to: #20 (RAID bus controller)
> 
> 30: IDE 400.0: 10600 Disk
>   [Created at block.188]
>   Unique ID: by2M.RNPEDP68X7E
>   Parent ID: LHB6.kPnnhej1LTC
>   SysFS ID: /block/sde
>   SysFS BusID: 4:0:0:0
>   SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:02.0/host4/target4:0:0/4:0:0:0
>   Hardware Class: disk
>   Model: "SAMSUNG SP2504C"
>   Vendor: "SAMSUNG"
>   Device: "SP2504C"
>   Revision: "VT10"
>   Serial ID: "S09QJ1GYA02990"
>   Driver: "sata_sil24", "sd"
>   Device File: /dev/sde
>   Device 
> Files: /dev/sde, /dev/disk/by-path/pci-0000:02:02.0-scsi-3:0:0:0, /dev/disk/by-id/1ATA_SAMSUNG_SP2504C_S09QJ1GYA02990
>   Device Number: block 8:64-8:79
>   Geometry (Logical): CHS 30401/255/63
>   Size: 488397168 sectors a 512 bytes
>   Config Status: cfg=no, avail=yes, need=no, active=unknown
>   Attached to: #20 (RAID bus controller)
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-21  4:48 ` Andrew Morton
@ 2008-03-21 18:32   ` Roger Heflin
  2008-03-21 23:06     ` Hans-Peter Jansen
  2008-03-29 12:58   ` Tejun Heo
  1 sibling, 1 reply; 11+ messages in thread
From: Roger Heflin @ 2008-03-21 18:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hans-Peter Jansen, linux-kernel, linux-ide

Andrew Morton wrote:
> (cc linux-ide)
> (regression?)
> 
> On Thu, 20 Mar 2008 15:18:31 +0100 Hans-Peter Jansen <hpj@urpla.net> wrote:
> 
>> Hi,
>>
>> since I upgraded to 2.6.24.3 on one of my production systems, I see 
>> regular device resets like these:

Hans,

What kernel were you using before you updated to that kernel?

>>
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> Mar 20 14:33:03 lisa5 kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: status: { DRDY }
>> Mar 20 14:33:03 lisa5 kernel: ata2: hard resetting link
>> Mar 20 14:33:05 lisa5 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
>> Mar 20 14:33:05 lisa5 kernel: ata2.00: configured for UDMA/100
>> Mar 20 14:33:05 lisa5 kernel: ata2: EH complete
>> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] 488397168 512-byte hardware sectors (250059 MB)
>> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write Protect is off
>> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
>> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>> Mar 20 14:36:11 lisa5 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> Mar 20 14:36:11 lisa5 kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> Mar 20 14:36:11 lisa5 kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
>> Mar 20 14:36:11 lisa5 kernel: ata3.00: status: { DRDY }
>> Mar 20 14:36:11 lisa5 kernel: ata3: hard resetting link
>> Mar 20 14:36:13 lisa5 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
>> Mar 20 14:36:13 lisa5 kernel: ata3.00: configured for UDMA/100
>> Mar 20 14:36:13 lisa5 kernel: ata3: EH complete
>> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
>> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write Protect is off
>> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
>> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>>
>> Should I be worried? smartd doesn't show anything suspicious on those.
>>

Andrew,

I don't think it is a recent regression, I have seen it happening for a while on 
my machine, I don't think it is causing any crashes but I am getting unexplained 
events about 1x per month that appear to deadlock a number of things (machine is 
up, but top won't run and vmstat actually gets a FP exception on the second 
sample, and a number of other things have issues until reboot).

I have 4 identical disks, 2 on a sata_sil and 2 on another controller, the ones 
on the sil controller have this behavior, I have seen it in 2.6.23.1, 
FC7-2.6.23.15-80 and FC7-2.6.22.9-91.   My sil is a 4-port 3114 PCI card, and my 
disks are 500GB Western Digital disks.  I have a fairly long run with 20-30 
events on the 2 disks on the sata_sil and no events on the identical non-sil 
disks that had previously been getting resets (when on the sil controller), and 
since they are under software raid5 all 4 disks should have very very similar IO 
loads.

                           Roger


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-21 18:32   ` Roger Heflin
@ 2008-03-21 23:06     ` Hans-Peter Jansen
  0 siblings, 0 replies; 11+ messages in thread
From: Hans-Peter Jansen @ 2008-03-21 23:06 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Andrew Morton, linux-kernel, linux-ide

Am Freitag, 21. März 2008 schrieb Roger Heflin:
> Andrew Morton wrote:
> > (cc linux-ide)
> > (regression?)
> >
> > On Thu, 20 Mar 2008 15:18:31 +0100 Hans-Peter Jansen <hpj@urpla.net> 
wrote:
> >> Hi,
> >>
> >> since I upgraded to 2.6.24.3 on one of my production systems, I see
> >> regular device resets like these:
>
> Hans,
>
> What kernel were you using before you updated to that kernel?

You don't what to know that ;-) (cough 2.6.11 cough)

> >> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0
> >> SErr 0x0 action 0x2 frozen Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd
> >> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Mar 20 14:33:03 lisa5
> >> kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
> >> (timeout) Mar 20 14:33:03 lisa5 kernel: ata2.00: status: { DRDY }
> >> Mar 20 14:33:03 lisa5 kernel: ata2: hard resetting link
> >> Mar 20 14:33:05 lisa5 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
> >> SControl 0) Mar 20 14:33:05 lisa5 kernel: ata2.00: configured for
> >> UDMA/100 Mar 20 14:33:05 lisa5 kernel: ata2: EH complete
> >> Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0: [sdc] 488397168 512-byte
> >> hardware sectors (250059 MB) Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0:
> >> [sdc] Write Protect is off Mar 20 14:33:05 lisa5 kernel: sd 2:0:0:0:
> >> [sdc] Mode Sense: 00 3a 00 00 Mar 20 14:33:05 lisa5 kernel: sd
> >> 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
> >> support DPO or FUA Mar 20 14:36:11 lisa5 kernel: ata3.00: exception
> >> Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Mar 20 14:36:11 lisa5
> >> kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Mar 20
> >> 14:36:11 lisa5 kernel:          res
> >> 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Mar 20
> >> 14:36:11 lisa5 kernel: ata3.00: status: { DRDY }
> >> Mar 20 14:36:11 lisa5 kernel: ata3: hard resetting link
> >> Mar 20 14:36:13 lisa5 kernel: ata3: SATA link up 3.0 Gbps (SStatus 123
> >> SControl 0) Mar 20 14:36:13 lisa5 kernel: ata3.00: configured for
> >> UDMA/100 Mar 20 14:36:13 lisa5 kernel: ata3: EH complete
> >> Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0: [sdd] 488397168 512-byte
> >> hardware sectors (250059 MB) Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0:
> >> [sdd] Write Protect is off Mar 20 14:36:13 lisa5 kernel: sd 3:0:0:0:
> >> [sdd] Mode Sense: 00 3a 00 00 Mar 20 14:36:13 lisa5 kernel: sd
> >> 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't
> >> support DPO or FUA
> >>
> >> Should I be worried? smartd doesn't show anything suspicious on those.
>
> Andrew,
>
> I don't think it is a recent regression, I have seen it happening for a
> while on my machine, I don't think it is causing any crashes but I am
> getting unexplained events about 1x per month that appear to deadlock a
> number of things (machine is up, but top won't run and vmstat actually
> gets a FP exception on the second sample, and a number of other things
> have issues until reboot).

Well, that doesn't sound reassuring, does it?

> I have 4 identical disks, 2 on a sata_sil and 2 on another controller,
> the ones on the sil controller have this behavior, I have seen it in
> 2.6.23.1, FC7-2.6.23.15-80 and FC7-2.6.22.9-91.   My sil is a 4-port 3114
> PCI card, and my disks are 500GB Western Digital disks.  I have a fairly
> long run with 20-30 events on the 2 disks on the sata_sil and no events
> on the identical non-sil disks that had previously been getting resets
> (when on the sil controller), and since they are under software raid5 all
> 4 disks should have very very similar IO loads.

Okay, those resets may happen without further consequences, but they're 
disturbing nevertheless. 

BTW, I'm preparing a hardware reorg, which will eliminate this controller 
during this weekend.. Well, to be correct, the other is the one, that 
really nags me for some time now (3ware 9xxx-8). Swapping both with one 
Areca 1130, which performes _much_ better.

Pete

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-21  4:48 ` Andrew Morton
  2008-03-21 18:32   ` Roger Heflin
@ 2008-03-29 12:58   ` Tejun Heo
  2008-03-30  0:14     ` Hans-Peter Jansen
  1 sibling, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2008-03-29 12:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hans-Peter Jansen, linux-kernel, linux-ide

Hello, Hans.

Andrew Morton wrote:
>> since I upgraded to 2.6.24.3 on one of my production systems, I see 
>> regular device resets like these:
>>
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> Mar 20 14:33:03 lisa5 kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Ouch, timeout on FLUSH_EXT.  Are all errors on cmd ea?

>> Should I be worried? smartd doesn't show anything suspicious on those.

Can you please post the result of "smartctl -a /dev/sdX"?

>> It's been 4 samsung drives at all hanging on a sata sil 3124:

FLUSH_EXT timing out usually indicates that the drive is having problem 
writing out what it has in its cache to the media.  There was one case 
where FLUSH_EXT timeout was caused by the driver failing to switch 
controller back from NCQ mode before issuing FLUSH_EXT but that was on 
sata_nv.  There hasn't been any similar problem on sata_sil24.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-29 12:58   ` Tejun Heo
@ 2008-03-30  0:14     ` Hans-Peter Jansen
  2008-03-30  0:54       ` Tejun Heo
  0 siblings, 1 reply; 11+ messages in thread
From: Hans-Peter Jansen @ 2008-03-30  0:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Andrew Morton, linux-kernel, linux-ide

Hi Tejun,

thanks for picking this issue up.

Am Samstag, 29. März 2008 schrieb Tejun Heo:
> Hello, Hans.
>
> Andrew Morton wrote:
> >> since I upgraded to 2.6.24.3 on one of my production systems, I see
> >> regular device resets like these:
> >>
> >> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0
> >> SErr 0x0 action 0x2 frozen Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd
> >> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Mar 20 14:33:03 lisa5
> >> kernel:          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
> >> (timeout)
>
> Ouch, timeout on FLUSH_EXT.  Are all errors on cmd ea?
>
> >> Should I be worried? smartd doesn't show anything suspicious on those.
>
> Can you please post the result of "smartctl -a /dev/sdX"?

Here's the last smart report from two of the offending drives. As noted 
before, I did the hardware reorganization, replaced the dog slow 3ware 
9500S-8 and the SiI 3124 with a single Areca 1130 and retired the drives 
for now, but a nephew already showed interest. What do you think, can I 
cede those drives with a clear conscience? The Hardware_ECC_Recovered
values are really worrisome, aren't they?

sdc:
smartctl version 5.38 [i686-suse-linux-gnu] Copyright (C) 2002-7 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint P120 series
Device Model:     SAMSUNG SP2504C
Serial Number:    S09QJ1GYA03006
Firmware Version: VT100-33
User Capacity:    250.059.350.016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:    Sun Mar 23 01:13:37 2008 CET

==> WARNING: May need -F samsung3 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4866) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  81) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       82
  3 Spin_Up_Time            0x0007   100   100   025    Pre-fail  Always       -       5952
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       23
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   253   253   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17647
 10 Spin_Retry_Count        0x0033   253   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   253   002   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       19
190 Airflow_Temperature_Cel 0x0022   124   124   000    Old_age   Always       -       38
194 Temperature_Celsius     0x0022   124   124   000    Old_age   Always       -       38
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   253   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
202 TA_Increase_Count       0x0032   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17624         -
# 2  Short offline       Completed without error       00%     17601         -
# 3  Short offline       Completed without error       00%     17577         -
# 4  Short offline       Completed without error       00%     17553         -
# 5  Short offline       Completed without error       00%     17528         -
# 6  Short offline       Completed without error       00%     17504         -
# 7  Extended offline    Completed without error       00%     17489         -
# 8  Short offline       Completed without error       00%     17480         -
# 9  Short offline       Completed without error       00%     17456         -
#10  Short offline       Completed without error       00%     17432         -
#11  Short offline       Completed without error       00%     17408         -
#12  Short offline       Completed without error       00%     17384         -
#13  Short offline       Completed without error       00%     17360         -
#14  Short offline       Completed without error       00%     17336         -
#15  Extended offline    Completed without error       00%     17320         -
#16  Short offline       Completed without error       00%     17311         -
#17  Short offline       Completed without error       00%     17287         -
#18  Short offline       Completed without error       00%     17263         -
#19  Short offline       Completed without error       00%     17239         -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sdd:

smartctl version 5.38 [i686-suse-linux-gnu] Copyright (C) 2002-7 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint P120 series
Device Model:     SAMSUNG SP2504C
Serial Number:    S09QJ1GYA03003
Firmware Version: VT100-33
User Capacity:    250.059.350.016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:    Sun Mar 23 01:13:38 2008 CET

==> WARNING: May need -F samsung3 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (4836) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  80) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       79
  3 Spin_Up_Time            0x0007   100   100   025    Pre-fail  Always       -       5952
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       23
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   253   253   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17648
 10 Spin_Retry_Count        0x0033   253   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   253   002   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       19
190 Airflow_Temperature_Cel 0x0022   118   118   000    Old_age   Always       -       40
194 Temperature_Celsius     0x0022   118   118   000    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162520674
196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   253   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
202 TA_Increase_Count       0x0032   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17626         -
# 2  Short offline       Completed without error       00%     17602         -
# 3  Short offline       Completed without error       00%     17578         -
# 4  Short offline       Completed without error       00%     17554         -
# 5  Short offline       Completed without error       00%     17530         -
# 6  Short offline       Completed without error       00%     17506         -
# 7  Extended offline    Completed without error       00%     17490         -
# 8  Short offline       Completed without error       00%     17482         -
# 9  Short offline       Completed without error       00%     17457         -
#10  Short offline       Completed without error       00%     17433         -
#11  Short offline       Completed without error       00%     17409         -
#12  Short offline       Completed without error       00%     17385         -
#13  Short offline       Completed without error       00%     17361         -
#14  Short offline       Completed without error       00%     17337         -
#15  Extended offline    Completed without error       00%     17321         -
#16  Short offline       Completed without error       00%     17313         -
#17  Short offline       Completed without error       00%     17289         -
#18  Short offline       Completed without error       00%     17264         -
#19  Short offline       Completed without error       00%     17240         -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

> >> It's been 4 samsung drives at all hanging on a sata sil 3124:
>
> FLUSH_EXT timing out usually indicates that the drive is having problem
> writing out what it has in its cache to the media.  There was one case
> where FLUSH_EXT timeout was caused by the driver failing to switch
> controller back from NCQ mode before issuing FLUSH_EXT but that was on
> sata_nv.  There hasn't been any similar problem on sata_sil24.

Hmm, I didn't noticed any data distortions, and if there where, they live
on as copies in their new home.. 

Thanks,
Pete

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-30  0:14     ` Hans-Peter Jansen
@ 2008-03-30  0:54       ` Tejun Heo
  2008-03-30 12:00         ` Hans-Peter Jansen
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2008-03-30  0:54 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Andrew Morton, linux-kernel, linux-ide

Hello,

Hans-Peter Jansen wrote:
>>>> Should I be worried? smartd doesn't show anything suspicious on those.
>> Can you please post the result of "smartctl -a /dev/sdX"?
> 
> Here's the last smart report from two of the offending drives. As noted 
> before, I did the hardware reorganization, replaced the dog slow 3ware 
> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the drives 
> for now, but a nephew already showed interest. What do you think, can I 
> cede those drives with a clear conscience? The Hardware_ECC_Recovered
> values are really worrisome, aren't they?

Different vendors use different scales for the raw values.  The value is 
still pegged at the highest so it could be those raw values are okay or 
that the vendor just doesn't update value field accordingly.  My P120 
says 0 for the raw value and 904635 for hardware ECC recovered so there 
is some difference.  What do other non-failing drives say about those 
values?

> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       82
>   3 Spin_Up_Time            0x0007   100   100   025    Pre-fail  Always       -       5952
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       23
>   5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
>   7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
>   8 Seek_Time_Performance   0x0025   253   253   015    Pre-fail  Offline      -       0
>   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17647
>  10 Spin_Retry_Count        0x0033   253   253   051    Pre-fail  Always       -       0
>  11 Calibration_Retry_Count 0x0012   253   002   000    Old_age   Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       19
> 190 Airflow_Temperature_Cel 0x0022   124   124   000    Old_age   Always       -       38
> 194 Temperature_Celsius     0x0022   124   124   000    Old_age   Always       -       38
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
> 196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 200 Multi_Zone_Error_Rate   0x000a   253   100   000    Old_age   Always       -       0
> 201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
> 202 TA_Increase_Count       0x0032   253   253   000    Old_age   Always       -       0

Hmmm... If the drive is failing FLUSHs, I would expect to see elevated 
reallocation counters and maybe some pending counts.  Aieee.. weird.

>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>> FLUSH_EXT timing out usually indicates that the drive is having problem
>> writing out what it has in its cache to the media.  There was one case
>> where FLUSH_EXT timeout was caused by the driver failing to switch
>> controller back from NCQ mode before issuing FLUSH_EXT but that was on
>> sata_nv.  There hasn't been any similar problem on sata_sil24.
> 
> Hmm, I didn't noticed any data distortions, and if there where, they live
> on as copies in their new home.. 

It should have appeared as read errors.  Maybe the drive successfully 
wrote those sectors after 30+ secs timeout.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-30  0:54       ` Tejun Heo
@ 2008-03-30 12:00         ` Hans-Peter Jansen
  2008-03-30 12:41           ` Roger Heflin
  0 siblings, 1 reply; 11+ messages in thread
From: Hans-Peter Jansen @ 2008-03-30 12:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Andrew Morton, linux-kernel, linux-ide, Roger Heflin

Am Sonntag, 30. März 2008 schrieb Tejun Heo:
> Hello,
>
> Hans-Peter Jansen wrote:
> >>>> Should I be worried? smartd doesn't show anything suspicious on
> >>>> those.
> >>
> >> Can you please post the result of "smartctl -a /dev/sdX"?
> >
> > Here's the last smart report from two of the offending drives. As noted
> > before, I did the hardware reorganization, replaced the dog slow 3ware
> > 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the
> > drives for now, but a nephew already showed interest. What do you
> > think, can I cede those drives with a clear conscience? The
> > Hardware_ECC_Recovered values are really worrisome, aren't they?
>
> Different vendors use different scales for the raw values.  The value is
> still pegged at the highest so it could be those raw values are okay or
> that the vendor just doesn't update value field accordingly.  My P120
> says 0 for the raw value and 904635 for hardware ECC recovered so there
> is some difference.  What do other non-failing drives say about those
> values?

The only non-failing drive was sdf as it was running in standby mode in this 
md raid 5 ensemble:

20080323-011337-sdc.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
20080323-011337-sdc.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
20080323-011337-sdc.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
20080323-011337-sdc.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
20080323-011337-sdc.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
20080323-011338-sdd.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162520674
20080323-011338-sdd.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
20080323-011338-sdd.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
20080323-011338-sdd.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
20080323-011338-sdd.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
20080323-011338-sde.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       148429049
20080323-011338-sde.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
20080323-011338-sde.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
20080323-011338-sde.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
20080323-011338-sde.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
20080323-011339-sdf.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1559
20080323-011339-sdf.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
20080323-011339-sdf.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
20080323-011339-sdf.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
20080323-011339-sdf.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
> reallocation counters and maybe some pending counts.  Aieee.. weird.

But there are no reallocations nor any pending sectors on any of them.

> >>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
> >>
> >> FLUSH_EXT timing out usually indicates that the drive is having
> >> problem writing out what it has in its cache to the media.  There was
> >> one case where FLUSH_EXT timeout was caused by the driver failing to
> >> switch controller back from NCQ mode before issuing FLUSH_EXT but that
> >> was on sata_nv.  There hasn't been any similar problem on sata_sil24.
> >
> > Hmm, I didn't noticed any data distortions, and if there where, they
> > live on as copies in their new home..
>
> It should have appeared as read errors.  Maybe the drive successfully
                             ^^^^
                             write (I guess)
> wrote those sectors after 30+ secs timeout.

That would point to some driver issue, wouldn't it? Roger Heflin also
experienced similar behavior with that controller, which wasn't 
reproducible with another. 

I can offer to you rebuilding that md in a test environment, and giving 
you access to it, if you're interested.

Anyway, thanks for caring Tejun,
Pete

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-30 12:00         ` Hans-Peter Jansen
@ 2008-03-30 12:41           ` Roger Heflin
  2008-03-31  4:33             ` Tejun Heo
  0 siblings, 1 reply; 11+ messages in thread
From: Roger Heflin @ 2008-03-30 12:41 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Tejun Heo, Andrew Morton, linux-kernel, linux-ide

Hans-Peter Jansen wrote:
> Am Sonntag, 30. März 2008 schrieb Tejun Heo:
>> Hello,
>>
>> Hans-Peter Jansen wrote:
>>>>>> Should I be worried? smartd doesn't show anything suspicious on
>>>>>> those.
>>>> Can you please post the result of "smartctl -a /dev/sdX"?
>>> Here's the last smart report from two of the offending drives. As noted
>>> before, I did the hardware reorganization, replaced the dog slow 3ware
>>> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the
>>> drives for now, but a nephew already showed interest. What do you
>>> think, can I cede those drives with a clear conscience? The
>>> Hardware_ECC_Recovered values are really worrisome, aren't they?
>> Different vendors use different scales for the raw values.  The value is
>> still pegged at the highest so it could be those raw values are okay or
>> that the vendor just doesn't update value field accordingly.  My P120
>> says 0 for the raw value and 904635 for hardware ECC recovered so there
>> is some difference.  What do other non-failing drives say about those
>> values?
> 
> The only non-failing drive was sdf as it was running in standby mode in this 
> md raid 5 ensemble:
> 
> 20080323-011337-sdc.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
> 20080323-011337-sdc.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 20080323-011337-sdc.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 20080323-011337-sdc.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 20080323-011337-sdc.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 20080323-011338-sdd.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162520674
> 20080323-011338-sdd.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 20080323-011338-sdd.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 20080323-011338-sdd.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 20080323-011338-sdd.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 20080323-011338-sde.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       148429049
> 20080323-011338-sde.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 20080323-011338-sde.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 20080323-011338-sde.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 20080323-011338-sde.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 20080323-011339-sdf.log:195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1559
> 20080323-011339-sdf.log:196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 20080323-011339-sdf.log:197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 20080323-011339-sdf.log:198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 20080323-011339-sdf.log:199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 
>> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
>> reallocation counters and maybe some pending counts.  Aieee.. weird.
> 
> But there are no reallocations nor any pending sectors on any of them.
> 
>>>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>>>> FLUSH_EXT timing out usually indicates that the drive is having
>>>> problem writing out what it has in its cache to the media.  There was
>>>> one case where FLUSH_EXT timeout was caused by the driver failing to
>>>> switch controller back from NCQ mode before issuing FLUSH_EXT but that
>>>> was on sata_nv.  There hasn't been any similar problem on sata_sil24.
>>> Hmm, I didn't noticed any data distortions, and if there where, they
>>> live on as copies in their new home..
>> It should have appeared as read errors.  Maybe the drive successfully
>                              ^^^^
>                              write (I guess)
>> wrote those sectors after 30+ secs timeout.
> 
> That would point to some driver issue, wouldn't it? Roger Heflin also
> experienced similar behavior with that controller, which wasn't 
> reproducible with another. 
> 
> I can offer to you rebuilding that md in a test environment, and giving 
> you access to it, if you're interested.
> 
> Anyway, thanks for caring Tejun,
> Pete
> 

Here are the errors I get, though look at it closer, I am don't appear to be 
getting the reset, just this error from time to time:

sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
sd 9:0:0:0: [sde] Write Protect is off
sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA
ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
ata8.00: BMDMA2 stat 0x687d8009
ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data 196608 in
          res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
ata8.00: configured for UDMA/100
ata8: EH complete
sd 7:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
sd 7:0:0:0: [sdd] Write Protect is off
sd 7:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA

I have 4 identical disks, with all 4 connected to the SIL controller all give 
some errors, moving 2 of the disks to a promise controller makes the errors go 
away on the 2 connected to the promise controller.   All drives are part of a 
software raid5 array.

Startup looks like this:
sata_sil 0000:00:09.0: version 2.3
ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 16 (level, low) -> IRQ 20
sata_sil 0000:00:09.0: Applying R_ERR on DMA activate FIS errata fix
scsi7 : sata_sil
scsi8 : sata_sil
scsi9 : sata_sil
scsi10 : sata_sil
ata8: SATA max UDMA/100 cmd 0xf8942080 ctl 0xf894208a bmdma 0xf8942000 irq 20
ata9: SATA max UDMA/100 cmd 0xf89420c0 ctl 0xf89420ca bmdma 0xf8942008 irq 20
ata10: SATA max UDMA/100 cmd 0xf8942280 ctl 0xf894228a bmdma 0xf8942200 irq 20
ata11: SATA max UDMA/100 cmd 0xf89422c0 ctl 0xf89422ca bmdma 0xf8942208 irq 20

Right now I am running 2.6.23.15-80.fc7, but have also got the errors under 2.6.23.1

                                     Roger

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-30 12:41           ` Roger Heflin
@ 2008-03-31  4:33             ` Tejun Heo
  2008-04-01 19:27               ` Roger Heflin
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2008-03-31  4:33 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Hans-Peter Jansen, Andrew Morton, linux-kernel, linux-ide

Roger Heflin wrote:
>> The only non-failing drive was sdf as it was running in standby mode 
>> in this md raid 5 ensemble:
>>
>> 20080323-011337-sdc.log:195 Hardware_ECC_Recovered  0x001a   100   
>> 100   000    Old_age   Always       -       162956700
>> 20080323-011338-sde.log:195 Hardware_ECC_Recovered  0x001a   100   
>> 100   000    Old_age   Always       -       148429049

Hmmm... looks similar.

>>> Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
>>> reallocation counters and maybe some pending counts.  Aieee.. weird.
>>
>> But there are no reallocations nor any pending sectors on any of them.

Yeah, indded.

>>>>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>>>>> FLUSH_EXT timing out usually indicates that the drive is having
>>>>> problem writing out what it has in its cache to the media.  There was
>>>>> one case where FLUSH_EXT timeout was caused by the driver failing to
>>>>> switch controller back from NCQ mode before issuing FLUSH_EXT but that
>>>>> was on sata_nv.  There hasn't been any similar problem on sata_sil24.
>>>> Hmm, I didn't noticed any data distortions, and if there where, they
>>>> live on as copies in their new home..
>>> It should have appeared as read errors.  Maybe the drive successfully
>>                              ^^^^
>>                              write (I guess)

I actually meant read.  For the corrupted data to get transferred to 
other disks, it should have been read as wrong values but such things 
should never happen as ECC checks would fail.

>>> wrote those sectors after 30+ secs timeout.
>>
>> That would point to some driver issue, wouldn't it? Roger Heflin also
>> experienced similar behavior with that controller, which wasn't 
>> reproducible with another.

Roger's problem is different one.  I'll talk about it below.

>> I can offer to you rebuilding that md in a test environment, and 
>> giving you access to it, if you're interested.

Can you hook up those failed drives to a different controller?  Say, 
ahci or ata_piix and put them under write load (ext3 w/ barrier=1 and 
copying lots of files into it should work) and see whether the problem 
reproduces?

> Here are the errors I get, though look at it closer, I am don't appear 
> to be getting the reset, just this error from time to time:
> 
> sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
> sd 9:0:0:0: [sde] Write Protect is off
> sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
> sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't 
> support DPO or FUA
> ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
> ata8.00: BMDMA2 stat 0x687d8009
> ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data 
> 196608 in
>          res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
> ata8.00: configured for UDMA/100

That's device abort error on read.  The drive just can't read sector one 
of the requested sectors and it's not sata_sil24.  It's a bmdma one.

> I have 4 identical disks, with all 4 connected to the SIL controller all 
> give some errors, moving 2 of the disks to a promise controller makes 
> the errors go away on the 2 connected to the promise controller.   All 
> drives are part of a software raid5 array.

Ah.. okay, sata_sil.  Roger, the moving and errors are not very likely 
to have anything to do with each other.  The only possibility is 
transmission problems but the drive didn't report transport error (ICRC) 
and it's more likely that the drive was experiencing temporary failures. 
  It's also possible that the drive set ABRT although there was some 
problem with the transport tho.

If you move the drive back to the sata_sil, do those problems appear 
again?  Anyways, this doesn't really have anything to do with what Hans 
is seeing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24.3: regular sata drive resets - worrisome?
  2008-03-31  4:33             ` Tejun Heo
@ 2008-04-01 19:27               ` Roger Heflin
  0 siblings, 0 replies; 11+ messages in thread
From: Roger Heflin @ 2008-04-01 19:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Hans-Peter Jansen, Andrew Morton, linux-kernel, linux-ide

Tejun Heo wrote:
>>> I can offer to you rebuilding that md in a test environment, and 
>>> giving you access to it, if you're interested.
> 
> Can you hook up those failed drives to a different controller?  Say, 
> ahci or ata_piix and put them under write load (ext3 w/ barrier=1 and 
> copying lots of files into it should work) and see whether the problem 
> reproduces?

I can move switch the disks to a sata_promise controller, I also have a sata_via 
controller but I cannot get those disks to work at all on it (it initially sees 
the disk, but does not finish init).

I don't on the machine that those disks are on have any other sata controllers.

> 
>> Here are the errors I get, though look at it closer, I am don't appear 
>> to be getting the reset, just this error from time to time:
>>
>> sd 9:0:0:0: [sde] 976773168 512-byte hardware sectors (500108 MB)
>> sd 9:0:0:0: [sde] Write Protect is off
>> sd 9:0:0:0: [sde] Mode Sense: 00 3a 00 00
>> sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't 
>> support DPO or FUA
>> ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x280000 action 0x0
>> ata8.00: BMDMA2 stat 0x687d8009
>> ata8.00: cmd 25/00:80:a7:00:1d/00:01:1d:00:00/e0 tag 0 cdb 0x0 data 
>> 196608 in
>>          res 51/04:8f:98:01:1d/00:00:1d:00:00/f0 Emask 0x1 (device error)
>> ata8.00: configured for UDMA/100
> 
> That's device abort error on read.  The drive just can't read sector one 
> of the requested sectors and it's not sata_sil24.  It's a bmdma one.
> 
>> I have 4 identical disks, with all 4 connected to the SIL controller 
>> all give some errors, moving 2 of the disks to a promise controller 
>> makes the errors go away on the 2 connected to the promise 
>> controller.   All drives are part of a software raid5 array.
> 
> Ah.. okay, sata_sil.  Roger, the moving and errors are not very likely 
> to have anything to do with each other.  The only possibility is 
> transmission problems but the drive didn't report transport error (ICRC) 
> and it's more likely that the drive was experiencing temporary failures. 
>  It's also possible that the drive set ABRT although there was some 
> problem with the transport tho.
> 
> If you move the drive back to the sata_sil, do those problems appear 
> again?  Anyways, this doesn't really have anything to do with what Hans 
> is seeing.

I can swap the disk around next time I reboot the machine, the 2 on the promise 
will go to the sil and the 2 on the sil will go to the promise, from past 
testing I expect the disk on the sil to have the errors and the ones on the 
promise to not have errors.

After I looked at the error more carefully and I though that too, I had 
originally thought I was getting resets also but I was wrong on that.

                       Roger

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-04-01 19:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-20 14:18 2.6.24.3: regular sata drive resets - worrisome? Hans-Peter Jansen
2008-03-21  4:48 ` Andrew Morton
2008-03-21 18:32   ` Roger Heflin
2008-03-21 23:06     ` Hans-Peter Jansen
2008-03-29 12:58   ` Tejun Heo
2008-03-30  0:14     ` Hans-Peter Jansen
2008-03-30  0:54       ` Tejun Heo
2008-03-30 12:00         ` Hans-Peter Jansen
2008-03-30 12:41           ` Roger Heflin
2008-03-31  4:33             ` Tejun Heo
2008-04-01 19:27               ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).