LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* What does this scsi error mean ? @ 2007-01-15 17:16 Olivier Galibert 2007-01-15 18:45 ` Alan 2007-01-15 23:27 ` Stefan Richter 0 siblings, 2 replies; 12+ messages in thread From: Olivier Galibert @ 2007-01-15 17:16 UTC (permalink / raw) To: Hack inc. sd 0:0:0:0: SCSI error: return code = 0x08000002 sda: Current: sense key: Hardware Error ASC=0x42 ASCQ=0x0 Info fld=0x400802c end_request: I/O error, dev sda, sector 202369 Aborting journal on device sda1. journal commit I/O error ext3_abort called. EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only It's always on a journal write and smart on the disk doesn't see a thing (no error log, short and long smart tests pass). In case it is relevant (it's an IBM LS20 blade): 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) 01:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) 01:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] 02:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 10) 02:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 10) 02:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=222 Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: IBM-ESXS Model: ST936701LC FN Rev: B41D Type: Direct-Access ANSI SCSI revision: 04 OG. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-15 17:16 What does this scsi error mean ? Olivier Galibert @ 2007-01-15 18:45 ` Alan 2007-01-15 21:45 ` Olivier Galibert 2007-01-15 23:27 ` Stefan Richter 1 sibling, 1 reply; 12+ messages in thread From: Alan @ 2007-01-15 18:45 UTC (permalink / raw) To: Olivier Galibert; +Cc: Hack inc. On Mon, 15 Jan 2007 18:16:02 +0100 Olivier Galibert <galibert@pobox.com> wrote: > sd 0:0:0:0: SCSI error: return code = 0x08000002 > sda: Current: sense key: Hardware Error > ASC=0x42 ASCQ=0x0 I'll give you a clue: The words "Hardware Error". Run a SCSI verify pass on the drive with some drive utilities and see what happens. If you are lucky it'll just reallocate blocks and decide the drive is ok, if not well see what the smart data thinks. Alan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-15 18:45 ` Alan @ 2007-01-15 21:45 ` Olivier Galibert 2007-01-15 23:14 ` Alan 2007-01-16 15:16 ` linux-os (Dick Johnson) 0 siblings, 2 replies; 12+ messages in thread From: Olivier Galibert @ 2007-01-15 21:45 UTC (permalink / raw) To: Alan; +Cc: Hack inc. On Mon, Jan 15, 2007 at 06:45:40PM +0000, Alan wrote: > On Mon, 15 Jan 2007 18:16:02 +0100 > Olivier Galibert <galibert@pobox.com> wrote: > > > sd 0:0:0:0: SCSI error: return code = 0x08000002 > > sda: Current: sense key: Hardware Error > > ASC=0x42 ASCQ=0x0 > > I'll give you a clue: The words "Hardware Error". > > Run a SCSI verify pass on the drive with some drive utilities and see > what happens. If you are lucky it'll just reallocate blocks and decide > the drive is ok, if not well see what the smart data thinks. Both smart and the internal blade diagnostics say "everything is a-ok with the drive, there hasn't been any error ever except a bunch of corrected ECC ones, and no more than with a similar drive in another working blade". Hence my initial post. "Hardware error" is kinda imprecise, so I was wondering whether it was unexpected controller answer, detected transmission error, block write error, sector not found... Is there a way to have more information? OG. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-15 21:45 ` Olivier Galibert @ 2007-01-15 23:14 ` Alan 2007-01-16 0:10 ` Olivier Galibert 2007-01-18 14:08 ` Olivier Galibert 2007-01-16 15:16 ` linux-os (Dick Johnson) 1 sibling, 2 replies; 12+ messages in thread From: Alan @ 2007-01-15 23:14 UTC (permalink / raw) To: Olivier Galibert; +Cc: Hack inc. > Both smart and the internal blade diagnostics say "everything is a-ok > with the drive, there hasn't been any error ever except a bunch of > corrected ECC ones, and no more than with a similar drive in another > working blade". Hence my initial post. "Hardware error" is kinda > imprecise, so I was wondering whether it was unexpected controller > answer, detected transmission error, block write error, sector not > found... Is there a way to have more information? Well the right place to look would indeed have been the SMART data providing the drive didn't get into a state it couldn't update it. Hardware error comes from the drive deciding something is wrong (or a raid card faking it I guess). That covers everything from power fluctuations and overheating through firmware consistency failures and more. If you pull the drive and test it in another box does it show the same ? And what does a scsi verify have to say ? Alan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-15 23:14 ` Alan @ 2007-01-16 0:10 ` Olivier Galibert 2007-01-18 14:08 ` Olivier Galibert 1 sibling, 0 replies; 12+ messages in thread From: Olivier Galibert @ 2007-01-16 0:10 UTC (permalink / raw) To: Alan; +Cc: Hack inc. On Mon, Jan 15, 2007 at 11:14:52PM +0000, Alan wrote: > If you pull the drive and test it in another box does it show the same ? I'm going to try that. The prolem requires 3-7 days to appear, so I won't know immediatly. > And what does a scsi verify have to say ? Running, looks like it's gonna take a little while. OG. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-15 23:14 ` Alan 2007-01-16 0:10 ` Olivier Galibert @ 2007-01-18 14:08 ` Olivier Galibert 2007-02-07 17:43 ` Olivier Galibert 1 sibling, 1 reply; 12+ messages in thread From: Olivier Galibert @ 2007-01-18 14:08 UTC (permalink / raw) To: Alan; +Cc: Hack inc. On Mon, Jan 15, 2007 at 11:14:52PM +0000, Alan wrote: > > Both smart and the internal blade diagnostics say "everything is a-ok > > with the drive, there hasn't been any error ever except a bunch of > > corrected ECC ones, and no more than with a similar drive in another > > working blade". Hence my initial post. "Hardware error" is kinda > > imprecise, so I was wondering whether it was unexpected controller > > answer, detected transmission error, block write error, sector not > > found... Is there a way to have more information? > > Well the right place to look would indeed have been the SMART data > providing the drive didn't get into a state it couldn't update it. > Hardware error comes from the drive deciding something is wrong (or a > raid card faking it I guess). That covers everything from power > fluctuations and overheating through firmware consistency failures and > more. > > If you pull the drive and test it in another box does it show the same ? Ok, inverted the disks, got a crash of the same blade with the new disk, so the problem is not the drive itself. Gonna try inverting two blades to check if it's the power supply connector/rail. OG. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-18 14:08 ` Olivier Galibert @ 2007-02-07 17:43 ` Olivier Galibert 0 siblings, 0 replies; 12+ messages in thread From: Olivier Galibert @ 2007-02-07 17:43 UTC (permalink / raw) To: Alan, Hack inc. On Thu, Jan 18, 2007 at 03:08:46PM +0100, Olivier Galibert wrote: > On Mon, Jan 15, 2007 at 11:14:52PM +0000, Alan wrote: > > > Both smart and the internal blade diagnostics say "everything is a-ok > > > with the drive, there hasn't been any error ever except a bunch of > > > corrected ECC ones, and no more than with a similar drive in another > > > working blade". Hence my initial post. "Hardware error" is kinda > > > imprecise, so I was wondering whether it was unexpected controller > > > answer, detected transmission error, block write error, sector not > > > found... Is there a way to have more information? > > > > Well the right place to look would indeed have been the SMART data > > providing the drive didn't get into a state it couldn't update it. > > Hardware error comes from the drive deciding something is wrong (or a > > raid card faking it I guess). That covers everything from power > > fluctuations and overheating through firmware consistency failures and > > more. > > > > If you pull the drive and test it in another box does it show the same ? > > Ok, inverted the disks, got a crash of the same blade with the new > disk, so the problem is not the drive itself. Gonna try inverting two > blades to check if it's the power supply connector/rail. ...and it is the power supply/connector. Failure is linked to the position of the blade in the box (as in the blade in the first position always fails). Now that's a cute failure. Having the support act on it is going to be fun. OG. PS: Yes, I did forget to send that email :-) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-15 21:45 ` Olivier Galibert 2007-01-15 23:14 ` Alan @ 2007-01-16 15:16 ` linux-os (Dick Johnson) 2007-01-16 15:47 ` Alan 1 sibling, 1 reply; 12+ messages in thread From: linux-os (Dick Johnson) @ 2007-01-16 15:16 UTC (permalink / raw) To: Olivier Galibert; +Cc: Hack inc. On Mon, 15 Jan 2007, Olivier Galibert wrote: > On Mon, Jan 15, 2007 at 06:45:40PM +0000, Alan wrote: >> On Mon, 15 Jan 2007 18:16:02 +0100 >> Olivier Galibert <galibert@pobox.com> wrote: >> >>> sd 0:0:0:0: SCSI error: return code = 0x08000002 >>> sda: Current: sense key: Hardware Error >>> ASC=0x42 ASCQ=0x0 >> >> I'll give you a clue: The words "Hardware Error". >> >> Run a SCSI verify pass on the drive with some drive utilities and see >> what happens. If you are lucky it'll just reallocate blocks and decide >> the drive is ok, if not well see what the smart data thinks. > > Both smart and the internal blade diagnostics say "everything is a-ok > with the drive, there hasn't been any error ever except a bunch of > corrected ECC ones, and no more than with a similar drive in another > working blade". Hence my initial post. "Hardware error" is kinda > imprecise, so I was wondering whether it was unexpected controller > answer, detected transmission error, block write error, sector not > found... Is there a way to have more information? > > OG. Correctable SCSI errors show that the data in a sector was not properly read, but the device was able to fix the data error because of the redundancy in the CRC. The error could be permanently fixed is you rewrote the sector. You probably don't know where the bad sector is without adding a printk() to driver code. Some BIOS SCSI utilities (Adaptec) have the capability of reading an entire drive and fixing bad sectors either by rewrite or relocation. Since drives can be accessed as files, you could write a utility that opens the RAW device with in NOT mounted, reads a bunch of sectors, then writes them back. To do this, you need to verify that lseek() works on your particular drive because you need to write the data back to the same offset that you read it from. I mention this because the raw r/w of an early Adaptec (aha1542) driver, didn't impliment lseek, just returned 'okay'. You can imagine the mess I made of a drive with that controller! Once you verify that lseek works, the rest of the code is trivial. I suggest reading then writing 64 kilobytes at a time. It will seem to take 'forever', but the retries on these relatively short groups of sectors (128 sectors), will be short when errors are encountered. Make sure the drive is either not mounted or mounted r/o. Cheers, Dick Johnson Penguin : Linux version 2.6.16.24 on an i686 machine (5592.67 BogoMips). New book: http://www.AbominableFirebug.com/ _ \x1a\x04 **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-16 15:16 ` linux-os (Dick Johnson) @ 2007-01-16 15:47 ` Alan 2007-01-16 17:25 ` Olivier Galibert 0 siblings, 1 reply; 12+ messages in thread From: Alan @ 2007-01-16 15:47 UTC (permalink / raw) To: linux-os (Dick Johnson); +Cc: Olivier Galibert, Hack inc. > Correctable SCSI errors show that the data in a sector was not properly > read, but the device was able to fix the data error because of the > redundancy in the CRC. The error could be permanently fixed is you > rewrote the sector. You probably don't know where the bad sector is The drives do that automatically, and the SCSI verify did it for him too if there were any other problems. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-16 15:47 ` Alan @ 2007-01-16 17:25 ` Olivier Galibert 0 siblings, 0 replies; 12+ messages in thread From: Olivier Galibert @ 2007-01-16 17:25 UTC (permalink / raw) To: Alan; +Cc: linux-os (Dick Johnson), Hack inc. On Tue, Jan 16, 2007 at 03:47:52PM +0000, Alan wrote: > The drives do that automatically, and the SCSI verify did it for him too > if there were any other problems. The SCSI verify didn't see a thing, I'm gonna do the disk swapping dance. OG. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-15 17:16 What does this scsi error mean ? Olivier Galibert 2007-01-15 18:45 ` Alan @ 2007-01-15 23:27 ` Stefan Richter 2007-01-15 23:35 ` Olivier Galibert 1 sibling, 1 reply; 12+ messages in thread From: Stefan Richter @ 2007-01-15 23:27 UTC (permalink / raw) To: Olivier Galibert; +Cc: Hack inc. On 15 Jan, Olivier Galibert wrote: > sd 0:0:0:0: SCSI error: return code = 0x08000002 > sda: Current: sense key: Hardware Error > ASC=0x42 ASCQ=0x0 The Additional Sense Code means "power-on or self-test failure" FWIW. (SPC-4 annex D) -- Stefan Richter -=====-=-=== ---= =---- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: What does this scsi error mean ? 2007-01-15 23:27 ` Stefan Richter @ 2007-01-15 23:35 ` Olivier Galibert 0 siblings, 0 replies; 12+ messages in thread From: Olivier Galibert @ 2007-01-15 23:35 UTC (permalink / raw) To: Stefan Richter; +Cc: Hack inc. On Tue, Jan 16, 2007 at 12:27:17AM +0100, Stefan Richter wrote: > On 15 Jan, Olivier Galibert wrote: > > sd 0:0:0:0: SCSI error: return code = 0x08000002 > > sda: Current: sense key: Hardware Error > > ASC=0x42 ASCQ=0x0 > > The Additional Sense Code means "power-on or self-test failure" FWIW. > (SPC-4 annex D) Given that happens between 3 days to a week after bootup on the root drive, it's obviously not the "power on" part. It's kinda annoying nothing appears in the smart logs though: smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: IBM-ESXS ST936701LC FN Version: B41D Serial number: 3LC0C8P000007647WLMV Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Tue Jan 16 00:33:09 2007 CET Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 33 C Drive Trip Temperature: 60 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 16206797 Blocks received from initiator = 83607272 Blocks read from cache and sent to initiator = 3311410 Number of read and write commands whose size <= segment size = 2801896 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 533.07 number of minutes until next internal SMART test = 112 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 10474 0 0 10474 10474 61.360 0 write: 0 0 0 0 0 58.647 2 Non-medium error count: 1457822 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed - 407 - [- - -] # 2 Background short Completed - 243 - [- - -] Long (extended) Self Test duration: 793 seconds [13.2 minutes] OG. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2007-02-07 17:43 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-01-15 17:16 What does this scsi error mean ? Olivier Galibert 2007-01-15 18:45 ` Alan 2007-01-15 21:45 ` Olivier Galibert 2007-01-15 23:14 ` Alan 2007-01-16 0:10 ` Olivier Galibert 2007-01-18 14:08 ` Olivier Galibert 2007-02-07 17:43 ` Olivier Galibert 2007-01-16 15:16 ` linux-os (Dick Johnson) 2007-01-16 15:47 ` Alan 2007-01-16 17:25 ` Olivier Galibert 2007-01-15 23:27 ` Stefan Richter 2007-01-15 23:35 ` Olivier Galibert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).