LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* ATA device reset, shoud I be concerned?
@ 2008-01-13 22:19 Georgi Chulkov
2008-01-15 10:54 ` Andrew Morton
0 siblings, 1 reply; 22+ messages in thread
From: Georgi Chulkov @ 2008-01-13 22:19 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-ide
Hello,
During heavy disk load on my laptop, sometimes the IDE disk will pause for a
second and then continue. I get this in my kernel log:
[ 9031.028000] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
frozen
[ 9031.028000] ata1.00: cmd c8/00:08:90:ca:ce/00:00:00:00:00/e0 tag 0 cdb 0x0
data 4096 in
[ 9031.028000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
(timeout)
[ 9036.068000] ata1: port is slow to respond, please be patient (Status 0xd0)
[ 9041.052000] ata1: device not ready (errno=-16), forcing hardreset
[ 9041.052000] ata1: soft resetting port
[ 9041.232000] ata1.00: configured for UDMA/100
[ 9041.232000] ata1: EH complete
[ 9041.248000] sd 0:0:0:0: [sda] 78140160 512-byte hardware sectors (40008 MB)
[ 9041.248000] sd 0:0:0:0: [sda] Write Protect is off
[ 9041.248000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 9041.248000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
My question: What is this telling me, and do I need to be concerned?
Everything continues to work normally after the message: no I/O errors, no
fsck errors, etc.
I've seen some similar reports on the mailing list, but they include slightly
different messages. I would appreciate any information!
uname -a (on Kubuntu Gutsy, CPU is a single-core 32-bit Pentium M):
Linux superfly 2.6.22-14-386 #1 Tue Dec 18 07:34:24 UTC 2007 i686 GNU/Linux
lspci:
00:00.0 Host bridge: Intel Corporation 82855PM Processor to I/O Controller
(rev 03)
00:01.0 PCI bridge: Intel Corporation 82855PM Processor to AGP Controller (rev
03)
00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M)
USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M)
USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M)
USB UHCI Controller #3 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corporation 82801DBM (ICH4-M) LPC Interface Bridge
(rev 01)
00:1f.1 IDE interface: Intel Corporation 82801DBM (ICH4-M) IDE Controller (rev
01)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM
(ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV28 [GeForce4 Ti 4200
Go AGP 8x] (rev a1)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5705M Gigabit
Ethernet (rev 01)
02:01.0 CardBus bridge: Texas Instruments PCI7510 PC card Cardbus Controller
(rev 01)
02:01.2 FireWire (IEEE 1394): Texas Instruments PCI7410,7510,7610 OHCI-Lynx
Controller
02:01.3 System peripheral: Texas Instruments PCI7410,7510,7610 PCI Firmware
Loading Function
02:03.0 CardBus bridge: Texas Instruments PCI1410 PC card Cardbus Controller
(rev 01)
Thanks!
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-13 22:19 ATA device reset, shoud I be concerned? Georgi Chulkov
@ 2008-01-15 10:54 ` Andrew Morton
2008-01-15 11:35 ` Alan Cox
2008-01-22 20:29 ` Georgi Chulkov
0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2008-01-15 10:54 UTC (permalink / raw)
To: Georgi Chulkov; +Cc: linux-kernel, linux-ide
On Mon, 14 Jan 2008 00:19:20 +0200 Georgi Chulkov <g.chulkov@jacobs-university.de> wrote:
> Hello,
>
> During heavy disk load on my laptop, sometimes the IDE disk will pause for a
> second and then continue. I get this in my kernel log:
>
> [ 9031.028000] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
> frozen
> [ 9031.028000] ata1.00: cmd c8/00:08:90:ca:ce/00:00:00:00:00/e0 tag 0 cdb 0x0
> data 4096 in
> [ 9031.028000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> (timeout)
> [ 9036.068000] ata1: port is slow to respond, please be patient (Status 0xd0)
> [ 9041.052000] ata1: device not ready (errno=-16), forcing hardreset
> [ 9041.052000] ata1: soft resetting port
> [ 9041.232000] ata1.00: configured for UDMA/100
> [ 9041.232000] ata1: EH complete
> [ 9041.248000] sd 0:0:0:0: [sda] 78140160 512-byte hardware sectors (40008 MB)
> [ 9041.248000] sd 0:0:0:0: [sda] Write Protect is off
> [ 9041.248000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> [ 9041.248000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled,
> doesn't support DPO or FUA
>
> My question: What is this telling me, and do I need to be concerned?
> Everything continues to work normally after the message: no I/O errors, no
> fsck errors, etc.
>
> I've seen some similar reports on the mailing list, but they include slightly
> different messages. I would appreciate any information!
>
> uname -a (on Kubuntu Gutsy, CPU is a single-core 32-bit Pentium M):
>
> Linux superfly 2.6.22-14-386 #1 Tue Dec 18 07:34:24 UTC 2007 i686 GNU/Linux
>
Has it done this in all kernel versions or did some earler version work OK?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-15 10:54 ` Andrew Morton
@ 2008-01-15 11:35 ` Alan Cox
2008-01-21 7:56 ` Tejun Heo
2008-01-22 20:29 ` Georgi Chulkov
1 sibling, 1 reply; 22+ messages in thread
From: Alan Cox @ 2008-01-15 11:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: Georgi Chulkov, linux-kernel, linux-ide
> > [ 9031.028000] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
> > frozen
> > [ 9031.028000] ata1.00: cmd c8/00:08:90:ca:ce/00:00:00:00:00/e0 tag 0 cdb 0x0
> > data 4096 in
> > [ 9031.028000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> > (timeout)
We got bored of waiting for the drive to respond to our request. I still
think we have the timeouts too short or are accounting queue time
somewhere we shouldn't as there a few other examples where we don't allow
long enough for a drive to retry out and fail with a media error on a bad
sector.
Alan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-15 11:35 ` Alan Cox
@ 2008-01-21 7:56 ` Tejun Heo
2008-01-21 13:02 ` Alan Cox
0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2008-01-21 7:56 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
Alan Cox wrote:
>>> [ 9031.028000] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
>>> frozen
>>> [ 9031.028000] ata1.00: cmd c8/00:08:90:ca:ce/00:00:00:00:00/e0 tag 0 cdb 0x0
>>> data 4096 in
>>> [ 9031.028000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
>>> (timeout)
>
> We got bored of waiting for the drive to respond to our request. I still
> think we have the timeouts too short or are accounting queue time
> somewhere we shouldn't as there a few other examples where we don't allow
> long enough for a drive to retry out and fail with a media error on a bad
> sector.
Hmm.. That's not what I hear from Mark and vendor contacts. They say
30secs is more than enough. I actually am thinking about reducing it to
15secs (not for FLUSH of course) as many SFF controllers report
transmission failure as timeouts. Of course, if we're ticking the timer
while the command is not in flight, that's a bug. If there are cases
where 30 secs isn't enough, can you please point me to those reports?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 7:56 ` Tejun Heo
@ 2008-01-21 13:02 ` Alan Cox
2008-01-21 13:14 ` Tejun Heo
0 siblings, 1 reply; 22+ messages in thread
From: Alan Cox @ 2008-01-21 13:02 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
> transmission failure as timeouts. Of course, if we're ticking the timer
> while the command is not in flight, that's a bug. If there are cases
> where 30 secs isn't enough, can you please point me to those reports?
I have been, in bugzilla - the raid failure example where old IDE
eventually reports a media error while libata keeps timing out,
resetting, repeating.
Alan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 13:02 ` Alan Cox
@ 2008-01-21 13:14 ` Tejun Heo
2008-01-21 14:14 ` Alan Cox
0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2008-01-21 13:14 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
Alan Cox wrote:
>> transmission failure as timeouts. Of course, if we're ticking the timer
>> while the command is not in flight, that's a bug. If there are cases
>> where 30 secs isn't enough, can you please point me to those reports?
>
> I have been, in bugzilla - the raid failure example where old IDE
> eventually reports a media error while libata keeps timing out,
> resetting, repeating.
Maybe the difference is not in timeout but what the driver does after
timeout happens. After timeout, libata ignores almost everything (it
considers DMA error reported on BMDMA status) and resets the device
while IDE thinks that IRQ might be lost and complete the command if the
TF status register says so.
It could be that the particular device doesn't raise IRQ on certain
error conditions but updates TF registers. After timeout, IDE completes
the command with the indicated error while libata ignores the status and
resets the device.
libata never touches TF register after timeout because some controllers
lock up hard if TF register is read after certain error conditions
(event the status register).
--
tejun
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 13:14 ` Tejun Heo
@ 2008-01-21 14:14 ` Alan Cox
2008-01-21 14:31 ` Tejun Heo
0 siblings, 1 reply; 22+ messages in thread
From: Alan Cox @ 2008-01-21 14:14 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
> while IDE thinks that IRQ might be lost and complete the command if the
> TF status register says so.
For PATA at least that makes a lot of sense. It would probably make the
Promise driver a lot more stable too.
> It could be that the particular device doesn't raise IRQ on certain
> error conditions but updates TF registers. After timeout, IDE completes
> the command with the indicated error while libata ignores the status and
> resets the device.
And loses the important information like media errors
> libata never touches TF register after timeout because some controllers
> lock up hard if TF register is read after certain error conditions
> (event the status register).
Should that not then be a per host flag ?
Alan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 14:14 ` Alan Cox
@ 2008-01-21 14:31 ` Tejun Heo
2008-01-21 14:33 ` Tejun Heo
2008-01-21 16:47 ` Alan Cox
0 siblings, 2 replies; 22+ messages in thread
From: Tejun Heo @ 2008-01-21 14:31 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
Alan Cox wrote:
>> while IDE thinks that IRQ might be lost and complete the command if the
>> TF status register says so.
>
> For PATA at least that makes a lot of sense. It would probably make the
> Promise driver a lot more stable too.
Can you elaborate a bit? I don't really think completing a command
after 30sec timeout contributes a lot to driver stability.
>> It could be that the particular device doesn't raise IRQ on certain
>> error conditions but updates TF registers. After timeout, IDE completes
>> the command with the indicated error while libata ignores the status and
>> resets the device.
>
> And loses the important information like media errors
>
>> libata never touches TF register after timeout because some controllers
>> lock up hard if TF register is read after certain error conditions
>> (event the status register).
>
> Should that not then be a per host flag ?
Yeah, that would be the best. The problem is that there are several
different kinds of timeouts and we don't know which controller locks up
after which timeout and investigating them is really difficult.
IMHO, losing media error information is much better than locking up a
machine hard. We can start white listing known good controllers but I'm
skeptical how much benefit it will bring.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 14:31 ` Tejun Heo
@ 2008-01-21 14:33 ` Tejun Heo
2008-01-21 16:44 ` Alan Cox
2009-08-27 2:40 ` Robert Hancock
2008-01-21 16:47 ` Alan Cox
1 sibling, 2 replies; 22+ messages in thread
From: Tejun Heo @ 2008-01-21 14:33 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
Tejun Heo wrote:
> IMHO, losing media error information is much better than locking up a
> machine hard. We can start white listing known good controllers but I'm
> skeptical how much benefit it will bring.
Just a data point, even ICHs lock up after PHY event if the wrong TF
register is accessed. I just don't think tempting with TF regs after
timeout is worth the cost.
--
tejun
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 14:33 ` Tejun Heo
@ 2008-01-21 16:44 ` Alan Cox
2009-08-27 2:40 ` Robert Hancock
1 sibling, 0 replies; 22+ messages in thread
From: Alan Cox @ 2008-01-21 16:44 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
> Just a data point, even ICHs lock up after PHY event if the wrong TF
> register is accessed. I just don't think tempting with TF regs after
> timeout is worth the cost.
For SATA maybe, for PATA I don't have any controllers with your bug so
its wrong for libata to cripple the PATA support or working controllers.
Alan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 14:31 ` Tejun Heo
2008-01-21 14:33 ` Tejun Heo
@ 2008-01-21 16:47 ` Alan Cox
2008-01-21 17:02 ` Tejun Heo
1 sibling, 1 reply; 22+ messages in thread
From: Alan Cox @ 2008-01-21 16:47 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
> Can you elaborate a bit? I don't really think completing a command
> after 30sec timeout contributes a lot to driver stability.
Timeout, timeout, timeout, reset, timeout.. (repeat), failed I/O
This gives the end user no information about the fault, nor does it let
the upper layers of SCSI and above distinguish between a random passing
sulk and media errors which need the disk replacing.
> > Should that not then be a per host flag ?
>
> Yeah, that would be the best. The problem is that there are several
> different kinds of timeouts and we don't know which controller locks up
> after which timeout and investigating them is really difficult.
PATA controllers don't lock up in that case so its quite easy. The one
exception is if the device jams IORDY but in that case you are dead
anyway the next I/O (except on a SIL680 which has a timer we could use).
Old IDE says it works for PATA. For SATA I can see it might need more
care and you might simply not be able to get the info.
Alan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 16:47 ` Alan Cox
@ 2008-01-21 17:02 ` Tejun Heo
2008-01-21 17:27 ` Alan Cox
2008-01-22 1:31 ` Bartlomiej Zolnierkiewicz
0 siblings, 2 replies; 22+ messages in thread
From: Tejun Heo @ 2008-01-21 17:02 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
Alan Cox wrote:
>> Can you elaborate a bit? I don't really think completing a command
>> after 30sec timeout contributes a lot to driver stability.
>
> Timeout, timeout, timeout, reset, timeout.. (repeat), failed I/O
>
> This gives the end user no information about the fault, nor does it let
> the upper layers of SCSI and above distinguish between a random passing
> sulk and media errors which need the disk replacing.
I still don't think it's worth the trouble. There's currently only one
reported device which forgets to raise IRQ on media error. The behavior
is out of spec and rare. I don't think it's a good idea to change EH
behavior for it.
>>> Should that not then be a per host flag ?
>> Yeah, that would be the best. The problem is that there are several
>> different kinds of timeouts and we don't know which controller locks up
>> after which timeout and investigating them is really difficult.
>
> PATA controllers don't lock up in that case so its quite easy. The one
> exception is if the device jams IORDY but in that case you are dead
> anyway the next I/O (except on a SIL680 which has a timer we could use).
>
> Old IDE says it works for PATA. For SATA I can see it might need more
> care and you might simply not be able to get the info.
Old IDE often locks up the machine hard after timeouts. I'm all for
gathering more info but benefit vs. risk equation just doesn't look good
here. Why take risk for a rare device which forgets to raise IRQ on
media error? If such behavior is wide spread among PATA drives && we
can verify that TF register access after timeout is safe for PATA
controllers, sure, but currently we aren't sure about either.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 17:02 ` Tejun Heo
@ 2008-01-21 17:27 ` Alan Cox
2008-01-22 0:31 ` Tejun Heo
2008-01-22 1:31 ` Bartlomiej Zolnierkiewicz
1 sibling, 1 reply; 22+ messages in thread
From: Alan Cox @ 2008-01-21 17:27 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
> I still don't think it's worth the trouble. There's currently only one
> reported device which forgets to raise IRQ on media error. The behavior
Most people wouldn't realise what is going on.
> > Old IDE says it works for PATA. For SATA I can see it might need more
> > care and you might simply not be able to get the info.
>
> Old IDE often locks up the machine hard after timeouts. I'm all for
The code paths are racy - it didn't use to in 2.4 (except for the promise
drain bug)
> gathering more info but benefit vs. risk equation just doesn't look good
> here. Why take risk for a rare device which forgets to raise IRQ on
> media error? If such behavior is wide spread among PATA drives && we
> can verify that TF register access after timeout is safe for PATA
> controllers, sure, but currently we aren't sure about either.
We lose IRQs in lots of other cases. Promise PATA is particularly bad at
forgetting to give us the completion interrupt.
Alan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 17:27 ` Alan Cox
@ 2008-01-22 0:31 ` Tejun Heo
0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2008-01-22 0:31 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide, Mark Lord
Hello,
Alan Cox wrote:
>> I still don't think it's worth the trouble. There's currently only one
>> reported device which forgets to raise IRQ on media error. The behavior
>
> Most people wouldn't realise what is going on.
Yeap, true but I don't think we have many timeouts due to media errors.
I've seen lots of SMART logs for drives which caused timeouts but
haven't seen any which logged related media errors.
>>> Old IDE says it works for PATA. For SATA I can see it might need more
>>> care and you might simply not be able to get the info.
>> Old IDE often locks up the machine hard after timeouts. I'm all for
>
> The code paths are racy - it didn't use to in 2.4 (except for the promise
> drain bug)
My jmicron locks up hard under certain conditions. I haven't
investigated it too deep but it looks like a hard lockup (controller
dying while holding PCI bus). NMI watchdog doesn't work afterwards.
>> gathering more info but benefit vs. risk equation just doesn't look good
>> here. Why take risk for a rare device which forgets to raise IRQ on
>> media error? If such behavior is wide spread among PATA drives && we
>> can verify that TF register access after timeout is safe for PATA
>> controllers, sure, but currently we aren't sure about either.
>
> We lose IRQs in lots of other cases. Promise PATA is particularly bad at
> forgetting to give us the completion interrupt.
In that case, completing commands after 30secs doesn't really help as
long as normal operation can be recovered afterward. The driver should
take measures against lost interrupts like polling for interrupts after
a while. Those are two different problems and require different almost
opposite solutions. Some controllers need registers polled once in a
while while others die when registers are read unexpectedly.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 17:02 ` Tejun Heo
2008-01-21 17:27 ` Alan Cox
@ 2008-01-22 1:31 ` Bartlomiej Zolnierkiewicz
2008-01-22 1:36 ` Tejun Heo
2008-01-22 1:39 ` Alan Cox
1 sibling, 2 replies; 22+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2008-01-22 1:31 UTC (permalink / raw)
To: Tejun Heo
Cc: Alan Cox, Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide,
Mark Lord
On Monday 21 January 2008, Tejun Heo wrote:
[...]
> > Old IDE says it works for PATA. For SATA I can see it might need more
> > care and you might simply not be able to get the info.
>
> Old IDE often locks up the machine hard after timeouts. I'm all for
Could you point me to some bugreports?
I would like to know more about hosts/conditions for which it happens.
> gathering more info but benefit vs. risk equation just doesn't look good
> here. Why take risk for a rare device which forgets to raise IRQ on
> media error? If such behavior is wide spread among PATA drives && we
> can verify that TF register access after timeout is safe for PATA
> controllers, sure, but currently we aren't sure about either.
Thanks,
Bart
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-22 1:31 ` Bartlomiej Zolnierkiewicz
@ 2008-01-22 1:36 ` Tejun Heo
2008-01-22 2:08 ` Tejun Heo
2008-01-22 1:39 ` Alan Cox
1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2008-01-22 1:36 UTC (permalink / raw)
To: Bartlomiej Zolnierkiewicz
Cc: Alan Cox, Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide,
Mark Lord
Bartlomiej Zolnierkiewicz wrote:
> On Monday 21 January 2008, Tejun Heo wrote:
>
> [...]
>
>>> Old IDE says it works for PATA. For SATA I can see it might need more
>>> care and you might simply not be able to get the info.
>> Old IDE often locks up the machine hard after timeouts. I'm all for
>
> Could you point me to some bugreports?
>
> I would like to know more about hosts/conditions for which it happens.
It's jmicron and all on-board jmicrons I have show the same problem.
Connect harddrrive to the controller and drive it via jmicron, hot plug
unplug SATA drives continuously, after a while, jmicron says it lost
interrupt and the machine locks up hard.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-22 1:31 ` Bartlomiej Zolnierkiewicz
2008-01-22 1:36 ` Tejun Heo
@ 2008-01-22 1:39 ` Alan Cox
1 sibling, 0 replies; 22+ messages in thread
From: Alan Cox @ 2008-01-22 1:39 UTC (permalink / raw)
To: Bartlomiej Zolnierkiewicz
Cc: Tejun Heo, Andrew Morton, Georgi Chulkov, linux-kernel,
linux-ide, Mark Lord
> Could you point me to some bugreports?
>
> I would like to know more about hosts/conditions for which it happens.
The timer reset path races the I/O path races the interrupt path. That
was the vomitously foul race that persuaded me to go libata instead. I
seem to remember explaining this all some time ago.
The root cause is that we drop the lock on some error paths.
Unfortunately its almost impossible to fix that without having an
infrastructure for quiescing the driver and running a separate error
handling context (the way scsi does).
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-22 1:36 ` Tejun Heo
@ 2008-01-22 2:08 ` Tejun Heo
0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2008-01-22 2:08 UTC (permalink / raw)
To: Bartlomiej Zolnierkiewicz
Cc: Alan Cox, Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide,
Mark Lord
Tejun Heo wrote:
> Bartlomiej Zolnierkiewicz wrote:
>> On Monday 21 January 2008, Tejun Heo wrote:
>>
>> [...]
>>
>>>> Old IDE says it works for PATA. For SATA I can see it might need more
>>>> care and you might simply not be able to get the info.
>>> Old IDE often locks up the machine hard after timeouts. I'm all for
>> Could you point me to some bugreports?
>>
>> I would like to know more about hosts/conditions for which it happens.
>
> It's jmicron and all on-board jmicrons I have show the same problem.
> Connect harddrrive to the controller and drive it via jmicron, hot plug
> unplug SATA drives continuously, after a while, jmicron says it lost
> interrupt and the machine locks up hard.
BTW, those hot plug/unplugs don't have any direct relationship with the
JMB controller. It's some interference or power issue, I guess. Hot
plugging unrelated drives somehow locks up the jmicron driver. :-(
--
tejun
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-15 10:54 ` Andrew Morton
2008-01-15 11:35 ` Alan Cox
@ 2008-01-22 20:29 ` Georgi Chulkov
1 sibling, 0 replies; 22+ messages in thread
From: Georgi Chulkov @ 2008-01-22 20:29 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-ide
It appears that the problem was caused by a faulty power supply. Thanks
anyway!
On Tuesday 15 January 2008 11:54:35 Andrew Morton wrote:
> On Mon, 14 Jan 2008 00:19:20 +0200 Georgi Chulkov
<g.chulkov@jacobs-university.de> wrote:
> > Hello,
> >
> > During heavy disk load on my laptop, sometimes the IDE disk will pause
> > for a second and then continue. I get this in my kernel log:
> >
> > [ 9031.028000] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
> > frozen
> > [ 9031.028000] ata1.00: cmd c8/00:08:90:ca:ce/00:00:00:00:00/e0 tag 0 cdb
> > 0x0 data 4096 in
> > [ 9031.028000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> > (timeout)
> > [ 9036.068000] ata1: port is slow to respond, please be patient (Status
> > 0xd0) [ 9041.052000] ata1: device not ready (errno=-16), forcing
> > hardreset [ 9041.052000] ata1: soft resetting port
> > [ 9041.232000] ata1.00: configured for UDMA/100
> > [ 9041.232000] ata1: EH complete
> > [ 9041.248000] sd 0:0:0:0: [sda] 78140160 512-byte hardware sectors
> > (40008 MB) [ 9041.248000] sd 0:0:0:0: [sda] Write Protect is off
> > [ 9041.248000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> > [ 9041.248000] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> > enabled, doesn't support DPO or FUA
> >
> > My question: What is this telling me, and do I need to be concerned?
> > Everything continues to work normally after the message: no I/O errors,
> > no fsck errors, etc.
> >
> > I've seen some similar reports on the mailing list, but they include
> > slightly different messages. I would appreciate any information!
> >
> > uname -a (on Kubuntu Gutsy, CPU is a single-core 32-bit Pentium M):
> >
> > Linux superfly 2.6.22-14-386 #1 Tue Dec 18 07:34:24 UTC 2007 i686
> > GNU/Linux
>
> Has it done this in all kernel versions or did some earler version work OK?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2008-01-21 14:33 ` Tejun Heo
2008-01-21 16:44 ` Alan Cox
@ 2009-08-27 2:40 ` Robert Hancock
2009-08-27 3:07 ` Jeff Garzik
2009-08-27 8:37 ` Alan Cox
1 sibling, 2 replies; 22+ messages in thread
From: Robert Hancock @ 2009-08-27 2:40 UTC (permalink / raw)
To: Tejun Heo
Cc: Alan Cox, Andrew Morton, Georgi Chulkov, linux-kernel, linux-ide,
Mark Lord
On 01/21/2008 08:33 AM, Tejun Heo wrote:
> Tejun Heo wrote:
>> IMHO, losing media error information is much better than locking up a
>> machine hard. We can start white listing known good controllers but I'm
>> skeptical how much benefit it will bring.
>
> Just a data point, even ICHs lock up after PHY event if the wrong TF
> register is accessed. I just don't think tempting with TF regs after
> timeout is worth the cost.
Nvidia CK804 SATA controllers appear to also explode on reading TF
registers after media errors in certain cases. (They tend to either
lockup the machine or throw HyperTransport timeout machine check
exceptions). I suspect those error paths aren't well tested (except that
it even explodes in Windows with the default Microsoft IDE driver, when
reading a scratched DVD on a SATA drive, for example.)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2009-08-27 2:40 ` Robert Hancock
@ 2009-08-27 3:07 ` Jeff Garzik
2009-08-27 8:37 ` Alan Cox
1 sibling, 0 replies; 22+ messages in thread
From: Jeff Garzik @ 2009-08-27 3:07 UTC (permalink / raw)
To: Robert Hancock
Cc: Tejun Heo, Alan Cox, Andrew Morton, Georgi Chulkov, linux-kernel,
linux-ide, Mark Lord
On 08/26/2009 10:40 PM, Robert Hancock wrote:
> On 01/21/2008 08:33 AM, Tejun Heo wrote:
>> Tejun Heo wrote:
>>> IMHO, losing media error information is much better than locking up a
>>> machine hard. We can start white listing known good controllers but I'm
>>> skeptical how much benefit it will bring.
>>
>> Just a data point, even ICHs lock up after PHY event if the wrong TF
>> register is accessed. I just don't think tempting with TF regs after
>> timeout is worth the cost.
>
> Nvidia CK804 SATA controllers appear to also explode on reading TF
> registers after media errors in certain cases. (They tend to either
> lockup the machine or throw HyperTransport timeout machine check
> exceptions). I suspect those error paths aren't well tested (except that
> it even explodes in Windows with the default Microsoft IDE driver, when
> reading a scratched DVD on a SATA drive, for example.)
Well, reading TF when DMA or other operation is enabled is a big
no-no... according to spec. If we are touching TF before data xfer
operation is completed, and host controller has a chance to receive D2H
FIS, TF is undefined (except Status, in some cases).
Jeff
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: ATA device reset, shoud I be concerned?
2009-08-27 2:40 ` Robert Hancock
2009-08-27 3:07 ` Jeff Garzik
@ 2009-08-27 8:37 ` Alan Cox
1 sibling, 0 replies; 22+ messages in thread
From: Alan Cox @ 2009-08-27 8:37 UTC (permalink / raw)
To: Robert Hancock
Cc: Tejun Heo, Andrew Morton, Georgi Chulkov, linux-kernel,
linux-ide, Mark Lord
> Nvidia CK804 SATA controllers appear to also explode on reading TF
> registers after media errors in certain cases. (They tend to either
> lockup the machine or throw HyperTransport timeout machine check
You must halt any running DMA activity before reading them. This is true
for pretty much all task file style controllers although the results of
not doing so vary by device somewhat.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2009-08-27 8:36 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-13 22:19 ATA device reset, shoud I be concerned? Georgi Chulkov
2008-01-15 10:54 ` Andrew Morton
2008-01-15 11:35 ` Alan Cox
2008-01-21 7:56 ` Tejun Heo
2008-01-21 13:02 ` Alan Cox
2008-01-21 13:14 ` Tejun Heo
2008-01-21 14:14 ` Alan Cox
2008-01-21 14:31 ` Tejun Heo
2008-01-21 14:33 ` Tejun Heo
2008-01-21 16:44 ` Alan Cox
2009-08-27 2:40 ` Robert Hancock
2009-08-27 3:07 ` Jeff Garzik
2009-08-27 8:37 ` Alan Cox
2008-01-21 16:47 ` Alan Cox
2008-01-21 17:02 ` Tejun Heo
2008-01-21 17:27 ` Alan Cox
2008-01-22 0:31 ` Tejun Heo
2008-01-22 1:31 ` Bartlomiej Zolnierkiewicz
2008-01-22 1:36 ` Tejun Heo
2008-01-22 2:08 ` Tejun Heo
2008-01-22 1:39 ` Alan Cox
2008-01-22 20:29 ` Georgi Chulkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).