LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* 2.6.24-rt1 IRQ routing anomaly
@ 2008-02-21 12:01 Mark Hounschell
  2008-02-21 12:51 ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Hounschell @ 2008-02-21 12:01 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar; +Cc: Mark Hounschell

According to /proc/interrupts, every interrupt received by eth1 is also
being received by the sound card EMU10K1. The problem showed itself
first with this. The sound system was quiet BTW.

It does not happen with 2.6.24 vanilla.

kernel: irq 19: nobody cared (try booting with the "irqpoll" option)
kernel: Pid: 1832, comm: IRQ-19 Not tainted 2.6.24.2-crt #2
kernel:  [<c013d6da>] __report_bad_irq+0x36/0x75
kernel:  [<c013d910>] note_interrupt+0x1f7/0x227
kernel:  [<c013ce85>] thread_simple_irq+0x61/0x74
kernel:  [<c013d455>] do_irqd+0x0/0x22f
kernel:  [<c013d507>] do_irqd+0xb2/0x22f
kernel:  [<c013d455>] do_irqd+0x0/0x22f
kernel:  [<c012b137>] kthread+0x38/0x5d
kernel:  [<c012b0ff>] kthread+0x0/0x5d
kernel:  [<c0104c13>] kernel_thread_helper+0x7/0x10
kernel:  =======================
kernel: ---------------------------
kernel: | preempt count: 00000001 ]
kernel: | 1-level deep critical section nesting:
kernel: ----------------------------------------
kernel: .. [<c02b03b3>] .... __spin_lock_irq+0xe/0x1e
kernel: .....[<00000000>] ..   ( <= _stext+0x3feff000/0x14)
kernel:
kernel: handlers:
kernel: [<f4d16544>] (snd_emu10k1_interrupt+0x0/0x42c [snd_emu10k1])
kernel: turning off IO-APIC fast mode.
kernel: irq 19: nobody cared (try booting with the "irqpoll" option)
kernel: Pid: 1832, comm: IRQ-19 Not tainted 2.6.24.2-crt #2
kernel:  [<c013d6da>] __report_bad_irq+0x36/0x75
kernel:  [<c013d910>] note_interrupt+0x1f7/0x227
kernel:  [<c013ce85>] thread_simple_irq+0x61/0x74
kernel:  [<c013d455>] do_irqd+0x0/0x22f
kernel:  [<c013d507>] do_irqd+0xb2/0x22f
kernel:  [<c013d455>] do_irqd+0x0/0x22f
kernel:  [<c012b137>] kthread+0x38/0x5d
kernel:  [<c012b0ff>] kthread+0x0/0x5d
kernel:  [<c0104c13>] kernel_thread_helper+0x7/0x10
kernel:  =======================
kernel: ---------------------------
kernel: | preempt count: 00000001 ]
kernel: | 1-level deep critical section nesting:
kernel: ----------------------------------------
kernel: .. [<c02b03b3>] .... __spin_lock_irq+0xe/0x1e
kernel: .....[<00000000>] ..   ( <= _stext+0x3feff000/0x14)
kernel:
kernel: handlers:
kernel: [<f4d16544>] (snd_emu10k1_interrupt+0x0/0x42c [snd_emu10k1])

Looking at /proc/interrupts I could see the the EMU10K1 interrupt was
going to town. I was busy busy on eth1 at the time.

So a simple externall ping test with a quiet system at run level-3 revealed:

# lspci cat before.ping
           CPU0       CPU1
  0:         85          0   IO-APIC-edge      timer
  1:        396        420   IO-APIC-edge      i8042
  3:          4          2   IO-APIC-edge
  4:          5          1   IO-APIC-edge
  6:          1          4   IO-APIC-edge      floppy
  7:          0          0   IO-APIC-edge      parport0
  8:          2          0   IO-APIC-edge      rtc
  9:          0          1   IO-APIC-fasteoi   acpi
 12:         21         84   IO-APIC-edge      i8042
 14:       8457       8179   IO-APIC-edge      libata
 15:       1016       1519   IO-APIC-edge      libata
 16:         60         60   IO-APIC-fasteoi   aic7xxx
 17:        113         96   IO-APIC-fasteoi   eth1
 18:         44         47   IO-APIC-fasteoi
 19:         99        114   IO-APIC-fasteoi   EMU10K1
NMI:          0          0   Non-maskable interrupts
LOC:      93895      94157   Local timer interrupts
RES:       8831       8188   Rescheduling interrupts
CAL:       4176       5267   function call interrupts
TLB:        271        235   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0


Then from an external machine: ping -c10 10.10.10.200


# cat after.ping
           CPU0       CPU1
  0:         85          0   IO-APIC-edge      timer
  1:        464        432   IO-APIC-edge      i8042
  3:          4          2   IO-APIC-edge
  4:          5          1   IO-APIC-edge
  6:          1          4   IO-APIC-edge      floppy
  7:          0          0   IO-APIC-edge      parport0
  8:          2          0   IO-APIC-edge      rtc
  9:          0          1   IO-APIC-fasteoi   acpi
 12:         21         84   IO-APIC-edge      i8042
 14:       8460       8198   IO-APIC-edge      libata
 15:       1360       1549   IO-APIC-edge      libata
 16:         60         60   IO-APIC-fasteoi   aic7xxx
 17:        129        102   IO-APIC-fasteoi   eth1
 18:         44         47   IO-APIC-fasteoi
 19:        105        130   IO-APIC-fasteoi   EMU10K1
NMI:          0          0   Non-maskable interrupts
LOC:     104387     104637   Local timer interrupts
RES:       8890       8214   Rescheduling interrupts
CAL:       4176       5267   function call interrupts
TLB:        271        236   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0


44 interrupts added to both eth1 and EMU10K1


#lspci

00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P]
System Controller (rev 20)
00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P]
AGP Bridge
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE
(rev 04)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
00:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado]
(rev 6c)
00:09.0 Class Class ff00: Compro Computer Services, Inc. Unknown device
4610 (rev 03)
00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05)
01:05.0 VGA compatible controller: nVidia Corporation NV25 [GeForce4 Ti
4400] (rev a2)
02:04.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 04)
02:04.1 Input device controller: Creative Labs SB Live! Game Port (rev 01)
02:05.0 Communication controller: National Instruments PCI-GPIB (rev 01)
02:06.0 SCSI storage controller: Adaptec AHA-2930CU (rev 03)
02:07.0 Communication controller: National Instruments PCI-GPIB (rev 01)
02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado]
(rev 78)

Again this does not happen with 2.6.24 vanilla. I'm not sure about
earlier RT kernels.

Regards
Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-02-21 12:01 2.6.24-rt1 IRQ routing anomaly Mark Hounschell
@ 2008-02-21 12:51 ` Steven Rostedt
  2008-02-21 13:30   ` Mark Hounschell
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2008-02-21 12:51 UTC (permalink / raw)
  To: Mark Hounschell
  Cc: linux-kernel, Ingo Molnar, Mark Hounschell, Thomas Gleixner, Jon Masters

[CC'd Thomas and Jon]

Thomas, Jon, looks like the someone has the funny interrupt controller.

On Thu, 21 Feb 2008, Mark Hounschell wrote:

> According to /proc/interrupts, every interrupt received by eth1 is also
> being received by the sound card EMU10K1. The problem showed itself
> first with this. The sound system was quiet BTW.
>
> It does not happen with 2.6.24 vanilla.
>
> kernel: irq 19: nobody cared (try booting with the "irqpoll" option)
> kernel: Pid: 1832, comm: IRQ-19 Not tainted 2.6.24.2-crt #2
> kernel:  [<c013d6da>] __report_bad_irq+0x36/0x75
> kernel:  [<c013d910>] note_interrupt+0x1f7/0x227
> kernel:  [<c013ce85>] thread_simple_irq+0x61/0x74
> kernel:  [<c013d455>] do_irqd+0x0/0x22f
> kernel:  [<c013d507>] do_irqd+0xb2/0x22f
> kernel:  [<c013d455>] do_irqd+0x0/0x22f
> kernel:  [<c012b137>] kthread+0x38/0x5d
> kernel:  [<c012b0ff>] kthread+0x0/0x5d
> kernel:  [<c0104c13>] kernel_thread_helper+0x7/0x10
> kernel:  =======================
> kernel: ---------------------------
> kernel: | preempt count: 00000001 ]
> kernel: | 1-level deep critical section nesting:
> kernel: ----------------------------------------
> kernel: .. [<c02b03b3>] .... __spin_lock_irq+0xe/0x1e
> kernel: .....[<00000000>] ..   ( <= _stext+0x3feff000/0x14)
> kernel:
> kernel: handlers:
> kernel: [<f4d16544>] (snd_emu10k1_interrupt+0x0/0x42c [snd_emu10k1])
> kernel: turning off IO-APIC fast mode.
> kernel: irq 19: nobody cared (try booting with the "irqpoll" option)
> kernel: Pid: 1832, comm: IRQ-19 Not tainted 2.6.24.2-crt #2
> kernel:  [<c013d6da>] __report_bad_irq+0x36/0x75
> kernel:  [<c013d910>] note_interrupt+0x1f7/0x227
> kernel:  [<c013ce85>] thread_simple_irq+0x61/0x74
> kernel:  [<c013d455>] do_irqd+0x0/0x22f
> kernel:  [<c013d507>] do_irqd+0xb2/0x22f
> kernel:  [<c013d455>] do_irqd+0x0/0x22f
> kernel:  [<c012b137>] kthread+0x38/0x5d
> kernel:  [<c012b0ff>] kthread+0x0/0x5d
> kernel:  [<c0104c13>] kernel_thread_helper+0x7/0x10
> kernel:  =======================
> kernel: ---------------------------
> kernel: | preempt count: 00000001 ]
> kernel: | 1-level deep critical section nesting:
> kernel: ----------------------------------------
> kernel: .. [<c02b03b3>] .... __spin_lock_irq+0xe/0x1e
> kernel: .....[<00000000>] ..   ( <= _stext+0x3feff000/0x14)
> kernel:
> kernel: handlers:
> kernel: [<f4d16544>] (snd_emu10k1_interrupt+0x0/0x42c [snd_emu10k1])
>
> Looking at /proc/interrupts I could see the the EMU10K1 interrupt was
> going to town. I was busy busy on eth1 at the time.
>
> So a simple externall ping test with a quiet system at run level-3 revealed:
>
> # lspci cat before.ping
>            CPU0       CPU1
>   0:         85          0   IO-APIC-edge      timer
>   1:        396        420   IO-APIC-edge      i8042
>   3:          4          2   IO-APIC-edge
>   4:          5          1   IO-APIC-edge
>   6:          1          4   IO-APIC-edge      floppy
>   7:          0          0   IO-APIC-edge      parport0
>   8:          2          0   IO-APIC-edge      rtc
>   9:          0          1   IO-APIC-fasteoi   acpi
>  12:         21         84   IO-APIC-edge      i8042
>  14:       8457       8179   IO-APIC-edge      libata
>  15:       1016       1519   IO-APIC-edge      libata
>  16:         60         60   IO-APIC-fasteoi   aic7xxx
>  17:        113         96   IO-APIC-fasteoi   eth1
>  18:         44         47   IO-APIC-fasteoi
>  19:         99        114   IO-APIC-fasteoi   EMU10K1
> NMI:          0          0   Non-maskable interrupts
> LOC:      93895      94157   Local timer interrupts
> RES:       8831       8188   Rescheduling interrupts
> CAL:       4176       5267   function call interrupts
> TLB:        271        235   TLB shootdowns
> TRM:          0          0   Thermal event interrupts
> SPU:          0          0   Spurious interrupts
> ERR:          0
> MIS:          0
>
>
> Then from an external machine: ping -c10 10.10.10.200
>
>
> # cat after.ping
>            CPU0       CPU1
>   0:         85          0   IO-APIC-edge      timer
>   1:        464        432   IO-APIC-edge      i8042
>   3:          4          2   IO-APIC-edge
>   4:          5          1   IO-APIC-edge
>   6:          1          4   IO-APIC-edge      floppy
>   7:          0          0   IO-APIC-edge      parport0
>   8:          2          0   IO-APIC-edge      rtc
>   9:          0          1   IO-APIC-fasteoi   acpi
>  12:         21         84   IO-APIC-edge      i8042
>  14:       8460       8198   IO-APIC-edge      libata
>  15:       1360       1549   IO-APIC-edge      libata
>  16:         60         60   IO-APIC-fasteoi   aic7xxx
>  17:        129        102   IO-APIC-fasteoi   eth1
>  18:         44         47   IO-APIC-fasteoi
>  19:        105        130   IO-APIC-fasteoi   EMU10K1
> NMI:          0          0   Non-maskable interrupts
> LOC:     104387     104637   Local timer interrupts
> RES:       8890       8214   Rescheduling interrupts
> CAL:       4176       5267   function call interrupts
> TLB:        271        236   TLB shootdowns
> TRM:          0          0   Thermal event interrupts
> SPU:          0          0   Spurious interrupts
> ERR:          0
> MIS:          0
>
>
> 44 interrupts added to both eth1 and EMU10K1

This is a known problem with this. Some interrupt controlers are funny
and do funny things when an interrupt is masked, but interrupts enabled.
They route the interrupt to the wrong interrupt line. The only reason that
vanilla doesn't show it, is that vanilla does the interrupt handler
when the interrupt is triggered, so it has no need to mask. RT on the
other hand, runs interrupts in threaded context, which triggers this
little quirk because we mask the interrupt. For some strange reason, the
interrupt controller will trigger the interrupt for another interrupt, if
that interrupt line is masked.

To prove this is the problem, boot with noapic in the kernel command line.
1) the problem should disappear.
2) (I'm betting) you see that the eth and EMU10K1 share the same
   interrupt line.

I see from the back trace that this is i386. We have a workaround for this
on x86_64. Jon Masters has been working on better solutions too.

-- Steve

>
>
> #lspci
>
> 00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P]
> System Controller (rev 20)
> 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P]
> AGP Bridge
> 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05)
> 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE
> (rev 04)
> 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
> 00:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado]
> (rev 6c)
> 00:09.0 Class Class ff00: Compro Computer Services, Inc. Unknown device
> 4610 (rev 03)
> 00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05)
> 01:05.0 VGA compatible controller: nVidia Corporation NV25 [GeForce4 Ti
> 4400] (rev a2)
> 02:04.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 04)
> 02:04.1 Input device controller: Creative Labs SB Live! Game Port (rev 01)
> 02:05.0 Communication controller: National Instruments PCI-GPIB (rev 01)
> 02:06.0 SCSI storage controller: Adaptec AHA-2930CU (rev 03)
> 02:07.0 Communication controller: National Instruments PCI-GPIB (rev 01)
> 02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado]
> (rev 78)
>
> Again this does not happen with 2.6.24 vanilla. I'm not sure about
> earlier RT kernels.
>
> Regards
> Mark
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-02-21 12:51 ` Steven Rostedt
@ 2008-02-21 13:30   ` Mark Hounschell
  2008-02-21 15:08     ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Hounschell @ 2008-02-21 13:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Hounschell, linux-kernel, Ingo Molnar, Thomas Gleixner, Jon Masters

Steven Rostedt wrote:
> [CC'd Thomas and Jon]
> 
> Thomas, Jon, looks like the someone has the funny interrupt controller.
> 
> On Thu, 21 Feb 2008, Mark Hounschell wrote:
> 
>> According to /proc/interrupts, every interrupt received by eth1 is also
>> being received by the sound card EMU10K1. The problem showed itself
>> first with this. The sound system was quiet BTW.
>>
>> It does not happen with 2.6.24 vanilla.
>>
>> kernel: irq 19: nobody cared (try booting with the "irqpoll" option)
>> kernel: Pid: 1832, comm: IRQ-19 Not tainted 2.6.24.2-crt #2
>> kernel:  [<c013d6da>] __report_bad_irq+0x36/0x75
>> kernel:  [<c013d910>] note_interrupt+0x1f7/0x227
>> kernel:  [<c013ce85>] thread_simple_irq+0x61/0x74
>> kernel:  [<c013d455>] do_irqd+0x0/0x22f
>> kernel:  [<c013d507>] do_irqd+0xb2/0x22f
>> kernel:  [<c013d455>] do_irqd+0x0/0x22f
>> kernel:  [<c012b137>] kthread+0x38/0x5d
>> kernel:  [<c012b0ff>] kthread+0x0/0x5d
>> kernel:  [<c0104c13>] kernel_thread_helper+0x7/0x10
>> kernel:  =======================
>> kernel: ---------------------------
>> kernel: | preempt count: 00000001 ]
>> kernel: | 1-level deep critical section nesting:
>> kernel: ----------------------------------------
>> kernel: .. [<c02b03b3>] .... __spin_lock_irq+0xe/0x1e
>> kernel: .....[<00000000>] ..   ( <= _stext+0x3feff000/0x14)
>> kernel:
>> kernel: handlers:
>> kernel: [<f4d16544>] (snd_emu10k1_interrupt+0x0/0x42c [snd_emu10k1])
>> kernel: turning off IO-APIC fast mode.
>> kernel: irq 19: nobody cared (try booting with the "irqpoll" option)
>> kernel: Pid: 1832, comm: IRQ-19 Not tainted 2.6.24.2-crt #2
>> kernel:  [<c013d6da>] __report_bad_irq+0x36/0x75
>> kernel:  [<c013d910>] note_interrupt+0x1f7/0x227
>> kernel:  [<c013ce85>] thread_simple_irq+0x61/0x74
>> kernel:  [<c013d455>] do_irqd+0x0/0x22f
>> kernel:  [<c013d507>] do_irqd+0xb2/0x22f
>> kernel:  [<c013d455>] do_irqd+0x0/0x22f
>> kernel:  [<c012b137>] kthread+0x38/0x5d
>> kernel:  [<c012b0ff>] kthread+0x0/0x5d
>> kernel:  [<c0104c13>] kernel_thread_helper+0x7/0x10
>> kernel:  =======================
>> kernel: ---------------------------
>> kernel: | preempt count: 00000001 ]
>> kernel: | 1-level deep critical section nesting:
>> kernel: ----------------------------------------
>> kernel: .. [<c02b03b3>] .... __spin_lock_irq+0xe/0x1e
>> kernel: .....[<00000000>] ..   ( <= _stext+0x3feff000/0x14)
>> kernel:
>> kernel: handlers:
>> kernel: [<f4d16544>] (snd_emu10k1_interrupt+0x0/0x42c [snd_emu10k1])
>>
>> Looking at /proc/interrupts I could see the the EMU10K1 interrupt was
>> going to town. I was busy busy on eth1 at the time.
>>
>> So a simple externall ping test with a quiet system at run level-3 revealed:
>>
>> # lspci cat before.ping
>>            CPU0       CPU1
>>   0:         85          0   IO-APIC-edge      timer
>>   1:        396        420   IO-APIC-edge      i8042
>>   3:          4          2   IO-APIC-edge
>>   4:          5          1   IO-APIC-edge
>>   6:          1          4   IO-APIC-edge      floppy
>>   7:          0          0   IO-APIC-edge      parport0
>>   8:          2          0   IO-APIC-edge      rtc
>>   9:          0          1   IO-APIC-fasteoi   acpi
>>  12:         21         84   IO-APIC-edge      i8042
>>  14:       8457       8179   IO-APIC-edge      libata
>>  15:       1016       1519   IO-APIC-edge      libata
>>  16:         60         60   IO-APIC-fasteoi   aic7xxx
>>  17:        113         96   IO-APIC-fasteoi   eth1
>>  18:         44         47   IO-APIC-fasteoi
>>  19:         99        114   IO-APIC-fasteoi   EMU10K1
>> NMI:          0          0   Non-maskable interrupts
>> LOC:      93895      94157   Local timer interrupts
>> RES:       8831       8188   Rescheduling interrupts
>> CAL:       4176       5267   function call interrupts
>> TLB:        271        235   TLB shootdowns
>> TRM:          0          0   Thermal event interrupts
>> SPU:          0          0   Spurious interrupts
>> ERR:          0
>> MIS:          0
>>
>>
>> Then from an external machine: ping -c10 10.10.10.200
>>
>>
>> # cat after.ping
>>            CPU0       CPU1
>>   0:         85          0   IO-APIC-edge      timer
>>   1:        464        432   IO-APIC-edge      i8042
>>   3:          4          2   IO-APIC-edge
>>   4:          5          1   IO-APIC-edge
>>   6:          1          4   IO-APIC-edge      floppy
>>   7:          0          0   IO-APIC-edge      parport0
>>   8:          2          0   IO-APIC-edge      rtc
>>   9:          0          1   IO-APIC-fasteoi   acpi
>>  12:         21         84   IO-APIC-edge      i8042
>>  14:       8460       8198   IO-APIC-edge      libata
>>  15:       1360       1549   IO-APIC-edge      libata
>>  16:         60         60   IO-APIC-fasteoi   aic7xxx
>>  17:        129        102   IO-APIC-fasteoi   eth1
>>  18:         44         47   IO-APIC-fasteoi
>>  19:        105        130   IO-APIC-fasteoi   EMU10K1
>> NMI:          0          0   Non-maskable interrupts
>> LOC:     104387     104637   Local timer interrupts
>> RES:       8890       8214   Rescheduling interrupts
>> CAL:       4176       5267   function call interrupts
>> TLB:        271        236   TLB shootdowns
>> TRM:          0          0   Thermal event interrupts
>> SPU:          0          0   Spurious interrupts
>> ERR:          0
>> MIS:          0
>>
>>
>> 44 interrupts added to both eth1 and EMU10K1
> 
> This is a known problem with this. Some interrupt controlers are funny
> and do funny things when an interrupt is masked, but interrupts enabled.
> They route the interrupt to the wrong interrupt line. The only reason that
> vanilla doesn't show it, is that vanilla does the interrupt handler
> when the interrupt is triggered, so it has no need to mask. RT on the
> other hand, runs interrupts in threaded context, which triggers this
> little quirk because we mask the interrupt. For some strange reason, the
> interrupt controller will trigger the interrupt for another interrupt, if
> that interrupt line is masked.
> 
> To prove this is the problem, boot with noapic in the kernel command line.
> 1) the problem should disappear.
> 2) (I'm betting) you see that the eth and EMU10K1 share the same
>    interrupt line.
> 

Yep, you were right. They do share the same IRQ and the problem does go away.
Unfortunately  I can't run this machine with noapic. I need irq affinity. 

> I see from the back trace that this is i386. We have a workaround for this
> on x86_64. Jon Masters has been working on better solutions too.
> 

Yes this is i386. Is it just a certain interrupt controller that acts this way or are there more?
I guess I have at least one of "those funny interrupt controllers"  Hmm


>>
>> #lspci
>>
>> 00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P]
>> System Controller (rev 20)
>> 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P]
>> AGP Bridge
>> 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05)
>> 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE
>> (rev 04)
>> 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
>> 00:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado]
>> (rev 6c)
>> 00:09.0 Class Class ff00: Compro Computer Services, Inc. Unknown device
>> 4610 (rev 03)
>> 00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05)
>> 01:05.0 VGA compatible controller: nVidia Corporation NV25 [GeForce4 Ti
>> 4400] (rev a2)
>> 02:04.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 04)
>> 02:04.1 Input device controller: Creative Labs SB Live! Game Port (rev 01)
>> 02:05.0 Communication controller: National Instruments PCI-GPIB (rev 01)
>> 02:06.0 SCSI storage controller: Adaptec AHA-2930CU (rev 03)
>> 02:07.0 Communication controller: National Instruments PCI-GPIB (rev 01)
>> 02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado]
>> (rev 78)
>>
>> Again this does not happen with 2.6.24 vanilla. I'm not sure about
>> earlier RT kernels.
>>
>> Regards
>> Mark
>>
 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-02-21 13:30   ` Mark Hounschell
@ 2008-02-21 15:08     ` Steven Rostedt
  2008-02-21 22:41       ` Mark Hounschell
  2008-03-03  4:31       ` Jon Masters
  0 siblings, 2 replies; 11+ messages in thread
From: Steven Rostedt @ 2008-02-21 15:08 UTC (permalink / raw)
  To: Mark Hounschell
  Cc: Mark Hounschell, linux-kernel, Ingo Molnar, Thomas Gleixner, Jon Masters


On Thu, 21 Feb 2008, Mark Hounschell wrote:
> >
> > To prove this is the problem, boot with noapic in the kernel command line.
> > 1) the problem should disappear.
> > 2) (I'm betting) you see that the eth and EMU10K1 share the same
> >    interrupt line.
> >
>
> Yep, you were right. They do share the same IRQ and the problem does go away.
> Unfortunately  I can't run this machine with noapic. I need irq affinity.
>

Thanks for verifying. OK, I'll see if I can get the workaround on i386. I
thought I saw a patch someplace where someone ported that workaround. I'm
still sorting out bugs in -rt2 (why it is still not out). I'll see if I
can find time to get the workaround to i386 for -rt2 as well.

-- Steve


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-02-21 15:08     ` Steven Rostedt
@ 2008-02-21 22:41       ` Mark Hounschell
  2008-02-21 23:10         ` Steven Rostedt
  2008-03-03  4:31       ` Jon Masters
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Hounschell @ 2008-02-21 22:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Hounschell, linux-kernel, Ingo Molnar, Thomas Gleixner, Jon Masters

Steven Rostedt wrote:
> On Thu, 21 Feb 2008, Mark Hounschell wrote:
>>> To prove this is the problem, boot with noapic in the kernel command line.
>>> 1) the problem should disappear.
>>> 2) (I'm betting) you see that the eth and EMU10K1 share the same
>>>    interrupt line.
>>>
>> Yep, you were right. They do share the same IRQ and the problem does go away.
>> Unfortunately  I can't run this machine with noapic. I need irq affinity.
>>
> 
> Thanks for verifying. OK, I'll see if I can get the workaround on i386. I
> thought I saw a patch someplace where someone ported that workaround. I'm
> still sorting out bugs in -rt2 (why it is still not out). I'll see if I
> can find time to get the workaround to i386 for -rt2 as well.
> 
> -- Steve

I'll be happy to test it for ya

Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-02-21 22:41       ` Mark Hounschell
@ 2008-02-21 23:10         ` Steven Rostedt
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2008-02-21 23:10 UTC (permalink / raw)
  To: Mark Hounschell
  Cc: Mark Hounschell, linux-kernel, Ingo Molnar, Thomas Gleixner, Jon Masters


On Thu, 21 Feb 2008, Mark Hounschell wrote:
> >
> > Thanks for verifying. OK, I'll see if I can get the workaround on i386. I
> > thought I saw a patch someplace where someone ported that workaround. I'm
> > still sorting out bugs in -rt2 (why it is still not out). I'll see if I
> > can find time to get the workaround to i386 for -rt2 as well.
> >
> > -- Steve
>
> I'll be happy to test it for ya
>

I'm looking at it now, and it seems the work around is already there for
i386. I don't know why it didn't trigger for you.

Can you send me privately, a full dmesg and the output of lspci -vvv

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-02-21 15:08     ` Steven Rostedt
  2008-02-21 22:41       ` Mark Hounschell
@ 2008-03-03  4:31       ` Jon Masters
  2008-03-03 13:24         ` Steven Rostedt
  1 sibling, 1 reply; 11+ messages in thread
From: Jon Masters @ 2008-03-03  4:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Hounschell, Mark Hounschell, linux-kernel, Ingo Molnar,
	Thomas Gleixner


On Thu, 2008-02-21 at 10:08 -0500, Steven Rostedt wrote:
> On Thu, 21 Feb 2008, Mark Hounschell wrote:
> > >
> > > To prove this is the problem, boot with noapic in the kernel command line.
> > > 1) the problem should disappear.
> > > 2) (I'm betting) you see that the eth and EMU10K1 share the same
> > >    interrupt line.
> > >
> >
> > Yep, you were right. They do share the same IRQ and the problem does go away.
> > Unfortunately  I can't run this machine with noapic. I need irq affinity.
> >
> 
> Thanks for verifying. OK, I'll see if I can get the workaround on i386.

What's the situation with this one? Want me to look at it?

Jon.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-03-03  4:31       ` Jon Masters
@ 2008-03-03 13:24         ` Steven Rostedt
  2008-03-03 13:36           ` Mark Hounschell
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2008-03-03 13:24 UTC (permalink / raw)
  To: Jon Masters
  Cc: Mark Hounschell, Mark Hounschell, linux-kernel, Ingo Molnar,
	Thomas Gleixner


On Sun, 2 Mar 2008, Jon Masters wrote:
> On Thu, 2008-02-21 at 10:08 -0500, Steven Rostedt wrote:
> > On Thu, 21 Feb 2008, Mark Hounschell wrote:
> > > >
> > > > To prove this is the problem, boot with noapic in the kernel command line.
> > > > 1) the problem should disappear.
> > > > 2) (I'm betting) you see that the eth and EMU10K1 share the same
> > > >    interrupt line.
> > > >
> > >
> > > Yep, you were right. They do share the same IRQ and the problem does go away.
> > > Unfortunately  I can't run this machine with noapic. I need irq affinity.
> > >
> >
> > Thanks for verifying. OK, I'll see if I can get the workaround on i386.
>
> What's the situation with this one? Want me to look at it?

Jon,

The board Mark has may just be some cheap hardware. It would be great
that RT would work on all boxes, but I'm not sure we want to spend time on
"broken-by-design" hardware while there's bigger fish in the sea to catch.

The current workaround is just use noapic, although I do understand
that's not good enough for Mark. But I'm sure Mark has other hardware
he could use ;-)

-- Steve


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-03-03 13:24         ` Steven Rostedt
@ 2008-03-03 13:36           ` Mark Hounschell
  2008-03-03 14:31             ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Hounschell @ 2008-03-03 13:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jon Masters, Mark Hounschell, linux-kernel, Ingo Molnar, Thomas Gleixner

Steven Rostedt wrote:
> On Sun, 2 Mar 2008, Jon Masters wrote:
>> On Thu, 2008-02-21 at 10:08 -0500, Steven Rostedt wrote:
>>> On Thu, 21 Feb 2008, Mark Hounschell wrote:
>>>>> To prove this is the problem, boot with noapic in the kernel command line.
>>>>> 1) the problem should disappear.
>>>>> 2) (I'm betting) you see that the eth and EMU10K1 share the same
>>>>>    interrupt line.
>>>>>
>>>> Yep, you were right. They do share the same IRQ and the problem does go away.
>>>> Unfortunately  I can't run this machine with noapic. I need irq affinity.
>>>>
>>> Thanks for verifying. OK, I'll see if I can get the workaround on i386.
>> What's the situation with this one? Want me to look at it?
> 
> Jon,
> 
> The board Mark has may just be some cheap hardware. It would be great
> that RT would work on all boxes, but I'm not sure we want to spend time on
> "broken-by-design" hardware while there's bigger fish in the sea to catch.
> 
> The current workaround is just use noapic, although I do understand
> that's not good enough for Mark. But I'm sure Mark has other hardware
> he could use ;-)
> 
> -- Steve
> 
> 
Steve is correct. I have plenty of other choices. Steve, you mentioned, a "work around"
is in -rt3. My only concern is does the current "work around" for other hardware really 
work or may I see this again with other "non cheap" hardware? 

Is there a known list of hardware this problem is seen on?

Thanks
Mark

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-03-03 13:36           ` Mark Hounschell
@ 2008-03-03 14:31             ` Steven Rostedt
  2008-03-03 17:00               ` Jon Masters
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2008-03-03 14:31 UTC (permalink / raw)
  To: Mark Hounschell
  Cc: Jon Masters, Mark Hounschell, linux-kernel, Ingo Molnar, Thomas Gleixner



On Mon, 3 Mar 2008, Mark Hounschell wrote:
5B> >
> >
> Steve is correct. I have plenty of other choices. Steve, you mentioned, a "work around"
> is in -rt3. My only concern is does the current "work around" for other hardware really
> work or may I see this again with other "non cheap" hardware?

We have a work around for secondary IOAPICS, which sometimes shows this
behaviour (on non-cheap hardware).

The problem we have is that for some reason, IO-APICS with PCI-X chips get
confused when the interrupt line is masked. The work around that we
currently have (besides noapic) is to switch the interrupt to an edged
level interrupt instead of masking. We mark the interrupt as IN_PROGRESS
and if new interrupts come in from the same line, we can just flag them as
pending and return.

This works for some boxes. But this can cause problems for other boxes
that don't like having the interrupt being switched from level to edge and
back. For these boxes, the workaround must be disabled.

Then we have a third set of boxes where the masking causes wrong
interrupts (like what you were seeing) and the level/edge hack also causes
problems. For these boxes the only current solution is noapic.


The last solution to this, which is also our long term fix, is to add a
new interface for devices to let them disable the interrupt at the device
level (not masked at the IO-APIC). The disadvantage to this is the longer
time to traverse the PCI Bus, and added traffic on it. But the advantage
is not only a fix to this problem, but a way to figrane the priorities of
interrupts further than just the interrupt line. With this fix we can
create an interrupt thread per device. Also making the use of tasklets and
softirqs obsolete.  But this has a long way to go still.

>
> Is there a known list of hardware this problem is seen on?

We know of some, the list is still growing.

Jon, where are we on the "blacklist"?

-- Steve


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.6.24-rt1 IRQ routing anomaly
  2008-03-03 14:31             ` Steven Rostedt
@ 2008-03-03 17:00               ` Jon Masters
  0 siblings, 0 replies; 11+ messages in thread
From: Jon Masters @ 2008-03-03 17:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mark Hounschell, Mark Hounschell, linux-kernel, Ingo Molnar,
	Thomas Gleixner

On Mon, 2008-03-03 at 09:31 -0500, Steven Rostedt wrote:
> 
> On Mon, 3 Mar 2008, Mark Hounschell wrote:
> 5B> >
> > >
> > Steve is correct. I have plenty of other choices. Steve, you mentioned, a "work around"
> > is in -rt3. My only concern is does the current "work around" for other hardware really
> > work or may I see this again with other "non cheap" hardware?
> 
> We have a work around for secondary IOAPICS, which sometimes shows this
> behaviour (on non-cheap hardware).

Yeah, and weirdly, we never see it on the primary IO-APIC. However, I
think that's just wonderfully convenient. It usually happens on IO-APICs
integrated into e.g. a PCIe chipset that do weird legacy mode support,
but those can equally well happen at the root node of modern systems
(and hence it could happen on the primary one in theory too).

As you can see, we've had a lot of trouble actually tracking down
exactly what circumstances cause this to happen - and it depends upon
the overall system design for what legacy lines are wired together :)

> The problem we have is that for some reason, IO-APICS with PCI-X chips

*and* PCIe, in fact it's happening in chipsets that either have or are
closely connected with another that provides an IO-APIC. For example,
the HT-1000/HT-2000 combos, that talk to oneanother over HT can be
affected by this problem too.

> The work around that we
> currently have (besides noapic) is to switch the interrupt to an edged
> level interrupt instead of masking. We mark the interrupt as IN_PROGRESS
> and if new interrupts come in from the same line, we can just flag them as
> pending and return.

Yup. That works quite well, for some boxen.

> This works for some boxes. But this can cause problems for other boxes
> that don't like having the interrupt being switched from level to edge and
> back. For these boxes, the workaround must be disabled.

Yup. And there are some errata too that mean some chips explicitly won't
support masking as we do, some won't support level/edge/level, some
won't do both, some will just put their fingers in their ears.

> Then we have a third set of boxes where the masking causes wrong
> interrupts (like what you were seeing) and the level/edge hack also causes
> problems. For these boxes the only current solution is noapic.

And a specific warning message in the RHEL-RT Red Hat kernel. I can put
together a semi-upstream acceptable patch for this, but the quirk we
have is really a temporary workaround at the moment. We abuse PCI
quirking code to enable a global IO-APIC level/edge/level hack.

Note that we can't do what "mainstream" Linux does in -RT, namely using
the fastEOI path, because Fast EOI interrupt handling (in which Linux
talks directly to the LAPIC (local APIC within the CPU) and tells it to
shut the hell up the IO-APIC that just raised an interrupt) relies upon
serial interrupt delivery and complete handling of an interrupt before
handling of the next one. The "new world order" basically completely
screws over the design of this hardware, but does the right thing (IMHO
the way EOI is implemented actually needs rethinking anyway).

Also note that, for completeness, I wrote a blog entry on modern IO-APIC
handling a while back, for when this issue would finally come up here:
http://www.jonmasters.org/blog/?p=641

> The last solution to this, which is also our long term fix, is to add a
> new interface for devices to let them disable the interrupt at the device
> level (not masked at the IO-APIC). The disadvantage to this is the longer
> time to traverse the PCI Bus, and added traffic on it. But the advantage
> is not only a fix to this problem, but a way to figrane the priorities of
> interrupts further than just the interrupt line. With this fix we can
> create an interrupt thread per device. Also making the use of tasklets and
> softirqs obsolete.  But this has a long way to go still.

Indeed. I have submitted a paper to OLS on this, and will followup with
a patch series in due course, once I have time to get a test system
fully converted over. We don't plan to re-introduce top and bottom
halves, but we do plan on this:

*). Interrupt is asserted.
*). Quiesce handler called to shut up device, verify interrupt source,
subsequently calls the regular EOI handling path to shutup IO-APIC.
*). Linux schedules the corresponding per-device interrupt thread.
*). Interrupt thread runs, is schedulable, can do away with lots of
pointless use of tasklets, and other "DPC" in kernel IRQ paths.

Besides, I quite like this for actually getting interrupt threads
upstream in due course. We can make a really nice and compelling
argument that splitting interrupts into two pieces like this (really
just making the threads work "right" - one thread per device, and not
one thread per line...which doesn't scale) makes IRQ threads the right
solution for Linux, removes complexity, and is just a good idea. Also,
we can even make systems more robust once interrupt threads that lock
hard can't actually always bring down the system around you.

> >
> > Is there a known list of hardware this problem is seen on?
> 
> We know of some, the list is still growing.
> 
> Jon, where are we on the "blacklist"?

The blacklist is simply a list of PCI quirks. Do you think I should
clean this up for upstream? I mean, really, I was just hoping to get the
longer term interrupt rewrite done, quietly sneak it out before OLS,
then wait for the flamewar...I guess you let the cat out of the bag.

:)

Jon.



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-03-03 17:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-21 12:01 2.6.24-rt1 IRQ routing anomaly Mark Hounschell
2008-02-21 12:51 ` Steven Rostedt
2008-02-21 13:30   ` Mark Hounschell
2008-02-21 15:08     ` Steven Rostedt
2008-02-21 22:41       ` Mark Hounschell
2008-02-21 23:10         ` Steven Rostedt
2008-03-03  4:31       ` Jon Masters
2008-03-03 13:24         ` Steven Rostedt
2008-03-03 13:36           ` Mark Hounschell
2008-03-03 14:31             ` Steven Rostedt
2008-03-03 17:00               ` Jon Masters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).