LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* net: tx timeouts with skge, 8139too, dmfe drivers/NICs
@ 2008-02-25 20:37 Marin Mitov
  2008-02-25 20:53 ` Jeff Garzik
  0 siblings, 1 reply; 7+ messages in thread
From: Marin Mitov @ 2008-02-25 20:37 UTC (permalink / raw)
  To: linux-kernel

Hi all,

I experience very rare freezes at heavy outbound traffic 
(sending ~4GB DVD image to another host(s) on the same LAN) 
using skge driver (NIC on the mobo) as well as (recently tested)
using rtl8139 or dmfe NICs on the PCI bus. There is a single 
switch between them (tested with another one just to exclude
a faulty switch).

skge <--> Marvell 88E8001 chip
8139too <--> Realtek 8136B chip
dmfe <--> Davicom DM9102 chip

Symptoms are similar: tx timeouts and no more net activity.
KDE desktop works, computational programs - work, the machine 
is usable, but cannot ping, nor can be ping-ed anymore.
rmmod && modprobe the respective modules repairs the problem.
Simple surfing/e-mailing from it do not trigger the problem.

The machine is used as LTSP server for old PCs (as X terminals)
(mostly outbound traffic) and is not usable as such due to this
problem.

The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).

As far as this happens with 3 different NICs/drivers could it be
a problem in the (common for all of them) networking subsystem?

As far as many persons are working on this machine only limited 
testing could be done.

Thank you in advance for your suggestions, help (and patches).

Regards.

Marin Mitov

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs
  2008-02-25 20:37 net: tx timeouts with skge, 8139too, dmfe drivers/NICs Marin Mitov
@ 2008-02-25 20:53 ` Jeff Garzik
  2008-02-25 21:36   ` Marin Mitov
  2008-03-12 11:41   ` Marin Mitov
  0 siblings, 2 replies; 7+ messages in thread
From: Jeff Garzik @ 2008-02-25 20:53 UTC (permalink / raw)
  To: Marin Mitov; +Cc: linux-kernel

Marin Mitov wrote:
> Hi all,
> 
> I experience very rare freezes at heavy outbound traffic 
> (sending ~4GB DVD image to another host(s) on the same LAN) 
> using skge driver (NIC on the mobo) as well as (recently tested)
> using rtl8139 or dmfe NICs on the PCI bus. There is a single 
> switch between them (tested with another one just to exclude
> a faulty switch).
> 
> skge <--> Marvell 88E8001 chip
> 8139too <--> Realtek 8136B chip
> dmfe <--> Davicom DM9102 chip
> 
> Symptoms are similar: tx timeouts and no more net activity.
> KDE desktop works, computational programs - work, the machine 
> is usable, but cannot ping, nor can be ping-ed anymore.
> rmmod && modprobe the respective modules repairs the problem.
> Simple surfing/e-mailing from it do not trigger the problem.
> 
> The machine is used as LTSP server for old PCs (as X terminals)
> (mostly outbound traffic) and is not usable as such due to this
> problem.
> 
> The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
> 
> As far as this happens with 3 different NICs/drivers could it be
> a problem in the (common for all of them) networking subsystem?

A TX timeout (like hardware timeouts, in general) is a very generic 
behavior, with many causes.

In general, when you see timeouts with varied hardware and drivers, 
you're almost always dealing with a problem with interrupt delivery, or 
a generic system problem, rather than bugs in the network stack or all 
three drivers.

	Jeff




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs
  2008-02-25 20:53 ` Jeff Garzik
@ 2008-02-25 21:36   ` Marin Mitov
  2008-02-25 21:42     ` Stephen Hemminger
  2008-03-12 11:41   ` Marin Mitov
  1 sibling, 1 reply; 7+ messages in thread
From: Marin Mitov @ 2008-02-25 21:36 UTC (permalink / raw)
  To: linux-kernel

On Monday 25 February 2008 10:53:01 pm you wrote:
> Marin Mitov wrote:
> > Hi all,
> >
> > I experience very rare freezes at heavy outbound traffic
> > (sending ~4GB DVD image to another host(s) on the same LAN)
> > using skge driver (NIC on the mobo) as well as (recently tested)
> > using rtl8139 or dmfe NICs on the PCI bus. There is a single
> > switch between them (tested with another one just to exclude
> > a faulty switch).
> >
> > skge <--> Marvell 88E8001 chip
> > 8139too <--> Realtek 8136B chip
> > dmfe <--> Davicom DM9102 chip
> >
> > Symptoms are similar: tx timeouts and no more net activity.
> > KDE desktop works, computational programs - work, the machine
> > is usable, but cannot ping, nor can be ping-ed anymore.
> > rmmod && modprobe the respective modules repairs the problem.
> > Simple surfing/e-mailing from it do not trigger the problem.
> >
> > The machine is used as LTSP server for old PCs (as X terminals)
> > (mostly outbound traffic) and is not usable as such due to this
> > problem.
> >
> > The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
> >
> > As far as this happens with 3 different NICs/drivers could it be
> > a problem in the (common for all of them) networking subsystem?
>
> A TX timeout (like hardware timeouts, in general) is a very generic
> behavior, with many causes.
>
> In general, when you see timeouts with varied hardware and drivers,
> you're almost always dealing with a problem with interrupt delivery, or

All the drivers are using #INTA on PCI bus (no MSI/MSI-X).

"problem with interrupt delivery" - you suspect interrupts incorrectly
 disabled (lost) in the drivers or faulty hardware(motherboard)?

> a generic system problem, rather than bugs in the network stack or all

"a generic system problem" - bad config or faulty hardware(motherboard)?

Where I should look for the problem?

Just for info: the system is very stable - uptime (if no power outages) could
be a month or more (rebooting for kernel changes or updates).

Marin Mitov

> three drivers.
>
> 	Jeff
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs
  2008-02-25 21:36   ` Marin Mitov
@ 2008-02-25 21:42     ` Stephen Hemminger
  2008-02-25 22:09       ` Marin Mitov
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Hemminger @ 2008-02-25 21:42 UTC (permalink / raw)
  To: Marin Mitov; +Cc: linux-kernel

On Mon, 25 Feb 2008 23:36:06 +0200
Marin Mitov <mitov@issp.bas.bg> wrote:

> On Monday 25 February 2008 10:53:01 pm you wrote:
> > Marin Mitov wrote:
> > > Hi all,
> > >
> > > I experience very rare freezes at heavy outbound traffic
> > > (sending ~4GB DVD image to another host(s) on the same LAN)
> > > using skge driver (NIC on the mobo) as well as (recently tested)
> > > using rtl8139 or dmfe NICs on the PCI bus. There is a single
> > > switch between them (tested with another one just to exclude
> > > a faulty switch).
> > >
> > > skge <--> Marvell 88E8001 chip
> > > 8139too <--> Realtek 8136B chip
> > > dmfe <--> Davicom DM9102 chip
> > >
> > > Symptoms are similar: tx timeouts and no more net activity.
> > > KDE desktop works, computational programs - work, the machine
> > > is usable, but cannot ping, nor can be ping-ed anymore.
> > > rmmod && modprobe the respective modules repairs the problem.
> > > Simple surfing/e-mailing from it do not trigger the problem.
> > >
> > > The machine is used as LTSP server for old PCs (as X terminals)
> > > (mostly outbound traffic) and is not usable as such due to this
> > > problem.
> > >
> > > The kernel is 2.6.24.2-SMP/x86_32 (PREEMPT or not - NO difference).
> > >
> > > As far as this happens with 3 different NICs/drivers could it be
> > > a problem in the (common for all of them) networking subsystem?
> >
> > A TX timeout (like hardware timeouts, in general) is a very generic
> > behavior, with many causes.
> >
> > In general, when you see timeouts with varied hardware and drivers,
> > you're almost always dealing with a problem with interrupt delivery, or
> 
> All the drivers are using #INTA on PCI bus (no MSI/MSI-X).
> 
> "problem with interrupt delivery" - you suspect interrupts incorrectly
>  disabled (lost) in the drivers or faulty hardware(motherboard)?
> 
> > a generic system problem, rather than bugs in the network stack or all
> 
> "a generic system problem" - bad config or faulty hardware(motherboard)?
> 
> Where I should look for the problem?
> 
> Just for info: the system is very stable - uptime (if no power outages) could
> be a month or more (rebooting for kernel changes or updates).
> 
> Marin Mitov

Make sure the interrupt is showing up as level triggered in /proc/interrupts.
The BIOS may be configuring it as edge-triggered and that won't work with
Ethernet drivers that use NAPI.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs
  2008-02-25 21:42     ` Stephen Hemminger
@ 2008-02-25 22:09       ` Marin Mitov
  2008-02-25 22:57         ` Stephen Hemminger
  0 siblings, 1 reply; 7+ messages in thread
From: Marin Mitov @ 2008-02-25 22:09 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: linux-kernel

Hi Stephen,

> Make sure the interrupt is showing up as level triggered in
> /proc/interrupts. The BIOS may be configuring it as edge-triggered and that
> won't work with Ethernet drivers that use NAPI.

for: skge <--> Marvell 88E8001 chip
cat /proc/interrupts gives (AMD64 X2 SMP):
           CPU0       CPU1
 21:   11691000   11933174   IO-APIC-fasteoi   eth0

It is neither IO-APIC-edge, nor IO-APIC-level.

Could it be the problem?

Marin Mitov



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs
  2008-02-25 22:09       ` Marin Mitov
@ 2008-02-25 22:57         ` Stephen Hemminger
  0 siblings, 0 replies; 7+ messages in thread
From: Stephen Hemminger @ 2008-02-25 22:57 UTC (permalink / raw)
  To: Marin Mitov; +Cc: linux-kernel

On Tue, 26 Feb 2008 00:09:46 +0200
Marin Mitov <mitov@issp.bas.bg> wrote:

> Hi Stephen,
> 
> > Make sure the interrupt is showing up as level triggered in
> > /proc/interrupts. The BIOS may be configuring it as edge-triggered and that
> > won't work with Ethernet drivers that use NAPI.
> 
> for: skge <--> Marvell 88E8001 chip
> cat /proc/interrupts gives (AMD64 X2 SMP):
>            CPU0       CPU1
>  21:   11691000   11933174   IO-APIC-fasteoi   eth0
> 
> It is neither IO-APIC-edge, nor IO-APIC-level.
> 
> Could it be the problem?
> 
> Marin Mitov

No. that isn't the problem.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: net: tx timeouts with skge, 8139too, dmfe drivers/NICs
  2008-02-25 20:53 ` Jeff Garzik
  2008-02-25 21:36   ` Marin Mitov
@ 2008-03-12 11:41   ` Marin Mitov
  1 sibling, 0 replies; 7+ messages in thread
From: Marin Mitov @ 2008-03-12 11:41 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel

On Monday 25 February 2008 10:53:01 pm you wrote:
> > As far as this happens with 3 different NICs/drivers could it be
> > a problem in the (common for all of them) networking subsystem?
> 
> A TX timeout (like hardware timeouts, in general) is a very generic 
> behavior, with many causes.
> 
> In general, when you see timeouts with varied hardware and drivers, 
> you're almost always dealing with a problem with interrupt delivery, or 
> a generic system problem, rather than bugs in the network stack or all 
> three drivers.

Well, this gave me a direction of research. 

Using printk in various parts of  skge driver, as well as modifying it to
collect different statistics (used via ethtool -S eth0), the following observations
had been made when it freezes:

1. interrupts are generated (status register shows there are pending
interrupts and they are NOT masked), but irq_handler is NOT invoked.

2. Looking on the cat /proc/interrups shows that when skge is working
both CPUs receive any IRQs. When skge freezes NO CPU receives skge's
interrupts, CPU[0] receives any others IRQs, but skge's, CPU[1] do not
receive any IRQ above the line (see bellow), but receives LOC: and RES:
below the line.
#cat /proc/interrups
           CPU0       CPU1
  0:         85          1   IO-APIC-edge      timer
  1:      34078          9   IO-APIC-edge      i8042
  6:          1          4   IO-APIC-edge      floppy
  7:        216          1   IO-APIC-edge      parport0
  8:          0          1   IO-APIC-edge      rtc
  9:          0          0   IO-APIC-fasteoi   acpi
 12:     893003    1390080   IO-APIC-edge      i8042
 14:      59682     286628   IO-APIC-edge      ide0
 15:    5458527         12   IO-APIC-edge      ide1
 16:   60547054          1   IO-APIC-fasteoi   mga@pci:0000:01:00.0
 17:    1634623     914447   IO-APIC-fasteoi   sata_via
 18:       7768          7   IO-APIC-fasteoi   sata_promise
 19:          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5
 20:     535380          1   IO-APIC-fasteoi   VIA8237
 21:   30780380   31448992   IO-APIC-fasteoi   eth0
---------line added by me----------------------------------
NMI:          0          0   Non-maskable interrupts
LOC:  154311126  154736178   Local timer interrupts
RES:    1325239    2423719   Rescheduling interrupts
CAL:      40893        456   function call interrupts
TLB:      52651      29184   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0

That looks like IRQs are somehow disabled (at IO-APIC/LAPIC?)
at some priority and bellow.

Here is the place to say that after freezing, ifconfig down/up (+routing info)
does NOT solve the problem, while rmmod/modprobe the driver, makes it work 
again.

So, I moved the functions request_irq()/free_irq() from driver's probe()/release() 
methods to open()/stop() methods. Thus modified, when skge freezes, 
ifconfig down/up makes it work again (no need to rmmod/modprobe).

That makes me think that somehow skge's IRQ is disabled OUT of the driver
and free_irq()/request_irq() clears the problem. Am I wrong?

Could it be possible? How could this happen?

Any comments/suggestions/patches wellcome.

Regards

Marin Mitov


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-03-12 11:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-25 20:37 net: tx timeouts with skge, 8139too, dmfe drivers/NICs Marin Mitov
2008-02-25 20:53 ` Jeff Garzik
2008-02-25 21:36   ` Marin Mitov
2008-02-25 21:42     ` Stephen Hemminger
2008-02-25 22:09       ` Marin Mitov
2008-02-25 22:57         ` Stephen Hemminger
2008-03-12 11:41   ` Marin Mitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).