LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* kernel BUG at page_alloc.c:98 -- compiling with distcc
@ 2004-04-02 10:21 Marco Fais
  2004-04-02 13:15 ` Marco Roeland
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Marco Fais @ 2004-04-02 10:21 UTC (permalink / raw)
  To: linux-kernel

Hi!


[1.] Kernel panic while using distcc

[2.] I have 5-6 development linux systems that we use without problem
under a normal development workload. Trying distcc for speeding up
compilation, we have a fully reproducible kernel panic in a very short
time (seconds after compilation start). The kernel panic happens *only*
when the systems are "remotely controlled" (the distcc daemon is
receiving source files from remote systems, compile and send back
compiled objects). When compiling with distcc the local system doesn't
show any kernel panic, while the same system used as a "remote compiler
system" dies very quickly.

[3.] Keywords: distcc BUG page_alloc.c

[4.] Linux version 2.4.25 (root@test1) (gcc version 3.2 20020903 (Red
Hat Linux 8.0 3.2-7)) #1 mer mar 31 10:28:36 CEST 2004

[5.]
ksymoops 2.4.5 on i686 2.4.25.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.25/ (default)
     -m /boot/System.map-2.4.25 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

kernel BUG at page_alloc.c:98!
invalid operand: 0000
CPU:    0
EIP:    0010:[<c01372ae>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000002   ebx: c14b3f00   ecx: c14b3f00   edx: 00000000
esi: 00000000   edi: dec11340   ebp: c02f1d04   esp: c02f1cd4
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c02f1000)
Stack: ddd46000 c02f1cfc c0135a76 c158f6f0 de9bcdf4 ddd45800 de9bcdf4
005207dc
       ddd45800 00000001 dec4d894 dec11340 c02f1d18 c021667b 00000282
dec4d894
       dec4d8c4 c02f1d2c c02166b4 dec4d894 dec4d894 dec4d894 c02f1d44
c0216816
Call Trace:    [<c0135a76>] [<c021667b>] [<c02166b4>] [<c0216816>]
[<c023be39>]
  [<c023c385>] [<c023f51c>] [<c02465a9>] [<c0246a76>] [<c022dad0>]
[<c022dc25>]
  [<c0222780>] [<c022dad0>] [<c022d88f>] [<c022dad0>] [<c022de3a>]
[<e08d7eab>]
  [<c021ad14>] [<c021ae3f>] [<c021af5a>] [<c0121cd7>] [<c010a66d>]
[<c01070a0>]
  [<c010cb58>] [<c01070a0>] [<c01070c6>] [<c0107142>] [<c0105000>]
Code: 0f 0b 62 00 f7 60 27 c0 e9 ad fd ff ff 90 8d 74 26 00 55 89


> >EIP; c01372ae <__free_pages_ok+26e/280>   <=====

> >ebx; c14b3f00 <_end+116e728/204d48a8>
> >ecx; c14b3f00 <_end+116e728/204d48a8>
> >edi; dec11340 <_end+1e8cbb68/204d48a8>
> >ebp; c02f1d04 <init_task_union+1d04/2000>
> >esp; c02f1cd4 <init_task_union+1cd4/2000>

Trace; c0135a76 <kmem_cache_free_one+f6/210>
Trace; c021667b <skb_release_data+6b/90>
Trace; c02166b4 <kfree_skbmem+14/70>
Trace; c0216816 <__kfree_skb+106/160>
Trace; c023be39 <tcp_clean_rtx_queue+139/330>
Trace; c023c385 <tcp_ack+c5/380>
Trace; c023f51c <tcp_rcv_state_process+19c/a90>
Trace; c02465a9 <tcp_v4_do_rcv+a9/130>
Trace; c0246a76 <tcp_v4_rcv+446/560>
Trace; c022dad0 <ip_local_deliver_finish+0/180>
Trace; c022dc25 <ip_local_deliver_finish+155/180>
Trace; c0222780 <nf_hook_slow+b0/170>
Trace; c022dad0 <ip_local_deliver_finish+0/180>
Trace; c022d88f <ip_local_deliver+4f/70>
Trace; c022dad0 <ip_local_deliver_finish+0/180>
Trace; c022de3a <ip_rcv_finish+1ea/270>
Trace; e08d7eab <[8139too]rtl8139_rx_interrupt+6b/3b0>
Trace; c021ad14 <netif_receive_skb+c4/180>
Trace; c021ae3f <process_backlog+6f/120>
Trace; c021af5a <net_rx_action+6a/100>
Trace; c0121cd7 <do_softirq+97/a0>
Trace; c010a66d <do_IRQ+bd/f0>
Trace; c01070a0 <default_idle+0/30>
Trace; c010cb58 <call_do_IRQ+5/d>
Trace; c01070a0 <default_idle+0/30>
Trace; c01070c6 <default_idle+26/30>
Trace; c0107142 <cpu_idle+42/60>
Trace; c0105000 <_stext+0/0>

Code;  c01372ae <__free_pages_ok+26e/280>
00000000 <_EIP>:
Code;  c01372ae <__free_pages_ok+26e/280>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c01372b0 <__free_pages_ok+270/280>
   2:   62 00                     bound  %eax,(%eax)
Code;  c01372b2 <__free_pages_ok+272/280>
   4:   f7 60 27                  mull   0x27(%eax)
Code;  c01372b5 <__free_pages_ok+275/280>
   7:   c0 e9 ad                  shr    $0xad,%cl
Code;  c01372b8 <__free_pages_ok+278/280>
   a:   fd                        std
Code;  c01372b9 <__free_pages_ok+279/280>
   b:   ff                        (bad)
Code;  c01372ba <__free_pages_ok+27a/280>
   c:   ff 90 8d 74 26 00         call   *0x26748d(%eax)
Code;  c01372c0 <rmqueue+0/230>
  12:   55                        push   %ebp
Code;  c01372c1 <rmqueue+1/230>
  13:   89 00                     mov    %eax,(%eax)

<0>Kernel panic: Aiee, killing interrupt handler!

1 warning issued.  Results may not be reliable.


[6.] Launch distccd --daemon on the affected system, then on the remote
host, set DISTCC_HOSTS="<problematic remote system>" and launch, for
example, a kernel compile: make -j2 CC=distcc bzImage.

[7.] All system are AthlonXP 2.6+, on a VIA KT400 chipset (various
motherboard vendors). All using EXT3 filesystems, with various redhat
distributions (8.0, 9, RHEL3 -- not using NPTL)

[7.1.]
Gnu C                  3.2
Gnu make               3.79.1
util-linux             2.11r
mount                  2.11r
modutils               2.4.18
e2fsprogs              1.27
jfsutils               1.0.17
reiserfsprogs          3.6.2
pcmcia-cs              3.1.31
quota-tools            3.06.
PPP                    2.4.1
isdn4k-utils           3.1pre4
Linux C Library        2.3.2
Dynamic linker (ldd)   2.3.2
Procps                 2.0.7
Net-tools              1.60
Kbd                    1.06
Sh-utils               2.0.12

[7.2.] Processor information (from /proc/cpuinfo):

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(tm) XP 2600+
stepping        : 1
cpu MHz         : 2075.355
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 4141.87

[7.3.] Module information (from /proc/modules):

Module                  Size  Used by    Not tainted
nfs                    78968   1  (autoclean)
binfmt_misc             7304   1
nfsd                   80304   8  (autoclean)
lockd                  58480   1  (autoclean) [nfs nfsd]
sunrpc                 84188   1  (autoclean) [nfs nfsd lockd]
8139too                19784   2
mii                     3944   0  [8139too]
crc32                   3680   0  [8139too]
iptable_filter          2412   0  (autoclean) (unused)
ip_tables              15392   1  [iptable_filter]
ohci1394               33608   0  (unused)
ieee1394               64676   0  [ohci1394]
mousedev                5428   0  (unused)
keybdev                 3072   0  (unused)
input                   5824   0  [mousedev keybdev]
hid                    12248   0  (unused)
rtc                     8764   0  (autoclean)

[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)

0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
02f8-02ff : serial(auto)
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(auto)
0cf8-0cff : PCI conf1
c000-c0ff : Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+
  c000-c0ff : 8139too
c400-c4ff : Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (#2)
  c400-c4ff : 8139too
d400-d47f : VIA Technologies, Inc. IEEE 1394 Host Controller
d800-d8ff : C-Media Electronics Inc CM8738
dc00-dc1f : VIA Technologies, Inc. USB
e000-e01f : VIA Technologies, Inc. USB (#2)
e400-e41f : VIA Technologies, Inc. USB (#3)
e800-e80f : VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C
PIPC Bus Master IDE
  e800-e807 : ide0
  e808-e80f : ide1

00000000-0009ffff : System RAM
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000f0000-000fffff : System ROM
00100000-1ffeffff : System RAM
  00100000-002667cb : Kernel code
  002667cc-002ef563 : Kernel data
1fff0000-1fff2fff : ACPI Non-volatile Storage
1fff3000-1fffffff : ACPI Tables
d0000000-dfffffff : PCI Bus #01
  d0000000-d7ffffff : nVidia Corporation NV17 [GeForce4 MX 440]
  d8000000-d807ffff : nVidia Corporation NV17 [GeForce4 MX 440]
e0000000-e3ffffff : VIA Technologies, Inc. VT8377 [KT400 AGP] Host Bridge
e4000000-e5ffffff : PCI Bus #01
  e4000000-e4ffffff : nVidia Corporation NV17 [GeForce4 MX 440]
e6020000-e60200ff : Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (#2)
  e6020000-e60200ff : 8139too
e6022000-e60220ff : Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+
  e6022000-e60220ff : 8139too
e6023000-e60237ff : VIA Technologies, Inc. IEEE 1394 Host Controller
  e6023000-e60237ff : ohci1394
e6024000-e60240ff : VIA Technologies, Inc. USB 2.0
fec00000-fec00fff : reserved
fee00000-fee00fff : reserved
ffff0000-ffffffff : reserved

[7.5.] PCI information

00:00.0 Host bridge: VIA Technologies, Inc.: Unknown device 3189
        Subsystem: VIA Technologies, Inc.: Unknown device 3189
        Flags: bus master, 66Mhz, medium devsel, latency 8
        Memory at e0000000 (32-bit, prefetchable) [size=64M]
        Capabilities: [a0] AGP version 2.0
        Capabilities: [c0] Power Management version 2

00:01.0 PCI bridge: VIA Technologies, Inc.: Unknown device b168 (prog-if
00 [Normal decode])
        Flags: bus master, 66Mhz, medium devsel, latency 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        Memory behind bridge: e4000000-e5ffffff
        Prefetchable memory behind bridge: d0000000-dfffffff
        Capabilities: [80] Power Management version 2

00:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
        Subsystem: Realtek Semiconductor Co., Ltd. RT8139
        Flags: bus master, medium devsel, latency 32, IRQ 17
        I/O ports at c000 [size=256]
        Memory at e6022000 (32-bit, non-prefetchable) [size=256]
        Capabilities: [50] Power Management version 2

00:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
        Subsystem: Realtek Semiconductor Co., Ltd. RT8139
        Flags: bus master, medium devsel, latency 32, IRQ 19
        I/O ports at c400 [size=256]
        Memory at e6020000 (32-bit, non-prefetchable) [size=256]
        Capabilities: [50] Power Management version 2

00:0e.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host
Controller (rev 46) (prog-if 10 [OHCI])
        Subsystem: Biostar Microtech Int'l Corp: Unknown device 4200
        Flags: bus master, medium devsel, latency 32, IRQ 18
        Memory at e6023000 (32-bit, non-prefetchable) [size=2K]
        I/O ports at d400 [size=128]
        Capabilities: [50] Power Management version 2

00:0f.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev 10)
        Subsystem: Biostar Microtech Int'l Corp: Unknown device 8738
        Flags: bus master, medium devsel, latency 32, IRQ 19
        I/O ports at d800 [size=256]
        Capabilities: [c0] Power Management version 2

00:10.0 USB Controller: VIA Technologies, Inc. USB (rev 80) (prog-if 00
[UHCI])
        Subsystem: VIA Technologies, Inc. USB
        Flags: bus master, medium devsel, latency 32, IRQ 21
        I/O ports at dc00 [size=32]
        Capabilities: [80] Power Management version 2

00:10.1 USB Controller: VIA Technologies, Inc. USB (rev 80) (prog-if 00
[UHCI])
        Subsystem: VIA Technologies, Inc. USB
        Flags: bus master, medium devsel, latency 32, IRQ 21
        I/O ports at e000 [size=32]
        Capabilities: [80] Power Management version 2

00:10.2 USB Controller: VIA Technologies, Inc. USB (rev 80) (prog-if 00
[UHCI])
        Subsystem: VIA Technologies, Inc. USB
        Flags: bus master, medium devsel, latency 32, IRQ 21
        I/O ports at e400 [size=32]
        Capabilities: [80] Power Management version 2

00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82) (prog-if
20 [EHCI])
        Subsystem: VIA Technologies, Inc. USB 2.0
        Flags: bus master, medium devsel, latency 32, IRQ 19
        Memory at e6024000 (32-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 2

00:11.0 ISA bridge: VIA Technologies, Inc. VT8233A ISA Bridge
        Subsystem: VIA Technologies, Inc. VT8233A ISA Bridge
        Flags: bus master, stepping, medium devsel, latency 0
        Capabilities: [c0] Power Management version 2

00:11.1 IDE interface: VIA Technologies, Inc. VT82C586B PIPC Bus Master
IDE (rev 06) (prog-if 8a [Master SecP PriP])
        Subsystem: VIA Technologies, Inc. VT82C586B PIPC Bus Master IDE
        Flags: bus master, medium devsel, latency 32, IRQ 11
        I/O ports at e800 [size=16]
        Capabilities: [c0] Power Management version 2

01:00.0 VGA compatible controller: nVidia Corporation NV17 [GeForce4
MX440] (rev a3) (prog-if 00 [VGA])
        Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 16
        Memory at e4000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (32-bit, prefetchable) [size=128M]
        Memory at d8000000 (32-bit, prefetchable) [size=512K]
        Expansion ROM at <unassigned> [disabled] [size=128K]
        Capabilities: [60] Power Management version 2
        Capabilities: [44] AGP version 2.0

         CPU0
  0:     292363    IO-APIC-edge  timer
  1:          3    IO-APIC-edge  keyboard
  2:          0          XT-PIC  cascade
  8:          1    IO-APIC-edge  rtc
12:          0          XT-PIC  PS/2 Mouse
14:       8958    IO-APIC-edge  ide0
15:          4    IO-APIC-edge  ide1
17:       6482   IO-APIC-level  eth0
18:          2   IO-APIC-level  ohci1394
19:         28   IO-APIC-level  eth1
NMI:          0
LOC:     292280
ERR:          0
MIS:          0

[7.7.] Other information that might be relevant to the problem

Other systems (DL-360G3 dual Xeon 2.8 GHz, RHEL3, SMP or UP kernel)
doesn't show the problem.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-02 10:21 kernel BUG at page_alloc.c:98 -- compiling with distcc Marco Fais
@ 2004-04-02 13:15 ` Marco Roeland
       [not found]   ` <6.0.0.22.2.20040402163334.02abe7d8@pop.localnet>
  2004-04-02 23:36 ` Andrew Morton
  2004-05-04  1:07 ` Marcelo Tosatti
  2 siblings, 1 reply; 27+ messages in thread
From: Marco Roeland @ 2004-04-02 13:15 UTC (permalink / raw)
  To: Marco Fais; +Cc: linux-kernel

On Friday April 2nd 2004 Marco Fais wrote:

> [...] 
 
> When compiling with distcc the local system doesn't show any kernel
> panic, while the same system used as a "remote compiler system" dies
> very quickly.

> >>EIP; c01372ae <__free_pages_ok+26e/280>   <=====
> ... 
> Trace; e08d7eab <[8139too]rtl8139_rx_interrupt+6b/3b0>

> <0>Kernel panic: Aiee, killing interrupt handler!

>From a very superficial examination of your data, it looks like there is
something going wrong in the interrupt handling of the driver for (one
of) the network cards.

Distcc can generate a lot of network traffic. You might experiment with
switching the role of the two network cards (in case there might be
something wrong with the hardware of one of them) or use the '--listen'
directive in the distccd configuration to do so.

If the panic is indeed caused by the network driver, then it should also
be possible to trigger and debug this with a tool like netcat (listen on
the panicking box with 'nc -l someport' and send some stuff from another
box ('cat /dev/zero | nc panicker someport' or vice versa).

Sadly, nothing of this will solve your problem of course, but it might
pinpoint the cause somewhat more accurately, leading hopefully to a
solution!
-- 
Marco Roeland

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
       [not found]   ` <6.0.0.22.2.20040402163334.02abe7d8@pop.localnet>
@ 2004-04-02 15:05     ` Marco Roeland
  2004-04-05 10:42       ` Marco Fais
  2004-04-05 17:03       ` Max Valdez
  0 siblings, 2 replies; 27+ messages in thread
From: Marco Roeland @ 2004-04-02 15:05 UTC (permalink / raw)
  To: Marco Fais; +Cc: linux-kernel

On Friday April 2nd 2004 Marco Fais wrote:

> Mmmh, all the servers use an RTL-8139 compatible card, with the same 
> 8139too driver. So this can be the problem.

Hey, I'm by no means an expert. Suggesting the driver is to blame was
mostly based on the fact that compiling locally worked, and from a
remote machine triggered a panick. The rest of your description below
indicates that it probably *isnt't* the driver.
 
> But in this moment I'm doing a kernel compile while receiving and sending 
> huge amounts of data using netcat, as you suggested... and works perfectly.
 
> Ok, next I will test the second network card on the server, just to avoid 
> the possibility of an hardware failure -- but I have other 4 servers that 
> show the same behaviour, so I don't think it's caused by faulty hardware.

If 4 other servers show the same behaviour, and netcatting a lot of data
doesn't panick the machine, that highly suggests that the network card
and driver are innocent! I thought only one machine had the problem.
 
> Running this test for about an hour, using all the available bandwidth on 
> the NIC, while compiling the kernel in a loop... no problem. Using distcc, 
> compiling the same files, cause a kernel panic in a few seconds.
> So this test doesn't show the problem, but I think that anyway the network 
> card driver (or the hardware) is involved.

Why do you think so, it seems there's nothing wrong with it; you've just
tested that?

One last suggestion:

Have you tried a local distcc compile, but specifying the host name as
it's IP address or its real name. Distcc treats 'localhost' differently,
but if it sees an IP address it will use the network route. As specified
in the man page this is slower, but if there's something peculiar with
the interaction of distcc with the network layer, then perhaps this
triggers it. You can also use the '--verbose' option on distccd, perhaps
it reports something useful before panicking.
-- 
Marco Roeland

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-02 10:21 kernel BUG at page_alloc.c:98 -- compiling with distcc Marco Fais
  2004-04-02 13:15 ` Marco Roeland
@ 2004-04-02 23:36 ` Andrew Morton
  2004-04-05 10:47   ` Marco Fais
  2004-05-04  1:07 ` Marcelo Tosatti
  2 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-04-02 23:36 UTC (permalink / raw)
  To: Marco Fais; +Cc: linux-kernel, netdev


(linux-2.4.25) 

Marco Fais <marco.fais@abbeynet.it> wrote:
>
> kernel BUG at page_alloc.c:98!
> 

uh-oh.

> 
> > >EIP; c01372ae <__free_pages_ok+26e/280>   <=====
> 
> > >ebx; c14b3f00 <_end+116e728/204d48a8>
> > >ecx; c14b3f00 <_end+116e728/204d48a8>
> > >edi; dec11340 <_end+1e8cbb68/204d48a8>
> > >ebp; c02f1d04 <init_task_union+1d04/2000>
> > >esp; c02f1cd4 <init_task_union+1cd4/2000>
> 
> Trace; c0135a76 <kmem_cache_free_one+f6/210>
> Trace; c021667b <skb_release_data+6b/90>
> Trace; c02166b4 <kfree_skbmem+14/70>
> Trace; c0216816 <__kfree_skb+106/160>
> Trace; c023be39 <tcp_clean_rtx_queue+139/330>
> Trace; c023c385 <tcp_ack+c5/380>
> Trace; c023f51c <tcp_rcv_state_process+19c/a90>
> Trace; c02465a9 <tcp_v4_do_rcv+a9/130>
> Trace; c0246a76 <tcp_v4_rcv+446/560>
> Trace; c022dad0 <ip_local_deliver_finish+0/180>
> Trace; c022dc25 <ip_local_deliver_finish+155/180>
> Trace; c0222780 <nf_hook_slow+b0/170>
> Trace; c022dad0 <ip_local_deliver_finish+0/180>
> Trace; c022d88f <ip_local_deliver+4f/70>
> Trace; c022dad0 <ip_local_deliver_finish+0/180>
> Trace; c022de3a <ip_rcv_finish+1ea/270>
> Trace; e08d7eab <[8139too]rtl8139_rx_interrupt+6b/3b0>
> Trace; c021ad14 <netif_receive_skb+c4/180>
> Trace; c021ae3f <process_backlog+6f/120>
> Trace; c021af5a <net_rx_action+6a/100>
> Trace; c0121cd7 <do_softirq+97/a0>
> Trace; c010a66d <do_IRQ+bd/f0>

distcc uses sendfile().  The 8139too hardware and driver are
zerocopy-capable so the kernel uses zerocopy direct-from-user-pages for
sendfile().

The bug is that the networking layer is releasing the final ref to user
pages from softirq context.  Those pages are still on the page LRU so
__free_pages_ok() will take them off.

Problem is, removing these pages from the LRU requires that the
pagemap_lru_lock be taken, and that lock may not be taken from interrupt
context.   So we go BUG instead.

This was all discussed fairly extensively a couple of years back and I
thought it ended up being fixed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-02 15:05     ` Marco Roeland
@ 2004-04-05 10:42       ` Marco Fais
  2004-04-05 11:46         ` Marco Roeland
  2004-04-05 17:03       ` Max Valdez
  1 sibling, 1 reply; 27+ messages in thread
From: Marco Fais @ 2004-04-05 10:42 UTC (permalink / raw)
  To: Marco Roeland; +Cc: linux-kernel

Marco Roeland ha scritto:

>>Mmmh, all the servers use an RTL-8139 compatible card, with the same 
>>8139too driver. So this can be the problem.
> Hey, I'm by no means an expert. Suggesting the driver is to blame was
> mostly based on the fact that compiling locally worked, and from a
> remote machine triggered a panick. The rest of your description below
> indicates that it probably *isnt't* the driver.

I was not saying *this is the problem*, just noticing that all the 
systems that show this problem have this network card, while the other 
systems that are working perfectly are using other network hardware 
(e100 driver) :)

>>Ok, next I will test the second network card on the server, just to avoid 
>>the possibility of an hardware failure -- but I have other 4 servers that 
>>show the same behaviour, so I don't think it's caused by faulty hardware.
> If 4 other servers show the same behaviour, and netcatting a lot of data
> doesn't panick the machine, that highly suggests that the network card
> and driver are innocent! I thought only one machine had the problem.

If you read Andrew's message, seems that distcc uses a function that 
trigger the problem -- sendfile() -- so, if netcat doesn't use it, it's 
clear why doesn't panic the kernel.

> Have you tried a local distcc compile, but specifying the host name as
> it's IP address or its real name. Distcc treats 'localhost' differently,
> but if it sees an IP address it will use the network route. As specified

Good test.

Yeah, kernel panic in a few seconds. Using localhost instead, compile 
run perfectly for hours.
So it's definitely an issue related to distcc AND networking (and 
probably interaction between network driver and kernel).

Thank you again for your advice!


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-02 23:36 ` Andrew Morton
@ 2004-04-05 10:47   ` Marco Fais
  2004-04-05 10:56     ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Marco Fais @ 2004-04-05 10:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, netdev

Andrew Morton ha scritto:

>>kernel BUG at page_alloc.c:98!
> uh-oh.

That was the same thing that I've said when I saw all the leds blinking 
in *all* the keyboards ... :)

> distcc uses sendfile().  The 8139too hardware and driver are
> zerocopy-capable so the kernel uses zerocopy direct-from-user-pages for
> sendfile().

Ok. Other servers with e100 driver doesn't show the problem. This means 
that they're not "zerocopy-capable"?

> This was all discussed fairly extensively a couple of years back and I
> thought it ended up being fixed.

There are any workarounds for this, until the problem is corrected?

Thank you very much.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-05 10:47   ` Marco Fais
@ 2004-04-05 10:56     ` Andrew Morton
  2004-04-05 13:58       ` Marco Fais
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-04-05 10:56 UTC (permalink / raw)
  To: Marco Fais; +Cc: linux-kernel, netdev

Marco Fais <marco.fais@abbeynet.it> wrote:
>
> Andrew Morton ha scritto:
> 
> >>kernel BUG at page_alloc.c:98!
> > uh-oh.
> 
> That was the same thing that I've said when I saw all the leds blinking 
> in *all* the keyboards ... :)
> 
> > distcc uses sendfile().  The 8139too hardware and driver are
> > zerocopy-capable so the kernel uses zerocopy direct-from-user-pages for
> > sendfile().
> 
> Ok. Other servers with e100 driver doesn't show the problem. This means 
> that they're not "zerocopy-capable"?

They are.  It could be a timing thing.

> > This was all discussed fairly extensively a couple of years back and I
> > thought it ended up being fixed.
> 
> There are any workarounds for this, until the problem is corrected?

This will probably make it go away.

--- linux-2.4.26-rc1/drivers/net/8139too.c	2004-03-27 22:06:18.000000000 -0800
+++ 24/drivers/net/8139too.c	2004-04-05 03:54:50.478692968 -0700
@@ -983,7 +983,7 @@ static int __devinit rtl8139_init_one (s
 	 * through the use of skb_copy_and_csum_dev we enable these
 	 * features
 	 */
-	dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA;
+	dev->features |= NETIF_F_SG | NETIF_F_HIGHDMA;
 
 	dev->irq = pdev->irq;
 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-05 10:42       ` Marco Fais
@ 2004-04-05 11:46         ` Marco Roeland
  2004-04-05 14:08           ` Marco Fais
  0 siblings, 1 reply; 27+ messages in thread
From: Marco Roeland @ 2004-04-05 11:46 UTC (permalink / raw)
  To: Marco Fais; +Cc: Marco Roeland, linux-kernel

On Monday April 5th 2004 Marco Fais wrote:

> I was not saying *this is the problem*, just noticing that all the 
> systems that show this problem have this network card, while the other 
> systems that are working perfectly are using other network hardware 
> (e100 driver) :)

Yes, my conclusion was too hasty, it *is* driver related! ;-)

With hindsight we also should have tried, of course, a 'strace distccd
--no-detach' in a crashing and a non-crashing situation. This would
probably have shown that 'sendfile()' was the first missing system call
(and therefore likely the culprit) in the crashing situation. Oh, well...
 
> If you read Andrew's message, seems that distcc uses a function that 
> trigger the problem -- sendfile() -- so, if netcat doesn't use it, it's 
> clear why doesn't panic the kernel.

Yes, sendfile() in combination with the 8139too driver seems to be
causing the trouble. Until that will hopefully be fixed, it doesn't seem
easy to workaround against. At the moment it looks like it is not an
easy configurable option to *not* want to use zero_copy functionality,
either in the kernel, nor in distcc.

There is an '8139cp' driver too, it's supposed to be working better
as well, perhaps that one might not free the pages that are to be
zero_copied across the network before they are sent?! That is the real
problem if I understand Andrew's mail correctly.

You might send a 'linux 8139too sendfile() panic' kind of bugreport
to the 'netdev@oss.sgi.com' mailing list. That is the list where the
networking gurus are supposed to be hanging out. Although IMVHO this bug
is more on the kernel than on the network side. Also filing an entry to
bugzilla.kernel.org might speed up someone fixing the real problem.

Easiest workaround might be to just use a customised distcc for the
machines involved: just download the source from 'distcc.samba.org', do
a regular './configure', and then in the generated 'src/config.h' hand
edit '#undef HAVE_SENDFILE' and '#undef HAVE_SYS_SENDFILE_H'. That
should stop distcc from using sendfile().
-- 
Marco Roeland

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-05 10:56     ` Andrew Morton
@ 2004-04-05 13:58       ` Marco Fais
  0 siblings, 0 replies; 27+ messages in thread
From: Marco Fais @ 2004-04-05 13:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, netdev

Hola Andrew!

Andrew Morton ha scritto:

>>There are any workarounds for this, until the problem is corrected?
> This will probably make it go away.
> 
> --- linux-2.4.26-rc1/drivers/net/8139too.c	2004-03-27 22:06:18.000000000 -0800
> +++ 24/drivers/net/8139too.c	2004-04-05 03:54:50.478692968 -0700
> @@ -983,7 +983,7 @@ static int __devinit rtl8139_init_one (s
>  	 * through the use of skb_copy_and_csum_dev we enable these
>  	 * features
>  	 */
> -	dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA;
> +	dev->features |= NETIF_F_SG | NETIF_F_HIGHDMA;
>  
>  	dev->irq = pdev->irq;

Unfortunately, this doesn't solve the problem. Seems that the panic it's 
triggered a little later (1-2 minutes instead of a few seconds), but 
anyway I have a kernel panic every time, also with this patch.

The oops tracing looks very similar to the one I've posted on the 
linux-kernel list.

Thank you Andrew, bye!


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-05 11:46         ` Marco Roeland
@ 2004-04-05 14:08           ` Marco Fais
  2004-04-05 14:36             ` Marco Roeland
  0 siblings, 1 reply; 27+ messages in thread
From: Marco Fais @ 2004-04-05 14:08 UTC (permalink / raw)
  To: Marco Roeland; +Cc: linux-kernel

Marco Roeland ha scritto:

> There is an '8139cp' driver too, it's supposed to be working better
> as well, perhaps that one might not free the pages that are to be
> zero_copied across the network before they are sent?! That is the real
> problem if I understand Andrew's mail correctly.

Just tried that, unfortunately this network card isn't supported from 
8139cp driver.

> You might send a 'linux 8139too sendfile() panic' kind of bugreport
> to the 'netdev@oss.sgi.com' mailing list. That is the list where the
> networking gurus are supposed to be hanging out. Although IMVHO this bug

Andrew's messages are in CC: to the netdev@oss.sgi.com list, so I think 
they're already aware of the problem.

> is more on the kernel than on the network side. Also filing an entry to
> bugzilla.kernel.org might speed up someone fixing the real problem.

Ok, let see if we get a patch from this discussion, otherwise I'll file 
a new bugzilla entry.

> Easiest workaround might be to just use a customised distcc for the
> machines involved: just download the source from 'distcc.samba.org', do
> a regular './configure', and then in the generated 'src/config.h' hand
> edit '#undef HAVE_SENDFILE' and '#undef HAVE_SYS_SENDFILE_H'. That
> should stop distcc from using sendfile().

Great! I'm going to test that right now, surely better than deploying 
customized kernels in all servers until an "official" patch comes out.

Thank you very much, Marco.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-05 14:08           ` Marco Fais
@ 2004-04-05 14:36             ` Marco Roeland
  0 siblings, 0 replies; 27+ messages in thread
From: Marco Roeland @ 2004-04-05 14:36 UTC (permalink / raw)
  To: Marco Fais; +Cc: Marco Roeland, linux-kernel

On Monday April 5th 2004 Marco Fais wrote:

> Ok, let see if we get a patch from this discussion, otherwise I'll file 
> a new bugzilla entry.

Perhaps the fact that you have *two* cards in each machine that crashes
with the 8139too driver could be important? I have two Athlon XP 2000+
with Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ that distcc
quite a lot, and never any crash. But network topology and timings might
just trigger the panic in your situation and not with others...

> [building distcc without sendfile()]
> Great! I'm going to test that right now, surely better than deploying 
> customized kernels in all servers until an "official" patch comes out.

Yeah, although that viewpoint might not be very popular on this mailing
list. ;-) By the way the patch looks quite alright and applies (with
an offset) to 2.6.5 as well. If you build 8139too modular, you might
even make two modules, a modified one with the reduced advertised
capabilities (so that the kernel assumes the card isn't zero-copy
capable) under another name perhaps like 8139too-nosendfile, and the
standard one. You can than at least distribute one kernel package, and
only on the affected machines modprobe the bugfix module.

Anyway, first installing a distcc without sendfile() usages, can make
you (distcc)build patched kernels much faster in the future. ;-)
-- 
Marco Roeland

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-02 15:05     ` Marco Roeland
  2004-04-05 10:42       ` Marco Fais
@ 2004-04-05 17:03       ` Max Valdez
  1 sibling, 0 replies; 27+ messages in thread
From: Max Valdez @ 2004-04-05 17:03 UTC (permalink / raw)
  To: Marco Roeland; +Cc: Marco Fais, linux-kernel

I Sent an email a couple os weeks ago about the same issue.

But it wasnt so documented and organized.

I can say that the card and hardware are inocents, maybe the driver, the 
"remote" machines that hang are using the latest fedore stable kernel.

I would need really good pointing to the procedure to debug the problem, I'm 
not expert in anything about kernel.

I think it's a problem in the network handling because it happens on different 
kernels, in different hardware. And it happens from a couple of months ago 
(we got a new faster network "arquitecture") and the problems seems to be 
triggered by fast transport of file over NTF, and distcc. I remember having a 
crash using scp too for some iso files.

If needed I can help track this problem, but I need some hints on the 
procedure

Max

-- 
Linux garaged 2.6.5-rc2-mm3 #1 Fri Mar 26 11:07:16 CST 2004 i686 Intel(R) 
Pentium(R) 4 CPU 2.80GHz GenuineIntel GNU/Linux
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/S d- s: a-29 C++(+++) ULAHI+++ P+ L++>+++ E--- W++ N* o-- K- w++++ O- M-- 
V-- PS+ PE Y-- PGP++ t- 5- X+ R tv++ b+ DI+++ D- G++ e++ h+ r+ z**
------END GEEK CODE BLOCK------
gpg-key: http://garaged.homeip.net/gpg-key.txt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-02 10:21 kernel BUG at page_alloc.c:98 -- compiling with distcc Marco Fais
  2004-04-02 13:15 ` Marco Roeland
  2004-04-02 23:36 ` Andrew Morton
@ 2004-05-04  1:07 ` Marcelo Tosatti
  2004-05-05 16:25   ` Carson Gaspar
  2 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2004-05-04  1:07 UTC (permalink / raw)
  To: Marco Fais; +Cc: linux-kernel, Carson Gaspar

On Fri, Apr 02, 2004 at 12:21:03PM +0200, Marco Fais wrote:
> Hi!
> 
> 
> [1.] Kernel panic while using distcc
> 
> [2.] I have 5-6 development linux systems that we use without problem
> under a normal development workload. Trying distcc for speeding up
> compilation, we have a fully reproducible kernel panic in a very short
> time (seconds after compilation start). The kernel panic happens *only*
> when the systems are "remotely controlled" (the distcc daemon is
> receiving source files from remote systems, compile and send back
> compiled objects). When compiling with distcc the local system doesn't
> show any kernel panic, while the same system used as a "remote compiler
> system" dies very quickly.
> 
> [3.] Keywords: distcc BUG page_alloc.c

Marco, Carson,

Can you please try to reproduce this distcc generated oops using 2.4.27-pre2?
 
Thanks!


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-05-04  1:07 ` Marcelo Tosatti
@ 2004-05-05 16:25   ` Carson Gaspar
  2004-05-05 16:28     ` Marc-Christian Petersen
  2004-05-05 18:35     ` Marcelo Tosatti
  0 siblings, 2 replies; 27+ messages in thread
From: Carson Gaspar @ 2004-05-05 16:25 UTC (permalink / raw)
  To: Marcelo Tosatti, Marco Fais, linux-kernel

--On Monday, May 03, 2004 22:07:14 -0300 Marcelo Tosatti 
<marcelo.tosatti@cyclades.com> wrote:

> On Fri, Apr 02, 2004 at 12:21:03PM +0200, Marco Fais wrote:
>> Hi!
>>
>>
>> [1.] Kernel panic while using distcc
>>
>> [2.] I have 5-6 development linux systems that we use without problem
>> under a normal development workload. Trying distcc for speeding up
>> compilation, we have a fully reproducible kernel panic in a very short
>> time (seconds after compilation start). The kernel panic happens *only*
>> when the systems are "remotely controlled" (the distcc daemon is
>> receiving source files from remote systems, compile and send back
>> compiled objects). When compiling with distcc the local system doesn't
>> show any kernel panic, while the same system used as a "remote compiler
>> system" dies very quickly.
>>
>> [3.] Keywords: distcc BUG page_alloc.c
>
> Marco, Carson,
>
> Can you please try to reproduce this distcc generated oops using
> 2.4.27-pre2?

I'd love to. However 2.4.27-pre2 broke the tg3 driver. tg3.c contains 
WARN_ON(1). Sadly, WARN_ON doesn't exist in 2.4.x, so depmod correctly 
complains about an unresolved symbol.

I'm beginning to wonder if anyone actually builds these pre releases... I 
mean, I know the tg3 driver is really obscure, and only used by 2 people, 
but...

-- 
Carson


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-05-05 16:25   ` Carson Gaspar
@ 2004-05-05 16:28     ` Marc-Christian Petersen
  2004-05-05 18:35     ` Marcelo Tosatti
  1 sibling, 0 replies; 27+ messages in thread
From: Marc-Christian Petersen @ 2004-05-05 16:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Carson Gaspar, Marcelo Tosatti, Marco Fais

[-- Attachment #1: Type: text/plain, Size: 520 bytes --]

On Wednesday 05 May 2004 18:25, Carson Gaspar wrote:

Hi Carson,

> I'd love to. However 2.4.27-pre2 broke the tg3 driver. tg3.c contains
> WARN_ON(1). Sadly, WARN_ON doesn't exist in 2.4.x, so depmod correctly
> complains about an unresolved symbol.
> I'm beginning to wonder if anyone actually builds these pre releases... I
> mean, I know the tg3 driver is really obscure, and only used by 2 people,
> but...

by 2 people? you have to be kidding.

Anyway, attached is 2.4 WARN_ON. Apply it and use tg3 :p

ciao, Marc

[-- Attachment #2: 2.4-WARN_ON.patch --]
[-- Type: text/x-diff, Size: 472 bytes --]

--- old/include/linux/kernel.h	2004-05-04 21:48:24.000000000 +0200
+++ new/include/linux/kernel.h	2004-05-05 10:53:32.000000000 +0200
@@ -196,4 +196,11 @@ struct sysinfo {
 
 #define BUG_ON(condition) do { if (unlikely((condition)!=0)) BUG(); } while(0)
 
+#define WARN_ON(condition) do { \
+	if (unlikely((condition)!=0)) { \
+		printk("Badness in %s at %s:%d\n", __FUNCTION__, __FILE__, __LINE__); \
+		dump_stack(); \
+	} \
+} while (0)
+
 #endif /* _LINUX_KERNEL_H */

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-05-05 16:25   ` Carson Gaspar
  2004-05-05 16:28     ` Marc-Christian Petersen
@ 2004-05-05 18:35     ` Marcelo Tosatti
  2004-05-19 11:59       ` Marcelo Tosatti
  1 sibling, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2004-05-05 18:35 UTC (permalink / raw)
  To: Carson Gaspar; +Cc: Marco Fais, linux-kernel

On Wed, May 05, 2004 at 12:25:00PM -0400, Carson Gaspar wrote:
> --On Monday, May 03, 2004 22:07:14 -0300 Marcelo Tosatti 
> <marcelo.tosatti@cyclades.com> wrote:
> 
> >On Fri, Apr 02, 2004 at 12:21:03PM +0200, Marco Fais wrote:
> >>Hi!
> >>
> >>
> >>[1.] Kernel panic while using distcc
> >>
> >>[2.] I have 5-6 development linux systems that we use without problem
> >>under a normal development workload. Trying distcc for speeding up
> >>compilation, we have a fully reproducible kernel panic in a very short
> >>time (seconds after compilation start). The kernel panic happens *only*
> >>when the systems are "remotely controlled" (the distcc daemon is
> >>receiving source files from remote systems, compile and send back
> >>compiled objects). When compiling with distcc the local system doesn't
> >>show any kernel panic, while the same system used as a "remote compiler
> >>system" dies very quickly.
> >>
> >>[3.] Keywords: distcc BUG page_alloc.c
> >
> >Marco, Carson,
> >
> >Can you please try to reproduce this distcc generated oops using
> >2.4.27-pre2?
> 
> I'd love to. However 2.4.27-pre2 broke the tg3 driver. tg3.c contains 
> WARN_ON(1). Sadly, WARN_ON doesn't exist in 2.4.x, so depmod correctly 
> complains about an unresolved symbol.
> 
> I'm beginning to wonder if anyone actually builds these pre releases... I 
> mean, I know the tg3 driver is really obscure, and only used by 2 people, 
> but...

I just commited a fix to the BK tree. 

Can you please apply this.

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.1385  -> 1.1386 
#	include/linux/kernel.h	1.22    -> 1.23   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 04/05/05	m.c.p@kernel.linux-systeme.com	1.1386
# [PATCH] copy WARN_ON() definition from 2.6
# 
# --------------------------------------------
#
diff -Nru a/include/linux/kernel.h b/include/linux/kernel.h
--- a/include/linux/kernel.h	Wed May  5 15:35:31 2004
+++ b/include/linux/kernel.h	Wed May  5 15:35:31 2004
@@ -196,4 +196,11 @@
 
 #define BUG_ON(condition) do { if (unlikely((condition)!=0)) BUG(); } while(0)
 
+#define WARN_ON(condition) do { \
+	if (unlikely((condition)!=0)) { \
+		printk("Badness in %s at %s:%d\n", __FUNCTION__, __FILE__, __LINE__); \
+		dump_stack(); \
+	} \
+} while (0)
+
 #endif /* _LINUX_KERNEL_H */

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-05-05 18:35     ` Marcelo Tosatti
@ 2004-05-19 11:59       ` Marcelo Tosatti
  2004-05-19 15:50         ` Marc-Christian Petersen
  2004-05-19 20:21         ` Carson Gaspar
  0 siblings, 2 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2004-05-19 11:59 UTC (permalink / raw)
  To: Carson Gaspar; +Cc: Marco Fais, linux-kernel

On Wed, May 05, 2004 at 03:35:58PM -0300, Marcelo Tosatti wrote:
> On Wed, May 05, 2004 at 12:25:00PM -0400, Carson Gaspar wrote:
> > --On Monday, May 03, 2004 22:07:14 -0300 Marcelo Tosatti 
> > <marcelo.tosatti@cyclades.com> wrote:
> > 
> > >On Fri, Apr 02, 2004 at 12:21:03PM +0200, Marco Fais wrote:
> > >>Hi!
> > >>
> > >>
> > >>[1.] Kernel panic while using distcc
> > >>
> > >>[2.] I have 5-6 development linux systems that we use without problem
> > >>under a normal development workload. Trying distcc for speeding up
> > >>compilation, we have a fully reproducible kernel panic in a very short
> > >>time (seconds after compilation start). The kernel panic happens *only*
> > >>when the systems are "remotely controlled" (the distcc daemon is
> > >>receiving source files from remote systems, compile and send back
> > >>compiled objects). When compiling with distcc the local system doesn't
> > >>show any kernel panic, while the same system used as a "remote compiler
> > >>system" dies very quickly.
> > >>
> > >>[3.] Keywords: distcc BUG page_alloc.c
> > >Marco, Carson,

Hi Carson, 

So did Andrea's fix work for you? :) 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-05-19 11:59       ` Marcelo Tosatti
@ 2004-05-19 15:50         ` Marc-Christian Petersen
  2004-05-20 12:21           ` Marcelo Tosatti
  2004-05-19 20:21         ` Carson Gaspar
  1 sibling, 1 reply; 27+ messages in thread
From: Marc-Christian Petersen @ 2004-05-19 15:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Marcelo Tosatti, Carson Gaspar, Marco Fais

On Wednesday 19 May 2004 13:59, Marcelo Tosatti wrote:

Hi Marcelo, Carson ...

> > > >>[1.] Kernel panic while using distcc
> > > >>[2.] I have 5-6 development linux systems that we use without problem
> > > >>under a normal development workload. Trying distcc for speeding up
> > > >>compilation, we have a fully reproducible kernel panic in a very
> > > >> short time (seconds after compilation start). The kernel panic
> > > >> happens *only* when the systems are "remotely controlled" (the
> > > >> distcc daemon is receiving source files from remote systems, compile
> > > >> and send back compiled objects). When compiling with distcc the
> > > >> local system doesn't show any kernel panic, while the same system
> > > >> used as a "remote compiler system" dies very quickly.
> > > >>[3.] Keywords: distcc BUG page_alloc.c

> So did Andrea's fix work for you? :)

sorry if I did not follow this thread from the beginning, but why is distcc 
causing a BUG() in page_alloc.c? I use distcc since I don't know when and 
never had any BUG() in page_alloc with distcc, nor the specific bug at :98.

I have 7 machines in my distcc farm, and all are "remote controlled".

Could someone please clarify me? Thank you.

ciao, Marc


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-05-19 11:59       ` Marcelo Tosatti
  2004-05-19 15:50         ` Marc-Christian Petersen
@ 2004-05-19 20:21         ` Carson Gaspar
  1 sibling, 0 replies; 27+ messages in thread
From: Carson Gaspar @ 2004-05-19 20:21 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Marco Fais, linux-kernel



--On Wednesday, May 19, 2004 8:59 AM -0300 Marcelo Tosatti 
<marcelo.tosatti@cyclades.com> wrote:

> Hi Carson,
>
> So did Andrea's fix work for you? :)

Yes.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-05-19 15:50         ` Marc-Christian Petersen
@ 2004-05-20 12:21           ` Marcelo Tosatti
  0 siblings, 0 replies; 27+ messages in thread
From: Marcelo Tosatti @ 2004-05-20 12:21 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: linux-kernel, Carson Gaspar, Marco Fais

On Wed, May 19, 2004 at 05:50:10PM +0200, Marc-Christian Petersen wrote:
> On Wednesday 19 May 2004 13:59, Marcelo Tosatti wrote:
> 
> Hi Marcelo, Carson ...
> 
> > > > >>[1.] Kernel panic while using distcc
> > > > >>[2.] I have 5-6 development linux systems that we use without problem
> > > > >>under a normal development workload. Trying distcc for speeding up
> > > > >>compilation, we have a fully reproducible kernel panic in a very
> > > > >> short time (seconds after compilation start). The kernel panic
> > > > >> happens *only* when the systems are "remotely controlled" (the
> > > > >> distcc daemon is receiving source files from remote systems, compile
> > > > >> and send back compiled objects). When compiling with distcc the
> > > > >> local system doesn't show any kernel panic, while the same system
> > > > >> used as a "remote compiler system" dies very quickly.
> > > > >>[3.] Keywords: distcc BUG page_alloc.c
> 
> > So did Andrea's fix work for you? :)
> 
> sorry if I did not follow this thread from the beginning, but why is distcc 
> causing a BUG() in page_alloc.c? I use distcc since I don't know when and 
> never had any BUG() in page_alloc with distcc, nor the specific bug at :98.
> 
> I have 7 machines in my distcc farm, and all are "remote controlled".
> 
> Could someone please clarify me? Thank you.

We try to free a page which has been sent over the network 
in IRQ context, because with sendfile() its possible that such IRQ context 
reference is the last one on the page.

Quoting David Miller:

When vmscan.c shrinks a cache, it never tosses a page which is LRU if
the page count is not 1 (after trying to toss attached buffers if any).
                                                                                                                                                                                   
I think LRU used to contribute a page count, which prevented this problem,
or something like that.
                                                                                                                                                                                   
In fact it seems trivial to trigger the bug in question:
                                                                                                                                                                                   
1) open file
2) sendfile() it over a socket
3) quickly close the file, no existing user references, only the
   sendfile() packet references remain to the page
                                                                                                                                                                                   
When the TCP packet gets ACK'd, we explode in __free_pages_ok() as per
the report.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-29 23:26         ` Andrew Morton
@ 2004-04-30  0:15           ` Andrea Arcangeli
  0 siblings, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2004-04-30  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: marcelo.tosatti, jmoyer, carson, linux-kernel, netdev, davem

On Thu, Apr 29, 2004 at 04:26:32PM -0700, Andrew Morton wrote:
> The only application which we know will exercise that code is the distcc
> server.  Making that little change while testing the patch will increase
> the chance of shaking out any problems.

if you're scared it has bugs I think it'd be more useful to change it to
"|| 1" and run it under some stress test, and then remove the "|| 1".
the aio code in unmap_kvec is also a big user of that.  a schedule every
40M of ram freed isn't too nice to my eyes (but I doubt it can be
measured).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-29 22:49       ` Andrea Arcangeli
@ 2004-04-29 23:26         ` Andrew Morton
  2004-04-30  0:15           ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-04-29 23:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: marcelo.tosatti, jmoyer, carson, linux-kernel, netdev, davem

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Thu, Apr 29, 2004 at 02:28:07PM -0700, Andrew Morton wrote:
> > just to exercise that code path a bit more.
> 
> what's the point of exercising that code path more? are you worried that
> there are bugs in it?

The only application which we know will exercise that code is the distcc
server.  Making that little change while testing the patch will increase
the chance of shaking out any problems.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-29 21:28     ` Andrew Morton
@ 2004-04-29 22:49       ` Andrea Arcangeli
  2004-04-29 23:26         ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Andrea Arcangeli @ 2004-04-29 22:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Marcelo Tosatti, jmoyer, carson, linux-kernel, netdev, davem

On Thu, Apr 29, 2004 at 02:28:07PM -0700, Andrew Morton wrote:
> just to exercise that code path a bit more.

what's the point of exercising that code path more? are you worried that
there are bugs in it?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-29 21:09   ` Marcelo Tosatti
@ 2004-04-29 21:28     ` Andrew Morton
  2004-04-29 22:49       ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-04-29 21:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: jmoyer, carson, linux-kernel, netdev, andrea, davem

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> > Andrea fixed this in his tree by deferring the page free to process context
> > instead of BUG()ing on PageLRU(page).
> 
> Yeap, his fix looks OK.

It does.

It would be nice to change

	if (in_interrupt())

to

	if (in_interrupt() || ((count++ % 10000) == 0))

just to exercise that code path a bit more.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-28  2:02 ` Jeff Moyer
@ 2004-04-29 21:09   ` Marcelo Tosatti
  2004-04-29 21:28     ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Marcelo Tosatti @ 2004-04-29 21:09 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Carson Gaspar, linux-kernel, netdev, akpm, andrea, davem

On Tue, Apr 27, 2004 at 10:02:11PM -0400, Jeff Moyer wrote:
> 
> >FYI, we see the exact same panic with the tg3 driver using 2.4.25 and 
> >distcc with sendfile(). The bcm5700 driver also panics, but I haven't 
> >captured a panic message to be certain it's the same bug.
> 
> >kernel BUG at page_alloc.c:98!
> 
> Andrea fixed this in his tree by deferring the page free to process context
> instead of BUG()ing on PageLRU(page).

Yeap, his fix looks OK.

Can you please people seeing the oops try this, from Andrea (on top of 2.4.26):

--- a/mm/page_alloc.c.orig	2004-04-29 17:38:14.184021976 -0300
+++ b/mm/page_alloc.c	2004-04-29 17:47:27.906843312 -0300
@@ -46,6 +46,34 @@
 
 int vm_gfp_debug = 0;
 
+static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order));
+
+static spinlock_t free_pages_ok_no_irq_lock = SPIN_LOCK_UNLOCKED;
+struct page * free_pages_ok_no_irq_head;
+
+static void do_free_pages_ok_no_irq(void * arg)
+{
+       struct page * page, * __page;
+
+       spin_lock_irq(&free_pages_ok_no_irq_lock);
+
+       page = free_pages_ok_no_irq_head;
+       free_pages_ok_no_irq_head = NULL;
+
+       spin_unlock_irq(&free_pages_ok_no_irq_lock);
+
+       while (page) {
+               __page = page;
+               page = page->next_hash;
+               __free_pages_ok(__page, __page->index);
+       }
+}
+
+static struct tq_struct free_pages_ok_no_irq_task = {
+       .routine        = do_free_pages_ok_no_irq,
+};
+
+
 /*
  * Temporary debugging check.
  */
@@ -81,7 +109,6 @@
  * -- wli
  */
 
-static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order));
 static void __free_pages_ok (struct page *page, unsigned int order)
 {
 	unsigned long index, page_idx, mask, flags;
@@ -94,8 +121,20 @@
 	 * a reference to a page in order to pin it for io. -ben
 	 */
 	if (PageLRU(page)) {
-		if (unlikely(in_interrupt()))
-			BUG();
+		if (unlikely(in_interrupt())) {
+			unsigned long flags;
+
+			spin_lock_irqsave(&free_pages_ok_no_irq_lock, flags);
+			page->next_hash = free_pages_ok_no_irq_head;
+			free_pages_ok_no_irq_head = page;
+			page->index = order;
+	
+			spin_unlock_irqrestore(&free_pages_ok_no_irq_lock, flags);
+	
+			schedule_task(&free_pages_ok_no_irq_task);
+			return;
+		}
+		
 		lru_cache_del(page);
 	}
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
  2004-04-23 22:33 Carson Gaspar
@ 2004-04-28  2:02 ` Jeff Moyer
  2004-04-29 21:09   ` Marcelo Tosatti
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2004-04-28  2:02 UTC (permalink / raw)
  To: Carson Gaspar; +Cc: linux-kernel, netdev


>FYI, we see the exact same panic with the tg3 driver using 2.4.25 and 
>distcc with sendfile(). The bcm5700 driver also panics, but I haven't 
>captured a panic message to be certain it's the same bug.

>kernel BUG at page_alloc.c:98!

Andrea fixed this in his tree by deferring the page free to process context
instead of BUG()ing on PageLRU(page).

-Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: kernel BUG at page_alloc.c:98 -- compiling with distcc
@ 2004-04-23 22:33 Carson Gaspar
  2004-04-28  2:02 ` Jeff Moyer
  0 siblings, 1 reply; 27+ messages in thread
From: Carson Gaspar @ 2004-04-23 22:33 UTC (permalink / raw)
  To: linux-kernel, netdev

FYI, we see the exact same panic with the tg3 driver using 2.4.25 and 
distcc with sendfile(). The bcm5700 driver also panics, but I haven't 
captured a panic message to be certain it's the same bug.

kernel BUG at page_alloc.c:98!
invalid operand: 0000
CPU:    1
EIP:    0010:[<c0139492>]    Tainted: PF
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000001   ebx: c294dcb0   ecx: 00000001   edx: 00000020
esi: edb6e2e0   edi: 00000000   ebp: 00000004   esp: c55af9b4
ds: 0018   es: 0018   ss: 0018
Process cc1plus (pid: 21186, stackpage=c55af000)
Stack: c022e9ee f6fb1000 c022aa9c 00000287 00000206 00000286 db5a9600 
00000001
       edb6e2e0 edb6e2e0 00000004 c022aa4e edb6e2e0 f3716100 c022aa9c 
edb6e2e0
       f371623c f3716100 c022ac25 edb6e2e0 00000000 c025423a edb6e2e0 
c55ae000
Call Trace:    [<c022e9ee>] [<c022aa9c>] [<c022aa4e>] [<c022aa9c>] 
[<c022ac25>]
  [<c025423a>] [<c0247d28>] [<c024be53>] [<c025675b>] [<c02547c8>] 
[<c0256bdf>]
  [<c0138175>] [<c022aa9c>] [<c0254307>] [<c0258a67>] [<c022aa9c>] 
[<c0254307>]
  [<c025ef5b>] [<c025f4ad>] [<c022ac25>] [<c0256bec>] [<c01550dc>] 
[<c014ba00>]
  [<c02449a3>] [<c02449a3>] [<c0244da6>] [<c025ef5b>] [<c0139c05>] 
[<c025f4ad>]
  [<c022a8af>] [<c022f189>] [<c022a8af>] [<f8990d48>] [<c02449a3>] 
[<f8990ef9>]
  [<c022f3a3>]o[<c0122c5b>] [<c010a74e>] [<c0131a04>] [<c012e232>] 
[<c0131487>]
  [<c0119e06>] [<c0131b08>] [<c0131990>] [<c01410d6>] [<c012e72a>] 
[<c0108b5f>]
Code: 0f 0b 62 00 bd 35 2a c0 89 d8 e8 5f ed ff ff 8b 6b 28 85 ed

>>EIP; c0139492 <__free_pages_ok+32/2b0>   <=====
Trace; c022e9ee <dev_queue_xmit+14e/320>
Trace; c022aa9c <kfree_skbmem+c/70>
Trace; c022aa4e <skb_release_data+4e/90>
Trace; c022aa9c <kfree_skbmem+c/70>
Trace; c022ac25 <__kfree_skb+125/130>
Trace; c025423a <tcp_clean_rtx_queue+15a/310>
Trace; c0247d28 <ip_queue_xmit+3d8/550>
Trace; c024be53 <tcp_write_space+53/80>
Trace; c025675b <tcp_new_space+7b/80>
Trace; c02547c8 <tcp_ack+138/360>
Trace; c0256bdf <tcp_rcv_established+ef/8b0>
Trace; c0138175 <lru_cache_add+75/80>
Trace; c022aa9c <kfree_skbmem+c/70>
Trace; c0254307 <tcp_clean_rtx_queue+227/310>
Trace; c0258a67 <tcp_transmit_skb+567/620>
Trace; c022aa9c <kfree_skbmem+c/70>
Trace; c0254307 <tcp_clean_rtx_queue+227/310>
Trace; c025ef5b <tcp_v4_do_rcv+3b/120>
Trace; c025f4ad <tcp_v4_rcv+46d/6f0>
Trace; c022ac25 <__kfree_skb+125/130>
Trace; c0256bec <tcp_rcv_established+fc/8b0>
Trace; c01550dc <dput+1c/160>
Trace; c014ba00 <cached_lookup+10/50>
Trace; c02449a3 <ip_local_deliver+f3/190>
Trace; c02449a3 <ip_local_deliver+f3/190>
Trace; c0244da6 <ip_rcv+366/400>
Trace; c025ef5b <tcp_v4_do_rcv+3b/120>
Trace; c0139c05 <__alloc_pages+75/2f0>
Trace; c025f4ad <tcp_v4_rcv+46d/6f0>
Trace; c022a8af <alloc_skb+ef/1c0>
Trace; c022f189 <netif_receive_skb+189/1c0>
Trace; c022a8af <alloc_skb+ef/1c0>
Trace; f8990d48 <[usbcore]__kstrtab_usb_hcd_giveback_urb+52f8/6a50>
Trace; c02449a3 <ip_local_deliver+f3/190>
Trace; f8990ef9 <[usbcore]__kstrtab_usb_hcd_giveback_urb+54a9/6a50>
Trace; c022f3a3 <net_rx_action+b3/170>
Trace; c0119e06 <do_page_fault+1a6/4eb>
Trace; c0131b08 <generic_file_read+88/170>
Trace; c0131990 <file_read_actor+0/f0>
Trace; c01410d6 <sys_read+96/110>
Trace; c012e72a <sys_brk+ba/f0>
Trace; c0108b5f <system_call+33/38>
Code;  c0139492 <__free_pages_ok+32/2b0>
00000000 <_EIP>:
Code;  c0139492 <__free_pages_ok+32/2b0>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c0139494 <__free_pages_ok+34/2b0>
   2:   62 00                     bound  %eax,(%eax)
Code;  c0139496 <__free_pages_ok+36/2b0>
   4:   bd 35 2a c0 89            mov    $0x89c02a35,%ebp
Code;  c013949b <__free_pages_ok+3b/2b0>
   9:   d8 e8                     fsubr  %st(0),%st
Code;  c013949d <__free_pages_ok+3d/2b0>
   b:   5f                        pop    %edi
Code;  c013949e <__free_pages_ok+3e/2b0>
   c:   ed                        in     (%dx),%eax
Code;  c013949f <__free_pages_ok+3f/2b0>
   d:   ff                        (bad)
Code;  c01394a0 <__free_pages_ok+40/2b0>
   e:   ff 8b 6b 28 85 ed         decl   0xed85286b(%ebx)

 <0>Kernel panic: Aiee, killing interrupt handler!


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2004-05-20 12:20 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-02 10:21 kernel BUG at page_alloc.c:98 -- compiling with distcc Marco Fais
2004-04-02 13:15 ` Marco Roeland
     [not found]   ` <6.0.0.22.2.20040402163334.02abe7d8@pop.localnet>
2004-04-02 15:05     ` Marco Roeland
2004-04-05 10:42       ` Marco Fais
2004-04-05 11:46         ` Marco Roeland
2004-04-05 14:08           ` Marco Fais
2004-04-05 14:36             ` Marco Roeland
2004-04-05 17:03       ` Max Valdez
2004-04-02 23:36 ` Andrew Morton
2004-04-05 10:47   ` Marco Fais
2004-04-05 10:56     ` Andrew Morton
2004-04-05 13:58       ` Marco Fais
2004-05-04  1:07 ` Marcelo Tosatti
2004-05-05 16:25   ` Carson Gaspar
2004-05-05 16:28     ` Marc-Christian Petersen
2004-05-05 18:35     ` Marcelo Tosatti
2004-05-19 11:59       ` Marcelo Tosatti
2004-05-19 15:50         ` Marc-Christian Petersen
2004-05-20 12:21           ` Marcelo Tosatti
2004-05-19 20:21         ` Carson Gaspar
2004-04-23 22:33 Carson Gaspar
2004-04-28  2:02 ` Jeff Moyer
2004-04-29 21:09   ` Marcelo Tosatti
2004-04-29 21:28     ` Andrew Morton
2004-04-29 22:49       ` Andrea Arcangeli
2004-04-29 23:26         ` Andrew Morton
2004-04-30  0:15           ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).