LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
@ 2007-03-18 20:50 Stefan Priebe
  2007-03-20  7:27 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Priebe @ 2007-03-18 20:50 UTC (permalink / raw)
  To: linux-kernel

Hello!

We've a very strange Problem with Kernel 2.6.20.x

If i try to access a SCSI or SATA Disk (tested with Adaptec U320 
ASC-29320, ICP Vortex 9024, Promise TX300) the whole server hangs - no 
output - no error on the screen - but it hangs completely. But it does 
not happen on all our systems affected are only old 604pin xeons and 
socket 940 Opterons. Socket F Opteron or 771 Xeons does work fine.

I've also testet apci=off pci=routeirq but both does not help. The 
systems work fine with 2.6.19.x and before.

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
  2007-03-18 20:50 Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers Stefan Priebe
@ 2007-03-20  7:27 ` Andrew Morton
  2007-03-20 10:33   ` Stefan Priebe
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2007-03-20  7:27 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: linux-kernel, linux-scsi

On Sun, 18 Mar 2007 21:50:46 +0100 Stefan Priebe <stefan@prie.be> wrote:

> Hello!
> 
> We've a very strange Problem with Kernel 2.6.20.x
> 
> If i try to access a SCSI or SATA Disk (tested with Adaptec U320 
> ASC-29320, ICP Vortex 9024, Promise TX300) the whole server hangs - no 
> output - no error on the screen - but it hangs completely. But it does 
> not happen on all our systems affected are only old 604pin xeons and 
> socket 940 Opterons. Socket F Opteron or 771 Xeons does work fine.
> 
> I've also testet apci=off pci=routeirq but both does not help. The 
> systems work fine with 2.6.19.x and before.

Well that's a bit sad.

Could you please set up netconsole
(Documentation/networking/netconsole.txt) and add initcall_debug to the
kernel boot command line and then send us the full bootup logs?

(Even better: serial console with earlyprintk).

If that doesn't shed any light, we might have to ask you to perform a
git-bisect search to find the buggy commit, I'm afraid.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
  2007-03-20  7:27 ` Andrew Morton
@ 2007-03-20 10:33   ` Stefan Priebe
  2007-03-20 10:54     ` Olaf Kirch
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Priebe @ 2007-03-20 10:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-scsi

Hello!

Here are more informations... the problem seems to be a little bit more 
special.

1.) I've bootet these systems through NFS and would like to access 
/dev/sda or /dev/sdb then. For example via fdisk and this does not work.

2.) I've now tested the following kernels -
2.6.18.8 - works
2.6.19.7 - works
2.6.20 - does not work
2.6.21-rc4 - does not work

3.) The funny thing is, i can boot the whole system via 2.6.18.8 for 
example - fdisk the harddisk and format it + plus copying the whole 
image with a 2.6.20.3 kernel - and then the Server installied works 
perfectly i also can fdisk /dev/sdb or so. It only does not work if the 
system itself is bootet via NFS...

Stefan

Andrew Morton schrieb:
> On Sun, 18 Mar 2007 21:50:46 +0100 Stefan Priebe <stefan@prie.be> wrote:
> 
>> Hello!
>>
>> We've a very strange Problem with Kernel 2.6.20.x
>>
>> If i try to access a SCSI or SATA Disk (tested with Adaptec U320 
>> ASC-29320, ICP Vortex 9024, Promise TX300) the whole server hangs - no 
>> output - no error on the screen - but it hangs completely. But it does 
>> not happen on all our systems affected are only old 604pin xeons and 
>> socket 940 Opterons. Socket F Opteron or 771 Xeons does work fine.
>>
>> I've also testet apci=off pci=routeirq but both does not help. The 
>> systems work fine with 2.6.19.x and before.
> 
> Well that's a bit sad.
> 
> Could you please set up netconsole
> (Documentation/networking/netconsole.txt) and add initcall_debug to the
> kernel boot command line and then send us the full bootup logs?
> 
> (Even better: serial console with earlyprintk).
> 
> If that doesn't shed any light, we might have to ask you to perform a
> git-bisect search to find the buggy commit, I'm afraid.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
  2007-03-20 10:33   ` Stefan Priebe
@ 2007-03-20 10:54     ` Olaf Kirch
  2007-03-20 10:59       ` Stefan Priebe
                         ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Olaf Kirch @ 2007-03-20 10:54 UTC (permalink / raw)
  To: stefan; +Cc: linux-kernel, linux-scsi

On Tuesday 20 March 2007 11:33, Stefan Priebe wrote:
> 1.) I've bootet these systems through NFS and would like to access
> /dev/sda or /dev/sdb then. For example via fdisk and this does not work.

What do you mean by "booted through NFS"? Do you mean the machine
runs with the root file system mounted via NFS? Or does it mean you
booted, and started the NFS server?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir@lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
  2007-03-20 10:54     ` Olaf Kirch
@ 2007-03-20 10:59       ` Stefan Priebe
  2007-03-20 11:20       ` Stefan Priebe
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Stefan Priebe @ 2007-03-20 10:59 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: linux-kernel, linux-scsi

Hello!

It runs with nfsroot

# mount
192.168.0.100:/PXE/debian on / type nfs (rw)

Kernel command line: nfs root=/dev/nfs nfsroot=192.168.0.100:/PXE/debian 
ip=dhcp

Stefan

Olaf Kirch schrieb:
> On Tuesday 20 March 2007 11:33, Stefan Priebe wrote:
>> 1.) I've bootet these systems through NFS and would like to access
>> /dev/sda or /dev/sdb then. For example via fdisk and this does not work.
> 
> What do you mean by "booted through NFS"? Do you mean the machine
> runs with the root file system mounted via NFS? Or does it mean you
> booted, and started the NFS server?
> 
> Olaf


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
  2007-03-20 10:54     ` Olaf Kirch
  2007-03-20 10:59       ` Stefan Priebe
@ 2007-03-20 11:20       ` Stefan Priebe
  2007-03-20 12:23       ` Stefan Priebe
  2007-03-20 13:28       ` Stefan Priebe
  3 siblings, 0 replies; 9+ messages in thread
From: Stefan Priebe @ 2007-03-20 11:20 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: linux-kernel, linux-scsi

Hello!

Here a some more information:
- sometimes the whole systems crash - sometimes they are still alive
- if they are alive fdisk consumes 99% CPU
- fdisk cannot be killed also not with kill -9
- the same happens with a cat on /dev/sdX
- no problem when trying to access /dev/hdX

Stefan

Olaf Kirch schrieb:
> On Tuesday 20 March 2007 11:33, Stefan Priebe wrote:
>> 1.) I've bootet these systems through NFS and would like to access
>> /dev/sda or /dev/sdb then. For example via fdisk and this does not work.
> 
> What do you mean by "booted through NFS"? Do you mean the machine
> runs with the root file system mounted via NFS? Or does it mean you
> booted, and started the NFS server?
> 
> Olaf


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
  2007-03-20 10:54     ` Olaf Kirch
  2007-03-20 10:59       ` Stefan Priebe
  2007-03-20 11:20       ` Stefan Priebe
@ 2007-03-20 12:23       ` Stefan Priebe
  2007-03-20 13:28       ` Stefan Priebe
  3 siblings, 0 replies; 9+ messages in thread
From: Stefan Priebe @ 2007-03-20 12:23 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: linux-kernel, linux-scsi

 >  - on a 2.6.20 system, try "dd if=/dev/sdb of=/dev/null bs=4k count=1" or
 >    something like this (with NFS root) - does this crash, too?
no it does not crash it is also no problem to set the count= to 10000 or 
so or change the bs to 16k ...

 >  - do you have ACLs on files in /dev?
no

 >  - enable the sysrq key, make sure kernel messages go to the console
 >    by using "dmesg -n7", and when the kernel hangs, try sysrq-p, and
 >    sysrq-t
 >    (sysrq is documented in Documation/sysrq.txt in the kernel source)
 >  - try to capture the oops message - there must be one.

OK i've done the following:
1.) I've set up netconsole
2.) dmesg -n7
3.) fdisk /dev/sda
4.) sysrq-t / sysrq-p

So here is the output of -p and -t it hangs at nfs_sync_mapping_wait:
SysRq : Show Regs

Pid: 1598, comm:                fdisk
EIP: 0060:[<c03bf506>] CPU: 0
EIP is at _spin_lock+0x7/0xf
  EFLAGS: 00000286    Not tainted  (2.6.20.3 #6)
EAX: c3117afc EBX: c3117a2c ECX: 00000020 EDX: 00000000
ESI: f7b63ed4 EDI: f7b63f04 EBP: f7b63edc DS: 007b ES: 007b GS: 00d8
CR0: 8005003b CR2: b7f00f90 CR3: 033ea000 CR4: 000006d0
  [<c01b5c92>] nfs_sync_mapping_wait+0x83/0x1aa
  [<c01516c5>] cache_alloc_refill+0xc8/0x196
  [<c01b5eca>] nfs_sync_mapping_range+0x97/0xb6
  [<c01ae5cf>] nfs_getattr+0x3a/0x96
  [<c01ae595>] nfs_getattr+0x0/0x96
  [<c01565d9>] vfs_getattr+0x21/0x30
  [<c01566a3>] vfs_fstat+0x22/0x31
  [<c0156c51>] sys_fstat64+0xf/0x23
  [<c015da9c>] sys_ioctl+0x33/0x4b
  [<c0114358>] do_page_fault+0x0/0x549
  [<c010291c>] syscall_call+0x7/0xb
  [<c03b0033>] call_verify+0x182/0x36f
  =======================




SysRq : Show State

                          free                        sibling
   task             PC    stack   pid father child younger older
init          S C0117721     0     1      0     2               (NOTLB)
        c313fc48 00000082 c312fa90 c0117721 00100100 00200200 f7da9600 
f7941e40
        00000010 c313fc04 00000008 00000002 c3022700 c312fa90 c312fb9c 
000008dd
        64bf803e 00000029 c312f030 c313fc90 00000000 c30013c0 c03b3515 
c03b352f
Call Trace:
  [<c0117721>] default_wake_function+0x0/0xc
  [<c03b3515>] rpc_wait_bit_interruptible+0x0/0x1f
  [<c03b352f>] rpc_wait_bit_interruptible+0x1a/0x1f
  [<c03beb38>] __wait_on_bit+0x2c/0x51
  [<c03b3515>] rpc_wait_bit_interruptible+0x0/0x1f
  [<c03bebd0>] out_of_line_wait_on_bit+0x73/0x7b
  [<c012c950>] wake_bit_function+0x0/0x3c
  [<c012c950>] wake_bit_function+0x0/0x3c
  [<c03b3c6a>] __rpc_execute+0xdb/0x18b
  [<c03b354d>] rpc_set_active+0x19/0x57
  [<c03af1ef>] rpc_call_sync+0x71/0x98
  [<c01b1824>] nfs_proc_getattr+0x5b/0x7f
  [<c01ae981>] __nfs_revalidate_inode+0xe7/0x21a
  [<c01ad415>] nfs_permission+0x0/0x133
  [<c01ad415>] nfs_permission+0x0/0x133
  [<c01ad527>] nfs_permission+0x112/0x133
  [<c01ad415>] nfs_permission+0x0/0x133
  [<c0159928>] permission+0x94/0xa2
  [<c0159e57>] __link_path_walk+0x6c/0xa59
  [<c013e20c>] __alloc_pages+0x4a/0x2a3
  [<c015a883>] link_path_walk+0x3f/0xa4
  [<c015abc5>] do_path_lookup+0x170/0x18b
  [<c015ae0c>] __user_walk_fd+0x2d/0x43
  [<c0156601>] vfs_stat_fd+0x19/0x40
  [<c0156c0b>] sys_stat64+0xf/0x23
  [<c02456d4>] copy_to_user+0x2f/0x37
  [<c01234f6>] do_gettimeofday+0x35/0x119
  [<c011f93e>] sys_time+0x1e/0x2e
  [<c010291c>] syscall_call+0x7/0xb
  =======================
ksoftirqd/0   S C33442C0     0     3      1             4     2 (L-TLB)
        c3149fb8 00000046 c013cd73 c33442c0 00000000 c30131e0 00000003 
f7931900
        c301321c 00000000 c33f5030 00000000 c3012700 c3136030 c313613c 
000001d9
        a733fbbd 00000004 c04a8cc0 c0539380 c0539380 c0120494 fffffffc 
c01204d6
Call Trace:
  [<c013cd73>] mempool_free+0x65/0x6a
  [<c0120494>] ksoftirqd+0x0/0xa7
  [<c01204d6>] ksoftirqd+0x42/0xa7
  [<c012c5e6>] kthread+0x72/0x96
  [<c012c574>] kthread+0x0/0x96
  [<c01034f7>] kernel_thread_helper+0x7/0x10
  =======================
migration/1   S F745BF24     0     4      1             5     3 (L-TLB)
        c314bfb0 00000046 00000092 f745bf24 00000001 f745bf70 c314bf94 
f7ab03c0
        00000000 00000001 f745bf74 00000001 c301a700 c3139a90 c3139b9c 
000023c5
        b7d09ccb 00000004 c312f560 c301b054 c301a700 00000001 c314bfc4 
c0118643
Call Trace:
  [<c0118643>] migration_thread+0x7a/0xd2
  [<c01185c9>] migration_thread+0x0/0xd2
  [<c012c5e6>] kthread+0x72/0x96
  [<c012c574>] kthread+0x0/0x96
  [<c01034f7>] kernel_thread_helper+0x7/0x10
  =======================
ksoftirqd/1   S C301B1A0     0     5      1             6     4 (L-TLB)
        c316ffb8 00000046 00000000 c301b1a0 00000008 c012a884 c301b1e0 
f7f39040
        c012aa25 c301b21c 00000000 00000001 c301a700 c3139560 c313966c 
00000c4f
        48c808e9 00000004 c312f560 c0539380 c0539380 c0120494 fffffffc 
c01204d6
Call Trace:
  [<c012a884>] rcu_do_batch+0x1a/0x7f
  [<c012aa25>] __rcu_process_callbacks+0x8f/0xa1
  [<c0120494>] ksoftirqd+0x0/0xa7
  [<c01204d6>] ksoftirqd+0x42/0xa7
  [<c012c5e6>] kthread+0x72/0x96
  [<c012c574>] kthread+0x0/0x96
  [<c01034f7>] kernel_thread_helper+0x7/0x10
  =======================
migration/2   S F7B63F24     0     6      1             7     5 (L-TLB)
        c3171fb0 00000046 00000092 f7b63f24 00000001 f7b63f70 c3171f94 
f79703c0
        00000000 00000001 f7b63f74 00000002 c3022700 c3139030 c313913c 
000011f0
        482d3411 00000022 c312f030 c3023054 c3022700 00000002 c3171fc4 
c0118643
Call Trace:
  [<c0118643>] migration_thread+0x7a/0xd2
  [<c01185c9>] migration_thread+0x0/0xd2
  [<c012c5e6>] kthread+0x72/0x96
  [<c012c574>] kthread+0x0/0x96
  [<c01034f7>] kernel_thread_helper+0x7/0x10
  =======================
ksoftirqd/2   S C324D780     0     7      1             8     6 (L-TLB)
        c3175fb8 00000046 c013cd73 c324d780 00000000 c30231e0 00000003 
f7ba2740
        c302321c 00000000 c053ab90 00000002 c3022700 c3155a90 c3155b9c 
00000564
        610707d5 00000004 c312f030 c0539380 c0539380 c0120494 fffffffc 
c01204d6
Call Trace:
  [<c013cd73>] mempool_free+0x65/0x6a
  [<c0120494>] ksoftirqd+0x0/0xa7
  [<c01204d6>] ksoftirqd+0x42/0xa7
  [<c012c5e6>] kthread+0x72/0x96
  [<c012c574>] kthread+0x0/0x96
  [<c01034f7>] kernel_thread_helper+0x7/0x10
  =======================
migration/3   S F74F1F24     0     8      1             9     7 (L-TLB)
        c3177fb0 00000046 00000092 f74f1f24 00000001 f74f1f70 c3177f94 
f7ab03c0
        00000000 00000001 f74f1f74 00000003 c302a700 c3155560 c315566c 
00000ea1
        b2116928 00000004 c3136a90 c302b054 c302a700 00000003 c3177fc4 
c0118643
Call Trace:
  [<c0118643>] migration_thread+0x7a/0xd2
  [<c01185c9>] migration_thread+0x0/0xd2
  [<c012c5e6>] kthread+0x72/0x96
  [<c012c574>] kthread+0x0/0x96
  [<c01034f7>] kernel_thread_helper+0x7/0x10
  =======================
ksoftirqd/3   S C317BFC4     0     9      1            10     8 (L-TLB)
        c317bfb8 00000046 c03be392 c317bfc4 00000046 00000086 c313fee8 
00000002 c312f560 kthread+0x72/0x96
0000002e schedule_timeout+0x70/0x8d
00000082 prep_new_page+0xb2/0xea
  [<c02456d4>] inet_csk_accept+0x51/0x125


Stefan


Olaf Kirch schrieb:
 > On Tuesday 20 March 2007 11:59, Stefan Priebe wrote:
 >> Kernel command line: nfs root=/dev/nfs nfsroot=192.168.0.100:/PXE/debian
 >> ip=dhcp
 >
 > Some things that may be worth trying:
 >
 >  - on a 2.6.20 system, try "dd if=/dev/sdb of=/dev/null bs=4k count=1" or
 >    something like this (with NFS root) - does this crash, too?
 >
 >  - do you have ACLs on files in /dev?
 >
 >  - enable the sysrq key, make sure kernel messages go to the console
 >    by using "dmesg -n7", and when the kernel hangs, try sysrq-p, and 
sysrq-t
 >    (sysrq is documented in Documation/sysrq.txt in the kernel source)
 >
 >  - try to capture the oops message - there must be one.
 >
 > Olaf


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
  2007-03-20 10:54     ` Olaf Kirch
                         ` (2 preceding siblings ...)
  2007-03-20 12:23       ` Stefan Priebe
@ 2007-03-20 13:28       ` Stefan Priebe
  2007-03-20 16:01         ` Chuck Ebbert
  3 siblings, 1 reply; 9+ messages in thread
From: Stefan Priebe @ 2007-03-20 13:28 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: linux-kernel, linux-scsi, akpm

[-- Attachment #1: Type: text/plain, Size: 293 bytes --]

Hello!

With the sysrq i've found the function with is the problem:
inode.c => nfs_getattr => nfs_sync_mapping_range

I've also found the attached patch - which is not included in any stable 
release nor in 2.6.21.X but is public since 20.02.07

I think this is very important.

Stefan Priebe

[-- Attachment #2: linux-2.6.20-001-fix_block_device_getattr.dif --]
[-- Type: text/plain, Size: 818 bytes --]

commit 090ad38f8ceea3cc048981e9fe9cc62ed43fee58
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Tue Feb 20 19:28:07 2007 -0500

    NFS: nfs_getattr() can't call nfs_sync_mapping_range() for non-regular files
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index af53c02..93d046c 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -429,7 +429,8 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 	int err;
 
 	/* Flush out writes to the server in order to update c/mtime */
-	nfs_sync_mapping_range(inode->i_mapping, 0, 0, FLUSH_NOCOMMIT);
+	if (S_ISREG(inode->i_mode))
+		nfs_sync_mapping_range(inode->i_mapping, 0, 0, FLUSH_NOCOMMIT);
 
 	/*
 	 * We may force a getattr if the user cares about atime.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers
  2007-03-20 13:28       ` Stefan Priebe
@ 2007-03-20 16:01         ` Chuck Ebbert
  0 siblings, 0 replies; 9+ messages in thread
From: Chuck Ebbert @ 2007-03-20 16:01 UTC (permalink / raw)
  To: stefan; +Cc: Olaf Kirch, linux-kernel, linux-scsi, akpm

Stefan Priebe wrote:
> Hello!
> 
> With the sysrq i've found the function with is the problem:
> inode.c => nfs_getattr => nfs_sync_mapping_range
> 
> I've also found the attached patch - which is not included in any stable
> release nor in 2.6.21.X but is public since 20.02.07
> 
> I think this is very important.
> 

It is queued for 2.6.20.4.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-03-20 16:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-18 20:50 Kernel 2.6.20 does not work anymore with SCSI or SATA on old Opteron / Xeon servers Stefan Priebe
2007-03-20  7:27 ` Andrew Morton
2007-03-20 10:33   ` Stefan Priebe
2007-03-20 10:54     ` Olaf Kirch
2007-03-20 10:59       ` Stefan Priebe
2007-03-20 11:20       ` Stefan Priebe
2007-03-20 12:23       ` Stefan Priebe
2007-03-20 13:28       ` Stefan Priebe
2007-03-20 16:01         ` Chuck Ebbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).