LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* INFO: task mount:11202 blocked for more than 120 seconds
@ 2008-03-07 20:32 Christian Kujau
2008-03-07 22:40 ` David Chinner
0 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-07 20:32 UTC (permalink / raw)
To: LKML; +Cc: xfs
Hi,
after upgrading from 2.6.24.1 to 2.6.25-rc3, I came across[0]. This
warning seems to be gone now. With 2.6.25-rc4 (and the fix from [1])
the box was running fine for 20 hours or so (doing its usual jobs plus
a "make randconfig && make" loop).
After this, I noticed that /bin/sync would not exit anymore and
remains stuck in D state. Looking around I noticed that the rsync
backup jobs (rsync'ing to an xfs partition) from earlier this
morning did not exit either and hung in D state. With sync hung, the
following messages started to appear:
[75377.756985] INFO: task sync:2697 blocked for more than 120 seconds.
[75377.757579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[75377.758211] sync D c013835c 0 2697 16457
[75377.758216] f59506c0 00000082 f4c34000 c013835c fffeffff f6c1bcb0 f5dd0000 f4c34000
[75377.758223] c04405d7 f53f7e98 f6c1bcb4 f6c1bcd0 00000000 f6c1bcb0 00000000 f7ca1090
[75377.758230] f4c34000 c044070a f6c1bcd0 f6c1bcd0 f5dd0000 00000001 f6c1bcb0 c044074b
[75377.758237] Call Trace:
[75377.758253] [<c013835c>] trace_hardirqs_on+0x9c/0x110
[75377.758269] [<c04405d7>] rwsem_down_failed_common+0x67/0x150
[75377.758279] [<c044070a>] rwsem_down_read_failed+0x1a/0x24
[75377.758286] [<c044074b>] call_rwsem_down_read_failed+0x7/0xc
[75377.758291] [<c012fd7c>] down_read_nested+0x4c/0x60
[75377.758295] [<c027a64b>] xfs_ilock+0x5b/0xb0
[75377.758301] [<c027a64b>] xfs_ilock+0x5b/0xb0
[75377.758306] [<c029693d>] xfs_sync_inodes+0x3dd/0x6b0
[75377.758314] [<c0440b14>] _spin_unlock+0x14/0x20
[75377.758325] [<c0296d9b>] xfs_syncsub+0x18b/0x300
[75377.758330] [<c0440b14>] _spin_unlock+0x14/0x20
[75377.758335] [<c02a7c2b>] xfs_fs_sync_super+0x2b/0xd0
[75377.758342] [<c016a124>] sync_filesystems+0xa4/0x100
[75377.758351] [<c043fdd8>] down_read+0x38/0x50
[75377.758356] [<c016a13f>] sync_filesystems+0xbf/0x100
[75377.758361] [<c01872b3>] do_sync+0x33/0x70
[75377.758366] [<c0102ed7>] restore_nocheck+0x12/0x15
[75377.758371] [<c01872fa>] sys_sync+0xa/0x10
[75377.758375] [<c0102dee>] sysenter_past_esp+0x5f/0xa5
[75377.758402] =======================
[75377.758405] 3 locks held by sync/2697:
[75377.758407] #0: (mutex){--..}, at: [<c016a091>] sync_filesystems+0x11/0x100
[75377.758414] #1: (&type->s_umount_key#22){----}, at: [<c016a124>] sync_filesystems+0xa4/0x100
[75377.758422] #2: (&(&ip->i_iolock)->mr_lock){----}, at: [<c027a64b>] xfs_ilock+0x5b/0xb0
The box is still up & running, although the load is increasing slightly.
I've gathered some details here: http://nerdbynature.de/bits/2.6.25-rc4/
I've searched the archives for this error, but the only thing was
* http://lkml.org/lkml/2008/2/12/44
[BUG] 2.6.25-rc1-git1 softlockup while bootup on powerpc
...however, I don't get "CPU stuck" messages
* http://lkml.org/lkml/2008/1/29/370
Re: system hang on latest git
...but calltrace looks a lot different.
Since both mailings are not so current, I'd like to got back to -rc3 and
try to reproduce this one.
Do you have any idea what's going on here?
Thanks,
Christian.
[0] http://lkml.org/lkml/2008/3/2/171
[1] http://lkml.org/lkml/2008/3/4/634
--
BOFH excuse #158:
Defunct processes
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-07 20:32 INFO: task mount:11202 blocked for more than 120 seconds Christian Kujau
@ 2008-03-07 22:40 ` David Chinner
2008-03-07 23:46 ` Christian Kujau
0 siblings, 1 reply; 34+ messages in thread
From: David Chinner @ 2008-03-07 22:40 UTC (permalink / raw)
To: Christian Kujau; +Cc: LKML, xfs
On Fri, Mar 07, 2008 at 09:32:57PM +0100, Christian Kujau wrote:
> Hi,
>
> after upgrading from 2.6.24.1 to 2.6.25-rc3, I came across[0]. This
> warning seems to be gone now. With 2.6.25-rc4 (and the fix from [1])
> the box was running fine for 20 hours or so (doing its usual jobs plus
> a "make randconfig && make" loop).
>
> After this, I noticed that /bin/sync would not exit anymore and
> remains stuck in D state. Looking around I noticed that the rsync
> backup jobs (rsync'ing to an xfs partition) from earlier this
> morning did not exit either and hung in D state. With sync hung, the
> following messages started to appear:
>
> [75377.756985] INFO: task sync:2697 blocked for more than 120 seconds.
> [75377.757579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [75377.758211] sync D c013835c 0 2697 16457
> [75377.758216] f59506c0 00000082 f4c34000 c013835c fffeffff f6c1bcb0
> f5dd0000 f4c34000 [75377.758223] c04405d7 f53f7e98 f6c1bcb4 f6c1bcd0
> 00000000 f6c1bcb0 00000000 f7ca1090 [75377.758230] f4c34000 c044070a
> f6c1bcd0 f6c1bcd0 f5dd0000 00000001 f6c1bcb0 c044074b [75377.758237] Call
> Trace:
> [75377.758253] [<c013835c>] trace_hardirqs_on+0x9c/0x110
> [75377.758269] [<c04405d7>] rwsem_down_failed_common+0x67/0x150
> [75377.758279] [<c044070a>] rwsem_down_read_failed+0x1a/0x24
> [75377.758286] [<c044074b>] call_rwsem_down_read_failed+0x7/0xc
> [75377.758291] [<c012fd7c>] down_read_nested+0x4c/0x60
> [75377.758295] [<c027a64b>] xfs_ilock+0x5b/0xb0
> [75377.758301] [<c027a64b>] xfs_ilock+0x5b/0xb0
> [75377.758306] [<c029693d>] xfs_sync_inodes+0x3dd/0x6b0
> [75377.758314] [<c0440b14>] _spin_unlock+0x14/0x20
> [75377.758325] [<c0296d9b>] xfs_syncsub+0x18b/0x300
> [75377.758330] [<c0440b14>] _spin_unlock+0x14/0x20
> [75377.758335] [<c02a7c2b>] xfs_fs_sync_super+0x2b/0xd0
> [75377.758342] [<c016a124>] sync_filesystems+0xa4/0x100
> [75377.758351] [<c043fdd8>] down_read+0x38/0x50
> [75377.758356] [<c016a13f>] sync_filesystems+0xbf/0x100
> [75377.758361] [<c01872b3>] do_sync+0x33/0x70
> [75377.758366] [<c0102ed7>] restore_nocheck+0x12/0x15
> [75377.758371] [<c01872fa>] sys_sync+0xa/0x10
> [75377.758375] [<c0102dee>] sysenter_past_esp+0x5f/0xa5
> [75377.758402] =======================
> [75377.758405] 3 locks held by sync/2697:
> [75377.758407] #0: (mutex){--..}, at: [<c016a091>]
> sync_filesystems+0x11/0x100
> [75377.758414] #1: (&type->s_umount_key#22){----}, at: [<c016a124>]
> sync_filesystems+0xa4/0x100
> [75377.758422] #2: (&(&ip->i_iolock)->mr_lock){----}, at: [<c027a64b>]
> xfs_ilock+0x5b/0xb0
Well, if that is hung there, something else must be holding on to
the iolock it's waiting on. What are the other D state processes in the
machine?
Also, the iolock can be held across I/O so it's possible you've lost an I/O.
Any I/O errors in the syslog?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-07 22:40 ` David Chinner
@ 2008-03-07 23:46 ` Christian Kujau
2008-03-08 1:54 ` Christian Kujau
2008-03-09 21:34 ` INFO: task mount:11202 blocked for more than 120 seconds David Chinner
0 siblings, 2 replies; 34+ messages in thread
From: Christian Kujau @ 2008-03-07 23:46 UTC (permalink / raw)
To: David Chinner; +Cc: LKML, xfs
On Sat, 8 Mar 2008, David Chinner wrote:
> Well, if that is hung there, something else must be holding on to
> the iolock it's waiting on. What are the other D state processes in the
> machine?
I have 7 processes in D state so far:
$ ps auxww [....]
root 9844 0.0 0.0 0 0 ? D Mar06 0:22 [pdflush]
root 2697 0.0 0.0 4712 460 ? D Mar07 0:00 sync
root 8342 0.0 0.0 1780 440 ? D Mar07 0:01 /bin/rm -rf /data/md1/stuff
root 12494 0.0 0.0 11124 1228 ? D Mar07 0:14 /usr/bin/rsync
root 15008 0.0 0.0 4712 460 ? D Mar07 0:00 sync
root 11202 0.0 0.0 5012 764 ? D Mar07 0:00 mount -o remount,ro /data/md1
root 15936 0.0 0.0 4712 460 ? D Mar07 0:00 sync
At one point I did a sysrq-D and put the results in:
http://nerdbynature.de/bits/2.6.25-rc4/hung_task/kern.log.gz
(grep for "SysRq : Show Locks Held" and "SysRq : Show Blocked State")
> Also, the iolock can be held across I/O so it's possible you've lost an I/O.
> Any I/O errors in the syslog?
No, no I/O errors at all. See the kern.log above, I could even do dd(1)
from the md1 (dm-crypt on raid1), no errors either.
thanks,
Christian.
--
BOFH excuse #233:
TCP/IP UDP alarm threshold is set too low.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-07 23:46 ` Christian Kujau
@ 2008-03-08 1:54 ` Christian Kujau
2008-03-09 6:15 ` 2.6.25-rc hangs (was: INFO: task mount:11202 blocked for more than 120 seconds) Christian Kujau
2008-03-09 21:34 ` INFO: task mount:11202 blocked for more than 120 seconds David Chinner
1 sibling, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-08 1:54 UTC (permalink / raw)
To: David Chinner; +Cc: LKML, xfs
On Sat, 8 Mar 2008, Christian Kujau wrote:
> I have 7 processes in D state so far:
FWIW, it's 100% reproducible with 2.6.25-rc3 too...sigh :-\
So, the last working kernel for me is 2.6.24.1 - that's a lot of bisecting
and I fear that compile errors will invalidate the bisecting results again
or make it impossible at all....I'll try anyway....tomorrow...
C.
--
BOFH excuse #285:
Telecommunications is upgrading.
^ permalink raw reply [flat|nested] 34+ messages in thread
* 2.6.25-rc hangs (was: INFO: task mount:11202 blocked for more than 120 seconds)
2008-03-08 1:54 ` Christian Kujau
@ 2008-03-09 6:15 ` Christian Kujau
2008-03-09 16:44 ` 2.6.25-rc hangs Eric Sandeen
0 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-09 6:15 UTC (permalink / raw)
To: David Chinner; +Cc: LKML, xfs
On Sat, 8 Mar 2008, Christian Kujau wrote:
> FWIW, it's 100% reproducible with 2.6.25-rc3 too...sigh :-\
> So, the last working kernel for me is 2.6.24.1 - that's a lot of bisecting
> and I fear that compile errors will invalidate the bisecting results again or
> make it impossible at all....I'll try anyway....tomorrow...
Bisecting failed as expected :-(
I tried to follow the git-bisect manpage (and have successfully used
bisect in the past a few times), but I think ~5700 revisions between
2.6.24 and 2.6.25 are just too much fuzz. The bisect logs so far, with
my comments inbetween:
git-bisect start
# 2.6.25-rc4, known to be bad:
# bad: [84c6f6046c5a2189160a8f0dca8b90427bf690ea] x86_64: make ptrace always sign-extend orig_ax to 64 bits
git-bisect bad 84c6f6046c5a2189160a8f0dca8b90427bf690ea
# 2.6.24, good:
# good: [49914084e797530d9baaf51df9eda77babc98fa8] Linux 2.6.24
git-bisect good 49914084e797530d9baaf51df9eda77babc98fa8
# 2.6.24, hard lockup during boot, bad:
# bad: [bd45ac0c5daae35e7c71138172e63df5cf644cf6] Merge branch 'linux-2.6'
git-bisect bad bd45ac0c5daae35e7c71138172e63df5cf644cf6
I marked the last one bad, because I could not boot any more. As it's a
headless box, I could not get more details. It did not even respond so
sysrq-b. After marking this bad, I compiled and booted again - same
result, hard lockup. So I tried again:
git bisect reset
git-bisect start
# 2.6.25-rc4, known to be bad:
# bad: [84c6f6046c5a2189160a8f0dca8b90427bf690ea] x86_64: make ptrace always sign-extend orig_ax to 64 bits
git-bisect bad 84c6f6046c5a2189160a8f0dca8b90427bf690ea
# 2.6.24, good:
# good: [49914084e797530d9baaf51df9eda77babc98fa8] Linux 2.6.24
git-bisect good 49914084e797530d9baaf51df9eda77babc98fa8
# 2.6.24, hard lockup during boot, marking good anyway:
# good: [bd45ac0c5daae35e7c71138172e63df5cf644cf6] Merge branch 'linux-2.6'
git-bisect good bd45ac0c5daae35e7c71138172e63df5cf644cf6
# lockup
# bad: [f0f1b3364ae7f48084bdf2837fb979ff59622523] Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
git-bisect bad f0f1b3364ae7f48084bdf2837fb979ff59622523
Although I could not boot with bd45ac0c5daae35e7c71138172e63df5cf644cf6, I
marked it "good", as the lockup is totally unrelated to my problem.
However, the box locks up as soon as I'm using the device-mapper. This
time it does respond to sysrq-b.
But still: I'm unable to diagnose the system hang [0] and I fear that
2.6.25 is released and for the first time since ages I'd have to skip a release...
Help!
Christian.
[0] http://lkml.org/lkml/2008/3/7/308
--
BOFH excuse #288:
Hard drive sleeping. Let it wake up on it's own...
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: 2.6.25-rc hangs
2008-03-09 6:15 ` 2.6.25-rc hangs (was: INFO: task mount:11202 blocked for more than 120 seconds) Christian Kujau
@ 2008-03-09 16:44 ` Eric Sandeen
2008-03-09 18:05 ` Christian Kujau
0 siblings, 1 reply; 34+ messages in thread
From: Eric Sandeen @ 2008-03-09 16:44 UTC (permalink / raw)
To: Christian Kujau; +Cc: David Chinner, LKML, xfs
Christian Kujau wrote:
> On Sat, 8 Mar 2008, Christian Kujau wrote:
>> FWIW, it's 100% reproducible with 2.6.25-rc3 too...sigh :-\
>> So, the last working kernel for me is 2.6.24.1 - that's a lot of bisecting
>> and I fear that compile errors will invalidate the bisecting results again or
>> make it impossible at all....I'll try anyway....tomorrow...
>
> Bisecting failed as expected :-(
> I tried to follow the git-bisect manpage (and have successfully used
> bisect in the past a few times), but I think ~5700 revisions between
> 2.6.24 and 2.6.25 are just too much fuzz. The bisect logs so far, with
> my comments inbetween:
Christian, what is the test you are using for the bisect?
Thanks,
-Eric
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: 2.6.25-rc hangs
2008-03-09 16:44 ` 2.6.25-rc hangs Eric Sandeen
@ 2008-03-09 18:05 ` Christian Kujau
[not found] ` <47D42AD6.10500@sandeen.net>
0 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-09 18:05 UTC (permalink / raw)
To: Eric Sandeen; +Cc: David Chinner, LKML, xfs
On Sun, 9 Mar 2008, Eric Sandeen wrote:
>> I tried to follow the git-bisect manpage (and have successfully used
>> bisect in the past a few times), but I think ~5700 revisions between
>> 2.6.24 and 2.6.25 are just too much fuzz. The bisect logs so far, with
>> my comments inbetween:
>
> Christian, what is the test you are using for the bisect?
Sorry, I don't understand: which "test" do you mean?
I did the bisect as per the bisect log and then just rebooted. Which gave
me no usable results yet. I'm trying to boot 2.6.25-rc1 in a few moments
and see if the hang is there as well. If it is, I'll start bisecting again
and hope that "halfway between -rc1 and .24" will be a bootable kernel
this time...
The "error" I'm trying to chase is a system "hang", but no instant lockup.
I can reproduce this by increasing disk I/O. I did this primarily with
rsync from different filesystems to my backup XFS partition. After a few
minutes, the INFO: messages[0] appeared.
Thanks,
Christian.
[0] http://lkml.org/lkml/2008/3/7/308
--
BOFH excuse #2:
solar flares
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: 2.6.25-rc hangs
[not found] ` <47D42AD6.10500@sandeen.net>
@ 2008-03-09 18:49 ` Christian Kujau
2008-03-12 18:06 ` Samuel Tardieu
0 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-09 18:49 UTC (permalink / raw)
To: Eric Sandeen; +Cc: LKML
On Sun, 9 Mar 2008, Eric Sandeen wrote:
> I meant the test you were using to determine "good" from "bad" - I guess
> it was "boot up and do IO for a while to see if it hangs"
Maybe I was too confusing, let me try again:
* 2.6.24.1 was running fine for weeks (and is now)
* 2.6.25-rc3, -rc4 comes with system hang. Trying to bisect it failed
at a really early stage:
[check out current -git, 2.6.25-rc4)
$ git bisect start
$ git bisect bad # because I know, that current -git is bad
$ git bisect good v2.6.24 # because I know 2.6.24 is good
[ compiling, and first reboot]
-> failed, because hard lockup.
After rebooting to a working kernel I can now do either:
1) mark the current one as "bad", solely on the fact that "it does not
boot" and is "bad" per se, ignoring the fact that if the box *had*
booted, the system hang *perhaps* did not occur. IOW, I'm marking
it "bad" because of a totally different issue.
2) although booting failed, I still mark it "good", which means I'm
literally *guessing* that this current kernel (bd45ac0c5daa...)
does NOT have the system hang and thus affecting all subsequent
bisects. Worst case: this guess turns out to be false and I'm bisecting
all through all ~2800 revisions without ever getting the correct "bad"
one (because they were actually all "bad").
I tried 1), but did not get any further, as the next kernel did not boot
either.
I tried 2), but did not get any further, as the next kernel did boot but
locked up when I tried to use the device mapper.
Hope that's a bit better explained than before...
Thanks,
C.
--
BOFH excuse #58:
high pressure system failure
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-07 23:46 ` Christian Kujau
2008-03-08 1:54 ` Christian Kujau
@ 2008-03-09 21:34 ` David Chinner
2008-03-10 1:46 ` Christian Kujau
` (2 more replies)
1 sibling, 3 replies; 34+ messages in thread
From: David Chinner @ 2008-03-09 21:34 UTC (permalink / raw)
To: Christian Kujau; +Cc: David Chinner, LKML, xfs, dm-devel
[adding dm-devel to cc list]
On Sat, Mar 08, 2008 at 12:46:40AM +0100, Christian Kujau wrote:
> On Sat, 8 Mar 2008, David Chinner wrote:
> >Well, if that is hung there, something else must be holding on to
> >the iolock it's waiting on. What are the other D state processes in the
> >machine?
>
> I have 7 processes in D state so far:
>
> $ ps auxww [....]
> root 9844 0.0 0.0 0 0 ? D Mar06 0:22 [pdflush]
> root 2697 0.0 0.0 4712 460 ? D Mar07 0:00 sync
> root 8342 0.0 0.0 1780 440 ? D Mar07 0:01 /bin/rm -rf
> /data/md1/stuff
> root 12494 0.0 0.0 11124 1228 ? D Mar07 0:14 /usr/bin/rsync
> root 15008 0.0 0.0 4712 460 ? D Mar07 0:00 sync
> root 11202 0.0 0.0 5012 764 ? D Mar07 0:00 mount -o
> remount,ro /data/md1
> root 15936 0.0 0.0 4712 460 ? D Mar07 0:00 sync
>
> At one point I did a sysrq-D and put the results in:
> http://nerdbynature.de/bits/2.6.25-rc4/hung_task/kern.log.gz
> (grep for "SysRq : Show Locks Held" and "SysRq : Show Blocked State")
Ok, this looks like a block layer issue. Everything is waiting in ioschedule()
and so in places we are blocked holding locked inodes. Hence sync, pdflush,
etc are hung waiting for the inode locks to be released. e.g:
INFO: task rm:8342 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rm D ee08de8c 0 8342 8199
f5eebd80 00000086 c0380a16 ee08de8c ee08de8c 00000000 ee08de94 c200ebd0
c043ef2b c0146237 c043f1d2 c0146210 ee08de8c 00000000 00000000 db122110
c01464ca 00000002 c1865b00 0000000c 00000000 ee3acd60 c012c700 c200ebec
Call Trace:
[<c0380a16>] dm_table_unplug_all+0x26/0x40
[<c043ef2b>] io_schedule+0xb/0x20
[<c0146237>] sync_page+0x27/0x40
[<c043f1d2>] __wait_on_bit+0x42/0x70
[<c0146210>] sync_page+0x0/0x40
[<c01464ca>] wait_on_page_bit+0x5a/0x60
[<c012c700>] wake_bit_function+0x0/0x60
[<c014ec53>] truncate_inode_pages_range+0x223/0x360
[<c014eda7>] truncate_inode_pages+0x17/0x20
[<c017b44f>] generic_delete_inode+0xef/0x100
[<c017a8ac>] iput+0x5c/0x70
[<c0171e47>] do_unlinkat+0xf7/0x160
[<c0102e29>] sysenter_past_esp+0x9a/0xa5
[<c013835c>] trace_hardirqs_on+0x9c/0x110
[<c0102dee>] sysenter_past_esp+0x5f/0xa5
=======================
no locks held by rm/8342.
And rsync is stuck in congestion_wait()
SysRq : Show Blocked State
task PC stack pid father
rsync D 00000292 0 12494 1
f5dc7b40 00000086 00000000 00000292 f697bcdc 00735f5c 00000000 00000600
c043efd9 c013820d f532ed60 c05c0f04 f5cc1b58 00735f5c c0122900 f532ed60
c05c0c00 c053eb04 0000000a c043ef0b c01515f8 00000000 f532ed60 c012c6c0
Call Trace:
[<c043efd9>] schedule_timeout+0x49/0xc0
[<c013820d>] mark_held_locks+0x3d/0x70
[<c0122900>] process_timeout+0x0/0x10
[<c043ef0b>] io_schedule_timeout+0xb/0x20
[<c01515f8>] congestion_wait+0x58/0x80
[<c012c6c0>] autoremove_wake_function+0x0/0x40
[<c014cc32>] balance_dirty_pages_ratelimited_nr+0xb2/0x240
[<c0147368>] generic_file_buffered_write+0x1a8/0x650
[<c028540e>] xfs_log_move_tail+0x3e/0x180
[<c0440be9>] _spin_lock+0x29/0x40
[<c02a73cc>] xfs_write+0x7ac/0x8a0
[<c0174c01>] core_sys_select+0x21/0x350
[<c02a32bc>] xfs_file_aio_write+0x5c/0x70
[<c0167d25>] do_sync_write+0xd5/0x120
[<c0102ed7>] restore_nocheck+0x12/0x15
[<c012c6c0>] autoremove_wake_function+0x0/0x40
[<c019d105>] dnotify_parent+0x35/0x90
[<c0167c50>] do_sync_write+0x0/0x120
[<c016859f>] vfs_write+0x9f/0x140
[<c0168b51>] sys_write+0x41/0x70
[<c0102dee>] sysenter_past_esp+0x5f/0xa5
And this rsync procss will definitely be holding the iolock on an XFS
inode here (which is why other processes are hanging on an inode iolock).
> >Also, the iolock can be held across I/O so it's possible you've lost an
> >I/O. Any I/O errors in the syslog?
>
> No, no I/O errors at all. See the kern.log above, I could even do dd(1)
> from the md1 (dm-crypt on raid1), no errors either.
Oh, dm-crypt. Well, I'd definitely start looking there. XFS has a
history of exposing dm-crypt bugs, and these hangs appear to be I/O
congestion/scheduling related and not XFS. Also, we haven't changed
anything related to plug/unplug of block devices in XFS recently, so
that also points to some other change as well...
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-09 21:34 ` INFO: task mount:11202 blocked for more than 120 seconds David Chinner
@ 2008-03-10 1:46 ` Christian Kujau
2008-03-12 18:03 ` Samuel Tardieu
2008-03-14 9:27 ` Milan Broz
2 siblings, 0 replies; 34+ messages in thread
From: Christian Kujau @ 2008-03-10 1:46 UTC (permalink / raw)
To: David Chinner; +Cc: LKML, xfs, dm-devel
On Mon, 10 Mar 2008, David Chinner wrote:
> Oh, dm-crypt. Well, I'd definitely start looking there. XFS has a
> history of exposing dm-crypt bugs, and these hangs appear to be I/O
> congestion/scheduling related and not XFS.
Yeah, I noticed that too, thanks for verifying this: during the 2nd bisect
run, the box locked up hard when I accessed the device-mapper. I'm using a
wrapper script to set up my luks/dm-crypt devices and still have to find
out which command exactly triggers the lockup. So, maybe the hard lockup
and the hangs are not so unrelated after all...oh well.
> Also, we haven't changed
> anything related to plug/unplug of block devices in XFS recently, so
> that also points to some other change as well...
Thanks for your assistance, David, I really appreciate it. I'll try to
find out more about this dm-crypt thingy....
Christian.
--
BOFH excuse #396:
Mail server hit by UniSpammer.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-09 21:34 ` INFO: task mount:11202 blocked for more than 120 seconds David Chinner
2008-03-10 1:46 ` Christian Kujau
@ 2008-03-12 18:03 ` Samuel Tardieu
2008-03-12 19:53 ` Chr
2008-03-14 9:27 ` Milan Broz
2 siblings, 1 reply; 34+ messages in thread
From: Samuel Tardieu @ 2008-03-12 18:03 UTC (permalink / raw)
To: device-mapper development; +Cc: Christian Kujau, David Chinner, LKML, xfs
>>>>> "David" == David Chinner <dgc@sgi.com> writes:
>> No, no I/O errors at all. See the kern.log above, I could even do
>> dd(1) from the md1 (dm-crypt on raid1), no errors either.
David> Oh, dm-crypt. Well, I'd definitely start looking there. XFS has
David> a history of exposing dm-crypt bugs, and these hangs appear to
David> be I/O congestion/scheduling related and not XFS. Also, we
David> haven't changed anything related to plug/unplug of block
David> devices in XFS recently, so that also points to some other
David> change as well...
For what it is worth, I notice the same error last week (with a kernel
of the day) on my laptop but was too busy at work to investigate. And
I use ext3... on dm-crypt.
Sam
--
Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: 2.6.25-rc hangs
2008-03-09 18:49 ` Christian Kujau
@ 2008-03-12 18:06 ` Samuel Tardieu
2008-03-12 21:02 ` Christian Kujau
0 siblings, 1 reply; 34+ messages in thread
From: Samuel Tardieu @ 2008-03-12 18:06 UTC (permalink / raw)
To: Christian Kujau; +Cc: Eric Sandeen, LKML
>>>>> "Christian" == Christian Kujau <lists@nerdbynature.de> writes:
Christian> After rebooting to a working kernel I can now do either:
Christian> 1) mark the current one as "bad" [...]
Christian> 2) although booting failed, I still mark it "good" [...]
Can't you skip it with "git bisect skip" in this case? (I've never
tried "skip" but from "git bisect" documentation it does what you
want, and "git bisect" will select a nearby revision)
Sam
--
Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-12 18:03 ` Samuel Tardieu
@ 2008-03-12 19:53 ` Chr
2008-03-12 23:07 ` Christian Kujau
0 siblings, 1 reply; 34+ messages in thread
From: Chr @ 2008-03-12 19:53 UTC (permalink / raw)
To: Samuel Tardieu
Cc: device-mapper development, Christian Kujau, David Chinner, LKML, xfs
On Wednesday 12 March 2008 19:03:51 Samuel Tardieu wrote:
>
> David> Oh, dm-crypt. Well, I'd definitely start looking there. XFS has
> David> a history of exposing dm-crypt bugs, and these hangs appear to
> David> be I/O congestion/scheduling related and not XFS. Also, we
> David> haven't changed anything related to plug/unplug of block
> David> devices in XFS recently, so that also points to some other
> David> change as well...
>
me too...
http://lkml.org/lkml/2008/3/11/377
(BTW: I'm still bisecting it... can somebody please confirm that 2.6.24-git17
http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.24-git17.bz2
was the _last_ "good" one?)
BTW:
by pressing "SYSRQ-S (Emerg. Sync)" over and over again can unlock
some but not all "D" tasks...
Regards,
Chr.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: 2.6.25-rc hangs
2008-03-12 18:06 ` Samuel Tardieu
@ 2008-03-12 21:02 ` Christian Kujau
0 siblings, 0 replies; 34+ messages in thread
From: Christian Kujau @ 2008-03-12 21:02 UTC (permalink / raw)
To: Samuel Tardieu; +Cc: LKML, dm-devel
On Wed, 12 Mar 2008, Samuel Tardieu wrote:
> Can't you skip it with "git bisect skip" in this case? (I've never
> tried "skip" but from "git bisect" documentation it does what you
> want, and "git bisect" will select a nearby revision)
Yes, I shall try so. In the meantime I tried -rc5 but even more headaches
with this one:
[ 1219.355352] ------------[ cut here ]------------
[ 1219.355359] WARNING: at drivers/usb/host/ehci-hcd.c:287 ehci_iaa_watchdog+0x7a/0x80()
[ 1219.355362] Modules linked in: act_police sch_ingress cls_u32 sch_sfq sch_cbq xt_tcpudp ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ftp nf_nat nf_conntrack_ftp xt_conntrack nf_conntrack iptable_filter ip_tables ipt_ULOG x_tables nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc tun fuse sg sr_mod twofish_i586 twofish_common eeprom w83l785ts asb100 hwmon_vid hwmon usb_storage ecb zd1211rw firmware_class snd_intel8x0 snd_ac97_codec mac80211 ac97_bus snd_pcm snd_timer snd soundcore pl2303 usbserial snd_page_alloc cfg80211 i2c_nforce2 i2c_core
[ 1219.355411] Pid: 450, comm: md0_raid1 Not tainted 2.6.25-rc5 #1
[ 1219.355415] [<c0119a0f>] warn_on_slowpath+0x5f/0x90
[ 1219.355429] [<c0138d57>] __lock_acquire+0x537/0x10b0
[ 1219.355441] [<c0117062>] task_tick_fair+0x32/0x60
[ 1219.355446] [<c01159cf>] scheduler_tick+0xcf/0x200
[ 1219.355453] [<c011b47d>] profile_tick+0x3d/0x80
[ 1219.355459] [<c0102ed7>] restore_nocheck+0x12/0x15
[ 1219.355467] [<c043c9ff>] _spin_lock_irqsave+0x3f/0x50
[ 1219.355475] [<c03620ba>] ehci_iaa_watchdog+0x7a/0x80
[ 1219.355480] [<c0122748>] run_timer_softirq+0x128/0x190
[ 1219.355486] [<c0105688>] do_softirq+0x98/0xd0
[ 1219.355492] [<c0362040>] ehci_iaa_watchdog+0x0/0x80
[ 1219.355498] [<c011ecb2>] __do_softirq+0x52/0xa0
[ 1219.355503] [<c0105688>] do_softirq+0x98/0xd0
[ 1219.355507] [<c0144960>] handle_level_irq+0x0/0xe0
[ 1219.355515] [<c011ec3d>] irq_exit+0x4d/0x70
[ 1219.355518] [<c0105750>] do_IRQ+0x90/0xf0
[ 1219.355522] [<c013837c>] trace_hardirqs_on+0x9c/0x110
[ 1219.355529] [<c0103886>] common_interrupt+0x2e/0x34
[ 1219.355536] [<c043c982>] _spin_unlock_irq+0x22/0x30
[ 1219.355540] [<c02ba101>] __make_request+0xc1/0x320
[ 1219.355550] [<c02b91e7>] generic_make_request+0x177/0x230
[ 1219.355556] [<c043caf5>] _spin_unlock_irqrestore+0x45/0x60
[ 1219.355561] [<c013837c>] trace_hardirqs_on+0x9c/0x110
[ 1219.355567] [<c03705d9>] raid1d+0x419/0xcc0
[ 1219.355574] [<c0138d57>] __lock_acquire+0x537/0x10b0
[ 1219.355581] [<c01159cf>] scheduler_tick+0xcf/0x200
[ 1219.355590] [<c0102ed7>] restore_nocheck+0x12/0x15
[ 1219.355593] [<c037b6e0>] md_thread+0x0/0xe0
[ 1219.355599] [<c013837c>] trace_hardirqs_on+0x9c/0x110
[ 1219.355608] [<c043caf5>] _spin_unlock_irqrestore+0x45/0x60
[ 1219.355612] [<c037b6e0>] md_thread+0x0/0xe0
[ 1219.355616] [<c013837c>] trace_hardirqs_on+0x9c/0x110
[ 1219.355622] [<c037b6e0>] md_thread+0x0/0xe0
[ 1219.355626] [<c037b702>] md_thread+0x22/0xe0
[ 1219.355631] [<c012c6d0>] autoremove_wake_function+0x0/0x40
[ 1219.355639] [<c037b6e0>] md_thread+0x0/0xe0
[ 1219.355643] [<c012c402>] kthread+0x42/0x70
[ 1219.355647] [<c012c3c0>] kthread+0x0/0x70
[ 1219.355651] [<c0103a1f>] kernel_thread_helper+0x7/0x18
[ 1219.355657] =======================
[ 1219.355660] ---[ end trace d0567d7a35270324 ]---
@dm-devel, can you get something useful out of these traces?
Thank you,
Christian.
--
BOFH excuse #152:
My pony-tail hit the on/off switch on the power strip.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-12 19:53 ` Chr
@ 2008-03-12 23:07 ` Christian Kujau
2008-03-13 13:45 ` Christian Kujau
0 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-12 23:07 UTC (permalink / raw)
To: Chr; +Cc: Samuel Tardieu, device-mapper development, LKML
On Wed, 12 Mar 2008, Chr wrote:
> http://lkml.org/lkml/2008/3/11/377
> (BTW: I'm still bisecting it... can somebody please confirm that 2.6.24-git17
> http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.24-git17.bz2
> was the _last_ "good" one?)
Hm, -git17 locked up hard when trying to access the device mapper
(just like -rc1), not even sysrq was usable any more. Sadly,
nothing was logged to the disks/network. Will try earlier snapshots
then. And yes, this resyncing of the md devices after an unclean shutdown
is a real pita right now...
Thanks for reporting!
Christian.
--
BOFH excuse #197:
I'm sorry a pentium won't do, you need an SGI to connect with us.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-12 23:07 ` Christian Kujau
@ 2008-03-13 13:45 ` Christian Kujau
2008-03-13 21:33 ` Chr
0 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-13 13:45 UTC (permalink / raw)
To: Chr; +Cc: Samuel Tardieu, device-mapper development, LKML
On Thu, 13 Mar 2008, Christian Kujau wrote:
> Hm, -git17 locked up hard when trying to access the device mapper
and so does -git8 and -git9...sigh :(
C.
--
BOFH excuse #412:
Radial Telemetry Infiltration
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-13 13:45 ` Christian Kujau
@ 2008-03-13 21:33 ` Chr
2008-03-13 21:54 ` Christian Kujau
0 siblings, 1 reply; 34+ messages in thread
From: Chr @ 2008-03-13 21:33 UTC (permalink / raw)
To: Christian Kujau, Christian Kujau
Cc: Samuel Tardieu, device-mapper development, LKML
On Thursday 13 March 2008 14:45:39 Christian Kujau wrote:
> and so does -git8 and -git9...sigh :(
hmm, I can boot any of these -rcX-gitY kernels (but not without some stuck
tasks after about a hour or so...)
but back: I tried lockdep and ran into this: (debug blocker)
pktcdvd: writer pktcdvd0 mapped to sr0
=============================================
[ INFO: possible recursive locking detected ]
2.6.25-rc5-git3 #2
---------------------------------------------
scsi_id/8262 is trying to acquire lock:
(&bdev->bd_mutex){--..}, at: [<ffffffff802a8e51>] do_open+0x7b/0x2c0
but task is already holding lock:
(&bdev->bd_mutex){--..}, at: [<ffffffff802a8e51>] do_open+0x7b/0x2c0
other info that might help us debug this:
2 locks held by scsi_id/8262:
#0: (&bdev->bd_mutex){--..}, at: [<ffffffff802a8e51>] do_open+0x7b/0x2c0
#1: (&ctl_mutex#2){--..}, at: [<ffffffff88231af2>] pkt_open+0x26/0x516
[pktcdvd]
stack backtrace:
Pid: 8262, comm: scsi_id Not tainted 2.6.25-rc5-git3 #2
Call Trace:
[<ffffffff8024ffe4>] __lock_acquire+0x8c9/0xc72
[<ffffffff804bb23e>] ? __mutex_unlock_slowpath+0xf9/0x105
[<ffffffff8025079f>] lock_acquire+0x5e/0x77
[<ffffffff802a8e51>] ? do_open+0x7b/0x2c0
[<ffffffff804bb336>] mutex_lock_nested+0xe1/0x270
[<ffffffff802a8e51>] ? do_open+0x7b/0x2c0
[<ffffffff802a8e51>] do_open+0x7b/0x2c0
[<ffffffff802a9116>] __blkdev_get+0x80/0x92
[<ffffffff802a9133>] blkdev_get+0xb/0xd
[<ffffffff88231b6b>] :pktcdvd:pkt_open+0x9f/0x516
[<ffffffff802a8e84>] do_open+0xae/0x2c0
[<ffffffff802a92a4>] ? blkdev_open+0x0/0x6a
[<ffffffff804bcbab>] ? _spin_unlock+0x26/0x2a
[<ffffffff802a92a4>] ? blkdev_open+0x0/0x6a
[<ffffffff802a92db>] blkdev_open+0x37/0x6a
[<ffffffff80282571>] __dentry_open+0xe6/0x1c0
[<ffffffff802826e4>] nameidata_to_filp+0x2e/0x40
[<ffffffff8028272f>] do_filp_open+0x39/0x4b
[<ffffffff804bcbab>] ? _spin_unlock+0x26/0x2a
[<ffffffff8028246f>] ? get_unused_fd_flags+0x10d/0x11c
[<ffffffff80282792>] do_sys_open+0x51/0xd9
[<ffffffff80282843>] sys_open+0x1b/0x1d
[<ffffffff8020b22b>] system_call_after_swapgs+0x7b/0x80
soooo.....?
scsi_id/8262 is trying to acquire lock:
(&bdev->bd_mutex){--..}, at: [<ffffffff802a8e51>] do_open+0x7b/0x2c0
but task is already holding lock:
(&bdev->bd_mutex){--..}, at: [<ffffffff802a8e51>] do_open+0x7b/0x2c0
sounds a bit strange? Is this a SMP related problem after all?
(*disabled pktcddvd for now... to see if still locks)
Regards,
Chr...
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-13 21:33 ` Chr
@ 2008-03-13 21:54 ` Christian Kujau
2008-03-14 0:15 ` Chr
0 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-13 21:54 UTC (permalink / raw)
To: Chr; +Cc: Samuel Tardieu, device-mapper development, LKML
On Thu, 13 Mar 2008, Chr wrote:
> On Thursday 13 March 2008 14:45:39 Christian Kujau wrote:
>> and so does -git8 and -git9...sigh :(
>
> hmm, I can boot any of these -rcX-gitY kernels (but not without some stuck
> tasks after about a hour or so...)
Ah, when you said -git17 would be the last "good" kernel, I thought
everything earlier would not have the stuck tasks.
> but back: I tried lockdep and ran into this: (debug blocker)
I have lockdep enabled as well, but no warnings so far...
> sounds a bit strange? Is this a SMP related problem after all?
I don't think so, as I don't have an SMP system (and CONFIG_SMP is
disabled as well).
C.
--
BOFH excuse #441:
Hash table has woodworm
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-13 21:54 ` Christian Kujau
@ 2008-03-14 0:15 ` Chr
0 siblings, 0 replies; 34+ messages in thread
From: Chr @ 2008-03-14 0:15 UTC (permalink / raw)
To: Christian Kujau
Cc: Samuel Tardieu, device-mapper development, LKML, Ingo Molnar
On Thursday 13 March 2008 22:54:25 Christian Kujau wrote:
> On Thu, 13 Mar 2008, Chr wrote:
> > On Thursday 13 March 2008 14:45:39 Christian Kujau wrote:
> >> and so does -git8 and -git9...sigh :(
> >
> > hmm, I can boot any of these -rcX-gitY kernels (but not without some
> > stuck tasks after about a hour or so...)
>
> Ah, when you said -git17 would be the last "good" kernel, I thought
> everything earlier would not have the stuck tasks.
well, no... even 2.6.24.3 has _hangs_, however not forever (approx. 5-10 min)
and everything is back to normal...
> > sounds a bit strange? Is this a SMP related problem after all?
>
> I don't think so, as I don't have an SMP system (and CONFIG_SMP is
> disabled as well).
Then maybe scheduler? because these values looks suspecious:
(that's why I added Ingo to CC, because he probably knows what's NOT going on
;-) )
(full dmesg there:
http://www.pastebin.ca/941845
config here:
http://www.pastebin.ca/941856
)
the funny stuff is highlighted with a >:
cpu#1, 2211.332 MHz
.nr_running : 8
.load : 188884
.nr_switches : 9615826
.nr_load_updates : 1879402
> .nr_uninterruptible : -2053
.jiffies : 4296946718
.next_balance : 4296.946740
.curr->pid : 0
.clock : 9800558.756490
.idle_clock : 6521432.794621
.prev_clock_raw : 4476359.432375
.clock_warps : 0
.clock_overflows : 8676899
.clock_underflows : 688045
.clock_deep_idle_events : 0
.clock_max_delta : 4.000250
.cpu_load[0] : 0
.cpu_load[1] : 24
.cpu_load[2] : 112
.cpu_load[3] : 155
.cpu_load[4] : 177
cfs_rq
.exec_clock : 1353227.466775
.MIN_vruntime : 1808656.942015
.min_vruntime : 1808696.942015
.max_vruntime : 1808657.235424
.spread : 0.293409
> .spread0 : -267982.528057
.nr_running : 7
.load : 11362
.bkl_count : 60751
.nr_spread_over : 6234
runnable tasks:
task PID tree-key switches prio exec-runtime
sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
kcryptd 2266 1808656.942015 12778 115 1808656.942015
3689.257692 3125323.031084
hald-addon-inpu 5504 1808656.942015 622 120 1808656.942015
10.739211 3044831.003941
kcryptd 7312 1808656.960820 578891 115 1808656.960820
71166.387493 2625492.750280
Xorg 9564 1808656.942016 513338 120 1808656.942016
245777.204573 2587613.331559
pdflush 10059 1808656.942015 14575 120 1808656.942015
4400.355110 2985523.630368
pdflush 10696 1808656.942015 9433 120 1808656.942015
3607.302084 3103638.660388
xine 16773 1808657.235424 13455 120 1808657.235424
2135.923058 574348.774173
> watchdog 16959 -206866.028086 0 98 -206866.028086
0.000000 0.000000
(it's at the end of the dmesg, the kernel, X.org and probably everything else
was stuck...)
Thanks,
Christian.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-09 21:34 ` INFO: task mount:11202 blocked for more than 120 seconds David Chinner
2008-03-10 1:46 ` Christian Kujau
2008-03-12 18:03 ` Samuel Tardieu
@ 2008-03-14 9:27 ` Milan Broz
2008-03-14 23:58 ` Christian Kujau
2008-03-15 0:08 ` Chr
2 siblings, 2 replies; 34+ messages in thread
From: Milan Broz @ 2008-03-14 9:27 UTC (permalink / raw)
To: David Chinner; +Cc: Christian Kujau, LKML, xfs, dm-devel
David Chinner wrote:
>>> Also, the iolock can be held across I/O so it's possible you've lost an
>>> I/O. Any I/O errors in the syslog?
>> No, no I/O errors at all. See the kern.log above, I could even do dd(1)
>> from the md1 (dm-crypt on raid1), no errors either.
>
> Oh, dm-crypt. Well, I'd definitely start looking there. XFS has a
> history of exposing dm-crypt bugs, and these hangs appear to be I/O
> congestion/scheduling related and not XFS. Also, we haven't changed
> anything related to plug/unplug of block devices in XFS recently, so
> that also points to some other change as well...
Yes, there is bug in dm-crypt...
Please try if the patch here helps: http://lkml.org/lkml/2008/3/14/71
Thanks,
Milan
--
mbroz@redhat.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-14 9:27 ` Milan Broz
@ 2008-03-14 23:58 ` Christian Kujau
2008-03-16 20:33 ` David Chinner
2008-03-15 0:08 ` Chr
1 sibling, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-14 23:58 UTC (permalink / raw)
To: Milan Broz; +Cc: David Chinner, LKML, xfs, dm-devel
On Fri, 14 Mar 2008, Milan Broz wrote:
> Yes, there is bug in dm-crypt...
> Please try if the patch here helps: http://lkml.org/lkml/2008/3/14/71
Hm, it seems to help the hangs, yes. Applied to today's -git a few hours
ago, the hangs are gone. However, when doing lots of disk I/O, the machine
locks up after a few (10-20) minutes. Sadly, netconsole got nothing :(
After the first lockup I tried again and shortly after bootup I got:
[ 866.681441] [ INFO: possible circular locking dependency detected ]
[ 866.681876] 2.6.25-rc5 #1
[ 866.682203] -------------------------------------------------------
[ 866.682637] kswapd0/132 is trying to acquire lock:
[ 866.683028] (&(&ip->i_iolock)->mr_lock){----}, at: [<c027a686>] xfs_ilock+0x96/0xb0
[ 866.683916]
[ 866.683917] but task is already holding lock:
[ 866.684582] (iprune_mutex){--..}, at: [<c017b592>] shrink_icache_memory+0x72/0x220
[ 866.685461]
[ 866.685462] which lock already depends on the new lock.
[ 866.685463]
[ 866.686440]
[ 866.686441] the existing dependency chain (in reverse order) is:
[ 866.687151]
[ 866.687152] -> #1 (iprune_mutex){--..}:
[ 866.687339] [<c0136914>] add_lock_to_list+0x44/0xc0
[ 866.687339] [<c01393a6>] __lock_acquire+0xc26/0x10b0
[ 866.687339] [<c017b592>] shrink_icache_memory+0x72/0x220
[ 866.687339] [<c013890f>] __lock_acquire+0x18f/0x10b0
[ 866.687339] [<c013988e>] lock_acquire+0x5e/0x80
[ 866.687339] [<c017b592>] shrink_icache_memory+0x72/0x220
[ 866.687339] [<c043fc79>] mutex_lock_nested+0x89/0x240
[ 866.687339] [<c017b592>] shrink_icache_memory+0x72/0x220
[ 866.687339] [<c017b592>] shrink_icache_memory+0x72/0x220
[ 866.687339] [<c017b592>] shrink_icache_memory+0x72/0x220
[ 866.687339] [<c0150051>] shrink_slab+0x21/0x160
[ 866.687340] [<c0150131>] shrink_slab+0x101/0x160
[ 866.687340] [<c01502e2>] try_to_free_pages+0x152/0x230
[ 866.687340] [<c014f060>] isolate_pages_global+0x0/0x60
[ 866.687340] [<c014b95b>] __alloc_pages+0x14b/0x370
[ 866.687340] [<c04413a0>] _read_unlock_irq+0x20/0x30
[ 866.687340] [<c0146601>] __grab_cache_page+0x81/0xc0
[ 866.687340] [<c01896f6>] block_write_begin+0x76/0xe0
[ 866.687340] [<c029ec76>] xfs_vm_write_begin+0x46/0x50
[ 866.687340] [<c029f4c0>] xfs_get_blocks+0x0/0x30
[ 866.687340] [<c0147297>] generic_file_buffered_write+0x117/0x650
[ 866.687340] [<c027a65d>] xfs_ilock+0x6d/0xb0
[ 866.687340] [<c02a73cc>] xfs_write+0x7ac/0x8a0
[ 866.687340] [<c0174ac1>] core_sys_select+0x21/0x350
[ 866.687340] [<c02a32bc>] xfs_file_aio_write+0x5c/0x70
[ 866.687340] [<c0167bf5>] do_sync_write+0xd5/0x120
[ 866.687340] [<c012c630>] autoremove_wake_function+0x0/0x40
[ 866.687340] [<c019cfd5>] dnotify_parent+0x35/0x90
[ 866.687340] [<c0167b20>] do_sync_write+0x0/0x120
[ 866.687340] [<c016846f>] vfs_write+0x9f/0x140
[ 866.687340] [<c0168a21>] sys_write+0x41/0x70
[ 866.687340] [<c0102dee>] sysenter_past_esp+0x5f/0xa5
[ 866.687340] [<ffffffff>] 0xffffffff
[ 866.687340]
[ 866.687340] -> #0 (&(&ip->i_iolock)->mr_lock){----}:
[ 866.687340] [<c0136ba0>] print_circular_bug_entry+0x40/0x50
The box was running fine then for ~20 minutes, then it locked up again.
Full dmesg and .config: http://nerdbynature.de/bits/2.6.25-rc5/
Right now I'm back to 2.6.24.3...
Thanks,
Christian.
--
BOFH excuse #350:
paradigm shift...without a clutch
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-14 9:27 ` Milan Broz
2008-03-14 23:58 ` Christian Kujau
@ 2008-03-15 0:08 ` Chr
2008-03-15 13:32 ` [dm-crypt] " Chr
1 sibling, 1 reply; 34+ messages in thread
From: Chr @ 2008-03-15 0:08 UTC (permalink / raw)
To: Milan Broz; +Cc: David Chinner, Christian Kujau, LKML, xfs, dm-devel
On Friday 14 March 2008 10:27:07 Milan Broz wrote:
>
> > Oh, dm-crypt. Well, I'd definitely start looking there. XFS has a
> > history of exposing dm-crypt bugs, and these hangs appear to be I/O
> > congestion/scheduling related and not XFS. Also, we haven't changed
> > anything related to plug/unplug of block devices in XFS recently, so
> > that also points to some other change as well...
>
> Yes, there is bug in dm-crypt...
> Please try if the patch here helps: http://lkml.org/lkml/2008/3/14/71
>
hmm, :-/
stuck again... (tried the patch on top of 2.6.25-rc5 and 2.6.25-rc5-git4)
SYS-RQ (several times) T+W+Q
right here:
http://www.pastebin.ca/943145
Regards,
Christian
^ permalink raw reply [flat|nested] 34+ messages in thread
* [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-15 0:08 ` Chr
@ 2008-03-15 13:32 ` Chr
2008-03-15 21:34 ` Chr
0 siblings, 1 reply; 34+ messages in thread
From: Chr @ 2008-03-15 13:32 UTC (permalink / raw)
To: Milan Broz
Cc: David Chinner, Christian Kujau, LKML, xfs, dm-devel,
Alasdair G Kergon, dm-crypt
>On Sat, Mar 15, 2008 at 01:21:04AM +0100, Ulrich Lukas wrote:
>> With that patch applied, writing seems to work again.
>
> Current version of the patch is below. It's still not quite finished.
>
:-/, both patches freezes here...
I'll try to revert:
commit 3a7f6c990ad04e6f576a159876c602d14d6f7fef
dm crypt: use async crypto
dm-crypt: Use crypto ablkcipher interface
Move encrypt/decrypt core to async crypto call.
maybe it isn't the async stuff after all..
>> Btw, is this a regression because of the changed 2.6.25(-rc) code or is
>> it just a coincidence that it worked before?
>
>Introduced during code changes to support async crypto interface.
Regards,
Chr
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-15 13:32 ` [dm-crypt] " Chr
@ 2008-03-15 21:34 ` Chr
2008-03-16 13:08 ` Christian Kujau
0 siblings, 1 reply; 34+ messages in thread
From: Chr @ 2008-03-15 21:34 UTC (permalink / raw)
To: Milan Broz
Cc: David Chinner, Christian Kujau, LKML, xfs, dm-devel,
Alasdair G Kergon, dm-crypt
On Saturday 15 March 2008 14:32:10 Chr wrote:
> reverted:
>
> commit 3a7f6c990ad04e6f576a159876c602d14d6f7fef
> dm crypt: use async crypto
>
> dm-crypt: Use crypto ablkcipher interface
> Move encrypt/decrypt core to async crypto call.
>
well.... it's much better now, without the async interface.
Christian Kajau, can you confirm it too?
Regards,
Christian
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-15 21:34 ` Chr
@ 2008-03-16 13:08 ` Christian Kujau
2008-03-17 17:36 ` Alasdair G Kergon
0 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-16 13:08 UTC (permalink / raw)
To: Chr
Cc: Milan Broz, David Chinner, LKML, xfs, dm-devel,
Alasdair G Kergon, dm-crypt
On Sat, 15 Mar 2008, Chr wrote:
> On Saturday 15 March 2008 14:32:10 Chr wrote:
>> reverted:
>>
>> commit 3a7f6c990ad04e6f576a159876c602d14d6f7fef
>> dm crypt: use async crypto
>>
>> dm-crypt: Use crypto ablkcipher interface
>> Move encrypt/decrypt core to async crypto call.
>>
>
> well.... it's much better now, without the async interface.
> Christian Kajau, can you confirm it too?
I reverted the commit above from today's -git and booted....I could not
notice any hangs more. But when I tried to reproduce these hangs by
generating disk I/O (mostly reads) ~10 minutes later the box panicked, but
still not netconsole messages :-(
thanks,
C.
--
BOFH excuse #233:
TCP/IP UDP alarm threshold is set too low.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: INFO: task mount:11202 blocked for more than 120 seconds
2008-03-14 23:58 ` Christian Kujau
@ 2008-03-16 20:33 ` David Chinner
0 siblings, 0 replies; 34+ messages in thread
From: David Chinner @ 2008-03-16 20:33 UTC (permalink / raw)
To: Christian Kujau; +Cc: Milan Broz, David Chinner, LKML, xfs, dm-devel
On Sat, Mar 15, 2008 at 12:58:02AM +0100, Christian Kujau wrote:
> On Fri, 14 Mar 2008, Milan Broz wrote:
> >Yes, there is bug in dm-crypt...
> >Please try if the patch here helps: http://lkml.org/lkml/2008/3/14/71
>
> Hm, it seems to help the hangs, yes. Applied to today's -git a few hours
> ago, the hangs are gone. However, when doing lots of disk I/O, the machine
> locks up after a few (10-20) minutes. Sadly, netconsole got nothing :(
>
> After the first lockup I tried again and shortly after bootup I got:
False positive. Memory reclaim can inverts the order of iprune_mutex and
the normal inode locking orders. i.e. Both of these are possible:
do_something()
enter memory reclaim
iprune_mutex
inode lock
or
do_something()
inode lock
do_something_else
enter memory reclaim
iprune_mutex
inode lock on different inode
So depending on what is seen first (in this case the first), the second
will trip lockdep. Neither are a deadlock, because the inode lock
held before memory reclaim is a referenced inode and will *never* be
on the free list for memory reclaim to deadlock on....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-16 13:08 ` Christian Kujau
@ 2008-03-17 17:36 ` Alasdair G Kergon
2008-03-17 18:36 ` Chr
` (3 more replies)
0 siblings, 4 replies; 34+ messages in thread
From: Alasdair G Kergon @ 2008-03-17 17:36 UTC (permalink / raw)
To: Christian Kujau, Chr, Milan Broz, David Chinner, LKML, xfs,
dm-devel, dm-crypt, Herbert Xu, Ritesh Raj Sarraf
Latest version for everyone to try:
From: Milan Broz <mbroz@redhat.com>
Fix regression in dm-crypt introduced in commit
3a7f6c990ad04e6f576a159876c602d14d6f7fef
(dm crypt: use async crypto).
If write requests need to be split into pieces, the code must not
process them in parallel because the crypto context cannot be shared.
So there can be parallel crypto operations on one part of the write,
but only one write bio can be processed at a time.
This is not optimal and the workqueue code need to be optimized for
parallel processing, but for now it solves problem without affecting
the performance of synchronous crypto operation (most of current
dm-crypt users).
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
---
drivers/md/dm-crypt.c | 58 +++++++++++++++++++++++++-------------------------
1 files changed, 30 insertions(+), 28 deletions(-)
Index: linux-2.6.25-rc4/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.25-rc4.orig/drivers/md/dm-crypt.c 2008-03-17 11:42:16.000000000 +0000
+++ linux-2.6.25-rc4/drivers/md/dm-crypt.c 2008-03-17 11:42:28.000000000 +0000
@@ -1,7 +1,7 @@
/*
* Copyright (C) 2003 Christophe Saout <christophe@saout.de>
* Copyright (C) 2004 Clemens Fruhwirth <clemens@endorphin.org>
- * Copyright (C) 2006-2007 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2006-2008 Red Hat, Inc. All rights reserved.
*
* This file is released under the GPL.
*/
@@ -93,6 +93,8 @@ struct crypt_config {
struct workqueue_struct *io_queue;
struct workqueue_struct *crypt_queue;
+ wait_queue_head_t writeq;
+
/*
* crypto related data
*/
@@ -331,14 +333,7 @@ static void crypt_convert_init(struct cr
ctx->idx_out = bio_out ? bio_out->bi_idx : 0;
ctx->sector = sector + cc->iv_offset;
init_completion(&ctx->restart);
- /*
- * Crypto operation can be asynchronous,
- * ctx->pending is increased after request submission.
- * We need to ensure that we don't call the crypt finish
- * operation before pending got incremented
- * (dependent on crypt submission return code).
- */
- atomic_set(&ctx->pending, 2);
+ atomic_set(&ctx->pending, 1);
}
static int crypt_convert_block(struct crypt_config *cc,
@@ -411,43 +406,42 @@ static void crypt_alloc_req(struct crypt
static int crypt_convert(struct crypt_config *cc,
struct convert_context *ctx)
{
- int r = 0;
+ int r;
while(ctx->idx_in < ctx->bio_in->bi_vcnt &&
ctx->idx_out < ctx->bio_out->bi_vcnt) {
crypt_alloc_req(cc, ctx);
+ atomic_inc(&ctx->pending);
+
r = crypt_convert_block(cc, ctx, cc->req);
switch (r) {
+ /* async */
case -EBUSY:
wait_for_completion(&ctx->restart);
INIT_COMPLETION(ctx->restart);
/* fall through*/
case -EINPROGRESS:
- atomic_inc(&ctx->pending);
cc->req = NULL;
- r = 0;
- /* fall through*/
+ ctx->sector++;
+ continue;
+
+ /* sync */
case 0:
+ atomic_dec(&ctx->pending);
ctx->sector++;
continue;
- }
- break;
+ /* error */
+ default:
+ atomic_dec(&ctx->pending);
+ return r;
+ }
}
- /*
- * If there are pending crypto operation run async
- * code. Otherwise process return code synchronously.
- * The step of 2 ensures that async finish doesn't
- * call crypto finish too early.
- */
- if (atomic_sub_return(2, &ctx->pending))
- return -EINPROGRESS;
-
- return r;
+ return 0;
}
static void dm_crypt_bio_destructor(struct bio *bio)
@@ -624,8 +618,10 @@ static void kcryptd_io_read(struct dm_cr
static void kcryptd_io_write(struct dm_crypt_io *io)
{
struct bio *clone = io->ctx.bio_out;
+ struct crypt_config *cc = io->target->private;
generic_make_request(clone);
+ wake_up(&cc->writeq);
}
static void kcryptd_io(struct work_struct *work)
@@ -698,7 +694,8 @@ static void kcryptd_crypt_write_convert_
r = crypt_convert(cc, &io->ctx);
- if (r != -EINPROGRESS) {
+ if (atomic_dec_and_test(&io->ctx.pending)) {
+ /* processed, no running async crypto */
kcryptd_crypt_write_io_submit(io, r, 0);
if (unlikely(r < 0))
return;
@@ -706,8 +703,12 @@ static void kcryptd_crypt_write_convert_
atomic_inc(&io->pending);
/* out of memory -> run queues */
- if (unlikely(remaining))
+ if (unlikely(remaining)) {
+ /* wait for async crypto and reinitialize pending counter */
+ wait_event(cc->writeq, !atomic_read(&io->ctx.pending));
+ atomic_set(&io->ctx.pending, 1);
congestion_wait(WRITE, HZ/100);
+ }
}
}
@@ -746,7 +747,7 @@ static void kcryptd_crypt_read_convert(s
r = crypt_convert(cc, &io->ctx);
- if (r != -EINPROGRESS)
+ if (atomic_dec_and_test(&io->ctx.pending))
kcryptd_crypt_read_done(io, r);
crypt_dec_pending(io);
@@ -1047,6 +1048,7 @@ static int crypt_ctr(struct dm_target *t
goto bad_crypt_queue;
}
+ init_waitqueue_head(&cc->writeq);
ti->private = cc;
return 0;
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-17 17:36 ` Alasdair G Kergon
@ 2008-03-17 18:36 ` Chr
2008-03-18 17:46 ` Christian Kujau
2008-03-18 0:56 ` Herbert Xu
` (2 subsequent siblings)
3 siblings, 1 reply; 34+ messages in thread
From: Chr @ 2008-03-17 18:36 UTC (permalink / raw)
To: Alasdair G Kergon
Cc: Christian Kujau, Milan Broz, David Chinner, LKML, xfs, dm-devel,
dm-crypt, Herbert Xu, Ritesh Raj Sarraf
On Monday 17 March 2008 18:36:09 Alasdair G Kergon wrote:
> Latest version for everyone to try:
>
> From: Milan Broz <mbroz@redhat.com>
>
> Fix regression in dm-crypt introduced in commit
> 3a7f6c990ad04e6f576a159876c602d14d6f7fef
> (dm crypt: use async crypto).
>
> If write requests need to be split into pieces, the code must not
> process them in parallel because the crypto context cannot be shared.
> So there can be parallel crypto operations on one part of the write,
> but only one write bio can be processed at a time.
>
> This is not optimal and the workqueue code need to be optimized for
> parallel processing, but for now it solves problem without affecting
> the performance of synchronous crypto operation (most of current
> dm-crypt users).
>
> Signed-off-by: Milan Broz <mbroz@redhat.com>
> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tested-by: Christian Lamparter <chunkeey@web.de>
Well, the dm-crypt regressions seems to be gone. :)
Thanks for your work & time!
Regards,
Christian
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-17 17:36 ` Alasdair G Kergon
2008-03-17 18:36 ` Chr
@ 2008-03-18 0:56 ` Herbert Xu
2008-03-18 4:07 ` Milan Broz
2008-03-22 2:52 ` Christian Kujau
2008-03-26 16:57 ` Chr
3 siblings, 1 reply; 34+ messages in thread
From: Herbert Xu @ 2008-03-18 0:56 UTC (permalink / raw)
To: Christian Kujau, Chr, Milan Broz, David Chinner, LKML, xfs,
dm-devel, dm-crypt, Ritesh Raj Sarraf
On Mon, Mar 17, 2008 at 05:36:09PM +0000, Alasdair G Kergon wrote:
> Latest version for everyone to try:
>
> From: Milan Broz <mbroz@redhat.com>
>
> Fix regression in dm-crypt introduced in commit
> 3a7f6c990ad04e6f576a159876c602d14d6f7fef
> (dm crypt: use async crypto).
>
> If write requests need to be split into pieces, the code must not
> process them in parallel because the crypto context cannot be shared.
> So there can be parallel crypto operations on one part of the write,
> but only one write bio can be processed at a time.
Could you explain this part please? Crypto tfm objects are meant
to be reentrant, synchronous or not.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-18 0:56 ` Herbert Xu
@ 2008-03-18 4:07 ` Milan Broz
0 siblings, 0 replies; 34+ messages in thread
From: Milan Broz @ 2008-03-18 4:07 UTC (permalink / raw)
To: Herbert Xu
Cc: Christian Kujau, Chr, David Chinner, LKML, xfs, dm-devel,
dm-crypt, Ritesh Raj Sarraf
Herbert Xu wrote:
> On Mon, Mar 17, 2008 at 05:36:09PM +0000, Alasdair G Kergon wrote:
>> Latest version for everyone to try:
>>
>> From: Milan Broz <mbroz@redhat.com>
>>
>> Fix regression in dm-crypt introduced in commit
>> 3a7f6c990ad04e6f576a159876c602d14d6f7fef
>> (dm crypt: use async crypto).
>>
>> If write requests need to be split into pieces, the code must not
>> process them in parallel because the crypto context cannot be shared.
>> So there can be parallel crypto operations on one part of the write,
>> but only one write bio can be processed at a time.
>
> Could you explain this part please? Crypto tfm objects are meant
> to be reentrant, synchronous or not.
Ah, sorry - I mean dm-crypt convert context (with crypto context included).
Context is reentrant in the sense of crypto operations. But we need also
sometimes split bio in writes (not only because of low memory,
but also new memory layout of cloned bio can be different and we
must not violate hardware restrictions - spec. XFS generates such
highly optimized bio requests - it's why it discovers so many dm-crypt problems ;-)
see problematic dm-crypt bio write path
while (remaining) {
clone = crypt_alloc_buffer(io, remaining);
...
io->ctx.bio_out = clone;
io->ctx.idx_out = 0;
remaining -= clone->bi_size;
...
r = crypt_convert(cc, &io->ctx);
-> fire sync or (multiple) async crypto operation
if (atomic_dec_and_test(&io->ctx.pending))
-> sync mode, submit clone direclty
...
if (unlikely(remaining))
congestion_wait(WRITE, HZ/100);
}
and in async crypto completion callback (because async callback cannot
call in its context generic_make_request directly) is called:
struct convert_context *ctx = async_req->data;
...
if (!atomic_dec_and_test(&ctx->pending))
return;
...
INIT_WORK(&io->work, kcryptd_io);
queue_work(cc->io_queue, &io->work);
...
(and from io thread later)
struct bio *clone = io->ctx.bio_out;
generic_make_request(clone);
problems:
1) we cannot use io->work struct in parallel
2) io->ctx.pending is shared here between multiple sub-bio clones
...
(there was no problems in sync crypto mode. and dm-crypt io struct is
already allocated from mempool in crypt_map allocation, so changing this
to per cloned sub-bio allocation can cause new problems in low-memory situations,
not good idea change it in this development phase...)
Milan
--
mbroz@redhat.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-17 18:36 ` Chr
@ 2008-03-18 17:46 ` Christian Kujau
0 siblings, 0 replies; 34+ messages in thread
From: Christian Kujau @ 2008-03-18 17:46 UTC (permalink / raw)
To: Chr
Cc: Alasdair G Kergon, Milan Broz, David Chinner, LKML, xfs,
dm-devel, dm-crypt, Herbert Xu, Ritesh Raj Sarraf
Chr wrote:
> Well, the dm-crypt regressions seems to be gone. :)
> Thanks for your work& time!
Unfortunately I'm unable to test atm as I currently don't have access to
the machine where the hangs occured. So, if Chr says "it's fixed" I
believe it is and I don't wanna be a show stopper...IOW: feel free to
close the bug.
Thanks to all involved,
Christian.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-17 17:36 ` Alasdair G Kergon
2008-03-17 18:36 ` Chr
2008-03-18 0:56 ` Herbert Xu
@ 2008-03-22 2:52 ` Christian Kujau
2008-03-27 8:21 ` Christian Kujau
2008-03-26 16:57 ` Chr
3 siblings, 1 reply; 34+ messages in thread
From: Christian Kujau @ 2008-03-22 2:52 UTC (permalink / raw)
To: Alasdair G Kergon
Cc: Chr, Milan Broz, David Chinner, LKML, xfs, dm-devel, dm-crypt,
Herbert Xu, Ritesh Raj Sarraf
On Mon, 17 Mar 2008, Alasdair G Kergon wrote:
> From: Milan Broz <mbroz@redhat.com>
>
> Fix regression in dm-crypt introduced in commit
> 3a7f6c990ad04e6f576a159876c602d14d6f7fef
> (dm crypt: use async crypto).
I finally got around to test this - and yes, 2.6.25-rc6 with this patch
applied make the hangs go away. There are other problems now[0], but they
don't seem to be related, AFAICT.
Thank you!
Christian.
[0] http://lkml.org/lkml/2008/3/21/408
--
BOFH excuse #197:
I'm sorry a pentium won't do, you need an SGI to connect with us.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-17 17:36 ` Alasdair G Kergon
` (2 preceding siblings ...)
2008-03-22 2:52 ` Christian Kujau
@ 2008-03-26 16:57 ` Chr
3 siblings, 0 replies; 34+ messages in thread
From: Chr @ 2008-03-26 16:57 UTC (permalink / raw)
To: Alasdair G Kergon
Cc: Christian Kujau, Milan Broz, David Chinner, LKML,
Ritesh Raj Sarraf, dm-devel, dm-crypt, Herbert Xu
On Monday 17 March 2008 18:36:09 Alasdair G Kergon wrote:
> Latest version for everyone to try:
>
> From: Milan Broz <mbroz@redhat.com>
>
> Fix regression in dm-crypt introduced in commit
> 3a7f6c990ad04e6f576a159876c602d14d6f7fef
> (dm crypt: use async crypto).
>
> If write requests need to be split into pieces, the code must not
> process them in parallel because the crypto context cannot be shared.
> So there can be parallel crypto operations on one part of the write,
> but only one write bio can be processed at a time.
>
> This is not optimal and the workqueue code need to be optimized for
> parallel processing, but for now it solves problem without affecting
> the performance of synchronous crypto operation (most of current
> dm-crypt users).
>
> Signed-off-by: Milan Broz <mbroz@redhat.com>
> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
>
> [...]
Is there already a new, optimized version? Or is it postponed and someone will
_hopefully_ merge this patch into -rc7-gitX before we have a buggy mainline
2.6.25?
Thanks,
Chr
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [dm-crypt] INFO: task mount:11202 blocked for more than 120 seconds
2008-03-22 2:52 ` Christian Kujau
@ 2008-03-27 8:21 ` Christian Kujau
0 siblings, 0 replies; 34+ messages in thread
From: Christian Kujau @ 2008-03-27 8:21 UTC (permalink / raw)
To: Alasdair G Kergon
Cc: Chr, Milan Broz, David Chinner, LKML, xfs, dm-devel, dm-crypt,
Herbert Xu, Ritesh Raj Sarraf
On Sat, 22 Mar 2008, Christian Kujau wrote:
> On Mon, 17 Mar 2008, Alasdair G Kergon wrote:
>> From: Milan Broz <mbroz@redhat.com>
>>
>> Fix regression in dm-crypt introduced in commit
>> 3a7f6c990ad04e6f576a159876c602d14d6f7fef
>> (dm crypt: use async crypto).
I noticed that this patch[0] hasn't made it into mainline yet,
will this be included or is this fixed by something else?
Thanks,
Christian.
[0] http://lkml.org/lkml/2008/3/17/214
--
BOFH excuse #416:
We're out of slots on the server
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2008-03-27 8:22 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-07 20:32 INFO: task mount:11202 blocked for more than 120 seconds Christian Kujau
2008-03-07 22:40 ` David Chinner
2008-03-07 23:46 ` Christian Kujau
2008-03-08 1:54 ` Christian Kujau
2008-03-09 6:15 ` 2.6.25-rc hangs (was: INFO: task mount:11202 blocked for more than 120 seconds) Christian Kujau
2008-03-09 16:44 ` 2.6.25-rc hangs Eric Sandeen
2008-03-09 18:05 ` Christian Kujau
[not found] ` <47D42AD6.10500@sandeen.net>
2008-03-09 18:49 ` Christian Kujau
2008-03-12 18:06 ` Samuel Tardieu
2008-03-12 21:02 ` Christian Kujau
2008-03-09 21:34 ` INFO: task mount:11202 blocked for more than 120 seconds David Chinner
2008-03-10 1:46 ` Christian Kujau
2008-03-12 18:03 ` Samuel Tardieu
2008-03-12 19:53 ` Chr
2008-03-12 23:07 ` Christian Kujau
2008-03-13 13:45 ` Christian Kujau
2008-03-13 21:33 ` Chr
2008-03-13 21:54 ` Christian Kujau
2008-03-14 0:15 ` Chr
2008-03-14 9:27 ` Milan Broz
2008-03-14 23:58 ` Christian Kujau
2008-03-16 20:33 ` David Chinner
2008-03-15 0:08 ` Chr
2008-03-15 13:32 ` [dm-crypt] " Chr
2008-03-15 21:34 ` Chr
2008-03-16 13:08 ` Christian Kujau
2008-03-17 17:36 ` Alasdair G Kergon
2008-03-17 18:36 ` Chr
2008-03-18 17:46 ` Christian Kujau
2008-03-18 0:56 ` Herbert Xu
2008-03-18 4:07 ` Milan Broz
2008-03-22 2:52 ` Christian Kujau
2008-03-27 8:21 ` Christian Kujau
2008-03-26 16:57 ` Chr
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).