LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: Strange freeze on 2.6.22 (deadlock?)
@ 2008-01-07 19:06 Brice Figureau
  2008-01-08 23:16 ` Chuck Ebbert
  0 siblings, 1 reply; 6+ messages in thread
From: Brice Figureau @ 2008-01-07 19:06 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-kernel


On Mon, January 7, 2008 18:20, Randy Dunlap wrote:
> On Mon, 07 Jan 2008 16:48:54 +0100 Brice Figureau wrote:
>
>> Hi,
>>
>> I'm seeing a strange complete server freeze/lock-up on an bi-Xeon HT
>> amd64 server running standard debian 2.6.22 (and before that vanilla
>> 2.6.19.x and 2.6.20.x which exhibited the same issue).
>>
>> I'm only reporting it now, since I could get a full sysrq-t only this
>> morning.
>>
>> The symptoms are that every 5 to 7 days, the server (which acts as a MX
>> along with a few low traffic websites) locks-up. The ipmi watchdog is
>> unable to reboot the server (and doesn't even trigger, since there is no
>> evidence in the esmlog), the machine is still pingable. I can't ssh to
>> it, but I can enter my login & password on a serial console, but no
>> shell is started.
>>
>> Pressing sysrq-t produced the trace hosted here:
>> http://www.daysofwonder.com/bug/crash-server1.txt.gz
>>
>> It happened one time when I was connected to the server through ssh and
>> I could see that the load started to increase well above 100. It was
>> then impossible to launch new process from the command-line (and I had
>> to reboot manually).
>> It happened also last week, and the server was stuck for about 6 hours.
>> When I started investigating what was wrong, it slowly came back to life
>> (with an avg 1-min load of more than 1500, and tons of cron processes
>> running in parallel).
>>
>> I'm not really familiar with kernel development so I can't really find
>> the issue in the aforementioned trace output.
>> What I think is that for some reason there is a race/deadlock that
>> finally prevents new processes to really start (which in turns produces
>> the high load).
>>
>> What seems suspect in the aforementioned trace is:
>>  *) lot of processes stacktrace ends in __mod_timer+0xc3/0xd3
>> which seems to be this line from kernel/timer.c
>>
>> 415	timer->expires = expires;
>> 416	internal_add_timer(base, timer);
>> -->	spin_unlock_irqrestore(&base->lock, flags);
>>
>> 419	return ret;
>> 420  }
>>
>>  *) lot of processes stacktrace ends in __mutex_lock_slowpath and/or
>> zone_statistics
>
> There are also lots of processes in D state (usually waiting
> for I/O to complete).  And jbd is in their stack traces.
>
> How is/are the ext3 filesystems mounted?  I mean what data=xyz
> mode?  data=journal (the heaviest duty mode) has at least one
> known deadlock.  If you are using data=journal, you could try
> switching to data=ordered...

Thanks for the answer.
I'm using whatever is the default mount option (which I think is
data=ordered). The only other mount option I use is nodiratime,noatime.

Note that a large part of the processes in D state are "waiting" in
__mutex_lock from generic_file_aio_write.
Another large part is coming from balance_dirty_pages_ratelimited_nr.

It seems that there was some writeback congestion to the block device.
All the /proc/sys/vm/* files are at their defaults.

This looks like if it wasn't possible to write to the block device anymore.
Could a block device write error (ie hardware failure) be the root cause?

Any other idea?
What should I try the next time it freezes?
-- 
Brice Figureau


^ permalink raw reply	[flat|nested] 6+ messages in thread
* Re: Strange freeze on 2.6.22 (deadlock?)
@ 2008-01-07 21:14 Brice Figureau
  0 siblings, 0 replies; 6+ messages in thread
From: Brice Figureau @ 2008-01-07 21:14 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-kernel


On Mon, January 7, 2008 20:06, Brice Figureau wrote:
>
> On Mon, January 7, 2008 18:20, Randy Dunlap wrote:
>> On Mon, 07 Jan 2008 16:48:54 +0100 Brice Figureau wrote:
>>
>>> Hi,
>>>
>>> I'm seeing a strange complete server freeze/lock-up on an bi-Xeon HT
>>> amd64 server running standard debian 2.6.22 (and before that vanilla
>>> 2.6.19.x and 2.6.20.x which exhibited the same issue).
>>>
>>> I'm only reporting it now, since I could get a full sysrq-t only this
>>> morning.
>>>
>>> The symptoms are that every 5 to 7 days, the server (which acts as a MX
>>> along with a few low traffic websites) locks-up. The ipmi watchdog is
>>> unable to reboot the server (and doesn't even trigger, since there is
>>> no
>>> evidence in the esmlog), the machine is still pingable. I can't ssh to
>>> it, but I can enter my login & password on a serial console, but no
>>> shell is started.
>>>
>>> Pressing sysrq-t produced the trace hosted here:
>>> http://www.daysofwonder.com/bug/crash-server1.txt.gz
>>>
>>> It happened one time when I was connected to the server through ssh and
>>> I could see that the load started to increase well above 100. It was
>>> then impossible to launch new process from the command-line (and I had
>>> to reboot manually).
>>> It happened also last week, and the server was stuck for about 6 hours.
>>> When I started investigating what was wrong, it slowly came back to
>>> life
>>> (with an avg 1-min load of more than 1500, and tons of cron processes
>>> running in parallel).
>>>
>>> I'm not really familiar with kernel development so I can't really find
>>> the issue in the aforementioned trace output.
>>> What I think is that for some reason there is a race/deadlock that
>>> finally prevents new processes to really start (which in turns produces
>>> the high load).
>>>
>>> What seems suspect in the aforementioned trace is:
>>>  *) lot of processes stacktrace ends in __mod_timer+0xc3/0xd3
>>> which seems to be this line from kernel/timer.c
>>>
>>> 415	timer->expires = expires;
>>> 416	internal_add_timer(base, timer);
>>> -->	spin_unlock_irqrestore(&base->lock, flags);
>>>
>>> 419	return ret;
>>> 420  }
>>>
>>>  *) lot of processes stacktrace ends in __mutex_lock_slowpath and/or
>>> zone_statistics
>>
>> There are also lots of processes in D state (usually waiting
>> for I/O to complete).  And jbd is in their stack traces.
>>
>> How is/are the ext3 filesystems mounted?  I mean what data=xyz
>> mode?  data=journal (the heaviest duty mode) has at least one
>> known deadlock.  If you are using data=journal, you could try
>> switching to data=ordered...
>
> Thanks for the answer.
> I'm using whatever is the default mount option (which I think is
> data=ordered). The only other mount option I use is nodiratime,noatime.

I'm sorry, I stand corrected, there is a data=journal option but not on a
mount point, as an attribute of a bunch of directories (postfix queues):
# lsattr /var/spool/postfix/
-------A-----j---- /var/spool/postfix/active
-------A-----j---- /var/spool/postfix/bounce
-------A-----j---- /var/spool/postfix/corrupt
-------A-----j---- /var/spool/postfix/defer
-------A-----j---- /var/spool/postfix/deferred
-------A-----j---- /var/spool/postfix/flush
-------A-----j---- /var/spool/postfix/hold
-------A-----j---- /var/spool/postfix/incoming
-------A-----j---- /var/spool/postfix/private
-------A-----j---- /var/spool/postfix/maildrop
-------A-----j---- /var/spool/postfix/public
-------A-----j---- /var/spool/postfix/pid
-------A-----j---- /var/spool/postfix/saved
-------A-----j---- /var/spool/postfix/trace

The idea was that queue file are short-lived files and thus they could
only be stored in the journal and never hit the regular data area.

So I might have hit the deadlock you were talking about.
I just disabled this setup, we'll see if that changes anything.

Note that I was using this setup since 2.4 time, the problem appeared only
a few kernel revisions ago (around 2.6.18 I think).

Thanks
-- 
Brice Figureau


^ permalink raw reply	[flat|nested] 6+ messages in thread
* Strange freeze on 2.6.22 (deadlock?)
@ 2008-01-07 15:48 Brice Figureau
  2008-01-07 17:20 ` Randy Dunlap
  0 siblings, 1 reply; 6+ messages in thread
From: Brice Figureau @ 2008-01-07 15:48 UTC (permalink / raw)
  To: linux-kernel

Hi,

I'm seeing a strange complete server freeze/lock-up on an bi-Xeon HT
amd64 server running standard debian 2.6.22 (and before that vanilla
2.6.19.x and 2.6.20.x which exhibited the same issue).

I'm only reporting it now, since I could get a full sysrq-t only this
morning.

The symptoms are that every 5 to 7 days, the server (which acts as a MX
along with a few low traffic websites) locks-up. The ipmi watchdog is
unable to reboot the server (and doesn't even trigger, since there is no
evidence in the esmlog), the machine is still pingable. I can't ssh to
it, but I can enter my login & password on a serial console, but no
shell is started.

Pressing sysrq-t produced the trace hosted here:
http://www.daysofwonder.com/bug/crash-server1.txt.gz

It happened one time when I was connected to the server through ssh and
I could see that the load started to increase well above 100. It was
then impossible to launch new process from the command-line (and I had
to reboot manually).
It happened also last week, and the server was stuck for about 6 hours.
When I started investigating what was wrong, it slowly came back to life
(with an avg 1-min load of more than 1500, and tons of cron processes
running in parallel).

I'm not really familiar with kernel development so I can't really find
the issue in the aforementioned trace output.
What I think is that for some reason there is a race/deadlock that
finally prevents new processes to really start (which in turns produces
the high load).

What seems suspect in the aforementioned trace is:
 *) lot of processes stacktrace ends in __mod_timer+0xc3/0xd3
which seems to be this line from kernel/timer.c

415	timer->expires = expires;
416	internal_add_timer(base, timer);
-->	spin_unlock_irqrestore(&base->lock, flags);

419	return ret;
420  }

 *) lot of processes stacktrace ends in __mutex_lock_slowpath and/or zone_statistics

Anyway, I will soon reboot to a 2.6.23.x to see if that symptom
persists.
More information (config, server specs) are available on request.

I'm not subscribed to the list, so please CC: me for any anwser.
Many thanks,
-- 
Brice Figureau <brice+lklm@daysofwonder.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-01-09 16:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-07 19:06 Strange freeze on 2.6.22 (deadlock?) Brice Figureau
2008-01-08 23:16 ` Chuck Ebbert
2008-01-09 16:02   ` Brice Figureau
  -- strict thread matches above, loose matches on Subject: below --
2008-01-07 21:14 Brice Figureau
2008-01-07 15:48 Brice Figureau
2008-01-07 17:20 ` Randy Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).