LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Monthly md check == hung machine; how do I debug?
@ 2008-02-03 21:21 Robin Lee Powell
  2008-02-04  5:37 ` martin f krafft
  2008-02-04 10:40 ` Nick Piggin
  0 siblings, 2 replies; 9+ messages in thread
From: Robin Lee Powell @ 2008-02-03 21:21 UTC (permalink / raw)
  To: linux-kernel


I've got a machine with a 4 disk SATA raid10 configuration using md.
The entire disk is loop-AES encrypted, but that shouldn't matter
here.

Once a month, Debian runs:

    /usr/share/mdadm/checkarray --cron --all --quiet

and the machine hangs within 30 minutes of that starting.

It seems that I can avoid the hang by not having "mdadm --monitor"
running, but I'm not certain if that's the case or if I've just been
lucky this go-round.

I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
Athlon(tm) 64 Processor 3700+.

I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
can't find anything that looks relevant.

So, how can I (help you all) debug this?

-Robin

-- 
Lojban Reason #17: http://en.wikipedia.org/wiki/Buffalo_buffalo
Proud Supporter of the Singularity Institute - http://singinst.org/
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monthly md check == hung machine; how do I debug?
  2008-02-03 21:21 Monthly md check == hung machine; how do I debug? Robin Lee Powell
@ 2008-02-04  5:37 ` martin f krafft
  2008-02-04  6:59   ` Robin Lee Powell
  2008-02-04 10:40 ` Nick Piggin
  1 sibling, 1 reply; 9+ messages in thread
From: martin f krafft @ 2008-02-04  5:37 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

also sprach Robin Lee Powell <rlpowell@digitalkingdom.org> [2008.02.04.1021 +1300]:
>     /usr/share/mdadm/checkarray --cron --all --quiet

FYI:
http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=blob;f=debian/checkarray

It basically does

  echo check > /sys/block/$array/md/sync_action

for all arrays.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
i feel like i'm diagonally parked in a parallel universe.
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monthly md check == hung machine; how do I debug?
  2008-02-04  5:37 ` martin f krafft
@ 2008-02-04  6:59   ` Robin Lee Powell
  0 siblings, 0 replies; 9+ messages in thread
From: Robin Lee Powell @ 2008-02-04  6:59 UTC (permalink / raw)
  To: linux-kernel

On Mon, Feb 04, 2008 at 06:37:02PM +1300, martin f krafft wrote:
> also sprach Robin Lee Powell <rlpowell@digitalkingdom.org> [2008.02.04.1021 +1300]:
> >     /usr/share/mdadm/checkarray --cron --all --quiet
> 
> FYI:
> http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=blob;f=debian/checkarray
> 
> It basically does
> 
>   echo check > /sys/block/$array/md/sync_action
> 
> for all arrays.

Thanks for the clarification.

I've tried a few more times, by the way, and it seems that without
"mdadm --monitor" running, the hang doesn't occur.

I'd certainly prefer being notified of state changes, though, so
that's not much of a solution.

-Robin

-- 
Lojban Reason #17: http://en.wikipedia.org/wiki/Buffalo_buffalo
Proud Supporter of the Singularity Institute - http://singinst.org/
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monthly md check == hung machine; how do I debug?
  2008-02-03 21:21 Monthly md check == hung machine; how do I debug? Robin Lee Powell
  2008-02-04  5:37 ` martin f krafft
@ 2008-02-04 10:40 ` Nick Piggin
  2008-02-05 17:10   ` Robin Lee Powell
  1 sibling, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2008-02-04 10:40 UTC (permalink / raw)
  To: Robin Lee Powell; +Cc: linux-kernel

On Monday 04 February 2008 08:21, Robin Lee Powell wrote:
> I've got a machine with a 4 disk SATA raid10 configuration using md.
> The entire disk is loop-AES encrypted, but that shouldn't matter
> here.
>
> Once a month, Debian runs:
>
>     /usr/share/mdadm/checkarray --cron --all --quiet
>
> and the machine hangs within 30 minutes of that starting.
>
> It seems that I can avoid the hang by not having "mdadm --monitor"
> running, but I'm not certain if that's the case or if I've just been
> lucky this go-round.
>
> I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
> Athlon(tm) 64 Processor 3700+.
>
> I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
> can't find anything that looks relevant.
>
> So, how can I (help you all) debug this?

Do you have a serial console? Does it respond to pings?

Can you try to get sysrq+T traces, and sysrq+P traces, and post
them?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monthly md check == hung machine; how do I debug?
  2008-02-04 10:40 ` Nick Piggin
@ 2008-02-05 17:10   ` Robin Lee Powell
  2008-02-05 18:55     ` Lennart Sorensen
  2008-02-05 20:27     ` Neil Brown
  0 siblings, 2 replies; 9+ messages in thread
From: Robin Lee Powell @ 2008-02-05 17:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel

On Mon, Feb 04, 2008 at 09:40:55PM +1100, Nick Piggin wrote:
> On Monday 04 February 2008 08:21, Robin Lee Powell wrote:
> > I've got a machine with a 4 disk SATA raid10 configuration using
> > md. The entire disk is loop-AES encrypted, but that shouldn't
> > matter here.
> >
> > Once a month, Debian runs:
> >
> >     /usr/share/mdadm/checkarray --cron --all --quiet
> >
> > and the machine hangs within 30 minutes of that starting.
> >
> > It seems that I can avoid the hang by not having "mdadm
> > --monitor" running, but I'm not certain if that's the case or if
> > I've just been lucky this go-round.
> >
> > I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
> > Athlon(tm) 64 Processor 3700+.
> >
> > I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
> > can't find anything that looks relevant.
> >
> > So, how can I (help you all) debug this?
> 
> Do you have a serial console? Does it respond to pings?

No and yes.

> Can you try to get sysrq+T traces, and sysrq+P traces, and post
> them?

I played with those after you suggested it, but without serial
console had no way to capture them.

I was able to solve the problem, however, like so:

132c133
< # CONFIG_PREEMPT_NONE is not set
---
> CONFIG_PREEMPT_NONE=y
134,135c135,136
< CONFIG_PREEMPT=y
< CONFIG_PREEMPT_BKL=y
---
> # CONFIG_PREEMPT is not set
> # CONFIG_PREEMPT_BKL is not set

-Robin

-- 
Lojban Reason #17: http://en.wikipedia.org/wiki/Buffalo_buffalo
Proud Supporter of the Singularity Institute - http://singinst.org/
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monthly md check == hung machine; how do I debug?
  2008-02-05 17:10   ` Robin Lee Powell
@ 2008-02-05 18:55     ` Lennart Sorensen
  2008-02-05 19:18       ` Robin Lee Powell
  2008-02-05 20:27     ` Neil Brown
  1 sibling, 1 reply; 9+ messages in thread
From: Lennart Sorensen @ 2008-02-05 18:55 UTC (permalink / raw)
  To: Robin Lee Powell; +Cc: Nick Piggin, linux-kernel

On Tue, Feb 05, 2008 at 09:10:05AM -0800, Robin Lee Powell wrote:
> On Mon, Feb 04, 2008 at 09:40:55PM +1100, Nick Piggin wrote:
> > On Monday 04 February 2008 08:21, Robin Lee Powell wrote:
> > > I've got a machine with a 4 disk SATA raid10 configuration using
> > > md. The entire disk is loop-AES encrypted, but that shouldn't
> > > matter here.
> > >
> > > Once a month, Debian runs:
> > >
> > >     /usr/share/mdadm/checkarray --cron --all --quiet
> > >
> > > and the machine hangs within 30 minutes of that starting.
> > >
> > > It seems that I can avoid the hang by not having "mdadm
> > > --monitor" running, but I'm not certain if that's the case or if
> > > I've just been lucky this go-round.
> > >
> > > I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
> > > Athlon(tm) 64 Processor 3700+.
> > >
> > > I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
> > > can't find anything that looks relevant.
> > >
> > > So, how can I (help you all) debug this?
> > 
> > Do you have a serial console? Does it respond to pings?
> 
> No and yes.
> 
> > Can you try to get sysrq+T traces, and sysrq+P traces, and post
> > them?
> 
> I played with those after you suggested it, but without serial
> console had no way to capture them.
> 
> I was able to solve the problem, however, like so:

I tend to adjust the max disk speed raid is allowed to use, since the
default of 200MB/s makes the system close to unusable while it is taking
place.  Could having slow disk access be causing things to lock up?

Things made much more sense some years ago when the default was 10MB/s
or something along those lines.

Who has 200MB/s capable hardware anyhow?

--
Len Sorensen

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monthly md check == hung machine; how do I debug?
  2008-02-05 18:55     ` Lennart Sorensen
@ 2008-02-05 19:18       ` Robin Lee Powell
  0 siblings, 0 replies; 9+ messages in thread
From: Robin Lee Powell @ 2008-02-05 19:18 UTC (permalink / raw)
  To: Lennart Sorensen; +Cc: Nick Piggin, linux-kernel

On Tue, Feb 05, 2008 at 01:55:17PM -0500, Lennart Sorensen wrote:
> I tend to adjust the max disk speed raid is allowed to use, since
> the default of 200MB/s makes the system close to unusable while it
> is taking place.  Could having slow disk access be causing things
> to lock up?

I don't know if it could or not, but I have no performance problems
at all when the sync is running, as long as it doesn't lock up, so
it seems unlikely to me.

(Shout out to a fellow CSC sysadmin, btw.)

-Robin

-- 
Lojban Reason #17: http://en.wikipedia.org/wiki/Buffalo_buffalo
Proud Supporter of the Singularity Institute - http://singinst.org/
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monthly md check == hung machine; how do I debug?
  2008-02-05 17:10   ` Robin Lee Powell
  2008-02-05 18:55     ` Lennart Sorensen
@ 2008-02-05 20:27     ` Neil Brown
  2008-02-05 21:17       ` Robin Lee Powell
  1 sibling, 1 reply; 9+ messages in thread
From: Neil Brown @ 2008-02-05 20:27 UTC (permalink / raw)
  To: Robin Lee Powell; +Cc: Nick Piggin, linux-kernel

On Tuesday February 5, rlpowell@digitalkingdom.org wrote:
> 
> I was able to solve the problem, however, like so:
> 
> 132c133
> < # CONFIG_PREEMPT_NONE is not set
> ---
> > CONFIG_PREEMPT_NONE=y
> 134,135c135,136
> < CONFIG_PREEMPT=y
> < CONFIG_PREEMPT_BKL=y
> ---
> > # CONFIG_PREEMPT is not set
> > # CONFIG_PREEMPT_BKL is not set
> 

This suggests that there is some sort of race.
Given that I've never hit it on SMP machines, it is probably a very
small window that opens immediately after some event that triggers
kernel preemption.

The only "mdadm --monitor" does in the kernel is read /proc/mdstat and
maybe make some GET_ARRAY_INFO/ GET_DISK_INFO ioctl calls.

They don't do much more than grab the reconfig_mutex.....

What sort of hardware do you have?  x86?  SMP or uni-processor?
Also, exactly what kernel are you running?

I might see if I can reproduce it... so if you can send me the broken
.config, that might help too.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monthly md check == hung machine; how do I debug?
  2008-02-05 20:27     ` Neil Brown
@ 2008-02-05 21:17       ` Robin Lee Powell
  0 siblings, 0 replies; 9+ messages in thread
From: Robin Lee Powell @ 2008-02-05 21:17 UTC (permalink / raw)
  To: Neil Brown; +Cc: Nick Piggin, linux-kernel

On Wed, Feb 06, 2008 at 07:27:56AM +1100, Neil Brown wrote:
> On Tuesday February 5, rlpowell@digitalkingdom.org wrote:
> > 
> > I was able to solve the problem, however, like so:
> > 
> > 132c133
> > < # CONFIG_PREEMPT_NONE is not set
> > ---
> > > CONFIG_PREEMPT_NONE=y
> > 134,135c135,136
> > < CONFIG_PREEMPT=y
> > < CONFIG_PREEMPT_BKL=y
> > ---
> > > # CONFIG_PREEMPT is not set
> > > # CONFIG_PREEMPT_BKL is not set
> > 
> 
> This suggests that there is some sort of race. Given that I've
> never hit it on SMP machines, it is probably a very small window
> that opens immediately after some event that triggers kernel
> preemption.
> 
> The only "mdadm --monitor" does

Going to stop you right there; "mdadm --monitor" wasn't it, nor was
smartd as I thought at one point.  I honestly don't know what was
triggering it, except maybe disk access.  The fact that backups were
running at the same time as the sync seemed to make it happen
faster; that's the best I've got at this point.

> What sort of hardware do you have?  x86?  SMP or uni-processor?
> Also, exactly what kernel are you running?

rlpowell@chain> uname -a                                                                       
Linux chain.digitalkingdom.org 2.6.23.1-dk3 #4 SMP Mon Feb 4 06:14:44 PST 2008 x86_64 GNU/Linux
rlpowell@chain> cat /proc/cpuinfo                                                              
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 39
model name      : AMD Athlon(tm) 64 Processor 3700+
stepping        : 1
cpu MHz         : 2210.251
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflu
t fxsr_opt lm 3dnowext 3dnow up rep_good pni lahf_lm
bogomips        : 4422.66
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc


> I might see if I can reproduce it... so if you can send me the
> broken .config, that might help too.

http://teddyb.org/~rlpowell/media/regular/config-2.6.23.1-dk2.txt

-Robin

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-02-05 21:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-03 21:21 Monthly md check == hung machine; how do I debug? Robin Lee Powell
2008-02-04  5:37 ` martin f krafft
2008-02-04  6:59   ` Robin Lee Powell
2008-02-04 10:40 ` Nick Piggin
2008-02-05 17:10   ` Robin Lee Powell
2008-02-05 18:55     ` Lennart Sorensen
2008-02-05 19:18       ` Robin Lee Powell
2008-02-05 20:27     ` Neil Brown
2008-02-05 21:17       ` Robin Lee Powell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).