LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: "Haar János" <djani22@netcenter.hu>
To: "David Chinner" <dgc@sgi.com>
Cc: <dgc@sgi.com>, <linux-xfs@oss.sgi.com>, <linux-kernel@vger.kernel.org>
Subject: Re: xfslogd-spinlock bug?
Date: Mon, 18 Dec 2006 09:17:50 +0100	[thread overview]
Message-ID: <027b01c7227d$0e26d1f0$0400a8c0@dcccs> (raw)
In-Reply-To: 20061218062444.GH44411608@melbourne.sgi.com


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar János" <djani22@netcenter.hu>
Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Monday, December 18, 2006 7:24 AM
Subject: Re: xfslogd-spinlock bug?


> On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
> > > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
> > > > I dont know there is a context between 2 messages, but i can see,
the
> > > > spinlock bug comes always on cpu #3.
> > > >
> > > > Somebody have any idea?
> > >
> > > Your disk interrupts are directed to CPU 3, and so log I/O completion
> > > occurs on that CPU.
> >
> >            CPU0       CPU1       CPU2       CPU3
> >   0:        100          0          0    4583704   IO-APIC-edge
timer
> >   1:          0          0          0          2   IO-APIC-edge
i8042
> >   4:          0          0          0    3878668   IO-APIC-edge
serial
> .....
> >  14:    3072118          0          0        181   IO-APIC-edge
ide0
> .....
> >  52:          0          0          0  213052723   IO-APIC-fasteoi
eth1
> >  53:          0          0          0   91913759   IO-APIC-fasteoi
eth2
> > 100:          0          0          0   16776910   IO-APIC-fasteoi
eth0
> ....
> >
> > Maybe....
> > I have 3 XFS on this system, with 3 source.
> >
> > 1. 200G one ide hdd.
> > 2. 2x200G mirror on 1 ide + 1 sata hdd.
> > 3. 4x3.3TB strip on NBD.
> >
> > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on
the
> > CPU0.
>
> I'd say your NBD based XFS filesystem is having trouble.
>
> > > Are you using XFS on a NBD?
> >
> > Yes, on the 3. source.
>
> Ok, I've never heard of a problem like this before and you are doing
> something that very few ppl are doing (i.e. XFS on NBD). I'd start
> Hence  I'd start by suspecting a bug in the NBD driver.

Ok, if you have right, this also can be in context with the following issue:

http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)


>
> > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b
RCX:
> > >                                                      ^^^^^^^^^^^^^^^^
> > > Anyone recognise that pattern?
> >
> > I think i have one idea.
> > This issue can stops sometimes the 5sec automatic restart on crash, and
this
> > shows possible memory corruption, and if the bug occurs in the IRQ
> > handling.... :-)
> > I have a lot of logs about this issue, and the RAX, RBX always the same.
>
> And is this the only place where you see the problem? Or are there
> other stack traces that you see this in as well?

I have used the 2.6.16.18 for a long time, and it was stable, except this
issue. (~20 dump with xfslogd)
And i try the new releases, and now i have more. :-)

What do you think exactly?
I can see in the logs, but search for what?
The RAX, RBX thing, or the xfslogd-spinlock problem or the old nbd-deadlock
+ mem corruption?

[root@NetCenter netlog]# grep "0000000000000033" messages*
messages.1:Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 12 18:16:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 13 11:40:05 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 14 22:25:32 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 15 06:24:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct  3 19:49:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct  7 01:11:17 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.13:Sep 21 15:35:31 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep  3 16:13:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep  5 21:00:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec  9 00:10:47 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec  9 14:07:01 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec 10 04:44:48 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Nov 30 10:59:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Dec  2 00:54:23 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 13 10:44:49 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:14:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:37:07 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 15 01:39:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  6 14:48:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  7 04:36:13 dy-base RAX: 0000000000000033 RBX:
ffff8100057d2080 RCX: ffff810050d638f8
messages.6:Nov  7 04:36:13 dy-base RDX: 0000000000000008 RSI:
0000000000012cff RDI: 0000000000000033
messages.6:Nov  7 11:12:06 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 03:20:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 15:02:16 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 15:27:12 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 10 15:29:43 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 11 20:44:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 18 15:31:02 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 19 13:53:24 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000


>
> > > This implies a spinlock inside a wait_queue_head_t is corrupt.
> > >
> > > What are you type of system do you have, and what sort of
> > > workload are you running?
> >
> > OS: Fedora 5, 64bit.
> > HW: dual xeon, with HT, ram 4GB.
> > (the min_free_kbytes limit is set to 128000, because sometimes the e1000
> > driver run out the reserved memory during irq handling.)
>
> That does not sound good. What happens when it does run out of memory?



This is an old problem, on 2.6.16.18 .
The default min_free_kbytes is 38xx , and the GIGE controller easily can be
overflow this little place.
If this happens, the system freez, and i can only use the serial console +
sysreq to dump stack:

download.netcenter.hu/bughunt/20060530/261618-good.txt
download.netcenter.hu/bughunt/20060530/dmesg.txt
download.netcenter.hu/bughunt/20060530/dump.txt

This problem is already fixed with set the min_free_kbytes to 128M.

> Is that when you start to see the above corruptions?

I think no, but i am not 100% sure.

Cheers,

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group


  reply	other threads:[~2006-12-18  8:19 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-12-11 23:00 xfslogd-spinlock bug? Haar János
2006-12-12 14:32 ` Justin Piszcz
2006-12-13  1:11   ` Haar János
2006-12-16 11:19     ` Haar János
2006-12-17 22:44       ` David Chinner
2006-12-17 23:56         ` Haar János
2006-12-18  6:24           ` David Chinner
2006-12-18  8:17             ` Haar János [this message]
2006-12-18 22:36               ` David Chinner
2006-12-18 23:39                 ` Haar János
2006-12-19  2:52                   ` David Chinner
2006-12-19  4:47                     ` David Chinner
2006-12-27 12:58                       ` Haar János
2007-01-07 23:14                         ` David Chinner
2007-01-10 17:18                           ` Janos Haar
2007-01-11  3:34                             ` David Chinner
2007-01-11 20:15                               ` Janos Haar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='027b01c7227d$0e26d1f0$0400a8c0@dcccs' \
    --to=djani22@netcenter.hu \
    --cc=dgc@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).