LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: xfs internal error on a new filesystem
@ 2007-02-15 16:19 Ahmed El Zein
  2007-02-16 17:58 ` David Chinner
  2007-02-18 13:56 ` Leon Kolchinsky
  0 siblings, 2 replies; 7+ messages in thread
From: Ahmed El Zein @ 2007-02-15 16:19 UTC (permalink / raw)
  To: David Chinner; +Cc: Ramy M. Hassan , linux-kernel, xfs



David Chinner <dgc@sgi.com> wrote on 15 Feb 2007, 11:16 AM:
Subject: Re: xfs internal error on a new filesystem
>On Wed, Feb 14, 2007 at 10:24:27AM +0000, Ramy M. Hassan  wrote:
>> Hello,
>> We got the following xfs internal error on one of our production servers:
>> 
>> Feb 14 08:28:52 info6 kernel: [238186.676483] Filesystem "sdd8": XFS
>> internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. 
>> Caller 0xf8b906e7
>
>Real stack looks to be:
>
> xfs_trans_cancel
> xfs_mkdir
> xfs_vn_mknod
> xfs_vn_mkdir
> vfs_mkdir
> sys_mkdirat
> sys_mkdir
>
>We aborted a transaction for some reason. We got an error somewhere in
>a mkdir while we had a dirty transaction.  Unfortunately, this tells us
>very
>little about the error that actually caused the shutdown.
>
>What is your filessytem layout? (xfs_info <mntpt>) How much memory
>do you have and were you near enomem conditions?

We have 1536 MB of ram. It is possible that at the time of the crash we
were near enomem conditions, I don;t know for sure but we have seen such
spikes on our servers.

root@info6:~# xfs_info /vol/6/
meta-data=/dev/sdd8              isize=256    agcount=16, agsize=7001584
blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=112025248, imaxpct=25
         =                       sunit=16     swidth=64 blks, unwritten=0
naming   =version 2              bsize=4096  
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks
realtime =none                   extsz=65536  blocks=0, rtextents=0


>
>> We were able to unmount/remount the volume (didn't do xfs_repair because
>we
>> thought it might take long time, and the server was already in production
>> at the moement)
>
>Risky to run a production system on a filesystem that might be corrupted.
>You risk further problems if you don't run repair....
>
>> The file system was created less than 48hours ago, and 370G of sensitve
>> production data was moved to the server before it xfs crash.
>
>So that's not a "new" filesystem at all...
By new we meant 48 hours old.

>
>FWIW, did you do any offline testing before you put it into production?

We did some basic testing. But as a filesystem developer, how would you
test a filesystem so that you would be comfortable with the stability of 
the filesystem and be worry free in terms of faulty hardware? 

>
>> System details :
>> Kernel: 2.6.18
>> Controller: 3ware 9550SX-8LP (RAID 10)
>
>Can you describe your dm/md volume layout?

one unit, 8HDDs, a stripe of 4 mirrors.

>
>> We are wondering here if this problem is an indicator to data corruption
>on
>> disk ?
>
>It might be. You didn't run xfs_check or xfs_repair, so we don't know if
>there is any on disk corruption here.
>
>> is it really necessary to run xfs_repair ?
>
>If you want to know if you haven't left any landmines around for the
>filesystem to trip over again. i.e. You should run repair after any
>sort of XFS shutdown to make sure nothing is corrupted on disk.
>If nothing is corrupted on disk, then we are looking at an in-memory
>problem....
we will run repair tonight.

>
>> Do u recommend that we switch back to reiserfs ?
>
>Not yet.
>
>> Could it be a hardware related problems  ?
>
>Yes. Do you have ECC memory on your server? Have you run memtest86?
>Were there any I/O errors in the log prior to the shutdown message?
Yes, we have ECC memory.
We will try to run memtest86 as soon as possible.
There were no I/O errors in the log prior to the shutdown message.

Btw, this is a vmware image. /vol/6 is an exported physical partition.

>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>Principal Engineer
>SGI Australian Software Group
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: xfs internal error on a new filesystem
  2007-02-15 16:19 xfs internal error on a new filesystem Ahmed El Zein
@ 2007-02-16 17:58 ` David Chinner
  2007-02-18 13:56 ` Leon Kolchinsky
  1 sibling, 0 replies; 7+ messages in thread
From: David Chinner @ 2007-02-16 17:58 UTC (permalink / raw)
  To: Ahmed El Zein; +Cc: David Chinner, Ramy M. Hassan , linux-kernel, xfs

On Thu, Feb 15, 2007 at 04:19:32PM +0000, Ahmed El Zein wrote:
> David Chinner <dgc@sgi.com> wrote on 15 Feb 2007, 11:16 AM:
> >What is your filessytem layout? (xfs_info <mntpt>) How much memory
> >do you have and were you near enomem conditions?
> 
> We have 1536 MB of ram. It is possible that at the time of the crash we
> were near enomem conditions, I don;t know for sure but we have seen such
> spikes on our servers.

Ok, so that's a possibility.

> root@info6:~# xfs_info /vol/6/
> meta-data=/dev/sdd8              isize=256    agcount=16, agsize=7001584
> blks
>          =                       sectsz=512   attr=0
> data     =                       bsize=4096   blocks=112025248, imaxpct=25
>          =                       sunit=16     swidth=64 blks, unwritten=0
> naming   =version 2              bsize=4096  
> log      =internal               bsize=4096   blocks=32768, version=1
>          =                       sectsz=512   sunit=0 blks
> realtime =none                   extsz=65536  blocks=0, rtextents=0

Nothing unusual here...

> >Yes. Do you have ECC memory on your server? Have you run memtest86?
> >Were there any I/O errors in the log prior to the shutdown message?
> Yes, we have ECC memory.
> We will try to run memtest86 as soon as possible.
> There were no I/O errors in the log prior to the shutdown message.
> 
> Btw, this is a vmware image. /vol/6 is an exported physical partition.

I'd suggest trying to reproduce this problem without vmware in the
picture - you need to rule out a vmware based problem first before we
can really make any progress on this....

Cheers,

Dave.

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: xfs internal error on a new filesystem
  2007-02-15 16:19 xfs internal error on a new filesystem Ahmed El Zein
  2007-02-16 17:58 ` David Chinner
@ 2007-02-18 13:56 ` Leon Kolchinsky
  1 sibling, 0 replies; 7+ messages in thread
From: Leon Kolchinsky @ 2007-02-18 13:56 UTC (permalink / raw)
  To: 'Ahmed El Zein', 'David Chinner'
  Cc: 'Ramy M. Hassan ', linux-kernel, xfs

> >
> >> Do u recommend that we switch back to reiserfs ?
> >
> >Not yet.
> >
> >> Could it be a hardware related problems  ?
> >
> >Yes. Do you have ECC memory on your server? Have you run memtest86?
> >Were there any I/O errors in the log prior to the shutdown message?
> Yes, we have ECC memory.
> We will try to run memtest86 as soon as possible.
> There were no I/O errors in the log prior to the shutdown message.
> 
> Btw, this is a vmware image. /vol/6 is an exported physical partition.
> 

I've read that vmware do disk caching by default and we know xfs has problem
with when disaster strikes.

You definitely should disable disk caching on you side. 

> >Cheers,
> >
> >Dave.
> >--
> >Dave Chinner
> >Principal Engineer
> >SGI Australian Software Group
> >
>


Regards,
Leon Kolchinsky
 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: xfs internal error on a new filesystem
  2007-02-14 10:24 Ramy M. Hassan 
  2007-02-14 10:40 ` Jan-Benedict Glaw
  2007-02-14 10:48 ` Patrick Ale
@ 2007-02-15  9:16 ` David Chinner
  2 siblings, 0 replies; 7+ messages in thread
From: David Chinner @ 2007-02-15  9:16 UTC (permalink / raw)
  To: Ramy M. Hassan ; +Cc: linux-kernel, Ahmed El Zein, xfs

On Wed, Feb 14, 2007 at 10:24:27AM +0000, Ramy M. Hassan  wrote:
> Hello,
> We got the following xfs internal error on one of our production servers:
> 
> Feb 14 08:28:52 info6 kernel: [238186.676483] Filesystem "sdd8": XFS
> internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. 
> Caller 0xf8b906e7

Real stack looks to be:

 xfs_trans_cancel
 xfs_mkdir
 xfs_vn_mknod
 xfs_vn_mkdir
 vfs_mkdir
 sys_mkdirat
 sys_mkdir

We aborted a transaction for some reason. We got an error somewhere in
a mkdir while we had a dirty transaction.  Unfortunately, this tells us very
little about the error that actually caused the shutdown.

What is your filessytem layout? (xfs_info <mntpt>) How much memory
do you have and were you near enomem conditions?

> We were able to unmount/remount the volume (didn't do xfs_repair because we
> thought it might take long time, and the server was already in production
> at the moement)

Risky to run a production system on a filesystem that might be corrupted.
You risk further problems if you don't run repair....

> The file system was created less than 48hours ago, and 370G of sensitve
> production data was moved to the server before it xfs crash.

So that's not a "new" filesystem at all...

FWIW, did you do any offline testing before you put it into production?

> System details :
> Kernel: 2.6.18
> Controller: 3ware 9550SX-8LP (RAID 10)

Can you describe your dm/md volume layout?

> We are wondering here if this problem is an indicator to data corruption on
> disk ?

It might be. You didn't run xfs_check or xfs_repair, so we don't know if
there is any on disk corruption here.

> is it really necessary to run xfs_repair ?

If you want to know if you haven't left any landmines around for the
filesystem to trip over again. i.e. You should run repair after any
sort of XFS shutdown to make sure nothing is corrupted on disk.
If nothing is corrupted on disk, then we are looking at an in-memory
problem....

> Do u recommend that we switch back to reiserfs ?

Not yet.

> Could it be a hardware related problems  ?

Yes. Do you have ECC memory on your server? Have you run memtest86?
Were there any I/O errors in the log prior to the shutdown message?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: xfs internal error on a new filesystem
  2007-02-14 10:24 Ramy M. Hassan 
  2007-02-14 10:40 ` Jan-Benedict Glaw
@ 2007-02-14 10:48 ` Patrick Ale
  2007-02-15  9:16 ` David Chinner
  2 siblings, 0 replies; 7+ messages in thread
From: Patrick Ale @ 2007-02-14 10:48 UTC (permalink / raw)
  To: Ramy M. Hassan; +Cc: linux-kernel, Ahmed El Zein

On 2/14/07, Ramy M. Hassan <ramy@gawab.com> wrote:
> Hello,
> We got the following xfs internal error on one of our production servers:

Hi, I want firstly to make a disclaimer that I am not an XFS or kernel
guru, what I am writing now is purely my experience, since I use XFS
on all my machines, on different disks and all.


I encountered the problem you have now, twice over the past three years.
Once it was caused by a faulty disk where the 8MB cache on the disk
was faulty, causing corruption, and one time it was cause of, what
seems to be, a CPU that couldnt handle XFS. This sounds illogical, and
to me too, honestly, but the explanation I got was that XFS writes are
quite CPU intensive, especialy when you write with 500MB/s and we
tried to do this on a PII-400Mhz.

I tried reiserfs aswell, and I honestly cant give you one reason to
switch back to it. I love XFS, always did, its fast and reliable.
Problems that I had were never related to XFS but to hardware that had
to deal with XFS in a way (CPU/disk).

And, xfs_repair DID repair my filesystems, the data was on the disks,
and valid, XFS just shut down my filesystem cause it found my journal
not reliable/corrupted.

Again, please be aware that I am just a regular user who likes to play
around with linux and the kernel, I am no expert in the field of XFS
or its relations.


I hope this helps you a bit.


Patrick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: xfs internal error on a new filesystem
  2007-02-14 10:24 Ramy M. Hassan 
@ 2007-02-14 10:40 ` Jan-Benedict Glaw
  2007-02-14 10:48 ` Patrick Ale
  2007-02-15  9:16 ` David Chinner
  2 siblings, 0 replies; 7+ messages in thread
From: Jan-Benedict Glaw @ 2007-02-14 10:40 UTC (permalink / raw)
  To: Ramy M. Hassan ; +Cc: linux-kernel, Ahmed El Zein

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On Wed, 2007-02-14 10:24:27 +0000, Ramy M. Hassan  <ramy@gawab.com> wrote:
> Feb 14 08:28:52 info6 kernel: [238186.945610] Filesystem "sdd8": Corruption of in-memory data detected.  Shutting down filesystem: sdd8
[...]
> We are wondering here if this problem is an indicator to data corruption on
> disk ? is it really necessary to run xfs_repair ?  Do u recommend that we
> switch back to reiserfs ? Could it be a hardware related problems  ? any
> clues how to verify that ?

Running memtest86 for some hours could at least help to verify that
you don't see bad memory...

MfG, JBG

-- 
      Jan-Benedict Glaw      jbglaw@lug-owl.de              +49-172-7608481
Signature of: 23:53 <@jbglaw> So, ich kletter' jetzt mal ins Bett.
the second  : 23:57 <@jever2> .oO( kletter ..., hat er noch Gitter vorm Bett, wie früher meine Kinder?)
              00:00 <@jbglaw> jever2: *patsch*
              00:01 <@jever2> *aua*, wofür, Gedanken sind frei!
              00:02 <@jbglaw> Nee, freie Gedanken, die sind seit 1984 doch aus!
              00:03 <@jever2> 1984? ich bin erst seit 1985 verheiratet!

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* xfs internal error on a new filesystem
@ 2007-02-14 10:24 Ramy M. Hassan 
  2007-02-14 10:40 ` Jan-Benedict Glaw
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Ramy M. Hassan  @ 2007-02-14 10:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ahmed El Zein

Hello,
We got the following xfs internal error on one of our production servers:

Feb 14 08:28:52 info6 kernel: [238186.676483] Filesystem "sdd8": XFS
internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. 
Caller 0xf8b906e7
Feb 14 08:28:52 info6 kernel: [238186.869089]  [pg0+947617605/1069749248]
xfs_trans_cancel+0x115/0x140 [xfs]
Feb 14 08:28:52 info6 kernel: [238186.879131]  [pg0+947664615/1069749248]
xfs_mkdir+0x2ad/0x777 [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882517]  [pg0+947664615/1069749248]
xfs_mkdir+0x2ad/0x777 [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882558]  [pg0+947708079/1069749248]
xfs_vn_mknod+0x46f/0x493 [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882625]  [pg0+947621368/1069749248]
xfs_trans_unlocked_item+0x32/0x50 [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882654]  [pg0+947428195/1069749248]
xfs_da_buf_done+0x73/0xca [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882712]  [pg0+947623330/1069749248]
xfs_trans_brelse+0xd2/0xe5 [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882739]  [pg0+947686680/1069749248]
xfs_buf_rele+0x21/0x77 [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882768]  [pg0+947428868/1069749248]
xfs_da_state_free+0x64/0x6b [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882827]  [pg0+947468486/1069749248]
xfs_dir2_node_lookup+0x85/0xbb [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882857]  [pg0+947445628/1069749248]
xfs_dir_lookup+0x13a/0x13e [xfs]
Feb 14 08:28:52 info6 kernel: [238186.882912]  [vfs_permission+32/36]
vfs_permission+0x20/0x24
Feb 14 08:28:52 info6 kernel: [238186.883395]  [__link_path_walk+119/3686]
__link_path_walk+0x77/0xe66
Feb 14 08:28:52 info6 kernel: [238186.883413]  [pg0+947627458/1069749248]
xfs_dir_lookup_int+0x3c/0x121 [xfs]
Feb 14 08:28:52 info6 kernel: [238186.883453]  [pg0+947708157/1069749248]
xfs_vn_mkdir+0x2a/0x2e [xfs]
Feb 14 08:28:52 info6 kernel: [238186.883481]  [vfs_mkdir+221/299]
vfs_mkdir+0xdd/0x12b
Feb 14 08:28:52 info6 kernel: [238186.883493]  [sys_mkdirat+156/222]
sys_mkdirat+0x9c/0xde
Feb 14 08:28:52 info6 kernel: [238186.883510]  [sys_mkdir+31/35]
sys_mkdir+0x1f/0x23
Feb 14 08:28:52 info6 kernel: [238186.883519]  [sysenter_past_esp+86/121]
sysenter_past_esp+0x56/0x79
Feb 14 08:28:52 info6 kernel: [238186.883804] xfs_force_shutdown(sdd8,0x8)
called from line 1139 of file fs/xfs/xfs_trans.c.  Return address =
0xf8b84f6b
Feb 14 08:28:52 info6 kernel: [238186.945610] Filesystem "sdd8": Corruption
of in-memory data detected.  Shutting down filesystem: sdd8
Feb 14 08:28:52 info6 kernel: [238186.952732] Please umount the filesystem,
and rectify the problem(s)

We were able to unmount/remount the volume (didn't do xfs_repair because we
thought it might take long time, and the server was already in production
at the moement)
The file system was created less than 48hours ago, and 370G of sensitve
production data was moved to the server before it xfs crash.
In fact, we have been using reiserfs for a long period of time, and we have
suffered from several filesystem corruption incidents, so we decided to
switch to xfs hoping that it is more stable and corrupts less than
reiserfs, but it was disappointing to get this problem after just 48 hours
from reformating a production server using xfs.

System details :
Kernel: 2.6.18
Controller: 3ware 9550SX-8LP (RAID 10)

We are wondering here if this problem is an indicator to data corruption on
disk ? is it really necessary to run xfs_repair ?  Do u recommend that we
switch back to reiserfs ? Could it be a hardware related problems  ? any
clues how to verify that ?

Thanks

--Ramy

---------------------------------------------
Free POP3 Email from www.Gawab.com 
Sign up NOW and get your account @gawab.com!!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-02-18 14:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-15 16:19 xfs internal error on a new filesystem Ahmed El Zein
2007-02-16 17:58 ` David Chinner
2007-02-18 13:56 ` Leon Kolchinsky
  -- strict thread matches above, loose matches on Subject: below --
2007-02-14 10:24 Ramy M. Hassan 
2007-02-14 10:40 ` Jan-Benedict Glaw
2007-02-14 10:48 ` Patrick Ale
2007-02-15  9:16 ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).