LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC][PATCH] ChunkFS: fs fission for faster fsck
@ 2007-04-23 11:21 Amit Gud
[not found] ` <17965.6084 1.900376.524639@gargle.gargle.HOWL>
` (2 more replies)
0 siblings, 3 replies; 34+ messages in thread
From: Amit Gud @ 2007-04-23 11:21 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel
Cc: val_henson, riel, zab, arjan, suparna, brandon, karunasagark, gud
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1990 bytes --]
This is an initial implementation of ChunkFS technique, briefly discussed
at: http://lwn.net/Articles/190222 and
http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
This implementation is done within ext2 driver. Every chunk is an
independent ext2 file system. The knowledge about chunks is kept within
ext2 and 'continuation inodes', which are used to allow files and
directories span across multiple chunks, are managed within ext2.
At mount time, super blocks for all the chunks are created and linked with
the global super_blocks list maintained by VFS. This allows independent
behavior or individual chunks and also helps writebacks to happen
seamlessly.
Apart from this, chunkfs code in ext2 effectively only provides knowledge of:
- what inode's which block number to look for, for a given file's logical
block number
- in which chunk to allocate next inode / block
- number of inodes to scan when a directory is being read
To maintain the ext2's inode number uniqueness property, 8 msb bits of
inode number are used to indicate the chunk number in which it resides.
As said, this is a preliminary implementation and lots of changes are
expected before this code is even sanely usable. Some known issues and
obvious optimizations are listed in the TODO file in the chunkfs patch.
http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
- one big patch
- applies to 2.6.18
Attached - ext2-chunkfs-diff.patch.gz
- since the code is a spin-off of ext2, this patch explains better what
has changed from the ext2.
git://cislinux.cis.ksu.edu/chunkfs-tools
- mkfs, and fsck for chunkfs.
http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
- config file used; tested mostly on UML with loopback file systems.
NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP
should be "no" for clean compile.
Please comment, suggest, criticize. Patches most welcome.
Best,
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 41382 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-23 16:28 ` Suparna Bhattacharya
@ 2007-04-23 15:25 ` Amit Gud
2007-04-23 16:32 ` Suparna Bhattacharya
1 sibling, 0 replies; 34+ messages in thread
From: Amit Gud @ 2007-04-23 15:25 UTC (permalink / raw)
To: suparna
Cc: linux-fsdevel, linux-kernel, val_henson, riel, zab, arjan,
brandon, karunasagark
Suparna Bhattacharya wrote:
> Could you send this out as a patch to ext2 codebase, so we can just look
> at the changes for chunkfs ? That might also make it small enough
> to inline your patch in email for review.
>
> What kind of results are you planning to gather to evaluate/optimize this ?
>
Mainly I'm trying to gather following:
- Graph of continuation inodes vs. the file system fragmentation (or
aging) factor with varying configurations of chunk sizes
- Graph of wall clock time vs. disk size + data on the disk with both
chunkfs and native ext2, and/or other file systems
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-23 11:21 [RFC][PATCH] ChunkFS: fs fission for faster fsck Amit Gud
[not found] ` <17965.6084 1.900376.524639@gargle.gargle.HOWL>
@ 2007-04-23 16:28 ` Suparna Bhattacharya
2007-04-23 15:25 ` Amit Gud
2007-04-23 16:32 ` Suparna Bhattacharya
2007-04-24 11:44 ` Nikita Danilov
2 siblings, 2 replies; 34+ messages in thread
From: Suparna Bhattacharya @ 2007-04-23 16:28 UTC (permalink / raw)
To: Amit Gud
Cc: linux-fsdevel, linux-kernel, val_henson, riel, zab, arjan,
brandon, karunasagark, gud
On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
>
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
>
> This implementation is done within ext2 driver. Every chunk is an
> independent ext2 file system. The knowledge about chunks is kept within
> ext2 and 'continuation inodes', which are used to allow files and
> directories span across multiple chunks, are managed within ext2.
>
> At mount time, super blocks for all the chunks are created and linked with
> the global super_blocks list maintained by VFS. This allows independent
> behavior or individual chunks and also helps writebacks to happen
> seamlessly.
>
> Apart from this, chunkfs code in ext2 effectively only provides knowledge
> of:
>
> - what inode's which block number to look for, for a given file's logical
> block number
> - in which chunk to allocate next inode / block
> - number of inodes to scan when a directory is being read
>
> To maintain the ext2's inode number uniqueness property, 8 msb bits of
> inode number are used to indicate the chunk number in which it resides.
>
> As said, this is a preliminary implementation and lots of changes are
> expected before this code is even sanely usable. Some known issues and
> obvious optimizations are listed in the TODO file in the chunkfs patch.
>
> http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
> - one big patch
> - applies to 2.6.18
Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review.
What kind of results are you planning to gather to evaluate/optimize this ?
Regards
Suparna
>
> Attached - ext2-chunkfs-diff.patch.gz
> - since the code is a spin-off of ext2, this patch explains better what
> has changed from the ext2.
>
> git://cislinux.cis.ksu.edu/chunkfs-tools
> - mkfs, and fsck for chunkfs.
>
> http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
> - config file used; tested mostly on UML with loopback file systems.
>
> NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP
> should be "no" for clean compile.
>
>
> Please comment, suggest, criticize. Patches most welcome.
>
>
> Best,
> AG
> --
> May the source be with you.
> http://www.cis.ksu.edu/~gud
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-23 16:28 ` Suparna Bhattacharya
2007-04-23 15:25 ` Amit Gud
@ 2007-04-23 16:32 ` Suparna Bhattacharya
1 sibling, 0 replies; 34+ messages in thread
From: Suparna Bhattacharya @ 2007-04-23 16:32 UTC (permalink / raw)
To: Amit Gud
Cc: linux-fsdevel, linux-kernel, val_henson, riel, zab, arjan,
brandon, karunasagark, gud
On Mon, Apr 23, 2007 at 09:58:49PM +0530, Suparna Bhattacharya wrote:
> On Mon, Apr 23, 2007 at 06:21:34AM -0500, Amit Gud wrote:
> >
> > This is an initial implementation of ChunkFS technique, briefly discussed
> > at: http://lwn.net/Articles/190222 and
> > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> >
> > This implementation is done within ext2 driver. Every chunk is an
> > independent ext2 file system. The knowledge about chunks is kept within
> > ext2 and 'continuation inodes', which are used to allow files and
> > directories span across multiple chunks, are managed within ext2.
> >
> > At mount time, super blocks for all the chunks are created and linked with
> > the global super_blocks list maintained by VFS. This allows independent
> > behavior or individual chunks and also helps writebacks to happen
> > seamlessly.
> >
> > Apart from this, chunkfs code in ext2 effectively only provides knowledge
> > of:
> >
> > - what inode's which block number to look for, for a given file's logical
> > block number
> > - in which chunk to allocate next inode / block
> > - number of inodes to scan when a directory is being read
> >
> > To maintain the ext2's inode number uniqueness property, 8 msb bits of
> > inode number are used to indicate the chunk number in which it resides.
> >
> > As said, this is a preliminary implementation and lots of changes are
> > expected before this code is even sanely usable. Some known issues and
> > obvious optimizations are listed in the TODO file in the chunkfs patch.
> >
> > http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
> > - one big patch
> > - applies to 2.6.18
>
>
> Could you send this out as a patch to ext2 codebase, so we can just look
> at the changes for chunkfs ? That might also make it small enough
> to inline your patch in email for review.
Sorry, I missed the part about ext2-chunkfs-diff below.
Regards
suparna
>
> What kind of results are you planning to gather to evaluate/optimize this ?
>
> Regards
> Suparna
>
> >
> > Attached - ext2-chunkfs-diff.patch.gz
> > - since the code is a spin-off of ext2, this patch explains better what
> > has changed from the ext2.
> >
> > git://cislinux.cis.ksu.edu/chunkfs-tools
> > - mkfs, and fsck for chunkfs.
> >
> > http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
> > - config file used; tested mostly on UML with loopback file systems.
> >
> > NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP
> > should be "no" for clean compile.
> >
> >
> > Please comment, suggest, criticize. Patches most welcome.
> >
> >
> > Best,
> > AG
> > --
> > May the source be with you.
> > http://www.cis.ksu.edu/~gud
>
>
>
> --
> Suparna Bhattacharya (suparna@in.ibm.com)
> Linux Technology Center
> IBM Software Lab, India
>
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-23 11:21 [RFC][PATCH] ChunkFS: fs fission for faster fsck Amit Gud
[not found] ` <17965.6084 1.900376.524639@gargle.gargle.HOWL>
2007-04-23 16:28 ` Suparna Bhattacharya
@ 2007-04-24 11:44 ` Nikita Danilov
2007-04-24 18:27 ` David Lang
2 siblings, 1 reply; 34+ messages in thread
From: Nikita Danilov @ 2007-04-24 11:44 UTC (permalink / raw)
To: Amit Gud
Cc: linux-fsdevel, linux-kernel, val_henson, riel, zab, arjan,
suparna, brandon, karunasagark, gud
Amit Gud writes:
Hello,
>
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
I have a couple of questions about chunkfs repair process.
First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.
Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.
[...]
>
> Best,
> AG
Nikita.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-24 11:44 ` Nikita Danilov
@ 2007-04-24 18:27 ` David Lang
2007-04-24 19:34 ` Nikita Danilov
0 siblings, 1 reply; 34+ messages in thread
From: David Lang @ 2007-04-24 18:27 UTC (permalink / raw)
To: Nikita Danilov
Cc: Amit Gud, linux-fsdevel, linux-kernel, val_henson, riel, zab,
arjan, suparna, brandon, karunasagark, gud
On Tue, 24 Apr 2007, Nikita Danilov wrote:
> Amit Gud writes:
>
> Hello,
>
> >
> > This is an initial implementation of ChunkFS technique, briefly discussed
> > at: http://lwn.net/Articles/190222 and
> > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
>
> I have a couple of questions about chunkfs repair process.
>
> First, as I understand it, each continuation inode is a sparse file,
> mapping some subset of logical file blocks into block numbers. Then it
> seems, that during "final phase" fsck has to check that these partial
> mappings are consistent, for example, that no two different continuation
> inodes for a given file contain a block number for the same offset. This
> check requires scan of all chunks (rather than of only "active during
> crash"), which seems to return us back to the scalability problem
> chunkfs tries to address.
not quite.
this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
the process. with chunkfs you divide the problem by a large constant (100 or
more) for the checks of individual chunks. after those are done then the final
pass checking the cross-chunk links doesn't have to keep track of everything, it
only needs to check those links and what they point to
any ability to mark a filesystem as 'clean' and then not have to check it on
reboot is a bonus on top of this.
David Lang
> Second, it is not clear how, under assumption of bugs in the file system
> code (which paper makes at the very beginning), fsck can limit itself
> only to the chunks that were active at the moment of crash.
>
> [...]
>
> >
> > Best,
> > AG
>
> Nikita.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-24 19:34 ` Nikita Danilov
@ 2007-04-24 19:26 ` David Lang
2007-04-25 11:34 ` Nikita Danilov
2007-04-24 21:53 ` Amit Gud
2007-04-25 22:43 ` Valerie Henson
2 siblings, 1 reply; 34+ messages in thread
From: David Lang @ 2007-04-24 19:26 UTC (permalink / raw)
To: Nikita Danilov
Cc: Amit Gud, linux-fsdevel, linux-kernel, val_henson, riel, zab,
arjan, suparna, brandon, karunasagark, gud
On Tue, 24 Apr 2007, Nikita Danilov wrote:
> David Lang writes:
> > On Tue, 24 Apr 2007, Nikita Danilov wrote:
> >
> > > Amit Gud writes:
> > >
> > > Hello,
> > >
> > > >
> > > > This is an initial implementation of ChunkFS technique, briefly discussed
> > > > at: http://lwn.net/Articles/190222 and
> > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > >
> > > I have a couple of questions about chunkfs repair process.
> > >
> > > First, as I understand it, each continuation inode is a sparse file,
> > > mapping some subset of logical file blocks into block numbers. Then it
> > > seems, that during "final phase" fsck has to check that these partial
> > > mappings are consistent, for example, that no two different continuation
> > > inodes for a given file contain a block number for the same offset. This
> > > check requires scan of all chunks (rather than of only "active during
> > > crash"), which seems to return us back to the scalability problem
> > > chunkfs tries to address.
> >
> > not quite.
> >
> > this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
> > the process. with chunkfs you divide the problem by a large constant (100 or
> > more) for the checks of individual chunks. after those are done then the final
> > pass checking the cross-chunk links doesn't have to keep track of everything, it
> > only needs to check those links and what they point to
>
> Maybe I failed to describe the problem presicely.
>
> Suppose that all chunks have been checked. After that, for every inode
> I0 having continuations I1, I2, ... In, one has to check that every
> logical block is presented in at most one of these inodes. For this one
> has to read I0, with all its indirect (double-indirect, triple-indirect)
> blocks, then read I1 with all its indirect blocks, etc. And to repeat
> this for every inode with continuations.
>
> In the worst case (every inode has a continuation in every chunk) this
> obviously is as bad as un-chunked fsck. But even in the average case,
> total amount of io necessary for this operation is proportional to the
> _total_ file system size, rather than to the chunk size.
actually, it should be proportional to the number of continuation nodes. The
expectation (and design) is that they are rare.
If you get into the worst-case situation of all of them being continuation
nodes, then you are actually worse off then you were to start with (as you are
saying), but numbers from people's real filesystems (assuming a chunk size equal
to a block cluster size) indicates that we are more on the order of a fraction
of a percent of the nodes. and the expectation is that since the chunk sizes
will be substantially larger then the block cluster sizes this should get
reduced even more.
David Lang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-24 18:27 ` David Lang
@ 2007-04-24 19:34 ` Nikita Danilov
2007-04-24 19:26 ` David Lang
` (2 more replies)
0 siblings, 3 replies; 34+ messages in thread
From: Nikita Danilov @ 2007-04-24 19:34 UTC (permalink / raw)
To: David Lang
Cc: Amit Gud, linux-fsdevel, linux-kernel, val_henson, riel, zab,
arjan, suparna, brandon, karunasagark, gud
David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > Amit Gud writes:
> >
> > Hello,
> >
> > >
> > > This is an initial implementation of ChunkFS technique, briefly discussed
> > > at: http://lwn.net/Articles/190222 and
> > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> >
> > I have a couple of questions about chunkfs repair process.
> >
> > First, as I understand it, each continuation inode is a sparse file,
> > mapping some subset of logical file blocks into block numbers. Then it
> > seems, that during "final phase" fsck has to check that these partial
> > mappings are consistent, for example, that no two different continuation
> > inodes for a given file contain a block number for the same offset. This
> > check requires scan of all chunks (rather than of only "active during
> > crash"), which seems to return us back to the scalability problem
> > chunkfs tries to address.
>
> not quite.
>
> this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
> the process. with chunkfs you divide the problem by a large constant (100 or
> more) for the checks of individual chunks. after those are done then the final
> pass checking the cross-chunk links doesn't have to keep track of everything, it
> only needs to check those links and what they point to
Maybe I failed to describe the problem presicely.
Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.
In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.
>
> any ability to mark a filesystem as 'clean' and then not have to check it on
> reboot is a bonus on top of this.
>
> David Lang
Nikita.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-24 19:34 ` Nikita Danilov
2007-04-24 19:26 ` David Lang
@ 2007-04-24 21:53 ` Amit Gud
2007-04-25 10:54 ` David Chinner
2007-04-25 22:43 ` Valerie Henson
2 siblings, 1 reply; 34+ messages in thread
From: Amit Gud @ 2007-04-24 21:53 UTC (permalink / raw)
To: Nikita Danilov
Cc: David Lang, linux-fsdevel, linux-kernel, val_henson, riel, zab,
arjan, suparna, brandon, karunasagark
Nikita Danilov wrote:
> Maybe I failed to describe the problem presicely.
>
> Suppose that all chunks have been checked. After that, for every inode
> I0 having continuations I1, I2, ... In, one has to check that every
> logical block is presented in at most one of these inodes. For this one
> has to read I0, with all its indirect (double-indirect, triple-indirect)
> blocks, then read I1 with all its indirect blocks, etc. And to repeat
> this for every inode with continuations.
>
> In the worst case (every inode has a continuation in every chunk) this
> obviously is as bad as un-chunked fsck. But even in the average case,
> total amount of io necessary for this operation is proportional to the
> _total_ file system size, rather than to the chunk size.
>
Perhaps, I should talk about how continuation inodes are managed /
located on disk. (This is how it is in my current implementation)
Right now, there is no distinction between an inode and continuation
inode (also referred to as 'cnode' below), except for the
EXT2_IS_CONT_FL flag. Every inode holds a list of static number of
inodes, currently limited to 4.
The structure looks like this:
---------- ----------
| cnode 0 |---------->| cnode 0 |----------> to another cnode or NULL
---------- ----------
| cnode 1 |----- | cnode 1 |-----
---------- | ---------- |
| cnode 2 |-- | | cnode 2 |-- |
---------- | | ---------- | |
| cnode 3 | | | | cnode 3 | | |
---------- | | ---------- | |
| | | | | |
inodes inodes or NULL
I.e. only first cnode in the list carries forward the chain if all the
slots are occupied.
Every cnode# field contains
{
ino_t cnode;
__u32 start; /* starting logical block number */
__u32 end; /* ending logical block number */
}
(current implementation has just one field: cnode)
I thought of this structure to avoid recursion and / or use of any data
structure while traversing the continuation inodes.
Additional flag, EXT2_SPARSE_CONT_FL would indicate whether the inode
has any sparse portions. 'start' and 'end' fields are used to speed-up
finding a cnode given a logical block number without the need of
actually reading the inode - this can be done away with, perhaps more
conveniently by, pinning the cnodes in memory as and when read.
Now going back to the Nikita's question, all the cnodes for an inode
need to be scanned iff 'start' field or number of blocks or flag
EXT2_SPARSE_CONT_FL in any of its cnodes is altered.
And yes, the whole attempt is to reduce the number of continuation inodes.
Comments, suggestions welcome.
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-24 21:53 ` Amit Gud
@ 2007-04-25 10:54 ` David Chinner
2007-04-25 11:38 ` Andreas Dilger
2007-04-25 23:03 ` Valerie Henson
0 siblings, 2 replies; 34+ messages in thread
From: David Chinner @ 2007-04-25 10:54 UTC (permalink / raw)
To: Amit Gud
Cc: Nikita Danilov, David Lang, linux-fsdevel, linux-kernel,
val_henson, riel, zab, arjan, suparna, brandon, karunasagark
On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> Nikita Danilov wrote:
> >Maybe I failed to describe the problem presicely.
> >
> >Suppose that all chunks have been checked. After that, for every inode
> >I0 having continuations I1, I2, ... In, one has to check that every
> >logical block is presented in at most one of these inodes. For this one
> >has to read I0, with all its indirect (double-indirect, triple-indirect)
> >blocks, then read I1 with all its indirect blocks, etc. And to repeat
> >this for every inode with continuations.
> >
> >In the worst case (every inode has a continuation in every chunk) this
> >obviously is as bad as un-chunked fsck. But even in the average case,
> >total amount of io necessary for this operation is proportional to the
> >_total_ file system size, rather than to the chunk size.
> >
>
> Perhaps, I should talk about how continuation inodes are managed /
> located on disk. (This is how it is in my current implementation)
>
> Right now, there is no distinction between an inode and continuation
> inode (also referred to as 'cnode' below), except for the
> EXT2_IS_CONT_FL flag. Every inode holds a list of static number of
> inodes, currently limited to 4.
>
> The structure looks like this:
>
> ---------- ----------
> | cnode 0 |---------->| cnode 0 |----------> to another cnode or NULL
> ---------- ----------
> | cnode 1 |----- | cnode 1 |-----
> ---------- | ---------- |
> | cnode 2 |-- | | cnode 2 |-- |
> ---------- | | ---------- | |
> | cnode 3 | | | | cnode 3 | | |
> ---------- | | ---------- | |
> | | | | | |
>
> inodes inodes or NULL
How do you recover if fsfuzzer takes out a cnode in the chain? The
chunk is marked clean, but clearly corrupted and needs fixing and
you don't know what it was pointing at. Hence you have a pointer to
a trashed cnode *somewhere* that you need to find and fix, and a
bunch of orphaned cnodes that nobody points to *somewhere else* in
the filesystem that you have to find. That's a full scan fsck case,
isn't?
It seems that any sort of damage to the underlying storage (e.g.
media error, I/O error or user brain explosion) results in the need
to do a full fsck and hence chunkfs gives you no benefit in this
case.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-24 19:26 ` David Lang
@ 2007-04-25 11:34 ` Nikita Danilov
2007-04-25 16:39 ` David Lang
2007-04-25 22:47 ` Valerie Henson
0 siblings, 2 replies; 34+ messages in thread
From: Nikita Danilov @ 2007-04-25 11:34 UTC (permalink / raw)
To: David Lang
Cc: Amit Gud, linux-fsdevel, linux-kernel, val_henson, riel, zab,
arjan, suparna, brandon, karunasagark, gud
David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > David Lang writes:
> > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
> > >
> > > > Amit Gud writes:
> > > >
> > > > Hello,
> > > >
> > > > >
> > > > > This is an initial implementation of ChunkFS technique, briefly discussed
> > > > > at: http://lwn.net/Articles/190222 and
> > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > > >
> > > > I have a couple of questions about chunkfs repair process.
> > > >
> > > > First, as I understand it, each continuation inode is a sparse file,
> > > > mapping some subset of logical file blocks into block numbers. Then it
> > > > seems, that during "final phase" fsck has to check that these partial
> > > > mappings are consistent, for example, that no two different continuation
> > > > inodes for a given file contain a block number for the same offset. This
> > > > check requires scan of all chunks (rather than of only "active during
> > > > crash"), which seems to return us back to the scalability problem
> > > > chunkfs tries to address.
> > >
> > > not quite.
> > >
> > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
> > > the process. with chunkfs you divide the problem by a large constant (100 or
> > > more) for the checks of individual chunks. after those are done then the final
> > > pass checking the cross-chunk links doesn't have to keep track of everything, it
> > > only needs to check those links and what they point to
> >
> > Maybe I failed to describe the problem presicely.
> >
> > Suppose that all chunks have been checked. After that, for every inode
> > I0 having continuations I1, I2, ... In, one has to check that every
> > logical block is presented in at most one of these inodes. For this one
> > has to read I0, with all its indirect (double-indirect, triple-indirect)
> > blocks, then read I1 with all its indirect blocks, etc. And to repeat
> > this for every inode with continuations.
> >
> > In the worst case (every inode has a continuation in every chunk) this
> > obviously is as bad as un-chunked fsck. But even in the average case,
> > total amount of io necessary for this operation is proportional to the
> > _total_ file system size, rather than to the chunk size.
>
> actually, it should be proportional to the number of continuation nodes. The
> expectation (and design) is that they are rare.
Indeed, but total size of meta-data pertaining to all continuation
inodes is still proportional to the total file system size, and so is
fsck time: O(total_file_system_size).
What is more important, design puts (as far as I can see) no upper limit
on the number of continuation inodes, and hence, even if _average_ fsck
time is greatly reduced, occasionally it can take more time than ext2 of
the same size. This is clearly unacceptable in many situations (HA,
etc.).
Nikita.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 10:54 ` David Chinner
@ 2007-04-25 11:38 ` Andreas Dilger
2007-04-25 17:52 ` Amit Gud
2007-04-25 23:06 ` Valerie Henson
2007-04-25 23:03 ` Valerie Henson
1 sibling, 2 replies; 34+ messages in thread
From: Andreas Dilger @ 2007-04-25 11:38 UTC (permalink / raw)
To: David Chinner
Cc: Amit Gud, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, val_henson, riel, zab, arjan, suparna, brandon,
karunasagark
On Apr 25, 2007 20:54 +1000, David Chinner wrote:
> On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > Right now, there is no distinction between an inode and continuation
> > inode (also referred to as 'cnode' below), except for the
> > EXT2_IS_CONT_FL flag. Every inode holds a list of static number of
> > inodes, currently limited to 4.
> >
> > The structure looks like this:
> >
> > ---------- ----------
> > | cnode 0 |---------->| cnode 0 |----------> to another cnode or NULL
> > ---------- ----------
> > | cnode 1 |----- | cnode 1 |-----
> > ---------- | ---------- |
> > | cnode 2 |-- | | cnode 2 |-- |
> > ---------- | | ---------- | |
> > | cnode 3 | | | | cnode 3 | | |
> > ---------- | | ---------- | |
> > | | | | | |
> >
> > inodes inodes or NULL
>
> How do you recover if fsfuzzer takes out a cnode in the chain? The
> chunk is marked clean, but clearly corrupted and needs fixing and
> you don't know what it was pointing at. Hence you have a pointer to
> a trashed cnode *somewhere* that you need to find and fix, and a
> bunch of orphaned cnodes that nobody points to *somewhere else* in
> the filesystem that you have to find. That's a full scan fsck case,
> isn't?
Presumably, the cnodes in the other chunks contain forward and back
references. Those need to contain at minimum inode + generation + chunk
to avoid problem of pointing to a _different_ inode after such corruption
caused the old inode to be deleted and a new one allocated in its place.
If the cnode in each chunk is more than just a singly-linked list, the
file as a whole could survive multiple chunk corruptions, though there
would now be holes in the file.
> It seems that any sort of damage to the underlying storage (e.g.
> media error, I/O error or user brain explosion) results in the need
> to do a full fsck and hence chunkfs gives you no benefit in this
> case.
There are several cases where such corruption could be found:
- file access from the "parent" cnode will be missing corrupted cnode,
probably causing a fsck of both the source and target chunks
- a fsck of the source chunk would find the dangling cnode reference
and cause a fsck of the corrupt chunk
- a fsck of the later cnode chunks would find the dangling cnode reference
and cause a fsck of the corrupt chunk
- a fsck of the corrupt chunk would find the original corruption
The case where only a fsck of the corrupt chunk is done would not find the
cnode references. Maybe there needs to be per-chunk info which contains
a list/bitmap of other chunks that have cnodes shared with each chunk?
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 11:34 ` Nikita Danilov
@ 2007-04-25 16:39 ` David Lang
2007-04-25 22:47 ` Valerie Henson
1 sibling, 0 replies; 34+ messages in thread
From: David Lang @ 2007-04-25 16:39 UTC (permalink / raw)
To: Nikita Danilov
Cc: Amit Gud, linux-fsdevel, linux-kernel, val_henson, riel, zab,
arjan, suparna, brandon, karunasagark, gud
On Wed, 25 Apr 2007, Nikita Danilov wrote:
> David Lang writes:
> > On Tue, 24 Apr 2007, Nikita Danilov wrote:
> >
> > > David Lang writes:
> > > > On Tue, 24 Apr 2007, Nikita Danilov wrote:
> > > >
> > > > > Amit Gud writes:
> > > > >
> > > > > Hello,
> > > > >
> > > > > >
> > > > > > This is an initial implementation of ChunkFS technique, briefly discussed
> > > > > > at: http://lwn.net/Articles/190222 and
> > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> > > > >
> > > > > I have a couple of questions about chunkfs repair process.
> > > > >
> > > > > First, as I understand it, each continuation inode is a sparse file,
> > > > > mapping some subset of logical file blocks into block numbers. Then it
> > > > > seems, that during "final phase" fsck has to check that these partial
> > > > > mappings are consistent, for example, that no two different continuation
> > > > > inodes for a given file contain a block number for the same offset. This
> > > > > check requires scan of all chunks (rather than of only "active during
> > > > > crash"), which seems to return us back to the scalability problem
> > > > > chunkfs tries to address.
> > > >
> > > > not quite.
> > > >
> > > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
> > > > the process. with chunkfs you divide the problem by a large constant (100 or
> > > > more) for the checks of individual chunks. after those are done then the final
> > > > pass checking the cross-chunk links doesn't have to keep track of everything, it
> > > > only needs to check those links and what they point to
> > >
> > > Maybe I failed to describe the problem presicely.
> > >
> > > Suppose that all chunks have been checked. After that, for every inode
> > > I0 having continuations I1, I2, ... In, one has to check that every
> > > logical block is presented in at most one of these inodes. For this one
> > > has to read I0, with all its indirect (double-indirect, triple-indirect)
> > > blocks, then read I1 with all its indirect blocks, etc. And to repeat
> > > this for every inode with continuations.
> > >
> > > In the worst case (every inode has a continuation in every chunk) this
> > > obviously is as bad as un-chunked fsck. But even in the average case,
> > > total amount of io necessary for this operation is proportional to the
> > > _total_ file system size, rather than to the chunk size.
> >
> > actually, it should be proportional to the number of continuation nodes. The
> > expectation (and design) is that they are rare.
>
> Indeed, but total size of meta-data pertaining to all continuation
> inodes is still proportional to the total file system size, and so is
> fsck time: O(total_file_system_size).
correct, but remember that in the real world O(total_file_system_size) does not
mean that it can't work well. it just means that larger filesystems will take
longer to check.
they aren't out to eliminate the need for fsck, just to be able to divide the
time it currently takes by a large value so that as the filesystems continue to
get larger it is still reasonable to check them
> What is more important, design puts (as far as I can see) no upper limit
> on the number of continuation inodes, and hence, even if _average_ fsck
> time is greatly reduced, occasionally it can take more time than ext2 of
> the same size. This is clearly unacceptable in many situations (HA,
> etc.).
in a pathalogical situation you are correct, it would take longer. however
before declaring that this is completely unacceptable why don't you wait and see
if the pathalogical situation is at all likely?
when you are doing ha with shared storage you tend to be doing things like
databases, every database that I know about splits it's data files into many
pieces of a fixed size. Postgres for example does 1M files. if you do a chunk
size of 1G it's very unlikly that more then a couple files out of every thousand
will end up with continuation nodes.
remember that the current thinking on chunk size is to make the chunks be ~1% of
your filesystem, so on a 1TB filesystem your chunk size would be 10G (which, in
the example above would mean just a couple files out of every ten thousand would
have continuation nodes).
with the current filesystems it's _possible_ for a file to be spread out across
the disk such that it's first block is at the beginning of the disk, the second
at the end of the disk, the third back at the beginning, the fourth at the end,
etc. but users don't worry about this when useing the filesystems becouse the
odds of this happening under normal use are vanishingly small (and the
filesystem designers work to make the odds this small). similarly the chunkfs
designers are working to make the odds of every file having a continuation nodes
vanishingly small as well.
David Lang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 11:38 ` Andreas Dilger
@ 2007-04-25 17:52 ` Amit Gud
2007-04-25 23:06 ` Valerie Henson
1 sibling, 0 replies; 34+ messages in thread
From: Amit Gud @ 2007-04-25 17:52 UTC (permalink / raw)
To: David Chinner, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, val_henson, riel, zab, arjan, suparna, brandon,
karunasagark
Andreas Dilger wrote:
>> How do you recover if fsfuzzer takes out a cnode in the chain? The
>> chunk is marked clean, but clearly corrupted and needs fixing and
>> you don't know what it was pointing at. Hence you have a pointer to
>> a trashed cnode *somewhere* that you need to find and fix, and a
>> bunch of orphaned cnodes that nobody points to *somewhere else* in
>> the filesystem that you have to find. That's a full scan fsck case,
>> isn't?
>
> Presumably, the cnodes in the other chunks contain forward and back
> references. Those need to contain at minimum inode + generation + chunk
> to avoid problem of pointing to a _different_ inode after such corruption
> caused the old inode to be deleted and a new one allocated in its place.
>
> If the cnode in each chunk is more than just a singly-linked list, the
> file as a whole could survive multiple chunk corruptions, though there
> would now be holes in the file.
>
>> It seems that any sort of damage to the underlying storage (e.g.
>> media error, I/O error or user brain explosion) results in the need
>> to do a full fsck and hence chunkfs gives you no benefit in this
>> case.
>
Yes, what originated from discussions on #linuxfs is that redundancy is
required for cnodes, in order to avoid checking the entire file system
in search of a dangling cnode reference or for "parent" of a cnode.
If corruption, due to any reason, occurs in any other part of the file
system, it would be localized for that chunk. Even if entire fsck is
needed, chances of which are rare, full fsck of chunked file system is
no worse than fsck of non-chunked file system. Passes 3, 4, and 5 of
fsck take only 10-15% of total fsck run time and almost no-I/O Pass 6
for chunkfs wouldn't add whole lot.
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-24 19:34 ` Nikita Danilov
2007-04-24 19:26 ` David Lang
2007-04-24 21:53 ` Amit Gud
@ 2007-04-25 22:43 ` Valerie Henson
2 siblings, 0 replies; 34+ messages in thread
From: Valerie Henson @ 2007-04-25 22:43 UTC (permalink / raw)
To: Nikita Danilov
Cc: David Lang, Amit Gud, linux-fsdevel, linux-kernel, riel, zab,
arjan, suparna, brandon, karunasagark, gud
On Tue, Apr 24, 2007 at 11:34:48PM +0400, Nikita Danilov wrote:
>
> Maybe I failed to describe the problem presicely.
>
> Suppose that all chunks have been checked. After that, for every inode
> I0 having continuations I1, I2, ... In, one has to check that every
> logical block is presented in at most one of these inodes. For this one
> has to read I0, with all its indirect (double-indirect, triple-indirect)
> blocks, then read I1 with all its indirect blocks, etc. And to repeat
> this for every inode with continuations.
>
> In the worst case (every inode has a continuation in every chunk) this
> obviously is as bad as un-chunked fsck. But even in the average case,
> total amount of io necessary for this operation is proportional to the
> _total_ file system size, rather than to the chunk size.
Fsck in chunkfs is still going to have an element that is proportional
to the file system size for certain cases. However, that element will
be a great deal smaller than in a regular file system, except in the
most pathological cases. If those pathological cases happen often,
then it's back to the drawing board. My hunch is that they won't be
common.
-VAL
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 11:34 ` Nikita Danilov
2007-04-25 16:39 ` David Lang
@ 2007-04-25 22:47 ` Valerie Henson
2007-04-26 14:14 ` Jeff Dike
1 sibling, 1 reply; 34+ messages in thread
From: Valerie Henson @ 2007-04-25 22:47 UTC (permalink / raw)
To: Nikita Danilov
Cc: David Lang, Amit Gud, linux-fsdevel, linux-kernel, riel, zab,
arjan, suparna, brandon, karunasagark, gud
On Wed, Apr 25, 2007 at 03:34:03PM +0400, Nikita Danilov wrote:
>
> What is more important, design puts (as far as I can see) no upper limit
> on the number of continuation inodes, and hence, even if _average_ fsck
> time is greatly reduced, occasionally it can take more time than ext2 of
> the same size. This is clearly unacceptable in many situations (HA,
> etc.).
Actually, there is an upper limit on the number of continuation
inodes. Each file can have a maximum of one continuation inode per
chunk. (This is why we need to support sparse files.)
-VAL
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 10:54 ` David Chinner
2007-04-25 11:38 ` Andreas Dilger
@ 2007-04-25 23:03 ` Valerie Henson
2007-04-26 0:47 ` David Chinner
2007-04-26 8:47 ` Jan Kara
1 sibling, 2 replies; 34+ messages in thread
From: Valerie Henson @ 2007-04-25 23:03 UTC (permalink / raw)
To: David Chinner
Cc: Amit Gud, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, riel, zab, arjan, suparna, brandon, karunasagark
On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> >
> > The structure looks like this:
> >
> > ---------- ----------
> > | cnode 0 |---------->| cnode 0 |----------> to another cnode or NULL
> > ---------- ----------
> > | cnode 1 |----- | cnode 1 |-----
> > ---------- | ---------- |
> > | cnode 2 |-- | | cnode 2 |-- |
> > ---------- | | ---------- | |
> > | cnode 3 | | | | cnode 3 | | |
> > ---------- | | ---------- | |
> > | | | | | |
> >
> > inodes inodes or NULL
>
> How do you recover if fsfuzzer takes out a cnode in the chain? The
> chunk is marked clean, but clearly corrupted and needs fixing and
> you don't know what it was pointing at. Hence you have a pointer to
> a trashed cnode *somewhere* that you need to find and fix, and a
> bunch of orphaned cnodes that nobody points to *somewhere else* in
> the filesystem that you have to find. That's a full scan fsck case,
> isn't?
Excellent question. This is one of the trickier aspects of chunkfs -
the orphan inode problem (tricky, but solvable). The problem is what
if you smash/lose/corrupt an inode in one chunk that has a
continuation inode in another chunk? A back pointer does you no good
if the back pointer is corrupted.
What you do is keep tabs on whether you see damage that looks like
this has occurred - e.g., inode use/free counts wrong, you had to zero
a corrupted inode - and when this happens, you do a scan of all
continuation inodes in chunks that have links to the corrupted chunk.
What you need to make this go fast is (1) a pre-made list of which
chunks have links with which other chunks, (2) a fast way to read all
of the continuation inodes in a chunk (ignoring chunk-local inodes).
This stage is O(fs size) approximately, but it should be quite swift.
> It seems that any sort of damage to the underlying storage (e.g.
> media error, I/O error or user brain explosion) results in the need
> to do a full fsck and hence chunkfs gives you no benefit in this
> case.
I worry about this but so far haven't found something which couldn't
be cut down significantly with just a little extra work. It might be
helpful to look at an extreme case.
Let's say we're incredibly paranoid. We could be justified in running
a full fsck on the entire file system in between every single I/O.
After all, something *might* have been silently corrupted. But this
would be ridiculously slow. We could instead never check the file
system. But then we would end up panicking and corrupting the file
system a lot. So what's a good compromise?
In the chunkfs case, here's my rules of thumb so far:
1. Detection: All metadata has magic numbers and checksums.
2. Scrubbing: Random check of chunks when possible.
3. Repair: When we detect corruption, either by checksum error, file
system code assertion failure, or hardware tells us we have a bug,
check the chunk containing the error and any outside-chunk
information that could be affected by it.
-VAL
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 11:38 ` Andreas Dilger
2007-04-25 17:52 ` Amit Gud
@ 2007-04-25 23:06 ` Valerie Henson
1 sibling, 0 replies; 34+ messages in thread
From: Valerie Henson @ 2007-04-25 23:06 UTC (permalink / raw)
To: David Chinner, Amit Gud, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
On Wed, Apr 25, 2007 at 05:38:34AM -0600, Andreas Dilger wrote:
>
> The case where only a fsck of the corrupt chunk is done would not find the
> cnode references. Maybe there needs to be per-chunk info which contains
> a list/bitmap of other chunks that have cnodes shared with each chunk?
Yes, exactly. One might almost think you had solved this problem
before. :):):)
-VAL
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 23:03 ` Valerie Henson
@ 2007-04-26 0:47 ` David Chinner
2007-04-26 22:21 ` Jörn Engel
2007-04-26 8:47 ` Jan Kara
1 sibling, 1 reply; 34+ messages in thread
From: David Chinner @ 2007-04-26 0:47 UTC (permalink / raw)
To: Valerie Henson
Cc: David Chinner, Amit Gud, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
On Wed, Apr 25, 2007 at 04:03:44PM -0700, Valerie Henson wrote:
> On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > >
> > > The structure looks like this:
> > >
> > > ---------- ----------
> > > | cnode 0 |---------->| cnode 0 |----------> to another cnode or NULL
> > > ---------- ----------
> > > | cnode 1 |----- | cnode 1 |-----
> > > ---------- | ---------- |
> > > | cnode 2 |-- | | cnode 2 |-- |
> > > ---------- | | ---------- | |
> > > | cnode 3 | | | | cnode 3 | | |
> > > ---------- | | ---------- | |
> > > | | | | | |
> > >
> > > inodes inodes or NULL
> >
> > How do you recover if fsfuzzer takes out a cnode in the chain? The
> > chunk is marked clean, but clearly corrupted and needs fixing and
> > you don't know what it was pointing at. Hence you have a pointer to
> > a trashed cnode *somewhere* that you need to find and fix, and a
> > bunch of orphaned cnodes that nobody points to *somewhere else* in
> > the filesystem that you have to find. That's a full scan fsck case,
> > isn't?
>
> Excellent question. This is one of the trickier aspects of chunkfs -
> the orphan inode problem (tricky, but solvable). The problem is what
> if you smash/lose/corrupt an inode in one chunk that has a
> continuation inode in another chunk? A back pointer does you no good
> if the back pointer is corrupted.
*nod*
> What you do is keep tabs on whether you see damage that looks like
> this has occurred - e.g., inode use/free counts wrong, you had to zero
> a corrupted inode - and when this happens, you do a scan of all
> continuation inodes in chunks that have links to the corrupted chunk.
This assumes that you know a chunk has been corrupted, though.
How do you find that out?
> What you need to make this go fast is (1) a pre-made list of which
> chunks have links with which other chunks,
So you add a new on-disk structure that needs to be kept up to
date? How do you trust that structure to be correct if you are
not journalling it? What happens if fsfuzzer trashes part
of this table as well and you can't trust it?
> (2) a fast way to read all
> of the continuation inodes in a chunk (ignoring chunk-local inodes).
> This stage is O(fs size) approximately, but it should be quite swift.
Assuming you can trust this list. if not, finding cnodes is going
to be rather slow.....
> > It seems that any sort of damage to the underlying storage (e.g.
> > media error, I/O error or user brain explosion) results in the need
> > to do a full fsck and hence chunkfs gives you no benefit in this
> > case.
>
> I worry about this but so far haven't found something which couldn't
> be cut down significantly with just a little extra work. It might be
> helpful to look at an extreme case.
>
> Let's say we're incredibly paranoid. We could be justified in running
> a full fsck on the entire file system in between every single I/O.
> After all, something *might* have been silently corrupted. But this
> would be ridiculously slow. We could instead never check the file
> system. But then we would end up panicking and corrupting the file
> system a lot. So what's a good compromise?
>
> In the chunkfs case, here's my rules of thumb so far:
>
> 1. Detection: All metadata has magic numbers and checksums.
> 2. Scrubbing: Random check of chunks when possible.
> 3. Repair: When we detect corruption, either by checksum error, file
> system code assertion failure, or hardware tells us we have a bug,
> check the chunk containing the error and any outside-chunk
> information that could be affected by it.
So if you end up with a corruption in a "clean" part of the
filesystem, you may not find out about the corruption on reboot and
fsck? You need to trip over the corruption first before fsck can be
told it needs to check/repair a given chunk? Or do you need to force
a "check everything" fsck in this case?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 23:03 ` Valerie Henson
2007-04-26 0:47 ` David Chinner
@ 2007-04-26 8:47 ` Jan Kara
2007-04-27 5:07 ` Valerie Henson
1 sibling, 1 reply; 34+ messages in thread
From: Jan Kara @ 2007-04-26 8:47 UTC (permalink / raw)
To: Valerie Henson
Cc: David Chinner, Amit Gud, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
> On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > > ---------- ----------
> > > | cnode 0 |---------->| cnode 0 |----------> to another cnode or NULL
> > > ---------- ----------
> > > | cnode 1 |----- | cnode 1 |-----
> > > ---------- | ---------- |
> > > | cnode 2 |-- | | cnode 2 |-- |
> > > ---------- | | ---------- | |
> > > | cnode 3 | | | | cnode 3 | | |
> > > ---------- | | ---------- | |
> > > | | | | | |
> > >
> > > inodes inodes or NULL
> >
> > How do you recover if fsfuzzer takes out a cnode in the chain? The
> > chunk is marked clean, but clearly corrupted and needs fixing and
> > you don't know what it was pointing at. Hence you have a pointer to
> > a trashed cnode *somewhere* that you need to find and fix, and a
> > bunch of orphaned cnodes that nobody points to *somewhere else* in
> > the filesystem that you have to find. That's a full scan fsck case,
> > isn't?
>
> Excellent question. This is one of the trickier aspects of chunkfs -
> the orphan inode problem (tricky, but solvable). The problem is what
> if you smash/lose/corrupt an inode in one chunk that has a
> continuation inode in another chunk? A back pointer does you no good
> if the back pointer is corrupted.
>
> What you do is keep tabs on whether you see damage that looks like
> this has occurred - e.g., inode use/free counts wrong, you had to zero
> a corrupted inode - and when this happens, you do a scan of all
> continuation inodes in chunks that have links to the corrupted chunk.
> What you need to make this go fast is (1) a pre-made list of which
> chunks have links with which other chunks, (2) a fast way to read all
> of the continuation inodes in a chunk (ignoring chunk-local inodes).
> This stage is O(fs size) approximately, but it should be quite swift.
Do I get it right that you just have in each cnode a pointer to the
previous & next cnode? But then if two consecutive cnodes get corrupted,
you have no way to connect the chain, do you? If each cnode contained
some unique identifier of the file and a number identifying position of
cnode, then there would be at least some way (through expensive) to
link them together correctly...
Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-25 22:47 ` Valerie Henson
@ 2007-04-26 14:14 ` Jeff Dike
2007-04-26 15:53 ` Amit Gud
0 siblings, 1 reply; 34+ messages in thread
From: Jeff Dike @ 2007-04-26 14:14 UTC (permalink / raw)
To: Valerie Henson
Cc: Nikita Danilov, David Lang, Amit Gud, linux-fsdevel,
linux-kernel, riel, zab, arjan, suparna, brandon, karunasagark,
gud
On Wed, Apr 25, 2007 at 03:47:10PM -0700, Valerie Henson wrote:
> Actually, there is an upper limit on the number of continuation
> inodes. Each file can have a maximum of one continuation inode per
> chunk. (This is why we need to support sparse files.)
How about this case:
Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.
Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.
Jeff
--
Work email - jdike at linux dot intel dot com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-26 14:14 ` Jeff Dike
@ 2007-04-26 15:53 ` Amit Gud
2007-04-26 16:05 ` Jeff Dike
2007-04-26 16:11 ` Alan Cox
0 siblings, 2 replies; 34+ messages in thread
From: Amit Gud @ 2007-04-26 15:53 UTC (permalink / raw)
To: Jeff Dike
Cc: Valerie Henson, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, riel, zab, arjan, suparna, brandon, karunasagark
Jeff Dike wrote:
> How about this case:
>
> Growing file starts in chunk A.
> Overflows into chunk B.
> Delete file in chunk A.
> Growing file overflows chunk B and spots new free space in
> chunk A (and nothing anywhere else)
> Overflows into chunk A
> Delete file in chunk B.
> Overflow into chunk B again.
>
> Maybe this is not realistic, but in the absence of a mechanism to pull
> data back from an overflow chunk, it seems at least a theoretical
> possibility that there could be > 1 continuation inodes per file per
> chunk.
>
Preventive measures are taken to limit only one continuation inode per
file per chunk. This can be done easily in the chunk allocation
algorithm for disk space. Although I'm not quite sure what you mean by
"Delete file in chunk A". If you are referring to same file thats
growing, then deletion is not possible, because individual parts of any
file in any chunk cannot be deleted.
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-26 15:53 ` Amit Gud
@ 2007-04-26 16:05 ` Jeff Dike
2007-04-26 16:56 ` Amit Gud
2007-04-27 4:58 ` Valerie Henson
2007-04-26 16:11 ` Alan Cox
1 sibling, 2 replies; 34+ messages in thread
From: Jeff Dike @ 2007-04-26 16:05 UTC (permalink / raw)
To: Amit Gud
Cc: Valerie Henson, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, riel, zab, arjan, suparna, brandon, karunasagark
On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote:
> Jeff Dike wrote:
> >How about this case:
> >
> > Growing file starts in chunk A.
> > Overflows into chunk B.
> > Delete file in chunk A.
> > Growing file overflows chunk B and spots new free space in
> >chunk A (and nothing anywhere else)
> > Overflows into chunk A
> > Delete file in chunk B.
> > Overflow into chunk B again.
> >
> >Maybe this is not realistic, but in the absence of a mechanism to pull
> >data back from an overflow chunk, it seems at least a theoretical
> >possibility that there could be > 1 continuation inodes per file per
> >chunk.
> >
>
> Preventive measures are taken to limit only one continuation inode per
> file per chunk. This can be done easily in the chunk allocation
> algorithm for disk space. Although I'm not quite sure what you mean by
> "Delete file in chunk A". If you are referring to same file thats
> growing, then deletion is not possible, because individual parts of any
> file in any chunk cannot be deleted.
No, I'm referring to a different file. The scenario is that you have
a growing file in a nearly full disk with files being deleted (and
thus space being freed) such that allocations for the growing file
bounce back and forth between chunks.
Jeff
--
Work email - jdike at linux dot intel dot com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-26 15:53 ` Amit Gud
2007-04-26 16:05 ` Jeff Dike
@ 2007-04-26 16:11 ` Alan Cox
2007-04-26 16:44 ` Amit Gud
1 sibling, 1 reply; 34+ messages in thread
From: Alan Cox @ 2007-04-26 16:11 UTC (permalink / raw)
To: Amit Gud
Cc: Jeff Dike, Valerie Henson, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
> Preventive measures are taken to limit only one continuation inode per
> file per chunk. This can be done easily in the chunk allocation
> algorithm for disk space. Although I'm not quite sure what you mean by
How are you handling the allocation in this situation, are you assuming
that a chunk is "out of bounds" because part of a file already lives on
it or simply keeping a single inode per chunk which has multiple sparse
pieces of the file on it ?
ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write
16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have
to find another chunk to use ?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-26 16:11 ` Alan Cox
@ 2007-04-26 16:44 ` Amit Gud
0 siblings, 0 replies; 34+ messages in thread
From: Amit Gud @ 2007-04-26 16:44 UTC (permalink / raw)
To: Alan Cox
Cc: Jeff Dike, Valerie Henson, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
Alan Cox wrote:
>> Preventive measures are taken to limit only one continuation inode per
>> file per chunk. This can be done easily in the chunk allocation
>> algorithm for disk space. Although I'm not quite sure what you mean by
>
> How are you handling the allocation in this situation, are you assuming
> that a chunk is "out of bounds" because part of a file already lives on
> it or simply keeping a single inode per chunk which has multiple sparse
> pieces of the file on it ?
>
> ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write
> 16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have
> to find another chunk to use ?
Hello Alan,
You re-use the same inode with multiple sparse pieces.
This way you avoid hopping around continuation inodes and coming back to
same chunk with which you started but this time on a different
continuation inode. This may not be I/O intensive for successive
traversals if the continuation inodes are pinned in the memory, but it
certainly is a waste of resource - inodes. Not allowing this would make
worst case of every file having a continuation inode in every chunk,
even worse; may be like only single file exist in the file system and
rest all inodes in all chunks (including file's own chunk) are
continuation inodes.
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-26 16:05 ` Jeff Dike
@ 2007-04-26 16:56 ` Amit Gud
2007-04-27 4:58 ` Valerie Henson
1 sibling, 0 replies; 34+ messages in thread
From: Amit Gud @ 2007-04-26 16:56 UTC (permalink / raw)
To: Jeff Dike
Cc: Valerie Henson, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, riel, zab, arjan, suparna, brandon, karunasagark
Jeff Dike wrote:
> On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote:
>> Jeff Dike wrote:
>>> How about this case:
>>>
>>> Growing file starts in chunk A.
>>> Overflows into chunk B.
>>> Delete file in chunk A.
>>> Growing file overflows chunk B and spots new free space in
>>> chunk A (and nothing anywhere else)
>>> Overflows into chunk A
>>> Delete file in chunk B.
>>> Overflow into chunk B again.
>>>
>>> Maybe this is not realistic, but in the absence of a mechanism to pull
>>> data back from an overflow chunk, it seems at least a theoretical
>>> possibility that there could be > 1 continuation inodes per file per
>>> chunk.
>>>
>> Preventive measures are taken to limit only one continuation inode per
>> file per chunk. This can be done easily in the chunk allocation
>> algorithm for disk space. Although I'm not quite sure what you mean by
>> "Delete file in chunk A". If you are referring to same file thats
>> growing, then deletion is not possible, because individual parts of any
>> file in any chunk cannot be deleted.
>
> No, I'm referring to a different file. The scenario is that you have
> a growing file in a nearly full disk with files being deleted (and
> thus space being freed) such that allocations for the growing file
> bounce back and forth between chunks.
>
In such scenario either lot of continuation inodes < number of chunks
would be created or lot of sparse pieces would be created. But we can
certainly enforce the constraint of one continuation inode per file per
chunk, excluding the file's primary chunk in which it started.
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-26 0:47 ` David Chinner
@ 2007-04-26 22:21 ` Jörn Engel
0 siblings, 0 replies; 34+ messages in thread
From: Jörn Engel @ 2007-04-26 22:21 UTC (permalink / raw)
To: David Chinner
Cc: Valerie Henson, Amit Gud, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
On Thu, 26 April 2007 10:47:40 +1000, David Chinner wrote:
>
> This assumes that you know a chunk has been corrupted, though.
> How do you find that out?
Option 1: you notice something odd while serving userspace.
Option 2: a checking/scrubbing daemon of some sorts.
The first will obviously miss any corruption in data that is not touched
for a long time (ever?).
> > What you need to make this go fast is (1) a pre-made list of which
> > chunks have links with which other chunks,
>
> So you add a new on-disk structure that needs to be kept up to
> date? How do you trust that structure to be correct if you are
> not journalling it?
Only chance I see is to treat this list as hints. It should contain all
chunks that possibly have links. It may also contain chunks that don't
have links. By keeping strict FFS-style ordering of all relevant
writes, any mismatch should only cost fsck time.
Managing this list appears to be less than trivial. Might actually be
easier to have LogFS-style rmap for each object in the filesystem.
> What happens if fsfuzzer trashes part
> of this table as well and you can't trust it?
If you have 5000 redundant copies of data and all get corrupted, you are
doomed. I don't expect my filesystem to recover after having written
0x00 over the whole device.
Being able to recover a single corruption happening anywhere on the
device is already a huge step forward. Of course most current
filesystems wouldn't even be able to detect all possible corruptions.
That alone would be a step forward.
One of the smart things of ZFS is to checksum everything. Among the
Linux filesystems only JFFS2 seems to do it, but it cannot distinguish
between corrupted data and incomplete writes before a crash. It
definitely costs performance, but that is the price one has to pay if
errors are to be detected.
Jörn
--
Do not stop an army on its way home.
-- Sun Tzu
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-26 16:05 ` Jeff Dike
2007-04-26 16:56 ` Amit Gud
@ 2007-04-27 4:58 ` Valerie Henson
2007-04-27 15:06 ` Jeff Dike
1 sibling, 1 reply; 34+ messages in thread
From: Valerie Henson @ 2007-04-27 4:58 UTC (permalink / raw)
To: Jeff Dike
Cc: Amit Gud, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, riel, zab, arjan, suparna, brandon, karunasagark
On Thu, Apr 26, 2007 at 12:05:04PM -0400, Jeff Dike wrote:
>
> No, I'm referring to a different file. The scenario is that you have
> a growing file in a nearly full disk with files being deleted (and
> thus space being freed) such that allocations for the growing file
> bounce back and forth between chunks.
This is an excellent question. I call this the ping-pong problem.
The solution is as Amit describes: You have a maximum of one
continuation inode per file per chunk, and you require sparse files.
Here's an example, spelled out:
Allocate file 1 in chunk A.
Grow file 1.
Chunk A fills up.
Allocate continuation inode for file 1 in chunk B.
Chunk A gets some free space.
Chunk B fills up.
Pick chunk A for allocating next block of file 1.
Try to look up a continuation inode for file 1 in chunk A.
Continuation inode for file 1 found in chunk A!
Attach newly allocated block to existing inode for file 1 in chunk A.
This is why the file format inside each chunk needs to support sparse
files.
I have a presentation that has a series of slides on problems and
potential resolutions that might help:
http://infohost.nmt.edu/~val/review/chunkfs_presentation.pdf
-VAL
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-26 8:47 ` Jan Kara
@ 2007-04-27 5:07 ` Valerie Henson
2007-04-27 10:53 ` Jörn Engel
0 siblings, 1 reply; 34+ messages in thread
From: Valerie Henson @ 2007-04-27 5:07 UTC (permalink / raw)
To: Jan Kara
Cc: David Chinner, Amit Gud, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
On Thu, Apr 26, 2007 at 10:47:38AM +0200, Jan Kara wrote:
> Do I get it right that you just have in each cnode a pointer to the
> previous & next cnode? But then if two consecutive cnodes get corrupted,
> you have no way to connect the chain, do you? If each cnode contained
> some unique identifier of the file and a number identifying position of
> cnode, then there would be at least some way (through expensive) to
> link them together correctly...
You're right, it's easy to add a little more redundancy that would
make it possible to recover from two consecutive nodes being
corrupted. Keeping a parent inode id in each continuation inode is
definitely a smart thing to do.
Some minor side notes: Continuation inodes aren't really in any
defined order - if you look at Jeff's ping-pong chunk allocation
example, you'll see that the data in each continuation inode won't be
in linearly increasing order. Also, while the current implementation
is a simple doubly-linked list, this may not be the best solution
long-term. What's important is that each continuation inode have a
back pointer to the parent and that there is some structure for
quickly looking up the continuation inode for a given file offset.
Suggestions for data structures that work well in this situation are
welcome. :)
-VAL
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-27 5:07 ` Valerie Henson
@ 2007-04-27 10:53 ` Jörn Engel
2007-04-28 6:50 ` Valerie Henson
0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2007-04-27 10:53 UTC (permalink / raw)
To: Valerie Henson
Cc: Jan Kara, David Chinner, Amit Gud, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
On Thu, 26 April 2007 22:07:43 -0700, Valerie Henson wrote:
>
> What's important is that each continuation inode have a
> back pointer to the parent and that there is some structure for
> quickly looking up the continuation inode for a given file offset.
> Suggestions for data structures that work well in this situation are
> welcome. :)
All this would get easier if continuation inodes were known to be rare.
You can ditch the doubly-linked list in favor of a pointer to the main
inode then - traversing the list again is cheap, after all. And you can
just try to read the same block once for every continuation inode.
If those lists can get long and you need a mapping from offset to
continuation inode on the medium, you are basically fscked. Storing the
mapping requires space. You need the mapping only when space (in some
chunk) gets tight and you allocate continuation inodes. So either you
don't need the mapping or you don't have a good place to put it.
Having a mapping in memory is also questionable. Either you scan the
whole file on first access and spend a long time for large files. Or
you create the mapping on the fly. In that case the page cache will
already give you a 90% solution for free.
You should spend a lot of effort trying to minimize cnodes. ;)
Jörn
--
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-27 4:58 ` Valerie Henson
@ 2007-04-27 15:06 ` Jeff Dike
2007-05-01 17:26 ` Valerie Henson
0 siblings, 1 reply; 34+ messages in thread
From: Jeff Dike @ 2007-04-27 15:06 UTC (permalink / raw)
To: Valerie Henson
Cc: Amit Gud, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, riel, zab, arjan, suparna, brandon, karunasagark
On Thu, Apr 26, 2007 at 09:58:25PM -0700, Valerie Henson wrote:
> Here's an example, spelled out:
>
> Allocate file 1 in chunk A.
> Grow file 1.
> Chunk A fills up.
> Allocate continuation inode for file 1 in chunk B.
> Chunk A gets some free space.
> Chunk B fills up.
> Pick chunk A for allocating next block of file 1.
> Try to look up a continuation inode for file 1 in chunk A.
> Continuation inode for file 1 found in chunk A!
> Attach newly allocated block to existing inode for file 1 in chunk A.
So far, so good (and the slides are helpful, tx!). What happens when
file 1 keeps growing and chunk A fills up (and chunk B is still full)?
Can the same continuation inode also point at chunk C, where the file
is going to grow to?
Jeff
--
Work email - jdike at linux dot intel dot com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-27 10:53 ` Jörn Engel
@ 2007-04-28 6:50 ` Valerie Henson
2007-04-28 10:03 ` Jörn Engel
0 siblings, 1 reply; 34+ messages in thread
From: Valerie Henson @ 2007-04-28 6:50 UTC (permalink / raw)
To: J??rn Engel
Cc: Jan Kara, David Chinner, Amit Gud, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
On Fri, Apr 27, 2007 at 12:53:34PM +0200, J??rn Engel wrote:
>
> All this would get easier if continuation inodes were known to be rare.
> You can ditch the doubly-linked list in favor of a pointer to the main
> inode then - traversing the list again is cheap, after all. And you can
> just try to read the same block once for every continuation inode.
>
> If those lists can get long and you need a mapping from offset to
> continuation inode on the medium, you are basically fscked. Storing the
> mapping requires space. You need the mapping only when space (in some
> chunk) gets tight and you allocate continuation inodes. So either you
> don't need the mapping or you don't have a good place to put it.
Any mapping structure will have to be pre-allocated.
> Having a mapping in memory is also questionable. Either you scan the
> whole file on first access and spend a long time for large files. Or
> you create the mapping on the fly. In that case the page cache will
> already give you a 90% solution for free.
So in my secret heart of hearts, I do indeed hope that cnodes are rare
enough that we don't actually have to do anything smart to make them
go fast. Either having no fast lookup structure or creating it in
memory as needed would be the nicest solution. However, since I can't
guarantee this will be the case, it's nice to have some idea of what
we'll do if this does become important.
> You should spend a lot of effort trying to minimize cnodes. ;)
Yep. It's much better to optimize away most cnodes instead of trying
to make the go fast.
-VAL
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-28 6:50 ` Valerie Henson
@ 2007-04-28 10:03 ` Jörn Engel
0 siblings, 0 replies; 34+ messages in thread
From: Jörn Engel @ 2007-04-28 10:03 UTC (permalink / raw)
To: Valerie Henson
Cc: Jan Kara, David Chinner, Amit Gud, Nikita Danilov, David Lang,
linux-fsdevel, linux-kernel, riel, zab, arjan, suparna, brandon,
karunasagark
On Fri, 27 April 2007 23:50:06 -0700, Valerie Henson wrote:
>
> Any mapping structure will have to be pre-allocated.
How much space would you allocate for it then? The required size surely
depends on the file size and fragmentation.
> So in my secret heart of hearts, I do indeed hope that cnodes are rare
> enough that we don't actually have to do anything smart to make them
> go fast. Either having no fast lookup structure or creating it in
> memory as needed would be the nicest solution. However, since I can't
> guarantee this will be the case, it's nice to have some idea of what
> we'll do if this does become important.
You go slow and slower. Or you can defragment/clean the filesystem.
Fragmented filesystems being slow is hardly news to anyone. ChunkFS
will be slower than others, so the reasons to fight fragmentation weigh
heavier.
What you can do:
- If two files 1 and 2 contain blocks in chunks A and B, move file 1
blocks to chunk A and file 2 blocks to chunk B until one of the files
only spans a single chunk.
- Same as above with longer chains like file 1 in chunks (A,B), file 2
in chunks (B,C), file 3 in chunks (A,C).
- When writing to file 1 in chunk A but running out of space in chunk A,
move some cross-chunk file over instead of writing file 1 blocks to
chunk B.
The first two can hope to use some idle times (if they exist), the third
will make writes go slower in order to speed up future reads.
There ain't no such thing as a free lunch. :)
Jörn
--
I can say that I spend most of my time fixing bugs even if I have lots
of new features to implement in mind, but I give bugs more priority.
-- Andrea Arcangeli, 2000
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
2007-04-27 15:06 ` Jeff Dike
@ 2007-05-01 17:26 ` Valerie Henson
0 siblings, 0 replies; 34+ messages in thread
From: Valerie Henson @ 2007-05-01 17:26 UTC (permalink / raw)
To: Jeff Dike
Cc: Amit Gud, Nikita Danilov, David Lang, linux-fsdevel,
linux-kernel, riel, zab, arjan, suparna, brandon, karunasagark
On Fri, Apr 27, 2007 at 11:06:47AM -0400, Jeff Dike wrote:
> On Thu, Apr 26, 2007 at 09:58:25PM -0700, Valerie Henson wrote:
> > Here's an example, spelled out:
> >
> > Allocate file 1 in chunk A.
> > Grow file 1.
> > Chunk A fills up.
> > Allocate continuation inode for file 1 in chunk B.
> > Chunk A gets some free space.
> > Chunk B fills up.
> > Pick chunk A for allocating next block of file 1.
> > Try to look up a continuation inode for file 1 in chunk A.
> > Continuation inode for file 1 found in chunk A!
> > Attach newly allocated block to existing inode for file 1 in chunk A.
>
> So far, so good (and the slides are helpful, tx!). What happens when
> file 1 keeps growing and chunk A fills up (and chunk B is still full)?
> Can the same continuation inode also point at chunk C, where the file
> is going to grow to?
You allocate a new continuation inode in chunk C. The rule is that
only inodes inside a chunk can point to blocks inside the chunk, so
you need an inode in C if you want to allocate blocks from C.
-VAL
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2007-05-01 17:27 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-23 11:21 [RFC][PATCH] ChunkFS: fs fission for faster fsck Amit Gud
[not found] ` <17965.6084 1.900376.524639@gargle.gargle.HOWL>
2007-04-23 16:28 ` Suparna Bhattacharya
2007-04-23 15:25 ` Amit Gud
2007-04-23 16:32 ` Suparna Bhattacharya
2007-04-24 11:44 ` Nikita Danilov
2007-04-24 18:27 ` David Lang
2007-04-24 19:34 ` Nikita Danilov
2007-04-24 19:26 ` David Lang
2007-04-25 11:34 ` Nikita Danilov
2007-04-25 16:39 ` David Lang
2007-04-25 22:47 ` Valerie Henson
2007-04-26 14:14 ` Jeff Dike
2007-04-26 15:53 ` Amit Gud
2007-04-26 16:05 ` Jeff Dike
2007-04-26 16:56 ` Amit Gud
2007-04-27 4:58 ` Valerie Henson
2007-04-27 15:06 ` Jeff Dike
2007-05-01 17:26 ` Valerie Henson
2007-04-26 16:11 ` Alan Cox
2007-04-26 16:44 ` Amit Gud
2007-04-24 21:53 ` Amit Gud
2007-04-25 10:54 ` David Chinner
2007-04-25 11:38 ` Andreas Dilger
2007-04-25 17:52 ` Amit Gud
2007-04-25 23:06 ` Valerie Henson
2007-04-25 23:03 ` Valerie Henson
2007-04-26 0:47 ` David Chinner
2007-04-26 22:21 ` Jörn Engel
2007-04-26 8:47 ` Jan Kara
2007-04-27 5:07 ` Valerie Henson
2007-04-27 10:53 ` Jörn Engel
2007-04-28 6:50 ` Valerie Henson
2007-04-28 10:03 ` Jörn Engel
2007-04-25 22:43 ` Valerie Henson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).