LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [00/17] Large Blocksize Support V3
@ 2007-04-24 22:21 clameter
2007-04-24 22:21 ` [01/17] Remove open coded implementation of memclear_highpage flush clameter
` (26 more replies)
0 siblings, 27 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
V2->V3
- More restructuring
- It actually works!
- Add XFS support
- Fix up UP support
- Work out the direct I/O issues
- Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
back to constants. Disabled for 32bit and HIGHMEM configurations.
This also allows a gradual migration to the new page cache
inline functions. LARGE_BLOCKSIZE capabilities can be
added gradually and if there is a problem then we can disable
a subsystem.
V1->V2
- Some ext2 support
- Some block layer, fs layer support etc.
- Better page cache macros
- Use macros to clean up code.
This patchset modifies the Linux kernel so that larger block sizes than
page size can be supported. Larger block sizes are handled by using
compound pages of an arbitrary order for the page cache instead of
single pages with order 0.
Rationales:
1. We have problems supporting devices with a higher blocksize than
page size. This is for example important to support CD and DVDs that
can only read and write 32k or 64k blocks. We currently have a shim
layer in there to deal with this situation which limits the speed
of I/O. The developers are currently looking for ways to completely
bypass the page cache because of this deficiency.
2. 32/64k blocksize is also used in flash devices. Same issues.
3. Future harddisks will support bigger block sizes that Linux cannot
support since we are limited to PAGE_SIZE. Ok the on board cache
may buffer this for us but what is the point of handling smaller
page sizes than what the drive supports?
4. Reduce fsck times. Larger block sizes mean faster file system checking.
5. Performance. If we look at IA64 vs. x86_64 then it seems that the
faster interrupt handling on x86_64 compensate for the speed loss due to
a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
sizes on all allows a significant reduction in I/O overhead and increases
the size of I/O that can be performed by hardware in a single request
since the number of scatter gather entries are typically limited for
one request. This is going to become increasingly important to support
the ever growing memory sizes since we may have to handle excessively
large amounts of 4k requests for data sizes that may become common
soon. For example to write a 1 terabyte file the kernel would have to
handle 256 million 4k chunks.
6. Cross arch compatibility: It is currently not possible to mount
an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
With this patch this becoems possible.
The support here is currently only for buffered I/O. Modifications for
three filesystems are included:
A. XFS
B. Ext2
C. ramfs
Unsupported
- Mmapping blocks larger than page size
Issues:
- There are numerous places where the kernel can no longer assume that the
page cache consists of PAGE_SIZE pages that have not been fixed yet.
- Defrag warning: The patch set can fragment memory very fast.
It is likely that Mel Gorman's anti-frag patches and some more
work by him on defragmentation may be needed if one wants to use
super sized pages.
If you run a 2.6.21 kernel with this patch and start a kernel compile
on a 4k volume with a concurrent copy operation to a 64k volume on
a system with only 1 Gig then you will go boom (ummm no ... OOM) fast.
How well Mel's antifrag/defrag methods address this issue still has to
be seen.
Future:
- Mmap support could be done in a way that makes the mmap page size
independent from the page cache order. It is okay to map a 4k section
of a larger page cache page via a pte. 4k mmap semantics can be completely
preserved even for larger page sizes.
- Maybe people could perform benchmarks to see how much of a difference
there is between 4k size I/O and 64k? Andrew surely would like to know.
- If there is a chance for inclusion then I will diff this against mm,
do a complete scan over the kernel to find all page cache == PAGE_SIZE
assumptions and then try to get it upstream for 2.6.23.
How to make this work:
1. Apply this patchset to 2.6.21-rc7
2. Configure LARGE_BLOCKSIZE Support
3. compile kernel
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [01/17] Remove open coded implementation of memclear_highpage flush
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [02/17] Fix page allocation flags in grow_dev_page() clameter
` (25 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: consolidate_partial_clear --]
[-- Type: text/plain, Size: 10931 bytes --]
There are a series of open coded reimplementation of memclear_highpage_flush
all over the page cache code. Call memclear_highpage_flush in those locations.
Consolidates code and eases maintenance.
[There seems to be a better patch in mm. This patch here is just to make things
work against 2.6.21 until the rest of the patch is diffed against mm]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/buffer.c | 77 ++++++++++++-------------------------------------------
fs/direct-io.c | 8 +----
fs/libfs.c | 10 +++----
fs/mpage.c | 12 ++------
mm/filemap_xip.c | 6 ----
5 files changed, 29 insertions(+), 84 deletions(-)
Index: linux-2.6.21-rc7/fs/buffer.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/buffer.c 2007-04-23 14:19:04.000000000 -0700
+++ linux-2.6.21-rc7/fs/buffer.c 2007-04-23 14:44:09.000000000 -0700
@@ -1803,17 +1803,12 @@ static int __block_prepare_write(struct
continue;
}
if (block_end > to || block_start < from) {
- void *kaddr;
-
- kaddr = kmap_atomic(page, KM_USER0);
if (block_end > to)
- memset(kaddr+to, 0,
- block_end-to);
+ memclear_highpage_flush(page,
+ to, block_end - to);
if (block_start < from)
- memset(kaddr+block_start,
- 0, from-block_start);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page,
+ block_start, from - block_start);
}
continue;
}
@@ -1861,13 +1856,8 @@ static int __block_prepare_write(struct
if (block_start >= to)
break;
if (buffer_new(bh)) {
- void *kaddr;
-
clear_buffer_new(bh);
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr+block_start, 0, bh->b_size);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, block_start, bh->b_size);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
}
@@ -1955,10 +1945,7 @@ int block_read_full_page(struct page *pa
SetPageError(page);
}
if (!buffer_mapped(bh)) {
- void *kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + i * blocksize, 0, blocksize);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, i * blocksize, blocksize);
if (!err)
set_buffer_uptodate(bh);
continue;
@@ -2101,7 +2088,6 @@ int cont_prepare_write(struct page *page
long status;
unsigned zerofrom;
unsigned blocksize = 1 << inode->i_blkbits;
- void *kaddr;
while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
status = -ENOMEM;
@@ -2123,10 +2109,8 @@ int cont_prepare_write(struct page *page
PAGE_CACHE_SIZE, get_block);
if (status)
goto out_unmap;
- kaddr = kmap_atomic(new_page, KM_USER0);
- memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
- flush_dcache_page(new_page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(new_page, zerofrom,
+ PAGE_CACHE_SIZE - zerofrom);
generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
unlock_page(new_page);
page_cache_release(new_page);
@@ -2153,10 +2137,7 @@ int cont_prepare_write(struct page *page
if (status)
goto out1;
if (zerofrom < offset) {
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr+zerofrom, 0, offset-zerofrom);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, zerofrom, offset - zerofrom);
__block_commit_write(inode, page, zerofrom, offset);
}
return 0;
@@ -2243,7 +2224,6 @@ int nobh_prepare_write(struct page *page
unsigned block_in_page;
unsigned block_start;
sector_t block_in_file;
- char *kaddr;
int nr_reads = 0;
int i;
int ret = 0;
@@ -2283,13 +2263,12 @@ int nobh_prepare_write(struct page *page
if (PageUptodate(page))
continue;
if (buffer_new(&map_bh) || !buffer_mapped(&map_bh)) {
- kaddr = kmap_atomic(page, KM_USER0);
if (block_start < from)
- memset(kaddr+block_start, 0, from-block_start);
+ memclear_highpage_flush(page,
+ block_start, from - block_start);
if (block_end > to)
- memset(kaddr + to, 0, block_end - to);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page,
+ to, block_end - to);
continue;
}
if (buffer_uptodate(&map_bh))
@@ -2355,10 +2334,7 @@ failed:
* Error recovery is pretty slack. Clear the page and mark it dirty
* so we'll later zero out any blocks which _were_ allocated.
*/
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr, 0, PAGE_CACHE_SIZE);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, 0, PAGE_SIZE);
SetPageUptodate(page);
set_page_dirty(page);
return ret;
@@ -2397,7 +2373,6 @@ int nobh_writepage(struct page *page, ge
loff_t i_size = i_size_read(inode);
const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
unsigned offset;
- void *kaddr;
int ret;
/* Is the page fully inside i_size? */
@@ -2428,10 +2403,7 @@ int nobh_writepage(struct page *page, ge
* the page size, the remaining memory is zeroed when mapped, and
* writes to that region are not written out to the file."
*/
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, offset, PAGE_CACHE_SIZE - offset);
out:
ret = mpage_writepage(page, get_block, wbc);
if (ret == -EAGAIN)
@@ -2452,7 +2424,6 @@ int nobh_truncate_page(struct address_sp
unsigned to;
struct page *page;
const struct address_space_operations *a_ops = mapping->a_ops;
- char *kaddr;
int ret = 0;
if ((offset & (blocksize - 1)) == 0)
@@ -2466,10 +2437,7 @@ int nobh_truncate_page(struct address_sp
to = (offset + blocksize) & ~(blocksize - 1);
ret = a_ops->prepare_write(NULL, page, offset, to);
if (ret == 0) {
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, offset, PAGE_CACHE_SIZE - offset);
/*
* It would be more correct to call aops->commit_write()
* here, but this is more efficient.
@@ -2495,7 +2463,6 @@ int block_truncate_page(struct address_s
struct inode *inode = mapping->host;
struct page *page;
struct buffer_head *bh;
- void *kaddr;
int err;
blocksize = 1 << inode->i_blkbits;
@@ -2549,11 +2516,7 @@ int block_truncate_page(struct address_s
goto unlock;
}
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + offset, 0, length);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
-
+ memclear_highpage_flush(page, offset, length);
mark_buffer_dirty(bh);
err = 0;
@@ -2574,7 +2537,6 @@ int block_write_full_page(struct page *p
loff_t i_size = i_size_read(inode);
const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
unsigned offset;
- void *kaddr;
/* Is the page fully inside i_size? */
if (page->index < end_index)
@@ -2600,10 +2562,7 @@ int block_write_full_page(struct page *p
* the page size, the remaining memory is zeroed when mapped, and
* writes to that region are not written out to the file."
*/
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, offset, PAGE_CACHE_SIZE - offset);
return __block_write_full_page(inode, page, get_block, wbc);
}
Index: linux-2.6.21-rc7/fs/direct-io.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/direct-io.c 2007-04-23 14:28:35.000000000 -0700
+++ linux-2.6.21-rc7/fs/direct-io.c 2007-04-23 14:29:21.000000000 -0700
@@ -867,7 +867,6 @@ static int do_direct_IO(struct dio *dio)
do_holes:
/* Handle holes */
if (!buffer_mapped(map_bh)) {
- char *kaddr;
loff_t i_size_aligned;
/* AKPM: eargh, -ENOTBLK is a hack */
@@ -888,11 +887,8 @@ do_holes:
page_cache_release(page);
goto out;
}
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + (block_in_page << blkbits),
- 0, 1 << blkbits);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page,
+ block_in_page << blkbits, 1 << blkbits);
dio->block_in_file++;
block_in_page++;
goto next_block;
Index: linux-2.6.21-rc7/fs/libfs.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/libfs.c 2007-04-23 14:29:49.000000000 -0700
+++ linux-2.6.21-rc7/fs/libfs.c 2007-04-23 14:32:19.000000000 -0700
@@ -332,11 +332,11 @@ int simple_prepare_write(struct file *fi
{
if (!PageUptodate(page)) {
if (to - from != PAGE_CACHE_SIZE) {
- void *kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr, 0, from);
- memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ if (from)
+ memclear_highpage_flush(page, 0, from);
+ if (to < PAGE_CACHE_SIZE)
+ memclear_highpage_flush(page, to,
+ PAGE_CACHE_SIZE - to);
}
}
return 0;
Index: linux-2.6.21-rc7/fs/mpage.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/mpage.c 2007-04-23 14:32:27.000000000 -0700
+++ linux-2.6.21-rc7/fs/mpage.c 2007-04-23 14:53:55.000000000 -0700
@@ -284,11 +284,8 @@ do_mpage_readpage(struct bio *bio, struc
}
if (first_hole != blocks_per_page) {
- char *kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + (first_hole << blkbits), 0,
+ memclear_highpage_flush(page, first_hole << blkbits,
PAGE_CACHE_SIZE - (first_hole << blkbits));
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
if (first_hole == 0) {
SetPageUptodate(page);
unlock_page(page);
@@ -576,14 +573,11 @@ page_is_mapped:
* written out to the file."
*/
unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
- char *kaddr;
if (page->index > end_index || !offset)
goto confused;
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
+ memclear_highpage_flush(page, offset,
+ PAGE_CACHE_SIZE - offset);
}
/*
Index: linux-2.6.21-rc7/mm/filemap_xip.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap_xip.c 2007-04-23 14:34:41.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap_xip.c 2007-04-23 14:35:26.000000000 -0700
@@ -458,11 +458,7 @@ xip_truncate_page(struct address_space *
else
return PTR_ERR(page);
}
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + offset, 0, length);
- kunmap_atomic(kaddr, KM_USER0);
-
- flush_dcache_page(page);
+ memclear_highpage_flush(page, offset, length);
return 0;
}
EXPORT_SYMBOL_GPL(xip_truncate_page);
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [02/17] Fix page allocation flags in grow_dev_page()
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
2007-04-24 22:21 ` [01/17] Remove open coded implementation of memclear_highpage flush clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [03/17] Fix: find_or_create_page does not spread memory clameter
` (24 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: fix_buffer_alloc --]
[-- Type: text/plain, Size: 1282 bytes --]
Grow dev page simply passes GFP_NOFS to find_or_create_page. This means the
allocation of radix tree nodes is done with GFP_NOFS and the allocation
of a new page is done using GFP_NOFS.
The mapping has a flags field that contains the necessary allocation flags for
the page cache allocation. These need to be consulted in order to get DMA
and HIGHMEM allocations etc right.
So fix grow_dev_page to do the same as __page_symlink calls. Retrieve the
mask from the mapping and then remove __GFP_FS.
[mm has other changes to this code. Will post a different patch when diffing against mm]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/buffer.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux-2.6.21-rc7/fs/buffer.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/buffer.c 2007-04-23 15:29:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/buffer.c 2007-04-23 15:29:43.000000000 -0700
@@ -988,7 +988,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;
- page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, index,
+ mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
if (!page)
return NULL;
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [03/17] Fix: find_or_create_page does not spread memory.
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
2007-04-24 22:21 ` [01/17] Remove open coded implementation of memclear_highpage flush clameter
2007-04-24 22:21 ` [02/17] Fix page allocation flags in grow_dev_page() clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [04/17] Free up page->private for compound pages clameter
` (23 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: fix_find_or_create --]
[-- Type: text/plain, Size: 1663 bytes --]
The find_or_create function calls alloc_page with the gfp_mask passed to it
which is derived from the mappings gfp mask. So the allocation flags are right
(assuming my bugfix to fs/buffer.c is applied).
However, we call a alloc_page instead of page_cache_alloc in find_or_create_page.
This means that the page cache allocation will not obey cpuset memory spreading.
We really need to call __page_cache_alloc there.
Also stick a comment in there explaining how the allocation mask passed to
find_or_create_page needs to be handled.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/filemap.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-23 15:37:27.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c 2007-04-23 15:38:09.000000000 -0700
@@ -648,7 +648,8 @@ EXPORT_SYMBOL(find_lock_page);
* find_or_create_page - locate or add a pagecache page
* @mapping: the page's address_space
* @index: the page's index into the mapping
- * @gfp_mask: page allocation mode
+ * @gfp_mask: page allocation mode. This must be derived from the
+ * allocation flags of the mapping!
*
* Locates a page in the pagecache. If the page is not present, a new page
* is allocated using @gfp_mask and is added to the pagecache and to the VM's
@@ -670,7 +671,8 @@ repeat:
page = find_lock_page(mapping, index);
if (!page) {
if (!cached_page) {
- cached_page = alloc_page(gfp_mask);
+ cached_page =
+ __page_cache_alloc(gfp_mask);
if (!cached_page)
return NULL;
}
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [04/17] Free up page->private for compound pages
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (2 preceding siblings ...)
2007-04-24 22:21 ` [03/17] Fix: find_or_create_page does not spread memory clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [05/17] More compound page features clameter
` (22 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_compound --]
[-- Type: text/plain, Size: 9002 bytes --]
If we add a new flag so that we can distinguish between the
first page and the tail pages then we can avoid to use page->private
in the first page. page->private == page for the first page, so there
is no real information in there.
Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages. Right now we have to be careful f.e.
if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field. This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.
Also if page->private is available then a compound page may be equipped
with buffer heads. This may free up the way for filesystems to support
larger blocks than page size.
Note that this patch is different from the one in mm. The one in mm
uses PG_reclaim as a PG_tail. We cannot use PG_tail here since pages can
be reclaimed now. So use a separate page flag. The patch in mm
has more cleanups. Both need to be reconciled at some point.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
arch/ia64/mm/init.c | 2 +-
include/linux/mm.h | 28 ++++++++++++++++++++++------
include/linux/page-flags.h | 6 ++++++
mm/internal.h | 2 +-
mm/page_alloc.c | 35 +++++++++++++++++++++++++----------
mm/slab.c | 6 ++----
mm/swap.c | 2 +-
7 files changed, 58 insertions(+), 23 deletions(-)
Index: linux-2.6.21-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm.h 2007-04-24 11:31:51.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm.h 2007-04-24 11:32:45.000000000 -0700
@@ -263,21 +263,24 @@ static inline int put_page_testzero(stru
*/
static inline int get_page_unless_zero(struct page *page)
{
- VM_BUG_ON(PageCompound(page));
return atomic_inc_not_zero(&page->_count);
}
+static inline struct page *compound_head(struct page *page)
+{
+ if (unlikely(PageTail(page)))
+ return (struct page *)page->private;
+ return page;
+}
+
static inline int page_count(struct page *page)
{
- if (unlikely(PageCompound(page)))
- page = (struct page *)page_private(page);
- return atomic_read(&page->_count);
+ return atomic_read(&compound_head(page)->_count);
}
static inline void get_page(struct page *page)
{
- if (unlikely(PageCompound(page)))
- page = (struct page *)page_private(page);
+ page = compound_head(page);
VM_BUG_ON(atomic_read(&page->_count) == 0);
atomic_inc(&page->_count);
}
@@ -314,6 +317,19 @@ static inline compound_page_dtor *get_co
return (compound_page_dtor *)page[1].lru.next;
}
+static inline void set_compound_order(struct page *page, unsigned long order)
+{
+ page[1].lru.prev = (void *)order;
+}
+
+static inline int compound_order(struct page *page)
+{
+ if (!PageCompound(page) || PageTail(page))
+ return 0;
+
+ return (unsigned long)page[1].lru.prev;
+}
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
Index: linux-2.6.21-rc7/include/linux/page-flags.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/page-flags.h 2007-04-24 11:31:51.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/page-flags.h 2007-04-24 11:32:00.000000000 -0700
@@ -91,6 +91,8 @@
#define PG_nosave_free 18 /* Used for system suspend/resume */
#define PG_buddy 19 /* Page is free, on buddy lists */
+#define PG_tail 20 /* Page is tail of a compound page */
+
/* PG_owner_priv_1 users should have descriptive aliases */
#define PG_checked PG_owner_priv_1 /* Used by some filesystems */
@@ -241,6 +243,10 @@ static inline void SetPageUptodate(struc
#define __SetPageCompound(page) __set_bit(PG_compound, &(page)->flags)
#define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
+#define PageTail(page) test_bit(PG_tail, &(page)->flags)
+#define __SetPageTail(page) __set_bit(PG_tail, &(page)->flags)
+#define __ClearPageTail(page) __clear_bit(PG_tail, &(page)->flags)
+
#ifdef CONFIG_SWAP
#define PageSwapCache(page) test_bit(PG_swapcache, &(page)->flags)
#define SetPageSwapCache(page) set_bit(PG_swapcache, &(page)->flags)
Index: linux-2.6.21-rc7/mm/internal.h
===================================================================
--- linux-2.6.21-rc7.orig/mm/internal.h 2007-04-24 11:31:51.000000000 -0700
+++ linux-2.6.21-rc7/mm/internal.h 2007-04-24 11:32:00.000000000 -0700
@@ -24,7 +24,7 @@ static inline void set_page_count(struct
*/
static inline void set_page_refcounted(struct page *page)
{
- VM_BUG_ON(PageCompound(page) && page_private(page) != (unsigned long)page);
+ VM_BUG_ON(PageTail(page));
VM_BUG_ON(atomic_read(&page->_count));
set_page_count(page, 1);
}
Index: linux-2.6.21-rc7/mm/page_alloc.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/page_alloc.c 2007-04-24 11:31:51.000000000 -0700
+++ linux-2.6.21-rc7/mm/page_alloc.c 2007-04-24 11:32:00.000000000 -0700
@@ -227,7 +227,7 @@ static void bad_page(struct page *page)
static void free_compound_page(struct page *page)
{
- __free_pages_ok(page, (unsigned long)page[1].lru.prev);
+ __free_pages_ok(page, compound_order(page));
}
static void prep_compound_page(struct page *page, unsigned long order)
@@ -236,12 +236,14 @@ static void prep_compound_page(struct pa
int nr_pages = 1 << order;
set_compound_page_dtor(page, free_compound_page);
- page[1].lru.prev = (void *)order;
- for (i = 0; i < nr_pages; i++) {
+ set_compound_order(page, order);
+ __SetPageCompound(page);
+ for (i = 1; i < nr_pages; i++) {
struct page *p = page + i;
+ __SetPageTail(p);
__SetPageCompound(p);
- set_page_private(p, (unsigned long)page);
+ p->private = (unsigned long)page;
}
}
@@ -250,15 +252,19 @@ static void destroy_compound_page(struct
int i;
int nr_pages = 1 << order;
- if (unlikely((unsigned long)page[1].lru.prev != order))
+ if (unlikely(compound_order(page) != order))
bad_page(page);
- for (i = 0; i < nr_pages; i++) {
+ if (unlikely(!PageCompound(page)))
+ bad_page(page);
+ __ClearPageCompound(page);
+ for (i = 1; i < nr_pages; i++) {
struct page *p = page + i;
- if (unlikely(!PageCompound(p) |
- (page_private(p) != (unsigned long)page)))
+ if (unlikely(!PageCompound(p) | !PageTail(p) |
+ ((struct page *)p->private != page)))
bad_page(page);
+ __ClearPageTail(p);
__ClearPageCompound(p);
}
}
@@ -1438,8 +1444,17 @@ void __pagevec_free(struct pagevec *pvec
{
int i = pagevec_count(pvec);
- while (--i >= 0)
- free_hot_cold_page(pvec->pages[i], pvec->cold);
+ while (--i >= 0) {
+ struct page *page = pvec->pages[i];
+
+ if (PageCompound(page)) {
+ compound_page_dtor *dtor;
+
+ dtor = get_compound_page_dtor(page);
+ (*dtor)(page);
+ } else
+ free_hot_cold_page(page, pvec->cold);
+ }
}
fastcall void __free_pages(struct page *page, unsigned int order)
Index: linux-2.6.21-rc7/mm/slab.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/slab.c 2007-04-24 11:31:51.000000000 -0700
+++ linux-2.6.21-rc7/mm/slab.c 2007-04-24 11:32:00.000000000 -0700
@@ -592,8 +592,7 @@ static inline void page_set_cache(struct
static inline struct kmem_cache *page_get_cache(struct page *page)
{
- if (unlikely(PageCompound(page)))
- page = (struct page *)page_private(page);
+ page = compound_head(page);
BUG_ON(!PageSlab(page));
return (struct kmem_cache *)page->lru.next;
}
@@ -605,8 +604,7 @@ static inline void page_set_slab(struct
static inline struct slab *page_get_slab(struct page *page)
{
- if (unlikely(PageCompound(page)))
- page = (struct page *)page_private(page);
+ page = compound_head(page);
BUG_ON(!PageSlab(page));
return (struct slab *)page->lru.prev;
}
Index: linux-2.6.21-rc7/mm/swap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/swap.c 2007-04-24 11:31:51.000000000 -0700
+++ linux-2.6.21-rc7/mm/swap.c 2007-04-24 11:32:00.000000000 -0700
@@ -55,7 +55,7 @@ static void fastcall __page_cache_releas
static void put_compound_page(struct page *page)
{
- page = (struct page *)page_private(page);
+ page = compound_head(page);
if (put_page_testzero(page)) {
compound_page_dtor *dtor;
Index: linux-2.6.21-rc7/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.21-rc7.orig/arch/ia64/mm/init.c 2007-04-24 11:31:51.000000000 -0700
+++ linux-2.6.21-rc7/arch/ia64/mm/init.c 2007-04-24 11:32:00.000000000 -0700
@@ -121,7 +121,7 @@ lazy_mmu_prot_update (pte_t pte)
return; /* i-cache is already coherent with d-cache */
if (PageCompound(page)) {
- order = (unsigned long) (page[1].lru.prev);
+ order = compound_order(page);
flush_icache_range(addr, addr + (1UL << order << PAGE_SHIFT));
}
else
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [05/17] More compound page features
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (3 preceding siblings ...)
2007-04-24 22:21 ` [04/17] Free up page->private for compound pages clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [06/17] Fix up handling of Compound head pages clameter
` (21 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_compound_advanced --]
[-- Type: text/plain, Size: 1056 bytes --]
Add a couple of more compound functions to avoid having to duplicate code in various places.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/mm.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
Index: linux-2.6.21-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm.h 2007-04-24 11:33:34.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm.h 2007-04-24 11:32:16.000000000 -0700
@@ -330,6 +330,21 @@ static inline int compound_order(struct
return (unsigned long)page[1].lru.prev;
}
+static inline int compound_pages(struct page *page)
+{
+ return 1 << compound_order(page);
+}
+
+static inline int compound_shift(struct page *page)
+{
+ return PAGE_SHIFT + compound_order(page);
+}
+
+static inline int compound_size(struct page *page)
+{
+ return PAGE_SIZE << compound_order(page);
+}
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [06/17] Fix up handling of Compound head pages
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (4 preceding siblings ...)
2007-04-24 22:21 ` [05/17] More compound page features clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [07/17] vmstat.c: Support accounting for compound pages clameter
` (20 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_compound_release_pages --]
[-- Type: text/plain, Size: 1343 bytes --]
Compound pages can be on the LRU. This means that the page pointer
to the head page is on a pagevec. In that case we need full LRU processing
for the page in release_pages().
The check for compound pages in release_pages() was introduced by
Nick Piggin to make sure that the page count in tail pages is not
going negative. Well we can now explicitly check for the tail page.
The head page should be processes like a regular page in order to
support higher order page cache pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/swap.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
Index: linux-2.6.21-rc7/mm/swap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/swap.c 2007-04-24 09:40:16.000000000 -0700
+++ linux-2.6.21-rc7/mm/swap.c 2007-04-24 09:42:27.000000000 -0700
@@ -263,7 +263,13 @@ void release_pages(struct page **pages,
for (i = 0; i < nr; i++) {
struct page *page = pages[i];
- if (unlikely(PageCompound(page))) {
+ /*
+ * If we have a tail page on the LRU then we need to
+ * decrement the page count of the head page. There
+ * is no further need to do anything since tail pages
+ * cannot be on the LRU.
+ */
+ if (unlikely(PageTail(page))) {
if (zone) {
spin_unlock_irq(&zone->lru_lock);
zone = NULL;
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [07/17] vmstat.c: Support accounting for compound pages
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (5 preceding siblings ...)
2007-04-24 22:21 ` [06/17] Fix up handling of Compound head pages clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [08/17] Define functions for page cache handling clameter
` (19 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_vmstat --]
[-- Type: text/plain, Size: 4237 bytes --]
Compound pages must increment the counters in terms of base pages.
If we detect a compound page then add the number of base pages that
a compound page has to the counter.
This will avoid numerous changes in the VM to fix up page accounting
as we add more support for compound pages.
Also fix up the accounting for active / inactive pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/mm_inline.h | 12 ++++++------
include/linux/vmstat.h | 5 ++---
mm/vmstat.c | 8 +++-----
3 files changed, 11 insertions(+), 14 deletions(-)
Index: linux-2.6.21-rc7/mm/vmstat.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/vmstat.c 2007-04-23 20:31:54.000000000 -0700
+++ linux-2.6.21-rc7/mm/vmstat.c 2007-04-23 20:31:59.000000000 -0700
@@ -223,7 +223,7 @@ void __inc_zone_state(struct zone *zone,
void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
{
- __inc_zone_state(page_zone(page), item);
+ __mod_zone_page_state(page_zone(page), item, compound_pages(page));
}
EXPORT_SYMBOL(__inc_zone_page_state);
@@ -244,7 +244,7 @@ void __dec_zone_state(struct zone *zone,
void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
{
- __dec_zone_state(page_zone(page), item);
+ __mod_zone_page_state(page_zone(page), item, -compound_pages(page));
}
EXPORT_SYMBOL(__dec_zone_page_state);
@@ -260,11 +260,9 @@ void inc_zone_state(struct zone *zone, e
void inc_zone_page_state(struct page *page, enum zone_stat_item item)
{
unsigned long flags;
- struct zone *zone;
- zone = page_zone(page);
local_irq_save(flags);
- __inc_zone_state(zone, item);
+ __inc_zone_page_state(page, item);
local_irq_restore(flags);
}
EXPORT_SYMBOL(inc_zone_page_state);
Index: linux-2.6.21-rc7/include/linux/mm_inline.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm_inline.h 2007-04-23 20:31:54.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm_inline.h 2007-04-23 20:47:16.000000000 -0700
@@ -2,28 +2,28 @@ static inline void
add_page_to_active_list(struct zone *zone, struct page *page)
{
list_add(&page->lru, &zone->active_list);
- __inc_zone_state(zone, NR_ACTIVE);
+ __inc_zone_page_state(page, NR_ACTIVE);
}
static inline void
add_page_to_inactive_list(struct zone *zone, struct page *page)
{
list_add(&page->lru, &zone->inactive_list);
- __inc_zone_state(zone, NR_INACTIVE);
+ __inc_zone_page_state(page, NR_INACTIVE);
}
static inline void
del_page_from_active_list(struct zone *zone, struct page *page)
{
list_del(&page->lru);
- __dec_zone_state(zone, NR_ACTIVE);
+ __dec_zone_page_state(page, NR_ACTIVE);
}
static inline void
del_page_from_inactive_list(struct zone *zone, struct page *page)
{
list_del(&page->lru);
- __dec_zone_state(zone, NR_INACTIVE);
+ __dec_zone_page_state(page, NR_INACTIVE);
}
static inline void
@@ -32,9 +32,9 @@ del_page_from_lru(struct zone *zone, str
list_del(&page->lru);
if (PageActive(page)) {
__ClearPageActive(page);
- __dec_zone_state(zone, NR_ACTIVE);
+ __dec_zone_page_state(page, NR_ACTIVE);
} else {
- __dec_zone_state(zone, NR_INACTIVE);
+ __dec_zone_page_state(page, NR_INACTIVE);
}
}
Index: linux-2.6.21-rc7/include/linux/vmstat.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/vmstat.h 2007-04-23 20:33:02.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/vmstat.h 2007-04-23 20:49:06.000000000 -0700
@@ -235,7 +235,7 @@ static inline void __inc_zone_state(stru
static inline void __inc_zone_page_state(struct page *page,
enum zone_stat_item item)
{
- __inc_zone_state(page_zone(page), item);
+ __mod_zone_page_state(page_zone(page), item, compound_pages(page));
}
static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
@@ -247,8 +247,7 @@ static inline void __dec_zone_state(stru
static inline void __dec_zone_page_state(struct page *page,
enum zone_stat_item item)
{
- atomic_long_dec(&page_zone(page)->vm_stat[item]);
- atomic_long_dec(&vm_stat[item]);
+ __mod_zone_page_state(page_zone(page), item, -compound_pages(page));
}
/*
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [08/17] Define functions for page cache handling
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (6 preceding siblings ...)
2007-04-24 22:21 ` [07/17] vmstat.c: Support accounting for compound pages clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 23:00 ` Eric Dumazet
2007-04-24 22:21 ` [09/17] Convert PAGE_CACHE_xxx -> page_cache_xxx function calls clameter
` (18 subsequent siblings)
26 siblings, 1 reply; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_sizes --]
[-- Type: text/plain, Size: 8764 bytes --]
We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
and PAGE_CACHE_ALIGN in various places in the kernel. These are useful
if one only want to support one page size in the page cache.
This patch provides a set of functions in order to provide the
ability to define new page size in the future.
All functions take an address_space pointer. Add a set of extended functions
that will be used to consolidate the hand crafted shifts and adds in use
right now.
New function Related base page constant
---------------------------------------------------
page_cache_shift(a) PAGE_CACHE_SHIFT
page_cache_size(a) PAGE_CACHE_SIZE
page_cache_mask(a) PAGE_CACHE_MASK
page_cache_index(a, pos) Calculate page number from position
page_cache_next(addr, pos) Page number of next page
page_cache_offset(a, pos) Calculate offset into a page
page_cache_pos(a, index, offset)
Form position based on page number
and an offset.
The workings of these functions depend on CONFIG_LARGE_BLOCKSIZE. If set
then these sizes are dynamically calculated. Otherwise the functions will
provide constant results.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
block/Kconfig | 17 +++++++
include/linux/fs.h | 5 ++
include/linux/pagemap.h | 115 +++++++++++++++++++++++++++++++++++++++++++++---
mm/filemap.c | 12 ++---
4 files changed, 139 insertions(+), 10 deletions(-)
Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h 2007-04-24 11:31:49.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h 2007-04-24 11:37:21.000000000 -0700
@@ -32,6 +32,10 @@ static inline void mapping_set_gfp_mask(
{
m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
(__force unsigned long)mask;
+#ifdef CONFIG_LARGE_BLOCKSIZE
+ if (m->order)
+ m->flags |= __GFP_COMP;
+#endif
}
/*
@@ -41,33 +45,134 @@ static inline void mapping_set_gfp_mask(
* space in smaller chunks for same flexibility).
*
* Or rather, it _will_ be done in larger chunks.
+ *
+ * The following constants can be used if a filesystem only supports a single
+ * page size.
*/
#define PAGE_CACHE_SHIFT PAGE_SHIFT
#define PAGE_CACHE_SIZE PAGE_SIZE
#define PAGE_CACHE_MASK PAGE_MASK
#define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
+/*
+ * The next set of functions allow to write code that is capable of dealing
+ * with multiple page sizes.
+ */
+#ifdef CONFIG_LARGE_BLOCKSIZE
+static inline void set_mapping_order(struct address_space *a, int order)
+{
+ a->order = order;
+ a->shift = order + PAGE_SHIFT;
+ a->offset_mask = (1UL << a->shift) - 1;
+ if (order)
+ a->flags |= __GFP_COMP;
+}
+
+static inline int mapping_order(struct address_space *a)
+{
+ return a->order;
+}
+
+static inline int page_cache_shift(struct address_space *a)
+{
+ return a->shift;
+}
+
+static inline unsigned int page_cache_size(struct address_space *a)
+{
+ return a->offset_mask + 1;
+}
+
+static inline loff_t page_cache_mask(struct address_space *a)
+{
+ return ~a->offset_mask;
+}
+
+static inline unsigned int page_cache_offset(struct address_space *a,
+ loff_t pos)
+{
+ return pos & a->offset_mask;
+}
+#else
+/*
+ * Kernel configured for a fixed PAGE_SIZEd page cache
+ */
+static inline void set_mapping_order(struct address_space *a, int order)
+{
+ BUG_ON(order);
+}
+
+static inline int mapping_order(struct address_space *a)
+{
+ return 0;
+}
+
+static inline int page_cache_shift(struct address_space *a)
+{
+ return PAGE_SHIFT;
+}
+
+static inline unsigned int page_cache_size(struct address_space *a)
+{
+ return PAGE_SIZE;
+}
+
+static inline loff_t page_cache_mask(struct address_space *a)
+{
+ return (loff_t)PAGE_MASK;
+}
+
+static inline unsigned int page_cache_offset(struct address_space *a,
+ loff_t pos)
+{
+ return pos & ~PAGE_MASK;
+}
+#endif
+
+static inline pgoff_t page_cache_index(struct address_space *a,
+ loff_t pos)
+{
+ return pos >> page_cache_shift(a);
+}
+
+/*
+ * Index of the page starting on or after the given position.
+ */
+static inline pgoff_t page_cache_next(struct address_space *a,
+ loff_t pos)
+{
+ return page_cache_index(a, pos + page_cache_size(a) - 1);
+}
+
+static inline loff_t page_cache_pos(struct address_space *a,
+ pgoff_t index, unsigned long offset)
+{
+ return ((loff_t)index << page_cache_shift(a)) + offset;
+}
+
#define page_cache_get(page) get_page(page)
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);
#ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, int order);
#else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp, order);
}
#endif
static inline struct page *page_cache_alloc(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x));
+ return __page_cache_alloc(mapping_gfp_mask(x),
+ mapping_order(x));
}
static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+ return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
+ mapping_order(x));
}
typedef int filler_t(void *, struct page *);
Index: linux-2.6.21-rc7/include/linux/fs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/fs.h 2007-04-24 11:31:49.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/fs.h 2007-04-24 11:37:21.000000000 -0700
@@ -435,6 +435,11 @@ struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
rwlock_t tree_lock; /* and rwlock protecting it */
+#ifdef CONFIG_LARGE_BLOCKSIZE
+ unsigned int order; /* Page order of the pages in here */
+ unsigned int shift; /* Shift of index */
+ loff_t offset_mask; /* Mask to get to offset bits */
+#endif
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-24 11:31:49.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c 2007-04-24 11:37:21.000000000 -0700
@@ -467,13 +467,13 @@ int add_to_page_cache_lru(struct page *p
}
#ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, int order)
{
if (cpuset_do_page_mem_spread()) {
int n = cpuset_mem_spread_node();
- return alloc_pages_node(n, gfp, 0);
+ return alloc_pages_node(n, gfp, order);
}
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp, order);
}
EXPORT_SYMBOL(__page_cache_alloc);
#endif
@@ -672,7 +672,8 @@ repeat:
if (!page) {
if (!cached_page) {
cached_page =
- __page_cache_alloc(gfp_mask);
+ __page_cache_alloc(gfp_mask,
+ mapping_order(mapping));
if (!cached_page)
return NULL;
}
@@ -805,7 +806,8 @@ grab_cache_page_nowait(struct address_sp
page_cache_release(page);
return NULL;
}
- page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+ page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+ mapping_order(mapping));
if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
page_cache_release(page);
page = NULL;
Index: linux-2.6.21-rc7/block/Kconfig
===================================================================
--- linux-2.6.21-rc7.orig/block/Kconfig 2007-04-24 11:37:57.000000000 -0700
+++ linux-2.6.21-rc7/block/Kconfig 2007-04-24 11:40:23.000000000 -0700
@@ -49,6 +49,23 @@ config LSF
If unsure, say Y.
+#
+# We do not support HIGHMEM because kmap does not support higher order pages
+# We do not support 32 bit because smaller machines are limited in memory
+# and fragmentation could easily occur. Also 32 bit machines typically
+# have restricted DMA areas which requires page bouncing.
+#
+config LARGE_BLOCKSIZE
+ bool "Support blocksizes larger than page size"
+ default n
+ depends on EXPERIMENTAL && !HIGHMEM && 64BIT
+ help
+ Allows the page cache to support higher orders of pages. Higher
+ order page cache pages may be useful to support special devices
+ like CD or DVDs and Flash. Also to increase I/O performance.
+ However, be aware that higher order pages may cause fragmentation
+ which in turn may lead to OOM conditions.
+
endif
source block/Kconfig.iosched
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [09/17] Convert PAGE_CACHE_xxx -> page_cache_xxx function calls
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (7 preceding siblings ...)
2007-04-24 22:21 ` [08/17] Define functions for page cache handling clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [10/17] Variable Order Page Cache: Add clearing and flushing function clameter
` (17 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_conversion --]
[-- Type: text/plain, Size: 31676 bytes --]
This transforms the page cache code to use page_cache_xxx calls.
Patch could be more complete.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/buffer.c | 99 +++++++++++++++++++++++++-------------------
fs/libfs.c | 13 +++--
fs/mpage.c | 30 +++++++------
fs/sync.c | 8 +--
include/linux/buffer_head.h | 9 +++-
mm/fadvise.c | 8 +--
mm/filemap.c | 58 ++++++++++++-------------
mm/page-writeback.c | 4 -
mm/truncate.c | 23 +++++-----
9 files changed, 140 insertions(+), 112 deletions(-)
Index: linux-2.6.21-rc7/fs/libfs.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/libfs.c 2007-04-23 22:10:03.000000000 -0700
+++ linux-2.6.21-rc7/fs/libfs.c 2007-04-23 22:22:37.000000000 -0700
@@ -330,13 +330,15 @@ int simple_readpage(struct file *file, s
int simple_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
+ unsigned int page_size = page_cache_size(file->f_mapping);
+
if (!PageUptodate(page)) {
- if (to - from != PAGE_CACHE_SIZE) {
+ if (to - from != page_size) {
if (from)
memclear_highpage_flush(page, 0, from);
- if (to < PAGE_CACHE_SIZE)
+ if (to < page_size)
memclear_highpage_flush(page, to,
- PAGE_CACHE_SIZE - to);
+ page_size - to);
}
}
return 0;
@@ -345,8 +347,9 @@ int simple_prepare_write(struct file *fi
int simple_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- struct inode *inode = page->mapping->host;
- loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ loff_t pos = page_cache_pos(mapping, page->index, to);
if (!PageUptodate(page))
SetPageUptodate(page);
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-23 22:10:08.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c 2007-04-23 22:22:41.000000000 -0700
@@ -302,8 +302,8 @@ int wait_on_page_writeback_range(struct
int sync_page_range(struct inode *inode, struct address_space *mapping,
loff_t pos, loff_t count)
{
- pgoff_t start = pos >> PAGE_CACHE_SHIFT;
- pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+ pgoff_t start = page_cache_index(mapping, pos);
+ pgoff_t end = page_cache_index(mapping, pos + count - 1);
int ret;
if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -334,8 +334,8 @@ EXPORT_SYMBOL(sync_page_range);
int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
loff_t pos, loff_t count)
{
- pgoff_t start = pos >> PAGE_CACHE_SHIFT;
- pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+ pgoff_t start = page_cache_index(mapping, pos);
+ pgoff_t end = page_cache_index(mapping, pos + count - 1);
int ret;
if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -364,7 +364,7 @@ int filemap_fdatawait(struct address_spa
return 0;
return wait_on_page_writeback_range(mapping, 0,
- (i_size - 1) >> PAGE_CACHE_SHIFT);
+ page_cache_index(mapping, i_size - 1));
}
EXPORT_SYMBOL(filemap_fdatawait);
@@ -412,8 +412,8 @@ int filemap_write_and_wait_range(struct
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
int err2 = wait_on_page_writeback_range(mapping,
- lstart >> PAGE_CACHE_SHIFT,
- lend >> PAGE_CACHE_SHIFT);
+ page_cache_index(mapping, lstart),
+ page_cache_index(mapping, lend));
if (!err)
err = err2;
}
@@ -878,27 +878,27 @@ void do_generic_mapping_read(struct addr
struct file_ra_state ra = *_ra;
cached_page = NULL;
- index = *ppos >> PAGE_CACHE_SHIFT;
+ index = page_cache_index(mapping, *ppos);
next_index = index;
prev_index = ra.prev_page;
- last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
- offset = *ppos & ~PAGE_CACHE_MASK;
+ last_index = page_cache_next(mapping, *ppos + desc->count);
+ offset = page_cache_offset(mapping, *ppos);
isize = i_size_read(inode);
if (!isize)
goto out;
- end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+ end_index = page_cache_index(mapping, isize - 1);
for (;;) {
struct page *page;
unsigned long nr, ret;
/* nr is the maximum number of bytes to copy from this page */
- nr = PAGE_CACHE_SIZE;
+ nr = page_cache_size(mapping);
if (index >= end_index) {
if (index > end_index)
goto out;
- nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+ nr = page_cache_offset(mapping, isize - 1) + 1;
if (nr <= offset) {
goto out;
}
@@ -947,8 +947,8 @@ page_ok:
*/
ret = actor(desc, page, offset, nr);
offset += ret;
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
+ index += page_cache_index(mapping, offset);
+ offset = page_cache_offset(mapping, offset);
page_cache_release(page);
if (ret == nr && desc->count)
@@ -1012,16 +1012,16 @@ readpage:
* another truncate extends the file - this is desired though).
*/
isize = i_size_read(inode);
- end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+ end_index = page_cache_index(mapping, isize - 1);
if (unlikely(!isize || index > end_index)) {
page_cache_release(page);
goto out;
}
/* nr is the maximum number of bytes to copy from this page */
- nr = PAGE_CACHE_SIZE;
+ nr = page_cache_size(mapping);
if (index == end_index) {
- nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+ nr = page_cache_offset(mapping, isize - 1) + 1;
if (nr <= offset) {
page_cache_release(page);
goto out;
@@ -1064,7 +1064,7 @@ no_cached_page:
out:
*_ra = ra;
- *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+ *ppos = page_cache_pos(mapping, index, offset);
if (cached_page)
page_cache_release(cached_page);
if (filp)
@@ -1260,8 +1260,8 @@ asmlinkage ssize_t sys_readahead(int fd,
if (file) {
if (file->f_mode & FMODE_READ) {
struct address_space *mapping = file->f_mapping;
- unsigned long start = offset >> PAGE_CACHE_SHIFT;
- unsigned long end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
+ unsigned long start = page_cache_index(mapping, offset);
+ unsigned long end = page_cache_index(mapping, offset + count - 1);
unsigned long len = end - start + 1;
ret = do_readahead(mapping, file, start, len);
}
@@ -2076,9 +2076,9 @@ generic_file_buffered_write(struct kiocb
unsigned long offset;
size_t copied;
- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
- index = pos >> PAGE_CACHE_SHIFT;
- bytes = PAGE_CACHE_SIZE - offset;
+ offset = page_cache_offset(mapping, pos);
+ index = page_cache_index(mapping, pos);
+ bytes = page_cache_size(mapping) - offset;
/* Limit the size of the copy to the caller's write size */
bytes = min(bytes, count);
@@ -2305,8 +2305,8 @@ __generic_file_aio_write_nolock(struct k
if (err == 0) {
written = written_buffered;
invalidate_mapping_pages(mapping,
- pos >> PAGE_CACHE_SHIFT,
- endbyte >> PAGE_CACHE_SHIFT);
+ page_cache_index(mapping, pos),
+ page_cache_index(mapping, endbyte));
} else {
/*
* We don't know how much we wrote, so just return
@@ -2393,7 +2393,7 @@ generic_file_direct_IO(int rw, struct ki
*/
if (rw == WRITE) {
write_len = iov_length(iov, nr_segs);
- end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
+ end = page_cache_index(mapping, offset + write_len - 1);
if (mapping_mapped(mapping))
unmap_mapping_range(mapping, offset, write_len, 0);
}
@@ -2410,7 +2410,7 @@ generic_file_direct_IO(int rw, struct ki
*/
if (rw == WRITE && mapping->nrpages) {
retval = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
+ page_cache_index(mapping, offset), end);
if (retval)
goto out;
}
@@ -2428,7 +2428,7 @@ generic_file_direct_IO(int rw, struct ki
*/
if (rw == WRITE && mapping->nrpages) {
int err = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
+ page_cache_index(mapping, offset), end);
if (err && retval >= 0)
retval = err;
}
Index: linux-2.6.21-rc7/fs/sync.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/sync.c 2007-04-23 22:10:03.000000000 -0700
+++ linux-2.6.21-rc7/fs/sync.c 2007-04-23 22:14:27.000000000 -0700
@@ -254,8 +254,8 @@ int do_sync_file_range(struct file *file
ret = 0;
if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
ret = wait_on_page_writeback_range(mapping,
- offset >> PAGE_CACHE_SHIFT,
- endbyte >> PAGE_CACHE_SHIFT);
+ page_cache_index(mapping, offset),
+ page_cache_index(mapping, endbyte));
if (ret < 0)
goto out;
}
@@ -269,8 +269,8 @@ int do_sync_file_range(struct file *file
if (flags & SYNC_FILE_RANGE_WAIT_AFTER) {
ret = wait_on_page_writeback_range(mapping,
- offset >> PAGE_CACHE_SHIFT,
- endbyte >> PAGE_CACHE_SHIFT);
+ page_cache_index(mapping, offset),
+ page_cache_index(mapping, endbyte));
}
out:
return ret;
Index: linux-2.6.21-rc7/mm/fadvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/fadvise.c 2007-04-23 22:10:03.000000000 -0700
+++ linux-2.6.21-rc7/mm/fadvise.c 2007-04-23 22:22:36.000000000 -0700
@@ -79,8 +79,8 @@ asmlinkage long sys_fadvise64_64(int fd,
}
/* First and last PARTIAL page! */
- start_index = offset >> PAGE_CACHE_SHIFT;
- end_index = endbyte >> PAGE_CACHE_SHIFT;
+ start_index = page_cache_index(mapping, offset);
+ end_index = page_cache_index(mapping, endbyte);
/* Careful about overflow on the "+1" */
nrpages = end_index - start_index + 1;
@@ -100,8 +100,8 @@ asmlinkage long sys_fadvise64_64(int fd,
filemap_flush(mapping);
/* First and last FULL page! */
- start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
- end_index = (endbyte >> PAGE_CACHE_SHIFT);
+ start_index = page_cache_next(mapping, offset);
+ end_index = page_cache_index(mapping, endbyte);
if (end_index >= start_index)
invalidate_mapping_pages(mapping, start_index,
Index: linux-2.6.21-rc7/mm/page-writeback.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/page-writeback.c 2007-04-23 22:10:03.000000000 -0700
+++ linux-2.6.21-rc7/mm/page-writeback.c 2007-04-23 22:14:27.000000000 -0700
@@ -606,8 +606,8 @@ int generic_writepages(struct address_sp
index = mapping->writeback_index; /* Start from prev offset */
end = -1;
} else {
- index = wbc->range_start >> PAGE_CACHE_SHIFT;
- end = wbc->range_end >> PAGE_CACHE_SHIFT;
+ index = page_cache_index(mapping, wbc->range_start);
+ end = page_cache_index(mapping, wbc->range_end);
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
scanned = 1;
Index: linux-2.6.21-rc7/mm/truncate.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/truncate.c 2007-04-23 22:10:03.000000000 -0700
+++ linux-2.6.21-rc7/mm/truncate.c 2007-04-23 22:14:27.000000000 -0700
@@ -46,7 +46,8 @@ void do_invalidatepage(struct page *page
static inline void truncate_partial_page(struct page *page, unsigned partial)
{
- memclear_highpage_flush(page, partial, PAGE_CACHE_SIZE-partial);
+ memclear_highpage_flush(page, partial,
+ compound_size(page) - partial);
if (PagePrivate(page))
do_invalidatepage(page, partial);
}
@@ -94,7 +95,7 @@ truncate_complete_page(struct address_sp
if (page->mapping != mapping)
return;
- cancel_dirty_page(page, PAGE_CACHE_SIZE);
+ cancel_dirty_page(page, page_cache_size(mapping));
if (PagePrivate(page))
do_invalidatepage(page, 0);
@@ -156,9 +157,9 @@ invalidate_complete_page(struct address_
void truncate_inode_pages_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
- const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+ const pgoff_t start = page_cache_next(mapping, lstart);
pgoff_t end;
- const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+ const unsigned partial = page_cache_offset(mapping, lstart);
struct pagevec pvec;
pgoff_t next;
int i;
@@ -166,8 +167,9 @@ void truncate_inode_pages_range(struct a
if (mapping->nrpages == 0)
return;
- BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
- end = (lend >> PAGE_CACHE_SHIFT);
+ BUG_ON(page_cache_offset(mapping, lend) !=
+ page_cache_size(mapping) - 1);
+ end = page_cache_index(mapping, lend);
pagevec_init(&pvec, 0);
next = start;
@@ -402,9 +404,8 @@ int invalidate_inode_pages2_range(struct
* Zap the rest of the file in one hit.
*/
unmap_mapping_range(mapping,
- (loff_t)page_index<<PAGE_CACHE_SHIFT,
- (loff_t)(end - page_index + 1)
- << PAGE_CACHE_SHIFT,
+ page_cache_pos(mapping, page_index, 0),
+ page_cache_pos(mapping, end - page_index + 1, 0),
0);
did_range_unmap = 1;
} else {
@@ -412,8 +413,8 @@ int invalidate_inode_pages2_range(struct
* Just zap this page
*/
unmap_mapping_range(mapping,
- (loff_t)page_index<<PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
+ page_cache_pos(mapping, page_index, 0),
+ page_cache_size(mapping), 0);
}
}
ret = do_launder_page(mapping, page);
Index: linux-2.6.21-rc7/fs/buffer.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/buffer.c 2007-04-23 22:10:03.000000000 -0700
+++ linux-2.6.21-rc7/fs/buffer.c 2007-04-23 22:22:35.000000000 -0700
@@ -259,7 +259,7 @@ __find_get_block_slow(struct block_devic
struct page *page;
int all_mapped = 1;
- index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
+ index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits);
page = find_get_page(bd_mapping, index);
if (!page)
goto out;
@@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page
if (page->mapping) { /* Race with truncate? */
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
- task_io_account_write(PAGE_CACHE_SIZE);
+ task_io_account_write(page_cache_size(mapping));
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
@@ -879,10 +879,13 @@ struct buffer_head *alloc_page_buffers(s
{
struct buffer_head *bh, *head;
long offset;
+ unsigned page_size = page_cache_size(page->mapping);
+
+ BUG_ON(size > page_size);
try_again:
head = NULL;
- offset = PAGE_SIZE;
+ offset = page_size;
while ((offset -= size) >= 0) {
bh = alloc_buffer_head(GFP_NOFS);
if (!bh)
@@ -1418,7 +1421,7 @@ void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
{
bh->b_page = page;
- BUG_ON(offset >= PAGE_SIZE);
+ VM_BUG_ON(offset >= page_cache_size(page->mapping));
if (PageHighMem(page))
/*
* This catches illegal uses and preserves the offset:
@@ -1617,7 +1620,8 @@ static int __block_write_full_page(struc
* handle that here by just cleaning them.
*/
- block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ block = (sector_t)page->index <<
+ (compound_shift(page) - inode->i_blkbits);
head = page_buffers(page);
bh = head;
@@ -1767,8 +1771,8 @@ static int __block_prepare_write(struct
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
BUG_ON(!PageLocked(page));
- BUG_ON(from > PAGE_CACHE_SIZE);
- BUG_ON(to > PAGE_CACHE_SIZE);
+ BUG_ON(from > page_cache_size(inode->i_mapping));
+ BUG_ON(to > page_cache_size(inode->i_mapping));
BUG_ON(from > to);
blocksize = 1 << inode->i_blkbits;
@@ -1777,7 +1781,7 @@ static int __block_prepare_write(struct
head = page_buffers(page);
bbits = inode->i_blkbits;
- block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
+ block = (sector_t)page->index << (page_cache_shift(inode->i_mapping) - bbits);
for(bh = head, block_start = 0; bh != head || !block_start;
block++, block_start=block_end, bh = bh->b_this_page) {
@@ -1925,7 +1929,7 @@ int block_read_full_page(struct page *pa
create_empty_buffers(page, blocksize, 0);
head = page_buffers(page);
- iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ iblock = (sector_t)page->index << (page_cache_shift(page->mapping) - inode->i_blkbits);
lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
bh = head;
nr = 0;
@@ -2046,10 +2050,11 @@ out:
int generic_cont_expand(struct inode *inode, loff_t size)
{
+ struct address_space *mapping = inode->i_mapping;
pgoff_t index;
unsigned int offset;
- offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
+ offset = page_cache_offset(mapping, size);
/* ugh. in prepare/commit_write, if from==to==start of block, we
** skip the prepare. make sure we never send an offset for the start
@@ -2059,7 +2064,7 @@ int generic_cont_expand(struct inode *in
/* caller must handle this extra byte. */
offset++;
}
- index = size >> PAGE_CACHE_SHIFT;
+ index = page_cache_index(mapping, size);
return __generic_cont_expand(inode, size, index, offset);
}
@@ -2067,8 +2072,8 @@ int generic_cont_expand(struct inode *in
int generic_cont_expand_simple(struct inode *inode, loff_t size)
{
loff_t pos = size - 1;
- pgoff_t index = pos >> PAGE_CACHE_SHIFT;
- unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1;
+ pgoff_t index = page_cache_index(inode->i_mapping, pos);
+ unsigned int offset = page_cache_offset(inode->i_mapping, pos) + 1;
/* prepare/commit_write can handle even if from==to==start of block. */
return __generic_cont_expand(inode, size, index, offset);
@@ -2089,30 +2094,31 @@ int cont_prepare_write(struct page *page
long status;
unsigned zerofrom;
unsigned blocksize = 1 << inode->i_blkbits;
+ unsigned page_size = page_cache_size(mapping);
- while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
+ while(page->index > (pgpos = page_cache_index(mapping, *bytes))) {
status = -ENOMEM;
new_page = grab_cache_page(mapping, pgpos);
if (!new_page)
goto out;
/* we might sleep */
- if (*bytes>>PAGE_CACHE_SHIFT != pgpos) {
+ if (page_cache_index(mapping, *bytes) != pgpos) {
unlock_page(new_page);
page_cache_release(new_page);
continue;
}
- zerofrom = *bytes & ~PAGE_CACHE_MASK;
+ zerofrom = page_cache_offset(mapping, *bytes);
if (zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
status = __block_prepare_write(inode, new_page, zerofrom,
- PAGE_CACHE_SIZE, get_block);
+ page_size, get_block);
if (status)
goto out_unmap;
memclear_highpage_flush(new_page, zerofrom,
- PAGE_CACHE_SIZE - zerofrom);
- generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
+ page_size - zerofrom);
+ generic_commit_write(NULL, new_page, zerofrom, page_size);
unlock_page(new_page);
page_cache_release(new_page);
}
@@ -2122,7 +2128,7 @@ int cont_prepare_write(struct page *page
zerofrom = offset;
} else {
/* page covers the boundary, find the boundary offset */
- zerofrom = *bytes & ~PAGE_CACHE_MASK;
+ zerofrom = page_cache_offset(mapping, *bytes);
/* if we will expand the thing last block will be filled */
if (to > zerofrom && (zerofrom & (blocksize-1))) {
@@ -2174,8 +2180,9 @@ int block_commit_write(struct page *page
int generic_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- struct inode *inode = page->mapping->host;
- loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ loff_t pos = page_cache_pos(mapping, page->index, to);
__block_commit_write(inode,page,from,to);
/*
* No need to use i_size_read() here, the i_size
@@ -2217,6 +2224,7 @@ static void end_buffer_read_nobh(struct
int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
get_block_t *get_block)
{
+ struct address_space *mapping = page->mapping;
struct inode *inode = page->mapping->host;
const unsigned blkbits = inode->i_blkbits;
const unsigned blocksize = 1 << blkbits;
@@ -2224,6 +2232,7 @@ int nobh_prepare_write(struct page *page
struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
unsigned block_in_page;
unsigned block_start;
+ unsigned page_size = page_cache_size(mapping);
sector_t block_in_file;
int nr_reads = 0;
int i;
@@ -2233,7 +2242,7 @@ int nobh_prepare_write(struct page *page
if (PageMappedToDisk(page))
return 0;
- block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+ block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
map_bh.b_page = page;
/*
@@ -2242,7 +2251,7 @@ int nobh_prepare_write(struct page *page
* page is fully mapped-to-disk.
*/
for (block_start = 0, block_in_page = 0;
- block_start < PAGE_CACHE_SIZE;
+ block_start < page_size;
block_in_page++, block_start += blocksize) {
unsigned block_end = block_start + blocksize;
int create;
@@ -2335,7 +2344,7 @@ failed:
* Error recovery is pretty slack. Clear the page and mark it dirty
* so we'll later zero out any blocks which _were_ allocated.
*/
- memclear_highpage_flush(page, 0, PAGE_SIZE);
+ memclear_highpage_flush(page, 0, page_cache_size(mapping));
SetPageUptodate(page);
set_page_dirty(page);
return ret;
@@ -2349,8 +2358,9 @@ EXPORT_SYMBOL(nobh_prepare_write);
int nobh_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- struct inode *inode = page->mapping->host;
- loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ loff_t pos = page_cache_pos(mapping, page->index, to);
SetPageUptodate(page);
set_page_dirty(page);
@@ -2370,9 +2380,10 @@ EXPORT_SYMBOL(nobh_commit_write);
int nobh_writepage(struct page *page, get_block_t *get_block,
struct writeback_control *wbc)
{
- struct inode * const inode = page->mapping->host;
+ struct address_space *mapping = page->mapping;
+ struct inode * const inode = mapping->host;
loff_t i_size = i_size_read(inode);
- const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+ const pgoff_t end_index = page_cache_offset(mapping, i_size);
unsigned offset;
int ret;
@@ -2381,7 +2392,7 @@ int nobh_writepage(struct page *page, ge
goto out;
/* Is the page fully outside i_size? (truncate in progress) */
- offset = i_size & (PAGE_CACHE_SIZE-1);
+ offset = page_cache_offset(mapping, i_size);
if (page->index >= end_index+1 || !offset) {
/*
* The page may have dirty, unmapped buffers. For example,
@@ -2404,7 +2415,8 @@ int nobh_writepage(struct page *page, ge
* the page size, the remaining memory is zeroed when mapped, and
* writes to that region are not written out to the file."
*/
- memclear_highpage_flush(page, offset, PAGE_CACHE_SIZE - offset);
+ memclear_highpage_flush(page, offset,
+ page_cache_size(mapping) - offset);
out:
ret = mpage_writepage(page, get_block, wbc);
if (ret == -EAGAIN)
@@ -2420,8 +2432,8 @@ int nobh_truncate_page(struct address_sp
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
- pgoff_t index = from >> PAGE_CACHE_SHIFT;
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ pgoff_t index = page_cache_index(mapping, from);
+ unsigned offset = page_cache_offset(mapping, from);
unsigned to;
struct page *page;
const struct address_space_operations *a_ops = mapping->a_ops;
@@ -2438,7 +2450,8 @@ int nobh_truncate_page(struct address_sp
to = (offset + blocksize) & ~(blocksize - 1);
ret = a_ops->prepare_write(NULL, page, offset, to);
if (ret == 0) {
- memclear_highpage_flush(page, offset, PAGE_CACHE_SIZE - offset);
+ memclear_highpage_flush(page, offset,
+ page_cache_size(mapping) - offset);
/*
* It would be more correct to call aops->commit_write()
* here, but this is more efficient.
@@ -2456,8 +2469,8 @@ EXPORT_SYMBOL(nobh_truncate_page);
int block_truncate_page(struct address_space *mapping,
loff_t from, get_block_t *get_block)
{
- pgoff_t index = from >> PAGE_CACHE_SHIFT;
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ pgoff_t index = page_cache_index(mapping, from);
+ unsigned offset = page_cache_offset(mapping, from);
unsigned blocksize;
sector_t iblock;
unsigned length, pos;
@@ -2474,7 +2487,7 @@ int block_truncate_page(struct address_s
return 0;
length = blocksize - length;
- iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ iblock = (sector_t)index << (page_cache_shift(mapping) - inode->i_blkbits);
page = grab_cache_page(mapping, index);
err = -ENOMEM;
@@ -2534,9 +2547,10 @@ out:
int block_write_full_page(struct page *page, get_block_t *get_block,
struct writeback_control *wbc)
{
- struct inode * const inode = page->mapping->host;
+ struct address_space *mapping = page->mapping;
+ struct inode * const inode = mapping->host;
loff_t i_size = i_size_read(inode);
- const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+ const pgoff_t end_index = page_cache_index(mapping, i_size);
unsigned offset;
/* Is the page fully inside i_size? */
@@ -2544,7 +2558,7 @@ int block_write_full_page(struct page *p
return __block_write_full_page(inode, page, get_block, wbc);
/* Is the page fully outside i_size? (truncate in progress) */
- offset = i_size & (PAGE_CACHE_SIZE-1);
+ offset = page_cache_offset(mapping, i_size);
if (page->index >= end_index+1 || !offset) {
/*
* The page may have dirty, unmapped buffers. For example,
@@ -2563,7 +2577,8 @@ int block_write_full_page(struct page *p
* the page size, the remaining memory is zeroed when mapped, and
* writes to that region are not written out to the file."
*/
- memclear_highpage_flush(page, offset, PAGE_CACHE_SIZE - offset);
+ memclear_highpage_flush(page, offset,
+ page_cache_size(mapping) - offset);
return __block_write_full_page(inode, page, get_block, wbc);
}
@@ -2817,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
* dirty bit from being lost.
*/
if (ret)
- cancel_dirty_page(page, PAGE_CACHE_SIZE);
+ cancel_dirty_page(page, page_cache_size(mapping));
spin_unlock(&mapping->private_lock);
out:
if (buffers_to_free) {
Index: linux-2.6.21-rc7/fs/mpage.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/mpage.c 2007-04-23 22:10:03.000000000 -0700
+++ linux-2.6.21-rc7/fs/mpage.c 2007-04-23 22:15:29.000000000 -0700
@@ -133,7 +133,8 @@ mpage_alloc(struct block_device *bdev,
static void
map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block)
{
- struct inode *inode = page->mapping->host;
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
struct buffer_head *page_bh, *head;
int block = 0;
@@ -142,9 +143,9 @@ map_buffer_to_page(struct page *page, st
* don't make any buffers if there is only one buffer on
* the page and the page just needs to be set up to date
*/
- if (inode->i_blkbits == PAGE_CACHE_SHIFT &&
+ if (inode->i_blkbits == page_cache_shift(mapping) &&
buffer_uptodate(bh)) {
- SetPageUptodate(page);
+ SetPageUptodate(page);
return;
}
create_empty_buffers(page, 1 << inode->i_blkbits, 0);
@@ -177,9 +178,10 @@ do_mpage_readpage(struct bio *bio, struc
sector_t *last_block_in_bio, struct buffer_head *map_bh,
unsigned long *first_logical_block, get_block_t get_block)
{
- struct inode *inode = page->mapping->host;
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+ const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
const unsigned blocksize = 1 << blkbits;
sector_t block_in_file;
sector_t last_block;
@@ -196,7 +198,7 @@ do_mpage_readpage(struct bio *bio, struc
if (page_has_buffers(page))
goto confused;
- block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+ block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
last_block = block_in_file + nr_pages * blocks_per_page;
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
@@ -285,7 +287,7 @@ do_mpage_readpage(struct bio *bio, struc
if (first_hole != blocks_per_page) {
memclear_highpage_flush(page, first_hole << blkbits,
- PAGE_CACHE_SIZE - (first_hole << blkbits));
+ page_cache_size(mapping) - (first_hole << blkbits));
if (first_hole == 0) {
SetPageUptodate(page);
unlock_page(page);
@@ -462,7 +464,7 @@ __mpage_writepage(struct bio *bio, struc
struct inode *inode = page->mapping->host;
const unsigned blkbits = inode->i_blkbits;
unsigned long end_index;
- const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+ const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
sector_t last_block;
sector_t block_in_file;
sector_t blocks[MAX_BUF_PER_PAGE];
@@ -530,7 +532,7 @@ __mpage_writepage(struct bio *bio, struc
* The page has no buffers: map it to disk
*/
BUG_ON(!PageUptodate(page));
- block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+ block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
last_block = (i_size - 1) >> blkbits;
map_bh.b_page = page;
for (page_block = 0; page_block < blocks_per_page; ) {
@@ -562,7 +564,7 @@ __mpage_writepage(struct bio *bio, struc
first_unmapped = page_block;
page_is_mapped:
- end_index = i_size >> PAGE_CACHE_SHIFT;
+ end_index = page_cache_index(mapping, i_size);
if (page->index >= end_index) {
/*
* The page straddles i_size. It must be zeroed out on each
@@ -572,12 +574,12 @@ page_is_mapped:
* is zeroed when mapped, and writes to that region are not
* written out to the file."
*/
- unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
+ unsigned offset = page_cache_offset(mapping, i_size);
if (page->index > end_index || !offset)
goto confused;
memclear_highpage_flush(page, offset,
- PAGE_CACHE_SIZE - offset);
+ page_cache_size(mapping) - offset);
}
/*
@@ -721,8 +723,8 @@ mpage_writepages(struct address_space *m
index = mapping->writeback_index; /* Start from prev offset */
end = -1;
} else {
- index = wbc->range_start >> PAGE_CACHE_SHIFT;
- end = wbc->range_end >> PAGE_CACHE_SHIFT;
+ index = page_cache_index(mapping, wbc->range_start);
+ end = page_cache_index(mapping, wbc->range_end);
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
scanned = 1;
Index: linux-2.6.21-rc7/include/linux/buffer_head.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/buffer_head.h 2007-04-23 22:10:03.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/buffer_head.h 2007-04-23 22:15:29.000000000 -0700
@@ -129,7 +129,14 @@ BUFFER_FNS(Ordered, ordered)
BUFFER_FNS(Eopnotsupp, eopnotsupp)
BUFFER_FNS(Unwritten, unwritten)
-#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
+static inline unsigned long bh_offset(struct buffer_head *bh)
+{
+ /* Cannot use the mapping since it may be set to NULL. */
+ unsigned long mask = compound_size(bh->b_page) - 1;
+
+ return (unsigned long)bh->b_data & mask;
+}
+
#define touch_buffer(bh) mark_page_accessed(bh->b_page)
/* If we *know* page->private refers to buffer_heads */
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [10/17] Variable Order Page Cache: Add clearing and flushing function
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (8 preceding siblings ...)
2007-04-24 22:21 ` [09/17] Convert PAGE_CACHE_xxx -> page_cache_xxx function calls clameter
@ 2007-04-24 22:21 ` clameter
2007-04-26 7:02 ` Christoph Lameter
2007-04-24 22:21 ` [11/17] Readahead support for the variable order page cache clameter
` (16 subsequent siblings)
26 siblings, 1 reply; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_flush_zero --]
[-- Type: text/plain, Size: 3803 bytes --]
Add a flushing and clearing function for higher order pages.
These are provisional and will likely have to be optimized.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/libfs.c | 4 ++--
include/linux/highmem.h | 12 ++++++++++++
include/linux/pagemap.h | 25 +++++++++++++++++++++++++
mm/filemap.c | 4 ++--
4 files changed, 41 insertions(+), 4 deletions(-)
Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h 2007-04-24 11:41:35.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h 2007-04-24 11:41:51.000000000 -0700
@@ -291,6 +291,31 @@ static inline void wait_on_page_writebac
extern void end_page_writeback(struct page *page);
+/* Support for clearing higher order pages */
+static inline void clear_mapping_page(struct page *page)
+{
+ int nr_pages = compound_pages(page);
+ int i;
+
+ for (i = 0; i < nr_pages; i++)
+ clear_highpage(page + i);
+}
+
+/*
+ * Support for flushing higher order pages.
+ *
+ * A bit stupid: On many platforms flushing the first page
+ * will flush any TLB starting there
+ */
+static inline void flush_mapping_page(struct page *page)
+{
+ int nr_pages = compound_pages(page);
+ int i;
+
+ for (i = 0; i < nr_pages; i++)
+ flush_dcache_page(page + i);
+}
+
/*
* Fault a userspace page into pagetables. Return non-zero on a fault.
*
Index: linux-2.6.21-rc7/fs/libfs.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/libfs.c 2007-04-24 11:41:37.000000000 -0700
+++ linux-2.6.21-rc7/fs/libfs.c 2007-04-24 11:41:51.000000000 -0700
@@ -320,8 +320,8 @@ int simple_rename(struct inode *old_dir,
int simple_readpage(struct file *file, struct page *page)
{
- clear_highpage(page);
- flush_dcache_page(page);
+ clear_mapping_page(page);
+ flush_mapping_page(page);
SetPageUptodate(page);
unlock_page(page);
return 0;
Index: linux-2.6.21-rc7/include/linux/highmem.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/highmem.h 2007-04-24 11:42:52.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/highmem.h 2007-04-24 11:59:31.000000000 -0700
@@ -81,6 +81,17 @@ static inline void clear_highpage(struct
kunmap_atomic(kaddr, KM_USER0);
}
+#ifdef CONFIG_LARGE_BLOCKSIZE
+/*
+ * Macro needed because flush_mapping_page definition is not available yet
+ * CONFIG_LARGE_BLOCKSIZE means !HIGHMEM
+ */
+#define memclear_highpage_flush(__page,__offset,__size) \
+{ \
+ memset(page_address(__page), (__offset), (__size)); \
+ flush_mapping_page(__page); \
+}
+#else
/*
* Same but also flushes aliased cache contents to RAM.
*/
@@ -95,6 +106,7 @@ static inline void memclear_highpage_flu
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
}
+#endif
#ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-24 11:41:37.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c 2007-04-24 11:59:36.000000000 -0700
@@ -925,7 +925,7 @@ page_ok:
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
- flush_dcache_page(page);
+ flush_mapping_page(page);
/*
* When (part of) the same page is read multiple times
@@ -2139,7 +2139,7 @@ generic_file_buffered_write(struct kiocb
else
copied = filemap_copy_from_user_iovec(page, offset,
cur_iov, iov_base, bytes);
- flush_dcache_page(page);
+ flush_mapping_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [11/17] Readahead support for the variable order page cache
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (9 preceding siblings ...)
2007-04-24 22:21 ` [10/17] Variable Order Page Cache: Add clearing and flushing function clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [12/17] Variable Page Cache Size: Fix up reclaim counters clameter
` (15 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_readahead --]
[-- Type: text/plain, Size: 5070 bytes --]
Readahead is now dependent on the page size. For larger page sizes
we want less readahead.
Add a parameter to max_sane_readahead specifying the page order
and update the code in mm/readahead.c to be aware of variant
page sizes.
Mark the 2M readahead constant as a potential future problem.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/mm.h | 2 +-
mm/fadvise.c | 5 +++--
mm/filemap.c | 5 +++--
mm/madvise.c | 4 +++-
mm/readahead.c | 20 +++++++++++++-------
5 files changed, 23 insertions(+), 13 deletions(-)
Index: linux-2.6.21-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm.h 2007-04-23 22:26:28.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm.h 2007-04-23 22:27:44.000000000 -0700
@@ -1115,7 +1115,7 @@ unsigned long page_cache_readahead(struc
unsigned long size);
void handle_ra_miss(struct address_space *mapping,
struct file_ra_state *ra, pgoff_t offset);
-unsigned long max_sane_readahead(unsigned long nr);
+unsigned long max_sane_readahead(unsigned long nr, int order);
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
Index: linux-2.6.21-rc7/mm/fadvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/fadvise.c 2007-04-23 22:26:28.000000000 -0700
+++ linux-2.6.21-rc7/mm/fadvise.c 2007-04-23 22:27:44.000000000 -0700
@@ -86,10 +86,11 @@ asmlinkage long sys_fadvise64_64(int fd,
nrpages = end_index - start_index + 1;
if (!nrpages)
nrpages = ~0UL;
-
+
ret = force_page_cache_readahead(mapping, file,
start_index,
- max_sane_readahead(nrpages));
+ max_sane_readahead(nrpages,
+ mapping_order(mapping)));
if (ret > 0)
ret = 0;
break;
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-23 22:26:28.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c 2007-04-23 22:27:44.000000000 -0700
@@ -1246,7 +1246,7 @@ do_readahead(struct address_space *mappi
return -EINVAL;
force_page_cache_readahead(mapping, filp, index,
- max_sane_readahead(nr));
+ max_sane_readahead(nr, mapping_order(mapping)));
return 0;
}
@@ -1381,7 +1381,8 @@ retry_find:
count_vm_event(PGMAJFAULT);
}
did_readaround = 1;
- ra_pages = max_sane_readahead(file->f_ra.ra_pages);
+ ra_pages = max_sane_readahead(file->f_ra.ra_pages,
+ mapping_order(mapping));
if (ra_pages) {
pgoff_t start = 0;
Index: linux-2.6.21-rc7/mm/madvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/madvise.c 2007-04-23 22:26:28.000000000 -0700
+++ linux-2.6.21-rc7/mm/madvise.c 2007-04-23 22:27:44.000000000 -0700
@@ -105,7 +105,9 @@ static long madvise_willneed(struct vm_a
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
force_page_cache_readahead(file->f_mapping,
- file, start, max_sane_readahead(end - start));
+ file, start,
+ max_sane_readahead(end - start,
+ mapping_order(file->f_mapping)));
return 0;
}
Index: linux-2.6.21-rc7/mm/readahead.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/readahead.c 2007-04-23 22:26:28.000000000 -0700
+++ linux-2.6.21-rc7/mm/readahead.c 2007-04-23 22:27:44.000000000 -0700
@@ -152,7 +152,7 @@ int read_cache_pages(struct address_spac
put_pages_list(pages);
break;
}
- task_io_account_read(PAGE_CACHE_SIZE);
+ task_io_account_read(page_cache_size(mapping));
}
pagevec_lru_add(&lru_pvec);
return ret;
@@ -276,7 +276,7 @@ __do_page_cache_readahead(struct address
if (isize == 0)
goto out;
- end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ end_index = page_cache_index(mapping, isize - 1);
/*
* Preallocate as many pages as we will need.
@@ -330,7 +330,11 @@ int force_page_cache_readahead(struct ad
while (nr_to_read) {
int err;
- unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+ /*
+ * FIXME: Note the 2M constant here that may prove to
+ * be a problem if page sizes become bigger than one megabyte.
+ */
+ unsigned long this_chunk = page_cache_index(mapping, 2 * 1024 * 1024);
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
@@ -570,11 +574,13 @@ void handle_ra_miss(struct address_space
}
/*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
+ * Given a desired number of page order readahead pages, return a
* sensible upper limit.
*/
-unsigned long max_sane_readahead(unsigned long nr)
+unsigned long max_sane_readahead(unsigned long nr, int order)
{
- return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
- + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+ unsigned long base_pages = node_page_state(numa_node_id(), NR_INACTIVE)
+ + node_page_state(numa_node_id(), NR_FREE_PAGES);
+
+ return min(nr, (base_pages / 2) >> order);
}
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [12/17] Variable Page Cache Size: Fix up reclaim counters
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (10 preceding siblings ...)
2007-04-24 22:21 ` [11/17] Readahead support for the variable order page cache clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [13/17] set_blocksize: Allow to set a larger block size than PAGE_SIZE clameter
` (14 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_reclaim --]
[-- Type: text/plain, Size: 2417 bytes --]
We can now reclaim larger pages. Adjust the VM counters to deal with it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/vmscan.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
Index: linux-2.6.21-rc7/mm/vmscan.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/vmscan.c 2007-04-23 22:15:30.000000000 -0700
+++ linux-2.6.21-rc7/mm/vmscan.c 2007-04-23 22:19:21.000000000 -0700
@@ -471,14 +471,14 @@ static unsigned long shrink_page_list(st
VM_BUG_ON(PageActive(page));
- sc->nr_scanned++;
+ sc->nr_scanned += compound_pages(page);
if (!sc->may_swap && page_mapped(page))
goto keep_locked;
/* Double the slab pressure for mapped and swapcache pages */
if (page_mapped(page) || PageSwapCache(page))
- sc->nr_scanned++;
+ sc->nr_scanned += compound_pages(page);
if (PageWriteback(page))
goto keep_locked;
@@ -581,7 +581,7 @@ static unsigned long shrink_page_list(st
free_it:
unlock_page(page);
- nr_reclaimed++;
+ nr_reclaimed += compound_pages(page);
if (!pagevec_add(&freed_pvec, page))
__pagevec_release_nonlru(&freed_pvec);
continue;
@@ -627,7 +627,7 @@ static unsigned long isolate_lru_pages(u
struct page *page;
unsigned long scan;
- for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
+ for (scan = 0; scan < nr_to_scan && !list_empty(src); ) {
struct list_head *target;
page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);
@@ -644,10 +644,11 @@ static unsigned long isolate_lru_pages(u
*/
ClearPageLRU(page);
target = dst;
- nr_taken++;
+ nr_taken += compound_pages(page);
} /* else it is being freed elsewhere */
list_add(&page->lru, target);
+ scan += compound_pages(page);
}
*scanned = scan;
@@ -856,7 +857,7 @@ force_reclaim_mapped:
ClearPageActive(page);
list_move(&page->lru, &zone->inactive_list);
- pgmoved++;
+ pgmoved += compound_pages(page);
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
spin_unlock_irq(&zone->lru_lock);
@@ -884,7 +885,7 @@ force_reclaim_mapped:
SetPageLRU(page);
VM_BUG_ON(!PageActive(page));
list_move(&page->lru, &zone->active_list);
- pgmoved++;
+ pgmoved += compound_pages(page);
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
pgmoved = 0;
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [13/17] set_blocksize: Allow to set a larger block size than PAGE_SIZE
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (11 preceding siblings ...)
2007-04-24 22:21 ` [12/17] Variable Page Cache Size: Fix up reclaim counters clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [14/17] Add VM_BUG_ONs to check for correct page order clameter
` (13 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_buffer_head --]
[-- Type: text/plain, Size: 3345 bytes --]
set_blocksize is changed to allow to specify a blocksize larger than a
page. If that occurs then we switch the device to use compound pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/block_dev.c | 22 +++++++++++++++-------
fs/buffer.c | 2 +-
fs/inode.c | 5 +++++
3 files changed, 21 insertions(+), 8 deletions(-)
Index: linux-2.6.21-rc7/fs/block_dev.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/block_dev.c 2007-04-23 23:13:16.000000000 -0700
+++ linux-2.6.21-rc7/fs/block_dev.c 2007-04-23 23:13:19.000000000 -0700
@@ -60,12 +60,12 @@ static void kill_bdev(struct block_devic
{
invalidate_bdev(bdev, 1);
truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
-}
+}
int set_blocksize(struct block_device *bdev, int size)
{
- /* Size must be a power of two, and between 512 and PAGE_SIZE */
- if (size > PAGE_SIZE || size < 512 || (size & (size-1)))
+ /* Size must be a power of two, and greater than 512 */
+ if (size < 512 || (size & (size-1)))
return -EINVAL;
/* Size cannot be smaller than the size supported by the device */
@@ -74,10 +74,16 @@ int set_blocksize(struct block_device *b
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
+ int bits = blksize_bits(size);
+ struct address_space *mapping =
+ bdev->bd_inode->i_mapping;
+
sync_blockdev(bdev);
- bdev->bd_block_size = size;
- bdev->bd_inode->i_blkbits = blksize_bits(size);
kill_bdev(bdev);
+ bdev->bd_block_size = size;
+ bdev->bd_inode->i_blkbits = bits;
+ set_mapping_order(mapping,
+ bits < PAGE_SHIFT ? 0 : bits - PAGE_SHIFT);
}
return 0;
}
@@ -88,8 +94,10 @@ int sb_set_blocksize(struct super_block
{
if (set_blocksize(sb->s_bdev, size))
return 0;
- /* If we get here, we know size is power of two
- * and it's value is between 512 and PAGE_SIZE */
+ /*
+ * If we get here, we know size is power of two
+ * and it's value is larger than 512
+ */
sb->s_blocksize = size;
sb->s_blocksize_bits = blksize_bits(size);
return sb->s_blocksize;
Index: linux-2.6.21-rc7/fs/buffer.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/buffer.c 2007-04-23 23:13:16.000000000 -0700
+++ linux-2.6.21-rc7/fs/buffer.c 2007-04-23 23:13:19.000000000 -0700
@@ -1084,7 +1084,7 @@ __getblk_slow(struct block_device *bdev,
{
/* Size must be multiple of hard sectorsize */
if (unlikely(size & (bdev_hardsect_size(bdev)-1) ||
- (size < 512 || size > PAGE_SIZE))) {
+ size < 512)) {
printk(KERN_ERR "getblk(): invalid block size %d requested\n",
size);
printk(KERN_ERR "hardsect size: %d\n",
Index: linux-2.6.21-rc7/fs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/inode.c 2007-04-23 23:14:36.000000000 -0700
+++ linux-2.6.21-rc7/fs/inode.c 2007-04-23 23:17:26.000000000 -0700
@@ -146,6 +146,11 @@ static struct inode *alloc_inode(struct
mapping->host = inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+ if (inode->i_blkbits > PAGE_SHIFT)
+ set_mapping_order(mapping,
+ inode->i_blkbits - PAGE_SHIFT);
+ else
+ set_mapping_order(mapping, 0);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [14/17] Add VM_BUG_ONs to check for correct page order
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (12 preceding siblings ...)
2007-04-24 22:21 ` [13/17] set_blocksize: Allow to set a larger block size than PAGE_SIZE clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [15/17] ramfs: Variable order page cache support clameter
` (12 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_guards --]
[-- Type: text/plain, Size: 3663 bytes --]
Before we start changing the page order we better get some debugging
in there that trips us up whenever a wrong order page shows up in a
mapping. This will be helpful for converting new filesystems to
utilize higher orders.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/filemap.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-24 11:45:26.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c 2007-04-24 11:46:04.000000000 -0700
@@ -127,6 +127,7 @@ void remove_from_page_cache(struct page
struct address_space *mapping = page->mapping;
BUG_ON(!PageLocked(page));
+ VM_BUG_ON(mapping_order(mapping) != compound_order(page));
write_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
@@ -268,6 +269,7 @@ int wait_on_page_writeback_range(struct
if (page->index > end)
continue;
+ VM_BUG_ON(mapping_order(mapping) != compound_order(page));
wait_on_page_writeback(page);
if (PageError(page))
ret = -EIO;
@@ -439,6 +441,7 @@ int add_to_page_cache(struct page *page,
{
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ VM_BUG_ON(mapping_order(mapping) != compound_order(page));
if (error == 0) {
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
@@ -598,8 +601,10 @@ struct page * find_get_page(struct addre
read_lock_irq(&mapping->tree_lock);
page = radix_tree_lookup(&mapping->page_tree, offset);
- if (page)
+ if (page) {
+ VM_BUG_ON(mapping_order(mapping) != compound_order(page));
page_cache_get(page);
+ }
read_unlock_irq(&mapping->tree_lock);
return page;
}
@@ -624,6 +629,7 @@ struct page *find_lock_page(struct addre
repeat:
page = radix_tree_lookup(&mapping->page_tree, offset);
if (page) {
+ VM_BUG_ON(mapping_order(mapping) != compound_order(page));
page_cache_get(page);
if (TestSetPageLocked(page)) {
read_unlock_irq(&mapping->tree_lock);
@@ -685,6 +691,7 @@ repeat:
} else if (err == -EEXIST)
goto repeat;
}
+ VM_BUG_ON(mapping_order(mapping) != compound_order(page));
if (cached_page)
page_cache_release(cached_page);
return page;
@@ -716,8 +723,10 @@ unsigned find_get_pages(struct address_s
read_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup(&mapping->page_tree,
(void **)pages, start, nr_pages);
- for (i = 0; i < ret; i++)
+ for (i = 0; i < ret; i++) {
+ VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
page_cache_get(pages[i]);
+ }
read_unlock_irq(&mapping->tree_lock);
return ret;
}
@@ -747,6 +756,7 @@ unsigned find_get_pages_contig(struct ad
if (pages[i]->mapping == NULL || pages[i]->index != index)
break;
+ VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
page_cache_get(pages[i]);
index++;
}
@@ -774,8 +784,10 @@ unsigned find_get_pages_tag(struct addre
read_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
(void **)pages, *index, nr_pages, tag);
- for (i = 0; i < ret; i++)
+ for (i = 0; i < ret; i++) {
+ VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
page_cache_get(pages[i]);
+ }
if (ret)
*index = pages[ret - 1]->index + 1;
read_unlock_irq(&mapping->tree_lock);
@@ -2457,6 +2469,7 @@ int try_to_release_page(struct page *pag
struct address_space * const mapping = page->mapping;
BUG_ON(!PageLocked(page));
+ VM_BUG_ON(mapping_order(mapping) != compound_order(page));
if (PageWriteback(page))
return 0;
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [15/17] ramfs: Variable order page cache support
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (13 preceding siblings ...)
2007-04-24 22:21 ` [14/17] Add VM_BUG_ONs to check for correct page order clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [16/17] ext2: " clameter
` (11 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_ramfs --]
[-- Type: text/plain, Size: 3969 bytes --]
The simplest file system to use is ramfs. Add a mount parameter that
specifies the page order of the pages that ramfs should use. If the
order is greater than zero then disable mmap functionality.
This could be removed if the VM would be changes to support faulting
higher order pages but for now we are content with buffered I/O on higher
order pages.
Note that ramfs does not use the lower layers (buffer I/O etc) so its
the safest to use right now.
If you apply this patch and then you can f.e. try this:
mount -tramfs -o10 none /media
Mounts a ramfs filesystem with order 10 pages (4 MB)
cp linux-2.6.21-rc7.tar.gz /media
Populate the ramfs. Note that we allocate 14 pages of 4M each
instead of 13508..
umount /media
Gets rid of the large pages again
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/ramfs/file-mmu.c | 11 +++++++++++
fs/ramfs/inode.c | 16 +++++++++++++---
include/linux/ramfs.h | 1 +
3 files changed, 25 insertions(+), 3 deletions(-)
Index: linux-2.6.21-rc7/fs/ramfs/file-mmu.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/file-mmu.c 2007-04-23 22:26:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/file-mmu.c 2007-04-23 22:27:51.000000000 -0700
@@ -45,6 +45,17 @@ const struct file_operations ramfs_file_
.llseek = generic_file_llseek,
};
+/* Higher order mappings do not support mmmap */
+const struct file_operations ramfs_file_higher_order_operations = {
+ .read = do_sync_read,
+ .aio_read = generic_file_aio_read,
+ .write = do_sync_write,
+ .aio_write = generic_file_aio_write,
+ .fsync = simple_sync_file,
+ .sendfile = generic_file_sendfile,
+ .llseek = generic_file_llseek,
+};
+
const struct inode_operations ramfs_file_inode_operations = {
.getattr = simple_getattr,
};
Index: linux-2.6.21-rc7/fs/ramfs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/inode.c 2007-04-23 22:26:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/inode.c 2007-04-23 22:33:23.000000000 -0700
@@ -61,6 +61,8 @@ struct inode *ramfs_get_inode(struct sup
inode->i_blocks = 0;
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+ set_mapping_order(inode->i_mapping,
+ sb->s_blocksize_bits - PAGE_CACHE_SHIFT);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
default:
@@ -68,7 +70,10 @@ struct inode *ramfs_get_inode(struct sup
break;
case S_IFREG:
inode->i_op = &ramfs_file_inode_operations;
- inode->i_fop = &ramfs_file_operations;
+ if (mapping_order(inode->i_mapping))
+ inode->i_fop = &ramfs_file_higher_order_operations;
+ else
+ inode->i_fop = &ramfs_file_operations;
break;
case S_IFDIR:
inode->i_op = &ramfs_dir_inode_operations;
@@ -164,10 +169,15 @@ static int ramfs_fill_super(struct super
{
struct inode * inode;
struct dentry * root;
+ int order = 0;
+ char *options = data;
+
+ if (options && *options)
+ order = simple_strtoul(options, NULL, 10);
sb->s_maxbytes = MAX_LFS_FILESIZE;
- sb->s_blocksize = PAGE_CACHE_SIZE;
- sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+ sb->s_blocksize = PAGE_CACHE_SIZE << order;
+ sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT;
sb->s_magic = RAMFS_MAGIC;
sb->s_op = &ramfs_ops;
sb->s_time_gran = 1;
Index: linux-2.6.21-rc7/include/linux/ramfs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/ramfs.h 2007-04-23 22:26:18.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/ramfs.h 2007-04-23 22:27:51.000000000 -0700
@@ -16,6 +16,7 @@ extern int ramfs_nommu_mmap(struct file
#endif
extern const struct file_operations ramfs_file_operations;
+extern const struct file_operations ramfs_file_higher_order_operations;
extern struct vm_operations_struct generic_file_vm_ops;
extern int __init init_rootfs(void);
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [16/17] ext2: Variable order page cache support
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (14 preceding siblings ...)
2007-04-24 22:21 ` [15/17] ramfs: Variable order page cache support clameter
@ 2007-04-24 22:21 ` clameter
2007-04-24 22:21 ` [17/17] xfs: " clameter
` (10 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_ext2 --]
[-- Type: text/plain, Size: 9904 bytes --]
This adds variable page size support. It is then possible to mount filesystems
that have a larger blocksize than the page size.
F.e. the following is possible on x86_64 and i386 that have only a 4k page
size.
mke2fs -b 16384 /dev/hdd2 <Ignore warning about too large block size>
mount /dev/hdd2 /media
ls -l /media
.... Do more things with the volume that uses a 16k page cache size on
a 4k page sized platform..
We disable mmap for higher order pages like also done for ramfs. This
is temporary until we get support for mmapping higher order pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/ext2/dir.c | 40 +++++++++++++++++++++++-----------------
fs/ext2/ext2.h | 1 +
fs/ext2/file.c | 18 ++++++++++++++++++
fs/ext2/inode.c | 10 ++++++++--
fs/ext2/namei.c | 10 ++++++++--
5 files changed, 58 insertions(+), 21 deletions(-)
Index: linux-2.6.21-rc7/fs/ext2/dir.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/dir.c 2007-04-23 22:26:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/dir.c 2007-04-23 22:27:52.000000000 -0700
@@ -44,7 +44,8 @@ static inline void ext2_put_page(struct
static inline unsigned long dir_pages(struct inode *inode)
{
- return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT;
+ return (inode->i_size+page_cache_size(inode->i_mapping)-1)>>
+ page_cache_shift(inode->i_mapping);
}
/*
@@ -55,10 +56,11 @@ static unsigned
ext2_last_byte(struct inode *inode, unsigned long page_nr)
{
unsigned last_byte = inode->i_size;
+ struct address_space *mapping = inode->i_mapping;
- last_byte -= page_nr << PAGE_CACHE_SHIFT;
- if (last_byte > PAGE_CACHE_SIZE)
- last_byte = PAGE_CACHE_SIZE;
+ last_byte -= page_nr << page_cache_shift(mapping);
+ if (last_byte > page_cache_size(mapping))
+ last_byte = page_cache_size(mapping);
return last_byte;
}
@@ -77,18 +79,19 @@ static int ext2_commit_chunk(struct page
static void ext2_check_page(struct page *page)
{
- struct inode *dir = page->mapping->host;
+ struct address_space *mapping = page->mapping;
+ struct inode *dir = mapping->host;
struct super_block *sb = dir->i_sb;
unsigned chunk_size = ext2_chunk_size(dir);
char *kaddr = page_address(page);
u32 max_inumber = le32_to_cpu(EXT2_SB(sb)->s_es->s_inodes_count);
unsigned offs, rec_len;
- unsigned limit = PAGE_CACHE_SIZE;
+ unsigned limit = page_cache_size(mapping);
ext2_dirent *p;
char *error;
- if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
- limit = dir->i_size & ~PAGE_CACHE_MASK;
+ if (page_cache_index(mapping, dir->i_size) == page->index) {
+ limit = page_cache_offset(mapping, dir->i_size);
if (limit & (chunk_size - 1))
goto Ebadsize;
if (!limit)
@@ -140,7 +143,7 @@ Einumber:
bad_entry:
ext2_error (sb, "ext2_check_page", "bad entry in directory #%lu: %s - "
"offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
- dir->i_ino, error, (page->index<<PAGE_CACHE_SHIFT)+offs,
+ dir->i_ino, error, page_cache_pos(mapping, page->index, offs),
(unsigned long) le32_to_cpu(p->inode),
rec_len, p->name_len);
goto fail;
@@ -149,7 +152,7 @@ Eend:
ext2_error (sb, "ext2_check_page",
"entry in directory #%lu spans the page boundary"
"offset=%lu, inode=%lu",
- dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
+ dir->i_ino, page_cache_pos(mapping, page->index, offs),
(unsigned long) le32_to_cpu(p->inode));
fail:
SetPageChecked(page);
@@ -250,8 +253,9 @@ ext2_readdir (struct file * filp, void *
loff_t pos = filp->f_pos;
struct inode *inode = filp->f_path.dentry->d_inode;
struct super_block *sb = inode->i_sb;
- unsigned int offset = pos & ~PAGE_CACHE_MASK;
- unsigned long n = pos >> PAGE_CACHE_SHIFT;
+ struct address_space *mapping = inode->i_mapping;
+ unsigned int offset = page_cache_offset(mapping, pos);
+ unsigned long n = page_cache_index(mapping, pos);
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
unsigned char *types = NULL;
@@ -272,14 +276,14 @@ ext2_readdir (struct file * filp, void *
ext2_error(sb, __FUNCTION__,
"bad page in #%lu",
inode->i_ino);
- filp->f_pos += PAGE_CACHE_SIZE - offset;
+ filp->f_pos += page_cache_size(mapping) - offset;
return -EIO;
}
kaddr = page_address(page);
if (unlikely(need_revalidate)) {
if (offset) {
offset = ext2_validate_entry(kaddr, offset, chunk_mask);
- filp->f_pos = (n<<PAGE_CACHE_SHIFT) + offset;
+ filp->f_pos = page_cache_pos(mapping, n, offset);
}
filp->f_version = inode->i_version;
need_revalidate = 0;
@@ -302,7 +306,7 @@ ext2_readdir (struct file * filp, void *
offset = (char *)de - kaddr;
over = filldir(dirent, de->name, de->name_len,
- (n<<PAGE_CACHE_SHIFT) | offset,
+ page_cache_pos(mapping, n, offset),
le32_to_cpu(de->inode), d_type);
if (over) {
ext2_put_page(page);
@@ -328,6 +332,7 @@ struct ext2_dir_entry_2 * ext2_find_entr
struct dentry *dentry, struct page ** res_page)
{
const char *name = dentry->d_name.name;
+ struct address_space *mapping = dir->i_mapping;
int namelen = dentry->d_name.len;
unsigned reclen = EXT2_DIR_REC_LEN(namelen);
unsigned long start, n;
@@ -369,7 +374,7 @@ struct ext2_dir_entry_2 * ext2_find_entr
if (++n >= npages)
n = 0;
/* next page is past the blocks we've got */
- if (unlikely(n > (dir->i_blocks >> (PAGE_CACHE_SHIFT - 9)))) {
+ if (unlikely(n > (dir->i_blocks >> (page_cache_shift(mapping) - 9)))) {
ext2_error(dir->i_sb, __FUNCTION__,
"dir %lu size %lld exceeds block count %llu",
dir->i_ino, dir->i_size,
@@ -438,6 +443,7 @@ void ext2_set_link(struct inode *dir, st
int ext2_add_link (struct dentry *dentry, struct inode *inode)
{
struct inode *dir = dentry->d_parent->d_inode;
+ struct address_space *mapping = inode->i_mapping;
const char *name = dentry->d_name.name;
int namelen = dentry->d_name.len;
unsigned chunk_size = ext2_chunk_size(dir);
@@ -467,7 +473,7 @@ int ext2_add_link (struct dentry *dentry
kaddr = page_address(page);
dir_end = kaddr + ext2_last_byte(dir, n);
de = (ext2_dirent *)kaddr;
- kaddr += PAGE_CACHE_SIZE - reclen;
+ kaddr += page_cache_size(mapping) - reclen;
while ((char *)de <= kaddr) {
if ((char *)de == dir_end) {
/* We hit i_size */
Index: linux-2.6.21-rc7/fs/ext2/ext2.h
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/ext2.h 2007-04-23 22:26:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/ext2.h 2007-04-23 22:27:52.000000000 -0700
@@ -160,6 +160,7 @@ extern const struct file_operations ext2
/* file.c */
extern const struct inode_operations ext2_file_inode_operations;
extern const struct file_operations ext2_file_operations;
+extern const struct file_operations ext2_no_mmap_file_operations;
extern const struct file_operations ext2_xip_file_operations;
/* inode.c */
Index: linux-2.6.21-rc7/fs/ext2/file.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/file.c 2007-04-23 22:26:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/file.c 2007-04-23 22:27:52.000000000 -0700
@@ -58,6 +58,24 @@ const struct file_operations ext2_file_o
.splice_write = generic_file_splice_write,
};
+const struct file_operations ext2_no_mmap_file_operations = {
+ .llseek = generic_file_llseek,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
+ .ioctl = ext2_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext2_compat_ioctl,
+#endif
+ .open = generic_file_open,
+ .release = ext2_release_file,
+ .fsync = ext2_sync_file,
+ .sendfile = generic_file_sendfile,
+ .splice_read = generic_file_splice_read,
+ .splice_write = generic_file_splice_write,
+};
+
#ifdef CONFIG_EXT2_FS_XIP
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
Index: linux-2.6.21-rc7/fs/ext2/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/inode.c 2007-04-23 22:26:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/inode.c 2007-04-23 22:29:49.000000000 -0700
@@ -1128,10 +1128,16 @@ void ext2_read_inode (struct inode * ino
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
- inode->i_fop = &ext2_file_operations;
+ if (mapping_order(inode->i_mapping))
+ inode->i_fop = &ext2_no_mmap_file_operations;
+ else
+ inode->i_fop = &ext2_file_operations;
} else {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_file_operations;
+ if (mapping_order(inode->i_mapping))
+ inode->i_fop = &ext2_no_mmap_file_operations;
+ else
+ inode->i_fop = &ext2_file_operations;
}
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext2_dir_inode_operations;
Index: linux-2.6.21-rc7/fs/ext2/namei.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/namei.c 2007-04-23 22:26:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/namei.c 2007-04-23 22:30:56.000000000 -0700
@@ -114,10 +114,16 @@ static int ext2_create (struct inode * d
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
- inode->i_fop = &ext2_file_operations;
+ if (mapping_order(inode->i_mapping))
+ inode->i_fop = &ext2_no_mmap_file_operations;
+ else
+ inode->i_fop = &ext2_file_operations;
} else {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_file_operations;
+ if (mapping_order(inode->i_mapping))
+ inode->i_fop = &ext2_no_mmap_file_operations;
+ else
+ inode->i_fop = &ext2_file_operations;
}
mark_inode_dirty(inode);
err = ext2_add_nondir(dentry, inode);
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* [17/17] xfs: Variable order page cache support
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (15 preceding siblings ...)
2007-04-24 22:21 ` [16/17] ext2: " clameter
@ 2007-04-24 22:21 ` clameter
2007-04-25 0:46 ` [00/17] Large Blocksize Support V3 Jörn Engel
` (9 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: clameter @ 2007-04-24 22:21 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
[-- Attachment #1: var_pc_xfs --]
[-- Type: text/plain, Size: 10960 bytes --]
From: David Chinner <dgc@sgi.com>
Patch is attached that converts the XFS data path to use large order
page cache pages.
I haven't tested this on a real system yet but it works on UML. I've
tested it with fsx and it seems to do everything it is supposed to.
Data is actually written to the block device as it persists across
mount and unmount, so that appears to be working as well.
> - Lets try to keep scope as small as possible.
Hence I haven't tried to convert anything on the metadata side
of XFS to use the high order page cache - the XFS buffer cache
takes care of that for us right now and it's not a simple
change like the data path is.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
---
fs/xfs/linux-2.6/xfs_aops.c | 53 ++++++++++++++++++++++---------------------
fs/xfs/linux-2.6/xfs_file.c | 22 +++++++++++++++++
fs/xfs/linux-2.6/xfs_iops.h | 1
fs/xfs/linux-2.6/xfs_lrw.c | 6 ++--
fs/xfs/linux-2.6/xfs_super.c | 5 +++-
fs/xfs/xfs_mount.c | 13 ----------
6 files changed, 58 insertions(+), 42 deletions(-)
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-04-23 22:26:17.000000000 -0700
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_aops.c 2007-04-23 22:27:52.000000000 -0700
@@ -74,7 +74,7 @@ xfs_page_trace(
xfs_inode_t *ip;
bhv_vnode_t *vp = vn_from_inode(inode);
loff_t isize = i_size_read(inode);
- loff_t offset = page_offset(page);
+ loff_t offset = page_cache_offset(page->mapping);
int delalloc = -1, unmapped = -1, unwritten = -1;
if (page_has_buffers(page))
@@ -547,7 +547,7 @@ xfs_probe_page(
break;
} while ((bh = bh->b_this_page) != head);
} else
- ret = mapped ? 0 : PAGE_CACHE_SIZE;
+ ret = mapped ? 0 : page_cache_size(page->mapping);
}
return ret;
@@ -574,7 +574,7 @@ xfs_probe_cluster(
} while ((bh = bh->b_this_page) != head);
/* if we reached the end of the page, sum forwards in following pages */
- tlast = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+ tlast = page_cache_index(inode->i_mapping, i_size_read(inode));
tindex = startpage->index + 1;
/* Prune this back to avoid pathological behavior */
@@ -592,14 +592,14 @@ xfs_probe_cluster(
size_t pg_offset, len = 0;
if (tindex == tlast) {
- pg_offset =
- i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
+ pg_offset = page_cache_offset(inode->i_mapping,
+ i_size_read(inode));
if (!pg_offset) {
done = 1;
break;
}
} else
- pg_offset = PAGE_CACHE_SIZE;
+ pg_offset = page_cache_size(inode->i_mapping);
if (page->index == tindex && !TestSetPageLocked(page)) {
len = xfs_probe_page(page, pg_offset, mapped);
@@ -681,7 +681,8 @@ xfs_convert_page(
int bbits = inode->i_blkbits;
int len, page_dirty;
int count = 0, done = 0, uptodate = 1;
- xfs_off_t offset = page_offset(page);
+ struct address_space *map = inode->i_mapping;
+ xfs_off_t offset = page_cache_pos(map, page->index, 0);
if (page->index != tindex)
goto fail;
@@ -689,7 +690,7 @@ xfs_convert_page(
goto fail;
if (PageWriteback(page))
goto fail_unlock_page;
- if (page->mapping != inode->i_mapping)
+ if (page->mapping != map)
goto fail_unlock_page;
if (!xfs_is_delayed_page(page, (*ioendp)->io_type))
goto fail_unlock_page;
@@ -701,20 +702,20 @@ xfs_convert_page(
* Derivation:
*
* End offset is the highest offset that this page should represent.
- * If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1))
- * will evaluate non-zero and be less than PAGE_CACHE_SIZE and
+ * If we are on the last page, (end_offset & page_cache_mask())
+ * will evaluate non-zero and be less than page_cache_size() and
* hence give us the correct page_dirty count. On any other page,
* it will be zero and in that case we need page_dirty to be the
* count of buffers on the page.
*/
end_offset = min_t(unsigned long long,
- (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
+ (xfs_off_t)(page->index + 1) << page_cache_shift(map),
i_size_read(inode));
len = 1 << inode->i_blkbits;
- p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
- PAGE_CACHE_SIZE);
- p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
+ p_offset = min_t(unsigned long, page_cache_offset(map, end_offset),
+ page_cache_size(map));
+ p_offset = p_offset ? roundup(p_offset, len) : page_cache_size(map);
page_dirty = p_offset / len;
bh = head = page_buffers(page);
@@ -870,6 +871,7 @@ xfs_page_state_convert(
int page_dirty, count = 0;
int trylock = 0;
int all_bh = unmapped;
+ struct address_space *map = inode->i_mapping;
if (startio) {
if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
@@ -878,11 +880,11 @@ xfs_page_state_convert(
/* Is this page beyond the end of the file? */
offset = i_size_read(inode);
- end_index = offset >> PAGE_CACHE_SHIFT;
- last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
+ end_index = page_cache_index(map, offset);
+ last_index = page_cache_index(map, (offset - 1));
if (page->index >= end_index) {
if ((page->index >= end_index + 1) ||
- !(i_size_read(inode) & (PAGE_CACHE_SIZE - 1))) {
+ !(page_cache_offset(map, i_size_read(inode)))) {
if (startio)
unlock_page(page);
return 0;
@@ -896,22 +898,23 @@ xfs_page_state_convert(
* Derivation:
*
* End offset is the highest offset that this page should represent.
- * If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1))
- * will evaluate non-zero and be less than PAGE_CACHE_SIZE and
+ * If we are on the last page, (end_offset & page_cache_mask())
+ * will evaluate non-zero and be less than page_cache_size() and
* hence give us the correct page_dirty count. On any other page,
* it will be zero and in that case we need page_dirty to be the
* count of buffers on the page.
*/
end_offset = min_t(unsigned long long,
- (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT, offset);
+ (xfs_off_t)(page->index + 1) << page_cache_shift(map),
+ offset);
len = 1 << inode->i_blkbits;
- p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
- PAGE_CACHE_SIZE);
- p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
+ p_offset = min_t(unsigned long, page_cache_offset(map, end_offset),
+ page_cache_size(map));
+ p_offset = p_offset ? roundup(p_offset, len) : page_cache_size(map);
page_dirty = p_offset / len;
bh = head = page_buffers(page);
- offset = page_offset(page);
+ offset = page_cache_pos(map, page->index, 0);
flags = -1;
type = 0;
@@ -1040,7 +1043,7 @@ xfs_page_state_convert(
if (ioend && iomap_valid) {
offset = (iomap.iomap_offset + iomap.iomap_bsize - 1) >>
- PAGE_CACHE_SHIFT;
+ page_cache_shift(map);
tlast = min_t(pgoff_t, offset, last_index);
xfs_cluster_write(inode, page->index + 1, &iomap, &ioend,
wbc, startio, all_bh, tlast);
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_lrw.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_lrw.c 2007-04-23 22:26:17.000000000 -0700
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_lrw.c 2007-04-23 22:27:52.000000000 -0700
@@ -143,9 +143,9 @@ xfs_iozero(
do {
unsigned long index, offset;
- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
- index = pos >> PAGE_CACHE_SHIFT;
- bytes = PAGE_CACHE_SIZE - offset;
+ offset = page_cache_offset(mapping, pos); /* Within page */
+ index = page_cache_index(mapping, pos);
+ bytes = page_cache_size(mapping) - offset;
if (bytes > count)
bytes = count;
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_file.c 2007-04-23 22:26:17.000000000 -0700
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_file.c 2007-04-23 22:27:52.000000000 -0700
@@ -469,6 +469,28 @@ const struct file_operations xfs_file_op
#endif
};
+const struct file_operations xfs_no_mmap_file_operations = {
+ .llseek = generic_file_llseek,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = xfs_file_aio_read,
+ .aio_write = xfs_file_aio_write,
+ .sendfile = xfs_file_sendfile,
+ .splice_read = xfs_file_splice_read,
+ .splice_write = xfs_file_splice_write,
+ .unlocked_ioctl = xfs_file_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = xfs_file_compat_ioctl,
+#endif
+ .open = xfs_file_open,
+ .flush = xfs_file_close,
+ .release = xfs_file_release,
+ .fsync = xfs_file_fsync,
+#ifdef HAVE_FOP_OPEN_EXEC
+ .open_exec = xfs_file_open_exec,
+#endif
+};
+
const struct file_operations xfs_invis_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_iops.h
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_iops.h 2007-04-23 22:26:17.000000000 -0700
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_iops.h 2007-04-23 22:27:52.000000000 -0700
@@ -23,6 +23,7 @@ extern const struct inode_operations xfs
extern const struct inode_operations xfs_symlink_inode_operations;
extern const struct file_operations xfs_file_operations;
+extern const struct file_operations xfs_no_mmap_file_operations;
extern const struct file_operations xfs_dir_file_operations;
extern const struct file_operations xfs_invis_file_operations;
Index: linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/linux-2.6/xfs_super.c 2007-04-23 22:26:17.000000000 -0700
+++ linux-2.6.21-rc7/fs/xfs/linux-2.6/xfs_super.c 2007-04-23 22:34:50.000000000 -0700
@@ -125,8 +125,11 @@ xfs_set_inodeops(
{
switch (inode->i_mode & S_IFMT) {
case S_IFREG:
+ if (mapping_order(inode->i_mapping))
+ inode->i_fop = &xfs_no_mmap_file_operations;
+ else
+ inode->i_fop = &xfs_file_operations;
inode->i_op = &xfs_inode_operations;
- inode->i_fop = &xfs_file_operations;
inode->i_mapping->a_ops = &xfs_address_space_operations;
break;
case S_IFDIR:
Index: linux-2.6.21-rc7/fs/xfs/xfs_mount.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/xfs/xfs_mount.c 2007-04-23 22:26:17.000000000 -0700
+++ linux-2.6.21-rc7/fs/xfs/xfs_mount.c 2007-04-23 22:27:52.000000000 -0700
@@ -315,19 +315,6 @@ xfs_mount_validate_sb(
return XFS_ERROR(ENOSYS);
}
- /*
- * Until this is fixed only page-sized or smaller data blocks work.
- */
- if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) {
- xfs_fs_mount_cmn_err(flags,
- "file system with blocksize %d bytes",
- sbp->sb_blocksize);
- xfs_fs_mount_cmn_err(flags,
- "only pagesize (%ld) or less will currently work.",
- PAGE_SIZE);
- return XFS_ERROR(ENOSYS);
- }
-
return 0;
}
--
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [08/17] Define functions for page cache handling
2007-04-24 22:21 ` [08/17] Define functions for page cache handling clameter
@ 2007-04-24 23:00 ` Eric Dumazet
2007-04-25 6:27 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Eric Dumazet @ 2007-04-24 23:00 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
clameter@sgi.com a écrit :
> --- linux-2.6.21-rc7.orig/include/linux/fs.h 2007-04-24 11:31:49.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/fs.h 2007-04-24 11:37:21.000000000 -0700
> @@ -435,6 +435,11 @@ struct address_space {
> struct inode *host; /* owner: inode, block_device */
> struct radix_tree_root page_tree; /* radix tree of all pages */
> rwlock_t tree_lock; /* and rwlock protecting it */
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> + unsigned int order; /* Page order of the pages in here */
> + unsigned int shift; /* Shift of index */
> + loff_t offset_mask; /* Mask to get to offset bits */
> +#endif
> unsigned int i_mmap_writable;/* count VM_SHARED mappings */
> struct prio_tree_root i_mmap; /* tree of private and shared mappings */
> struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Some comments about this :
1) not optimal placement on 64bits arches (it creates one hole before
i_mmap_writeable)
2) sizeof(struct address_space) is an issue, given it is included in every
inode, even sockets or pipes.
-> 2.1) Do we really need 32 bits for 'order' and 'shift'
-> 2.2) Do we really need a 64bits offset_mask since it's a trivial arithmetic
from 'shift' (or 'order') ?
BTW, I presume splice() is not yet supported ? If yes, each pipe could hold 16
'big pages' so we probably can hit some DOS condition sooner...
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (16 preceding siblings ...)
2007-04-24 22:21 ` [17/17] xfs: " clameter
@ 2007-04-25 0:46 ` Jörn Engel
2007-04-25 0:47 ` H. Peter Anvin
` (8 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: Jörn Engel @ 2007-04-25 0:46 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Tue, 24 April 2007 15:21:05 -0700, clameter@sgi.com wrote:
>
> This patchset modifies the Linux kernel so that larger block sizes than
> page size can be supported. Larger block sizes are handled by using
> compound pages of an arbitrary order for the page cache instead of
> single pages with order 0.
I like to see this.
> 2. 32/64k blocksize is also used in flash devices. Same issues.
Actually most chips I encounter these days already have 128KiB. And
some people seem to do some kind of raid-0 in the drivers to increase
bandwidth. FS-visible blocksize is also increased by that.
> Unsupported
> - Mmapping blocks larger than page size
Bummer. Can this change in the future?
> Issues:
> - There are numerous places where the kernel can no longer assume that the
> page cache consists of PAGE_SIZE pages that have not been fixed yet.
> - Defrag warning: The patch set can fragment memory very fast.
> It is likely that Mel Gorman's anti-frag patches and some more
> work by him on defragmentation may be needed if one wants to use
> super sized pages.
> If you run a 2.6.21 kernel with this patch and start a kernel compile
> on a 4k volume with a concurrent copy operation to a 64k volume on
> a system with only 1 Gig then you will go boom (ummm no ... OOM) fast.
> How well Mel's antifrag/defrag methods address this issue still has to
> be seen.
"only 1 Gig" :)
With my LogFS hat on, I don't care too much whether data is cached in
terms of pages or blocks. What matters to me most is to get fed
blocksize chunk on writeback and be able to read blocksize'd chunks.
Compressing 64KiB at a time gives somewhere around 10% (don't remember
exact number) better compression when compared to 4KiB. JFFS2 can
benefit from this as well.
That should also be sufficient for cross-platform compatibility,
shouldn't it?
Better performance for the pagecache is also nice to have, no doubt.
But if system stability remains an issue, I'd rather keep slow and
stable.
Jörn
--
More computing sins are committed in the name of efficiency (without
necessarily achieving it) than for any other single reason - including
blind stupidity.
-- W. A. Wulf
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (17 preceding siblings ...)
2007-04-25 0:46 ` [00/17] Large Blocksize Support V3 Jörn Engel
@ 2007-04-25 0:47 ` H. Peter Anvin
2007-04-25 3:11 ` William Lee Irwin III
` (7 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: H. Peter Anvin @ 2007-04-25 0:47 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
FWIW, this would also let zisofs remove the ugly hacks we currently
employ to deal with compression blocks.
-hpa
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (18 preceding siblings ...)
2007-04-25 0:47 ` H. Peter Anvin
@ 2007-04-25 3:11 ` William Lee Irwin III
2007-04-25 11:35 ` Jens Axboe
` (6 subsequent siblings)
26 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-25 3:11 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Tue, Apr 24, 2007 at 03:21:05PM -0700, clameter@sgi.com wrote:
> V2->V3
> - More restructuring
> - It actually works!
> - Add XFS support
> - Fix up UP support
> - Work out the direct I/O issues
> - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
> back to constants. Disabled for 32bit and HIGHMEM configurations.
> This also allows a gradual migration to the new page cache
> inline functions. LARGE_BLOCKSIZE capabilities can be
> added gradually and if there is a problem then we can disable
> a subsystem.
Excellent, I'll do some testing here at the very least.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [08/17] Define functions for page cache handling
2007-04-24 23:00 ` Eric Dumazet
@ 2007-04-25 6:27 ` Christoph Lameter
0 siblings, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-25 6:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Wed, 25 Apr 2007, Eric Dumazet wrote:
> > struct radix_tree_root page_tree; /* radix tree of all pages */
> > rwlock_t tree_lock; /* and rwlock protecting it */
> > +#ifdef CONFIG_LARGE_BLOCKSIZE
> > + unsigned int order; /* Page order of the pages in
> > here */
> > + unsigned int shift; /* Shift of index */
> > + loff_t offset_mask; /* Mask to get to offset bits
> > */
> > +#endif
> > unsigned int i_mmap_writable;/* count VM_SHARED mappings */
> > struct prio_tree_root i_mmap; /* tree of private and shared
> > mappings */
> > struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings
> > */
>
> Some comments about this :
>
> 1) not optimal placement on 64bits arches (it creates one hole before
> i_mmap_writeable)
loff_t is a 64 bit long. We could put shift behind offset_mask.
> 2) sizeof(struct address_space) is an issue, given it is included in every
> inode, even sockets or pipes.
Right. I had an earlier implementation that just used an order field. See
V2 of this patchset. That did a lot of shifting instead of lookups.
> -> 2.1) Do we really need 32 bits for 'order' and 'shift'
No. u8 would be sufficient. If you can stuff in other stuff in between
then I can switch it.
> -> 2.2) Do we really need a 64bits offset_mask since it's a trivial arithmetic
> from 'shift' (or 'order') ?
It simplifies the generated asm code significantly.
> BTW, I presume splice() is not yet supported ? If yes, each pipe could hold 16
> 'big pages' so we probably can hit some DOS condition sooner...
Not yet. I'd be glad if you could come up with a patch that converts it.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (19 preceding siblings ...)
2007-04-25 3:11 ` William Lee Irwin III
@ 2007-04-25 11:35 ` Jens Axboe
2007-04-25 15:36 ` Christoph Lameter
2007-04-25 13:28 ` Mel Gorman
` (5 subsequent siblings)
26 siblings, 1 reply; 235+ messages in thread
From: Jens Axboe @ 2007-04-25 11:35 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Badari Pulavarty, Maxim Levitsky
On Tue, Apr 24 2007, clameter@sgi.com wrote:
> V2->V3
> - More restructuring
> - It actually works!
> - Add XFS support
> - Fix up UP support
> - Work out the direct I/O issues
> - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
> back to constants. Disabled for 32bit and HIGHMEM configurations.
> This also allows a gradual migration to the new page cache
> inline functions. LARGE_BLOCKSIZE capabilities can be
> added gradually and if there is a problem then we can disable
> a subsystem.
I need this patch to actually boot the thing, or it bombs with a NULL
deref in page_cache_size().
It then boots, doing a little test with 8kb ext2 quickly dies though:
BUG: unable to handle kernel NULL pointer dereference at virtual address
000000ac
printing eip:
d8149519
*pde = 00000000
Oops: 0002 [#1]
SMP
Modules linked in: sunrpc button battery ac uhci_hcd ehci_hcd tg3 ide_cd
cdrom
CPU: 0
EIP: 0060:[<d8149519>] Not tainted VLI
EFLAGS: 00010246 (2.6.21-rc7-g56a56164-dirty #131)
EIP is at 0xd8149519
eax: 000000ac ebx: d81492a8 ecx: ddc4b804 edx: d837a940
esi: d81495b5 edi: d8149580 ebp: d814954c esp: e9049d1c
ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
Process fsck.ext2 (pid: 5604, ti=e9049000 task=f75c9030
task.ti=e9049000)
Stack: d81494e4 d8ebcac8 cd29113c 00000000 78337894 000241d0 7a33abd4
78337890
7813b751 00000044 78228db3 0000000f e9049d78 e9049e20 f75c9030
00000010
00000001 f7445680 79aa5540 00000003 7a33abd4 00000004 7813d35c
00000004
Call Trace:
[<7813b751>] __alloc_pages+0x5c/0x2a8
[<78228db3>] sd_open+0x5e/0x106
[<7813d35c>] __do_page_cache_readahead+0x109/0x123
[<7813d48d>] blockable_page_cache_readahead+0x4a/0x9a
[<7813d60f>] page_cache_readahead+0x8f/0x159
[<78137d7b>] do_generic_mapping_read+0x17b/0x534
[<781383d4>] generic_file_aio_read+0x19f/0x1c3
[<78138134>] file_read_actor+0x0/0x101
[<78151dee>] do_sync_read+0xbf/0xfc
[<78127fbf>] autoremove_wake_function+0x0/0x33
[<782a5bb5>] mutex_lock+0x13/0x22
[<78151eb4>] vfs_read+0x89/0x104
[<78152172>] sys_read+0x41/0x67
[<78102548>] syscall_call+0x7/0xb
[<782a0000>] bictcp_cong_avoid+0x110/0x3cd
=======================
Code: b9 79 0d 00 00 00 00 02 00 00 00 1a ba d4 c0 aa 33 7a b4 b9 16 78
00 00 00 00 08 95 14 d8 08 95 14 d8 00 00 00 00 00 00 00 00 ac <00> 00
00 e4 94 14 d8 00 74 b9 79 0c 00 00 00 00 02 00 00 00 18
EIP: [<d8149519>] 0xd8149519 SS:ESP 0068:e9049d1c
That's here:
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
-> if (page)
goto got_pg;
which doesn't look healthy. Note that this is a 32-bit machine, I
removed the 32-bit check (devices in this box are fine).
--- fs/libfs.c~ 2007-04-25 13:30:50.000000000 +0200
+++ fs/libfs.c 2007-04-25 13:31:00.000000000 +0200
@@ -330,7 +330,7 @@
int simple_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- unsigned int page_size = page_cache_size(file->f_mapping);
+ unsigned int page_size = page_cache_size(page->mapping);
if (!PageUptodate(page)) {
if (to - from != page_size) {
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (20 preceding siblings ...)
2007-04-25 11:35 ` Jens Axboe
@ 2007-04-25 13:28 ` Mel Gorman
2007-04-25 15:23 ` Christoph Lameter
2007-04-25 22:46 ` Badari Pulavarty
` (4 subsequent siblings)
26 siblings, 1 reply; 235+ messages in thread
From: Mel Gorman @ 2007-04-25 13:28 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Nuts. Didn't spot V3 before I started V2, ah well.
On (24/04/07 15:21), clameter@sgi.com didst pronounce:
> V2->V3
> - More restructuring
> - It actually works!
> - Add XFS support
> - Fix up UP support
> - Work out the direct I/O issues
> - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
> back to constants. Disabled for 32bit and HIGHMEM configurations.
HIGHMEM I can understand because I suppose the kmap() issue is still in
there, but why 32 bit? Is this temporary or do you expect to see it
fixed up later?
> This also allows a gradual migration to the new page cache
> inline functions. LARGE_BLOCKSIZE capabilities can be
> added gradually and if there is a problem then we can disable
> a subsystem.
>
> V1->V2
> - Some ext2 support
> - Some block layer, fs layer support etc.
> - Better page cache macros
> - Use macros to clean up code.
>
> This patchset modifies the Linux kernel so that larger block sizes than
> page size can be supported. Larger block sizes are handled by using
> compound pages of an arbitrary order for the page cache instead of
> single pages with order 0.
>
> Rationales:
>
> 1. We have problems supporting devices with a higher blocksize than
> page size. This is for example important to support CD and DVDs that
> can only read and write 32k or 64k blocks. We currently have a shim
> layer in there to deal with this situation which limits the speed
> of I/O. The developers are currently looking for ways to completely
> bypass the page cache because of this deficiency.
>
> 2. 32/64k blocksize is also used in flash devices. Same issues.
>
> 3. Future harddisks will support bigger block sizes that Linux cannot
> support since we are limited to PAGE_SIZE. Ok the on board cache
> may buffer this for us but what is the point of handling smaller
> page sizes than what the drive supports?
>
> 4. Reduce fsck times. Larger block sizes mean faster file system checking.
>
> 5. Performance. If we look at IA64 vs. x86_64 then it seems that the
> faster interrupt handling on x86_64 compensate for the speed loss due to
> a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
> sizes on all allows a significant reduction in I/O overhead and increases
> the size of I/O that can be performed by hardware in a single request
> since the number of scatter gather entries are typically limited for
> one request. This is going to become increasingly important to support
> the ever growing memory sizes since we may have to handle excessively
> large amounts of 4k requests for data sizes that may become common
> soon. For example to write a 1 terabyte file the kernel would have to
> handle 256 million 4k chunks.
>
> 6. Cross arch compatibility: It is currently not possible to mount
> an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
> With this patch this becoems possible.
>
> The support here is currently only for buffered I/O. Modifications for
> three filesystems are included:
>
> A. XFS
> B. Ext2
> C. ramfs
>
> Unsupported
> - Mmapping blocks larger than page size
>
> Issues:
> - There are numerous places where the kernel can no longer assume that the
> page cache consists of PAGE_SIZE pages that have not been fixed yet.
> - Defrag warning: The patch set can fragment memory very fast.
I bet they do.
> It is likely that Mel Gorman's anti-frag patches and some more
> work by him on defragmentation may be needed if one wants to use
> super sized pages.
Very likely.
> If you run a 2.6.21 kernel with this patch and start a kernel compile
> on a 4k volume with a concurrent copy operation to a 64k volume on
> a system with only 1 Gig then you will go boom (ummm no ... OOM) fast.
On systems with larger amounts of memory, it'll go boom eventually. More
memory does not magically avoid fragmentation problems.
> How well Mel's antifrag/defrag methods address this issue still has to
> be seen.
>
The grouping pages by mobility should hold up for ext2 and XFS because
their page cache pages are reclaimable/movable and will get grouped with
other pages that are reclaimable/movable. ramfs may be a problem if it was
heavily used but lets see how things pan out.
> Future:
> - Mmap support could be done in a way that makes the mmap page size
> independent from the page cache order. It is okay to map a 4k section
> of a larger page cache page via a pte. 4k mmap semantics can be completely
> preserved even for larger page sizes.
> - Maybe people could perform benchmarks to see how much of a difference
> there is between 4k size I/O and 64k? Andrew surely would like to know.
> - If there is a chance for inclusion then I will diff this against mm,
> do a complete scan over the kernel to find all page cache == PAGE_SIZE
> assumptions and then try to get it upstream for 2.6.23.
>
> How to make this work:
>
> 1. Apply this patchset to 2.6.21-rc7
> 2. Configure LARGE_BLOCKSIZE Support
> 3. compile kernel
>
> --
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-25 13:28 ` Mel Gorman
@ 2007-04-25 15:23 ` Christoph Lameter
0 siblings, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-25 15:23 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-kernel, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Wed, 25 Apr 2007, Mel Gorman wrote:
> HIGHMEM I can understand because I suppose the kmap() issue is still in
> there, but why 32 bit? Is this temporary or do you expect to see it
> fixed up later?
It could be fixed but I only tested with 64 bit. Jens's report shows that
my skepticism was justified. 32 bit has only small memory and suffers from
VM crappiness due to highmem, bouncing due to dma zones etc etc. Its a big
mess that I would like to avoid dealing with for now. Lots of corner cases
that need consideration. When we actually merge this then we can deal with
all the cases. If some cases cannot be dealt with then the large blocksize
support will not be available if another particular feature is
enabled.
> > How well Mel's antifrag/defrag methods address this issue still has to
> > be seen.
> The grouping pages by mobility should hold up for ext2 and XFS because
> their page cache pages are reclaimable/movable and will get grouped with
> other pages that are reclaimable/movable. ramfs may be a problem if it was
> heavily used but lets see how things pan out.
Sounds good.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-25 11:35 ` Jens Axboe
@ 2007-04-25 15:36 ` Christoph Lameter
2007-04-25 17:53 ` Jens Axboe
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-25 15:36 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Badari Pulavarty, Maxim Levitsky
On Wed, 25 Apr 2007, Jens Axboe wrote:
> I need this patch to actually boot the thing, or it bombs with a NULL
> deref in page_cache_size().
Yeah on 32 bit which I disabled....
> It then boots, doing a little test with 8kb ext2 quickly dies though:
>
> BUG: unable to handle kernel NULL pointer dereference at virtual address
> EIP: [<d8149519>] 0xd8149519 SS:ESP 0068:e9049d1c
>
> That's here:
>
> page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
> zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> -> if (page)
> goto got_pg;
Impossible. page is not dereferenced. Cannot cause a NULL pointer
deference.
> which doesn't look healthy. Note that this is a 32-bit machine, I
> removed the 32-bit check (devices in this box are fine).
>
> --- fs/libfs.c~ 2007-04-25 13:30:50.000000000 +0200
> +++ fs/libfs.c 2007-04-25 13:31:00.000000000 +0200
> @@ -330,7 +330,7 @@
> int simple_prepare_write(struct file *file, struct page *page,
> unsigned from, unsigned to)
> {
> - unsigned int page_size = page_cache_size(file->f_mapping);
> + unsigned int page_size = page_cache_size(page->mapping);
We do a write operation on a file that has no mapping (and thus no radix
tree etc etc)? Is that legit?
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-25 15:36 ` Christoph Lameter
@ 2007-04-25 17:53 ` Jens Axboe
2007-04-25 18:03 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Jens Axboe @ 2007-04-25 17:53 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Badari Pulavarty, Maxim Levitsky
On Wed, Apr 25 2007, Christoph Lameter wrote:
> On Wed, 25 Apr 2007, Jens Axboe wrote:
>
> > I need this patch to actually boot the thing, or it bombs with a NULL
> > deref in page_cache_size().
>
> Yeah on 32 bit which I disabled....
Yep, on the grounds that devices may not cover the full address space.
On my box they do.
> > It then boots, doing a little test with 8kb ext2 quickly dies though:
> >
> > BUG: unable to handle kernel NULL pointer dereference at virtual address
>
> > EIP: [<d8149519>] 0xd8149519 SS:ESP 0068:e9049d1c
> >
> > That's here:
> >
> > page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
> > zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> > -> if (page)
> > goto got_pg;
>
> Impossible. page is not dereferenced. Cannot cause a NULL pointer
> deference.
Sure I know that, hence the oops happening there is likely fallout from
something scribbling where it should not have.
> > which doesn't look healthy. Note that this is a 32-bit machine, I
> > removed the 32-bit check (devices in this box are fine).
>
>
> >
> > --- fs/libfs.c~ 2007-04-25 13:30:50.000000000 +0200
> > +++ fs/libfs.c 2007-04-25 13:31:00.000000000 +0200
> > @@ -330,7 +330,7 @@
> > int simple_prepare_write(struct file *file, struct page *page,
> > unsigned from, unsigned to)
> > {
> > - unsigned int page_size = page_cache_size(file->f_mapping);
> > + unsigned int page_size = page_cache_size(page->mapping);
>
> We do a write operation on a file that has no mapping (and thus no radix
> tree etc etc)? Is that legit?
Probably not.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-25 17:53 ` Jens Axboe
@ 2007-04-25 18:03 ` Christoph Lameter
2007-04-25 18:05 ` Jens Axboe
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-25 18:03 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Badari Pulavarty, Maxim Levitsky
On Wed, 25 Apr 2007, Jens Axboe wrote:
> > Impossible. page is not dereferenced. Cannot cause a NULL pointer
> > deference.
>
> Sure I know that, hence the oops happening there is likely fallout from
> something scribbling where it should not have.
Hmmm... Must be a big scribble. And you are sure that no bouncing or
highmem stuff is done? Any way to figure out what that is?
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-25 18:03 ` Christoph Lameter
@ 2007-04-25 18:05 ` Jens Axboe
2007-04-25 18:14 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Jens Axboe @ 2007-04-25 18:05 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Badari Pulavarty, Maxim Levitsky
On Wed, Apr 25 2007, Christoph Lameter wrote:
> On Wed, 25 Apr 2007, Jens Axboe wrote:
>
> > > Impossible. page is not dereferenced. Cannot cause a NULL pointer
> > > deference.
> >
> > Sure I know that, hence the oops happening there is likely fallout from
> > something scribbling where it should not have.
>
> Hmmm... Must be a big scribble. And you are sure that no bouncing or
> highmem stuff is done? Any way to figure out what that is?
2038MB LOWMEM available.
All of memory is lowmem. There are no bounces going on, for sure.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-25 18:05 ` Jens Axboe
@ 2007-04-25 18:14 ` Christoph Lameter
2007-04-25 18:16 ` Jens Axboe
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-25 18:14 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Badari Pulavarty, Maxim Levitsky
On Wed, 25 Apr 2007, Jens Axboe wrote:
> All of memory is lowmem. There are no bounces going on, for sure.
Could you figure out what this is? The patchset was intentionally only for
64 bit. I cannot get to the 32 bit headaches for a week or so.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-25 18:14 ` Christoph Lameter
@ 2007-04-25 18:16 ` Jens Axboe
0 siblings, 0 replies; 235+ messages in thread
From: Jens Axboe @ 2007-04-25 18:16 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Badari Pulavarty, Maxim Levitsky
On Wed, Apr 25 2007, Christoph Lameter wrote:
> On Wed, 25 Apr 2007, Jens Axboe wrote:
>
> > All of memory is lowmem. There are no bounces going on, for sure.
>
> Could you figure out what this is? The patchset was intentionally only for
> 64 bit. I cannot get to the 32 bit headaches for a week or so.
Sure, but I probably wont have cycles to throw at that until next
week... Come monday I'll see if I can pin point the problem(s).
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (21 preceding siblings ...)
2007-04-25 13:28 ` Mel Gorman
@ 2007-04-25 22:46 ` Badari Pulavarty
2007-04-26 1:14 ` David Chinner
2007-04-26 4:51 ` Eric W. Biederman
` (3 subsequent siblings)
26 siblings, 1 reply; 235+ messages in thread
From: Badari Pulavarty @ 2007-04-25 22:46 UTC (permalink / raw)
To: clameter
Cc: lkml, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Maxim Levitsky
On Tue, 2007-04-24 at 15:21 -0700, clameter@sgi.com wrote:
> V2->V3
Hmm.. It broke ext2 :(
V2 worked fine with the small fix I sent you earlier.
But on V3, I can't run fsx. I see random data showing up.
I will debug, when I get a chance.
Thanks,
Badari
READ BAD DATA: offset = 0x5092a4, size = 0x5093a0, fname = (null)
OFFSET GOOD BAD RANGE
0x77f466d0 0x77e1f7c0 0x0000 0x 6ef
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 6f1
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 6f1
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 6f3
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 6f3
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 6f5
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x776f6e6b 0x6b636568 0x0000 0x 84f
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 851
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 851
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 853
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 853
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 855
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x776f6e6b 0x6b636568 0x0000 0x 857
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 859
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 859
operation# (mod 256) for the bad data may be 5284080
0x2820236e 0x77e1f7c0 0x0000 0x 85b
operation# (mod 256) for the bad data may be 5284080
LOG DUMP (49149 total operations):
0(2012505808 mod 256): WRITE 0x77f466d0 thru 0x77e1f7c0 (0x0
bytes) HOL***WWWW
0(2012505808 mod 256): WRITE 0x77f466d0 thru 0x77e1f7c0 (0x0
bytes)
0(2012505808 mod 256): READ 0x77f466d0 thru 0x77e1f7c0 (0x0
bytes)
0(2012505808 mod 256): READ 0x77f466d0 thru 0x77e1f7c0 (0x0
bytes) ***RRRR***
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-25 22:46 ` Badari Pulavarty
@ 2007-04-26 1:14 ` David Chinner
2007-04-26 1:17 ` David Chinner
0 siblings, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-04-26 1:14 UTC (permalink / raw)
To: Badari Pulavarty
Cc: clameter, lkml, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Maxim Levitsky
On Wed, Apr 25, 2007 at 03:46:19PM -0700, Badari Pulavarty wrote:
> On Tue, 2007-04-24 at 15:21 -0700, clameter@sgi.com wrote:
> > V2->V3
>
> Hmm.. It broke ext2 :(
>
> V2 worked fine with the small fix I sent you earlier.
> But on V3, I can't run fsx. I see random data showing up.
> I will debug, when I get a chance.
Same thing on XFS - 'fsx -d -S 42 -R -W foobar' fails on
the tenth operation....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 1:14 ` David Chinner
@ 2007-04-26 1:17 ` David Chinner
0 siblings, 0 replies; 235+ messages in thread
From: David Chinner @ 2007-04-26 1:17 UTC (permalink / raw)
To: David Chinner
Cc: Badari Pulavarty, clameter, lkml, Mel Gorman,
William Lee Irwin III, Jens Axboe, Maxim Levitsky
On Thu, Apr 26, 2007 at 11:14:49AM +1000, David Chinner wrote:
> On Wed, Apr 25, 2007 at 03:46:19PM -0700, Badari Pulavarty wrote:
> > On Tue, 2007-04-24 at 15:21 -0700, clameter@sgi.com wrote:
> > > V2->V3
> >
> > Hmm.. It broke ext2 :(
> >
> > V2 worked fine with the small fix I sent you earlier.
> > But on V3, I can't run fsx. I see random data showing up.
> > I will debug, when I get a chance.
>
> Same thing on XFS - 'fsx -d -S 42 -R -W foobar' fails on
> the tenth operation....
Hmmmm - even normal block size filesystems (ext3) are reading bogus
data (e.g. /etc/mtod).
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (22 preceding siblings ...)
2007-04-25 22:46 ` Badari Pulavarty
@ 2007-04-26 4:51 ` Eric W. Biederman
2007-04-26 5:05 ` Christoph Lameter
2007-04-26 5:37 ` Nick Piggin
2007-04-26 18:50 ` Maxim Levitsky
` (2 subsequent siblings)
26 siblings, 2 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 4:51 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
clameter@sgi.com writes:
> V2->V3
> - More restructuring
> - It actually works!
> - Add XFS support
> - Fix up UP support
> - Work out the direct I/O issues
> - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
> back to constants. Disabled for 32bit and HIGHMEM configurations.
> This also allows a gradual migration to the new page cache
> inline functions. LARGE_BLOCKSIZE capabilities can be
> added gradually and if there is a problem then we can disable
> a subsystem.
>
> V1->V2
> - Some ext2 support
> - Some block layer, fs layer support etc.
> - Better page cache macros
> - Use macros to clean up code.
>
> This patchset modifies the Linux kernel so that larger block sizes than
> page size can be supported. Larger block sizes are handled by using
> compound pages of an arbitrary order for the page cache instead of
> single pages with order 0.
Huh?
You seem to be mixing two very different concepts.
The page cache has no problems supporting things with a block
size larger then page size. Now the block device layer may not
have the code to do the scatter gather into small pages and it
may not handle buffer heads whose data is split between multiple
pages.
But this is not a page cache issue.
And generally larger physical pages are a mistake to use.
Especially as it looks from some of the later comment you don't
date test on 32bit because the memory fragments faster.
Is it common for hardware that supports large block sizes to not
support splitting those blocks apart during DMA? Unless it is common
the whole premise of this patchset seems broken.
I suspect what needs to be fixed is the page cache block device
interface so that we have helper functions that know how to stuff
a single block into several pages.
That would make the choice of using larger order pages (essentially
increasing PAGE_SIZE) something that can be investigated in parallel.
Right now I don't even want to think about trying to use a swap device
with a large block size when we are low on memory.
>
> Rationales:
>
> 1. We have problems supporting devices with a higher blocksize than
> page size. This is for example important to support CD and DVDs that
> can only read and write 32k or 64k blocks. We currently have a shim
> layer in there to deal with this situation which limits the speed
> of I/O. The developers are currently looking for ways to completely
> bypass the page cache because of this deficiency.
block device /page cache interface issue.
> 2. 32/64k blocksize is also used in flash devices. Same issues.
flash devices are not block devices so I strongly doubt it is
the same issue.
> 3. Future harddisks will support bigger block sizes that Linux cannot
> support since we are limited to PAGE_SIZE. Ok the on board cache
> may buffer this for us but what is the point of handling smaller
> page sizes than what the drive supports?
No fragmenting memory and keeping the system running.
> 4. Reduce fsck times. Larger block sizes mean faster file system checking.
Fewer seeks and less meta-data means faster fsck times. Larger block
sizes get us there only tangentially.
> 5. Performance. If we look at IA64 vs. x86_64 then it seems that the
> faster interrupt handling on x86_64 compensate for the speed loss due to
> a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
> sizes on all allows a significant reduction in I/O overhead and increases
> the size of I/O that can be performed by hardware in a single request
> since the number of scatter gather entries are typically limited for
> one request. This is going to become increasingly important to support
> the ever growing memory sizes since we may have to handle excessively
> large amounts of 4k requests for data sizes that may become common
> soon. For example to write a 1 terabyte file the kernel would have to
> handle 256 million 4k chunks.
This assumes you get the option of large files and batching things as
the systems scale. At SGI maybe that is true. However in general
you gets lots of small requests as systems scale up.
For example I have gigabytes of kernel trees. How are larger requests
going to speed of my reading and writing of those? And yes even with
8G of ram I have enough kernel trees that they fall out of memory.
So cache is not the only answer.
> 6. Cross arch compatibility: It is currently not possible to mount
> an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
> With this patch this becoems possible.
Again this is a problem with the page cache block device interface not
a page cache problem.
I think supporting larger block sizes is a nice goal. However unless
we are bumping up against hardware limitations let's see how far
we can go with batching and fixing the block layer/page cache interface
instead of assuming that larger page sizes are the answer.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 4:51 ` Eric W. Biederman
@ 2007-04-26 5:05 ` Christoph Lameter
2007-04-26 5:44 ` Eric W. Biederman
2007-04-26 5:37 ` Nick Piggin
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 5:05 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Wed, 25 Apr 2007, Eric W. Biederman wrote:
> The page cache has no problems supporting things with a block
> size larger then page size. Now the block device layer may not
> have the code to do the scatter gather into small pages and it
> may not handle buffer heads whose data is split between multiple
> pages.
It does have that problem. If a system is in use then memory is fragmented
and requests to the devices are in 4k sizes. The kernel has to manage the
4k size. The number of requests that the driver can take is limited.
Larger blocks allow shuffling more data to the device.
> And generally larger physical pages are a mistake to use.
> Especially as it looks from some of the later comment you don't
> date test on 32bit because the memory fragments faster.
Ummm.. Dont get me to comment on i386. I never said that memory fragments
faster on i386. i386 has multiple issues with memory management that
require a lot of work and that will cause difficulty. If you have these
fun systems with 512k ZONE_NORMAL and 63GB HIGHMEM then good luck...
> Is it common for hardware that supports large block sizes to not
> support splitting those blocks apart during DMA? Unless it is common
> the whole premise of this patchset seems broken.
Huh? Splitting the blocks requires hardware effort -> Reduction in
transfer rate.
> I suspect what needs to be fixed is the page cache block device
> interface so that we have helper functions that know how to stuff
> a single block into several pages.
Oh we have scores of these hacks around. Look at the dvd/cd layer. The
point is to get rid of those.
> Right now I don't even want to think about trying to use a swap device
> with a large block size when we are low on memory.
But that is due to the VM (at least Linus tree) having no defrag methods.
mm has Mel's antifrag methods and can do it.
> > 2. 32/64k blocksize is also used in flash devices. Same issues.
>
> flash devices are not block devices so I strongly doubt it is
> the same issue.
But they could be treated as such. Right now these poor guys have to
improvise around the page size limit.
> > 4. Reduce fsck times. Larger block sizes mean faster file system checking.
>
> Fewer seeks and less meta-data means faster fsck times. Larger block
> sizes get us there only tangentially.
Less meta data to manage does not reduce fsck times? Going from order 0 to
order 2 blocks cuts the metadata to a fourth.
> > 5. Performance. If we look at IA64 vs. x86_64 then it seems that the
> > faster interrupt handling on x86_64 compensate for the speed loss due to
> > a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
> > sizes on all allows a significant reduction in I/O overhead and increases
> > the size of I/O that can be performed by hardware in a single request
> > since the number of scatter gather entries are typically limited for
> > one request. This is going to become increasingly important to support
> > the ever growing memory sizes since we may have to handle excessively
> > large amounts of 4k requests for data sizes that may become common
> > soon. For example to write a 1 terabyte file the kernel would have to
> > handle 256 million 4k chunks.
>
> This assumes you get the option of large files and batching things as
> the systems scale. At SGI maybe that is true. However in general
> you gets lots of small requests as systems scale up.
Yes you get lots of small request *because* we do not support defrag and
cannot large contiguous allocations.
> > 6. Cross arch compatibility: It is currently not possible to mount
> > an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
> > With this patch this becoems possible.
>
> Again this is a problem with the page cache block device interface not
> a page cache problem.
Ummm the other arches read 16k blocks of contigous memory. That is not
supported on 4k platforms right now. I guess you you move those to vmalloc
areas? Want to hack the filesystems for this?
> I think supporting larger block sizes is a nice goal. However unless
> we are bumping up against hardware limitations let's see how far
> we can go with batching and fixing the block layer/page cache interface
> instead of assuming that larger page sizes are the answer.
There are multiple scaling issues in the kernel. What you propose is to
add hack over hack into the VM to avoid having to deal with
defragmentation. That in turn will cause churn with hardware etc etc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 4:51 ` Eric W. Biederman
2007-04-26 5:05 ` Christoph Lameter
@ 2007-04-26 5:37 ` Nick Piggin
2007-04-26 6:38 ` David Chinner
2007-04-26 6:40 ` Christoph Lameter
1 sibling, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 5:37 UTC (permalink / raw)
To: Eric W. Biederman
Cc: clameter, linux-kernel, Mel Gorman, William Lee Irwin III,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
Eric W. Biederman wrote:
> clameter@sgi.com writes:
>
>
>>V2->V3
>>- More restructuring
>>- It actually works!
>>- Add XFS support
>>- Fix up UP support
>>- Work out the direct I/O issues
>>- Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
>> back to constants. Disabled for 32bit and HIGHMEM configurations.
>> This also allows a gradual migration to the new page cache
>> inline functions. LARGE_BLOCKSIZE capabilities can be
>> added gradually and if there is a problem then we can disable
>> a subsystem.
>>
>>V1->V2
>>- Some ext2 support
>>- Some block layer, fs layer support etc.
>>- Better page cache macros
>>- Use macros to clean up code.
>>
>>This patchset modifies the Linux kernel so that larger block sizes than
>>page size can be supported. Larger block sizes are handled by using
>>compound pages of an arbitrary order for the page cache instead of
>>single pages with order 0.
>
>
> Huh?
>
> You seem to be mixing two very different concepts.
>
> The page cache has no problems supporting things with a block
> size larger then page size. Now the block device layer may not
> have the code to do the scatter gather into small pages and it
> may not handle buffer heads whose data is split between multiple
> pages.
Yeah, this patch is not really large blocksize support (which we normally
think of as block size > page cache size).
> But this is not a page cache issue.
>
> And generally larger physical pages are a mistake to use.
> Especially as it looks from some of the later comment you don't
> date test on 32bit because the memory fragments faster.
I actually completely agree with this, and I'm concerned in general about
using higher order pages. I think it is fundamentally the wrong approach
because of fragmentation and defragmentation costs (similarly to Linus's
take on page colouring).
I think starting with the assumption that we _want_ to use higher order
allocations, and then creating all this complexity around that is not a
good one, and if we start introducing things that _require_ significant
higher order allocations to function then it is a nasty thing for
robustness.
> Is it common for hardware that supports large block sizes to not
> support splitting those blocks apart during DMA? Unless it is common
> the whole premise of this patchset seems broken.
>
> I suspect what needs to be fixed is the page cache block device
> interface so that we have helper functions that know how to stuff
> a single block into several pages.
I am working now and again on some code to do this, it is a big job but
I think it is the right way to do it. But it would take a long time to
get stable and supported by filesystems...
> That would make the choice of using larger order pages (essentially
> increasing PAGE_SIZE) something that can be investigated in parallel.
I agree that hardware inefficiencies should be handled by increasing
PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 5:05 ` Christoph Lameter
@ 2007-04-26 5:44 ` Eric W. Biederman
2007-04-26 6:37 ` Christoph Lameter
` (3 more replies)
0 siblings, 4 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 5:44 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
Christoph Lameter <clameter@sgi.com> writes:
> On Wed, 25 Apr 2007, Eric W. Biederman wrote:
>
>> The page cache has no problems supporting things with a block
>> size larger then page size. Now the block device layer may not
>> have the code to do the scatter gather into small pages and it
>> may not handle buffer heads whose data is split between multiple
>> pages.
>
> It does have that problem. If a system is in use then memory is fragmented
> and requests to the devices are in 4k sizes. The kernel has to manage the
> 4k size. The number of requests that the driver can take is limited.
> Larger blocks allow shuffling more data to the device.
I have a hard time believe that device hardware limits don't allow them
to have enough space to handle larger requests. If so it was a poor
design by the hardware manufacturers.
>> And generally larger physical pages are a mistake to use.
>> Especially as it looks from some of the later comment you don't
>> date test on 32bit because the memory fragments faster.
>
> Ummm.. Dont get me to comment on i386. I never said that memory fragments
> faster on i386. i386 has multiple issues with memory management that
> require a lot of work and that will cause difficulty. If you have these
> fun systems with 512k ZONE_NORMAL and 63GB HIGHMEM then good luck...
>
>> Is it common for hardware that supports large block sizes to not
>> support splitting those blocks apart during DMA? Unless it is common
>> the whole premise of this patchset seems broken.
>
> Huh? Splitting the blocks requires hardware effort -> Reduction in
> transfer rate.
Splitting the blocks doesn't change the transfer effort one iota.
The bus pci/pcie/hypertransport already have block sizes below 4KB.
Reading a longer list of descriptors might slow things down, but I would
be surprised.
The physical medium is the primary disk bottleneck.
Thinking about it the fastest thing I can do with a filesystem or disk
is to not use it. That is to cache it efficiently. Having page sized
chunks in my cache increases my caching efficiency. Large order
pages work directly against my caching efficiency.
>> I suspect what needs to be fixed is the page cache block device
>> interface so that we have helper functions that know how to stuff
>> a single block into several pages.
>
> Oh we have scores of these hacks around. Look at the dvd/cd layer. The
> point is to get rid of those.
Perhaps this is just a matter of cleaning them up so they are no
longer hacks?
You are trying to couple something that has no business being coupled
as it reduces the system usability when you couple them.
>> Right now I don't even want to think about trying to use a swap device
>> with a large block size when we are low on memory.
>
> But that is due to the VM (at least Linus tree) having no defrag methods.
> mm has Mel's antifrag methods and can do it.
This is fundamental. Fragmentation when you multiple chunk sizes
cannot be solved without a the ability to move things in memory,
whereas it doesn't exist when you only have a single chunk size.
>> > 2. 32/64k blocksize is also used in flash devices. Same issues.
>>
>> flash devices are not block devices so I strongly doubt it is
>> the same issue.
>
> But they could be treated as such. Right now these poor guys have to
> improvise around the page size limit.
The reason they are different is that they have very different
fundamental properties. Flash devices have essentially no seek time
so random access if fast. However the have a maximum number of erases
per sector so you have to be careful to do wear leveling. Flash
devices are distinctly different, and using the block layer for them
while they do not behave like block devices is the wrong thing to do.
>> > 4. Reduce fsck times. Larger block sizes mean faster file system checking.
>>
>> Fewer seeks and less meta-data means faster fsck times. Larger block
>> sizes get us there only tangentially.
>
> Less meta data to manage does not reduce fsck times? Going from order 0 to
> order 2 blocks cuts the metadata to a fourth.
I agree that less meta data helps. But switching to extents can reduce the
meta data much more, and still doesn't penalize you for small files if
you have them.
>> > 5. Performance. If we look at IA64 vs. x86_64 then it seems that the
>> > faster interrupt handling on x86_64 compensate for the speed loss due to
>> > a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
>> > sizes on all allows a significant reduction in I/O overhead and increases
>> > the size of I/O that can be performed by hardware in a single request
>> > since the number of scatter gather entries are typically limited for
>> > one request. This is going to become increasingly important to support
>> > the ever growing memory sizes since we may have to handle excessively
>> > large amounts of 4k requests for data sizes that may become common
>> > soon. For example to write a 1 terabyte file the kernel would have to
>> > handle 256 million 4k chunks.
>>
>> This assumes you get the option of large files and batching things as
>> the systems scale. At SGI maybe that is true. However in general
>> you gets lots of small requests as systems scale up.
>
> Yes you get lots of small request *because* we do not support defrag and
> cannot large contiguous allocations.
Lots of small requests are fundamental. If lots of small requests were
not fundamental we would gets large requests scatter gather requests.
>> > 6. Cross arch compatibility: It is currently not possible to mount
>> > an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
>> > With this patch this becoems possible.
>>
>> Again this is a problem with the page cache block device interface not
>> a page cache problem.
>
> Ummm the other arches read 16k blocks of contigous memory. That is not
> supported on 4k platforms right now. I guess you you move those to vmalloc
> areas? Want to hack the filesystems for this?
Freak no. You teach the code how to have a block in multiple physical
pages.
>> I think supporting larger block sizes is a nice goal. However unless
>> we are bumping up against hardware limitations let's see how far
>> we can go with batching and fixing the block layer/page cache interface
>> instead of assuming that larger page sizes are the answer.
>
> There are multiple scaling issues in the kernel. What you propose is to
> add hack over hack into the VM to avoid having to deal with
> defragmentation. That in turn will cause churn with hardware etc
> etc.
No. I propose to avoid all designs that have the concept of
fragmentation.
There is an argument for having struct page control more than 4K
of memory even when the hardware page size is 4K. But that is a
separate problem. And by only having one size we still don't
get fragmentation.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 5:44 ` Eric W. Biederman
@ 2007-04-26 6:37 ` Christoph Lameter
2007-04-26 9:16 ` Mel Gorman
2007-04-26 6:38 ` Nick Piggin
` (2 subsequent siblings)
3 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 6:37 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Wed, 25 Apr 2007, Eric W. Biederman wrote:
> You are trying to couple something that has no business being coupled
> as it reduces the system usability when you couple them.
What I am coupling? The approach solves a series of issues as far as I can
tell.
> > But that is due to the VM (at least Linus tree) having no defrag methods.
> > mm has Mel's antifrag methods and can do it.
>
> This is fundamental. Fragmentation when you multiple chunk sizes
> cannot be solved without a the ability to move things in memory,
> whereas it doesn't exist when you only have a single chunk size.
We have that ability (although in a limited form) right now.
> > Yes you get lots of small request *because* we do not support defrag and
> > cannot large contiguous allocations.
>
> Lots of small requests are fundamental. If lots of small requests were
> not fundamental we would gets large requests scatter gather requests.
That is a statement of faith in small requests? Small requests are
fundamental so we want them?
> > Ummm the other arches read 16k blocks of contigous memory. That is not
> > supported on 4k platforms right now. I guess you you move those to vmalloc
> > areas? Want to hack the filesystems for this?
>
> Freak no. You teach the code how to have a block in multiple physical
> pages.
This aint gonna work without something that stores the information about
how the pieces come together. Teach the code.... More hacks.
> > There are multiple scaling issues in the kernel. What you propose is to
> > add hack over hack into the VM to avoid having to deal with
> > defragmentation. That in turn will cause churn with hardware etc
> > etc.
>
> No. I propose to avoid all designs that have the concept of
> fragmentation.
There are such designs? You can limit fragmentation but not avoid it.
> There is an argument for having struct page control more than 4K
> of memory even when the hardware page size is 4K. But that is a
> separate problem. And by only having one size we still don't
> get fragmentation.
We have fragmentation because we cannot limit our allocation sizes to 4k.
The stack is already 8k and there are subsystems that need more (f.e.
for jumbo frames). Then there is the huge page subsystem that is used to
avoid the TLB pressure that comes with small pages.
I think we are doing our typical community thing of running away from the
problem and developing ways to explain why our traditional approaches are
right.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 5:37 ` Nick Piggin
@ 2007-04-26 6:38 ` David Chinner
2007-04-26 6:50 ` Nick Piggin
2007-04-26 10:10 ` Eric W. Biederman
2007-04-26 6:40 ` Christoph Lameter
1 sibling, 2 replies; 235+ messages in thread
From: David Chinner @ 2007-04-26 6:38 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 03:37:28PM +1000, Nick Piggin wrote:
> I think starting with the assumption that we _want_ to use higher order
> allocations, and then creating all this complexity around that is not a
> good one, and if we start introducing things that _require_ significant
> higher order allocations to function then it is a nasty thing for
> robustness.
>From my POV, we started with the problem of how to provide atomic
access to a multi-page block in the page cache. For example, we want
to lock the filesystem block and prevent any updates to it, so we
need to lock all the pages in it. And then when we write them back,
they all need to change state at the same time, and they all need to
have their radix tree tags changed at the same time, the problem of
mapping them to disk, getting writeback to do block aligned and
sized writeback chunks, and so on.
And then there's the problem that most hardware is limited to 128
s/g entries and that means 128 non-contiguous pages in memory is the
maximum I/O size we can issue to these devices. We have RAID arrays
that go twice as fast if we can send them 1MB I/Os instead of 512k
I/Os and that means we need contiguous pages to be handled to the
devices....
All of these things require some aggregating structure to
co-ordinate. In times gone by on other OSs, this has been done with
a buffer cache, but Linux got rid of that long ago and we don't want
to reintroduce one. We can't use buffer heads - they can only point
to one page. So what do we do?
That's where compound pages are so nice - they solve all of these
problems with only a very small amount of code perturbation and they
don't change any algorithms or fundamental design of the OS at all.
We don't have to rewrite filesystems to support this. We don't have
to redesign the VM to support this. We don't have to do very much
work to the block layer and drivers to make this work.
FWIW, if you want 32 bit machines to support larger than 16TB
devices, you need high order page indexing in the page cache....
> >Is it common for hardware that supports large block sizes to not
> >support splitting those blocks apart during DMA? Unless it is common
> >the whole premise of this patchset seems broken.
> >
> >I suspect what needs to be fixed is the page cache block device
> >interface so that we have helper functions that know how to stuff
> >a single block into several pages.
>
> I am working now and again on some code to do this, it is a big job but
> I think it is the right way to do it. But it would take a long time to
> get stable and supported by filesystems...
Compared to a method that requires almost no change to the
filesystems that want to support large block sizes?
> >That would make the choice of using larger order pages (essentially
> >increasing PAGE_SIZE) something that can be investigated in parallel.
>
> I agree that hardware inefficiencies should be handled by increasing
> PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.
And what do we do for arches that can't do multiple page sizes, only
only have a limited and mostly useless set of page sizes to choose
from?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 5:44 ` Eric W. Biederman
2007-04-26 6:37 ` Christoph Lameter
@ 2007-04-26 6:38 ` Nick Piggin
2007-04-26 6:46 ` Christoph Lameter
2007-04-26 15:58 ` Christoph Hellwig
2007-04-26 13:28 ` Alan Cox
2007-04-28 10:55 ` Pierre Ossman
3 siblings, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 6:38 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Eric W. Biederman wrote:
> Christoph Lameter <clameter@sgi.com> writes:
>>>Right now I don't even want to think about trying to use a swap device
>>>with a large block size when we are low on memory.
>>
>>But that is due to the VM (at least Linus tree) having no defrag methods.
>>mm has Mel's antifrag methods and can do it.
>
>
> This is fundamental. Fragmentation when you multiple chunk sizes
> cannot be solved without a the ability to move things in memory,
> whereas it doesn't exist when you only have a single chunk size.
And even if you can (and you can't always, because the anti-frag is
only heuristics), then it costs you complexity and overhead to do.
>>Less meta data to manage does not reduce fsck times? Going from order 0 to
>>order 2 blocks cuts the metadata to a fourth.
>
>
> I agree that less meta data helps. But switching to extents can reduce the
> meta data much more, and still doesn't penalize you for small files if
> you have them.
Anyway, this is a general large block size issue, and not specifically
anything to do with large pagecache size.
>>There are multiple scaling issues in the kernel. What you propose is to
>>add hack over hack into the VM to avoid having to deal with
>>defragmentation. That in turn will cause churn with hardware etc
>>etc.
>
>
> No. I propose to avoid all designs that have the concept of
> fragmentation.
Yeah. IMO anti-fragmentation and defragmentation is the hack, and we
should stay away from higher order allocations whenever possible.
Hardware is built to handle many small pages efficintly, and I don't
understand how it could be an SGI-only issue. Sure, you may have an
order of magnitude or more memory than anyone else, but even my lowly
desktop _already_ has orders of magnitude more pages than it has TLB
entries or cache -- if a workload is cache-nice for me, it probably
will be on a 1TB machine as well, and if it is bad for the 1TB machine,
it is also bad on mine.
If this is instead an issue of io path or reclaim efficiency, then it
would be really nice to see numbers... but I don't think making these
fundamental paths more complex and slower is a nice way to fix it
(larger PAGE_SIZE would be, though).
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 5:37 ` Nick Piggin
2007-04-26 6:38 ` David Chinner
@ 2007-04-26 6:40 ` Christoph Lameter
2007-04-26 6:53 ` Nick Piggin
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 6:40 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> > The page cache has no problems supporting things with a block
> > size larger then page size. Now the block device layer may not
> > have the code to do the scatter gather into small pages and it
> > may not handle buffer heads whose data is split between multiple
> > pages.
>
> Yeah, this patch is not really large blocksize support (which we normally
> think of as block size > page cache size).
No? It depends on how you define block size. This patch definitely allows
a set blocksize function call with a size larger than 4k.
> > I suspect what needs to be fixed is the page cache block device
> > interface so that we have helper functions that know how to stuff
> > a single block into several pages.
>
> I am working now and again on some code to do this, it is a big job but
> I think it is the right way to do it. But it would take a long time to
> get stable and supported by filesystems...
Ummm... We already have a radix tree for this???? What more is needed? You
just need to go through all filesystems and make them use extends.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:38 ` Nick Piggin
@ 2007-04-26 6:46 ` Christoph Lameter
2007-04-26 6:57 ` Nick Piggin
2007-04-26 10:06 ` [00/17] Large Blocksize Support V3 Mel Gorman
2007-04-26 15:58 ` Christoph Hellwig
1 sibling, 2 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 6:46 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> Yeah. IMO anti-fragmentation and defragmentation is the hack, and we
> should stay away from higher order allocations whenever possible.
Right and we need to create series of other approaches that we then label
"non-hack" to replace it.
> Hardware is built to handle many small pages efficintly, and I don't
> understand how it could be an SGI-only issue. Sure, you may have an
> order of magnitude or more memory than anyone else, but even my lowly
> desktop _already_ has orders of magnitude more pages than it has TLB
> entries or cache -- if a workload is cache-nice for me, it probably
> will be on a 1TB machine as well, and if it is bad for the 1TB machine,
> it is also bad on mine.
There have been numbers of people that have argued the same point. Just
because we have developed a way of thinking to defend our traditional 4k
values does not make them right.
> If this is instead an issue of io path or reclaim efficiency, then it
> would be really nice to see numbers... but I don't think making these
> fundamental paths more complex and slower is a nice way to fix it
> (larger PAGE_SIZE would be, though).
The code paths can stay the same. You can switch CONFIG_LARGE pages off
if you do not want it and it is as it was.
If you would have a look the patches: The code is significantly cleanup
and easier to read.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:38 ` David Chinner
@ 2007-04-26 6:50 ` Nick Piggin
2007-04-26 8:40 ` Mel Gorman
2007-04-26 16:11 ` Christoph Hellwig
2007-04-26 10:10 ` Eric W. Biederman
1 sibling, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 6:50 UTC (permalink / raw)
To: David Chinner
Cc: Eric W. Biederman, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner wrote:
> On Thu, Apr 26, 2007 at 03:37:28PM +1000, Nick Piggin wrote:
>
>>I think starting with the assumption that we _want_ to use higher order
>>allocations, and then creating all this complexity around that is not a
>>good one, and if we start introducing things that _require_ significant
>>higher order allocations to function then it is a nasty thing for
>>robustness.
>
>
>>From my POV, we started with the problem of how to provide atomic
> access to a multi-page block in the page cache. For example, we want
> to lock the filesystem block and prevent any updates to it, so we
> need to lock all the pages in it. And then when we write them back,
> they all need to change state at the same time, and they all need to
> have their radix tree tags changed at the same time, the problem of
> mapping them to disk, getting writeback to do block aligned and
> sized writeback chunks, and so on.
>
> And then there's the problem that most hardware is limited to 128
> s/g entries and that means 128 non-contiguous pages in memory is the
> maximum I/O size we can issue to these devices. We have RAID arrays
> that go twice as fast if we can send them 1MB I/Os instead of 512k
> I/Os and that means we need contiguous pages to be handled to the
> devices....
>
> All of these things require some aggregating structure to
> co-ordinate. In times gone by on other OSs, this has been done with
> a buffer cache, but Linux got rid of that long ago and we don't want
> to reintroduce one. We can't use buffer heads - they can only point
> to one page. So what do we do?
Improving the buffer layer would be a good way. Of course, that is
a long and difficult task, so nobody wants to do it.
> That's where compound pages are so nice - they solve all of these
> problems with only a very small amount of code perturbation and they
> don't change any algorithms or fundamental design of the OS at all.
>
> We don't have to rewrite filesystems to support this. We don't have
> to redesign the VM to support this. We don't have to do very much
> work to the block layer and drivers to make this work.
Fragmentation is the problem. The anti-frag patches don't actually
guarantee anything about fragmentation, and even if they did, then
it takes more effort to actually find and reclaim higher order
pages (even after the lumpy reclaim thing). So we've redesigned
the page allocator and page reclaim and still don't have something
that you should rely on without fallbacks.
> FWIW, if you want 32 bit machines to support larger than 16TB
> devices, you need high order page indexing in the page cache....
How about a 64-bit pgoff_t and radix-tree key? Or larger PAGE_SIZE?
>>I am working now and again on some code to do this, it is a big job but
>>I think it is the right way to do it. But it would take a long time to
>>get stable and supported by filesystems...
>
>
> Compared to a method that requires almost no change to the
> filesystems that want to support large block sizes?
Luckily we have filesystem maintainers who want to support larger
block sizes ;)
>>>That would make the choice of using larger order pages (essentially
>>>increasing PAGE_SIZE) something that can be investigated in parallel.
>>
>>I agree that hardware inefficiencies should be handled by increasing
>>PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.
>
>
> And what do we do for arches that can't do multiple page sizes, only
> only have a limited and mostly useless set of page sizes to choose
> from?
Well, for those architectures (and this would solve your large block
size and 16TB pagecache size without any core kernel changes), you
can manage 1<<order hardware ptes as a single Linux pte. There is
nothing that says you must implement PAGE_SIZE as a single TLB sized
page.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:40 ` Christoph Lameter
@ 2007-04-26 6:53 ` Nick Piggin
2007-04-26 7:04 ` David Chinner
2007-04-26 7:07 ` Christoph Lameter
0 siblings, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 6:53 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>>The page cache has no problems supporting things with a block
>>>size larger then page size. Now the block device layer may not
>>>have the code to do the scatter gather into small pages and it
>>>may not handle buffer heads whose data is split between multiple
>>>pages.
>>
>>Yeah, this patch is not really large blocksize support (which we normally
>>think of as block size > page cache size).
>
>
> No? It depends on how you define block size. This patch definitely allows
> a set blocksize function call with a size larger than 4k.
Yeah because it is still <= page cache page size. Anyway, that's just
semantics, it doesn't matter.
>>>I suspect what needs to be fixed is the page cache block device
>>>interface so that we have helper functions that know how to stuff
>>>a single block into several pages.
>>
>>I am working now and again on some code to do this, it is a big job but
>>I think it is the right way to do it. But it would take a long time to
>>get stable and supported by filesystems...
>
>
> Ummm... We already have a radix tree for this???? What more is needed? You
> just need to go through all filesystems and make them use extends.
I'm talking about block size > page size in the buffer layer.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:46 ` Christoph Lameter
@ 2007-04-26 6:57 ` Nick Piggin
2007-04-26 7:10 ` Christoph Lameter
2007-04-26 10:06 ` [00/17] Large Blocksize Support V3 Mel Gorman
1 sibling, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 6:57 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>Yeah. IMO anti-fragmentation and defragmentation is the hack, and we
>>should stay away from higher order allocations whenever possible.
>
>
> Right and we need to create series of other approaches that we then label
> "non-hack" to replace it.
I don't understand? We're talking about several utterly different designs
to approach these problems. You don't agree that one might be better than
another?
>>Hardware is built to handle many small pages efficintly, and I don't
>>understand how it could be an SGI-only issue. Sure, you may have an
>>order of magnitude or more memory than anyone else, but even my lowly
>>desktop _already_ has orders of magnitude more pages than it has TLB
>>entries or cache -- if a workload is cache-nice for me, it probably
>>will be on a 1TB machine as well, and if it is bad for the 1TB machine,
>>it is also bad on mine.
>
>
> There have been numbers of people that have argued the same point. Just
> because we have developed a way of thinking to defend our traditional 4k
> values does not make them right.
>
>
>>If this is instead an issue of io path or reclaim efficiency, then it
>>would be really nice to see numbers... but I don't think making these
>>fundamental paths more complex and slower is a nice way to fix it
>>(larger PAGE_SIZE would be, though).
>
>
> The code paths can stay the same. You can switch CONFIG_LARGE pages off
> if you do not want it and it is as it was.
That isn't a good reason to merge something. If you don't have numbers then
that just seems incredible.
> If you would have a look the patches: The code is significantly cleanup
> and easier to read.
Cleanups are fine.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [10/17] Variable Order Page Cache: Add clearing and flushing function
2007-04-24 22:21 ` [10/17] Variable Order Page Cache: Add clearing and flushing function clameter
@ 2007-04-26 7:02 ` Christoph Lameter
2007-04-26 8:14 ` David Chinner
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 7:02 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Tue, 24 Apr 2007, clameter@sgi.com wrote:
> +{ \
> + memset(page_address(__page), (__offset), (__size)); \
> + flush_mapping_page(__page); \
This was borked. Dave does this patch make it work?
---
include/linux/highmem.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6.21-rc7/include/linux/highmem.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/highmem.h 2007-04-25 23:59:33.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/highmem.h 2007-04-26 00:00:05.000000000 -0700
@@ -88,7 +88,7 @@ static inline void clear_highpage(struct
*/
#define memclear_highpage_flush(__page,__offset,__size) \
{ \
- memset(page_address(__page), (__offset), (__size)); \
+ memset(page_address(__page) + (__offset), 0, (__size)); \
flush_mapping_page(__page); \
}
#else
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:53 ` Nick Piggin
@ 2007-04-26 7:04 ` David Chinner
2007-04-26 7:07 ` Nick Piggin
2007-04-26 7:07 ` Christoph Lameter
1 sibling, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-04-26 7:04 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 04:53:40PM +1000, Nick Piggin wrote:
> Christoph Lameter wrote:
> >On Thu, 26 Apr 2007, Nick Piggin wrote:
> >>I am working now and again on some code to do this, it is a big job but
> >>I think it is the right way to do it. But it would take a long time to
> >>get stable and supported by filesystems...
> >
> >Ummm... We already have a radix tree for this???? What more is needed? You
> >just need to go through all filesystems and make them use extends.
>
> I'm talking about block size > page size in the buffer layer.
Nick, what's the buffer layer? Are you talking about operations
based on bufferheads?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:04 ` David Chinner
@ 2007-04-26 7:07 ` Nick Piggin
2007-04-26 7:11 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 7:07 UTC (permalink / raw)
To: David Chinner
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner wrote:
> On Thu, Apr 26, 2007 at 04:53:40PM +1000, Nick Piggin wrote:
>
>>Christoph Lameter wrote:
>>
>>>On Thu, 26 Apr 2007, Nick Piggin wrote:
>>>
>>>>I am working now and again on some code to do this, it is a big job but
>>>>I think it is the right way to do it. But it would take a long time to
>>>>get stable and supported by filesystems...
>>>
>>>Ummm... We already have a radix tree for this???? What more is needed? You
>>>just need to go through all filesystems and make them use extends.
>>
>>I'm talking about block size > page size in the buffer layer.
>
>
> Nick, what's the buffer layer? Are you talking about operations
> based on bufferheads?
Yeah. Our pgoff_t->sector_t translation.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:53 ` Nick Piggin
2007-04-26 7:04 ` David Chinner
@ 2007-04-26 7:07 ` Christoph Lameter
2007-04-26 7:15 ` Nick Piggin
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 7:07 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> > > I am working now and again on some code to do this, it is a big job but
> > > I think it is the right way to do it. But it would take a long time to
> > > get stable and supported by filesystems...
> > Ummm... We already have a radix tree for this???? What more is needed? You
> > just need to go through all filesystems and make them use extends.
>
> I'm talking about block size > page size in the buffer layer.
I fail to see the point of adding another layer when you already have a
mapping through the radix tree. You just need to change the way the
filesystem looks up pages.
What are the exact requirement you are trying to address?
You fundamentally cannot address the large blocksize requirements with 4k
pages since you simply must have larger contiguous memory.
Large blocksize means that the device can do I/O on blocks of that size.
What can be done is to create some kind of fake linearity. At one level
the radix tree and the address space already provide that. The radix tree
allows you to find the next page etc. Another approach would be to create
a virtual address space that fakes linearity even for the processor.
Then there are ways with I/O mmus to avoid the issues again.
However, you still have not addressed the underlying problem of the device
not being able to do I/O to a larger block of memory.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:57 ` Nick Piggin
@ 2007-04-26 7:10 ` Christoph Lameter
2007-04-26 7:22 ` Nick Piggin
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 7:10 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> > Right and we need to create series of other approaches that we then label
> > "non-hack" to replace it.
>
> I don't understand? We're talking about several utterly different designs
> to approach these problems. You don't agree that one might be better than
> another?
What I seeing is a series of approaches being put into the kernel to
address this issue. We already have the lumpy reclaim there. Then we talk
about other fixed to basic page handling in the kernel to make it better.
Now you want yet another fs layer. All of that could be taken care of by
a defrag approach with larger pages. This has been done a number of times
before and actually the large page approach is a textbook example on how
to improve performance. It goes waaaay back.
> > The code paths can stay the same. You can switch CONFIG_LARGE pages off
> > if you do not want it and it is as it was.
>
> That isn't a good reason to merge something. If you don't have numbers then
> that just seems incredible.
Dont worry you will get numbers... Just did not have time to fix the bug
in this one since I had to take care of something else.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:07 ` Nick Piggin
@ 2007-04-26 7:11 ` Christoph Lameter
2007-04-26 7:17 ` Nick Piggin
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 7:11 UTC (permalink / raw)
To: Nick Piggin
Cc: David Chinner, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> > Nick, what's the buffer layer? Are you talking about operations
> > based on bufferheads?
>
> Yeah. Our pgoff_t->sector_t translation.
Sadly the buffers in the buffer layer still assume contiguous memory. You
would have to add a series of pointers there and then add a layer to
handle this.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:07 ` Christoph Lameter
@ 2007-04-26 7:15 ` Nick Piggin
2007-04-26 7:22 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 7:15 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>>>I am working now and again on some code to do this, it is a big job but
>>>>I think it is the right way to do it. But it would take a long time to
>>>>get stable and supported by filesystems...
>>>
>>>Ummm... We already have a radix tree for this???? What more is needed? You
>>>just need to go through all filesystems and make them use extends.
>>
>>I'm talking about block size > page size in the buffer layer.
>
>
> I fail to see the point of adding another layer when you already have a
It isn't another layer. We already have this layer.
> mapping through the radix tree. You just need to change the way the
> filesystem looks up pages.
You didn't think any of the criticisms of higher order page cache size
were valid?
> What are the exact requirement you are trying to address?
Block size > page cache size.
> You fundamentally cannot address the large blocksize requirements with 4k
> pages since you simply must have larger contiguous memory.
>
> Large blocksize means that the device can do I/O on blocks of that size.
>
> What can be done is to create some kind of fake linearity. At one level
> the radix tree and the address space already provide that. The radix tree
> allows you to find the next page etc. Another approach would be to create
> a virtual address space that fakes linearity even for the processor.
>
> Then there are ways with I/O mmus to avoid the issues again.
>
> However, you still have not addressed the underlying problem of the device
> not being able to do I/O to a larger block of memory.
With iommus and sg lists?
You guys have a couple of problems, firstly you need to have ia64
filesystems accessable to x86_64. And secondly you have these controllers
without enough sg entries for nice sized IOs.
I sympathise, and higher order pagecache might solve these in a way, but
I don't think it is the right way to go, mainly because of the fragmentation
issues.
Increasing PAGE_SIZE, support for block size > page cache size, and getting
io controllers matched to a 4K page size IMO would be some good ways to
solve these problems. I know they are probably harder...
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:11 ` Christoph Lameter
@ 2007-04-26 7:17 ` Nick Piggin
2007-04-26 7:28 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 7:17 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>>Nick, what's the buffer layer? Are you talking about operations
>>>based on bufferheads?
>>
>>Yeah. Our pgoff_t->sector_t translation.
>
>
> Sadly the buffers in the buffer layer still assume contiguous memory. You
> would have to add a series of pointers there and then add a layer to
> handle this.
That's the least of the problems with rewriting the buffer layer.
But I maintain that the end result is better than the fragmentation
based approach. A lot of people don't actually want a bigger page
cache size, because they want efficient internal fragmentation as
well, so your radix-tree based approach isn't really comparable.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:10 ` Christoph Lameter
@ 2007-04-26 7:22 ` Nick Piggin
2007-04-26 7:34 ` Christoph Lameter
2007-04-26 7:48 ` Questions on printk and console_drivers gshan
0 siblings, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 7:22 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>>Right and we need to create series of other approaches that we then label
>>>"non-hack" to replace it.
>>
>>I don't understand? We're talking about several utterly different designs
>>to approach these problems. You don't agree that one might be better than
>>another?
>
>
> What I seeing is a series of approaches being put into the kernel to
> address this issue. We already have the lumpy reclaim there. Then we talk
> about other fixed to basic page handling in the kernel to make it better.
> Now you want yet another fs layer. All of that could be taken care of by
No I don't want to add another fs layer.
> a defrag approach with larger pages. This has been done a number of times
> before and actually the large page approach is a textbook example on how
> to improve performance. It goes waaaay back.
I still don't think anti fragmentation or defragmentation are a good
approach, when you consider the alternatives.
It is like Linus on the page colouring issue. That goes back a looong
way too, but that doesn't mean it is the right way to do it.
>>>The code paths can stay the same. You can switch CONFIG_LARGE pages off
>>>if you do not want it and it is as it was.
>>
>>That isn't a good reason to merge something. If you don't have numbers then
>>that just seems incredible.
>
>
> Dont worry you will get numbers... Just did not have time to fix the bug
> in this one since I had to take care of something else.
OK, I would like to see them. And also discussions of things like why
we shouldn't increase PAGE_SIZE instead.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:15 ` Nick Piggin
@ 2007-04-26 7:22 ` Christoph Lameter
2007-04-26 7:42 ` Nick Piggin
2007-04-26 14:49 ` William Lee Irwin III
0 siblings, 2 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 7:22 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> > mapping through the radix tree. You just need to change the way the
> > filesystem looks up pages.
>
> You didn't think any of the criticisms of higher order page cache size
> were valid?
They are all known points that have been discussed to death.
> > What are the exact requirement you are trying to address?
>
> Block size > page cache size.
But what do you mean with it? A block is no longer a contiguous section of
memory. So you have redefined the term.
> You guys have a couple of problems, firstly you need to have ia64
> filesystems accessable to x86_64. And secondly you have these controllers
> without enough sg entries for nice sized IOs.
This is not sgi specific sorry.
> I sympathise, and higher order pagecache might solve these in a way, but
> I don't think it is the right way to go, mainly because of the fragmentation
> issues.
And you dont care about Mel's work on that level?
> Increasing PAGE_SIZE, support for block size > page cache size, and getting
> io controllers matched to a 4K page size IMO would be some good ways to
> solve these problems. I know they are probably harder...
No this has been tried before and does not work. Why should we loose the
capability to work with 4k pages just because there is some data that
has to be thrown around in quantity? I'd like to have flexibility here.
The fragmentation problem is solvable and we already have a solution in
mm. So I do not really see a problem there?
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:17 ` Nick Piggin
@ 2007-04-26 7:28 ` Christoph Lameter
2007-04-26 7:45 ` Nick Piggin
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 7:28 UTC (permalink / raw)
To: Nick Piggin
Cc: David Chinner, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> But I maintain that the end result is better than the fragmentation
> based approach. A lot of people don't actually want a bigger page
> cache size, because they want efficient internal fragmentation as
> well, so your radix-tree based approach isn't really comparable.
Me? Radix tree based approach? That approach is in the kernel. Do not
create a solution where there is no problem. If we do not want to
support large blocksizes then lets be honest and say so instead of
redefining what a block is. The current approach is fine if one is
satisfied with scatter gather and the VM overhead coming with handling
these pages. I fail to see what any of what you are proposing would add to
that.
Lets be clear here: A bigger page cache size if its just one is not
useful. 4k page size is a good size for many files on the system and
chaning it would break the binary format.. I just do not want it to be the
only one because different usage scenarios may require differnet page
sizes for optimal application performance.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:22 ` Nick Piggin
@ 2007-04-26 7:34 ` Christoph Lameter
2007-04-26 7:48 ` Nick Piggin
2007-04-26 13:50 ` William Lee Irwin III
2007-04-26 7:48 ` Questions on printk and console_drivers gshan
1 sibling, 2 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 7:34 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> No I don't want to add another fs layer.
Well maybe you could explain what you want. Preferably without redefining
the established terms?
> I still don't think anti fragmentation or defragmentation are a good
> approach, when you consider the alternatives.
I have not heard of any alternatives in this discussion here. Just the old
line of lets tune the VM here and there and hope it lasts a while longer.
> OK, I would like to see them. And also discussions of things like why
> we shouldn't increase PAGE_SIZE instead.
Because 4k is a good page size that is bound to the binary format? Frankly
there is no point in having my text files in large page sizes. However,
when I read a dvd then I may want to transfer 64k chunks or when use my
flash drive I may want to transfer 128k chunks. And yes if a scientific
application needs to do data dump then it should be able to use very high
page sizes (megabytes, gigabytes) to be able to continue its work while
the huge dumps runs at full I/O speed ...
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:22 ` Christoph Lameter
@ 2007-04-26 7:42 ` Nick Piggin
2007-04-26 10:48 ` Mel Gorman
` (3 more replies)
2007-04-26 14:49 ` William Lee Irwin III
1 sibling, 4 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 7:42 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>>mapping through the radix tree. You just need to change the way the
>>>filesystem looks up pages.
>>
>>You didn't think any of the criticisms of higher order page cache size
>>were valid?
>
>
> They are all known points that have been discussed to death.
I missed the part where you showed that it was a better solution than
the alternatives.
>>>What are the exact requirement you are trying to address?
>>
>>Block size > page cache size.
>
>
> But what do you mean with it? A block is no longer a contiguous section of
> memory. So you have redefined the term.
I don't understand what you mean at all. A block has always been a
contiguous area of disk.
>>You guys have a couple of problems, firstly you need to have ia64
>>filesystems accessable to x86_64. And secondly you have these controllers
>>without enough sg entries for nice sized IOs.
>
>
> This is not sgi specific sorry.
>
>
>>I sympathise, and higher order pagecache might solve these in a way, but
>>I don't think it is the right way to go, mainly because of the fragmentation
>>issues.
>
>
> And you dont care about Mel's work on that level?
I actually don't like it too much because it can't provide a robust
solution. What do you do on systems with small memories, or those that
eventually do get fragmented?
Actually, I don't know why people are so excited about being able to
use higher order allocations (I would rather be more excited about
never having to use them). But for those few places that really need
it, I'd rather see them use a virtually mapped kernel with proper
defragmentation rather than putting hacks all through the core code.
>>Increasing PAGE_SIZE, support for block size > page cache size, and getting
>>io controllers matched to a 4K page size IMO would be some good ways to
>>solve these problems. I know they are probably harder...
>
>
> No this has been tried before and does not work. Why should we loose the
> capability to work with 4k pages just because there is some data that
> has to be thrown around in quantity? I'd like to have flexibility here.
Is that a big problem? Really? You use 16K pages on your IPF systems,
don't you?
> The fragmentation problem is solvable and we already have a solution in
> mm. So I do not really see a problem there?
I don't think that it is solved, and I think the heuristics that are
there would be put under more stress if they become widely used. And
it isn't only about whether we can get the page or not, but also about
the cost. Look up Linus's arguments about page colouring, which are
similar and I also think are pretty valid.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:28 ` Christoph Lameter
@ 2007-04-26 7:45 ` Nick Piggin
2007-04-26 18:10 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 7:45 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>But I maintain that the end result is better than the fragmentation
>>based approach. A lot of people don't actually want a bigger page
>>cache size, because they want efficient internal fragmentation as
>>well, so your radix-tree based approach isn't really comparable.
>
>
> Me? Radix tree based approach? That approach is in the kernel. Do not
> create a solution where there is no problem. If we do not want to
> support large blocksizes then lets be honest and say so instead of
> redefining what a block is. The current approach is fine if one is
> satisfied with scatter gather and the VM overhead coming with handling
> these pages. I fail to see what any of what you are proposing would add to
> that.
I'm not just making this up. Fragmentation. OK?
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:34 ` Christoph Lameter
@ 2007-04-26 7:48 ` Nick Piggin
2007-04-26 9:20 ` David Chinner
2007-04-26 16:07 ` Christoph Hellwig
2007-04-26 13:50 ` William Lee Irwin III
1 sibling, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 7:48 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>No I don't want to add another fs layer.
>
>
> Well maybe you could explain what you want. Preferably without redefining
> the established terms?
Support for larger buffers than page cache pages.
>>I still don't think anti fragmentation or defragmentation are a good
>>approach, when you consider the alternatives.
>
>
> I have not heard of any alternatives in this discussion here. Just the old
> line of lets tune the VM here and there and hope it lasts a while longer.
I didn't realise that one was even in the running. How can you "tune" the
VM to handle bigger block sizes?
>>OK, I would like to see them. And also discussions of things like why
>>we shouldn't increase PAGE_SIZE instead.
>
>
> Because 4k is a good page size that is bound to the binary format? Frankly
> there is no point in having my text files in large page sizes. However,
> when I read a dvd then I may want to transfer 64k chunks or when use my
> flash drive I may want to transfer 128k chunks. And yes if a scientific
> application needs to do data dump then it should be able to use very high
> page sizes (megabytes, gigabytes) to be able to continue its work while
> the huge dumps runs at full I/O speed ...
So block size > page cache size... also, you should obviously be using
hardware that is tuned to work well with 4K pages, because surely there
is lots of that around.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Questions on printk and console_drivers
2007-04-26 7:22 ` Nick Piggin
2007-04-26 7:34 ` Christoph Lameter
@ 2007-04-26 7:48 ` gshan
1 sibling, 0 replies; 235+ messages in thread
From: gshan @ 2007-04-26 7:48 UTC (permalink / raw)
To: linux-kernel
Cc: Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Hi Folks,
I got a questions on printk and console_drivers. I have 2 serial ports
and want to see output from all of them. So I registered 2 console
driver using register_console. I can see the output from serial port 1
after it was registered, but can't see any output from it after serial
port 2 has been registered.
I don't why.
Thanks,
Gavin
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [10/17] Variable Order Page Cache: Add clearing and flushing function
2007-04-26 7:02 ` Christoph Lameter
@ 2007-04-26 8:14 ` David Chinner
0 siblings, 0 replies; 235+ messages in thread
From: David Chinner @ 2007-04-26 8:14 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 12:02:31AM -0700, Christoph Lameter wrote:
> On Tue, 24 Apr 2007, clameter@sgi.com wrote:
>
> > +{ \
> > + memset(page_address(__page), (__offset), (__size)); \
> > + flush_mapping_page(__page); \
>
> This was borked. Dave does this patch make it work?
>
> ---
> include/linux/highmem.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6.21-rc7/include/linux/highmem.h
> ===================================================================
> --- linux-2.6.21-rc7.orig/include/linux/highmem.h 2007-04-25 23:59:33.000000000 -0700
> +++ linux-2.6.21-rc7/include/linux/highmem.h 2007-04-26 00:00:05.000000000 -0700
> @@ -88,7 +88,7 @@ static inline void clear_highpage(struct
> */
> #define memclear_highpage_flush(__page,__offset,__size) \
> { \
> - memset(page_address(__page), (__offset), (__size)); \
> + memset(page_address(__page) + (__offset), 0, (__size)); \
> flush_mapping_page(__page); \
> }
> #else
Yes, works for me.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:50 ` Nick Piggin
@ 2007-04-26 8:40 ` Mel Gorman
2007-04-26 8:55 ` Nick Piggin
2007-04-26 16:11 ` Christoph Hellwig
1 sibling, 1 reply; 235+ messages in thread
From: Mel Gorman @ 2007-04-26 8:40 UTC (permalink / raw)
To: Nick Piggin
Cc: David Chinner, Eric W. Biederman, clameter, linux-kernel,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On (26/04/07 16:50), Nick Piggin didst pronounce:
> David Chinner wrote:
> >On Thu, Apr 26, 2007 at 03:37:28PM +1000, Nick Piggin wrote:
> >
> >>I think starting with the assumption that we _want_ to use higher order
> >>allocations, and then creating all this complexity around that is not a
> >>good one, and if we start introducing things that _require_ significant
> >>higher order allocations to function then it is a nasty thing for
> >>robustness.
> >
> >
> >>From my POV, we started with the problem of how to provide atomic
> >access to a multi-page block in the page cache. For example, we want
> >to lock the filesystem block and prevent any updates to it, so we
> >need to lock all the pages in it. And then when we write them back,
> >they all need to change state at the same time, and they all need to
> >have their radix tree tags changed at the same time, the problem of
> >mapping them to disk, getting writeback to do block aligned and
> >sized writeback chunks, and so on.
> >
> >And then there's the problem that most hardware is limited to 128
> >s/g entries and that means 128 non-contiguous pages in memory is the
> >maximum I/O size we can issue to these devices. We have RAID arrays
> >that go twice as fast if we can send them 1MB I/Os instead of 512k
> >I/Os and that means we need contiguous pages to be handled to the
> >devices....
> >
> >All of these things require some aggregating structure to
> >co-ordinate. In times gone by on other OSs, this has been done with
> >a buffer cache, but Linux got rid of that long ago and we don't want
> >to reintroduce one. We can't use buffer heads - they can only point
> >to one page. So what do we do?
>
> Improving the buffer layer would be a good way. Of course, that is
> a long and difficult task, so nobody wants to do it.
>
>
> >That's where compound pages are so nice - they solve all of these
> >problems with only a very small amount of code perturbation and they
> >don't change any algorithms or fundamental design of the OS at all.
> >
> >We don't have to rewrite filesystems to support this. We don't have
> >to redesign the VM to support this. We don't have to do very much
> >work to the block layer and drivers to make this work.
>
> Fragmentation is the problem. The anti-frag patches don't actually
> guarantee anything about fragmentation, and even if they did, then
The grouping pages by mobility do not guarantee anything but the memory
partition (kernelcore= boot parameter) does give hard guarantees about
the amount of memory that is "movable". Of course, the partition requires
configuration at boot-time so it's less than ideal but it does give hard
guarantees.
> it takes more effort to actually find and reclaim higher order
> pages (even after the lumpy reclaim thing). So we've redesigned
> the page allocator and page reclaim and still don't have something
> that you should rely on without fallbacks.
>
>
> >FWIW, if you want 32 bit machines to support larger than 16TB
> >devices, you need high order page indexing in the page cache....
>
> How about a 64-bit pgoff_t and radix-tree key? Or larger PAGE_SIZE?
>
Larger page size can result in wastage for small files and there is no
guarantee that the hardware supports the large page size so you now have to
deal with inserting pages into PTEs that are smaller than PAGE_SIZE.
>
> >>I am working now and again on some code to do this, it is a big job but
> >>I think it is the right way to do it. But it would take a long time to
> >>get stable and supported by filesystems...
> >
> >
> >Compared to a method that requires almost no change to the
> >filesystems that want to support large block sizes?
>
> Luckily we have filesystem maintainers who want to support larger
> block sizes ;)
>
>
> >>>That would make the choice of using larger order pages (essentially
> >>>increasing PAGE_SIZE) something that can be investigated in parallel.
> >>
> >>I agree that hardware inefficiencies should be handled by increasing
> >>PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.
> >
> >
> >And what do we do for arches that can't do multiple page sizes, only
> >only have a limited and mostly useless set of page sizes to choose
> >from?
>
> Well, for those architectures (and this would solve your large block
> size and 16TB pagecache size without any core kernel changes), you
> can manage 1<<order hardware ptes as a single Linux pte. There is
> nothing that says you must implement PAGE_SIZE as a single TLB sized
> page.
Indeed but then you have to deal with internal fragmentation
for pages-larger-than-TLB-page. I'm not saying it's wrong but it does
come with it's own set of issues.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 8:40 ` Mel Gorman
@ 2007-04-26 8:55 ` Nick Piggin
2007-04-26 10:30 ` Mel Gorman
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 8:55 UTC (permalink / raw)
To: Mel Gorman
Cc: David Chinner, Eric W. Biederman, clameter, linux-kernel,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Mel Gorman wrote:
> On (26/04/07 16:50), Nick Piggin didst pronounce:
>>Fragmentation is the problem. The anti-frag patches don't actually
>>guarantee anything about fragmentation, and even if they did, then
>
>
> The grouping pages by mobility do not guarantee anything but the memory
> partition (kernelcore= boot parameter) does give hard guarantees about
> the amount of memory that is "movable". Of course, the partition requires
> configuration at boot-time so it's less than ideal but it does give hard
> guarantees.
For the hugepages people, I can understand that's a solution. But that's
the last thing you want to do on a system with a limited amount of memory,
or a regular Joe's desktop/server.
> Indeed but then you have to deal with internal fragmentation
> for pages-larger-than-TLB-page. I'm not saying it's wrong but it does
> come with it's own set of issues.
None of them is perfect (the ways to increase the size of pagecache pages,
that is).
I think in the long term, TLB page sizes will probably increase a little
bit... but if a given page size is "good enough" for a CPU, they really
should be good enough for other hardware. I mean, come on, the CPU's TLB
has to have a good hit ratio and handle several lookups per cycle with a
3-cycle latency on 3GHz+ hardware... surely a an IO controller's
scatter-gather engine or IOMMU that has to do a few lookups per disk IO
is nowhere near so critical as a CPU's datapath: just add a few more
entries to it, they've already got hundreds of megs of cache, so that
isn't an issue either.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:37 ` Christoph Lameter
@ 2007-04-26 9:16 ` Mel Gorman
0 siblings, 0 replies; 235+ messages in thread
From: Mel Gorman @ 2007-04-26 9:16 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, linux-kernel, William Lee Irwin III,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On (25/04/07 23:37), Christoph Lameter didst pronounce:
> On Wed, 25 Apr 2007, Eric W. Biederman wrote:
>
> > You are trying to couple something that has no business being coupled
> > as it reduces the system usability when you couple them.
>
> What I am coupling? The approach solves a series of issues as far as I can
> tell.
>
> > > But that is due to the VM (at least Linus tree) having no defrag methods.
> > > mm has Mel's antifrag methods and can do it.
> >
> > This is fundamental. Fragmentation when you multiple chunk sizes
> > cannot be solved without a the ability to move things in memory,
> > whereas it doesn't exist when you only have a single chunk size.
>
> We have that ability (although in a limited form) right now.
>
And grouping pages by mobility works best when the majority of memory is
used as page cache and other movable/reclaimable allocations which it be
for the majority of workloads that care about larger blocksizes. If a
failure case is found, the memory partitioning is there to give hard
guarantees until I figure out what went wrong.
> > > Yes you get lots of small request *because* we do not support defrag and
> > > cannot large contiguous allocations.
> >
> > Lots of small requests are fundamental. If lots of small requests were
> > not fundamental we would gets large requests scatter gather requests.
>
> That is a statement of faith in small requests? Small requests are
> fundamental so we want them?
>
> > > Ummm the other arches read 16k blocks of contigous memory. That is not
> > > supported on 4k platforms right now. I guess you you move those to vmalloc
> > > areas? Want to hack the filesystems for this?
> >
> > Freak no. You teach the code how to have a block in multiple physical
> > pages.
>
> This aint gonna work without something that stores the information about
> how the pieces come together. Teach the code.... More hacks.
>
> > > There are multiple scaling issues in the kernel. What you propose is to
> > > add hack over hack into the VM to avoid having to deal with
> > > defragmentation. That in turn will cause churn with hardware etc
> > > etc.
> >
> > No. I propose to avoid all designs that have the concept of
> > fragmentation.
>
> There are such designs? You can limit fragmentation but not avoid it.
>
Indeed, it can't be eliminated unless all memory is movable which isn't.
That's why grouping pages by mobility keeps migratable+reclaimable memory
in one set of blocks and reclaimable (mainly slab) in a second set on the
knowledge that truely unmovable allocations are rare.
Heuristic it might be, but I expect it'll work well in practice. This sort
of patchset will put the fragmentation avoidance under more pressure than I
was expecting so problems will be found sooner rather than later. It's also
worth bearing in mind that the high-order allocations looked for here are
in the order 3 or 4 level instead of the order-9 and order-10 allocations
that I normally test with and get reasonably high success rates for.
Besides, we've seen that with the normal kernel that order-3 allocations
(e1000 jumbo frames) work longer than one would expect without fragmentation
avoidance and they are atomic allocations as well as everything else. With
fragmentation avoidance, we should be able to handle it although I'll admit
that jumbo frame allocations are nowhere near as long lived. If I'm wrong,
the allocation failure bug reports will roll in in a very obvious manner.
> > There is an argument for having struct page control more than 4K
> > of memory even when the hardware page size is 4K. But that is a
> > separate problem. And by only having one size we still don't
> > get fragmentation.
>
> We have fragmentation because we cannot limit our allocation sizes to 4k.
> The stack is already 8k and there are subsystems that need more (f.e.
> for jumbo frames). Then there is the huge page subsystem that is used to
> avoid the TLB pressure that comes with small pages.
>
> I think we are doing our typical community thing of running away from the
> problem and developing ways to explain why our traditional approaches are
> right.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:48 ` Nick Piggin
@ 2007-04-26 9:20 ` David Chinner
2007-04-26 13:53 ` Avi Kivity
2007-04-26 15:20 ` Nick Piggin
2007-04-26 16:07 ` Christoph Hellwig
1 sibling, 2 replies; 235+ messages in thread
From: David Chinner @ 2007-04-26 9:20 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:
> Christoph Lameter wrote:
> >On Thu, 26 Apr 2007, Nick Piggin wrote:
> >
> >
> >>No I don't want to add another fs layer.
> >
> >
> >Well maybe you could explain what you want. Preferably without redefining
> >the established terms?
>
> Support for larger buffers than page cache pages.
The problem with this approach is that it turns around the whole
way we look at bufferheads. Right now we have well defined 1:n
mapping of page to bufferheads and so we tpyically lock the
page first them iterate all the bufferheads on the page.
Going the other way, we need to support m:n which we means
the buffer has to become the primary interface for the filesystem
to the page cache. i.e. we need to lock the bufferhead first, then
iterate all the pages on it. This is messy because the cache indexes
via pages, not bufferheads. hence a buffer needs to point to all the
pages in it explicitly, and this leads to interesting issues with
locking.
If you still think that this is a good idea, I suggest that you
spend a bit of time looking at fs/xfs/linux-2.6/xfs_buf.c, because
that is *exactly* what this does - it is a multi-page buffer
interface on top of a block device address space radix tree. This
cache is the reason that XFS was so easy to transition to large
block sizes (i only needed to convert the data path).
However, this approach has some serious problems:
- need to index buffers so that lookups can be done
on buffer before page
- completely different locking is required
- needs memory allocation to hold more than 4 pages
- needs vmap() rather than kmap_atomic() for mapping
multi-page buffers
- I/O needs to be issued based on buffers, not pages
- needs it's own flush code
- does not interface with memory reclaim well
IOWs, we need to turn every filesystem completely upside down to
make work with this sort of large page infrastructure, not to mention
the rest of the VM (mmap, page reclaim, etc). It's back to the
bad ol' days of buffer caches again and we don't want to go back
there.
Compared to a buffer based implementation, the high order page cache
is a picture of elegance and refined integration. It is an
evolutionary step, not a disconnect, from what we have now....
> >Because 4k is a good page size that is bound to the binary format? Frankly
> >there is no point in having my text files in large page sizes. However,
> >when I read a dvd then I may want to transfer 64k chunks or when use my
> >flash drive I may want to transfer 128k chunks. And yes if a scientific
> >application needs to do data dump then it should be able to use very high
> >page sizes (megabytes, gigabytes) to be able to continue its work while
> >the huge dumps runs at full I/O speed ...
>
> So block size > page cache size... also, you should obviously be using
> hardware that is tuned to work well with 4K pages, because surely there
> is lots of that around.
The CPU hardware works well with 4k pages, but in general I/O
hardware works more efficiently as the numbers of s/g entries they
require drops for a given I/O size. Given that we limit drivers to
128 s/g entries, we really aren't using I/O hardware to it's full
potential or at it's most efficient by limiting each s/g entry to a
single 4k page.
And FWIW, a having a buffer for block size > page size does not
solve this problem - only contiguous page allocation solves this
problem.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:46 ` Christoph Lameter
2007-04-26 6:57 ` Nick Piggin
@ 2007-04-26 10:06 ` Mel Gorman
2007-04-26 14:47 ` Nick Piggin
1 sibling, 1 reply; 235+ messages in thread
From: Mel Gorman @ 2007-04-26 10:06 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, Eric W. Biederman, linux-kernel,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On (25/04/07 23:46), Christoph Lameter didst pronounce:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
> > Yeah. IMO anti-fragmentation and defragmentation is the hack, and we
> > should stay away from higher order allocations whenever possible.
>
> Right and we need to create series of other approaches that we then label
> "non-hack" to replace it.
>
To date, there hasn't been a credible alternative to dealing with
fragmentation. Breaking the 1:1 virtual:physical mapping and defragmenting
would incur a serious performance hit.
> > Hardware is built to handle many small pages efficintly, and I don't
> > understand how it could be an SGI-only issue. Sure, you may have an
> > order of magnitude or more memory than anyone else, but even my lowly
> > desktop _already_ has orders of magnitude more pages than it has TLB
> > entries or cache -- if a workload is cache-nice for me, it probably
> > will be on a 1TB machine as well, and if it is bad for the 1TB machine,
> > it is also bad on mine.
>
> There have been numbers of people that have argued the same point. Just
> because we have developed a way of thinking to defend our traditional 4k
> values does not make them right.
>
> > If this is instead an issue of io path or reclaim efficiency, then it
> > would be really nice to see numbers... but I don't think making these
> > fundamental paths more complex and slower is a nice way to fix it
> > (larger PAGE_SIZE would be, though).
>
> The code paths can stay the same. You can switch CONFIG_LARGE pages off
> if you do not want it and it is as it was.
>
It may not even need that that much effort. The most stressful use of the
high order allocation paths here require the creation of a filesystem and
is a deliberate action by the user.
> If you would have a look the patches: The code is significantly cleanup
> and easier to read.
It is easier to read all right and may be worth doing anyway.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:38 ` David Chinner
2007-04-26 6:50 ` Nick Piggin
@ 2007-04-26 10:10 ` Eric W. Biederman
2007-04-26 13:50 ` David Chinner
2007-04-26 18:07 ` Christoph Lameter
1 sibling, 2 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 10:10 UTC (permalink / raw)
To: David Chinner
Cc: Nick Piggin, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner <dgc@sgi.com> writes:
> On Thu, Apr 26, 2007 at 03:37:28PM +1000, Nick Piggin wrote:
>> I think starting with the assumption that we _want_ to use higher order
>> allocations, and then creating all this complexity around that is not a
>> good one, and if we start introducing things that _require_ significant
>> higher order allocations to function then it is a nasty thing for
>> robustness.
>
>>From my POV, we started with the problem of how to provide atomic
> access to a multi-page block in the page cache. For example, we want
> to lock the filesystem block and prevent any updates to it, so we
> need to lock all the pages in it. And then when we write them back,
> they all need to change state at the same time, and they all need to
> have their radix tree tags changed at the same time, the problem of
> mapping them to disk, getting writeback to do block aligned and
> sized writeback chunks, and so on.
Ok. That is a reasonable problem and worth solving.
I suspect the easiest way to go is to have something in the code
that points all of the locking activities at the first page in the
block. Like large pages do but they don't have to be physically
contiguous.
I think that would be even less code then what you are proposing.
> And then there's the problem that most hardware is limited to 128
> s/g entries and that means 128 non-contiguous pages in memory is the
> maximum I/O size we can issue to these devices. We have RAID arrays
> that go twice as fast if we can send them 1MB I/Os instead of 512k
> I/Os and that means we need contiguous pages to be handled to the
> devices....
Ok. Now why are high end hardware manufacturers building crippled
hardware? Or is there only an 8bit field in SCSI for describing
scatter gather entries? Although I would think this would be
move of a controller ranter than a drive issue.
> All of these things require some aggregating structure to
> co-ordinate. In times gone by on other OSs, this has been done with
> a buffer cache, but Linux got rid of that long ago and we don't want
> to reintroduce one. We can't use buffer heads - they can only point
> to one page. So what do we do?
For I/O we have the BIO which can point to multiple pages just fine.
Buffer heads are irrelevant. The question is how do we get to
the page cache from the BIO and from the BIO to the page cache.
> That's where compound pages are so nice - they solve all of these
> problems with only a very small amount of code perturbation and they
> don't change any algorithms or fundamental design of the OS at all.
The change the fundamental fragmentation avoidance algorithm of the
OS. Use only one size of page. That is a huge problem.
Yes we do relax that rule but only on things that we don't care
about much and don't mind failing.
> We don't have to rewrite filesystems to support this. We don't have
> to redesign the VM to support this. We don't have to do very much
> work to the block layer and drivers to make this work.
This changes the fundamental algorithm for avoiding fragmentation
problems in the VM. Don't have multiple page sizes.
That rule is relaxed allowing us to use larger pages in special
cases but this is not a special case. This is the common case.
That is a huge concern.
> FWIW, if you want 32 bit machines to support larger than 16TB
> devices, you need high order page indexing in the page cache....
Huh? Only if you are doing raw device accesses. We are limited
to 16TB files.
>> >That would make the choice of using larger order pages (essentially
>> >increasing PAGE_SIZE) something that can be investigated in parallel.
>>
>> I agree that hardware inefficiencies should be handled by increasing
>> PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.
>
> And what do we do for arches that can't do multiple page sizes, only
> only have a limited and mostly useless set of page sizes to choose
> from?
You have HW_PAGE_SIZE != PAGE_SIZE. That is you hide the fact from
the bulk of the kernel struct page manges 2 or more real hardware pages.
But you expose it to the handful of places that actually care.
Partly this is a path you are starting down in your patches, with
larger page cache support.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 8:55 ` Nick Piggin
@ 2007-04-26 10:30 ` Mel Gorman
2007-04-26 10:54 ` Eric W. Biederman
0 siblings, 1 reply; 235+ messages in thread
From: Mel Gorman @ 2007-04-26 10:30 UTC (permalink / raw)
To: Nick Piggin
Cc: David Chinner, Eric W. Biederman, clameter, linux-kernel,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On (26/04/07 18:55), Nick Piggin didst pronounce:
> Mel Gorman wrote:
> >On (26/04/07 16:50), Nick Piggin didst pronounce:
>
> >>Fragmentation is the problem. The anti-frag patches don't actually
> >>guarantee anything about fragmentation, and even if they did, then
> >
> >
> >The grouping pages by mobility do not guarantee anything but the memory
> >partition (kernelcore= boot parameter) does give hard guarantees about
> >the amount of memory that is "movable". Of course, the partition requires
> >configuration at boot-time so it's less than ideal but it does give hard
> >guarantees.
>
> For the hugepages people, I can understand that's a solution.
I think that's the closest you have ever come to saying fragmentation
avoidance is not a terrible idea :)
> But that's
> the last thing you want to do on a system with a limited amount of memory,
> or a regular Joe's desktop/server.
>
Regular Joe is not going to be creating a filesystem with large blocks or
overly concerned with saturating all the disks hanging off his RAID array.
At most, he'll care about faster DVD writing and considering the number and
duration of those allocations, the system will be able to handle it. So I
don't think Regular Joe will generally care.
> >Indeed but then you have to deal with internal fragmentation
> >for pages-larger-than-TLB-page. I'm not saying it's wrong but it does
> >come with it's own set of issues.
>
> None of them is perfect (the ways to increase the size of pagecache pages,
> that is).
>
> I think in the long term, TLB page sizes will probably increase a little
> bit... but if a given page size is "good enough" for a CPU, they really
> should be good enough for other hardware. I mean, come on, the CPU's TLB
> has to have a good hit ratio and handle several lookups per cycle with a
> 3-cycle latency on 3GHz+ hardware... surely a an IO controller's
> scatter-gather engine or IOMMU that has to do a few lookups per disk IO
> is nowhere near so critical as a CPU's datapath: just add a few more
> entries to it, they've already got hundreds of megs of cache, so that
> isn't an issue either.
>
I cannot speak with authority on what IO controllers are really capable
of so maybe someone else will comment on this more.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:42 ` Nick Piggin
@ 2007-04-26 10:48 ` Mel Gorman
2007-04-26 12:37 ` Andy Whitcroft
` (2 subsequent siblings)
3 siblings, 0 replies; 235+ messages in thread
From: Mel Gorman @ 2007-04-26 10:48 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On (26/04/07 17:42), Nick Piggin didst pronounce:
> Christoph Lameter wrote:
> >On Thu, 26 Apr 2007, Nick Piggin wrote:
> >
> >
> >>>mapping through the radix tree. You just need to change the way the
> >>>filesystem looks up pages.
> >>
> >>You didn't think any of the criticisms of higher order page cache size
> >>were valid?
> >
> >
> >They are all known points that have been discussed to death.
>
> I missed the part where you showed that it was a better solution than
> the alternatives.
>
>
> >>>What are the exact requirement you are trying to address?
> >>
> >>Block size > page cache size.
> >
> >
> >But what do you mean with it? A block is no longer a contiguous section of
> >memory. So you have redefined the term.
>
> I don't understand what you mean at all. A block has always been a
> contiguous area of disk.
>
Yes, but what you seem to be proposing is that lower layers be
able to treat non-contiguous pages as one IO artifact - like
non-contiguous-compound-pages. Ultimately both are probably needed. i.e.
Use contiguous pages for large blocks where possible but be able to deal
with the compound page consisting of multiple smaller pages with a
regression in performance when necessary.
I don't think what Christoph and Nick are proposing are mutually
exclusive. The arguement is really "which do we deal with first".
>
> >>You guys have a couple of problems, firstly you need to have ia64
> >>filesystems accessable to x86_64. And secondly you have these controllers
> >>without enough sg entries for nice sized IOs.
> >
> >
> >This is not sgi specific sorry.
> >
> >
> >>I sympathise, and higher order pagecache might solve these in a way, but
> >>I don't think it is the right way to go, mainly because of the
> >>fragmentation
> >>issues.
> >
> >
> >And you dont care about Mel's work on that level?
>
> I actually don't like it too much because it can't provide a robust
> solution. What do you do on systems with small memories, or those that
> eventually do get fragmented?
>
They won't be creating filesystems with large blocks for a start but
even if they did, they would need to your proposal.
Again, I don't think they are mutually exclusive as such.
> Actually, I don't know why people are so excited about being able to
> use higher order allocations (I would rather be more excited about
> never having to use them). But for those few places that really need
> it, I'd rather see them use a virtually mapped kernel with proper
> defragmentation rather than putting hacks all through the core code.
>
That involves creating a vmalloc-like area or breaking the 1:1 physical:virtual
mapping in the kernel address space. Of those two, I think a vmalloc area
would be easier, have less performance impact and might even be a starting
point for non-contiguous-compound-pages.
hmm.
> >>Increasing PAGE_SIZE, support for block size > page cache size, and
> >>getting
> >>io controllers matched to a 4K page size IMO would be some good ways to
> >>solve these problems. I know they are probably harder...
> >
> >
> >No this has been tried before and does not work. Why should we loose the
> >capability to work with 4k pages just because there is some data that
> >has to be thrown around in quantity? I'd like to have flexibility here.
>
> Is that a big problem? Really? You use 16K pages on your IPF systems,
> don't you?
>
>
> >The fragmentation problem is solvable and we already have a solution in
> >mm. So I do not really see a problem there?
>
> I don't think that it is solved, and I think the heuristics that are
> there would be put under more stress if they become widely used. And
> it isn't only about whether we can get the page or not, but also about
> the cost. Look up Linus's arguments about page colouring, which are
> similar and I also think are pretty valid.
>
Page coloring has come up a lot in the past. Can you point me at an
example you have in mind and I'll see if the same arguments really
apply.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 10:30 ` Mel Gorman
@ 2007-04-26 10:54 ` Eric W. Biederman
2007-04-26 12:23 ` Mel Gorman
2007-04-26 17:58 ` Christoph Lameter
0 siblings, 2 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 10:54 UTC (permalink / raw)
To: Mel Gorman
Cc: Nick Piggin, David Chinner, clameter, linux-kernel,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
mel@skynet.ie (Mel Gorman) writes:
> On (26/04/07 18:55), Nick Piggin didst pronounce:
>> Mel Gorman wrote:
>> >On (26/04/07 16:50), Nick Piggin didst pronounce:
>>
>> >>Fragmentation is the problem. The anti-frag patches don't actually
>> >>guarantee anything about fragmentation, and even if they did, then
>> >
>> >
>> >The grouping pages by mobility do not guarantee anything but the memory
>> >partition (kernelcore= boot parameter) does give hard guarantees about
>> >the amount of memory that is "movable". Of course, the partition requires
>> >configuration at boot-time so it's less than ideal but it does give hard
>> >guarantees.
>>
>> For the hugepages people, I can understand that's a solution.
>
> I think that's the closest you have ever come to saying fragmentation
> avoidance is not a terrible idea :)
>
>> But that's
>> the last thing you want to do on a system with a limited amount of memory,
>> or a regular Joe's desktop/server.
>>
>
> Regular Joe is not going to be creating a filesystem with large blocks or
> overly concerned with saturating all the disks hanging off his RAID array.
> At most, he'll care about faster DVD writing and considering the number and
> duration of those allocations, the system will be able to handle it. So I
> don't think Regular Joe will generally care.
The practical question is if this a special purpose hack, or is
this general infrastructure that we expect filesystems to seriously
use.
If this is a special purpose hack then the larger pages and
fragmentation are minor issues, but it's very existence is a problem.
If this is not a special purpose hack and people do use this a lot
then the fragmentation is a problem.
Your reply indicates that fragmentation is a concern, as does the
initial posting and the more positive threads. Therefore either
this is a special purpose hack in which case it's existence is
questionable or it fails a general solution and it's existence
is questionable.
>> >Indeed but then you have to deal with internal fragmentation
>> >for pages-larger-than-TLB-page. I'm not saying it's wrong but it does
>> >come with it's own set of issues.
>>
>> None of them is perfect (the ways to increase the size of pagecache pages,
>> that is).
>>
>> I think in the long term, TLB page sizes will probably increase a little
>> bit... but if a given page size is "good enough" for a CPU, they really
>> should be good enough for other hardware. I mean, come on, the CPU's TLB
>> has to have a good hit ratio and handle several lookups per cycle with a
>> 3-cycle latency on 3GHz+ hardware... surely a an IO controller's
>> scatter-gather engine or IOMMU that has to do a few lookups per disk IO
>> is nowhere near so critical as a CPU's datapath: just add a few more
>> entries to it, they've already got hundreds of megs of cache, so that
>> isn't an issue either.
>>
>
> I cannot speak with authority on what IO controllers are really capable
> of so maybe someone else will comment on this more.
Well here is some reality. Using an FPGA (slow by definition)
connected to a hypertransport interface I have exceeded a gigabyte a
second, without even setting up scatter gather.
All of the I/O on all of the peripheral busses happens in sub
1 page chunks.
At the rates we are talking for disks the issue is really how to
bury the disk latency. For fast arrays how do you submit enough
request to bury the latency.
I can't possibly believe any of this is about the cost of processing
a request, but rather the problem that some devices don't have a large
enough pool of requests to keep them busy if you submit the requests
in a 4K page sizes.
This all sounds like a device design issue rather than anything more
significant, and it doesn't sound like a long term trend. Market
pressure should fix the hardware.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 10:54 ` Eric W. Biederman
@ 2007-04-26 12:23 ` Mel Gorman
2007-04-26 17:58 ` Christoph Lameter
1 sibling, 0 replies; 235+ messages in thread
From: Mel Gorman @ 2007-04-26 12:23 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Nick Piggin, David Chinner, clameter, Linux Kernel Mailing List,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, 26 Apr 2007, Eric W. Biederman wrote:
> mel@skynet.ie (Mel Gorman) writes:
>
>> On (26/04/07 18:55), Nick Piggin didst pronounce:
>>> Mel Gorman wrote:
>>>> On (26/04/07 16:50), Nick Piggin didst pronounce:
>>>
>>>>> Fragmentation is the problem. The anti-frag patches don't actually
>>>>> guarantee anything about fragmentation, and even if they did, then
>>>>
>>>>
>>>> The grouping pages by mobility do not guarantee anything but the memory
>>>> partition (kernelcore= boot parameter) does give hard guarantees about
>>>> the amount of memory that is "movable". Of course, the partition requires
>>>> configuration at boot-time so it's less than ideal but it does give hard
>>>> guarantees.
>>>
>>> For the hugepages people, I can understand that's a solution.
>>
>> I think that's the closest you have ever come to saying fragmentation
>> avoidance is not a terrible idea :)
>>
>>> But that's
>>> the last thing you want to do on a system with a limited amount of memory,
>>> or a regular Joe's desktop/server.
>>>
>>
>> Regular Joe is not going to be creating a filesystem with large blocks or
>> overly concerned with saturating all the disks hanging off his RAID array.
>> At most, he'll care about faster DVD writing and considering the number and
>> duration of those allocations, the system will be able to handle it. So I
>> don't think Regular Joe will generally care.
>
> The practical question is if this a special purpose hack, or is
> this general infrastructure that we expect filesystems to seriously
> use.
>
> If this is a special purpose hack then the larger pages and
> fragmentation are minor issues, but it's very existence is a problem.
>
> If this is not a special purpose hack and people do use this a lot
> then the fragmentation is a problem.
>
Like so many other things, I think this would start as something used by
the minority of users with the hardware that requires this sort of feature
and slowly bubbles down until it appears on day-to-day machines. I
wouldn't describe it as a hack but it's fair to say that it'll start as
special purpose. It won't stay that way forever.
> Your reply indicates that fragmentation is a concern, as does the
> initial posting and the more positive threads. Therefore either
> this is a special purpose hack in which case it's existence is
> questionable or it fails a general solution and it's existence
> is questionable.
>
Or it'll start as a special purpose feature and evolve to be a general
solution.
>
>>>> Indeed but then you have to deal with internal fragmentation
>>>> for pages-larger-than-TLB-page. I'm not saying it's wrong but it does
>>>> come with it's own set of issues.
>>>
>>> None of them is perfect (the ways to increase the size of pagecache pages,
>>> that is).
>>>
>>> I think in the long term, TLB page sizes will probably increase a little
>>> bit... but if a given page size is "good enough" for a CPU, they really
>>> should be good enough for other hardware. I mean, come on, the CPU's TLB
>>> has to have a good hit ratio and handle several lookups per cycle with a
>>> 3-cycle latency on 3GHz+ hardware... surely a an IO controller's
>>> scatter-gather engine or IOMMU that has to do a few lookups per disk IO
>>> is nowhere near so critical as a CPU's datapath: just add a few more
>>> entries to it, they've already got hundreds of megs of cache, so that
>>> isn't an issue either.
>>>
>>
>> I cannot speak with authority on what IO controllers are really capable
>> of so maybe someone else will comment on this more.
>
> Well here is some reality. Using an FPGA (slow by definition)
> connected to a hypertransport interface I have exceeded a gigabyte a
> second, without even setting up scatter gather.
>
> All of the I/O on all of the peripheral busses happens in sub
> 1 page chunks.
>
> At the rates we are talking for disks the issue is really how to
> bury the disk latency. For fast arrays how do you submit enough
> request to bury the latency.
>
> I can't possibly believe any of this is about the cost of processing
> a request, but rather the problem that some devices don't have a large
> enough pool of requests to keep them busy if you submit the requests
> in a 4K page sizes.
>
> This all sounds like a device design issue rather than anything more
> significant, and it doesn't sound like a long term trend. Market
> pressure should fix the hardware.
>
>
> Eric
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:42 ` Nick Piggin
2007-04-26 10:48 ` Mel Gorman
@ 2007-04-26 12:37 ` Andy Whitcroft
2007-04-26 14:18 ` David Chinner
2007-04-26 15:08 ` Nick Piggin
2007-04-26 14:53 ` William Lee Irwin III
2007-04-26 18:13 ` Christoph Lameter
3 siblings, 2 replies; 235+ messages in thread
From: Andy Whitcroft @ 2007-04-26 12:37 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Nick Piggin wrote:
> Christoph Lameter wrote:
>> On Thu, 26 Apr 2007, Nick Piggin wrote:
>>
>>
>>>> mapping through the radix tree. You just need to change the way the
>>>> filesystem looks up pages.
>>>
>>> You didn't think any of the criticisms of higher order page cache size
>>> were valid?
>>
>>
>> They are all known points that have been discussed to death.
>
> I missed the part where you showed that it was a better solution than
> the alternatives.
>
>
>>>> What are the exact requirement you are trying to address?
>>>
>>> Block size > page cache size.
>>
>>
>> But what do you mean with it? A block is no longer a contiguous
>> section of memory. So you have redefined the term.
>
> I don't understand what you mean at all. A block has always been a
> contiguous area of disk.
Lets take Nick's definition of block being a disk based unit for the
moment. That does not change the key contention here, that even with
hardware specifically designed to handle 4k pages that hardware handles
larger contigious areas more efficiently. David Chinner gives us
figures showing major overall throughput improvements from (I assume)
shorter scatter gather lists and better tag utilisation. I am loath to
say we can just blame the hardware vendors for poor design.
>>> You guys have a couple of problems, firstly you need to have ia64
>>> filesystems accessable to x86_64. And secondly you have these
>>> controllers
>>> without enough sg entries for nice sized IOs.
>>
>>
>> This is not sgi specific sorry.
>>
>>
>>> I sympathise, and higher order pagecache might solve these in a way, but
>>> I don't think it is the right way to go, mainly because of the
>>> fragmentation
>>> issues.
>>
>>
>> And you dont care about Mel's work on that level?
>
> I actually don't like it too much because it can't provide a robust
> solution. What do you do on systems with small memories, or those that
> eventually do get fragmented?
>
> Actually, I don't know why people are so excited about being able to
> use higher order allocations (I would rather be more excited about
> never having to use them). But for those few places that really need
> it, I'd rather see them use a virtually mapped kernel with proper
> defragmentation rather than putting hacks all through the core code.
Virtually mapping the kernel was considered pretty seriously around the
time SPARSEMEM was being developed. However, that leads to a
non-constant relation for converting kernel virtual addresses to
physical ones which leads to significant complexity, not to mention
runtime overhead.
As a solution to the problem of supplying large pages from the allocator
it seems somewhat unsatisfactory. If no significant other changes are
made in support of large allocations, the process of defragmenting
becomes very expensive. Requiring a stop_machine style hiatus while the
physical copy and replace occurs for any kernel backed memory.
To put it a different way, even with such a full defragmentation scheme
available some sort of avoidance scheme would be highly desirable to
avoid using the very expensive deframentation underlying it.
>>> Increasing PAGE_SIZE, support for block size > page cache size, and
>>> getting
>>> io controllers matched to a 4K page size IMO would be some good ways to
>>> solve these problems. I know they are probably harder...
>>
>>
>> No this has been tried before and does not work. Why should we loose
>> the capability to work with 4k pages just because there is some data
>> that has to be thrown around in quantity? I'd like to have flexibility
>> here.
>
> Is that a big problem? Really? You use 16K pages on your IPF systems,
> don't you?
To my knowledge, moving to a higher base page size has its advantages in
TLB reach, but brings with it some pretty serious downsides. Especially
in caching small files. Internal fragmentation in the page cache
significantly affecting system performance. So much so that development
is ongoing to see if supporting sub-base-page objects in the buffer
cache could be beneficial.
>> The fragmentation problem is solvable and we already have a solution
>> in mm. So I do not really see a problem there?
>
> I don't think that it is solved, and I think the heuristics that are
> there would be put under more stress if they become widely used. And
> it isn't only about whether we can get the page or not, but also about
> the cost. Look up Linus's arguments about page colouring, which are
> similar and I also think are pretty valid.
-apw
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 5:44 ` Eric W. Biederman
2007-04-26 6:37 ` Christoph Lameter
2007-04-26 6:38 ` Nick Piggin
@ 2007-04-26 13:28 ` Alan Cox
2007-04-26 13:30 ` Jens Axboe
2007-04-29 14:12 ` Matt Mackall
2007-04-28 10:55 ` Pierre Ossman
3 siblings, 2 replies; 235+ messages in thread
From: Alan Cox @ 2007-04-26 13:28 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
> > Oh we have scores of these hacks around. Look at the dvd/cd layer. The
> > point is to get rid of those.
>
> Perhaps this is just a matter of cleaning them up so they are no
> longer hacks?
CD and DVD media support various non power-of-two block sizes. Supporting
more block sizes would also be useful as we could then read older smart
media (256byte/sector) without the SCSI layer objecting and the like.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 13:28 ` Alan Cox
@ 2007-04-26 13:30 ` Jens Axboe
2007-04-29 14:12 ` Matt Mackall
1 sibling, 0 replies; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 13:30 UTC (permalink / raw)
To: Alan Cox
Cc: Eric W. Biederman, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26 2007, Alan Cox wrote:
> > > Oh we have scores of these hacks around. Look at the dvd/cd layer. The
> > > point is to get rid of those.
> >
> > Perhaps this is just a matter of cleaning them up so they are no
> > longer hacks?
>
> CD and DVD media support various non power-of-two block sizes. Supporting
> more block sizes would also be useful as we could then read older smart
> media (256byte/sector) without the SCSI layer objecting and the like.
I think Christoph is referring to the packet writing stuff, where we use
32kb or 64kb block sizes. Those aren't implemented as page cache hacks
currently though, the issue is worked around at the block layer level by
doing software RMW.
Supporting larger page cache blocks would allow pktcdvd to be reduced to
almost nothing.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 10:10 ` Eric W. Biederman
@ 2007-04-26 13:50 ` David Chinner
2007-04-26 14:40 ` William Lee Irwin III
` (2 more replies)
2007-04-26 18:07 ` Christoph Lameter
1 sibling, 3 replies; 235+ messages in thread
From: David Chinner @ 2007-04-26 13:50 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Nick Piggin, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26, 2007 at 04:10:32AM -0600, Eric W. Biederman wrote:
> David Chinner <dgc@sgi.com> writes:
>
> > On Thu, Apr 26, 2007 at 03:37:28PM +1000, Nick Piggin wrote:
> >> I think starting with the assumption that we _want_ to use higher order
> >> allocations, and then creating all this complexity around that is not a
> >> good one, and if we start introducing things that _require_ significant
> >> higher order allocations to function then it is a nasty thing for
> >> robustness.
> >
> >>From my POV, we started with the problem of how to provide atomic
> > access to a multi-page block in the page cache. For example, we want
> > to lock the filesystem block and prevent any updates to it, so we
> > need to lock all the pages in it. And then when we write them back,
> > they all need to change state at the same time, and they all need to
> > have their radix tree tags changed at the same time, the problem of
> > mapping them to disk, getting writeback to do block aligned and
> > sized writeback chunks, and so on.
>
> Ok. That is a reasonable problem and worth solving.
>
> I suspect the easiest way to go is to have something in the code
> that points all of the locking activities at the first page in the
> block. Like large pages do but they don't have to be physically
> contiguous.
You just described non-contiguous compount pages.
Not much different, but how do you link all the pages together while
still having a private pointer for the filesystem without growing
the struct page? Remember, you have to be able to get from any page
to the head node, and you have to be able to get from the head node
to the individual pages when you have to free the compound page.
You'll need something like a linked list between the pages in
addition to the head pointer.
And then there's the allocation code - you need to alloc 1 page
N times rather than N pages once. And then we can still only
get 128 pages into a bio so all we've done churned some in memory
structures and not gained anything where it counts (i.e. on the
storage).
IMO, non-contiguous compound pages are more difficult to manage,
will burn lots more cpu, consume more memory and be able to drive
I/O no faster than the current code. three strikes and you're
out; fouri strikes....
OTOH, contiguous compound pages burn little extra CPU, will consume
less memory due to smaller radix trees and less bufferhead usage and
can improve the I/O efficiency of the machine.....
> I think that would be even less code then what you are proposing.
More, actually, and with a bigger impact on memory usage because of
the struct page size changes needed.
> > And then there's the problem that most hardware is limited to 128
> > s/g entries and that means 128 non-contiguous pages in memory is the
> > maximum I/O size we can issue to these devices. We have RAID arrays
> > that go twice as fast if we can send them 1MB I/Os instead of 512k
> > I/Os and that means we need contiguous pages to be handled to the
> > devices....
>
> Ok. Now why are high end hardware manufacturers building crippled
> hardware? Or is there only an 8bit field in SCSI for describing
> scatter gather entries? Although I would think this would be
> move of a controller ranter than a drive issue.
scsi.h:
/*
* The maximum sg list length SCSI can cope with
* (currently must be a power of 2 between 32 and 256)
*/
#define SCSI_MAX_PHYS_SEGMENTS MAX_PHYS_SEGMENTS
And from blkdev.h:
#define MAX_PHYS_SEGMENTS 128
#define MAX_HW_SEGMENTS 128
So currentlt on SCSI we are limited to 128 s/g entries, and the
maximum is 256. So I'd say we've got good grounds for needing
contiguous pages to go beyond 1MB I/O size on x86_64.
> > All of these things require some aggregating structure to
> > co-ordinate. In times gone by on other OSs, this has been done with
> > a buffer cache, but Linux got rid of that long ago and we don't want
> > to reintroduce one. We can't use buffer heads - they can only point
> > to one page. So what do we do?
>
> For I/O we have the BIO which can point to multiple pages just fine.
And it's a block layer construct it's rather heavyweight compared
to a compound page and a bufferhead for the disk mapping....
> Buffer heads are irrelevant. The question is how do we get to
> the page cache from the BIO and from the BIO to the page cache.
You missed a layer - the filesystem is usually the transalation
layer between the page cache and the BIO. Buffer heads are used to
carry extra filesystem state that can't fit on the page. Like the
block number the page is mapped to (i.e. the translation key for
page cache to bio mapping), the per block state (dirty,
unwritten, new, etc), and so on.
>From this information we can construct bios when they are
appropriate, but to use bios where we currently use buffer heads
is extremely wasteful of memory (we often have millions of bufferheads
in memory at once).
FWIW, block size larger than page size also reduces bufferhead usage
so XFS will reduce it's memory footprint when doing buffered writes.
> > That's where compound pages are so nice - they solve all of these
> > problems with only a very small amount of code perturbation and they
> > don't change any algorithms or fundamental design of the OS at all.
>
> The change the fundamental fragmentation avoidance algorithm of the
> OS. Use only one size of page. That is a huge problem.
And one that Mel Gorman has been working on for quite some time
with good results.
> Yes we do relax that rule but only on things that we don't care
> about much and don't mind failing.
No comment.
[snip repeated mantra]
> That is a huge concern.
So don't compile it in!
> > FWIW, if you want 32 bit machines to support larger than 16TB
> > devices, you need high order page indexing in the page cache....
>
> Huh? Only if you are doing raw device accesses. We are limited
> to 16TB files.
Many filesystems use a block device address space to address all
their metadata. e.g ext2/3 can't put metadata and therefore track
usage above 16TB. Therefore, filesytem gets no larger.
> >> >That would make the choice of using larger order pages (essentially
> >> >increasing PAGE_SIZE) something that can be investigated in parallel.
> >>
> >> I agree that hardware inefficiencies should be handled by increasing
> >> PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.
> >
> > And what do we do for arches that can't do multiple page sizes, only
> > only have a limited and mostly useless set of page sizes to choose
> > from?
>
> You have HW_PAGE_SIZE != PAGE_SIZE.
That's rather wasteful, though. Better to only use the large pages
when the filesystem needs them rather than penalise all filesystems.
> That is you hide the fact from
> the bulk of the kernel struct page manges 2 or more real hardware pages.
> But you expose it to the handful of places that actually care.
> Partly this is a path you are starting down in your patches, with
> larger page cache support.
Right, exactly. So apart from the contiguous allocation issue, you think
we are doing the right thing?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:34 ` Christoph Lameter
2007-04-26 7:48 ` Nick Piggin
@ 2007-04-26 13:50 ` William Lee Irwin III
2007-04-26 18:09 ` Eric W. Biederman
1 sibling, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-26 13:50 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, Eric W. Biederman, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
>> OK, I would like to see them. And also discussions of things like why
>> we shouldn't increase PAGE_SIZE instead.
On Thu, Apr 26, 2007 at 12:34:50AM -0700, Christoph Lameter wrote:
> Because 4k is a good page size that is bound to the binary format? Frankly
> there is no point in having my text files in large page sizes. However,
> when I read a dvd then I may want to transfer 64k chunks or when use my
> flash drive I may want to transfer 128k chunks. And yes if a scientific
> application needs to do data dump then it should be able to use very high
> page sizes (megabytes, gigabytes) to be able to continue its work while
> the huge dumps runs at full I/O speed ...
It's possible to divorce PAGE_SIZE from the binary formats, though I
found it difficult to keep up with the update treadmill. Maybe it's
like hch says and I just needed to find more and better API cleanups.
I've only not tried to resurrect it because it's too much for me to do
on my own. I essentially collapsed under the weight of it and my 2.5.x
codebase ended up worse than Katrina as a disaster, which I don't want
to repeat and think collaborators or a different project lead from
myself are needed to avoid that happening again.
It's unclear how much the situation has changed since 32-bit workload
feasibility issues have been relegated to ignorable or deliberate
"f**k 32-bit" status. The effect is doubtless to make it easier, though
to what degree I'm not sure.
Anyway, if that's being kicked around as an alternative, it could be
said that I have some insight into the issues surrounding it.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 9:20 ` David Chinner
@ 2007-04-26 13:53 ` Avi Kivity
2007-04-26 14:33 ` David Chinner
2007-04-26 15:20 ` Nick Piggin
1 sibling, 1 reply; 235+ messages in thread
From: Avi Kivity @ 2007-04-26 13:53 UTC (permalink / raw)
To: David Chinner
Cc: Nick Piggin, Christoph Lameter, Eric W. Biederman, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner wrote:
> On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:
>
>> Christoph Lameter wrote:
>>
>>> On Thu, 26 Apr 2007, Nick Piggin wrote:
>>>
>>>
>>>
>>>> No I don't want to add another fs layer.
>>>>
>>> Well maybe you could explain what you want. Preferably without redefining
>>> the established terms?
>>>
>> Support for larger buffers than page cache pages.
>>
>
> The problem with this approach is that it turns around the whole
> way we look at bufferheads. Right now we have well defined 1:n
> mapping of page to bufferheads and so we tpyically lock the
> page first them iterate all the bufferheads on the page.
>
> Going the other way, we need to support m:n which we means
> the buffer has to become the primary interface for the filesystem
> to the page cache. i.e. we need to lock the bufferhead first, then
> iterate all the pages on it. This is messy because the cache indexes
> via pages, not bufferheads. hence a buffer needs to point to all the
> pages in it explicitly, and this leads to interesting issues with
> locking.
>
Why is it necessary to assume that one filesystem block == one buffer?
Is it for atomicity, efficiency, or something else?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 12:37 ` Andy Whitcroft
@ 2007-04-26 14:18 ` David Chinner
2007-04-26 15:08 ` Nick Piggin
1 sibling, 0 replies; 235+ messages in thread
From: David Chinner @ 2007-04-26 14:18 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Nick Piggin, Christoph Lameter, Eric W. Biederman, linux-kernel,
Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 01:37:42PM +0100, Andy Whitcroft wrote:
> Nick Piggin wrote:
> > Christoph Lameter wrote:
> >> On Thu, 26 Apr 2007, Nick Piggin wrote:
> >>>> What are the exact requirement you are trying to address?
> >>>
> >>> Block size > page cache size.
> >>
> >> But what do you mean with it? A block is no longer a contiguous
> >> section of memory. So you have redefined the term.
> >
> > I don't understand what you mean at all. A block has always been a
> > contiguous area of disk.
>
> Lets take Nick's definition of block being a disk based unit for the
> moment. That does not change the key contention here, that even with
> hardware specifically designed to handle 4k pages that hardware handles
> larger contigious areas more efficiently. David Chinner gives us
> figures showing major overall throughput improvements from (I assume)
> shorter scatter gather lists and better tag utilisation.
I haven't actually provided any figures - it's knowledge passed down
from those that know more about it that I do.
If you want figures about the impact of large I/Os, then we should
not be looking at the HBAs but at the impact on RAID controller
throughput (this I do have numbers on ;). It is not uncommon to
see 2MB I/Os give twice the throughput of 512K I/Os to a single
RAID controller - larger than 512k can only be acheived on systems
with a page size larger than 4k.....
> I am loath to
> say we can just blame the hardware vendors for poor design.
Never did - I'm pointing out that linux can't use all the
capabilities they have.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 13:53 ` Avi Kivity
@ 2007-04-26 14:33 ` David Chinner
2007-04-26 14:56 ` Avi Kivity
0 siblings, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-04-26 14:33 UTC (permalink / raw)
To: Avi Kivity
Cc: David Chinner, Nick Piggin, Christoph Lameter, Eric W. Biederman,
linux-kernel, Mel Gorman, William Lee Irwin III, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 04:53:00PM +0300, Avi Kivity wrote:
> David Chinner wrote:
> >The problem with this approach is that it turns around the whole
> >way we look at bufferheads. Right now we have well defined 1:n
> >mapping of page to bufferheads and so we tpyically lock the
> >page first them iterate all the bufferheads on the page.
> >
> >Going the other way, we need to support m:n which we means
> >the buffer has to become the primary interface for the filesystem
> >to the page cache. i.e. we need to lock the bufferhead first, then
> >iterate all the pages on it. This is messy because the cache indexes
> >via pages, not bufferheads. hence a buffer needs to point to all the
> >pages in it explicitly, and this leads to interesting issues with
> >locking.
> >
>
> Why is it necessary to assume that one filesystem block == one buffer?
> Is it for atomicity, efficiency, or something else?
By definition, really - each filesystem block has it's own state and
it's own disk mapping and so we need something to carry that
information around....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 13:50 ` David Chinner
@ 2007-04-26 14:40 ` William Lee Irwin III
2007-04-26 15:38 ` Nick Piggin
2007-04-27 0:19 ` Jeremy Higdon
2 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-26 14:40 UTC (permalink / raw)
To: David Chinner
Cc: Eric W. Biederman, Nick Piggin, clameter, linux-kernel,
Mel Gorman, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 04:10:32AM -0600, Eric W. Biederman wrote:
>> You have HW_PAGE_SIZE != PAGE_SIZE.
On Thu, Apr 26, 2007 at 11:50:33PM +1000, David Chinner wrote:
> That's rather wasteful, though. Better to only use the large pages
> when the filesystem needs them rather than penalise all filesystems.
I found less of an issue with filesystem pagecache than with internal
fragmentation of anonymous memory when I did it. I'd expect, though,
that 4KB is probably too small and 64KB too large, at least for some
workloads. 16KB may do better; if not, 8KB may be worth doing just to
compensate for the larger sizes of pointers and unsigned longs in
struct page with respect to memory overhead so as not to regress vs.
32-bit, which isn't critical, but does have some cache and other
performance impacts.
Basically I found that without some intelligent method of divvying
out fragments of anonymous pages, pathological performance resulted.
The naive scheme of faulting at PAGE_SIZE-aligned boundaries created
swapstorms on memory-constrained hardware (e.g. laptops). Pagecache
was a second-order effect. It's not necessarily all that complex to
handle. One could easily recover the PAGE_SIZE == MMUPAGE_SIZE
behavior by keeping partially utilized anonymous pages cached in the mm
and handing out MMUPAGE_SIZE-sized fragments during COW/zerofill faults.
I had some sort of trouble with tracking the state for it, though I
don't remember what it was.
It's also notable that the two strategies (increasing base page size
and dealing with higher-order pages) don't clash all that much.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 10:06 ` [00/17] Large Blocksize Support V3 Mel Gorman
@ 2007-04-26 14:47 ` Nick Piggin
0 siblings, 0 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 14:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Mel Gorman wrote:
> On (25/04/07 23:46), Christoph Lameter didst pronounce:
>
>>On Thu, 26 Apr 2007, Nick Piggin wrote:
>>
>>
>>>Yeah. IMO anti-fragmentation and defragmentation is the hack, and we
>>>should stay away from higher order allocations whenever possible.
>>
>>Right and we need to create series of other approaches that we then label
>>"non-hack" to replace it.
>>
>
>
> To date, there hasn't been a credible alternative to dealing with
> fragmentation. Breaking the 1:1 virtual:physical mapping and defragmenting
> would incur a serious performance hit.
Depends what you mean by "dealing with", I guess.
I would say there has been no credible alternative to virtually mapping
the kernel.
>>>Hardware is built to handle many small pages efficintly, and I don't
>>>understand how it could be an SGI-only issue. Sure, you may have an
>>>order of magnitude or more memory than anyone else, but even my lowly
>>>desktop _already_ has orders of magnitude more pages than it has TLB
>>>entries or cache -- if a workload is cache-nice for me, it probably
>>>will be on a 1TB machine as well, and if it is bad for the 1TB machine,
>>>it is also bad on mine.
>>
>>There have been numbers of people that have argued the same point. Just
>>because we have developed a way of thinking to defend our traditional 4k
>>values does not make them right.
>>
>>
>>>If this is instead an issue of io path or reclaim efficiency, then it
>>>would be really nice to see numbers... but I don't think making these
>>>fundamental paths more complex and slower is a nice way to fix it
>>>(larger PAGE_SIZE would be, though).
>>
>>The code paths can stay the same. You can switch CONFIG_LARGE pages off
>>if you do not want it and it is as it was.
>>
>
>
> It may not even need that that much effort. The most stressful use of the
> high order allocation paths here require the creation of a filesystem and
> is a deliberate action by the user.
Saying "oh this stuff may not always work quite right for everyone, but
it is OK because it is a special purpose solution for now" IMO is a big
sign saying that it is a bad design, and including it means we're lumped
with it forever.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:22 ` Christoph Lameter
2007-04-26 7:42 ` Nick Piggin
@ 2007-04-26 14:49 ` William Lee Irwin III
1 sibling, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-26 14:49 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, Eric W. Biederman, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
>> Block size > page cache size.
On Thu, Apr 26, 2007 at 12:22:42AM -0700, Christoph Lameter wrote:
> But what do you mean with it? A block is no longer a contiguous section of
> memory. So you have redefined the term.
AIUI a block is a contiguous section of disk, not memory, with further
qualifications.
On Thu, 26 Apr 2007, Nick Piggin wrote:
>> You guys have a couple of problems, firstly you need to have ia64
>> filesystems accessable to x86_64. And secondly you have these controllers
>> without enough sg entries for nice sized IOs.
On Thu, Apr 26, 2007 at 12:22:42AM -0700, Christoph Lameter wrote:
> This is not sgi specific sorry.
Throw Oracle in the mix. I expect other IHV's and ISV's are also
interested, but probably need to be asked in some way to which they're
more easily able to respond, as their minions are not necessarily
reading this thread. It's a relatively broad point of interest.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:42 ` Nick Piggin
2007-04-26 10:48 ` Mel Gorman
2007-04-26 12:37 ` Andy Whitcroft
@ 2007-04-26 14:53 ` William Lee Irwin III
2007-04-26 18:16 ` Christoph Lameter
2007-04-26 18:21 ` Eric W. Biederman
2007-04-26 18:13 ` Christoph Lameter
3 siblings, 2 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-26 14:53 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 05:42:17PM +1000, Nick Piggin wrote:
> Actually, I don't know why people are so excited about being able to
> use higher order allocations (I would rather be more excited about
> never having to use them). But for those few places that really need
> it, I'd rather see them use a virtually mapped kernel with proper
> defragmentation rather than putting hacks all through the core code.
In memory as on disk, contiguity matters a lot for performance.
On Thu, Apr 26, 2007 at 05:42:17PM +1000, Nick Piggin wrote:
> I don't think that it is solved, and I think the heuristics that are
> there would be put under more stress if they become widely used. And
> it isn't only about whether we can get the page or not, but also about
> the cost. Look up Linus's arguments about page colouring, which are
> similar and I also think are pretty valid.
Perhaps broader exposure would help demonstrate their efficacy, or
otherwise discover what deficits need to be addressed.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 14:33 ` David Chinner
@ 2007-04-26 14:56 ` Avi Kivity
0 siblings, 0 replies; 235+ messages in thread
From: Avi Kivity @ 2007-04-26 14:56 UTC (permalink / raw)
To: David Chinner
Cc: Nick Piggin, Christoph Lameter, Eric W. Biederman, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner wrote:
>> Why is it necessary to assume that one filesystem block == one buffer?
>> Is it for atomicity, efficiency, or something else?
>>
>
> By definition, really - each filesystem block has it's own state and
> it's own disk mapping and so we need something to carry that
> information around....
>
Well, for block sizes > PAGE_SIZE, you can just duplicate the mapping
information (with an offset-in-block bit field) in each page's struct
page. But I see from your other posts that there are atomicity and
performance reasons as well.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 12:37 ` Andy Whitcroft
2007-04-26 14:18 ` David Chinner
@ 2007-04-26 15:08 ` Nick Piggin
2007-04-26 15:19 ` William Lee Irwin III
2007-04-26 15:28 ` David Chinner
1 sibling, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 15:08 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Andy Whitcroft wrote:
> Nick Piggin wrote:
>
>>I don't understand what you mean at all. A block has always been a
>>contiguous area of disk.
>
>
> Lets take Nick's definition of block being a disk based unit for the
> moment. That does not change the key contention here, that even with
> hardware specifically designed to handle 4k pages that hardware handles
> larger contigious areas more efficiently. David Chinner gives us
> figures showing major overall throughput improvements from (I assume)
> shorter scatter gather lists and better tag utilisation. I am loath to
> say we can just blame the hardware vendors for poor design.
So their controllers get double the throughput when going from 512K
(128x4K pages) to 2MB (128x16K pages) requests. Do you really think
it is to do with command processing overhead?
>>Actually, I don't know why people are so excited about being able to
>>use higher order allocations (I would rather be more excited about
>>never having to use them). But for those few places that really need
>>it, I'd rather see them use a virtually mapped kernel with proper
>>defragmentation rather than putting hacks all through the core code.
>
>
> Virtually mapping the kernel was considered pretty seriously around the
> time SPARSEMEM was being developed. However, that leads to a
> non-constant relation for converting kernel virtual addresses to
> physical ones which leads to significant complexity, not to mention
> runtime overhead.
Yeah, a page table walk (or better, a TLB hit). And yeah it will cost
a bit of performance, it always does.
> As a solution to the problem of supplying large pages from the allocator
> it seems somewhat unsatisfactory. If no significant other changes are
> made in support of large allocations, the process of defragmenting
> becomes very expensive. Requiring a stop_machine style hiatus while the
> physical copy and replace occurs for any kernel backed memory.
That would be a stupid thing to do though. All you need to do (after
you keep DMA away) is to unmap the pages.
> To put it a different way, even with such a full defragmentation scheme
> available some sort of avoidance scheme would be highly desirable to
> avoid using the very expensive deframentation underlying it.
Maybe. That doesn't change the fact that avoidance isn't a complete
solution by itself.
>>Is that a big problem? Really? You use 16K pages on your IPF systems,
>>don't you?
>
>
> To my knowledge, moving to a higher base page size has its advantages in
> TLB reach, but brings with it some pretty serious downsides. Especially
> in caching small files. Internal fragmentation in the page cache
> significantly affecting system performance. So much so that development
> is ongoing to see if supporting sub-base-page objects in the buffer
> cache could be beneficial.
I think 16K would be pretty reasonable (ia64 tends to use it). I guess
powerpc went to 64k either because that's what databases want or because
their TLB refills are too slow, so the internal fragmentation bites
them a lot harder.
But that was more of a side comment, because I still think io controllers
should be easily capable of operation on 4K pages. Graphics cards are,
aren't they?
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 15:08 ` Nick Piggin
@ 2007-04-26 15:19 ` William Lee Irwin III
2007-04-26 15:28 ` David Chinner
1 sibling, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-26 15:19 UTC (permalink / raw)
To: Nick Piggin
Cc: Andy Whitcroft, Christoph Lameter, Eric W. Biederman,
linux-kernel, Mel Gorman, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Andy Whitcroft wrote:
>> To my knowledge, moving to a higher base page size has its advantages in
>> TLB reach, but brings with it some pretty serious downsides. Especially
>> in caching small files. Internal fragmentation in the page cache
>> significantly affecting system performance. So much so that development
>> is ongoing to see if supporting sub-base-page objects in the buffer
>> cache could be beneficial.
On Fri, Apr 27, 2007 at 01:08:17AM +1000, Nick Piggin wrote:
> I think 16K would be pretty reasonable (ia64 tends to use it). I guess
> powerpc went to 64k either because that's what databases want or because
> their TLB refills are too slow, so the internal fragmentation bites
> them a lot harder.
> But that was more of a side comment, because I still think io controllers
> should be easily capable of operation on 4K pages. Graphics cards are,
> aren't they?
Base page size need only be a power-of-two multiple of the MMU pagesize.
ISTR the 64KB PAGE_SIZE in ppc64 being motivated by TLB considerations.
There is not going to be a way to improve how well IO controllers deal
with 4K pages. The scatter/gather lists for smaller fragments will
require more device communication overhead no matter what, and
furthermore, the more discontiguity, the more seekiness.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 9:20 ` David Chinner
2007-04-26 13:53 ` Avi Kivity
@ 2007-04-26 15:20 ` Nick Piggin
2007-04-26 17:42 ` Jens Axboe
1 sibling, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 15:20 UTC (permalink / raw)
To: David Chinner
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner wrote:
> On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:
>
>>Christoph Lameter wrote:
>>
>>>On Thu, 26 Apr 2007, Nick Piggin wrote:
>>>
>>>
>>>
>>>>No I don't want to add another fs layer.
>>>
>>>
>>>Well maybe you could explain what you want. Preferably without redefining
>>>the established terms?
>>
>>Support for larger buffers than page cache pages.
>
>
> The problem with this approach is that it turns around the whole
> way we look at bufferheads. Right now we have well defined 1:n
> mapping of page to bufferheads and so we tpyically lock the
> page first them iterate all the bufferheads on the page.
>
> Going the other way, we need to support m:n which we means
> the buffer has to become the primary interface for the filesystem
> to the page cache. i.e. we need to lock the bufferhead first, then
> iterate all the pages on it. This is messy because the cache indexes
> via pages, not bufferheads. hence a buffer needs to point to all the
> pages in it explicitly, and this leads to interesting issues with
> locking.
>
> If you still think that this is a good idea, I suggest that you
Yeah, I think it possibly is. But I don't think that much of the
vm should need to be touched at all, especially not mmap. And I
think only those filesystems interested in supporting it would
have to implement it (maybe a couple), not all of them.
I didn't run into all the issues you mentioned yet, and I don't
see why it would have to turn into a real (traditional) buffer
cache layer rather than just a Linux (ie. block mapping) buffer
layer.
>>So block size > page cache size... also, you should obviously be using
>>hardware that is tuned to work well with 4K pages, because surely there
>>is lots of that around.
>
>
> The CPU hardware works well with 4k pages, but in general I/O
> hardware works more efficiently as the numbers of s/g entries they
> require drops for a given I/O size. Given that we limit drivers to
> 128 s/g entries, we really aren't using I/O hardware to it's full
> potential or at it's most efficient by limiting each s/g entry to a
> single 4k page.
>
> And FWIW, a having a buffer for block size > page size does not
> solve this problem - only contiguous page allocation solves this
> problem.
But this seems to be due to your final IO request size, rather than
the number of sg entries the card has to process, right? My point
is that MMUs and the whole memory and data transfer system is quite
likely to be optimised for 4K pages, so it would be strange to think
that an IO card could not do that.
Why do we limit drivers to 128 sg entries?
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 15:08 ` Nick Piggin
2007-04-26 15:19 ` William Lee Irwin III
@ 2007-04-26 15:28 ` David Chinner
1 sibling, 0 replies; 235+ messages in thread
From: David Chinner @ 2007-04-26 15:28 UTC (permalink / raw)
To: Nick Piggin
Cc: Andy Whitcroft, Christoph Lameter, Eric W. Biederman,
linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, Apr 27, 2007 at 01:08:17AM +1000, Nick Piggin wrote:
> Andy Whitcroft wrote:
> >Nick Piggin wrote:
> >
>
> >>I don't understand what you mean at all. A block has always been a
> >>contiguous area of disk.
> >
> >
> >Lets take Nick's definition of block being a disk based unit for the
> >moment. That does not change the key contention here, that even with
> >hardware specifically designed to handle 4k pages that hardware handles
> >larger contigious areas more efficiently. David Chinner gives us
> >figures showing major overall throughput improvements from (I assume)
> >shorter scatter gather lists and better tag utilisation. I am loath to
> >say we can just blame the hardware vendors for poor design.
>
> So their controllers get double the throughput when going from 512K
> (128x4K pages) to 2MB (128x16K pages) requests. Do you really think
> it is to do with command processing overhead?
No - it has to do with things like the RAID controller caching behaviour, the
number of disks a single request can keep busy, getting I/os large
enough to avoid partial stripe writes, etc. Remember that this
controller is often on the other side of a HBA so large I/Os are
really desirable here....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 13:50 ` David Chinner
2007-04-26 14:40 ` William Lee Irwin III
@ 2007-04-26 15:38 ` Nick Piggin
2007-04-26 15:58 ` William Lee Irwin III
2007-04-27 0:19 ` Jeremy Higdon
2 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-26 15:38 UTC (permalink / raw)
To: David Chinner
Cc: Eric W. Biederman, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner wrote:
> On Thu, Apr 26, 2007 at 04:10:32AM -0600, Eric W. Biederman wrote:
>>Ok. Now why are high end hardware manufacturers building crippled
>>hardware? Or is there only an 8bit field in SCSI for describing
>>scatter gather entries? Although I would think this would be
>>move of a controller ranter than a drive issue.
>
>
> scsi.h:
>
> /*
> * The maximum sg list length SCSI can cope with
> * (currently must be a power of 2 between 32 and 256)
> */
> #define SCSI_MAX_PHYS_SEGMENTS MAX_PHYS_SEGMENTS
>
> And from blkdev.h:
>
> #define MAX_PHYS_SEGMENTS 128
> #define MAX_HW_SEGMENTS 128
>
> So currentlt on SCSI we are limited to 128 s/g entries, and the
> maximum is 256. So I'd say we've got good grounds for needing
> contiguous pages to go beyond 1MB I/O size on x86_64.
Or good grounds to increase the sg limit and push for io controller
manufacturers to do the same. If we have a hack in the kernel that
mostly works, they won't.
Page colouring was always rejected, and lots of people who knew
better got upset because it was the only way the hardware would go
fast...
>>>And what do we do for arches that can't do multiple page sizes, only
>>>only have a limited and mostly useless set of page sizes to choose
>>>from?
>>
>>You have HW_PAGE_SIZE != PAGE_SIZE.
>
>
> That's rather wasteful, though. Better to only use the large pages
> when the filesystem needs them rather than penalise all filesystems.
But 16k pages are fine for ia64. While you're talking about special
casing stuff, surely a bigger page size could be the config option
instead of higher order pagecache.
>>That is you hide the fact from
>>the bulk of the kernel struct page manges 2 or more real hardware pages.
>>But you expose it to the handful of places that actually care.
>>Partly this is a path you are starting down in your patches, with
>>larger page cache support.
>
>
> Right, exactly. So apart from the contiguous allocation issue, you think
> we are doing the right thing?
You could put it that way. Or that it is wrong because of the
fragmenatation problem. Realise that it is somewhat fundamental
considering that it is basically an unsolvable problem with our
current kernel assumptions of unconstrained kernel allocations and
a 1:1 kernel mapping.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 15:38 ` Nick Piggin
@ 2007-04-26 15:58 ` William Lee Irwin III
2007-04-27 9:46 ` Nick Piggin
0 siblings, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-26 15:58 UTC (permalink / raw)
To: Nick Piggin
Cc: David Chinner, Eric W. Biederman, clameter, linux-kernel,
Mel Gorman, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, Apr 27, 2007 at 01:38:30AM +1000, Nick Piggin wrote:
> Or good grounds to increase the sg limit and push for io controller
> manufacturers to do the same. If we have a hack in the kernel that
> mostly works, they won't.
On Fri, Apr 27, 2007 at 01:38:30AM +1000, Nick Piggin wrote:
> Page colouring was always rejected, and lots of people who knew
> better got upset because it was the only way the hardware would go
> fast...
Yes, stunning wisdom there. Reject the speedups.
On Fri, Apr 27, 2007 at 01:38:30AM +1000, Nick Piggin wrote:
> You could put it that way. Or that it is wrong because of the
> fragmenatation problem. Realise that it is somewhat fundamental
> considering that it is basically an unsolvable problem with our
> current kernel assumptions of unconstrained kernel allocations and
> a 1:1 kernel mapping.
Depends on what you consider a solution. A broadly used criterion is
that improves performance significantly in important usage cases.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:38 ` Nick Piggin
2007-04-26 6:46 ` Christoph Lameter
@ 2007-04-26 15:58 ` Christoph Hellwig
2007-04-26 16:05 ` Jens Axboe
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-26 15:58 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 04:38:54PM +1000, Nick Piggin wrote:
> Hardware is built to handle many small pages efficintly, and I don't
> understand how it could be an SGI-only issue. Sure, you may have an
> order of magnitude or more memory than anyone else, but even my lowly
> desktop _already_ has orders of magnitude more pages than it has TLB
> entries or cache -- if a workload is cache-nice for me, it probably
> will be on a 1TB machine as well, and if it is bad for the 1TB machine,
> it is also bad on mine.
It's not an SGI-only issue, but apparently SGI are the only ones
that actually care about real highend linux setups to work on these
issues. The Problem is not on the CPU hardware side. It's on
the Software side and Storage hardware side, or rather a combination
of the two.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 15:58 ` Christoph Hellwig
@ 2007-04-26 16:05 ` Jens Axboe
2007-04-26 16:16 ` Christoph Hellwig
0 siblings, 1 reply; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 16:05 UTC (permalink / raw)
To: Christoph Hellwig, Nick Piggin, Eric W. Biederman,
Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26 2007, Christoph Hellwig wrote:
> On Thu, Apr 26, 2007 at 04:38:54PM +1000, Nick Piggin wrote:
> > Hardware is built to handle many small pages efficintly, and I don't
> > understand how it could be an SGI-only issue. Sure, you may have an
> > order of magnitude or more memory than anyone else, but even my lowly
> > desktop _already_ has orders of magnitude more pages than it has TLB
> > entries or cache -- if a workload is cache-nice for me, it probably
> > will be on a 1TB machine as well, and if it is bad for the 1TB machine,
> > it is also bad on mine.
>
> It's not an SGI-only issue, but apparently SGI are the only ones
> that actually care about real highend linux setups to work on these
> issues. The Problem is not on the CPU hardware side. It's on
> the Software side and Storage hardware side, or rather a combination
> of the two.
Agree. I don't know why we are arguing the merrits of this, it's an
obvious win. The problem is more if it's doable this way or not due to
fragmentation, but that's a different discussion and should be kept
seperate.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:48 ` Nick Piggin
2007-04-26 9:20 ` David Chinner
@ 2007-04-26 16:07 ` Christoph Hellwig
2007-04-27 10:05 ` Nick Piggin
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-26 16:07 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:
> >Well maybe you could explain what you want. Preferably without redefining
> >the established terms?
>
> Support for larger buffers than page cache pages.
I don't think you really want this :) The whole non-pagecache I/O
path before 2.3 was a toal pain just because it used buffers to drive
I/O. Add to that buffers bigger than a page and you add another
two mangnitudes of complexity. If you want to see a mess like that
download on of the eary XFS/Linux releases that had an I/O path
like that. I _really_ _really_ don't want to go there.
Linux has a long tradition of trading a tiny bit of efficieny for
much cleaner code, and I'd for 100% go down Christoph's route here.
Then again I'd actually be rather surprised if > page buffers
were more efficient - you'd run into shitloads over overhead due to
them beeing non-contingous like calling vmap all over the place,
reprogramming iommus to at least make them look virtually contingous [1],
etc..
I also don't quite get what your problem with higher order allocations
are. order 1 allocations are generally just fine, and in fact
thread stacks are >= oder 1 on most architectures. And if the pagecache
uses higher order allocations that means we'll finally fix our problems
with them, which we have to do anyway. Workloads continue to grow and
with them the kernel overhead to manage them, while the pagesize for
many architectures is fixed. So we'll have to deal with order 1
and order 2 allocations better just for backing kmalloc and co.
Or think jumboframes for that matter.
[1] many iommu implementation of course also have a limit of how many
segments they can actually virtually merge
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 6:50 ` Nick Piggin
2007-04-26 8:40 ` Mel Gorman
@ 2007-04-26 16:11 ` Christoph Hellwig
2007-04-26 17:49 ` Eric W. Biederman
2007-04-27 10:38 ` Nick Piggin
1 sibling, 2 replies; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-26 16:11 UTC (permalink / raw)
To: Nick Piggin
Cc: David Chinner, Eric W. Biederman, clameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26, 2007 at 04:50:06PM +1000, Nick Piggin wrote:
> Improving the buffer layer would be a good way. Of course, that is
> a long and difficult task, so nobody wants to do it.
It's also a stupid idea. We got rid of the buffer layer because it's
a complete pain in the ass, and now you want to reintroduce one that's
even more complex, and most likely even slower than the elegant solution?
> Well, for those architectures (and this would solve your large block
> size and 16TB pagecache size without any core kernel changes), you
> can manage 1<<order hardware ptes as a single Linux pte. There is
> nothing that says you must implement PAGE_SIZE as a single TLB sized
> page.
Well, ppc64 can do that. And guess what, it really painfull for a lot
of workloads. Think of a poor ps3 with 256 from which the broken hypervisor
already takes a lot away and now every file in the pagecache takes
64k, every thread stack takes 64k, etc? It's good to have variable
sized objects in places where it makes sense, and the pagecache is
definitively one of them.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 16:05 ` Jens Axboe
@ 2007-04-26 16:16 ` Christoph Hellwig
0 siblings, 0 replies; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-26 16:16 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Nick Piggin, Eric W. Biederman,
Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26, 2007 at 06:05:07PM +0200, Jens Axboe wrote:
> Agree. I don't know why we are arguing the merrits of this, it's an
> obvious win. The problem is more if it's doable this way or not due to
> fragmentation, but that's a different discussion and should be kept
> seperate.
Btw, another bit I really like about the patches is that it make
the page cache size notation actually useful. People used PAGE_SIZE
and PAGE_CACHE_SIZE totally interchangeable before and created a total
mess. Getting rid of PAGE_CACHE_SIZE and only use the per-mapping
macros would be a nice cleanup all by iself even without putting the
rest in for now.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 15:20 ` Nick Piggin
@ 2007-04-26 17:42 ` Jens Axboe
2007-04-26 18:59 ` Eric W. Biederman
0 siblings, 1 reply; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 17:42 UTC (permalink / raw)
To: Nick Piggin
Cc: David Chinner, Christoph Lameter, Eric W. Biederman,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On Fri, Apr 27 2007, Nick Piggin wrote:
> Why do we limit drivers to 128 sg entries?
No particular reason, except than to avoid 2^bigger order allocations.
2MiB requests would require 3 contig pages to setup the sg list, which
is (probably) a little troublesome especially since it's sometimes
atomically allocated.
Larger pages are by no means a prerequisite to getting larger requests,
assuming your hardware can handle the bigger sglist. There are other
ways of doing that, I've contemplated doing chained sglists and adding
sg_for_each_segment() macros for iterating these things. Drivers that
want larger sglists would then be required to update their sg mapping
loop to use the provided macros. It wouldn't be too hard.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 16:11 ` Christoph Hellwig
@ 2007-04-26 17:49 ` Eric W. Biederman
2007-04-26 18:03 ` Christoph Lameter
2007-04-27 10:38 ` Nick Piggin
1 sibling, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 17:49 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Nick Piggin, David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Christoph Hellwig <hch@infradead.org> writes:
> On Thu, Apr 26, 2007 at 04:50:06PM +1000, Nick Piggin wrote:
>> Improving the buffer layer would be a good way. Of course, that is
>> a long and difficult task, so nobody wants to do it.
>
> It's also a stupid idea. We got rid of the buffer layer because it's
> a complete pain in the ass, and now you want to reintroduce one that's
> even more complex, and most likely even slower than the elegant solution?
No. I'm really suggesting improving the translation from BIO's
to the page cache. A set of helper functions.
This patch is suggesting we move to a BSD like buffer cache, except
that everything is physically mapped.
My most practical suggestion is to have support code so that you can
do all of the locking (that I/O cares about) on the first page of a
page group in the page cache. You don't need larger physical pages to
do that.
>> Well, for those architectures (and this would solve your large block
>> size and 16TB pagecache size without any core kernel changes), you
>> can manage 1<<order hardware ptes as a single Linux pte. There is
>> nothing that says you must implement PAGE_SIZE as a single TLB sized
>> page.
>
> Well, ppc64 can do that. And guess what, it really painfull for a lot
> of workloads. Think of a poor ps3 with 256 from which the broken hypervisor
> already takes a lot away and now every file in the pagecache takes
> 64k, every thread stack takes 64k, etc? It's good to have variable
> sized objects in places where it makes sense, and the pagecache is
> definitively one of them.
Agreed the page cache is all about variable sized objects known as files!
You don't need to do anything extra. The problem is only with building
I/O requests from what is there.
Iff we really the larger physical page size to support the hardware
then it makes sense to go down a path of larger pages. But it doesn't.
There is also a more fundamental reasons this patch is silly. It assumes
that there is a trivial mapping between filesystems (the primary target
of the page cache and blocks on disk). Now I admit this is common but
there is no reason to supposed it is true, and this patch appears to
exacerbate things.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 10:54 ` Eric W. Biederman
2007-04-26 12:23 ` Mel Gorman
@ 2007-04-26 17:58 ` Christoph Lameter
2007-04-26 18:02 ` Jens Axboe
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 17:58 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Mel Gorman, Nick Piggin, David Chinner, linux-kernel,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, 26 Apr 2007, Eric W. Biederman wrote:
> I can't possibly believe any of this is about the cost of processing
> a request, but rather the problem that some devices don't have a large
> enough pool of requests to keep them busy if you submit the requests
> in a 4K page sizes.
>
> This all sounds like a device design issue rather than anything more
> significant, and it doesn't sound like a long term trend. Market
> pressure should fix the hardware.
Sounds like we are dictating device manufacturers how to
design their devices instead of leaving them choice.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 17:58 ` Christoph Lameter
@ 2007-04-26 18:02 ` Jens Axboe
0 siblings, 0 replies; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 18:02 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, Mel Gorman, Nick Piggin, David Chinner,
linux-kernel, William Lee Irwin III, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26 2007, Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Eric W. Biederman wrote:
>
> > I can't possibly believe any of this is about the cost of processing
> > a request, but rather the problem that some devices don't have a large
> > enough pool of requests to keep them busy if you submit the requests
> > in a 4K page sizes.
> >
> > This all sounds like a device design issue rather than anything more
> > significant, and it doesn't sound like a long term trend. Market
> > pressure should fix the hardware.
>
> Sounds like we are dictating device manufacturers how to
> design their devices instead of leaving them choice.
Well... First of all (to Eric), the 4kb pages isn't the issue. Only if
you have limited sg capabilities in your hardware AND need larger
segment sizes to get up in the range of request sizes that makes that
hardware go fast would you benefit from always using bigger pages. And
market pressure should indeed make you lose, if you design crappy
hardware like that.
Secondly, choice isn't always good. Linux should cater to the hardware
only to the extent that it makes sense. We don't make design decisions
that are unmaintainable or cripple us in some way because of bad
hardware. That would be stupid.
If we need and want larger IO sizes, then we can do that without big
compound pages. I'm merely interested in the bigger page cache pages
because it'll solve several issues at once.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 17:49 ` Eric W. Biederman
@ 2007-04-26 18:03 ` Christoph Lameter
2007-04-26 18:03 ` Jens Axboe
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 18:03 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Christoph Hellwig, Nick Piggin, David Chinner, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, 26 Apr 2007, Eric W. Biederman wrote:
> My most practical suggestion is to have support code so that you can
> do all of the locking (that I/O cares about) on the first page of a
> page group in the page cache. You don't need larger physical pages to
> do that.
Virtual mappings for larger pages? Or some software simulated form? This
all sounds like awful hacks.
> Iff we really the larger physical page size to support the hardware
> then it makes sense to go down a path of larger pages. But it doesn't.
You are redefining the problem. We need larger physical sizes to support
the hardware. Yes. We can dodge the issue with shim layers and hacks. It
is obvious from the kernel sources that this is needed.
> There is also a more fundamental reasons this patch is silly. It assumes
> that there is a trivial mapping between filesystems (the primary target
> of the page cache and blocks on disk). Now I admit this is common but
> there is no reason to supposed it is true, and this patch appears to
> exacerbate things.
The patch does the obvious... Nothing is silly about memory being
contigous in memory and on disk.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:03 ` Christoph Lameter
@ 2007-04-26 18:03 ` Jens Axboe
2007-04-26 18:09 ` Christoph Hellwig
0 siblings, 1 reply; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 18:03 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, Christoph Hellwig, Nick Piggin, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26 2007, Christoph Lameter wrote:
> > Iff we really the larger physical page size to support the hardware
> > then it makes sense to go down a path of larger pages. But it doesn't.
>
> You are redefining the problem. We need larger physical sizes to support
> the hardware. Yes. We can dodge the issue with shim layers and hacks. It
> is obvious from the kernel sources that this is needed.
We definitely don't. Larger sizes are ONE way to solve the problem, they
are definitely not the only one. If the larger pages become unfeasible
for some reason (be it fragmentation, or just because the design isn't
good), then we can solve it differently.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 10:10 ` Eric W. Biederman
2007-04-26 13:50 ` David Chinner
@ 2007-04-26 18:07 ` Christoph Lameter
2007-04-26 18:45 ` Eric W. Biederman
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 18:07 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Nick Piggin, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, 26 Apr 2007, Eric W. Biederman wrote:
> Ok. Now why are high end hardware manufacturers building crippled
> hardware? Or is there only an 8bit field in SCSI for describing
> scatter gather entries? Although I would think this would be
> move of a controller ranter than a drive issue.
They do not share your definition of "crippled" and would rather
say that Linux is crippled OS because it cannot support large physical
block size I/O.
> The change the fundamental fragmentation avoidance algorithm of the
> OS. Use only one size of page. That is a huge problem.
No its no problem at all. Its in some peoples head that have never seen it
done otherwise. We have another OS here that has dealt with variable sized
pages for ages and our people have been shaking their heads for years
about these strong opinious that lead us not dealing with the
problem.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 13:50 ` William Lee Irwin III
@ 2007-04-26 18:09 ` Eric W. Biederman
2007-04-26 23:34 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 18:09 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Christoph Lameter, Nick Piggin, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III <wli@holomorphy.com> writes:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>>> OK, I would like to see them. And also discussions of things like why
>>> we shouldn't increase PAGE_SIZE instead.
>
> On Thu, Apr 26, 2007 at 12:34:50AM -0700, Christoph Lameter wrote:
>> Because 4k is a good page size that is bound to the binary format? Frankly
>> there is no point in having my text files in large page sizes. However,
>> when I read a dvd then I may want to transfer 64k chunks or when use my
>> flash drive I may want to transfer 128k chunks. And yes if a scientific
>> application needs to do data dump then it should be able to use very high
>> page sizes (megabytes, gigabytes) to be able to continue its work while
>> the huge dumps runs at full I/O speed ...
>
> It's possible to divorce PAGE_SIZE from the binary formats, though I
> found it difficult to keep up with the update treadmill.
On x86_64 the sizes is actually 64K for executable binaries if I
recall correctly. It certainly is not PAGE_SIZE, so we have some
flexibility there.
> Maybe it's
> like hch says and I just needed to find more and better API cleanups.
> I've only not tried to resurrect it because it's too much for me to do
> on my own. I essentially collapsed under the weight of it and my 2.5.x
> codebase ended up worse than Katrina as a disaster, which I don't want
> to repeat and think collaborators or a different project lead from
> myself are needed to avoid that happening again.
But we still have some issues with mmap. But since we could increase
PAGE_SIZE on x86_64 and not have to even worry about sub PAGE_SIZE
mmaps. It is being suggested that if people really need larger
physical pages that they just fix PAGE_SIZE. The everything just
works.
Thinking about it changing PAGE_SIZE on x86_64 should be about as
hard as doing the 3-level vs 2-level page table format. We say
we have a different page table format that uses a larger PAGE_SIZE.
All arch code, all code in paths that we expect to change.
Boom all done.
It might be worth implementing just so people can play with different
PAGE_SIZE values for benchmarking.
I don't think the larger physical page size is really the issue here
though.
> It's unclear how much the situation has changed since 32-bit workload
> feasibility issues have been relegated to ignorable or deliberate
> "f**k 32-bit" status. The effect is doubtless to make it easier, though
> to what degree I'm not sure.
Perhaps.
> Anyway, if that's being kicked around as an alternative, it could be
> said that I have some insight into the issues surrounding it.
Partially but also partially they are very much suggesting going down
the same path. Currently mmap doesn't work with order >0 pages because
they are not yet addressing these issues at all.
This looks like a more flexible version of the old PAGE_CACHE_SIZE >
PAGE_SIZE code. Which makes seriously question the whole idea.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:03 ` Jens Axboe
@ 2007-04-26 18:09 ` Christoph Hellwig
2007-04-26 18:12 ` Jens Axboe
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-26 18:09 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Lameter, Eric W. Biederman, Christoph Hellwig,
Nick Piggin, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 08:03:58PM +0200, Jens Axboe wrote:
> On Thu, Apr 26 2007, Christoph Lameter wrote:
> > > Iff we really the larger physical page size to support the hardware
> > > then it makes sense to go down a path of larger pages. But it doesn't.
> >
> > You are redefining the problem. We need larger physical sizes to support
> > the hardware. Yes. We can dodge the issue with shim layers and hacks. It
> > is obvious from the kernel sources that this is needed.
>
> We definitely don't. Larger sizes are ONE way to solve the problem, they
> are definitely not the only one. If the larger pages become unfeasible
> for some reason (be it fragmentation, or just because the design isn't
> good), then we can solve it differently.
Exactly. But the only counter-proposal we have so far seems far worse :)
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:45 ` Nick Piggin
@ 2007-04-26 18:10 ` Christoph Lameter
2007-04-27 10:08 ` Nick Piggin
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 18:10 UTC (permalink / raw)
To: Nick Piggin
Cc: David Chinner, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> Christoph Lameter wrote:
> > On Thu, 26 Apr 2007, Nick Piggin wrote:
> >
> >
> > > But I maintain that the end result is better than the fragmentation
> > > based approach. A lot of people don't actually want a bigger page
> > > cache size, because they want efficient internal fragmentation as
> > > well, so your radix-tree based approach isn't really comparable.
> >
> >
> > Me? Radix tree based approach? That approach is in the kernel. Do not create
> > a solution where there is no problem. If we do not want to support large
> > blocksizes then lets be honest and say so instead of redefining what a block
> > is. The current approach is fine if one is satisfied with scatter gather and
> > the VM overhead coming with handling these pages. I fail to see what any of
> > what you are proposing would add to that.
>
> I'm not just making this up. Fragmentation. OK?
Yes you are. If you want to avoid fragmentation by restricting the OS to
4k alone then the radix tree is sufficient to establish the order of pages
in a mapping. The only problem is to get an array of pointers to a
sequence of pages together by reading through the radix tree. I do not
know what else would be needed.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:09 ` Christoph Hellwig
@ 2007-04-26 18:12 ` Jens Axboe
2007-04-26 18:24 ` Christoph Hellwig
2007-04-26 18:28 ` Christoph Lameter
0 siblings, 2 replies; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 18:12 UTC (permalink / raw)
To: Christoph Hellwig, Christoph Lameter, Eric W. Biederman,
Nick Piggin, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26 2007, Christoph Hellwig wrote:
> On Thu, Apr 26, 2007 at 08:03:58PM +0200, Jens Axboe wrote:
> > On Thu, Apr 26 2007, Christoph Lameter wrote:
> > > > Iff we really the larger physical page size to support the hardware
> > > > then it makes sense to go down a path of larger pages. But it doesn't.
> > >
> > > You are redefining the problem. We need larger physical sizes to support
> > > the hardware. Yes. We can dodge the issue with shim layers and hacks. It
> > > is obvious from the kernel sources that this is needed.
> >
> > We definitely don't. Larger sizes are ONE way to solve the problem, they
> > are definitely not the only one. If the larger pages become unfeasible
> > for some reason (be it fragmentation, or just because the design isn't
> > good), then we can solve it differently.
>
> Exactly. But the only counter-proposal we have so far seems far worse :)
Lets look at some numbers. I'll just concentrate on the scatterlist,
since the bio_vec is smaller. On x86 32-bit, the scatterlist is 20 bytes
long. If we accept that 2^1 allocations are ok (they should be), then we
can support ~1.6mb ios just like that.
My approach would be to support scatterlist chaining. Essentially you'd
have the last element of the sglist pointing to the next array of
entries. We can then stick to 128 entry arrays which fit nicely in a
single page allocation and easily support >> 2mb ios. The only caveat is
that you'd need to update the drivers to get there, since a regular
iteration over the array isn't enough. My plan was to add an sglist
iterator helper that hides this from the drivers, if they need to loop
over the scatterlist. Things like {dma/pci}_map_sg() would of course be
updated.
The above can be implemented fairly cleanly, and on a need-to-have
basis. It's not something that'll break drivers.
What do you think?
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 7:42 ` Nick Piggin
` (2 preceding siblings ...)
2007-04-26 14:53 ` William Lee Irwin III
@ 2007-04-26 18:13 ` Christoph Lameter
2007-04-27 10:15 ` Nick Piggin
3 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 18:13 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Nick Piggin wrote:
> > But what do you mean with it? A block is no longer a contiguous section of
> > memory. So you have redefined the term.
>
> I don't understand what you mean at all. A block has always been a
> contiguous area of disk.
You want to change the block layer to support larger blocksize than
PAGE_SIZE right? So you need to segment that larger block into pieces.
> > And you dont care about Mel's work on that level?
>
> I actually don't like it too much because it can't provide a robust
> solution. What do you do on systems with small memories, or those that
> eventually do get fragmented?
You could f.e. switch off defragmentation and the large block support?
> Actually, I don't know why people are so excited about being able to
> use higher order allocations (I would rather be more excited about
> never having to use them). But for those few places that really need
> it, I'd rather see them use a virtually mapped kernel with proper
> defragmentation rather than putting hacks all through the core code.
Ahh. I knew we were going this way.... Now we have virtual contiguous vs.
physical discontiguous.... Yuck hackidihack.
> > No this has been tried before and does not work. Why should we loose the
> > capability to work with 4k pages just because there is some data that has to
> > be thrown around in quantity? I'd like to have flexibility here.
>
> Is that a big problem? Really? You use 16K pages on your IPF systems,
> don't you?
Yes but the processor supports 4k also. I'd rather have a choice. 16k is a
choice for performance given the current kernel limitations hat wastes
lots of memory.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 14:53 ` William Lee Irwin III
@ 2007-04-26 18:16 ` Christoph Lameter
2007-04-26 18:21 ` Eric W. Biederman
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 18:16 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Nick Piggin, Eric W. Biederman, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, William Lee Irwin III wrote:
> On Thu, Apr 26, 2007 at 05:42:17PM +1000, Nick Piggin wrote:
> > I don't think that it is solved, and I think the heuristics that are
> > there would be put under more stress if they become widely used. And
> > it isn't only about whether we can get the page or not, but also about
> > the cost. Look up Linus's arguments about page colouring, which are
> > similar and I also think are pretty valid.
>
> Perhaps broader exposure would help demonstrate their efficacy, or
> otherwise discover what deficits need to be addressed.
It would help if developers would once in a while look at how other
operating systems handle this. They have supported such features for a
long time.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 14:53 ` William Lee Irwin III
2007-04-26 18:16 ` Christoph Lameter
@ 2007-04-26 18:21 ` Eric W. Biederman
2007-04-27 0:32 ` William Lee Irwin III
1 sibling, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 18:21 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Nick Piggin, Christoph Lameter, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III <wli@holomorphy.com> writes:
> On Thu, Apr 26, 2007 at 05:42:17PM +1000, Nick Piggin wrote:
>> Actually, I don't know why people are so excited about being able to
>> use higher order allocations (I would rather be more excited about
>> never having to use them). But for those few places that really need
>> it, I'd rather see them use a virtually mapped kernel with proper
>> defragmentation rather than putting hacks all through the core code.
>
> In memory as on disk, contiguity matters a lot for performance.
Not nearly so much though. In memory you don't have seeks to avoid.
On disks avoiding seeks is everything.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:24 ` Christoph Hellwig
@ 2007-04-26 18:24 ` Jens Axboe
0 siblings, 0 replies; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 18:24 UTC (permalink / raw)
To: Christoph Hellwig, Christoph Lameter, Eric W. Biederman,
Nick Piggin, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26 2007, Christoph Hellwig wrote:
> On Thu, Apr 26, 2007 at 08:12:51PM +0200, Jens Axboe wrote:
> > > Exactly. But the only counter-proposal we have so far seems far worse :)
> >
> > Lets look at some numbers. I'll just concentrate on the scatterlist,
> > since the bio_vec is smaller. On x86 32-bit, the scatterlist is 20 bytes
> > long. If we accept that 2^1 allocations are ok (they should be), then we
> > can support ~1.6mb ios just like that.
> >
> > My approach would be to support scatterlist chaining. Essentially you'd
> > have the last element of the sglist pointing to the next array of
> > entries. We can then stick to 128 entry arrays which fit nicely in a
> > single page allocation and easily support >> 2mb ios. The only caveat is
> > that you'd need to update the drivers to get there, since a regular
> > iteration over the array isn't enough. My plan was to add an sglist
> > iterator helper that hides this from the drivers, if they need to loop
> > over the scatterlist. Things like {dma/pci}_map_sg() would of course be
> > updated.
> >
> > The above can be implemented fairly cleanly, and on a need-to-have
> > basis. It's not something that'll break drivers.
> >
> > What do you think?
>
> Purely for the I/O sizes to external arrays problem that's nice,
> and I think we (well, you :)) should implement it.
I will get it implemented, next week.
> But there's other reasons why larger objects in the page cache make
> sense that are mostly related to keeping overhead for large files
> in the operating system down. So I'd go both for s/g list chaining
> and variable order pagecache.
Oh I definitely agree, I just think we should keep the discussion
focused on the seperate issues and not mix everything up.
> Btw, we should talk a little about the sglist iterators on linux-scsi,
> as a lot of the dma mapping API will need updates for bidirection dmas
> anyway, and we should try to get everything done in one rush.
Yep
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:12 ` Jens Axboe
@ 2007-04-26 18:24 ` Christoph Hellwig
2007-04-26 18:24 ` Jens Axboe
2007-04-26 18:28 ` Christoph Lameter
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-26 18:24 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Christoph Lameter, Eric W. Biederman,
Nick Piggin, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 08:12:51PM +0200, Jens Axboe wrote:
> > Exactly. But the only counter-proposal we have so far seems far worse :)
>
> Lets look at some numbers. I'll just concentrate on the scatterlist,
> since the bio_vec is smaller. On x86 32-bit, the scatterlist is 20 bytes
> long. If we accept that 2^1 allocations are ok (they should be), then we
> can support ~1.6mb ios just like that.
>
> My approach would be to support scatterlist chaining. Essentially you'd
> have the last element of the sglist pointing to the next array of
> entries. We can then stick to 128 entry arrays which fit nicely in a
> single page allocation and easily support >> 2mb ios. The only caveat is
> that you'd need to update the drivers to get there, since a regular
> iteration over the array isn't enough. My plan was to add an sglist
> iterator helper that hides this from the drivers, if they need to loop
> over the scatterlist. Things like {dma/pci}_map_sg() would of course be
> updated.
>
> The above can be implemented fairly cleanly, and on a need-to-have
> basis. It's not something that'll break drivers.
>
> What do you think?
Purely for the I/O sizes to external arrays problem that's nice,
and I think we (well, you :)) should implement it.
But there's other reasons why larger objects in the page cache make
sense that are mostly related to keeping overhead for large files
in the operating system down. So I'd go both for s/g list chaining
and variable order pagecache.
Btw, we should talk a little about the sglist iterators on linux-scsi,
as a lot of the dma mapping API will need updates for bidirection dmas
anyway, and we should try to get everything done in one rush.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:12 ` Jens Axboe
2007-04-26 18:24 ` Christoph Hellwig
@ 2007-04-26 18:28 ` Christoph Lameter
2007-04-26 18:29 ` Jens Axboe
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 18:28 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Eric W. Biederman, Nick Piggin, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Jens Axboe wrote:
> The above can be implemented fairly cleanly, and on a need-to-have
> basis. It's not something that'll break drivers.
But its also not going to fix the hacks that we have in the kernel
to deal with > PAGE_SIZE i/o.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:28 ` Christoph Lameter
@ 2007-04-26 18:29 ` Jens Axboe
2007-04-26 18:35 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 18:29 UTC (permalink / raw)
To: Christoph Lameter
Cc: Christoph Hellwig, Eric W. Biederman, Nick Piggin, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26 2007, Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Jens Axboe wrote:
>
> > The above can be implemented fairly cleanly, and on a need-to-have
> > basis. It's not something that'll break drivers.
>
> But its also not going to fix the hacks that we have in the kernel
> to deal with > PAGE_SIZE i/o.
No, but that's a _seperate_ issue! Don't keep mixing up the two.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:29 ` Jens Axboe
@ 2007-04-26 18:35 ` Christoph Lameter
2007-04-26 18:39 ` Jens Axboe
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 18:35 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Eric W. Biederman, Nick Piggin, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Jens Axboe wrote:
> On Thu, Apr 26 2007, Christoph Lameter wrote:
> > On Thu, 26 Apr 2007, Jens Axboe wrote:
> >
> > > The above can be implemented fairly cleanly, and on a need-to-have
> > > basis. It's not something that'll break drivers.
> >
> > But its also not going to fix the hacks that we have in the kernel
> > to deal with > PAGE_SIZE i/o.
>
> No, but that's a _seperate_ issue! Don't keep mixing up the two.
Yes I understand that you want it to be a separate issue so we get get
more rationales for the hacks that we do to avoid the large
order allocations.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:35 ` Christoph Lameter
@ 2007-04-26 18:39 ` Jens Axboe
2007-04-26 19:35 ` Eric W. Biederman
2007-04-26 20:22 ` Mel Gorman
0 siblings, 2 replies; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 18:39 UTC (permalink / raw)
To: Christoph Lameter
Cc: Christoph Hellwig, Eric W. Biederman, Nick Piggin, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26 2007, Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Jens Axboe wrote:
>
> > On Thu, Apr 26 2007, Christoph Lameter wrote:
> > > On Thu, 26 Apr 2007, Jens Axboe wrote:
> > >
> > > > The above can be implemented fairly cleanly, and on a need-to-have
> > > > basis. It's not something that'll break drivers.
> > >
> > > But its also not going to fix the hacks that we have in the kernel
> > > to deal with > PAGE_SIZE i/o.
> >
> > No, but that's a _seperate_ issue! Don't keep mixing up the two.
>
> Yes I understand that you want it to be a separate issue so we get get
> more rationales for the hacks that we do to avoid the large
> order allocations.
Christoph, don't take your frustrations out on me. I've several times in
this thread said that I'd LIKE to have > PAGE_SIZE support in the page
cache. I WROTE the initial pktcdvd driver that is a primary example of
these hacks, I'm very well aware of the pain and bugs involved with
that.
But don't push large pages as the only solution to larger ios, because
that is trivially not true.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:07 ` Christoph Lameter
@ 2007-04-26 18:45 ` Eric W. Biederman
2007-04-26 18:59 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 18:45 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, Nick Piggin, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Christoph Lameter <clameter@sgi.com> writes:
> On Thu, 26 Apr 2007, Eric W. Biederman wrote:
>> The change the fundamental fragmentation avoidance algorithm of the
>> OS. Use only one size of page. That is a huge problem.
>
> No its no problem at all. Its in some peoples head that have never seen it
> done otherwise. We have another OS here that has dealt with variable sized
> pages for ages and our people have been shaking their heads for years
> about these strong opinious that lead us not dealing with the
> problem.
If you can make small things go fast, everything speeds up.
If you can only make big things go fast only some things speed up.
I want to make small things go fast so everything speeds up.
This isn't whatever OS you are used to dealing with. Assumptions are
different, and improving streaming read/writes benchmarks are not our
focus.
One of the first rules in storage or OS design is that it better be
reliable, and you patch looks like it will make the VM unreliable. So
no a design like this will not be agreed to without comment until it
is shown that the VM does not have problems.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (23 preceding siblings ...)
2007-04-26 4:51 ` Eric W. Biederman
@ 2007-04-26 18:50 ` Maxim Levitsky
2007-04-27 2:04 ` Andrew Morton
2007-04-28 16:39 ` Maxim Levitsky
26 siblings, 0 replies; 235+ messages in thread
From: Maxim Levitsky @ 2007-04-26 18:50 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty
On Wednesday 25 April 2007 01:21, clameter@sgi.com wrote:
> V2->V3
> - More restructuring
> - It actually works!
> - Add XFS support
> - Fix up UP support
> - Work out the direct I/O issues
> - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
> back to constants. Disabled for 32bit and HIGHMEM configurations.
> This also allows a gradual migration to the new page cache
> inline functions. LARGE_BLOCKSIZE capabilities can be
> added gradually and if there is a problem then we can disable
> a subsystem.
>
> V1->V2
> - Some ext2 support
> - Some block layer, fs layer support etc.
> - Better page cache macros
> - Use macros to clean up code.
>
> This patchset modifies the Linux kernel so that larger block sizes than
> page size can be supported. Larger block sizes are handled by using
> compound pages of an arbitrary order for the page cache instead of
> single pages with order 0.
>
> Rationales:
>
> 1. We have problems supporting devices with a higher blocksize than
> page size. This is for example important to support CD and DVDs that
> can only read and write 32k or 64k blocks. We currently have a shim
> layer in there to deal with this situation which limits the speed
> of I/O. The developers are currently looking for ways to completely
> bypass the page cache because of this deficiency.
>
> 2. 32/64k blocksize is also used in flash devices. Same issues.
>
> 3. Future harddisks will support bigger block sizes that Linux cannot
> support since we are limited to PAGE_SIZE. Ok the on board cache
> may buffer this for us but what is the point of handling smaller
> page sizes than what the drive supports?
>
> 4. Reduce fsck times. Larger block sizes mean faster file system checking.
>
> 5. Performance. If we look at IA64 vs. x86_64 then it seems that the
> faster interrupt handling on x86_64 compensate for the speed loss due to
> a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
> sizes on all allows a significant reduction in I/O overhead and
> increases the size of I/O that can be performed by hardware in a single
> request since the number of scatter gather entries are typically limited
> for one request. This is going to become increasingly important to support
> the ever growing memory sizes since we may have to handle excessively large
> amounts of 4k requests for data sizes that may become common soon. For
> example to write a 1 terabyte file the kernel would have to handle 256
> million 4k chunks.
>
> 6. Cross arch compatibility: It is currently not possible to mount
> an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
> With this patch this becoems possible.
>
> The support here is currently only for buffered I/O. Modifications for
> three filesystems are included:
>
> A. XFS
> B. Ext2
> C. ramfs
>
> Unsupported
> - Mmapping blocks larger than page size
>
> Issues:
> - There are numerous places where the kernel can no longer assume that the
> page cache consists of PAGE_SIZE pages that have not been fixed yet.
> - Defrag warning: The patch set can fragment memory very fast.
> It is likely that Mel Gorman's anti-frag patches and some more
> work by him on defragmentation may be needed if one wants to use
> super sized pages.
> If you run a 2.6.21 kernel with this patch and start a kernel compile
> on a 4k volume with a concurrent copy operation to a 64k volume on
> a system with only 1 Gig then you will go boom (ummm no ... OOM) fast.
> How well Mel's antifrag/defrag methods address this issue still has to
> be seen.
>
> Future:
> - Mmap support could be done in a way that makes the mmap page size
> independent from the page cache order. It is okay to map a 4k section
> of a larger page cache page via a pte. 4k mmap semantics can be
> completely preserved even for larger page sizes.
> - Maybe people could perform benchmarks to see how much of a difference
> there is between 4k size I/O and 64k? Andrew surely would like to know.
> - If there is a chance for inclusion then I will diff this against mm,
> do a complete scan over the kernel to find all page cache == PAGE_SIZE
> assumptions and then try to get it upstream for 2.6.23.
>
> How to make this work:
>
> 1. Apply this patchset to 2.6.21-rc7
> 2. Configure LARGE_BLOCKSIZE Support
> 3. compile kernel
>
> --
Hi,
I really like the idea of block size > page size
I just want to suggest mine ideas about how to implement it (I can't do that
since it is too compicated for me and I don't have time now)
I visualized this like that:
For size <= 4K , page still holds a 1 or more blocks.
For size > 4K:
A page holds a fraction of block, and its ->buffer_head also holds info about
that fraction.
But that ->bh contains a ->next pointer (or a list_head) that combines
all fractions of block in single one.
This will minimize changes in fs code.
Today fs blindly uses bh retured by block code.
A modifed filesystem will also read from linked ->next bhs to get all parts of
block.
For blocksizes <= 4k that ->next will be null indicating that that bh
contatins whole block.
Then inplementation of block address_space opertions should not change a lot
too.
They will be just aware that that page can be linked with other pages of same
block and do that right thing.
For example:
->readpage will not only read _that_ page but also will read all sibling pages
->writepage will magicly write not only that page but siblings too.
buffer_head alredy has pointer to page so it is easy to get page from buffer
head:
page -> private -> next -> page -> private ...
andf so on (sorry, but I did a study (for myself, trying to improve
packet-writing) on that a half year ago, so I don't remember now much about
that)
What about that ?
Best regards,
Maxim Levitsky
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 17:42 ` Jens Axboe
@ 2007-04-26 18:59 ` Eric W. Biederman
0 siblings, 0 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 18:59 UTC (permalink / raw)
To: Jens Axboe
Cc: Nick Piggin, David Chinner, Christoph Lameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Badari Pulavarty,
Maxim Levitsky
Jens Axboe <jens.axboe@oracle.com> writes:
> On Fri, Apr 27 2007, Nick Piggin wrote:
>> Why do we limit drivers to 128 sg entries?
>
> No particular reason, except than to avoid 2^bigger order allocations.
> 2MiB requests would require 3 contig pages to setup the sg list, which
> is (probably) a little troublesome especially since it's sometimes
> atomically allocated.
>
> Larger pages are by no means a prerequisite to getting larger requests,
> assuming your hardware can handle the bigger sglist. There are other
> ways of doing that, I've contemplated doing chained sglists and adding
> sg_for_each_segment() macros for iterating these things. Drivers that
> want larger sglists would then be required to update their sg mapping
> loop to use the provided macros. It wouldn't be too hard.
Thanks. At least in the long term supporting larger scatter gather lists
in the kernel looks like something we need to do assuming bandwidth
goes up exponentially but I/O device latency remains high.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:45 ` Eric W. Biederman
@ 2007-04-26 18:59 ` Christoph Lameter
2007-04-26 19:21 ` Eric W. Biederman
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-26 18:59 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Nick Piggin, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, 26 Apr 2007, Eric W. Biederman wrote:
> If you can make small things go fast, everything speeds up.
> If you can only make big things go fast only some things speed up.
>
> I want to make small things go fast so everything speeds up.
A reductionist operating system theory? Have you ever heard of emergent
properties? A system comprised of pieces can as a whole exibit other
characteristics than the basic elements would provide.
I think it is wrong to use the same small things for big things. You
would not tow a tanker with your bicycle. Similarly you could use small
pages for text files but huge page sizes when you need to transfer
gigabytes or terabytes of memory. Its wrong to dictate ones size fits
all. (Memories of my relatives in the eastern bloc surface but I better shut up.)
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:59 ` Christoph Lameter
@ 2007-04-26 19:21 ` Eric W. Biederman
0 siblings, 0 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 19:21 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, Nick Piggin, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Christoph Lameter <clameter@sgi.com> writes:
> On Thu, 26 Apr 2007, Eric W. Biederman wrote:
>
>> If you can make small things go fast, everything speeds up.
>> If you can only make big things go fast only some things speed up.
>>
>> I want to make small things go fast so everything speeds up.
>
> A reductionist operating system theory? Have you ever heard of emergent
> properties? A system comprised of pieces can as a whole exibit other
> characteristics than the basic elements would provide.
>
> I think it is wrong to use the same small things for big things. You
> would not tow a tanker with your bicycle. Similarly you could use small
> pages for text files but huge page sizes when you need to transfer
> gigabytes or terabytes of memory. Its wrong to dictate ones size fits
> all. (Memories of my relatives in the eastern bloc surface but I better shut
> up.)
Think of it like designing a cpu. Which would you rather have a
faster clock rate or a new instruction say floating point multiply-accumulate
that means you theoretically double your floating point computation speed
at a cost of reducing your clock rate?
Up to the limit of it being possible you get more bang out of improving
the little things. CPUs unfortunately pretty much hit the limit of
improving clock rates.
I don't think we are at the limit of making small things like syscalls,
and pages go fast.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:39 ` Jens Axboe
@ 2007-04-26 19:35 ` Eric W. Biederman
2007-04-26 19:42 ` Jens Axboe
2007-04-26 20:22 ` Mel Gorman
1 sibling, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-26 19:35 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Lameter, Christoph Hellwig, Nick Piggin, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
Jens Axboe <jens.axboe@oracle.com> writes:
> On Thu, Apr 26 2007, Christoph Lameter wrote:
>> On Thu, 26 Apr 2007, Jens Axboe wrote:
>>
>> > On Thu, Apr 26 2007, Christoph Lameter wrote:
>> > > On Thu, 26 Apr 2007, Jens Axboe wrote:
>> > >
>> > > > The above can be implemented fairly cleanly, and on a need-to-have
>> > > > basis. It's not something that'll break drivers.
>> > >
>> > > But its also not going to fix the hacks that we have in the kernel
>> > > to deal with > PAGE_SIZE i/o.
>> >
>> > No, but that's a _seperate_ issue! Don't keep mixing up the two.
>>
>> Yes I understand that you want it to be a separate issue so we get get
>> more rationales for the hacks that we do to avoid the large
>> order allocations.
>
> Christoph, don't take your frustrations out on me. I've several times in
> this thread said that I'd LIKE to have > PAGE_SIZE support in the page
> cache. I WROTE the initial pktcdvd driver that is a primary example of
> these hacks, I'm very well aware of the pain and bugs involved with
> that.
>
> But don't push large pages as the only solution to larger ios, because
> that is trivially not true.
Just taking a quick look at pktcdvd that does appear to be a case where
we want to express to the rest of the system that our block device has
a > 2KB block size.
I think it would be very useful to express to the rest of the system
that yes indeed this device has a 32K sector size (or whatever the
real limit is) instead of hiding that fact away.
Now I'm not a block layer expert and my knowledge of the page cache
is rusty. So I can't immediately point to a way we can do this.
I'll guess that it will require cleaning up some old crufty code that
no one wants to touch.
I will point out that while this appears to require support from
upper levels it also does not require a larger physical page size
in the page cache.
Am I correct in assuming that the problem is primarily about getting
filesystems (and other upper layers) to submit BIOs that take into
consideration the larger block size of the underlying device, so
that read/modify write is not needed in the pktcdvd layer?
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 19:35 ` Eric W. Biederman
@ 2007-04-26 19:42 ` Jens Axboe
2007-04-27 4:05 ` Eric W. Biederman
0 siblings, 1 reply; 235+ messages in thread
From: Jens Axboe @ 2007-04-26 19:42 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Christoph Lameter, Christoph Hellwig, Nick Piggin, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26 2007, Eric W. Biederman wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
>
> > On Thu, Apr 26 2007, Christoph Lameter wrote:
> >> On Thu, 26 Apr 2007, Jens Axboe wrote:
> >>
> >> > On Thu, Apr 26 2007, Christoph Lameter wrote:
> >> > > On Thu, 26 Apr 2007, Jens Axboe wrote:
> >> > >
> >> > > > The above can be implemented fairly cleanly, and on a need-to-have
> >> > > > basis. It's not something that'll break drivers.
> >> > >
> >> > > But its also not going to fix the hacks that we have in the kernel
> >> > > to deal with > PAGE_SIZE i/o.
> >> >
> >> > No, but that's a _seperate_ issue! Don't keep mixing up the two.
> >>
> >> Yes I understand that you want it to be a separate issue so we get get
> >> more rationales for the hacks that we do to avoid the large
> >> order allocations.
> >
> > Christoph, don't take your frustrations out on me. I've several times in
> > this thread said that I'd LIKE to have > PAGE_SIZE support in the page
> > cache. I WROTE the initial pktcdvd driver that is a primary example of
> > these hacks, I'm very well aware of the pain and bugs involved with
> > that.
> >
> > But don't push large pages as the only solution to larger ios, because
> > that is trivially not true.
>
> Just taking a quick look at pktcdvd that does appear to be a case where
> we want to express to the rest of the system that our block device has
> a > 2KB block size.
>
> I think it would be very useful to express to the rest of the system
> that yes indeed this device has a 32K sector size (or whatever the
> real limit is) instead of hiding that fact away.
>
> Now I'm not a block layer expert and my knowledge of the page cache
> is rusty. So I can't immediately point to a way we can do this.
> I'll guess that it will require cleaning up some old crufty code that
> no one wants to touch.
>
> I will point out that while this appears to require support from
> upper levels it also does not require a larger physical page size
> in the page cache.
Yep, if you could just have > PAGE_CACHE_SIZE blocks in the filesystem
easily, the problem would basically be solved for cd and dvd packet
writing.
> Am I correct in assuming that the problem is primarily about getting
> filesystems (and other upper layers) to submit BIOs that take into
> consideration the larger block size of the underlying device, so
> that read/modify write is not needed in the pktcdvd layer?
Yes, that is exactly the problem. Once you have that, pktcdvd is pretty
much reduced to setup and init code, the actual data handling can be
done by sr or ide-cd directly. You could merge it into cdrom.c, it would
not be very different from mt-rainier handling (which basically does RMW
in firmware, so it works for any write, but performance is of course
horrible if you don't do it right).
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:39 ` Jens Axboe
2007-04-26 19:35 ` Eric W. Biederman
@ 2007-04-26 20:22 ` Mel Gorman
2007-04-27 0:21 ` William Lee Irwin III
2007-04-27 5:16 ` Jens Axboe
1 sibling, 2 replies; 235+ messages in thread
From: Mel Gorman @ 2007-04-26 20:22 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Lameter, Christoph Hellwig, Eric W. Biederman,
Nick Piggin, David Chinner, linux-kernel, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On (26/04/07 20:39), Jens Axboe didst pronounce:
> On Thu, Apr 26 2007, Christoph Lameter wrote:
> > On Thu, 26 Apr 2007, Jens Axboe wrote:
> >
> > > On Thu, Apr 26 2007, Christoph Lameter wrote:
> > > > On Thu, 26 Apr 2007, Jens Axboe wrote:
> > > >
> > > > > The above can be implemented fairly cleanly, and on a need-to-have
> > > > > basis. It's not something that'll break drivers.
> > > >
> > > > But its also not going to fix the hacks that we have in the kernel
> > > > to deal with > PAGE_SIZE i/o.
> > >
> > > No, but that's a _seperate_ issue! Don't keep mixing up the two.
> >
> > Yes I understand that you want it to be a separate issue so we get get
> > more rationales for the hacks that we do to avoid the large
> > order allocations.
>
> Christoph, don't take your frustrations out on me. I've several times in
> this thread said that I'd LIKE to have > PAGE_SIZE support in the page
> cache. I WROTE the initial pktcdvd driver that is a primary example of
> these hacks, I'm very well aware of the pain and bugs involved with
> that.
>
> But don't push large pages as the only solution to larger ios, because
> that is trivially not true.
>
Would it be fair to say that your approach and using large pages are not
mutually exclusive solutions? It seems a lot of the debate here is
assuming there is One And Only One Solution for larger ios.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:09 ` Eric W. Biederman
@ 2007-04-26 23:34 ` William Lee Irwin III
0 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-26 23:34 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Christoph Lameter, Nick Piggin, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
>> It's possible to divorce PAGE_SIZE from the binary formats, though I
>> found it difficult to keep up with the update treadmill.
On Thu, Apr 26, 2007 at 12:09:24PM -0600, Eric W. Biederman wrote:
> On x86_64 the sizes is actually 64K for executable binaries if I
> recall correctly. It certainly is not PAGE_SIZE, so we have some
> flexibility there.
Not so fast. The x86-64 ABI may allow any power of two size between
4KB and 64KB, but the emulated i386 ABI does not.
William Lee Irwin III <wli@holomorphy.com> writes:
>> Maybe it's
>> like hch says and I just needed to find more and better API cleanups.
>> I've only not tried to resurrect it because it's too much for me to do
>> on my own. I essentially collapsed under the weight of it and my 2.5.x
>> codebase ended up worse than Katrina as a disaster, which I don't want
>> to repeat and think collaborators or a different project lead from
>> myself are needed to avoid that happening again.
On Thu, Apr 26, 2007 at 12:09:24PM -0600, Eric W. Biederman wrote:
> But we still have some issues with mmap. But since we could increase
> PAGE_SIZE on x86_64 and not have to even worry about sub PAGE_SIZE
> mmaps. It is being suggested that if people really need larger
> physical pages that they just fix PAGE_SIZE. The everything just
> works.
There's a little more to it than that, plus i386 ABI emulation.
On Thu, Apr 26, 2007 at 12:09:24PM -0600, Eric W. Biederman wrote:
> Thinking about it changing PAGE_SIZE on x86_64 should be about as
> hard as doing the 3-level vs 2-level page table format. We say
> we have a different page table format that uses a larger PAGE_SIZE.
> All arch code, all code in paths that we expect to change.
It does not resemble 3-level vs. 2-level pagetable format code. It
can be done entirely in arch code if you don't care to emulate the
i386 ABI.
On Thu, Apr 26, 2007 at 12:09:24PM -0600, Eric W. Biederman wrote:
> Boom all done.
> It might be worth implementing just so people can play with different
> PAGE_SIZE values for benchmarking.
> I don't think the larger physical page size is really the issue here
> though.
I'll clean up what I have and post it at some point.
William Lee Irwin III <wli@holomorphy.com> writes:
>> Anyway, if that's being kicked around as an alternative, it could be
>> said that I have some insight into the issues surrounding it.
On Thu, Apr 26, 2007 at 12:09:24PM -0600, Eric W. Biederman wrote:
> Partially but also partially they are very much suggesting going down
> the same path. Currently mmap doesn't work with order >0 pages because
> they are not yet addressing these issues at all.
> This looks like a more flexible version of the old PAGE_CACHE_SIZE >
> PAGE_SIZE code. Which makes seriously question the whole idea.
mmap() should be emulatable with the base page size without anything
interesting happening. Of course, there are opportunities for more, if
anyone cares to take them.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 13:50 ` David Chinner
2007-04-26 14:40 ` William Lee Irwin III
2007-04-26 15:38 ` Nick Piggin
@ 2007-04-27 0:19 ` Jeremy Higdon
2 siblings, 0 replies; 235+ messages in thread
From: Jeremy Higdon @ 2007-04-27 0:19 UTC (permalink / raw)
To: David Chinner
Cc: Eric W. Biederman, Nick Piggin, clameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26, 2007 at 11:50:33PM +1000, David Chinner wrote:
> On Thu, Apr 26, 2007 at 04:10:32AM -0600, Eric W. Biederman wrote:
> > > And then there's the problem that most hardware is limited to 128
> > > s/g entries and that means 128 non-contiguous pages in memory is the
> > > maximum I/O size we can issue to these devices. We have RAID arrays
> > > that go twice as fast if we can send them 1MB I/Os instead of 512k
> > > I/Os and that means we need contiguous pages to be handled to the
> > > devices....
> >
> > Ok. Now why are high end hardware manufacturers building crippled
> > hardware? Or is there only an 8bit field in SCSI for describing
> > scatter gather entries? Although I would think this would be
> > move of a controller ranter than a drive issue.
>
> scsi.h:
>
> /*
> * The maximum sg list length SCSI can cope with
> * (currently must be a power of 2 between 32 and 256)
> */
> #define SCSI_MAX_PHYS_SEGMENTS MAX_PHYS_SEGMENTS
>
> And from blkdev.h:
>
> #define MAX_PHYS_SEGMENTS 128
> #define MAX_HW_SEGMENTS 128
>
> So currentlt on SCSI we are limited to 128 s/g entries, and the
> maximum is 256. So I'd say we've got good grounds for needing
> contiguous pages to go beyond 1MB I/O size on x86_64.
Right, and there are also RAID devices that really want a 2 MiB I/O
size. Even if we could use 512 s/g entries (which would take two
pages), the other big problem is that many I/O chips/cards are limited
in the amount of space they have for s/g lists. So, you'd face the
possibility that you could do a 2MiB I/O request with 512 s/g entries,
but then you couldn't start a second request on that host until the
first one finished.
jeremy
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 20:22 ` Mel Gorman
@ 2007-04-27 0:21 ` William Lee Irwin III
2007-04-27 5:16 ` Jens Axboe
1 sibling, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-27 0:21 UTC (permalink / raw)
To: Mel Gorman
Cc: Jens Axboe, Christoph Lameter, Christoph Hellwig,
Eric W. Biederman, Nick Piggin, David Chinner, linux-kernel,
Badari Pulavarty, Maxim Levitsky
On (26/04/07 20:39), Jens Axboe didst pronounce:
>> But don't push large pages as the only solution to larger ios, because
>> that is trivially not true.
On Thu, Apr 26, 2007 at 09:22:02PM +0100, Mel Gorman wrote:
> Would it be fair to say that your approach and using large pages are not
> mutually exclusive solutions? It seems a lot of the debate here is
> assuming there is One And Only One Solution for larger ios.
I'd like to see how all the strategies mentioned thus far do in
conjunction.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:21 ` Eric W. Biederman
@ 2007-04-27 0:32 ` William Lee Irwin III
2007-04-27 10:22 ` Nick Piggin
0 siblings, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-27 0:32 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Nick Piggin, Christoph Lameter, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III <wli@holomorphy.com> writes:
>> In memory as on disk, contiguity matters a lot for performance.
On Thu, Apr 26, 2007 at 12:21:24PM -0600, Eric W. Biederman wrote:
> Not nearly so much though. In memory you don't have seeks to avoid.
> On disks avoiding seeks is everything.
I readily concede that seeks are most costly. Yet memory contiguity
remains rather influential.
Witness the fact that I'm now being called upon a second time to
adjust the order in which mm/page_alloc.c returns pages for the
sake of implicitly establishing IO contiguity (or otherwise
determining why things are coming out backward now).
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (24 preceding siblings ...)
2007-04-26 18:50 ` Maxim Levitsky
@ 2007-04-27 2:04 ` Andrew Morton
2007-04-27 2:27 ` David Chinner
2007-04-28 16:39 ` Maxim Levitsky
26 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 2:04 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Tue, 24 Apr 2007 15:21:05 -0700 clameter@sgi.com wrote:
> This patchset modifies the Linux kernel so that larger block sizes than
> page size can be supported. Larger block sizes are handled by using
> compound pages of an arbitrary order for the page cache instead of
> single pages with order 0.
Something I was looking for but couldn't find: suppose an application takes
a pagefault against the third 4k page of an order-2 pagecache "page". We
need to instantiate a pte against find_get_page(offset/4)+3. But these
patches don't touch mm/memory.c at all and filemap_nopage() appears to
return the zeroeth 4k page all the time in that case.
So.. what am I missing, and how does that part work?
Also, afaict your important requirements would be met by retaining
PAGE_CACHE_SIZE=4k and simply ensuring that pagecache is populated by
physically contiguous pages - so instead of allocating and adding one 4k
page, we allocate an order-2 page and sprinkle all four page*'s into the
radix tree in one hit. That should be fairly straightforward to do, and
could be made indistinguishably fast from doing a single 16k page for some
common pagecache operations (gang-insert, gang-lookup).
The BIO and block layers will do-the-right-thing with that pagecache and
you end up with four times more data in the SG lists, worst-case.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 2:04 ` Andrew Morton
@ 2007-04-27 2:27 ` David Chinner
2007-04-27 2:53 ` Andrew Morton
0 siblings, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-04-27 2:27 UTC (permalink / raw)
To: Andrew Morton
Cc: clameter, linux-kernel, Mel Gorman, William Lee Irwin III,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 07:04:38PM -0700, Andrew Morton wrote:
> On Tue, 24 Apr 2007 15:21:05 -0700 clameter@sgi.com wrote:
>
> > This patchset modifies the Linux kernel so that larger block sizes than
> > page size can be supported. Larger block sizes are handled by using
> > compound pages of an arbitrary order for the page cache instead of
> > single pages with order 0.
>
> Something I was looking for but couldn't find: suppose an application takes
> a pagefault against the third 4k page of an order-2 pagecache "page". We
> need to instantiate a pte against find_get_page(offset/4)+3. But these
> patches don't touch mm/memory.c at all and filemap_nopage() appears to
> return the zeroeth 4k page all the time in that case.
>
> So.. what am I missing, and how does that part work?
"mmap not supported yet" ;)
> Also, afaict your important requirements would be met by retaining
> PAGE_CACHE_SIZE=4k and simply ensuring that pagecache is populated by
> physically contiguous pages
Sure, that addresses the larger I/O side of things, but it doesn't address
the large filesystem blocksize issues that can only be solved with some kind
of page aggregation abstraction. Compound pages and high order page cache
indexing solves this extremely neatly, regardless of whether the compound
page is contiguous or not.....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 2:27 ` David Chinner
@ 2007-04-27 2:53 ` Andrew Morton
2007-04-27 3:47 ` [00/17] Large Blocksize Support V3 (mmap conceptual discussion) Christoph Lameter
2007-04-27 4:20 ` [00/17] Large Blocksize Support V3 David Chinner
0 siblings, 2 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 2:53 UTC (permalink / raw)
To: David Chinner
Cc: clameter, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007 12:27:31 +1000 David Chinner <dgc@sgi.com> wrote:
> On Thu, Apr 26, 2007 at 07:04:38PM -0700, Andrew Morton wrote:
> > On Tue, 24 Apr 2007 15:21:05 -0700 clameter@sgi.com wrote:
> >
> > > This patchset modifies the Linux kernel so that larger block sizes than
> > > page size can be supported. Larger block sizes are handled by using
> > > compound pages of an arbitrary order for the page cache instead of
> > > single pages with order 0.
> >
> > Something I was looking for but couldn't find: suppose an application takes
> > a pagefault against the third 4k page of an order-2 pagecache "page". We
> > need to instantiate a pte against find_get_page(offset/4)+3. But these
> > patches don't touch mm/memory.c at all and filemap_nopage() appears to
> > return the zeroeth 4k page all the time in that case.
> >
> > So.. what am I missing, and how does that part work?
>
> "mmap not supported yet" ;)
erk. I suspect this will have its sticky paws all over core mm.
> > Also, afaict your important requirements would be met by retaining
> > PAGE_CACHE_SIZE=4k and simply ensuring that pagecache is populated by
> > physically contiguous pages
>
> Sure, that addresses the larger I/O side of things, but it doesn't address
> the large filesystem blocksize issues that can only be solved with some kind
> of page aggregation abstraction.
a) That wasn't a part of Christoph's original rationale list, so forgive
me for thinking it is not so important and got snuck in post-facto when
things got tough.
b) I don't immediately see why a filesystam cannot implement larger
blocksizes via this scheme - instantiate and lock four pages and go for
it.
> Compound pages and high order page cache
> indexing solves this extremely neatly, regardless of whether the compound
> page is contiguous or not.....
We cannot say anything about neatness until we've seen mmap.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3 (mmap conceptual discussion)
2007-04-27 2:53 ` Andrew Morton
@ 2007-04-27 3:47 ` Christoph Lameter
2007-04-27 4:20 ` [00/17] Large Blocksize Support V3 David Chinner
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 3:47 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Andrew Morton wrote:
> > Sure, that addresses the larger I/O side of things, but it doesn't address
> > the large filesystem blocksize issues that can only be solved with some kind
> > of page aggregation abstraction.
>
> a) That wasn't a part of Christoph's original rationale list, so forgive
> me for thinking it is not so important and got snuck in post-facto when
> things got tough.
It was definitely part of my thinking. I never thought anyone could do it
differently so I did not emphasize it.
> b) I don't immediately see why a filesystam cannot implement larger
> blocksizes via this scheme - instantiate and lock four pages and go for
> it.
>
> > Compound pages and high order page cache
> > indexing solves this extremely neatly, regardless of whether the compound
> > page is contiguous or not.....
>
> We cannot say anything about neatness until we've seen mmap.
Rough Draft was posted at
http://marc.info/?l=linux-kernel&m=117709695522443&w=2
http://marc.info/?l=linux-kernel&m=117709215016822&w=2
http://marc.info/?l=linux-kernel&m=117709238129124&w=2
Basically 4k mmap semantics are preserved. One can mmap any 4k section of
a compound page. state information is kept in the head page. So we have two
page struct pointers to juggle
1. The one pointing to the page for address calculations COW etc.
2. The one pointing to the head page for state information.
For each 4k pointer from a process to a compound page we would
have to take a refcount.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 19:42 ` Jens Axboe
@ 2007-04-27 4:05 ` Eric W. Biederman
2007-04-27 10:26 ` Nick Piggin
0 siblings, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-27 4:05 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Lameter, Christoph Hellwig, Nick Piggin, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
Jens Axboe <jens.axboe@oracle.com> writes:
> On Thu, Apr 26 2007, Eric W. Biederman wrote:
>
> Yep, if you could just have > PAGE_CACHE_SIZE blocks in the filesystem
> easily, the problem would basically be solved for cd and dvd packet
> writing.
Ok. I'm not in a position to do this work. But I will keep it in
mind and look at it.
>> Am I correct in assuming that the problem is primarily about getting
>> filesystems (and other upper layers) to submit BIOs that take into
>> consideration the larger block size of the underlying device, so
>> that read/modify write is not needed in the pktcdvd layer?
>
> Yes, that is exactly the problem. Once you have that, pktcdvd is pretty
> much reduced to setup and init code, the actual data handling can be
> done by sr or ide-cd directly. You could merge it into cdrom.c, it would
> not be very different from mt-rainier handling (which basically does RMW
> in firmware, so it works for any write, but performance is of course
> horrible if you don't do it right).
Thanks for the clarification.
So we do have a clear problem that we do not have generic support for
large sector sizes residing in the page cache.
There is one place where this is a direct effect fs/block_dev.c
We have an indirect affect in the filesystems because there a few
bits of generic support missing and there is no linux convention
on how to handle this case.
I expect if we can enhance fs/block_dev.c to handle this case the
other parts will fall out naturally.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 2:53 ` Andrew Morton
2007-04-27 3:47 ` [00/17] Large Blocksize Support V3 (mmap conceptual discussion) Christoph Lameter
@ 2007-04-27 4:20 ` David Chinner
2007-04-27 5:15 ` Andrew Morton
1 sibling, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-04-27 4:20 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26, 2007 at 07:53:57PM -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 12:27:31 +1000 David Chinner <dgc@sgi.com> wrote:
> > On Thu, Apr 26, 2007 at 07:04:38PM -0700, Andrew Morton wrote:
> > > On Tue, 24 Apr 2007 15:21:05 -0700 clameter@sgi.com wrote:
> > > Also, afaict your important requirements would be met by retaining
> > > PAGE_CACHE_SIZE=4k and simply ensuring that pagecache is populated by
> > > physically contiguous pages
> >
> > Sure, that addresses the larger I/O side of things, but it doesn't address
> > the large filesystem blocksize issues that can only be solved with some kind
> > of page aggregation abstraction.
>
> a) That wasn't a part of Christoph's original rationale list, so forgive
> me for thinking it is not so important and got snuck in post-facto when
> things got tough.
I've been pushing christoph to do something like this for more than a year
purely so we can support large block sizes in XFS. He's got other reasons
for wanting to do this, but that doesn't mean that the large filesystem
blocksize issue is any less important.
> blocksizes via this scheme - instantiate and lock four pages and go for
> it.
So now how do you get block aligned writeback? Or make sure that truncate
doesn't race on a partial *block* truncate? You basically have to
jump through nasty, nasty hoops, to handle corner cases that are introduced
because the generic code can no longer reliably lock out access to a
filesystem block.
Eventually you end up with something like fs/xfs/linux-2.6/xfs_buf.c and
doing everything inside the filesystem because it's the only way sane
way to serialise access to these aggregated structures. This is
the way XFS used to work in it's data path, and we all know how long
and loud people complained about that.....
A filesystem specific aggregation mechanism is not a palatable solution
here because it drives filesystems away from being able to use generic
code.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 4:20 ` [00/17] Large Blocksize Support V3 David Chinner
@ 2007-04-27 5:15 ` Andrew Morton
2007-04-27 5:49 ` Christoph Lameter
` (2 more replies)
0 siblings, 3 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 5:15 UTC (permalink / raw)
To: David Chinner
Cc: clameter, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007 14:20:46 +1000 David Chinner <dgc@sgi.com> wrote:
> > blocksizes via this scheme - instantiate and lock four pages and go for
> > it.
>
> So now how do you get block aligned writeback?
in writeback and pageout:
if (page->index & mapping->block_size_mask)
continue;
> Or make sure that truncate
> doesn't race on a partial *block* truncate?
lock four pages
> You basically have to
> jump through nasty, nasty hoops, to handle corner cases that are introduced
> because the generic code can no longer reliably lock out access to a
> filesystem block.
>
> Eventually you end up with something like fs/xfs/linux-2.6/xfs_buf.c and
> doing everything inside the filesystem because it's the only way sane
> way to serialise access to these aggregated structures. This is
> the way XFS used to work in it's data path, and we all know how long
> and loud people complained about that.....
>
> A filesystem specific aggregation mechanism is not a palatable solution
> here because it drives filesystems away from being able to use generic
> code.
I would expect we could (should) implement this in generic code by
modifying the existing stuff.
I'm not saying it's especially simple, nor fast. But it has the advantage
that we're not forced to use larger pages with _it's_ attendant performance
problems.
And it will benefit all filesystems immediately.
And it doesn't introduce a rather nasty hack of pretending (in some places)
that pages are larger than they really are.
And it has the very significant advantage that it doesn't introduce brand
new concepts and some complexity into core MM.
And make no mistake: the latter disadvantage is huge. Because if we do the
PAGE_CACHE_SIZE hack (sorry, but it _is_), we have to do it *for ever*.
Maintaining and enhancing core MM and VFS becomes harder and more costly
and slower and more buggy *for ever*. The ramp for people to become
competent on core MM becomes longer. Our developer pool becomes smaller, and
proportionally less skilled.
And hardware gets better. If Intel & AMD come out with a 16k pagesize
option in a couple of years we'll look pretty dumb. If the problems which
you're presently having with that controller get sorted out in the next
generation of the hardware, we'll also look pretty dumb.
As always, there are tradeoffs. We can see the cons, and they are very
significant. We don't yet know the pros. Perhaps they will be similarly
significant. But I don't believe that the larger PAGE_CACHE_SIZE hack
(sorry) is the only way in which they can be realised.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 20:22 ` Mel Gorman
2007-04-27 0:21 ` William Lee Irwin III
@ 2007-04-27 5:16 ` Jens Axboe
1 sibling, 0 replies; 235+ messages in thread
From: Jens Axboe @ 2007-04-27 5:16 UTC (permalink / raw)
To: Mel Gorman
Cc: Christoph Lameter, Christoph Hellwig, Eric W. Biederman,
Nick Piggin, David Chinner, linux-kernel, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26 2007, Mel Gorman wrote:
> On (26/04/07 20:39), Jens Axboe didst pronounce:
> > On Thu, Apr 26 2007, Christoph Lameter wrote:
> > > On Thu, 26 Apr 2007, Jens Axboe wrote:
> > >
> > > > On Thu, Apr 26 2007, Christoph Lameter wrote:
> > > > > On Thu, 26 Apr 2007, Jens Axboe wrote:
> > > > >
> > > > > > The above can be implemented fairly cleanly, and on a need-to-have
> > > > > > basis. It's not something that'll break drivers.
> > > > >
> > > > > But its also not going to fix the hacks that we have in the kernel
> > > > > to deal with > PAGE_SIZE i/o.
> > > >
> > > > No, but that's a _seperate_ issue! Don't keep mixing up the two.
> > >
> > > Yes I understand that you want it to be a separate issue so we get get
> > > more rationales for the hacks that we do to avoid the large
> > > order allocations.
> >
> > Christoph, don't take your frustrations out on me. I've several times in
> > this thread said that I'd LIKE to have > PAGE_SIZE support in the page
> > cache. I WROTE the initial pktcdvd driver that is a primary example of
> > these hacks, I'm very well aware of the pain and bugs involved with
> > that.
> >
> > But don't push large pages as the only solution to larger ios, because
> > that is trivially not true.
> >
>
> Would it be fair to say that your approach and using large pages are not
> mutually exclusive solutions? It seems a lot of the debate here is
> assuming there is One And Only One Solution for larger ios.
Definitely, there's zero reason they cannot coexist.
--
Jens Axboe
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 5:15 ` Andrew Morton
@ 2007-04-27 5:49 ` Christoph Lameter
2007-04-27 6:55 ` Andrew Morton
2007-04-27 6:09 ` David Chinner
2007-04-27 16:55 ` Theodore Tso
2 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 5:49 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Andrew Morton wrote:
> > Or make sure that truncate
> > doesn't race on a partial *block* truncate?
>
> lock four pages
You would only lock a single higher order block. Truncate works on that
level.
If you have 4 separate pages then you need to take separate locks and you
may not have contiguous memory which makes the filesystem run through all
sorts of hoops.
> I'm not saying it's especially simple, nor fast. But it has the advantage
> that we're not forced to use larger pages with _it's_ attendant performance
> problems.
The patch is not about forcing to use large pages but about the option to
use larger pages. Its a new flexibility.
> And it doesn't introduce a rather nasty hack of pretending (in some places)
> that pages are larger than they really are.
They are really larger. One page struct controls it all.
> And it has the very significant advantage that it doesn't introduce brand
> new concepts and some complexity into core MM.
The patchset would reduce complexity and making it easy to handle the page
cache. Gets rid of the hacks to support larger ones right now. Its
straightforward, no new locking, very much a cleanup patch.
> And make no mistake: the latter disadvantage is huge. Because if we do the
> PAGE_CACHE_SIZE hack (sorry, but it _is_), we have to do it *for ever*.
> Maintaining and enhancing core MM and VFS becomes harder and more costly
> and slower and more buggy *for ever*. The ramp for people to become
> competent on core MM becomes longer. Our developer pool becomes smaller, and
> proportionally less skilled.
No it becomes easier. Look at the patchset. It cleans up a huge mess.
What is hacky about it? It is consistently using larger pages for the page
cache and it integrates nicely into the VM.
> And hardware gets better. If Intel & AMD come out with a 16k pagesize
> option in a couple of years we'll look pretty dumb. If the problems which
> you're presently having with that controller get sorted out in the next
> generation of the hardware, we'll also look pretty dumb.
We are currently looking dumb and unable to deal with the hardware. Yes
we can pressure the hardware vendors to produce hardware conforming to our
specifications but I always thought that was how another company operates.
> As always, there are tradeoffs. We can see the cons, and they are very
> significant. We don't yet know the pros. Perhaps they will be similarly
> significant. But I don't believe that the larger PAGE_CACHE_SIZE hack
> (sorry) is the only way in which they can be realised.
It is the most consistent solution that avoid the proliferation of further
hacks to address the large blocksize.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 5:15 ` Andrew Morton
2007-04-27 5:49 ` Christoph Lameter
@ 2007-04-27 6:09 ` David Chinner
2007-04-27 7:04 ` Andrew Morton
2007-04-27 16:55 ` Theodore Tso
2 siblings, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-04-27 6:09 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26, 2007 at 10:15:28PM -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 14:20:46 +1000 David Chinner <dgc@sgi.com> wrote:
>
> > > blocksizes via this scheme - instantiate and lock four pages and go for
> > > it.
> >
> > So now how do you get block aligned writeback?
>
> in writeback and pageout:
>
> if (page->index & mapping->block_size_mask)
> continue;
So we might do writeback on one page in N - how do we
make sure none of the other pages are reclaimed while we are doing
writeback on this bclok?
IOWs, we have to lock every page in the block, mark them all as
writeback, etc. Instead of doing something once, we have
to repeat it for every block in page. This is better than a compound
page, how?
> > Or make sure that truncate
> > doesn't race on a partial *block* truncate?
>
> lock four pages
And the locking order? How do you enforce *kernel wide* the
same locking order for all pages in the same block so that we
don't get ABBA deadlocks on page locks within a block?
i.e:
> > You basically have to
> > jump through nasty, nasty hoops, to handle corner cases that are introduced
> > because the generic code can no longer reliably lock out access to a
> > filesystem block.
This way lies insanity.
> > way to serialise access to these aggregated structures. This is
> > the way XFS used to work in it's data path, and we all know how long
> > and loud people complained about that.....
> >
> > A filesystem specific aggregation mechanism is not a palatable solution
> > here because it drives filesystems away from being able to use generic
> > code.
>
> I would expect we could (should) implement this in generic code by
> modifying the existing stuff.
So you're suggesting that we reintroduce a buffer-oriented filesystem
interface to support large block sizes?
> I'm not saying it's especially simple, nor fast. But it has the advantage
> that we're not forced to use larger pages with _it's_ attendant performance
> problems.
So you'll take slow, inefficient and complex rather than use an
non-intrusive and /optional/ interface to large pages?
Words fail me......
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 5:49 ` Christoph Lameter
@ 2007-04-27 6:55 ` Andrew Morton
2007-04-27 7:19 ` Christoph Lameter
` (3 more replies)
0 siblings, 4 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 6:55 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007 22:49:53 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 26 Apr 2007, Andrew Morton wrote:
>
> > > Or make sure that truncate
> > > doesn't race on a partial *block* truncate?
> >
> > lock four pages
>
> You would only lock a single higher order block. Truncate works on that
> level.
We all know that.
> If you have 4 separate pages then you need to take separate locks and you
> may not have contiguous memory which makes the filesystem run through all
> sorts of hoops.
This is completely incorrect.
Of *course* they're contiguous. That's the whole point.
It's not exactly hard to lock four pages which are contiguous in pagecache,
contiguous in physical memory and are contiguous in the radix-tree.
> > I'm not saying it's especially simple, nor fast. But it has the advantage
> > that we're not forced to use larger pages with _it's_ attendant performance
> > problems.
>
> The patch is not about forcing to use large pages but about the option to
> use larger pages. Its a new flexibility.
That's just spin.
> > And it doesn't introduce a rather nasty hack of pretending (in some places)
> > that pages are larger than they really are.
>
> They are really larger. One page struct controls it all.
No it doesn't and please stop spinning. x86 ptes map 4k pages and the core
MM needs changes to continue to work with this hack in place.
If x86 had larger pagesize we wouldn't be seeing any of this. It is a workaround
for present-generation hardware.
> > And it has the very significant advantage that it doesn't introduce brand
> > new concepts and some complexity into core MM.
>
> The patchset would reduce complexity and making it easy to handle the page
> cache. Gets rid of the hacks to support larger ones right now. Its
> straightforward, no new locking, very much a cleanup patch.
Were any cleanups made which were not also applicable as standalone things
to mainline?
> > And make no mistake: the latter disadvantage is huge. Because if we do the
> > PAGE_CACHE_SIZE hack (sorry, but it _is_), we have to do it *for ever*.
> > Maintaining and enhancing core MM and VFS becomes harder and more costly
> > and slower and more buggy *for ever*. The ramp for people to become
> > competent on core MM becomes longer. Our developer pool becomes smaller, and
> > proportionally less skilled.
>
> No it becomes easier. Look at the patchset. It cleans up a huge mess.
I see no cleanups which are not also applicable to mainline.
> What is hacky about it?
It pretends that pages are large than they actually are, forcing the
pte-management code to also play along with the pretence.
Pages *aren't* 16k. They're 4k.
> It is consistently using larger pages for the page
> cache and it integrates nicely into the VM.
> > And hardware gets better. If Intel & AMD come out with a 16k pagesize
> > option in a couple of years we'll look pretty dumb. If the problems which
> > you're presently having with that controller get sorted out in the next
> > generation of the hardware, we'll also look pretty dumb.
>
> We are currently looking dumb and unable to deal with the hardware. Yes
> we can pressure the hardware vendors to produce hardware conforming to our
> specifications but I always thought that was how another company operates.
That's spin as well.
Please address my point: if in five years time x86 has larger or varible
pagesize, this code will be a permanent millstone around our necks which we
*should not have merged*.
And if in five years time x86 does not have larger pagesize support then
the manufacturers would have decided that 4k pages are not a performance
problem, so we again should not have merged this code.
> > As always, there are tradeoffs. We can see the cons, and they are very
> > significant. We don't yet know the pros. Perhaps they will be similarly
> > significant. But I don't believe that the larger PAGE_CACHE_SIZE hack
> > (sorry) is the only way in which they can be realised.
>
> It is the most consistent solution that avoid the proliferation of further
> hacks to address the large blocksize.
You cannot say this. I'm sitting here *watching* you refuse to seriously
consider alternatives.
And you've conspicuously failed to address my point regarding the
*permanent* additional maintenance cost.
Anyway. Let's await those performance numbers. If they're notably good,
and if we judge that this goodness will be realised on more than one
arguably-crippled present-day disk adapter then we can evaluate the
*various* options which we have for stuffing more data into that adapter.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 6:09 ` David Chinner
@ 2007-04-27 7:04 ` Andrew Morton
2007-04-27 8:03 ` David Chinner
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 7:04 UTC (permalink / raw)
To: David Chinner
Cc: clameter, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007 16:09:21 +1000 David Chinner <dgc@sgi.com> wrote:
> On Thu, Apr 26, 2007 at 10:15:28PM -0700, Andrew Morton wrote:
> > On Fri, 27 Apr 2007 14:20:46 +1000 David Chinner <dgc@sgi.com> wrote:
> >
> > > > blocksizes via this scheme - instantiate and lock four pages and go for
> > > > it.
> > >
> > > So now how do you get block aligned writeback?
> >
> > in writeback and pageout:
> >
> > if (page->index & mapping->block_size_mask)
> > continue;
>
> So we might do writeback on one page in N - how do we
> make sure none of the other pages are reclaimed while we are doing
> writeback on this bclok?
By marking them all dirty when one is marked dirty.
David, you're perfectly capable of working all this out yourself. But
you're trying not to. Please stop this game.
> IOWs, we have to lock every page in the block, mark them all as
> writeback, etc. Instead of doing something once, we have
> to repeat it for every block in page. This is better than a compound
> page, how?
I already said it'd be a bit slower. But given that those four pageframes
will fall within two cachelines the cost will be small.
> > > Or make sure that truncate
> > > doesn't race on a partial *block* truncate?
> >
> > lock four pages
>
> And the locking order? How do you enforce *kernel wide* the
> same locking order for all pages in the same block so that we
> don't get ABBA deadlocks on page locks within a block?
Oh stop it. You know perfectly well how to do this. It is trivial.
> i.e:
>
> > > You basically have to
> > > jump through nasty, nasty hoops, to handle corner cases that are introduced
> > > because the generic code can no longer reliably lock out access to a
> > > filesystem block.
>
> This way lies insanity.
>
You're addressing Christoph's straw man here.
> > > way to serialise access to these aggregated structures. This is
> > > the way XFS used to work in it's data path, and we all know how long
> > > and loud people complained about that.....
> > >
> > > A filesystem specific aggregation mechanism is not a palatable solution
> > > here because it drives filesystems away from being able to use generic
> > > code.
> >
> > I would expect we could (should) implement this in generic code by
> > modifying the existing stuff.
>
> So you're suggesting that we reintroduce a buffer-oriented filesystem
> interface to support large block sizes?
Nothing vaguely like it. Please be serious.
> > I'm not saying it's especially simple, nor fast. But it has the advantage
> > that we're not forced to use larger pages with _it's_ attendant performance
> > problems.
>
> So you'll take slow, inefficient and complex rather than use an
> non-intrusive and /optional/ interface to large pages?
The optionality is useful - it at least means that we can easily remove it
all if/when it becomes obsolete.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 6:55 ` Andrew Morton
@ 2007-04-27 7:19 ` Christoph Lameter
2007-04-27 7:26 ` Andrew Morton
2007-04-27 7:22 ` Christoph Lameter
` (2 subsequent siblings)
3 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 7:19 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Andrew Morton wrote:
> It's not exactly hard to lock four pages which are contiguous in pagecache,
> contiguous in physical memory and are contiguous in the radix-tree.
If you can find them....
> > The patch is not about forcing to use large pages but about the option to
> > use larger pages. Its a new flexibility.
>
> That's just spin.
No its a fact. The patchset really allows one to switch large page
support on and off. It opens up new options..
> > They are really larger. One page struct controls it all.
>
> No it doesn't and please stop spinning. x86 ptes map 4k pages and the core
> MM needs changes to continue to work with this hack in place.
The page cache is different from pte mapping. One page struct controls
them all. Look at the patches. There is no state information in the tail
pages apart from pointing to the head page.
> If x86 had larger pagesize we wouldn't be seeing any of this. It is a workaround
> for present-generation hardware.
Pagecache != mmap.
> > The patchset would reduce complexity and making it easy to handle the page
> > cache. Gets rid of the hacks to support larger ones right now. Its
> > straightforward, no new locking, very much a cleanup patch.
>
> Were any cleanups made which were not also applicable as standalone things
> to mainline?
The page cache functions require a mapping parameter. This is available
in most place and a natural thing given that allocation etc is also bound
to mapping information.
> > No it becomes easier. Look at the patchset. It cleans up a huge mess.
>
> I see no cleanups which are not also applicable to mainline.
Not sure what you mean by that.
> > What is hacky about it?
>
> It pretends that pages are large than they actually are, forcing the
> pte-management code to also play along with the pretence.
>
> Pages *aren't* 16k. They're 4k.
No they are 16k if the filesystem wants them to be 16k. The filesystem
does not need to have the data mapped into an address space. And there is
no problem with mapping 4k sections if we want to using the ptes.
> Please address my point: if in five years time x86 has larger or varible
> pagesize, this code will be a permanent millstone around our necks which we
> *should not have merged*.
No this code will enable us to switch to this new page size in a very fast
way. Because the pagecache already supports it it is easier to add the
mmap support for other page sizes.
> And if in five years time x86 does not have larger pagesize support then
> the manufacturers would have decided that 4k pages are not a performance
> problem, so we again should not have merged this code.
The manufacturers on x86 are already supporting 2M page sizes and cannot
support intermediate sizes since they are married to the page table
format for performance reasons. The patch could f.e. lead to
straightforward support for 2M page mappings if we wanted it.
> > It is the most consistent solution that avoid the proliferation of further
> > hacks to address the large blocksize.
>
> You cannot say this. I'm sitting here *watching* you refuse to seriously
> consider alternatives.
And I am sitting here in disbelief about the series of weird alternatives
running over my screen just to avoid the obvious solution. Then there is
this weird idea that this would hinder us from supporting additional page
sizes for mmap while the patch does exactly lead to enable support such
features in the future.
> > And you've conspicuously failed to address my point regarding the
> *permanent* additional maintenance cost.
Where? The page cache handling in the various layers is significantly
simplified which reduces maintenance cost.
> Anyway. Let's await those performance numbers. If they're notably good,
> and if we judge that this goodness will be realised on more than one
> arguably-crippled present-day disk adapter then we can evaluate the
> *various* options which we have for stuffing more data into that adapter.
One? Spin.... The majority you mean?
Dave, where are we with the performance tests?
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 6:55 ` Andrew Morton
2007-04-27 7:19 ` Christoph Lameter
@ 2007-04-27 7:22 ` Christoph Lameter
2007-04-27 7:29 ` Andrew Morton
2007-04-27 11:05 ` Paul Mackerras
2007-04-27 13:44 ` William Lee Irwin III
3 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 7:22 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, 26 Apr 2007, Andrew Morton wrote:
> Were any cleanups made which were not also applicable as standalone things
> to mainline?
Ahh. I think I know what you mean. The current patchset is for performance
testing against mainline. Lets first cover the bases and then see where
we go. It is not against mm. I will submit pieces to mm depending on the
outcome of our discussions.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 7:19 ` Christoph Lameter
@ 2007-04-27 7:26 ` Andrew Morton
2007-04-27 8:37 ` David Chinner
` (2 more replies)
0 siblings, 3 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 7:26 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007 00:19:49 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> The page cache handling in the various layers is significantly
> simplified which reduces maintenance cost.
How on earth can the *addition* of variable pagecache size simplify the
existing code?
What cleanups are in this patchset which cannot be made *without* the
addition of variable pagecache size?
> Dave, where are we with the performance tests?
Well yes.
Do note that if the numbers are good, we also need to look at how generally
useful this work is. For example, if it only benefits one particular
arguably-crippled present-generation adapter then that of course weakens the
case.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 7:22 ` Christoph Lameter
@ 2007-04-27 7:29 ` Andrew Morton
2007-04-27 7:35 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 7:29 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007 00:22:26 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> I will submit pieces to mm depending on the
> outcome of our discussions.
Thanks.
There's a ludicrous amount of MM work pending in -mm. It would probably be
less work at your end to see what ends up landing in 2.6.22-rc1.
<wanders off to do his mm-merge-plans email>
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 7:29 ` Andrew Morton
@ 2007-04-27 7:35 ` Christoph Lameter
2007-04-27 7:43 ` Andrew Morton
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 7:35 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007, Andrew Morton wrote:
> On Fri, 27 Apr 2007 00:22:26 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
>
> > I will submit pieces to mm depending on the
> > outcome of our discussions.
> There's a ludicrous amount of MM work pending in -mm. It would probably be
> less work at your end to see what ends up landing in 2.6.22-rc1.
I am aware of that and thats why I kept this against upstream. The need
right now is for justification and explanation. I had to go
through a head spinning series of VM layers to get an idea how to do
this in a clean way and then had to make additional passes to do minimal
modifications to get this working so that it is testable.
Performance tests please...
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 7:35 ` Christoph Lameter
@ 2007-04-27 7:43 ` Andrew Morton
0 siblings, 0 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 7:43 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007 00:35:19 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 27 Apr 2007, Andrew Morton wrote:
>
> > On Fri, 27 Apr 2007 00:22:26 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> >
> > > I will submit pieces to mm depending on the
> > > outcome of our discussions.
>
> > There's a ludicrous amount of MM work pending in -mm. It would probably be
> > less work at your end to see what ends up landing in 2.6.22-rc1.
>
> I am aware of that and thats why I kept this against upstream. The need
> right now is for justification and explanation. I had to go
> through a head spinning series of VM layers to get an idea how to do
> this in a clean way and then had to make additional passes to do minimal
> modifications to get this working so that it is testable.
OK.
Don't get me wrong - I do think this is neat code and is a good way of
addressing the problem. (I'm surprised that the mmap protopatch didn't
touch rmap.c).
But I don't think it's a slam dunk and I would like you to appreciate the
constraints which I believe we operate under. And I don't think we've
adequately considered alternative solutions to the immediate performance problems.
> Performance tests please...
On various HBAs, please ;)
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 7:04 ` Andrew Morton
@ 2007-04-27 8:03 ` David Chinner
2007-04-27 8:48 ` Andrew Morton
2007-05-04 13:31 ` Eric W. Biederman
0 siblings, 2 replies; 235+ messages in thread
From: David Chinner @ 2007-04-27 8:03 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, Apr 27, 2007 at 12:04:03AM -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 16:09:21 +1000 David Chinner <dgc@sgi.com> wrote:
>
> > On Thu, Apr 26, 2007 at 10:15:28PM -0700, Andrew Morton wrote:
> > > On Fri, 27 Apr 2007 14:20:46 +1000 David Chinner <dgc@sgi.com> wrote:
> > >
> > > > > blocksizes via this scheme - instantiate and lock four pages and go for
> > > > > it.
> > > >
> > > > So now how do you get block aligned writeback?
> > >
> > > in writeback and pageout:
> > >
> > > if (page->index & mapping->block_size_mask)
> > > continue;
> >
> > So we might do writeback on one page in N - how do we
> > make sure none of the other pages are reclaimed while we are doing
> > writeback on this bclok?
>
> By marking them all dirty when one is marked dirty.
>
> David, you're perfectly capable of working all this out yourself. But
> you're trying not to. Please stop this game.
I've looked at all this but I'm trying to work out if anyone
else has looked at the impact of doing this. I have direct experience
with this form of block aggregation - this is pretty much what is
done in irix - and it's full of nasty, ugly corner cases.
I've got several year-old Irix bugs assigned that are hit every so
often where one page in the aggregated set has the wrong state, and
it's simply not possible to either reproduce the problem or work out
how it happened. The code has grown too complex and convoluted, and
by the time the problem is noticed (either by hang, panic or bug
check) the cause of it is long gone.
I don't want to go back to having to deal with this sort of problem
- I'd much prefer to have a design that does not make the same
mistakes that lead to these sorts of problem.
> > > > You basically have to
> > > > jump through nasty, nasty hoops, to handle corner cases that are introduced
> > > > because the generic code can no longer reliably lock out access to a
> > > > filesystem block.
> >
> > This way lies insanity.
>
> You're addressing Christoph's straw man here.
No, I'm speaking from years of experience working on a
page/buffer/chunk cache capable of using both large pages and
aggregating multiple pages. It has, at times, almost driven me
insane and I don't want to go back there.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 7:26 ` Andrew Morton
@ 2007-04-27 8:37 ` David Chinner
2007-04-27 12:01 ` Christoph Lameter
2007-04-27 16:36 ` David Chinner
2 siblings, 0 replies; 235+ messages in thread
From: David Chinner @ 2007-04-27 8:37 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, Apr 27, 2007 at 12:26:40AM -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 00:19:49 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
>
> > The page cache handling in the various layers is significantly
> > simplified which reduces maintenance cost.
>
> How on earth can the *addition* of variable pagecache size simplify the
> existing code?
>
> What cleanups are in this patchset which cannot be made *without* the
> addition of variable pagecache size?
Sure they can - but then the variable size page cache becomes even
more trivial to implement. ;)
> > Dave, where are we with the performance tests?
>
> Well yes.
Backed up behind real work. :/
Hopefully I'll have something tonight from my small test box.
it'll be some time next week before I get a chance to do anything
on a hardware raid array.
> Do note that if the numbers are good, we also need to look at how generally
> useful this work is. For example, if it only benefits one particular
> arguably-crippled present-generation adapter then that of course weakens the
> case.
I think you mistook my "hw RAID" as a "HW RAID adapter". I was
talking about "HW RAID arrays" as in rack mounted controllers on the
other end of multiple FC HCAs.
What I'm talking about is the difference in perfomrance when we can
do large enough I/Os to be able to do full-stripe writes on this
sort of hardware. Modern HW raid arrays go much faster when you do
aligned, full-stripe width writes and right now x86_64 linux is
not able to do this....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 8:03 ` David Chinner
@ 2007-04-27 8:48 ` Andrew Morton
2007-04-27 16:45 ` Theodore Tso
2007-05-04 12:57 ` Eric W. Biederman
2007-05-04 13:31 ` Eric W. Biederman
1 sibling, 2 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 8:48 UTC (permalink / raw)
To: David Chinner
Cc: clameter, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007 18:03:21 +1000 David Chinner <dgc@sgi.com> wrote:
> > > > > You basically have to
> > > > > jump through nasty, nasty hoops, to handle corner cases that are introduced
> > > > > because the generic code can no longer reliably lock out access to a
> > > > > filesystem block.
> > >
> > > This way lies insanity.
> >
> > You're addressing Christoph's straw man here.
>
> No, I'm speaking from years of experience working on a
> page/buffer/chunk cache capable of using both large pages and
> aggregating multiple pages. It has, at times, almost driven me
> insane and I don't want to go back there.
We're talking about two separate things here - let us not conflate them.
1: The arguably-crippled HBA which wants bigger SG lists.
2: The late-breaking large-blocksizes-in-the-fs thing.
None of this multiple-page-locking stuff we're discussing here is relevant
to the HBA performance problem. It's pretty simple (I think) for us to
ensure that, for the great majority of the time, contiguous pages in a file
are also physically contiguous. Problem solved, HBA go nice and quick,
move on.
Now, we have this the second and completely unrelated requirement:
supporting fs-blocksize > PAGE_SIZE. One way to address this is via the
mangle-multiple-pages-into-one approach. And it's obviously the best way
to do it, if mangle-multiple-pages is already available.
But I don't know how important requirement 2 is. XFS already has
presumably-working private code to do it, and there is simplification and
perhaps modest performance gain in the block allocator to be had here.
And other filesystems (ie: ext4) _might_ use it. But ext4 is extent-based,
so perhaps it's not work churning the on-disk format to get a bit of a
boost in the block allocator.
So I _think_ what this boils down to is offering some simplifications in
XFS, by adding complexications to core VFS and MM. I dunno if that's a
good deal.
So... tell us why you want feature 2?
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 15:58 ` William Lee Irwin III
@ 2007-04-27 9:46 ` Nick Piggin
0 siblings, 0 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 9:46 UTC (permalink / raw)
To: William Lee Irwin III
Cc: David Chinner, Eric W. Biederman, clameter, linux-kernel,
Mel Gorman, Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III wrote:
> On Fri, Apr 27, 2007 at 01:38:30AM +1000, Nick Piggin wrote:
>
>>Or good grounds to increase the sg limit and push for io controller
>>manufacturers to do the same. If we have a hack in the kernel that
>>mostly works, they won't.
>
>
> On Fri, Apr 27, 2007 at 01:38:30AM +1000, Nick Piggin wrote:
>
>>Page colouring was always rejected, and lots of people who knew
>>better got upset because it was the only way the hardware would go
>>fast...
>
>
> Yes, stunning wisdom there. Reject the speedups.
Yeah, that's how lots of people felt. But there is a good argument
to do just that.
> On Fri, Apr 27, 2007 at 01:38:30AM +1000, Nick Piggin wrote:
>
>>You could put it that way. Or that it is wrong because of the
>>fragmenatation problem. Realise that it is somewhat fundamental
>>considering that it is basically an unsolvable problem with our
>>current kernel assumptions of unconstrained kernel allocations and
>>a 1:1 kernel mapping.
>
>
> Depends on what you consider a solution. A broadly used criterion is
> that improves performance significantly in important usage cases.
My criterion is that you are not suddenly unable to access your
filesystem because you cannot allocate a higher order page.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 16:07 ` Christoph Hellwig
@ 2007-04-27 10:05 ` Nick Piggin
2007-04-27 13:06 ` Mel Gorman
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 10:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Christoph Lameter, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Hellwig wrote:
> On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:
>
>>>Well maybe you could explain what you want. Preferably without redefining
>>>the established terms?
>>
>>Support for larger buffers than page cache pages.
>
>
> I don't think you really want this :) The whole non-pagecache I/O
> path before 2.3 was a toal pain just because it used buffers to drive
> I/O. Add to that buffers bigger than a page and you add another
> two mangnitudes of complexity. If you want to see a mess like that
> download on of the eary XFS/Linux releases that had an I/O path
> like that. I _really_ _really_ don't want to go there.
I'm not actually suggesting to add anything like that. But I think
larger blocks can be doable while retaining the "buffer" layer as a
relatively simple pagecache to block translation.
Anyway, I'm working on patches... they might crash and burn, but we
might have something to talk about later.
> Linux has a long tradition of trading a tiny bit of efficieny for
> much cleaner code, and I'd for 100% go down Christoph's route here.
> Then again I'd actually be rather surprised if > page buffers
> were more efficient - you'd run into shitloads over overhead due to
> them beeing non-contingous like calling vmap all over the place,
> reprogramming iommus to at least make them look virtually contingous [1],
> etc..
I still think hardware should work reasonably well with 4K pages. The
SGI io controllers and/or the Linux block layer that doesn't allow more
than 128 sg entries is clearly suboptimal if the hardware runs twice as
fast with 2MB submissions.
> I also don't quite get what your problem with higher order allocations
> are. order 1 allocations are generally just fine, and in fact
> thread stacks are >= oder 1 on most architectures. And if the pagecache
> uses higher order allocations that means we'll finally fix our problems
> with them, which we have to do anyway. Workloads continue to grow and
> with them the kernel overhead to manage them, while the pagesize for
> many architectures is fixed. So we'll have to deal with order 1
> and order 2 allocations better just for backing kmalloc and co.
The pagecache is much bigger and often a lot more activity than these
other things though. Also, the more things you add to higher order
allocations, the more pressure you have.
I like PAGE_SIZE pagecache, because it is reliable and really fast, if
you need to reclaim a page it should be almost O(1).
> Or think jumboframes for that matter.
They can actually run into problems if the hardware wants contiguous
memory.
I don't know why you think the fragmentation issues are just magically
fixed. It is hard and inefficient to reclaim larger order blocks (even
with lumpy reclaim), and Mel's patches aren't perfect. Actually, last
time I looked, they needed to keep at least 16MB of pages free to be
reasonably effective (or do we just say that people with less than XMB
of memory shouldn't be accessing these filesystems anyway?), and I'm
not sure if they have been tested for long term stability in the
presence of a reasonable amount of higher order allocations.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:10 ` Christoph Lameter
@ 2007-04-27 10:08 ` Nick Piggin
0 siblings, 0 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 10:08 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>Christoph Lameter wrote:
>>
>>>On Thu, 26 Apr 2007, Nick Piggin wrote:
>>>
>>>
>>>
>>>>But I maintain that the end result is better than the fragmentation
>>>>based approach. A lot of people don't actually want a bigger page
>>>>cache size, because they want efficient internal fragmentation as
>>>>well, so your radix-tree based approach isn't really comparable.
>>>
>>>
>>>Me? Radix tree based approach? That approach is in the kernel. Do not create
>>>a solution where there is no problem. If we do not want to support large
>>>blocksizes then lets be honest and say so instead of redefining what a block
>>>is. The current approach is fine if one is satisfied with scatter gather and
>>>the VM overhead coming with handling these pages. I fail to see what any of
>>>what you are proposing would add to that.
>>
>>I'm not just making this up. Fragmentation. OK?
>
>
> Yes you are. If you want to avoid fragmentation by restricting the OS to
> 4k alone then the radix tree is sufficient to establish the order of pages
> in a mapping. The only problem is to get an array of pointers to a
> sequence of pages together by reading through the radix tree. I do not
> know what else would be needed.
No. We have avoided fragmentation up until now. We avoid fragmentation like
the plague because it is crap. What _I_ do not want to do is add some
patches that make it work a bit better and everyone think's that's a signal
that it is a good idea to start using higher order allocations wherever
possible.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 18:13 ` Christoph Lameter
@ 2007-04-27 10:15 ` Nick Piggin
0 siblings, 0 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 10:15 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Lameter wrote:
> On Thu, 26 Apr 2007, Nick Piggin wrote:
>
>
>>>But what do you mean with it? A block is no longer a contiguous section of
>>>memory. So you have redefined the term.
>>
>>I don't understand what you mean at all. A block has always been a
>>contiguous area of disk.
>
>
> You want to change the block layer to support larger blocksize than
> PAGE_SIZE right? So you need to segment that larger block into pieces.
The block is the disk block, which does not get segmented.
What you have is a small layer that tells you which block a pagecache
page points to, and which pagecache page refers to a given block. Just
like we have now only slightly extended.
>>>And you dont care about Mel's work on that level?
>>
>>I actually don't like it too much because it can't provide a robust
>>solution. What do you do on systems with small memories, or those that
>>eventually do get fragmented?
>
>
> You could f.e. switch off defragmentation and the large block support?
Ahh, then you reboot your machine to access your other filesystems?
>>Actually, I don't know why people are so excited about being able to
>>use higher order allocations (I would rather be more excited about
>>never having to use them). But for those few places that really need
>>it, I'd rather see them use a virtually mapped kernel with proper
>>defragmentation rather than putting hacks all through the core code.
>
>
> Ahh. I knew we were going this way.... Now we have virtual contiguous vs.
> physical discontiguous.... Yuck hackidihack.
That gives you have the proper infrastructure that is needed to actually
support higher order _physical_ allocations _properly_.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 0:32 ` William Lee Irwin III
@ 2007-04-27 10:22 ` Nick Piggin
2007-04-27 12:58 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 10:22 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Eric W. Biederman, Christoph Lameter, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III wrote:
> William Lee Irwin III <wli@holomorphy.com> writes:
>
>>>In memory as on disk, contiguity matters a lot for performance.
>
>
> On Thu, Apr 26, 2007 at 12:21:24PM -0600, Eric W. Biederman wrote:
>
>>Not nearly so much though. In memory you don't have seeks to avoid.
>>On disks avoiding seeks is everything.
>
>
> I readily concede that seeks are most costly. Yet memory contiguity
> remains rather influential.
>
> Witness the fact that I'm now being called upon a second time to
> adjust the order in which mm/page_alloc.c returns pages for the
> sake of implicitly establishing IO contiguity (or otherwise
> determining why things are coming out backward now).
Just a random aside question... doesn't Oracle db do direct IO from
hugepages?
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 4:05 ` Eric W. Biederman
@ 2007-04-27 10:26 ` Nick Piggin
2007-04-27 13:51 ` Eric W. Biederman
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 10:26 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Jens Axboe, Christoph Lameter, Christoph Hellwig, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
Eric W. Biederman wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
>>Yes, that is exactly the problem. Once you have that, pktcdvd is pretty
>>much reduced to setup and init code, the actual data handling can be
>>done by sr or ide-cd directly. You could merge it into cdrom.c, it would
>>not be very different from mt-rainier handling (which basically does RMW
>>in firmware, so it works for any write, but performance is of course
>>horrible if you don't do it right).
>
>
> Thanks for the clarification.
>
> So we do have a clear problem that we do not have generic support for
> large sector sizes residing in the page cache.
Well, it is a clear limitation. It hasn't mattered too much until
now, but it is one of the other issues that SGI hit (aside from
io efficiency) because they have 16K filesystems created on ia64
systems that I believe they want to access with x86-64 systems.
I'm slowly looking at patches in the background, but I'm hoping to
be able to spend a decent chunk of time working on them again soon.
It isn't trivial :)
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 16:11 ` Christoph Hellwig
2007-04-26 17:49 ` Eric W. Biederman
@ 2007-04-27 10:38 ` Nick Piggin
1 sibling, 0 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 10:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: David Chinner, Eric W. Biederman, clameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Christoph Hellwig wrote:
> On Thu, Apr 26, 2007 at 04:50:06PM +1000, Nick Piggin wrote:
>
>>Improving the buffer layer would be a good way. Of course, that is
>>a long and difficult task, so nobody wants to do it.
>
>
> It's also a stupid idea. We got rid of the buffer layer because it's
> a complete pain in the ass, and now you want to reintroduce one that's
> even more complex, and most likely even slower than the elegant solution?
>
>
>>Well, for those architectures (and this would solve your large block
>>size and 16TB pagecache size without any core kernel changes), you
>>can manage 1<<order hardware ptes as a single Linux pte. There is
>>nothing that says you must implement PAGE_SIZE as a single TLB sized
>>page.
>
>
> Well, ppc64 can do that. And guess what, it really painfull for a lot
> of workloads. Think of a poor ps3 with 256 from which the broken hypervisor
> already takes a lot away and now every file in the pagecache takes
> 64k, every thread stack takes 64k, etc? It's good to have variable
> sized objects in places where it makes sense, and the pagecache is
> definitively one of them.
Well, I think 64K is probably too large for a generic Linux config, but
16K is a lot more reasonable and would also nicely cut out most other
higher order allocations too.
The efficiency argument has always been there and is nothing new. Yeah,
I know it can be slightly more efficient to do most hardware operations
in larger chunks, and I didn't ever buy it before or now. Superpages can
also be used to use TLBs more efficiently sometimes, etc etc. This is
coming up now because of some specific big problems that are being
encountered which is absolutely not some fundamental hardware limit we
are seeing approach.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 6:55 ` Andrew Morton
2007-04-27 7:19 ` Christoph Lameter
2007-04-27 7:22 ` Christoph Lameter
@ 2007-04-27 11:05 ` Paul Mackerras
2007-04-27 11:41 ` Nick Piggin
2007-04-27 11:58 ` Christoph Lameter
2007-04-27 13:44 ` William Lee Irwin III
3 siblings, 2 replies; 235+ messages in thread
From: Paul Mackerras @ 2007-04-27 11:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Andrew Morton writes:
> If x86 had larger pagesize we wouldn't be seeing any of this. It is a workaround
> for present-generation hardware.
Unfortunately, it's not really practical to increase the page size
very much on most systems, because you end up wasting a lot of space
in the page cache. So there is a tension between wanting a small page
size so your page cache uses memory efficiently, and wanting a large
page size so the TLB covers more address space and your programs run
faster (not to mention other benefits such as the kernel having to
manage fewer pages, and I/O being done in bigger chunks).
Thus there is not really any single page size that suits all workloads
and machines. With distros wanting to just have a single kernel per
architecture, and the fact that the page size is a compile-time
constant, we currently end up having to pick one size and just put up
with the fact that it will suck for some users. We currently have
this situation on ppc64 now that POWER5+ and POWER6 machines have
hardware support for 64k pages as well as 4k pages.
So I can see a few different options:
(a) Keep things more or less as they are now and just wear the fact
that we will continue to show lower performance than certain
proprietary OSes, or
(b) Somehow manage to make the page size a variable rather than a
compile-time constant, and pick a suitable page size at boot time
based on how much memory the machine has, or something. I looked at
implementing this at one point and recoiled in horror. :)
(c) Make the page cache able to use small pages for small files and
large pages for large files. AIUI this is basically what Christoph is
proposing.
Option (a) isn't very palatable to me (nor I expect, Christoph :)
since it basically says that Linux is very much focussed on the
embedded and desktop end of things and isn't really suitable as a
high-performance OS for large SMP systems. I don't want to believe
that. ;)
Option (b) would be a bit of an ugly hack.
Which leaves option (c) - unless you have a further option. So I have
to say I support Christoph on this, at least as far as the general
principle is concerned.
Regards,
Paul.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 11:05 ` Paul Mackerras
@ 2007-04-27 11:41 ` Nick Piggin
2007-04-27 12:12 ` Christoph Lameter
2007-04-27 12:14 ` Paul Mackerras
2007-04-27 11:58 ` Christoph Lameter
1 sibling, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 11:41 UTC (permalink / raw)
To: Paul Mackerras
Cc: Andrew Morton, Christoph Lameter, David Chinner, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Paul Mackerras wrote:
> Andrew Morton writes:
>
>
>>If x86 had larger pagesize we wouldn't be seeing any of this. It is a workaround
>>for present-generation hardware.
>
>
> Unfortunately, it's not really practical to increase the page size
> very much on most systems, because you end up wasting a lot of space
> in the page cache. So there is a tension between wanting a small page
> size so your page cache uses memory efficiently, and wanting a large
> page size so the TLB covers more address space and your programs run
> faster (not to mention other benefits such as the kernel having to
> manage fewer pages, and I/O being done in bigger chunks).
>
> Thus there is not really any single page size that suits all workloads
> and machines. With distros wanting to just have a single kernel per
> architecture, and the fact that the page size is a compile-time
> constant, we currently end up having to pick one size and just put up
> with the fact that it will suck for some users. We currently have
> this situation on ppc64 now that POWER5+ and POWER6 machines have
> hardware support for 64k pages as well as 4k pages.
>
> So I can see a few different options:
>
> (a) Keep things more or less as they are now and just wear the fact
> that we will continue to show lower performance than certain
> proprietary OSes, or
>
> (b) Somehow manage to make the page size a variable rather than a
> compile-time constant, and pick a suitable page size at boot time
> based on how much memory the machine has, or something. I looked at
> implementing this at one point and recoiled in horror. :)
>
> (c) Make the page cache able to use small pages for small files and
> large pages for large files. AIUI this is basically what Christoph is
> proposing.
>
> Option (a) isn't very palatable to me (nor I expect, Christoph :)
> since it basically says that Linux is very much focussed on the
> embedded and desktop end of things and isn't really suitable as a
> high-performance OS for large SMP systems. I don't want to believe
> that. ;)
>
> Option (b) would be a bit of an ugly hack.
>
> Which leaves option (c) - unless you have a further option. So I have
> to say I support Christoph on this, at least as far as the general
> principle is concerned.
For the TLB issue, higher order pagecache doesn't help. If distros
ship with a 4K page size on powerpc, and use some larger pages in
the pagecache, some people are still going to get angry because
they wanted to use 64K pages... But I agree 64K pages is too big
for most things anyway, and 16 would be better as a default (which
hopefully x86-64 will get one day).
Anyway, for io performance, there are alternatives, dispite what
some people seem to be saying. We can submit larger sglists to the
device for larger ios, which Jens is looking at (which could help
all types of workloads, not just those with sequential large file
IO).
After that, I'd find it amusing if HBAs worth thousands of $ have
trouble looking up sglists at the relatively glacial pace that IO
requires, and/or can't spare a few more K for reasonable sglist
sizes, but if that is really the case, then we could use iommus
and/or just attempt to put physically contiguous pages in pagecache,
rather than require it.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 11:05 ` Paul Mackerras
2007-04-27 11:41 ` Nick Piggin
@ 2007-04-27 11:58 ` Christoph Lameter
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 11:58 UTC (permalink / raw)
To: Paul Mackerras
Cc: Andrew Morton, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, 27 Apr 2007, Paul Mackerras wrote:
> Option (b) would be a bit of an ugly hack.
>
> Which leaves option (c) - unless you have a further option. So I have
> to say I support Christoph on this, at least as far as the general
> principle is concerned.
We could approximate option (b) by setting a standard page size for the
page cache and set the same page size in SLUB (slub is already boot time
configurable in that respect). That will make large portions of the VM
use the same page order. I can try to add similar controls to the page
cache.
If the page size is set too high for a mount then we use the buffer
head functionality to split the higher order page into pieces of the
appropriate size. That could limit the number of page sizes that we need
to support.
But I think we should first see how well Mel's antifrag work does.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 7:26 ` Andrew Morton
2007-04-27 8:37 ` David Chinner
@ 2007-04-27 12:01 ` Christoph Lameter
2007-04-27 16:36 ` David Chinner
2 siblings, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 12:01 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007, Andrew Morton wrote:
> On Fri, 27 Apr 2007 00:19:49 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
>
> > The page cache handling in the various layers is significantly
> > simplified which reduces maintenance cost.
>
> How on earth can the *addition* of variable pagecache size simplify the
> existing code?
Because the page cache code is full of manual shifts and adds. Its a mess.
Adding appropriate accessors simplifies the code. I took that opportunity
to slide in an address space parameter into the accessors which allows to
have sets of accessors for single page size and multi page size
configurations.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 11:41 ` Nick Piggin
@ 2007-04-27 12:12 ` Christoph Lameter
2007-04-27 12:25 ` Nick Piggin
2007-04-27 13:37 ` Christoph Hellwig
2007-04-27 12:14 ` Paul Mackerras
1 sibling, 2 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 12:12 UTC (permalink / raw)
To: Nick Piggin
Cc: Paul Mackerras, Andrew Morton, David Chinner, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, 27 Apr 2007, Nick Piggin wrote:
> For the TLB issue, higher order pagecache doesn't help. If distros
> ship with a 4K page size on powerpc, and use some larger pages in
> the pagecache, some people are still going to get angry because
> they wanted to use 64K pages... But I agree 64K pages is too big
> for most things anyway, and 16 would be better as a default (which
> hopefully x86-64 will get one day).
Powerpc supports multiple pagesizes. Maybe we could make mmap use those
page sizes some day if we had a variable order page cache. Your stands on
the issue means that powerpc will be forever crippled and not be able to
use its full potential.
> Anyway, for io performance, there are alternatives, dispite what
> some people seem to be saying. We can submit larger sglists to the
> device for larger ios, which Jens is looking at (which could help
> all types of workloads, not just those with sequential large file
> IO).
Right this could help but it is not addressing the basic requirement for
devices that need large contiguuos chunks of memory for I/O.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 11:41 ` Nick Piggin
2007-04-27 12:12 ` Christoph Lameter
@ 2007-04-27 12:14 ` Paul Mackerras
2007-04-27 12:36 ` Nick Piggin
2007-04-27 13:42 ` Christoph Hellwig
1 sibling, 2 replies; 235+ messages in thread
From: Paul Mackerras @ 2007-04-27 12:14 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, Christoph Lameter, David Chinner, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Nick Piggin writes:
> For the TLB issue, higher order pagecache doesn't help. If distros
Oh? Assuming your hardware is capable of supporting a variety of page
sizes, and of putting a page at any address that is a multiple of its
size, it should help, potentially a great deal, as far as I can see.
I'm thinking in particular of machines that have software-loaded
fully-associative TLBs and support a lot of page sizes, e.g.
4kB * 4^n for n = 0 up to 8 or so, like some embedded powerpc chips.
It's not as simple on 64-bit powerpc with the hash table of course,
because the page size is chosen at the segment (256MB) level,
restricting where we can put 64k and 16M pages to some degree.
> ship with a 4K page size on powerpc, and use some larger pages in
> the pagecache, some people are still going to get angry because
> they wanted to use 64K pages... But I agree 64K pages is too big
> for most things anyway, and 16 would be better as a default (which
> hopefully x86-64 will get one day).
Even 16k is going to bloat the page cache, and some people will
complain. One way that x86-64 could do 16k pages is by still indexing
the PTE page in units of 4k, but then have an indicator in the PTE
that this is a 16k page. Thus a 16k page would occupy 4 consecutive
PTEs, but once it was loaded into the TLB, a single TLB entry would
map the whole 16k. That would give the expanded TLB reach and allow
4k and 16k pages to be intermixed freely.
Paul.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 12:12 ` Christoph Lameter
@ 2007-04-27 12:25 ` Nick Piggin
2007-04-27 13:39 ` Christoph Hellwig
2007-04-27 16:48 ` Christoph Lameter
2007-04-27 13:37 ` Christoph Hellwig
1 sibling, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 12:25 UTC (permalink / raw)
To: Christoph Lameter
Cc: Paul Mackerras, Andrew Morton, David Chinner, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Christoph Lameter wrote:
> On Fri, 27 Apr 2007, Nick Piggin wrote:
>
>
>>For the TLB issue, higher order pagecache doesn't help. If distros
>>ship with a 4K page size on powerpc, and use some larger pages in
>>the pagecache, some people are still going to get angry because
>>they wanted to use 64K pages... But I agree 64K pages is too big
>>for most things anyway, and 16 would be better as a default (which
>>hopefully x86-64 will get one day).
>
>
> Powerpc supports multiple pagesizes. Maybe we could make mmap use those
> page sizes some day if we had a variable order page cache. Your stands on
> the issue means that powerpc will be forever crippled and not be able to
> use its full potential.
Linus's favourite jokes about powerpc mmu being crippled forever, aside ;)
This seems like just speculation. I would not be against something which,
without, would "cripple" some relevant hardware, but you are just handwaving
at this point. And you are still ignoring the alternatives.
>>Anyway, for io performance, there are alternatives, dispite what
>>some people seem to be saying. We can submit larger sglists to the
>>device for larger ios, which Jens is looking at (which could help
>>all types of workloads, not just those with sequential large file
>>IO).
>
>
> Right this could help but it is not addressing the basic requirement for
> devices that need large contiguuos chunks of memory for I/O.
Did you read the last paragraph? Or anything Andrew's been writing?
"After that, I'd find it amusing if HBAs worth thousands of $ have
trouble looking up sglists at the relatively glacial pace that IO
requires, and/or can't spare a few more K for reasonable sglist
sizes, but if that is really the case, then we could use iommus
and/or just attempt to put physically contiguous pages in pagecache,
rather than require it."
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 12:14 ` Paul Mackerras
@ 2007-04-27 12:36 ` Nick Piggin
2007-04-27 13:42 ` Christoph Hellwig
1 sibling, 0 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 12:36 UTC (permalink / raw)
To: Paul Mackerras
Cc: Andrew Morton, Christoph Lameter, David Chinner, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Paul Mackerras wrote:
> Nick Piggin writes:
>
>
>>For the TLB issue, higher order pagecache doesn't help. If distros
>
>
> Oh? Assuming your hardware is capable of supporting a variety of page
> sizes, and of putting a page at any address that is a multiple of its
> size, it should help, potentially a great deal, as far as I can see.
> I'm thinking in particular of machines that have software-loaded
> fully-associative TLBs and support a lot of page sizes, e.g.
> 4kB * 4^n for n = 0 up to 8 or so, like some embedded powerpc chips.
That's a little bit more than just the higher order pagecache patch.
But I don't know if that would be impossible to do with the "attempt
to allocate contiguous pagecache" approach either. Or if it would be
worthwhile to support.
>>ship with a 4K page size on powerpc, and use some larger pages in
>>the pagecache, some people are still going to get angry because
>>they wanted to use 64K pages... But I agree 64K pages is too big
>>for most things anyway, and 16 would be better as a default (which
>>hopefully x86-64 will get one day).
>
>
> Even 16k is going to bloat the page cache, and some people will
> complain. One way that x86-64 could do 16k pages is by still indexing
> the PTE page in units of 4k, but then have an indicator in the PTE
> that this is a 16k page. Thus a 16k page would occupy 4 consecutive
> PTEs, but once it was loaded into the TLB, a single TLB entry would
> map the whole 16k. That would give the expanded TLB reach and allow
> 4k and 16k pages to be intermixed freely.
I guess any page size bloats the pagecache relative to something
smaller :) But 4K doesn't seem to be proving too much problem for
x86 and I'm not talking about an actual implementation coming up,
but just a size that would make sense in future (and probably last
for a long time).
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 10:22 ` Nick Piggin
@ 2007-04-27 12:58 ` William Lee Irwin III
2007-04-27 13:06 ` Nick Piggin
0 siblings, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-27 12:58 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, Christoph Lameter, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III wrote:
>> I readily concede that seeks are most costly. Yet memory contiguity
>> remains rather influential.
>> Witness the fact that I'm now being called upon a second time to
>> adjust the order in which mm/page_alloc.c returns pages for the
>> sake of implicitly establishing IO contiguity (or otherwise
>> determining why things are coming out backward now).
On Fri, Apr 27, 2007 at 08:22:07PM +1000, Nick Piggin wrote:
> Just a random aside question... doesn't Oracle db do direct IO from
> hugepages?
If and when configured to use direct IO and hugepages, yes. It's also
noteworthy thar Oracle has more code than its database.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 12:58 ` William Lee Irwin III
@ 2007-04-27 13:06 ` Nick Piggin
2007-04-27 14:49 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-27 13:06 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Eric W. Biederman, Christoph Lameter, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>I readily concede that seeks are most costly. Yet memory contiguity
>>>remains rather influential.
>>>Witness the fact that I'm now being called upon a second time to
>>>adjust the order in which mm/page_alloc.c returns pages for the
>>>sake of implicitly establishing IO contiguity (or otherwise
>>>determining why things are coming out backward now).
>
>
> On Fri, Apr 27, 2007 at 08:22:07PM +1000, Nick Piggin wrote:
>
>>Just a random aside question... doesn't Oracle db do direct IO from
>>hugepages?
>
>
> If and when configured to use direct IO and hugepages, yes.
Sweet. I wonder if you would see a much improvement for allowing more
than 128 sglist entries, then.
> It's also
> noteworthy thar Oracle has more code than its database.
Anything noteworthy in this context, that you care to note?
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 10:05 ` Nick Piggin
@ 2007-04-27 13:06 ` Mel Gorman
0 siblings, 0 replies; 235+ messages in thread
From: Mel Gorman @ 2007-04-27 13:06 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Hellwig, Christoph Lameter, Eric W. Biederman,
linux-kernel, William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On (27/04/07 20:05), Nick Piggin didst pronounce:
> Christoph Hellwig wrote:
> >On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:
> >
> >>>Well maybe you could explain what you want. Preferably without
> >>>redefining the established terms?
> >>
> >>Support for larger buffers than page cache pages.
> >
> >
> >I don't think you really want this :) The whole non-pagecache I/O
> >path before 2.3 was a toal pain just because it used buffers to drive
> >I/O. Add to that buffers bigger than a page and you add another
> >two mangnitudes of complexity. If you want to see a mess like that
> >download on of the eary XFS/Linux releases that had an I/O path
> >like that. I _really_ _really_ don't want to go there.
>
> I'm not actually suggesting to add anything like that. But I think
> larger blocks can be doable while retaining the "buffer" layer as a
> relatively simple pagecache to block translation.
>
> Anyway, I'm working on patches... they might crash and burn, but we
> might have something to talk about later.
>
>
> >Linux has a long tradition of trading a tiny bit of efficieny for
> >much cleaner code, and I'd for 100% go down Christoph's route here.
> >Then again I'd actually be rather surprised if > page buffers
> >were more efficient - you'd run into shitloads over overhead due to
> >them beeing non-contingous like calling vmap all over the place,
> >reprogramming iommus to at least make them look virtually contingous [1],
> >etc..
>
> I still think hardware should work reasonably well with 4K pages. The
> SGI io controllers and/or the Linux block layer that doesn't allow more
> than 128 sg entries is clearly suboptimal if the hardware runs twice as
> fast with 2MB submissions.
>
>
> >I also don't quite get what your problem with higher order allocations
> >are. order 1 allocations are generally just fine, and in fact
> >thread stacks are >= oder 1 on most architectures. And if the pagecache
> >uses higher order allocations that means we'll finally fix our problems
> >with them, which we have to do anyway. Workloads continue to grow and
> >with them the kernel overhead to manage them, while the pagesize for
> >many architectures is fixed. So we'll have to deal with order 1
> >and order 2 allocations better just for backing kmalloc and co.
>
> The pagecache is much bigger and often a lot more activity than these
> other things though. Also, the more things you add to higher order
> allocations, the more pressure you have.
>
> I like PAGE_SIZE pagecache, because it is reliable and really fast, if
> you need to reclaim a page it should be almost O(1).
>
>
> >Or think jumboframes for that matter.
>
> They can actually run into problems if the hardware wants contiguous
> memory.
>
> I don't know why you think the fragmentation issues are just magically
> fixed. It is hard and inefficient to reclaim larger order blocks (even
> with lumpy reclaim), and Mel's patches aren't perfect. Actually, last
> time I looked, they needed to keep at least 16MB of pages free to be
> reasonably effective (or do we just say that people with less than XMB
> of memory shouldn't be accessing these filesystems anyway?)
It'll work without adjusting the min_free_kbytes at all. The 16MB free had
better results after fragmentation stress tests but this was a few percent
of memory when allocating as huge pages as opposed to it falling apart. The
success rates were still way way higher than the vanilla kernel.
>, and I'm
> not sure if they have been tested for long term stability in the
> presence of a reasonable amount of higher order allocations.
>
I don't have a sample workload that has reasonable amount of higher order
allocations over longer period of time. When the next -mm comes out, SLUB will
be able to use high-order pages so I'll boot my machine with less memory to
pressure it more. Assuming the kernel boots on my desktop machine, I should
get some idea of what its long-term behaviour looks like.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 12:12 ` Christoph Lameter
2007-04-27 12:25 ` Nick Piggin
@ 2007-04-27 13:37 ` Christoph Hellwig
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-27 13:37 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, Paul Mackerras, Andrew Morton, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Fri, Apr 27, 2007 at 05:12:05AM -0700, Christoph Lameter wrote:
> Powerpc supports multiple pagesizes. Maybe we could make mmap use those
> page sizes some day if we had a variable order page cache. Your stands on
> the issue means that powerpc will be forever crippled and not be able to
> use its full potential.
You can already mmap using different pagesize on ppc64. A certain
infinibad adapter with interesting design choices requires 4k mappings
even on 64k kernels, and for some spu-related mappings on Cell we
want the same. I'm no expert on the powerpc mmu, but I'd be surprised
if 64k pagecache mappings wouldn't work on 4k base page size kernels.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 12:25 ` Nick Piggin
@ 2007-04-27 13:39 ` Christoph Hellwig
2007-04-28 2:27 ` Nick Piggin
2007-04-27 16:48 ` Christoph Lameter
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-27 13:39 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Paul Mackerras, Andrew Morton, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Fri, Apr 27, 2007 at 10:25:44PM +1000, Nick Piggin wrote:
> Linus's favourite jokes about powerpc mmu being crippled forever, aside ;)
Different mmu. The desktop 32bit mmu Linus refered to has almost nothing
in common with the mmu on 64bit systems.
> >Right this could help but it is not addressing the basic requirement for
> >devices that need large contiguuos chunks of memory for I/O.
>
> Did you read the last paragraph? Or anything Andrew's been writing?
>
> "After that, I'd find it amusing if HBAs worth thousands of $ have
> trouble looking up sglists at the relatively glacial pace that IO
> requires, and/or can't spare a few more K for reasonable sglist
> sizes, but if that is really the case, then we could use iommus
> and/or just attempt to put physically contiguous pages in pagecache,
> rather than require it."
Real highend HBAs don't have that problem. But for example aacraid
which is very common on mid-end servers is a _lot_ faster when it
gets continous memory. Some benchmark was 10 or more percent faster
on windows due to this.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 12:14 ` Paul Mackerras
2007-04-27 12:36 ` Nick Piggin
@ 2007-04-27 13:42 ` Christoph Hellwig
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-27 13:42 UTC (permalink / raw)
To: Paul Mackerras
Cc: Nick Piggin, Andrew Morton, Christoph Lameter, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Fri, Apr 27, 2007 at 10:14:20PM +1000, Paul Mackerras wrote:
> It's not as simple on 64-bit powerpc with the hash table of course,
> because the page size is chosen at the segment (256MB) level,
> restricting where we can put 64k and 16M pages to some degree.
I think Christoph's variable order pagecache should be perfectly
fine on ppc64. We're selecting the pagesize on a per-file basis,
and the page size selection would choce which segement this mmap
gets into. Ben's get_unmapped_area changes are very helpfull for that.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 6:55 ` Andrew Morton
` (2 preceding siblings ...)
2007-04-27 11:05 ` Paul Mackerras
@ 2007-04-27 13:44 ` William Lee Irwin III
2007-04-27 19:15 ` Andrew Morton
3 siblings, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-27 13:44 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, David Chinner, linux-kernel, Mel Gorman,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 11:55:42PM -0700, Andrew Morton wrote:
> Please address my point: if in five years time x86 has larger or varible
> pagesize, this code will be a permanent millstone around our necks which we
> *should not have merged*.
> And if in five years time x86 does not have larger pagesize support then
> the manufacturers would have decided that 4k pages are not a performance
> problem, so we again should not have merged this code.
So the verdict is wait 5 years, see if x86 did anything, and so on.
Has anyone else noticed our embedded arch installed base dwarfs our
x86 and "enterprise" installed bases combined? What are our priorities
that make designing the core around x86 meaningful again? Pipe dreams
of competing with Windows on the desktop? Optimizing for kernel
compiles on kernel hackers' workstations? Something should seriously
be reevaluated there at some point.
As x86 is now the priority regardless, maybe checking in with Intel and
AMD as far as what they'd like to see happen would be enlightening. It
may be that some things are deadlocked on OS use cases.
Also, is there something in particular that should be done for the case
of x86 acquiring a variable pagesize?
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 10:26 ` Nick Piggin
@ 2007-04-27 13:51 ` Eric W. Biederman
0 siblings, 0 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-27 13:51 UTC (permalink / raw)
To: Nick Piggin
Cc: Jens Axboe, Christoph Lameter, Christoph Hellwig, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III,
Badari Pulavarty, Maxim Levitsky
Nick Piggin <nickpiggin@yahoo.com.au> writes:
> Eric W. Biederman wrote:
>> Jens Axboe <jens.axboe@oracle.com> writes:
>
>>>Yes, that is exactly the problem. Once you have that, pktcdvd is pretty
>>>much reduced to setup and init code, the actual data handling can be
>>>done by sr or ide-cd directly. You could merge it into cdrom.c, it would
>>>not be very different from mt-rainier handling (which basically does RMW
>>>in firmware, so it works for any write, but performance is of course
>>>horrible if you don't do it right).
>>
>>
>> Thanks for the clarification.
>>
>> So we do have a clear problem that we do not have generic support for
>> large sector sizes residing in the page cache.
>
> Well, it is a clear limitation. It hasn't mattered too much until
> now, but it is one of the other issues that SGI hit (aside from
> io efficiency) because they have 16K filesystems created on ia64
> systems that I believe they want to access with x86-64 systems.
I think the current pktcdvd story is a better argument. There is real
hardware with a > 4K sector size. Of course once we support that
class of hardware support filesystems with a large block size will
also be straight forward.
> I'm slowly looking at patches in the background, but I'm hoping to
> be able to spend a decent chunk of time working on them again soon.
>
> It isn't trivial :)
I guess it depends on how you look at it.
If we can drop the assumption that large sector sizes are virtually
contiguous I expect things will be closer to trivial.
If we can do a page group thing where we keep the all of the I/O state on
the first cache page I expect things won't be to bad.
I do seem to see some VM affects needed from allocating and freeing
several pages together.
I also see an opportunity in allocating several pages at once. We
could make it one call that returns a vector of pages and the page
allocator could satisfy our request with a high order page split into
individual pages if it was available. The the I/O layer would have
to notice that we are giving it several page structs that are
physically contiguous.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 13:06 ` Nick Piggin
@ 2007-04-27 14:49 ` William Lee Irwin III
0 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-27 14:49 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, Christoph Lameter, linux-kernel, Mel Gorman,
David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, Apr 27, 2007 at 08:22:07PM +1000, Nick Piggin wrote:
>>> Just a random aside question... doesn't Oracle db do direct IO from
>>> hugepages?
William Lee Irwin III wrote:
>> If and when configured to use direct IO and hugepages, yes.
On Fri, Apr 27, 2007 at 11:06:36PM +1000, Nick Piggin wrote:
> Sweet. I wonder if you would see a much improvement for allowing more
> than 128 sglist entries, then.
There should be some. Oracle does its own IO scheduling, so it's able
to submit relatively large IO's (IIRC also configurable). I'm not sure
what's done for the non-aio case, but aio might be tricky to catch the
vectoring with all the blocking on get_request_wait() and the way the
vector elements are handled there. So turning off aio may also help.
Another useful idea is to look at the results from pagemap patches.
William Lee Irwin III wrote:
> >It's also
> >noteworthy thar Oracle has more code than its database.
>
On Fri, Apr 27, 2007 at 11:06:36PM +1000, Nick Piggin wrote:
> Anything noteworthy in this context, that you care to note?
The RMAN backup utility is often used to back up Oracle databases
and there are always complaints when backups disturb anything.
EM (enterprise management stuff), RAC (clustering affairs), various
database applications (typically in-depth for whatever field they're
pertinent to), ASM middleware (abstraction layer for database file
access), and more are out there.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 7:26 ` Andrew Morton
2007-04-27 8:37 ` David Chinner
2007-04-27 12:01 ` Christoph Lameter
@ 2007-04-27 16:36 ` David Chinner
2007-04-27 17:34 ` David Chinner
2 siblings, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-04-27 16:36 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, Apr 27, 2007 at 12:26:40AM -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 00:19:49 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
>
> > The page cache handling in the various layers is significantly
> > simplified which reduces maintenance cost.
>
> How on earth can the *addition* of variable pagecache size simplify the
> existing code?
>
> What cleanups are in this patchset which cannot be made *without* the
> addition of variable pagecache size?
I think this is the cleanup of all the open coded masking and offset
to index type of operations that get done over and over again everywhere.
> > Dave, where are we with the performance tests?
>
> Well yes.
Backed up behind real work ;)
The test was writing a single 50GB file to a fresh filesystem, and
then reading it back. Run on two different dm stripes - a 4-disk
RAID) and a 8disk RAID0 stripe, with a stripe unit of 512k. Disks
are 10krpm SAS, external jbod on PCI-X good for ~850MB/s read and
~750MB/s write. Server is 4p intel x86_64 with 16GB RAM.
READ WRITE
blksz disks tput sys tput sys
----- ----- ----- ---- ----- ----
4k 4 332 35s 203 76s
64k 4 173 20s 273 21s
4k 8 403 35s 443 76s
64k 8 634 21s 540 21s
Throughput in MB/s.
So, there's some interaction between reads and large block size;
may be related to readahead but i haven't looked into I/O
patterns at this stage.
More results soonish.....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 8:48 ` Andrew Morton
@ 2007-04-27 16:45 ` Theodore Tso
2007-05-04 13:33 ` Eric W. Biederman
2007-05-04 12:57 ` Eric W. Biederman
1 sibling, 1 reply; 235+ messages in thread
From: Theodore Tso @ 2007-04-27 16:45 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, Apr 27, 2007 at 01:48:49AM -0700, Andrew Morton wrote:
> And other filesystems (ie: ext4) _might_ use it. But ext4 is extent-based,
> so perhaps it's not work churning the on-disk format to get a bit of a
> boost in the block allocator.
Well, ext3 could definitely use it; there are people using 8k and 16k
blocksizes on ia64 systems today. Those filesystems can't be mounted
on x86 or x86_64 systems because our pagesize is 4k, though.
And I imagine that ext4 might want to use a large blocksize too ---
after all, XFS is extent based as well, and not _all_ of the
advantages of using a larger blocksize are related to brain-damaged
storage subsystems with short SG list support. Whether the advantages
offset the internal fragmentation overhead or the complexity of adding
fragments support is a different question, of course.
So while the jury is out about how many other filesystems might use
it, I suspect it's more than you might think. At the very least,
there may be some IA64 users who might be trying to transition their
way to x86_64, and have existing filesystems using a 8k or 16k
block filesystems. :-)
- Ted
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 12:25 ` Nick Piggin
2007-04-27 13:39 ` Christoph Hellwig
@ 2007-04-27 16:48 ` Christoph Lameter
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-27 16:48 UTC (permalink / raw)
To: Nick Piggin
Cc: Paul Mackerras, Andrew Morton, David Chinner, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, 27 Apr 2007, Nick Piggin wrote:
> Linus's favourite jokes about powerpc mmu being crippled forever, aside ;)
>
> This seems like just speculation. I would not be against something which,
> without, would "cripple" some relevant hardware, but you are just handwaving
> at this point. And you are still ignoring the alternatives.
This crippling also applies to IA64. The crippling is a concern of
multiple arches and its effects have been well documented. Talk to Peter
Chubb for example. The requirement is to be able to do I/O to large
contiguous sections of memory. Your proposals are ignoring the
requirements.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 5:15 ` Andrew Morton
2007-04-27 5:49 ` Christoph Lameter
2007-04-27 6:09 ` David Chinner
@ 2007-04-27 16:55 ` Theodore Tso
2007-04-27 17:32 ` Nicholas Miell
2 siblings, 1 reply; 235+ messages in thread
From: Theodore Tso @ 2007-04-27 16:55 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Thu, Apr 26, 2007 at 10:15:28PM -0700, Andrew Morton wrote:
> And hardware gets better. If Intel & AMD come out with a 16k pagesize
> option in a couple of years we'll look pretty dumb. If the problems which
> you're presently having with that controller get sorted out in the next
> generation of the hardware, we'll also look pretty dumb.
Unfortunately, this isn't a problem with hardware getting better, but
a willingness to break backwards compatibility.
x86_64 uses a 4k page size to avoid breaking 32-bit applications. And
unfortunately, iirc, even 64-bit applications are continuing to depend
on 4k page alignments for things like the text and bss segments. If
the userspace ELF and other compiler/linker specifications were
appropriate written so they could handle 16k pagesizes, maybe 5 years
from now we could move to a 16k pagesize. But this is going to
require some coordination between the userspace binutils folks and
AMD/Intel in order to plan such a migration.
- Ted
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 16:55 ` Theodore Tso
@ 2007-04-27 17:32 ` Nicholas Miell
2007-04-27 18:12 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: Nicholas Miell @ 2007-04-27 17:32 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, 2007-04-27 at 12:55 -0400, Theodore Tso wrote:
> On Thu, Apr 26, 2007 at 10:15:28PM -0700, Andrew Morton wrote:
> > And hardware gets better. If Intel & AMD come out with a 16k pagesize
> > option in a couple of years we'll look pretty dumb. If the problems which
> > you're presently having with that controller get sorted out in the next
> > generation of the hardware, we'll also look pretty dumb.
>
> Unfortunately, this isn't a problem with hardware getting better, but
> a willingness to break backwards compatibility.
>
> x86_64 uses a 4k page size to avoid breaking 32-bit applications. And
> unfortunately, iirc, even 64-bit applications are continuing to depend
> on 4k page alignments for things like the text and bss segments. If
> the userspace ELF and other compiler/linker specifications were
> appropriate written so they could handle 16k pagesizes, maybe 5 years
> from now we could move to a 16k pagesize. But this is going to
> require some coordination between the userspace binutils folks and
> AMD/Intel in order to plan such a migration.
>
> - Ted
The AMD64 psABI requires binaries to work with any page size up to 64k.
Whether that's true in practice is another matter entirely, of course.
--
Nicholas Miell <nmiell@comcast.net>
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 16:36 ` David Chinner
@ 2007-04-27 17:34 ` David Chinner
2007-04-27 19:11 ` Andrew Morton
0 siblings, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-04-27 17:34 UTC (permalink / raw)
To: David Chinner
Cc: Andrew Morton, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, Apr 28, 2007 at 02:36:20AM +1000, David Chinner wrote:
> The test was writing a single 50GB file to a fresh filesystem, and
> then reading it back. Run on two different dm stripes - a 4-disk
> RAID) and a 8disk RAID0 stripe, with a stripe unit of 512k. Disks
> are 10krpm SAS, external jbod on PCI-X good for ~850MB/s read and
> ~750MB/s write. Server is 4p intel x86_64 with 16GB RAM.
>
> READ WRITE
> blksz disks tput sys tput sys
> ----- ----- ----- ---- ----- ----
> 4k 4 332 35s 203 76s
> 64k 4 173 20s 273 21s
> 4k 8 403 35s 443 76s
> 64k 8 634 21s 540 21s
>
> Throughput in MB/s.
Some more information - stripe unit on the dm raid0 is 512k.
I have not attempted to increase I/O sizes at all yet - these test are
just demonstrating efficiency improvements in the filesystem.
These numbers for 32GB files.
READ WRITE
disks blksz tput sys tput sys
----- ----- ----- ---- ----- ----
1 4k 89 18s 57 44s
1 16k 46 13s 67 18s
1 64k 75 12s 68 12s
2 4k 179 20s 114 43s
2 16k 55 13s 132 18s
2 64k 126 12s 126 12s
4 4k 350 20s 214 43s
4 16k 350 14s 264 19s
4 64k 176 11s 266 12s
8 4k 415 21s 446 41s
8 16k 655 13s 518 19s
8 64k 664 12s 552 12s
12 4k 413 20s 633 33s
12 16k 736 14s 741 19s
12 64k 836 12s 743 12s
Throughput in MB/s.
Consistent improvement across the write results, first time
I've hit the limits of the PCI-X bus with a single buffered
I/O thread doing either reads or writes.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 17:32 ` Nicholas Miell
@ 2007-04-27 18:12 ` William Lee Irwin III
0 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-27 18:12 UTC (permalink / raw)
To: Nicholas Miell
Cc: Theodore Tso, Andrew Morton, David Chinner, clameter,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, 2007-04-27 at 12:55 -0400, Theodore Tso wrote:
>> Unfortunately, this isn't a problem with hardware getting better, but
>> a willingness to break backwards compatibility.
>> x86_64 uses a 4k page size to avoid breaking 32-bit applications. And
>> unfortunately, iirc, even 64-bit applications are continuing to depend
>> on 4k page alignments for things like the text and bss segments. If
>> the userspace ELF and other compiler/linker specifications were
>> appropriate written so they could handle 16k pagesizes, maybe 5 years
>> from now we could move to a 16k pagesize. But this is going to
>> require some coordination between the userspace binutils folks and
>> AMD/Intel in order to plan such a migration.
On Fri, Apr 27, 2007 at 10:32:07AM -0700, Nicholas Miell wrote:
> The AMD64 psABI requires binaries to work with any page size up to 64k.
> Whether that's true in practice is another matter entirely, of course.
64-bit applications are a non-issue. The ABI requires them to handle it.
It's 32-bit applications on x86-64 that are the concern. ABI emulation
for them is more involved when 4K 32-bit ABI's are to be emulated on
kernels compiled for larger native pagesizes. In practice, so many use
getpagesize() it may not be much of an issue, but consider WINE and
other sorts of non-Linux-native 32-bit apps, among other issues.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 17:34 ` David Chinner
@ 2007-04-27 19:11 ` Andrew Morton
2007-04-28 1:43 ` Nick Piggin
2007-04-28 3:17 ` David Chinner
0 siblings, 2 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 19:11 UTC (permalink / raw)
To: David Chinner
Cc: Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky, Nick Piggin
On Sat, 28 Apr 2007 03:34:32 +1000 David Chinner <dgc@sgi.com> wrote:
> Some more information - stripe unit on the dm raid0 is 512k.
> I have not attempted to increase I/O sizes at all yet - these test are
> just demonstrating efficiency improvements in the filesystem.
>
> These numbers for 32GB files.
>
> READ WRITE
> disks blksz tput sys tput sys
> ----- ----- ----- ---- ----- ----
> 1 4k 89 18s 57 44s
> 1 16k 46 13s 67 18s
> 1 64k 75 12s 68 12s
> 2 4k 179 20s 114 43s
> 2 16k 55 13s 132 18s
> 2 64k 126 12s 126 12s
> 4 4k 350 20s 214 43s
> 4 16k 350 14s 264 19s
> 4 64k 176 11s 266 12s
> 8 4k 415 21s 446 41s
> 8 16k 655 13s 518 19s
> 8 64k 664 12s 552 12s
> 12 4k 413 20s 633 33s
> 12 16k 736 14s 741 19s
> 12 64k 836 12s 743 12s
>
> Throughput in MB/s.
>
>
> Consistent improvement across the write results, first time
> I've hit the limits of the PCI-X bus with a single buffered
> I/O thread doing either reads or writes.
1-disk and 2-disk read throughput fell by an improbable amount, which makes
me cautious about the other numbers.
Your annotation says "blocksize". Are you really varying the fs blocksize
here, or did you mean "pagesize"?
What worries me here is that we have inefficient code, and increasing the
pagesize amortises that inefficiency without curing it.
If so, it would be better to fix the inefficiencies, so that 4k pagesize
will also benefit.
For example, see __do_page_cache_readahead(). It does a read_lock() and a
page allocation and a radix-tree lookup for each page. We can vastly
improve that.
Step 1:
- do a read-lock
- do a radix-tree walk to work out how many pages are missing
- read-unlock
- allocate that many pages
- read_lock()
- populate all the pages.
- read_unlock
- if any pages are left over, free them
- if we ended up not having enough pages, redo the whole thing.
that will reduce the number of read_lock()s, read_unlock()s and radix-tree
descents by a factor of 32 or so in this testcase. That's a lot, and it's
something we (Nick ;)) should have done ages ago.
Step 2 is pretty obvious: __do_page_cache_readahead() is now in an ideal
position to parse its list of missing-pgoff_t's and to perform higher-order
allocations to satisfy any power-of-2-sized-and-aligned holes in the
pagecache. Fix up your lameo HBA for reads.
Step 1 is a glaring inefficiency, which large PAGE_CACHE_SIZE attempts to
work around for some subset of cases. It's better to fix the inefficiency
at its core. There are others. Kernel profiles, please.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 13:44 ` William Lee Irwin III
@ 2007-04-27 19:15 ` Andrew Morton
2007-04-28 2:21 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-27 19:15 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Christoph Lameter, David Chinner, linux-kernel, Mel Gorman,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Fri, 27 Apr 2007 06:44:51 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:
> On Thu, Apr 26, 2007 at 11:55:42PM -0700, Andrew Morton wrote:
> > Please address my point: if in five years time x86 has larger or varible
> > pagesize, this code will be a permanent millstone around our necks which we
> > *should not have merged*.
> > And if in five years time x86 does not have larger pagesize support then
> > the manufacturers would have decided that 4k pages are not a performance
> > problem, so we again should not have merged this code.
>
> So the verdict is wait 5 years, see if x86 did anything, and so on.
You missed the bit about "evaluate alternatives".
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 19:11 ` Andrew Morton
@ 2007-04-28 1:43 ` Nick Piggin
2007-04-28 8:04 ` Peter Zijlstra
2007-04-28 3:17 ` David Chinner
1 sibling, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-28 1:43 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Andrew Morton wrote:
> On Sat, 28 Apr 2007 03:34:32 +1000 David Chinner <dgc@sgi.com> wrote:
>
>
>>Some more information - stripe unit on the dm raid0 is 512k.
>>I have not attempted to increase I/O sizes at all yet - these test are
>>just demonstrating efficiency improvements in the filesystem.
>>
>>These numbers for 32GB files.
>>
>> READ WRITE
>>disks blksz tput sys tput sys
>>----- ----- ----- ---- ----- ----
>> 1 4k 89 18s 57 44s
>> 1 16k 46 13s 67 18s
>> 1 64k 75 12s 68 12s
>> 2 4k 179 20s 114 43s
>> 2 16k 55 13s 132 18s
>> 2 64k 126 12s 126 12s
>> 4 4k 350 20s 214 43s
>> 4 16k 350 14s 264 19s
>> 4 64k 176 11s 266 12s
>> 8 4k 415 21s 446 41s
>> 8 16k 655 13s 518 19s
>> 8 64k 664 12s 552 12s
>> 12 4k 413 20s 633 33s
>> 12 16k 736 14s 741 19s
>> 12 64k 836 12s 743 12s
>>
>>Throughput in MB/s.
>>
>>
>>Consistent improvement across the write results, first time
>>I've hit the limits of the PCI-X bus with a single buffered
>>I/O thread doing either reads or writes.
>
>
> 1-disk and 2-disk read throughput fell by an improbable amount, which makes
> me cautious about the other numbers.
>
> Your annotation says "blocksize". Are you really varying the fs blocksize
> here, or did you mean "pagesize"?
>
> What worries me here is that we have inefficient code, and increasing the
> pagesize amortises that inefficiency without curing it.
>
> If so, it would be better to fix the inefficiencies, so that 4k pagesize
> will also benefit.
>
> For example, see __do_page_cache_readahead(). It does a read_lock() and a
> page allocation and a radix-tree lookup for each page. We can vastly
> improve that.
>
> Step 1:
>
> - do a read-lock
>
> - do a radix-tree walk to work out how many pages are missing
>
> - read-unlock
>
> - allocate that many pages
>
> - read_lock()
>
> - populate all the pages.
>
> - read_unlock
>
> - if any pages are left over, free them
>
> - if we ended up not having enough pages, redo the whole thing.
>
> that will reduce the number of read_lock()s, read_unlock()s and radix-tree
> descents by a factor of 32 or so in this testcase. That's a lot, and it's
> something we (Nick ;)) should have done ages ago.
We can do pretty well with the lockless radix tree (that is already upstream)
there. I split that stuff out of my most recent lockless pagecache patchset,
because it doesn't require the "scary" speculative refcount stuff of the
lockless pagecache proper. Subject: [patch 5/9] mm: lockless probe.
So that is something we could merge pretty soon.
The other thing is that we can batch up pagecache page insertions for bulk
writes as well (that is. write(2) with buffer size > page size). I should
have a patch somewhere for that as well if anyone interested.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 19:15 ` Andrew Morton
@ 2007-04-28 2:21 ` William Lee Irwin III
0 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-28 2:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, David Chinner, linux-kernel, Mel Gorman,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 11:55:42PM -0700, Andrew Morton wrote:
>>> Please address my point: if in five years time x86 has larger or varible
>>> pagesize, this code will be a permanent millstone around our necks which we
>>> *should not have merged*.
>>> And if in five years time x86 does not have larger pagesize support then
>>> the manufacturers would have decided that 4k pages are not a performance
>>> problem, so we again should not have merged this code.
On Fri, 27 Apr 2007 06:44:51 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:
>> So the verdict is wait 5 years, see if x86 did anything, and so on.
On Fri, Apr 27, 2007 at 12:15:57PM -0700, Andrew Morton wrote:
> You missed the bit about "evaluate alternatives".
No worries. I'm used to being on the wrong side of things. I'll have
no trouble picking out the alternative least likely to be accepted. ;)
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 13:39 ` Christoph Hellwig
@ 2007-04-28 2:27 ` Nick Piggin
2007-04-28 2:39 ` William Lee Irwin III
2007-04-28 8:16 ` Christoph Hellwig
0 siblings, 2 replies; 235+ messages in thread
From: Nick Piggin @ 2007-04-28 2:27 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Christoph Lameter, Paul Mackerras, Andrew Morton, David Chinner,
linux-kernel, Mel Gorman, William Lee Irwin III, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Christoph Hellwig wrote:
> On Fri, Apr 27, 2007 at 10:25:44PM +1000, Nick Piggin wrote:
>
>>Linus's favourite jokes about powerpc mmu being crippled forever, aside ;)
>
>
> Different mmu. The desktop 32bit mmu Linus refered to has almost nothing
> in common with the mmu on 64bit systems.
Well I wasn't trying to make a point there so it isn't a big deal... but
he has known to say the 64-bit hash table is insane or broken. If he's
since recanted, I'd be interested to read the post :)
>>>Right this could help but it is not addressing the basic requirement for
>>>devices that need large contiguuos chunks of memory for I/O.
>>
>>Did you read the last paragraph? Or anything Andrew's been writing?
>>
>> "After that, I'd find it amusing if HBAs worth thousands of $ have
>> trouble looking up sglists at the relatively glacial pace that IO
>> requires, and/or can't spare a few more K for reasonable sglist
>> sizes, but if that is really the case, then we could use iommus
>> and/or just attempt to put physically contiguous pages in pagecache,
>> rather than require it."
>
>
> Real highend HBAs don't have that problem. But for example aacraid
> which is very common on mid-end servers is a _lot_ faster when it
> gets continous memory. Some benchmark was 10 or more percent faster
> on windows due to this.
And that wasn't due to the 128 sg limit?
I guess 10% isn't a small amount. Though it would be nice to have
before/after numbers for Linux. And, like Andrew was saying, we could
just _attempt_ to put contiguous pages in pagecache rather than
_require_ it. Which is still robust under fragmentation, and benefits
everyone, not just files with a large pagecache size.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 2:27 ` Nick Piggin
@ 2007-04-28 2:39 ` William Lee Irwin III
2007-04-28 2:50 ` Nick Piggin
2007-04-28 8:16 ` Christoph Hellwig
1 sibling, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-28 2:39 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Hellwig, Christoph Lameter, Paul Mackerras,
Andrew Morton, David Chinner, linux-kernel, Mel Gorman,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
On Sat, Apr 28, 2007 at 12:27:45PM +1000, Nick Piggin wrote:
> I guess 10% isn't a small amount. Though it would be nice to have
> before/after numbers for Linux. And, like Andrew was saying, we could
> just _attempt_ to put contiguous pages in pagecache rather than
> _require_ it. Which is still robust under fragmentation, and benefits
> everyone, not just files with a large pagecache size.
What sort of strategy do you intend to use to speculatively populate
the pagecache with contiguous pages?
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 2:39 ` William Lee Irwin III
@ 2007-04-28 2:50 ` Nick Piggin
2007-04-28 3:16 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: Nick Piggin @ 2007-04-28 2:50 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Christoph Hellwig, Christoph Lameter, Paul Mackerras,
Andrew Morton, David Chinner, linux-kernel, Mel Gorman,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III wrote:
> On Sat, Apr 28, 2007 at 12:27:45PM +1000, Nick Piggin wrote:
>
>>I guess 10% isn't a small amount. Though it would be nice to have
>>before/after numbers for Linux. And, like Andrew was saying, we could
>>just _attempt_ to put contiguous pages in pagecache rather than
>>_require_ it. Which is still robust under fragmentation, and benefits
>>everyone, not just files with a large pagecache size.
>
>
> What sort of strategy do you intend to use to speculatively populate
> the pagecache with contiguous pages?
Andrew outlined it.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 2:50 ` Nick Piggin
@ 2007-04-28 3:16 ` William Lee Irwin III
0 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-28 3:16 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Hellwig, Christoph Lameter, Paul Mackerras,
Andrew Morton, David Chinner, linux-kernel, Mel Gorman,
Jens Axboe, Badari Pulavarty, Maxim Levitsky
William Lee Irwin III wrote:
>> What sort of strategy do you intend to use to speculatively populate
>> the pagecache with contiguous pages?
On Sat, Apr 28, 2007 at 12:50:26PM +1000, Nick Piggin wrote:
> Andrew outlined it.
I'd like to suggest a few straightforward additions to the proposal:
(1) the interface to the page allocator tries to allocate N pages where
(a) N is a power of 2
(b) some effort is made to get contiguity
(c) some effort is made to fall back to lesser contiguity
(d) some effort is made to get N pages even with no contiguity
(2) a corresponding group freeing interface to the page allocator
(3) Pass the pages around in a list or similar so that O(1) instead of
O(pages) splice operations under the lock suffice for passing
them around. Dissecting compound pages outside locks helps.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 19:11 ` Andrew Morton
2007-04-28 1:43 ` Nick Piggin
@ 2007-04-28 3:17 ` David Chinner
2007-04-28 3:49 ` Christoph Lameter
2007-04-28 4:56 ` Andrew Morton
1 sibling, 2 replies; 235+ messages in thread
From: David Chinner @ 2007-04-28 3:17 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky, Nick Piggin
On Fri, Apr 27, 2007 at 12:11:08PM -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 03:34:32 +1000 David Chinner <dgc@sgi.com> wrote:
>
> > Some more information - stripe unit on the dm raid0 is 512k.
> > I have not attempted to increase I/O sizes at all yet - these test are
> > just demonstrating efficiency improvements in the filesystem.
> >
> > These numbers for 32GB files.
> >
> > READ WRITE
> > disks blksz tput sys tput sys
> > ----- ----- ----- ---- ----- ----
> > 1 4k 89 18s 57 44s
> > 1 16k 46 13s 67 18s
> > 1 64k 75 12s 68 12s
> > 2 4k 179 20s 114 43s
> > 2 16k 55 13s 132 18s
> > 2 64k 126 12s 126 12s
> > 4 4k 350 20s 214 43s
> > 4 16k 350 14s 264 19s
> > 4 64k 176 11s 266 12s
> > 8 4k 415 21s 446 41s
> > 8 16k 655 13s 518 19s
> > 8 64k 664 12s 552 12s
> > 12 4k 413 20s 633 33s
> > 12 16k 736 14s 741 19s
> > 12 64k 836 12s 743 12s
> >
> > Throughput in MB/s.
> >
> >
> > Consistent improvement across the write results, first time
> > I've hit the limits of the PCI-X bus with a single buffered
> > I/O thread doing either reads or writes.
>
> 1-disk and 2-disk read throughput fell by an improbable amount, which makes
> me cautious about the other numbers.
For read, yes, and it's because something is going wrong with the
I/O size - it looks like readahead thrashing of some kind even
with 4k pages tests.
When when I bumped the block device readahead from 256 -> 2048,
the single disk read numbers went 60, 75, 75MB/s for 4->64k block size
and were repeatable, so we definitely have some interaction with readahead.
> Your annotation says "blocksize". Are you really varying the fs blocksize
> here, or did you mean "pagesize"?
Filesystem blocksize, as specified by mkfs.xfs. Which, in turn,
changes the page cache order.
> What worries me here is that we have inefficient code, and increasing the
> pagesize amortises that inefficiency without curing it.
Increasing the filesystem block size also reduces the overhead of
the filesystem, not just he page cache. A lot of the overhead (write
especially) reductions are going to be filesystem block size
related, so I wouldn't start assuming that it's just he page cache
changes that have brought about these system time reductions.
> If so, it would be better to fix the inefficiencies, so that 4k pagesize
> will also benefit.
>
> For example, see __do_page_cache_readahead(). It does a read_lock() and a
> page allocation and a radix-tree lookup for each page. We can vastly
> improve that.
....
Sure but that's a different problem to what we are trying to solve
now. Even with this in place, I think we'd still realise
improvements with the compound pages....
> Fix up your lameo HBA for reads.
Where did that come from? You spend 20 lines described the inefficiencies
of the readahead in the page cache and it should be fixed but then you
turn around and say fix the HBA?
This test was constructed to keep the I/o sizes within the current
bounds, so the HBA sees no difference in I/O sizes as the filesystem
block size changes. i.e. the HBA is constant factor during the
tests. IOWs, the changes in numbers above are purely a result of the
page cache and filesystem changes....
And besides, the "lameo HBA" I'm using is cleared limited by the
PCI-X bus it's on, not the size and type of pages being thrown at it
by the I/O layers. The hardware is pretty much irrelevant in these
tests....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 3:17 ` David Chinner
@ 2007-04-28 3:49 ` Christoph Lameter
2007-04-28 4:56 ` Andrew Morton
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-28 3:49 UTC (permalink / raw)
To: David Chinner
Cc: Andrew Morton, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky, Nick Piggin
On Sat, 28 Apr 2007, David Chinner wrote:
> > 1-disk and 2-disk read throughput fell by an improbable amount, which makes
> > me cautious about the other numbers.
>
> For read, yes, and it's because something is going wrong with the
> I/O size - it looks like readahead thrashing of some kind even
> with 4k pages tests.
Yup. I seem to have a problem in that area with my patches. Somehow the
nr of page is shifted by page order. I do not completely understand what
is going on there yet.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 3:17 ` David Chinner
2007-04-28 3:49 ` Christoph Lameter
@ 2007-04-28 4:56 ` Andrew Morton
2007-04-28 5:08 ` Christoph Lameter
2007-04-28 9:43 ` Alan Cox
1 sibling, 2 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 4:56 UTC (permalink / raw)
To: David Chinner
Cc: Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky, Nick Piggin
On Sat, 28 Apr 2007 13:17:40 +1000 David Chinner <dgc@sgi.com> wrote:
> > Fix up your lameo HBA for reads.
>
> Where did that come from? You spend 20 lines described the inefficiencies
> of the readahead in the page cache and it should be fixed but then you
> turn around and say fix the HBA?
My (repeated) point is that if we populate pagecache with physically-contiguous 4k
pages in this manner then bio+block will be able to create much larger SG lists.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 4:56 ` Andrew Morton
@ 2007-04-28 5:08 ` Christoph Lameter
2007-04-28 5:36 ` Andrew Morton
2007-04-28 9:43 ` Alan Cox
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-28 5:08 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky, Nick Piggin
On Fri, 27 Apr 2007, Andrew Morton wrote:
> My (repeated) point is that if we populate pagecache with physically-contiguous 4k
> pages in this manner then bio+block will be able to create much larger SG lists.
True but the "if" becomes exceedingly rare the longer the system was in
operation. 64k implies 16 pages in sequence. This is going to be a bit
difficult to get. Then there is the overhead of handling these pages.
Which may be not significant given growing processor capabilities in some
usage cases. In others like a synchronized application running on a large
number of nodes this is likely introduce random delays between processor
to processor communication that will significantly impair performance.
And then there is the long list of features that cannot be accomplished
with such an approach like mounting a volume with large block size,
handling CD/DVDs, getting rid of various shim layers etc.
I'd also like to have much higher orders of allocations for scientific
applications that require an extremely large I/O rate. For those we
could f.e. dedicate memory nodes that will only use a very high page
order to prevent fragmentation. E.g. 1G pages is certainly something that
lots of our customers would find beneficial (and they are actually
already using those types of pages in the form of huge pages but with
limited capabilities).
But then we are sadly again trying to find another workaround that
will not get us there and will not allow the flexibility in the
VM that would make things much easier for lots of usage scenarios.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 5:08 ` Christoph Lameter
@ 2007-04-28 5:36 ` Andrew Morton
2007-04-28 6:24 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 5:36 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky, Nick Piggin
On Fri, 27 Apr 2007 22:08:17 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 27 Apr 2007, Andrew Morton wrote:
>
> > My (repeated) point is that if we populate pagecache with physically-contiguous 4k
> > pages in this manner then bio+block will be able to create much larger SG lists.
>
> True but the "if" becomes exceedingly rare the longer the system was in
> operation. 64k implies 16 pages in sequence. This is going to be a bit
> difficult to get.
Nonsense. We need higher-order allocations whichever scheme is used.
And lumpy reclaim in the moveable zone should be extremely reliable. It
_should_ be the case that it can only be defeated by excessive use of
mlock. But we've seen no testing to either confirm or refute that.
> Then there is the overhead of handling these pages.
> Which may be not significant given growing processor capabilities in some
> usage cases. In others like a synchronized application running on a large
> number of nodes this is likely introduce random delays between processor
> to processor communication that will significantly impair performance.
Well, who knows.
> And then there is the long list of features that cannot be accomplished
> with such an approach like mounting a volume with large block size,
> handling CD/DVDs, getting rid of various shim layers etc.
There are disadvantages against which this must be traded off.
And if the volume which is mounted with the large page option also has a
lot of small files on it, we've gone and dramatically deoptimised the
user's machine. It would have been better to make the 4k-page
implementation faster, rather than working around existing inefficiencies.
> I'd also like to have much higher orders of allocations for scientific
> applications that require an extremely large I/O rate. For those we
> could f.e. dedicate memory nodes that will only use a very high page
> order to prevent fragmentation. E.g. 1G pages is certainly something that
> lots of our customers would find beneficial (and they are actually
> already using those types of pages in the form of huge pages but with
> limited capabilities).
>
> But then we are sadly again trying to find another workaround that
> will not get us there and will not allow the flexibility in the
> VM that would make things much easier for lots of usage scenarios.
Your patch *is* a workaround. It's a workaround for small CPU pagesize.
It's a workaround for suboptimal VFS anf filesystem implementations. It's
a workaround for a disk adapter which has suboptimal readahead and
writeback caching implementations.
See? I can spin too.
Fact is, this change has *costs*. And you're completely ignoring them,
trying to spin them away. It ain't working and it never will. I'm seeing
no serious attempt to think about how we can reduce those costs while
retaining most of the benefits.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 5:36 ` Andrew Morton
@ 2007-04-28 6:24 ` Christoph Lameter
2007-04-28 6:52 ` Andrew Morton
0 siblings, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-04-28 6:24 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky, Nick Piggin
On Fri, 27 Apr 2007, Andrew Morton wrote:
> Your patch *is* a workaround. It's a workaround for small CPU pagesize.
> It's a workaround for suboptimal VFS anf filesystem implementations. It's
> a workaround for a disk adapter which has suboptimal readahead and
> writeback caching implementations.
>
> See? I can spin too.
Of course anyone can spin. The veracity of the statements is the important
thing.
> Fact is, this change has *costs*. And you're completely ignoring them,
> trying to spin them away. It ain't working and it never will. I'm seeing
> no serious attempt to think about how we can reduce those costs while
> retaining most of the benefits.
Well okay my work is "no serious attempt". How encouraging. Then "It does
not work" despite the fact that we were able to run benchmarks on it, after
only a weeks worth of work (where most of my time was actually spend on
something else).
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 6:24 ` Christoph Lameter
@ 2007-04-28 6:52 ` Andrew Morton
2007-04-30 5:30 ` Christoph Lameter
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 6:52 UTC (permalink / raw)
To: Christoph Lameter
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky, Nick Piggin
On Fri, 27 Apr 2007 23:24:05 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > Fact is, this change has *costs*. And you're completely ignoring them,
> > trying to spin them away. It ain't working and it never will. I'm seeing
> > no serious attempt to think about how we can reduce those costs while
> > retaining most of the benefits.
>
> Well okay my work is "no serious attempt".
No. My point is, you have resisted all attempts to explore less costly
*alternatives* to this work. Alternatives.
By misunderstanding any suggestions, misrepresenting them, making incorrect
statements about them, by not suggesting any alternatives yourself, all of
it buttressed by a stolid refusal to recognise that this patch has any
costs.
This effectively leaves it up to others to find time to think about and to
implement possible alternative solutions to the problems which you're
observing.
The altenative which is on the table (and there may be others) is
populating pagecache with physically contiguous pages. This will fix the
HBA problem and is much less costly in terms of maintenance and will
improve all workloads on all machines and doesn't have the additional
runtime costs of pagecache wastage and more memset() overhead with small
files and it doesn't require administrator intervention.
OTOH (yes! there are tradeoffs!) it will consume an unknown amount more
CPU and it doesn't address the large-fs-blocksize requirement, but I don't
know how important the latter is and given the unrelenting advocacy storm
coming from the SGI direction I don't know how to find that out, frankly.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 1:43 ` Nick Piggin
@ 2007-04-28 8:04 ` Peter Zijlstra
2007-04-28 8:22 ` Andrew Morton
0 siblings, 1 reply; 235+ messages in thread
From: Peter Zijlstra @ 2007-04-28 8:04 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, David Chinner, Christoph Lameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 2007-04-28 at 11:43 +1000, Nick Piggin wrote:
> Andrew Morton wrote:
> > For example, see __do_page_cache_readahead(). It does a read_lock() and a
> > page allocation and a radix-tree lookup for each page. We can vastly
> > improve that.
> >
> > Step 1:
> >
> > - do a read-lock
> >
> > - do a radix-tree walk to work out how many pages are missing
> >
> > - read-unlock
> >
> > - allocate that many pages
> >
> > - read_lock()
> >
> > - populate all the pages.
> >
> > - read_unlock
> >
> > - if any pages are left over, free them
> >
> > - if we ended up not having enough pages, redo the whole thing.
> >
> > that will reduce the number of read_lock()s, read_unlock()s and radix-tree
> > descents by a factor of 32 or so in this testcase. That's a lot, and it's
> > something we (Nick ;)) should have done ages ago.
>
> We can do pretty well with the lockless radix tree (that is already upstream)
> there. I split that stuff out of my most recent lockless pagecache patchset,
> because it doesn't require the "scary" speculative refcount stuff of the
> lockless pagecache proper. Subject: [patch 5/9] mm: lockless probe.
>
> So that is something we could merge pretty soon.
>
> The other thing is that we can batch up pagecache page insertions for bulk
> writes as well (that is. write(2) with buffer size > page size). I should
> have a patch somewhere for that as well if anyone interested.
Together with the optimistic locking from my concurrent pagecache that
should bring most of the gains:
sequential insert of 8388608 items:
CONFIG_RADIX_TREE_CONCURRENT=n
[ffff81007d7f60c0] insert 0 done in 15286 ms
CONFIG_RADIX_TREE_OPTIMISTIC=y
[ffff81006b36e040] insert 0 done in 3443 ms
only 4.4 times faster, and more scalable, since we don't bounce the
upper level locks around.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 2:27 ` Nick Piggin
2007-04-28 2:39 ` William Lee Irwin III
@ 2007-04-28 8:16 ` Christoph Hellwig
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Hellwig @ 2007-04-28 8:16 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Hellwig, Christoph Lameter, Paul Mackerras,
Andrew Morton, David Chinner, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, Apr 28, 2007 at 12:27:45PM +1000, Nick Piggin wrote:
> And that wasn't due to the 128 sg limit?
No, that was due to aacraid really liking sg lists as small as possible
where every entry covers areas as big as possible. The driver really
liked physical merging once wli changed the page allocator to return
pages in the right order.
> I guess 10% isn't a small amount. Though it would be nice to have
> before/after numbers for Linux.
I'll try to find the old thread.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 8:04 ` Peter Zijlstra
@ 2007-04-28 8:22 ` Andrew Morton
2007-04-28 8:32 ` Peter Zijlstra
2007-04-28 14:09 ` William Lee Irwin III
0 siblings, 2 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 8:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Nick Piggin, David Chinner, Christoph Lameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > The other thing is that we can batch up pagecache page insertions for bulk
> > writes as well (that is. write(2) with buffer size > page size). I should
> > have a patch somewhere for that as well if anyone interested.
>
> Together with the optimistic locking from my concurrent pagecache that
> should bring most of the gains:
>
> sequential insert of 8388608 items:
>
> CONFIG_RADIX_TREE_CONCURRENT=n
>
> [ffff81007d7f60c0] insert 0 done in 15286 ms
>
> CONFIG_RADIX_TREE_OPTIMISTIC=y
>
> [ffff81006b36e040] insert 0 done in 3443 ms
>
> only 4.4 times faster, and more scalable, since we don't bounce the
> upper level locks around.
I'm not sure what we're looking at here. radix-tree changes? Locking
changes? Both?
If we have a whole pile of pages to insert then there are obvious gains
from not taking the lock once per page (gang insert). But I expect there
will also be gains from not walking down the radix tree once per page too:
walk all the way down and populate all the way to the end of the node.
The implementation could get a bit tricky, handling pages which a racer
instantiated when we dropped the lock, and suitably adjusting ->index. Not
rocket science though.
The depth of the radix tree matters (ie, the file size). 'twould be useful
to always describe the tree's size when publishing microbenchmark results
like this.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 8:22 ` Andrew Morton
@ 2007-04-28 8:32 ` Peter Zijlstra
2007-04-28 8:55 ` Andrew Morton
2007-04-28 14:09 ` William Lee Irwin III
1 sibling, 1 reply; 235+ messages in thread
From: Peter Zijlstra @ 2007-04-28 8:32 UTC (permalink / raw)
To: Andrew Morton
Cc: Nick Piggin, David Chinner, Christoph Lameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 2007-04-28 at 01:22 -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> > >
> > > The other thing is that we can batch up pagecache page insertions for bulk
> > > writes as well (that is. write(2) with buffer size > page size). I should
> > > have a patch somewhere for that as well if anyone interested.
> >
> > Together with the optimistic locking from my concurrent pagecache that
> > should bring most of the gains:
> >
> > sequential insert of 8388608 items:
> >
> > CONFIG_RADIX_TREE_CONCURRENT=n
> >
> > [ffff81007d7f60c0] insert 0 done in 15286 ms
> >
> > CONFIG_RADIX_TREE_OPTIMISTIC=y
> >
> > [ffff81006b36e040] insert 0 done in 3443 ms
> >
> > only 4.4 times faster, and more scalable, since we don't bounce the
> > upper level locks around.
>
> I'm not sure what we're looking at here. radix-tree changes? Locking
> changes? Both?
Both, the radix tree is basically node locked, and all modifying
operations are taught to node lock their way around the radix tree.
This, as Christoph pointed out, will suck on SMP because all the node
locks will bounce around like mad.
So what I did was couple that with an 'optimistic' RCU lookup of the
highest node that _needs_ to be locked for the operation to succeed,
lock that node, verify it is indeed still the node we need, and proceed
as normal (node locked). This avoid taking most/all of the upper level
node locks; esp for insert, which typically only needs one lock.
> If we have a whole pile of pages to insert then there are obvious gains
> from not taking the lock once per page (gang insert). But I expect there
> will also be gains from not walking down the radix tree once per page too:
> walk all the way down and populate all the way to the end of the node.
Yes, it will be even faster if we could batch insert a whole leaf node.
That would save 2^6 - 1 tree traversals and lock/unlock cycles.
Certainly not unwanted.
> The implementation could get a bit tricky, handling pages which a racer
> instantiated when we dropped the lock, and suitably adjusting ->index. Not
> rocket science though.
*nod*
> The depth of the radix tree matters (ie, the file size). 'twould be useful
> to always describe the tree's size when publishing microbenchmark results
> like this.
Right, this was with two such sequences of 2^23, so 2^24 elements in
total, with 0 offset, which gives: 24/6 = 4 levels.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 8:32 ` Peter Zijlstra
@ 2007-04-28 8:55 ` Andrew Morton
2007-04-28 9:36 ` Peter Zijlstra
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 8:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Nick Piggin, David Chinner, Christoph Lameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 28 Apr 2007 10:32:56 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Sat, 2007-04-28 at 01:22 -0700, Andrew Morton wrote:
> > On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > > >
> > > > The other thing is that we can batch up pagecache page insertions for bulk
> > > > writes as well (that is. write(2) with buffer size > page size). I should
> > > > have a patch somewhere for that as well if anyone interested.
> > >
> > > Together with the optimistic locking from my concurrent pagecache that
> > > should bring most of the gains:
> > >
> > > sequential insert of 8388608 items:
> > >
> > > CONFIG_RADIX_TREE_CONCURRENT=n
> > >
> > > [ffff81007d7f60c0] insert 0 done in 15286 ms
> > >
> > > CONFIG_RADIX_TREE_OPTIMISTIC=y
> > >
> > > [ffff81006b36e040] insert 0 done in 3443 ms
> > >
> > > only 4.4 times faster, and more scalable, since we don't bounce the
> > > upper level locks around.
> >
> > I'm not sure what we're looking at here. radix-tree changes? Locking
> > changes? Both?
>
> Both, the radix tree is basically node locked, and all modifying
> operations are taught to node lock their way around the radix tree.
> This, as Christoph pointed out, will suck on SMP because all the node
> locks will bounce around like mad.
>
> So what I did was couple that with an 'optimistic' RCU lookup of the
> highest node that _needs_ to be locked for the operation to succeed,
> lock that node, verify it is indeed still the node we need, and proceed
> as normal (node locked). This avoid taking most/all of the upper level
> node locks; esp for insert, which typically only needs one lock.
hm. Maybe if we were taking the lock once per N pages rather than once per
page, we could avoid such trickiness.
> > If we have a whole pile of pages to insert then there are obvious gains
> > from not taking the lock once per page (gang insert). But I expect there
> > will also be gains from not walking down the radix tree once per page too:
> > walk all the way down and populate all the way to the end of the node.
>
> Yes, it will be even faster if we could batch insert a whole leaf node.
> That would save 2^6 - 1 tree traversals and lock/unlock cycles.
> Certainly not unwanted.
>
> > The implementation could get a bit tricky, handling pages which a racer
> > instantiated when we dropped the lock, and suitably adjusting ->index. Not
> > rocket science though.
>
> *nod*
>
> > The depth of the radix tree matters (ie, the file size). 'twould be useful
> > to always describe the tree's size when publishing microbenchmark results
> > like this.
>
> Right, this was with two such sequences of 2^23, so 2^24 elements in
> total, with 0 offset, which gives: 24/6 = 4 levels.
Where an element would be a page, so we're talking about a 64GB file with
4k pagesize?
Gosh that tree has huge fanout. We fiddled around a bit with
RADIX_TREE_MAP_SHIFT in the early days. Changing it by quite large amounts
didn't appear to have much effect on anything, actually.
btw, regarding __do_page_cache_readahead(): that function is presently
optimised for file rereads - if all pages are present it'll only take the
lock once. The first version of it was optimised the other way around - it
took the lock once for uncached file reads (no pages present), but once per
page for rereads. I made this change because Anton Blanchard had some
workload which hurt, and we figured that rereads are the common case, and
uncached reads are limited by disk speed anyway.
Thinking about it, I guess that was a correct decision, but we did
pessimise these read-a-gargantuan-file cases.
But we could actually implement both flavours and pick the right one to
call based upon the hit-rate metrics which the readahead code is
maintaining (or add new ones).
Of course, gang-populate would be better.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 8:55 ` Andrew Morton
@ 2007-04-28 9:36 ` Peter Zijlstra
0 siblings, 0 replies; 235+ messages in thread
From: Peter Zijlstra @ 2007-04-28 9:36 UTC (permalink / raw)
To: Andrew Morton
Cc: Nick Piggin, David Chinner, Christoph Lameter, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 2007-04-28 at 01:55 -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 10:32:56 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> > On Sat, 2007-04-28 at 01:22 -0700, Andrew Morton wrote:
> > > On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > >
> > > > >
> > > > > The other thing is that we can batch up pagecache page insertions for bulk
> > > > > writes as well (that is. write(2) with buffer size > page size). I should
> > > > > have a patch somewhere for that as well if anyone interested.
> > > >
> > > > Together with the optimistic locking from my concurrent pagecache that
> > > > should bring most of the gains:
> > > >
> > > > sequential insert of 8388608 items:
> > > >
> > > > CONFIG_RADIX_TREE_CONCURRENT=n
> > > >
> > > > [ffff81007d7f60c0] insert 0 done in 15286 ms
> > > >
> > > > CONFIG_RADIX_TREE_OPTIMISTIC=y
> > > >
> > > > [ffff81006b36e040] insert 0 done in 3443 ms
> > > >
> > > > only 4.4 times faster, and more scalable, since we don't bounce the
> > > > upper level locks around.
> > >
> > > I'm not sure what we're looking at here. radix-tree changes? Locking
> > > changes? Both?
> >
> > Both, the radix tree is basically node locked, and all modifying
> > operations are taught to node lock their way around the radix tree.
> > This, as Christoph pointed out, will suck on SMP because all the node
> > locks will bounce around like mad.
> >
> > So what I did was couple that with an 'optimistic' RCU lookup of the
> > highest node that _needs_ to be locked for the operation to succeed,
> > lock that node, verify it is indeed still the node we need, and proceed
> > as normal (node locked). This avoid taking most/all of the upper level
> > node locks; esp for insert, which typically only needs one lock.
>
> hm. Maybe if we were taking the lock once per N pages rather than once per
> page, we could avoid such trickiness.
For insertion, maybe, not all insertion is coupled with strong
readahead.
But this also work for the other modifiers, like radix_tree_tag_set(),
which is typically called on single pages.
The whole idea is to entirely remove mapping->tree_lock, and serialize
the pagecache on page level.
> > > If we have a whole pile of pages to insert then there are obvious gains
> > > from not taking the lock once per page (gang insert). But I expect there
> > > will also be gains from not walking down the radix tree once per page too:
> > > walk all the way down and populate all the way to the end of the node.
> >
> > Yes, it will be even faster if we could batch insert a whole leaf node.
> > That would save 2^6 - 1 tree traversals and lock/unlock cycles.
> > Certainly not unwanted.
> >
> > > The implementation could get a bit tricky, handling pages which a racer
> > > instantiated when we dropped the lock, and suitably adjusting ->index. Not
> > > rocket science though.
> >
> > *nod*
> >
> > > The depth of the radix tree matters (ie, the file size). 'twould be useful
> > > to always describe the tree's size when publishing microbenchmark results
> > > like this.
> >
> > Right, this was with two such sequences of 2^23, so 2^24 elements in
> > total, with 0 offset, which gives: 24/6 = 4 levels.
>
> Where an element would be a page, so we're talking about a 64GB file with
> 4k pagesize?
>
> Gosh that tree has huge fanout. We fiddled around a bit with
> RADIX_TREE_MAP_SHIFT in the early days. Changing it by quite large amounts
> didn't appear to have much effect on anything, actually.
It might make sense to decrease it a bit with this work, it gives more
opportunity to parallelize.
Currently the tree_lock is pretty much the most expensive operation in
this area, it happily bounces around the machine.
The lockless work showed there is lots of room for improvement here.
> btw, regarding __do_page_cache_readahead(): that function is presently
> optimised for file rereads - if all pages are present it'll only take the
> lock once. The first version of it was optimised the other way around - it
> took the lock once for uncached file reads (no pages present), but once per
> page for rereads. I made this change because Anton Blanchard had some
> workload which hurt, and we figured that rereads are the common case, and
> uncached reads are limited by disk speed anyway.
>
> Thinking about it, I guess that was a correct decision, but we did
> pessimise these read-a-gargantuan-file cases.
>
> But we could actually implement both flavours and pick the right one to
> call based upon the hit-rate metrics which the readahead code is
> maintaining (or add new ones).
>
> Of course, gang-populate would be better.
I could try to do a gang insert operation on top of my concurrent
pagecache tree.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 4:56 ` Andrew Morton
2007-04-28 5:08 ` Christoph Lameter
@ 2007-04-28 9:43 ` Alan Cox
2007-04-28 9:58 ` Andrew Morton
1 sibling, 1 reply; 235+ messages in thread
From: Alan Cox @ 2007-04-28 9:43 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky, Nick Piggin
On Fri, 27 Apr 2007 21:56:34 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Sat, 28 Apr 2007 13:17:40 +1000 David Chinner <dgc@sgi.com> wrote:
>
> > > Fix up your lameo HBA for reads.
> >
> > Where did that come from? You spend 20 lines described the inefficiencies
> > of the readahead in the page cache and it should be fixed but then you
> > turn around and say fix the HBA?
>
> My (repeated) point is that if we populate pagecache with physically-contiguous 4k
> pages in this manner then bio+block will be able to create much larger SG lists.
Also remember that even if you do larger pages by using virtual pairs or
quads of real pages because it helps on some systems you end up needing
the same sized sglist as before so you don't make anything worse for
half-assed controllers as you get the same I/O size providing they have
the minimal 2 or 4 sg list entries (and those that don't are genuinely
beyond saving and nowdays very rare)
Alan
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 9:43 ` Alan Cox
@ 2007-04-28 9:58 ` Andrew Morton
2007-04-28 10:21 ` Alan Cox
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 9:58 UTC (permalink / raw)
To: Alan Cox
Cc: David Chinner, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky, Nick Piggin
On Sat, 28 Apr 2007 10:43:28 +0100 Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> On Fri, 27 Apr 2007 21:56:34 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > On Sat, 28 Apr 2007 13:17:40 +1000 David Chinner <dgc@sgi.com> wrote:
> >
> > > > Fix up your lameo HBA for reads.
> > >
> > > Where did that come from? You spend 20 lines described the inefficiencies
> > > of the readahead in the page cache and it should be fixed but then you
> > > turn around and say fix the HBA?
> >
> > My (repeated) point is that if we populate pagecache with physically-contiguous 4k
> > pages in this manner then bio+block will be able to create much larger SG lists.
>
> Also remember that even if you do larger pages by using virtual pairs or
> quads of real pages because it helps on some systems you end up needing
> the same sized sglist as before so you don't make anything worse for
> half-assed controllers as you get the same I/O size providing they have
> the minimal 2 or 4 sg list entries (and those that don't are genuinely
> beyond saving and nowdays very rare)
>
Could you expand on that a bit please? I don't get it.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 9:58 ` Andrew Morton
@ 2007-04-28 10:21 ` Alan Cox
2007-04-28 10:25 ` Andrew Morton
0 siblings, 1 reply; 235+ messages in thread
From: Alan Cox @ 2007-04-28 10:21 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky, Nick Piggin
> > Also remember that even if you do larger pages by using virtual pairs or
> > quads of real pages because it helps on some systems you end up needing
> > the same sized sglist as before so you don't make anything worse for
> > half-assed controllers as you get the same I/O size providing they have
> > the minimal 2 or 4 sg list entries (and those that don't are genuinely
> > beyond saving and nowdays very rare)
> >
>
> Could you expand on that a bit please? I don't get it.
Put a 16K "page" into the page cache physically and you need to allocate
1 sg entry and you get a clear benefit, IFF you can allocate the pages.
Put a 16K "page" into the page cache made up of 4 x real 4K pages which
are not physically contiguous and you need 4 sg list entries - which is
no worse than if you were using 4K pages
4 per 16K page cache "logcial page" -> 4 per 16K
1 per 4K physical page for 4K page cache -> 4 per 16K
The only ugly case for the latter is if you are reading something like a
16K page ext3fs from an old IA64 box onto a real computer and you have a
controller with insufficient sg list entries to read a 16K logical page.
At that point the block layer is going to have kittens.
Alan
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 10:21 ` Alan Cox
@ 2007-04-28 10:25 ` Andrew Morton
2007-04-28 11:29 ` Alan Cox
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 10:25 UTC (permalink / raw)
To: Alan Cox
Cc: David Chinner, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky, Nick Piggin
On Sat, 28 Apr 2007 11:21:17 +0100 Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > Also remember that even if you do larger pages by using virtual pairs or
> > > quads of real pages because it helps on some systems you end up needing
> > > the same sized sglist as before so you don't make anything worse for
> > > half-assed controllers as you get the same I/O size providing they have
> > > the minimal 2 or 4 sg list entries (and those that don't are genuinely
> > > beyond saving and nowdays very rare)
> > >
> >
> > Could you expand on that a bit please? I don't get it.
>
> Put a 16K "page" into the page cache physically and you need to allocate
> 1 sg entry and you get a clear benefit, IFF you can allocate the pages.
>
> Put a 16K "page" into the page cache made up of 4 x real 4K pages which
> are not physically contiguous and you need 4 sg list entries - which is
> no worse than if you were using 4K pages
>
> 4 per 16K page cache "logcial page" -> 4 per 16K
>
> 1 per 4K physical page for 4K page cache -> 4 per 16K
>
> The only ugly case for the latter is if you are reading something like a
> 16K page ext3fs from an old IA64 box onto a real computer and you have a
> controller with insufficient sg list entries to read a 16K logical page.
> At that point the block layer is going to have kittens.
>
OK.
But all (both) the proposals we're (ahem) discussing do involve 4x
physically contiguous pages going into those four contiguous pagecache
slots.
So we're improving things for the half-assed controllers, aren't we?
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 5:44 ` Eric W. Biederman
` (2 preceding siblings ...)
2007-04-26 13:28 ` Alan Cox
@ 2007-04-28 10:55 ` Pierre Ossman
2007-04-28 15:39 ` Eric W. Biederman
3 siblings, 1 reply; 235+ messages in thread
From: Pierre Ossman @ 2007-04-28 10:55 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Eric W. Biederman wrote:
>
> I have a hard time believe that device hardware limits don't allow them
> to have enough space to handle larger requests. If so it was a poor
> design by the hardware manufacturers.
>
In the MMC layer, the block size is a major bottle neck. None of the currently
supported hardware supports scatter/gather so we're restricted to servicing a
single continuous chunk of memory at a time. And since latency is substantial
for MMC/SD, good performance is several orders above 4k. We get ~8 MB/s for
cards which are supposed to do 20 MB/s (which has been tested against other
systems where we can get larger memory chunks), and the peasants are getting a
bit unruly.
I plan to experiment with some bounce buffer scheme to get performance up, but
getting large blocks directly would make such hacks unnecessary.
Just my two cents.
Rgds
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
PulseAudio, core developer http://pulseaudio.org
rdesktop, core developer http://www.rdesktop.org
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 10:25 ` Andrew Morton
@ 2007-04-28 11:29 ` Alan Cox
2007-04-28 14:37 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: Alan Cox @ 2007-04-28 11:29 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky, Nick Piggin
> But all (both) the proposals we're (ahem) discussing do involve 4x
> physically contiguous pages going into those four contiguous pagecache
> slots.
>
> So we're improving things for the half-assed controllers, aren't we?
Not neccessarily. If you use 16K contiguous pages you have to do
more work to get memory contiguously and you have less cache efficiency
both of which will do serious damage to performance with poor I/O
subsystems for all the extra paging and I/O, and probably way more than
sglist stuff on PC class boxes. (yes its probably a win on your 32GB SGI
monster)
That aside you are improving things for most but not all half-assed
controllers (some sglists are simply page pointers for each 4K page or
have different efficient encoding for page sized chunks (I2O). If you are
reading large chunks on a really crap controller (eg IDE in PIO) then the
fact you pull 16K when you need 4K will also wipe out any gains.
Alan
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 8:22 ` Andrew Morton
2007-04-28 8:32 ` Peter Zijlstra
@ 2007-04-28 14:09 ` William Lee Irwin III
2007-04-28 18:26 ` Andrew Morton
1 sibling, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-28 14:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Peter Zijlstra, Nick Piggin, David Chinner, Christoph Lameter,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> only 4.4 times faster, and more scalable, since we don't bounce the
>> upper level locks around.
On Sat, Apr 28, 2007 at 01:22:51AM -0700, Andrew Morton wrote:
> I'm not sure what we're looking at here. radix-tree changes? Locking
> changes? Both?
> If we have a whole pile of pages to insert then there are obvious gains
> from not taking the lock once per page (gang insert). But I expect there
> will also be gains from not walking down the radix tree once per page too:
> walk all the way down and populate all the way to the end of the node.
The gang allocation affair would may also want to make the calls into
the page allocator batched. For instance, grab enough compound pages to
build the gang under the lock, since we're going to blow the per-cpu
lists with so many pages, then break the compound pages up outside the
zone->lock.
I think it'd be good to have some corresponding tactics for freeing as
well.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 11:29 ` Alan Cox
@ 2007-04-28 14:37 ` William Lee Irwin III
0 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-28 14:37 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, David Chinner, Christoph Lameter, linux-kernel,
Mel Gorman, Jens Axboe, Badari Pulavarty, Maxim Levitsky,
Nick Piggin
On Sat, Apr 28, 2007 at 12:29:08PM +0100, Alan Cox wrote:
> Not neccessarily. If you use 16K contiguous pages you have to do
> more work to get memory contiguously and you have less cache efficiency
> both of which will do serious damage to performance with poor I/O
> subsystems for all the extra paging and I/O, and probably way more than
> sglist stuff on PC class boxes. (yes its probably a win on your 32GB SGI
> monster)
Some of this happens anyway with readahead window sizes and the like.
jejb's also seen gains from my fiddling with mm/page_alloc.c so that
pages come out in the right order (which has since regressed and they
don't anymore; odds are s/list_add/list_add_tail/ in one or two places
will fix it), though that didn't involve crappy controllers.
It may be that readahead needs to be less aggressive for PC class boxes,
but I expect that's already been dealt with.
On Sat, Apr 28, 2007 at 12:29:08PM +0100, Alan Cox wrote:
> That aside you are improving things for most but not all half-assed
> controllers (some sglists are simply page pointers for each 4K page or
> have different efficient encoding for page sized chunks (I2O). If you are
> reading large chunks on a really crap controller (eg IDE in PIO) then the
> fact you pull 16K when you need 4K will also wipe out any gains.
Sorry about the "back to basics" bit here, but...
It might help me to follow the discussion if what's meant by a "crappy
controller" could be clarified. The notion I had in mind to date was
poor or absent support for scatter gather, where deficiencies can take
the form of limited sglist element size, limited sglist length, or gross
inefficiency in processing large sglists (e.g. implemented internally as
a series of requests, one for each sglist element).
I would think that the gains from cramming larger amounts of data into
an individual command would largely be due to avoiding artificially
serializing IO so the speedup comes about from concurrent processing of
the IO within the controller or disk as opposed to, say, reducing the
overhead of issuing commands. Maybe clarifying why we expect larger IO's
to be faster would help, too.
Those clarifications might help more precisely define what issue all
this is meant to address.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 10:55 ` Pierre Ossman
@ 2007-04-28 15:39 ` Eric W. Biederman
0 siblings, 0 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-04-28 15:39 UTC (permalink / raw)
To: Pierre Ossman
Cc: Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
Pierre Ossman <drzeus-list@drzeus.cx> writes:
> Eric W. Biederman wrote:
>>
>> I have a hard time believe that device hardware limits don't allow them
>> to have enough space to handle larger requests. If so it was a poor
>> design by the hardware manufacturers.
>>
>
> In the MMC layer, the block size is a major bottle neck. None of the currently
> supported hardware supports scatter/gather so we're restricted to servicing a
> single continuous chunk of memory at a time. And since latency is substantial
> for MMC/SD, good performance is several orders above 4k. We get ~8 MB/s for
> cards which are supposed to do 20 MB/s (which has been tested against other
> systems where we can get larger memory chunks), and the peasants are getting a
> bit unruly.
>
> I plan to experiment with some bounce buffer scheme to get performance up, but
> getting large blocks directly would make such hacks unnecessary.
My problem with the proposed scheme is not that it uses large pages,
but rather that it requires large pages. So in the mmc case you would
go from getting 20MB/s soon after the system booted to failing to be
able to do I/O at all a couple of days later when memory gets
fragmented.
I think a reliable 8MB/s is much better than an unreliable 20MB/s.
With your bounce buffer scheme it seems probable that we can even
get a reliable 20MB/s.
So I'm interested to hear that we have several in tree users that
could benefit.
I do think large block support if we don't require large pages makes
sense.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
` (25 preceding siblings ...)
2007-04-27 2:04 ` Andrew Morton
@ 2007-04-28 16:39 ` Maxim Levitsky
2007-04-30 5:23 ` Christoph Lameter
26 siblings, 1 reply; 235+ messages in thread
From: Maxim Levitsky @ 2007-04-28 16:39 UTC (permalink / raw)
To: clameter
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty
On Wednesday 25 April 2007 01:21, clameter@sgi.com wrote:
> Rationales:
>
> 1. We have problems supporting devices with a higher blocksize than
> page size. This is for example important to support CD and DVDs that
> can only read and write 32k or 64k blocks. We currently have a shim
> layer in there to deal with this situation which limits the speed
> of I/O. The developers are currently looking for ways to completely
> bypass the page cache because of this deficiency.
>
> 2. 32/64k blocksize is also used in flash devices. Same issues.
>
> 3. Future harddisks will support bigger block sizes that Linux cannot
> support since we are limited to PAGE_SIZE. Ok the on board cache
> may buffer this for us but what is the point of handling smaller
> page sizes than what the drive supports?
>
> 4. Reduce fsck times. Larger block sizes mean faster file system checking.
>
> 5. Performance. If we look at IA64 vs. x86_64 then it seems that the
> faster interrupt handling on x86_64 compensate for the speed loss due to
> a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
> sizes on all allows a significant reduction in I/O overhead and
> increases the size of I/O that can be performed by hardware in a single
> request since the number of scatter gather entries are typically limited
> for one request. This is going to become increasingly important to support
> the ever growing memory sizes since we may have to handle excessively large
> amounts of 4k requests for data sizes that may become common soon. For
> example to write a 1 terabyte file the kernel would have to handle 256
> million 4k chunks.
>
> 6. Cross arch compatibility: It is currently not possible to mount
> an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
> With this patch this becoems possible.
>
Hi.
I have a few questions about that patchset:
1) Is it possible for block device to assume that it will alway get big
requests (and aligned by big blocksize) ?
2) Does metadata reading/writing occuress also using same big blocksize ?
3 If so, How __bread/__getblk are affrected? Does returned buffer_head point
to whole block ?
And what do you think about mine design ?
I want to link parts of compound page through buffer_heads
So the head page's bh points to second page (tail page ) bh's, and from this
bh it is possible to reference the page itself and so on.
(This will allow a compound page be physicly fragmented)
Best regards,
Maxim Levitsky
PS:
I ask questions since this patchset does matter to me, I really like to see
this <= 4K limit lifted (all software limits are bad)
And finaly get good packet writing... I miss DirectCD much...
Altough >4K blocksizes are really only first step.
To make really fast packet writing, the UDF filesystem should be rewritten as
well.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 14:09 ` William Lee Irwin III
@ 2007-04-28 18:26 ` Andrew Morton
2007-04-28 19:19 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 18:26 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Peter Zijlstra, Nick Piggin, David Chinner, Christoph Lameter,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 28 Apr 2007 07:09:07 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:
> On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >> only 4.4 times faster, and more scalable, since we don't bounce the
> >> upper level locks around.
>
> On Sat, Apr 28, 2007 at 01:22:51AM -0700, Andrew Morton wrote:
> > I'm not sure what we're looking at here. radix-tree changes? Locking
> > changes? Both?
> > If we have a whole pile of pages to insert then there are obvious gains
> > from not taking the lock once per page (gang insert). But I expect there
> > will also be gains from not walking down the radix tree once per page too:
> > walk all the way down and populate all the way to the end of the node.
>
> The gang allocation affair would may also want to make the calls into
> the page allocator batched. For instance, grab enough compound pages to
> build the gang under the lock, since we're going to blow the per-cpu
> lists with so many pages, then break the compound pages up outside the
> zone->lock.
Sure, but...
Allocating a single order-3 (say) page _is_ a form of batching
We don't want compound pages here: just higher-order ones
Higher-order allocations bypass the per-cpu lists
> I think it'd be good to have some corresponding tactics for freeing as
> well.
hm, hadn't thought about that - would need to peek at contiguous pages in
the pagecache and see if we can gang-free them as higher-order pages.
The place to do that is perhaps inside the per-cpu magazines: it's more
general. Dunno if it would net advantageous though.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 18:26 ` Andrew Morton
@ 2007-04-28 19:19 ` William Lee Irwin III
2007-04-28 21:28 ` Andrew Morton
0 siblings, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-04-28 19:19 UTC (permalink / raw)
To: Andrew Morton
Cc: Peter Zijlstra, Nick Piggin, David Chinner, Christoph Lameter,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 28 Apr 2007 07:09:07 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:
>> The gang allocation affair would may also want to make the calls into
>> the page allocator batched. For instance, grab enough compound pages to
>> build the gang under the lock, since we're going to blow the per-cpu
>> lists with so many pages, then break the compound pages up outside the
>> zone->lock.
On Sat, Apr 28, 2007 at 11:26:40AM -0700, Andrew Morton wrote:
> Sure, but...
> Allocating a single order-3 (say) page _is_ a form of batching
Sorry, I should clarify here. If we fall back, we may still want to
get all the pages together. For instance, if we can't get an order 3,
grab an order 2, then if a second order 2 doesn't pan out, an order
1, and so on, until as many pages as requested are allocated or an
allocation failure occurs.
Also, passing around the results linked together into a list vs.
e.g. filling an array has the advantage of splice operations under
the lock, though arrays can catch up for the most part if their
elements are allowed to vary in terms of the orders of the pages.
On Sat, Apr 28, 2007 at 11:26:40AM -0700, Andrew Morton wrote:
> We don't want compound pages here: just higher-order ones
> Higher-order allocations bypass the per-cpu lists
Sorry again. I conflated the two, and failed to take the use of
higher-order pages as an assumption as I should've.
On Sat, 28 Apr 2007 07:09:07 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:
>> I think it'd be good to have some corresponding tactics for freeing as
>> well.
On Sat, Apr 28, 2007 at 11:26:40AM -0700, Andrew Morton wrote:
> hm, hadn't thought about that - would need to peek at contiguous pages in
> the pagecache and see if we can gang-free them as higher-order pages.
> The place to do that is perhaps inside the per-cpu magazines: it's more
> general. Dunno if it would net advantageous though.
What I was hoping for was an interface to hand back groups of pages at
a time which would then do contiguity detection if advantageous, and if
not, just assembles the pages into something that can be slung around
more quickly under the lock. Essentially doing small bits of the buddy
system's work for it outside the lock. Arrays make more sense here, as
it's relatively easy to do contiguity detection by heapifying them
and dequeueing in order in preparation for work under the lock.
There is an issue in that reclaim is not organized in such a fashion as
to issue calls to such freeing functions. An implicit effect of this
sort could be achieved by maintaining the pcp lists as an array-based
deque via duelling heap arrays with reversed comparators if an
appropriate deque structure for sets as small as the pcp arrays can't
be dredged up, or an auxiliary adjacency detection structure.
I'm skeptical, however, that the contiguity gains will compensate for
the CPU required to do such with the pcp lists. I think rather that
users of an interface for likely-contiguous batched freeing would be
better to arrange provided reclaim in such manners makes sense from
the standpoint of IO. Gang freeing in general could do adjacency
detection without disturbing the characteristics of the pcp lists,
though it, too, may not be productive without some specific notion
of whether contiguity is likely. For instance, quicklist_trim()
could readily use gang freeing, but it's not likely to have much in
the way of contiguity.
These sorts of algorithmic concerns are probably not quite as pressing
as the general notion of trying to establish some sort of contiguity,
so I'm by no means insistent on any of this.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 19:19 ` William Lee Irwin III
@ 2007-04-28 21:28 ` Andrew Morton
0 siblings, 0 replies; 235+ messages in thread
From: Andrew Morton @ 2007-04-28 21:28 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Peter Zijlstra, Nick Piggin, David Chinner, Christoph Lameter,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Sat, 28 Apr 2007 12:19:56 -0700 William Lee Irwin III <wli@holomorphy.com> wrote:
> I'm skeptical, however, that the contiguity gains will compensate for
> the CPU required to do such with the pcp lists.
It wouldn't surprise me if approximate contiguity is a pretty common case
in the pcp lists. Recaim isn't very important here: most pages get freed
in truncate and particularly unmap_vmas. If the allocator is handing out
pages in reasonably contiguous fashion (and it does, and we're talking
about strengthening that) then I'd expect that very often we end up freeing
pages which have a lot of locality too. So the sort of tricks which you're
discussing might get a pretty good hit rate.
otoh, it's not obvious to me that there's a lot to be gained here. If we
repeatedly call the buddy allocator freeing contiguous order-0 pages, all
the data structures which are needed to handle those should be in L1 cache
and the buddy itself becomes our point-of-collection, if you see what I
mean.
Dunno. Profiling should tell?
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-26 13:28 ` Alan Cox
2007-04-26 13:30 ` Jens Axboe
@ 2007-04-29 14:12 ` Matt Mackall
1 sibling, 0 replies; 235+ messages in thread
From: Matt Mackall @ 2007-04-29 14:12 UTC (permalink / raw)
To: Alan Cox
Cc: Eric W. Biederman, Christoph Lameter, linux-kernel, Mel Gorman,
William Lee Irwin III, David Chinner, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Thu, Apr 26, 2007 at 02:28:46PM +0100, Alan Cox wrote:
> > > Oh we have scores of these hacks around. Look at the dvd/cd layer. The
> > > point is to get rid of those.
> >
> > Perhaps this is just a matter of cleaning them up so they are no
> > longer hacks?
>
> CD and DVD media support various non power-of-two block sizes. Supporting
> more block sizes would also be useful as we could then read older smart
> media (256byte/sector) without the SCSI layer objecting and the like.
Non-power-of-two block sizes are also desirable for FLASH, where
things like UBI want to stash metadata in parts of eraseblocks.
(I've suggested to the MTD folks that it could be made to work with the
block layer, but their imaginations couldn't seem to get much beyond
"erase blocks aren't sector-sized".)
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 16:39 ` Maxim Levitsky
@ 2007-04-30 5:23 ` Christoph Lameter
0 siblings, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-30 5:23 UTC (permalink / raw)
To: Maxim Levitsky
Cc: linux-kernel, Mel Gorman, William Lee Irwin III, David Chinner,
Jens Axboe, Badari Pulavarty
On Sat, 28 Apr 2007, Maxim Levitsky wrote:
> 1) Is it possible for block device to assume that it will alway get big
> requests (and aligned by big blocksize) ?
That is one of the key problems. We hope that Mel Gorman's antifrag work
will get us there.
> 2) Does metadata reading/writing occuress also using same big blocksize ?
It can if the filesystem decides to and sets up the order the mapping.
> 3 If so, How __bread/__getblk are affrected? Does returned buffer_head point
> to whole block ?
Correct.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-28 6:52 ` Andrew Morton
@ 2007-04-30 5:30 ` Christoph Lameter
0 siblings, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-04-30 5:30 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, linux-kernel, Mel Gorman, William Lee Irwin III,
Jens Axboe, Badari Pulavarty, Maxim Levitsky, Nick Piggin
On Fri, 27 Apr 2007, Andrew Morton wrote:
> By misunderstanding any suggestions, misrepresenting them, making incorrect
> statements about them, by not suggesting any alternatives yourself, all of
> it buttressed by a stolid refusal to recognise that this patch has any
> costs.
That was even mentioned in the initial post.... Definitely it would
require significant changes but getting there is fairly straightforward
with the use of compound pages.
> This effectively leaves it up to others to find time to think about and to
> implement possible alternative solutions to the problems which you're
> observing.
They are working on other problems like radix tree scalability it seems.
> The altenative which is on the table (and there may be others) is
> populating pagecache with physically contiguous pages. This will fix the
> HBA problem and is much less costly in terms of maintenance and will
> improve all workloads on all machines and doesn't have the additional
> runtime costs of pagecache wastage and more memset() overhead with small
> files and it doesn't require administrator intervention.
>
> OTOH (yes! there are tradeoffs!) it will consume an unknown amount more
> CPU and it doesn't address the large-fs-blocksize requirement, but I don't
> know how important the latter is and given the unrelenting advocacy storm
> coming from the SGI direction I don't know how to find that out, frankly.
This is certainly a nice approach if it works and may address one
issue that motivated this patchset but it does not address all.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 8:48 ` Andrew Morton
2007-04-27 16:45 ` Theodore Tso
@ 2007-05-04 12:57 ` Eric W. Biederman
1 sibling, 0 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-05-04 12:57 UTC (permalink / raw)
To: Andrew Morton
Cc: David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Andrew Morton <akpm@linux-foundation.org> writes:
> On Fri, 27 Apr 2007 18:03:21 +1000 David Chinner <dgc@sgi.com> wrote:
>
>> > > > > You basically have to
>> > > > > jump through nasty, nasty hoops, to handle corner cases that are
> introduced
>> > > > > because the generic code can no longer reliably lock out access to a
>> > > > > filesystem block.
>> > >
>> > > This way lies insanity.
>> >
>> > You're addressing Christoph's straw man here.
>>
>> No, I'm speaking from years of experience working on a
>> page/buffer/chunk cache capable of using both large pages and
>> aggregating multiple pages. It has, at times, almost driven me
>> insane and I don't want to go back there.
>
> We're talking about two separate things here - let us not conflate them.
>
> 1: The arguably-crippled HBA which wants bigger SG lists.
>
> 2: The late-breaking large-blocksizes-in-the-fs thing.
Well from other parts of the conversation there is a third issue.
3: large-sectorsize-on-disk.
There are a handful of devices in the kernel that could benefit
and be cleaned up a great deal if they could assume they always
received data in their sg lists that were full sectors. Nothing
needs to be physically contiguous to handle that case though.
If we support large sector sizes for raw block devices we would
still have an issue of what to do with filesystems that want
to live on them directly.
> None of this multiple-page-locking stuff we're discussing here is relevant
> to the HBA performance problem. It's pretty simple (I think) for us to
> ensure that, for the great majority of the time, contiguous pages in a file
> are also physically contiguous. Problem solved, HBA go nice and quick,
> move on.
I suspect we will still need Jens > 128 page linux scatter gather list
work to fully take advantage of this.
> Now, we have this the second and completely unrelated requirement:
> supporting fs-blocksize > PAGE_SIZE. One way to address this is via the
> mangle-multiple-pages-into-one approach. And it's obviously the best way
> to do it, if mangle-multiple-pages is already available.
Yep.
> But I don't know how important requirement 2 is. XFS already has
> presumably-working private code to do it, and there is simplification and
> perhaps modest performance gain in the block allocator to be had here.
>
> And other filesystems (ie: ext4) _might_ use it. But ext4 is extent-based,
> so perhaps it's not work churning the on-disk format to get a bit of a
> boost in the block allocator.
>
> So I _think_ what this boils down to is offering some simplifications in
> XFS, by adding complexications to core VFS and MM. I dunno if that's a
> good deal.
Agreed.
When we are doing things optimistically and absolutely require large pages
this approach seems pretty sane. When we start requiring large 64k
pages I get nervous.
> So... tell us why you want feature 2?
A good question.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 8:03 ` David Chinner
2007-04-27 8:48 ` Andrew Morton
@ 2007-05-04 13:31 ` Eric W. Biederman
2007-05-04 16:11 ` Christoph Lameter
2007-05-07 4:58 ` David Chinner
1 sibling, 2 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-05-04 13:31 UTC (permalink / raw)
To: David Chinner
Cc: Andrew Morton, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner <dgc@sgi.com> writes:
> On Fri, Apr 27, 2007 at 12:04:03AM -0700, Andrew Morton wrote:
>
> I've looked at all this but I'm trying to work out if anyone
> else has looked at the impact of doing this. I have direct experience
> with this form of block aggregation - this is pretty much what is
> done in irix - and it's full of nasty, ugly corner cases.
>
> I've got several year-old Irix bugs assigned that are hit every so
> often where one page in the aggregated set has the wrong state, and
> it's simply not possible to either reproduce the problem or work out
> how it happened. The code has grown too complex and convoluted, and
> by the time the problem is noticed (either by hang, panic or bug
> check) the cause of it is long gone.
>
> I don't want to go back to having to deal with this sort of problem
> - I'd much prefer to have a design that does not make the same
> mistakes that lead to these sorts of problem.
So the practical question is. Was it a high level design problem or
was it simply a choice of implementation issue.
Until we code review and implementation that does page aggregation for
linux we can't say how nasty it would be.
Of course what gets confusing is when you mention you refer to the
previous implementation as a buffer cache, because that isn't at all
what Linux had for a buffer cache. The linux buffer cache was the
same as the current page cache except it was index by block number and
not by offset into a file.
>>
>> You're addressing Christoph's straw man here.
>
> No, I'm speaking from years of experience working on a
> page/buffer/chunk cache capable of using both large pages and
> aggregating multiple pages. It has, at times, almost driven me
> insane and I don't want to go back there.
The suggestion seems to be to always aggregate pages (to handle
PAGE_SIZE < block size), and not to even worry about the fact
that it happens that the pages you are aggregating are physically
contiguous. The memory allocator and the block layer can worry
about that. It isn't something the page cache or filesystems
need to pay attention to.
I suspect the implementation in linux would be sufficiently different
that it would not be prone to the same problems. Among other things
we are already do most things on a range of page addresses, so we
would seem to have most of the infrastructure already.
It looks like if we extend the current batching a little more
so it covers all of the interesting cases. (read)
Ensure the dirty bit on all pages in the group when we set it on
one page.
Add re-read when we dirty the group if we don't have it all present.
Round the range we operate on up so we cleanly hit the beginning
and end of the group size.
Only issue the mapping operations on the first page in the group.
Is about what we would have to do to handle multiple pages in one
block in the page cache. There are clearly more details but as
a first approximation I don't see this being fundamentally more
complex then what we are currently doing. Just taking into account a
few more details.
The whole physical continuity thing seems to come cleanly out of a
speculative page allocator, and that would seem to work and provide
improvements on smaller block sizes filesystems so it looks like
a larger general improvement.
Likewise Jens increase the linux scatter gather list size seems like
a more general independent improvement.
So if we can also handle groups of pages that make up a single block
as a independent change we have all of the benefits of large block
sizes with most of them applying to small sector size filesystems
as well.
Given that small block sizes give us better storage efficiency,
which means less disk bandwidth used, which means less time
to get the data off of a slow disk (especially if you can
put multiple files you want simultaneously in that same space).
I'm not convinced that large block sizes are a clear disk performance
advantage, so we should not neglect the small file sizes.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-04-27 16:45 ` Theodore Tso
@ 2007-05-04 13:33 ` Eric W. Biederman
2007-05-07 4:29 ` David Chinner
0 siblings, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-05-04 13:33 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, David Chinner, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Theodore Tso <tytso@mit.edu> writes:
> On Fri, Apr 27, 2007 at 01:48:49AM -0700, Andrew Morton wrote:
>> And other filesystems (ie: ext4) _might_ use it. But ext4 is extent-based,
>> so perhaps it's not work churning the on-disk format to get a bit of a
>> boost in the block allocator.
>
> Well, ext3 could definitely use it; there are people using 8k and 16k
> blocksizes on ia64 systems today. Those filesystems can't be mounted
> on x86 or x86_64 systems because our pagesize is 4k, though.
>
> And I imagine that ext4 might want to use a large blocksize too ---
> after all, XFS is extent based as well, and not _all_ of the
> advantages of using a larger blocksize are related to brain-damaged
> storage subsystems with short SG list support. Whether the advantages
> offset the internal fragmentation overhead or the complexity of adding
> fragments support is a different question, of course.
>
> So while the jury is out about how many other filesystems might use
> it, I suspect it's more than you might think. At the very least,
> there may be some IA64 users who might be trying to transition their
> way to x86_64, and have existing filesystems using a 8k or 16k
> block filesystems. :-)
How much of a problem would it be if those blocks were not necessarily
contiguous in RAM, but placed in normal 4K pages in the page cache?
I expect meta data operations would have to be modified but that otherwise
you would not care.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-04 13:31 ` Eric W. Biederman
@ 2007-05-04 16:11 ` Christoph Lameter
2007-05-07 4:58 ` David Chinner
1 sibling, 0 replies; 235+ messages in thread
From: Christoph Lameter @ 2007-05-04 16:11 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Andrew Morton, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, 4 May 2007, Eric W. Biederman wrote:
> Given that small block sizes give us better storage efficiency,
> which means less disk bandwidth used, which means less time
> to get the data off of a slow disk (especially if you can
> put multiple files you want simultaneously in that same space).
> I'm not convinced that large block sizes are a clear disk performance
> advantage, so we should not neglect the small file sizes.
And the 50% gain in the benchmarks means what? That the device
manufacturers have to redesign their chips?
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-04 13:33 ` Eric W. Biederman
@ 2007-05-07 4:29 ` David Chinner
2007-05-07 4:48 ` Eric W. Biederman
0 siblings, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-05-07 4:29 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Theodore Tso, Andrew Morton, David Chinner, clameter,
linux-kernel, Mel Gorman, William Lee Irwin III, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Fri, May 04, 2007 at 07:33:54AM -0600, Eric W. Biederman wrote:
> >
> > So while the jury is out about how many other filesystems might use
> > it, I suspect it's more than you might think. At the very least,
> > there may be some IA64 users who might be trying to transition their
> > way to x86_64, and have existing filesystems using a 8k or 16k
> > block filesystems. :-)
>
> How much of a problem would it be if those blocks were not necessarily
> contiguous in RAM, but placed in normal 4K pages in the page cache?
If you need to treat the block in a contiguous range, then you need to
vmap() the discontiguous pages. That has substantial overhead if you
have to do it regularly.
We do this in xfs_buf.c for > page size blocks - the overhead that
caused when operating on inode clusters resulted in us doing some
pointer fiddling and directly addresing the contents of each page
to avoid the vmap overhead. See xfs_buf_offset() and friends....
> I expect meta data operations would have to be modified but that otherwise
> you would not care.
I think you might need to modify the copy-in and copy-out operations
substantially (e.g. prepare_/commit_write()) as they assume a buffer doesn't
span multple pages.....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 4:29 ` David Chinner
@ 2007-05-07 4:48 ` Eric W. Biederman
2007-05-07 5:27 ` David Chinner
0 siblings, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-05-07 4:48 UTC (permalink / raw)
To: David Chinner
Cc: Theodore Tso, Andrew Morton, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner <dgc@sgi.com> writes:
> On Fri, May 04, 2007 at 07:33:54AM -0600, Eric W. Biederman wrote:
>> >
>> > So while the jury is out about how many other filesystems might use
>> > it, I suspect it's more than you might think. At the very least,
>> > there may be some IA64 users who might be trying to transition their
>> > way to x86_64, and have existing filesystems using a 8k or 16k
>> > block filesystems. :-)
>>
>> How much of a problem would it be if those blocks were not necessarily
>> contiguous in RAM, but placed in normal 4K pages in the page cache?
>
> If you need to treat the block in a contiguous range, then you need to
> vmap() the discontiguous pages. That has substantial overhead if you
> have to do it regularly.
Which is why I would prefer not to do it. I think vmap is not really
compatible with the design of the linux page cache.
Although we can't even count on the pages being mapped into low
memory right now and have to call kmap if we want to access them
so things might not be that bad. Even if it was a multipage kmap
type operation.
> We do this in xfs_buf.c for > page size blocks - the overhead that
> caused when operating on inode clusters resulted in us doing some
> pointer fiddling and directly addresing the contents of each page
> to avoid the vmap overhead. See xfs_buf_offset() and friends....
>
>> I expect meta data operations would have to be modified but that otherwise
>> you would not care.
>
> I think you might need to modify the copy-in and copy-out operations
> substantially (e.g. prepare_/commit_write()) as they assume a buffer doesn't
> span multple pages.....
But in a filesystem like ext2 except for a zeroing some unused hunks
of the page all that really happens is you setup for DMA straight out
of the page cache. So this is primarily an issue for meta-data.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-04 13:31 ` Eric W. Biederman
2007-05-04 16:11 ` Christoph Lameter
@ 2007-05-07 4:58 ` David Chinner
2007-05-07 6:56 ` Eric W. Biederman
1 sibling, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-05-07 4:58 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Andrew Morton, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Fri, May 04, 2007 at 07:31:37AM -0600, Eric W. Biederman wrote:
> David Chinner <dgc@sgi.com> writes:
>
> > On Fri, Apr 27, 2007 at 12:04:03AM -0700, Andrew Morton wrote:
> > I've got several year-old Irix bugs assigned that are hit every so
> > often where one page in the aggregated set has the wrong state, and
> > it's simply not possible to either reproduce the problem or work out
> > how it happened. The code has grown too complex and convoluted, and
> > by the time the problem is noticed (either by hang, panic or bug
> > check) the cause of it is long gone.
> >
> > I don't want to go back to having to deal with this sort of problem
> > - I'd much prefer to have a design that does not make the same
> > mistakes that lead to these sorts of problem.
>
> So the practical question is. Was it a high level design problem or
> was it simply a choice of implementation issue.
Both. To many things can happen asynchroonously to a page that it
makes it just about impossible to predict all the potential race
conditions that are involved. complexity arose from trying to fix
the races that were uncovered without breaking everything else...
> Until we code review and implementation that does page aggregation for
> linux we can't say how nasty it would be.
We already have an implementation - I've pointed it out several times
now: see fs/xfs/linux-2.6/xfs_buf.[ch].
There are a lot of nasties in there....
> >> You're addressing Christoph's straw man here.
> >
> > No, I'm speaking from years of experience working on a
> > page/buffer/chunk cache capable of using both large pages and
> > aggregating multiple pages. It has, at times, almost driven me
> > insane and I don't want to go back there.
>
> The suggestion seems to be to always aggregate pages (to handle
> PAGE_SIZE < block size), and not to even worry about the fact
> that it happens that the pages you are aggregating are physically
> contiguous. The memory allocator and the block layer can worry
> about that. It isn't something the page cache or filesystems
> need to pay attention to.
perfomrance problems in using discontigous pages and needing to
vmap() them says otherwise....
> I suspect the implementation in linux would be sufficiently different
> that it would not be prone to the same problems. Among other things
> we are already do most things on a range of page addresses, so we
> would seem to have most of the infrastructure already.
Filesystems don't typically do this - they work on blocks and assume
that a block can be directly referenced.
> Given that small block sizes give us better storage efficiency,
> which means less disk bandwidth used, which means less time
> to get the data off of a slow disk (especially if you can
> put multiple files you want simultaneously in that same space).
> I'm not convinced that large block sizes are a clear disk performance
> advantage, so we should not neglect the small file sizes.
Hmmm - we're not talking about using 64k block size filesystems to
store lots of little files or using them on small, slow disks.
We're looking at optimising for multi-petabyte filesystems with
multi-terabyte sized files sustaining throughput of tens to hundreds
of GB/s to/from hundreds to thousands of disk.
I certinaly don't consider 64k block size filesystems as something
suitable for desktop use - maybe PVRs would benefit, but this
is not something you'd use for your kernel build environment on a
single disk in a desktop system....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 4:48 ` Eric W. Biederman
@ 2007-05-07 5:27 ` David Chinner
2007-05-07 6:43 ` Eric W. Biederman
0 siblings, 1 reply; 235+ messages in thread
From: David Chinner @ 2007-05-07 5:27 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Theodore Tso, Andrew Morton, clameter,
linux-kernel, Mel Gorman, William Lee Irwin III, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
On Sun, May 06, 2007 at 10:48:23PM -0600, Eric W. Biederman wrote:
> David Chinner <dgc@sgi.com> writes:
>
> > On Fri, May 04, 2007 at 07:33:54AM -0600, Eric W. Biederman wrote:
> >> >
> >> > So while the jury is out about how many other filesystems might use
> >> > it, I suspect it's more than you might think. At the very least,
> >> > there may be some IA64 users who might be trying to transition their
> >> > way to x86_64, and have existing filesystems using a 8k or 16k
> >> > block filesystems. :-)
> >>
> >> How much of a problem would it be if those blocks were not necessarily
> >> contiguous in RAM, but placed in normal 4K pages in the page cache?
> >
> > If you need to treat the block in a contiguous range, then you need to
> > vmap() the discontiguous pages. That has substantial overhead if you
> > have to do it regularly.
>
> Which is why I would prefer not to do it. I think vmap is not really
> compatible with the design of the linux page cache.
Right - so how do we efficiently manipulate data inside a large
block that spans multiple discontigous pages if we don't vmap
it?
> Although we can't even count on the pages being mapped into low
> memory right now and have to call kmap if we want to access them
> so things might not be that bad. Even if it was a multipage kmap
> type operation.
Except when you structures span page boundaries. Then you can't directly
reference the structure - it needs to be copied out elsewhere, modified
and copied back. That's messy and will require significant modification
to any filesystem that wants large block sizes....
> > We do this in xfs_buf.c for > page size blocks - the overhead that
> > caused when operating on inode clusters resulted in us doing some
> > pointer fiddling and directly addresing the contents of each page
> > to avoid the vmap overhead. See xfs_buf_offset() and friends....
> >
> >> I expect meta data operations would have to be modified but that otherwise
> >> you would not care.
> >
> > I think you might need to modify the copy-in and copy-out operations
> > substantially (e.g. prepare_/commit_write()) as they assume a buffer doesn't
> > span multple pages.....
>
> But in a filesystem like ext2 except for a zeroing some unused hunks
> of the page all that really happens is you setup for DMA straight out
> of the page cache. So this is primarily an issue for meta-data.
I'm not sure I follow you here - copyin/copyout is to userspace and
has to handle things like RMW cycles to a filesystem block. e.g. if
we get a partial block over-write, we need to read in all the bits
around it and that will span multiple discontiguous pages. Currently
these function only handle RMW operations on something up to a
single page in size - to handle a RMW cycle on a block larger than a
page they are going to need substantial modification or entirely
new interfaces.
The high order page cache avoids the need to redesign interfaces
because it doesn't change the interfaces between the filesystem
and the page cache - everything still effectively operates
on single pages and the filesystem block size never exceeds the
size of a single page.....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 5:27 ` David Chinner
@ 2007-05-07 6:43 ` Eric W. Biederman
2007-05-07 6:49 ` William Lee Irwin III
2007-05-07 16:06 ` Christoph Lameter
0 siblings, 2 replies; 235+ messages in thread
From: Eric W. Biederman @ 2007-05-07 6:43 UTC (permalink / raw)
To: David Chinner
Cc: Eric W. Biederman, Theodore Tso, Andrew Morton, clameter,
linux-kernel, Mel Gorman, William Lee Irwin III, Jens Axboe,
Badari Pulavarty, Maxim Levitsky
David Chinner <dgc@sgi.com> writes:
> On Sun, May 06, 2007 at 10:48:23PM -0600, Eric W. Biederman wrote:
>> David Chinner <dgc@sgi.com> writes:
>>
>> > On Fri, May 04, 2007 at 07:33:54AM -0600, Eric W. Biederman wrote:
>> >> >
>> >> > So while the jury is out about how many other filesystems might use
>> >> > it, I suspect it's more than you might think. At the very least,
>> >> > there may be some IA64 users who might be trying to transition their
>> >> > way to x86_64, and have existing filesystems using a 8k or 16k
>> >> > block filesystems. :-)
>> >>
>> >> How much of a problem would it be if those blocks were not necessarily
>> >> contiguous in RAM, but placed in normal 4K pages in the page cache?
>> >
>> > If you need to treat the block in a contiguous range, then you need to
>> > vmap() the discontiguous pages. That has substantial overhead if you
>> > have to do it regularly.
>>
>> Which is why I would prefer not to do it. I think vmap is not really
>> compatible with the design of the linux page cache.
>
> Right - so how do we efficiently manipulate data inside a large
> block that spans multiple discontigous pages if we don't vmap
> it?
You don't manipulate data except for copy_from_user, copy_to_user.
That is easy comparatively to deal with, and certainly doesn't
need vmap.
Meta-data may be trickier, but a lot of that depends on your
individual filesystem and how it organizes it's meta-data.
>> Although we can't even count on the pages being mapped into low
>> memory right now and have to call kmap if we want to access them
>> so things might not be that bad. Even if it was a multipage kmap
>> type operation.
>
> Except when you structures span page boundaries. Then you can't directly
> reference the structure - it needs to be copied out elsewhere, modified
> and copied back. That's messy and will require significant modification
> to any filesystem that wants large block sizes....
Potentially. This is a just a meta data problem, and possibly we
solve it with something like vmap. Possibly the filesystem won't
cross those kinds of boundaries and we simply never care.
The fact that it is a meta-data problem suggests it isn't the fast
path and we can incur a little more cost. Especially if we filesytems
with large block sizes are rare.
> I'm not sure I follow you here - copyin/copyout is to userspace and
> has to handle things like RMW cycles to a filesystem block. e.g. if
> we get a partial block over-write, we need to read in all the bits
> around it and that will span multiple discontiguous pages. Currently
> these function only handle RMW operations on something up to a
> single page in size - to handle a RMW cycle on a block larger than a
> page they are going to need substantial modification or entirely
> new interfaces.
Bleh. It has been to many days since I have hacked that code I forgot
which piece that was. Yes. prepare_to_write is called before
we write to the page cache from the filesystem.
We already handle multiple page writes fairly well in that context.
prepare_write, commit_write may need a page cache but it may not.
All that really needs to happen is that all of the pages that
are part of the block get marked dirty in the page cache so one
won't get written without the others.
> The high order page cache avoids the need to redesign interfaces
> because it doesn't change the interfaces between the filesystem
> and the page cache - everything still effectively operates
> on single pages and the filesystem block size never exceeds the
> size of a single page.....
Yes, instead of having to redesign the interface between the
fs and the page cache for those filesystems that handle large
blocks we instead need to redesign significant parts of the VM interface.
Shift the redesign work to another group of people and call it a trivial.
That is hardly a gain when it looks like you can have the same effect
with some moderately simple changes to mm/filemap.c and the existing
interfaces.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 6:43 ` Eric W. Biederman
@ 2007-05-07 6:49 ` William Lee Irwin III
2007-05-07 7:06 ` William Lee Irwin III
2007-05-07 16:06 ` Christoph Lameter
1 sibling, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-05-07 6:49 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Theodore Tso, Andrew Morton, clameter,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner <dgc@sgi.com> writes:
>> Right - so how do we efficiently manipulate data inside a large
>> block that spans multiple discontigous pages if we don't vmap
>> it?
On Mon, May 07, 2007 at 12:43:19AM -0600, Eric W. Biederman wrote:
> You don't manipulate data except for copy_from_user, copy_to_user.
> That is easy comparatively to deal with, and certainly doesn't
> need vmap.
> Meta-data may be trickier, but a lot of that depends on your
> individual filesystem and how it organizes it's meta-data.
I wonder what happened to my pagearray patches.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 4:58 ` David Chinner
@ 2007-05-07 6:56 ` Eric W. Biederman
2007-05-07 15:17 ` Weigert, Daniel
0 siblings, 1 reply; 235+ messages in thread
From: Eric W. Biederman @ 2007-05-07 6:56 UTC (permalink / raw)
To: David Chinner
Cc: Andrew Morton, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner <dgc@sgi.com> writes:
> Both. To many things can happen asynchroonously to a page that it
> makes it just about impossible to predict all the potential race
> conditions that are involved. complexity arose from trying to fix
> the races that were uncovered without breaking everything else...
Ok.
>> Until we code review and implementation that does page aggregation for
>> linux we can't say how nasty it would be.
>
> We already have an implementation - I've pointed it out several times
> now: see fs/xfs/linux-2.6/xfs_buf.[ch].
>
> There are a lot of nasties in there....
Yes, and but it isn't a generic implementation in mm/filemap.c,
it is a compatibility layer. It lives with the current deficiencies
instead of removes them.
>> >> You're addressing Christoph's straw man here.
>> >
>> > No, I'm speaking from years of experience working on a
>> > page/buffer/chunk cache capable of using both large pages and
>> > aggregating multiple pages. It has, at times, almost driven me
>> > insane and I don't want to go back there.
>>
>> The suggestion seems to be to always aggregate pages (to handle
>> PAGE_SIZE < block size), and not to even worry about the fact
>> that it happens that the pages you are aggregating are physically
>> contiguous. The memory allocator and the block layer can worry
>> about that. It isn't something the page cache or filesystems
>> need to pay attention to.
>
> perfomrance problems in using discontigous pages
Small scatter lists?
> and needing to vmap() them says otherwise....
Always?
Ugh. I just realized looking at the xfs code that it doesn't
work in the presence of high memory, at least not with 4K pages.
>> I suspect the implementation in linux would be sufficiently different
>> that it would not be prone to the same problems. Among other things
>> we are already do most things on a range of page addresses, so we
>> would seem to have most of the infrastructure already.
>
> Filesystems don't typically do this - they work on blocks and assume
> that a block can be directly referenced.
But that is how mm/filemap.c works. The calls into the filesystem
can be per multi-page group as they are current per page. The point
is that the existing in kernel abstraction is already larger then a
page for doing the work.
>> Given that small block sizes give us better storage efficiency,
>> which means less disk bandwidth used, which means less time
>> to get the data off of a slow disk (especially if you can
>> put multiple files you want simultaneously in that same space).
>> I'm not convinced that large block sizes are a clear disk performance
>> advantage, so we should not neglect the small file sizes.
>
> Hmmm - we're not talking about using 64k block size filesystems to
> store lots of little files or using them on small, slow disks.
> We're looking at optimising for multi-petabyte filesystems with
> multi-terabyte sized files sustaining throughput of tens to hundreds
> of GB/s to/from hundreds to thousands of disk.
>
> I certinaly don't consider 64k block size filesystems as something
> suitable for desktop use - maybe PVRs would benefit, but this
> is not something you'd use for your kernel build environment on a
> single disk in a desktop system....
Yes. You are talking about only fixing the kernel for your giant
64K block filesystems that are only interesting on peta-byte arrays.
I am pointing out that the other fixes that have been discussed.
Optimistic contiguous page allocation and a larger linux scatter
gather list. Are interesting on much smaller filesystem and machine
sizes where small files are still interesting. Making them generally
better improvements for linux.
If you only improve the giant peta-byte raid cases 99% of linux users
simply don't care, and so the code isn't very interesting.
Eric
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 6:49 ` William Lee Irwin III
@ 2007-05-07 7:06 ` William Lee Irwin III
2007-05-08 8:49 ` William Lee Irwin III
0 siblings, 1 reply; 235+ messages in thread
From: William Lee Irwin III @ 2007-05-07 7:06 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Theodore Tso, Andrew Morton, clameter,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
David Chinner <dgc@sgi.com> writes:
>>> Right - so how do we efficiently manipulate data inside a large
>>> block that spans multiple discontigous pages if we don't vmap
>>> it?
On Mon, May 07, 2007 at 12:43:19AM -0600, Eric W. Biederman wrote:
>> You don't manipulate data except for copy_from_user, copy_to_user.
>> That is easy comparatively to deal with, and certainly doesn't
>> need vmap.
>> Meta-data may be trickier, but a lot of that depends on your
>> individual filesystem and how it organizes it's meta-data.
On Sun, May 06, 2007 at 11:49:25PM -0700, William Lee Irwin III wrote:
> I wonder what happened to my pagearray patches.
I never really got the thing working, but I had an idea for a sort of
library to do this. This is/was probably against something like 2.6.5
but I honestly have no idea. Maybe this makes it something of an API
proposal.
-- wli
Index: linux-2.6/include/linux/pagearray.h
===================================================================
--- linux-2.6.orig/include/linux/pagearray.h 2004-04-06 10:56:48.000000000 -0700
+++ linux-2.6/include/linux/pagearray.h 2005-04-22 06:06:02.677494584 -0700
@@ -0,0 +1,24 @@
+#ifndef _LINUX_PAGEARRAY_H
+#define _LINUX_PAGEARRAY_H
+
+struct scatterlist;
+struct vm_area_struct;
+struct page;
+
+struct pagearray {
+ struct page **pages;
+ int nr_pages;
+ size_t length;
+};
+
+int alloc_page_array(struct pagearray *, const int, const size_t);
+void free_page_array(struct pagearray *);
+void zero_page_array(struct pagearray *);
+struct page *nopage_page_array(const struct vm_area_struct *, unsigned long, unsigned long, int *, struct pagearray *);
+int mmap_page_array(const struct vm_area_struct *, struct pagearray *, const size_t, const size_t);
+int copy_page_array_to_user(struct pagearray *, void __user *, const size_t, const size_t);
+int copy_page_array_from_user(struct pagearray *, void __user *, const size_t, const size_t);
+struct scatterlist *pagearray_to_scatterlist(struct pagearray *, size_t, size_t, int *);
+void *vmap_pagearray(struct pagearray *);
+
+#endif /* _LINUX_PAGEARRAY_H */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2005-04-22 06:01:29.786980248 -0700
+++ linux-2.6/mm/Makefile 2005-04-22 06:06:02.677494584 -0700
@@ -10,7 +10,7 @@
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o \
readahead.o slab.o swap.o truncate.o vmscan.o \
- prio_tree.o $(mmu-y)
+ prio_tree.o pagearray.o $(mmu-y)
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
Index: linux-2.6/mm/pagearray.c
===================================================================
--- linux-2.6.orig/mm/pagearray.c 2004-04-06 10:56:48.000000000 -0700
+++ linux-2.6/mm/pagearray.c 2005-04-22 06:20:26.154226168 -0700
@@ -0,0 +1,293 @@
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/highmem.h>
+#include <linux/pagearray.h>
+#include <asm/uaccess.h>
+#include <asm/scatterlist.h>
+
+/**
+ * alloc_page_array - allocate an array of pages
+ * @pages: the array of pages to be allocated
+ * @gfp_mask: the GFP flags to be passed to the allocator
+ * @length: the amount of data the array needs to hold
+ *
+ * Allocate an array of page pointers long enough so that when full of
+ * pages, the amount of data in length may be stored, then allocate the
+ * pages for each position in the array.
+ */
+int alloc_page_array(struct pagearray *pages, const int gfp_mask, const size_t length)
+{
+ int k;
+ pages->length = PAGE_ALIGN(length);
+ pages->nr_pages = PAGE_ALIGN(length) >> PAGE_SHIFT;
+ pages->pages = kmalloc(pages->nr_pages*sizeof(struct page *), gfp_mask);
+ if (!pages->pages)
+ return -ENOMEM;
+ memset(pages->pages, 0, pages->nr_pages*sizeof(struct page *));
+ for (k = 0; k < pages->nr_pages; ++k) {
+ pages->pages[k] = alloc_page(gfp_mask);
+ if (!pages->pages[k])
+ goto enomem;
+ }
+ return 0;
+enomem:
+ for (--k; k >= 0; --k)
+ __free_page(pages->pages[k]);
+ kfree(pages->pages);
+ memset(pages, 0, sizeof(struct pagearray));
+ return -ENOMEM;
+}
+EXPORT_SYMBOL(alloc_page_array);
+
+/**
+ * free_page_array - free an array of pages
+ * @pages: the array of pages to be freed
+ *
+ * Free an array of pages, including the pages pointed to by the array.
+ */
+void free_page_array(struct pagearray *pages)
+{
+ int k;
+ for (k = 0; k < pages->nr_pages; ++k)
+ __free_page(pages->pages[k]);
+ kfree(pages->pages);
+ memset(pages, 0, sizeof(struct pagearray));
+}
+EXPORT_SYMBOL(free_page_array);
+
+/**
+ * zero_page_array - zero an array of pages
+ * @pages: the array of pages
+ *
+ * Zero out a set of pages pointed to by an array of page pointers.
+ */
+void zero_page_array(struct pagearray *pages)
+{
+ int k;
+ for (k = 0; k < pages->nr_pages; ++k)
+ clear_highpage(pages->pages[k]);
+}
+EXPORT_SYMBOL(zero_page_array);
+
+/**
+ * nopage_page_array - retrieve the page to satisfy a fault with
+ * @vma: the user virtual memory area the fault occurred on
+ * @pgoff: an offset into the underlying array to add to ->vm_pgoff
+ * @vaddr: the user virtual address the fault occurred on
+ * @type: the type of fault that occurred, to be returned
+ * @pages: the array of page pointers
+ *
+ * This is a trivial helper for ->nopage() methods. Simply return the
+ * result of this function after retrieving the page array and its
+ * descriptive parameters from vma->vm_private_data, for instance:
+ * return nopage_page_array(vma, pgoff, vaddr, type, pages);
+ * as the last thing in the ->nopage() method after fetching the
+ * parameters from vma->vm_private_data.
+ */
+struct page *nopage_page_array(const struct vm_area_struct *vma, unsigned long pgoff, unsigned long vaddr, int *type, struct pagearray *pages)
+{
+ if (vaddr >= vma->vm_end)
+ goto sigbus;
+ pgoff += vma->vm_pgoff + ((vaddr - vma->vm_start) >> PAGE_SHIFT);
+ if (pgoff > PAGE_ALIGN(pages->length)/PAGE_SIZE)
+ goto sigbus;
+ if (pgoff > pages->nr_pages)
+ goto sigbus;
+ get_page(pages->pages[pgoff]);
+ if (type)
+ *type = VM_FAULT_MINOR;
+ return pages->pages[pgoff];
+sigbus:
+ if (type)
+ *type = VM_FAULT_SIGBUS;
+ return NOPAGE_SIGBUS;
+}
+EXPORT_SYMBOL(nopage_page_array);
+
+/**
+ * mmap_page_array - mmap an array of pages
+ * @vma: the vma where the mmapping is done
+ * @pages: the array of page pointers
+ * @offset: the offset into the vma in bytes where mmapping should be done
+ * @length: the amount of data that should be mmap'd, in bytes
+ *
+ * vma->vm_pgoff specifies how far out into the page array mmapping
+ * should be done. The page array is treated as a list of the pieces
+ * of an object and vma->vm_pgoff the offset into that object.
+ * vma->vm_page_prot in turn specifies the protections to map with.
+ * offset says where in userspace relative to vma->vm_start to put
+ * the mappings of the pieces of the page array. length specifies how
+ * much data should be mapped into userspace.
+ */
+#ifdef CONFIG_MMU
+int mmap_page_array(const struct vm_area_struct *vma, struct pagearray *pages, const size_t offset, const size_t length)
+{
+ int k, ret = 0;
+ unsigned long end, off, vaddr = vma->vm_start + offset;
+ off = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+ end = vaddr + length;
+ if (vaddr >= end)
+ return -EINVAL;
+ else if (offset != PAGE_ALIGN(offset))
+ return -EINVAL;
+ else if (offset + length > pages->length)
+ return -EINVAL;
+ k = off >> PAGE_SHIFT;
+ while (vaddr < end && !ret) {
+ pgd_t *pgd;
+ pud_t *pud;
+
+ spin_lock(&vma->vm_mm->page_table_lock);
+ pgd = pgd_offset(vma->vm_mm, vaddr);
+ pud = pud_alloc(vma->vm_mm, pgd, vaddr);
+ if (!pud) {
+ ret = -ENOMEM;
+ break;
+ } else {
+ pmd_t *pmd = pmd_alloc(vma->vm_mm, pud, vaddr);
+ if (!pmd) {
+ ret = -ENOMEM;
+ break;
+ } else {
+ pte_t val, *pte;
+
+ pte = pte_alloc_map(vma->vm_mm, pmd, vaddr);
+ if (!pte) {
+ ret = -ENOMEM;
+ break;
+ } else {
+ val = mk_pte(pages->pages[k], vma->vm_page_prot);
+ set_pte(pte, val);
+ pte_unmap(pte);
+ update_mmu_cache(vma, vaddr, val);
+ }
+ }
+ }
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ vaddr += PAGE_SIZE;
+ off += PAGE_SIZE;
+ ++k;
+ }
+ return ret;
+}
+#else
+int mmap_page_array(const struct vm_area_struct *vma, struct pagearray *pages, const size_t offset, const size_t length)
+{
+ return -ENOSYS;
+}
+#endif
+EXPORT_SYMBOL(mmap_page_array);
+
+static int copy_page_array(struct pagearray *pages, char __user *buf, const size_t offset, const size_t length, const int rw)
+{
+ size_t pos = 0, off = offset, remaining = length;
+ int k;
+
+ if (length > pages->length)
+ return -EFAULT;
+ else if (length > MM_VM_SIZE(current->mm))
+ return -EFAULT;
+ else if ((unsigned long)buf > MM_VM_SIZE(current->mm) - length)
+ return -EFAULT;
+
+ for (k = off >> PAGE_SHIFT; k < pages->nr_pages && remaining > 0; ++k) {
+ unsigned long left, tail, suboff = off & PAGE_MASK;
+ char *kbuf = kmap_atomic(pages->pages[k], KM_USER0);
+ tail = min(PAGE_SIZE - suboff, (unsigned long)remaining);
+ if (rw)
+ left = __copy_to_user(&buf[pos], &kbuf[suboff], tail);
+ else
+ left = __copy_from_user(&kbuf[suboff], &buf[pos], tail);
+ kunmap_atomic(kbuf, KM_USER0);
+ if (left) {
+ kbuf = kmap(pages->pages[k]);
+ if (rw)
+ left = __copy_to_user(&buf[pos], &kbuf[suboff], tail);
+ else
+ left = __copy_from_user(&kbuf[suboff], &buf[pos], tail);
+ kunmap(pages->pages[k]);
+ }
+ BUG_ON(tail - left > remaining);
+ remaining -= tail - left;
+ pos += tail - left;
+ off = (off + PAGE_SIZE) & PAGE_MASK;
+ if (left)
+ break;
+ }
+ return remaining;
+}
+
+/**
+ * copy_page_array_to_user - copy data from a page array to userspace
+ * @pages: the array of page pointers holding the data
+ * @buf: the user virtual address to start depositing the data at
+ * @offset: the offset into the page array to start copying data from
+ * @length: how much data to copy
+ *
+ * Copy data from a page array, starting offset bytes into the array
+ * when it's treated as a list of the pieces of an object in order,
+ * to userspace.
+ */
+int copy_page_array_to_user(struct pagearray *pages, void __user *buf, const size_t offset, const size_t length)
+{
+ return copy_page_array(pages, buf, offset, length, 1);
+}
+EXPORT_SYMBOL(copy_page_array_to_user);
+
+/**
+ * copy_page_array_from_user - copy data from userspace to a page array
+ * @pages: the array of page pointers holding the data
+ * @buf: the user virtual address to start reading the data from
+ * @offset: the offset into the page array to start copying data to
+ * @length: how much data to copy
+ *
+ * Copy data to a page array, starting offset bytes into the array
+ * when it's treated as a list of the pieces of an object in order,
+ * from userspace.
+ */
+int copy_page_array_from_user(struct pagearray *pages, void __user *buf, const size_t offset, const size_t length)
+{
+ return copy_page_array(pages, buf, offset, length, 0);
+}
+EXPORT_SYMBOL(copy_page_array_from_user);
+
+/**
+ * pagearray_to_scatterlist - generate a scatterlist for a slice of a pagearray
+ * @pages: the pagearray to make a scatterlist for
+ * @offset: the offset into the pagearray of the start of the slice
+ * @length: the length of the slice of the pagearray
+ * @sglist_len: the size of the generated scatterlist
+ *
+ * Set up a scatterlist covering a slice of a pagearray, starting at offset
+ * bytes into the pagearray, with length length.
+ */
+struct scatterlist *pagearray_to_scatterlist(struct pagearray *pages, size_t offset, size_t length, int *sglist_len)
+{
+ struct scatterlist *sg;
+ int i, nr_pages =
+ (PAGE_ALIGN(offset + length) - (offset & PAGE_MASK))/PAGE_SIZE;
+ sg = kmalloc(nr_pages * sizeof(struct scatterlist), GFP_KERNEL);
+ if (!sg)
+ return NULL;
+ memset(sg, 0, nr_pages * sizeof(struct scatterlist));
+ sg[0].page = pages->pages[offset >> PAGE_SHIFT];
+ sg[0].offset = offset & ~PAGE_MASK;
+ sg[0].length = PAGE_SIZE - sg[0].offset;
+ offset = (offset + PAGE_SIZE) & PAGE_MASK;
+ for (i = 1; i < nr_pages - 1; ++i) {
+ sg[i].page = pages->pages[i];
+ sg[i].length = PAGE_SIZE;
+ }
+ sg[i].page = pages->pages[i];
+ sg[i].length = (offset + length) & ~PAGE_MASK;
+ *sglist_len = nr_pages;
+ return sg;
+}
+EXPORT_SYMBOL(pagearray_to_scatterlist);
+
+void *vmap_pagearray(struct pagearray *pages)
+{
+ return vmap(pages->pages, pages->nr_pages, VM_MAP, PAGE_KERNEL);
+}
+EXPORT_SYMBOL(vmap_pagearray);
^ permalink raw reply [flat|nested] 235+ messages in thread
* RE: [00/17] Large Blocksize Support V3
2007-05-07 6:56 ` Eric W. Biederman
@ 2007-05-07 15:17 ` Weigert, Daniel
0 siblings, 0 replies; 235+ messages in thread
From: Weigert, Daniel @ 2007-05-07 15:17 UTC (permalink / raw)
To: Eric W. Biederman, David Chinner
Cc: Andrew Morton, clameter, linux-kernel, Mel Gorman,
William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
Just to stick my two cents in here:
The definition of what is meant by "large" filesystems has to change
with the advances in disk drive technology. In the not too distant
past, a "large" single filesystem was 100 GB. There are now consumer
grade disks on the market with 1 TB available in a single unit. I don't
know about you guys, but that scares the crap out of me, in terms of
dealing with that much space on a desktop machine. Efficiently dealing
with transferring that much data on a desktop (never mind server) means
re-thinking the limitations of the I/O subsystems. What was once the
realm of the data center is now the realm of the living room. Large
data sets are becoming more commonplace (HD Movies, audio files, etc)
with each passing day, and there is no end in sight in the progression.
In addition, with the release of specs recently about larger sector
sizes for disk drives (2048 bytes, or larger), this is going to become a
pressing need for the general case, not just the extremely large
servers, or HPC machines and clusters. Already there is no efficient
way to back up that much space, in a reasonable time, except to have
another disk of a similar or larger size to back up to. Anything we can
do to make disk I/O *Faster* is a win.
I recognize that there is a huge issue in dealing with sub block size
files. The trade off of small files VS large blocks is now a non
trivial problem. Once disk sector sizes increase, the problems will
have to be dealt with in a more intelligent manner, possibly dividing
sectors into smaller logical blocks for small files? Maybe filesystems
that can understand multiple block sizes?
Well, we do live in interesting times; we just have to make the most of
it.
Dan Weigert
-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Eric W.
Biederman
Sent: Monday, May 07, 2007 2:56 AM
To: David Chinner
Cc: Andrew Morton; clameter@sgi.com; linux-kernel@vger.kernel.org; Mel
Gorman; William Lee Irwin III; Jens Axboe; Badari Pulavarty; Maxim
Levitsky
Subject: Re: [00/17] Large Blocksize Support V3
David Chinner <dgc@sgi.com> writes:
> Both. To many things can happen asynchroonously to a page that it
> makes it just about impossible to predict all the potential race
> conditions that are involved. complexity arose from trying to fix
> the races that were uncovered without breaking everything else...
Ok.
>> Until we code review and implementation that does page aggregation
for
>> linux we can't say how nasty it would be.
>
> We already have an implementation - I've pointed it out several times
> now: see fs/xfs/linux-2.6/xfs_buf.[ch].
>
> There are a lot of nasties in there....
Yes, and but it isn't a generic implementation in mm/filemap.c,
it is a compatibility layer. It lives with the current deficiencies
instead of removes them.
>> >> You're addressing Christoph's straw man here.
>> >
>> > No, I'm speaking from years of experience working on a
>> > page/buffer/chunk cache capable of using both large pages and
>> > aggregating multiple pages. It has, at times, almost driven me
>> > insane and I don't want to go back there.
>>
>> The suggestion seems to be to always aggregate pages (to handle
>> PAGE_SIZE < block size), and not to even worry about the fact
>> that it happens that the pages you are aggregating are physically
>> contiguous. The memory allocator and the block layer can worry
>> about that. It isn't something the page cache or filesystems
>> need to pay attention to.
>
> perfomrance problems in using discontigous pages
Small scatter lists?
> and needing to vmap() them says otherwise....
Always?
Ugh. I just realized looking at the xfs code that it doesn't
work in the presence of high memory, at least not with 4K pages.
>> I suspect the implementation in linux would be sufficiently different
>> that it would not be prone to the same problems. Among other things
>> we are already do most things on a range of page addresses, so we
>> would seem to have most of the infrastructure already.
>
> Filesystems don't typically do this - they work on blocks and assume
> that a block can be directly referenced.
But that is how mm/filemap.c works. The calls into the filesystem
can be per multi-page group as they are current per page. The point
is that the existing in kernel abstraction is already larger then a
page for doing the work.
>> Given that small block sizes give us better storage efficiency,
>> which means less disk bandwidth used, which means less time
>> to get the data off of a slow disk (especially if you can
>> put multiple files you want simultaneously in that same space).
>> I'm not convinced that large block sizes are a clear disk performance
>> advantage, so we should not neglect the small file sizes.
>
> Hmmm - we're not talking about using 64k block size filesystems to
> store lots of little files or using them on small, slow disks.
> We're looking at optimising for multi-petabyte filesystems with
> multi-terabyte sized files sustaining throughput of tens to hundreds
> of GB/s to/from hundreds to thousands of disk.
>
> I certinaly don't consider 64k block size filesystems as something
> suitable for desktop use - maybe PVRs would benefit, but this
> is not something you'd use for your kernel build environment on a
> single disk in a desktop system....
Yes. You are talking about only fixing the kernel for your giant
64K block filesystems that are only interesting on peta-byte arrays.
I am pointing out that the other fixes that have been discussed.
Optimistic contiguous page allocation and a larger linux scatter
gather list. Are interesting on much smaller filesystem and machine
sizes where small files are still interesting. Making them generally
better improvements for linux.
If you only improve the giant peta-byte raid cases 99% of linux users
simply don't care, and so the code isn't very interesting.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-----------------------------------------
This message and any attachments are intended only for the use of
the addressee and may contain information that is privileged and
confidential. If the reader of the message is not the intended
recipient or an authorized representative of the intended
recipient, you are hereby notified that any dissemination of this
communication is strictly prohibited. If you have received this
communication in error, notify the sender immediately by return
email and delete the message and any attachments from your system.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 6:43 ` Eric W. Biederman
2007-05-07 6:49 ` William Lee Irwin III
@ 2007-05-07 16:06 ` Christoph Lameter
2007-05-07 17:29 ` William Lee Irwin III
1 sibling, 1 reply; 235+ messages in thread
From: Christoph Lameter @ 2007-05-07 16:06 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Theodore Tso, Andrew Morton, linux-kernel,
Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Mon, 7 May 2007, Eric W. Biederman wrote:
> Yes, instead of having to redesign the interface between the
> fs and the page cache for those filesystems that handle large
> blocks we instead need to redesign significant parts of the VM interface.
> Shift the redesign work to another group of people and call it a trivial.
To some extend that is true. But then there will then also be additional
gain: We can likely get the VM to handle larger pages too which may get
rid of hugetlb fs etc. The work is pretty straightforward: No locking
changes f.e. So hardly a redesign. I think the crucial point is the
antifrag/defrag issue if we want to generalize it.
I have an updated patch here that relies on page reservations. Adds
something called page pools. On bootup you need to specify how many pages
of each size you want. The page cache will then use those pages for
filesystems that need larger blocksize.
The interesting thing about that one is that it actually enables support
foir multiple blocksizes with a single larger pagesize. If f.e. we setup a
pool of 64k pages then the block layer can segment that into 16k pieces.
So one can actually use 16k 32k and 64k block size with a single larger
page size.
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 16:06 ` Christoph Lameter
@ 2007-05-07 17:29 ` William Lee Irwin III
0 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-05-07 17:29 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric W. Biederman, David Chinner, Theodore Tso, Andrew Morton,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Mon, 7 May 2007, Eric W. Biederman wrote:
>> Yes, instead of having to redesign the interface between the
>> fs and the page cache for those filesystems that handle large
>> blocks we instead need to redesign significant parts of the VM interface.
>> Shift the redesign work to another group of people and call it a trivial.
On Mon, May 07, 2007 at 09:06:05AM -0700, Christoph Lameter wrote:
> To some extend that is true. But then there will then also be additional
> gain: We can likely get the VM to handle larger pages too which may get
> rid of hugetlb fs etc. The work is pretty straightforward: No locking
> changes f.e. So hardly a redesign. I think the crucial point is the
> antifrag/defrag issue if we want to generalize it.
Sadly, a backward compatibility stub must be retained in perpetuity.
It should be able to be reduced to the point it doesn't need its own
dedicated source files or config options, but it'll need something to
deal with the arch code.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
* Re: [00/17] Large Blocksize Support V3
2007-05-07 7:06 ` William Lee Irwin III
@ 2007-05-08 8:49 ` William Lee Irwin III
0 siblings, 0 replies; 235+ messages in thread
From: William Lee Irwin III @ 2007-05-08 8:49 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Chinner, Theodore Tso, Andrew Morton, clameter,
linux-kernel, Mel Gorman, Jens Axboe, Badari Pulavarty,
Maxim Levitsky
On Mon, May 07, 2007 at 12:06:38AM -0700, William Lee Irwin III wrote:
> +int alloc_page_array(struct pagearray *, const int, const size_t);
> +void free_page_array(struct pagearray *);
> +void zero_page_array(struct pagearray *);
> +struct page *nopage_page_array(const struct vm_area_struct *, unsigned long, unsigned long, int *, struct pagearray *);
> +int mmap_page_array(const struct vm_area_struct *, struct pagearray *, const size_t, const size_t);
> +int copy_page_array_to_user(struct pagearray *, void __user *, const size_t, const size_t);
> +int copy_page_array_from_user(struct pagearray *, void __user *, const size_t, const size_t);
> +struct scatterlist *pagearray_to_scatterlist(struct pagearray *, size_t, size_t, int *);
> +void *vmap_pagearray(struct pagearray *);
This should probably have memcpy to/from pagearrays. Whole-hog read
and write f_op implementations would be good, too, since ISTR some
drivers basically do little besides that on their internal buffers.
vmap_pagearray() should take flags, esp. VM_IOREMAP but perhaps also
protections besides PAGE_KERNEL in case uncachedness is desirable. I'm
not entirely sure what it'd be used for if discontiguity is so heavily
supported. My wild guess is drivers that do things that are just too
weird to support with the discontig API, since that's how I used it.
It should support vmap()'ing interior sub-ranges, too.
The pagearray mmap() support is schizophrenic as to whether it prefills
or faults and not all that complete as far as manipulating the mmap()
goes. Shooting down ptes, flipping pages, or whatever drivers actually
do with the things should have helpers arranged. Coherent sets of
helpers for faulting vs. mmap()'ing idioms would be good.
pagearray_to_scatterlist() should probably take the scatterist as an
argument instead of allocating the scatterlist itself.
Something to construct bio's from pagearrays might help.
s/page_array/pagearray/g should probably be done. Prefixing with
pagearray_ instead of randomly positioning it within the name would
be good, too.
Some working API conversions on drivers sound like a good idea. I had
some large number of API conversions about, now lost, but they'd be
bitrotted anyway.
struct pagearray is better off as an opaque type so large pagearray
handling can be added in later via radix trees or some such, likewise
for expansion and contraction. Keeping drivers' hands off the internals
is just a good idea in general.
I'm somewhat less clear on what filesystems need to do here, or if it
would be useful for them to efficiently manipulate data inside a
large block that spans multiple discontiguous pages. I expect some
changes are needed at the very least to fill a pagearray with whatever
predetermined pages are needed. Filesystems probably need other changes
to handle sparse pagearrays and refilling pages within them via IO.
-- wli
^ permalink raw reply [flat|nested] 235+ messages in thread
end of thread, other threads:[~2007-05-09 6:27 UTC | newest]
Thread overview: 235+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-24 22:21 [00/17] Large Blocksize Support V3 clameter
2007-04-24 22:21 ` [01/17] Remove open coded implementation of memclear_highpage flush clameter
2007-04-24 22:21 ` [02/17] Fix page allocation flags in grow_dev_page() clameter
2007-04-24 22:21 ` [03/17] Fix: find_or_create_page does not spread memory clameter
2007-04-24 22:21 ` [04/17] Free up page->private for compound pages clameter
2007-04-24 22:21 ` [05/17] More compound page features clameter
2007-04-24 22:21 ` [06/17] Fix up handling of Compound head pages clameter
2007-04-24 22:21 ` [07/17] vmstat.c: Support accounting for compound pages clameter
2007-04-24 22:21 ` [08/17] Define functions for page cache handling clameter
2007-04-24 23:00 ` Eric Dumazet
2007-04-25 6:27 ` Christoph Lameter
2007-04-24 22:21 ` [09/17] Convert PAGE_CACHE_xxx -> page_cache_xxx function calls clameter
2007-04-24 22:21 ` [10/17] Variable Order Page Cache: Add clearing and flushing function clameter
2007-04-26 7:02 ` Christoph Lameter
2007-04-26 8:14 ` David Chinner
2007-04-24 22:21 ` [11/17] Readahead support for the variable order page cache clameter
2007-04-24 22:21 ` [12/17] Variable Page Cache Size: Fix up reclaim counters clameter
2007-04-24 22:21 ` [13/17] set_blocksize: Allow to set a larger block size than PAGE_SIZE clameter
2007-04-24 22:21 ` [14/17] Add VM_BUG_ONs to check for correct page order clameter
2007-04-24 22:21 ` [15/17] ramfs: Variable order page cache support clameter
2007-04-24 22:21 ` [16/17] ext2: " clameter
2007-04-24 22:21 ` [17/17] xfs: " clameter
2007-04-25 0:46 ` [00/17] Large Blocksize Support V3 Jörn Engel
2007-04-25 0:47 ` H. Peter Anvin
2007-04-25 3:11 ` William Lee Irwin III
2007-04-25 11:35 ` Jens Axboe
2007-04-25 15:36 ` Christoph Lameter
2007-04-25 17:53 ` Jens Axboe
2007-04-25 18:03 ` Christoph Lameter
2007-04-25 18:05 ` Jens Axboe
2007-04-25 18:14 ` Christoph Lameter
2007-04-25 18:16 ` Jens Axboe
2007-04-25 13:28 ` Mel Gorman
2007-04-25 15:23 ` Christoph Lameter
2007-04-25 22:46 ` Badari Pulavarty
2007-04-26 1:14 ` David Chinner
2007-04-26 1:17 ` David Chinner
2007-04-26 4:51 ` Eric W. Biederman
2007-04-26 5:05 ` Christoph Lameter
2007-04-26 5:44 ` Eric W. Biederman
2007-04-26 6:37 ` Christoph Lameter
2007-04-26 9:16 ` Mel Gorman
2007-04-26 6:38 ` Nick Piggin
2007-04-26 6:46 ` Christoph Lameter
2007-04-26 6:57 ` Nick Piggin
2007-04-26 7:10 ` Christoph Lameter
2007-04-26 7:22 ` Nick Piggin
2007-04-26 7:34 ` Christoph Lameter
2007-04-26 7:48 ` Nick Piggin
2007-04-26 9:20 ` David Chinner
2007-04-26 13:53 ` Avi Kivity
2007-04-26 14:33 ` David Chinner
2007-04-26 14:56 ` Avi Kivity
2007-04-26 15:20 ` Nick Piggin
2007-04-26 17:42 ` Jens Axboe
2007-04-26 18:59 ` Eric W. Biederman
2007-04-26 16:07 ` Christoph Hellwig
2007-04-27 10:05 ` Nick Piggin
2007-04-27 13:06 ` Mel Gorman
2007-04-26 13:50 ` William Lee Irwin III
2007-04-26 18:09 ` Eric W. Biederman
2007-04-26 23:34 ` William Lee Irwin III
2007-04-26 7:48 ` Questions on printk and console_drivers gshan
2007-04-26 10:06 ` [00/17] Large Blocksize Support V3 Mel Gorman
2007-04-26 14:47 ` Nick Piggin
2007-04-26 15:58 ` Christoph Hellwig
2007-04-26 16:05 ` Jens Axboe
2007-04-26 16:16 ` Christoph Hellwig
2007-04-26 13:28 ` Alan Cox
2007-04-26 13:30 ` Jens Axboe
2007-04-29 14:12 ` Matt Mackall
2007-04-28 10:55 ` Pierre Ossman
2007-04-28 15:39 ` Eric W. Biederman
2007-04-26 5:37 ` Nick Piggin
2007-04-26 6:38 ` David Chinner
2007-04-26 6:50 ` Nick Piggin
2007-04-26 8:40 ` Mel Gorman
2007-04-26 8:55 ` Nick Piggin
2007-04-26 10:30 ` Mel Gorman
2007-04-26 10:54 ` Eric W. Biederman
2007-04-26 12:23 ` Mel Gorman
2007-04-26 17:58 ` Christoph Lameter
2007-04-26 18:02 ` Jens Axboe
2007-04-26 16:11 ` Christoph Hellwig
2007-04-26 17:49 ` Eric W. Biederman
2007-04-26 18:03 ` Christoph Lameter
2007-04-26 18:03 ` Jens Axboe
2007-04-26 18:09 ` Christoph Hellwig
2007-04-26 18:12 ` Jens Axboe
2007-04-26 18:24 ` Christoph Hellwig
2007-04-26 18:24 ` Jens Axboe
2007-04-26 18:28 ` Christoph Lameter
2007-04-26 18:29 ` Jens Axboe
2007-04-26 18:35 ` Christoph Lameter
2007-04-26 18:39 ` Jens Axboe
2007-04-26 19:35 ` Eric W. Biederman
2007-04-26 19:42 ` Jens Axboe
2007-04-27 4:05 ` Eric W. Biederman
2007-04-27 10:26 ` Nick Piggin
2007-04-27 13:51 ` Eric W. Biederman
2007-04-26 20:22 ` Mel Gorman
2007-04-27 0:21 ` William Lee Irwin III
2007-04-27 5:16 ` Jens Axboe
2007-04-27 10:38 ` Nick Piggin
2007-04-26 10:10 ` Eric W. Biederman
2007-04-26 13:50 ` David Chinner
2007-04-26 14:40 ` William Lee Irwin III
2007-04-26 15:38 ` Nick Piggin
2007-04-26 15:58 ` William Lee Irwin III
2007-04-27 9:46 ` Nick Piggin
2007-04-27 0:19 ` Jeremy Higdon
2007-04-26 18:07 ` Christoph Lameter
2007-04-26 18:45 ` Eric W. Biederman
2007-04-26 18:59 ` Christoph Lameter
2007-04-26 19:21 ` Eric W. Biederman
2007-04-26 6:40 ` Christoph Lameter
2007-04-26 6:53 ` Nick Piggin
2007-04-26 7:04 ` David Chinner
2007-04-26 7:07 ` Nick Piggin
2007-04-26 7:11 ` Christoph Lameter
2007-04-26 7:17 ` Nick Piggin
2007-04-26 7:28 ` Christoph Lameter
2007-04-26 7:45 ` Nick Piggin
2007-04-26 18:10 ` Christoph Lameter
2007-04-27 10:08 ` Nick Piggin
2007-04-26 7:07 ` Christoph Lameter
2007-04-26 7:15 ` Nick Piggin
2007-04-26 7:22 ` Christoph Lameter
2007-04-26 7:42 ` Nick Piggin
2007-04-26 10:48 ` Mel Gorman
2007-04-26 12:37 ` Andy Whitcroft
2007-04-26 14:18 ` David Chinner
2007-04-26 15:08 ` Nick Piggin
2007-04-26 15:19 ` William Lee Irwin III
2007-04-26 15:28 ` David Chinner
2007-04-26 14:53 ` William Lee Irwin III
2007-04-26 18:16 ` Christoph Lameter
2007-04-26 18:21 ` Eric W. Biederman
2007-04-27 0:32 ` William Lee Irwin III
2007-04-27 10:22 ` Nick Piggin
2007-04-27 12:58 ` William Lee Irwin III
2007-04-27 13:06 ` Nick Piggin
2007-04-27 14:49 ` William Lee Irwin III
2007-04-26 18:13 ` Christoph Lameter
2007-04-27 10:15 ` Nick Piggin
2007-04-26 14:49 ` William Lee Irwin III
2007-04-26 18:50 ` Maxim Levitsky
2007-04-27 2:04 ` Andrew Morton
2007-04-27 2:27 ` David Chinner
2007-04-27 2:53 ` Andrew Morton
2007-04-27 3:47 ` [00/17] Large Blocksize Support V3 (mmap conceptual discussion) Christoph Lameter
2007-04-27 4:20 ` [00/17] Large Blocksize Support V3 David Chinner
2007-04-27 5:15 ` Andrew Morton
2007-04-27 5:49 ` Christoph Lameter
2007-04-27 6:55 ` Andrew Morton
2007-04-27 7:19 ` Christoph Lameter
2007-04-27 7:26 ` Andrew Morton
2007-04-27 8:37 ` David Chinner
2007-04-27 12:01 ` Christoph Lameter
2007-04-27 16:36 ` David Chinner
2007-04-27 17:34 ` David Chinner
2007-04-27 19:11 ` Andrew Morton
2007-04-28 1:43 ` Nick Piggin
2007-04-28 8:04 ` Peter Zijlstra
2007-04-28 8:22 ` Andrew Morton
2007-04-28 8:32 ` Peter Zijlstra
2007-04-28 8:55 ` Andrew Morton
2007-04-28 9:36 ` Peter Zijlstra
2007-04-28 14:09 ` William Lee Irwin III
2007-04-28 18:26 ` Andrew Morton
2007-04-28 19:19 ` William Lee Irwin III
2007-04-28 21:28 ` Andrew Morton
2007-04-28 3:17 ` David Chinner
2007-04-28 3:49 ` Christoph Lameter
2007-04-28 4:56 ` Andrew Morton
2007-04-28 5:08 ` Christoph Lameter
2007-04-28 5:36 ` Andrew Morton
2007-04-28 6:24 ` Christoph Lameter
2007-04-28 6:52 ` Andrew Morton
2007-04-30 5:30 ` Christoph Lameter
2007-04-28 9:43 ` Alan Cox
2007-04-28 9:58 ` Andrew Morton
2007-04-28 10:21 ` Alan Cox
2007-04-28 10:25 ` Andrew Morton
2007-04-28 11:29 ` Alan Cox
2007-04-28 14:37 ` William Lee Irwin III
2007-04-27 7:22 ` Christoph Lameter
2007-04-27 7:29 ` Andrew Morton
2007-04-27 7:35 ` Christoph Lameter
2007-04-27 7:43 ` Andrew Morton
2007-04-27 11:05 ` Paul Mackerras
2007-04-27 11:41 ` Nick Piggin
2007-04-27 12:12 ` Christoph Lameter
2007-04-27 12:25 ` Nick Piggin
2007-04-27 13:39 ` Christoph Hellwig
2007-04-28 2:27 ` Nick Piggin
2007-04-28 2:39 ` William Lee Irwin III
2007-04-28 2:50 ` Nick Piggin
2007-04-28 3:16 ` William Lee Irwin III
2007-04-28 8:16 ` Christoph Hellwig
2007-04-27 16:48 ` Christoph Lameter
2007-04-27 13:37 ` Christoph Hellwig
2007-04-27 12:14 ` Paul Mackerras
2007-04-27 12:36 ` Nick Piggin
2007-04-27 13:42 ` Christoph Hellwig
2007-04-27 11:58 ` Christoph Lameter
2007-04-27 13:44 ` William Lee Irwin III
2007-04-27 19:15 ` Andrew Morton
2007-04-28 2:21 ` William Lee Irwin III
2007-04-27 6:09 ` David Chinner
2007-04-27 7:04 ` Andrew Morton
2007-04-27 8:03 ` David Chinner
2007-04-27 8:48 ` Andrew Morton
2007-04-27 16:45 ` Theodore Tso
2007-05-04 13:33 ` Eric W. Biederman
2007-05-07 4:29 ` David Chinner
2007-05-07 4:48 ` Eric W. Biederman
2007-05-07 5:27 ` David Chinner
2007-05-07 6:43 ` Eric W. Biederman
2007-05-07 6:49 ` William Lee Irwin III
2007-05-07 7:06 ` William Lee Irwin III
2007-05-08 8:49 ` William Lee Irwin III
2007-05-07 16:06 ` Christoph Lameter
2007-05-07 17:29 ` William Lee Irwin III
2007-05-04 12:57 ` Eric W. Biederman
2007-05-04 13:31 ` Eric W. Biederman
2007-05-04 16:11 ` Christoph Lameter
2007-05-07 4:58 ` David Chinner
2007-05-07 6:56 ` Eric W. Biederman
2007-05-07 15:17 ` Weigert, Daniel
2007-04-27 16:55 ` Theodore Tso
2007-04-27 17:32 ` Nicholas Miell
2007-04-27 18:12 ` William Lee Irwin III
2007-04-28 16:39 ` Maxim Levitsky
2007-04-30 5:23 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).