LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch] rewrite rd
@ 2007-12-04  4:26 Nick Piggin
  2007-12-04  6:29 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Nick Piggin @ 2007-12-04  4:26 UTC (permalink / raw)
  To: Linux Kernel Mailing List, linux-fsdevel, Christian Borntraeger,
	Eric W. Biederman, Andrew Morton
  Cc: rob, Jens Axboe

Hi,

This is my proposal for a (hopefully) backwards compatible rd driver.
The motivation for me is not pressing, except that I have this code
sitting here that is either going to rot or get merged. I'm happy to
make myself maintainer of this code, but if any real block device
driver writer would like to, that would be fine by me ;)

Comments?

--

This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. It relies on the dirty
bit being set, to pin its backing store in cache, however there are non
trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
which had recently lead to data corruption. And in general it is completely
wrong for a block device driver to do this.

The new one is more like a regular block device driver. It has no idea
about vm/vfs stuff. It's backing store is similar to the buffer cache
(a simple radix-tree of pages), but it doesn't know anything about page
cache (the pages in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it),
so under memory intensive situations, footprint should effectively be the
same -- maybe even a slight advantage to the new driver because it can also
reclaim buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

   text    data     bss     dec     hex filename
   2837     849     384    4070     fe6 drivers/block/rd.o
   3528     371      12    3911     f47 drivers/block/brd.o

Text is larger, but data and bss are smaller, making total size smaller.

A few other nice things about it:
- Similar structure and layout to the new loop device handlinag.
- Dynamic ramdisk creation.
- Runtime flexible buffer head size (because it is no longer part of the
  ramdisk code).
- Boot / load time flexible ramdisk size, which could easily be extended
  to a per-ramdisk runtime changeable size (eg. with an ioctl).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 MAINTAINERS            |    5 
 drivers/block/Kconfig  |   12 -
 drivers/block/Makefile |    2 
 drivers/block/brd.c    |  500 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/block/rd.c     |  536 -------------------------------------------------
 fs/buffer.c            |    1 
 6 files changed, 508 insertions(+), 548 deletions(-)

Index: linux-2.6/drivers/block/Kconfig
===================================================================
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -311,7 +311,7 @@ config BLK_DEV_UB
 	  If unsure, say N.
 
 config BLK_DEV_RAM
-	tristate "RAM disk support"
+	tristate "RAM block device support"
 	---help---
 	  Saying Y here will allow you to use a portion of your RAM memory as
 	  a block device, so that you can make file systems on it, read and
@@ -346,16 +346,6 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
-config BLK_DEV_RAM_BLOCKSIZE
-	int "Default RAM disk block size (bytes)"
-	depends on BLK_DEV_RAM
-	default "1024"
-	help
-	  The default value is 1024 bytes.  PAGE_SIZE is a much more
-	  efficient choice however.  The default is kept to ensure initrd
-	  setups function - apparently needed by the rd_load_image routine
-	  that supposes the filesystem in the image uses a 1024 blocksize.
-
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
Index: linux-2.6/drivers/block/Makefile
===================================================================
--- linux-2.6.orig/drivers/block/Makefile
+++ linux-2.6/drivers/block/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_AMIGA_FLOPPY)	+= amiflop.o
 obj-$(CONFIG_PS3_DISK)		+= ps3disk.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
-obj-$(CONFIG_BLK_DEV_RAM)	+= rd.o
+obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_DEV_PS2)	+= ps2esdi.o
 obj-$(CONFIG_BLK_DEV_XD)	+= xd.o
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/block/brd.c
@@ -0,0 +1,536 @@
+/*
+ * Ram backed block device driver.
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ *
+ * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
+ * of their respective owners.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/major.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/highmem.h>
+#include <linux/gfp.h>
+#include <linux/radix-tree.h>
+#include <linux/buffer_head.h> /* invalidate_bh_lrus() */
+
+#include <asm/uaccess.h>
+
+#define SECTOR_SHIFT		9
+#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
+#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
+
+/*
+ * Each block ramdisk device has a radix_tree brd_pages of pages that stores
+ * the pages containing the block device's contents. A brd page's ->index is
+ * its offset in PAGE_SIZE units. This is similar to, but in no way connected
+ * with, the kernel's pagecache or buffer cache (which sit above our block
+ * device).
+ */
+struct brd_device {
+	int		brd_number;
+	int		brd_refcnt;
+	loff_t		brd_offset;
+	loff_t		brd_sizelimit;
+	unsigned	brd_blocksize;
+
+	struct request_queue	*brd_queue;
+	struct gendisk		*brd_disk;
+	struct list_head	brd_list;
+
+	/*
+	 * Backing store of pages and lock to protect it. This is the contents
+	 * of the block device.
+	 */
+	spinlock_t		brd_lock;
+	struct radix_tree_root	brd_pages;
+};
+
+/*
+ * Look up and return a brd's page for a given sector.
+ */
+static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
+{
+	unsigned long idx;
+	struct page *page;
+
+	/*
+	 * The page lifetime is protected by the fact that we have opened the
+	 * device node -- brd pages will never be deleted under us, so we
+	 * don't need any further locking or refcounting.
+	 *
+	 * This is strictly true for the radix-tree nodes as well (ie. we
+	 * don't actually need the rcu_read_lock()), however that is not a
+	 * documented feature of the radix-tree API so it is better to be
+	 * safe here (we don't have total exclusion from radix tree updates
+	 * here, only deletes).
+	 */
+	rcu_read_lock();
+	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
+	page = radix_tree_lookup(&brd->brd_pages, idx);
+	rcu_read_unlock();
+
+	BUG_ON(page && page->index != idx);
+
+	return page;
+}
+
+/*
+ * Look up and return a brd's page for a given sector.
+ * If one does not exist, allocate an empty page, and insert that. Then
+ * return it.
+ */
+static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
+{
+	unsigned long idx;
+	struct page *page;
+
+	page = brd_lookup_page(brd, sector);
+	if (page)
+		return page;
+
+	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
+	if (!page)
+		return NULL;
+
+	if (radix_tree_preload(GFP_NOIO)) {
+		__free_page(page);
+		return NULL;
+	}
+
+	spin_lock(&brd->brd_lock);
+	idx = sector >> PAGE_SECTORS_SHIFT;
+	if (radix_tree_insert(&brd->brd_pages, idx, page)) {
+		__free_page(page);
+		page = radix_tree_lookup(&brd->brd_pages, idx);
+		BUG_ON(!page);
+		BUG_ON(page->index != idx);
+	} else
+		page->index = idx;
+	spin_unlock(&brd->brd_lock);
+
+	radix_tree_preload_end();
+
+	return page;
+}
+
+/*
+ * Free all backing store pages and radix tree. This must only be called when
+ * there are no other users of the device.
+ */
+#define FREE_BATCH 16
+static void brd_free_pages(struct brd_device *brd)
+{
+	unsigned long pos = 0;
+	struct page *pages[FREE_BATCH];
+	int nr_pages;
+
+	do {
+		int i;
+
+		nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
+				(void **)pages, pos, FREE_BATCH);
+
+		for (i = 0; i < nr_pages; i++) {
+			void *ret;
+
+			BUG_ON(pages[i]->index < pos);
+			pos = pages[i]->index;
+			ret = radix_tree_delete(&brd->brd_pages, pos);
+			BUG_ON(!ret || ret != pages[i]);
+			__free_page(pages[i]);
+		}
+
+		pos++;
+
+	} while (nr_pages == FREE_BATCH);
+}
+
+/*
+ * copy_to_brd_setup must be called before copy_to_brd. It may sleep.
+ */
+static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
+{
+	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	if (!brd_insert_page(brd, sector))
+		return -ENOMEM;
+	if (copy < n) {
+		sector += copy >> SECTOR_SHIFT;
+		if (!brd_insert_page(brd, sector))
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+/*
+ * Copy n bytes from src to the brd starting at sector. Does not sleep.
+ */
+static void copy_to_brd(struct brd_device *brd, const void *src,
+			sector_t sector, size_t n)
+{
+	struct page *page;
+	void *dst;
+	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	page = brd_lookup_page(brd, sector);
+	BUG_ON(!page);
+
+	dst = kmap_atomic(page, KM_USER1);
+	memcpy(dst + offset, src, copy);
+	kunmap_atomic(dst, KM_USER1);
+
+	if (copy < n) {
+		src += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		page = brd_lookup_page(brd, sector);
+		BUG_ON(!page);
+
+		dst = kmap_atomic(page, KM_USER1);
+		memcpy(dst, src, copy);
+		kunmap_atomic(dst, KM_USER1);
+	}
+}
+
+/*
+ * Copy n bytes to dst from the brd starting at sector. Does not sleep.
+ */
+static void copy_from_brd(void *dst, struct brd_device *brd,
+			sector_t sector, size_t n)
+{
+	struct page *page;
+	const void *src;
+	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	page = brd_lookup_page(brd, sector);
+	if (page) {
+		src = kmap_atomic(page, KM_USER1);
+		memcpy(dst, src + offset, copy);
+		kunmap_atomic(src, KM_USER1);
+	} else
+		memset(dst, 0, copy);
+
+	if (copy < n) {
+		dst += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		page = brd_lookup_page(brd, sector);
+		if (page) {
+			src = kmap_atomic(page, KM_USER1);
+			memcpy(dst, src, copy);
+			kunmap_atomic(src, KM_USER1);
+		} else
+			memset(dst, 0, copy);
+	}
+}
+
+/*
+ * Process a single bvec of a bio.
+ */
+static int brd_do_bvec(struct brd_device *brd, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	void *mem;
+	int err = 0;
+
+	if (rw != READ) {
+		err = copy_to_brd_setup(brd, sector, len);
+		if (err)
+			goto out;
+	}
+
+	mem = kmap_atomic(page, KM_USER0);
+	if (rw == READ) {
+		copy_from_brd(mem + off, brd, sector, len);
+		flush_dcache_page(page);
+	} else
+		copy_to_brd(brd, mem + off, sector, len);
+	kunmap_atomic(mem, KM_USER0);
+
+out:
+	return err;
+}
+
+static int brd_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct brd_device *brd = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec *bvec;
+	sector_t sector;
+	int i;
+	int err = -EIO;
+
+	sector = bio->bi_sector;
+	if (sector + (bio->bi_size >> SECTOR_SHIFT) >
+						get_capacity(bdev->bd_disk))
+		goto out;
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, i) {
+		unsigned int len = bvec->bv_len;
+		err = brd_do_bvec(brd, bvec->bv_page, len,
+					bvec->bv_offset, rw, sector);
+		if (err)
+			break;
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+
+	return 0;
+}
+
+static int brd_ioctl(struct inode *inode, struct file *file,
+			unsigned int cmd, unsigned long arg)
+{
+	int error;
+	struct block_device *bdev = inode->i_bdev;
+	struct brd_device *brd = bdev->bd_disk->private_data;
+
+	if (cmd != BLKFLSBUF)
+		return -ENOTTY;
+
+	/*
+	 * ram device BLKFLSBUF has special semantics, we want to actually
+	 * release and destroy the ramdisk data.
+	 */
+	mutex_lock(&bdev->bd_mutex);
+	error = -EBUSY;
+	if (bdev->bd_openers <= 1) {
+		/*
+		 * Invalidate the cache first, so it isn't written
+		 * back to the device.
+		 */
+		invalidate_bh_lrus();
+		truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+		brd_free_pages(brd);
+		error = 0;
+	}
+	mutex_unlock(&bdev->bd_mutex);
+
+	return error;
+}
+
+static struct block_device_operations brd_fops = {
+	.owner =	THIS_MODULE,
+	.ioctl =	brd_ioctl,
+};
+
+/*
+ * And now the modules code and kernel interface.
+ */
+static int rd_nr;
+static int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
+module_param(rd_nr, int, 0);
+MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
+module_param(rd_size, int, 0);
+MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
+
+#ifndef MODULE
+/* Legacy boot options - nonmodular */
+static int __init ramdisk_size(char *str)
+{
+	rd_size = simple_strtol(str, NULL, 0);
+	return 1;
+}
+static int __init ramdisk_size2(char *str)
+{
+	return ramdisk_size(str);
+}
+__setup("ramdisk=", ramdisk_size);
+__setup("ramdisk_size=", ramdisk_size2);
+#endif
+
+/*
+ * The device scheme is derived from loop.c. Keep them in synch where possible
+ * (should share code eventually).
+ */
+static LIST_HEAD(brd_devices);
+static DEFINE_MUTEX(brd_devices_mutex);
+
+static struct brd_device *brd_alloc(int i)
+{
+	struct brd_device *brd;
+	struct gendisk *disk;
+
+	brd = kzalloc(sizeof(*brd), GFP_KERNEL);
+	if (!brd)
+		goto out;
+	brd->brd_number		= i;
+	spin_lock_init(&brd->brd_lock);
+	INIT_RADIX_TREE(&brd->brd_pages, GFP_ATOMIC);
+
+	brd->brd_queue = blk_alloc_queue(GFP_KERNEL);
+	if (!brd->brd_queue)
+		goto out_free_dev;
+	blk_queue_make_request(brd->brd_queue, brd_make_request);
+	blk_queue_max_sectors(brd->brd_queue, 1024);
+	blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
+
+	disk = brd->brd_disk = alloc_disk(1);
+	if (!disk)
+		goto out_free_queue;
+	disk->major		= RAMDISK_MAJOR;
+	disk->first_minor	= i;
+	disk->fops		= &brd_fops;
+	disk->private_data	= brd;
+	disk->queue		= brd->brd_queue;
+	sprintf(disk->disk_name, "ram%d", i);
+	set_capacity(disk, rd_size * 2);
+
+	return brd;
+
+out_free_queue:
+	blk_cleanup_queue(brd->brd_queue);
+out_free_dev:
+	kfree(brd);
+out:
+	return NULL;
+}
+
+static void brd_free(struct brd_device *brd)
+{
+	put_disk(brd->brd_disk);
+	blk_cleanup_queue(brd->brd_queue);
+	brd_free_pages(brd);
+	kfree(brd);
+}
+
+static struct brd_device *brd_init_one(int i)
+{
+	struct brd_device *brd;
+
+	list_for_each_entry(brd, &brd_devices, brd_list) {
+		if (brd->brd_number == i)
+			goto out;
+	}
+
+	brd = brd_alloc(i);
+	if (brd) {
+		add_disk(brd->brd_disk);
+		list_add_tail(&brd->brd_list, &brd_devices);
+	}
+out:
+	return brd;
+}
+
+static void brd_del_one(struct brd_device *brd)
+{
+	list_del(&brd->brd_list);
+	del_gendisk(brd->brd_disk);
+	brd_free(brd);
+}
+
+static struct kobject *brd_probe(dev_t dev, int *part, void *data)
+{
+	struct brd_device *brd;
+	struct kobject *kobj;
+
+	mutex_lock(&brd_devices_mutex);
+	brd = brd_init_one(dev & MINORMASK);
+	kobj = brd ? get_disk(brd->brd_disk) : ERR_PTR(-ENOMEM);
+	mutex_unlock(&brd_devices_mutex);
+
+	*part = 0;
+	return kobj;
+}
+
+static int __init brd_init(void)
+{
+	int i, nr;
+	unsigned long range;
+	struct brd_device *brd, *next;
+
+	/*
+	 * brd module now has a feature to instantiate underlying device
+	 * structure on-demand, provided that there is an access dev node.
+	 * However, this will not work well with user space tool that doesn't
+	 * know about such "feature".  In order to not break any existing
+	 * tool, we do the following:
+	 *
+	 * (1) if rd_nr is specified, create that many upfront, and this
+	 *     also becomes a hard limit.
+	 * (2) if rd_nr is not specified, create 1 rd device on module
+	 *     load, user can further extend brd device by create dev node
+	 *     themselves and have kernel automatically instantiate actual
+	 *     device on-demand.
+	 */
+	if (rd_nr > 1UL << MINORBITS)
+		return -EINVAL;
+
+	if (rd_nr) {
+		nr = rd_nr;
+		range = rd_nr;
+	} else {
+		nr = CONFIG_BLK_DEV_RAM_COUNT;
+		range = 1UL << MINORBITS;
+	}
+
+	if (register_blkdev(RAMDISK_MAJOR, "ramdisk"))
+		return -EIO;
+
+	for (i = 0; i < nr; i++) {
+		brd = brd_alloc(i);
+		if (!brd)
+			goto out_free;
+		list_add_tail(&brd->brd_list, &brd_devices);
+	}
+
+	/* point of no return */
+
+	list_for_each_entry(brd, &brd_devices, brd_list)
+		add_disk(brd->brd_disk);
+
+	blk_register_region(MKDEV(RAMDISK_MAJOR, 0), range,
+				  THIS_MODULE, brd_probe, NULL, NULL);
+
+	printk(KERN_INFO "brd: module loaded\n");
+	return 0;
+
+out_free:
+	list_for_each_entry_safe(brd, next, &brd_devices, brd_list) {
+		list_del(&brd->brd_list);
+		brd_free(brd);
+	}
+
+	unregister_blkdev(RAMDISK_MAJOR, "brd");
+	return -ENOMEM;
+}
+
+static void __exit brd_exit(void)
+{
+	unsigned long range;
+	struct brd_device *brd, *next;
+
+	range = rd_nr ? rd_nr :  1UL << MINORBITS;
+
+	list_for_each_entry_safe(brd, next, &brd_devices, brd_list)
+		brd_del_one(brd);
+
+	blk_unregister_region(MKDEV(RAMDISK_MAJOR, 0), range);
+	unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
+}
+
+module_init(brd_init);
+module_exit(brd_exit);
+
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ /dev/null
@@ -1,536 +0,0 @@
-/*
- * ramdisk.c - Multiple RAM disk driver - gzip-loading version - v. 0.8 beta.
- *
- * (C) Chad Page, Theodore Ts'o, et. al, 1995.
- *
- * This RAM disk is designed to have filesystems created on it and mounted
- * just like a regular floppy disk.
- *
- * It also does something suggested by Linus: use the buffer cache as the
- * RAM disk data.  This makes it possible to dynamically allocate the RAM disk
- * buffer - with some consequences I have to deal with as I write this.
- *
- * This code is based on the original ramdisk.c, written mostly by
- * Theodore Ts'o (TYT) in 1991.  The code was largely rewritten by
- * Chad Page to use the buffer cache to store the RAM disk data in
- * 1995; Theodore then took over the driver again, and cleaned it up
- * for inclusion in the mainline kernel.
- *
- * The original CRAMDISK code was written by Richard Lyons, and
- * adapted by Chad Page to use the new RAM disk interface.  Theodore
- * Ts'o rewrote it so that both the compressed RAM disk loader and the
- * kernel decompressor uses the same inflate.c codebase.  The RAM disk
- * loader now also loads into a dynamic (buffer cache based) RAM disk,
- * not the old static RAM disk.  Support for the old static RAM disk has
- * been completely removed.
- *
- * Loadable module support added by Tom Dyas.
- *
- * Further cleanups by Chad Page (page0588@sundance.sjsu.edu):
- *	Cosmetic changes in #ifdef MODULE, code movement, etc.
- * 	When the RAM disk module is removed, free the protected buffers
- * 	Default RAM disk size changed to 2.88 MB
- *
- *  Added initrd: Werner Almesberger & Hans Lermen, Feb '96
- *
- * 4/25/96 : Made RAM disk size a parameter (default is now 4 MB)
- *		- Chad Page
- *
- * Add support for fs images split across >1 disk, Paul Gortmaker, Mar '98
- *
- * Make block size and block size shift for RAM disks a global macro
- * and set blk_size for -ENOSPC,     Werner Fink <werner@suse.de>, Apr '99
- */
-
-#include <linux/string.h>
-#include <linux/slab.h>
-#include <asm/atomic.h>
-#include <linux/bio.h>
-#include <linux/module.h>
-#include <linux/moduleparam.h>
-#include <linux/init.h>
-#include <linux/pagemap.h>
-#include <linux/blkdev.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>		/* for invalidate_bdev() */
-#include <linux/backing-dev.h>
-#include <linux/blkpg.h>
-#include <linux/writeback.h>
-
-#include <asm/uaccess.h>
-
-/* Various static variables go here.  Most are used only in the RAM disk code.
- */
-
-static struct gendisk *rd_disks[CONFIG_BLK_DEV_RAM_COUNT];
-static struct block_device *rd_bdev[CONFIG_BLK_DEV_RAM_COUNT];/* Protected device data */
-static struct request_queue *rd_queue[CONFIG_BLK_DEV_RAM_COUNT];
-
-/*
- * Parameters for the boot-loading of the RAM disk.  These are set by
- * init/main.c (from arguments to the kernel command line) or from the
- * architecture-specific setup routine (from the stored boot sector
- * information).
- */
-int rd_size = CONFIG_BLK_DEV_RAM_SIZE;		/* Size of the RAM disks */
-/*
- * It would be very desirable to have a soft-blocksize (that in the case
- * of the ramdisk driver is also the hardblocksize ;) of PAGE_SIZE because
- * doing that we'll achieve a far better MM footprint. Using a rd_blocksize of
- * BLOCK_SIZE in the worst case we'll make PAGE_SIZE/BLOCK_SIZE buffer-pages
- * unfreeable. With a rd_blocksize of PAGE_SIZE instead we are sure that only
- * 1 page will be protected. Depending on the size of the ramdisk you
- * may want to change the ramdisk blocksize to achieve a better or worse MM
- * behaviour. The default is still BLOCK_SIZE (needed by rd_load_image that
- * supposes the filesystem in the image uses a BLOCK_SIZE blocksize).
- */
-static int rd_blocksize = CONFIG_BLK_DEV_RAM_BLOCKSIZE;
-
-/*
- * Copyright (C) 2000 Linus Torvalds.
- *               2000 Transmeta Corp.
- * aops copied from ramfs.
- */
-
-/*
- * If a ramdisk page has buffers, some may be uptodate and some may be not.
- * To bring the page uptodate we zero out the non-uptodate buffers.  The
- * page must be locked.
- */
-static void make_page_uptodate(struct page *page)
-{
-	if (page_has_buffers(page)) {
-		struct buffer_head *bh = page_buffers(page);
-		struct buffer_head *head = bh;
-
-		do {
-			if (!buffer_uptodate(bh)) {
-				memset(bh->b_data, 0, bh->b_size);
-				/*
-				 * akpm: I'm totally undecided about this.  The
-				 * buffer has just been magically brought "up to
-				 * date", but nobody should want to be reading
-				 * it anyway, because it hasn't been used for
-				 * anything yet.  It is still in a "not read
-				 * from disk yet" state.
-				 *
-				 * But non-uptodate buffers against an uptodate
-				 * page are against the rules.  So do it anyway.
-				 */
-				 set_buffer_uptodate(bh);
-			}
-		} while ((bh = bh->b_this_page) != head);
-	} else {
-		memset(page_address(page), 0, PAGE_CACHE_SIZE);
-	}
-	flush_dcache_page(page);
-	SetPageUptodate(page);
-}
-
-static int ramdisk_readpage(struct file *file, struct page *page)
-{
-	if (!PageUptodate(page))
-		make_page_uptodate(page);
-	unlock_page(page);
-	return 0;
-}
-
-static int ramdisk_prepare_write(struct file *file, struct page *page,
-				unsigned offset, unsigned to)
-{
-	if (!PageUptodate(page))
-		make_page_uptodate(page);
-	return 0;
-}
-
-static int ramdisk_commit_write(struct file *file, struct page *page,
-				unsigned offset, unsigned to)
-{
-	set_page_dirty(page);
-	return 0;
-}
-
-/*
- * ->writepage to the blockdev's mapping has to redirty the page so that the
- * VM doesn't go and steal it.  We return AOP_WRITEPAGE_ACTIVATE so that the VM
- * won't try to (pointlessly) write the page again for a while.
- *
- * Really, these pages should not be on the LRU at all.
- */
-static int ramdisk_writepage(struct page *page, struct writeback_control *wbc)
-{
-	if (!PageUptodate(page))
-		make_page_uptodate(page);
-	SetPageDirty(page);
-	if (wbc->for_reclaim)
-		return AOP_WRITEPAGE_ACTIVATE;
-	unlock_page(page);
-	return 0;
-}
-
-/*
- * This is a little speedup thing: short-circuit attempts to write back the
- * ramdisk blockdev inode to its non-existent backing store.
- */
-static int ramdisk_writepages(struct address_space *mapping,
-				struct writeback_control *wbc)
-{
-	return 0;
-}
-
-/*
- * ramdisk blockdev pages have their own ->set_page_dirty() because we don't
- * want them to contribute to dirty memory accounting.
- */
-static int ramdisk_set_page_dirty(struct page *page)
-{
-	if (!TestSetPageDirty(page))
-		return 1;
-	return 0;
-}
-
-/*
- * releasepage is called by pagevec_strip/try_to_release_page if
- * buffers_heads_over_limit is true. Without a releasepage function
- * try_to_free_buffers is called instead. That can unset the dirty
- * bit of our ram disk pages, which will be eventually freed, even
- * if the page is still in use.
- */
-static int ramdisk_releasepage(struct page *page, gfp_t dummy)
-{
-	return 0;
-}
-
-static const struct address_space_operations ramdisk_aops = {
-	.readpage	= ramdisk_readpage,
-	.prepare_write	= ramdisk_prepare_write,
-	.commit_write	= ramdisk_commit_write,
-	.writepage	= ramdisk_writepage,
-	.set_page_dirty	= ramdisk_set_page_dirty,
-	.writepages	= ramdisk_writepages,
-	.releasepage	= ramdisk_releasepage,
-};
-
-static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
-				struct address_space *mapping)
-{
-	pgoff_t index = sector >> (PAGE_CACHE_SHIFT - 9);
-	unsigned int vec_offset = vec->bv_offset;
-	int offset = (sector << 9) & ~PAGE_CACHE_MASK;
-	int size = vec->bv_len;
-	int err = 0;
-
-	do {
-		int count;
-		struct page *page;
-		char *src;
-		char *dst;
-
-		count = PAGE_CACHE_SIZE - offset;
-		if (count > size)
-			count = size;
-		size -= count;
-
-		page = grab_cache_page(mapping, index);
-		if (!page) {
-			err = -ENOMEM;
-			goto out;
-		}
-
-		if (!PageUptodate(page))
-			make_page_uptodate(page);
-
-		index++;
-
-		if (rw == READ) {
-			src = kmap_atomic(page, KM_USER0) + offset;
-			dst = kmap_atomic(vec->bv_page, KM_USER1) + vec_offset;
-		} else {
-			src = kmap_atomic(vec->bv_page, KM_USER0) + vec_offset;
-			dst = kmap_atomic(page, KM_USER1) + offset;
-		}
-		offset = 0;
-		vec_offset += count;
-
-		memcpy(dst, src, count);
-
-		kunmap_atomic(src, KM_USER0);
-		kunmap_atomic(dst, KM_USER1);
-
-		if (rw == READ)
-			flush_dcache_page(vec->bv_page);
-		else
-			set_page_dirty(page);
-		unlock_page(page);
-		put_page(page);
-	} while (size);
-
- out:
-	return err;
-}
-
-/*
- *  Basically, my strategy here is to set up a buffer-head which can't be
- *  deleted, and make that my Ramdisk.  If the request is outside of the
- *  allocated size, we must get rid of it...
- *
- * 19-JAN-1998  Richard Gooch <rgooch@atnf.csiro.au>  Added devfs support
- *
- */
-static int rd_make_request(struct request_queue *q, struct bio *bio)
-{
-	struct block_device *bdev = bio->bi_bdev;
-	struct address_space * mapping = bdev->bd_inode->i_mapping;
-	sector_t sector = bio->bi_sector;
-	unsigned long len = bio->bi_size >> 9;
-	int rw = bio_data_dir(bio);
-	struct bio_vec *bvec;
-	int ret = 0, i;
-
-	if (sector + len > get_capacity(bdev->bd_disk))
-		goto fail;
-
-	if (rw==READA)
-		rw=READ;
-
-	bio_for_each_segment(bvec, bio, i) {
-		ret |= rd_blkdev_pagecache_IO(rw, bvec, sector, mapping);
-		sector += bvec->bv_len >> 9;
-	}
-	if (ret)
-		goto fail;
-
-	bio_endio(bio, 0);
-	return 0;
-fail:
-	bio_io_error(bio);
-	return 0;
-} 
-
-static int rd_ioctl(struct inode *inode, struct file *file,
-			unsigned int cmd, unsigned long arg)
-{
-	int error;
-	struct block_device *bdev = inode->i_bdev;
-
-	if (cmd != BLKFLSBUF)
-		return -ENOTTY;
-
-	/*
-	 * special: we want to release the ramdisk memory, it's not like with
-	 * the other blockdevices where this ioctl only flushes away the buffer
-	 * cache
-	 */
-	error = -EBUSY;
-	mutex_lock(&bdev->bd_mutex);
-	if (bdev->bd_openers <= 2) {
-		truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
-		error = 0;
-	}
-	mutex_unlock(&bdev->bd_mutex);
-	return error;
-}
-
-/*
- * This is the backing_dev_info for the blockdev inode itself.  It doesn't need
- * writeback and it does not contribute to dirty memory accounting.
- */
-static struct backing_dev_info rd_backing_dev_info = {
-	.ra_pages	= 0,	/* No readahead */
-	.capabilities	= BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK | BDI_CAP_MAP_COPY,
-	.unplug_io_fn	= default_unplug_io_fn,
-};
-
-/*
- * This is the backing_dev_info for the files which live atop the ramdisk
- * "device".  These files do need writeback and they do contribute to dirty
- * memory accounting.
- */
-static struct backing_dev_info rd_file_backing_dev_info = {
-	.ra_pages	= 0,	/* No readahead */
-	.capabilities	= BDI_CAP_MAP_COPY,	/* Does contribute to dirty memory */
-	.unplug_io_fn	= default_unplug_io_fn,
-};
-
-static int rd_open(struct inode *inode, struct file *filp)
-{
-	unsigned unit = iminor(inode);
-
-	if (rd_bdev[unit] == NULL) {
-		struct block_device *bdev = inode->i_bdev;
-		struct address_space *mapping;
-		unsigned bsize;
-		gfp_t gfp_mask;
-
-		inode = igrab(bdev->bd_inode);
-		rd_bdev[unit] = bdev;
-		bdev->bd_openers++;
-		bsize = bdev_hardsect_size(bdev);
-		bdev->bd_block_size = bsize;
-		inode->i_blkbits = blksize_bits(bsize);
-		inode->i_size = get_capacity(bdev->bd_disk)<<9;
-
-		mapping = inode->i_mapping;
-		mapping->a_ops = &ramdisk_aops;
-		mapping->backing_dev_info = &rd_backing_dev_info;
-		bdev->bd_inode_backing_dev_info = &rd_file_backing_dev_info;
-
-		/*
-		 * Deep badness.  rd_blkdev_pagecache_IO() needs to allocate
-		 * pagecache pages within a request_fn.  We cannot recur back
-		 * into the filesystem which is mounted atop the ramdisk, because
-		 * that would deadlock on fs locks.  And we really don't want
-		 * to reenter rd_blkdev_pagecache_IO when we're already within
-		 * that function.
-		 *
-		 * So we turn off __GFP_FS and __GFP_IO.
-		 *
-		 * And to give this thing a hope of working, turn on __GFP_HIGH.
-		 * Hopefully, there's enough regular memory allocation going on
-		 * for the page allocator emergency pools to keep the ramdisk
-		 * driver happy.
-		 */
-		gfp_mask = mapping_gfp_mask(mapping);
-		gfp_mask &= ~(__GFP_FS|__GFP_IO);
-		gfp_mask |= __GFP_HIGH;
-		mapping_set_gfp_mask(mapping, gfp_mask);
-	}
-
-	return 0;
-}
-
-static struct block_device_operations rd_bd_op = {
-	.owner =	THIS_MODULE,
-	.open =		rd_open,
-	.ioctl =	rd_ioctl,
-};
-
-/*
- * Before freeing the module, invalidate all of the protected buffers!
- */
-static void __exit rd_cleanup(void)
-{
-	int i;
-
-	for (i = 0; i < CONFIG_BLK_DEV_RAM_COUNT; i++) {
-		struct block_device *bdev = rd_bdev[i];
-		rd_bdev[i] = NULL;
-		if (bdev) {
-			invalidate_bdev(bdev);
-			blkdev_put(bdev);
-		}
-		del_gendisk(rd_disks[i]);
-		put_disk(rd_disks[i]);
-		blk_cleanup_queue(rd_queue[i]);
-	}
-	unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
-
-	bdi_destroy(&rd_file_backing_dev_info);
-	bdi_destroy(&rd_backing_dev_info);
-}
-
-/*
- * This is the registration and initialization section of the RAM disk driver
- */
-static int __init rd_init(void)
-{
-	int i;
-	int err;
-
-	err = bdi_init(&rd_backing_dev_info);
-	if (err)
-		goto out2;
-
-	err = bdi_init(&rd_file_backing_dev_info);
-	if (err) {
-		bdi_destroy(&rd_backing_dev_info);
-		goto out2;
-	}
-
-	err = -ENOMEM;
-
-	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
-			(rd_blocksize & (rd_blocksize-1))) {
-		printk("RAMDISK: wrong blocksize %d, reverting to defaults\n",
-		       rd_blocksize);
-		rd_blocksize = BLOCK_SIZE;
-	}
-
-	for (i = 0; i < CONFIG_BLK_DEV_RAM_COUNT; i++) {
-		rd_disks[i] = alloc_disk(1);
-		if (!rd_disks[i])
-			goto out;
-
-		rd_queue[i] = blk_alloc_queue(GFP_KERNEL);
-		if (!rd_queue[i]) {
-			put_disk(rd_disks[i]);
-			goto out;
-		}
-	}
-
-	if (register_blkdev(RAMDISK_MAJOR, "ramdisk")) {
-		err = -EIO;
-		goto out;
-	}
-
-	for (i = 0; i < CONFIG_BLK_DEV_RAM_COUNT; i++) {
-		struct gendisk *disk = rd_disks[i];
-
-		blk_queue_make_request(rd_queue[i], &rd_make_request);
-		blk_queue_hardsect_size(rd_queue[i], rd_blocksize);
-
-		/* rd_size is given in kB */
-		disk->major = RAMDISK_MAJOR;
-		disk->first_minor = i;
-		disk->fops = &rd_bd_op;
-		disk->queue = rd_queue[i];
-		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
-		sprintf(disk->disk_name, "ram%d", i);
-		set_capacity(disk, rd_size * 2);
-		add_disk(rd_disks[i]);
-	}
-
-	/* rd_size is given in kB */
-	printk("RAMDISK driver initialized: "
-		"%d RAM disks of %dK size %d blocksize\n",
-		CONFIG_BLK_DEV_RAM_COUNT, rd_size, rd_blocksize);
-
-	return 0;
-out:
-	while (i--) {
-		put_disk(rd_disks[i]);
-		blk_cleanup_queue(rd_queue[i]);
-	}
-	bdi_destroy(&rd_backing_dev_info);
-	bdi_destroy(&rd_file_backing_dev_info);
-out2:
-	return err;
-}
-
-module_init(rd_init);
-module_exit(rd_cleanup);
-
-/* options - nonmodular */
-#ifndef MODULE
-static int __init ramdisk_size(char *str)
-{
-	rd_size = simple_strtol(str,NULL,0);
-	return 1;
-}
-static int __init ramdisk_blocksize(char *str)
-{
-	rd_blocksize = simple_strtol(str,NULL,0);
-	return 1;
-}
-__setup("ramdisk_size=", ramdisk_size);
-__setup("ramdisk_blocksize=", ramdisk_blocksize);
-#endif
-
-/* options - modular */
-module_param(rd_size, int, 0);
-MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
-module_param(rd_blocksize, int, 0);
-MODULE_PARM_DESC(rd_blocksize, "Blocksize of each RAM disk in bytes.");
-MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
-
-MODULE_LICENSE("GPL");
Index: linux-2.6/MAINTAINERS
===================================================================
--- linux-2.6.orig/MAINTAINERS
+++ linux-2.6/MAINTAINERS
@@ -3163,6 +3163,11 @@ W:	http://rt2x00.serialmonkey.com/
 S:	Maintained
 F:	drivers/net/wireless/rt2x00/
 
+RAMDISK RAM BLOCK DEVICE DRIVER
+P:	Nick Piggin
+M:	npiggin@suse.de
+S:	Maintained
+
 RANDOM NUMBER DRIVER
 P:	Matt Mackall
 M:	mpm@selenic.com
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1436,6 +1436,7 @@ void invalidate_bh_lrus(void)
 {
 	on_each_cpu(invalidate_bh_lru, NULL, 1, 1);
 }
+EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 
 void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  4:26 [patch] rewrite rd Nick Piggin
@ 2007-12-04  6:29 ` Andrew Morton
  2007-12-04  7:01   ` Nick Piggin
  2007-12-04  7:55 ` Rob Landley
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2007-12-04  6:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Christian Borntraeger,
	Eric W. Biederman, rob, Jens Axboe

On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin <npiggin@suse.de> wrote:

> Hi,
> 
> This is my proposal for a (hopefully) backwards compatible rd driver.
> The motivation for me is not pressing, except that I have this code
> sitting here that is either going to rot or get merged. I'm happy to
> make myself maintainer of this code, but if any real block device
> driver writer would like to, that would be fine by me ;)
> 
> Comments?
> 
> --
> 
> This is a rewrite of the ramdisk block device driver.
> 
> The old one is really difficult because it effectively implements a block
> device which serves data out of its own buffer cache. It relies on the dirty
> bit being set, to pin its backing store in cache, however there are non
> trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
> which had recently lead to data corruption. And in general it is completely
> wrong for a block device driver to do this.
> 
> The new one is more like a regular block device driver. It has no idea
> about vm/vfs stuff. It's backing store is similar to the buffer cache
> (a simple radix-tree of pages), but it doesn't know anything about page
> cache (the pages in the radix tree are not pagecache pages).
> 
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.

what about mmap(/dev/ram0)?

> The fact that it now goes through all the regular vm/fs paths makes it
> much more useful for testing, too.
> 
>    text    data     bss     dec     hex filename
>    2837     849     384    4070     fe6 drivers/block/rd.o
>    3528     371      12    3911     f47 drivers/block/brd.o
> 
> Text is larger, but data and bss are smaller, making total size smaller.
> 
> A few other nice things about it:
> - Similar structure and layout to the new loop device handlinag.
> - Dynamic ramdisk creation.
> - Runtime flexible buffer head size (because it is no longer part of the
>   ramdisk code).
> - Boot / load time flexible ramdisk size, which could easily be extended
>   to a per-ramdisk runtime changeable size (eg. with an ioctl).

This ramdisk driver can use highmem whereas the existing one can't (yes?). 
That's a pretty major difference.  Plus look at the revoltingness in rd.c's
mapping_set_gfp_mask()s.

> +++ linux-2.6/drivers/block/brd.c
> @@ -0,0 +1,536 @@
> +/*
> + * Ram backed block device driver.
> + *
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + *
> + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
> + * of their respective owners.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/moduleparam.h>
> +#include <linux/major.h>
> +#include <linux/blkdev.h>
> +#include <linux/bio.h>
> +#include <linux/highmem.h>
> +#include <linux/gfp.h>
> +#include <linux/radix-tree.h>
> +#include <linux/buffer_head.h> /* invalidate_bh_lrus() */
> +
> +#include <asm/uaccess.h>
> +
> +#define SECTOR_SHIFT		9

That's our third definition of SECTOR_SHIFT.

> +#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
> +#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
> +
> +/*
> + * Each block ramdisk device has a radix_tree brd_pages of pages that stores
> + * the pages containing the block device's contents. A brd page's ->index is
> + * its offset in PAGE_SIZE units. This is similar to, but in no way connected
> + * with, the kernel's pagecache or buffer cache (which sit above our block
> + * device).
> + */
> +struct brd_device {
> +	int		brd_number;
> +	int		brd_refcnt;
> +	loff_t		brd_offset;
> +	loff_t		brd_sizelimit;
> +	unsigned	brd_blocksize;
> +
> +	struct request_queue	*brd_queue;
> +	struct gendisk		*brd_disk;
> +	struct list_head	brd_list;
> +
> +	/*
> +	 * Backing store of pages and lock to protect it. This is the contents
> +	 * of the block device.
> +	 */
> +	spinlock_t		brd_lock;
> +	struct radix_tree_root	brd_pages;
> +};
> +
> +/*
> + * Look up and return a brd's page for a given sector.
> + */
> +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
> +{
> +	unsigned long idx;

Could use pgoff_t here if you think that's clearer.

> +	struct page *page;
> +
> +	/*
> +	 * The page lifetime is protected by the fact that we have opened the
> +	 * device node -- brd pages will never be deleted under us, so we
> +	 * don't need any further locking or refcounting.
> +	 *
> +	 * This is strictly true for the radix-tree nodes as well (ie. we
> +	 * don't actually need the rcu_read_lock()), however that is not a
> +	 * documented feature of the radix-tree API so it is better to be
> +	 * safe here (we don't have total exclusion from radix tree updates
> +	 * here, only deletes).
> +	 */
> +	rcu_read_lock();
> +	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
> +	page = radix_tree_lookup(&brd->brd_pages, idx);
> +	rcu_read_unlock();
> +
> +	BUG_ON(page && page->index != idx);
> +
> +	return page;
> +}
> +
> +/*
> + * Look up and return a brd's page for a given sector.
> + * If one does not exist, allocate an empty page, and insert that. Then
> + * return it.
> + */
> +static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
> +{
> +	unsigned long idx;
> +	struct page *page;
> +
> +	page = brd_lookup_page(brd, sector);
> +	if (page)
> +		return page;
> +
> +	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);

Why GFP_NOIO?

Have you thought about __GFP_MOVABLE treatment here?  Seems pretty
desirable as this sucker can be large.

> +	if (!page)
> +		return NULL;
> +
> +	if (radix_tree_preload(GFP_NOIO)) {
> +		__free_page(page);
> +		return NULL;
> +	}
> +
> +	spin_lock(&brd->brd_lock);
> +	idx = sector >> PAGE_SECTORS_SHIFT;
> +	if (radix_tree_insert(&brd->brd_pages, idx, page)) {
> +		__free_page(page);
> +		page = radix_tree_lookup(&brd->brd_pages, idx);
> +		BUG_ON(!page);
> +		BUG_ON(page->index != idx);
> +	} else
> +		page->index = idx;
> +	spin_unlock(&brd->brd_lock);
> +
> +	radix_tree_preload_end();
> +
> +	return page;
> +}
> +
> +/*
> + * Free all backing store pages and radix tree. This must only be called when
> + * there are no other users of the device.
> + */
> +#define FREE_BATCH 16
> +static void brd_free_pages(struct brd_device *brd)
> +{
> +	unsigned long pos = 0;
> +	struct page *pages[FREE_BATCH];
> +	int nr_pages;
> +
> +	do {
> +		int i;
> +
> +		nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
> +				(void **)pages, pos, FREE_BATCH);
> +
> +		for (i = 0; i < nr_pages; i++) {
> +			void *ret;
> +
> +			BUG_ON(pages[i]->index < pos);
> +			pos = pages[i]->index;
> +			ret = radix_tree_delete(&brd->brd_pages, pos);
> +			BUG_ON(!ret || ret != pages[i]);
> +			__free_page(pages[i]);
> +		}
> +
> +		pos++;
> +
> +	} while (nr_pages == FREE_BATCH);
> +}

I have vague memories that radix_tree_gang_lookup()'s "naive"
implementation may return fewer items than you asked for even when there
are more items remaining - when it hits certain boundaries.

> +/*
> + * copy_to_brd_setup must be called before copy_to_brd. It may sleep.
> + */
> +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
> +{
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min((unsigned long)n, PAGE_SIZE - offset);

use min_t.  Or, better, sort out the types.

> +	if (!brd_insert_page(brd, sector))
> +		return -ENOMEM;
> +	if (copy < n) {
> +		sector += copy >> SECTOR_SHIFT;
> +		if (!brd_insert_page(brd, sector))
> +			return -ENOMEM;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Copy n bytes from src to the brd starting at sector. Does not sleep.
> + */
> +static void copy_to_brd(struct brd_device *brd, const void *src,
> +			sector_t sector, size_t n)
> +{
> +	struct page *page;
> +	void *dst;
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min((unsigned long)n, PAGE_SIZE - offset);

Ditto.

> +	page = brd_lookup_page(brd, sector);
> +	BUG_ON(!page);
> +
> +	dst = kmap_atomic(page, KM_USER1);
> +	memcpy(dst + offset, src, copy);
> +	kunmap_atomic(dst, KM_USER1);

Might need flush_dcache_page() if mmap gets sorted out.

> +	if (copy < n) {
> +		src += copy;
> +		sector += copy >> SECTOR_SHIFT;
> +		copy = n - copy;
> +		page = brd_lookup_page(brd, sector);
> +		BUG_ON(!page);
> +
> +		dst = kmap_atomic(page, KM_USER1);
> +		memcpy(dst, src, copy);
> +		kunmap_atomic(dst, KM_USER1);
> +	}
> +}
> +
> +/*
> + * Copy n bytes to dst from the brd starting at sector. Does not sleep.
> + */
> +static void copy_from_brd(void *dst, struct brd_device *brd,
> +			sector_t sector, size_t n)
> +{
> +	struct page *page;
> +	const void *src;
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min((unsigned long)n, PAGE_SIZE - offset);

tritto.

> +	page = brd_lookup_page(brd, sector);
> +	if (page) {
> +		src = kmap_atomic(page, KM_USER1);
> +		memcpy(dst, src + offset, copy);
> +		kunmap_atomic(src, KM_USER1);
> +	} else
> +		memset(dst, 0, copy);
> +
> +	if (copy < n) {
> +		dst += copy;
> +		sector += copy >> SECTOR_SHIFT;
> +		copy = n - copy;
> +		page = brd_lookup_page(brd, sector);
> +		if (page) {
> +			src = kmap_atomic(page, KM_USER1);
> +			memcpy(dst, src, copy);
> +			kunmap_atomic(src, KM_USER1);
> +		} else
> +			memset(dst, 0, copy);
> +	}
> +}
> +
> +/*
> + * Process a single bvec of a bio.
> + */
> +static int brd_do_bvec(struct brd_device *brd, struct page *page,
> +			unsigned int len, unsigned int off, int rw,
> +			sector_t sector)
> +{
> +	void *mem;
> +	int err = 0;
> +
> +	if (rw != READ) {
> +		err = copy_to_brd_setup(brd, sector, len);
> +		if (err)
> +			goto out;
> +	}
> +
> +	mem = kmap_atomic(page, KM_USER0);
> +	if (rw == READ) {
> +		copy_from_brd(mem + off, brd, sector, len);
> +		flush_dcache_page(page);

hm, there's a flush_dcache_page().  I guess you've throught it through ;)

> +	} else
> +		copy_to_brd(brd, mem + off, sector, len);
> +	kunmap_atomic(mem, KM_USER0);
> +
> +out:
> +	return err;
> +}
> +
>
> ...
>
> +static int brd_ioctl(struct inode *inode, struct file *file,
> +			unsigned int cmd, unsigned long arg)
> +{
> +	int error;
> +	struct block_device *bdev = inode->i_bdev;
> +	struct brd_device *brd = bdev->bd_disk->private_data;
> +
> +	if (cmd != BLKFLSBUF)
> +		return -ENOTTY;
> +
> +	/*
> +	 * ram device BLKFLSBUF has special semantics, we want to actually
> +	 * release and destroy the ramdisk data.
> +	 */
> +	mutex_lock(&bdev->bd_mutex);
> +	error = -EBUSY;
> +	if (bdev->bd_openers <= 1) {
> +		/*
> +		 * Invalidate the cache first, so it isn't written
> +		 * back to the device.
> +		 */
> +		invalidate_bh_lrus();
> +		truncate_inode_pages(bdev->bd_inode->i_mapping, 0);

hm, some other thread can instantiate pagecache here.  I guess it's always
been like that and there's not a lot we can (or should) do about it.

> +		brd_free_pages(brd);
> +		error = 0;
> +	}
> +	mutex_unlock(&bdev->bd_mutex);
> +
> +	return error;
> +}
> +
>
> ...
>
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -1436,6 +1436,7 @@ void invalidate_bh_lrus(void)
>  {
>  	on_each_cpu(invalidate_bh_lru, NULL, 1, 1);
>  }
> +EXPORT_SYMBOL_GPL(invalidate_bh_lrus);

Maybe create a new helper function which does
invalidate_bh_lrus()+truncate_inode_pages(), call that from kill_bdev() and
here, make invalidate_bh_lrus() static.

That's a separate patch, I guess.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  6:29 ` Andrew Morton
@ 2007-12-04  7:01   ` Nick Piggin
  2007-12-04  7:08     ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-04  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, linux-fsdevel, Christian Borntraeger,
	Eric W. Biederman, rob, Jens Axboe

On Mon, Dec 03, 2007 at 10:29:03PM -0800, Andrew Morton wrote:
> On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > There is one slight downside -- direct block device access and filesystem
> > metadata access goes through an extra copy and gets stored in RAM twice.
> > However, this downside is only slight, because the real buffercache of the
> > device is now reclaimable (because we're not playing crazy games with it),
> > so under memory intensive situations, footprint should effectively be the
> > same -- maybe even a slight advantage to the new driver because it can also
> > reclaim buffer heads.
> 
> what about mmap(/dev/ram0)?

Same thing I guess -- it will use more memory in the common case, but should
actually be able to reclaim slightly more memory under pressure, so the
tiny systems guys shouldn't have too much trouble.

And we're avoiding the whole class of aliasing problems where mmap on
the old rd driver means that you're creating new mappings to your backing
store pages.

 
> > Text is larger, but data and bss are smaller, making total size smaller.
> > 
> > A few other nice things about it:
> > - Similar structure and layout to the new loop device handlinag.
> > - Dynamic ramdisk creation.
> > - Runtime flexible buffer head size (because it is no longer part of the
> >   ramdisk code).
> > - Boot / load time flexible ramdisk size, which could easily be extended
> >   to a per-ramdisk runtime changeable size (eg. with an ioctl).
> 
> This ramdisk driver can use highmem whereas the existing one can't (yes?). 
> That's a pretty major difference.  Plus look at the revoltingness in rd.c's
> mapping_set_gfp_mask()s.

Ah yep, there is the highmem advantage too.


> > +#define SECTOR_SHIFT		9
> 
> That's our third definition of SECTOR_SHIFT.

Or 7th, depend on how you count ;) I always thought redefining it is a
prerequsite to getting anything merged into the block layer, so I'm too
scared to put it in include/linux/blkdev.h ;)


> > +/*
> > + * Look up and return a brd's page for a given sector.
> > + */
> > +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
> > +{
> > +	unsigned long idx;
> 
> Could use pgoff_t here if you think that's clearer.

I guess it is. Although on one hand the radix-tree uses unsigned long, but
on the other hand, page->index is pgoff.

> > +{
> > +	unsigned long idx;
> > +	struct page *page;
> > +
> > +	page = brd_lookup_page(brd, sector);
> > +	if (page)
> > +		return page;
> > +
> > +	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
> 
> Why GFP_NOIO?

I guess it has similar issues to rd.c -- we can't risk recursing
into the block layer here. However unlike rd.c, we _could_ easily
add some mode or ioctl to allocate the backing store upfront, with
full reclaim flags...

 
> Have you thought about __GFP_MOVABLE treatment here?  Seems pretty
> desirable as this sucker can be large.

AFAIK that only applies to pagecache but I haven't been paying much
attention to that stuff lately. It wouldn't be hard to move these
pages around, if we had a hook from the vm for it.
 

> > +static void brd_free_pages(struct brd_device *brd)
> > +{
> > +	unsigned long pos = 0;
> > +	struct page *pages[FREE_BATCH];
> > +	int nr_pages;
> > +
> > +	do {
> > +		int i;
> > +
> > +		nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
> > +				(void **)pages, pos, FREE_BATCH);
> > +
> > +		for (i = 0; i < nr_pages; i++) {
> > +			void *ret;
> > +
> > +			BUG_ON(pages[i]->index < pos);
> > +			pos = pages[i]->index;
> > +			ret = radix_tree_delete(&brd->brd_pages, pos);
> > +			BUG_ON(!ret || ret != pages[i]);
> > +			__free_page(pages[i]);
> > +		}
> > +
> > +		pos++;
> > +
> > +	} while (nr_pages == FREE_BATCH);
> > +}
> 
> I have vague memories that radix_tree_gang_lookup()'s "naive"
> implementation may return fewer items than you asked for even when there
> are more items remaining - when it hits certain boundaries.

Good memory, but it's the low level leaf traversal that bales out at
boundaries. The higher level code then retries, so we should be OK
here.

> > +/*
> > + * copy_to_brd_setup must be called before copy_to_brd. It may sleep.
> > + */
> > +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
> > +{
> > +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> > +	size_t copy;
> > +
> > +	copy = min((unsigned long)n, PAGE_SIZE - offset);
> 
> use min_t.  Or, better, sort out the types.

I guess the API is using size_t, so that would be the more approprite type
to convert to. And I'll use min_t too.


> > +static void copy_to_brd(struct brd_device *brd, const void *src,
> > +			sector_t sector, size_t n)
> > +{
> > +	struct page *page;
> > +	void *dst;
> > +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> > +	size_t copy;
> > +
> > +	copy = min((unsigned long)n, PAGE_SIZE - offset);
> 
> Ditto.
> 
> > +	page = brd_lookup_page(brd, sector);
> > +	BUG_ON(!page);
> > +
> > +	dst = kmap_atomic(page, KM_USER1);
> > +	memcpy(dst + offset, src, copy);
> > +	kunmap_atomic(dst, KM_USER1);
> 
> Might need flush_dcache_page() if mmap gets sorted out.

The brd backing store never gets any userspace aliases, so I think it should
be OK when copying into it.

 
> > +static int brd_do_bvec(struct brd_device *brd, struct page *page,
> > +			unsigned int len, unsigned int off, int rw,
> > +			sector_t sector)
> > +{
> > +	void *mem;
> > +	int err = 0;
> > +
> > +	if (rw != READ) {
> > +		err = copy_to_brd_setup(brd, sector, len);
> > +		if (err)
> > +			goto out;
> > +	}
> > +
> > +	mem = kmap_atomic(page, KM_USER0);
> > +	if (rw == READ) {
> > +		copy_from_brd(mem + off, brd, sector, len);
> > +		flush_dcache_page(page);
> 
> hm, there's a flush_dcache_page().  I guess you've throught it through ;)

Copying into the pagecache yes needs to flush I think. Although strictly it
also needs a write barrier to prevent these stores being passed by the store
to set PG_uptodate (but that's another story, which I think should be fixed
in the PageUptodate macros rather than device drivers and filesystems...)

> > +	mutex_lock(&bdev->bd_mutex);
> > +	error = -EBUSY;
> > +	if (bdev->bd_openers <= 1) {
> > +		/*
> > +		 * Invalidate the cache first, so it isn't written
> > +		 * back to the device.
> > +		 */
> > +		invalidate_bh_lrus();
> > +		truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
> 
> hm, some other thread can instantiate pagecache here.  I guess it's always
> been like that and there's not a lot we can (or should) do about it.
 
Yeah, another thread could do that...


> > +		brd_free_pages(brd);
> > +		error = 0;
> > +	}
> > +	mutex_unlock(&bdev->bd_mutex);
> > +
> > +	return error;
> > +}
> > +
> >
> > ...
> >
> > --- linux-2.6.orig/fs/buffer.c
> > +++ linux-2.6/fs/buffer.c
> > @@ -1436,6 +1436,7 @@ void invalidate_bh_lrus(void)
> >  {
> >  	on_each_cpu(invalidate_bh_lru, NULL, 1, 1);
> >  }
> > +EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
> 
> Maybe create a new helper function which does
> invalidate_bh_lrus()+truncate_inode_pages(), call that from kill_bdev() and
> here, make invalidate_bh_lrus() static.
> 
> That's a separate patch, I guess.

I was thinking also that perhaps the buffer cache layer could intercept
the ioctl on the way through and flush the buffer cache before going to
the device -- so device drivers don't have to have _any_ knowledge
about the buffer cache...?

Thanks for the review, I'll post an incremental patch in a sec.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  7:01   ` Nick Piggin
@ 2007-12-04  7:08     ` Nick Piggin
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2007-12-04  7:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, linux-fsdevel, Christian Borntraeger,
	Eric W. Biederman, rob, Jens Axboe

On Tue, Dec 04, 2007 at 08:01:31AM +0100, Nick Piggin wrote:
> Thanks for the review, I'll post an incremental patch in a sec.
 

Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -56,7 +56,7 @@ struct brd_device {
  */
 static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 {
-	unsigned long idx;
+	pgoff_t idx;
 	struct page *page;
 
 	/*
@@ -87,13 +87,17 @@ static struct page *brd_lookup_page(stru
  */
 static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 {
-	unsigned long idx;
+	pgoff_t idx;
 	struct page *page;
 
 	page = brd_lookup_page(brd, sector);
 	if (page)
 		return page;
 
+	/*
+	 * Must use NOIO because we don't want to recurse back into the
+	 * block or filesystem layers from page reclaim.
+	 */
 	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
 	if (!page)
 		return NULL;
@@ -148,6 +152,11 @@ static void brd_free_pages(struct brd_de
 
 		pos++;
 
+		/*
+		 * This assumes radix_tree_gang_lookup always returns as
+		 * many pages as possible. If the radix-tree code changes,
+		 * so will this have to.
+		 */
 	} while (nr_pages == FREE_BATCH);
 }
 
@@ -159,7 +168,7 @@ static int copy_to_brd_setup(struct brd_
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
-	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	if (!brd_insert_page(brd, sector))
 		return -ENOMEM;
 	if (copy < n) {
@@ -181,7 +190,7 @@ static void copy_to_brd(struct brd_devic
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
-	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	BUG_ON(!page);
 
@@ -213,7 +222,7 @@ static void copy_from_brd(void *dst, str
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
-	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	if (page) {
 		src = kmap_atomic(page, KM_USER1);
@@ -318,6 +327,9 @@ static int brd_ioctl(struct inode *inode
 		/*
 		 * Invalidate the cache first, so it isn't written
 		 * back to the device.
+		 *
+		 * Another thread might instantiate more buffercache here,
+		 * but there is not much we can do to close that race.
 		 */
 		invalidate_bh_lrus();
 		truncate_inode_pages(bdev->bd_inode->i_mapping, 0);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  4:26 [patch] rewrite rd Nick Piggin
  2007-12-04  6:29 ` Andrew Morton
@ 2007-12-04  7:55 ` Rob Landley
  2007-12-04  9:29   ` Nick Piggin
  2007-12-04  9:54 ` Christian Borntraeger
  2008-01-14 16:47 ` [patch] rewrite rd Matthew Wilcox
  3 siblings, 1 reply; 34+ messages in thread
From: Rob Landley @ 2007-12-04  7:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Christian Borntraeger,
	Eric W. Biederman, Andrew Morton, Jens Axboe

On Monday 03 December 2007 22:26:28 Nick Piggin wrote:
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.

For the embedded world, initramfs has pretty much rendered initrd obsolete, 
and that was the biggest user of the ramdisk code I know of.  Beyond that, 
loopback mounts give you more flexible transient block devices than ramdisks 
do.  (In fact, ramdisks are such an amazing pain to use/size/free that if I 
really needed something like that I'd just make a loopback mount in a ramfs 
instance.)

Embedded users who still want a block interface for memory are generally 
trying to use a cramfs or squashfs image out of ROM or flash, although there 
are flash-specific filesystems for this and I dunno if they're actually 
mounting /dev/mem at an offset or something (md?  losetup -o?  Beats me, I 
haven't tried that myself yet...)

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  7:55 ` Rob Landley
@ 2007-12-04  9:29   ` Nick Piggin
  2007-12-04 19:53     ` Rob Landley
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-04  9:29 UTC (permalink / raw)
  To: Rob Landley
  Cc: Linux Kernel Mailing List, linux-fsdevel, Christian Borntraeger,
	Eric W. Biederman, Andrew Morton, Jens Axboe

On Tue, Dec 04, 2007 at 01:55:17AM -0600, Rob Landley wrote:
> On Monday 03 December 2007 22:26:28 Nick Piggin wrote:
> > There is one slight downside -- direct block device access and filesystem
> > metadata access goes through an extra copy and gets stored in RAM twice.
> > However, this downside is only slight, because the real buffercache of the
> > device is now reclaimable (because we're not playing crazy games with it),
> > so under memory intensive situations, footprint should effectively be the
> > same -- maybe even a slight advantage to the new driver because it can also
> > reclaim buffer heads.
> 
> For the embedded world, initramfs has pretty much rendered initrd obsolete, 
> and that was the biggest user of the ramdisk code I know of.  Beyond that, 
> loopback mounts give you more flexible transient block devices than ramdisks 
> do.  (In fact, ramdisks are such an amazing pain to use/size/free that if I 
> really needed something like that I'd just make a loopback mount in a ramfs 
> instance.)

They are, although we could easily hook up a couple more ioctls for them
now (if anybody is interested).

The rd driver can potentially be a _lot_ more scalable than the loop driver.
It's completely lockless in the fastpath and doesn't even dirty any
cachelines except for the actual destination memory. OK, this is only
really important for testing purposes (eg. testing scalability of a
filesystem etc.) But it is one reason why I (personally) want brd...
 

> Embedded users who still want a block interface for memory are generally 
> trying to use a cramfs or squashfs image out of ROM or flash, although there 
> are flash-specific filesystems for this and I dunno if they're actually 
> mounting /dev/mem at an offset or something (md?  losetup -o?  Beats me, I 
> haven't tried that myself yet...)

OK, it would be interesting to hear from anyone using rd for things like
cramfs. I don't think we can get rid of the code entirely, but it sounds
like the few downsides of the new code won't be big problems.

Thanks for the feedback.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  4:26 [patch] rewrite rd Nick Piggin
  2007-12-04  6:29 ` Andrew Morton
  2007-12-04  7:55 ` Rob Landley
@ 2007-12-04  9:54 ` Christian Borntraeger
  2007-12-04 10:10   ` Nick Piggin
  2008-01-14 16:47 ` [patch] rewrite rd Matthew Wilcox
  3 siblings, 1 reply; 34+ messages in thread
From: Christian Borntraeger @ 2007-12-04  9:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, rob, Jens Axboe

Am Dienstag, 4. Dezember 2007 schrieb Nick Piggin:
[...]
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.

This is just an idea, I dont know if it is worth the trouble, but have you
though about implementing direct_access for brd? That would allow 
execute-in-place (xip) on brd eliminating the extra copy.

Christian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  9:54 ` Christian Borntraeger
@ 2007-12-04 10:10   ` Nick Piggin
  2007-12-04 11:21     ` [patch] rd: support XIP Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-04 10:10 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, rob, Jens Axboe

On Tue, Dec 04, 2007 at 10:54:51AM +0100, Christian Borntraeger wrote:
> Am Dienstag, 4. Dezember 2007 schrieb Nick Piggin:
> [...]
> > There is one slight downside -- direct block device access and filesystem
> > metadata access goes through an extra copy and gets stored in RAM twice.
> > However, this downside is only slight, because the real buffercache of the
> > device is now reclaimable (because we're not playing crazy games with it),
> > so under memory intensive situations, footprint should effectively be the
> > same -- maybe even a slight advantage to the new driver because it can also
> > reclaim buffer heads.
> 
> This is just an idea, I dont know if it is worth the trouble, but have you
> though about implementing direct_access for brd? That would allow 
> execute-in-place (xip) on brd eliminating the extra copy.

Actually that's a pretty good idea. It would allow xip to be tested
without special hardware as well...

I'll see what the patch looks like. Thanks


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch] rd: support XIP
  2007-12-04 10:10   ` Nick Piggin
@ 2007-12-04 11:21     ` Nick Piggin
  2007-12-04 11:23       ` [patch] ext2: xip check fix Nick Piggin
                         ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Nick Piggin @ 2007-12-04 11:21 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, rob, Jens Axboe

On Tue, Dec 04, 2007 at 11:10:09AM +0100, Nick Piggin wrote:
> > 
> > This is just an idea, I dont know if it is worth the trouble, but have you
> > though about implementing direct_access for brd? That would allow 
> > execute-in-place (xip) on brd eliminating the extra copy.
> 
> Actually that's a pretty good idea. It would allow xip to be tested
> without special hardware as well...
> 
> I'll see what the patch looks like. Thanks

This seems to work, although I needed another patch for ext2 too.
Never run XIP before, but I'm able to execute files out of the -oxip
mount, so I'm guessing it's working ;)

---

Support direct_access XIP method with brd.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/drivers/block/Kconfig
===================================================================
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -346,6 +346,16 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
+config BLK_DEV_XIP
+	bool "Support XIP filesystems on RAM block device"
+	depends on BLK_DEV_RAM
+	default n
+	help
+	  Support XIP filesystems (such as ext2 with XIP support on) on
+	  top of block ram device. This will slightly enlarge the kernel, and
+	  will prevent RAM block device backing store memory from being
+	  allocated from highmem (only a problem for highmem systems).
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -89,6 +89,7 @@ static struct page *brd_insert_page(stru
 {
 	pgoff_t idx;
 	struct page *page;
+	gfp_t gfp_flags;
 
 	page = brd_lookup_page(brd, sector);
 	if (page)
@@ -97,7 +98,16 @@ static struct page *brd_insert_page(stru
 	/*
 	 * Must use NOIO because we don't want to recurse back into the
 	 * block or filesystem layers from page reclaim.
+	 *
+	 * Cannot support XIP and highmem, because our ->direct_access
+	 * routine for XIP must return memory that is always addressable.
+	 * If XIP was reworked to use pfns and kmap throughout, this
+	 * restriction might be able to be lifted.
 	 */
+	gfp_flags = GFP_NOIO | __GFP_ZERO;
+#ifndef CONFIG_BLK_DEV_XIP
+	gfp_flags |= __GFP_HIGHMEM;
+#endif
 	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
 	if (!page)
 		return NULL;
@@ -307,6 +317,28 @@ out:
 	return 0;
 }
 
+#ifdef CONFIG_BLK_DEV_XIP
+static int brd_direct_access (struct block_device *bdev, sector_t sector,
+			unsigned long *data)
+{
+	struct brd_device *brd = bdev->bd_disk->private_data;
+	struct page *page;
+
+	if (!brd)
+		return -ENODEV;
+	if (sector & (PAGE_SECTORS-1))
+		return -EINVAL;
+	if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
+		return -ERANGE;
+	page = brd_insert_page(brd, sector);
+	if (!page)
+		return -ENOMEM;
+	*data = (unsigned long)page_address(page);
+
+	return 0;
+}
+#endif
+
 static int brd_ioctl(struct inode *inode, struct file *file,
 			unsigned int cmd, unsigned long arg)
 {
@@ -342,8 +374,11 @@ static int brd_ioctl(struct inode *inode
 }
 
 static struct block_device_operations brd_fops = {
-	.owner =	THIS_MODULE,
-	.ioctl =	brd_ioctl,
+	.owner =		THIS_MODULE,
+	.ioctl =		brd_ioctl,
+#ifdef CONFIG_BLK_DEV_XIP
+	.direct_access =	brd_direct_access,
+#endif
 };
 
 /*

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch] ext2: xip check fix
  2007-12-04 11:21     ` [patch] rd: support XIP Nick Piggin
@ 2007-12-04 11:23       ` Nick Piggin
  2007-12-05 15:43         ` Carsten Otte
  2007-12-04 11:26       ` [patch] rd: support XIP Andrew Morton
  2007-12-04 12:06       ` [patch] rd: support XIP Duane Griffin
  2 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-04 11:23 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, rob, Jens Axboe

Am I missing something here? I wonder how s390 works without this change?

--
ext2 should not worry about checking sb->s_blocksize for XIP before the
sb's blocksize actually gets set.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -844,8 +844,7 @@ static int ext2_fill_super(struct super_
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if ((ext2_use_xip(sb)) && ((blocksize != PAGE_SIZE) ||
-				  (sb->s_blocksize != blocksize))) {
+	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
 		if (!silent)
 			printk("XIP: Unsupported blocksize\n");
 		goto failed_mount;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rd: support XIP
  2007-12-04 11:21     ` [patch] rd: support XIP Nick Piggin
  2007-12-04 11:23       ` [patch] ext2: xip check fix Nick Piggin
@ 2007-12-04 11:26       ` Andrew Morton
  2007-12-04 11:35         ` Nick Piggin
  2007-12-04 12:06       ` [patch] rd: support XIP Duane Griffin
  2 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2007-12-04 11:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, rob, Jens Axboe

On Tue, 4 Dec 2007 12:21:00 +0100 Nick Piggin <npiggin@suse.de> wrote:

> +	 *
> +	 * Cannot support XIP and highmem, because our ->direct_access
> +	 * routine for XIP must return memory that is always addressable.
> +	 * If XIP was reworked to use pfns and kmap throughout, this
> +	 * restriction might be able to be lifted.
>  	 */
> +	gfp_flags = GFP_NOIO | __GFP_ZERO;
> +#ifndef CONFIG_BLK_DEV_XIP
> +	gfp_flags |= __GFP_HIGHMEM;
> +#endif

A dubious tradeoff?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rd: support XIP
  2007-12-04 11:26       ` [patch] rd: support XIP Andrew Morton
@ 2007-12-04 11:35         ` Nick Piggin
  2007-12-04 13:00           ` [patch] mm: fix XIP file writes Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-04 11:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, rob, Jens Axboe

On Tue, Dec 04, 2007 at 03:26:20AM -0800, Andrew Morton wrote:
> On Tue, 4 Dec 2007 12:21:00 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > +	 *
> > +	 * Cannot support XIP and highmem, because our ->direct_access
> > +	 * routine for XIP must return memory that is always addressable.
> > +	 * If XIP was reworked to use pfns and kmap throughout, this
> > +	 * restriction might be able to be lifted.
> >  	 */
> > +	gfp_flags = GFP_NOIO | __GFP_ZERO;
> > +#ifndef CONFIG_BLK_DEV_XIP
> > +	gfp_flags |= __GFP_HIGHMEM;
> > +#endif
> 
> A dubious tradeoff?

On big highmem machines certainly. It may be somewhat useful on small
memory systems... but having the config option there is nice for a VM
developer without an s390 easily available ;)

But don't apply these XIP patches yet -- after a bit more testing I'm
seeing some data corruption, so I'll have to work out what's going
wrong with that first.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rd: support XIP
  2007-12-04 11:21     ` [patch] rd: support XIP Nick Piggin
  2007-12-04 11:23       ` [patch] ext2: xip check fix Nick Piggin
  2007-12-04 11:26       ` [patch] rd: support XIP Andrew Morton
@ 2007-12-04 12:06       ` Duane Griffin
  2007-12-04 13:03         ` [patch] rd: support XIP (updated) Nick Piggin
  2 siblings, 1 reply; 34+ messages in thread
From: Duane Griffin @ 2007-12-04 12:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, Andrew Morton, rob, Jens Axboe

On 04/12/2007, Nick Piggin <npiggin@suse.de> wrote:
> +       gfp_flags = GFP_NOIO | __GFP_ZERO;
> +#ifndef CONFIG_BLK_DEV_XIP
> +       gfp_flags |= __GFP_HIGHMEM;
> +#endif
>         page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);

I think that should be alloc_page(gfp_flags), no?

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch] mm: fix XIP file writes
  2007-12-04 11:35         ` Nick Piggin
@ 2007-12-04 13:00           ` Nick Piggin
  2007-12-10 14:38             ` Christian Borntraeger
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-04 13:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, rob, Jens Axboe

On Tue, Dec 04, 2007 at 12:35:49PM +0100, Nick Piggin wrote:
> On Tue, Dec 04, 2007 at 03:26:20AM -0800, Andrew Morton wrote:
> > On Tue, 4 Dec 2007 12:21:00 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > +	 *
> > > +	 * Cannot support XIP and highmem, because our ->direct_access
> > > +	 * routine for XIP must return memory that is always addressable.
> > > +	 * If XIP was reworked to use pfns and kmap throughout, this
> > > +	 * restriction might be able to be lifted.
> > >  	 */
> > > +	gfp_flags = GFP_NOIO | __GFP_ZERO;
> > > +#ifndef CONFIG_BLK_DEV_XIP
> > > +	gfp_flags |= __GFP_HIGHMEM;
> > > +#endif
> > 
> > A dubious tradeoff?
> 
> On big highmem machines certainly. It may be somewhat useful on small
> memory systems... but having the config option there is nice for a VM
> developer without an s390 easily available ;)
> 
> But don't apply these XIP patches yet -- after a bit more testing I'm
> seeing some data corruption, so I'll have to work out what's going
> wrong with that first.

Here we go. See, brd already found a bug ;)

Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch.

---

Writing to XIP files at a non-page-aligned offset results in data corruption
because the writes were always sent to the start of the page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons
 		fault_in_pages_readable(buf, bytes);
 		kaddr = kmap_atomic(page, KM_USER0);
 		copied = bytes -
-			__copy_from_user_inatomic_nocache(kaddr, buf, bytes);
+			__copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 		kunmap_atomic(kaddr, KM_USER0);
 		flush_dcache_page(page);
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch] rd: support XIP (updated)
  2007-12-04 12:06       ` [patch] rd: support XIP Duane Griffin
@ 2007-12-04 13:03         ` Nick Piggin
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2007-12-04 13:03 UTC (permalink / raw)
  To: Duane Griffin
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, Andrew Morton, rob, Jens Axboe

On Tue, Dec 04, 2007 at 12:06:23PM +0000, Duane Griffin wrote:
> On 04/12/2007, Nick Piggin <npiggin@suse.de> wrote:
> > +       gfp_flags = GFP_NOIO | __GFP_ZERO;
> > +#ifndef CONFIG_BLK_DEV_XIP
> > +       gfp_flags |= __GFP_HIGHMEM;
> > +#endif
> >         page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
> 
> I think that should be alloc_page(gfp_flags), no?

Yes. Here is a resend. Andrew, please apply this one (has passed some
testing with ext2 XIP).

--

Support direct_access XIP method with brd.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/drivers/block/Kconfig
===================================================================
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -346,6 +346,16 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
+config BLK_DEV_XIP
+	bool "Support XIP filesystems on RAM block device"
+	depends on BLK_DEV_RAM
+	default n
+	help
+	  Support XIP filesystems (such as ext2 with XIP support on) on
+	  top of block ram device. This will slightly enlarge the kernel, and
+	  will prevent RAM block device backing store memory from being
+	  allocated from highmem (only a problem for highmem systems).
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -89,6 +89,7 @@ static struct page *brd_insert_page(stru
 {
 	pgoff_t idx;
 	struct page *page;
+	gfp_t gfp_flags;
 
 	page = brd_lookup_page(brd, sector);
 	if (page)
@@ -97,8 +98,17 @@ static struct page *brd_insert_page(stru
 	/*
 	 * Must use NOIO because we don't want to recurse back into the
 	 * block or filesystem layers from page reclaim.
+	 *
+	 * Cannot support XIP and highmem, because our ->direct_access
+	 * routine for XIP must return memory that is always addressable.
+	 * If XIP was reworked to use pfns and kmap throughout, this
+	 * restriction might be able to be lifted.
 	 */
-	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
+	gfp_flags = GFP_NOIO | __GFP_ZERO;
+#ifndef CONFIG_BLK_DEV_XIP
+	gfp_flags |= __GFP_HIGHMEM;
+#endif
+	page = alloc_page(gfp_flags);
 	if (!page)
 		return NULL;
 
@@ -307,6 +317,28 @@ out:
 	return 0;
 }
 
+#ifdef CONFIG_BLK_DEV_XIP
+static int brd_direct_access (struct block_device *bdev, sector_t sector,
+			unsigned long *data)
+{
+	struct brd_device *brd = bdev->bd_disk->private_data;
+	struct page *page;
+
+	if (!brd)
+		return -ENODEV;
+	if (sector & (PAGE_SECTORS-1))
+		return -EINVAL;
+	if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
+		return -ERANGE;
+	page = brd_insert_page(brd, sector);
+	if (!page)
+		return -ENOMEM;
+	*data = (unsigned long)page_address(page);
+
+	return 0;
+}
+#endif
+
 static int brd_ioctl(struct inode *inode, struct file *file,
 			unsigned int cmd, unsigned long arg)
 {
@@ -342,8 +374,11 @@ static int brd_ioctl(struct inode *inode
 }
 
 static struct block_device_operations brd_fops = {
-	.owner =	THIS_MODULE,
-	.ioctl =	brd_ioctl,
+	.owner =		THIS_MODULE,
+	.ioctl =		brd_ioctl,
+#ifdef CONFIG_BLK_DEV_XIP
+	.direct_access =	brd_direct_access,
+#endif
 };
 
 /*

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  9:29   ` Nick Piggin
@ 2007-12-04 19:53     ` Rob Landley
  0 siblings, 0 replies; 34+ messages in thread
From: Rob Landley @ 2007-12-04 19:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Christian Borntraeger,
	Eric W. Biederman, Andrew Morton, Jens Axboe

On Tuesday 04 December 2007 03:29:40 Nick Piggin wrote:
> On Tue, Dec 04, 2007 at 01:55:17AM -0600, Rob Landley wrote:
> > On Monday 03 December 2007 22:26:28 Nick Piggin wrote:
> > > There is one slight downside -- direct block device access and
> > > filesystem metadata access goes through an extra copy and gets stored
> > > in RAM twice. However, this downside is only slight, because the real
> > > buffercache of the device is now reclaimable (because we're not playing
> > > crazy games with it), so under memory intensive situations, footprint
> > > should effectively be the same -- maybe even a slight advantage to the
> > > new driver because it can also reclaim buffer heads.
> >
> > For the embedded world, initramfs has pretty much rendered initrd
> > obsolete, and that was the biggest user of the ramdisk code I know of. 
> > Beyond that, loopback mounts give you more flexible transient block
> > devices than ramdisks do.  (In fact, ramdisks are such an amazing pain to
> > use/size/free that if I really needed something like that I'd just make a
> > loopback mount in a ramfs instance.)
>
> They are, although we could easily hook up a couple more ioctls for them
> now (if anybody is interested).

>From a usability perspective, an ioctl is no substitute for being able to 
resize a device with "dd".  (Or for that matter, make it sparse.)

> The rd driver can potentially be a _lot_ more scalable than the loop
> driver. It's completely lockless in the fastpath and doesn't even dirty any
> cachelines except for the actual destination memory. OK, this is only
> really important for testing purposes (eg. testing scalability of a
> filesystem etc.) But it is one reason why I (personally) want brd...

I wouldn't say not important: The "database in RAM" people will probably love 
you, at least if they insist on using Oracle.  (Filesystem schmilesystem, 
gimme a raw block device and let me implement my own filesystem from 
userspace...)

But I'm not personally likely to care. :)

> > Embedded users who still want a block interface for memory are generally
> > trying to use a cramfs or squashfs image out of ROM or flash, although
> > there are flash-specific filesystems for this and I dunno if they're
> > actually mounting /dev/mem at an offset or something (md?  losetup -o? 
> > Beats me, I haven't tried that myself yet...)
>
> OK, it would be interesting to hear from anyone using rd for things like
> cramfs.

I'd be interested in that too.  I've heard it proposed a lot but not actually 
seen anyone bother to implement it yet...

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-04 11:23       ` [patch] ext2: xip check fix Nick Piggin
@ 2007-12-05 15:43         ` Carsten Otte
  2007-12-05 23:33           ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Carsten Otte @ 2007-12-05 15:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, Andrew Morton, rob, Jens Axboe

Nick Piggin wrote:
> Am I missing something here? I wonder how s390 works without this change?
> 
> --
> ext2 should not worry about checking sb->s_blocksize for XIP before the
> sb's blocksize actually gets set.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/fs/ext2/super.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/super.c
> +++ linux-2.6/fs/ext2/super.c
> @@ -844,8 +844,7 @@ static int ext2_fill_super(struct super_
> 
>  	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
> 
> -	if ((ext2_use_xip(sb)) && ((blocksize != PAGE_SIZE) ||
> -				  (sb->s_blocksize != blocksize))) {
> +	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
>  		if (!silent)
>  			printk("XIP: Unsupported blocksize\n");
>  		goto failed_mount;
"blocksize" contains the blocksize of the device here, and 
sb->s_blocksize does contain the filesystem block size as saved in the 
super block. Xip does only work, if both do match PAGE_SIZE because it 
does'nt support multiple calls to direct_access in the get_xip_page 
address space operation. Thus we check both here, actually this was 
changed from how it looks after your patch as a bugfix where our 
tester tried a 4k filesystem on a 2k blockdev.
Did I miss something?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-05 15:43         ` Carsten Otte
@ 2007-12-05 23:33           ` Nick Piggin
  2007-12-06  8:43             ` Carsten Otte
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-05 23:33 UTC (permalink / raw)
  To: carsteno
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, Andrew Morton, rob, Jens Axboe

On Wed, Dec 05, 2007 at 04:43:16PM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >Am I missing something here? I wonder how s390 works without this change?
> >
> >--
> >ext2 should not worry about checking sb->s_blocksize for XIP before the
> >sb's blocksize actually gets set.
> >
> >Signed-off-by: Nick Piggin <npiggin@suse.de>
> >---
> >Index: linux-2.6/fs/ext2/super.c
> >===================================================================
> >--- linux-2.6.orig/fs/ext2/super.c
> >+++ linux-2.6/fs/ext2/super.c
> >@@ -844,8 +844,7 @@ static int ext2_fill_super(struct super_
> >
> > 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
> >
> >-	if ((ext2_use_xip(sb)) && ((blocksize != PAGE_SIZE) ||
> >-				  (sb->s_blocksize != blocksize))) {
> >+	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
> > 		if (!silent)
> > 			printk("XIP: Unsupported blocksize\n");
> > 		goto failed_mount;
> "blocksize" contains the blocksize of the device here, and 
> sb->s_blocksize does contain the filesystem block size as saved in the 
> super block.

You mean blocksize is the size of the ext2 block, and s_blocksize is the
default size of the block device.


> Xip does only work, if both do match PAGE_SIZE because it 
> does'nt support multiple calls to direct_access in the get_xip_page 
> address space operation. Thus we check both here, actually this was 
> changed from how it looks after your patch as a bugfix where our 
> tester tried a 4k filesystem on a 2k blockdev.
> Did I miss something?

However, the bdev block size may be changed with sb_set_blocksize. It
doesn't actually have to match the hardware sector size -- if this
does matter for XIP, then I think you need some other check here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-05 23:33           ` Nick Piggin
@ 2007-12-06  8:43             ` Carsten Otte
  2007-12-06  8:52               ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Carsten Otte @ 2007-12-06  8:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: carsteno, Christian Borntraeger, Linux Kernel Mailing List,
	linux-fsdevel, Eric W. Biederman, Andrew Morton, rob, Jens Axboe

Nick Piggin wrote:
>> Xip does only work, if both do match PAGE_SIZE because it 
>> does'nt support multiple calls to direct_access in the get_xip_page 
>> address space operation. Thus we check both here, actually this was 
>> changed from how it looks after your patch as a bugfix where our 
>> tester tried a 4k filesystem on a 2k blockdev.
>> Did I miss something?
> 
> However, the bdev block size may be changed with sb_set_blocksize. It
> doesn't actually have to match the hardware sector size -- if this
> does matter for XIP, then I think you need some other check here.
Hmmmmhh. For a bdev with PAGE_SIZE hardsect size, there is no other 
valid value then PAGE_SIZE that one could set it to. Or can it indeed 
be changed to a value greater then PAGE_SIZE or smaller then hardsect 
size?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-06  8:43             ` Carsten Otte
@ 2007-12-06  8:52               ` Nick Piggin
  2007-12-06  9:59                 ` Carsten Otte
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-06  8:52 UTC (permalink / raw)
  To: carsteno
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, Andrew Morton, rob, Jens Axboe

On Thu, Dec 06, 2007 at 09:43:27AM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >>Xip does only work, if both do match PAGE_SIZE because it 
> >>does'nt support multiple calls to direct_access in the get_xip_page 
> >>address space operation. Thus we check both here, actually this was 
> >>changed from how it looks after your patch as a bugfix where our 
> >>tester tried a 4k filesystem on a 2k blockdev.
> >>Did I miss something?
> >
> >However, the bdev block size may be changed with sb_set_blocksize. It
> >doesn't actually have to match the hardware sector size -- if this
> >does matter for XIP, then I think you need some other check here.
> Hmmmmhh. For a bdev with PAGE_SIZE hardsect size, there is no other 
> valid value then PAGE_SIZE that one could set it to. Or can it indeed 
> be changed to a value greater then PAGE_SIZE or smaller then hardsect 
> size?

It can't be made smaller (or larger, in current kernels). But you
already get all that checking done for you -- both by checking that
the filesystem blocksize == PAGE_SIZE, and by the error checking in
sb_set_blocksize.

After my patch, we can do XIP in a hardsect size < PAGE_SIZE block
device -- this seems to be a fine thing to do at least for the
ramdisk code. Would this situation be problematic for existing drivers,
and if so, in what way?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-06  8:52               ` Nick Piggin
@ 2007-12-06  9:59                 ` Carsten Otte
  2007-12-06 10:18                   ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Carsten Otte @ 2007-12-06  9:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: carsteno, Christian Borntraeger, Linux Kernel Mailing List,
	linux-fsdevel, Eric W. Biederman, Andrew Morton, rob, Jens Axboe

Nick Piggin wrote:
> After my patch, we can do XIP in a hardsect size < PAGE_SIZE block
> device -- this seems to be a fine thing to do at least for the
> ramdisk code. Would this situation be problematic for existing drivers,
> and if so, in what way?
I have done some archeology, and our ancient CVS logs show this check 
was introduced in early 2005 into our 2.6.x. codebase. However, it 
existed way before, and was copied from our prehistorical ext2 split 
named xip2 back in the old days of 2.4.x where we did not really have 
a block device behind because that one was scamped into the file 
system in a very queer way.
After all, I don't see any risk in removing the check. The only driver 
we have that does direct_access is drivers/s390/block/dcssblk.c, and 
that one only supports block_size == PAGE_SIZE. I think the patch 
should go into mainline.

Acked-by: Carsten Otte <cotte@de.ibm.com>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-06  9:59                 ` Carsten Otte
@ 2007-12-06 10:18                   ` Nick Piggin
  2007-12-06 10:24                     ` Carsten Otte
  0 siblings, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2007-12-06 10:18 UTC (permalink / raw)
  To: carsteno
  Cc: Christian Borntraeger, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, Andrew Morton, rob, Jens Axboe

On Thu, Dec 06, 2007 at 10:59:02AM +0100, Carsten Otte wrote:
> Nick Piggin wrote:
> >After my patch, we can do XIP in a hardsect size < PAGE_SIZE block
> >device -- this seems to be a fine thing to do at least for the
> >ramdisk code. Would this situation be problematic for existing drivers,
> >and if so, in what way?
> I have done some archeology, and our ancient CVS logs show this check 
> was introduced in early 2005 into our 2.6.x. codebase. However, it 
> existed way before, and was copied from our prehistorical ext2 split 
> named xip2 back in the old days of 2.4.x where we did not really have 
> a block device behind because that one was scamped into the file 
> system in a very queer way.

OK, thanks for taking a look at that. It will be helpful for testing
XIP with my new ramdisk driver (did you see the patch?).


> After all, I don't see any risk in removing the check. The only driver 
> we have that does direct_access is drivers/s390/block/dcssblk.c, and 
> that one only supports block_size == PAGE_SIZE. I think the patch 
> should go into mainline.

Actually another one's recently sprung up too (arch/powerpc/sysdev/axonram.c)
but it looks like that one should be fine as it looks to be all simply memory
mapped.

 
> Acked-by: Carsten Otte <cotte@de.ibm.com>

Thanks! Andrew, would you queue this up for 2.6.25 please?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-06 10:18                   ` Nick Piggin
@ 2007-12-06 10:24                     ` Carsten Otte
  2007-12-06 18:11                       ` Rob Landley
  0 siblings, 1 reply; 34+ messages in thread
From: Carsten Otte @ 2007-12-06 10:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: carsteno, Christian Borntraeger, Linux Kernel Mailing List,
	linux-fsdevel, Eric W. Biederman, Andrew Morton, rob, Jens Axboe

Nick Piggin wrote:
> OK, thanks for taking a look at that. It will be helpful for testing
> XIP with my new ramdisk driver (did you see the patch?).
I have'nt looked at it yet. I do appreciate it, I think it might 
broaden the user-base of this feature which is up to now s390 only due 
to the fact that the flash memory extensions have not been implemented 
(yet?). And it enables testing xip on other platforms. The patch is on 
my must-read list.

> Actually another one's recently sprung up too (arch/powerpc/sysdev/axonram.c)
> but it looks like that one should be fine as it looks to be all simply memory
> mapped.
Oh I know, the author's office is aproximately 500 meters away from 
mine, next building. I have'nt heared of anyone actually using that 
one lately.

cheers,
Carsten

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-06 10:24                     ` Carsten Otte
@ 2007-12-06 18:11                       ` Rob Landley
  2007-12-07  3:22                         ` Jared Hulbert
  0 siblings, 1 reply; 34+ messages in thread
From: Rob Landley @ 2007-12-06 18:11 UTC (permalink / raw)
  To: carsteno
  Cc: Nick Piggin, Christian Borntraeger, Linux Kernel Mailing List,
	linux-fsdevel, Eric W. Biederman, Andrew Morton, Jens Axboe

On Thursday 06 December 2007 04:24:20 Carsten Otte wrote:
> Nick Piggin wrote:
> > OK, thanks for taking a look at that. It will be helpful for testing
> > XIP with my new ramdisk driver (did you see the patch?).
>
> I have'nt looked at it yet. I do appreciate it, I think it might
> broaden the user-base of this feature which is up to now s390 only due
> to the fact that the flash memory extensions have not been implemented
> (yet?). And it enables testing xip on other platforms. The patch is on
> my must-read list.

query: which feature is currently s390 only?  (Execute In Place?)

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-06 18:11                       ` Rob Landley
@ 2007-12-07  3:22                         ` Jared Hulbert
  2007-12-07  4:17                           ` Rob Landley
  2007-12-07  8:59                           ` Carsten Otte
  0 siblings, 2 replies; 34+ messages in thread
From: Jared Hulbert @ 2007-12-07  3:22 UTC (permalink / raw)
  To: Rob Landley
  Cc: carsteno, Nick Piggin, Christian Borntraeger,
	Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, Jens Axboe

> > I have'nt looked at it yet. I do appreciate it, I think it might
> > broaden the user-base of this feature which is up to now s390 only due
> > to the fact that the flash memory extensions have not been implemented
> > (yet?). And it enables testing xip on other platforms. The patch is on
> > my must-read list.
>
> query: which feature is currently s390 only?  (Execute In Place?)

I think so.  The filemap_xip.c functionality doesn't work for Flash
memory yet.  Flash memory doesn't have struct pages to back it up with
which this stuff depends on.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-07  3:22                         ` Jared Hulbert
@ 2007-12-07  4:17                           ` Rob Landley
  2007-12-07  4:23                             ` Nick Piggin
  2007-12-07  4:40                             ` Jared Hulbert
  2007-12-07  8:59                           ` Carsten Otte
  1 sibling, 2 replies; 34+ messages in thread
From: Rob Landley @ 2007-12-07  4:17 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: carsteno, Nick Piggin, Christian Borntraeger,
	Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, Jens Axboe

On Thursday 06 December 2007 21:22:25 Jared Hulbert wrote:
> > > I have'nt looked at it yet. I do appreciate it, I think it might
> > > broaden the user-base of this feature which is up to now s390 only due
> > > to the fact that the flash memory extensions have not been implemented
> > > (yet?). And it enables testing xip on other platforms. The patch is on
> > > my must-read list.
> >
> > query: which feature is currently s390 only?  (Execute In Place?)
>
> I think so.  The filemap_xip.c functionality doesn't work for Flash
> memory yet.  Flash memory doesn't have struct pages to back it up with
> which this stuff depends on.

Um, trying to clarify:  S390.  Also known as zSeries, big iron machine, uses 
its own weird processor design rather than x86, x86-64, arm, or mips 
processors.

How does "struct page" enter into this?

What I want to know is, are you saying execute in place doesn't work on things 
like arm and mips?  (I so, I was unaware of this.  I heard about somebody 
getting it to work on a Nintendo DS: 
http://forums.maxconsole.net/showthread.php?t=18668 )

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-07  4:17                           ` Rob Landley
@ 2007-12-07  4:23                             ` Nick Piggin
  2007-12-07  4:40                             ` Jared Hulbert
  1 sibling, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2007-12-07  4:23 UTC (permalink / raw)
  To: Rob Landley
  Cc: Jared Hulbert, carsteno, Christian Borntraeger,
	Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, Jens Axboe

On Thu, Dec 06, 2007 at 10:17:39PM -0600, Rob Landley wrote:
> On Thursday 06 December 2007 21:22:25 Jared Hulbert wrote:
> > > > I have'nt looked at it yet. I do appreciate it, I think it might
> > > > broaden the user-base of this feature which is up to now s390 only due
> > > > to the fact that the flash memory extensions have not been implemented
> > > > (yet?). And it enables testing xip on other platforms. The patch is on
> > > > my must-read list.
> > >
> > > query: which feature is currently s390 only?  (Execute In Place?)
> >
> > I think so.  The filemap_xip.c functionality doesn't work for Flash
> > memory yet.  Flash memory doesn't have struct pages to back it up with
> > which this stuff depends on.
> 
> Um, trying to clarify:  S390.  Also known as zSeries, big iron machine, uses 
> its own weird processor design rather than x86, x86-64, arm, or mips 
> processors.
> 
> How does "struct page" enter into this?
> 
> What I want to know is, are you saying execute in place doesn't work on things 
> like arm and mips?  (I so, I was unaware of this.  I heard about somebody 
> getting it to work on a Nintendo DS: 
> http://forums.maxconsole.net/showthread.php?t=18668 )

It's just that the only device driver with XIP support for the moment is
s390 (and an obscure powerpc one).

Now with my ramdisk patch, anybody can use it. That's just using regular
RAM, but there is nothing preventing XIP backed by more exotic storage,
provided the CPU can natively address and execute from it.
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-07  4:17                           ` Rob Landley
  2007-12-07  4:23                             ` Nick Piggin
@ 2007-12-07  4:40                             ` Jared Hulbert
  1 sibling, 0 replies; 34+ messages in thread
From: Jared Hulbert @ 2007-12-07  4:40 UTC (permalink / raw)
  To: Rob Landley
  Cc: carsteno, Nick Piggin, Christian Borntraeger,
	Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, Jens Axboe

> Um, trying to clarify:  S390.  Also known as zSeries, big iron machine, uses
> its own weird processor design rather than x86, x86-64, arm, or mips
> processors.

Right.  filemap_xip.c allows for an XIP filesystem.  The only
filesystem that is supported is ext2.  Even that requires a block
device driver thingy, which I don't understand, that's specific to the
s390.

> How does "struct page" enter into this?

Don't sweat it, it has to do with the way filemap_xip.c works.

> What I want to know is, are you saying execute in place doesn't work on things
> like arm and mips?  (I so, I was unaware of this.  I heard about somebody
> getting it to work on a Nintendo DS:
> http://forums.maxconsole.net/showthread.php?t=18668 )

XIP works fine on things like arm and mips.  However there is mixed
support in the mainline kernel for it.  For example, you can build an
XiP kernel image for arm since like 2.6.10 or 12.  Also MTD has an XiP
aware mode that protects XiP objects in flash from get screwed up
during programs and erases.  But there is no mainlined solution for
XiP of applications from the filesystem.  However there have been
patches for cramfs to do this for years.  They are kind of messy and
keep getting rejected.  I do have a solution in the works for this
part of it - http://axfs.sf.net.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-07  3:22                         ` Jared Hulbert
  2007-12-07  4:17                           ` Rob Landley
@ 2007-12-07  8:59                           ` Carsten Otte
  2007-12-07  9:52                             ` Jared Hulbert
  1 sibling, 1 reply; 34+ messages in thread
From: Carsten Otte @ 2007-12-07  8:59 UTC (permalink / raw)
  To: Jared Hulbert
  Cc: Rob Landley, carsteno, Nick Piggin, Christian Borntraeger,
	Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, Jens Axboe

Jared Hulbert wrote:
> I think so.  The filemap_xip.c functionality doesn't work for Flash
> memory yet.  Flash memory doesn't have struct pages to back it up with
> which this stuff depends on.
Struct page is not the major issue. The primary problem is writing to 
the media (and I am not a flash expert at all, just relaying here): 
For some period of time, the flash memory is not usable and thus we 
need to make sure we can nuke the page table entries that we have in 
userland page tables. For that, we need a callback from the device so 
that it can ask to get its references back. Oh, and a put_xip_page 
counterpart to get_xip_page, so that the driver knows when it's safe 
to erase.

cheers,
Carsten

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] ext2: xip check fix
  2007-12-07  8:59                           ` Carsten Otte
@ 2007-12-07  9:52                             ` Jared Hulbert
  0 siblings, 0 replies; 34+ messages in thread
From: Jared Hulbert @ 2007-12-07  9:52 UTC (permalink / raw)
  To: carsteno
  Cc: Rob Landley, Nick Piggin, Christian Borntraeger,
	Linux Kernel Mailing List, linux-fsdevel, Eric W. Biederman,
	Andrew Morton, Jens Axboe

> > I think so.  The filemap_xip.c functionality doesn't work for Flash
> > memory yet.  Flash memory doesn't have struct pages to back it up with
> > which this stuff depends on.
>
> Struct page is not the major issue. The primary problem is writing to
> the media (and I am not a flash expert at all, just relaying here):
> For some period of time, the flash memory is not usable and thus we
> need to make sure we can nuke the page table entries that we have in
> userland page tables. For that, we need a callback from the device so
> that it can ask to get its references back. Oh, and a put_xip_page
> counterpart to get_xip_page, so that the driver knows when it's safe
> to erase.

Well... That's the biggest/hardest problem, yes.  But not the first.
First we got to tackle the easy read only case, which doesn't require
any of that unpleasantness, yet which is used in a bunch of out of
tree hacks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] mm: fix XIP file writes
  2007-12-04 13:00           ` [patch] mm: fix XIP file writes Nick Piggin
@ 2007-12-10 14:38             ` Christian Borntraeger
  2007-12-12  4:03               ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Borntraeger @ 2007-12-10 14:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, rob, Jens Axboe, cotte

Hi Nick,

> Here we go. See, brd already found a bug ;)
> Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch.
[...]
> Writing to XIP files at a non-page-aligned offset results in data corruption
> because the writes were always sent to the start of the page.
[...]
> @@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons
>  		fault_in_pages_readable(buf, bytes);
>  		kaddr = kmap_atomic(page, KM_USER0);
>  		copied = bytes -
> -			__copy_from_user_inatomic_nocache(kaddr, buf, bytes);
> +			__copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
>  		kunmap_atomic(kaddr, KM_USER0);
>  		flush_dcache_page(page);

I asked myself why this problem never happened before. So I asked our testers
to reproduce this problem on 2.6.23 and service levels. As the testcase did
not trigger, I looked into the 2.6.23 code. This problem was introduced by
commit 4a9e5ef1f4f15205e477817a5cefc34bd3f65f55 (mm: write iovec cleanup from
Nick Piggin) during 2.6.24-rc:
--------snip-------
-		copied = filemap_copy_from_user(page, offset, buf, bytes);
[...]
+		copied = bytes -
+			__copy_from_user_inatomic_nocache(kaddr, buf, bytes);
-------------------

So yes, its good to have xip on brd. It even tests your changes ;-)
Good news is, that we dont need anything for stable.

Christian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] mm: fix XIP file writes
  2007-12-10 14:38             ` Christian Borntraeger
@ 2007-12-12  4:03               ` Nick Piggin
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2007-12-12  4:03 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-fsdevel,
	Eric W. Biederman, rob, Jens Axboe, cotte

On Mon, Dec 10, 2007 at 03:38:20PM +0100, Christian Borntraeger wrote:
> Hi Nick,
> 
> > Here we go. See, brd already found a bug ;)
> > Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch.
> [...]
> > Writing to XIP files at a non-page-aligned offset results in data corruption
> > because the writes were always sent to the start of the page.
> [...]
> > @@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons
> >  		fault_in_pages_readable(buf, bytes);
> >  		kaddr = kmap_atomic(page, KM_USER0);
> >  		copied = bytes -
> > -			__copy_from_user_inatomic_nocache(kaddr, buf, bytes);
> > +			__copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> >  		kunmap_atomic(kaddr, KM_USER0);
> >  		flush_dcache_page(page);
> 
> I asked myself why this problem never happened before. So I asked our testers
> to reproduce this problem on 2.6.23 and service levels. As the testcase did
> not trigger, I looked into the 2.6.23 code. This problem was introduced by
> commit 4a9e5ef1f4f15205e477817a5cefc34bd3f65f55 (mm: write iovec cleanup from
> Nick Piggin) during 2.6.24-rc:
> --------snip-------
> -		copied = filemap_copy_from_user(page, offset, buf, bytes);
> [...]
> +		copied = bytes -
> +			__copy_from_user_inatomic_nocache(kaddr, buf, bytes);
> -------------------
> 
> So yes, its good to have xip on brd. It even tests your changes ;-)
> Good news is, that we dont need anything for stable.

Heh ;) that explains a lot. Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2007-12-04  4:26 [patch] rewrite rd Nick Piggin
                   ` (2 preceding siblings ...)
  2007-12-04  9:54 ` Christian Borntraeger
@ 2008-01-14 16:47 ` Matthew Wilcox
  2008-01-14 17:21   ` Jens Axboe
  3 siblings, 1 reply; 34+ messages in thread
From: Matthew Wilcox @ 2008-01-14 16:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, linux-fsdevel, Christian Borntraeger,
	Eric W. Biederman, Andrew Morton, rob, Jens Axboe

On Tue, Dec 04, 2007 at 05:26:28AM +0100, Nick Piggin wrote:
> +static void copy_to_brd(struct brd_device *brd, const void *src,
> +			sector_t sector, size_t n)
> +{
> +	struct page *page;
> +	void *dst;
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min((unsigned long)n, PAGE_SIZE - offset);
> +	page = brd_lookup_page(brd, sector);
> +	BUG_ON(!page);
> +
> +	dst = kmap_atomic(page, KM_USER1);
> +	memcpy(dst + offset, src, copy);
> +	kunmap_atomic(dst, KM_USER1);

You're using kmap_atomic, but I see no reason you can't be preempted.
Don't you need to at least disable preemption while you have stuff
atomically kmapped?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] rewrite rd
  2008-01-14 16:47 ` [patch] rewrite rd Matthew Wilcox
@ 2008-01-14 17:21   ` Jens Axboe
  0 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2008-01-14 17:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, Linux Kernel Mailing List, linux-fsdevel,
	Christian Borntraeger, Eric W. Biederman, Andrew Morton, rob

On Mon, Jan 14 2008, Matthew Wilcox wrote:
> On Tue, Dec 04, 2007 at 05:26:28AM +0100, Nick Piggin wrote:
> > +static void copy_to_brd(struct brd_device *brd, const void *src,
> > +			sector_t sector, size_t n)
> > +{
> > +	struct page *page;
> > +	void *dst;
> > +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> > +	size_t copy;
> > +
> > +	copy = min((unsigned long)n, PAGE_SIZE - offset);
> > +	page = brd_lookup_page(brd, sector);
> > +	BUG_ON(!page);
> > +
> > +	dst = kmap_atomic(page, KM_USER1);
> > +	memcpy(dst + offset, src, copy);
> > +	kunmap_atomic(dst, KM_USER1);
> 
> You're using kmap_atomic, but I see no reason you can't be preempted.
> Don't you need to at least disable preemption while you have stuff
> atomically kmapped?

kmap_atomic() disables preemption through pagefault_disable().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2008-01-14 17:21 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-04  4:26 [patch] rewrite rd Nick Piggin
2007-12-04  6:29 ` Andrew Morton
2007-12-04  7:01   ` Nick Piggin
2007-12-04  7:08     ` Nick Piggin
2007-12-04  7:55 ` Rob Landley
2007-12-04  9:29   ` Nick Piggin
2007-12-04 19:53     ` Rob Landley
2007-12-04  9:54 ` Christian Borntraeger
2007-12-04 10:10   ` Nick Piggin
2007-12-04 11:21     ` [patch] rd: support XIP Nick Piggin
2007-12-04 11:23       ` [patch] ext2: xip check fix Nick Piggin
2007-12-05 15:43         ` Carsten Otte
2007-12-05 23:33           ` Nick Piggin
2007-12-06  8:43             ` Carsten Otte
2007-12-06  8:52               ` Nick Piggin
2007-12-06  9:59                 ` Carsten Otte
2007-12-06 10:18                   ` Nick Piggin
2007-12-06 10:24                     ` Carsten Otte
2007-12-06 18:11                       ` Rob Landley
2007-12-07  3:22                         ` Jared Hulbert
2007-12-07  4:17                           ` Rob Landley
2007-12-07  4:23                             ` Nick Piggin
2007-12-07  4:40                             ` Jared Hulbert
2007-12-07  8:59                           ` Carsten Otte
2007-12-07  9:52                             ` Jared Hulbert
2007-12-04 11:26       ` [patch] rd: support XIP Andrew Morton
2007-12-04 11:35         ` Nick Piggin
2007-12-04 13:00           ` [patch] mm: fix XIP file writes Nick Piggin
2007-12-10 14:38             ` Christian Borntraeger
2007-12-12  4:03               ` Nick Piggin
2007-12-04 12:06       ` [patch] rd: support XIP Duane Griffin
2007-12-04 13:03         ` [patch] rd: support XIP (updated) Nick Piggin
2008-01-14 16:47 ` [patch] rewrite rd Matthew Wilcox
2008-01-14 17:21   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).