LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
@ 2004-05-06 13:17 Jörn Engel
  2004-05-06 13:18 ` [PATCH COW] generic_sendpage Jörn Engel
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jamie Lokier, Pavel Machek, Eric W. Biederman, Denis Vlasenko

Couldn't sleep last night and finished a first complete version of
cowlinks, code-named MAD COW.  It is still based on the stupid old
design with a flag to distinguish between regular hard links and
cowlinks.  Please don't comment on that design, it's just a proof of
concept.

Patches are against 2.6.5 but most things should apply to other 2.6
kernel without too much trouble.

1 generic_sendpage	- allow sendfile with ext[23] files as target
2 sendfile		- introduce vfs_sendfile for in-kernel use
3 copyfile		- new copyfile() system call
4 lock_flags		- old cruft, just ignore it
5 madcow		- the MAD COW itself

Patches 1-3 will stay, 4 will be remove and 5 replaced by a better
design over time.  I've also set up a webpage for it:
http://wohnheim.fh-wedel.de/~joern/cowlink/

Maybe that should be converted into a wiki so someone with better
taste than myself can improve it.

Jörn

-- 
This above all: to thine own self be true.
-- Shakespeare

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH COW] generic_sendpage
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
@ 2004-05-06 13:18 ` Jörn Engel
  2004-05-06 13:19 ` [PATCH COW] sendfile Jörn Engel
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:18 UTC (permalink / raw)
  To: linux-kernel

Patch 1.

Jörn

-- 
When you close your hand, you own nothing. When you open it up, you
own the whole world.
-- Li Mu Bai in Tiger & Dragon

o Add sendfile() support for file targets to normal mm/filemap.c.
o Have ext[23] use that support.

 fs/ext2/file.c     |    1 
 fs/ext3/file.c     |    1 
 include/linux/fs.h |    1 
 mm/filemap.c       |  216 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 219 insertions(+)


--- linux-2.6.5cow/include/linux/fs.h~generic_sendpage	2004-04-14 16:22:24.000000000 +0200
+++ linux-2.6.5cow/include/linux/fs.h	2004-04-27 16:34:32.000000000 +0200
@@ -1346,6 +1346,7 @@
 ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
 				unsigned long nr_segs, loff_t *ppos);
 extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void __user *);
+extern ssize_t generic_file_sendpage(struct file *, struct page *, int, size_t, loff_t *, int);
 extern void do_generic_mapping_read(struct address_space *mapping,
 				    struct file_ra_state *, struct file *,
 				    loff_t *, read_descriptor_t *, read_actor_t);
--- linux-2.6.5cow/mm/filemap.c~generic_sendpage	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/mm/filemap.c	2004-04-20 16:20:21.000000000 +0200
@@ -896,6 +896,31 @@
 	return written;
 }
 
+/* FIXME: It would be as simple as this, if we had a (void __user*) to write.
+ * We already have a kernel buffer, so it should be even simpler, right? ;)
+ *
+ * Yes, sorta.  After duplicating the complete path of generic_file_write(),
+ * at least some special cases could be removed, so the copy is simpler than
+ * the original.  But it remains a copy, so overall complexity increases.
+ */
+static ssize_t
+generic_kernel_file_write(struct file *, const char *, size_t, loff_t *);
+
+ssize_t generic_file_sendpage(struct file *file, struct page *page,
+		int offset, size_t size, loff_t *ppos, int more)
+{
+	ssize_t ret;
+	char *kaddr;
+
+	kaddr = kmap(page);
+	ret = generic_kernel_file_write(file, kaddr + offset, size, ppos);
+	kunmap(page);
+
+	return ret;
+}
+
+EXPORT_SYMBOL(generic_file_sendpage);
+
 ssize_t generic_file_sendfile(struct file *in_file, loff_t *ppos,
 			 size_t count, read_actor_t actor, void __user *target)
 {
@@ -1554,6 +1579,19 @@
 	return bytes - left;
 }
 
+static inline size_t
+filemap_copy_from_kernel(struct page *page, unsigned long offset,
+			 const char *buf, unsigned bytes)
+{
+	char *kaddr;
+
+	kaddr = kmap(page);
+	memcpy(kaddr + offset, buf, bytes);
+	kunmap(page);
+
+	return bytes;
+}
+
 static size_t
 __filemap_copy_from_user_iovec(char *vaddr, 
 			const struct iovec *iov, size_t base, size_t bytes)
@@ -1910,6 +1948,155 @@
 
 EXPORT_SYMBOL(generic_file_aio_write_nolock);
 
+/*
+ * TODO:
+ * This largely tries to copy generic_file_aio_write_nolock(), although it
+ * doesn't have to be nearly as generic.  A real cleanup should either
+ * merge this into generic_file_aio_write_nolock() as well or keep it special
+ * and remove as much code as possible.
+ */
+static ssize_t
+generic_kernel_file_aio_write_nolock(struct kiocb *iocb, const struct iovec*iov,
+				     unsigned long nr_segs, loff_t *ppos)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space * mapping = file->f_mapping;
+	struct address_space_operations *a_ops = mapping->a_ops;
+	size_t ocount;		/* original count */
+	size_t count;		/* after file limit checks */
+	struct inode 	*inode = mapping->host;
+	long		status = 0;
+	loff_t		pos;
+	struct page	*page;
+	struct page	*cached_page = NULL;
+	const int	isblk = S_ISBLK(inode->i_mode);
+	ssize_t		written;
+	ssize_t		err;
+	size_t		bytes;
+	struct pagevec	lru_pvec;
+	const struct iovec *cur_iov = iov; /* current iovec */
+	size_t		iov_base = 0;	   /* offset in the current iovec */
+	unsigned long	seg;
+	char		*buf;
+
+	ocount = 0;
+	for (seg = 0; seg < nr_segs; seg++) {
+		const struct iovec *iv = &iov[seg];
+
+		/*
+		 * If any segment has a negative length, or the cumulative
+		 * length ever wraps negative then return -EINVAL.
+		 */
+		ocount += iv->iov_len;
+		if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
+			return -EINVAL;
+	}
+
+	count = ocount;
+	pos = *ppos;
+	pagevec_init(&lru_pvec, 0);
+
+	/* We can write back this queue in page reclaim */
+	current->backing_dev_info = mapping->backing_dev_info;
+	written = 0;
+
+	err = generic_write_checks(file, &pos, &count, isblk);
+	if (err)
+		goto out;
+
+
+	if (count == 0)
+		goto out;
+
+	remove_suid(file->f_dentry);
+	inode_update_time(inode, 1);
+
+	/* There is no sane reason to use O_DIRECT */
+	BUG_ON(file->f_flags & O_DIRECT);
+
+	buf = (char *)iov->iov_base;
+	do {
+		unsigned long index;
+		unsigned long offset;
+		size_t copied;
+
+		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+		index = pos >> PAGE_CACHE_SHIFT;
+		bytes = PAGE_CACHE_SIZE - offset;
+		if (bytes > count)
+			bytes = count;
+
+		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+		if (!page) {
+			status = -ENOMEM;
+			break;
+		}
+
+		status = a_ops->prepare_write(file, page, offset, offset+bytes);
+		if (unlikely(status)) {
+			loff_t isize = i_size_read(inode);
+			/*
+			 * prepare_write() may have instantiated a few blocks
+			 * outside i_size.  Trim these off again.
+			 */
+			unlock_page(page);
+			page_cache_release(page);
+			if (pos + bytes > isize)
+				vmtruncate(inode, isize);
+			break;
+		}
+
+		BUG_ON(nr_segs != 1);
+		copied = filemap_copy_from_kernel(page, offset, buf, bytes);
+
+		flush_dcache_page(page);
+		status = a_ops->commit_write(file, page, offset, offset+bytes);
+		if (likely(copied > 0)) {
+			if (!status)
+				status = copied;
+
+			if (status >= 0) {
+				written += status;
+				count -= status;
+				pos += status;
+				buf += status;
+				if (unlikely(nr_segs > 1))
+					filemap_set_next_iovec(&cur_iov,
+							&iov_base, status);
+			}
+		}
+		if (unlikely(copied != bytes))
+			if (status >= 0)
+				status = -EFAULT;
+		unlock_page(page);
+		mark_page_accessed(page);
+		page_cache_release(page);
+		if (status < 0)
+			break;
+		balance_dirty_pages_ratelimited(mapping);
+		cond_resched();
+	} while (count);
+	*ppos = pos;
+
+	if (cached_page)
+		page_cache_release(cached_page);
+
+	/*
+	 * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
+	 */
+	if (status >= 0) {
+		if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
+			status = generic_osync_inode(inode, mapping,
+					OSYNC_METADATA|OSYNC_DATA);
+	}
+	
+	err = written ? written : status;
+out:
+	pagevec_lru_add(&lru_pvec);
+	current->backing_dev_info = 0;
+	return err;
+}
+
 ssize_t
 generic_file_write_nolock(struct file *file, const struct iovec *iov,
 				unsigned long nr_segs, loff_t *ppos)
@@ -1924,6 +2111,20 @@
 	return ret;
 }
 
+static ssize_t
+generic_kernel_file_write_nolock(struct file *file, const struct iovec *iov,
+				 unsigned long nr_segs, loff_t *ppos)
+{
+	struct kiocb kiocb;
+	ssize_t ret;
+
+	init_sync_kiocb(&kiocb, file);
+	ret = generic_kernel_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos);
+	if (ret == -EIOCBQUEUED)
+		ret = wait_on_sync_kiocb(&kiocb);
+	return ret;
+}
+
 EXPORT_SYMBOL(generic_file_write_nolock);
 
 ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf,
@@ -1962,6 +2163,21 @@
 
 EXPORT_SYMBOL(generic_file_write);
 
+static ssize_t generic_kernel_file_write(struct file *file, const char *buf,
+					 size_t count, loff_t *ppos)
+{
+	struct inode	*inode = file->f_mapping->host;
+	ssize_t		err;
+	struct iovec local_iov = {.iov_base = (void __user *)buf,
+				  .iov_len = count };
+
+	down(&inode->i_sem);
+	err = generic_kernel_file_write_nolock(file, &local_iov, 1, ppos);
+	up(&inode->i_sem);
+
+	return err;
+}
+
 ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
 			unsigned long nr_segs, loff_t *ppos)
 {
--- linux-2.6.5cow/fs/ext2/file.c~generic_sendpage	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/file.c	2004-04-14 16:22:24.000000000 +0200
@@ -53,6 +53,7 @@
 	.readv		= generic_file_readv,
 	.writev		= generic_file_writev,
 	.sendfile	= generic_file_sendfile,
+	.sendpage	= generic_file_sendpage,
 };
 
 struct inode_operations ext2_file_inode_operations = {
--- linux-2.6.5cow/fs/ext3/file.c~generic_sendpage	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/file.c	2004-04-14 16:22:24.000000000 +0200
@@ -127,6 +127,7 @@
 	.release	= ext3_release_file,
 	.fsync		= ext3_sync_file,
 	.sendfile	= generic_file_sendfile,
+	.sendpage	= generic_file_sendpage,
 };
 
 struct inode_operations ext3_file_inode_operations = {

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH COW] sendfile
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
  2004-05-06 13:18 ` [PATCH COW] generic_sendpage Jörn Engel
@ 2004-05-06 13:19 ` Jörn Engel
  2004-05-06 13:19 ` [PATCH COW] copyfile Jörn Engel
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:19 UTC (permalink / raw)
  To: linux-kernel

Patch 2

Jörn

-- 
Simplicity is prerequisite for reliability.
-- Edsger W. Dijkstra

Creates vfs_sendfile(), which can be called from other places within
the kernel.  Such other places include copyfile() and cowlinks.

In principle, this just removes code from do_sendfile() to
vfs_sendfile().  On top of that, it adds a check to out_inode,
identical to the one on in_inode.  True, the check for out_inode was
never needed, maybe that tells you something about the check to
in_inode as well. ;)

 fs/read_write.c          |   99 +++++++++++++++++++++++++++--------------------
 include/linux/fs.h       |    1 
 include/linux/syscalls.h |    2 
 3 files changed, 60 insertions(+), 42 deletions(-)

--- linux-2.6.5cow/fs/read_write.c~copyfile	2004-04-06 18:50:59.000000000 +0200
+++ linux-2.6.5cow/fs/read_write.c	2004-04-25 03:21:41.000000000 +0200
@@ -537,12 +537,66 @@
 	return ret;
 }
 
+ssize_t vfs_sendfile(struct file *out_file, struct file *in_file, loff_t *ppos,
+		     size_t count, loff_t max)
+{
+	struct inode * in_inode, * out_inode;
+	loff_t pos;
+	ssize_t ret;
+
+	/* verify in_file */
+	in_inode = in_file->f_dentry->d_inode;
+	if (!in_inode)
+		return -EINVAL;
+	if (!in_file->f_op || !in_file->f_op->sendfile)
+		return -EINVAL;
+
+	if (!ppos)
+		ppos = &in_file->f_pos;
+	ret = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, *ppos, count);
+	if (ret)
+		return ret;
+
+	/* verify out_file */
+	out_inode = out_file->f_dentry->d_inode;
+	if (!out_inode)
+		return -EINVAL;
+	if (!out_file->f_op || !out_file->f_op->sendpage)
+		return -EINVAL;
+
+	ret = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file->f_pos, count);
+	if (ret)
+		return ret;
+
+	ret = security_file_permission (out_file, MAY_WRITE);
+	if (ret)
+		return ret;
+
+	if (!max)
+		max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
+
+	pos = *ppos;
+	if (unlikely(pos < 0))
+		return -EINVAL;
+	if (unlikely(pos + count > max)) {
+		if (pos >= max)
+			return -EOVERFLOW;
+		count = max - pos;
+	}
+
+	ret = in_file->f_op->sendfile(in_file, ppos, count, file_send_actor, out_file);
+
+	if (*ppos > max)
+		return -EOVERFLOW;
+	return ret;
+}
+
+EXPORT_SYMBOL(vfs_sendfile);
+
 static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 			   size_t count, loff_t max)
 {
 	struct file * in_file, * out_file;
-	struct inode * in_inode, * out_inode;
-	loff_t pos;
 	ssize_t retval;
 	int fput_needed_in, fput_needed_out;
 
@@ -555,17 +609,6 @@
 		goto out;
 	if (!(in_file->f_mode & FMODE_READ))
 		goto fput_in;
-	retval = -EINVAL;
-	in_inode = in_file->f_dentry->d_inode;
-	if (!in_inode)
-		goto fput_in;
-	if (!in_file->f_op || !in_file->f_op->sendfile)
-		goto fput_in;
-	if (!ppos)
-		ppos = &in_file->f_pos;
-	retval = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, *ppos, count);
-	if (retval)
-		goto fput_in;
 
 	retval = security_file_permission (in_file, MAY_READ);
 	if (retval)
@@ -580,36 +623,8 @@
 		goto fput_in;
 	if (!(out_file->f_mode & FMODE_WRITE))
 		goto fput_out;
-	retval = -EINVAL;
-	if (!out_file->f_op || !out_file->f_op->sendpage)
-		goto fput_out;
-	out_inode = out_file->f_dentry->d_inode;
-	retval = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file->f_pos, count);
-	if (retval)
-		goto fput_out;
-
-	retval = security_file_permission (out_file, MAY_WRITE);
-	if (retval)
-		goto fput_out;
-
-	if (!max)
-		max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
-
-	pos = *ppos;
-	retval = -EINVAL;
-	if (unlikely(pos < 0))
-		goto fput_out;
-	if (unlikely(pos + count > max)) {
-		retval = -EOVERFLOW;
-		if (pos >= max)
-			goto fput_out;
-		count = max - pos;
-	}
-
-	retval = in_file->f_op->sendfile(in_file, ppos, count, file_send_actor, out_file);
 
-	if (*ppos > max)
-		retval = -EOVERFLOW;
+	retval = vfs_sendfile(out_file, in_file, ppos, count, max);
 
 fput_out:
 	fput_light(out_file, fput_needed_out);
--- linux-2.6.5cow/include/linux/fs.h~copyfile	2004-04-20 16:20:21.000000000 +0200
+++ linux-2.6.5cow/include/linux/fs.h	2004-04-27 16:34:23.000000000 +0200
@@ -877,6 +877,7 @@
 		unsigned long, loff_t *);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
+ssize_t vfs_sendfile(struct file *, struct file *, loff_t *, size_t, loff_t);
 
 /*
  * NOTE: write_inode, delete_inode, clear_inode, put_inode can be called
--- linux-2.6.5cow/include/linux/syscalls.h~copyfile	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/include/linux/syscalls.h	2004-04-27 16:34:23.000000000 +0200
@@ -278,6 +278,8 @@
 asmlinkage long sys_unlink(const char __user *pathname);
 asmlinkage long sys_rename(const char __user *oldname,
 				const char __user *newname);
+asmlinkage long sys_copyfile(const char __user *from, const char __user *to,
+				umode_t mode);
 asmlinkage long sys_chmod(const char __user *filename, mode_t mode);
 asmlinkage long sys_fchmod(unsigned int fd, mode_t mode);
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH COW] copyfile
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
  2004-05-06 13:18 ` [PATCH COW] generic_sendpage Jörn Engel
  2004-05-06 13:19 ` [PATCH COW] sendfile Jörn Engel
@ 2004-05-06 13:19 ` Jörn Engel
  2004-05-06 13:20 ` [PATCH COW] lock_flags Jörn Engel
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:19 UTC (permalink / raw)
  To: linux-kernel

Patch 3

Jörn

-- 
More computing sins are committed in the name of efficiency (without
necessarily achieving it) than for any other single reason - including
blind stupidity.
-- W. A. Wulf 

Adds a new syscall, copyfile() which does as the name sais.

This syscall is a preparation for real cowlinks, but it might make sense on
it's own as well.  For cowlinks, the real copy is not done until the first
write (or similar operation) occurs, but at that point, both operations
should be the same.

 arch/i386/kernel/entry.S |    1 
 fs/namei.c               |  140 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 141 insertions(+)


--- linux-2.6.5cow/arch/i386/kernel/entry.S~copyfile	2004-04-27 16:34:32.000000000 +0200
+++ linux-2.6.5cow/arch/i386/kernel/entry.S	2004-04-30 08:40:04.000000000 +0200
@@ -882,5 +882,6 @@
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
+	.long sys_copyfile
 
 syscall_table_size=(.-sys_call_table)
--- linux-2.6.5cow/fs/namei.c~copyfile	2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/fs/namei.c	2004-05-06 01:08:53.000000000 +0200
@@ -2139,6 +2139,146 @@
 	return error;
 }
 
+/* note how we open() two struct file *, just to do the sendfile().  this
+ * should not be necessary, only nfs, smbfs and cifs depend on the struct
+ * file *, while afs and all local filesystems can do without.
+ * but changing all interfaces and fixing up three filesystems is a little
+ * too much work for now.
+ */
+#include <linux/file.h> /* for fput */
+static int copy_data(struct dentry *old_dentry, struct vfsmount *old_mnt,
+		     struct dentry *new_dentry, struct vfsmount *new_mnt)
+{
+	int ret;
+	loff_t size;
+	ssize_t n;
+	struct file *old_file;
+	struct file *new_file;
+
+	old_file = dentry_open(old_dentry, old_mnt, O_RDONLY);
+	if (IS_ERR(old_file))
+		return PTR_ERR(old_file);
+
+	new_file = dentry_open(new_dentry, new_mnt, O_WRONLY);
+	ret = PTR_ERR(new_file);
+	if (IS_ERR(new_file))
+		goto out1;
+
+	size = i_size_read(old_file->f_dentry->d_inode);
+	if (((size_t)size != size) || ((ssize_t)size != size)) {
+		ret = -EFBIG;
+		goto out2;
+	}
+	n = vfs_sendfile(new_file, old_file, NULL, size, 0);
+	if (n == size)
+		ret = 0;
+	else if (n < 0)
+		ret = n;
+	else
+		ret = -ENOSPC;
+
+out2:
+	dget(new_file->f_dentry); /* fput calls dput */
+	fput(new_file);
+out1:
+	dget(old_file->f_dentry); /* fput calls dput */
+	fput(old_file);
+	return ret;
+}
+
+int vfs_copyfile(struct nameidata *old_nd, struct nameidata *new_nd,
+		 struct dentry *new_dentry, int mode)
+{
+	int ret;
+
+	if (!old_nd->dentry->d_inode)
+		return -ENOENT;
+	if (!S_ISREG(old_nd->dentry->d_inode->i_mode))
+		return -EINVAL;
+	/* FIXME: replace with proper permission check */
+	if (new_dentry->d_inode)
+		return -EEXIST;
+
+	if (mode == -1)
+		mode = old_nd->dentry->d_inode->i_mode;
+
+	ret = vfs_create(new_nd->dentry->d_inode, new_dentry, mode, new_nd);
+	if (ret)
+		return ret;
+
+	ret = copy_data(old_nd->dentry, old_nd->mnt, new_dentry, new_nd->mnt);
+
+	if (ret) {
+		int error = vfs_unlink(new_nd->dentry->d_inode, new_dentry);
+		BUG_ON(error);
+		/* FIXME: not sure if there are return value we should not BUG()
+		 * on */
+	}
+	return ret;
+}
+
+int do_copyfile(const char *from, const char *to, int mode)
+{
+	int ret;
+	struct dentry *new_dentry;
+	struct nameidata old_nd, new_nd;
+
+	ret = path_lookup(from, 0, &old_nd);
+	if (ret)
+		return ret;
+
+	ret = path_lookup(to, LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE, &new_nd);
+	if (ret)
+		goto out1;
+
+	new_dentry = lookup_create(&new_nd, 0);
+	ret = PTR_ERR(new_dentry);
+	if (IS_ERR(new_dentry))
+		goto out2;
+
+	ret = vfs_copyfile(&old_nd, &new_nd, new_dentry, mode);
+
+	dput(new_dentry);
+	up(&new_nd.dentry->d_inode->i_sem);
+out2:
+	path_release(&new_nd);
+out1:
+	path_release(&old_nd);
+	return ret;
+}
+
+/* copyfile() - create a new file and copy the old one's contents into it, all
+ * inside the kernel
+ *
+ * TODO:
+ * quota
+ * notification
+ * security
+ * unlock directory during data copy
+ * loop for data copy
+ */
+int sys_copyfile(const char __user *ufrom, const char __user *uto, umode_t mode)
+{
+	char *from, *to;
+	int ret;
+
+	from = getname(ufrom);
+	if (IS_ERR(from))
+		return PTR_ERR(from);
+
+	to = getname(uto);
+	ret = PTR_ERR(to);
+	if (IS_ERR(to))
+		goto err;
+
+	ret = do_copyfile(from, to, mode);
+
+	putname(to);
+err:
+	putname(from);
+	return ret;
+}
+
 int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen, const char *link)
 {
 	int len;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH COW] lock_flags
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
                   ` (2 preceding siblings ...)
  2004-05-06 13:19 ` [PATCH COW] copyfile Jörn Engel
@ 2004-05-06 13:20 ` Jörn Engel
  2004-05-06 13:21 ` [PATCH COW] MAD COW Jörn Engel
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:20 UTC (permalink / raw)
  To: linux-kernel

Patch 4 - just for completeness

Jörn

-- 
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall 

spins on i_lock before accessing i_flags

Omissions:
o fs/befs/linuxvfs.c	inode->i_flags should be befs_ino->i_flags
o fs/dquot.c		vfs_quota_on() is racy.  the race is now limited to
			S_NOATIME, S_QUOTA and S_NOQUOTA, but it still exists.

 drivers/usb/core/inode.c |    2 ++
 fs/afs/inode.c           |    2 ++
 fs/dquot.c               |   21 ++++++++++++++++++---
 fs/ext2/ialloc.c         |    2 ++
 fs/ext2/inode.c          |    2 ++
 fs/ext3/ialloc.c         |    2 ++
 fs/ext3/inode.c          |    2 ++
 fs/fat/inode.c           |    5 ++++-
 fs/hfs/inode.c           |    2 +-
 fs/hfsplus/catalog.c     |    4 ++--
 fs/hfsplus/dir.c         |   10 ++++++++--
 fs/hfsplus/inode.c       |    6 +++++-
 fs/hfsplus/ioctl.c       |    2 ++
 fs/intermezzo/vfs.c      |    7 ++++++-
 fs/namei.c               |   10 ++++++++--
 fs/nfs/inode.c           |    2 ++
 fs/proc/base.c           |    4 ++++
 fs/reiserfs/inode.c      |    9 ++++++++-
 fs/smbfs/file.c          |    2 +-
 fs/udf/ialloc.c          |    2 ++
 fs/ufs/ialloc.c          |    2 ++
 fs/xfs/linux/xfs_ioctl.c |    2 +-
 fs/xfs/linux/xfs_super.c |    2 ++
 fs/xfs/linux/xfs_vnode.c |    2 ++
 fs/xfs/xfs_acl.c         |    2 +-
 fs/xfs/xfs_cap.c         |    2 +-
 include/linux/fs.h       |   32 +++++++++++++++++++++-----------
 27 files changed, 113 insertions(+), 29 deletions(-)


--- linux-2.6.5cow/include/linux/fs.h~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/include/linux/fs.h	2004-04-27 16:34:42.000000000 +0200
@@ -155,24 +155,25 @@
  */
 #define __IS_FLG(inode,flg) ((inode)->i_sb->s_flags & (flg))
 
-#define IS_RDONLY(inode) ((inode)->i_sb->s_flags & MS_RDONLY)
+#define IS_RDONLY(inode)	(__IS_FLG(inode, MS_RDONLY))
 #define IS_SYNC(inode)		(__IS_FLG(inode, MS_SYNCHRONOUS) || \
-					((inode)->i_flags & S_SYNC))
+					inode_flags((inode), S_SYNC))
 #define IS_DIRSYNC(inode)	(__IS_FLG(inode, MS_SYNCHRONOUS|MS_DIRSYNC) || \
-					((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
+				       inode_flags((inode), (S_SYNC|S_DIRSYNC)))
 #define IS_MANDLOCK(inode)	__IS_FLG(inode, MS_MANDLOCK)
 
-#define IS_QUOTAINIT(inode)	((inode)->i_flags & S_QUOTA)
-#define IS_NOQUOTA(inode)	((inode)->i_flags & S_NOQUOTA)
-#define IS_APPEND(inode)	((inode)->i_flags & S_APPEND)
-#define IS_IMMUTABLE(inode)	((inode)->i_flags & S_IMMUTABLE)
-#define IS_NOATIME(inode)	(__IS_FLG(inode, MS_NOATIME) || ((inode)->i_flags & S_NOATIME))
+#define IS_QUOTAINIT(inode)	(inode_flags((inode), S_QUOTA))
+#define IS_NOQUOTA(inode)	(inode_flags((inode), S_NOQUOTA))
+#define IS_APPEND(inode)	(inode_flags((inode), S_APPEND))
+#define IS_IMMUTABLE(inode)	(inode_flags((inode), S_IMMUTABLE))
+#define IS_NOATIME(inode)	(__IS_FLG(inode, MS_NOATIME) || \
+					inode_flags((inode), S_NOATIME))
 #define IS_NODIRATIME(inode)	__IS_FLG(inode, MS_NODIRATIME)
 #define IS_POSIXACL(inode)	__IS_FLG(inode, MS_POSIXACL)
 #define IS_ONE_SECOND(inode)	__IS_FLG(inode, MS_ONE_SECOND)
 
-#define IS_DEADDIR(inode)	((inode)->i_flags & S_DEAD)
-#define IS_NOCMTIME(inode)	((inode)->i_flags & S_NOCMTIME)
+#define IS_DEADDIR(inode)	(inode_flags((inode), S_DEAD))
+#define IS_NOCMTIME(inode)	(inode_flags((inode), S_NOCMTIME))
 
 /* the read-only stuff doesn't really belong here, but any other place is
    probably as bad and I don't want to create yet another include file. */
@@ -417,7 +418,7 @@
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
-	unsigned int		i_flags;
+	unsigned int		i_flags;	/* protected by i_lock */
 	unsigned char		i_sock;
 
 	atomic_t		i_writecount;
@@ -431,6 +432,15 @@
 #endif
 };
 
+static inline unsigned int inode_flags(struct inode *inode, unsigned int mask)
+{
+	int ret;
+	spin_lock(inode->i_lock);
+	ret = inode->i_flags & mask;
+	spin_unlock(inode->i_lock);
+	return ret;
+}
+
 /*
  * NOTE: in a 32bit arch with a preemptable kernel and
  * an UP compile the i_size_read/write must be atomic
--- linux-2.6.5cow/drivers/usb/core/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/drivers/usb/core/inode.c	2004-04-14 16:22:24.000000000 +0200
@@ -277,7 +277,9 @@
 	if (usbfs_empty(dentry)) {
 		dentry->d_inode->i_nlink -= 2;
 		dput(dentry);
+		spin_lock(inode->i_lock);
 		inode->i_flags |= S_DEAD;
+		spin_unlock(inode->i_lock);
 		dir->i_nlink--;
 		error = 0;
 	}
--- linux-2.6.5cow/fs/xfs/linux/xfs_vnode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/linux/xfs_vnode.c	2004-04-14 16:22:24.000000000 +0200
@@ -213,6 +213,7 @@
 		inode->i_mtime	    = va.va_mtime;
 		inode->i_ctime	    = va.va_ctime;
 		inode->i_atime	    = va.va_atime;
+		spin_lock(inode->i_lock);
 		if (va.va_xflags & XFS_XFLAG_IMMUTABLE)
 			inode->i_flags |= S_IMMUTABLE;
 		else
@@ -229,6 +230,7 @@
 			inode->i_flags |= S_NOATIME;
 		else
 			inode->i_flags &= ~S_NOATIME;
+		spin_unlock(inode->i_lock);
 		VUNMODIFY(vp);
 	}
 	return -error;
--- linux-2.6.5cow/fs/xfs/linux/xfs_ioctl.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/linux/xfs_ioctl.c	2004-04-14 16:22:24.000000000 +0200
@@ -885,7 +885,7 @@
 	int			attr_flags = 0;
 	int			error;
 
-	if (vp->v_inode.i_flags & (S_IMMUTABLE|S_APPEND))
+	if (inode_flags(vp->v_inode, (S_IMMUTABLE|S_APPEND)))
 		return -XFS_ERROR(EPERM);
 
 	if (!(filp->f_flags & FMODE_WRITE))
--- linux-2.6.5cow/fs/xfs/linux/xfs_super.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/linux/xfs_super.c	2004-04-14 16:22:24.000000000 +0200
@@ -187,6 +187,7 @@
 	inode->i_mtime.tv_nsec	= ip->i_d.di_mtime.t_nsec;
 	inode->i_ctime.tv_sec	= ip->i_d.di_ctime.t_sec;
 	inode->i_ctime.tv_nsec	= ip->i_d.di_ctime.t_nsec;
+	spin_lock(inode->i_lock);
 	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
 	else
@@ -203,6 +204,7 @@
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
+	spin_unlock(inode->i_lock);
 	vp->v_flag &= ~VMODIFIED;
 }
 
--- linux-2.6.5cow/fs/xfs/xfs_acl.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/xfs_acl.c	2004-04-14 16:22:24.000000000 +0200
@@ -388,7 +388,7 @@
 	vattr_t		va;
 	int		error;
 
-	if (vp->v_inode.i_flags & (S_IMMUTABLE|S_APPEND))
+	if (inode_flags(&vp->v_inode, (S_IMMUTABLE|S_APPEND)))
 		return EPERM;
 	if (kind == _ACL_TYPE_DEFAULT && vp->v_type != VDIR)
 		return ENOTDIR;
--- linux-2.6.5cow/fs/xfs/xfs_cap.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/xfs_cap.c	2004-04-14 16:22:24.000000000 +0200
@@ -192,7 +192,7 @@
 
 	if (vp->v_vfsp->vfs_flag & VFS_RDONLY)
 		return EROFS;
-	if (vp->v_inode.i_flags & (S_IMMUTABLE|S_APPEND))
+	if (inode_flags(&vp->v_inode, (S_IMMUTABLE|S_APPEND)))
 		return EPERM;
 	if ((error = _MAC_VACCESS(vp, NULL, VWRITE)))
 		return error;
--- linux-2.6.5cow/fs/nfs/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/nfs/inode.c	2004-04-14 16:22:24.000000000 +0200
@@ -695,7 +695,9 @@
 		inode->i_ino = hash;
 
 		/* We can't support update_atime(), since the server will reset it */
+		spin_lock(inode->i_lock);
 		inode->i_flags |= S_NOATIME|S_NOCMTIME;
+		spin_unlock(inode->i_lock);
 		inode->i_mode = fattr->mode;
 		/* Why so? Because we want revalidate for devices/FIFOs, and
 		 * that's precisely what we have in nfs_file_inode_operations.
--- linux-2.6.5cow/fs/ufs/ialloc.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ufs/ialloc.c	2004-04-14 16:22:24.000000000 +0200
@@ -283,7 +283,9 @@
 
 	if (DQUOT_ALLOC_INODE(inode)) {
 		DQUOT_DROP(inode);
+		spin_lock(inode->i_lock);
 		inode->i_flags |= S_NOQUOTA;
+		spin_unlock(inode->i_lock);
 		inode->i_nlink = 0;
 		iput(inode);
 		return ERR_PTR(-EDQUOT);
--- linux-2.6.5cow/fs/hfs/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfs/inode.c	2004-04-20 16:20:21.000000000 +0200
@@ -540,7 +540,7 @@
 	if (atomic_dec_and_test(&HFS_I(inode)->opencnt)) {
 		down(&inode->i_sem);
 		hfs_file_truncate(inode);
-		//if (inode->i_flags & S_DEAD) {
+		//if (IS_DEADDIR(inode) {
 		//	hfs_delete_cat(inode->i_ino, HFSPLUS_SB(sb).hidden_dir, NULL);
 		//	hfs_delete_inode(inode);
 		//}
--- linux-2.6.5cow/fs/intermezzo/vfs.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/intermezzo/vfs.c	2004-04-14 16:22:24.000000000 +0200
@@ -1441,7 +1441,9 @@
                 error = iops->rmdir(dir->d_inode, dentry);
                 unlock_kernel();
                 if (!error) {
+			spin_lock(dentry->d_inode->i_lock);
                         dentry->d_inode->i_flags |= S_DEAD;
+			spin_unlock(dentry->d_inode->i_lock);
                         error = presto_settime(fset, NULL, NULL, dir, info,
                                                ATTR_CTIME | ATTR_MTIME);
                 }
@@ -1838,8 +1840,11 @@
                 error = do_rename(fset, old_parent, old_dentry,
                                          new_parent, new_dentry, info);
         if (target) {
-                if (!error)
+                if (!error) {
+			spin_lock(target);
                         target->i_flags |= S_DEAD;
+			spin_unlock(target);
+		}
                 //                triple_up(&old_dir->i_zombie,
                 //                          &new_dir->i_zombie,
                 //                          &target->i_zombie);
--- linux-2.6.5cow/fs/udf/ialloc.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/udf/ialloc.c	2004-04-14 16:22:24.000000000 +0200
@@ -165,7 +165,9 @@
 	if (DQUOT_ALLOC_INODE(inode))
 	{
 		DQUOT_DROP(inode);
+		spin_lock(inode->i_lock);
 		inode->i_flags |= S_NOQUOTA;
+		spin_unlock(inode->i_lock);
 		inode->i_nlink = 0;
 		iput(inode);
 		*err = -EDQUOT;
--- linux-2.6.5cow/fs/reiserfs/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/reiserfs/inode.c	2004-04-20 16:20:21.000000000 +0200
@@ -1612,8 +1612,11 @@
     /* uid and gid must already be set by the caller for quota init */
 
     /* symlink cannot be immutable or append only, right? */
-    if( S_ISLNK( inode -> i_mode ) )
+    if( S_ISLNK( inode -> i_mode ) ) {
+	    spin_lock(inode->i_lock);
 	    inode -> i_flags &= ~ ( S_IMMUTABLE | S_APPEND );
+	    spin_unlock(inode->i_lock);
+    }
 
     inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
     inode->i_size = i_size;
@@ -2287,6 +2290,7 @@
 void sd_attrs_to_i_attrs( __u16 sd_attrs, struct inode *inode )
 {
 	if( reiserfs_attrs( inode -> i_sb ) ) {
+		spin_lock(inode->i_lock);
 		if( sd_attrs & REISERFS_SYNC_FL )
 			inode -> i_flags |= S_SYNC;
 		else
@@ -2307,12 +2311,14 @@
 			REISERFS_I(inode)->i_flags |= i_nopack_mask;
 		else
 			REISERFS_I(inode)->i_flags &= ~i_nopack_mask;
+		spin_unlock(inode->i_lock);
 	}
 }
 
 void i_attrs_to_sd_attrs( struct inode *inode, __u16 *sd_attrs )
 {
 	if( reiserfs_attrs( inode -> i_sb ) ) {
+		spin_lock(inode->i_lock);
 		if( inode -> i_flags & S_IMMUTABLE )
 			*sd_attrs |= REISERFS_IMMUTABLE_FL;
 		else
@@ -2329,6 +2335,7 @@
 			*sd_attrs |= REISERFS_NOTAIL_FL;
 		else
 			*sd_attrs &= ~REISERFS_NOTAIL_FL;
+		spin_unlock(inode->i_lock);
 	}
 }
 
--- linux-2.6.5cow/fs/fat/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/fat/inode.c	2004-04-20 16:20:21.000000000 +0200
@@ -1193,8 +1193,11 @@
 		MSDOS_I(inode)->mmu_private = inode->i_size;
 	}
 	if(de->attr & ATTR_SYS)
-		if (sbi->options.sys_immutable)
+		if (sbi->options.sys_immutable) {
+			spin_lock(inode->i_lock);
 			inode->i_flags |= S_IMMUTABLE;
+			spin_unlock(inode->i_lock);
+		}
 	MSDOS_I(inode)->i_attrs = de->attr & ATTR_UNUSED;
 	/* this is as close to the truth as we can get ... */
 	inode->i_blksize = sbi->cluster_size;
--- linux-2.6.5cow/fs/ext3/ialloc.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/ialloc.c	2004-04-14 16:22:24.000000000 +0200
@@ -627,7 +627,9 @@
 	return ret;
 
 fail2:
+	spin_lock(inode->i_lock);
 	inode->i_flags |= S_NOQUOTA;
+	spin_unlock(inode->i_lock);
 	inode->i_nlink = 0;
 	iput(inode);
 	brelse(bitmap_bh);
--- linux-2.6.5cow/fs/ext3/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/inode.c	2004-04-27 16:34:42.000000000 +0200
@@ -2447,6 +2447,7 @@
 {
 	unsigned int flags = EXT3_I(inode)->i_flags;
 
+	spin_lock(inode->i_lock);
 	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
 	if (flags & EXT3_SYNC_FL)
 		inode->i_flags |= S_SYNC;
@@ -2458,6 +2459,7 @@
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT3_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	spin_unlock(inode->i_lock);
 }
 
 void ext3_read_inode(struct inode * inode)
--- linux-2.6.5cow/fs/proc/base.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/proc/base.c	2004-04-14 16:22:24.000000000 +0200
@@ -1594,7 +1594,9 @@
 	inode->i_op = &proc_tgid_base_inode_operations;
 	inode->i_fop = &proc_tgid_base_operations;
 	inode->i_nlink = 3;
+	spin_lock(inode->i_lock);
 	inode->i_flags|=S_IMMUTABLE;
+	spin_unlock(inode->i_lock);
 
 	dentry->d_op = &pid_base_dentry_operations;
 
@@ -1649,7 +1651,9 @@
 	inode->i_op = &proc_tid_base_inode_operations;
 	inode->i_fop = &proc_tid_base_operations;
 	inode->i_nlink = 3;
+	spin_lock(inode->i_lock);
 	inode->i_flags|=S_IMMUTABLE;
+	spin_unlock(inode->i_lock);
 
 	dentry->d_op = &pid_base_dentry_operations;
 
--- linux-2.6.5cow/fs/ext2/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/inode.c	2004-04-27 16:34:42.000000000 +0200
@@ -1020,6 +1020,7 @@
 {
 	unsigned int flags = EXT2_I(inode)->i_flags;
 
+	spin_lock(inode->i_lock);
 	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
 	if (flags & EXT2_SYNC_FL)
 		inode->i_flags |= S_SYNC;
@@ -1031,6 +1032,7 @@
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	spin_unlock(inode->i_lock);
 }
 
 void ext2_read_inode (struct inode * inode)
--- linux-2.6.5cow/fs/ext2/ialloc.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/ialloc.c	2004-04-14 16:22:24.000000000 +0200
@@ -620,7 +620,9 @@
 	return inode;
 
 fail2:
+	spin_lock(inode->i_lock);
 	inode->i_flags |= S_NOQUOTA;
+	spin_unlock(inode->i_lock);
 	inode->i_nlink = 0;
 	iput(inode);
 	return ERR_PTR(err);
--- linux-2.6.5cow/fs/namei.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/namei.c	2004-04-27 16:34:42.000000000 +0200
@@ -1609,8 +1609,11 @@
 		error = security_inode_rmdir(dir, dentry);
 		if (!error) {
 			error = dir->i_op->rmdir(dir, dentry);
-			if (!error)
+			if (!error) {
+				spin_lock(dentry->d_inode->i_lock);
 				dentry->d_inode->i_flags |= S_DEAD;
+				spin_unlock(dentry->d_inode->i_lock);
+			}
 		}
 	}
 	up(&dentry->d_inode->i_sem);
@@ -1952,8 +1955,11 @@
 	else 
 		error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
 	if (target) {
-		if (!error)
+		if (!error) {
+			spin_lock(target->i_lock);
 			target->i_flags |= S_DEAD;
+			spin_unlock(target->i_lock);
+		}
 		up(&target->i_sem);
 		if (d_unhashed(new_dentry))
 			d_rehash(new_dentry);
--- linux-2.6.5cow/fs/dquot.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/dquot.c	2004-04-14 16:22:24.000000000 +0200
@@ -573,7 +573,9 @@
 		if (inode->i_dquot[cnt] != NODQUOT)
 			goto put_it;
 	}
+	spin_lock(inode->i_lock);
 	inode->i_flags &= ~S_QUOTA;
+	spin_unlock(inode->i_lock);
 put_it:
 	if (dquot != NODQUOT) {
 		if (dqput_blocks(dquot)) {
@@ -824,8 +826,11 @@
 					break;
 			}
 			inode->i_dquot[cnt] = dqget(inode->i_sb, id, cnt);
-			if (inode->i_dquot[cnt])
+			if (inode->i_dquot[cnt]) {
+				spin_lock(inode->i_lock);
 				inode->i_flags |= S_QUOTA;
+				spin_unlock(inode->i_lock);
+			}
 		}
 	}
 	up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
@@ -839,7 +844,9 @@
 {
 	int cnt;
 
+	spin_lock(inode->i_lock);
 	inode->i_flags &= ~S_QUOTA;
+	spin_unlock(inode->i_lock);
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		to_drop[cnt] = inode->i_dquot[cnt];
 		inode->i_dquot[cnt] = NODQUOT;
@@ -1174,6 +1181,7 @@
 	return 0;
 }
 
+#define SAFE_FLAGS (S_QUOTA | S_NOQUOTA | S_NOATIME)
 int vfs_quota_on(struct super_block *sb, int type, int format_id, char *path)
 {
 	struct file *f;
@@ -1207,14 +1215,18 @@
 		error = -EBUSY;
 		goto out_lock;
 	}
-	oldflags = inode->i_flags;
+	/* FIXME: this is a race for i_flags.  not sure how to fix it properly,
+	 * but at least it only affects the three SAFE_FLAGS now */
+	oldflags = inode_flags(inode, SAFE_FLAGS);
 	dqopt->files[type] = f;
 	error = -EINVAL;
 	if (!fmt->qf_ops->check_quota_file(sb, type))
 		goto out_file_init;
 	/* We don't want quota and atime on quota files (deadlocks possible) */
 	dquot_drop_nolock(inode);
+	spin_lock(inode->i_lock);
 	inode->i_flags |= S_NOQUOTA | S_NOATIME;
+	spin_unlock(inode->i_lock);
 
 	dqopt->ops[type] = fmt->qf_ops;
 	dqopt->info[type].dqi_format = fmt;
@@ -1233,7 +1245,10 @@
 	return 0;
 
 out_file_init:
-	inode->i_flags = oldflags;
+	spin_lock(inode->i_lock);
+	inode->i_flags &= ~SAFE_FLAGS;
+	inode->i_flags |= oldflags;
+	spin_unlock(inode->i_lock);
 	dqopt->files[type] = NULL;
 out_lock:
 	up_write(&dqopt->dqptr_sem);
--- linux-2.6.5cow/fs/hfsplus/catalog.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfsplus/catalog.c	2004-04-14 16:22:24.000000000 +0200
@@ -54,11 +54,11 @@
 
 static void hfsplus_set_perms(struct inode *inode, struct hfsplus_perm *perms)
 {
-	if (inode->i_flags & S_IMMUTABLE)
+	if (IS_IMMUTABLE(inode))
 		perms->rootflags |= HFSPLUS_FLG_IMMUTABLE;
 	else
 		perms->rootflags &= ~HFSPLUS_FLG_IMMUTABLE;
-	if (inode->i_flags & S_APPEND)
+	if (IS_APPEND(inode))
 		perms->rootflags |= HFSPLUS_FLG_APPEND;
 	else
 		perms->rootflags &= ~HFSPLUS_FLG_APPEND;
--- linux-2.6.5cow/fs/hfsplus/dir.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfsplus/dir.c	2004-04-14 16:22:24.000000000 +0200
@@ -315,8 +315,11 @@
 		res = hfsplus_rename_cat(inode->i_ino,
 					 dir, &dentry->d_name,
 					 HFSPLUS_SB(sb).hidden_dir, &str);
-		if (!res)
+		if (!res) {
+			spin_lock(inode->i_lock);
 			inode->i_flags |= S_DEAD;
+			spin_unlock(inode->i_lock);
+		}
 		return res;
 	}
 	res = hfsplus_delete_cat(cnid, dir, &dentry->d_name);
@@ -330,8 +333,11 @@
 			res = hfsplus_delete_cat(inode->i_ino, HFSPLUS_SB(sb).hidden_dir, NULL);
 			if (!res)
 				hfsplus_delete_inode(inode);
-		} else
+		} else {
+			spin_lock(inode->i_lock);
 			inode->i_flags |= S_DEAD;
+			spin_unlock(inode->i_lock);
+		}
 	}
 	inode->i_ctime = CURRENT_TIME;
 	mark_inode_dirty(inode);
--- linux-2.6.5cow/fs/hfsplus/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfsplus/inode.c	2004-04-20 16:20:21.000000000 +0200
@@ -225,6 +225,7 @@
 
 	HFSPLUS_I(inode).rootflags = perms->rootflags;
 	HFSPLUS_I(inode).userflags = perms->userflags;
+	spin_lock(inode->i_lock);
 	if (perms->rootflags & HFSPLUS_FLG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
 	else
@@ -233,10 +234,12 @@
 		inode->i_flags |= S_APPEND;
 	else
 		inode->i_flags &= ~S_APPEND;
+	spin_unlock(inode->i_lock);
 }
 
 static void hfsplus_set_perms(struct inode *inode, struct hfsplus_perm *perms)
 {
+	spin_lock(inode->i_lock);
 	if (inode->i_flags & S_IMMUTABLE)
 		perms->rootflags |= HFSPLUS_FLG_IMMUTABLE;
 	else
@@ -245,6 +248,7 @@
 		perms->rootflags |= HFSPLUS_FLG_APPEND;
 	else
 		perms->rootflags &= ~HFSPLUS_FLG_APPEND;
+	spin_unlock(inode->i_lock);
 	perms->userflags = HFSPLUS_I(inode).userflags;
 	perms->mode = cpu_to_be16(inode->i_mode);
 	perms->owner = cpu_to_be32(inode->i_uid);
@@ -285,7 +289,7 @@
 	if (atomic_dec_and_test(&HFSPLUS_I(inode).opencnt)) {
 		down(&inode->i_sem);
 		hfsplus_file_truncate(inode);
-		if (inode->i_flags & S_DEAD) {
+		if (IS_DEADDIR(inode)) {
 			hfsplus_delete_cat(inode->i_ino, HFSPLUS_SB(sb).hidden_dir, NULL);
 			hfsplus_delete_inode(inode);
 		}
--- linux-2.6.5cow/fs/hfsplus/ioctl.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfsplus/ioctl.c	2004-04-14 16:22:24.000000000 +0200
@@ -53,6 +53,7 @@
 			      EXT2_FLAG_NODUMP))
 			return -EOPNOTSUPP;
 
+		spin_lock(inode->i_lock);
 		if (flags & EXT2_FLAG_IMMUTABLE) { /* EXT2_IMMUTABLE_FL */
 			inode->i_flags |= S_IMMUTABLE;
 			HFSPLUS_I(inode).rootflags |= HFSPLUS_FLG_IMMUTABLE;
@@ -67,6 +68,7 @@
 			inode->i_flags &= ~S_APPEND;
 			HFSPLUS_I(inode).rootflags &= ~HFSPLUS_FLG_APPEND;
 		}
+		spin_unlock(inode->i_lock);
 		if (flags & EXT2_FLAG_NODUMP) /* EXT2_NODUMP_FL */
 			HFSPLUS_I(inode).userflags |= HFSPLUS_FLG_NODUMP;
 		else
--- linux-2.6.5cow/fs/smbfs/file.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/smbfs/file.c	2004-04-15 22:03:45.000000000 +0200
@@ -230,7 +230,7 @@
 
 	VERBOSE("before read, size=%ld, flags=%x, atime=%ld\n",
 		(long)dentry->d_inode->i_size,
-		dentry->d_inode->i_flags, dentry->d_inode->i_atime);
+		inode_flags(dentry->d_inode, -1U), dentry->d_inode->i_atime);
 
 	status = generic_file_read(file, buf, count, ppos);
 out:
--- linux-2.6.5cow/fs/afs/inode.c~lock_flags	2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/afs/inode.c	2004-04-14 16:22:24.000000000 +0200
@@ -188,7 +188,9 @@
 #endif
 
 	/* okay... it's a new inode */
+	spin_lock(inode->i_lock);
 	inode->i_flags |= S_NOATIME;
+	spin_unlock(inode->i_lock);
 	vnode->flags |= AFS_VNODE_CHANGED;
 	ret = afs_inode_fetch_status(inode);
 	if (ret<0)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH COW] MAD COW
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
                   ` (3 preceding siblings ...)
  2004-05-06 13:20 ` [PATCH COW] lock_flags Jörn Engel
@ 2004-05-06 13:21 ` Jörn Engel
  2004-05-08 13:45 ` [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Denis Vlasenko
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:21 UTC (permalink / raw)
  To: linux-kernel

Patch 5

Jörn

-- 
A victorious army first wins and then seeks battle.
-- Sun Tzu

Allow COW behaviour for hard links, depending on an inode flag.

Semantics:
o Files with S_COWLINK do cow. (yes, really ;)
o Directories with S_COWLINK inherit flag to new files.
o If in doubt, return -EMLINK and let the user sort it out:
  - When linking non-cow files to cow directories.
  - When moving non-cow files/directories to cow directories.
  - When moving cow files/directories to non-cow directories.

Thanks to Sytse and Andrew for tips.

 fs/ext2/inode.c         |   11 ++++++
 fs/ext2/super.c         |    2 -
 fs/ext3/inode.c         |   11 ++++++
 fs/ext3/super.c         |    2 -
 fs/fcntl.c              |   24 ++++++++++++++
 fs/namei.c              |   76 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/open.c               |   79 +++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/ext2_fs.h |    1 
 include/linux/ext3_fs.h |    1 
 include/linux/fcntl.h   |    3 +
 include/linux/fs.h      |    4 ++
 11 files changed, 205 insertions(+), 9 deletions(-)


--- linux-2.6.5cow/fs/open.c~break_madcow	2004-05-04 16:31:50.000000000 +0200
+++ linux-2.6.5cow/fs/open.c	2004-05-06 01:02:57.000000000 +0200
@@ -723,6 +723,71 @@
 	return error;
 }
 
+char *index(const char *s, int c)
+{
+	while (*s) {
+		if (*s == c)
+			return (char *) s;
+		s++;
+	}
+	return NULL;
+}
+
+char *rindex(const char *s, int c)
+{
+	char *ret = NULL;
+	while (*s) {
+		if (*s == c)
+			ret = (char *) s;
+		s++;
+	}
+	return ret;
+}
+
+char *madcow_temp_name(const char *name)
+{
+	const char temp[] = "__MADCOW_BREAK_LINK"; /* FIXME: add random part */
+	char *last_slash = rindex(name, '/');
+	size_t dir_len = last_slash ? last_slash + 1 - name : 0;
+	size_t len = sizeof(temp) + dir_len;
+	char *ret;
+
+	ret = kmalloc(len, GFP_KERNEL);
+	if (!ret)
+		return NULL;
+
+	strncpy(ret, name, dir_len);
+	strcpy(ret+dir_len, temp);
+	return ret;
+}
+
+int vfs_rename_other(struct inode *, struct dentry *,
+		struct inode *, struct dentry *);
+int do_copyfile(const char *from, const char *to, int mode);
+int do_rename(const char * oldname, const char * newname);
+
+int madcow_break_link(const char *from)
+{
+	int err, ret = -EMLINK;
+	char *to = madcow_temp_name(from);
+
+	printk("break link '%s' -> '%s'\n", from, to);
+	err = do_copyfile(from, to, -1);
+	printk("do_copyfile returned %d\n", err);
+	if (err)
+		goto out;
+
+	err = do_rename(to, from);
+	printk("do_rename returned %d\n", err);
+	if (err)
+		goto out;
+
+	ret = 0;
+out:
+	kfree(to);
+	return ret;
+}
+
 /*
  * Note that while the flag value (low two bits) for sys_open means:
  *	00 - read-only
@@ -746,13 +811,19 @@
 	if ((namei_flags+1) & O_ACCMODE)
 		namei_flags++;
 	if (namei_flags & O_TRUNC)
-		namei_flags |= 2;
+		namei_flags |= FMODE_WRITE;
 
 	error = open_namei(filename, namei_flags, mode, &nd);
-	if (!error)
-		return dentry_open(nd.dentry, nd.mnt, flags);
 
-	return ERR_PTR(error);
+	if (error == -EMLINK) {
+		error = madcow_break_link(filename);
+		if (!error) /*retry*/
+			error = open_namei(filename, namei_flags, mode, &nd);
+	}
+	if (error)
+		return ERR_PTR(error);
+
+	return dentry_open(nd.dentry, nd.mnt, flags);
 }
 
 EXPORT_SYMBOL(filp_open);
--- linux-2.6.5cow/fs/ext2/inode.c~madcow	2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/inode.c	2004-04-27 17:02:33.000000000 +0200
@@ -1021,7 +1021,8 @@
 	unsigned int flags = EXT2_I(inode)->i_flags;
 
 	spin_lock(inode->i_lock);
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME
+			| S_DIRSYNC | S_MADCOWLINK);
 	if (flags & EXT2_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT2_APPEND_FL)
@@ -1032,6 +1033,8 @@
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT2_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (flags & EXT2_MADCOWLINK_FL)
+		inode->i_flags |= S_MADCOWLINK;
 	spin_unlock(inode->i_lock);
 }
 
@@ -1159,6 +1162,12 @@
 	if (IS_ERR(raw_inode))
  		return -EIO;
 
+	/* vfs inode holds the current MADCOWLINK flag, so we have to update
+	 * ei->i_flags first */
+	ei->i_flags &= ~EXT2_MADCOWLINK_FL;
+	if (inode_flags(inode, S_MADCOWLINK))
+		ei->i_flags |= EXT2_MADCOWLINK_FL;
+
 	/* For fields not not tracking in the in-memory inode,
 	 * initialise them to zero for new inodes. */
 	if (ei->i_state & EXT2_STATE_NEW)
--- linux-2.6.5cow/fs/ext2/super.c~madcow	2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/super.c	2004-04-27 17:03:10.000000000 +0200
@@ -1015,7 +1015,7 @@
 	.name		= "ext2",
 	.get_sb		= ext2_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_MADCOW,
 };
 
 static int __init init_ext2_fs(void)
--- linux-2.6.5cow/fs/ext3/inode.c~madcow	2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/inode.c	2004-04-27 17:04:41.000000000 +0200
@@ -2448,7 +2448,8 @@
 	unsigned int flags = EXT3_I(inode)->i_flags;
 
 	spin_lock(inode->i_lock);
-	inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+	inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME
+			| S_DIRSYNC | S_MADCOWLINK);
 	if (flags & EXT3_SYNC_FL)
 		inode->i_flags |= S_SYNC;
 	if (flags & EXT3_APPEND_FL)
@@ -2459,6 +2460,8 @@
 		inode->i_flags |= S_NOATIME;
 	if (flags & EXT3_DIRSYNC_FL)
 		inode->i_flags |= S_DIRSYNC;
+	if (flags & EXT3_MADCOWLINK_FL)
+		inode->i_flags |= S_MADCOWLINK;
 	spin_unlock(inode->i_lock);
 }
 
@@ -2594,6 +2597,12 @@
 	struct buffer_head *bh = iloc->bh;
 	int err = 0, rc, block;
 
+	/* vfs inode holds the current MADCOWLINK flag, so we have to update
+	 * ei->i_flags first */
+	ei->i_flags &= ~EXT3_MADCOWLINK_FL;
+	if (inode_flags(inode, S_MADCOWLINK))
+		ei->i_flags |= EXT3_MADCOWLINK_FL;
+
 	/* For fields not not tracking in the in-memory inode,
 	 * initialise them to zero for new inodes. */
 	if (ei->i_state & EXT3_STATE_NEW)
--- linux-2.6.5cow/fs/ext3/super.c~madcow	2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/super.c	2004-04-27 17:05:22.000000000 +0200
@@ -2004,7 +2004,7 @@
 	.name		= "ext3",
 	.get_sb		= ext3_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_MADCOW,
 };
 
 static int __init init_ext3_fs(void)
--- linux-2.6.5cow/fs/fcntl.c~madcow	2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/fs/fcntl.c	2004-04-27 16:55:42.000000000 +0200
@@ -282,6 +282,24 @@
 
 EXPORT_SYMBOL(f_delown);
 
+static long fcntl_setmadcow(struct file *filp, unsigned long arg)
+{
+	struct inode *inode = filp->f_dentry->d_inode;
+
+	if (!(inode->i_sb->s_type->fs_flags & FS_MADCOW))
+		return -EINVAL;
+	//FIXME: -EPERM?
+
+	spin_lock(&inode->i_lock);
+	if (arg)
+		inode->i_flags |= S_MADCOWLINK;
+	else
+		inode->i_flags &= ~S_MADCOWLINK;
+	spin_unlock(&inode->i_lock);
+	mark_inode_dirty(inode);
+	return 0;
+}
+
 static long do_fcntl(unsigned int fd, unsigned int cmd,
 		     unsigned long arg, struct file * filp)
 {
@@ -346,6 +364,12 @@
 		case F_NOTIFY:
 			err = fcntl_dirnotify(fd, filp, arg);
 			break;
+		case F_SETMADCOW:
+			err = fcntl_setmadcow(filp, arg);
+			break;
+		case F_GETMADCOW:
+			err = !!inode_flags(filp->f_dentry->d_inode, S_MADCOWLINK);
+			break;
 		default:
 			break;
 	}
--- linux-2.6.5cow/fs/namei.c~madcow	2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/fs/namei.c	2004-05-02 15:07:57.000000000 +0200
@@ -223,6 +223,41 @@
 	return security_inode_permission(inode, mask, nd);
 }
 
+static inline void set_madcowflag(struct inode *inode)
+{
+	spin_lock(&inode->i_lock);
+	inode->i_flags |= S_MADCOWLINK;
+	spin_unlock(&inode->i_lock);
+	mark_inode_dirty(inode);
+}
+
+/*
+ * Files with the S_MADCOWLINK flag set cannot be written to, if more
+ * than one hard link to them exists.  Ultimately, this function
+ * should copy the inode, assign the copy to the dentry and lower use
+ * count of the old inode - one day.
+ * For now, it is sufficient to return an error and let userspace
+ * deal with the messy part.  Not exactly the meaning of
+ * copy-on-write, but much better than writing to fifty files at once
+ * and noticing month later.
+ *
+ * Yes, this breaks the kernel interface and is simply wrong.  This
+ * is intended behaviour, so Linus will not merge the code before
+ * it is complete.  Or will he?
+ */
+static int break_madcow_link(struct inode *inode)
+{
+	if (!inode_flags(inode, S_MADCOWLINK))
+		return 0;
+	if (!S_ISREG(inode->i_mode))
+		return 0;
+	if (inode->i_nlink < 2)
+		return 0;
+	/* TODO: As soon as sendfile can do normal file copies, use that
+	 * and always return 0 */
+	return -EMLINK;
+}
+
 /*
  * get_write_access() gets write permission for a file.
  * put_write_access() releases this write permission.
@@ -243,6 +278,10 @@
 
 int get_write_access(struct inode * inode)
 {
+	int error = break_madcow_link(inode);
+	if (error)
+		return error;
+
 	spin_lock(&inode->i_lock);
 	if (atomic_read(&inode->i_writecount) < 0) {
 		spin_unlock(&inode->i_lock);
@@ -1146,6 +1185,8 @@
 	DQUOT_INIT(dir);
 	error = dir->i_op->create(dir, dentry, mode, nd);
 	if (!error) {
+		if (inode_flags(dir, S_MADCOWLINK))
+			set_madcowflag(dentry->d_inode);
 		inode_dir_notify(dir, DN_CREATE);
 		security_inode_post_create(dir, dentry, mode);
 	}
@@ -1520,6 +1561,8 @@
 	DQUOT_INIT(dir);
 	error = dir->i_op->mkdir(dir, dentry, mode);
 	if (!error) {
+		if (inode_flags(dir, S_MADCOWLINK))
+			set_madcowflag(dentry->d_inode);
 		inode_dir_notify(dir, DN_CREATE);
 		security_inode_post_mkdir(dir,dentry, mode);
 	}
@@ -1823,6 +1866,13 @@
 		return -EXDEV;
 
 	/*
+	 * Madcowlink attribute is inherited from directory, but here,
+	 * the inode already has one.  If they don't match, bail out.
+	 */
+	if (inode_flags(dir, S_MADCOWLINK) != inode_flags(inode, S_MADCOWLINK))
+		return -EMLINK;
+
+	/*
 	 * A link to an append-only or immutable file cannot be created.
 	 */
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
@@ -2003,6 +2053,26 @@
 	return error;
 }
 
+static int madcow_allow_rename(struct inode *old_dir, struct dentry *old_dentry,
+			       struct inode *new_dir)
+{
+	struct inode *old_inode = old_dentry->d_inode;
+
+	/* source and target share directory: allow */
+	if (old_dir == new_dir)
+		return 0;
+	/* source and target directory have identical madcowlink flag: allow */
+	if (inode_flags(old_inode, S_MADCOWLINK) == inode_flags(new_dir, S_MADCOWLINK))
+		return 0;
+	/* We could always fail here, but madcowlink flag is only defined for
+	 * files and directories, so let's allow special files */
+	if (!S_ISREG(old_inode->i_mode))
+		return -EMLINK;
+	if (!S_ISDIR(old_inode->i_mode))
+		return -EMLINK;
+	return 0;
+}
+
 int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 	       struct inode *new_dir, struct dentry *new_dentry)
 {
@@ -2026,6 +2096,10 @@
 	if (!old_dir->i_op || !old_dir->i_op->rename)
 		return -EPERM;
 
+	error = madcow_allow_rename(old_dir, old_dentry, new_dir);
+	if (error)
+		return error;
+
 	DQUOT_INIT(old_dir);
 	DQUOT_INIT(new_dir);
 
@@ -2118,7 +2118,7 @@
 	return error;
 }
 
-static inline int do_rename(const char * oldname, const char * newname)
+int do_rename(const char * oldname, const char * newname)
 {
 	int error = 0;
 	struct dentry * old_dir, * new_dir;
--- linux-2.6.5cow/include/linux/ext2_fs.h~madcow	2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/include/linux/ext2_fs.h	2004-04-27 17:01:30.000000000 +0200
@@ -192,6 +192,7 @@
 #define EXT2_NOTAIL_FL			0x00008000 /* file tail should not be merged */
 #define EXT2_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
 #define EXT2_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
+#define EXT2_MADCOWLINK_FL		0x00040000 /* COW behaviour for hard links */
 #define EXT2_RESERVED_FL		0x80000000 /* reserved for ext2 lib */
 
 #define EXT2_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
--- linux-2.6.5cow/include/linux/ext3_fs.h~madcow	2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/include/linux/ext3_fs.h	2004-04-27 17:03:46.000000000 +0200
@@ -185,6 +185,7 @@
 #define EXT3_NOTAIL_FL			0x00008000 /* file tail should not be merged */
 #define EXT3_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
 #define EXT3_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
+#define EXT3_MADCOWLINK_FL		0x00040000 /* COW behaviour for hard links */
 #define EXT3_RESERVED_FL		0x80000000 /* reserved for ext3 lib */
 
 #define EXT3_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
--- linux-2.6.5cow/include/linux/fcntl.h~madcow	2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/include/linux/fcntl.h	2004-04-27 16:49:45.000000000 +0200
@@ -23,6 +23,9 @@
 #define DN_ATTRIB	0x00000020	/* File changed attibutes */
 #define DN_MULTISHOT	0x80000000	/* Don't remove notifier */
 
+#define F_SETMADCOW	(F_LINUX_SPECIFIC_BASE+3)
+#define F_GETMADCOW	(F_LINUX_SPECIFIC_BASE+4)
+
 #ifdef __KERNEL__
 
 #if BITS_PER_LONG == 32
--- linux-2.6.5cow/include/linux/fs.h~madcow	2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/include/linux/fs.h	2004-04-30 08:40:04.000000000 +0200
@@ -90,6 +90,7 @@
 /* public flags for file_system_type */
 #define FS_REQUIRES_DEV 1 
 #define FS_BINARY_MOUNTDATA 2
+#define FS_MADCOW	4
 #define FS_REVAL_DOT	16384	/* Check the paths ".", ".." for staleness */
 #define FS_ODD_RENAME	32768	/* Temporary stuff; will go away as soon
 				  * as nfs_rename() will be cleaned up
@@ -139,6 +140,9 @@
 #define S_NOQUOTA	64	/* Inode is not counted to quota */
 #define S_DIRSYNC	128	/* Directory modifications are synchronous */
 #define S_NOCMTIME	256	/* Do not update file c/mtime */
+#define S_MADCOWLINK	512	/* Hard links have copy on write semantics.
+				 * This flag has no meaning for directories,
+				 * but is inherited to directory children */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
                   ` (4 preceding siblings ...)
  2004-05-06 13:21 ` [PATCH COW] MAD COW Jörn Engel
@ 2004-05-08 13:45 ` Denis Vlasenko
  2004-05-08 22:10   ` Pavel Machek
  2004-05-08 22:48 ` Pavel Machek
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Denis Vlasenko @ 2004-05-08 13:45 UTC (permalink / raw)
  To: Jörn Engel, linux-kernel
  Cc: Jamie Lokier, Pavel Machek, Eric W. Biederman

On Thursday 06 May 2004 16:17, Jörn Engel wrote:
> Couldn't sleep last night and finished a first complete version of
> cowlinks, code-named MAD COW.  It is still based on the stupid old
> design with a flag to distinguish between regular hard links and
> cowlinks.  Please don't comment on that design, it's just a proof of
> concept.
>
> Patches are against 2.6.5 but most things should apply to other 2.6
> kernel without too much trouble.
>
> 1 generic_sendpage	- allow sendfile with ext[23] files as target
> 2 sendfile		- introduce vfs_sendfile for in-kernel use
> 3 copyfile		- new copyfile() system call
> 4 lock_flags		- old cruft, just ignore it
> 5 madcow		- the MAD COW itself
>
> Patches 1-3 will stay, 4 will be remove and 5 replaced by a better
> design over time.  I've also set up a webpage for it:
> http://wohnheim.fh-wedel.de/~joern/cowlink/
>
> Maybe that should be converted into a wiki so someone with better
> taste than myself can improve it.

Hi,

Glad to see this happening. My filesystems are mostly reiser
these days, but if I have some time/occasion, I will try your patch.

Regarding semantics etc. I have read http://lwn.net/Articles/77972/.
I think most difficulties arose from inode numbers as a concept.

Usage of inode number is a historic UNIX misfeature.
AFAIK it is almost exclusively used for determining whether
two files are really the same: same (dev,ino) => same file.
Usage:
* diff wants this to avoid diffing file against itself.
* du wants this in order to not count file twice.
* cp/tar wants this in order to preserve hardlinks.
For cowlinks it is a bit different:
* diff still dont need to compare cowlinked file.
* cp/tar aren't required to detect cowlinks, may do this
  as an optimization.
* du: I am not sure... analogous to sparse files?

In essense, diff wants to ask kernel "are these files have
identical contents?" while tar/cp ask "are these files
hardlinked together?". du asks "are these share storage"?

Original UNIX folks had to make inode number only an advisory
indicator, meaning "if ino1 != ino2, files are _definitely_ not
linked", and a syscall is_hardlinked(fd1,fd2), intended
for use when inodes are equal.

That is probably unfixable now, but you can avoid making similar
error. Provide is_cowlinked(fd1,fd2) syscall. Pity you will
have to use different inode numbers for cowlinks (due to tar/cp),
and this won't fly:

diff.c:
...
if(ino1==ino2 && is_cowlinked(fd1,fd2))
	skip_diff();
...

diff will have to use if(is_cowlinked(fd1,fd2)), i.e. one
extra syscall, always. Anyway, for diff that's not tragic,
savings from not doing diff are still very substantial.
For du, that would hurt.

Per-block cow filesystems will open another can of worms
for diff and du. ;)

P.S. If is_hardlinked() is introduced too, tar/cp can be
updated to use that, and we can slowly drift into right
direction (of not assuming "equal inos => hardlinked file").
I think too much people will disagree with me, and this
won't happen.
--
vda


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-08 13:45 ` [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Denis Vlasenko
@ 2004-05-08 22:10   ` Pavel Machek
  2004-05-09 14:09     ` Denis Vlasenko
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-08 22:10 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Jörn Engel, linux-kernel, Jamie Lokier, Eric W. Biederman

Hi!

> That is probably unfixable now, but you can avoid making similar
> error. Provide is_cowlinked(fd1,fd2) syscall. Pity you will
> have to use different inode numbers for cowlinks (due to tar/cp),
> and this won't fly:

is_cowlinked does not fly, either. For n files, you have to do O(n^2)
calls to find those that are linked.

You want get_cowlinked_id which can return -1 "I do not know".

								Pavel
-- 
Horseback riding is like software...
...vgf orggre jura vgf serr.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
                   ` (5 preceding siblings ...)
  2004-05-08 13:45 ` [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Denis Vlasenko
@ 2004-05-08 22:48 ` Pavel Machek
  2004-05-10 15:53   ` Jörn Engel
  2004-05-10  5:15 ` Eric W. Biederman
  2004-05-12 16:39 ` Rob Landley
  8 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-08 22:48 UTC (permalink / raw)
  To: Jörn Engel
  Cc: linux-kernel, Jamie Lokier, Eric W. Biederman, Denis Vlasenko

Hi!

> Couldn't sleep last night and finished a first complete version of
> cowlinks, code-named MAD COW.  It is still based on the stupid old
> design with a flag to distinguish between regular hard links and
> cowlinks.  Please don't comment on that design, it's just a proof of
> concept.

> Patches are against 2.6.5 but most things should apply to other 2.6
> kernel without too much trouble.
> 
> 1 generic_sendpage	- allow sendfile with ext[23] files as target
> 2 sendfile		- introduce vfs_sendfile for in-kernel use
> 3 copyfile		- new copyfile() system call

Well, up to "3" it seems usefull on its own. You might attempt to
merge that.

namei.c: you realy don't want to #include in the middle of .c file.

vfs_unlink followed by BUG_ON(error)... that's definitely wrong. In
case of disk error, you might get error on unlink; but you should not
BUG() on that. Perhaps copyfile() should be specified as "may leave
part of copy of target on disk in case of error"?

								Pavel
-- 
Horseback riding is like software...
...vgf orggre jura vgf serr.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-08 22:10   ` Pavel Machek
@ 2004-05-09 14:09     ` Denis Vlasenko
  2004-05-09 21:53       ` Pavel Machek
  0 siblings, 1 reply; 36+ messages in thread
From: Denis Vlasenko @ 2004-05-09 14:09 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jörn Engel, linux-kernel, Jamie Lokier, Eric W. Biederman

On Sunday 09 May 2004 01:10, Pavel Machek wrote:
> Hi!
>
> > That is probably unfixable now, but you can avoid making similar
> > error. Provide is_cowlinked(fd1,fd2) syscall. Pity you will
> > have to use different inode numbers for cowlinks (due to tar/cp),
> > and this won't fly:
>
> is_cowlinked does not fly, either. For n files, you have to do O(n^2)
> calls to find those that are linked.

Hm, let me think about it. diff does not need to check all permutations,
it checks only those two which it needs to compare.

IMHO "inodes done right" is something like this: if inode numbers are
different, then files are not hardlinked/cowlinked. If they are the same,
check is_hardlinked(a,b) or is_cowlinked(a,b) to find out.

This beats O(n^2) argument.

But this is non-POSIX, would not be accepted.

I don't know how to handle this now. Introducing cow-inode number
with semantic "cowino1==cowino2 => files are cowlinked" is
ugly and won't deal with per-block cow. Sooner or later someone
will want to have per block cow. Think about cow'ing multi-gigabyte
database files for checkpointing/backup purposes...

> You want get_cowlinked_id which can return -1 "I do not know".

Is this the same to my "cow-inode number" concept above?
--
vda


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-09 14:09     ` Denis Vlasenko
@ 2004-05-09 21:53       ` Pavel Machek
  2004-05-10 15:44         ` Jörn Engel
  2004-05-12 20:29         ` Rob Landley
  0 siblings, 2 replies; 36+ messages in thread
From: Pavel Machek @ 2004-05-09 21:53 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Jörn Engel, linux-kernel, Jamie Lokier, Eric W. Biederman

Hi!

> > > That is probably unfixable now, but you can avoid making similar
> > > error. Provide is_cowlinked(fd1,fd2) syscall. Pity you will
> > > have to use different inode numbers for cowlinks (due to tar/cp),
> > > and this won't fly:
> >
> > is_cowlinked does not fly, either. For n files, you have to do O(n^2)
> > calls to find those that are linked.
> 
> Hm, let me think about it. diff does not need to check all permutations,
> it checks only those two which it needs to compare.

And tar?

> IMHO "inodes done right" is something like this: if inode numbers are
> different, then files are not hardlinked/cowlinked. If they are the same,
> check is_hardlinked(a,b) or is_cowlinked(a,b) to find out.
> 
> This beats O(n^2) argument.
> 
> But this is non-POSIX, would not be accepted.

I do not see how this matters. cowlinks are already non-POSIX. If
is_hardlinked() always returns 1, but is_cowlinked() sometimes returns
something else, you are still "as much POSIX as possible".

> I don't know how to handle this now. Introducing cow-inode number
> with semantic "cowino1==cowino2 => files are cowlinked" is
> ugly and won't deal with per-block cow. Sooner or later someone
> will want to have per block cow. Think about cow'ing multi-gigabyte
> database files for checkpointing/backup purposes...

Well, if only block 17 is cowlink-shared between two files, I guess
userspace does not want to know... And I think that cow-inode number
*can* handle all other cases.

Oh, get_cow_inode() should really be allowed to fail in some usefull
way, so that filesystems do not have to implement it if its hard for
them.

> > You want get_cowlinked_id which can return -1 "I do not know".
> 
> Is this the same to my "cow-inode number" concept above?

Yes.

-- 
Horseback riding is like software...
...vgf orggre jura vgf serr.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
                   ` (6 preceding siblings ...)
  2004-05-08 22:48 ` Pavel Machek
@ 2004-05-10  5:15 ` Eric W. Biederman
  2004-05-10 15:59   ` Jörn Engel
  2004-05-12 16:39 ` Rob Landley
  8 siblings, 1 reply; 36+ messages in thread
From: Eric W. Biederman @ 2004-05-10  5:15 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-kernel, Jamie Lokier, Pavel Machek, Denis Vlasenko

Jörn Engel <joern@wohnheim.fh-wedel.de> writes:
> 3 copyfile		- new copyfile() system call
> http://wohnheim.fh-wedel.de/~joern/cowlink/

Question about sys_copyfile.

Is the intention that a new file with completely new permissions
be created?

Some people have wanted a copyfile that copies all of the extra
metadata user/group/acls.

I currently see technical merit in both approaches.

Looking at the CIFS information it appears that the CopyFILE RPC
copies the permissions.  It is not at all clear about that, and
the fact it appears to copy permissions may simply be a specification
bug.  Given that FAT does not really have permissions, let alone
extended attributes it would be an easy mistake to make.

In the general case you cannot copy permissions from one file to
another, either you don't have those permissions yourself or the
target file system may not support them all.  Not copying
permissions leads to a simpler implementation with the burden
of the work left to user space.  What is not done is a loop through
the extended attributes.

Eric


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-09 21:53       ` Pavel Machek
@ 2004-05-10 15:44         ` Jörn Engel
  2004-05-10 15:51           ` Pavel Machek
  2004-05-12  0:26           ` Jamie Lokier
  2004-05-12 20:29         ` Rob Landley
  1 sibling, 2 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-10 15:44 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Denis Vlasenko, linux-kernel, Jamie Lokier, Eric W. Biederman

On Sun, 9 May 2004 23:53:51 +0200, Pavel Machek wrote:
> 
> I do not see how this matters. cowlinks are already non-POSIX. If
> is_hardlinked() always returns 1, but is_cowlinked() sometimes returns
> something else, you are still "as much POSIX as possible".

Assuming that POSIX is generally pretty sane, "as sane as possible" is
good enough for me.  If POSIX is insane in one way or another,
breaking POSIX is actually a feature, imho.

> > I don't know how to handle this now. Introducing cow-inode number
> > with semantic "cowino1==cowino2 => files are cowlinked" is
> > ugly and won't deal with per-block cow. Sooner or later someone
> > will want to have per block cow. Think about cow'ing multi-gigabyte
> > database files for checkpointing/backup purposes...
> 
> Well, if only block 17 is cowlink-shared between two files, I guess
> userspace does not want to know... And I think that cow-inode number
> *can* handle all other cases.

Correct.

> Oh, get_cow_inode() should really be allowed to fail in some usefull
> way, so that filesystems do not have to implement it if its hard for
> them.

Also correct.

And while we're at it, I need a brain extention again.  Pavel? ;)

The real cowlink patch (too ugly to show it yet) is something like
this:

inode 1: real file
inode 2: cow->inode 1
inode 3: cow->inode 1

Stat() on inode 2 is implemented here:

static void cowlink_fillattr(struct inode *link, struct kstat *stat)
{
	struct super_block *sb = link->i_sb;
	ino_t ino = link->i_op->readcow(link);
	struct inode *inode = iget(sb, ino);

	stat->dev       = link->i_sb->s_dev;
	stat->ino       = ino;
	stat->mode      = link->i_mode - S_IFCOW + S_IFREG;
	stat->nlink     = link->i_nlink;
	stat->uid       = link->i_uid;
	stat->gid       = link->i_gid;
	stat->rdev      = 0;
	stat->atime     = link->i_atime;
	stat->mtime     = link->i_mtime;
	stat->ctime     = link->i_ctime;
	stat->size      = i_size_read(inode);
	stat->blocks    = inode->i_blocks;
	stat->blksize   = inode->i_blksize;
}

size, blocks and blksize has to be taken from inode 1, sure.  uid,
gid, mode and *time are from inode 2, also obvious.  dev doesn't
matter yet, but we might do cross-filesystem cowlinks someday.  nlink
should be from inode 2, so several hard links on it work, also agreed.

What about ino?  I currently return 1, so diff remains fast without
any changes.  If someone really needs the difference between inode 2
and 3, I would introduce a cstat() system call similar to lstat(),
which would return ino=2.

Is this sane?  Should it be reversed and cstat() return ino=1, while
stat returns ino=2?  I can imagine that "tar -x" would create hard
links for every cowlink that "tar -c" saw, but I'm not sure yet.

Jörn

-- 
They laughed at Galileo.  They laughed at Copernicus.  They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-10 15:44         ` Jörn Engel
@ 2004-05-10 15:51           ` Pavel Machek
  2004-05-10 15:56             ` Jörn Engel
  2004-05-12  0:26           ` Jamie Lokier
  1 sibling, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-10 15:51 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Denis Vlasenko, linux-kernel, Jamie Lokier, Eric W. Biederman

Hi!

> > Oh, get_cow_inode() should really be allowed to fail in some usefull
> > way, so that filesystems do not have to implement it if its hard for
> > them.
> 
> Also correct.
> 
> And while we're at it, I need a brain extention again.  Pavel? ;)
> 
> The real cowlink patch (too ugly to show it yet) is something like
> this:
> 
> inode 1: real file
> inode 2: cow->inode 1
> inode 3: cow->inode 1
> 
> Stat() on inode 2 is implemented here:
> 
> static void cowlink_fillattr(struct inode *link, struct kstat *stat)
> {
> 	struct super_block *sb = link->i_sb;
> 	ino_t ino = link->i_op->readcow(link);
> 	struct inode *inode = iget(sb, ino);
> 
> 	stat->dev       = link->i_sb->s_dev;
> 	stat->ino       = ino;
> 	stat->mode      = link->i_mode - S_IFCOW + S_IFREG;
> 	stat->nlink     = link->i_nlink;
> 	stat->uid       = link->i_uid;
> 	stat->gid       = link->i_gid;
> 	stat->rdev      = 0;
> 	stat->atime     = link->i_atime;
> 	stat->mtime     = link->i_mtime;
> 	stat->ctime     = link->i_ctime;
> 	stat->size      = i_size_read(inode);
> 	stat->blocks    = inode->i_blocks;
> 	stat->blksize   = inode->i_blksize;
> }
> 
> size, blocks and blksize has to be taken from inode 1, sure.  uid,
> gid, mode and *time are from inode 2, also obvious.  dev doesn't
> matter yet, but we might do cross-filesystem cowlinks someday.  nlink
> should be from inode 2, so several hard links on it work, also agreed.
> 
> What about ino?  I currently return 1, so diff remains fast without
> any changes.  If someone really needs the difference between inode 2
> and 3, I would introduce a cstat() system call similar to lstat(),
> which would return ino=2.

I think you need to return 2, otherwise inode numbers change when
cowling is broken.... and that would be bad.

Aha.. That is another neccessary property for get_cow_inode():
cow_inode can change, any time, unlike normal inode.

								Pavel
-- 
Horseback riding is like software...
...vgf orggre jura vgf serr.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-08 22:48 ` Pavel Machek
@ 2004-05-10 15:53   ` Jörn Engel
  2004-05-10 19:26     ` Jan Harkes
  0 siblings, 1 reply; 36+ messages in thread
From: Jörn Engel @ 2004-05-10 15:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, Jamie Lokier, Eric W. Biederman, Denis Vlasenko

On Sun, 9 May 2004 00:48:35 +0200, Pavel Machek wrote:
> 
> > Couldn't sleep last night and finished a first complete version of
> > cowlinks, code-named MAD COW.  It is still based on the stupid old
> > design with a flag to distinguish between regular hard links and
> > cowlinks.  Please don't comment on that design, it's just a proof of
> > concept.
> 
> > Patches are against 2.6.5 but most things should apply to other 2.6
> > kernel without too much trouble.
> > 
> > 1 generic_sendpage	- allow sendfile with ext[23] files as target
> > 2 sendfile		- introduce vfs_sendfile for in-kernel use
> > 3 copyfile		- new copyfile() system call
> 
> Well, up to "3" it seems usefull on its own. You might attempt to
> merge that.

True.  Right now I'm just lazy and prefer to keep them under my
control.  Maybe if 2.7 was already opened.

> namei.c: you realy don't want to #include in the middle of .c file.

Actually, I do.  I'm still not convinced that all the low-level
read/write functions need a struct file* as an argument, that looks
like a really ugly hack to help nfs and some others.  And if those can
go, the #include will go as well.  So the ugly line serves as a
reminder for a different problem. :)

On the other hand, if you can convince me that there is no way to
avoid passing the struct file*, then I agree and the #include will
move up.

> vfs_unlink followed by BUG_ON(error)... that's definitely wrong. In
> case of disk error, you might get error on unlink; but you should not
> BUG() on that. Perhaps copyfile() should be specified as "may leave
> part of copy of target on disk in case of error"?

Yes, makes sense.  I will keep the BUG_ON for a while as a reminder to
really think through all the consequences, but most likely, I will
just follow your suggestion.

A real problem is that copyfile() has all errno's from create(),
sendfile() and unlink() combined, which doesn't make error handling in
userspace easy.  "It could be this, that or another error" is the kind
of mess I always hated about Windows, so I should try to do a little
better.

Jörn

-- 
It's not whether you win or lose, it's how you place the blame.
-- unknown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-10 15:51           ` Pavel Machek
@ 2004-05-10 15:56             ` Jörn Engel
  0 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-10 15:56 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Denis Vlasenko, linux-kernel, Jamie Lokier, Eric W. Biederman

On Mon, 10 May 2004 17:51:27 +0200, Pavel Machek wrote:
> > 
> > What about ino?  I currently return 1, so diff remains fast without
> > any changes.  If someone really needs the difference between inode 2
> > and 3, I would introduce a cstat() system call similar to lstat(),
> > which would return ino=2.
> 
> I think you need to return 2, otherwise inode numbers change when
> cowling is broken.... and that would be bad.

Makes sense.  Damn, now I gotta touch diff as well. :(

> Aha.. That is another neccessary property for get_cow_inode():
> cow_inode can change, any time, unlike normal inode.

Correct.  Still, I prefer cstat(), even though almost all information
is identical to stat.  But the device might change as well, one day.

Jörn

-- 
With a PC, I always felt limited by the software available. On Unix, 
I am limited only by my knowledge.
-- Peter J. Schoenster

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-10  5:15 ` Eric W. Biederman
@ 2004-05-10 15:59   ` Jörn Engel
  0 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-10 15:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Jamie Lokier, Pavel Machek, Denis Vlasenko

On Sun, 9 May 2004 23:15:33 -0600, Eric W. Biederman wrote:
> Jörn Engel <joern@wohnheim.fh-wedel.de> writes:
> > 3 copyfile		- new copyfile() system call
> > http://wohnheim.fh-wedel.de/~joern/cowlink/
> 
> Question about sys_copyfile.
> 
> Is the intention that a new file with completely new permissions
> be created?
> 
> Some people have wanted a copyfile that copies all of the extra
> metadata user/group/acls.
> 
> I currently see technical merit in both approaches.
> 
> Looking at the CIFS information it appears that the CopyFILE RPC
> copies the permissions.  It is not at all clear about that, and
> the fact it appears to copy permissions may simply be a specification
> bug.  Given that FAT does not really have permissions, let alone
> extended attributes it would be an easy mistake to make.
> 
> In the general case you cannot copy permissions from one file to
> another, either you don't have those permissions yourself or the
> target file system may not support them all.  Not copying
> permissions leads to a simpler implementation with the burden
> of the work left to user space.  What is not done is a loop through
> the extended attributes.

Unless someone finds a good reason to change this, I prefer to create
new permissions and not copy the acl's.  That way, I can also
copyfile() a file belonging to someone else, as long as I have read
access to it.

Jörn

-- 
Don't worry about people stealing your ideas. If your ideas are any good,
you'll have to ram them down people's throats.
-- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by
   Raph Levien, 1979

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-10 15:53   ` Jörn Engel
@ 2004-05-10 19:26     ` Jan Harkes
  2004-05-11 10:02       ` Jörn Engel
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Harkes @ 2004-05-10 19:26 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Pavel Machek, linux-kernel, Jamie Lokier, Eric W. Biederman,
	Denis Vlasenko

On Mon, May 10, 2004 at 05:53:59PM +0200, Jörn Engel wrote:
> A real problem is that copyfile() has all errno's from create(),
> sendfile() and unlink() combined, which doesn't make error handling in
> userspace easy.  "It could be this, that or another error" is the kind
> of mess I always hated about Windows, so I should try to do a little
> better.

Well, if you leave the create and unlink up to the application and
simply pass open filedescriptors to copyfile... But then it would be
equivalent to your new sendfile.

Copyfile can trivially be implemented in libc. I don't see why it would
have to be a system call. If a network filesystem wants to optimize the
file copying it could do this based on the sendfile data. If source and
destination are within the same filesystem and we're copying the whole
file starting at offset 0, send a copyfile RPC.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-10 19:26     ` Jan Harkes
@ 2004-05-11 10:02       ` Jörn Engel
  2004-05-11 14:08         ` Jan Harkes
  2004-05-11 15:40         ` Steve French
  0 siblings, 2 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-11 10:02 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Steve French, linux-kernel

On Mon, 10 May 2004 15:26:02 -0400, Jan Harkes wrote:
> On Mon, May 10, 2004 at 05:53:59PM +0200, Jörn Engel wrote:
> > A real problem is that copyfile() has all errno's from create(),
> > sendfile() and unlink() combined, which doesn't make error handling in
> > userspace easy.  "It could be this, that or another error" is the kind
> > of mess I always hated about Windows, so I should try to do a little
> > better.
> 
> Well, if you leave the create and unlink up to the application and
> simply pass open filedescriptors to copyfile... But then it would be
> equivalent to your new sendfile.
> 
> Copyfile can trivially be implemented in libc. I don't see why it would
> have to be a system call. If a network filesystem wants to optimize the
> file copying it could do this based on the sendfile data. If source and
> destination are within the same filesystem and we're copying the whole
> file starting at offset 0, send a copyfile RPC.

Can you explain this to Steve?  I'm still quite clueless about network
filesystems, but it sounded as if such an optimization was impossible
to do in cifs without a combined create/copy/unlink_on_error system
call.

If your suggestion works and the network filesystems can be changed to
work independently of a struct file*, I agree with you that copyfile()
is a stupid idea and should be forgotten.

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-11 10:02       ` Jörn Engel
@ 2004-05-11 14:08         ` Jan Harkes
  2004-05-11 14:18           ` Jan Harkes
                             ` (2 more replies)
  2004-05-11 15:40         ` Steve French
  1 sibling, 3 replies; 36+ messages in thread
From: Jan Harkes @ 2004-05-11 14:08 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Steve French, linux-kernel

On Tue, May 11, 2004 at 12:02:32PM +0200, Jörn Engel wrote:
> > Copyfile can trivially be implemented in libc. I don't see why it would
> > have to be a system call. If a network filesystem wants to optimize the
> > file copying it could do this based on the sendfile data. If source and
> > destination are within the same filesystem and we're copying the whole
> > file starting at offset 0, send a copyfile RPC.
> 
> Can you explain this to Steve?  I'm still quite clueless about network
> filesystems, but it sounded as if such an optimization was impossible
> to do in cifs without a combined create/copy/unlink_on_error system
> call.
> 
> If your suggestion works and the network filesystems can be changed to
> work independently of a struct file*, I agree with you that copyfile()
> is a stupid idea and should be forgotten.

I would probably do it by overriding the file_operations.sendfile
function. A first approximation of a possible implementation follows. I
went a bit crazy on the comments. The only problem is that the type of
target is unknown, block/loop.c and nfsd/vfs.c are using sendfile to
to send to something that is not a struct file.

Jan

int my_file_sendfile(struct file *in_file, loff_t *ppos, size_t count,
		     read_actor_t actor, void __user *target)
{
    struct file *out_file = NULL;

    /* We have to check the read_actor callback function to see if the
     * target actually points at a struct file. */
    if (actor != file_send_actor)
	goto copy_local;

    /* are both source and destination within the same file system
     * mountpoint? */
    if (in_file->f_dentry->d_inode->i_sb != out_file->f_dentry->d_inode->i_sb)
	goto copy_local;

    /* are we copying the entire source file? */
    if (*ppos != 0 || count != in_file->f_dentry->d_inode->i_size)
	goto copy_local;

    /* and are we at least overwriting the complete destination file?
       (alternatively we could check that the out_file is currently empty) */
    if (out_file->f_pos != 0 || out_file->f_dentry->d_inode->i_size > count)
	goto copy_local;

    /* we are in luck and can send a copyfile rpc */
    int err = make_copyfile_rpc(in_file, out_file);

    /* we probably should force a refresh of out_file as it changed on
     * the server and not locally */
    MY_I(out_file->f_dentry->d_inode)->time = 0;
    return err;

copy_local:
    /* no luck, we're sending to something that isn't a file or in the
     * same filesystem, or we only copy a partial file. any case, we
     * have to perform the copy locally. */
    return generic_file_sendfile(in_file, ppos, count, actor, target);
}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-11 14:08         ` Jan Harkes
@ 2004-05-11 14:18           ` Jan Harkes
  2004-05-11 14:33           ` Jörn Engel
  2004-05-21 23:23           ` Rob Landley
  2 siblings, 0 replies; 36+ messages in thread
From: Jan Harkes @ 2004-05-11 14:18 UTC (permalink / raw)
  To: Jörn Engel, Steve French, linux-kernel

Forgot a somewhat useful line there,

On Tue, May 11, 2004 at 10:08:53AM -0400, Jan Harkes wrote:
> int my_file_sendfile(struct file *in_file, loff_t *ppos, size_t count,
> 		     read_actor_t actor, void __user *target)
> {
>     struct file *out_file = NULL;
> 
>     /* We have to check the read_actor callback function to see if the
>      * target actually points at a struct file. */
>     if (actor != file_send_actor)
> 	goto copy_local;

      out_file = (struct file *)target;

>     /* are both source and destination within the same file system
>      * mountpoint? */
>     if (in_file->f_dentry->d_inode->i_sb != out_file->f_dentry->d_inode->i_sb)
> 	goto copy_local;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-11 14:08         ` Jan Harkes
  2004-05-11 14:18           ` Jan Harkes
@ 2004-05-11 14:33           ` Jörn Engel
  2004-05-21 23:23           ` Rob Landley
  2 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-11 14:33 UTC (permalink / raw)
  To: Steve French, linux-kernel

On Tue, 11 May 2004 10:08:53 -0400, Jan Harkes wrote:
> On Tue, May 11, 2004 at 12:02:32PM +0200, Jörn Engel wrote:
> > > Copyfile can trivially be implemented in libc. I don't see why it would
> > > have to be a system call. If a network filesystem wants to optimize the
> > > file copying it could do this based on the sendfile data. If source and
> > > destination are within the same filesystem and we're copying the whole
> > > file starting at offset 0, send a copyfile RPC.
> > 
> > Can you explain this to Steve?  I'm still quite clueless about network
> > filesystems, but it sounded as if such an optimization was impossible
> > to do in cifs without a combined create/copy/unlink_on_error system
> > call.
> > 
> > If your suggestion works and the network filesystems can be changed to
> > work independently of a struct file*, I agree with you that copyfile()
> > is a stupid idea and should be forgotten.
> 
> I would probably do it by overriding the file_operations.sendfile
> function. A first approximation of a possible implementation follows. I
> went a bit crazy on the comments. The only problem is that the type of
> target is unknown, block/loop.c and nfsd/vfs.c are using sendfile to
> to send to something that is not a struct file.

Looks as if it could work.  Steve, can you verify this?

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-11 10:02       ` Jörn Engel
  2004-05-11 14:08         ` Jan Harkes
@ 2004-05-11 15:40         ` Steve French
  2004-05-11 15:58           ` Jörn Engel
  1 sibling, 1 reply; 36+ messages in thread
From: Steve French @ 2004-05-11 15:40 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Jan Harkes, linux-kernel

It would not be helpful to take a userspace request to perform a file
(or directory) copy operation and break it into open/sendfile/close by
passing file handles to the network filesystem and have this work for
SMB/CIFS - there is no equivalent network protocol operation.  It also
makes the operation much, much harder to make atomic (since two systems
are involved) and makes error handling and recovery for network
filesystems much harder since inconsistent client and server state have
to be considered if the copy operation is broken into pieces on the
clien (it is also slower - a single copy operation on the network is the
absolute optimal case - minimizes the expensive network latency - the
roundtrip delay - open/sendfile/close sends at a minimum three times as
many but likely four with an extra lookup or two)

Currently copy file (or copy directory for that matter) as both speced
(and is implemented in various servers) in the SMB/CIFS network
filesystem protocol takes in effect four parameters (there is no handle
based file copy):

a source pathname,  and source export (actually SMB tree identifier for
a share, but in effect the same thing) 
a target pathname, and target export (actually SMB tree identifier for a
share, but in effect the same thing) 
And the shares (exports) referenced have to be on the same server

Trying to ignore the 1st file open

On Tue, 2004-05-11 at 05:02, Jörn Engel wrote:
> On Mon, 10 May 2004 15:26:02 -0400, Jan Harkes wrote:
> > On Mon, May 10, 2004 at 05:53:59PM +0200, Jörn Engel wrote:
> > > A real problem is that copyfile() has all errno's from create(),
> > > sendfile() and unlink() combined, which doesn't make error handling in
> > > userspace easy.  "It could be this, that or another error" is the kind
> > > of mess I always hated about Windows, so I should try to do a little
> > > better.
> > 
> > Well, if you leave the create and unlink up to the application and
> > simply pass open filedescriptors to copyfile... But then it would be
> > equivalent to your new sendfile.
> > 
> > Copyfile can trivially be implemented in libc. I don't see why it would
> > have to be a system call. If a network filesystem wants to optimize the
> > file copying it could do this based on the sendfile data. If source and
> > destination are within the same filesystem and we're copying the whole
> > file starting at offset 0, send a copyfile RPC.
> 
> Can you explain this to Steve?  I'm still quite clueless about network
> filesystems, but it sounded as if such an optimization was impossible
> to do in cifs without a combined create/copy/unlink_on_error system
> call.
> 
> If your suggestion works and the network filesystems can be changed to
> work independently of a struct file*, I agree with you that copyfile()
> is a stupid idea and should be forgotten.
> 
> Jörn


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-11 15:40         ` Steve French
@ 2004-05-11 15:58           ` Jörn Engel
  0 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-11 15:58 UTC (permalink / raw)
  To: Steve French; +Cc: Jan Harkes, linux-kernel

On Tue, 11 May 2004 10:40:49 -0500, Steve French wrote:
> 
> It would not be helpful to take a userspace request to perform a file
> (or directory) copy operation and break it into open/sendfile/close by
> passing file handles to the network filesystem and have this work for
> SMB/CIFS - there is no equivalent network protocol operation.  It also
> makes the operation much, much harder to make atomic (since two systems
> are involved) and makes error handling and recovery for network
> filesystems much harder since inconsistent client and server state have
> to be considered if the copy operation is broken into pieces on the
> clien (it is also slower - a single copy operation on the network is the
> absolute optimal case - minimizes the expensive network latency - the
> roundtrip delay - open/sendfile/close sends at a minimum three times as
> many but likely four with an extra lookup or two)
> 
> Currently copy file (or copy directory for that matter) as both speced
> (and is implemented in various servers) in the SMB/CIFS network
> filesystem protocol takes in effect four parameters (there is no handle
> based file copy):
> 
> a source pathname,  and source export (actually SMB tree identifier for
> a share, but in effect the same thing) 
> a target pathname, and target export (actually SMB tree identifier for a
> share, but in effect the same thing) 
> And the shares (exports) referenced have to be on the same server

Or in short, copyfile makes sense for smbfs/cifs.  The question
whether my hack can be cleaned up enough to get past Al Viro is still
open, though. :)

Jörn

-- 
Homo Sapiens is a goal, not a description.
-- unknown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-10 15:44         ` Jörn Engel
  2004-05-10 15:51           ` Pavel Machek
@ 2004-05-12  0:26           ` Jamie Lokier
  2004-05-13 10:56             ` Jörn Engel
  1 sibling, 1 reply; 36+ messages in thread
From: Jamie Lokier @ 2004-05-12  0:26 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Pavel Machek, Denis Vlasenko, linux-kernel, Eric W. Biederman

Jörn Engel wrote:
> What about ino?  I currently return 1, so diff remains fast without
> any changes.  If someone really needs the difference between inode 2
> and 3, I would introduce a cstat() system call similar to lstat(),
> which would return ino=2.
> 
> Is this sane?  Should it be reversed and cstat() return ino=1, while
> stat returns ino=2?  I can imagine that "tar -x" would create hard
> links for every cowlink that "tar -c" saw, but I'm not sure yet.

I think it should be reversed.

One very useful application for cowlinks is for virtual machine (UML)
and chroot jail setups, where an entire filesystem tree is copied
perhaps hundreds of times on a single disk.  I'm surprised we didn't
think of this earlier, as it's potentially one of the most useful
applications for cowlinks.

In that scenario, cowlinks would save enormous amounts of storage and
potentially save memory too.  However to be useful at all, they'd need
to have accurate POSIX semantics: that is, cowlinks must behave very
much as a storage optimisation only.

That means stat() should return ino==2.

-- Jamie

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
                   ` (7 preceding siblings ...)
  2004-05-10  5:15 ` Eric W. Biederman
@ 2004-05-12 16:39 ` Rob Landley
  2004-05-20 13:49   ` Pavel Machek
  8 siblings, 1 reply; 36+ messages in thread
From: Rob Landley @ 2004-05-12 16:39 UTC (permalink / raw)
  To: Jörn Engel, linux-kernel

On Thursday 06 May 2004 08:17, Jörn Engel wrote:
> Couldn't sleep last night and finished a first complete version of
> cowlinks, code-named MAD COW.  It is still based on the stupid old
> design with a flag to distinguish between regular hard links and
> cowlinks.  Please don't comment on that design, it's just a proof of
> concept.

Catching up on some really old mail, I thought I'd ask:

For years now I've wanted to use a sendfile variant to tell the system to 
connect two filehandles from userspace.  Not just web servers want to 
marshall data from one filehandle into another, things like netcat want to do 
it between a pipe and a network connection, and I've wrote a couple of data 
dispatcher daemons that wanted to do it between two network connections.

Unfortunately, sendfile didn't work generically when I tried it (back under 
2.4).  Would this infrastructure be a step in the right direction to 
eliminate gratuitous poll loops (where nobody but me EVER seems to get the 
"shutdown just one half of the connection" thing right.  My netcat can handle 
"echo 'GET /' | netcat www.slashdot.org 80".  The standard netcat can't.  
Yes, I plan to fix the one in busybox eventually...)

Rob



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-09 21:53       ` Pavel Machek
  2004-05-10 15:44         ` Jörn Engel
@ 2004-05-12 20:29         ` Rob Landley
  1 sibling, 0 replies; 36+ messages in thread
From: Rob Landley @ 2004-05-12 20:29 UTC (permalink / raw)
  To: Pavel Machek, Denis Vlasenko
  Cc: Jörn Engel, linux-kernel, Jamie Lokier, Eric W. Biederman

On Sunday 09 May 2004 16:53, Pavel Machek wrote:

> > I don't know how to handle this now. Introducing cow-inode number
> > with semantic "cowino1==cowino2 => files are cowlinked" is
> > ugly and won't deal with per-block cow. Sooner or later someone
> > will want to have per block cow. Think about cow'ing multi-gigabyte
> > database files for checkpointing/backup purposes...
>
> Well, if only block 17 is cowlink-shared between two files, I guess
> userspace does not want to know... And I think that cow-inode number
> *can* handle all other cases.

I remember somebody had a toy to find out the file block ranges (on ext2/ext3 
anyway), which could detect fragmentation and holes and such.  Presumably, 
whatever they did would already detect per-block cowlinks, if anybody 
actually cared...

Rob



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-12  0:26           ` Jamie Lokier
@ 2004-05-13 10:56             ` Jörn Engel
  0 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-13 10:56 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Pavel Machek, Denis Vlasenko, linux-kernel, Eric W. Biederman

On Wed, 12 May 2004 01:26:06 +0100, Jamie Lokier wrote:
> Jörn Engel wrote:
> > What about ino?  I currently return 1, so diff remains fast without
> > any changes.  If someone really needs the difference between inode 2
> > and 3, I would introduce a cstat() system call similar to lstat(),
> > which would return ino=2.
> > 
> > Is this sane?  Should it be reversed and cstat() return ino=1, while
> > stat returns ino=2?  I can imagine that "tar -x" would create hard
> > links for every cowlink that "tar -c" saw, but I'm not sure yet.
> 
> I think it should be reversed.

Aye.

> One very useful application for cowlinks is for virtual machine (UML)
> and chroot jail setups, where an entire filesystem tree is copied
> perhaps hundreds of times on a single disk.  I'm surprised we didn't
> think of this earlier, as it's potentially one of the most useful
> applications for cowlinks.
> 
> In that scenario, cowlinks would save enormous amounts of storage and
> potentially save memory too.  However to be useful at all, they'd need
> to have accurate POSIX semantics: that is, cowlinks must behave very
> much as a storage optimisation only.
> 
> That means stat() should return ino==2.

Hmm, true.  Up 'till now, that was done with disk images and block
based diffs/snapshots.  Nice application.

Jörn

-- 
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-12 16:39 ` Rob Landley
@ 2004-05-20 13:49   ` Pavel Machek
  2004-05-25 21:55     ` Rob Landley
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-20 13:49 UTC (permalink / raw)
  To: Rob Landley; +Cc: Jörn Engel, linux-kernel

Hi!

> Catching up on some really old mail, I thought I'd ask:
> 
> For years now I've wanted to use a sendfile variant to tell the system to 
> connect two filehandles from userspace.  Not just web servers want to 
> marshall data from one filehandle into another, things like netcat want to do 
> it between a pipe and a network connection, and I've wrote a couple of data 
> dispatcher daemons that wanted to do it between two network connections.
> 
> Unfortunately, sendfile didn't work generically when I tried it (back under 
> 2.4).  Would this infrastructure be a step in the right direction to 
> eliminate gratuitous poll loops (where nobody but me EVER seems to get the 
> "shutdown just one half of the connection" thing right.  My netcat can handle 
> "echo 'GET /' | netcat www.slashdot.org 80".  The standard netcat can't.  
> Yes, I plan to fix the one in busybox eventually...)
> 

Ugh. Yes, some syscalls like that were proposed... but to
make programming easier, you'd need asynchronous
sendfile to help you with programming, right?

				Pavel

-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-11 14:08         ` Jan Harkes
  2004-05-11 14:18           ` Jan Harkes
  2004-05-11 14:33           ` Jörn Engel
@ 2004-05-21 23:23           ` Rob Landley
  2004-05-25 22:46             ` Jan Harkes
  2 siblings, 1 reply; 36+ messages in thread
From: Rob Landley @ 2004-05-21 23:23 UTC (permalink / raw)
  To: Jan Harkes, Jörn Engel; +Cc: Steve French, linux-kernel

On Tuesday 11 May 2004 09:08, Jan Harkes wrote:

> int my_file_sendfile(struct file *in_file, loff_t *ppos, size_t count,
> 		     read_actor_t actor, void __user *target)
> {
>     struct file *out_file = NULL;
>
>     /* We have to check the read_actor callback function to see if the
>      * target actually points at a struct file. */
>     if (actor != file_send_actor)
> 	goto copy_local;
>
>     /* are both source and destination within the same file system
>      * mountpoint? */
>     if (in_file->f_dentry->d_inode->i_sb !=
> out_file->f_dentry->d_inode->i_sb) goto copy_local;
>
>     /* are we copying the entire source file? */
>     if (*ppos != 0 || count != in_file->f_dentry->d_inode->i_size)
> 	goto copy_local;

Is there a race condition for i_size to change between the api getting called 
and the copy being done?  More to the point, is there some way to specify a 
count of -1 or something to easily say "to end of file"?

Rob
-- 
www.linucon.org: Linux Expo and Science Fiction Convention
October 8-10, 2004 in Austin Texas.  (I'm the con chair.)



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-20 13:49   ` Pavel Machek
@ 2004-05-25 21:55     ` Rob Landley
  2004-05-25 22:08       ` Pavel Machek
  0 siblings, 1 reply; 36+ messages in thread
From: Rob Landley @ 2004-05-25 21:55 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Jörn Engel, linux-kernel

On Thursday 20 May 2004 08:49, Pavel Machek wrote:
> Hi!

> > For years now I've wanted to use a sendfile variant to tell the system to
> > connect two filehandles from userspace.  Not just web servers want to
> > marshall data from one filehandle into another, things like netcat want
> > to do it between a pipe and a network connection, and I've wrote a couple
> > of data dispatcher daemons that wanted to do it between two network
> > connections.
> >
> > Unfortunately, sendfile didn't work generically when I tried it (back
> > under 2.4).  Would this infrastructure be a step in the right direction
> > to eliminate gratuitous poll loops (where nobody but me EVER seems to get
> > the "shutdown just one half of the connection" thing right.  My netcat
> > can handle "echo 'GET /' | netcat www.slashdot.org 80".  The standard
> > netcat can't. Yes, I plan to fix the one in busybox eventually...)
>
> Ugh. Yes, some syscalls like that were proposed... but to
> make programming easier, you'd need asynchronous
> sendfile to help you with programming, right?
>
> 				Pavel

Doesn't asynchronous sendfile has the little problem your process can exit 
before the sendfile is complete?

I'm not sure how much of a help it really is, since fork() isn't brain surgery 
if you want it to be asynchronous, and the lifetime rules are really explicit 
then.  (With a ps that does thread grouping, this isn't too bad from a 
clutter standpoint, even.  And you automatically get a SIGCHLD when the 
sendfile is complete, too...)

Of course if the syscall can make the sendfile outlive the process that fired 
it off, then by all means it sounds good.  I dunno how much extra work that 
is for the kernel, though.

I had a "star server" daemon that was allowing machines behind firewalls to 
talk to each other by ssh-ing in to a machine that wasn't behind a firewall, 
which would then route traffic between them.  It boiled down to connecting 
together incoming network sockets, which meant that each connection went 
through five processes on the star server: incoming ssh, redirector program 
that ssh ran (which sent it the data to a loopback network connection), the 
dispatcher daemon that handled the loopback connections and figured out who 
should connect to who, and then the redirector and the ssh session for the 
"outgoing" connection.  Both redirectors basically wanted to exit after 
connecting their stdin and stdout to a network socket on loopback, but 
couldn't because they had to run a poll loop to shovel data back and forth.  
Similarly the dispatcher daemon had the mother of all poll loops that 
basically implemented bidirectional sendfile by hand between pairs of network 
sockets.

Anyway, real world example of something ugly that made me want sendfile to be 
more generic.  (Netcat's another.)  I doubt the star server could have been 
cleaned up _too_ much since it was an ugly idea to begin with (on many 
levels), but my boss wanted it, and customers wanted it, so...

Rob

-- 
www.linucon.org: Linux Expo and Science Fiction Convention
October 8-10, 2004 in Austin Texas.  (I'm the con chair.)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-25 21:55     ` Rob Landley
@ 2004-05-25 22:08       ` Pavel Machek
  2004-05-25 23:16         ` Rob Landley
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-25 22:08 UTC (permalink / raw)
  To: Rob Landley; +Cc: Jörn Engel, linux-kernel

Hi!

> > > For years now I've wanted to use a sendfile variant to tell the system to
> > > connect two filehandles from userspace.  Not just web servers want to
> > > marshall data from one filehandle into another, things like netcat want
> > > to do it between a pipe and a network connection, and I've wrote a couple
> > > of data dispatcher daemons that wanted to do it between two network
> > > connections.
> > >
> > > Unfortunately, sendfile didn't work generically when I tried it (back
> > > under 2.4).  Would this infrastructure be a step in the right direction
> > > to eliminate gratuitous poll loops (where nobody but me EVER seems to get
> > > the "shutdown just one half of the connection" thing right.  My netcat
> > > can handle "echo 'GET /' | netcat www.slashdot.org 80".  The standard
> > > netcat can't. Yes, I plan to fix the one in busybox eventually...)
> >
> > Ugh. Yes, some syscalls like that were proposed... but to
> > make programming easier, you'd need asynchronous
> > sendfile to help you with programming, right?
> 
> Doesn't asynchronous sendfile has the little problem your process can exit 
> before the sendfile is complete?

Hmm, it has...

> I'm not sure how much of a help it really is, since fork() isn't brain surgery 
> if you want it to be asynchronous, and the lifetime rules are really explicit 
> then.  (With a ps that does thread grouping, this isn't too bad from a 
> clutter standpoint, even.  And you automatically get a SIGCHLD when the 
> sendfile is complete, too...)

Right.

> Of course if the syscall can make the sendfile outlive the process that fired 
> it off, then by all means it sounds good.  I dunno how much extra work that 
> is for the kernel, though.

Well, it would be "interesting" to stop that sendfile then. You could
not kill it etc.

I guess async sendfile is bad idea after all.
								Pavel
-- 
When do you have heart between your knees?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-21 23:23           ` Rob Landley
@ 2004-05-25 22:46             ` Jan Harkes
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Harkes @ 2004-05-25 22:46 UTC (permalink / raw)
  To: Rob Landley; +Cc: Jan Harkes, Jörn Engel, Steve French, linux-kernel

On Fri, May 21, 2004 at 06:23:12PM -0500, Rob Landley wrote:
> >     /* are we copying the entire source file? */
> >     if (*ppos != 0 || count != in_file->f_dentry->d_inode->i_size)
> > 	goto copy_local;
> 
> Is there a race condition for i_size to change between the api getting called 
> and the copy being done?  More to the point, is there some way to specify a 
> count of -1 or something to easily say "to end of file"?

I don't think so, what does the existing sendfile to a socket
implementation do?

Sure there is something you could call a race, but it doesn't seem all
that serious to me. If someone writes to the source file a couple of
seconds after the copy completes we would get the exact same situation.

The only problem here is that it is possible for the write to arrive
between the stat of the source file to get the number of bytes we want
to copy and the actual sendfile call. That would invalidate the cached
inode data and by the time call sendfile the example code falls back to
a local copy operation because count != i_size.

But if the filesystem provided a more powerful copy operation than just
copy whole file A to file B it could actually do what the user asked for
without needing to fall back on the local copy operation.

An application that wants to make sure the source file is not modified
before, during or after the copy operation, both by the local client and
possibly by any remote clients, probably should lock it with flock or
fcntl.

Jan


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-25 22:08       ` Pavel Machek
@ 2004-05-25 23:16         ` Rob Landley
  2004-05-26  0:16           ` Ian Stirling
  2004-05-26  9:52           ` Jörn Engel
  0 siblings, 2 replies; 36+ messages in thread
From: Rob Landley @ 2004-05-25 23:16 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Jörn Engel, linux-kernel

On Tuesday 25 May 2004 17:08, Pavel Machek wrote:

> > Doesn't asynchronous sendfile has the little problem your process can
> > exit before the sendfile is complete?
>
> Hmm, it has...
>
> > I'm not sure how much of a help it really is, since fork() isn't brain
> > surgery if you want it to be asynchronous, and the lifetime rules are
> > really explicit then.  (With a ps that does thread grouping, this isn't
> > too bad from a clutter standpoint, even.  And you automatically get a
> > SIGCHLD when the sendfile is complete, too...)
>
> Right.
>
> > Of course if the syscall can make the sendfile outlive the process that
> > fired it off, then by all means it sounds good.  I dunno how much extra
> > work that is for the kernel, though.
>
> Well, it would be "interesting" to stop that sendfile then. You could
> not kill it etc.

Well, logically what you're doing is redirecting an existing filehandle so it 
points to something else.  Instead of reading from this pipe, you're now 
reading from this file or from this network socket, or this other process's 
stdout.  So any intermediate processes going away is theoretically okay as 
long as the anchors at each end remain (process/filesystem/network connection 
generating the data, process/filesystem/network connection receiving the 
data).

The easy way to make the semantics work out right is that such an asynchronous 
sendfile would effectively close the file in question from the point of view 
of the process that did the sendfile.  It would pretty much have to be part 
of the semantics of any asynchronous sendfile call: welding together the two 
filehandles would behave like a single direction shutdown(2) as far as the 
process that called sendfile is concerned.  (That way, if you do an async 
sendfile in each direction, the filehandle is closed both ways, but you don't 
HAVE to if you don't want to.  You can feed data to a child process from a 
script file or something, and just deal with the responses coming back.)

This would mean that in theory the process that did the sendfile could go away 
without too much ambiguity about what should happen.  (The bits that are 
already closed from the process's point of view are unaffected by the process 
exiting.)  I dunno what's needed to clean that up in the kernel.

I also don't know if it's a good idea, because as you noticed the fire and 
forget nature of the thing means that killing it afterwards is something we 
haven't got a semantic for if the thing at the other end is NOT a pipe to a 
processes.  (We kill processes.  We don't have a kill for network connections 
or for files in the filesystem that are no longer associated with any 
process.  This is theoretically existing problem, by the way.  Check out 
SO_LINGER in man 7 socket...)

There are a number of interesting "no process involved" cases, actually.  Do 
an asynchronous sendfile from one file to another and the disk fills up; who 
do you report the error to?  And if you do a sendfile from one network socket 
to another, both of which are outbound connections to the internet, then how 
do you interrupt it later if those other systems decide to keep talking 
through your system for two days?

Possibly the rule would have to be that a filehandle must remain open in SOME 
process context somewhere, or it gets closed (even if it was in the middle of 
an asynchronous sendfile).  That would make sense, really; that's how we 
handle incoming data from a network connection when the process in question 
exits.  We don't care that the datastream had data in transit, coming from 
the network into the receive buffer.  It go bye-bye when the last process 
that had it open closes it, and the in-flight data is discarded.

That sounds like a workable semantic to me.  Opinions?


It might even be possible for other I/O on the filehandle to block until the 
async sendfile finishes, but that's getting a bit fancy in the absence of 
known real world use cases asking for that feature.  It would let you add in 
something like SIGIO to let you know when you need to close the filehandle 
and get any error code (like "disk full" or whatever).  But I'd probably hold 
off on that until somebody actually came up with a real world use case for 
it.

> I guess async sendfile is bad idea after all.

Well, one thread per sendfile does smell a little bit like the java way of 
doing things, but synchronous sendfile is the 90% solution.  I can always 
make my own wrapper to do the fork() and an atexit() hook to detach and 
daemonify if I really care.  The kernel doesn't really have to do this for 
me.  And since 2.6 already has some infrastructure to group threads so ps 
doesn't look so terrible...

My opinion is that a synchronous sendfile that works in a generic context 
would be an improvement over what's there now, and something I actually have 
a couple existing uses for.  Cleaning it up to work async can remain a to-do 
item for later (including working out under what circumstances, if any, it 
might be a good idea).

Low hanging fruit, and all that.

> 								Pavel

Rob

-- 
www.linucon.org: Linux Expo and Science Fiction Convention
October 8-10, 2004 in Austin Texas.  (I'm the con chair.)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-25 23:16         ` Rob Landley
@ 2004-05-26  0:16           ` Ian Stirling
  2004-05-26  9:52           ` Jörn Engel
  1 sibling, 0 replies; 36+ messages in thread
From: Ian Stirling @ 2004-05-26  0:16 UTC (permalink / raw)
  To: Rob Landley; +Cc: Pavel Machek, Jörn Engel, linux-kernel

Rob Landley wrote:
> On Tuesday 25 May 2004 17:08, Pavel Machek wrote:
> 
> 
>>>Doesn't asynchronous sendfile has the little problem your process can
>>>exit before the sendfile is complete?
>>
>>Hmm, it has...
>>
>>
>>>I'm not sure how much of a help it really is, since fork() isn't brain
>>>surgery if you want it to be asynchronous, and the lifetime rules are
>>>really explicit then.  (With a ps that does thread grouping, this isn't
>>>too bad from a clutter standpoint, even.  And you automatically get a
>>>SIGCHLD when the sendfile is complete, too...)
>>
>>Right.
>>
>>
>>>Of course if the syscall can make the sendfile outlive the process that
>>>fired it off, then by all means it sounds good.  I dunno how much extra
>>>work that is for the kernel, though.
>>
>>Well, it would be "interesting" to stop that sendfile then. You could
>>not kill it etc.
> 
> 
> Well, logically what you're doing is redirecting an existing filehandle so it 
> points to something else.  Instead of reading from this pipe, you're now 
> reading from this file or from this network socket, or this other process's 
> stdout.  So any intermediate processes going away is theoretically okay as 
> long as the anchors at each end remain (process/filesystem/network connection 
> generating the data, process/filesystem/network connection receiving the 
> data).
> 
> The easy way to make the semantics work out right is that such an asynchronous 
> sendfile would effectively close the file in question from the point of view 
> of the process that did the sendfile.  It would pretty much have to be part 
> of the semantics of any asynchronous sendfile call: welding together the two 
> filehandles would behave like a single direction shutdown(2) as far as the 
> process that called sendfile is concerned.  (That way, if you do an async 
> sendfile in each direction, the filehandle is closed both ways, but you don't 
> HAVE to if you don't want to.  You can feed data to a child process from a 
> script file or something, and just deal with the responses coming back.)
> 
> This would mean that in theory the process that did the sendfile could go away 
> without too much ambiguity about what should happen.  (The bits that are 
> already closed from the process's point of view are unaffected by the process 
> exiting.)  I dunno what's needed to clean that up in the kernel.
> 
> I also don't know if it's a good idea, because as you noticed the fire and 
> forget nature of the thing means that killing it afterwards is something we 
> haven't got a semantic for if the thing at the other end is NOT a pipe to a 
> processes.  (We kill processes.  We don't have a kill for network connections 
> or for files in the filesystem that are no longer associated with any 
> process.  This is theoretically existing problem, by the way.  Check out 
> SO_LINGER in man 7 socket...)
> 

A while back (2.2?) I was using a util that came with netpipes, sockdown.
Give it a socket FD, and it would close the socket.
It wasn't intended for it, but I found that it was really handy to
be able to shut other programs network connections.
For example, tin gets wedged talking to a server that won't do something, and
you just do
sockdown </proc/tin/fd/nn
And it shuts.

Mildly annoyingly, this stopped working in 2.4.
I suppose it'd be totally insane for init to inherit descriptors for open network
connections.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
  2004-05-25 23:16         ` Rob Landley
  2004-05-26  0:16           ` Ian Stirling
@ 2004-05-26  9:52           ` Jörn Engel
  1 sibling, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-26  9:52 UTC (permalink / raw)
  To: Rob Landley; +Cc: Pavel Machek, linux-kernel

On Tue, 25 May 2004 18:16:05 -0500, Rob Landley wrote:
> 
> [a lot]

Sorry, you didn't have the time to make it short and I don't have the
time to read it all. ;)

Anyway, it sounds as if you really want a connect_files() or similar
that shortens a chain of pipes.  Similar to removing an element from a
doubly linked list.  Should be possible and not too hard either, iff
you can point to a decent user.

Right now, I don't care at all since it is just an optimization for
what's possible in userspace already and is rarely used.  Root of all
evil an such...

Jörn

-- 
The strong give up and move away, while the weak give up and stay.
-- unknown

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2004-05-26  9:52 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
2004-05-06 13:18 ` [PATCH COW] generic_sendpage Jörn Engel
2004-05-06 13:19 ` [PATCH COW] sendfile Jörn Engel
2004-05-06 13:19 ` [PATCH COW] copyfile Jörn Engel
2004-05-06 13:20 ` [PATCH COW] lock_flags Jörn Engel
2004-05-06 13:21 ` [PATCH COW] MAD COW Jörn Engel
2004-05-08 13:45 ` [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Denis Vlasenko
2004-05-08 22:10   ` Pavel Machek
2004-05-09 14:09     ` Denis Vlasenko
2004-05-09 21:53       ` Pavel Machek
2004-05-10 15:44         ` Jörn Engel
2004-05-10 15:51           ` Pavel Machek
2004-05-10 15:56             ` Jörn Engel
2004-05-12  0:26           ` Jamie Lokier
2004-05-13 10:56             ` Jörn Engel
2004-05-12 20:29         ` Rob Landley
2004-05-08 22:48 ` Pavel Machek
2004-05-10 15:53   ` Jörn Engel
2004-05-10 19:26     ` Jan Harkes
2004-05-11 10:02       ` Jörn Engel
2004-05-11 14:08         ` Jan Harkes
2004-05-11 14:18           ` Jan Harkes
2004-05-11 14:33           ` Jörn Engel
2004-05-21 23:23           ` Rob Landley
2004-05-25 22:46             ` Jan Harkes
2004-05-11 15:40         ` Steve French
2004-05-11 15:58           ` Jörn Engel
2004-05-10  5:15 ` Eric W. Biederman
2004-05-10 15:59   ` Jörn Engel
2004-05-12 16:39 ` Rob Landley
2004-05-20 13:49   ` Pavel Machek
2004-05-25 21:55     ` Rob Landley
2004-05-25 22:08       ` Pavel Machek
2004-05-25 23:16         ` Rob Landley
2004-05-26  0:16           ` Ian Stirling
2004-05-26  9:52           ` Jörn Engel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).