LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
@ 2004-05-06 13:17 Jörn Engel
2004-05-06 13:18 ` [PATCH COW] generic_sendpage Jörn Engel
` (8 more replies)
0 siblings, 9 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:17 UTC (permalink / raw)
To: linux-kernel
Cc: Jamie Lokier, Pavel Machek, Eric W. Biederman, Denis Vlasenko
Couldn't sleep last night and finished a first complete version of
cowlinks, code-named MAD COW. It is still based on the stupid old
design with a flag to distinguish between regular hard links and
cowlinks. Please don't comment on that design, it's just a proof of
concept.
Patches are against 2.6.5 but most things should apply to other 2.6
kernel without too much trouble.
1 generic_sendpage - allow sendfile with ext[23] files as target
2 sendfile - introduce vfs_sendfile for in-kernel use
3 copyfile - new copyfile() system call
4 lock_flags - old cruft, just ignore it
5 madcow - the MAD COW itself
Patches 1-3 will stay, 4 will be remove and 5 replaced by a better
design over time. I've also set up a webpage for it:
http://wohnheim.fh-wedel.de/~joern/cowlink/
Maybe that should be converted into a wiki so someone with better
taste than myself can improve it.
Jörn
--
This above all: to thine own self be true.
-- Shakespeare
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH COW] generic_sendpage
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
@ 2004-05-06 13:18 ` Jörn Engel
2004-05-06 13:19 ` [PATCH COW] sendfile Jörn Engel
` (7 subsequent siblings)
8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:18 UTC (permalink / raw)
To: linux-kernel
Patch 1.
Jörn
--
When you close your hand, you own nothing. When you open it up, you
own the whole world.
-- Li Mu Bai in Tiger & Dragon
o Add sendfile() support for file targets to normal mm/filemap.c.
o Have ext[23] use that support.
fs/ext2/file.c | 1
fs/ext3/file.c | 1
include/linux/fs.h | 1
mm/filemap.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 219 insertions(+)
--- linux-2.6.5cow/include/linux/fs.h~generic_sendpage 2004-04-14 16:22:24.000000000 +0200
+++ linux-2.6.5cow/include/linux/fs.h 2004-04-27 16:34:32.000000000 +0200
@@ -1346,6 +1346,7 @@
ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos);
extern ssize_t generic_file_sendfile(struct file *, loff_t *, size_t, read_actor_t, void __user *);
+extern ssize_t generic_file_sendpage(struct file *, struct page *, int, size_t, loff_t *, int);
extern void do_generic_mapping_read(struct address_space *mapping,
struct file_ra_state *, struct file *,
loff_t *, read_descriptor_t *, read_actor_t);
--- linux-2.6.5cow/mm/filemap.c~generic_sendpage 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/mm/filemap.c 2004-04-20 16:20:21.000000000 +0200
@@ -896,6 +896,31 @@
return written;
}
+/* FIXME: It would be as simple as this, if we had a (void __user*) to write.
+ * We already have a kernel buffer, so it should be even simpler, right? ;)
+ *
+ * Yes, sorta. After duplicating the complete path of generic_file_write(),
+ * at least some special cases could be removed, so the copy is simpler than
+ * the original. But it remains a copy, so overall complexity increases.
+ */
+static ssize_t
+generic_kernel_file_write(struct file *, const char *, size_t, loff_t *);
+
+ssize_t generic_file_sendpage(struct file *file, struct page *page,
+ int offset, size_t size, loff_t *ppos, int more)
+{
+ ssize_t ret;
+ char *kaddr;
+
+ kaddr = kmap(page);
+ ret = generic_kernel_file_write(file, kaddr + offset, size, ppos);
+ kunmap(page);
+
+ return ret;
+}
+
+EXPORT_SYMBOL(generic_file_sendpage);
+
ssize_t generic_file_sendfile(struct file *in_file, loff_t *ppos,
size_t count, read_actor_t actor, void __user *target)
{
@@ -1554,6 +1579,19 @@
return bytes - left;
}
+static inline size_t
+filemap_copy_from_kernel(struct page *page, unsigned long offset,
+ const char *buf, unsigned bytes)
+{
+ char *kaddr;
+
+ kaddr = kmap(page);
+ memcpy(kaddr + offset, buf, bytes);
+ kunmap(page);
+
+ return bytes;
+}
+
static size_t
__filemap_copy_from_user_iovec(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
@@ -1910,6 +1948,155 @@
EXPORT_SYMBOL(generic_file_aio_write_nolock);
+/*
+ * TODO:
+ * This largely tries to copy generic_file_aio_write_nolock(), although it
+ * doesn't have to be nearly as generic. A real cleanup should either
+ * merge this into generic_file_aio_write_nolock() as well or keep it special
+ * and remove as much code as possible.
+ */
+static ssize_t
+generic_kernel_file_aio_write_nolock(struct kiocb *iocb, const struct iovec*iov,
+ unsigned long nr_segs, loff_t *ppos)
+{
+ struct file *file = iocb->ki_filp;
+ struct address_space * mapping = file->f_mapping;
+ struct address_space_operations *a_ops = mapping->a_ops;
+ size_t ocount; /* original count */
+ size_t count; /* after file limit checks */
+ struct inode *inode = mapping->host;
+ long status = 0;
+ loff_t pos;
+ struct page *page;
+ struct page *cached_page = NULL;
+ const int isblk = S_ISBLK(inode->i_mode);
+ ssize_t written;
+ ssize_t err;
+ size_t bytes;
+ struct pagevec lru_pvec;
+ const struct iovec *cur_iov = iov; /* current iovec */
+ size_t iov_base = 0; /* offset in the current iovec */
+ unsigned long seg;
+ char *buf;
+
+ ocount = 0;
+ for (seg = 0; seg < nr_segs; seg++) {
+ const struct iovec *iv = &iov[seg];
+
+ /*
+ * If any segment has a negative length, or the cumulative
+ * length ever wraps negative then return -EINVAL.
+ */
+ ocount += iv->iov_len;
+ if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
+ return -EINVAL;
+ }
+
+ count = ocount;
+ pos = *ppos;
+ pagevec_init(&lru_pvec, 0);
+
+ /* We can write back this queue in page reclaim */
+ current->backing_dev_info = mapping->backing_dev_info;
+ written = 0;
+
+ err = generic_write_checks(file, &pos, &count, isblk);
+ if (err)
+ goto out;
+
+
+ if (count == 0)
+ goto out;
+
+ remove_suid(file->f_dentry);
+ inode_update_time(inode, 1);
+
+ /* There is no sane reason to use O_DIRECT */
+ BUG_ON(file->f_flags & O_DIRECT);
+
+ buf = (char *)iov->iov_base;
+ do {
+ unsigned long index;
+ unsigned long offset;
+ size_t copied;
+
+ offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ index = pos >> PAGE_CACHE_SHIFT;
+ bytes = PAGE_CACHE_SIZE - offset;
+ if (bytes > count)
+ bytes = count;
+
+ page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+ if (!page) {
+ status = -ENOMEM;
+ break;
+ }
+
+ status = a_ops->prepare_write(file, page, offset, offset+bytes);
+ if (unlikely(status)) {
+ loff_t isize = i_size_read(inode);
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again.
+ */
+ unlock_page(page);
+ page_cache_release(page);
+ if (pos + bytes > isize)
+ vmtruncate(inode, isize);
+ break;
+ }
+
+ BUG_ON(nr_segs != 1);
+ copied = filemap_copy_from_kernel(page, offset, buf, bytes);
+
+ flush_dcache_page(page);
+ status = a_ops->commit_write(file, page, offset, offset+bytes);
+ if (likely(copied > 0)) {
+ if (!status)
+ status = copied;
+
+ if (status >= 0) {
+ written += status;
+ count -= status;
+ pos += status;
+ buf += status;
+ if (unlikely(nr_segs > 1))
+ filemap_set_next_iovec(&cur_iov,
+ &iov_base, status);
+ }
+ }
+ if (unlikely(copied != bytes))
+ if (status >= 0)
+ status = -EFAULT;
+ unlock_page(page);
+ mark_page_accessed(page);
+ page_cache_release(page);
+ if (status < 0)
+ break;
+ balance_dirty_pages_ratelimited(mapping);
+ cond_resched();
+ } while (count);
+ *ppos = pos;
+
+ if (cached_page)
+ page_cache_release(cached_page);
+
+ /*
+ * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
+ */
+ if (status >= 0) {
+ if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
+ status = generic_osync_inode(inode, mapping,
+ OSYNC_METADATA|OSYNC_DATA);
+ }
+
+ err = written ? written : status;
+out:
+ pagevec_lru_add(&lru_pvec);
+ current->backing_dev_info = 0;
+ return err;
+}
+
ssize_t
generic_file_write_nolock(struct file *file, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos)
@@ -1924,6 +2111,20 @@
return ret;
}
+static ssize_t
+generic_kernel_file_write_nolock(struct file *file, const struct iovec *iov,
+ unsigned long nr_segs, loff_t *ppos)
+{
+ struct kiocb kiocb;
+ ssize_t ret;
+
+ init_sync_kiocb(&kiocb, file);
+ ret = generic_kernel_file_aio_write_nolock(&kiocb, iov, nr_segs, ppos);
+ if (ret == -EIOCBQUEUED)
+ ret = wait_on_sync_kiocb(&kiocb);
+ return ret;
+}
+
EXPORT_SYMBOL(generic_file_write_nolock);
ssize_t generic_file_aio_write(struct kiocb *iocb, const char __user *buf,
@@ -1962,6 +2163,21 @@
EXPORT_SYMBOL(generic_file_write);
+static ssize_t generic_kernel_file_write(struct file *file, const char *buf,
+ size_t count, loff_t *ppos)
+{
+ struct inode *inode = file->f_mapping->host;
+ ssize_t err;
+ struct iovec local_iov = {.iov_base = (void __user *)buf,
+ .iov_len = count };
+
+ down(&inode->i_sem);
+ err = generic_kernel_file_write_nolock(file, &local_iov, 1, ppos);
+ up(&inode->i_sem);
+
+ return err;
+}
+
ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos)
{
--- linux-2.6.5cow/fs/ext2/file.c~generic_sendpage 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/file.c 2004-04-14 16:22:24.000000000 +0200
@@ -53,6 +53,7 @@
.readv = generic_file_readv,
.writev = generic_file_writev,
.sendfile = generic_file_sendfile,
+ .sendpage = generic_file_sendpage,
};
struct inode_operations ext2_file_inode_operations = {
--- linux-2.6.5cow/fs/ext3/file.c~generic_sendpage 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/file.c 2004-04-14 16:22:24.000000000 +0200
@@ -127,6 +127,7 @@
.release = ext3_release_file,
.fsync = ext3_sync_file,
.sendfile = generic_file_sendfile,
+ .sendpage = generic_file_sendpage,
};
struct inode_operations ext3_file_inode_operations = {
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH COW] sendfile
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
2004-05-06 13:18 ` [PATCH COW] generic_sendpage Jörn Engel
@ 2004-05-06 13:19 ` Jörn Engel
2004-05-06 13:19 ` [PATCH COW] copyfile Jörn Engel
` (6 subsequent siblings)
8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:19 UTC (permalink / raw)
To: linux-kernel
Patch 2
Jörn
--
Simplicity is prerequisite for reliability.
-- Edsger W. Dijkstra
Creates vfs_sendfile(), which can be called from other places within
the kernel. Such other places include copyfile() and cowlinks.
In principle, this just removes code from do_sendfile() to
vfs_sendfile(). On top of that, it adds a check to out_inode,
identical to the one on in_inode. True, the check for out_inode was
never needed, maybe that tells you something about the check to
in_inode as well. ;)
fs/read_write.c | 99 +++++++++++++++++++++++++++--------------------
include/linux/fs.h | 1
include/linux/syscalls.h | 2
3 files changed, 60 insertions(+), 42 deletions(-)
--- linux-2.6.5cow/fs/read_write.c~copyfile 2004-04-06 18:50:59.000000000 +0200
+++ linux-2.6.5cow/fs/read_write.c 2004-04-25 03:21:41.000000000 +0200
@@ -537,12 +537,66 @@
return ret;
}
+ssize_t vfs_sendfile(struct file *out_file, struct file *in_file, loff_t *ppos,
+ size_t count, loff_t max)
+{
+ struct inode * in_inode, * out_inode;
+ loff_t pos;
+ ssize_t ret;
+
+ /* verify in_file */
+ in_inode = in_file->f_dentry->d_inode;
+ if (!in_inode)
+ return -EINVAL;
+ if (!in_file->f_op || !in_file->f_op->sendfile)
+ return -EINVAL;
+
+ if (!ppos)
+ ppos = &in_file->f_pos;
+ ret = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, *ppos, count);
+ if (ret)
+ return ret;
+
+ /* verify out_file */
+ out_inode = out_file->f_dentry->d_inode;
+ if (!out_inode)
+ return -EINVAL;
+ if (!out_file->f_op || !out_file->f_op->sendpage)
+ return -EINVAL;
+
+ ret = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file->f_pos, count);
+ if (ret)
+ return ret;
+
+ ret = security_file_permission (out_file, MAY_WRITE);
+ if (ret)
+ return ret;
+
+ if (!max)
+ max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
+
+ pos = *ppos;
+ if (unlikely(pos < 0))
+ return -EINVAL;
+ if (unlikely(pos + count > max)) {
+ if (pos >= max)
+ return -EOVERFLOW;
+ count = max - pos;
+ }
+
+ ret = in_file->f_op->sendfile(in_file, ppos, count, file_send_actor, out_file);
+
+ if (*ppos > max)
+ return -EOVERFLOW;
+ return ret;
+}
+
+EXPORT_SYMBOL(vfs_sendfile);
+
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
struct file * in_file, * out_file;
- struct inode * in_inode, * out_inode;
- loff_t pos;
ssize_t retval;
int fput_needed_in, fput_needed_out;
@@ -555,17 +609,6 @@
goto out;
if (!(in_file->f_mode & FMODE_READ))
goto fput_in;
- retval = -EINVAL;
- in_inode = in_file->f_dentry->d_inode;
- if (!in_inode)
- goto fput_in;
- if (!in_file->f_op || !in_file->f_op->sendfile)
- goto fput_in;
- if (!ppos)
- ppos = &in_file->f_pos;
- retval = locks_verify_area(FLOCK_VERIFY_READ, in_inode, in_file, *ppos, count);
- if (retval)
- goto fput_in;
retval = security_file_permission (in_file, MAY_READ);
if (retval)
@@ -580,36 +623,8 @@
goto fput_in;
if (!(out_file->f_mode & FMODE_WRITE))
goto fput_out;
- retval = -EINVAL;
- if (!out_file->f_op || !out_file->f_op->sendpage)
- goto fput_out;
- out_inode = out_file->f_dentry->d_inode;
- retval = locks_verify_area(FLOCK_VERIFY_WRITE, out_inode, out_file, out_file->f_pos, count);
- if (retval)
- goto fput_out;
-
- retval = security_file_permission (out_file, MAY_WRITE);
- if (retval)
- goto fput_out;
-
- if (!max)
- max = min(in_inode->i_sb->s_maxbytes, out_inode->i_sb->s_maxbytes);
-
- pos = *ppos;
- retval = -EINVAL;
- if (unlikely(pos < 0))
- goto fput_out;
- if (unlikely(pos + count > max)) {
- retval = -EOVERFLOW;
- if (pos >= max)
- goto fput_out;
- count = max - pos;
- }
-
- retval = in_file->f_op->sendfile(in_file, ppos, count, file_send_actor, out_file);
- if (*ppos > max)
- retval = -EOVERFLOW;
+ retval = vfs_sendfile(out_file, in_file, ppos, count, max);
fput_out:
fput_light(out_file, fput_needed_out);
--- linux-2.6.5cow/include/linux/fs.h~copyfile 2004-04-20 16:20:21.000000000 +0200
+++ linux-2.6.5cow/include/linux/fs.h 2004-04-27 16:34:23.000000000 +0200
@@ -877,6 +877,7 @@
unsigned long, loff_t *);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
+ssize_t vfs_sendfile(struct file *, struct file *, loff_t *, size_t, loff_t);
/*
* NOTE: write_inode, delete_inode, clear_inode, put_inode can be called
--- linux-2.6.5cow/include/linux/syscalls.h~copyfile 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/include/linux/syscalls.h 2004-04-27 16:34:23.000000000 +0200
@@ -278,6 +278,8 @@
asmlinkage long sys_unlink(const char __user *pathname);
asmlinkage long sys_rename(const char __user *oldname,
const char __user *newname);
+asmlinkage long sys_copyfile(const char __user *from, const char __user *to,
+ umode_t mode);
asmlinkage long sys_chmod(const char __user *filename, mode_t mode);
asmlinkage long sys_fchmod(unsigned int fd, mode_t mode);
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH COW] copyfile
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
2004-05-06 13:18 ` [PATCH COW] generic_sendpage Jörn Engel
2004-05-06 13:19 ` [PATCH COW] sendfile Jörn Engel
@ 2004-05-06 13:19 ` Jörn Engel
2004-05-06 13:20 ` [PATCH COW] lock_flags Jörn Engel
` (5 subsequent siblings)
8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:19 UTC (permalink / raw)
To: linux-kernel
Patch 3
Jörn
--
More computing sins are committed in the name of efficiency (without
necessarily achieving it) than for any other single reason - including
blind stupidity.
-- W. A. Wulf
Adds a new syscall, copyfile() which does as the name sais.
This syscall is a preparation for real cowlinks, but it might make sense on
it's own as well. For cowlinks, the real copy is not done until the first
write (or similar operation) occurs, but at that point, both operations
should be the same.
arch/i386/kernel/entry.S | 1
fs/namei.c | 140 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 141 insertions(+)
--- linux-2.6.5cow/arch/i386/kernel/entry.S~copyfile 2004-04-27 16:34:32.000000000 +0200
+++ linux-2.6.5cow/arch/i386/kernel/entry.S 2004-04-30 08:40:04.000000000 +0200
@@ -882,5 +882,6 @@
.long sys_utimes
.long sys_fadvise64_64
.long sys_ni_syscall /* sys_vserver */
+ .long sys_copyfile
syscall_table_size=(.-sys_call_table)
--- linux-2.6.5cow/fs/namei.c~copyfile 2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/fs/namei.c 2004-05-06 01:08:53.000000000 +0200
@@ -2139,6 +2139,146 @@
return error;
}
+/* note how we open() two struct file *, just to do the sendfile(). this
+ * should not be necessary, only nfs, smbfs and cifs depend on the struct
+ * file *, while afs and all local filesystems can do without.
+ * but changing all interfaces and fixing up three filesystems is a little
+ * too much work for now.
+ */
+#include <linux/file.h> /* for fput */
+static int copy_data(struct dentry *old_dentry, struct vfsmount *old_mnt,
+ struct dentry *new_dentry, struct vfsmount *new_mnt)
+{
+ int ret;
+ loff_t size;
+ ssize_t n;
+ struct file *old_file;
+ struct file *new_file;
+
+ old_file = dentry_open(old_dentry, old_mnt, O_RDONLY);
+ if (IS_ERR(old_file))
+ return PTR_ERR(old_file);
+
+ new_file = dentry_open(new_dentry, new_mnt, O_WRONLY);
+ ret = PTR_ERR(new_file);
+ if (IS_ERR(new_file))
+ goto out1;
+
+ size = i_size_read(old_file->f_dentry->d_inode);
+ if (((size_t)size != size) || ((ssize_t)size != size)) {
+ ret = -EFBIG;
+ goto out2;
+ }
+ n = vfs_sendfile(new_file, old_file, NULL, size, 0);
+ if (n == size)
+ ret = 0;
+ else if (n < 0)
+ ret = n;
+ else
+ ret = -ENOSPC;
+
+out2:
+ dget(new_file->f_dentry); /* fput calls dput */
+ fput(new_file);
+out1:
+ dget(old_file->f_dentry); /* fput calls dput */
+ fput(old_file);
+ return ret;
+}
+
+int vfs_copyfile(struct nameidata *old_nd, struct nameidata *new_nd,
+ struct dentry *new_dentry, int mode)
+{
+ int ret;
+
+ if (!old_nd->dentry->d_inode)
+ return -ENOENT;
+ if (!S_ISREG(old_nd->dentry->d_inode->i_mode))
+ return -EINVAL;
+ /* FIXME: replace with proper permission check */
+ if (new_dentry->d_inode)
+ return -EEXIST;
+
+ if (mode == -1)
+ mode = old_nd->dentry->d_inode->i_mode;
+
+ ret = vfs_create(new_nd->dentry->d_inode, new_dentry, mode, new_nd);
+ if (ret)
+ return ret;
+
+ ret = copy_data(old_nd->dentry, old_nd->mnt, new_dentry, new_nd->mnt);
+
+ if (ret) {
+ int error = vfs_unlink(new_nd->dentry->d_inode, new_dentry);
+ BUG_ON(error);
+ /* FIXME: not sure if there are return value we should not BUG()
+ * on */
+ }
+ return ret;
+}
+
+int do_copyfile(const char *from, const char *to, int mode)
+{
+ int ret;
+ struct dentry *new_dentry;
+ struct nameidata old_nd, new_nd;
+
+ ret = path_lookup(from, 0, &old_nd);
+ if (ret)
+ return ret;
+
+ ret = path_lookup(to, LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE, &new_nd);
+ if (ret)
+ goto out1;
+
+ new_dentry = lookup_create(&new_nd, 0);
+ ret = PTR_ERR(new_dentry);
+ if (IS_ERR(new_dentry))
+ goto out2;
+
+ ret = vfs_copyfile(&old_nd, &new_nd, new_dentry, mode);
+
+ dput(new_dentry);
+ up(&new_nd.dentry->d_inode->i_sem);
+out2:
+ path_release(&new_nd);
+out1:
+ path_release(&old_nd);
+ return ret;
+}
+
+/* copyfile() - create a new file and copy the old one's contents into it, all
+ * inside the kernel
+ *
+ * TODO:
+ * quota
+ * notification
+ * security
+ * unlock directory during data copy
+ * loop for data copy
+ */
+int sys_copyfile(const char __user *ufrom, const char __user *uto, umode_t mode)
+{
+ char *from, *to;
+ int ret;
+
+ from = getname(ufrom);
+ if (IS_ERR(from))
+ return PTR_ERR(from);
+
+ to = getname(uto);
+ ret = PTR_ERR(to);
+ if (IS_ERR(to))
+ goto err;
+
+ ret = do_copyfile(from, to, mode);
+
+ putname(to);
+err:
+ putname(from);
+ return ret;
+}
+
int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen, const char *link)
{
int len;
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH COW] lock_flags
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
` (2 preceding siblings ...)
2004-05-06 13:19 ` [PATCH COW] copyfile Jörn Engel
@ 2004-05-06 13:20 ` Jörn Engel
2004-05-06 13:21 ` [PATCH COW] MAD COW Jörn Engel
` (4 subsequent siblings)
8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:20 UTC (permalink / raw)
To: linux-kernel
Patch 4 - just for completeness
Jörn
--
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall
spins on i_lock before accessing i_flags
Omissions:
o fs/befs/linuxvfs.c inode->i_flags should be befs_ino->i_flags
o fs/dquot.c vfs_quota_on() is racy. the race is now limited to
S_NOATIME, S_QUOTA and S_NOQUOTA, but it still exists.
drivers/usb/core/inode.c | 2 ++
fs/afs/inode.c | 2 ++
fs/dquot.c | 21 ++++++++++++++++++---
fs/ext2/ialloc.c | 2 ++
fs/ext2/inode.c | 2 ++
fs/ext3/ialloc.c | 2 ++
fs/ext3/inode.c | 2 ++
fs/fat/inode.c | 5 ++++-
fs/hfs/inode.c | 2 +-
fs/hfsplus/catalog.c | 4 ++--
fs/hfsplus/dir.c | 10 ++++++++--
fs/hfsplus/inode.c | 6 +++++-
fs/hfsplus/ioctl.c | 2 ++
fs/intermezzo/vfs.c | 7 ++++++-
fs/namei.c | 10 ++++++++--
fs/nfs/inode.c | 2 ++
fs/proc/base.c | 4 ++++
fs/reiserfs/inode.c | 9 ++++++++-
fs/smbfs/file.c | 2 +-
fs/udf/ialloc.c | 2 ++
fs/ufs/ialloc.c | 2 ++
fs/xfs/linux/xfs_ioctl.c | 2 +-
fs/xfs/linux/xfs_super.c | 2 ++
fs/xfs/linux/xfs_vnode.c | 2 ++
fs/xfs/xfs_acl.c | 2 +-
fs/xfs/xfs_cap.c | 2 +-
include/linux/fs.h | 32 +++++++++++++++++++++-----------
27 files changed, 113 insertions(+), 29 deletions(-)
--- linux-2.6.5cow/include/linux/fs.h~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/include/linux/fs.h 2004-04-27 16:34:42.000000000 +0200
@@ -155,24 +155,25 @@
*/
#define __IS_FLG(inode,flg) ((inode)->i_sb->s_flags & (flg))
-#define IS_RDONLY(inode) ((inode)->i_sb->s_flags & MS_RDONLY)
+#define IS_RDONLY(inode) (__IS_FLG(inode, MS_RDONLY))
#define IS_SYNC(inode) (__IS_FLG(inode, MS_SYNCHRONOUS) || \
- ((inode)->i_flags & S_SYNC))
+ inode_flags((inode), S_SYNC))
#define IS_DIRSYNC(inode) (__IS_FLG(inode, MS_SYNCHRONOUS|MS_DIRSYNC) || \
- ((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
+ inode_flags((inode), (S_SYNC|S_DIRSYNC)))
#define IS_MANDLOCK(inode) __IS_FLG(inode, MS_MANDLOCK)
-#define IS_QUOTAINIT(inode) ((inode)->i_flags & S_QUOTA)
-#define IS_NOQUOTA(inode) ((inode)->i_flags & S_NOQUOTA)
-#define IS_APPEND(inode) ((inode)->i_flags & S_APPEND)
-#define IS_IMMUTABLE(inode) ((inode)->i_flags & S_IMMUTABLE)
-#define IS_NOATIME(inode) (__IS_FLG(inode, MS_NOATIME) || ((inode)->i_flags & S_NOATIME))
+#define IS_QUOTAINIT(inode) (inode_flags((inode), S_QUOTA))
+#define IS_NOQUOTA(inode) (inode_flags((inode), S_NOQUOTA))
+#define IS_APPEND(inode) (inode_flags((inode), S_APPEND))
+#define IS_IMMUTABLE(inode) (inode_flags((inode), S_IMMUTABLE))
+#define IS_NOATIME(inode) (__IS_FLG(inode, MS_NOATIME) || \
+ inode_flags((inode), S_NOATIME))
#define IS_NODIRATIME(inode) __IS_FLG(inode, MS_NODIRATIME)
#define IS_POSIXACL(inode) __IS_FLG(inode, MS_POSIXACL)
#define IS_ONE_SECOND(inode) __IS_FLG(inode, MS_ONE_SECOND)
-#define IS_DEADDIR(inode) ((inode)->i_flags & S_DEAD)
-#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
+#define IS_DEADDIR(inode) (inode_flags((inode), S_DEAD))
+#define IS_NOCMTIME(inode) (inode_flags((inode), S_NOCMTIME))
/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
@@ -417,7 +418,7 @@
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */
- unsigned int i_flags;
+ unsigned int i_flags; /* protected by i_lock */
unsigned char i_sock;
atomic_t i_writecount;
@@ -431,6 +432,15 @@
#endif
};
+static inline unsigned int inode_flags(struct inode *inode, unsigned int mask)
+{
+ int ret;
+ spin_lock(inode->i_lock);
+ ret = inode->i_flags & mask;
+ spin_unlock(inode->i_lock);
+ return ret;
+}
+
/*
* NOTE: in a 32bit arch with a preemptable kernel and
* an UP compile the i_size_read/write must be atomic
--- linux-2.6.5cow/drivers/usb/core/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/drivers/usb/core/inode.c 2004-04-14 16:22:24.000000000 +0200
@@ -277,7 +277,9 @@
if (usbfs_empty(dentry)) {
dentry->d_inode->i_nlink -= 2;
dput(dentry);
+ spin_lock(inode->i_lock);
inode->i_flags |= S_DEAD;
+ spin_unlock(inode->i_lock);
dir->i_nlink--;
error = 0;
}
--- linux-2.6.5cow/fs/xfs/linux/xfs_vnode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/linux/xfs_vnode.c 2004-04-14 16:22:24.000000000 +0200
@@ -213,6 +213,7 @@
inode->i_mtime = va.va_mtime;
inode->i_ctime = va.va_ctime;
inode->i_atime = va.va_atime;
+ spin_lock(inode->i_lock);
if (va.va_xflags & XFS_XFLAG_IMMUTABLE)
inode->i_flags |= S_IMMUTABLE;
else
@@ -229,6 +230,7 @@
inode->i_flags |= S_NOATIME;
else
inode->i_flags &= ~S_NOATIME;
+ spin_unlock(inode->i_lock);
VUNMODIFY(vp);
}
return -error;
--- linux-2.6.5cow/fs/xfs/linux/xfs_ioctl.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/linux/xfs_ioctl.c 2004-04-14 16:22:24.000000000 +0200
@@ -885,7 +885,7 @@
int attr_flags = 0;
int error;
- if (vp->v_inode.i_flags & (S_IMMUTABLE|S_APPEND))
+ if (inode_flags(vp->v_inode, (S_IMMUTABLE|S_APPEND)))
return -XFS_ERROR(EPERM);
if (!(filp->f_flags & FMODE_WRITE))
--- linux-2.6.5cow/fs/xfs/linux/xfs_super.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/linux/xfs_super.c 2004-04-14 16:22:24.000000000 +0200
@@ -187,6 +187,7 @@
inode->i_mtime.tv_nsec = ip->i_d.di_mtime.t_nsec;
inode->i_ctime.tv_sec = ip->i_d.di_ctime.t_sec;
inode->i_ctime.tv_nsec = ip->i_d.di_ctime.t_nsec;
+ spin_lock(inode->i_lock);
if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
inode->i_flags |= S_IMMUTABLE;
else
@@ -203,6 +204,7 @@
inode->i_flags |= S_NOATIME;
else
inode->i_flags &= ~S_NOATIME;
+ spin_unlock(inode->i_lock);
vp->v_flag &= ~VMODIFIED;
}
--- linux-2.6.5cow/fs/xfs/xfs_acl.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/xfs_acl.c 2004-04-14 16:22:24.000000000 +0200
@@ -388,7 +388,7 @@
vattr_t va;
int error;
- if (vp->v_inode.i_flags & (S_IMMUTABLE|S_APPEND))
+ if (inode_flags(&vp->v_inode, (S_IMMUTABLE|S_APPEND)))
return EPERM;
if (kind == _ACL_TYPE_DEFAULT && vp->v_type != VDIR)
return ENOTDIR;
--- linux-2.6.5cow/fs/xfs/xfs_cap.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/xfs/xfs_cap.c 2004-04-14 16:22:24.000000000 +0200
@@ -192,7 +192,7 @@
if (vp->v_vfsp->vfs_flag & VFS_RDONLY)
return EROFS;
- if (vp->v_inode.i_flags & (S_IMMUTABLE|S_APPEND))
+ if (inode_flags(&vp->v_inode, (S_IMMUTABLE|S_APPEND)))
return EPERM;
if ((error = _MAC_VACCESS(vp, NULL, VWRITE)))
return error;
--- linux-2.6.5cow/fs/nfs/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/nfs/inode.c 2004-04-14 16:22:24.000000000 +0200
@@ -695,7 +695,9 @@
inode->i_ino = hash;
/* We can't support update_atime(), since the server will reset it */
+ spin_lock(inode->i_lock);
inode->i_flags |= S_NOATIME|S_NOCMTIME;
+ spin_unlock(inode->i_lock);
inode->i_mode = fattr->mode;
/* Why so? Because we want revalidate for devices/FIFOs, and
* that's precisely what we have in nfs_file_inode_operations.
--- linux-2.6.5cow/fs/ufs/ialloc.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ufs/ialloc.c 2004-04-14 16:22:24.000000000 +0200
@@ -283,7 +283,9 @@
if (DQUOT_ALLOC_INODE(inode)) {
DQUOT_DROP(inode);
+ spin_lock(inode->i_lock);
inode->i_flags |= S_NOQUOTA;
+ spin_unlock(inode->i_lock);
inode->i_nlink = 0;
iput(inode);
return ERR_PTR(-EDQUOT);
--- linux-2.6.5cow/fs/hfs/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfs/inode.c 2004-04-20 16:20:21.000000000 +0200
@@ -540,7 +540,7 @@
if (atomic_dec_and_test(&HFS_I(inode)->opencnt)) {
down(&inode->i_sem);
hfs_file_truncate(inode);
- //if (inode->i_flags & S_DEAD) {
+ //if (IS_DEADDIR(inode) {
// hfs_delete_cat(inode->i_ino, HFSPLUS_SB(sb).hidden_dir, NULL);
// hfs_delete_inode(inode);
//}
--- linux-2.6.5cow/fs/intermezzo/vfs.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/intermezzo/vfs.c 2004-04-14 16:22:24.000000000 +0200
@@ -1441,7 +1441,9 @@
error = iops->rmdir(dir->d_inode, dentry);
unlock_kernel();
if (!error) {
+ spin_lock(dentry->d_inode->i_lock);
dentry->d_inode->i_flags |= S_DEAD;
+ spin_unlock(dentry->d_inode->i_lock);
error = presto_settime(fset, NULL, NULL, dir, info,
ATTR_CTIME | ATTR_MTIME);
}
@@ -1838,8 +1840,11 @@
error = do_rename(fset, old_parent, old_dentry,
new_parent, new_dentry, info);
if (target) {
- if (!error)
+ if (!error) {
+ spin_lock(target);
target->i_flags |= S_DEAD;
+ spin_unlock(target);
+ }
// triple_up(&old_dir->i_zombie,
// &new_dir->i_zombie,
// &target->i_zombie);
--- linux-2.6.5cow/fs/udf/ialloc.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/udf/ialloc.c 2004-04-14 16:22:24.000000000 +0200
@@ -165,7 +165,9 @@
if (DQUOT_ALLOC_INODE(inode))
{
DQUOT_DROP(inode);
+ spin_lock(inode->i_lock);
inode->i_flags |= S_NOQUOTA;
+ spin_unlock(inode->i_lock);
inode->i_nlink = 0;
iput(inode);
*err = -EDQUOT;
--- linux-2.6.5cow/fs/reiserfs/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/reiserfs/inode.c 2004-04-20 16:20:21.000000000 +0200
@@ -1612,8 +1612,11 @@
/* uid and gid must already be set by the caller for quota init */
/* symlink cannot be immutable or append only, right? */
- if( S_ISLNK( inode -> i_mode ) )
+ if( S_ISLNK( inode -> i_mode ) ) {
+ spin_lock(inode->i_lock);
inode -> i_flags &= ~ ( S_IMMUTABLE | S_APPEND );
+ spin_unlock(inode->i_lock);
+ }
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
inode->i_size = i_size;
@@ -2287,6 +2290,7 @@
void sd_attrs_to_i_attrs( __u16 sd_attrs, struct inode *inode )
{
if( reiserfs_attrs( inode -> i_sb ) ) {
+ spin_lock(inode->i_lock);
if( sd_attrs & REISERFS_SYNC_FL )
inode -> i_flags |= S_SYNC;
else
@@ -2307,12 +2311,14 @@
REISERFS_I(inode)->i_flags |= i_nopack_mask;
else
REISERFS_I(inode)->i_flags &= ~i_nopack_mask;
+ spin_unlock(inode->i_lock);
}
}
void i_attrs_to_sd_attrs( struct inode *inode, __u16 *sd_attrs )
{
if( reiserfs_attrs( inode -> i_sb ) ) {
+ spin_lock(inode->i_lock);
if( inode -> i_flags & S_IMMUTABLE )
*sd_attrs |= REISERFS_IMMUTABLE_FL;
else
@@ -2329,6 +2335,7 @@
*sd_attrs |= REISERFS_NOTAIL_FL;
else
*sd_attrs &= ~REISERFS_NOTAIL_FL;
+ spin_unlock(inode->i_lock);
}
}
--- linux-2.6.5cow/fs/fat/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/fat/inode.c 2004-04-20 16:20:21.000000000 +0200
@@ -1193,8 +1193,11 @@
MSDOS_I(inode)->mmu_private = inode->i_size;
}
if(de->attr & ATTR_SYS)
- if (sbi->options.sys_immutable)
+ if (sbi->options.sys_immutable) {
+ spin_lock(inode->i_lock);
inode->i_flags |= S_IMMUTABLE;
+ spin_unlock(inode->i_lock);
+ }
MSDOS_I(inode)->i_attrs = de->attr & ATTR_UNUSED;
/* this is as close to the truth as we can get ... */
inode->i_blksize = sbi->cluster_size;
--- linux-2.6.5cow/fs/ext3/ialloc.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/ialloc.c 2004-04-14 16:22:24.000000000 +0200
@@ -627,7 +627,9 @@
return ret;
fail2:
+ spin_lock(inode->i_lock);
inode->i_flags |= S_NOQUOTA;
+ spin_unlock(inode->i_lock);
inode->i_nlink = 0;
iput(inode);
brelse(bitmap_bh);
--- linux-2.6.5cow/fs/ext3/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/inode.c 2004-04-27 16:34:42.000000000 +0200
@@ -2447,6 +2447,7 @@
{
unsigned int flags = EXT3_I(inode)->i_flags;
+ spin_lock(inode->i_lock);
inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
if (flags & EXT3_SYNC_FL)
inode->i_flags |= S_SYNC;
@@ -2458,6 +2459,7 @@
inode->i_flags |= S_NOATIME;
if (flags & EXT3_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ spin_unlock(inode->i_lock);
}
void ext3_read_inode(struct inode * inode)
--- linux-2.6.5cow/fs/proc/base.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/proc/base.c 2004-04-14 16:22:24.000000000 +0200
@@ -1594,7 +1594,9 @@
inode->i_op = &proc_tgid_base_inode_operations;
inode->i_fop = &proc_tgid_base_operations;
inode->i_nlink = 3;
+ spin_lock(inode->i_lock);
inode->i_flags|=S_IMMUTABLE;
+ spin_unlock(inode->i_lock);
dentry->d_op = &pid_base_dentry_operations;
@@ -1649,7 +1651,9 @@
inode->i_op = &proc_tid_base_inode_operations;
inode->i_fop = &proc_tid_base_operations;
inode->i_nlink = 3;
+ spin_lock(inode->i_lock);
inode->i_flags|=S_IMMUTABLE;
+ spin_unlock(inode->i_lock);
dentry->d_op = &pid_base_dentry_operations;
--- linux-2.6.5cow/fs/ext2/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/inode.c 2004-04-27 16:34:42.000000000 +0200
@@ -1020,6 +1020,7 @@
{
unsigned int flags = EXT2_I(inode)->i_flags;
+ spin_lock(inode->i_lock);
inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
@@ -1031,6 +1032,7 @@
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ spin_unlock(inode->i_lock);
}
void ext2_read_inode (struct inode * inode)
--- linux-2.6.5cow/fs/ext2/ialloc.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/ialloc.c 2004-04-14 16:22:24.000000000 +0200
@@ -620,7 +620,9 @@
return inode;
fail2:
+ spin_lock(inode->i_lock);
inode->i_flags |= S_NOQUOTA;
+ spin_unlock(inode->i_lock);
inode->i_nlink = 0;
iput(inode);
return ERR_PTR(err);
--- linux-2.6.5cow/fs/namei.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/namei.c 2004-04-27 16:34:42.000000000 +0200
@@ -1609,8 +1609,11 @@
error = security_inode_rmdir(dir, dentry);
if (!error) {
error = dir->i_op->rmdir(dir, dentry);
- if (!error)
+ if (!error) {
+ spin_lock(dentry->d_inode->i_lock);
dentry->d_inode->i_flags |= S_DEAD;
+ spin_unlock(dentry->d_inode->i_lock);
+ }
}
}
up(&dentry->d_inode->i_sem);
@@ -1952,8 +1955,11 @@
else
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
if (target) {
- if (!error)
+ if (!error) {
+ spin_lock(target->i_lock);
target->i_flags |= S_DEAD;
+ spin_unlock(target->i_lock);
+ }
up(&target->i_sem);
if (d_unhashed(new_dentry))
d_rehash(new_dentry);
--- linux-2.6.5cow/fs/dquot.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/dquot.c 2004-04-14 16:22:24.000000000 +0200
@@ -573,7 +573,9 @@
if (inode->i_dquot[cnt] != NODQUOT)
goto put_it;
}
+ spin_lock(inode->i_lock);
inode->i_flags &= ~S_QUOTA;
+ spin_unlock(inode->i_lock);
put_it:
if (dquot != NODQUOT) {
if (dqput_blocks(dquot)) {
@@ -824,8 +826,11 @@
break;
}
inode->i_dquot[cnt] = dqget(inode->i_sb, id, cnt);
- if (inode->i_dquot[cnt])
+ if (inode->i_dquot[cnt]) {
+ spin_lock(inode->i_lock);
inode->i_flags |= S_QUOTA;
+ spin_unlock(inode->i_lock);
+ }
}
}
up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
@@ -839,7 +844,9 @@
{
int cnt;
+ spin_lock(inode->i_lock);
inode->i_flags &= ~S_QUOTA;
+ spin_unlock(inode->i_lock);
for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
to_drop[cnt] = inode->i_dquot[cnt];
inode->i_dquot[cnt] = NODQUOT;
@@ -1174,6 +1181,7 @@
return 0;
}
+#define SAFE_FLAGS (S_QUOTA | S_NOQUOTA | S_NOATIME)
int vfs_quota_on(struct super_block *sb, int type, int format_id, char *path)
{
struct file *f;
@@ -1207,14 +1215,18 @@
error = -EBUSY;
goto out_lock;
}
- oldflags = inode->i_flags;
+ /* FIXME: this is a race for i_flags. not sure how to fix it properly,
+ * but at least it only affects the three SAFE_FLAGS now */
+ oldflags = inode_flags(inode, SAFE_FLAGS);
dqopt->files[type] = f;
error = -EINVAL;
if (!fmt->qf_ops->check_quota_file(sb, type))
goto out_file_init;
/* We don't want quota and atime on quota files (deadlocks possible) */
dquot_drop_nolock(inode);
+ spin_lock(inode->i_lock);
inode->i_flags |= S_NOQUOTA | S_NOATIME;
+ spin_unlock(inode->i_lock);
dqopt->ops[type] = fmt->qf_ops;
dqopt->info[type].dqi_format = fmt;
@@ -1233,7 +1245,10 @@
return 0;
out_file_init:
- inode->i_flags = oldflags;
+ spin_lock(inode->i_lock);
+ inode->i_flags &= ~SAFE_FLAGS;
+ inode->i_flags |= oldflags;
+ spin_unlock(inode->i_lock);
dqopt->files[type] = NULL;
out_lock:
up_write(&dqopt->dqptr_sem);
--- linux-2.6.5cow/fs/hfsplus/catalog.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfsplus/catalog.c 2004-04-14 16:22:24.000000000 +0200
@@ -54,11 +54,11 @@
static void hfsplus_set_perms(struct inode *inode, struct hfsplus_perm *perms)
{
- if (inode->i_flags & S_IMMUTABLE)
+ if (IS_IMMUTABLE(inode))
perms->rootflags |= HFSPLUS_FLG_IMMUTABLE;
else
perms->rootflags &= ~HFSPLUS_FLG_IMMUTABLE;
- if (inode->i_flags & S_APPEND)
+ if (IS_APPEND(inode))
perms->rootflags |= HFSPLUS_FLG_APPEND;
else
perms->rootflags &= ~HFSPLUS_FLG_APPEND;
--- linux-2.6.5cow/fs/hfsplus/dir.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfsplus/dir.c 2004-04-14 16:22:24.000000000 +0200
@@ -315,8 +315,11 @@
res = hfsplus_rename_cat(inode->i_ino,
dir, &dentry->d_name,
HFSPLUS_SB(sb).hidden_dir, &str);
- if (!res)
+ if (!res) {
+ spin_lock(inode->i_lock);
inode->i_flags |= S_DEAD;
+ spin_unlock(inode->i_lock);
+ }
return res;
}
res = hfsplus_delete_cat(cnid, dir, &dentry->d_name);
@@ -330,8 +333,11 @@
res = hfsplus_delete_cat(inode->i_ino, HFSPLUS_SB(sb).hidden_dir, NULL);
if (!res)
hfsplus_delete_inode(inode);
- } else
+ } else {
+ spin_lock(inode->i_lock);
inode->i_flags |= S_DEAD;
+ spin_unlock(inode->i_lock);
+ }
}
inode->i_ctime = CURRENT_TIME;
mark_inode_dirty(inode);
--- linux-2.6.5cow/fs/hfsplus/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfsplus/inode.c 2004-04-20 16:20:21.000000000 +0200
@@ -225,6 +225,7 @@
HFSPLUS_I(inode).rootflags = perms->rootflags;
HFSPLUS_I(inode).userflags = perms->userflags;
+ spin_lock(inode->i_lock);
if (perms->rootflags & HFSPLUS_FLG_IMMUTABLE)
inode->i_flags |= S_IMMUTABLE;
else
@@ -233,10 +234,12 @@
inode->i_flags |= S_APPEND;
else
inode->i_flags &= ~S_APPEND;
+ spin_unlock(inode->i_lock);
}
static void hfsplus_set_perms(struct inode *inode, struct hfsplus_perm *perms)
{
+ spin_lock(inode->i_lock);
if (inode->i_flags & S_IMMUTABLE)
perms->rootflags |= HFSPLUS_FLG_IMMUTABLE;
else
@@ -245,6 +248,7 @@
perms->rootflags |= HFSPLUS_FLG_APPEND;
else
perms->rootflags &= ~HFSPLUS_FLG_APPEND;
+ spin_unlock(inode->i_lock);
perms->userflags = HFSPLUS_I(inode).userflags;
perms->mode = cpu_to_be16(inode->i_mode);
perms->owner = cpu_to_be32(inode->i_uid);
@@ -285,7 +289,7 @@
if (atomic_dec_and_test(&HFSPLUS_I(inode).opencnt)) {
down(&inode->i_sem);
hfsplus_file_truncate(inode);
- if (inode->i_flags & S_DEAD) {
+ if (IS_DEADDIR(inode)) {
hfsplus_delete_cat(inode->i_ino, HFSPLUS_SB(sb).hidden_dir, NULL);
hfsplus_delete_inode(inode);
}
--- linux-2.6.5cow/fs/hfsplus/ioctl.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/hfsplus/ioctl.c 2004-04-14 16:22:24.000000000 +0200
@@ -53,6 +53,7 @@
EXT2_FLAG_NODUMP))
return -EOPNOTSUPP;
+ spin_lock(inode->i_lock);
if (flags & EXT2_FLAG_IMMUTABLE) { /* EXT2_IMMUTABLE_FL */
inode->i_flags |= S_IMMUTABLE;
HFSPLUS_I(inode).rootflags |= HFSPLUS_FLG_IMMUTABLE;
@@ -67,6 +68,7 @@
inode->i_flags &= ~S_APPEND;
HFSPLUS_I(inode).rootflags &= ~HFSPLUS_FLG_APPEND;
}
+ spin_unlock(inode->i_lock);
if (flags & EXT2_FLAG_NODUMP) /* EXT2_NODUMP_FL */
HFSPLUS_I(inode).userflags |= HFSPLUS_FLG_NODUMP;
else
--- linux-2.6.5cow/fs/smbfs/file.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/smbfs/file.c 2004-04-15 22:03:45.000000000 +0200
@@ -230,7 +230,7 @@
VERBOSE("before read, size=%ld, flags=%x, atime=%ld\n",
(long)dentry->d_inode->i_size,
- dentry->d_inode->i_flags, dentry->d_inode->i_atime);
+ inode_flags(dentry->d_inode, -1U), dentry->d_inode->i_atime);
status = generic_file_read(file, buf, count, ppos);
out:
--- linux-2.6.5cow/fs/afs/inode.c~lock_flags 2004-04-14 16:22:15.000000000 +0200
+++ linux-2.6.5cow/fs/afs/inode.c 2004-04-14 16:22:24.000000000 +0200
@@ -188,7 +188,9 @@
#endif
/* okay... it's a new inode */
+ spin_lock(inode->i_lock);
inode->i_flags |= S_NOATIME;
+ spin_unlock(inode->i_lock);
vnode->flags |= AFS_VNODE_CHANGED;
ret = afs_inode_fetch_status(inode);
if (ret<0)
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH COW] MAD COW
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
` (3 preceding siblings ...)
2004-05-06 13:20 ` [PATCH COW] lock_flags Jörn Engel
@ 2004-05-06 13:21 ` Jörn Engel
2004-05-08 13:45 ` [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Denis Vlasenko
` (3 subsequent siblings)
8 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-06 13:21 UTC (permalink / raw)
To: linux-kernel
Patch 5
Jörn
--
A victorious army first wins and then seeks battle.
-- Sun Tzu
Allow COW behaviour for hard links, depending on an inode flag.
Semantics:
o Files with S_COWLINK do cow. (yes, really ;)
o Directories with S_COWLINK inherit flag to new files.
o If in doubt, return -EMLINK and let the user sort it out:
- When linking non-cow files to cow directories.
- When moving non-cow files/directories to cow directories.
- When moving cow files/directories to non-cow directories.
Thanks to Sytse and Andrew for tips.
fs/ext2/inode.c | 11 ++++++
fs/ext2/super.c | 2 -
fs/ext3/inode.c | 11 ++++++
fs/ext3/super.c | 2 -
fs/fcntl.c | 24 ++++++++++++++
fs/namei.c | 76 +++++++++++++++++++++++++++++++++++++++++++++-
fs/open.c | 79 +++++++++++++++++++++++++++++++++++++++++++++---
include/linux/ext2_fs.h | 1
include/linux/ext3_fs.h | 1
include/linux/fcntl.h | 3 +
include/linux/fs.h | 4 ++
11 files changed, 205 insertions(+), 9 deletions(-)
--- linux-2.6.5cow/fs/open.c~break_madcow 2004-05-04 16:31:50.000000000 +0200
+++ linux-2.6.5cow/fs/open.c 2004-05-06 01:02:57.000000000 +0200
@@ -723,6 +723,71 @@
return error;
}
+char *index(const char *s, int c)
+{
+ while (*s) {
+ if (*s == c)
+ return (char *) s;
+ s++;
+ }
+ return NULL;
+}
+
+char *rindex(const char *s, int c)
+{
+ char *ret = NULL;
+ while (*s) {
+ if (*s == c)
+ ret = (char *) s;
+ s++;
+ }
+ return ret;
+}
+
+char *madcow_temp_name(const char *name)
+{
+ const char temp[] = "__MADCOW_BREAK_LINK"; /* FIXME: add random part */
+ char *last_slash = rindex(name, '/');
+ size_t dir_len = last_slash ? last_slash + 1 - name : 0;
+ size_t len = sizeof(temp) + dir_len;
+ char *ret;
+
+ ret = kmalloc(len, GFP_KERNEL);
+ if (!ret)
+ return NULL;
+
+ strncpy(ret, name, dir_len);
+ strcpy(ret+dir_len, temp);
+ return ret;
+}
+
+int vfs_rename_other(struct inode *, struct dentry *,
+ struct inode *, struct dentry *);
+int do_copyfile(const char *from, const char *to, int mode);
+int do_rename(const char * oldname, const char * newname);
+
+int madcow_break_link(const char *from)
+{
+ int err, ret = -EMLINK;
+ char *to = madcow_temp_name(from);
+
+ printk("break link '%s' -> '%s'\n", from, to);
+ err = do_copyfile(from, to, -1);
+ printk("do_copyfile returned %d\n", err);
+ if (err)
+ goto out;
+
+ err = do_rename(to, from);
+ printk("do_rename returned %d\n", err);
+ if (err)
+ goto out;
+
+ ret = 0;
+out:
+ kfree(to);
+ return ret;
+}
+
/*
* Note that while the flag value (low two bits) for sys_open means:
* 00 - read-only
@@ -746,13 +811,19 @@
if ((namei_flags+1) & O_ACCMODE)
namei_flags++;
if (namei_flags & O_TRUNC)
- namei_flags |= 2;
+ namei_flags |= FMODE_WRITE;
error = open_namei(filename, namei_flags, mode, &nd);
- if (!error)
- return dentry_open(nd.dentry, nd.mnt, flags);
- return ERR_PTR(error);
+ if (error == -EMLINK) {
+ error = madcow_break_link(filename);
+ if (!error) /*retry*/
+ error = open_namei(filename, namei_flags, mode, &nd);
+ }
+ if (error)
+ return ERR_PTR(error);
+
+ return dentry_open(nd.dentry, nd.mnt, flags);
}
EXPORT_SYMBOL(filp_open);
--- linux-2.6.5cow/fs/ext2/inode.c~madcow 2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/inode.c 2004-04-27 17:02:33.000000000 +0200
@@ -1021,7 +1021,8 @@
unsigned int flags = EXT2_I(inode)->i_flags;
spin_lock(inode->i_lock);
- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME
+ | S_DIRSYNC | S_MADCOWLINK);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1032,6 +1033,8 @@
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (flags & EXT2_MADCOWLINK_FL)
+ inode->i_flags |= S_MADCOWLINK;
spin_unlock(inode->i_lock);
}
@@ -1159,6 +1162,12 @@
if (IS_ERR(raw_inode))
return -EIO;
+ /* vfs inode holds the current MADCOWLINK flag, so we have to update
+ * ei->i_flags first */
+ ei->i_flags &= ~EXT2_MADCOWLINK_FL;
+ if (inode_flags(inode, S_MADCOWLINK))
+ ei->i_flags |= EXT2_MADCOWLINK_FL;
+
/* For fields not not tracking in the in-memory inode,
* initialise them to zero for new inodes. */
if (ei->i_state & EXT2_STATE_NEW)
--- linux-2.6.5cow/fs/ext2/super.c~madcow 2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/fs/ext2/super.c 2004-04-27 17:03:10.000000000 +0200
@@ -1015,7 +1015,7 @@
.name = "ext2",
.get_sb = ext2_get_sb,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_MADCOW,
};
static int __init init_ext2_fs(void)
--- linux-2.6.5cow/fs/ext3/inode.c~madcow 2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/inode.c 2004-04-27 17:04:41.000000000 +0200
@@ -2448,7 +2448,8 @@
unsigned int flags = EXT3_I(inode)->i_flags;
spin_lock(inode->i_lock);
- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME
+ | S_DIRSYNC | S_MADCOWLINK);
if (flags & EXT3_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT3_APPEND_FL)
@@ -2459,6 +2460,8 @@
inode->i_flags |= S_NOATIME;
if (flags & EXT3_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (flags & EXT3_MADCOWLINK_FL)
+ inode->i_flags |= S_MADCOWLINK;
spin_unlock(inode->i_lock);
}
@@ -2594,6 +2597,12 @@
struct buffer_head *bh = iloc->bh;
int err = 0, rc, block;
+ /* vfs inode holds the current MADCOWLINK flag, so we have to update
+ * ei->i_flags first */
+ ei->i_flags &= ~EXT3_MADCOWLINK_FL;
+ if (inode_flags(inode, S_MADCOWLINK))
+ ei->i_flags |= EXT3_MADCOWLINK_FL;
+
/* For fields not not tracking in the in-memory inode,
* initialise them to zero for new inodes. */
if (ei->i_state & EXT3_STATE_NEW)
--- linux-2.6.5cow/fs/ext3/super.c~madcow 2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/fs/ext3/super.c 2004-04-27 17:05:22.000000000 +0200
@@ -2004,7 +2004,7 @@
.name = "ext3",
.get_sb = ext3_get_sb,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_MADCOW,
};
static int __init init_ext3_fs(void)
--- linux-2.6.5cow/fs/fcntl.c~madcow 2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/fs/fcntl.c 2004-04-27 16:55:42.000000000 +0200
@@ -282,6 +282,24 @@
EXPORT_SYMBOL(f_delown);
+static long fcntl_setmadcow(struct file *filp, unsigned long arg)
+{
+ struct inode *inode = filp->f_dentry->d_inode;
+
+ if (!(inode->i_sb->s_type->fs_flags & FS_MADCOW))
+ return -EINVAL;
+ //FIXME: -EPERM?
+
+ spin_lock(&inode->i_lock);
+ if (arg)
+ inode->i_flags |= S_MADCOWLINK;
+ else
+ inode->i_flags &= ~S_MADCOWLINK;
+ spin_unlock(&inode->i_lock);
+ mark_inode_dirty(inode);
+ return 0;
+}
+
static long do_fcntl(unsigned int fd, unsigned int cmd,
unsigned long arg, struct file * filp)
{
@@ -346,6 +364,12 @@
case F_NOTIFY:
err = fcntl_dirnotify(fd, filp, arg);
break;
+ case F_SETMADCOW:
+ err = fcntl_setmadcow(filp, arg);
+ break;
+ case F_GETMADCOW:
+ err = !!inode_flags(filp->f_dentry->d_inode, S_MADCOWLINK);
+ break;
default:
break;
}
--- linux-2.6.5cow/fs/namei.c~madcow 2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/fs/namei.c 2004-05-02 15:07:57.000000000 +0200
@@ -223,6 +223,41 @@
return security_inode_permission(inode, mask, nd);
}
+static inline void set_madcowflag(struct inode *inode)
+{
+ spin_lock(&inode->i_lock);
+ inode->i_flags |= S_MADCOWLINK;
+ spin_unlock(&inode->i_lock);
+ mark_inode_dirty(inode);
+}
+
+/*
+ * Files with the S_MADCOWLINK flag set cannot be written to, if more
+ * than one hard link to them exists. Ultimately, this function
+ * should copy the inode, assign the copy to the dentry and lower use
+ * count of the old inode - one day.
+ * For now, it is sufficient to return an error and let userspace
+ * deal with the messy part. Not exactly the meaning of
+ * copy-on-write, but much better than writing to fifty files at once
+ * and noticing month later.
+ *
+ * Yes, this breaks the kernel interface and is simply wrong. This
+ * is intended behaviour, so Linus will not merge the code before
+ * it is complete. Or will he?
+ */
+static int break_madcow_link(struct inode *inode)
+{
+ if (!inode_flags(inode, S_MADCOWLINK))
+ return 0;
+ if (!S_ISREG(inode->i_mode))
+ return 0;
+ if (inode->i_nlink < 2)
+ return 0;
+ /* TODO: As soon as sendfile can do normal file copies, use that
+ * and always return 0 */
+ return -EMLINK;
+}
+
/*
* get_write_access() gets write permission for a file.
* put_write_access() releases this write permission.
@@ -243,6 +278,10 @@
int get_write_access(struct inode * inode)
{
+ int error = break_madcow_link(inode);
+ if (error)
+ return error;
+
spin_lock(&inode->i_lock);
if (atomic_read(&inode->i_writecount) < 0) {
spin_unlock(&inode->i_lock);
@@ -1146,6 +1185,8 @@
DQUOT_INIT(dir);
error = dir->i_op->create(dir, dentry, mode, nd);
if (!error) {
+ if (inode_flags(dir, S_MADCOWLINK))
+ set_madcowflag(dentry->d_inode);
inode_dir_notify(dir, DN_CREATE);
security_inode_post_create(dir, dentry, mode);
}
@@ -1520,6 +1561,8 @@
DQUOT_INIT(dir);
error = dir->i_op->mkdir(dir, dentry, mode);
if (!error) {
+ if (inode_flags(dir, S_MADCOWLINK))
+ set_madcowflag(dentry->d_inode);
inode_dir_notify(dir, DN_CREATE);
security_inode_post_mkdir(dir,dentry, mode);
}
@@ -1823,6 +1866,13 @@
return -EXDEV;
/*
+ * Madcowlink attribute is inherited from directory, but here,
+ * the inode already has one. If they don't match, bail out.
+ */
+ if (inode_flags(dir, S_MADCOWLINK) != inode_flags(inode, S_MADCOWLINK))
+ return -EMLINK;
+
+ /*
* A link to an append-only or immutable file cannot be created.
*/
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
@@ -2003,6 +2053,26 @@
return error;
}
+static int madcow_allow_rename(struct inode *old_dir, struct dentry *old_dentry,
+ struct inode *new_dir)
+{
+ struct inode *old_inode = old_dentry->d_inode;
+
+ /* source and target share directory: allow */
+ if (old_dir == new_dir)
+ return 0;
+ /* source and target directory have identical madcowlink flag: allow */
+ if (inode_flags(old_inode, S_MADCOWLINK) == inode_flags(new_dir, S_MADCOWLINK))
+ return 0;
+ /* We could always fail here, but madcowlink flag is only defined for
+ * files and directories, so let's allow special files */
+ if (!S_ISREG(old_inode->i_mode))
+ return -EMLINK;
+ if (!S_ISDIR(old_inode->i_mode))
+ return -EMLINK;
+ return 0;
+}
+
int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry)
{
@@ -2026,6 +2096,10 @@
if (!old_dir->i_op || !old_dir->i_op->rename)
return -EPERM;
+ error = madcow_allow_rename(old_dir, old_dentry, new_dir);
+ if (error)
+ return error;
+
DQUOT_INIT(old_dir);
DQUOT_INIT(new_dir);
@@ -2118,7 +2118,7 @@
return error;
}
-static inline int do_rename(const char * oldname, const char * newname)
+int do_rename(const char * oldname, const char * newname)
{
int error = 0;
struct dentry * old_dir, * new_dir;
--- linux-2.6.5cow/include/linux/ext2_fs.h~madcow 2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/include/linux/ext2_fs.h 2004-04-27 17:01:30.000000000 +0200
@@ -192,6 +192,7 @@
#define EXT2_NOTAIL_FL 0x00008000 /* file tail should not be merged */
#define EXT2_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */
#define EXT2_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/
+#define EXT2_MADCOWLINK_FL 0x00040000 /* COW behaviour for hard links */
#define EXT2_RESERVED_FL 0x80000000 /* reserved for ext2 lib */
#define EXT2_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
--- linux-2.6.5cow/include/linux/ext3_fs.h~madcow 2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/include/linux/ext3_fs.h 2004-04-27 17:03:46.000000000 +0200
@@ -185,6 +185,7 @@
#define EXT3_NOTAIL_FL 0x00008000 /* file tail should not be merged */
#define EXT3_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */
#define EXT3_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/
+#define EXT3_MADCOWLINK_FL 0x00040000 /* COW behaviour for hard links */
#define EXT3_RESERVED_FL 0x80000000 /* reserved for ext3 lib */
#define EXT3_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
--- linux-2.6.5cow/include/linux/fcntl.h~madcow 2004-04-27 16:34:42.000000000 +0200
+++ linux-2.6.5cow/include/linux/fcntl.h 2004-04-27 16:49:45.000000000 +0200
@@ -23,6 +23,9 @@
#define DN_ATTRIB 0x00000020 /* File changed attibutes */
#define DN_MULTISHOT 0x80000000 /* Don't remove notifier */
+#define F_SETMADCOW (F_LINUX_SPECIFIC_BASE+3)
+#define F_GETMADCOW (F_LINUX_SPECIFIC_BASE+4)
+
#ifdef __KERNEL__
#if BITS_PER_LONG == 32
--- linux-2.6.5cow/include/linux/fs.h~madcow 2004-04-27 16:48:55.000000000 +0200
+++ linux-2.6.5cow/include/linux/fs.h 2004-04-30 08:40:04.000000000 +0200
@@ -90,6 +90,7 @@
/* public flags for file_system_type */
#define FS_REQUIRES_DEV 1
#define FS_BINARY_MOUNTDATA 2
+#define FS_MADCOW 4
#define FS_REVAL_DOT 16384 /* Check the paths ".", ".." for staleness */
#define FS_ODD_RENAME 32768 /* Temporary stuff; will go away as soon
* as nfs_rename() will be cleaned up
@@ -139,6 +140,9 @@
#define S_NOQUOTA 64 /* Inode is not counted to quota */
#define S_DIRSYNC 128 /* Directory modifications are synchronous */
#define S_NOCMTIME 256 /* Do not update file c/mtime */
+#define S_MADCOWLINK 512 /* Hard links have copy on write semantics.
+ * This flag has no meaning for directories,
+ * but is inherited to directory children */
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
` (4 preceding siblings ...)
2004-05-06 13:21 ` [PATCH COW] MAD COW Jörn Engel
@ 2004-05-08 13:45 ` Denis Vlasenko
2004-05-08 22:10 ` Pavel Machek
2004-05-08 22:48 ` Pavel Machek
` (2 subsequent siblings)
8 siblings, 1 reply; 36+ messages in thread
From: Denis Vlasenko @ 2004-05-08 13:45 UTC (permalink / raw)
To: Jörn Engel, linux-kernel
Cc: Jamie Lokier, Pavel Machek, Eric W. Biederman
On Thursday 06 May 2004 16:17, Jörn Engel wrote:
> Couldn't sleep last night and finished a first complete version of
> cowlinks, code-named MAD COW. It is still based on the stupid old
> design with a flag to distinguish between regular hard links and
> cowlinks. Please don't comment on that design, it's just a proof of
> concept.
>
> Patches are against 2.6.5 but most things should apply to other 2.6
> kernel without too much trouble.
>
> 1 generic_sendpage - allow sendfile with ext[23] files as target
> 2 sendfile - introduce vfs_sendfile for in-kernel use
> 3 copyfile - new copyfile() system call
> 4 lock_flags - old cruft, just ignore it
> 5 madcow - the MAD COW itself
>
> Patches 1-3 will stay, 4 will be remove and 5 replaced by a better
> design over time. I've also set up a webpage for it:
> http://wohnheim.fh-wedel.de/~joern/cowlink/
>
> Maybe that should be converted into a wiki so someone with better
> taste than myself can improve it.
Hi,
Glad to see this happening. My filesystems are mostly reiser
these days, but if I have some time/occasion, I will try your patch.
Regarding semantics etc. I have read http://lwn.net/Articles/77972/.
I think most difficulties arose from inode numbers as a concept.
Usage of inode number is a historic UNIX misfeature.
AFAIK it is almost exclusively used for determining whether
two files are really the same: same (dev,ino) => same file.
Usage:
* diff wants this to avoid diffing file against itself.
* du wants this in order to not count file twice.
* cp/tar wants this in order to preserve hardlinks.
For cowlinks it is a bit different:
* diff still dont need to compare cowlinked file.
* cp/tar aren't required to detect cowlinks, may do this
as an optimization.
* du: I am not sure... analogous to sparse files?
In essense, diff wants to ask kernel "are these files have
identical contents?" while tar/cp ask "are these files
hardlinked together?". du asks "are these share storage"?
Original UNIX folks had to make inode number only an advisory
indicator, meaning "if ino1 != ino2, files are _definitely_ not
linked", and a syscall is_hardlinked(fd1,fd2), intended
for use when inodes are equal.
That is probably unfixable now, but you can avoid making similar
error. Provide is_cowlinked(fd1,fd2) syscall. Pity you will
have to use different inode numbers for cowlinks (due to tar/cp),
and this won't fly:
diff.c:
...
if(ino1==ino2 && is_cowlinked(fd1,fd2))
skip_diff();
...
diff will have to use if(is_cowlinked(fd1,fd2)), i.e. one
extra syscall, always. Anyway, for diff that's not tragic,
savings from not doing diff are still very substantial.
For du, that would hurt.
Per-block cow filesystems will open another can of worms
for diff and du. ;)
P.S. If is_hardlinked() is introduced too, tar/cp can be
updated to use that, and we can slowly drift into right
direction (of not assuming "equal inos => hardlinked file").
I think too much people will disagree with me, and this
won't happen.
--
vda
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-08 13:45 ` [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Denis Vlasenko
@ 2004-05-08 22:10 ` Pavel Machek
2004-05-09 14:09 ` Denis Vlasenko
0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-08 22:10 UTC (permalink / raw)
To: Denis Vlasenko
Cc: Jörn Engel, linux-kernel, Jamie Lokier, Eric W. Biederman
Hi!
> That is probably unfixable now, but you can avoid making similar
> error. Provide is_cowlinked(fd1,fd2) syscall. Pity you will
> have to use different inode numbers for cowlinks (due to tar/cp),
> and this won't fly:
is_cowlinked does not fly, either. For n files, you have to do O(n^2)
calls to find those that are linked.
You want get_cowlinked_id which can return -1 "I do not know".
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
` (5 preceding siblings ...)
2004-05-08 13:45 ` [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Denis Vlasenko
@ 2004-05-08 22:48 ` Pavel Machek
2004-05-10 15:53 ` Jörn Engel
2004-05-10 5:15 ` Eric W. Biederman
2004-05-12 16:39 ` Rob Landley
8 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-08 22:48 UTC (permalink / raw)
To: Jörn Engel
Cc: linux-kernel, Jamie Lokier, Eric W. Biederman, Denis Vlasenko
Hi!
> Couldn't sleep last night and finished a first complete version of
> cowlinks, code-named MAD COW. It is still based on the stupid old
> design with a flag to distinguish between regular hard links and
> cowlinks. Please don't comment on that design, it's just a proof of
> concept.
> Patches are against 2.6.5 but most things should apply to other 2.6
> kernel without too much trouble.
>
> 1 generic_sendpage - allow sendfile with ext[23] files as target
> 2 sendfile - introduce vfs_sendfile for in-kernel use
> 3 copyfile - new copyfile() system call
Well, up to "3" it seems usefull on its own. You might attempt to
merge that.
namei.c: you realy don't want to #include in the middle of .c file.
vfs_unlink followed by BUG_ON(error)... that's definitely wrong. In
case of disk error, you might get error on unlink; but you should not
BUG() on that. Perhaps copyfile() should be specified as "may leave
part of copy of target on disk in case of error"?
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-08 22:10 ` Pavel Machek
@ 2004-05-09 14:09 ` Denis Vlasenko
2004-05-09 21:53 ` Pavel Machek
0 siblings, 1 reply; 36+ messages in thread
From: Denis Vlasenko @ 2004-05-09 14:09 UTC (permalink / raw)
To: Pavel Machek
Cc: Jörn Engel, linux-kernel, Jamie Lokier, Eric W. Biederman
On Sunday 09 May 2004 01:10, Pavel Machek wrote:
> Hi!
>
> > That is probably unfixable now, but you can avoid making similar
> > error. Provide is_cowlinked(fd1,fd2) syscall. Pity you will
> > have to use different inode numbers for cowlinks (due to tar/cp),
> > and this won't fly:
>
> is_cowlinked does not fly, either. For n files, you have to do O(n^2)
> calls to find those that are linked.
Hm, let me think about it. diff does not need to check all permutations,
it checks only those two which it needs to compare.
IMHO "inodes done right" is something like this: if inode numbers are
different, then files are not hardlinked/cowlinked. If they are the same,
check is_hardlinked(a,b) or is_cowlinked(a,b) to find out.
This beats O(n^2) argument.
But this is non-POSIX, would not be accepted.
I don't know how to handle this now. Introducing cow-inode number
with semantic "cowino1==cowino2 => files are cowlinked" is
ugly and won't deal with per-block cow. Sooner or later someone
will want to have per block cow. Think about cow'ing multi-gigabyte
database files for checkpointing/backup purposes...
> You want get_cowlinked_id which can return -1 "I do not know".
Is this the same to my "cow-inode number" concept above?
--
vda
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-09 14:09 ` Denis Vlasenko
@ 2004-05-09 21:53 ` Pavel Machek
2004-05-10 15:44 ` Jörn Engel
2004-05-12 20:29 ` Rob Landley
0 siblings, 2 replies; 36+ messages in thread
From: Pavel Machek @ 2004-05-09 21:53 UTC (permalink / raw)
To: Denis Vlasenko
Cc: Jörn Engel, linux-kernel, Jamie Lokier, Eric W. Biederman
Hi!
> > > That is probably unfixable now, but you can avoid making similar
> > > error. Provide is_cowlinked(fd1,fd2) syscall. Pity you will
> > > have to use different inode numbers for cowlinks (due to tar/cp),
> > > and this won't fly:
> >
> > is_cowlinked does not fly, either. For n files, you have to do O(n^2)
> > calls to find those that are linked.
>
> Hm, let me think about it. diff does not need to check all permutations,
> it checks only those two which it needs to compare.
And tar?
> IMHO "inodes done right" is something like this: if inode numbers are
> different, then files are not hardlinked/cowlinked. If they are the same,
> check is_hardlinked(a,b) or is_cowlinked(a,b) to find out.
>
> This beats O(n^2) argument.
>
> But this is non-POSIX, would not be accepted.
I do not see how this matters. cowlinks are already non-POSIX. If
is_hardlinked() always returns 1, but is_cowlinked() sometimes returns
something else, you are still "as much POSIX as possible".
> I don't know how to handle this now. Introducing cow-inode number
> with semantic "cowino1==cowino2 => files are cowlinked" is
> ugly and won't deal with per-block cow. Sooner or later someone
> will want to have per block cow. Think about cow'ing multi-gigabyte
> database files for checkpointing/backup purposes...
Well, if only block 17 is cowlink-shared between two files, I guess
userspace does not want to know... And I think that cow-inode number
*can* handle all other cases.
Oh, get_cow_inode() should really be allowed to fail in some usefull
way, so that filesystems do not have to implement it if its hard for
them.
> > You want get_cowlinked_id which can return -1 "I do not know".
>
> Is this the same to my "cow-inode number" concept above?
Yes.
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
` (6 preceding siblings ...)
2004-05-08 22:48 ` Pavel Machek
@ 2004-05-10 5:15 ` Eric W. Biederman
2004-05-10 15:59 ` Jörn Engel
2004-05-12 16:39 ` Rob Landley
8 siblings, 1 reply; 36+ messages in thread
From: Eric W. Biederman @ 2004-05-10 5:15 UTC (permalink / raw)
To: Jörn Engel; +Cc: linux-kernel, Jamie Lokier, Pavel Machek, Denis Vlasenko
Jörn Engel <joern@wohnheim.fh-wedel.de> writes:
> 3 copyfile - new copyfile() system call
> http://wohnheim.fh-wedel.de/~joern/cowlink/
Question about sys_copyfile.
Is the intention that a new file with completely new permissions
be created?
Some people have wanted a copyfile that copies all of the extra
metadata user/group/acls.
I currently see technical merit in both approaches.
Looking at the CIFS information it appears that the CopyFILE RPC
copies the permissions. It is not at all clear about that, and
the fact it appears to copy permissions may simply be a specification
bug. Given that FAT does not really have permissions, let alone
extended attributes it would be an easy mistake to make.
In the general case you cannot copy permissions from one file to
another, either you don't have those permissions yourself or the
target file system may not support them all. Not copying
permissions leads to a simpler implementation with the burden
of the work left to user space. What is not done is a loop through
the extended attributes.
Eric
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-09 21:53 ` Pavel Machek
@ 2004-05-10 15:44 ` Jörn Engel
2004-05-10 15:51 ` Pavel Machek
2004-05-12 0:26 ` Jamie Lokier
2004-05-12 20:29 ` Rob Landley
1 sibling, 2 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-10 15:44 UTC (permalink / raw)
To: Pavel Machek
Cc: Denis Vlasenko, linux-kernel, Jamie Lokier, Eric W. Biederman
On Sun, 9 May 2004 23:53:51 +0200, Pavel Machek wrote:
>
> I do not see how this matters. cowlinks are already non-POSIX. If
> is_hardlinked() always returns 1, but is_cowlinked() sometimes returns
> something else, you are still "as much POSIX as possible".
Assuming that POSIX is generally pretty sane, "as sane as possible" is
good enough for me. If POSIX is insane in one way or another,
breaking POSIX is actually a feature, imho.
> > I don't know how to handle this now. Introducing cow-inode number
> > with semantic "cowino1==cowino2 => files are cowlinked" is
> > ugly and won't deal with per-block cow. Sooner or later someone
> > will want to have per block cow. Think about cow'ing multi-gigabyte
> > database files for checkpointing/backup purposes...
>
> Well, if only block 17 is cowlink-shared between two files, I guess
> userspace does not want to know... And I think that cow-inode number
> *can* handle all other cases.
Correct.
> Oh, get_cow_inode() should really be allowed to fail in some usefull
> way, so that filesystems do not have to implement it if its hard for
> them.
Also correct.
And while we're at it, I need a brain extention again. Pavel? ;)
The real cowlink patch (too ugly to show it yet) is something like
this:
inode 1: real file
inode 2: cow->inode 1
inode 3: cow->inode 1
Stat() on inode 2 is implemented here:
static void cowlink_fillattr(struct inode *link, struct kstat *stat)
{
struct super_block *sb = link->i_sb;
ino_t ino = link->i_op->readcow(link);
struct inode *inode = iget(sb, ino);
stat->dev = link->i_sb->s_dev;
stat->ino = ino;
stat->mode = link->i_mode - S_IFCOW + S_IFREG;
stat->nlink = link->i_nlink;
stat->uid = link->i_uid;
stat->gid = link->i_gid;
stat->rdev = 0;
stat->atime = link->i_atime;
stat->mtime = link->i_mtime;
stat->ctime = link->i_ctime;
stat->size = i_size_read(inode);
stat->blocks = inode->i_blocks;
stat->blksize = inode->i_blksize;
}
size, blocks and blksize has to be taken from inode 1, sure. uid,
gid, mode and *time are from inode 2, also obvious. dev doesn't
matter yet, but we might do cross-filesystem cowlinks someday. nlink
should be from inode 2, so several hard links on it work, also agreed.
What about ino? I currently return 1, so diff remains fast without
any changes. If someone really needs the difference between inode 2
and 3, I would introduce a cstat() system call similar to lstat(),
which would return ino=2.
Is this sane? Should it be reversed and cstat() return ino=1, while
stat returns ino=2? I can imagine that "tar -x" would create hard
links for every cowlink that "tar -c" saw, but I'm not sure yet.
Jörn
--
They laughed at Galileo. They laughed at Copernicus. They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-10 15:44 ` Jörn Engel
@ 2004-05-10 15:51 ` Pavel Machek
2004-05-10 15:56 ` Jörn Engel
2004-05-12 0:26 ` Jamie Lokier
1 sibling, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-10 15:51 UTC (permalink / raw)
To: Jörn Engel
Cc: Denis Vlasenko, linux-kernel, Jamie Lokier, Eric W. Biederman
Hi!
> > Oh, get_cow_inode() should really be allowed to fail in some usefull
> > way, so that filesystems do not have to implement it if its hard for
> > them.
>
> Also correct.
>
> And while we're at it, I need a brain extention again. Pavel? ;)
>
> The real cowlink patch (too ugly to show it yet) is something like
> this:
>
> inode 1: real file
> inode 2: cow->inode 1
> inode 3: cow->inode 1
>
> Stat() on inode 2 is implemented here:
>
> static void cowlink_fillattr(struct inode *link, struct kstat *stat)
> {
> struct super_block *sb = link->i_sb;
> ino_t ino = link->i_op->readcow(link);
> struct inode *inode = iget(sb, ino);
>
> stat->dev = link->i_sb->s_dev;
> stat->ino = ino;
> stat->mode = link->i_mode - S_IFCOW + S_IFREG;
> stat->nlink = link->i_nlink;
> stat->uid = link->i_uid;
> stat->gid = link->i_gid;
> stat->rdev = 0;
> stat->atime = link->i_atime;
> stat->mtime = link->i_mtime;
> stat->ctime = link->i_ctime;
> stat->size = i_size_read(inode);
> stat->blocks = inode->i_blocks;
> stat->blksize = inode->i_blksize;
> }
>
> size, blocks and blksize has to be taken from inode 1, sure. uid,
> gid, mode and *time are from inode 2, also obvious. dev doesn't
> matter yet, but we might do cross-filesystem cowlinks someday. nlink
> should be from inode 2, so several hard links on it work, also agreed.
>
> What about ino? I currently return 1, so diff remains fast without
> any changes. If someone really needs the difference between inode 2
> and 3, I would introduce a cstat() system call similar to lstat(),
> which would return ino=2.
I think you need to return 2, otherwise inode numbers change when
cowling is broken.... and that would be bad.
Aha.. That is another neccessary property for get_cow_inode():
cow_inode can change, any time, unlike normal inode.
Pavel
--
Horseback riding is like software...
...vgf orggre jura vgf serr.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-08 22:48 ` Pavel Machek
@ 2004-05-10 15:53 ` Jörn Engel
2004-05-10 19:26 ` Jan Harkes
0 siblings, 1 reply; 36+ messages in thread
From: Jörn Engel @ 2004-05-10 15:53 UTC (permalink / raw)
To: Pavel Machek
Cc: linux-kernel, Jamie Lokier, Eric W. Biederman, Denis Vlasenko
On Sun, 9 May 2004 00:48:35 +0200, Pavel Machek wrote:
>
> > Couldn't sleep last night and finished a first complete version of
> > cowlinks, code-named MAD COW. It is still based on the stupid old
> > design with a flag to distinguish between regular hard links and
> > cowlinks. Please don't comment on that design, it's just a proof of
> > concept.
>
> > Patches are against 2.6.5 but most things should apply to other 2.6
> > kernel without too much trouble.
> >
> > 1 generic_sendpage - allow sendfile with ext[23] files as target
> > 2 sendfile - introduce vfs_sendfile for in-kernel use
> > 3 copyfile - new copyfile() system call
>
> Well, up to "3" it seems usefull on its own. You might attempt to
> merge that.
True. Right now I'm just lazy and prefer to keep them under my
control. Maybe if 2.7 was already opened.
> namei.c: you realy don't want to #include in the middle of .c file.
Actually, I do. I'm still not convinced that all the low-level
read/write functions need a struct file* as an argument, that looks
like a really ugly hack to help nfs and some others. And if those can
go, the #include will go as well. So the ugly line serves as a
reminder for a different problem. :)
On the other hand, if you can convince me that there is no way to
avoid passing the struct file*, then I agree and the #include will
move up.
> vfs_unlink followed by BUG_ON(error)... that's definitely wrong. In
> case of disk error, you might get error on unlink; but you should not
> BUG() on that. Perhaps copyfile() should be specified as "may leave
> part of copy of target on disk in case of error"?
Yes, makes sense. I will keep the BUG_ON for a while as a reminder to
really think through all the consequences, but most likely, I will
just follow your suggestion.
A real problem is that copyfile() has all errno's from create(),
sendfile() and unlink() combined, which doesn't make error handling in
userspace easy. "It could be this, that or another error" is the kind
of mess I always hated about Windows, so I should try to do a little
better.
Jörn
--
It's not whether you win or lose, it's how you place the blame.
-- unknown
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-10 15:51 ` Pavel Machek
@ 2004-05-10 15:56 ` Jörn Engel
0 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-10 15:56 UTC (permalink / raw)
To: Pavel Machek
Cc: Denis Vlasenko, linux-kernel, Jamie Lokier, Eric W. Biederman
On Mon, 10 May 2004 17:51:27 +0200, Pavel Machek wrote:
> >
> > What about ino? I currently return 1, so diff remains fast without
> > any changes. If someone really needs the difference between inode 2
> > and 3, I would introduce a cstat() system call similar to lstat(),
> > which would return ino=2.
>
> I think you need to return 2, otherwise inode numbers change when
> cowling is broken.... and that would be bad.
Makes sense. Damn, now I gotta touch diff as well. :(
> Aha.. That is another neccessary property for get_cow_inode():
> cow_inode can change, any time, unlike normal inode.
Correct. Still, I prefer cstat(), even though almost all information
is identical to stat. But the device might change as well, one day.
Jörn
--
With a PC, I always felt limited by the software available. On Unix,
I am limited only by my knowledge.
-- Peter J. Schoenster
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-10 5:15 ` Eric W. Biederman
@ 2004-05-10 15:59 ` Jörn Engel
0 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-10 15:59 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-kernel, Jamie Lokier, Pavel Machek, Denis Vlasenko
On Sun, 9 May 2004 23:15:33 -0600, Eric W. Biederman wrote:
> Jörn Engel <joern@wohnheim.fh-wedel.de> writes:
> > 3 copyfile - new copyfile() system call
> > http://wohnheim.fh-wedel.de/~joern/cowlink/
>
> Question about sys_copyfile.
>
> Is the intention that a new file with completely new permissions
> be created?
>
> Some people have wanted a copyfile that copies all of the extra
> metadata user/group/acls.
>
> I currently see technical merit in both approaches.
>
> Looking at the CIFS information it appears that the CopyFILE RPC
> copies the permissions. It is not at all clear about that, and
> the fact it appears to copy permissions may simply be a specification
> bug. Given that FAT does not really have permissions, let alone
> extended attributes it would be an easy mistake to make.
>
> In the general case you cannot copy permissions from one file to
> another, either you don't have those permissions yourself or the
> target file system may not support them all. Not copying
> permissions leads to a simpler implementation with the burden
> of the work left to user space. What is not done is a loop through
> the extended attributes.
Unless someone finds a good reason to change this, I prefer to create
new permissions and not copy the acl's. That way, I can also
copyfile() a file belonging to someone else, as long as I have read
access to it.
Jörn
--
Don't worry about people stealing your ideas. If your ideas are any good,
you'll have to ram them down people's throats.
-- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by
Raph Levien, 1979
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-10 15:53 ` Jörn Engel
@ 2004-05-10 19:26 ` Jan Harkes
2004-05-11 10:02 ` Jörn Engel
0 siblings, 1 reply; 36+ messages in thread
From: Jan Harkes @ 2004-05-10 19:26 UTC (permalink / raw)
To: Jörn Engel
Cc: Pavel Machek, linux-kernel, Jamie Lokier, Eric W. Biederman,
Denis Vlasenko
On Mon, May 10, 2004 at 05:53:59PM +0200, Jörn Engel wrote:
> A real problem is that copyfile() has all errno's from create(),
> sendfile() and unlink() combined, which doesn't make error handling in
> userspace easy. "It could be this, that or another error" is the kind
> of mess I always hated about Windows, so I should try to do a little
> better.
Well, if you leave the create and unlink up to the application and
simply pass open filedescriptors to copyfile... But then it would be
equivalent to your new sendfile.
Copyfile can trivially be implemented in libc. I don't see why it would
have to be a system call. If a network filesystem wants to optimize the
file copying it could do this based on the sendfile data. If source and
destination are within the same filesystem and we're copying the whole
file starting at offset 0, send a copyfile RPC.
Jan
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-10 19:26 ` Jan Harkes
@ 2004-05-11 10:02 ` Jörn Engel
2004-05-11 14:08 ` Jan Harkes
2004-05-11 15:40 ` Steve French
0 siblings, 2 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-11 10:02 UTC (permalink / raw)
To: Jan Harkes; +Cc: Steve French, linux-kernel
On Mon, 10 May 2004 15:26:02 -0400, Jan Harkes wrote:
> On Mon, May 10, 2004 at 05:53:59PM +0200, Jörn Engel wrote:
> > A real problem is that copyfile() has all errno's from create(),
> > sendfile() and unlink() combined, which doesn't make error handling in
> > userspace easy. "It could be this, that or another error" is the kind
> > of mess I always hated about Windows, so I should try to do a little
> > better.
>
> Well, if you leave the create and unlink up to the application and
> simply pass open filedescriptors to copyfile... But then it would be
> equivalent to your new sendfile.
>
> Copyfile can trivially be implemented in libc. I don't see why it would
> have to be a system call. If a network filesystem wants to optimize the
> file copying it could do this based on the sendfile data. If source and
> destination are within the same filesystem and we're copying the whole
> file starting at offset 0, send a copyfile RPC.
Can you explain this to Steve? I'm still quite clueless about network
filesystems, but it sounded as if such an optimization was impossible
to do in cifs without a combined create/copy/unlink_on_error system
call.
If your suggestion works and the network filesystems can be changed to
work independently of a struct file*, I agree with you that copyfile()
is a stupid idea and should be forgotten.
Jörn
--
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-11 10:02 ` Jörn Engel
@ 2004-05-11 14:08 ` Jan Harkes
2004-05-11 14:18 ` Jan Harkes
` (2 more replies)
2004-05-11 15:40 ` Steve French
1 sibling, 3 replies; 36+ messages in thread
From: Jan Harkes @ 2004-05-11 14:08 UTC (permalink / raw)
To: Jörn Engel; +Cc: Steve French, linux-kernel
On Tue, May 11, 2004 at 12:02:32PM +0200, Jörn Engel wrote:
> > Copyfile can trivially be implemented in libc. I don't see why it would
> > have to be a system call. If a network filesystem wants to optimize the
> > file copying it could do this based on the sendfile data. If source and
> > destination are within the same filesystem and we're copying the whole
> > file starting at offset 0, send a copyfile RPC.
>
> Can you explain this to Steve? I'm still quite clueless about network
> filesystems, but it sounded as if such an optimization was impossible
> to do in cifs without a combined create/copy/unlink_on_error system
> call.
>
> If your suggestion works and the network filesystems can be changed to
> work independently of a struct file*, I agree with you that copyfile()
> is a stupid idea and should be forgotten.
I would probably do it by overriding the file_operations.sendfile
function. A first approximation of a possible implementation follows. I
went a bit crazy on the comments. The only problem is that the type of
target is unknown, block/loop.c and nfsd/vfs.c are using sendfile to
to send to something that is not a struct file.
Jan
int my_file_sendfile(struct file *in_file, loff_t *ppos, size_t count,
read_actor_t actor, void __user *target)
{
struct file *out_file = NULL;
/* We have to check the read_actor callback function to see if the
* target actually points at a struct file. */
if (actor != file_send_actor)
goto copy_local;
/* are both source and destination within the same file system
* mountpoint? */
if (in_file->f_dentry->d_inode->i_sb != out_file->f_dentry->d_inode->i_sb)
goto copy_local;
/* are we copying the entire source file? */
if (*ppos != 0 || count != in_file->f_dentry->d_inode->i_size)
goto copy_local;
/* and are we at least overwriting the complete destination file?
(alternatively we could check that the out_file is currently empty) */
if (out_file->f_pos != 0 || out_file->f_dentry->d_inode->i_size > count)
goto copy_local;
/* we are in luck and can send a copyfile rpc */
int err = make_copyfile_rpc(in_file, out_file);
/* we probably should force a refresh of out_file as it changed on
* the server and not locally */
MY_I(out_file->f_dentry->d_inode)->time = 0;
return err;
copy_local:
/* no luck, we're sending to something that isn't a file or in the
* same filesystem, or we only copy a partial file. any case, we
* have to perform the copy locally. */
return generic_file_sendfile(in_file, ppos, count, actor, target);
}
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-11 14:08 ` Jan Harkes
@ 2004-05-11 14:18 ` Jan Harkes
2004-05-11 14:33 ` Jörn Engel
2004-05-21 23:23 ` Rob Landley
2 siblings, 0 replies; 36+ messages in thread
From: Jan Harkes @ 2004-05-11 14:18 UTC (permalink / raw)
To: Jörn Engel, Steve French, linux-kernel
Forgot a somewhat useful line there,
On Tue, May 11, 2004 at 10:08:53AM -0400, Jan Harkes wrote:
> int my_file_sendfile(struct file *in_file, loff_t *ppos, size_t count,
> read_actor_t actor, void __user *target)
> {
> struct file *out_file = NULL;
>
> /* We have to check the read_actor callback function to see if the
> * target actually points at a struct file. */
> if (actor != file_send_actor)
> goto copy_local;
out_file = (struct file *)target;
> /* are both source and destination within the same file system
> * mountpoint? */
> if (in_file->f_dentry->d_inode->i_sb != out_file->f_dentry->d_inode->i_sb)
> goto copy_local;
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-11 14:08 ` Jan Harkes
2004-05-11 14:18 ` Jan Harkes
@ 2004-05-11 14:33 ` Jörn Engel
2004-05-21 23:23 ` Rob Landley
2 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-11 14:33 UTC (permalink / raw)
To: Steve French, linux-kernel
On Tue, 11 May 2004 10:08:53 -0400, Jan Harkes wrote:
> On Tue, May 11, 2004 at 12:02:32PM +0200, Jörn Engel wrote:
> > > Copyfile can trivially be implemented in libc. I don't see why it would
> > > have to be a system call. If a network filesystem wants to optimize the
> > > file copying it could do this based on the sendfile data. If source and
> > > destination are within the same filesystem and we're copying the whole
> > > file starting at offset 0, send a copyfile RPC.
> >
> > Can you explain this to Steve? I'm still quite clueless about network
> > filesystems, but it sounded as if such an optimization was impossible
> > to do in cifs without a combined create/copy/unlink_on_error system
> > call.
> >
> > If your suggestion works and the network filesystems can be changed to
> > work independently of a struct file*, I agree with you that copyfile()
> > is a stupid idea and should be forgotten.
>
> I would probably do it by overriding the file_operations.sendfile
> function. A first approximation of a possible implementation follows. I
> went a bit crazy on the comments. The only problem is that the type of
> target is unknown, block/loop.c and nfsd/vfs.c are using sendfile to
> to send to something that is not a struct file.
Looks as if it could work. Steve, can you verify this?
Jörn
--
A surrounded army must be given a way out.
-- Sun Tzu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-11 10:02 ` Jörn Engel
2004-05-11 14:08 ` Jan Harkes
@ 2004-05-11 15:40 ` Steve French
2004-05-11 15:58 ` Jörn Engel
1 sibling, 1 reply; 36+ messages in thread
From: Steve French @ 2004-05-11 15:40 UTC (permalink / raw)
To: Jörn Engel; +Cc: Jan Harkes, linux-kernel
It would not be helpful to take a userspace request to perform a file
(or directory) copy operation and break it into open/sendfile/close by
passing file handles to the network filesystem and have this work for
SMB/CIFS - there is no equivalent network protocol operation. It also
makes the operation much, much harder to make atomic (since two systems
are involved) and makes error handling and recovery for network
filesystems much harder since inconsistent client and server state have
to be considered if the copy operation is broken into pieces on the
clien (it is also slower - a single copy operation on the network is the
absolute optimal case - minimizes the expensive network latency - the
roundtrip delay - open/sendfile/close sends at a minimum three times as
many but likely four with an extra lookup or two)
Currently copy file (or copy directory for that matter) as both speced
(and is implemented in various servers) in the SMB/CIFS network
filesystem protocol takes in effect four parameters (there is no handle
based file copy):
a source pathname, and source export (actually SMB tree identifier for
a share, but in effect the same thing)
a target pathname, and target export (actually SMB tree identifier for a
share, but in effect the same thing)
And the shares (exports) referenced have to be on the same server
Trying to ignore the 1st file open
On Tue, 2004-05-11 at 05:02, Jörn Engel wrote:
> On Mon, 10 May 2004 15:26:02 -0400, Jan Harkes wrote:
> > On Mon, May 10, 2004 at 05:53:59PM +0200, Jörn Engel wrote:
> > > A real problem is that copyfile() has all errno's from create(),
> > > sendfile() and unlink() combined, which doesn't make error handling in
> > > userspace easy. "It could be this, that or another error" is the kind
> > > of mess I always hated about Windows, so I should try to do a little
> > > better.
> >
> > Well, if you leave the create and unlink up to the application and
> > simply pass open filedescriptors to copyfile... But then it would be
> > equivalent to your new sendfile.
> >
> > Copyfile can trivially be implemented in libc. I don't see why it would
> > have to be a system call. If a network filesystem wants to optimize the
> > file copying it could do this based on the sendfile data. If source and
> > destination are within the same filesystem and we're copying the whole
> > file starting at offset 0, send a copyfile RPC.
>
> Can you explain this to Steve? I'm still quite clueless about network
> filesystems, but it sounded as if such an optimization was impossible
> to do in cifs without a combined create/copy/unlink_on_error system
> call.
>
> If your suggestion works and the network filesystems can be changed to
> work independently of a struct file*, I agree with you that copyfile()
> is a stupid idea and should be forgotten.
>
> Jörn
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-11 15:40 ` Steve French
@ 2004-05-11 15:58 ` Jörn Engel
0 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-11 15:58 UTC (permalink / raw)
To: Steve French; +Cc: Jan Harkes, linux-kernel
On Tue, 11 May 2004 10:40:49 -0500, Steve French wrote:
>
> It would not be helpful to take a userspace request to perform a file
> (or directory) copy operation and break it into open/sendfile/close by
> passing file handles to the network filesystem and have this work for
> SMB/CIFS - there is no equivalent network protocol operation. It also
> makes the operation much, much harder to make atomic (since two systems
> are involved) and makes error handling and recovery for network
> filesystems much harder since inconsistent client and server state have
> to be considered if the copy operation is broken into pieces on the
> clien (it is also slower - a single copy operation on the network is the
> absolute optimal case - minimizes the expensive network latency - the
> roundtrip delay - open/sendfile/close sends at a minimum three times as
> many but likely four with an extra lookup or two)
>
> Currently copy file (or copy directory for that matter) as both speced
> (and is implemented in various servers) in the SMB/CIFS network
> filesystem protocol takes in effect four parameters (there is no handle
> based file copy):
>
> a source pathname, and source export (actually SMB tree identifier for
> a share, but in effect the same thing)
> a target pathname, and target export (actually SMB tree identifier for a
> share, but in effect the same thing)
> And the shares (exports) referenced have to be on the same server
Or in short, copyfile makes sense for smbfs/cifs. The question
whether my hack can be cleaned up enough to get past Al Viro is still
open, though. :)
Jörn
--
Homo Sapiens is a goal, not a description.
-- unknown
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-10 15:44 ` Jörn Engel
2004-05-10 15:51 ` Pavel Machek
@ 2004-05-12 0:26 ` Jamie Lokier
2004-05-13 10:56 ` Jörn Engel
1 sibling, 1 reply; 36+ messages in thread
From: Jamie Lokier @ 2004-05-12 0:26 UTC (permalink / raw)
To: Jörn Engel
Cc: Pavel Machek, Denis Vlasenko, linux-kernel, Eric W. Biederman
Jörn Engel wrote:
> What about ino? I currently return 1, so diff remains fast without
> any changes. If someone really needs the difference between inode 2
> and 3, I would introduce a cstat() system call similar to lstat(),
> which would return ino=2.
>
> Is this sane? Should it be reversed and cstat() return ino=1, while
> stat returns ino=2? I can imagine that "tar -x" would create hard
> links for every cowlink that "tar -c" saw, but I'm not sure yet.
I think it should be reversed.
One very useful application for cowlinks is for virtual machine (UML)
and chroot jail setups, where an entire filesystem tree is copied
perhaps hundreds of times on a single disk. I'm surprised we didn't
think of this earlier, as it's potentially one of the most useful
applications for cowlinks.
In that scenario, cowlinks would save enormous amounts of storage and
potentially save memory too. However to be useful at all, they'd need
to have accurate POSIX semantics: that is, cowlinks must behave very
much as a storage optimisation only.
That means stat() should return ino==2.
-- Jamie
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
` (7 preceding siblings ...)
2004-05-10 5:15 ` Eric W. Biederman
@ 2004-05-12 16:39 ` Rob Landley
2004-05-20 13:49 ` Pavel Machek
8 siblings, 1 reply; 36+ messages in thread
From: Rob Landley @ 2004-05-12 16:39 UTC (permalink / raw)
To: Jörn Engel, linux-kernel
On Thursday 06 May 2004 08:17, Jörn Engel wrote:
> Couldn't sleep last night and finished a first complete version of
> cowlinks, code-named MAD COW. It is still based on the stupid old
> design with a flag to distinguish between regular hard links and
> cowlinks. Please don't comment on that design, it's just a proof of
> concept.
Catching up on some really old mail, I thought I'd ask:
For years now I've wanted to use a sendfile variant to tell the system to
connect two filehandles from userspace. Not just web servers want to
marshall data from one filehandle into another, things like netcat want to do
it between a pipe and a network connection, and I've wrote a couple of data
dispatcher daemons that wanted to do it between two network connections.
Unfortunately, sendfile didn't work generically when I tried it (back under
2.4). Would this infrastructure be a step in the right direction to
eliminate gratuitous poll loops (where nobody but me EVER seems to get the
"shutdown just one half of the connection" thing right. My netcat can handle
"echo 'GET /' | netcat www.slashdot.org 80". The standard netcat can't.
Yes, I plan to fix the one in busybox eventually...)
Rob
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-09 21:53 ` Pavel Machek
2004-05-10 15:44 ` Jörn Engel
@ 2004-05-12 20:29 ` Rob Landley
1 sibling, 0 replies; 36+ messages in thread
From: Rob Landley @ 2004-05-12 20:29 UTC (permalink / raw)
To: Pavel Machek, Denis Vlasenko
Cc: Jörn Engel, linux-kernel, Jamie Lokier, Eric W. Biederman
On Sunday 09 May 2004 16:53, Pavel Machek wrote:
> > I don't know how to handle this now. Introducing cow-inode number
> > with semantic "cowino1==cowino2 => files are cowlinked" is
> > ugly and won't deal with per-block cow. Sooner or later someone
> > will want to have per block cow. Think about cow'ing multi-gigabyte
> > database files for checkpointing/backup purposes...
>
> Well, if only block 17 is cowlink-shared between two files, I guess
> userspace does not want to know... And I think that cow-inode number
> *can* handle all other cases.
I remember somebody had a toy to find out the file block ranges (on ext2/ext3
anyway), which could detect fragmentation and holes and such. Presumably,
whatever they did would already detect per-block cowlinks, if anybody
actually cared...
Rob
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-12 0:26 ` Jamie Lokier
@ 2004-05-13 10:56 ` Jörn Engel
0 siblings, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-13 10:56 UTC (permalink / raw)
To: Jamie Lokier
Cc: Pavel Machek, Denis Vlasenko, linux-kernel, Eric W. Biederman
On Wed, 12 May 2004 01:26:06 +0100, Jamie Lokier wrote:
> Jörn Engel wrote:
> > What about ino? I currently return 1, so diff remains fast without
> > any changes. If someone really needs the difference between inode 2
> > and 3, I would introduce a cstat() system call similar to lstat(),
> > which would return ino=2.
> >
> > Is this sane? Should it be reversed and cstat() return ino=1, while
> > stat returns ino=2? I can imagine that "tar -x" would create hard
> > links for every cowlink that "tar -c" saw, but I'm not sure yet.
>
> I think it should be reversed.
Aye.
> One very useful application for cowlinks is for virtual machine (UML)
> and chroot jail setups, where an entire filesystem tree is copied
> perhaps hundreds of times on a single disk. I'm surprised we didn't
> think of this earlier, as it's potentially one of the most useful
> applications for cowlinks.
>
> In that scenario, cowlinks would save enormous amounts of storage and
> potentially save memory too. However to be useful at all, they'd need
> to have accurate POSIX semantics: that is, cowlinks must behave very
> much as a storage optimisation only.
>
> That means stat() should return ino==2.
Hmm, true. Up 'till now, that was done with disk images and block
based diffs/snapshots. Nice application.
Jörn
--
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-12 16:39 ` Rob Landley
@ 2004-05-20 13:49 ` Pavel Machek
2004-05-25 21:55 ` Rob Landley
0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-20 13:49 UTC (permalink / raw)
To: Rob Landley; +Cc: Jörn Engel, linux-kernel
Hi!
> Catching up on some really old mail, I thought I'd ask:
>
> For years now I've wanted to use a sendfile variant to tell the system to
> connect two filehandles from userspace. Not just web servers want to
> marshall data from one filehandle into another, things like netcat want to do
> it between a pipe and a network connection, and I've wrote a couple of data
> dispatcher daemons that wanted to do it between two network connections.
>
> Unfortunately, sendfile didn't work generically when I tried it (back under
> 2.4). Would this infrastructure be a step in the right direction to
> eliminate gratuitous poll loops (where nobody but me EVER seems to get the
> "shutdown just one half of the connection" thing right. My netcat can handle
> "echo 'GET /' | netcat www.slashdot.org 80". The standard netcat can't.
> Yes, I plan to fix the one in busybox eventually...)
>
Ugh. Yes, some syscalls like that were proposed... but to
make programming easier, you'd need asynchronous
sendfile to help you with programming, right?
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-11 14:08 ` Jan Harkes
2004-05-11 14:18 ` Jan Harkes
2004-05-11 14:33 ` Jörn Engel
@ 2004-05-21 23:23 ` Rob Landley
2004-05-25 22:46 ` Jan Harkes
2 siblings, 1 reply; 36+ messages in thread
From: Rob Landley @ 2004-05-21 23:23 UTC (permalink / raw)
To: Jan Harkes, Jörn Engel; +Cc: Steve French, linux-kernel
On Tuesday 11 May 2004 09:08, Jan Harkes wrote:
> int my_file_sendfile(struct file *in_file, loff_t *ppos, size_t count,
> read_actor_t actor, void __user *target)
> {
> struct file *out_file = NULL;
>
> /* We have to check the read_actor callback function to see if the
> * target actually points at a struct file. */
> if (actor != file_send_actor)
> goto copy_local;
>
> /* are both source and destination within the same file system
> * mountpoint? */
> if (in_file->f_dentry->d_inode->i_sb !=
> out_file->f_dentry->d_inode->i_sb) goto copy_local;
>
> /* are we copying the entire source file? */
> if (*ppos != 0 || count != in_file->f_dentry->d_inode->i_size)
> goto copy_local;
Is there a race condition for i_size to change between the api getting called
and the copy being done? More to the point, is there some way to specify a
count of -1 or something to easily say "to end of file"?
Rob
--
www.linucon.org: Linux Expo and Science Fiction Convention
October 8-10, 2004 in Austin Texas. (I'm the con chair.)
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-20 13:49 ` Pavel Machek
@ 2004-05-25 21:55 ` Rob Landley
2004-05-25 22:08 ` Pavel Machek
0 siblings, 1 reply; 36+ messages in thread
From: Rob Landley @ 2004-05-25 21:55 UTC (permalink / raw)
To: Pavel Machek; +Cc: Jörn Engel, linux-kernel
On Thursday 20 May 2004 08:49, Pavel Machek wrote:
> Hi!
> > For years now I've wanted to use a sendfile variant to tell the system to
> > connect two filehandles from userspace. Not just web servers want to
> > marshall data from one filehandle into another, things like netcat want
> > to do it between a pipe and a network connection, and I've wrote a couple
> > of data dispatcher daemons that wanted to do it between two network
> > connections.
> >
> > Unfortunately, sendfile didn't work generically when I tried it (back
> > under 2.4). Would this infrastructure be a step in the right direction
> > to eliminate gratuitous poll loops (where nobody but me EVER seems to get
> > the "shutdown just one half of the connection" thing right. My netcat
> > can handle "echo 'GET /' | netcat www.slashdot.org 80". The standard
> > netcat can't. Yes, I plan to fix the one in busybox eventually...)
>
> Ugh. Yes, some syscalls like that were proposed... but to
> make programming easier, you'd need asynchronous
> sendfile to help you with programming, right?
>
> Pavel
Doesn't asynchronous sendfile has the little problem your process can exit
before the sendfile is complete?
I'm not sure how much of a help it really is, since fork() isn't brain surgery
if you want it to be asynchronous, and the lifetime rules are really explicit
then. (With a ps that does thread grouping, this isn't too bad from a
clutter standpoint, even. And you automatically get a SIGCHLD when the
sendfile is complete, too...)
Of course if the syscall can make the sendfile outlive the process that fired
it off, then by all means it sounds good. I dunno how much extra work that
is for the kernel, though.
I had a "star server" daemon that was allowing machines behind firewalls to
talk to each other by ssh-ing in to a machine that wasn't behind a firewall,
which would then route traffic between them. It boiled down to connecting
together incoming network sockets, which meant that each connection went
through five processes on the star server: incoming ssh, redirector program
that ssh ran (which sent it the data to a loopback network connection), the
dispatcher daemon that handled the loopback connections and figured out who
should connect to who, and then the redirector and the ssh session for the
"outgoing" connection. Both redirectors basically wanted to exit after
connecting their stdin and stdout to a network socket on loopback, but
couldn't because they had to run a poll loop to shovel data back and forth.
Similarly the dispatcher daemon had the mother of all poll loops that
basically implemented bidirectional sendfile by hand between pairs of network
sockets.
Anyway, real world example of something ugly that made me want sendfile to be
more generic. (Netcat's another.) I doubt the star server could have been
cleaned up _too_ much since it was an ugly idea to begin with (on many
levels), but my boss wanted it, and customers wanted it, so...
Rob
--
www.linucon.org: Linux Expo and Science Fiction Convention
October 8-10, 2004 in Austin Texas. (I'm the con chair.)
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-25 21:55 ` Rob Landley
@ 2004-05-25 22:08 ` Pavel Machek
2004-05-25 23:16 ` Rob Landley
0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2004-05-25 22:08 UTC (permalink / raw)
To: Rob Landley; +Cc: Jörn Engel, linux-kernel
Hi!
> > > For years now I've wanted to use a sendfile variant to tell the system to
> > > connect two filehandles from userspace. Not just web servers want to
> > > marshall data from one filehandle into another, things like netcat want
> > > to do it between a pipe and a network connection, and I've wrote a couple
> > > of data dispatcher daemons that wanted to do it between two network
> > > connections.
> > >
> > > Unfortunately, sendfile didn't work generically when I tried it (back
> > > under 2.4). Would this infrastructure be a step in the right direction
> > > to eliminate gratuitous poll loops (where nobody but me EVER seems to get
> > > the "shutdown just one half of the connection" thing right. My netcat
> > > can handle "echo 'GET /' | netcat www.slashdot.org 80". The standard
> > > netcat can't. Yes, I plan to fix the one in busybox eventually...)
> >
> > Ugh. Yes, some syscalls like that were proposed... but to
> > make programming easier, you'd need asynchronous
> > sendfile to help you with programming, right?
>
> Doesn't asynchronous sendfile has the little problem your process can exit
> before the sendfile is complete?
Hmm, it has...
> I'm not sure how much of a help it really is, since fork() isn't brain surgery
> if you want it to be asynchronous, and the lifetime rules are really explicit
> then. (With a ps that does thread grouping, this isn't too bad from a
> clutter standpoint, even. And you automatically get a SIGCHLD when the
> sendfile is complete, too...)
Right.
> Of course if the syscall can make the sendfile outlive the process that fired
> it off, then by all means it sounds good. I dunno how much extra work that
> is for the kernel, though.
Well, it would be "interesting" to stop that sendfile then. You could
not kill it etc.
I guess async sendfile is bad idea after all.
Pavel
--
When do you have heart between your knees?
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-21 23:23 ` Rob Landley
@ 2004-05-25 22:46 ` Jan Harkes
0 siblings, 0 replies; 36+ messages in thread
From: Jan Harkes @ 2004-05-25 22:46 UTC (permalink / raw)
To: Rob Landley; +Cc: Jan Harkes, Jörn Engel, Steve French, linux-kernel
On Fri, May 21, 2004 at 06:23:12PM -0500, Rob Landley wrote:
> > /* are we copying the entire source file? */
> > if (*ppos != 0 || count != in_file->f_dentry->d_inode->i_size)
> > goto copy_local;
>
> Is there a race condition for i_size to change between the api getting called
> and the copy being done? More to the point, is there some way to specify a
> count of -1 or something to easily say "to end of file"?
I don't think so, what does the existing sendfile to a socket
implementation do?
Sure there is something you could call a race, but it doesn't seem all
that serious to me. If someone writes to the source file a couple of
seconds after the copy completes we would get the exact same situation.
The only problem here is that it is possible for the write to arrive
between the stat of the source file to get the number of bytes we want
to copy and the actual sendfile call. That would invalidate the cached
inode data and by the time call sendfile the example code falls back to
a local copy operation because count != i_size.
But if the filesystem provided a more powerful copy operation than just
copy whole file A to file B it could actually do what the user asked for
without needing to fall back on the local copy operation.
An application that wants to make sure the source file is not modified
before, during or after the copy operation, both by the local client and
possibly by any remote clients, probably should lock it with flock or
fcntl.
Jan
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-25 22:08 ` Pavel Machek
@ 2004-05-25 23:16 ` Rob Landley
2004-05-26 0:16 ` Ian Stirling
2004-05-26 9:52 ` Jörn Engel
0 siblings, 2 replies; 36+ messages in thread
From: Rob Landley @ 2004-05-25 23:16 UTC (permalink / raw)
To: Pavel Machek; +Cc: Jörn Engel, linux-kernel
On Tuesday 25 May 2004 17:08, Pavel Machek wrote:
> > Doesn't asynchronous sendfile has the little problem your process can
> > exit before the sendfile is complete?
>
> Hmm, it has...
>
> > I'm not sure how much of a help it really is, since fork() isn't brain
> > surgery if you want it to be asynchronous, and the lifetime rules are
> > really explicit then. (With a ps that does thread grouping, this isn't
> > too bad from a clutter standpoint, even. And you automatically get a
> > SIGCHLD when the sendfile is complete, too...)
>
> Right.
>
> > Of course if the syscall can make the sendfile outlive the process that
> > fired it off, then by all means it sounds good. I dunno how much extra
> > work that is for the kernel, though.
>
> Well, it would be "interesting" to stop that sendfile then. You could
> not kill it etc.
Well, logically what you're doing is redirecting an existing filehandle so it
points to something else. Instead of reading from this pipe, you're now
reading from this file or from this network socket, or this other process's
stdout. So any intermediate processes going away is theoretically okay as
long as the anchors at each end remain (process/filesystem/network connection
generating the data, process/filesystem/network connection receiving the
data).
The easy way to make the semantics work out right is that such an asynchronous
sendfile would effectively close the file in question from the point of view
of the process that did the sendfile. It would pretty much have to be part
of the semantics of any asynchronous sendfile call: welding together the two
filehandles would behave like a single direction shutdown(2) as far as the
process that called sendfile is concerned. (That way, if you do an async
sendfile in each direction, the filehandle is closed both ways, but you don't
HAVE to if you don't want to. You can feed data to a child process from a
script file or something, and just deal with the responses coming back.)
This would mean that in theory the process that did the sendfile could go away
without too much ambiguity about what should happen. (The bits that are
already closed from the process's point of view are unaffected by the process
exiting.) I dunno what's needed to clean that up in the kernel.
I also don't know if it's a good idea, because as you noticed the fire and
forget nature of the thing means that killing it afterwards is something we
haven't got a semantic for if the thing at the other end is NOT a pipe to a
processes. (We kill processes. We don't have a kill for network connections
or for files in the filesystem that are no longer associated with any
process. This is theoretically existing problem, by the way. Check out
SO_LINGER in man 7 socket...)
There are a number of interesting "no process involved" cases, actually. Do
an asynchronous sendfile from one file to another and the disk fills up; who
do you report the error to? And if you do a sendfile from one network socket
to another, both of which are outbound connections to the internet, then how
do you interrupt it later if those other systems decide to keep talking
through your system for two days?
Possibly the rule would have to be that a filehandle must remain open in SOME
process context somewhere, or it gets closed (even if it was in the middle of
an asynchronous sendfile). That would make sense, really; that's how we
handle incoming data from a network connection when the process in question
exits. We don't care that the datastream had data in transit, coming from
the network into the receive buffer. It go bye-bye when the last process
that had it open closes it, and the in-flight data is discarded.
That sounds like a workable semantic to me. Opinions?
It might even be possible for other I/O on the filehandle to block until the
async sendfile finishes, but that's getting a bit fancy in the absence of
known real world use cases asking for that feature. It would let you add in
something like SIGIO to let you know when you need to close the filehandle
and get any error code (like "disk full" or whatever). But I'd probably hold
off on that until somebody actually came up with a real world use case for
it.
> I guess async sendfile is bad idea after all.
Well, one thread per sendfile does smell a little bit like the java way of
doing things, but synchronous sendfile is the 90% solution. I can always
make my own wrapper to do the fork() and an atexit() hook to detach and
daemonify if I really care. The kernel doesn't really have to do this for
me. And since 2.6 already has some infrastructure to group threads so ps
doesn't look so terrible...
My opinion is that a synchronous sendfile that works in a generic context
would be an improvement over what's there now, and something I actually have
a couple existing uses for. Cleaning it up to work async can remain a to-do
item for later (including working out under what circumstances, if any, it
might be a good idea).
Low hanging fruit, and all that.
> Pavel
Rob
--
www.linucon.org: Linux Expo and Science Fiction Convention
October 8-10, 2004 in Austin Texas. (I'm the con chair.)
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-25 23:16 ` Rob Landley
@ 2004-05-26 0:16 ` Ian Stirling
2004-05-26 9:52 ` Jörn Engel
1 sibling, 0 replies; 36+ messages in thread
From: Ian Stirling @ 2004-05-26 0:16 UTC (permalink / raw)
To: Rob Landley; +Cc: Pavel Machek, Jörn Engel, linux-kernel
Rob Landley wrote:
> On Tuesday 25 May 2004 17:08, Pavel Machek wrote:
>
>
>>>Doesn't asynchronous sendfile has the little problem your process can
>>>exit before the sendfile is complete?
>>
>>Hmm, it has...
>>
>>
>>>I'm not sure how much of a help it really is, since fork() isn't brain
>>>surgery if you want it to be asynchronous, and the lifetime rules are
>>>really explicit then. (With a ps that does thread grouping, this isn't
>>>too bad from a clutter standpoint, even. And you automatically get a
>>>SIGCHLD when the sendfile is complete, too...)
>>
>>Right.
>>
>>
>>>Of course if the syscall can make the sendfile outlive the process that
>>>fired it off, then by all means it sounds good. I dunno how much extra
>>>work that is for the kernel, though.
>>
>>Well, it would be "interesting" to stop that sendfile then. You could
>>not kill it etc.
>
>
> Well, logically what you're doing is redirecting an existing filehandle so it
> points to something else. Instead of reading from this pipe, you're now
> reading from this file or from this network socket, or this other process's
> stdout. So any intermediate processes going away is theoretically okay as
> long as the anchors at each end remain (process/filesystem/network connection
> generating the data, process/filesystem/network connection receiving the
> data).
>
> The easy way to make the semantics work out right is that such an asynchronous
> sendfile would effectively close the file in question from the point of view
> of the process that did the sendfile. It would pretty much have to be part
> of the semantics of any asynchronous sendfile call: welding together the two
> filehandles would behave like a single direction shutdown(2) as far as the
> process that called sendfile is concerned. (That way, if you do an async
> sendfile in each direction, the filehandle is closed both ways, but you don't
> HAVE to if you don't want to. You can feed data to a child process from a
> script file or something, and just deal with the responses coming back.)
>
> This would mean that in theory the process that did the sendfile could go away
> without too much ambiguity about what should happen. (The bits that are
> already closed from the process's point of view are unaffected by the process
> exiting.) I dunno what's needed to clean that up in the kernel.
>
> I also don't know if it's a good idea, because as you noticed the fire and
> forget nature of the thing means that killing it afterwards is something we
> haven't got a semantic for if the thing at the other end is NOT a pipe to a
> processes. (We kill processes. We don't have a kill for network connections
> or for files in the filesystem that are no longer associated with any
> process. This is theoretically existing problem, by the way. Check out
> SO_LINGER in man 7 socket...)
>
A while back (2.2?) I was using a util that came with netpipes, sockdown.
Give it a socket FD, and it would close the socket.
It wasn't intended for it, but I found that it was really handy to
be able to shut other programs network connections.
For example, tin gets wedged talking to a server that won't do something, and
you just do
sockdown </proc/tin/fd/nn
And it shuts.
Mildly annoyingly, this stopped working in 2.4.
I suppose it'd be totally insane for init to inherit descriptors for open network
connections.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks
2004-05-25 23:16 ` Rob Landley
2004-05-26 0:16 ` Ian Stirling
@ 2004-05-26 9:52 ` Jörn Engel
1 sibling, 0 replies; 36+ messages in thread
From: Jörn Engel @ 2004-05-26 9:52 UTC (permalink / raw)
To: Rob Landley; +Cc: Pavel Machek, linux-kernel
On Tue, 25 May 2004 18:16:05 -0500, Rob Landley wrote:
>
> [a lot]
Sorry, you didn't have the time to make it short and I don't have the
time to read it all. ;)
Anyway, it sounds as if you really want a connect_files() or similar
that shortens a chain of pipes. Similar to removing an element from a
doubly linked list. Should be possible and not too hard either, iff
you can point to a decent user.
Right now, I don't care at all since it is just an optimization for
what's possible in userspace already and is rarely used. Root of all
evil an such...
Jörn
--
The strong give up and move away, while the weak give up and stay.
-- unknown
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2004-05-26 9:52 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-06 13:17 [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Jörn Engel
2004-05-06 13:18 ` [PATCH COW] generic_sendpage Jörn Engel
2004-05-06 13:19 ` [PATCH COW] sendfile Jörn Engel
2004-05-06 13:19 ` [PATCH COW] copyfile Jörn Engel
2004-05-06 13:20 ` [PATCH COW] lock_flags Jörn Engel
2004-05-06 13:21 ` [PATCH COW] MAD COW Jörn Engel
2004-05-08 13:45 ` [ANNOUNCEMENT PATCH COW] proof of concept impementation of cowlinks Denis Vlasenko
2004-05-08 22:10 ` Pavel Machek
2004-05-09 14:09 ` Denis Vlasenko
2004-05-09 21:53 ` Pavel Machek
2004-05-10 15:44 ` Jörn Engel
2004-05-10 15:51 ` Pavel Machek
2004-05-10 15:56 ` Jörn Engel
2004-05-12 0:26 ` Jamie Lokier
2004-05-13 10:56 ` Jörn Engel
2004-05-12 20:29 ` Rob Landley
2004-05-08 22:48 ` Pavel Machek
2004-05-10 15:53 ` Jörn Engel
2004-05-10 19:26 ` Jan Harkes
2004-05-11 10:02 ` Jörn Engel
2004-05-11 14:08 ` Jan Harkes
2004-05-11 14:18 ` Jan Harkes
2004-05-11 14:33 ` Jörn Engel
2004-05-21 23:23 ` Rob Landley
2004-05-25 22:46 ` Jan Harkes
2004-05-11 15:40 ` Steve French
2004-05-11 15:58 ` Jörn Engel
2004-05-10 5:15 ` Eric W. Biederman
2004-05-10 15:59 ` Jörn Engel
2004-05-12 16:39 ` Rob Landley
2004-05-20 13:49 ` Pavel Machek
2004-05-25 21:55 ` Rob Landley
2004-05-25 22:08 ` Pavel Machek
2004-05-25 23:16 ` Rob Landley
2004-05-26 0:16 ` Ian Stirling
2004-05-26 9:52 ` Jörn Engel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).