LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 10/13] signal/timer/event fds v8 - eventfd core ...
@ 2007-03-20 18:37 Davide Libenzi
  2007-03-30 19:40 ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Davide Libenzi @ 2007-03-20 18:37 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Andrew Morton, Linus Torvalds, Ingo Molnar, Suparna Bhattacharya,
	Zach Brown, Benjamin LaHaise

This is a very simple and light file descriptor, that can be used as
event wait/dispatch by userspace (both wait and dispatch) and by the
kernel (dispatch only). It can be used instead of pipe(2) in all cases
where those would simply be used to signal events. Their kernel overhead
is much lower than pipes, and they do not consume two fds. When used in
the kernel, it can offer an fd-bridge to enable, for example, functionalities
like KAIO or syslets/threadlets to signal to an fd the completion of certain
operations. But more in general, an eventfd can be used by the kernel to
signal readiness, in a POSIX poll/select way, of interfaces that would
otherwise be incompatible with it. The API is:

int eventfd(unsigned int count);

The eventfd API accepts an initial "count" parameter, and returns an
eventfd fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
The POLLIN flag is raised when the internal counter is greater than zero.
The POLLOUT flag is raised when at least a value of "1" can be written to
the internal counter.
The POLLERR flag is raised when an overflow in the counter value is detected.
The write(2) operation can never overflow the counter, since it blocks
(unless O_NONBLOCK is set, in which case -EAGAIN is returned).
But the eventfd_signal() function can do it, since it's supposed to not
sleep during its operation.
The read(2) function reads the __u64 counter value, and reset the internal
value to zero. If the value read is equal to (__u64) -1, an overflow
happened on the internal counter (due to 2^64 eventfd_signal() posts
that has never been retired - unlickely, but possible).
The write(2) call writes an __u64 count value, and adds it
to the current counter. The eventfd fd supports O_NONBLOCK also.
On the kernel side, we have:

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);

The eventfd_fget() should be called to get a struct file* from an eventfd
fd (this is an fget() + check of f_op being an eventfd fops pointer).
The kernel can then call eventfd_signal() every time it wants to post
an event to userspace. The eventfd_signal() function can be called from any
context.
An eventfd() simple test and bench is available here:

http://www.xmailserver.org/eventfd-bench.c

This is the eventfd-based version of pipetest-4 (pipe(2) based):

http://www.xmailserver.org/pipetest-4.c

Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.




Signed-off-by: Davide Libenzi <davidel@xmailserver.org>



- Davide



Index: linux-2.6.21-rc3.quilt/fs/Makefile
===================================================================
--- linux-2.6.21-rc3.quilt.orig/fs/Makefile	2007-03-19 19:01:52.000000000 -0700
+++ linux-2.6.21-rc3.quilt/fs/Makefile	2007-03-19 19:01:58.000000000 -0700
@@ -11,7 +11,7 @@
 		attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o drop_caches.o splice.o sync.o utimes.o \
-		stack.o anon_inodes.o signalfd.o timerfd.o
+		stack.o anon_inodes.o signalfd.o timerfd.o eventfd.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
Index: linux-2.6.21-rc3.quilt/include/linux/syscalls.h
===================================================================
--- linux-2.6.21-rc3.quilt.orig/include/linux/syscalls.h	2007-03-19 19:01:52.000000000 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/syscalls.h	2007-03-19 19:01:58.000000000 -0700
@@ -605,6 +605,7 @@
 asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemask);
 asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
 			    const struct itimerspec __user *utmr);
+asmlinkage long sys_eventfd(unsigned int count);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
Index: linux-2.6.21-rc3.quilt/fs/eventfd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.21-rc3.quilt/fs/eventfd.c	2007-03-19 19:01:58.000000000 -0700
@@ -0,0 +1,237 @@
+/*
+ *  fs/eventfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *
+ */
+
+#include <linux/file.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/jiffies.h>
+#include <linux/anon_inodes.h>
+#include <linux/eventfd.h>
+
+#include <asm/uaccess.h>
+
+
+
+struct eventfd_ctx {
+	spinlock_t lock;
+	wait_queue_head_t wqh;
+	__u64 count;
+};
+
+
+static int eventfd_close(struct inode *inode, struct file *file);
+static unsigned int eventfd_poll(struct file *file, poll_table *wait);
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+			    loff_t *ppos);
+static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
+			     loff_t *ppos);
+
+
+
+static const struct file_operations eventfd_fops = {
+	.release	= eventfd_close,
+	.poll		= eventfd_poll,
+	.read		= eventfd_read,
+	.write		= eventfd_write,
+};
+
+
+
+struct file *eventfd_fget(int fd)
+{
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return ERR_PTR(-EBADF);
+	if (file->f_op != &eventfd_fops) {
+		fput(file);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return file;
+}
+
+/*
+ * This function is supposed to be called by the kernel in paths
+ * that do not allow to sleep. In this function we allow the counter
+ * to reach the ULLONG_MAX value, and we signal this as overflow
+ * condition by returining a POLLERR to poll(2).
+ */
+int eventfd_signal(struct file *file, int n)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	unsigned long flags;
+
+	if (n < 0)
+		return -EINVAL;
+	spin_lock_irqsave(&ctx->lock, flags);
+	if (ULLONG_MAX - ctx->count < n)
+		n = (int) (ULLONG_MAX - ctx->count);
+	ctx->count += n;
+	if (waitqueue_active(&ctx->wqh))
+		wake_up_locked(&ctx->wqh);
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return n;
+}
+
+asmlinkage long sys_eventfd(unsigned int count)
+{
+	int error, fd;
+	struct eventfd_ctx *ctx;
+	struct file *file;
+	struct inode *inode;
+
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	init_waitqueue_head(&ctx->wqh);
+	spin_lock_init(&ctx->lock);
+	ctx->count = count;
+
+	/*
+	 * When we call this, the initialization must be complete, since
+	 * aino_getfd() will install the fd.
+	 */
+	error = aino_getfd(&fd, &inode, &file, "[eventfd]",
+			   &eventfd_fops, ctx);
+	if (!error)
+		return fd;
+
+	kfree(ctx);
+	return error;
+}
+
+static int eventfd_close(struct inode *inode, struct file *file)
+{
+	kfree(file->private_data);
+	return 0;
+}
+
+static unsigned int eventfd_poll(struct file *file, poll_table *wait)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	unsigned int events = 0;
+	unsigned long flags;
+
+	poll_wait(file, &ctx->wqh, wait);
+
+	spin_lock_irqsave(&ctx->lock, flags);
+	if (ctx->count > 0)
+		events |= POLLIN;
+	if (ctx->count == ULLONG_MAX)
+		events |= POLLERR;
+	if (ULLONG_MAX - 1 > ctx->count)
+		events |= POLLOUT;
+	spin_unlock_irqrestore(&ctx->lock, flags);
+
+	return events;
+}
+
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+			    loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 ucnt;
+	DECLARE_WAITQUEUE(wait, current);
+
+	if (count < sizeof(ucnt))
+		return -EINVAL;
+	spin_lock_irq(&ctx->lock);
+	res = -EAGAIN;
+	ucnt = ctx->count;
+	if (ucnt > 0)
+		res = sizeof(ucnt);
+	else if (!(file->f_flags & O_NONBLOCK)) {
+		__add_wait_queue(&ctx->wqh, &wait);
+		for (res = 0;;) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (ctx->count > 0) {
+				ucnt = ctx->count;
+				res = sizeof(ucnt);
+				break;
+			}
+			if (signal_pending(current)) {
+				res = -ERESTARTSYS;
+				break;
+			}
+			spin_unlock_irq(&ctx->lock);
+			schedule();
+			spin_lock_irq(&ctx->lock);
+		}
+		__remove_wait_queue(&ctx->wqh, &wait);
+		__set_current_state(TASK_RUNNING);
+	}
+	if (res > 0) {
+		ctx->count = 0;
+		if (waitqueue_active(&ctx->wqh))
+			wake_up_locked(&ctx->wqh);
+	}
+	spin_unlock_irq(&ctx->lock);
+	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
+		return -EFAULT;
+
+	return res;
+}
+
+static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
+			     loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 ucnt;
+	DECLARE_WAITQUEUE(wait, current);
+
+	if (count < sizeof(ucnt))
+		return -EINVAL;
+	if (get_user(ucnt, (const __u64 __user *) buf))
+		return -EFAULT;
+	if (ucnt == ULLONG_MAX)
+		return -EINVAL;
+	spin_lock_irq(&ctx->lock);
+	res = -EAGAIN;
+	if (ULLONG_MAX - ctx->count > ucnt)
+		res = sizeof(ucnt);
+	else if (!(file->f_flags & O_NONBLOCK)) {
+		__add_wait_queue(&ctx->wqh, &wait);
+		for (res = 0;;) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (ULLONG_MAX - ctx->count > ucnt) {
+				res = sizeof(ucnt);
+				break;
+			}
+			if (signal_pending(current)) {
+				res = -ERESTARTSYS;
+				break;
+			}
+			spin_unlock_irq(&ctx->lock);
+			schedule();
+			spin_lock_irq(&ctx->lock);
+		}
+		__remove_wait_queue(&ctx->wqh, &wait);
+		__set_current_state(TASK_RUNNING);
+	}
+	if (res > 0) {
+		ctx->count += ucnt;
+		if (waitqueue_active(&ctx->wqh))
+			wake_up_locked(&ctx->wqh);
+	}
+	spin_unlock_irq(&ctx->lock);
+
+	return res;
+}
+
Index: linux-2.6.21-rc3.quilt/include/linux/eventfd.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.21-rc3.quilt/include/linux/eventfd.h	2007-03-19 19:01:58.000000000 -0700
@@ -0,0 +1,20 @@
+/*
+ *  include/linux/eventfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *
+ */
+
+#ifndef _LINUX_EVENTFD_H
+#define _LINUX_EVENTFD_H
+
+
+#ifdef __KERNEL__
+
+struct file *eventfd_fget(int fd);
+int eventfd_signal(struct file *file, int n);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_EVENTFD_H */
+


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 10/13] signal/timer/event fds v8 - eventfd core ...
  2007-03-20 18:37 [patch 10/13] signal/timer/event fds v8 - eventfd core Davide Libenzi
@ 2007-03-30 19:40 ` Andrew Morton
  2007-03-31  1:11   ` Davide Libenzi
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2007-03-30 19:40 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Suparna Bhattacharya, Zach Brown, Benjamin LaHaise

On Tue, 20 Mar 2007 11:37:14 -0700
Davide Libenzi <davidel@xmailserver.org> wrote:

> This is a very simple and light file descriptor, that can be used as
> event wait/dispatch by userspace (both wait and dispatch) and by the
> kernel (dispatch only). It can be used instead of pipe(2) in all cases
> where those would simply be used to signal events. Their kernel overhead
> is much lower than pipes, and they do not consume two fds. When used in
> the kernel, it can offer an fd-bridge to enable, for example, functionalities
> like KAIO or syslets/threadlets to signal to an fd the completion of certain
> operations. But more in general, an eventfd can be used by the kernel to
> signal readiness, in a POSIX poll/select way, of interfaces that would
> otherwise be incompatible with it. The API is:
> 
> int eventfd(unsigned int count);
>
> The eventfd API accepts an initial "count" parameter, and returns an
> eventfd fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
> The POLLIN flag is raised when the internal counter is greater than zero.
> The POLLOUT flag is raised when at least a value of "1" can be written to
> the internal counter.
> The POLLERR flag is raised when an overflow in the counter value is detected.
> The write(2) operation can never overflow the counter, since it blocks
> (unless O_NONBLOCK is set, in which case -EAGAIN is returned).
> But the eventfd_signal() function can do it, since it's supposed to not
> sleep during its operation.
> The read(2) function reads the __u64 counter value, and reset the internal
> value to zero. If the value read is equal to (__u64) -1, an overflow
> happened on the internal counter (due to 2^64 eventfd_signal() posts
> that has never been retired - unlickely, but possible).
> The write(2) call writes an __u64 count value, and adds it
> to the current counter. The eventfd fd supports O_NONBLOCK also.
> On the kernel side, we have:
> 
> struct file *eventfd_fget(int fd);
> int eventfd_signal(struct file *file, unsigned int n);
> 
> The eventfd_fget() should be called to get a struct file* from an eventfd
> fd (this is an fget() + check of f_op being an eventfd fops pointer).
> The kernel can then call eventfd_signal() every time it wants to post
> an event to userspace. The eventfd_signal() function can be called from any
> context.
> An eventfd() simple test and bench is available here:
> 
> http://www.xmailserver.org/eventfd-bench.c
> 
> This is the eventfd-based version of pipetest-4 (pipe(2) based):
> 
> http://www.xmailserver.org/pipetest-4.c
> 
> Not that performance matters much in the eventfd case, but eventfd-bench
> shows almost as double as performance than pipetest-4.
> 
>...
> 
> Index: linux-2.6.21-rc3.quilt/fs/eventfd.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.21-rc3.quilt/fs/eventfd.c	2007-03-19 19:01:58.000000000 -0700
> @@ -0,0 +1,237 @@
> +/*
> + *  fs/eventfd.c
> + *
> + *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> + *
> + */
> +
> +#include <linux/file.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/jiffies.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/eventfd.h>
> +
> +#include <asm/uaccess.h>
> +
> +
> +
> +struct eventfd_ctx {
> +	spinlock_t lock;
> +	wait_queue_head_t wqh;
> +	__u64 count;
> +};

Again, can we borrow wqh.lock?

`count' needs documentation - these things are key to understanding the
code.

> +
> +
> +static int eventfd_close(struct inode *inode, struct file *file);
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait);
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> +			    loff_t *ppos);
> +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> +			     loff_t *ppos);

dittos

> +
> +
> +

dittos

> +static const struct file_operations eventfd_fops = {
> +	.release	= eventfd_close,

dittos

> +	.poll		= eventfd_poll,
> +	.read		= eventfd_read,
> +	.write		= eventfd_write,
> +};
> +
> +
> +
> +struct file *eventfd_fget(int fd)
> +{
> +	struct file *file;
> +
> +	file = fget(fd);
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	if (file->f_op != &eventfd_fops) {
> +		fput(file);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	return file;
> +}
> +
> +/*
> + * This function is supposed to be called by the kernel in paths
> + * that do not allow to sleep. In this function we allow the counter

"that do not allow sleeping"

> + * to reach the ULLONG_MAX value, and we signal this as overflow
> + * condition by returining a POLLERR to poll(2).

typo

> + */

So it is the caller's responsibility to ensure that *file refers to an
eventfd file?

> +int eventfd_signal(struct file *file, int n)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	unsigned long flags;
> +
> +	if (n < 0)
> +		return -EINVAL;
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	if (ULLONG_MAX - ctx->count < n)
> +		n = (int) (ULLONG_MAX - ctx->count);
> +	ctx->count += n;
> +	if (waitqueue_active(&ctx->wqh))
> +		wake_up_locked(&ctx->wqh);
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	return n;
> +}

Neither the incoming arg (usefully named "n") nor the return value are
documented.


> +asmlinkage long sys_eventfd(unsigned int count)
> +{
> +	int error, fd;
> +	struct eventfd_ctx *ctx;
> +	struct file *file;
> +	struct inode *inode;
> +
> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	init_waitqueue_head(&ctx->wqh);
> +	spin_lock_init(&ctx->lock);
> +	ctx->count = count;
> +
> +	/*
> +	 * When we call this, the initialization must be complete, since
> +	 * aino_getfd() will install the fd.
> +	 */
> +	error = aino_getfd(&fd, &inode, &file, "[eventfd]",
> +			   &eventfd_fops, ctx);
> +	if (!error)
> +		return fd;
> +
> +	kfree(ctx);
> +	return error;
> +}

Documentation?

> +static int eventfd_close(struct inode *inode, struct file *file)
> +{
> +	kfree(file->private_data);
> +	return 0;
> +}
> +
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	unsigned int events = 0;
> +	unsigned long flags;
> +
> +	poll_wait(file, &ctx->wqh, wait);
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	if (ctx->count > 0)
> +		events |= POLLIN;
> +	if (ctx->count == ULLONG_MAX)
> +		events |= POLLERR;
> +	if (ULLONG_MAX - 1 > ctx->count)
> +		events |= POLLOUT;
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	return events;
> +}
> +
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> +			    loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt;
> +	DECLARE_WAITQUEUE(wait, current);
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;
> +	spin_lock_irq(&ctx->lock);
> +	res = -EAGAIN;
> +	ucnt = ctx->count;
> +	if (ucnt > 0)
> +		res = sizeof(ucnt);
> +	else if (!(file->f_flags & O_NONBLOCK)) {
> +		__add_wait_queue(&ctx->wqh, &wait);
> +		for (res = 0;;) {
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			if (ctx->count > 0) {
> +				ucnt = ctx->count;
> +				res = sizeof(ucnt);
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				res = -ERESTARTSYS;
> +				break;
> +			}
> +			spin_unlock_irq(&ctx->lock);
> +			schedule();
> +			spin_lock_irq(&ctx->lock);
> +		}
> +		__remove_wait_queue(&ctx->wqh, &wait);
> +		__set_current_state(TASK_RUNNING);
> +	}
> +	if (res > 0) {
> +		ctx->count = 0;
> +		if (waitqueue_active(&ctx->wqh))
> +			wake_up_locked(&ctx->wqh);
> +	}
> +	spin_unlock_irq(&ctx->lock);
> +	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
> +		return -EFAULT;
> +
> +	return res;
> +}

Needs interface documentation, please.  Even the changelog doesn't tell us
what an EAGAIN return from read() means.

> +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> +			     loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt;
> +	DECLARE_WAITQUEUE(wait, current);
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;
> +	if (get_user(ucnt, (const __u64 __user *) buf))
> +		return -EFAULT;

Some architectures do not implement 64-bit get_user()

> +	if (ucnt == ULLONG_MAX)
> +		return -EINVAL;
> +	spin_lock_irq(&ctx->lock);
> +	res = -EAGAIN;
> +	if (ULLONG_MAX - ctx->count > ucnt)
> +		res = sizeof(ucnt);
> +	else if (!(file->f_flags & O_NONBLOCK)) {
> +		__add_wait_queue(&ctx->wqh, &wait);
> +		for (res = 0;;) {
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			if (ULLONG_MAX - ctx->count > ucnt) {
> +				res = sizeof(ucnt);
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				res = -ERESTARTSYS;
> +				break;
> +			}
> +			spin_unlock_irq(&ctx->lock);
> +			schedule();
> +			spin_lock_irq(&ctx->lock);
> +		}
> +		__remove_wait_queue(&ctx->wqh, &wait);
> +		__set_current_state(TASK_RUNNING);
> +	}
> +	if (res > 0) {
> +		ctx->count += ucnt;
> +		if (waitqueue_active(&ctx->wqh))
> +			wake_up_locked(&ctx->wqh);
> +	}
> +	spin_unlock_irq(&ctx->lock);
> +
> +	return res;
> +}
> +


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 10/13] signal/timer/event fds v8 - eventfd core ...
  2007-03-30 19:40 ` Andrew Morton
@ 2007-03-31  1:11   ` Davide Libenzi
  2007-03-31  1:14     ` Davide Libenzi
  2007-03-31  1:26     ` Andrew Morton
  0 siblings, 2 replies; 6+ messages in thread
From: Davide Libenzi @ 2007-03-31  1:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Suparna Bhattacharya, Zach Brown, Benjamin LaHaise

On Fri, 30 Mar 2007, Andrew Morton wrote:

> > +struct eventfd_ctx {
> > +	spinlock_t lock;
> > +	wait_queue_head_t wqh;
> > +	__u64 count;
> > +};
> 
> Again, can we borrow wqh.lock?
> 
> `count' needs documentation - these things are key to understanding the
> code.

Added.



> > + */
> 
> So it is the caller's responsibility to ensure that *file refers to an
> eventfd file?

In which function? I lost you ...



> > +int eventfd_signal(struct file *file, int n)
> > +{
> > +	struct eventfd_ctx *ctx = file->private_data;
> > +	unsigned long flags;
> > +
> > +	if (n < 0)
> > +		return -EINVAL;
> > +	spin_lock_irqsave(&ctx->lock, flags);
> > +	if (ULLONG_MAX - ctx->count < n)
> > +		n = (int) (ULLONG_MAX - ctx->count);
> > +	ctx->count += n;
> > +	if (waitqueue_active(&ctx->wqh))
> > +		wake_up_locked(&ctx->wqh);
> > +	spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > +	return n;
> > +}
> 
> Neither the incoming arg (usefully named "n") nor the return value are
> documented.

Documented now.




> Needs interface documentation, please.  Even the changelog doesn't tell us
> what an EAGAIN return from read() means.

I'll be adding the errno documentation to all of them.




> > +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> > +			     loff_t *ppos)
> > +{
> > +	struct eventfd_ctx *ctx = file->private_data;
> > +	ssize_t res;
> > +	__u64 ucnt;
> > +	DECLARE_WAITQUEUE(wait, current);
> > +
> > +	if (count < sizeof(ucnt))
> > +		return -EINVAL;
> > +	if (get_user(ucnt, (const __u64 __user *) buf))
> > +		return -EFAULT;
> 
> Some architectures do not implement 64-bit get_user()

copy_from_user it is, then ...



- Davide



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 10/13] signal/timer/event fds v8 - eventfd core ...
  2007-03-31  1:11   ` Davide Libenzi
@ 2007-03-31  1:14     ` Davide Libenzi
  2007-03-31  1:26     ` Andrew Morton
  1 sibling, 0 replies; 6+ messages in thread
From: Davide Libenzi @ 2007-03-31  1:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Suparna Bhattacharya, Zach Brown, Benjamin LaHaise

On Fri, 30 Mar 2007, Davide Libenzi wrote:

> > Some architectures do not implement 64-bit get_user()
> 
> copy_from_user it is, then ...

That's messed up though. We do have put_user and we miss get_user. Bah...



- Davide



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 10/13] signal/timer/event fds v8 - eventfd core ...
  2007-03-31  1:11   ` Davide Libenzi
  2007-03-31  1:14     ` Davide Libenzi
@ 2007-03-31  1:26     ` Andrew Morton
  2007-03-31 19:50       ` Davide Libenzi
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2007-03-31  1:26 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Suparna Bhattacharya, Zach Brown, Benjamin LaHaise

On Fri, 30 Mar 2007 18:11:55 -0700 (PDT) Davide Libenzi <davidel@xmailserver.org> wrote:

> 
> > > + */
> > 
> > So it is the caller's responsibility to ensure that *file refers to an
> > eventfd file?
> 
> In which function? I lost you ...
> 

eventfd_signal() assumes that the passed in file* refers to an eventfd
file.  So if a caller passes in a file* for /etc/passwd, the kernel will go
splat.

I guess that's caveat emptor, and any violations of that will show up
quickly in testing.  My main concern would be that there might be some way
for a naughty user to force the kernel to pass a non-eventfd file* into
this function.  That depends upon as-yet-unwritten code - is there a risk
of this happening, and how do we prevent it?

> 
> > > +int eventfd_signal(struct file *file, int n)
> > > +{
> > > +	struct eventfd_ctx *ctx = file->private_data;
> > > +	unsigned long flags;
> > > +
> > > +	if (n < 0)
> > > +		return -EINVAL;
> > > +	spin_lock_irqsave(&ctx->lock, flags);
> > > +	if (ULLONG_MAX - ctx->count < n)
> > > +		n = (int) (ULLONG_MAX - ctx->count);
> > > +	ctx->count += n;
> > > +	if (waitqueue_active(&ctx->wqh))
> > > +		wake_up_locked(&ctx->wqh);
> > > +	spin_unlock_irqrestore(&ctx->lock, flags);
> > > +
> > > +	return n;
> > > +}
>
>
>
> > > +	DECLARE_WAITQUEUE(wait, current);
> > > +
> > > +	if (count < sizeof(ucnt))
> > > +		return -EINVAL;
> > > +	if (get_user(ucnt, (const __u64 __user *) buf))
> > > +		return -EFAULT;
> > 
> > Some architectures do not implement 64-bit get_user()
> 
> copy_from_user it is, then ...
> 

spose so.  I think architectures _should_ implement 64-bit get_user() and
put_user() nowadays.  So you could leave the code as-is and inform the arch
maintainers, if you're feeling keen.

If all this code has its own Kconfig options then the architectures won't
break until their maintainers come along to enable the new features, so
they'll implement 64-bit get_user() at that time and things will all unfold
in a nicely non-chaotic fashion.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch 10/13] signal/timer/event fds v8 - eventfd core ...
  2007-03-31  1:26     ` Andrew Morton
@ 2007-03-31 19:50       ` Davide Libenzi
  0 siblings, 0 replies; 6+ messages in thread
From: Davide Libenzi @ 2007-03-31 19:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Suparna Bhattacharya, Zach Brown, Benjamin LaHaise

On Fri, 30 Mar 2007, Andrew Morton wrote:

> On Fri, 30 Mar 2007 18:11:55 -0700 (PDT) Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > 
> > > > + */
> > > 
> > > So it is the caller's responsibility to ensure that *file refers to an
> > > eventfd file?
> > 
> > In which function? I lost you ...
> > 
> 
> eventfd_signal() assumes that the passed in file* refers to an eventfd
> file.  So if a caller passes in a file* for /etc/passwd, the kernel will go
> splat.
> 
> I guess that's caveat emptor, and any violations of that will show up
> quickly in testing.  My main concern would be that there might be some way
> for a naughty user to force the kernel to pass a non-eventfd file* into
> this function.  That depends upon as-yet-unwritten code - is there a risk
> of this happening, and how do we prevent it?

That's the kernel side of the API. The userspace->kernel fd validation is 
done in eventfd_fget(), that the kernel uses to get an eventfd file* from 
an eventfd file descriptor. The eventfd_fget() does the file operations 
check and returns proper error.
At that point, if the kernel feeds eventfd_signal() with crap, it gets 
crap back. Like calling internal kernel functions with bogus mm_struct or 
task_struct.



> > > > +	DECLARE_WAITQUEUE(wait, current);
> > > > +
> > > > +	if (count < sizeof(ucnt))
> > > > +		return -EINVAL;
> > > > +	if (get_user(ucnt, (const __u64 __user *) buf))
> > > > +		return -EFAULT;
> > > 
> > > Some architectures do not implement 64-bit get_user()
> > 
> > copy_from_user it is, then ...
> > 
> 
> spose so.  I think architectures _should_ implement 64-bit get_user() and
> put_user() nowadays.  So you could leave the code as-is and inform the arch
> maintainers, if you're feeling keen.
> 
> If all this code has its own Kconfig options then the architectures won't
> break until their maintainers come along to enable the new features, so
> they'll implement 64-bit get_user() at that time and things will all unfold
> in a nicely non-chaotic fashion.

Agreed. I'll leave the get_user(u64).



- Davide



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-03-31 19:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-20 18:37 [patch 10/13] signal/timer/event fds v8 - eventfd core Davide Libenzi
2007-03-30 19:40 ` Andrew Morton
2007-03-31  1:11   ` Davide Libenzi
2007-03-31  1:14     ` Davide Libenzi
2007-03-31  1:26     ` Andrew Morton
2007-03-31 19:50       ` Davide Libenzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).