LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
@ 2015-03-10 1:49 Fam Zheng
2015-03-10 1:49 ` [PATCH v4 1/9] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
` (9 more replies)
0 siblings, 10 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
Changes from v3:
- Add "size" field in epoll_wait_params. [Jon, Ingo, Seymour]
- Input validation for ncmds in epoll_ctl_batch. [Dan]
- Return -EFAULT if copy_to_user failed in epoll_ctl_batch. [Omar, Michael]
- Change "timeout" in epoll_wait_params to pointer, to get the same
convention of 'no wait', 'wait indefinitely' and 'wait for specified time'
with epoll_pwait. [Seymour]
- Add compat implementation of epoll_pwait1.
Justification
=============
QEMU, among many select/poll based applications, considers epoll as an
alternative when its event loop needs to handle a big number of FDs. However,
there are currently two concerns with epoll which prevents the switching:
The major one is the timeout precision. For example in QEMU, the main loop
takes care of calling callbacks at a specific timeout - the QEMU timer API. The
timeout value in ppoll depends on the next firing timer. epoll_pwait's
millisecond timeout is so coarse that rounding up the timeout will hurt
performance badly.
The minor one is the number of system call to update fd set. While epoll can
handle a large number of fds quickly, it still requires one epoll_ctl per fd
update, compared to the one-shot call to select/poll with an fd array. This may
as well make epoll inferior to ppoll in the cases where a small, but frequently
changing set of fds are polled by the event loop.
This series introduces two new epoll sys calls to address them respectively.
The idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
suggested clockid as a parameter in epoll_pwait1.
[1]: http://lists.openwall.net/linux-kernel/2015/01/08/542
Benchmark for epoll_pwait1
==========================
By running fio tests inside VM with both original and modified QEMU, we can
compare their difference in performance.
With a small VM setup [t1], the original QEMU (ppoll based) has an 4k read
latency overhead around 37 us. In this setup, the main loop polls 10~20 fds.
With a slightly larger VM instance [t2] - attached a virtio-serial device so
that there are 80~90 fds in the main loop - the original QEMU has a latency
overhead around 49 us. By adding more such devices [t3], we can see the latency
go even higher - 83 us with ~200 FDs.
Now modify QEMU to use epoll_pwait1 and test again, the latency numbers are
repectively 36us, 37us, 47us for t1, t2 and t3.
Previous Changelogs
===================
Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
----------------------------------------------------
- Rename epoll_ctl_cmd.error_hint to "result". [Michael]
- Add background introduction in cover letter. [Michael]
- Expand the last struct of epoll_pwait1, add clockid and timespec.
- Update man page in cover letter accordingly:
* "error_hint" -> "result".
* The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.
Please review!
Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
-----------------------------------------------------
- As discussed in previous thread [1], split the call to epoll_ctl_batch and
epoll_pwait. [Michael]
- Fix memory leaks. [Omar]
- Add a short comment about the ignored copy_to_user failure. [Omar]
- Cover letter rewritten.
Documentation of the new system calls
=====================================
1) epoll_ctl_batch
------------------
NAME
epoll_ctl_batch - batch control interface for an epoll descriptor
SYNOPSIS
#include <sys/epoll.h>
int epoll_ctl_batch(int epfd, int flags,
int ncmds, struct epoll_ctl_cmd *cmds);
DESCRIPTION
This system call is an extension of epoll_ctl(). The primary difference
is that this system call allows you to batch multiple operations with
the one system call. This provides a more efficient interface for
updating events on this epoll file descriptor epfd.
The flags argument is reserved and must be 0.
The argument ncmds is the number of cmds entries being passed in.
This number must be greater than 0.
Each operation is specified as an element in the cmds array, defined as:
struct epoll_ctl_cmd {
/* Reserved flags for future extension, must be 0. */
int flags;
/* The same as epoll_ctl() op parameter. */
int op;
/* The same as epoll_ctl() fd parameter. */
int fd;
/* The same as the "events" field in struct epoll_event. */
uint32_t events;
/* The same as the "data" field in struct epoll_event. */
uint64_t data;
/* Output field, will be set to the return code after this
* command is executed by kernel */
int result;
};
This system call is not atomic when updating the epoll descriptor. All
entries in cmds are executed in the provided order. If any cmds entry
fails to be processed, no further entries are processed and the number
of successfully processed entries is returned.
Each single operation defined by a struct epoll_ctl_cmd has the same
semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page for
more information about how to correctly setup the members of a struct
epoll_ctl_cmd.
Upon completion of the call the result member of each struct
epoll_ctl_cmd may be set to 0 (sucessfully completed) or an error code
depending on the result of the command. If the kernel fails to change
the result (for example the location of the cmds argument is fully or
partly read only) the result member of each struct epoll_ctl_cmd may be
unchanged.
RETURN VALUE
epoll_ctl_batch() returns a number greater than 0 to indicate the number
of cmnd entries processed. If all entries have been processed this will
equal the ncmds parameter passed in.
If one or more parameters are incorrect the value returned is -1 with
errno set appropriately - no cmds entries have been processed when this
happens.
If processing any entry in the cmds argument results in an error, the
number returned is the index of the failing entry - this number will be
less than ncmds. Since ncmds must be greater than 0, a return value of 0
indicates an error associated with the very first cmds entry. A return
value of 0 does not indicate a successful system call.
To correctly test the return value from epoll_ctl_batch() use code
similar to the following:
ret = epoll_ctl_batch(epfd, flags, ncmds, &cmds);
if (ret < ncmds) {
if (ret == -1) {
/* An argument was invalid */
} else {
/* ret contains the number of successful entries
* processed. If you (mis?)use it as a C index it
* will index directly to the failing entry to
* get the result use cmds[ret].result which may
* contain the errno value associated with the
* entry.
*/
}
} else {
/* Success */
}
ERRORS
EINVAL flags is non-zero; ncmds is less than or equal to zero, or
greater than (INT_MAX / sizeof(struct epoll_ctl_cmd); cmds is
NULL;
ENOMEM There was insufficient memory to handle the requested op control
operation.
EFAULT The memory area pointed to by cmds is not accessible.
In the event that the return value is not the same as the ncmds
parameter, the result member of the failing struct epoll_ctl_cmd will
contain a negative errno value related to the error, unless the memory
area is not writable (EFAULT returned). The errno values that can be set
are those documented on the epoll_ctl(2) manual page.
CONFORMING TO
epoll_ctl_batch() is Linux-specific.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
2) epoll_pwait1
---------------
NAME
epoll_pwait1 - wait for an I/O event on an epoll file descriptor
SYNOPSIS
#include <sys/epoll.h>
int epoll_pwait1(int epfd, int flags,
struct epoll_event *events,
int maxevents,
struct epoll_wait_params *params);
DESCRIPTION
The epoll_pwait1() syscall has more elaborate parameters compared to
epoll_pwait(), in order to allow fine control of the wait.
The epfd, events and maxevents parameters are the same
as in epoll_wait() and epoll_pwait(). The flags and params are new.
The flags is reserved and must be zero.
The params is a pointer to a struct epoll_wait_params which is
defined as:
struct epoll_wait_params {
int clockid;
struct timespec *timeout;
sigset_t *sigmask;
size_t sigsetsize;
};
The clockid member must be either CLOCK_REALTIME or CLOCK_MONOTONIC.
This will choose the clock type to use for timeout. This differs to
epoll_pwait(2) which has an implicit clock type of CLOCK_MONOTONIC.
The timeout member specifies the minimum time that epoll_wait(2) will
block. The time spent waiting will be rounded up to the clock
granularity. Kernel scheduling delays mean that the blocking
interval may overrun by a small amount. Specifying NULL will cause
causes epoll_pwait1(2) to block indefinitely. Specifying a timeout
equal to zero (both tv_sec and tv_nsec are zero) causes epoll_pwait1(2)
to return immediately, even if no events are available.
Both sigmask and sigsetsize have the same semantics as epoll_pwait(2).
The sigmask field may be specified as NULL, in which case
epoll_pwait1(2) will behave like epoll_wait(2).
User visibility of sigsetsize
In epoll_pwait(2) and other syscalls, sigsetsize is not visible to
an application developer as glibc has a wrapper around epoll_pwait(2).
Now we pack several parameters in epoll_wait_params. In
order to hide sigsetsize from application code this system call also
needs to be wrapped either by expanding parameters and building the
structure in the wrapper function, or by only asking application to
provide this part of the structure:
struct epoll_wait_params_user {
int clockid;
struct timespec *timeout;
sigset_t *sigmask;
};
In the wrapper function it would be copied to a full structure with
sigsetsize filled in.
RETURN VALUE
When successful, epoll_wait1() returns the number of file descriptors
ready for the requested I/O, or zero if no file descriptor became ready
during the requested timeout nanoseconds. When an error occurs,
epoll_wait1() returns -1 and errno is set appropriately.
ERRORS
This system call can set errno to the same values as epoll_pwait(2),
as well as the following additional reasons:
EINVAL flags is not zero, or clockid is not one of CLOCK_REALTIME or
CLOCK_MONOTONIC, or the timespec data pointed to by timeout is
not valid.
EFAULT The memory area pointed to by params, params.sigmask or
params.timeout is not accessible.
CONFORMING TO
epoll_pwait1() is Linux-specific.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll_wait(2), epoll_pwait(2), epoll(7)
Fam Zheng (9):
epoll: Extract epoll_wait_do and epoll_pwait_do
epoll: Specify clockid explicitly
epoll: Extract ep_ctl_do
epoll: Add implementation for epoll_ctl_batch
x86: Hook up epoll_ctl_batch syscall
epoll: Add implementation for epoll_pwait1
x86: Hook up epoll_pwait1 syscall
epoll: Add compat version implementation of epoll_pwait1
x86: Hook up 32 bit compat epoll_pwait1 syscall
arch/x86/syscalls/syscall_32.tbl | 2 +
arch/x86/syscalls/syscall_64.tbl | 2 +
fs/eventpoll.c | 308 ++++++++++++++++++++++++++++-----------
include/linux/compat.h | 6 +
include/linux/syscalls.h | 9 ++
include/uapi/linux/eventpoll.h | 19 +++
6 files changed, 262 insertions(+), 84 deletions(-)
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 1/9] epoll: Extract epoll_wait_do and epoll_pwait_do
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-10 1:49 ` [PATCH v4 2/9] epoll: Specify clockid explicitly Fam Zheng
` (8 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
In preparation of new epoll syscalls, this patch allows reusing the code from
epoll_pwait implementation. The new functions uses ktime_t for more accuracy.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 154 ++++++++++++++++++++++++++-------------------------------
1 file changed, 71 insertions(+), 83 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e009ca..7dfabeb 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1554,17 +1554,6 @@ static int ep_send_events(struct eventpoll *ep,
return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
}
-static inline struct timespec ep_set_mstimeout(long ms)
-{
- struct timespec now, ts = {
- .tv_sec = ms / MSEC_PER_SEC,
- .tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
- };
-
- ktime_get_ts(&now);
- return timespec_add_safe(now, ts);
-}
-
/**
* ep_poll - Retrieves ready events, and delivers them to the caller supplied
* event buffer.
@@ -1573,17 +1562,15 @@ static inline struct timespec ep_set_mstimeout(long ms)
* @events: Pointer to the userspace buffer where the ready events should be
* stored.
* @maxevents: Size (in terms of number of events) of the caller event buffer.
- * @timeout: Maximum timeout for the ready events fetch operation, in
- * milliseconds. If the @timeout is zero, the function will not block,
- * while if the @timeout is less than zero, the function will block
- * until at least one event has been retrieved (or an error
- * occurred).
+ * @timeout: Maximum timeout for the ready events fetch operation. If 0, the
+ * function will not block. If negative, the function will block until
+ * at least one event has been retrieved (or an error occurred).
*
* Returns: Returns the number of ready events which have been fetched, or an
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents, long timeout)
+ int maxevents, const ktime_t timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
@@ -1591,13 +1578,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
wait_queue_t wait;
ktime_t expires, *to = NULL;
- if (timeout > 0) {
- struct timespec end_time = ep_set_mstimeout(timeout);
-
- slack = select_estimate_accuracy(&end_time);
- to = &expires;
- *to = timespec_to_ktime(end_time);
- } else if (timeout == 0) {
+ if (!ktime_to_ns(timeout)) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
* caller specified a non blocking operation.
@@ -1605,6 +1586,15 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
goto check_events;
+ } else if (ktime_to_ns(timeout) > 0) {
+ struct timespec now, end_time;
+
+ ktime_get_ts(&now);
+ end_time = timespec_add_safe(now, ktime_to_timespec(timeout));
+
+ slack = select_estimate_accuracy(&end_time);
+ to = &expires;
+ *to = timespec_to_ktime(end_time);
}
fetch_events:
@@ -1954,12 +1944,8 @@ error_return:
return error;
}
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
- int, maxevents, int, timeout)
+static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
+ int maxevents, const ktime_t timeout)
{
int error;
struct fd f;
@@ -2002,46 +1988,70 @@ error_fput:
/*
* Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_wait(2).
+ */
+SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
+ int, maxevents, int, timeout)
+{
+ ktime_t kt = ms_to_ktime(timeout);
+ return epoll_wait_do(epfd, events, maxevents, kt);
+}
+
+static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
+ int maxevents, ktime_t timeout,
+ sigset_t *sigmask, size_t sigsetsize)
+{
+ int error;
+ sigset_t sigsaved;
+
+ /*
+ * If the caller wants a certain signal mask to be set during the wait,
+ * we apply it here.
+ */
+ if (sigmask) {
+ sigsaved = current->blocked;
+ set_current_blocked(sigmask);
+ }
+
+ error = epoll_wait_do(epfd, events, maxevents, timeout);
+
+ /*
+ * If we changed the signal mask, we need to restore the original one.
+ * In case we've got a signal while waiting, we do not restore the
+ * signal mask yet, and we allow do_signal() to deliver the signal on
+ * the way back to userspace, before the signal mask is restored.
+ */
+ if (sigmask) {
+ if (error == -EINTR) {
+ memcpy(¤t->saved_sigmask, &sigsaved,
+ sizeof(sigsaved));
+ set_restore_sigmask();
+ } else
+ set_current_blocked(&sigsaved);
+ }
+
+ return error;
+}
+
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
* part of the user space epoll_pwait(2).
*/
SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout, const sigset_t __user *, sigmask,
size_t, sigsetsize)
{
- int error;
- sigset_t ksigmask, sigsaved;
+ ktime_t kt = ms_to_ktime(timeout);
+ sigset_t ksigmask;
- /*
- * If the caller wants a certain signal mask to be set during the wait,
- * we apply it here.
- */
if (sigmask) {
if (sigsetsize != sizeof(sigset_t))
return -EINVAL;
if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
return -EFAULT;
- sigsaved = current->blocked;
- set_current_blocked(&ksigmask);
}
-
- error = sys_epoll_wait(epfd, events, maxevents, timeout);
-
- /*
- * If we changed the signal mask, we need to restore the original one.
- * In case we've got a signal while waiting, we do not restore the
- * signal mask yet, and we allow do_signal() to deliver the signal on
- * the way back to userspace, before the signal mask is restored.
- */
- if (sigmask) {
- if (error == -EINTR) {
- memcpy(¤t->saved_sigmask, &sigsaved,
- sizeof(sigsaved));
- set_restore_sigmask();
- } else
- set_current_blocked(&sigsaved);
- }
-
- return error;
+ return epoll_pwait_do(epfd, events, maxevents, kt,
+ sigmask ? &ksigmask : NULL, sigsetsize);
}
#ifdef CONFIG_COMPAT
@@ -2051,42 +2061,20 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
const compat_sigset_t __user *, sigmask,
compat_size_t, sigsetsize)
{
- long err;
compat_sigset_t csigmask;
- sigset_t ksigmask, sigsaved;
+ sigset_t ksigmask;
+ ktime_t kt = ms_to_ktime(timeout);
- /*
- * If the caller wants a certain signal mask to be set during the wait,
- * we apply it here.
- */
if (sigmask) {
if (sigsetsize != sizeof(compat_sigset_t))
return -EINVAL;
if (copy_from_user(&csigmask, sigmask, sizeof(csigmask)))
return -EFAULT;
sigset_from_compat(&ksigmask, &csigmask);
- sigsaved = current->blocked;
- set_current_blocked(&ksigmask);
}
- err = sys_epoll_wait(epfd, events, maxevents, timeout);
-
- /*
- * If we changed the signal mask, we need to restore the original one.
- * In case we've got a signal while waiting, we do not restore the
- * signal mask yet, and we allow do_signal() to deliver the signal on
- * the way back to userspace, before the signal mask is restored.
- */
- if (sigmask) {
- if (err == -EINTR) {
- memcpy(¤t->saved_sigmask, &sigsaved,
- sizeof(sigsaved));
- set_restore_sigmask();
- } else
- set_current_blocked(&sigsaved);
- }
-
- return err;
+ return epoll_pwait_do(epfd, events, maxevents, kt,
+ sigmask ? &ksigmask : NULL, sigsetsize);
}
#endif
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 2/9] epoll: Specify clockid explicitly
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-03-10 1:49 ` [PATCH v4 1/9] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-10 1:49 ` [PATCH v4 3/9] epoll: Extract ep_ctl_do Fam Zheng
` (7 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
Later we will add clockid in the interface, so let's start using explicit
clockid internally. Now we specify CLOCK_MONOTONIC, which is the same as before.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 7dfabeb..957d1d0 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1570,7 +1570,7 @@ static int ep_send_events(struct eventpoll *ep,
* error code, in case of error.
*/
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents, const ktime_t timeout)
+ int maxevents, int clockid, const ktime_t timeout)
{
int res = 0, eavail, timed_out = 0;
unsigned long flags;
@@ -1578,6 +1578,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
wait_queue_t wait;
ktime_t expires, *to = NULL;
+ if (clockid != CLOCK_MONOTONIC && clockid != CLOCK_REALTIME)
+ return -EINVAL;
if (!ktime_to_ns(timeout)) {
/*
* Avoid the unnecessary trip to the wait queue loop, if the
@@ -1624,7 +1626,8 @@ fetch_events:
}
spin_unlock_irqrestore(&ep->lock, flags);
- if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+ if (!schedule_hrtimeout_range_clock(to, slack,
+ HRTIMER_MODE_ABS, clockid))
timed_out = 1;
spin_lock_irqsave(&ep->lock, flags);
@@ -1945,7 +1948,8 @@ error_return:
}
static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
- int maxevents, const ktime_t timeout)
+ int maxevents, int clockid,
+ const ktime_t timeout)
{
int error;
struct fd f;
@@ -1979,7 +1983,7 @@ static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
ep = f.file->private_data;
/* Time to fish for events ... */
- error = ep_poll(ep, events, maxevents, timeout);
+ error = ep_poll(ep, events, maxevents, clockid, timeout);
error_fput:
fdput(f);
@@ -1994,12 +1998,13 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout)
{
ktime_t kt = ms_to_ktime(timeout);
- return epoll_wait_do(epfd, events, maxevents, kt);
+ return epoll_wait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt);
}
static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
- int maxevents, ktime_t timeout,
- sigset_t *sigmask, size_t sigsetsize)
+ int maxevents,
+ int clockid, ktime_t timeout,
+ sigset_t *sigmask)
{
int error;
sigset_t sigsaved;
@@ -2013,7 +2018,7 @@ static inline int epoll_pwait_do(int epfd, struct epoll_event __user *events,
set_current_blocked(sigmask);
}
- error = epoll_wait_do(epfd, events, maxevents, timeout);
+ error = epoll_wait_do(epfd, events, maxevents, clockid, timeout);
/*
* If we changed the signal mask, we need to restore the original one.
@@ -2050,8 +2055,8 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask)))
return -EFAULT;
}
- return epoll_pwait_do(epfd, events, maxevents, kt,
- sigmask ? &ksigmask : NULL, sigsetsize);
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+ sigmask ? &ksigmask : NULL);
}
#ifdef CONFIG_COMPAT
@@ -2073,8 +2078,8 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
sigset_from_compat(&ksigmask, &csigmask);
}
- return epoll_pwait_do(epfd, events, maxevents, kt,
- sigmask ? &ksigmask : NULL, sigsetsize);
+ return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
+ sigmask ? &ksigmask : NULL);
}
#endif
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 3/9] epoll: Extract ep_ctl_do
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-03-10 1:49 ` [PATCH v4 1/9] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
2015-03-10 1:49 ` [PATCH v4 2/9] epoll: Specify clockid explicitly Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-10 1:49 ` [PATCH v4 4/9] epoll: Add implementation for epoll_ctl_batch Fam Zheng
` (6 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
This is the common part from epoll_ctl implementation which will be
shared with the new syscall.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 957d1d0..7909c88 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1810,22 +1810,15 @@ SYSCALL_DEFINE1(epoll_create, int, size)
* the eventpoll file that enables the insertion/removal/change of
* file descriptors inside the interest set.
*/
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
- struct epoll_event __user *, event)
+int ep_ctl_do(int epfd, int op, int fd, struct epoll_event epds)
{
int error;
int full_check = 0;
struct fd f, tf;
struct eventpoll *ep;
struct epitem *epi;
- struct epoll_event epds;
struct eventpoll *tep = NULL;
- error = -EFAULT;
- if (ep_op_has_event(op) &&
- copy_from_user(&epds, event, sizeof(struct epoll_event)))
- goto error_return;
-
error = -EBADF;
f = fdget(epfd);
if (!f.file)
@@ -1947,6 +1940,23 @@ error_return:
return error;
}
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+ struct epoll_event __user *, event)
+{
+ struct epoll_event epds;
+
+ if (ep_op_has_event(op) &&
+ copy_from_user(&epds, event, sizeof(struct epoll_event)))
+ return -EFAULT;
+
+ return ep_ctl_do(epfd, op, fd, epds);
+}
+
static inline int epoll_wait_do(int epfd, struct epoll_event __user *events,
int maxevents, int clockid,
const ktime_t timeout)
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 4/9] epoll: Add implementation for epoll_ctl_batch
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (2 preceding siblings ...)
2015-03-10 1:49 ` [PATCH v4 3/9] epoll: Extract ep_ctl_do Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-10 13:59 ` Dan Rosenberg
2015-03-10 1:49 ` [PATCH v4 5/9] x86: Hook up epoll_ctl_batch syscall Fam Zheng
` (5 subsequent siblings)
9 siblings, 1 reply; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
This new syscall is a batched version of epoll_ctl. It will execute each
command as specified in cmds in given order, and stop at first failure
or upon completion of all commands.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 50 ++++++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 4 ++++
include/uapi/linux/eventpoll.h | 11 ++++++++++
3 files changed, 65 insertions(+)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 7909c88..54dc63f 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -99,6 +99,8 @@
#define EP_MAX_EVENTS (INT_MAX / sizeof(struct epoll_event))
+#define EP_MAX_BATCH (INT_MAX / sizeof(struct epoll_ctl_cmd))
+
#define EP_UNACTIVE_PTR ((void *) -1L)
#define EP_ITEM_COST (sizeof(struct epitem) + sizeof(struct eppoll_entry))
@@ -2069,6 +2071,54 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
sigmask ? &ksigmask : NULL);
}
+SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
+ int, ncmds, struct epoll_ctl_cmd __user *, cmds)
+{
+ struct epoll_ctl_cmd *kcmds = NULL;
+ int i, ret = 0;
+ size_t cmd_size;
+
+ if (flags)
+ return -EINVAL;
+ if (!cmds || ncmds <= 0 || ncmds > EP_MAX_BATCH)
+ return -EINVAL;
+ cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
+ /* TODO: optimize for small arguments like select/poll with a stack
+ * allocated buffer */
+
+ kcmds = kmalloc(cmd_size, GFP_KERNEL);
+ if (!kcmds)
+ return -ENOMEM;
+ if (copy_from_user(kcmds, cmds, cmd_size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ for (i = 0; i < ncmds; i++) {
+ struct epoll_event ev = (struct epoll_event) {
+ .events = kcmds[i].events,
+ .data = kcmds[i].data,
+ };
+ if (kcmds[i].flags) {
+ kcmds[i].result = -EINVAL;
+ goto copy;
+ }
+ kcmds[i].result = ep_ctl_do(epfd, kcmds[i].op,
+ kcmds[i].fd, ev);
+ if (kcmds[i].result)
+ goto copy;
+ ret++;
+ }
+copy:
+ /* We lose the number of succeeded commands in favor of returning
+ * -EFAULT, but in this case the application will want to fix the
+ * memory bug first. */
+ if (copy_to_user(cmds, kcmds, cmd_size))
+ ret = -EFAULT;
+out:
+ kfree(kcmds);
+ return ret;
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..7d784e3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -12,6 +12,7 @@
#define _LINUX_SYSCALLS_H
struct epoll_event;
+struct epoll_ctl_cmd;
struct iattr;
struct inode;
struct iocb;
@@ -634,6 +635,9 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
+ int ncmds,
+ struct epoll_ctl_cmd __user *cmds);
asmlinkage long sys_gethostname(char __user *name, int len);
asmlinkage long sys_sethostname(char __user *name, int len);
asmlinkage long sys_setdomainname(char __user *name, int len);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..4e18b17 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -18,6 +18,8 @@
#include <linux/fcntl.h>
#include <linux/types.h>
+#include <linux/signal.h>
+
/* Flags for epoll_create1. */
#define EPOLL_CLOEXEC O_CLOEXEC
@@ -61,6 +63,15 @@ struct epoll_event {
__u64 data;
} EPOLL_PACKED;
+struct epoll_ctl_cmd {
+ int flags;
+ int op;
+ int fd;
+ __u32 events;
+ __u64 data;
+ int result;
+} EPOLL_PACKED;
+
#ifdef CONFIG_PM_SLEEP
static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
{
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 5/9] x86: Hook up epoll_ctl_batch syscall
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (3 preceding siblings ...)
2015-03-10 1:49 ` [PATCH v4 4/9] epoll: Add implementation for epoll_ctl_batch Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-10 1:49 ` [PATCH v4 6/9] epoll: Add implementation for epoll_pwait1 Fam Zheng
` (4 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
Signed-off-by: Fam Zheng <famz@redhat.com>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..fe809f6 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 epoll_ctl_batch sys_epoll_ctl_batch
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..67b2ea4 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 64 epoll_ctl_batch sys_epoll_ctl_batch
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 6/9] epoll: Add implementation for epoll_pwait1
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (4 preceding siblings ...)
2015-03-10 1:49 ` [PATCH v4 5/9] x86: Hook up epoll_ctl_batch syscall Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-10 1:49 ` [PATCH v4 7/9] x86: Hook up epoll_pwait1 syscall Fam Zheng
` (3 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
This is the new implementation for poll which has a flags parameter and
packs a number of parameters into a structure.
The main advantage of it over existing epoll_pwait is about timeout:
epoll_pwait expects a relative millisecond value, while epoll_pwait1
accepts 1) a timespec which is in nanosecond granularity; 2) a clockid
to allow using a clock other than CLOCK_MONOTONIC.
The 'flags' field in params is reserved for now and must be zero. The
next step would be allowing absolute timeout value.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++++++++-
include/linux/syscalls.h | 5 +++++
include/uapi/linux/eventpoll.h | 8 ++++++++
3 files changed, 51 insertions(+), 1 deletion(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 54dc63f..06a59fc 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2085,7 +2085,6 @@ SYSCALL_DEFINE4(epoll_ctl_batch, int, epfd, int, flags,
cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
/* TODO: optimize for small arguments like select/poll with a stack
* allocated buffer */
-
kcmds = kmalloc(cmd_size, GFP_KERNEL);
if (!kcmds)
return -ENOMEM;
@@ -2119,6 +2118,44 @@ out:
return ret;
}
+SYSCALL_DEFINE5(epoll_pwait1, int, epfd, int, flags,
+ struct epoll_event __user *, events,
+ int, maxevents,
+ struct epoll_wait_params __user *, params)
+{
+ struct epoll_wait_params p;
+ ktime_t kt = { 0 };
+ sigset_t sigmask;
+ struct timespec timeout;
+
+ if (flags)
+ return -EINVAL;
+ if (!params)
+ return -EINVAL;
+ if (copy_from_user(&p, params, sizeof(p)))
+ return -EFAULT;
+ if (p.size != sizeof(p))
+ return -EINVAL;
+ if (p.sigmask) {
+ if (copy_from_user(&sigmask, p.sigmask, sizeof(sigmask)))
+ return -EFAULT;
+ if (p.sigsetsize != sizeof(p.sigmask))
+ return -EINVAL;
+ }
+ if (p.timeout) {
+ if (copy_from_user(&timeout, p.timeout, sizeof(timeout)))
+ return -EFAULT;
+ if (!timespec_valid(&timeout))
+ return -EINVAL;
+ kt = timespec_to_ktime(timeout);
+ } else {
+ kt = ns_to_ktime(-1);
+ }
+
+ return epoll_pwait_do(epfd, events, maxevents, p.clockid,
+ kt, p.sigmask ? &sigmask : NULL);
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
struct epoll_event __user *, events,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7d784e3..a4823d9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -13,6 +13,7 @@
struct epoll_event;
struct epoll_ctl_cmd;
+struct epoll_wait_params;
struct iattr;
struct inode;
struct iocb;
@@ -635,6 +636,10 @@ asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events,
int maxevents, int timeout,
const sigset_t __user *sigmask,
size_t sigsetsize);
+asmlinkage long sys_epoll_pwait1(int epfd, int flags,
+ struct epoll_event __user *events,
+ int maxevents,
+ struct epoll_wait_params __user *params);
asmlinkage long sys_epoll_ctl_batch(int epfd, int flags,
int ncmds,
struct epoll_ctl_cmd __user *cmds);
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 4e18b17..05ae035 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -72,6 +72,14 @@ struct epoll_ctl_cmd {
int result;
} EPOLL_PACKED;
+struct epoll_wait_params {
+ int size;
+ int clockid;
+ struct timespec *timeout;
+ sigset_t *sigmask;
+ size_t sigsetsize;
+} EPOLL_PACKED;
+
#ifdef CONFIG_PM_SLEEP
static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev)
{
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 7/9] x86: Hook up epoll_pwait1 syscall
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (5 preceding siblings ...)
2015-03-10 1:49 ` [PATCH v4 6/9] epoll: Add implementation for epoll_pwait1 Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-10 1:49 ` [PATCH v4 8/9] epoll: Add compat version implementation of epoll_pwait1 Fam Zheng
` (2 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
Signed-off-by: Fam Zheng <famz@redhat.com>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index fe809f6..bf912d8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -366,3 +366,4 @@
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
359 i386 epoll_ctl_batch sys_epoll_ctl_batch
+360 i386 epoll_pwait1 sys_epoll_pwait1
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 67b2ea4..9246ad5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -330,6 +330,7 @@
321 common bpf sys_bpf
322 64 execveat stub_execveat
323 64 epoll_ctl_batch sys_epoll_ctl_batch
+324 64 epoll_pwait1 sys_epoll_pwait1
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 8/9] epoll: Add compat version implementation of epoll_pwait1
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (6 preceding siblings ...)
2015-03-10 1:49 ` [PATCH v4 7/9] x86: Hook up epoll_pwait1 syscall Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-10 1:49 ` [PATCH v4 9/9] x86: Hook up 32 bit compat epoll_pwait1 syscall Fam Zheng
2015-03-12 15:02 ` [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Jason Baron
9 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
Signed-off-by: Fam Zheng <famz@redhat.com>
---
fs/eventpoll.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/compat.h | 6 ++++++
2 files changed, 56 insertions(+)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 06a59fc..b837ea4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2178,6 +2178,56 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
return epoll_pwait_do(epfd, events, maxevents, CLOCK_MONOTONIC, kt,
sigmask ? &ksigmask : NULL);
}
+
+struct compat_epoll_wait_params {
+ int size;
+ int clockid;
+ compat_uptr_t timeout;
+ compat_uptr_t sigmask;
+ compat_size_t sigsetsize;
+} EPOLL_PACKED;
+
+COMPAT_SYSCALL_DEFINE5(epoll_pwait1, int, epfd, int, flags,
+ struct epoll_event __user *, events,
+ int, maxevents,
+ struct compat_epoll_wait_params __user *, params)
+{
+ struct compat_epoll_wait_params p;
+
+ ktime_t kt = { 0 };
+ sigset_t sigmask;
+ compat_sigset_t compat_sigmask;
+ struct timespec timeout;
+
+ if (flags)
+ return -EINVAL;
+ if (!params)
+ return -EINVAL;
+ if (copy_from_user(&p, params, sizeof(p)))
+ return -EFAULT;
+ if (p.size != sizeof(p))
+ return -EINVAL;
+ if (p.sigmask) {
+ if (copy_from_user(&compat_sigmask, compat_ptr(p.sigmask),
+ sizeof(sigmask)))
+ return -EFAULT;
+ if (p.sigsetsize != sizeof(p.sigmask))
+ return -EINVAL;
+ sigset_from_compat(&sigmask, &compat_sigmask);
+ }
+ if (p.timeout) {
+ if (compat_get_timespec(&timeout, compat_ptr(p.timeout)))
+ return -EFAULT;
+ if (!timespec_valid(&timeout))
+ return -EINVAL;
+ kt = timespec_to_ktime(timeout);
+ } else {
+ kt = ns_to_ktime(-1);
+ }
+
+ return epoll_pwait_do(epfd, events, maxevents, p.clockid,
+ kt, p.sigmask ? &sigmask : NULL);
+}
#endif
static int __init eventpoll_init(void)
diff --git a/include/linux/compat.h b/include/linux/compat.h
index ab25814..649c5b2 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -452,6 +452,12 @@ asmlinkage long compat_sys_epoll_pwait(int epfd,
const compat_sigset_t __user *sigmask,
compat_size_t sigsetsize);
+struct compat_epoll_wait_params;
+asmlinkage long compat_sys_epoll_pwait1(int epfd, int flags,
+ struct epoll_event __user *events,
+ int maxevents,
+ struct compat_epoll_wait_params __user *params);
+
asmlinkage long compat_sys_utime(const char __user *filename,
struct compat_utimbuf __user *t);
asmlinkage long compat_sys_utimensat(unsigned int dfd,
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 9/9] x86: Hook up 32 bit compat epoll_pwait1 syscall
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (7 preceding siblings ...)
2015-03-10 1:49 ` [PATCH v4 8/9] epoll: Add compat version implementation of epoll_pwait1 Fam Zheng
@ 2015-03-10 1:49 ` Fam Zheng
2015-03-12 15:02 ` [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Jason Baron
9 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-10 1:49 UTC (permalink / raw)
To: linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Fam Zheng, Peter Zijlstra, linux-fsdevel,
linux-api, Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
Signed-off-by: Fam Zheng <famz@redhat.com>
---
arch/x86/syscalls/syscall_32.tbl | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index bf912d8..5728c2e 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -366,4 +366,4 @@
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
359 i386 epoll_ctl_batch sys_epoll_ctl_batch
-360 i386 epoll_pwait1 sys_epoll_pwait1
+360 i386 epoll_pwait1 sys_epoll_pwait1 compat_sys_epoll_pwait1
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 4/9] epoll: Add implementation for epoll_ctl_batch
2015-03-10 1:49 ` [PATCH v4 4/9] epoll: Add implementation for epoll_ctl_batch Fam Zheng
@ 2015-03-10 13:59 ` Dan Rosenberg
2015-03-11 2:23 ` Fam Zheng
0 siblings, 1 reply; 16+ messages in thread
From: Dan Rosenberg @ 2015-03-10 13:59 UTC (permalink / raw)
To: Fam Zheng, linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour
On 03/09/2015 09:49 PM, Fam Zheng wrote:
> + if (!cmds || ncmds <= 0 || ncmds > EP_MAX_BATCH)
> + return -EINVAL;
> + cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
> + /* TODO: optimize for small arguments like select/poll with a stack
> + * allocated buffer */
> +
> + kcmds = kmalloc(cmd_size, GFP_KERNEL);
> + if (!kcmds)
> + return -ENOMEM;
You probably want to define EP_MAX_BATCH as some sane value much less
than INT_MAX/(sizeof(struct epoll_ctl_cmd)). While this avoids the
integer overflow from before, any user can cause the kernel to kmalloc
up to INT_MAX bytes. Probably not a huge deal because it's freed at the
end of the syscall, but generally not a great idea.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 4/9] epoll: Add implementation for epoll_ctl_batch
2015-03-10 13:59 ` Dan Rosenberg
@ 2015-03-11 2:23 ` Fam Zheng
0 siblings, 0 replies; 16+ messages in thread
From: Fam Zheng @ 2015-03-11 2:23 UTC (permalink / raw)
To: Dan Rosenberg; +Cc: linux-kernel, famz
On Tue, 03/10 09:59, Dan Rosenberg wrote:
> On 03/09/2015 09:49 PM, Fam Zheng wrote:
> > + if (!cmds || ncmds <= 0 || ncmds > EP_MAX_BATCH)
> > + return -EINVAL;
> > + cmd_size = sizeof(struct epoll_ctl_cmd) * ncmds;
> > + /* TODO: optimize for small arguments like select/poll with a stack
> > + * allocated buffer */
> > +
> > + kcmds = kmalloc(cmd_size, GFP_KERNEL);
> > + if (!kcmds)
> > + return -ENOMEM;
> You probably want to define EP_MAX_BATCH as some sane value much less
> than INT_MAX/(sizeof(struct epoll_ctl_cmd)). While this avoids the
> integer overflow from before, any user can cause the kernel to kmalloc
> up to INT_MAX bytes. Probably not a huge deal because it's freed at the
> end of the syscall, but generally not a great idea.
>
Yeah, makes sense, any suggested value?
Thanks,
Fam
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
` (8 preceding siblings ...)
2015-03-10 1:49 ` [PATCH v4 9/9] x86: Hook up 32 bit compat epoll_pwait1 syscall Fam Zheng
@ 2015-03-12 15:02 ` Jason Baron
2015-03-13 11:31 ` Fam Zheng
9 siblings, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-03-12 15:02 UTC (permalink / raw)
To: Fam Zheng, linux-kernel
Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
On 03/09/2015 09:49 PM, Fam Zheng wrote:
>
> Benchmark for epoll_pwait1
> ==========================
>
> By running fio tests inside VM with both original and modified QEMU, we can
> compare their difference in performance.
>
> With a small VM setup [t1], the original QEMU (ppoll based) has an 4k read
> latency overhead around 37 us. In this setup, the main loop polls 10~20 fds.
>
> With a slightly larger VM instance [t2] - attached a virtio-serial device so
> that there are 80~90 fds in the main loop - the original QEMU has a latency
> overhead around 49 us. By adding more such devices [t3], we can see the latency
> go even higher - 83 us with ~200 FDs.
>
> Now modify QEMU to use epoll_pwait1 and test again, the latency numbers are
> repectively 36us, 37us, 47us for t1, t2 and t3.
>
>
Hi,
So it sounds like you are comparing original qemu code (which was using
ppoll) vs. using epoll with these new syscalls. Curious if you have numbers
comparing the existing epoll (with say the timerfd in your epoll set), so
we can see the improvement relative to epoll.
Thanks,
-Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
2015-03-12 15:02 ` [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Jason Baron
@ 2015-03-13 11:31 ` Fam Zheng
2015-03-13 14:46 ` Jason Baron
0 siblings, 1 reply; 16+ messages in thread
From: Fam Zheng @ 2015-03-13 11:31 UTC (permalink / raw)
To: Jason Baron
Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
On Thu, 03/12 11:02, Jason Baron wrote:
> On 03/09/2015 09:49 PM, Fam Zheng wrote:
> >
> > Benchmark for epoll_pwait1
> > ==========================
> >
> > By running fio tests inside VM with both original and modified QEMU, we can
> > compare their difference in performance.
> >
> > With a small VM setup [t1], the original QEMU (ppoll based) has an 4k read
> > latency overhead around 37 us. In this setup, the main loop polls 10~20 fds.
> >
> > With a slightly larger VM instance [t2] - attached a virtio-serial device so
> > that there are 80~90 fds in the main loop - the original QEMU has a latency
> > overhead around 49 us. By adding more such devices [t3], we can see the latency
> > go even higher - 83 us with ~200 FDs.
> >
> > Now modify QEMU to use epoll_pwait1 and test again, the latency numbers are
> > repectively 36us, 37us, 47us for t1, t2 and t3.
> >
> >
>
> Hi,
>
> So it sounds like you are comparing original qemu code (which was using
> ppoll) vs. using epoll with these new syscalls. Curious if you have numbers
> comparing the existing epoll (with say the timerfd in your epoll set), so
> we can see the improvement relative to epoll.
I did compare them, but they are too close to see differences. The improvements
in epoll_pwait1 doesn't really help the hot path of guest IO, but it does
affect the program timer precision, that are used in various device emulations
in QEMU.
Although it's kind of subtle and difficult to summarize here, I can give an
example in the IO throttling implementation in QEMU, to show the significance:
The throttling algorithm computes a duration for the next IO, which is used to
arm a timer in order to delay the request a bit. As timers are always rounded
*UP* to the effective granularity, the timeout being 1ms in epoll_pwait is just
too coarse and will lead to severe inaccuracy. With epoll_pwait1, we can avoid
the rounding-up.
I think this idea could be pertty generally desired by other applications, too.
Regarding the epoll_ctl_batch improvement, again, it is not going to disrupt
the numbers in the small workload I managed to test.
Of course, if you have a specific application senario in mind, I will try it. :)
Thanks,
Fam
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
2015-03-13 11:31 ` Fam Zheng
@ 2015-03-13 14:46 ` Jason Baron
2015-03-13 14:56 ` Paolo Bonzini
0 siblings, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-03-13 14:46 UTC (permalink / raw)
To: Fam Zheng
Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
Josh Triplett, Michael Kerrisk (man-pages),
Paolo Bonzini, Omar Sandoval, Jonathan Corbet, shane.seymour,
dan.j.rosenberg
On 03/13/2015 07:31 AM, Fam Zheng wrote:
> On Thu, 03/12 11:02, Jason Baron wrote:
>> On 03/09/2015 09:49 PM, Fam Zheng wrote:
>>
>> Hi,
>>
>> So it sounds like you are comparing original qemu code (which was using
>> ppoll) vs. using epoll with these new syscalls. Curious if you have numbers
>> comparing the existing epoll (with say the timerfd in your epoll set), so
>> we can see the improvement relative to epoll.
> I did compare them, but they are too close to see differences. The improvements
> in epoll_pwait1 doesn't really help the hot path of guest IO, but it does
> affect the program timer precision, that are used in various device emulations
> in QEMU.
>
> Although it's kind of subtle and difficult to summarize here, I can give an
> example in the IO throttling implementation in QEMU, to show the significance:
>
> The throttling algorithm computes a duration for the next IO, which is used to
> arm a timer in order to delay the request a bit. As timers are always rounded
> *UP* to the effective granularity, the timeout being 1ms in epoll_pwait is just
> too coarse and will lead to severe inaccuracy. With epoll_pwait1, we can avoid
> the rounding-up.
right, but we could use the timerfd here to get the desired precision.
> I think this idea could be pertty generally desired by other applications, too.
>
> Regarding the epoll_ctl_batch improvement, again, it is not going to disrupt
> the numbers in the small workload I managed to test.
>
> Of course, if you have a specific application senario in mind, I will try it. :)
I want to understand what new functionality these syscalls offer over
what we have now. I mean we could show a micro-benchmark where
these matter, but is that enough to justify these new syscalls given that
I think we could implement library wrappers around what we have now
to do what you are proposing here.
Thanks,
-Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
2015-03-13 14:46 ` Jason Baron
@ 2015-03-13 14:56 ` Paolo Bonzini
0 siblings, 0 replies; 16+ messages in thread
From: Paolo Bonzini @ 2015-03-13 14:56 UTC (permalink / raw)
To: Jason Baron, Fam Zheng
Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Alexander Viro, Andrew Morton, Kees Cook, Andy Lutomirski,
David Herrmann, Alexei Starovoitov, Miklos Szeredi,
David Drysdale, Oleg Nesterov, David S. Miller, Vivek Goyal,
Mike Frysinger, Theodore Ts'o, Heiko Carstens,
Rasmus Villemoes, Rashika Kheria, Hugh Dickins,
Mathieu Desnoyers, Peter Zijlstra, linux-fsdevel, linux-api,
Josh Triplett, Michael Kerrisk (man-pages),
Omar Sandoval, Jonathan Corbet, shane.seymour, dan.j.rosenberg
On 13/03/2015 15:46, Jason Baron wrote:
> > The throttling algorithm computes a duration for the next IO, which is used to
> > arm a timer in order to delay the request a bit. As timers are always rounded
> > *UP* to the effective granularity, the timeout being 1ms in epoll_pwait is just
> > too coarse and will lead to severe inaccuracy. With epoll_pwait1, we can avoid
> > the rounding-up.
>
> right, but we could use the timerfd here to get the desired precision.
Fam, didn't you see slowdowns with few file descriptors
epoll_ctl+epoll_wait+timerfd compared to ppoll?
Do they disappear or improve with epoll_ctl_batch and epoll_pwait1?
Paolo
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-03-13 14:57 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-10 1:49 [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Fam Zheng
2015-03-10 1:49 ` [PATCH v4 1/9] epoll: Extract epoll_wait_do and epoll_pwait_do Fam Zheng
2015-03-10 1:49 ` [PATCH v4 2/9] epoll: Specify clockid explicitly Fam Zheng
2015-03-10 1:49 ` [PATCH v4 3/9] epoll: Extract ep_ctl_do Fam Zheng
2015-03-10 1:49 ` [PATCH v4 4/9] epoll: Add implementation for epoll_ctl_batch Fam Zheng
2015-03-10 13:59 ` Dan Rosenberg
2015-03-11 2:23 ` Fam Zheng
2015-03-10 1:49 ` [PATCH v4 5/9] x86: Hook up epoll_ctl_batch syscall Fam Zheng
2015-03-10 1:49 ` [PATCH v4 6/9] epoll: Add implementation for epoll_pwait1 Fam Zheng
2015-03-10 1:49 ` [PATCH v4 7/9] x86: Hook up epoll_pwait1 syscall Fam Zheng
2015-03-10 1:49 ` [PATCH v4 8/9] epoll: Add compat version implementation of epoll_pwait1 Fam Zheng
2015-03-10 1:49 ` [PATCH v4 9/9] x86: Hook up 32 bit compat epoll_pwait1 syscall Fam Zheng
2015-03-12 15:02 ` [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 Jason Baron
2015-03-13 11:31 ` Fam Zheng
2015-03-13 14:46 ` Jason Baron
2015-03-13 14:56 ` Paolo Bonzini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).