LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall
@ 2021-09-13 17:52 André Almeida
  2021-09-13 17:52 ` [PATCH v3 1/6] futex: Prepare for futex_wait_multiple() André Almeida
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: André Almeida @ 2021-09-13 17:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, Arnd Bergmann, André Almeida

Hi,

This patchset introduce the futex_waitv syscall. It reuses as much as
possible of original futex code for the new operation, so the first
commit move some stuff to futex header to make accessible for futex2.

* Use case

The use case of this syscall is to allow low level locking libraries to
wait for multiple locks at the same time. This is specially useful for
emulating Windows' WaitForMultipleObjects. A futex_waitv()-based solution
has been used for some time at Proton's Wine (a compatibility layer to
run Windows games on Linux). Compared to a solution that uses eventfd(),
futex was able to reduce CPU utilization for games, and even increase
frames per second for some games. This happens because eventfd doesn't
scale very well for a huge number of read, write and poll calls compared
to futex. Native game engines will benefit of this as well, given that
this wait pattern is common for games.

* The interface

This is how the interface looks like:

  futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
              unsigned int flags, struct timespec *timo)

  struct futex_waitv {
          __u64 val;
          __u64 uaddr;
          __u32 flags;
          __u32 __reserved;
  };

struct futex_waitv uses explicit padding, so we can use it in all
architectures. The __reserved is used for the padding and should always
be 0, but it may be repurposed in the future for some extension. If
userspace has 32-bit pointers, it should do a explicit cast to make sure
the upper bits are zeroed. uintptr_t does the tricky and it works for
32/64-bit pointers.

* Why u64?

Although futex() supports only 32-bit long integers, while researching
about feedback around a new futex interface, developers made some points
for variable size support:

- At Boost Libraries, futex is used as back end to implement atomic
primitives for some architectures. It works fine for 32-bit futexes, but
for other sizes it "must use an internal lock pool to implement waiting
and notifying operations, which increases thread contention. For
inter-process atomics, this means that waiting must be done using a spin
loop, which is terribly inefficient."[1]

- glibc’s rwlock implementation "uses a torn 32-bit futex read which is
part of an atomically updated 64-bit word".[2]

- Peter Oskolkov[3] pointed out that for 64-bit platforms it would be
useful to do atomic operations in pointer values: "imagine a simple
producer/consumer scenario, with the producer updating some shared
memory data and waking the consumer. Storing the pointer in the futex
makes it so that only one shared memory location needs to be accessed
atomically".

- The original proposal[4] to support 8-bit and 16-bit futexes had some
use cases as well: "Having mutexes that are only one byte in size was
the first reason WebKit mentioned for re-implementing futexes in a
library" and "The C++ standard added futexes to the standard library in
C++20 under the name atomic_wait and atomic_notify. The C++20 version
supports this for atomic variables of any size. The more sizes we can
support, the better the implementation can be in the standard library."

 Testing

Through Proton, I've tested futex_waitv() with modern games that issue
more than 40k futex calls per second. Selftest are provided as part of this
patchset. However, those selftests aren't really reliable in 32-bit
platforms giving that glibc doesn't expose a way to have a 64-bit timespec 
gettime(). In the past I implemented a gettime64() by myself as part of
the selftest, but I'm not sure if this the best approach:
https://lore.kernel.org/lkml/20210805190405.59110-4-andrealmeid@collabora.com/

 Changelog

Changes from v2:
v2: https://lore.kernel.org/lkml/20210904231159.13292-1-andrealmeid@collabora.com/
- Last version, I made compat and non-compat use the same code, but
failed to remove the compat entry point. This is fixed now.
- Add ARM support

Changes from v1:
v1: https://lore.kernel.org/lkml/20210805190405.59110-1-andrealmeid@collabora.com/
- Tons of code and comment improvements and fixes (thanks Thomas!)
- Changed the struct to have explicit padding (thanks Arnd!)
- Created a kernel/futex.h
- Splitted syscall table changes from the implementation
- Compat and non-compat entry point now uses the same code and same
  struct
- Added test for timeout

More info about futex2: https://lore.kernel.org/lkml/20210709001328.329716-1-andrealmeid@collabora.com/

[1] https://lists.boost.org/Archives/boost/2021/05/251508.php

[2]
https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@collabora.com/T/#m37bfbbd6ac76c121941defd1daea774389552674

[3]
https://lore.kernel.org/lkml/CAFTs51XAr2b3DmcSM4=qeU5cNuh0mTxUbhG66U6bc63YYzkzYA@mail.gmail.com/

[4]
https://lore.kernel.org/lkml/20191204235238.10764-1-malteskarupke@web.de/

André Almeida (6):
  futex: Prepare for futex_wait_multiple()
  futex2: Implement vectorized wait
  futex2: wire up syscall for x86
  futex2: wire up syscall for ARM
  selftests: futex2: Add waitv test
  selftests: futex2: Test futex_waitv timeout

 MAINTAINERS                                   |   3 +-
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 include/linux/syscalls.h                      |   6 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 include/uapi/linux/futex.h                    |  25 ++
 init/Kconfig                                  |   7 +
 kernel/Makefile                               |   1 +
 kernel/futex.c                                | 335 +++++++++++-------
 kernel/futex.h                                | 155 ++++++++
 kernel/futex2.c                               | 117 ++++++
 kernel/sys_ni.c                               |   3 +
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   3 +-
 .../futex/functional/futex_wait_timeout.c     |  21 +-
 .../selftests/futex/functional/futex_waitv.c  | 158 +++++++++
 .../testing/selftests/futex/functional/run.sh |   3 +
 .../selftests/futex/include/futex2test.h      |  31 ++
 21 files changed, 744 insertions(+), 137 deletions(-)
 create mode 100644 kernel/futex.h
 create mode 100644 kernel/futex2.c
 create mode 100644 tools/testing/selftests/futex/functional/futex_waitv.c
 create mode 100644 tools/testing/selftests/futex/include/futex2test.h

-- 
2.33.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/6] futex: Prepare for futex_wait_multiple()
  2021-09-13 17:52 [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall André Almeida
@ 2021-09-13 17:52 ` André Almeida
  2021-09-13 17:52 ` [PATCH v3 2/6] futex2: Implement vectorized wait André Almeida
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: André Almeida @ 2021-09-13 17:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, Arnd Bergmann, André Almeida

Make public functions and defines that will be used for
futex_wait_multiple() function in next commit.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 kernel/futex.c | 134 +---------------------------------------------
 kernel/futex.h | 140 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 142 insertions(+), 132 deletions(-)
 create mode 100644 kernel/futex.h

diff --git a/kernel/futex.c b/kernel/futex.c
index c15ad276fd15..32c91f9d7385 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -34,14 +34,11 @@
 #include <linux/compat.h>
 #include <linux/jhash.h>
 #include <linux/pagemap.h>
-#include <linux/syscalls.h>
 #include <linux/freezer.h>
 #include <linux/memblock.h>
 #include <linux/fault-inject.h>
-#include <linux/time_namespace.h>
-
-#include <asm/futex.h>
 
+#include "futex.h"
 #include "locking/rtmutex_common.h"
 
 /*
@@ -150,22 +147,6 @@
 static int  __read_mostly futex_cmpxchg_enabled;
 #endif
 
-/*
- * Futex flags used to encode options to functions and preserve them across
- * restarts.
- */
-#ifdef CONFIG_MMU
-# define FLAGS_SHARED		0x01
-#else
-/*
- * NOMMU does not have per process address space. Let the compiler optimize
- * code away.
- */
-# define FLAGS_SHARED		0x00
-#endif
-#define FLAGS_CLOCKRT		0x02
-#define FLAGS_HAS_TIMEOUT	0x04
-
 /*
  * Priority Inheritance state:
  */
@@ -187,103 +168,6 @@ struct futex_pi_state {
 	union futex_key key;
 } __randomize_layout;
 
-/**
- * struct futex_q - The hashed futex queue entry, one per waiting task
- * @list:		priority-sorted list of tasks waiting on this futex
- * @task:		the task waiting on the futex
- * @lock_ptr:		the hash bucket lock
- * @key:		the key the futex is hashed on
- * @pi_state:		optional priority inheritance state
- * @rt_waiter:		rt_waiter storage for use with requeue_pi
- * @requeue_pi_key:	the requeue_pi target futex key
- * @bitset:		bitset for the optional bitmasked wakeup
- * @requeue_state:	State field for futex_requeue_pi()
- * @requeue_wait:	RCU wait for futex_requeue_pi() (RT only)
- *
- * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
- * we can wake only the relevant ones (hashed queues may be shared).
- *
- * A futex_q has a woken state, just like tasks have TASK_RUNNING.
- * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
- * The order of wakeup is always to make the first condition true, then
- * the second.
- *
- * PI futexes are typically woken before they are removed from the hash list via
- * the rt_mutex code. See unqueue_me_pi().
- */
-struct futex_q {
-	struct plist_node list;
-
-	struct task_struct *task;
-	spinlock_t *lock_ptr;
-	union futex_key key;
-	struct futex_pi_state *pi_state;
-	struct rt_mutex_waiter *rt_waiter;
-	union futex_key *requeue_pi_key;
-	u32 bitset;
-	atomic_t requeue_state;
-#ifdef CONFIG_PREEMPT_RT
-	struct rcuwait requeue_wait;
-#endif
-} __randomize_layout;
-
-/*
- * On PREEMPT_RT, the hash bucket lock is a 'sleeping' spinlock with an
- * underlying rtmutex. The task which is about to be requeued could have
- * just woken up (timeout, signal). After the wake up the task has to
- * acquire hash bucket lock, which is held by the requeue code.  As a task
- * can only be blocked on _ONE_ rtmutex at a time, the proxy lock blocking
- * and the hash bucket lock blocking would collide and corrupt state.
- *
- * On !PREEMPT_RT this is not a problem and everything could be serialized
- * on hash bucket lock, but aside of having the benefit of common code,
- * this allows to avoid doing the requeue when the task is already on the
- * way out and taking the hash bucket lock of the original uaddr1 when the
- * requeue has been completed.
- *
- * The following state transitions are valid:
- *
- * On the waiter side:
- *   Q_REQUEUE_PI_NONE		-> Q_REQUEUE_PI_IGNORE
- *   Q_REQUEUE_PI_IN_PROGRESS	-> Q_REQUEUE_PI_WAIT
- *
- * On the requeue side:
- *   Q_REQUEUE_PI_NONE		-> Q_REQUEUE_PI_INPROGRESS
- *   Q_REQUEUE_PI_IN_PROGRESS	-> Q_REQUEUE_PI_DONE/LOCKED
- *   Q_REQUEUE_PI_IN_PROGRESS	-> Q_REQUEUE_PI_NONE (requeue failed)
- *   Q_REQUEUE_PI_WAIT		-> Q_REQUEUE_PI_DONE/LOCKED
- *   Q_REQUEUE_PI_WAIT		-> Q_REQUEUE_PI_IGNORE (requeue failed)
- *
- * The requeue side ignores a waiter with state Q_REQUEUE_PI_IGNORE as this
- * signals that the waiter is already on the way out. It also means that
- * the waiter is still on the 'wait' futex, i.e. uaddr1.
- *
- * The waiter side signals early wakeup to the requeue side either through
- * setting state to Q_REQUEUE_PI_IGNORE or to Q_REQUEUE_PI_WAIT depending
- * on the current state. In case of Q_REQUEUE_PI_IGNORE it can immediately
- * proceed to take the hash bucket lock of uaddr1. If it set state to WAIT,
- * which means the wakeup is interleaving with a requeue in progress it has
- * to wait for the requeue side to change the state. Either to DONE/LOCKED
- * or to IGNORE. DONE/LOCKED means the waiter q is now on the uaddr2 futex
- * and either blocked (DONE) or has acquired it (LOCKED). IGNORE is set by
- * the requeue side when the requeue attempt failed via deadlock detection
- * and therefore the waiter q is still on the uaddr1 futex.
- */
-enum {
-	Q_REQUEUE_PI_NONE		=  0,
-	Q_REQUEUE_PI_IGNORE,
-	Q_REQUEUE_PI_IN_PROGRESS,
-	Q_REQUEUE_PI_WAIT,
-	Q_REQUEUE_PI_DONE,
-	Q_REQUEUE_PI_LOCKED,
-};
-
-static const struct futex_q futex_q_init = {
-	/* list gets initialized in queue_me()*/
-	.key		= FUTEX_KEY_INIT,
-	.bitset		= FUTEX_BITSET_MATCH_ANY,
-	.requeue_state	= ATOMIC_INIT(Q_REQUEUE_PI_NONE),
-};
 
 /*
  * Hash buckets are shared by all the futex_keys that hash to the same
@@ -453,7 +337,7 @@ enum futex_access {
  * Return: Initialized hrtimer_sleeper structure or NULL if no timeout
  *	   value given
  */
-static inline struct hrtimer_sleeper *
+inline struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns)
 {
@@ -4005,20 +3889,6 @@ static __always_inline bool futex_cmd_has_timeout(u32 cmd)
 	return false;
 }
 
-static __always_inline int
-futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
-{
-	if (!timespec64_valid(ts))
-		return -EINVAL;
-
-	*t = timespec64_to_ktime(*ts);
-	if (cmd == FUTEX_WAIT)
-		*t = ktime_add_safe(ktime_get(), *t);
-	else if (cmd != FUTEX_LOCK_PI && !(op & FUTEX_CLOCK_REALTIME))
-		*t = timens_ktime_to_host(CLOCK_MONOTONIC, *t);
-	return 0;
-}
-
 SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 		const struct __kernel_timespec __user *, utime,
 		u32 __user *, uaddr2, u32, val3)
diff --git a/kernel/futex.h b/kernel/futex.h
new file mode 100644
index 000000000000..c914e0080cf1
--- /dev/null
+++ b/kernel/futex.h
@@ -0,0 +1,140 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _FUTEX_H
+#define _FUTEX_H
+
+#include <asm/futex.h>
+
+#include <linux/syscalls.h>
+#include <linux/time_namespace.h>
+
+/*
+ * Futex flags used to encode options to functions and preserve them across
+ * restarts.
+ */
+#ifdef CONFIG_MMU
+# define FLAGS_SHARED		0x01
+#else
+/*
+ * NOMMU does not have per process address space. Let the compiler optimize
+ * code away.
+ */
+# define FLAGS_SHARED		0x00
+#endif
+#define FLAGS_CLOCKRT		0x02
+#define FLAGS_HAS_TIMEOUT	0x04
+
+/**
+ * struct futex_q - The hashed futex queue entry, one per waiting task
+ * @list:		priority-sorted list of tasks waiting on this futex
+ * @task:		the task waiting on the futex
+ * @lock_ptr:		the hash bucket lock
+ * @key:		the key the futex is hashed on
+ * @pi_state:		optional priority inheritance state
+ * @rt_waiter:		rt_waiter storage for use with requeue_pi
+ * @requeue_pi_key:	the requeue_pi target futex key
+ * @bitset:		bitset for the optional bitmasked wakeup
+ *
+ * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
+ * we can wake only the relevant ones (hashed queues may be shared).
+ *
+ * A futex_q has a woken state, just like tasks have TASK_RUNNING.
+ * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
+ * The order of wakeup is always to make the first condition true, then
+ * the second.
+ *
+ * PI futexes are typically woken before they are removed from the hash list via
+ * the rt_mutex code. See unqueue_me_pi().
+ */
+struct futex_q {
+	struct plist_node list;
+
+	struct task_struct *task;
+	spinlock_t *lock_ptr;
+	union futex_key key;
+	struct futex_pi_state *pi_state;
+	struct rt_mutex_waiter *rt_waiter;
+	union futex_key *requeue_pi_key;
+	u32 bitset;
+	atomic_t requeue_state;
+#ifdef CONFIG_PREEMPT_RT
+	struct rcuwait requeue_wait;
+#endif
+} __randomize_layout;
+
+/*
+ * On PREEMPT_RT, the hash bucket lock is a 'sleeping' spinlock with an
+ * underlying rtmutex. The task which is about to be requeued could have
+ * just woken up (timeout, signal). After the wake up the task has to
+ * acquire hash bucket lock, which is held by the requeue code.  As a task
+ * can only be blocked on _ONE_ rtmutex at a time, the proxy lock blocking
+ * and the hash bucket lock blocking would collide and corrupt state.
+ *
+ * On !PREEMPT_RT this is not a problem and everything could be serialized
+ * on hash bucket lock, but aside of having the benefit of common code,
+ * this allows to avoid doing the requeue when the task is already on the
+ * way out and taking the hash bucket lock of the original uaddr1 when the
+ * requeue has been completed.
+ *
+ * The following state transitions are valid:
+ *
+ * On the waiter side:
+ *   Q_REQUEUE_PI_NONE		-> Q_REQUEUE_PI_IGNORE
+ *   Q_REQUEUE_PI_IN_PROGRESS	-> Q_REQUEUE_PI_WAIT
+ *
+ * On the requeue side:
+ *   Q_REQUEUE_PI_NONE		-> Q_REQUEUE_PI_INPROGRESS
+ *   Q_REQUEUE_PI_IN_PROGRESS	-> Q_REQUEUE_PI_DONE/LOCKED
+ *   Q_REQUEUE_PI_IN_PROGRESS	-> Q_REQUEUE_PI_NONE (requeue failed)
+ *   Q_REQUEUE_PI_WAIT		-> Q_REQUEUE_PI_DONE/LOCKED
+ *   Q_REQUEUE_PI_WAIT		-> Q_REQUEUE_PI_IGNORE (requeue failed)
+ *
+ * The requeue side ignores a waiter with state Q_REQUEUE_PI_IGNORE as this
+ * signals that the waiter is already on the way out. It also means that
+ * the waiter is still on the 'wait' futex, i.e. uaddr1.
+ *
+ * The waiter side signals early wakeup to the requeue side either through
+ * setting state to Q_REQUEUE_PI_IGNORE or to Q_REQUEUE_PI_WAIT depending
+ * on the current state. In case of Q_REQUEUE_PI_IGNORE it can immediately
+ * proceed to take the hash bucket lock of uaddr1. If it set state to WAIT,
+ * which means the wakeup is interleaving with a requeue in progress it has
+ * to wait for the requeue side to change the state. Either to DONE/LOCKED
+ * or to IGNORE. DONE/LOCKED means the waiter q is now on the uaddr2 futex
+ * and either blocked (DONE) or has acquired it (LOCKED). IGNORE is set by
+ * the requeue side when the requeue attempt failed via deadlock detection
+ * and therefore the waiter q is still on the uaddr1 futex.
+ */
+enum {
+	Q_REQUEUE_PI_NONE		=  0,
+	Q_REQUEUE_PI_IGNORE,
+	Q_REQUEUE_PI_IN_PROGRESS,
+	Q_REQUEUE_PI_WAIT,
+	Q_REQUEUE_PI_DONE,
+	Q_REQUEUE_PI_LOCKED,
+};
+
+static const struct futex_q futex_q_init = {
+	/* list gets initialized in queue_me()*/
+	.key            = FUTEX_KEY_INIT,
+	.bitset         = FUTEX_BITSET_MATCH_ANY,
+	.requeue_state  = ATOMIC_INIT(Q_REQUEUE_PI_NONE),
+};
+
+inline struct hrtimer_sleeper *
+futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
+		  int flags, u64 range_ns);
+
+static __always_inline int
+futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
+{
+	if (!timespec64_valid(ts))
+		return -EINVAL;
+
+	*t = timespec64_to_ktime(*ts);
+	if (cmd == FUTEX_WAIT)
+		*t = ktime_add_safe(ktime_get(), *t);
+	else if (cmd != FUTEX_LOCK_PI && !(op & FUTEX_CLOCK_REALTIME))
+		*t = timens_ktime_to_host(CLOCK_MONOTONIC, *t);
+	return 0;
+}
+
+#endif
-- 
2.33.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 2/6] futex2: Implement vectorized wait
  2021-09-13 17:52 [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall André Almeida
  2021-09-13 17:52 ` [PATCH v3 1/6] futex: Prepare for futex_wait_multiple() André Almeida
@ 2021-09-13 17:52 ` André Almeida
  2021-09-14  1:03   ` Gabriel Krisman Bertazi
  2021-09-13 17:52 ` [PATCH v3 3/6] futex2: wire up syscall for x86 André Almeida
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: André Almeida @ 2021-09-13 17:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, Arnd Bergmann, André Almeida

Add support to wait on multiple futexes. This is the interface
implemented by this syscall:

futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
	    unsigned int flags, struct timespec *timo)

struct futex_waitv {
	__u64 val;
	__u64 uaddr;
	__u32 flags;
	__u32 __reserved;
};

Given an array of struct futex_waitv, wait on each uaddr. The thread
wakes if a futex_wake() is performed at any uaddr. The syscall returns
immediately if any waiter has *uaddr != val. *timo is an optional
absolute timeout value for the operation. This syscall supports only
64bit sized timeout structs. The flags argument of the syscall should be
used solely for specifying the timeout clock as realtime, if needed.
Flags for shared futexes, sizes, etc. should be used on the individual
flags of each waiter.

__reserved is used for explicit padding and should be 0, but it might be
used for future extensions. If the userspace uses 32-bit pointers, it
should make sure to explicitly cast it when assigning to waitv::uaddr.

Returns the array index of one of the awakened futexes. There’s no given
information of how many were awakened, or any particular attribute of it
(if it’s the first awakened, if it is of the smaller index...).

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 MAINTAINERS                       |   3 +-
 include/linux/syscalls.h          |   6 +
 include/uapi/asm-generic/unistd.h |   5 +-
 include/uapi/linux/futex.h        |  25 ++++
 init/Kconfig                      |   7 ++
 kernel/Makefile                   |   1 +
 kernel/futex.c                    | 201 ++++++++++++++++++++++++++++++
 kernel/futex.h                    |  15 +++
 kernel/futex2.c                   | 117 +++++++++++++++++
 kernel/sys_ni.c                   |   3 +
 10 files changed, 381 insertions(+), 2 deletions(-)
 create mode 100644 kernel/futex2.c

diff --git a/MAINTAINERS b/MAINTAINERS
index eeb4c70b3d5b..7b756d96f09f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7718,6 +7718,7 @@ M:	Ingo Molnar <mingo@redhat.com>
 R:	Peter Zijlstra <peterz@infradead.org>
 R:	Darren Hart <dvhart@infradead.org>
 R:	Davidlohr Bueso <dave@stgolabs.net>
+R:	André Almeida <andrealmeid@collabora.com>
 L:	linux-kernel@vger.kernel.org
 S:	Maintained
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
@@ -7725,7 +7726,7 @@ F:	Documentation/locking/*futex*
 F:	include/asm-generic/futex.h
 F:	include/linux/futex.h
 F:	include/uapi/linux/futex.h
-F:	kernel/futex.c
+F:	kernel/futex*
 F:	tools/perf/bench/futex*
 F:	tools/testing/selftests/futex/
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 252243c7783d..a30083ec4bd5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -58,6 +58,7 @@ struct mq_attr;
 struct compat_stat;
 struct old_timeval32;
 struct robust_list_head;
+struct futex_waitv;
 struct getcpu_cache;
 struct old_linux_dirent;
 struct perf_event_attr;
@@ -623,6 +624,11 @@ asmlinkage long sys_get_robust_list(int pid,
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+/* kernel/futex2.c */
+asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
+				unsigned int nr_futexes, unsigned int flags,
+				struct __kernel_timespec __user *timo);
+
 /* kernel/hrtimer.c */
 asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
 			      struct __kernel_timespec __user *rmtp);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1c5fb86d455a..ebafbb27cc41 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -880,8 +880,11 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
 #define __NR_process_mrelease 448
 __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 
+#define __NR_futex_waitv 449
+__SC_COMP(__NR_futex_waitv, sys_futex_waitv)
+
 #undef __NR_syscalls
-#define __NR_syscalls 449
+#define __NR_syscalls 450
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 235e5b2facaa..71a5df8d2689 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -43,6 +43,31 @@
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
 
+/*
+ * Flags to specify the bit length of the futex word for futex2 syscalls.
+ * Currently, only 32 is supported.
+ */
+#define FUTEX_32		2
+
+/*
+ * Max numbers of elements in a futex_waitv array
+ */
+#define FUTEX_WAITV_MAX		128
+
+/**
+ * struct futex_waitv - A waiter for vectorized wait
+ * @val:	Expected value at uaddr
+ * @uaddr:	User address to wait on
+ * @flags:	Flags for this waiter
+ * @__reserved:	Reserved member to preserve data alignment. Should be 0.
+ */
+struct futex_waitv {
+	__u64 val;
+	__u64 uaddr;
+	__u32 flags;
+	__u32 __reserved;
+};
+
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
  * thread exit time.
diff --git a/init/Kconfig b/init/Kconfig
index 11f8a845f259..a5c9300f9000 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1581,6 +1581,13 @@ config FUTEX
 	  support for "fast userspace mutexes".  The resulting kernel may not
 	  run glibc-based applications correctly.
 
+config FUTEX2
+	bool "Enable futex2 support" if EXPERT
+	depends on FUTEX
+	default y
+	help
+	  Support for futex2 interface.
+
 config FUTEX_PI
 	bool
 	depends on FUTEX && RT_MUTEXES
diff --git a/kernel/Makefile b/kernel/Makefile
index 4df609be42d0..1eaf2af50283 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
 obj-$(CONFIG_FUTEX) += futex.o
+obj-$(CONFIG_FUTEX2) += futex2.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
 obj-$(CONFIG_SMP) += smp.o
 ifneq ($(CONFIG_SMP),y)
diff --git a/kernel/futex.c b/kernel/futex.c
index 32c91f9d7385..858465f97d9b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2739,6 +2739,207 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 	__set_current_state(TASK_RUNNING);
 }
 
+/**
+ * unqueue_multiple - Remove various futexes from their hash bucket
+ * @v:	   The list of futexes to unqueue
+ * @count: Number of futexes in the list
+ *
+ * Helper to unqueue a list of futexes. This can't fail.
+ *
+ * Return:
+ *  - >=0 - Index of the last futex that was awoken;
+ *  - -1  - No futex was awoken
+ */
+static int unqueue_multiple(struct futex_vector *v, int count)
+{
+	int ret = -1, i;
+
+	for (i = 0; i < count; i++) {
+		if (!unqueue_me(&v[i].q))
+			ret = i;
+	}
+
+	return ret;
+}
+
+/**
+ * futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
+ * @vs:		The futex list to wait on
+ * @count:	The size of the list
+ * @awaken:	Index of the last awoken futex, if any. Used to notify the
+ *		caller that it can return this index to userspace (return parameter)
+ *
+ * Prepare multiple futexes in a single step and enqueue them. This may fail if
+ * the futex list is invalid or if any futex was already awoken. On success the
+ * task is ready to interruptible sleep.
+ *
+ * Return:
+ *  -  1 - One of the futexes was awaken by another thread
+ *  -  0 - Success
+ *  - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
+ */
+static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *awaken)
+{
+	struct futex_hash_bucket *hb;
+	bool retry = false;
+	int ret, i;
+	u32 uval;
+
+	/*
+	 * Enqueuing multiple futexes is tricky, because we need to enqueue
+	 * each futex in the list before dealing with the next one to avoid
+	 * deadlocking on the hash bucket. But, before enqueuing, we need to
+	 * make sure that current->state is TASK_INTERRUPTIBLE, so we don't
+	 * absorb any awake events, which cannot be done before the
+	 * get_futex_key of the next key, because it calls get_user_pages,
+	 * which can sleep. Thus, we fetch the list of futexes keys in two
+	 * steps, by first pinning all the memory keys in the futex key, and
+	 * only then we read each key and queue the corresponding futex.
+	 *
+	 * Private futexes doesn't need to recalculate hash in retry, so skip
+	 * get_futex_key() when retrying.
+	 */
+retry:
+	for (i = 0; i < count; i++) {
+		if ((vs[i].w.flags & FUTEX_PRIVATE_FLAG) && retry)
+			continue;
+
+		ret = get_futex_key(u64_to_user_ptr(vs[i].w.uaddr),
+				    !(vs[i].w.flags & FUTEX_PRIVATE_FLAG),
+				    &vs[i].q.key, FUTEX_READ);
+
+		if (unlikely(ret))
+			return ret;
+	}
+
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	for (i = 0; i < count; i++) {
+		u32 __user *uaddr = (u32 __user *)vs[i].w.uaddr;
+		struct futex_q *q = &vs[i].q;
+		u32 val = (u32)vs[i].w.val;
+
+		hb = queue_lock(q);
+		ret = get_futex_value_locked(&uval, uaddr);
+
+		if (!ret && uval == val) {
+			/*
+			 * The bucket lock can't be held while dealing with the
+			 * next futex. Queue each futex at this moment so hb can
+			 * be unlocked.
+			 */
+			queue_me(q, hb);
+			continue;
+		}
+
+		queue_unlock(hb);
+		__set_current_state(TASK_RUNNING);
+
+		/*
+		 * Even if something went wrong, if we find out that a futex
+		 * was awaken, we don't return error and return this index to
+		 * userspace
+		 */
+		*awaken = unqueue_multiple(vs, i);
+		if (*awaken >= 0)
+			return 1;
+
+		if (uval != val)
+			return -EWOULDBLOCK;
+
+		if (ret) {
+			/*
+			 * If we need to handle a page fault, we need to do so
+			 * without any lock and any enqueued futex (otherwise
+			 * we could lose some wakeup). So we do it here, after
+			 * undoing all the work done so far. In success, we
+			 * retry all the work.
+			 */
+			if (get_user(uval, uaddr))
+				return -EFAULT;
+
+			retry = true;
+			goto retry;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * futex_sleep_multiple - Check sleeping conditions and sleep
+ * @vs:    List of futexes to wait for
+ * @count: Length of vs
+ * @to:    Timeout
+ *
+ * Sleep if and only if the timeout hasn't expired and no futex on the list has
+ * been awaken.
+ */
+static void futex_sleep_multiple(struct futex_vector *vs, unsigned int count,
+				 struct hrtimer_sleeper *to)
+{
+	if (to && !to->task)
+		return;
+
+	for (; count; count--, vs++) {
+		if (!READ_ONCE(vs->q.lock_ptr))
+			return;
+	}
+
+	freezable_schedule();
+}
+
+/**
+ * futex_wait_multiple - Prepare to wait on and enqueue several futexes
+ * @vs:		The list of futexes to wait on
+ * @count:	The number of objects
+ * @to:		Timeout before giving up and returning to userspace
+ *
+ * Entry point for the FUTEX_WAIT_MULTIPLE futex operation, this function
+ * sleeps on a group of futexes and returns on the first futex that is
+ * wake, or after the timeout has elapsed.
+ *
+ * Return:
+ *  - >=0 - Hint to the futex that was awoken
+ *  - <0  - On error
+ */
+int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
+			struct hrtimer_sleeper *to)
+{
+	int ret, hint = 0;
+
+	if (to)
+		hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
+
+	while (1) {
+		ret = futex_wait_multiple_setup(vs, count, &hint);
+		if (ret) {
+			if (ret > 0) {
+				/* A futex was awaken during setup */
+				ret = hint;
+			}
+			return ret;
+		}
+
+		futex_sleep_multiple(vs, count, to);
+
+		__set_current_state(TASK_RUNNING);
+
+		ret = unqueue_multiple(vs, count);
+		if (ret >= 0)
+			return ret;
+
+		if (to && !to->task)
+			return -ETIMEDOUT;
+		else if (signal_pending(current))
+			return -ERESTARTSYS;
+		/*
+		 * The final case is a spurious wakeup, for
+		 * which just retry.
+		 */
+	}
+}
+
 /**
  * futex_wait_setup() - Prepare to wait on a futex
  * @uaddr:	the futex userspace address
diff --git a/kernel/futex.h b/kernel/futex.h
index c914e0080cf1..bcd0142c3f6e 100644
--- a/kernel/futex.h
+++ b/kernel/futex.h
@@ -137,4 +137,19 @@ futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
 	return 0;
 }
 
+/**
+ * struct futex_vector - Auxiliary struct for futex_waitv()
+ * @w: Userspace provided data
+ * @q: Kernel side data
+ *
+ * Struct used to build an array with all data need for futex_waitv()
+ */
+struct futex_vector {
+	struct futex_waitv w;
+	struct futex_q q;
+};
+
+int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
+			struct hrtimer_sleeper *to);
+
 #endif
diff --git a/kernel/futex2.c b/kernel/futex2.c
new file mode 100644
index 000000000000..f724ecf40f3e
--- /dev/null
+++ b/kernel/futex2.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * futex2 interface system calls
+ *
+ * futex_waitv by André Almeida <andrealmeid@collabora.com>
+ *
+ * Copyright 2021 Collabora Ltd.
+ */
+
+#include "futex.h"
+
+/* Mask of available flags for each futex in futex_waitv list */
+#define FUTEXV_WAITER_MASK (FUTEX_32 | FUTEX_PRIVATE_FLAG)
+
+/* Mask of available flags for sys_futex_waitv flag */
+#define FUTEXV_MASK (FUTEX_CLOCK_REALTIME)
+
+/**
+ * futex_parse_waitv - Parse a waitv array from userspace
+ * @futexv:	Kernel side list of waiters to be filled
+ * @uwaitv:     Userspace list to be parsed
+ * @nr_futexes: Length of futexv
+ *
+ * Return: Error code on failure, 0 on success
+ */
+static int futex_parse_waitv(struct futex_vector *futexv,
+			     struct futex_waitv __user *uwaitv,
+			     unsigned int nr_futexes)
+{
+	struct futex_waitv aux;
+	unsigned int i;
+
+	for (i = 0; i < nr_futexes; i++) {
+		if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
+			return -EFAULT;
+
+		if ((aux.flags & ~FUTEXV_WAITER_MASK) || aux.__reserved)
+			return -EINVAL;
+
+		futexv[i].w.flags = aux.flags;
+		futexv[i].w.val = aux.val;
+		futexv[i].w.uaddr = aux.uaddr;
+		futexv[i].q = futex_q_init;
+	}
+
+	return 0;
+}
+
+/**
+ * sys_futex_waitv - Wait on a list of futexes
+ * @waiters:    List of futexes to wait on
+ * @nr_futexes: Length of futexv
+ * @flags:      Flag for timeout (monotonic/realtime)
+ * @timo:	Optional absolute timeout.
+ *
+ * Given an array of `struct futex_waitv`, wait on each uaddr. The thread wakes
+ * if a futex_wake() is performed at any uaddr. The syscall returns immediately
+ * if any waiter has *uaddr != val. *timo is an optional timeout value for the
+ * operation. Each waiter has individual flags. The `flags` argument for the
+ * syscall should be used solely for specifying the timeout as realtime, if
+ * needed. Flags for shared futexes, sizes, etc. should be used on the
+ * individual flags of each waiter.
+ *
+ * Returns the array index of one of the awaken futexes. There's no given
+ * information of how many were awakened, or any particular attribute of it (if
+ * it's the first awakened, if it is of the smaller index...).
+ */
+
+SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
+		unsigned int, nr_futexes, unsigned int, flags,
+		struct __kernel_timespec __user *, timo)
+{
+	struct hrtimer_sleeper to;
+	struct futex_vector *futexv;
+	struct timespec64 ts;
+	ktime_t time;
+	int ret;
+
+	if (flags & ~FUTEXV_MASK)
+		return -EINVAL;
+
+	if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
+		return -EINVAL;
+
+	if (timo) {
+		int flag_clkid = (flags & FUTEX_CLOCK_REALTIME) ? FLAGS_CLOCKRT : 0;
+
+		if (get_timespec64(&ts, timo))
+			return -EFAULT;
+
+		/*
+		 * Since there's no opcode for futex_waitv, use
+		 * FUTEX_WAIT_BITSET that uses absolute timeout as well
+		 */
+		ret = futex_init_timeout(FUTEX_WAIT_BITSET, flags, &ts, &time);
+		if (ret)
+			return ret;
+
+		futex_setup_timer(&time, &to, flag_clkid, 0);
+	}
+
+	futexv = kcalloc(nr_futexes, sizeof(*futexv), GFP_KERNEL);
+	if (!futexv)
+		return -ENOMEM;
+
+	ret = futex_parse_waitv(futexv, waiters, nr_futexes);
+	if (!ret)
+		ret = futex_wait_multiple(futexv, nr_futexes, timo ? &to : NULL);
+
+	if (timo) {
+		hrtimer_cancel(&to.timer);
+		destroy_hrtimer_on_stack(&to.timer);
+	}
+
+	kfree(futexv);
+	return ret;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index f43d89d92860..3d0b94f6b88d 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -151,6 +151,9 @@ COND_SYSCALL_COMPAT(set_robust_list);
 COND_SYSCALL(get_robust_list);
 COND_SYSCALL_COMPAT(get_robust_list);
 
+/* kernel/futex2.c */
+COND_SYSCALL(futex_waitv);
+
 /* kernel/hrtimer.c */
 
 /* kernel/itimer.c */
-- 
2.33.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 3/6] futex2: wire up syscall for x86
  2021-09-13 17:52 [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall André Almeida
  2021-09-13 17:52 ` [PATCH v3 1/6] futex: Prepare for futex_wait_multiple() André Almeida
  2021-09-13 17:52 ` [PATCH v3 2/6] futex2: Implement vectorized wait André Almeida
@ 2021-09-13 17:52 ` André Almeida
  2021-09-13 17:52 ` [PATCH v3 4/6] futex2: wire up syscall for ARM André Almeida
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: André Almeida @ 2021-09-13 17:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, Arnd Bergmann, André Almeida

Wire up syscall entry point for x86 arch, for both i386 and x86_64.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 960a021d543e..7e25543693de 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -453,3 +453,4 @@
 446	i386	landlock_restrict_self	sys_landlock_restrict_self
 447	i386	memfd_secret		sys_memfd_secret
 448	i386	process_mrelease	sys_process_mrelease
+449	i386	futex_waitv		sys_futex_waitv
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 18b5500ea8bf..fe8f8dd157b4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -370,6 +370,7 @@
 446	common	landlock_restrict_self	sys_landlock_restrict_self
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
+449	common	futex_waitv		sys_futex_waitv
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
-- 
2.33.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 4/6] futex2: wire up syscall for ARM
  2021-09-13 17:52 [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall André Almeida
                   ` (2 preceding siblings ...)
  2021-09-13 17:52 ` [PATCH v3 3/6] futex2: wire up syscall for x86 André Almeida
@ 2021-09-13 17:52 ` André Almeida
  2021-09-13 17:52 ` [PATCH v3 5/6] selftests: futex2: Add waitv test André Almeida
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: André Almeida @ 2021-09-13 17:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, Arnd Bergmann, André Almeida

Wire up syscall entry point for ARM architectures, for both 32 and 64-bit.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 arch/arm/tools/syscall.tbl        | 1 +
 arch/arm64/include/asm/unistd.h   | 2 +-
 arch/arm64/include/asm/unistd32.h | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index e842209e135d..543100151f2b 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -462,3 +462,4 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+449	common	futex_waitv			sys_futex_waitv
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 3cb206aea3db..6bdb5f5db438 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		449
+#define __NR_compat_syscalls		450
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 844f6ae58662..41ea1195e44b 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -903,6 +903,8 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 #define __NR_process_mrelease 448
 __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
+#define __NR_futex_waitv 449
+__SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 
 /*
  * Please add new compat syscalls above this comment and update
-- 
2.33.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 5/6] selftests: futex2: Add waitv test
  2021-09-13 17:52 [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall André Almeida
                   ` (3 preceding siblings ...)
  2021-09-13 17:52 ` [PATCH v3 4/6] futex2: wire up syscall for ARM André Almeida
@ 2021-09-13 17:52 ` André Almeida
  2021-09-14  1:11   ` Gabriel Krisman Bertazi
  2021-09-13 17:52 ` [PATCH v3 6/6] selftests: futex2: Test futex_waitv timeout André Almeida
  2021-09-14  1:05 ` [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall Gabriel Krisman Bertazi
  6 siblings, 1 reply; 17+ messages in thread
From: André Almeida @ 2021-09-13 17:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, Arnd Bergmann, André Almeida

Create a new file to test the waitv mechanism. Test both private and
shared futexes. Wake the last futex in the array, and check if the
return value from futex_waitv() is the right index.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   3 +-
 .../selftests/futex/functional/futex_waitv.c  | 158 ++++++++++++++++++
 .../testing/selftests/futex/functional/run.sh |   3 +
 .../selftests/futex/include/futex2test.h      |  31 ++++
 5 files changed, 195 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/futex/functional/futex_waitv.c
 create mode 100644 tools/testing/selftests/futex/include/futex2test.h

diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index 0e78b49d0f2f..fbcbdb6963b3 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -8,3 +8,4 @@ futex_wait_uninitialized_heap
 futex_wait_wouldblock
 futex_wait
 futex_requeue
+futex_waitv
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index bd1fec59e010..5cc38de9d8ea 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -17,7 +17,8 @@ TEST_GEN_FILES := \
 	futex_wait_uninitialized_heap \
 	futex_wait_private_mapped_file \
 	futex_wait \
-	futex_requeue
+	futex_requeue \
+	futex_waitv
 
 TEST_PROGS := run.sh
 
diff --git a/tools/testing/selftests/futex/functional/futex_waitv.c b/tools/testing/selftests/futex/functional/futex_waitv.c
new file mode 100644
index 000000000000..567667dfa7cf
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_waitv.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/******************************************************************************
+ *
+ *   Copyright Collabora Ltd., 2021
+ *
+ * DESCRIPTION
+ *	Test waitv/wake mechanism of futex2, using 32bit sized futexes.
+ *
+ * AUTHOR
+ *	André Almeida <andrealmeid@collabora.com>
+ *
+ * HISTORY
+ *      2021-Feb-5: Initial version by André <andrealmeid@collabora.com>
+ *
+ *****************************************************************************/
+
+#include <errno.h>
+#include <error.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <pthread.h>
+#include <stdint.h>
+#include <sys/shm.h>
+#include "futextest.h"
+#include "futex2test.h"
+#include "logging.h"
+
+#define TEST_NAME "futex-wait"
+#define WAKE_WAIT_US 10000
+#define NR_FUTEXES 30
+static struct futex_waitv waitv[NR_FUTEXES];
+u_int32_t futexes[NR_FUTEXES] = {0};
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -c	Use color\n");
+	printf("  -h	Display this help message\n");
+	printf("  -v L	Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+}
+
+void *waiterfn(void *arg)
+{
+	struct timespec to;
+	int res;
+
+	/* setting absolute timeout for futex2 */
+	if (clock_gettime(CLOCK_MONOTONIC, &to))
+		error("gettime64 failed\n", errno);
+
+	to.tv_sec++;
+
+	res = futex_waitv(waitv, NR_FUTEXES, 0, &to);
+	if (res < 0) {
+		ksft_test_result_fail("futex_waitv returned: %d %s\n",
+				      errno, strerror(errno));
+	} else if (res != NR_FUTEXES - 1) {
+		ksft_test_result_fail("futex_waitv returned: %d, expecting %d\n",
+				      res, NR_FUTEXES - 1);
+	}
+
+	return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+	pthread_t waiter;
+	int res, ret = RET_PASS;
+	int c, i;
+
+	while ((c = getopt(argc, argv, "cht:v:")) != -1) {
+		switch (c) {
+		case 'c':
+			log_color(1);
+			break;
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	ksft_print_header();
+	ksft_set_plan(2);
+	ksft_print_msg("%s: Test FUTEX_WAITV\n",
+		       basename(argv[0]));
+
+	for (i = 0; i < NR_FUTEXES; i++) {
+		waitv[i].uaddr = (uintptr_t)&futexes[i];
+		waitv[i].flags = FUTEX_32 | FUTEX_PRIVATE_FLAG;
+		waitv[i].val = 0;
+		waitv[i].__reserved = 0;
+	}
+
+	/* Private waitv */
+	if (pthread_create(&waiter, NULL, waiterfn, NULL))
+		error("pthread_create failed\n", errno);
+
+	usleep(WAKE_WAIT_US);
+
+	res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, FUTEX_PRIVATE_FLAG);
+	if (res != 1) {
+		ksft_test_result_fail("futex_wake private returned: %d %s\n",
+				      res ? errno : res,
+				      res ? strerror(errno) : "");
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_pass("futex_waitv private\n");
+	}
+
+	/* Shared waitv */
+	for (i = 0; i < NR_FUTEXES; i++) {
+		int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
+
+		if (shm_id < 0) {
+			perror("shmget");
+			exit(1);
+		}
+
+		unsigned int *shared_data = shmat(shm_id, NULL, 0);
+
+		*shared_data = 0;
+		waitv[i].uaddr = (uintptr_t)shared_data;
+		waitv[i].flags = FUTEX_32;
+		waitv[i].val = 0;
+		waitv[i].__reserved = 0;
+	}
+
+	if (pthread_create(&waiter, NULL, waiterfn, NULL))
+		error("pthread_create failed\n", errno);
+
+	usleep(WAKE_WAIT_US);
+
+	res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, 0);
+	if (res != 1) {
+		ksft_test_result_fail("futex_wake shared returned: %d %s\n",
+				      res ? errno : res,
+				      res ? strerror(errno) : "");
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_pass("futex_waitv shared\n");
+	}
+
+	for (i = 0; i < NR_FUTEXES; i++)
+		shmdt(u64_to_ptr(waitv[i].uaddr));
+
+	ksft_print_cnts();
+	return ret;
+}
diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testing/selftests/futex/functional/run.sh
index 11a9d62290f5..5ccd599da6c3 100755
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -79,3 +79,6 @@ echo
 
 echo
 ./futex_requeue $COLOR
+
+echo
+./futex_waitv $COLOR
diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
new file mode 100644
index 000000000000..51a3b9356c9b
--- /dev/null
+++ b/tools/testing/selftests/futex/include/futex2test.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/******************************************************************************
+ *
+ *   Copyright Collabora Ltd., 2021
+ *
+ * DESCRIPTION
+ *	Futex2 library addons for futex tests
+ *
+ * AUTHOR
+ *	André Almeida <andrealmeid@collabora.com>
+ *
+ * HISTORY
+ *      2021-Feb-5: Initial version by André <andrealmeid@collabora.com>
+ *
+ *****************************************************************************/
+#include <stdint.h>
+
+#define u64_to_ptr(x) ((void *)(uintptr_t)(x))
+
+/**
+ * futex_waitv - Wait at multiple futexes, wake on any
+ * @waiters:    Array of waiters
+ * @nr_waiters: Length of waiters array
+ * @flags: Operation flags
+ * @timo:  Optional timeout for operation
+ */
+static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
+			      unsigned long flags, struct timespec *timo)
+{
+	return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo);
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 6/6] selftests: futex2: Test futex_waitv timeout
  2021-09-13 17:52 [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall André Almeida
                   ` (4 preceding siblings ...)
  2021-09-13 17:52 ` [PATCH v3 5/6] selftests: futex2: Add waitv test André Almeida
@ 2021-09-13 17:52 ` André Almeida
  2021-09-14  1:05 ` [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall Gabriel Krisman Bertazi
  6 siblings, 0 replies; 17+ messages in thread
From: André Almeida @ 2021-09-13 17:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, Arnd Bergmann, André Almeida

Test if the futex_waitv timeout is working as expected, using the
supported clockid options.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 .../futex/functional/futex_wait_timeout.c     | 21 ++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/futex/functional/futex_wait_timeout.c b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
index 1f8f6daaf1e7..4cdada0fcb81 100644
--- a/tools/testing/selftests/futex/functional/futex_wait_timeout.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
@@ -17,6 +17,7 @@
 
 #include <pthread.h>
 #include "futextest.h"
+#include "futex2test.h"
 #include "logging.h"
 
 #define TEST_NAME "futex-wait-timeout"
@@ -96,6 +97,12 @@ int main(int argc, char *argv[])
 	struct timespec to;
 	pthread_t thread;
 	int c;
+	struct futex_waitv waitv = {
+			.uaddr = (uintptr_t)&f1,
+			.val = f1,
+			.flags = FUTEX_32,
+			.__reserved = 0
+		};
 
 	while ((c = getopt(argc, argv, "cht:v:")) != -1) {
 		switch (c) {
@@ -118,7 +125,7 @@ int main(int argc, char *argv[])
 	}
 
 	ksft_print_header();
-	ksft_set_plan(7);
+	ksft_set_plan(9);
 	ksft_print_msg("%s: Block on a futex and wait for timeout\n",
 	       basename(argv[0]));
 	ksft_print_msg("\tArguments: timeout=%ldns\n", timeout_ns);
@@ -175,6 +182,18 @@ int main(int argc, char *argv[])
 	res = futex_lock_pi(&futex_pi, NULL, 0, FUTEX_CLOCK_REALTIME);
 	test_timeout(res, &ret, "futex_lock_pi invalid timeout flag", ENOSYS);
 
+	/* futex_waitv with CLOCK_MONOTONIC */
+	if (futex_get_abs_timeout(CLOCK_MONOTONIC, &to, timeout_ns))
+		return RET_FAIL;
+	res = futex_waitv(&waitv, 1, 0, &to);
+	test_timeout(res, &ret, "futex_waitv monotonic", ETIMEDOUT);
+
+	/* futex_waitv with CLOCK_REALTIME */
+	if (futex_get_abs_timeout(CLOCK_REALTIME, &to, timeout_ns))
+		return RET_FAIL;
+	res = futex_waitv(&waitv, 1, FUTEX_CLOCK_REALTIME, &to);
+	test_timeout(res, &ret, "futex_waitv realtime", ETIMEDOUT);
+
 	ksft_print_cnts();
 	return ret;
 }
-- 
2.33.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/6] futex2: Implement vectorized wait
  2021-09-13 17:52 ` [PATCH v3 2/6] futex2: Implement vectorized wait André Almeida
@ 2021-09-14  1:03   ` Gabriel Krisman Bertazi
  2021-09-14 17:18     ` André Almeida
  0 siblings, 1 reply; 17+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-09-14  1:03 UTC (permalink / raw)
  To: André Almeida
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso,
	Arnd Bergmann

André Almeida <andrealmeid@collabora.com> writes:

> Add support to wait on multiple futexes. This is the interface
> implemented by this syscall:
>
> futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
> 	    unsigned int flags, struct timespec *timo)
>
> struct futex_waitv {
> 	__u64 val;
> 	__u64 uaddr;
> 	__u32 flags;
> 	__u32 __reserved;
> };
>
> Given an array of struct futex_waitv, wait on each uaddr. The thread
> wakes if a futex_wake() is performed at any uaddr. The syscall returns
> immediately if any waiter has *uaddr != val. *timo is an optional
> absolute timeout value for the operation. This syscall supports only
> 64bit sized timeout structs. The flags argument of the syscall should be
> used solely for specifying the timeout clock as realtime, if needed.
> Flags for shared futexes, sizes, etc. should be used on the individual
> flags of each waiter.
>
> __reserved is used for explicit padding and should be 0, but it might be
> used for future extensions. If the userspace uses 32-bit pointers, it
> should make sure to explicitly cast it when assigning to waitv::uaddr.
>
> Returns the array index of one of the awakened futexes. There’s no given
> information of how many were awakened, or any particular attribute of it
> (if it’s the first awakened, if it is of the smaller index...).
>
> Signed-off-by: André Almeida <andrealmeid@collabora.com>
> ---
>  MAINTAINERS                       |   3 +-
>  include/linux/syscalls.h          |   6 +
>  include/uapi/asm-generic/unistd.h |   5 +-
>  include/uapi/linux/futex.h        |  25 ++++
>  init/Kconfig                      |   7 ++
>  kernel/Makefile                   |   1 +
>  kernel/futex.c                    | 201 ++++++++++++++++++++++++++++++
>  kernel/futex.h                    |  15 +++
>  kernel/futex2.c                   | 117 +++++++++++++++++

Hi,

If you were to keep this implementation inside futex.c, your patchset
would be much simpler, patch 1 would immediately disappear.  Since we
are just taking about a multiple wait operation and the code is tiny, I
don't see why it couldn't go inside futex.c


>  kernel/sys_ni.c                   |   3 +
>  10 files changed, 381 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/futex2.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index eeb4c70b3d5b..7b756d96f09f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7718,6 +7718,7 @@ M:	Ingo Molnar <mingo@redhat.com>
>  R:	Peter Zijlstra <peterz@infradead.org>
>  R:	Darren Hart <dvhart@infradead.org>
>  R:	Davidlohr Bueso <dave@stgolabs.net>
> +R:	André Almeida <andrealmeid@collabora.com>
>  L:	linux-kernel@vger.kernel.org
>  S:	Maintained
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> locking/core

This goes in a separate commit.

> @@ -7725,7 +7726,7 @@ F:	Documentation/locking/*futex*
>  F:	include/asm-generic/futex.h
>  F:	include/linux/futex.h
>  F:	include/uapi/linux/futex.h
> -F:	kernel/futex.c
> +F:	kernel/futex*
>  F:	tools/perf/bench/futex*
>  F:	tools/testing/selftests/futex/
>  
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 252243c7783d..a30083ec4bd5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -58,6 +58,7 @@ struct mq_attr;
>  struct compat_stat;
>  struct old_timeval32;
>  struct robust_list_head;
> +struct futex_waitv;
>  struct getcpu_cache;
>  struct old_linux_dirent;
>  struct perf_event_attr;
> @@ -623,6 +624,11 @@ asmlinkage long sys_get_robust_list(int pid,
>  asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
>  				    size_t len);
>  
> +/* kernel/futex2.c */
> +asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
> +				unsigned int nr_futexes, unsigned int flags,
> +				struct __kernel_timespec __user *timo);
> +
>  /* kernel/hrtimer.c */
>  asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
>  			      struct __kernel_timespec __user *rmtp);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 1c5fb86d455a..ebafbb27cc41 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -880,8 +880,11 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
>  #define __NR_process_mrelease 448
>  __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
>  
> +#define __NR_futex_waitv 449
> +__SC_COMP(__NR_futex_waitv, sys_futex_waitv)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 449
> +#define __NR_syscalls 450
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
> index 235e5b2facaa..71a5df8d2689 100644
> --- a/include/uapi/linux/futex.h
> +++ b/include/uapi/linux/futex.h
> @@ -43,6 +43,31 @@
>  #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
>  					 FUTEX_PRIVATE_FLAG)
>  
> +/*
> + * Flags to specify the bit length of the futex word for futex2 syscalls.
> + * Currently, only 32 is supported.
> + */
> +#define FUTEX_32		2

Why start at 2?

> +
> +/*
> + * Max numbers of elements in a futex_waitv array
> + */
> +#define FUTEX_WAITV_MAX		128
> +
> +/**
> + * struct futex_waitv - A waiter for vectorized wait
> + * @val:	Expected value at uaddr
> + * @uaddr:	User address to wait on
> + * @flags:	Flags for this waiter
> + * @__reserved:	Reserved member to preserve data alignment. Should be 0.
> + */
> +struct futex_waitv {
> +	__u64 val;
> +	__u64 uaddr;
> +	__u32 flags;
> +	__u32 __reserved;
> +};

why force uaddr  to be __u64, even for 32-bit?  uaddr could be a (void*) for
all we care, no?  Also, by adding a reserved field, you are wasting 32
bits even on 32-bit architectures.

> +
>  /*
>   * Support for robust futexes: the kernel cleans up held futexes at
>   * thread exit time.
> diff --git a/init/Kconfig b/init/Kconfig
> index 11f8a845f259..a5c9300f9000 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1581,6 +1581,13 @@ config FUTEX
>  	  support for "fast userspace mutexes".  The resulting kernel may not
>  	  run glibc-based applications correctly.
>  
> +config FUTEX2
> +	bool "Enable futex2 support" if EXPERT
> +	depends on FUTEX
> +	default y
> +	help
> +	  Support for futex2 interface.
> +

This also seems unnecessary.  why not just reuse CONFIG_FUTEX?  It isn't
really a bunch of code you are adding anyway.

>  config FUTEX_PI
>  	bool
>  	depends on FUTEX && RT_MUTEXES
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4df609be42d0..1eaf2af50283 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -60,6 +60,7 @@ obj-$(CONFIG_PROFILING) += profile.o
>  obj-$(CONFIG_STACKTRACE) += stacktrace.o
>  obj-y += time/
>  obj-$(CONFIG_FUTEX) += futex.o
> +obj-$(CONFIG_FUTEX2) += futex2.o
>  obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
>  obj-$(CONFIG_SMP) += smp.o
>  ifneq ($(CONFIG_SMP),y)
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 32c91f9d7385..858465f97d9b 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2739,6 +2739,207 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
>  	__set_current_state(TASK_RUNNING);
>  }
>  
> +/**
> + * unqueue_multiple - Remove various futexes from their hash bucket

What about: "Remove an array of futexes from the hash table."

> + * @v:	   The list of futexes to unqueue
> + * @count: Number of futexes in the list
> + *
> + * Helper to unqueue a list of futexes. This can't fail.
> + *
> + * Return:
> + *  - >=0 - Index of the last futex that was awoken;
> + *  - -1  - No futex was awoken
> + */
> +static int unqueue_multiple(struct futex_vector *v, int count)
> +{
> +	int ret = -1, i;
> +
> +	for (i = 0; i < count; i++) {
> +		if (!unqueue_me(&v[i].q))
> +			ret = i;
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
> + * @vs:		The futex list to wait on
> + * @count:	The size of the list
> + * @awaken:	Index of the last awoken futex, if any. Used to notify the
> + *		caller that it can return this index to userspace (return parameter)
> + *
> + * Prepare multiple futexes in a single step and enqueue them. This may fail if
> + * the futex list is invalid or if any futex was already awoken. On success the
> + * task is ready to interruptible sleep.
> + *
> + * Return:
> + *  -  1 - One of the futexes was awaken by another thread
> + *  -  0 - Success
> + *  - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
> + */
> +static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *awaken)
> +{
> +	struct futex_hash_bucket *hb;
> +	bool retry = false;
> +	int ret, i;
> +	u32 uval;
> +
> +	/*
> +	 * Enqueuing multiple futexes is tricky, because we need to enqueue
> +	 * each futex in the list before dealing with the next one to avoid
> +	 * deadlocking on the hash bucket. But, before enqueuing, we need to
> +	 * make sure that current->state is TASK_INTERRUPTIBLE, so we don't
> +	 * absorb any awake events, which cannot be done before the
> +	 * get_futex_key of the next key, because it calls get_user_pages,
> +	 * which can sleep. Thus, we fetch the list of futexes keys in two
> +	 * steps, by first pinning all the memory keys in the futex key, and
> +	 * only then we read each key and queue the corresponding futex.
> +	 *
> +	 * Private futexes doesn't need to recalculate hash in retry, so skip
> +	 * get_futex_key() when retrying.
> +	 */
> +retry:
> +	for (i = 0; i < count; i++) {
> +		if ((vs[i].w.flags & FUTEX_PRIVATE_FLAG) && retry)
> +			continue;
> +
> +		ret = get_futex_key(u64_to_user_ptr(vs[i].w.uaddr),
> +				    !(vs[i].w.flags & FUTEX_PRIVATE_FLAG),
> +				    &vs[i].q.key, FUTEX_READ);
> +
> +		if (unlikely(ret))
> +			return ret;
> +	}
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +
> +	for (i = 0; i < count; i++) {
> +		u32 __user *uaddr = (u32 __user *)vs[i].w.uaddr;
> +		struct futex_q *q = &vs[i].q;
> +		u32 val = (u32)vs[i].w.val;
> +
> +		hb = queue_lock(q);
> +		ret = get_futex_value_locked(&uval, uaddr);
> +
> +		if (!ret && uval == val) {
> +			/*
> +			 * The bucket lock can't be held while dealing with the
> +			 * next futex. Queue each futex at this moment so hb can
> +			 * be unlocked.
> +			 */
> +			queue_me(q, hb);
> +			continue;
> +		}
> +
> +		queue_unlock(hb);
> +		__set_current_state(TASK_RUNNING);
> +
> +		/*
> +		 * Even if something went wrong, if we find out that a futex
> +		 * was awaken, we don't return error and return this index to
> +		 * userspace
> +		 */
> +		*awaken = unqueue_multiple(vs, i);
> +		if (*awaken >= 0)
> +			return 1;

if user feed us a bogus key and get_futex_value_locked throws an EFAULT,
I think we should error out (after failing the get_user() also), instead
of ignoring it if a futex was awaken.  If this happens, we are helping
to hide application errors.

This means you should need to do the get_user() below before returning.

> +
> +		if (uval != val)
> +			return -EWOULDBLOCK;
> +
> +		if (ret) {
> +			/*
> +			 * If we need to handle a page fault, we need to do so
> +			 * without any lock and any enqueued futex (otherwise
> +			 * we could lose some wakeup). So we do it here, after
> +			 * undoing all the work done so far. In success, we
> +			 * retry all the work.
> +			 */
> +			if (get_user(uval, uaddr))
> +				return -EFAULT;
> +
> +			retry = true;
> +			goto retry;
> +		}

My nit is that this in an error path, but it doesn't look like it.  it
could benefit from making it more obvious.

> +	}
> +
> +	return 0;
> +}
> +
> +/**

...

> diff --git a/kernel/futex.h b/kernel/futex.h
> index c914e0080cf1..bcd0142c3f6e 100644
> --- a/kernel/futex.h
> +++ b/kernel/futex.h
> @@ -137,4 +137,19 @@ futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
>  	return 0;
>  }
>  
> +/**
> + * struct futex_vector - Auxiliary struct for futex_waitv()
> + * @w: Userspace provided data
> + * @q: Kernel side data
> + *
> + * Struct used to build an array with all data need for futex_waitv()
> + */
> +struct futex_vector {
> +	struct futex_waitv w;
> +	struct futex_q q;
> +};
> +
> +int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
> +			struct hrtimer_sleeper *to);
> +
>  #endif
> diff --git a/kernel/futex2.c b/kernel/futex2.c
> new file mode 100644
> index 000000000000..f724ecf40f3e
> --- /dev/null
> +++ b/kernel/futex2.c

Now I'm confused.  Why is the implementation split in two files?  I feel this just messes
up with a bunch of declarations and headers, making it much harder to review.

> @@ -0,0 +1,117 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * futex2 interface system calls
> + *
> + * futex_waitv by André Almeida <andrealmeid@collabora.com>
> + *
> + * Copyright 2021 Collabora Ltd.
> + */
> +
> +#include "futex.h"
> +
> +/* Mask of available flags for each futex in futex_waitv list */
> +#define FUTEXV_WAITER_MASK (FUTEX_32 | FUTEX_PRIVATE_FLAG)
> +
> +/* Mask of available flags for sys_futex_waitv flag */
> +#define FUTEXV_MASK (FUTEX_CLOCK_REALTIME)
> +
> +/**
> + * futex_parse_waitv - Parse a waitv array from userspace
> + * @futexv:	Kernel side list of waiters to be filled
> + * @uwaitv:     Userspace list to be parsed
> + * @nr_futexes: Length of futexv
> + *
> + * Return: Error code on failure, 0 on success
> + */
> +static int futex_parse_waitv(struct futex_vector *futexv,
> +			     struct futex_waitv __user *uwaitv,
> +			     unsigned int nr_futexes)
> +{
> +	struct futex_waitv aux;
> +	unsigned int i;
> +
> +	for (i = 0; i < nr_futexes; i++) {
> +		if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
> +			return -EFAULT;
> +
> +		if ((aux.flags & ~FUTEXV_WAITER_MASK) || aux.__reserved)
> +			return -EINVAL;
> +
> +		futexv[i].w.flags = aux.flags;
> +		futexv[i].w.val = aux.val;
> +		futexv[i].w.uaddr = aux.uaddr;
> +		futexv[i].q = futex_q_init;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * sys_futex_waitv - Wait on a list of futexes
> + * @waiters:    List of futexes to wait on
> + * @nr_futexes: Length of futexv
> + * @flags:      Flag for timeout (monotonic/realtime)
> + * @timo:	Optional absolute timeout.
> + *
> + * Given an array of `struct futex_waitv`, wait on each uaddr. The thread wakes
> + * if a futex_wake() is performed at any uaddr. The syscall returns immediately
> + * if any waiter has *uaddr != val. *timo is an optional timeout value for the
> + * operation. Each waiter has individual flags. The `flags` argument for the
> + * syscall should be used solely for specifying the timeout as realtime, if
> + * needed. Flags for shared futexes, sizes, etc. should be used on the
> + * individual flags of each waiter.
> + *
> + * Returns the array index of one of the awaken futexes. There's no given
> + * information of how many were awakened, or any particular attribute of it (if
> + * it's the first awakened, if it is of the smaller index...).
> + */
> +
> +SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
> +		unsigned int, nr_futexes, unsigned int, flags,
> +		struct __kernel_timespec __user *, timo)
> +{
> +	struct hrtimer_sleeper to;
> +	struct futex_vector *futexv;
> +	struct timespec64 ts;
> +	ktime_t time;
> +	int ret;
> +
> +	if (flags & ~FUTEXV_MASK)
> +		return -EINVAL;
> +
> +	if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
> +		return -EINVAL;
> +
> +	if (timo) {
> +		int flag_clkid = (flags & FUTEX_CLOCK_REALTIME) ? FLAGS_CLOCKRT : 0;
> +
> +		if (get_timespec64(&ts, timo))
> +			return -EFAULT;
> +
> +		/*
> +		 * Since there's no opcode for futex_waitv, use
> +		 * FUTEX_WAIT_BITSET that uses absolute timeout as well
> +		 */
> +		ret = futex_init_timeout(FUTEX_WAIT_BITSET, flags, &ts, &time);
> +		if (ret)
> +			return ret;
> +
> +		futex_setup_timer(&time, &to, flag_clkid, 0);
> +	}
> +
> +	futexv = kcalloc(nr_futexes, sizeof(*futexv), GFP_KERNEL);
> +	if (!futexv)
> +		return -ENOMEM;
> +
> +	ret = futex_parse_waitv(futexv, waiters, nr_futexes);
> +	if (!ret)
> +		ret = futex_wait_multiple(futexv, nr_futexes, timo ? &to : NULL);
> +
> +	if (timo) {
> +		hrtimer_cancel(&to.timer);
> +		destroy_hrtimer_on_stack(&to.timer);
> +	}
> +
> +	kfree(futexv);
> +	return ret;
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index f43d89d92860..3d0b94f6b88d 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -151,6 +151,9 @@ COND_SYSCALL_COMPAT(set_robust_list);
>  COND_SYSCALL(get_robust_list);
>  COND_SYSCALL_COMPAT(get_robust_list);
>  
> +/* kernel/futex2.c */
> +COND_SYSCALL(futex_waitv);
> +

This should go into a syscall wiring patch, if possible.

>  /* kernel/hrtimer.c */
>  
>  /* kernel/itimer.c */
> -- 
>
> 2.33.0

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall
  2021-09-13 17:52 [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall André Almeida
                   ` (5 preceding siblings ...)
  2021-09-13 17:52 ` [PATCH v3 6/6] selftests: futex2: Test futex_waitv timeout André Almeida
@ 2021-09-14  1:05 ` Gabriel Krisman Bertazi
  2021-09-14  3:07   ` André Almeida
  6 siblings, 1 reply; 17+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-09-14  1:05 UTC (permalink / raw)
  To: André Almeida
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso,
	Arnd Bergmann

André Almeida <andrealmeid@collabora.com> writes:

> Hi,
>
> This patchset introduce the futex_waitv syscall. It reuses as much as
> possible of original futex code for the new operation, so the first
> commit move some stuff to futex header to make accessible for futex2.
>

In general, this series is missing a Documentation/ patch at the end.
In particular since it adds a new interface.  Much of what you describe
in the cover letter should go there...

> * Use case
>
> The use case of this syscall is to allow low level locking libraries to
> wait for multiple locks at the same time. This is specially useful for
> emulating Windows' WaitForMultipleObjects. A futex_waitv()-based solution
> has been used for some time at Proton's Wine (a compatibility layer to
> run Windows games on Linux). Compared to a solution that uses eventfd(),
> futex was able to reduce CPU utilization for games, and even increase
> frames per second for some games. This happens because eventfd doesn't
> scale very well for a huge number of read, write and poll calls compared
> to futex. Native game engines will benefit of this as well, given that
> this wait pattern is common for games.
>
> * The interface
>
> This is how the interface looks like:
>
>   futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
>               unsigned int flags, struct timespec *timo)
>
>   struct futex_waitv {
>           __u64 val;
>           __u64 uaddr;
>           __u32 flags;
>           __u32 __reserved;
>   };
>
> struct futex_waitv uses explicit padding, so we can use it in all
> architectures. The __reserved is used for the padding and should always
> be 0, but it may be repurposed in the future for some extension. If
> userspace has 32-bit pointers, it should do a explicit cast to make sure
> the upper bits are zeroed. uintptr_t does the tricky and it works for
> 32/64-bit pointers.
>
> * Why u64?
>
> Although futex() supports only 32-bit long integers, while researching
> about feedback around a new futex interface, developers made some points
> for variable size support:
>
> - At Boost Libraries, futex is used as back end to implement atomic
> primitives for some architectures. It works fine for 32-bit futexes, but
> for other sizes it "must use an internal lock pool to implement waiting
> and notifying operations, which increases thread contention. For
> inter-process atomics, this means that waiting must be done using a spin
> loop, which is terribly inefficient."[1]
>
> - glibc’s rwlock implementation "uses a torn 32-bit futex read which is
> part of an atomically updated 64-bit word".[2]
>
> - Peter Oskolkov[3] pointed out that for 64-bit platforms it would be
> useful to do atomic operations in pointer values: "imagine a simple
> producer/consumer scenario, with the producer updating some shared
> memory data and waking the consumer. Storing the pointer in the futex
> makes it so that only one shared memory location needs to be accessed
> atomically".
>
> - The original proposal[4] to support 8-bit and 16-bit futexes had some
> use cases as well: "Having mutexes that are only one byte in size was
> the first reason WebKit mentioned for re-implementing futexes in a
> library" and "The C++ standard added futexes to the standard library in
> C++20 under the name atomic_wait and atomic_notify. The C++20 version
> supports this for atomic variables of any size. The more sizes we can
> support, the better the implementation can be in the standard library."
>
>  Testing
>
> Through Proton, I've tested futex_waitv() with modern games that issue
> more than 40k futex calls per second. Selftest are provided as part of this
> patchset. However, those selftests aren't really reliable in 32-bit
> platforms giving that glibc doesn't expose a way to have a 64-bit timespec 
> gettime(). In the past I implemented a gettime64() by myself as part of
> the selftest, but I'm not sure if this the best approach:
> https://lore.kernel.org/lkml/20210805190405.59110-4-andrealmeid@collabora.com/
>
>  Changelog
>
> Changes from v2:
> v2: https://lore.kernel.org/lkml/20210904231159.13292-1-andrealmeid@collabora.com/
> - Last version, I made compat and non-compat use the same code, but
> failed to remove the compat entry point. This is fixed now.
> - Add ARM support
>
> Changes from v1:
> v1: https://lore.kernel.org/lkml/20210805190405.59110-1-andrealmeid@collabora.com/
> - Tons of code and comment improvements and fixes (thanks Thomas!)
> - Changed the struct to have explicit padding (thanks Arnd!)
> - Created a kernel/futex.h
> - Splitted syscall table changes from the implementation
> - Compat and non-compat entry point now uses the same code and same
>   struct
> - Added test for timeout
>
> More info about futex2: https://lore.kernel.org/lkml/20210709001328.329716-1-andrealmeid@collabora.com/
>
> [1] https://lists.boost.org/Archives/boost/2021/05/251508.php
>
> [2]
> https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@collabora.com/T/#m37bfbbd6ac76c121941defd1daea774389552674
>
> [3]
> https://lore.kernel.org/lkml/CAFTs51XAr2b3DmcSM4=qeU5cNuh0mTxUbhG66U6bc63YYzkzYA@mail.gmail.com/
>
> [4]
> https://lore.kernel.org/lkml/20191204235238.10764-1-malteskarupke@web.de/
>
> André Almeida (6):
>   futex: Prepare for futex_wait_multiple()
>   futex2: Implement vectorized wait
>   futex2: wire up syscall for x86
>   futex2: wire up syscall for ARM
>   selftests: futex2: Add waitv test
>   selftests: futex2: Test futex_waitv timeout
>
>  MAINTAINERS                                   |   3 +-
>  arch/arm/tools/syscall.tbl                    |   1 +
>  arch/arm64/include/asm/unistd.h               |   2 +-
>  arch/arm64/include/asm/unistd32.h             |   2 +
>  arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>  include/linux/syscalls.h                      |   6 +
>  include/uapi/asm-generic/unistd.h             |   5 +-
>  include/uapi/linux/futex.h                    |  25 ++
>  init/Kconfig                                  |   7 +
>  kernel/Makefile                               |   1 +
>  kernel/futex.c                                | 335 +++++++++++-------
>  kernel/futex.h                                | 155 ++++++++
>  kernel/futex2.c                               | 117 ++++++
>  kernel/sys_ni.c                               |   3 +
>  .../selftests/futex/functional/.gitignore     |   1 +
>  .../selftests/futex/functional/Makefile       |   3 +-
>  .../futex/functional/futex_wait_timeout.c     |  21 +-
>  .../selftests/futex/functional/futex_waitv.c  | 158 +++++++++
>  .../testing/selftests/futex/functional/run.sh |   3 +
>  .../selftests/futex/include/futex2test.h      |  31 ++
>  21 files changed, 744 insertions(+), 137 deletions(-)
>  create mode 100644 kernel/futex.h
>  create mode 100644 kernel/futex2.c
>  create mode 100644 tools/testing/selftests/futex/functional/futex_waitv.c
>  create mode 100644 tools/testing/selftests/futex/include/futex2test.h
>
> -- 
> 2.33.0

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 5/6] selftests: futex2: Add waitv test
  2021-09-13 17:52 ` [PATCH v3 5/6] selftests: futex2: Add waitv test André Almeida
@ 2021-09-14  1:11   ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 17+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-09-14  1:11 UTC (permalink / raw)
  To: André Almeida
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso,
	Arnd Bergmann

André Almeida <andrealmeid@collabora.com> writes:

> Create a new file to test the waitv mechanism. Test both private and
> shared futexes. Wake the last futex in the array, and check if the
> return value from futex_waitv() is the right index.
>
> Signed-off-by: André Almeida <andrealmeid@collabora.com>
> ---
>  .../selftests/futex/functional/.gitignore     |   1 +
>  .../selftests/futex/functional/Makefile       |   3 +-
>  .../selftests/futex/functional/futex_waitv.c  | 158 ++++++++++++++++++
>  .../testing/selftests/futex/functional/run.sh |   3 +
>  .../selftests/futex/include/futex2test.h      |  31 ++++
>  5 files changed, 195 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/futex/functional/futex_waitv.c
>  create mode 100644 tools/testing/selftests/futex/include/futex2test.h
>
> diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
> index 0e78b49d0f2f..fbcbdb6963b3 100644
> --- a/tools/testing/selftests/futex/functional/.gitignore
> +++ b/tools/testing/selftests/futex/functional/.gitignore
> @@ -8,3 +8,4 @@ futex_wait_uninitialized_heap
>  futex_wait_wouldblock
>  futex_wait
>  futex_requeue
> +futex_waitv
> diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
> index bd1fec59e010..5cc38de9d8ea 100644
> --- a/tools/testing/selftests/futex/functional/Makefile
> +++ b/tools/testing/selftests/futex/functional/Makefile
> @@ -17,7 +17,8 @@ TEST_GEN_FILES := \
>  	futex_wait_uninitialized_heap \
>  	futex_wait_private_mapped_file \
>  	futex_wait \
> -	futex_requeue
> +	futex_requeue \
> +	futex_waitv
>  
>  TEST_PROGS := run.sh
>  
> diff --git a/tools/testing/selftests/futex/functional/futex_waitv.c b/tools/testing/selftests/futex/functional/futex_waitv.c
> new file mode 100644
> index 000000000000..567667dfa7cf
> --- /dev/null
> +++ b/tools/testing/selftests/futex/functional/futex_waitv.c
> @@ -0,0 +1,158 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/******************************************************************************
> + *
> + *   Copyright Collabora Ltd., 2021
> + *
> + * DESCRIPTION
> + *	Test waitv/wake mechanism of futex2, using 32bit sized futexes.
> + *
> + * AUTHOR
> + *	André Almeida <andrealmeid@collabora.com>
> + *
> + * HISTORY
> + *      2021-Feb-5: Initial version by André <andrealmeid@collabora.com>

We have git to keep history nowadays.  This type of changelog is a relic
from less civilized times and adds no extra information.  :)



-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall
  2021-09-14  1:05 ` [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall Gabriel Krisman Bertazi
@ 2021-09-14  3:07   ` André Almeida
  0 siblings, 0 replies; 17+ messages in thread
From: André Almeida @ 2021-09-14  3:07 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso,
	Arnd Bergmann

Às 22:05 de 13/09/21, Gabriel Krisman Bertazi escreveu:
> André Almeida <andrealmeid@collabora.com> writes:
> 
>> Hi,
>>
>> This patchset introduce the futex_waitv syscall. It reuses as much as
>> possible of original futex code for the new operation, so the first
>> commit move some stuff to futex header to make accessible for futex2.
>>
> 
> In general, this series is missing a Documentation/ patch at the end.
> In particular since it adds a new interface.  Much of what you describe
> in the cover letter should go there...
> 

Ops, it seems that I forgot that commit behind, I'll add it for v4:
https://lore.kernel.org/lkml/20210805190405.59110-5-andrealmeid@collabora.com/

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/6] futex2: Implement vectorized wait
  2021-09-14  1:03   ` Gabriel Krisman Bertazi
@ 2021-09-14 17:18     ` André Almeida
  2021-09-16  4:10       ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 17+ messages in thread
From: André Almeida @ 2021-09-14 17:18 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso,
	Arnd Bergmann

Hi Gabriel, thanks for the feedback! A few clarifications:

Às 22:03 de 13/09/21, Gabriel Krisman Bertazi escreveu:
> André Almeida <andrealmeid@collabora.com> writes:
> 
>> Add support to wait on multiple futexes. This is the interface
>> implemented by this syscall:
>>

[...]

>>  
>> +/*
>> + * Flags to specify the bit length of the futex word for futex2 syscalls.
>> + * Currently, only 32 is supported.
>> + */
>> +#define FUTEX_32		2
> 
> Why start at 2?

I was planning to do:

FUTEX_8		0
FUTEX_16	1
FUTEX_32	2
FUTEX_64	3

> 
>> +
>> +/*
>> + * Max numbers of elements in a futex_waitv array
>> + */
>> +#define FUTEX_WAITV_MAX		128
>> +
>> +/**
>> + * struct futex_waitv - A waiter for vectorized wait
>> + * @val:	Expected value at uaddr
>> + * @uaddr:	User address to wait on
>> + * @flags:	Flags for this waiter
>> + * @__reserved:	Reserved member to preserve data alignment. Should be 0.
>> + */
>> +struct futex_waitv {
>> +	__u64 val;
>> +	__u64 uaddr;
>> +	__u32 flags;
>> +	__u32 __reserved;
>> +};
> 
> why force uaddr  to be __u64, even for 32-bit?  uaddr could be a (void*) for
> all we care, no?  Also, by adding a reserved field, you are wasting 32
> bits even on 32-bit architectures.
> 

We do that to make the structure layout compatible with both entry
points, remove the need for special cast and duplicated code, as
suggested by Thomas and Arnd:

https://lore.kernel.org/lkml/87v94310gm.ffs@tglx/

https://lore.kernel.org/lkml/CAK8P3a0MO1qJLRkCH8KrZ3+=L66KOsMRmcbrUvYdMoKykdKoyQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/6] futex2: Implement vectorized wait
  2021-09-14 17:18     ` André Almeida
@ 2021-09-16  4:10       ` Gabriel Krisman Bertazi
  2021-09-16 11:20         ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-09-16  4:10 UTC (permalink / raw)
  To: André Almeida
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso,
	Arnd Bergmann

André Almeida <andrealmeid@collabora.com> writes:

>>> +/**
>>> + * struct futex_waitv - A waiter for vectorized wait
>>> + * @val:	Expected value at uaddr
>>> + * @uaddr:	User address to wait on
>>> + * @flags:	Flags for this waiter
>>> + * @__reserved:	Reserved member to preserve data alignment. Should be 0.
>>> + */
>>> +struct futex_waitv {
>>> +	__u64 val;
>>> +	__u64 uaddr;
>>> +	__u32 flags;
>>> +	__u32 __reserved;
>>> +};
>> 
>> why force uaddr  to be __u64, even for 32-bit?  uaddr could be a (void*) for
>> all we care, no?  Also, by adding a reserved field, you are wasting 32
>> bits even on 32-bit architectures.
>> 
>
> We do that to make the structure layout compatible with both entry
> points, remove the need for special cast and duplicated code, as
> suggested by Thomas and Arnd:
>
> https://lore.kernel.org/lkml/87v94310gm.ffs@tglx/
>
> https://lore.kernel.org/lkml/CAK8P3a0MO1qJLRkCH8KrZ3+=L66KOsMRmcbrUvYdMoKykdKoyQ@mail.gmail.com/

I find this weird.  I'm not even juts talking about compat, but even on
native 32-bit. But also, 32 applications on 64, which is a big use
case for games.

The structure is mandating a 64 bit uaddr field and has an unnecessary
pad.  You are wasting 20% of the space, which is gonna be elements of a
vector coming from user space.  Worst case, you are doing copy_from_user
of an extra 1k bytes in the critical path of futex_waitv for no good
reason.

Also, if I understand correctly, Arnd suggestion, at least, was to have
two parser functions and a single syscall entry point, that would do the
translation:

if (in_compat_syscall())
   futex_parse_waitv_compat(futexv, waiters, nr_futexes);
else
   futex_parse_waitv(futexv, waiters, nr_futexes);

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/6] futex2: Implement vectorized wait
  2021-09-16  4:10       ` Gabriel Krisman Bertazi
@ 2021-09-16 11:20         ` Peter Zijlstra
  2021-09-16 11:50           ` Arnd Bergmann
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-09-16 11:20 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: André Almeida, Thomas Gleixner, Ingo Molnar, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso,
	Arnd Bergmann

On Thu, Sep 16, 2021 at 12:10:25AM -0400, Gabriel Krisman Bertazi wrote:

> I find this weird.  I'm not even juts talking about compat, but even on
> native 32-bit. But also, 32 applications on 64, which is a big use
> case for games.

Seriously, people still make 32bit applications today? And for legacy
games, I would think the speed increase of modern CPUs would far offset
this little inefficiency.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/6] futex2: Implement vectorized wait
  2021-09-16 11:20         ` Peter Zijlstra
@ 2021-09-16 11:50           ` Arnd Bergmann
  2021-09-16 13:37             ` Steven Rostedt
  2021-09-16 16:36             ` Gabriel Krisman Bertazi
  0 siblings, 2 replies; 17+ messages in thread
From: Arnd Bergmann @ 2021-09-16 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gabriel Krisman Bertazi, André Almeida, Thomas Gleixner,
	Ingo Molnar, Darren Hart, Linux Kernel Mailing List,
	Steven Rostedt, Sebastian Andrzej Siewior, Collabora kernel ML,
	Linux API, GNU C Library, Michael Kerrisk, Davidlohr Bueso,
	Arnd Bergmann

On Thu, Sep 16, 2021 at 1:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Sep 16, 2021 at 12:10:25AM -0400, Gabriel Krisman Bertazi wrote:
>
> > I find this weird.  I'm not even juts talking about compat, but even on
> > native 32-bit. But also, 32 applications on 64, which is a big use
> > case for games.
>
> Seriously, people still make 32bit applications today? And for legacy
> games, I would think the speed increase of modern CPUs would far offset
> this little inefficiency.

There are 32-bit Windows games apparently, because it's easier to build it
that way than having both 32-bit and 64-bit versions.
There may be native 32-bit games built for Linux from the same sources when
that is not written portably, not sure if that's a thing.

One important reason to use compat mode is for cost savings when you can
ship an embedded system with slightly less RAM by running 32-bit user space
on it. We even still see people running 32-bit kernels on Arm boxes that have
entry-level 64-bit chips, though I hope that those will migrate the
kernel to arm64
even when they ship 32-bit user space.

Similar logic applies to cloud instances or containers. Running a 32-bit
Alpine Linux in a container means you can often go to a lower memory
instance on the host compared to a full 64-bit distro.

        Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/6] futex2: Implement vectorized wait
  2021-09-16 11:50           ` Arnd Bergmann
@ 2021-09-16 13:37             ` Steven Rostedt
  2021-09-16 16:36             ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2021-09-16 13:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Gabriel Krisman Bertazi, André Almeida,
	Thomas Gleixner, Ingo Molnar, Darren Hart,
	Linux Kernel Mailing List, Sebastian Andrzej Siewior,
	Collabora kernel ML, Linux API, GNU C Library, Michael Kerrisk,
	Davidlohr Bueso

On Thu, 16 Sep 2021 13:50:14 +0200
Arnd Bergmann <arnd@arndb.de> wrote:

> Similar logic applies to cloud instances or containers. Running a 32-bit
> Alpine Linux in a container means you can often go to a lower memory
> instance on the host compared to a full 64-bit distro.

I also found that running a 32 bit version of Chrome or FireFox keeps them
from taking up all the memory in your system ;-)  The most they can use is
4 gigs.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/6] futex2: Implement vectorized wait
  2021-09-16 11:50           ` Arnd Bergmann
  2021-09-16 13:37             ` Steven Rostedt
@ 2021-09-16 16:36             ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 17+ messages in thread
From: Gabriel Krisman Bertazi @ 2021-09-16 16:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, André Almeida, Thomas Gleixner, Ingo Molnar,
	Darren Hart, Linux Kernel Mailing List, Steven Rostedt,
	Sebastian Andrzej Siewior, Collabora kernel ML, Linux API,
	GNU C Library, Michael Kerrisk, Davidlohr Bueso

Arnd Bergmann <arnd@arndb.de> writes:

> On Thu, Sep 16, 2021 at 1:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Thu, Sep 16, 2021 at 12:10:25AM -0400, Gabriel Krisman Bertazi wrote:
>>
>> > I find this weird.  I'm not even juts talking about compat, but even on
>> > native 32-bit. But also, 32 applications on 64, which is a big use
>> > case for games.
>>
>> Seriously, people still make 32bit applications today? And for legacy
>> games, I would think the speed increase of modern CPUs would far offset
>> this little inefficiency.
>
> There are 32-bit Windows games apparently, because it's easier to build it
> that way than having both 32-bit and 64-bit versions.

Yes, many modern, recently released, tiple-A Windows games running over
Proton/Wine are published only in 32-bit.  We also keep a 32-bit Proton
for that reason.

> There may be native 32-bit games built for Linux from the same sources when
> that is not written portably, not sure if that's a thing.
>
> One important reason to use compat mode is for cost savings when you can
> ship an embedded system with slightly less RAM by running 32-bit user space
> on it. We even still see people running 32-bit kernels on Arm boxes that have
> entry-level 64-bit chips, though I hope that those will migrate the
> kernel to arm64
> even when they ship 32-bit user space.
>
> Similar logic applies to cloud instances or containers. Running a 32-bit
> Alpine Linux in a container means you can often go to a lower memory
> instance on the host compared to a full 64-bit distro.
>
>         Arnd

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-09-16 16:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-13 17:52 [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall André Almeida
2021-09-13 17:52 ` [PATCH v3 1/6] futex: Prepare for futex_wait_multiple() André Almeida
2021-09-13 17:52 ` [PATCH v3 2/6] futex2: Implement vectorized wait André Almeida
2021-09-14  1:03   ` Gabriel Krisman Bertazi
2021-09-14 17:18     ` André Almeida
2021-09-16  4:10       ` Gabriel Krisman Bertazi
2021-09-16 11:20         ` Peter Zijlstra
2021-09-16 11:50           ` Arnd Bergmann
2021-09-16 13:37             ` Steven Rostedt
2021-09-16 16:36             ` Gabriel Krisman Bertazi
2021-09-13 17:52 ` [PATCH v3 3/6] futex2: wire up syscall for x86 André Almeida
2021-09-13 17:52 ` [PATCH v3 4/6] futex2: wire up syscall for ARM André Almeida
2021-09-13 17:52 ` [PATCH v3 5/6] selftests: futex2: Add waitv test André Almeida
2021-09-14  1:11   ` Gabriel Krisman Bertazi
2021-09-13 17:52 ` [PATCH v3 6/6] selftests: futex2: Test futex_waitv timeout André Almeida
2021-09-14  1:05 ` [PATCH v3 0/6] futex2: Add wait on multiple futexes syscall Gabriel Krisman Bertazi
2021-09-14  3:07   ` André Almeida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).