LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/4] futex2: Add wait on multiple futexes syscall
@ 2021-08-05 19:04 André Almeida
  2021-08-05 19:04 ` [PATCH 1/4] futex: Prepare for futex_wait_multiple() André Almeida
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: André Almeida @ 2021-08-05 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, André Almeida

Hi,

As opposed to previous futex2 patchsets, this one adds only one syscall:
futex_waitv(). This way we can focus on this operation that already have
a well defined use case and has been tested for months now.

The patchset reuses as much as possible of original futex code for the new
operation, so the first commit move some stuff to futex header to make
accessible for futex2.

Ideally, I would completely replace futex_wait_setup() with
futex_wait_multiple(): it is basic the same logic, but for n futexes,
so for existing operations it was a matter of calling it with nr_futexes=1.
This worked pretty well for futex_wait(): I tested with glibc tests,
tested with a complete distro running on top of it and perf benchs
presented no performance difference. However, it didn't work for
futex_wait_requeue_pi(), since the wait path for it is slightly different
of the normal wait, that would require some refactor to get it in a way to
be easily replaced. So I decided to not replace it at all.

 Use case

The use case of this syscall is to allow low level locking libraries to
wait for multiple locks at the same time. This is specially useful for
emulating Windows' WaitForMultipleObjects. A futex_waitv()-based solution
has been used for some time at Proton's Wine (a compatibility layer to
run Windows games on Linux). Compared to a solution that uses eventfd(),
futex was able to reduce CPU utilization for games, and even increase
frames per second for some games. This happens because eventfd doesn't
scale very well for a huge number of read, write and poll calls compared
to futex. Native game engines will benefit of this as well, given that
this wait pattern is common for games.

 Testing

Selftest is provided as part of this patchset. As stated above, I used
the futex_wait_multiple() in FUTEX_WAIT path and it worked fine in a
full distro. Throught Proton, I've tested futex_waitv() with modern games
that issue more than 40k futex calls per second.

André Almeida (4):
  futex: Prepare for futex_wait_multiple()
  futex2: Implement vectorized wait
  selftests: futex2: Add waitv test
  futex2: Documentation: Document futex_waitv() uAPI

 Documentation/userspace-api/futex2.rst        |  79 ++++++
 Documentation/userspace-api/index.rst         |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 include/linux/compat.h                        |   9 +
 include/linux/futex.h                         |  75 ++++++
 include/uapi/asm-generic/unistd.h             |   5 +-
 include/uapi/linux/futex.h                    |  17 ++
 init/Kconfig                                  |   7 +
 kernel/Makefile                               |   1 +
 kernel/futex.c                                | 224 ++++++++++++++----
 kernel/futex2.c                               | 198 ++++++++++++++++
 kernel/sys_ni.c                               |   4 +
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   3 +-
 .../selftests/futex/functional/futex2_waitv.c | 154 ++++++++++++
 .../testing/selftests/futex/functional/run.sh |   3 +
 .../selftests/futex/include/futex2test.h      |  72 ++++++
 18 files changed, 812 insertions(+), 43 deletions(-)
 create mode 100644 Documentation/userspace-api/futex2.rst
 create mode 100644 kernel/futex2.c
 create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c
 create mode 100644 tools/testing/selftests/futex/include/futex2test.h

-- 
2.32.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/4] futex: Prepare for futex_wait_multiple()
  2021-08-05 19:04 [PATCH 0/4] futex2: Add wait on multiple futexes syscall André Almeida
@ 2021-08-05 19:04 ` André Almeida
  2021-08-18  8:22   ` Thomas Gleixner
  2021-08-05 19:04 ` [PATCH 2/4] futex2: Implement vectorized wait André Almeida
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: André Almeida @ 2021-08-05 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, André Almeida

Make public functions and defines that will be used for
futex_wait_multiple() function in next commit.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 include/linux/futex.h | 60 +++++++++++++++++++++++++++++++++++++++++++
 kernel/futex.c        | 42 +-----------------------------
 2 files changed, 61 insertions(+), 41 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85..f7a0f4a4b5f0 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -11,6 +11,22 @@ struct inode;
 struct mm_struct;
 struct task_struct;
 
+/*
+ * Futex flags used to encode options to functions and preserve them across
+ * restarts.
+ */
+#ifdef CONFIG_MMU
+# define FLAGS_SHARED		0x01
+#else
+/*
+ * NOMMU does not have per process address space. Let the compiler optimize
+ * code away.
+ */
+# define FLAGS_SHARED		0x00
+#endif
+#define FLAGS_CLOCKRT		0x02
+#define FLAGS_HAS_TIMEOUT	0x04
+
 /*
  * Futexes are matched on equal values of this key.
  * The key type depends on whether it's a shared or private mapping.
@@ -50,8 +66,52 @@ union futex_key {
 	} both;
 };
 
+/**
+ * struct futex_q - The hashed futex queue entry, one per waiting task
+ * @list:		priority-sorted list of tasks waiting on this futex
+ * @task:		the task waiting on the futex
+ * @lock_ptr:		the hash bucket lock
+ * @key:		the key the futex is hashed on
+ * @pi_state:		optional priority inheritance state
+ * @rt_waiter:		rt_waiter storage for use with requeue_pi
+ * @requeue_pi_key:	the requeue_pi target futex key
+ * @bitset:		bitset for the optional bitmasked wakeup
+ *
+ * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
+ * we can wake only the relevant ones (hashed queues may be shared).
+ *
+ * A futex_q has a woken state, just like tasks have TASK_RUNNING.
+ * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
+ * The order of wakeup is always to make the first condition true, then
+ * the second.
+ *
+ * PI futexes are typically woken before they are removed from the hash list via
+ * the rt_mutex code. See unqueue_me_pi().
+ */
+struct futex_q {
+	struct plist_node list;
+
+	struct task_struct *task;
+	spinlock_t *lock_ptr;
+	union futex_key key;
+	struct futex_pi_state *pi_state;
+	struct rt_mutex_waiter *rt_waiter;
+	union futex_key *requeue_pi_key;
+	u32 bitset;
+} __randomize_layout;
+
 #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = 0ULL } }
 
+static const struct futex_q futex_q_init = {
+	/* list gets initialized in queue_me()*/
+	.key = FUTEX_KEY_INIT,
+	.bitset = FUTEX_BITSET_MATCH_ANY
+};
+
+inline struct hrtimer_sleeper *
+futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
+		  int flags, u64 range_ns);
+
 #ifdef CONFIG_FUTEX
 enum {
 	FUTEX_STATE_OK,
diff --git a/kernel/futex.c b/kernel/futex.c
index 2ecb07575055..c07cb0f747ac 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -187,46 +187,6 @@ struct futex_pi_state {
 	union futex_key key;
 } __randomize_layout;
 
-/**
- * struct futex_q - The hashed futex queue entry, one per waiting task
- * @list:		priority-sorted list of tasks waiting on this futex
- * @task:		the task waiting on the futex
- * @lock_ptr:		the hash bucket lock
- * @key:		the key the futex is hashed on
- * @pi_state:		optional priority inheritance state
- * @rt_waiter:		rt_waiter storage for use with requeue_pi
- * @requeue_pi_key:	the requeue_pi target futex key
- * @bitset:		bitset for the optional bitmasked wakeup
- *
- * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
- * we can wake only the relevant ones (hashed queues may be shared).
- *
- * A futex_q has a woken state, just like tasks have TASK_RUNNING.
- * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
- * The order of wakeup is always to make the first condition true, then
- * the second.
- *
- * PI futexes are typically woken before they are removed from the hash list via
- * the rt_mutex code. See unqueue_me_pi().
- */
-struct futex_q {
-	struct plist_node list;
-
-	struct task_struct *task;
-	spinlock_t *lock_ptr;
-	union futex_key key;
-	struct futex_pi_state *pi_state;
-	struct rt_mutex_waiter *rt_waiter;
-	union futex_key *requeue_pi_key;
-	u32 bitset;
-} __randomize_layout;
-
-static const struct futex_q futex_q_init = {
-	/* list gets initialized in queue_me()*/
-	.key = FUTEX_KEY_INIT,
-	.bitset = FUTEX_BITSET_MATCH_ANY
-};
-
 /*
  * Hash buckets are shared by all the futex_keys that hash to the same
  * location.  Each key may have multiple futex_q structures, one for each task
@@ -395,7 +355,7 @@ enum futex_access {
  * Return: Initialized hrtimer_sleeper structure or NULL if no timeout
  *	   value given
  */
-static inline struct hrtimer_sleeper *
+inline struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns)
 {
-- 
2.32.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/4] futex2: Implement vectorized wait
  2021-08-05 19:04 [PATCH 0/4] futex2: Add wait on multiple futexes syscall André Almeida
  2021-08-05 19:04 ` [PATCH 1/4] futex: Prepare for futex_wait_multiple() André Almeida
@ 2021-08-05 19:04 ` André Almeida
  2021-08-18 11:00   ` Thomas Gleixner
  2021-08-05 19:04 ` [PATCH 3/4] selftests: futex2: Add waitv test André Almeida
  2021-08-05 19:04 ` [PATCH 4/4] futex2: Documentation: Document futex_waitv() uAPI André Almeida
  3 siblings, 1 reply; 11+ messages in thread
From: André Almeida @ 2021-08-05 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, André Almeida

Add support to wait on multiple futexes. This is the interface
implemented by this syscall:

futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
	    unsigned int flags, struct timespec *timo)

struct futex_waitv {
	__u64 val;
	void *uaddr;
	unsigned int flags;
};

Given an array of struct futex_waitv, wait on each uaddr. The thread
wakes if a futex_wake() is performed at any uaddr. The syscall returns
immediately if any waiter has *uaddr != val. *timo is an optional
timeout value for the operation. The flags argument of the syscall
should be used solely for specifying the timeout clock as realtime, if
needed.  Flags for shared futexes, sizes, etc. should be used on the
individual flags of each waiter.

Returns the array index of one of the awakened futexes. There’s no given
information of how many were awakened, or any particular attribute of it
(if it’s the first awakened, if it is of the smaller index...).

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/compat.h                 |   9 ++
 include/linux/futex.h                  |  15 ++
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/futex.h             |  17 +++
 init/Kconfig                           |   7 +
 kernel/Makefile                        |   1 +
 kernel/futex.c                         | 182 +++++++++++++++++++++++
 kernel/futex2.c                        | 192 +++++++++++++++++++++++++
 kernel/sys_ni.c                        |   4 +
 11 files changed, 433 insertions(+), 1 deletion(-)
 create mode 100644 kernel/futex2.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ce763a12311c..de4053104ffa 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -452,3 +452,4 @@
 445	i386	landlock_add_rule	sys_landlock_add_rule
 446	i386	landlock_restrict_self	sys_landlock_restrict_self
 447	i386	memfd_secret		sys_memfd_secret
+448	i386	futex_waitv		sys_futex_waitv			compat_sys_futex_waitv
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f6b57799c1ea..ec8659eba1f3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -369,6 +369,7 @@
 445	common	landlock_add_rule	sys_landlock_add_rule
 446	common	landlock_restrict_self	sys_landlock_restrict_self
 447	common	memfd_secret		sys_memfd_secret
+448	common	futex_waitv		sys_futex_waitv
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/compat.h b/include/linux/compat.h
index c270124e4402..0c38adfc40a2 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -368,6 +368,12 @@ struct compat_robust_list_head {
 	compat_uptr_t			list_op_pending;
 };
 
+struct compat_futex_waitv {
+	compat_u64 val;
+	compat_uptr_t uaddr;
+	compat_uint_t flags;
+};
+
 #ifdef CONFIG_COMPAT_OLD_SIGACTION
 struct compat_old_sigaction {
 	compat_uptr_t			sa_handler;
@@ -690,6 +696,9 @@ asmlinkage long
 compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr,
 			   compat_size_t __user *len_ptr);
 
+asmlinkage long compat_sys_futex_waitv(struct compat_futex_waitv *waiters,
+				       compat_uint_t nr_futexes, compat_uint_t flags,
+				       struct __kernel_timespec __user *timo);
 /* kernel/itimer.c */
 asmlinkage long compat_sys_getitimer(int which,
 				     struct old_itimerval32 __user *it);
diff --git a/include/linux/futex.h b/include/linux/futex.h
index f7a0f4a4b5f0..f7ba0577e0e8 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -100,6 +100,18 @@ struct futex_q {
 	u32 bitset;
 } __randomize_layout;
 
+/**
+ * struct futex_vector - Auxiliary struct for futex_waitv()
+ * @w: Userspace provided data
+ * @q: Kernel side data
+ *
+ * Struct used to build an array with all data need for futex_waitv()
+ */
+struct futex_vector {
+	struct futex_waitv w;
+	struct futex_q q;
+};
+
 #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = 0ULL } }
 
 static const struct futex_q futex_q_init = {
@@ -112,6 +124,9 @@ inline struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns);
 
+int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
+			struct hrtimer_sleeper *to);
+
 #ifdef CONFIG_FUTEX
 enum {
 	FUTEX_STATE_OK,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a9d6fcd95f42..17cebd1e9384 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -878,8 +878,11 @@ __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
 #endif
 
+#define __NR_futex_waitv 448
+__SC_COMP(__NR_futex_waitv, sys_futex_waitv, compat_sys_futex_waitv)
+
 #undef __NR_syscalls
-#define __NR_syscalls 448
+#define __NR_syscalls 449
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 235e5b2facaa..daa135bdedda 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -42,6 +42,23 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
+#define FUTEX_32	2
+#define FUTEX_SHARED_FLAG 8
+#define FUTEX_SIZE_MASK	0x3
+
+#define FUTEX_WAITV_MAX 128
+
+/**
+ * struct futex_waitv - A waiter for vectorized wait
+ * @val:   Expected value at uaddr
+ * @uaddr: User address to wait on
+ * @flags: Flags for this waiter
+ */
+struct futex_waitv {
+	__u64 val;
+	void __user *uaddr;
+	unsigned int flags;
+};
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
diff --git a/init/Kconfig b/init/Kconfig
index 55f9f7738ebb..4cfc315182ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1558,6 +1558,13 @@ config FUTEX
 	  support for "fast userspace mutexes".  The resulting kernel may not
 	  run glibc-based applications correctly.
 
+config FUTEX2
+	bool "Enable futex2 support" if EXPERT
+	depends on FUTEX
+	default y
+	help
+	  Support for futex2 interface.
+
 config FUTEX_PI
 	bool
 	depends on FUTEX && RT_MUTEXES
diff --git a/kernel/Makefile b/kernel/Makefile
index 4df609be42d0..1eaf2af50283 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
 obj-$(CONFIG_FUTEX) += futex.o
+obj-$(CONFIG_FUTEX2) += futex2.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
 obj-$(CONFIG_SMP) += smp.o
 ifneq ($(CONFIG_SMP),y)
diff --git a/kernel/futex.c b/kernel/futex.c
index c07cb0f747ac..a3d4d9f52a3a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2564,6 +2564,188 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 	__set_current_state(TASK_RUNNING);
 }
 
+/**
+ * unqueue_multiple() - Remove various futexes from their futex_hash_bucket
+ * @v:	   The list of futexes to unqueue
+ * @count: Number of futexes in the list
+ *
+ * Helper to unqueue a list of futexes. This can't fail.
+ *
+ * Return:
+ *  - >=0 - Index of the last futex that was awoken;
+ *  - -1  - No futex was awoken
+ */
+static int unqueue_multiple(struct futex_vector *v, int count)
+{
+	int ret = -1, i;
+
+	for (i = 0; i < count; i++) {
+		if (!unqueue_me(&v[i].q))
+			ret = i;
+	}
+
+	return ret;
+}
+
+/**
+ * futex_wait_multiple_setup() - Prepare to wait and enqueue multiple futexes
+ * @vs:		The corresponding futex list
+ * @count:	The size of the list
+ * @awaken:	Index of the last awoken futex (return parameter)
+ *
+ * Prepare multiple futexes in a single step and enqueue them. This may fail if
+ * the futex list is invalid or if any futex was already awoken. On success the
+ * task is ready to interruptible sleep.
+ *
+ * Return:
+ *  -  1 - One of the futexes was awaken by another thread
+ *  -  0 - Success
+ *  - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
+ */
+static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *awaken)
+{
+	struct futex_hash_bucket *hb;
+	int ret, i;
+	u32 uval;
+
+	/*
+	 * Enqueuing multiple futexes is tricky, because we need to
+	 * enqueue each futex in the list before dealing with the next
+	 * one to avoid deadlocking on the hash bucket.  But, before
+	 * enqueuing, we need to make sure that current->state is
+	 * TASK_INTERRUPTIBLE, so we don't absorb any awake events, which
+	 * cannot be done before the get_futex_key of the next key,
+	 * because it calls get_user_pages, which can sleep.  Thus, we
+	 * fetch the list of futexes keys in two steps, by first pinning
+	 * all the memory keys in the futex key, and only then we read
+	 * each key and queue the corresponding futex.
+	 */
+retry:
+	for (i = 0; i < count; i++) {
+		ret = get_futex_key(vs[i].w.uaddr,
+				    vs[i].w.flags & FUTEX_SHARED_FLAG,
+				    &vs[i].q.key, FUTEX_READ);
+		if (unlikely(ret))
+			return ret;
+	}
+
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	for (i = 0; i < count; i++) {
+		struct futex_q *q = &vs[i].q;
+		struct futex_waitv *waitv = &vs[i].w;
+
+		hb = queue_lock(q);
+		ret = get_futex_value_locked(&uval, waitv->uaddr);
+		if (ret) {
+			/*
+			 * We need to try to handle the fault, which
+			 * cannot be done without sleep, so we need to
+			 * undo all the work already done, to make sure
+			 * we don't miss any wake ups.  Therefore, clean
+			 * up, handle the fault and retry from the
+			 * beginning.
+			 */
+			queue_unlock(hb);
+			__set_current_state(TASK_RUNNING);
+
+			*awaken = unqueue_multiple(vs, i);
+			if (*awaken >= 0)
+				return 1;
+
+			if (get_user(uval, (u32 __user *)waitv->uaddr))
+				return -EINVAL;
+
+			goto retry;
+		}
+
+		if (uval != waitv->val) {
+			queue_unlock(hb);
+			__set_current_state(TASK_RUNNING);
+
+			/*
+			 * If something was already awaken, we can
+			 * safely ignore the error and succeed.
+			 */
+			*awaken = unqueue_multiple(vs, i);
+			if (*awaken >= 0)
+				return 1;
+
+			return -EWOULDBLOCK;
+		}
+
+		/*
+		 * The bucket lock can't be held while dealing with the
+		 * next futex. Queue each futex at this moment so hb can
+		 * be unlocked.
+		 */
+		queue_me(&vs[i].q, hb);
+	}
+	return 0;
+}
+
+/**
+ * futex_wait_multiple() - Prepare to wait on and enqueue several futexes
+ * @vs:		The list of futexes to wait on
+ * @count:	The number of objects
+ * @to:		Timeout before giving up and returning to userspace
+ *
+ * Entry point for the FUTEX_WAIT_MULTIPLE futex operation, this function
+ * sleeps on a group of futexes and returns on the first futex that
+ * triggered, or after the timeout has elapsed.
+ *
+ * Return:
+ *  - >=0 - Hint to the futex that was awoken
+ *  - <0  - On error
+ */
+int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
+			struct hrtimer_sleeper *to)
+{
+	int ret, hint = 0;
+	unsigned int i;
+
+	while (1) {
+		ret = futex_wait_multiple_setup(vs, count, &hint);
+		if (ret) {
+			if (ret > 0) {
+				/* A futex was awaken during setup */
+				ret = hint;
+			}
+			return ret;
+		}
+
+		if (to)
+			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
+
+		/*
+		 * Avoid sleeping if another thread already tried to
+		 * wake us.
+		 */
+		for (i = 0; i < count; i++) {
+			if (plist_node_empty(&vs[i].q.list))
+				break;
+		}
+
+		if (i == count && (!to || to->task))
+			freezable_schedule();
+
+		__set_current_state(TASK_RUNNING);
+
+		ret = unqueue_multiple(vs, count);
+		if (ret >= 0)
+			return ret;
+
+		if (to && !to->task)
+			return -ETIMEDOUT;
+		else if (signal_pending(current))
+			return -ERESTARTSYS;
+		/*
+		 * The final case is a spurious wakeup, for
+		 * which just retry.
+		 */
+	}
+}
+
 /**
  * futex_wait_setup() - Prepare to wait on a futex
  * @uaddr:	the futex userspace address
diff --git a/kernel/futex2.c b/kernel/futex2.c
new file mode 100644
index 000000000000..19bbd4bf7187
--- /dev/null
+++ b/kernel/futex2.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * futex2 system call interface by André Almeida <andrealmeid@collabora.com>
+ *
+ * Copyright 2021 Collabora Ltd.
+ */
+
+#include <asm/futex.h>
+
+#include <linux/freezer.h>
+#include <linux/syscalls.h>
+
+/* Mask for each futex in futex_waitv list */
+#define FUTEXV_WAITER_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG)
+
+/* Mask for sys_futex_waitv flag */
+#define FUTEXV_MASK (FUTEX_CLOCK_REALTIME)
+
+#ifdef CONFIG_COMPAT
+/**
+ * compat_futex_parse_waitv - Parse a waitv array from userspace
+ * @futexv:	Kernel side list of waiters to be filled
+ * @uwaitv:     Userspace list to be parsed
+ * @nr_futexes: Length of futexv
+ *
+ * Return: Error code on failure, pointer to a prepared futexv otherwise
+ */
+static int compat_futex_parse_waitv(struct futex_vector *futexv,
+				    struct compat_futex_waitv __user *uwaitv,
+				    unsigned int nr_futexes)
+{
+	struct compat_futex_waitv aux;
+	unsigned int i;
+
+	for (i = 0; i < nr_futexes; i++) {
+		if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
+			return -EFAULT;
+
+		if ((aux.flags & ~FUTEXV_WAITER_MASK) ||
+		    (aux.flags & FUTEX_SIZE_MASK) != FUTEX_32)
+			return -EINVAL;
+
+		futexv[i].w.flags = aux.flags;
+		futexv[i].w.val = aux.val;
+		futexv[i].w.uaddr = compat_ptr(aux.uaddr);
+		futexv[i].q = futex_q_init;
+	}
+
+	return 0;
+}
+
+COMPAT_SYSCALL_DEFINE4(futex_waitv, struct compat_futex_waitv __user *, waiters,
+		       unsigned int, nr_futexes, unsigned int, flags,
+		       struct __kernel_timespec __user *, timo)
+{
+	struct hrtimer_sleeper to;
+	struct futex_vector *futexv;
+	struct timespec64 ts;
+	ktime_t time;
+	int ret;
+
+	if (flags & ~FUTEXV_MASK)
+		return -EINVAL;
+
+	if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
+		return -EINVAL;
+
+	if (timo) {
+		int flag_clkid = 0;
+
+		if (get_timespec64(&ts, timo))
+			return -EFAULT;
+
+		if (!timespec64_valid(&ts))
+			return -EINVAL;
+
+		if (flags & FUTEX_CLOCK_REALTIME)
+			flag_clkid = FLAGS_CLOCKRT;
+
+		time = timespec64_to_ktime(ts);
+		futex_setup_timer(&time, &to, flag_clkid, 0);
+	}
+
+	futexv = kcalloc(nr_futexes, sizeof(*futexv), GFP_KERNEL);
+	if (!futexv)
+		return -ENOMEM;
+
+	ret = compat_futex_parse_waitv(futexv, waiters, nr_futexes);
+	if (!ret)
+		ret = futex_wait_multiple(futexv, nr_futexes, timo ? &to : NULL);
+
+	if (timo) {
+		hrtimer_cancel(&to.timer);
+		destroy_hrtimer_on_stack(&to.timer);
+	}
+
+	kfree(futexv);
+	return ret;
+}
+#endif
+
+static int futex_parse_waitv(struct futex_vector *futexv,
+			     struct futex_waitv __user *uwaitv,
+			     unsigned int nr_futexes)
+{
+	struct futex_waitv aux;
+	unsigned int i;
+
+	for (i = 0; i < nr_futexes; i++) {
+		if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
+			return -EFAULT;
+
+		if ((aux.flags & ~FUTEXV_WAITER_MASK) ||
+		    (aux.flags & FUTEX_SIZE_MASK) != FUTEX_32)
+			return -EINVAL;
+
+		futexv[i].w.flags = aux.flags;
+		futexv[i].w.val = aux.val;
+		futexv[i].w.uaddr = aux.uaddr;
+		futexv[i].q = futex_q_init;
+	}
+
+	return 0;
+}
+
+/**
+ * sys_futex_waitv - Wait on a list of futexes
+ * @waiters:    List of futexes to wait on
+ * @nr_futexes: Length of futexv
+ * @flags:      Flag for timeout (monotonic/realtime)
+ * @timo:	Optional absolute timeout.
+ *
+ * Given an array of `struct futex_waitv`, wait on each uaddr. The thread wakes
+ * if a futex_wake() is performed at any uaddr. The syscall returns immediately
+ * if any waiter has *uaddr != val. *timo is an optional timeout value for the
+ * operation. Each waiter has individual flags. The `flags` argument for the
+ * syscall should be used solely for specifying the timeout as realtime, if
+ * needed. Flags for shared futexes, sizes, etc. should be used on the
+ * individual flags of each waiter.
+ *
+ * Returns the array index of one of the awaken futexes. There's no given
+ * information of how many were awakened, or any particular attribute of it (if
+ * it's the first awakened, if it is of the smaller index...).
+ */
+SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
+		unsigned int, nr_futexes, unsigned int, flags,
+		struct __kernel_timespec __user *, timo)
+{
+	struct hrtimer_sleeper to;
+	struct futex_vector *futexv;
+	struct timespec64 ts;
+	ktime_t time;
+	int ret;
+
+	if (flags & ~FUTEXV_MASK)
+		return -EINVAL;
+
+	if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
+		return -EINVAL;
+
+	if (timo) {
+		int flag_clkid = 0;
+
+		if (get_timespec64(&ts, timo))
+			return -EFAULT;
+
+		if (!timespec64_valid(&ts))
+			return -EINVAL;
+
+		if (flags & FUTEX_CLOCK_REALTIME)
+			flag_clkid = FLAGS_CLOCKRT;
+
+		time = timespec64_to_ktime(ts);
+		futex_setup_timer(&time, &to, flag_clkid, 0);
+	}
+
+	futexv = kcalloc(nr_futexes, sizeof(*futexv), GFP_KERNEL);
+	if (!futexv)
+		return -ENOMEM;
+
+	ret = futex_parse_waitv(futexv, waiters, nr_futexes);
+	if (!ret)
+		ret = futex_wait_multiple(futexv, nr_futexes, timo ? &to : NULL);
+
+	if (timo) {
+		hrtimer_cancel(&to.timer);
+		destroy_hrtimer_on_stack(&to.timer);
+	}
+
+	kfree(futexv);
+	return ret;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 30971b1dd4a9..03062eb03669 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -151,6 +151,10 @@ COND_SYSCALL_COMPAT(set_robust_list);
 COND_SYSCALL(get_robust_list);
 COND_SYSCALL_COMPAT(get_robust_list);
 
+/* kernel/futex2.c */
+COND_SYSCALL(futex_waitv);
+COND_SYSCALL_COMPAT(futex_waitv);
+
 /* kernel/hrtimer.c */
 
 /* kernel/itimer.c */
-- 
2.32.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/4] selftests: futex2: Add waitv test
  2021-08-05 19:04 [PATCH 0/4] futex2: Add wait on multiple futexes syscall André Almeida
  2021-08-05 19:04 ` [PATCH 1/4] futex: Prepare for futex_wait_multiple() André Almeida
  2021-08-05 19:04 ` [PATCH 2/4] futex2: Implement vectorized wait André Almeida
@ 2021-08-05 19:04 ` André Almeida
  2021-08-05 19:04 ` [PATCH 4/4] futex2: Documentation: Document futex_waitv() uAPI André Almeida
  3 siblings, 0 replies; 11+ messages in thread
From: André Almeida @ 2021-08-05 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, André Almeida

Create a new file to test the waitv mechanism. Test both private and
shared futexes. Wake the last futex in the array, and check if the
return value from futex_waitv() is the right index.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   3 +-
 .../selftests/futex/functional/futex2_waitv.c | 154 ++++++++++++++++++
 .../testing/selftests/futex/functional/run.sh |   3 +
 .../selftests/futex/include/futex2test.h      |  72 ++++++++
 5 files changed, 232 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c
 create mode 100644 tools/testing/selftests/futex/include/futex2test.h

diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index 0e78b49d0f2f..913ce6f91bb0 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -8,3 +8,4 @@ futex_wait_uninitialized_heap
 futex_wait_wouldblock
 futex_wait
 futex_requeue
+futex2_waitv
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index bd1fec59e010..cb5f1bdd6678 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -17,7 +17,8 @@ TEST_GEN_FILES := \
 	futex_wait_uninitialized_heap \
 	futex_wait_private_mapped_file \
 	futex_wait \
-	futex_requeue
+	futex_requeue \
+	futex2_waitv
 
 TEST_PROGS := run.sh
 
diff --git a/tools/testing/selftests/futex/functional/futex2_waitv.c b/tools/testing/selftests/futex/functional/futex2_waitv.c
new file mode 100644
index 000000000000..562eec6e946c
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex2_waitv.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/******************************************************************************
+ *
+ *   Copyright Collabora Ltd., 2021
+ *
+ * DESCRIPTION
+ *	Test waitv/wake mechanism of futex2, using 32bit sized futexes.
+ *
+ * AUTHOR
+ *	André Almeida <andrealmeid@collabora.com>
+ *
+ * HISTORY
+ *      2021-Feb-5: Initial version by André <andrealmeid@collabora.com>
+ *
+ *****************************************************************************/
+
+#include <errno.h>
+#include <error.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <pthread.h>
+#include <sys/shm.h>
+#include "futex2test.h"
+#include "logging.h"
+
+#define TEST_NAME "futex2-wait"
+#define WAKE_WAIT_US 10000
+#define NR_FUTEXES 30
+struct futex_waitv waitv[NR_FUTEXES];
+u_int32_t futexes[NR_FUTEXES] = {0};
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -c	Use color\n");
+	printf("  -h	Display this help message\n");
+	printf("  -v L	Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+	       VQUIET, VCRITICAL, VINFO);
+}
+
+void *waiterfn(void *arg)
+{
+	struct timespec64 to64;
+	int res;
+
+	/* setting absolute timeout for futex2 */
+	if (gettime64(CLOCK_MONOTONIC, &to64))
+		error("gettime64 failed\n", errno);
+
+	to64.tv_sec++;
+
+	res = futex2_waitv(waitv, NR_FUTEXES, 0, &to64);
+	if (res < 0) {
+		ksft_test_result_fail("futex2_waitv returned: %d %s\n",
+				      errno, strerror(errno));
+	} else if (res != NR_FUTEXES - 1) {
+		ksft_test_result_fail("futex2_waitv returned: %d, expecting %d\n",
+				      res, NR_FUTEXES - 1);
+	}
+
+	return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+	pthread_t waiter;
+	int res, ret = RET_PASS;
+	int c, i;
+
+	while ((c = getopt(argc, argv, "cht:v:")) != -1) {
+		switch (c) {
+		case 'c':
+			log_color(1);
+			break;
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'v':
+			log_verbosity(atoi(optarg));
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	ksft_print_header();
+	ksft_set_plan(2);
+	ksft_print_msg("%s: Test FUTEX2_WAITV\n",
+		       basename(argv[0]));
+
+	for (i = 0; i < NR_FUTEXES; i++) {
+		waitv[i].uaddr = &futexes[i];
+		waitv[i].flags = FUTEX_32;
+		waitv[i].val = 0;
+	}
+
+	/* Private waitv */
+	if (pthread_create(&waiter, NULL, waiterfn, NULL))
+		error("pthread_create failed\n", errno);
+
+	usleep(WAKE_WAIT_US);
+
+	res = futex_wake(waitv[NR_FUTEXES - 1].uaddr, 1, FUTEX_PRIVATE_FLAG);
+	if (res != 1) {
+		ksft_test_result_fail("futex2_waitv private returned: %d %s\n",
+				      res ? errno : res,
+				      res ? strerror(errno) : "");
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_pass("futex2_waitv private\n");
+	}
+
+	/* Shared waitv */
+	for (i = 0; i < NR_FUTEXES; i++) {
+		int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
+
+		if (shm_id < 0) {
+			perror("shmget");
+			exit(1);
+		}
+
+		unsigned int *shared_data = shmat(shm_id, NULL, 0);
+
+		*shared_data = 0;
+		waitv[i].uaddr = shared_data;
+		waitv[i].flags = FUTEX_32 | FUTEX_SHARED_FLAG;
+		waitv[i].val = 0;
+	}
+
+	if (pthread_create(&waiter, NULL, waiterfn, NULL))
+		error("pthread_create failed\n", errno);
+
+	usleep(WAKE_WAIT_US);
+
+	res = futex_wake(waitv[NR_FUTEXES - 1].uaddr, 1, 0);
+	if (res != 1) {
+		ksft_test_result_fail("futex2_waitv shared returned: %d %s\n",
+				      res ? errno : res,
+				      res ? strerror(errno) : "");
+		ret = RET_FAIL;
+	} else {
+		ksft_test_result_pass("futex2_waitv shared\n");
+	}
+
+	for (i = 0; i < NR_FUTEXES; i++)
+		shmdt(waitv[i].uaddr);
+
+	ksft_print_cnts();
+	return ret;
+}
diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testing/selftests/futex/functional/run.sh
index 11a9d62290f5..e2b4b4747368 100755
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -79,3 +79,6 @@ echo
 
 echo
 ./futex_requeue $COLOR
+
+echo
+./futex2_waitv $COLOR
diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
new file mode 100644
index 000000000000..dbc29d8e8187
--- /dev/null
+++ b/tools/testing/selftests/futex/include/futex2test.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/******************************************************************************
+ *
+ *   Copyright Collabora Ltd., 2021
+ *
+ * DESCRIPTION
+ *	Futex2 library addons for old futex library
+ *
+ * AUTHOR
+ *	André Almeida <andrealmeid@collabora.com>
+ *
+ * HISTORY
+ *      2021-Feb-5: Initial version by André <andrealmeid@collabora.com>
+ *
+ *****************************************************************************/
+#include "futextest.h"
+#include <stdio.h>
+
+#define NSEC_PER_SEC	1000000000L
+
+#ifndef FUTEX_8
+# define FUTEX_8	0
+#endif
+#ifndef FUTEX_16
+# define FUTEX_16	1
+#endif
+#ifndef FUTEX_32
+# define FUTEX_32	2
+#endif
+
+#ifndef FUTEX_SHARED_FLAG
+#define FUTEX_SHARED_FLAG 8
+#endif
+
+/*
+ * - Y2038 section for 32-bit applications -
+ *
+ * Remove this when glibc is ready for y2038. Then, always compile with
+ * `-DTIME_BITS=64` or `-D__USE_TIME_BITS64`. glibc will provide both
+ * timespec64 and clock_gettime64 so we won't need to define here.
+ */
+#if defined(__i386__) || __TIMESIZE == 32
+# define NR_gettime __NR_clock_gettime64
+#else
+# define NR_gettime __NR_clock_gettime
+#endif
+
+struct timespec64 {
+	long long tv_sec;	/* seconds */
+	long long tv_nsec;	/* nanoseconds */
+};
+
+int gettime64(clock_t clockid, struct timespec64 *tv)
+{
+	return syscall(NR_gettime, clockid, tv);
+}
+/*
+ * - End of Y2038 section -
+ */
+
+/**
+ * futex2_waitv - Wait at multiple futexes, wake on any
+ * @waiters:    Array of waiters
+ * @nr_waiters: Length of waiters array
+ * @flags: Operation flags
+ * @timo:  Optional timeout for operation
+ */
+static inline int futex2_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
+			      unsigned long flags, struct timespec64 *timo)
+{
+	return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo);
+}
-- 
2.32.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/4] futex2: Documentation: Document futex_waitv() uAPI
  2021-08-05 19:04 [PATCH 0/4] futex2: Add wait on multiple futexes syscall André Almeida
                   ` (2 preceding siblings ...)
  2021-08-05 19:04 ` [PATCH 3/4] selftests: futex2: Add waitv test André Almeida
@ 2021-08-05 19:04 ` André Almeida
  3 siblings, 0 replies; 11+ messages in thread
From: André Almeida @ 2021-08-05 19:04 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, André Almeida

Create userspace documentation for futex_waitv() syscall, detailing how
the arguments are used.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
---
 Documentation/userspace-api/futex2.rst | 79 ++++++++++++++++++++++++++
 Documentation/userspace-api/index.rst  |  1 +
 2 files changed, 80 insertions(+)
 create mode 100644 Documentation/userspace-api/futex2.rst

diff --git a/Documentation/userspace-api/futex2.rst b/Documentation/userspace-api/futex2.rst
new file mode 100644
index 000000000000..829805c24a51
--- /dev/null
+++ b/Documentation/userspace-api/futex2.rst
@@ -0,0 +1,79 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+futex2
+======
+
+:Author: André Almeida <andrealmeid@collabora.com>
+
+
+futex, or fast user mutex, is a set of syscalls to allow userspace to create
+performant synchronization mechanisms, such as mutexes, semaphores and
+conditional variables in userspace. C standard libraries, like glibc, uses it
+as a means to implement more high level interfaces like pthreads.
+
+futex2 is a followup version of the initial futex syscall, designed to overcome
+limitations of the original interface.
+
+User API
+========
+
+``futex_waitv()``
+-----------------
+
+Wait on an array of futexes, wake on any::
+
+  futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes, unsigned int flags, struct timespec *timo)
+
+  struct futex_waitv {
+          uint64_t val;
+          void *uaddr;
+          unsigned int flags;
+  };
+
+Userspace set an array of struct futex_waitv (up to a max of 128 entries),
+using ``uaddr`` for the address to wait for, ``val`` for the expected value
+and ``flags`` to specify the type (shared, private) and size of futex. The
+pointer for the first item of the array is passed as ``waiters``. An invalid
+address for ``waiters`` or for any ``uaddr`` returns ``-EFAULT``.
+
+``nr_futexes`` specifies the size of the array. Numbers out of [1, 128]
+interval will make the syscall return ``-EINVAL``.
+
+``flags`` can be used to set the timeout clock.
+
+For each entry in ``waiters`` array, the current value at ``uaddr`` is compared
+to ``val``. If it's different, the syscall undo all the work done so far and
+return ``-EAGAIN``. If all tests and verifications succeeds, syscall waits until
+one of the following happens:
+
+- The timeout expires, returning ``-ETIMEOUT``.
+- A signal was sent to the sleeping task, returning ``-ERESTARTSYS``.
+- Some futex at the list was awaken, returning the index of some waked futex.
+
+An example of how to use the interface can be found at ``tools/testing/selftests/futex/futenctional/futex2_waitv.c``.
+
+Timeout
+-------
+
+For every operation that has a ``struct timespec timo`` argument, it is an
+optional argument that points to an absolute timeout. By default, it's measured
+against ``CLOCK_MONOTONIC``, but the flag ``FUTEX_CLOCK_REALTIME`` can be used
+to measure it against ``CLOCK_REALTIME``. This syscall accepts only 64bit
+timespec structs.
+
+Types of futex
+--------------
+
+A futex can be either private or shared. Private is used for processes that
+shares the same memory space and the virtual address of the futex will be the
+same for all processes. This allows for optimizations in the kernel and is the
+default type of a futex. For processes that doesn't share the same memory space
+and therefore can have different virtual addresses for the same futex (using,
+for instance, a file-backed shared memory) requires different internal
+mechanisms to be get properly enqueued in the kernel and need to set the flag
+``FUTEX_SHARED_FLAG``.
+
+Futexes can be of different sizes: 8, 16, 32 or 64 bits. Currently, the only
+supported one is 32 bit sized futex, and it need to be specified using
+``FUTEX_32`` flag.
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 0b5eefed027e..dac7189d9752 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
    iommu
    media/index
    sysfs-platform_profile
+   futex2
 
 .. only::  subproject and html
 
-- 
2.32.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/4] futex: Prepare for futex_wait_multiple()
  2021-08-05 19:04 ` [PATCH 1/4] futex: Prepare for futex_wait_multiple() André Almeida
@ 2021-08-18  8:22   ` Thomas Gleixner
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2021-08-18  8:22 UTC (permalink / raw)
  To: André Almeida, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, André Almeida

Andre,

On Thu, Aug 05 2021 at 16:04, André Almeida wrote:
> +/*
> + * Futex flags used to encode options to functions and preserve them across
> + * restarts.
> + */
> +#ifdef CONFIG_MMU
> +# define FLAGS_SHARED		0x01
> +#else
> +/*
> + * NOMMU does not have per process address space. Let the compiler optimize
> + * code away.
> + */
> +# define FLAGS_SHARED		0x00
> +#endif
> +#define FLAGS_CLOCKRT		0x02
> +#define FLAGS_HAS_TIMEOUT	0x04
> +
>  /*
>   * Futexes are matched on equal values of this key.
>   * The key type depends on whether it's a shared or private mapping.
> @@ -50,8 +66,52 @@ union futex_key {
>  	} both;
>  };
>  
> +/**
> + * struct futex_q - The hashed futex queue entry, one per waiting task
> + * @list:		priority-sorted list of tasks waiting on this futex
> + * @task:		the task waiting on the futex
> + * @lock_ptr:		the hash bucket lock
> + * @key:		the key the futex is hashed on
> + * @pi_state:		optional priority inheritance state
> + * @rt_waiter:		rt_waiter storage for use with requeue_pi
> + * @requeue_pi_key:	the requeue_pi target futex key
> + * @bitset:		bitset for the optional bitmasked wakeup
> + *
> + * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
> + * we can wake only the relevant ones (hashed queues may be shared).
> + *
> + * A futex_q has a woken state, just like tasks have TASK_RUNNING.
> + * It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
> + * The order of wakeup is always to make the first condition true, then
> + * the second.
> + *
> + * PI futexes are typically woken before they are removed from the hash list via
> + * the rt_mutex code. See unqueue_me_pi().
> + */
> +struct futex_q {
> +	struct plist_node list;
> +
> +	struct task_struct *task;
> +	spinlock_t *lock_ptr;
> +	union futex_key key;
> +	struct futex_pi_state *pi_state;
> +	struct rt_mutex_waiter *rt_waiter;
> +	union futex_key *requeue_pi_key;
> +	u32 bitset;
> +} __randomize_layout;
> +
>  #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = 0ULL } }
>  
> +static const struct futex_q futex_q_init = {
> +	/* list gets initialized in queue_me()*/
> +	.key = FUTEX_KEY_INIT,
> +	.bitset = FUTEX_BITSET_MATCH_ANY
> +};
> +
> +inline struct hrtimer_sleeper *
> +futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
> +		  int flags, u64 range_ns);
> +

None of these things belong into the global header. Please move them to
kernel/futex.h.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] futex2: Implement vectorized wait
  2021-08-05 19:04 ` [PATCH 2/4] futex2: Implement vectorized wait André Almeida
@ 2021-08-18 11:00   ` Thomas Gleixner
  2021-08-18 16:20     ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2021-08-18 11:00 UTC (permalink / raw)
  To: André Almeida, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, linux-api, libc-alpha, mtk.manpages,
	Davidlohr Bueso, André Almeida

Andre,

On Thu, Aug 05 2021 at 16:04, André Almeida wrote:
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/compat.h                 |   9 ++
>  include/linux/futex.h                  |  15 ++
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/futex.h             |  17 +++
>  init/Kconfig                           |   7 +
>  kernel/Makefile                        |   1 +
>  kernel/futex.c                         | 182 +++++++++++++++++++++++
>  kernel/futex2.c                        | 192 +++++++++++++++++++++++++
>  kernel/sys_ni.c                        |   4 +

please split this in implementation and enabling on x86. 

> index c270124e4402..0c38adfc40a2 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -368,6 +368,12 @@ struct compat_robust_list_head {
>  	compat_uptr_t			list_op_pending;
>  };
>  
> +struct compat_futex_waitv {
> +	compat_u64 val;

Why do we need a u64 here? u32 is what futexes are based on.

> +/**
> + * struct futex_vector - Auxiliary struct for futex_waitv()
> + * @w: Userspace provided data
> + * @q: Kernel side data
> + *
> + * Struct used to build an array with all data need for futex_waitv()
> + */
> +struct futex_vector {
> +	struct futex_waitv w;
> +	struct futex_q q;
> +};

No point in exposing this globaly.

> diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
> index 235e5b2facaa..daa135bdedda 100644
> --- a/include/uapi/linux/futex.h
> +++ b/include/uapi/linux/futex.h
> @@ -42,6 +42,23 @@
>  					 FUTEX_PRIVATE_FLAG)
>  #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
>  					 FUTEX_PRIVATE_FLAG)
> +#define FUTEX_32	2
> +#define FUTEX_SHARED_FLAG 8
> +#define FUTEX_SIZE_MASK	0x3
> +
> +#define FUTEX_WAITV_MAX 128

Style nitpick. All the defines above this are layed out tabular. Please
keep that.

Aside of that these constants look like random numbers and lack any form
of explanation.

Plus I don't see a reason why this new stuff wants FUTEX_SHARED_FLAG
which is the opposite of the FUTEX_PRIVATE_FLAG for the existing futex
interface. Can we please make this stuff consistent instead of creating
more confusion?

> +
> +/**
> + * struct futex_waitv - A waiter for vectorized wait
> + * @val:   Expected value at uaddr
> + * @uaddr: User address to wait on
> + * @flags: Flags for this waiter
> + */
> +struct futex_waitv {
> +	__u64 val;

Again. Why u64?

> +	void __user *uaddr;
> +	unsigned int flags;
> +};
>  
> +/**
> + * unqueue_multiple() - Remove various futexes from their futex_hash_bucket

s/()// and what are the underscores in futex_hash_bucket for?

> +/**
> + * futex_wait_multiple_setup() - Prepare to wait and enqueue multiple futexes
> + * @vs:		The corresponding futex list

To what is this corresponding?

> + * @count:	The size of the list
> + * @awaken:	Index of the last awoken futex (return parameter)

What's the purpose of this?

> +static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *awaken)
> +{
> +	struct futex_hash_bucket *hb;
> +	int ret, i;
> +	u32 uval;
> +
> +	/*
> +	 * Enqueuing multiple futexes is tricky, because we need to
> +	 * enqueue each futex in the list before dealing with the next
> +	 * one to avoid deadlocking on the hash bucket.  But, before
> +	 * enqueuing, we need to make sure that current->state is
> +	 * TASK_INTERRUPTIBLE, so we don't absorb any awake events, which
> +	 * cannot be done before the get_futex_key of the next key,
> +	 * because it calls get_user_pages, which can sleep.  Thus, we
> +	 * fetch the list of futexes keys in two steps, by first pinning
> +	 * all the memory keys in the futex key, and only then we read
> +	 * each key and queue the corresponding futex.
> +	 */
> +retry:
> +	for (i = 0; i < count; i++) {
> +		ret = get_futex_key(vs[i].w.uaddr,
> +				    vs[i].w.flags & FUTEX_SHARED_FLAG,
> +				    &vs[i].q.key, FUTEX_READ);
> +		if (unlikely(ret))
> +			return ret;
> +	}
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +
> +	for (i = 0; i < count; i++) {
> +		struct futex_q *q = &vs[i].q;
> +		struct futex_waitv *waitv = &vs[i].w;

Please order them reverse.

> +
> +		hb = queue_lock(q);
> +		ret = get_futex_value_locked(&uval, waitv->uaddr);
> +		if (ret) {
> +			/*
> +			 * We need to try to handle the fault, which
> +			 * cannot be done without sleep, so we need to
> +			 * undo all the work already done, to make sure
> +			 * we don't miss any wake ups.  Therefore, clean
> +			 * up, handle the fault and retry from the
> +			 * beginning.
> +			 */
> +			queue_unlock(hb);
> +			__set_current_state(TASK_RUNNING);
> +
> +			*awaken = unqueue_multiple(vs, i);
> +			if (*awaken >= 0)
> +				return 1;
> +
> +			if (get_user(uval, (u32 __user *)waitv->uaddr))

This type cast is horrible.

> +				return -EINVAL;

-EFAULT

> +
> +			goto retry;

Why a full retry if the futexes are private?

> +		}
> +
> +		if (uval != waitv->val) {

Comparison between u32 and u64 ...

> +			queue_unlock(hb);
> +			__set_current_state(TASK_RUNNING);
> +
> +			/*
> +			 * If something was already awaken, we can
> +			 * safely ignore the error and succeed.
> +			 */
> +			*awaken = unqueue_multiple(vs, i);
> +			if (*awaken >= 0)
> +				return 1;
> +
> +			return -EWOULDBLOCK;
> +		}
> +
> +		/*
> +		 * The bucket lock can't be held while dealing with the
> +		 * next futex. Queue each futex at this moment so hb can
> +		 * be unlocked.
> +		 */
> +		queue_me(&vs[i].q, hb);

So the two error pathes are doing both

> +			queue_unlock(hb);
> +			__set_current_state(TASK_RUNNING);
> +
> +			*awaken = unqueue_multiple(vs, i);
> +			if (*awaken >= 0)
> +				return 1;

This can be consolidated into:

	if (!ret && uval == waitv->val) {
        	queue_me();
                continue;
        }
                
	queue_unlock(hb);
	__set_current_state(TASK_RUNNING);
	*awaken = unqueue_multiple(vs, i);
	if (*awaken >= 0)
		return 1;

        if (uval != waitv->val)
        	return -EWOULDBLOCK;
        ....

> +	}
> +	return 0;
> +}
> +
> +/**
> + * futex_wait_multiple() - Prepare to wait on and enqueue several futexes
> + * @vs:		The list of futexes to wait on
> + * @count:	The number of objects
> + * @to:		Timeout before giving up and returning to userspace
> + *
> + * Entry point for the FUTEX_WAIT_MULTIPLE futex operation, this function
> + * sleeps on a group of futexes and returns on the first futex that
> + * triggered, or after the timeout has elapsed.

futexes can't trigger.

> + * Return:
> + *  - >=0 - Hint to the futex that was awoken
> + *  - <0  - On error
> + */
> +int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
> +			struct hrtimer_sleeper *to)
> +{
> +	int ret, hint = 0;
> +	unsigned int i;
> +
> +	while (1) {
> +		ret = futex_wait_multiple_setup(vs, count, &hint);
> +		if (ret) {
> +			if (ret > 0) {
> +				/* A futex was awaken during setup */
> +				ret = hint;
> +			}
> +			return ret;
> +		}
> +
> +		if (to)
> +			hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);

hrtimer_sleeper_start_expires() and also why is this inside of the loop?

> +
> +		/*
> +		 * Avoid sleeping if another thread already tried to
> +		 * wake us.
> +		 */
> +		for (i = 0; i < count; i++) {
> +			if (plist_node_empty(&vs[i].q.list))
> +				break;
> +		}
> +
> +		if (i == count && (!to || to->task))
> +			freezable_schedule();

TBH, this sleeping condition along with the loop above is
unreadable. It can be nicely split out:

static void futex_sleep_multiple(struct futex_vector *vs, unsigned int count,
				 struct hrtimer_sleeper *to)
{
	if (to && !to->task)
        	return;
                
	for (; count; count--, vs++) {
		if (!READ_ONCE(vs->q.lock_ptr))
			return;
	}

	freezable_schedule();
}

> +
> +		__set_current_state(TASK_RUNNING);
> +
> +		ret = unqueue_multiple(vs, count);
> +		if (ret >= 0)
> +			return ret;
> +		if (to && !to->task)
> +			return -ETIMEDOUT;
> +		else if (signal_pending(current))
> +			return -ERESTARTSYS;
> +		/*
> +		 * The final case is a spurious wakeup, for
> +		 * which just retry.
> +		 */
> +	}
> +}
> +

>  /**
>   * futex_wait_setup() - Prepare to wait on a futex
>   * @uaddr:	the futex userspace address
> diff --git a/kernel/futex2.c b/kernel/futex2.c
> new file mode 100644
> index 000000000000..19bbd4bf7187
> --- /dev/null
> +++ b/kernel/futex2.c
> @@ -0,0 +1,192 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * futex2 system call interface by André Almeida <andrealmeid@collabora.com>

I don't see a futex2 system call anywhere

> + * Copyright 2021 Collabora Ltd.
> + */
> +
> +#include <asm/futex.h>
> +
> +#include <linux/freezer.h>
> +#include <linux/syscalls.h>
> +
> +/* Mask for each futex in futex_waitv list */
> +#define FUTEXV_WAITER_MASK (FUTEX_SIZE_MASK | FUTEX_SHARED_FLAG)

This piggy packs on the existing futex code, so what is this size thing
going to help?

> +/* Mask for sys_futex_waitv flag */
> +#define FUTEXV_MASK (FUTEX_CLOCK_REALTIME)
> +
> +#ifdef CONFIG_COMPAT
> +/**
> + * compat_futex_parse_waitv - Parse a waitv array from userspace
> + * @futexv:	Kernel side list of waiters to be filled
> + * @uwaitv:     Userspace list to be parsed
> + * @nr_futexes: Length of futexv
> + *
> + * Return: Error code on failure, pointer to a prepared futexv otherwise

The int return value becomes magically a pointer ?

> + */
> +static int compat_futex_parse_waitv(struct futex_vector *futexv,
> +				    struct compat_futex_waitv __user *uwaitv,
> +				    unsigned int nr_futexes)
> +{
> +	struct compat_futex_waitv aux;
> +	unsigned int i;
> +
> +	for (i = 0; i < nr_futexes; i++) {
> +		if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
> +			return -EFAULT;
> +
> +		if ((aux.flags & ~FUTEXV_WAITER_MASK) ||
> +		    (aux.flags & FUTEX_SIZE_MASK) != FUTEX_32)
> +			return -EINVAL;
> +
> +		futexv[i].w.flags = aux.flags;
> +		futexv[i].w.val = aux.val;
> +		futexv[i].w.uaddr = compat_ptr(aux.uaddr);
> +		futexv[i].q = futex_q_init;
> +	}
> +
> +	return 0;
> +}
> +
> +COMPAT_SYSCALL_DEFINE4(futex_waitv, struct compat_futex_waitv __user *, waiters,
> +		       unsigned int, nr_futexes, unsigned int, flags,
> +		       struct __kernel_timespec __user *, timo)
> +{
> +	struct hrtimer_sleeper to;
> +	struct futex_vector *futexv;
> +	struct timespec64 ts;
> +	ktime_t time;
> +	int ret;
> +
> +	if (flags & ~FUTEXV_MASK)
> +		return -EINVAL;
> +
> +	if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
> +		return -EINVAL;
> +
> +	if (timo) {
> +		int flag_clkid = 0;
> +
> +		if (get_timespec64(&ts, timo))
> +			return -EFAULT;
> +
> +		if (!timespec64_valid(&ts))
> +			return -EINVAL;
> +
> +		if (flags & FUTEX_CLOCK_REALTIME)
> +			flag_clkid = FLAGS_CLOCKRT;
> +
> +		time = timespec64_to_ktime(ts);

What's the point of open coding futex_init_timeout() and thereby failing to
do the namespace adjustment for CLOCK_MONOTONIC?

> +		futex_setup_timer(&time, &to, flag_clkid, 0);
> +	}
> +
> +	futexv = kcalloc(nr_futexes, sizeof(*futexv), GFP_KERNEL);
> +	if (!futexv)
> +		return -ENOMEM;
> +
> +	ret = compat_futex_parse_waitv(futexv, waiters, nr_futexes);
> +	if (!ret)
> +		ret = futex_wait_multiple(futexv, nr_futexes, timo ? &to : NULL);
> +
> +	if (timo) {
> +		hrtimer_cancel(&to.timer);
> +		destroy_hrtimer_on_stack(&to.timer);
> +	}
> +
> +	kfree(futexv);
> +	return ret;
> +}
> +#endif
> +
> +static int futex_parse_waitv(struct futex_vector *futexv,
> +			     struct futex_waitv __user *uwaitv,
> +			     unsigned int nr_futexes)
> +{
> +	struct futex_waitv aux;
> +	unsigned int i;
> +
> +	for (i = 0; i < nr_futexes; i++) {
> +		if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
> +			return -EFAULT;
> +
> +		if ((aux.flags & ~FUTEXV_WAITER_MASK) ||
> +		    (aux.flags & FUTEX_SIZE_MASK) != FUTEX_32)
> +			return -EINVAL;
> +
> +		futexv[i].w.flags = aux.flags;
> +		futexv[i].w.val = aux.val;
> +		futexv[i].w.uaddr = aux.uaddr;
> +		futexv[i].q = futex_q_init;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * sys_futex_waitv - Wait on a list of futexes
> + * @waiters:    List of futexes to wait on
> + * @nr_futexes: Length of futexv
> + * @flags:      Flag for timeout (monotonic/realtime)
> + * @timo:	Optional absolute timeout.
> + *
> + * Given an array of `struct futex_waitv`, wait on each uaddr. The thread wakes
> + * if a futex_wake() is performed at any uaddr. The syscall returns immediately
> + * if any waiter has *uaddr != val. *timo is an optional timeout value for the
> + * operation. Each waiter has individual flags. The `flags` argument for the
> + * syscall should be used solely for specifying the timeout as realtime, if
> + * needed. Flags for shared futexes, sizes, etc. should be used on the
> + * individual flags of each waiter.
> + *
> + * Returns the array index of one of the awaken futexes. There's no given
> + * information of how many were awakened, or any particular attribute of it (if
> + * it's the first awakened, if it is of the smaller index...).
> + */
> +SYSCALL_DEFINE4(futex_waitv, struct futex_waitv __user *, waiters,
> +		unsigned int, nr_futexes, unsigned int, flags,
> +		struct __kernel_timespec __user *, timo)
> +{
> +	struct hrtimer_sleeper to;
> +	struct futex_vector *futexv;
> +	struct timespec64 ts;
> +	ktime_t time;
> +	int ret;
> +
> +	if (flags & ~FUTEXV_MASK)
> +		return -EINVAL;
> +
> +	if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
> +		return -EINVAL;
> +
> +	if (timo) {
> +		int flag_clkid = 0;
> +
> +		if (get_timespec64(&ts, timo))
> +			return -EFAULT;
> +
> +		if (!timespec64_valid(&ts))
> +			return -EINVAL;
> +
> +		if (flags & FUTEX_CLOCK_REALTIME)
> +			flag_clkid = FLAGS_CLOCKRT;
> +
> +		time = timespec64_to_ktime(ts);
> +		futex_setup_timer(&time, &to, flag_clkid, 0);

And of course we need a copy of the same here again.

> +	}
> +
> +	futexv = kcalloc(nr_futexes, sizeof(*futexv), GFP_KERNEL);
> +	if (!futexv)
> +		return -ENOMEM;
> +
> +	ret = futex_parse_waitv(futexv, waiters, nr_futexes);
> +	if (!ret)
> +		ret = futex_wait_multiple(futexv, nr_futexes, timo ? &to : NULL);
> +
> +	if (timo) {
> +		hrtimer_cancel(&to.timer);
> +		destroy_hrtimer_on_stack(&to.timer);
> +	}
> +
> +	kfree(futexv);
> +	return ret;

So the only difference of the compat code and the non compat version is
the pointer size in struct futex_waitv.

struct futex_waitv {
       __u32	val;
       __u32	flags;
       __u64	uaddr;
};

which gets rid of all the code duplication and special casing of compat.

Thanks,

        tglx
    

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] futex2: Implement vectorized wait
  2021-08-18 11:00   ` Thomas Gleixner
@ 2021-08-18 16:20     ` Peter Zijlstra
  2021-08-18 16:34       ` André Almeida
  2021-08-18 19:45       ` Thomas Gleixner
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2021-08-18 16:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: André Almeida, Ingo Molnar, Darren Hart, linux-kernel,
	Steven Rostedt, Sebastian Andrzej Siewior, kernel, krisman,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso

On Wed, Aug 18, 2021 at 01:00:57PM +0200, Thomas Gleixner wrote:
> > +/**
> > + * struct futex_waitv - A waiter for vectorized wait
> > + * @val:   Expected value at uaddr
> > + * @uaddr: User address to wait on
> > + * @flags: Flags for this waiter
> > + */
> > +struct futex_waitv {
> > +	__u64 val;
> 
> Again. Why u64?

So I think the idea was that if we're going to do new syscalls, we
should cater for future extentions, one of which was 64bit futexes (for
64bit archs) (along with u{8,16,32})

The previous set of patches implemented a more complete replacement ABI
-- which I rather liked, however the implementation was completely
disjoint from the existing futexes, which was a non-starter for me.

Anyway, yes, current futexes are u32, but if we want to ever do u64
futexes, we should either do this syscall with a u64, or already plan to
retire the whole syscall.

Obiously this would've made good Changelog material, but alas it wasn't
there.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] futex2: Implement vectorized wait
  2021-08-18 16:20     ` Peter Zijlstra
@ 2021-08-18 16:34       ` André Almeida
  2021-08-18 19:45       ` Thomas Gleixner
  1 sibling, 0 replies; 11+ messages in thread
From: André Almeida @ 2021-08-18 16:34 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Ingo Molnar, Darren Hart, linux-kernel, Steven Rostedt,
	Sebastian Andrzej Siewior, kernel, krisman, linux-api,
	libc-alpha, mtk.manpages, Davidlohr Bueso

Às 13:20 de 18/08/21, Peter Zijlstra escreveu:
> On Wed, Aug 18, 2021 at 01:00:57PM +0200, Thomas Gleixner wrote:
>>> +/**
>>> + * struct futex_waitv - A waiter for vectorized wait
>>> + * @val:   Expected value at uaddr
>>> + * @uaddr: User address to wait on
>>> + * @flags: Flags for this waiter
>>> + */
>>> +struct futex_waitv {
>>> +	__u64 val;
>>
>> Again. Why u64?
> 
> So I think the idea was that if we're going to do new syscalls, we
> should cater for future extentions, one of which was 64bit futexes (for
> 64bit archs) (along with u{8,16,32})
> 
> The previous set of patches implemented a more complete replacement ABI
> -- which I rather liked, however the implementation was completely
> disjoint from the existing futexes, which was a non-starter for me.
> 
> Anyway, yes, current futexes are u32, but if we want to ever do u64
> futexes, we should either do this syscall with a u64, or already plan to
> retire the whole syscall.
> 
> Obiously this would've made good Changelog material, but alas it wasn't
> there.
> 

Ops, I forgot to add the reasoning behind the 64 futexes. The idea is
that futex users want to be able to properly do 64bit atomic operations
on top of futex values:

[0]
https://lore.kernel.org/lkml/CAFTs51XAr2b3DmcSM4=qeU5cNuh0mTxUbhG66U6bc63YYzkzYA@mail.gmail.com/

[1]
https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@collabora.com/T/#m37bfbbd6ac76c121941defd1daea774389552674

[2] https://lists.boost.org/Archives/boost/2021/05/251508.php

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] futex2: Implement vectorized wait
  2021-08-18 16:20     ` Peter Zijlstra
  2021-08-18 16:34       ` André Almeida
@ 2021-08-18 19:45       ` Thomas Gleixner
  2021-08-19  3:38         ` André Almeida
  1 sibling, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2021-08-18 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: André Almeida, Ingo Molnar, Darren Hart, linux-kernel,
	Steven Rostedt, Sebastian Andrzej Siewior, kernel, krisman,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso

On Wed, Aug 18 2021 at 18:20, Peter Zijlstra wrote:
> On Wed, Aug 18, 2021 at 01:00:57PM +0200, Thomas Gleixner wrote:
>> > +/**
>> > + * struct futex_waitv - A waiter for vectorized wait
>> > + * @val:   Expected value at uaddr
>> > + * @uaddr: User address to wait on
>> > + * @flags: Flags for this waiter
>> > + */
>> > +struct futex_waitv {
>> > +	__u64 val;
>> 
>> Again. Why u64?
>
> So I think the idea was that if we're going to do new syscalls, we
> should cater for future extentions, one of which was 64bit futexes (for
> 64bit archs) (along with u{8,16,32})
>
> The previous set of patches implemented a more complete replacement ABI
> -- which I rather liked, however the implementation was completely
> disjoint from the existing futexes, which was a non-starter for me.
>
> Anyway, yes, current futexes are u32, but if we want to ever do u64
> futexes, we should either do this syscall with a u64, or already plan to
> retire the whole syscall.
>
> Obiously this would've made good Changelog material, but alas it wasn't
> there.

Fair enough, but OTOH 64bit futexes for 64bit architectures: What's
exactly the point? Just because 64bit architectures can implement it is
not really a good reason. Where is the use case and the benefit and
what's the workaround for 32bit user space / architectures?

I'm not opposed against variable sized futexes in principle, but they
come with limitations and we end up with tons of sanity checks and
exclusions all over the place.

The 32bit futexes have a charm as they just work for all architectures
and the interaction with PI and robust list is trivial and well
established.

I serioulsy doubt that 8 and 16 bit futexes can be actually used for
locking in a meaningful way. If they are purely wait/wake then the
question is whether they actually fit into futex in the first place or
just happen to be implementable via futexes.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] futex2: Implement vectorized wait
  2021-08-18 19:45       ` Thomas Gleixner
@ 2021-08-19  3:38         ` André Almeida
  0 siblings, 0 replies; 11+ messages in thread
From: André Almeida @ 2021-08-19  3:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, Darren Hart, linux-kernel,
	Steven Rostedt, Sebastian Andrzej Siewior, kernel, krisman,
	linux-api, libc-alpha, mtk.manpages, Davidlohr Bueso,
	malteskarupke, malteskarupke, Peter Oskolkov, Andrey Semashev,
	Florian Weimer

Às 16:45 de 18/08/21, Thomas Gleixner escreveu:
> On Wed, Aug 18 2021 at 18:20, Peter Zijlstra wrote:
>> On Wed, Aug 18, 2021 at 01:00:57PM +0200, Thomas Gleixner wrote:
>>>> +/**
>>>> + * struct futex_waitv - A waiter for vectorized wait
>>>> + * @val:   Expected value at uaddr
>>>> + * @uaddr: User address to wait on
>>>> + * @flags: Flags for this waiter
>>>> + */
>>>> +struct futex_waitv {
>>>> +	__u64 val;
>>>
>>> Again. Why u64?
>>
>> So I think the idea was that if we're going to do new syscalls, we
>> should cater for future extentions, one of which was 64bit futexes (for
>> 64bit archs) (along with u{8,16,32})
>>
>> The previous set of patches implemented a more complete replacement ABI
>> -- which I rather liked, however the implementation was completely
>> disjoint from the existing futexes, which was a non-starter for me.
>>
>> Anyway, yes, current futexes are u32, but if we want to ever do u64
>> futexes, we should either do this syscall with a u64, or already plan to
>> retire the whole syscall.
>>
>> Obiously this would've made good Changelog material, but alas it wasn't
>> there.
> 
> Fair enough, but OTOH 64bit futexes for 64bit architectures: What's
> exactly the point? Just because 64bit architectures can implement it is
> not really a good reason. Where is the use case and the benefit and
> what's the workaround for 32bit user space / architectures?
> 
> I'm not opposed against variable sized futexes in principle, but they
> come with limitations and we end up with tons of sanity checks and
> exclusions all over the place.
> 
> The 32bit futexes have a charm as they just work for all architectures
> and the interaction with PI and robust list is trivial and well
> established.
> 
> I serioulsy doubt that 8 and 16 bit futexes can be actually used for
> locking in a meaningful way. If they are purely wait/wake then the
> question is whether they actually fit into futex in the first place or
> just happen to be implementable via futexes.

This is the feedback that I have collected from the community about
variable sized futex:

- At Boost Libraries, futex is used as back end to implement atomic
primitives for some architectures. It works fine for 32-bit futexes, but
for other sizes it "must use an internal lock pool to implement waiting
and notifying operations, which increases thread contention. For
inter-process atomics, this means that waiting must be done using a spin
loop, which is terribly inefficient."[0]

- glibc’s rwlock implementation "uses a torn 32-bit futex read which is
part of an atomically updated 64-bit word".[1]

- Peter Oskolkov[2] pointed out that for 64-bit platforms it would be
useful to do atomic operations in pointer values: "imagine a simple
producer/consumer scenario, with the producer updating some shared
memory data and waking the consumer. Storing the pointer in the futex
makes it so that only one shared memory location needs to be accessed
atomically".

- The original proposal[3] to support 8-bit and 16-bit futexes had some
use cases as well: "Having mutexes that are only one byte in size was
the first reason WebKit mentioned for re-implementing futexes in a
library" and "The C++ standard added futexes to the standard library in
C++20 under the name atomic_wait and atomic_notify. The C++20 version
supports this for atomic variables of any size. The more sizes we can
support, the better the implementation can be in the standard library."

[0] https://lists.boost.org/Archives/boost/2021/05/251508.php

[1]
https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@collabora.com/T/#m37bfbbd6ac76c121941defd1daea774389552674

[2]
https://lore.kernel.org/lkml/CAFTs51XAr2b3DmcSM4=qeU5cNuh0mTxUbhG66U6bc63YYzkzYA@mail.gmail.com/

[3]
https://lore.kernel.org/lkml/20191204235238.10764-1-malteskarupke@web.de/


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-08-19  3:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05 19:04 [PATCH 0/4] futex2: Add wait on multiple futexes syscall André Almeida
2021-08-05 19:04 ` [PATCH 1/4] futex: Prepare for futex_wait_multiple() André Almeida
2021-08-18  8:22   ` Thomas Gleixner
2021-08-05 19:04 ` [PATCH 2/4] futex2: Implement vectorized wait André Almeida
2021-08-18 11:00   ` Thomas Gleixner
2021-08-18 16:20     ` Peter Zijlstra
2021-08-18 16:34       ` André Almeida
2021-08-18 19:45       ` Thomas Gleixner
2021-08-19  3:38         ` André Almeida
2021-08-05 19:04 ` [PATCH 3/4] selftests: futex2: Add waitv test André Almeida
2021-08-05 19:04 ` [PATCH 4/4] futex2: Documentation: Document futex_waitv() uAPI André Almeida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).