LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 00/10] kernel/locking: qspinlock improvements
@ 2018-04-05 16:58 Will Deacon
  2018-04-05 16:58 ` [PATCH 01/10] locking/qspinlock: Don't spin on pending->locked transition in slowpath Will Deacon
                   ` (10 more replies)
  0 siblings, 11 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

Hi all,

I've been kicking the tyres further on qspinlock and with this set of patches
I'm happy with the performance and fairness properties. In particular, the
locking algorithm now guarantees forward progress whereas the implementation
in mainline can starve threads indefinitely in cmpxchg loops.

Catalin has also implemented a model of this using TLA to prove that the
lock is fair, although this doesn't take the memory model into account:

https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/commit/

I'd still like to get more benchmark numbers and wider exposure before
enabling this for arm64, but my current testing is looking very promising.
This series, along with the arm64-specific patches, is available at:

https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=qspinlock

Cheers,

Will

--->8

Jason Low (1):
  locking/mcs: Use smp_cond_load_acquire() in mcs spin loop

Will Deacon (9):
  locking/qspinlock: Don't spin on pending->locked transition in
    slowpath
  locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  locking/qspinlock: Kill cmpxchg loop when claiming lock from head of
    queue
  locking/qspinlock: Use atomic_cond_read_acquire
  barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
  locking/qspinlock: Use smp_cond_load_relaxed to wait for next node
  locking/qspinlock: Merge struct __qspinlock into struct qspinlock
  locking/qspinlock: Make queued_spin_unlock use smp_store_release
  locking/qspinlock: Elide back-to-back RELEASE operations with
    smp_wmb()

 arch/x86/include/asm/qspinlock.h          |  19 ++-
 arch/x86/include/asm/qspinlock_paravirt.h |   3 +-
 include/asm-generic/barrier.h             |  27 ++++-
 include/asm-generic/qspinlock.h           |   2 +-
 include/asm-generic/qspinlock_types.h     |  32 ++++-
 include/linux/atomic.h                    |   2 +
 kernel/locking/mcs_spinlock.h             |  10 +-
 kernel/locking/qspinlock.c                | 191 ++++++++++--------------------
 kernel/locking/qspinlock_paravirt.h       |  34 ++----
 9 files changed, 141 insertions(+), 179 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/10] locking/qspinlock: Don't spin on pending->locked transition in slowpath
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
@ 2018-04-05 16:58 ` Will Deacon
  2018-04-05 16:58 ` [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

If a locker taking the qspinlock slowpath reads a lock value indicating
that only the pending bit is set, then it will spin whilst the
concurrent pending->locked transition takes effect.

Unfortunately, there is no guarantee that such a transition will ever be
observed since concurrent lockers could continuously set pending and
hand over the lock amongst themselves, leading to starvation. Whilst
this would probably resolve in practice, it means that it is not
possible to prove liveness properties about the lock and means that lock
acquisition time is unbounded.

Remove the pending->locked spinning from the slowpath and instead
queue explicitly if pending is set.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d880296245c5..a192af2fe378 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -306,16 +306,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;
 
 	/*
-	 * wait for in-progress pending->locked hand-overs
-	 *
-	 * 0,1,0 -> 0,0,1
-	 */
-	if (val == _Q_PENDING_VAL) {
-		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
-			cpu_relax();
-	}
-
-	/*
 	 * trylock || pending
 	 *
 	 * 0,0,0 -> 0,0,1 ; trylock
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
  2018-04-05 16:58 ` [PATCH 01/10] locking/qspinlock: Don't spin on pending->locked transition in slowpath Will Deacon
@ 2018-04-05 16:58 ` Will Deacon
  2018-04-05 17:07   ` Peter Zijlstra
                     ` (4 more replies)
  2018-04-05 16:59 ` [PATCH 03/10] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
                   ` (8 subsequent siblings)
  10 siblings, 5 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

The qspinlock locking slowpath utilises a "pending" bit as a simple form
of an embedded test-and-set lock that can avoid the overhead of explicit
queuing in cases where the lock is held but uncontended. This bit is
managed using a cmpxchg loop which tries to transition the uncontended
lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).

Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
indefinitely if the lock word is seen to oscillate between unlocked
(0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
able to take the lock in the cmpxchg loop without queuing and pass it
around amongst themselves.

This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
using atomic_fetch_or, and then inspecting the old value to see whether
we need to spin on the current lock owner, or whether we now effectively
hold the lock. The tricky scenario is when concurrent lockers end up
queuing on the lock and the lock becomes available, causing us to see
a lockword of (n,0,0). With pending now set, simply queuing could lead
to deadlock as the head of the queue may not have observed the pending
flag being cleared. Conversely, if the head of the queue did observe
pending being cleared, then it could transition the lock from (n,0,0) ->
(0,0,1) meaning that any attempt to "undo" our setting of the pending
bit could race with a concurrent locker trying to set it.

We handle this race by preserving the pending bit when taking the lock
after reaching the head of the queue and leaving the tail entry intact
if we saw pending set, because we know that the tail is going to be
updated shortly.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 80 ++++++++++++++++++++--------------------------
 1 file changed, 35 insertions(+), 45 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index a192af2fe378..b75361d23ea5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -294,7 +294,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	struct mcs_spinlock *prev, *next, *node;
-	u32 new, old, tail;
+	u32 old, tail;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -306,58 +306,48 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;
 
 	/*
+	 * If we observe any contention; queue.
+	 */
+	if (val & ~_Q_LOCKED_MASK)
+		goto queue;
+
+	/*
 	 * trylock || pending
 	 *
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	for (;;) {
+	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	if (!(val & ~_Q_LOCKED_MASK)) {
 		/*
-		 * If we observe any contention; queue.
+		 * we're pending, wait for the owner to go away.
+		 *
+		 * *,1,1 -> *,1,0
+		 *
+		 * this wait loop must be a load-acquire such that we match the
+		 * store-release that clears the locked bit and create lock
+		 * sequentiality; this is because not all
+		 * clear_pending_set_locked() implementations imply full
+		 * barriers.
 		 */
-		if (val & ~_Q_LOCKED_MASK)
-			goto queue;
-
-		new = _Q_LOCKED_VAL;
-		if (val == new)
-			new |= _Q_PENDING_VAL;
-
+		if (val & _Q_LOCKED_MASK)
+			smp_cond_load_acquire(&lock->val.counter,
+					      !(VAL & _Q_LOCKED_MASK));
 		/*
-		 * Acquire semantic is required here as the function may
-		 * return immediately if the lock was free.
+		 * take ownership and clear the pending bit.
+		 *
+		 * *,1,0 -> *,0,1
 		 */
-		old = atomic_cmpxchg_acquire(&lock->val, val, new);
-		if (old == val)
-			break;
-
-		val = old;
-	}
-
-	/*
-	 * we won the trylock
-	 */
-	if (new == _Q_LOCKED_VAL)
+		clear_pending_set_locked(lock);
 		return;
+	}
 
 	/*
-	 * we're pending, wait for the owner to go away.
-	 *
-	 * *,1,1 -> *,1,0
-	 *
-	 * this wait loop must be a load-acquire such that we match the
-	 * store-release that clears the locked bit and create lock
-	 * sequentiality; this is because not all clear_pending_set_locked()
-	 * implementations imply full barriers.
-	 */
-	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
-
-	/*
-	 * take ownership and clear the pending bit.
-	 *
-	 * *,1,0 -> *,0,1
+	 * If pending was clear but there are waiters in the queue, then
+	 * we need to undo our setting of pending before we queue ourselves.
 	 */
-	clear_pending_set_locked(lock);
-	return;
+	if (!(val & _Q_PENDING_MASK))
+		atomic_andnot(_Q_PENDING_VAL, &lock->val);
 
 	/*
 	 * End of pending bit optimistic spinning and beginning of MCS
@@ -461,15 +451,15 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * claim the lock:
 	 *
 	 * n,0,0 -> 0,0,1 : lock, uncontended
-	 * *,0,0 -> *,0,1 : lock, contended
+	 * *,*,0 -> *,*,1 : lock, contended
 	 *
-	 * If the queue head is the only one in the queue (lock value == tail),
-	 * clear the tail code and grab the lock. Otherwise, we only need
-	 * to grab the lock.
+	 * If the queue head is the only one in the queue (lock value == tail)
+	 * and nobody is pending, clear the tail code and grab the lock.
+	 * Otherwise, we only need to grab the lock.
 	 */
 	for (;;) {
 		/* In the PV case we might already have _Q_LOCKED_VAL set */
-		if ((val & _Q_TAIL_MASK) != tail) {
+		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
 			set_locked(lock);
 			break;
 		}
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 03/10] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
  2018-04-05 16:58 ` [PATCH 01/10] locking/qspinlock: Don't spin on pending->locked transition in slowpath Will Deacon
  2018-04-05 16:58 ` [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
@ 2018-04-05 16:59 ` Will Deacon
  2018-04-05 17:19   ` Peter Zijlstra
  2018-04-05 16:59 ` [PATCH 04/10] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

When a queued locker reaches the head of the queue, it claims the lock
by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it
must also clear the tail as part of this operation so that subsequent
lockers can avoid taking the slowpath altogether.

Currently this is expressed as a cmpxchg loop that practically only
runs up to two iterations. This is confusing to the reader and unhelpful
to the compiler. Rewrite the cmpxchg loop without the loop, so that a
failed cmpxchg implies that there is contention and we just need to
write to _Q_LOCKED_VAL without considering the rest of the lockword.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b75361d23ea5..cdfa7b7328a8 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -457,24 +457,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * and nobody is pending, clear the tail code and grab the lock.
 	 * Otherwise, we only need to grab the lock.
 	 */
-	for (;;) {
-		/* In the PV case we might already have _Q_LOCKED_VAL set */
-		if ((val & _Q_TAIL_MASK) != tail || (val & _Q_PENDING_MASK)) {
-			set_locked(lock);
-			break;
-		}
+
+	/* In the PV case we might already have _Q_LOCKED_VAL set */
+	if ((val & _Q_TAIL_MASK) == tail) {
 		/*
 		 * The smp_cond_load_acquire() call above has provided the
-		 * necessary acquire semantics required for locking. At most
-		 * two iterations of this loop may be ran.
+		 * necessary acquire semantics required for locking.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
 		if (old == val)
-			goto release;	/* No contention */
-
-		val = old;
+			goto release; /* No contention */
 	}
 
+	/* Either somebody is queued behind us or _Q_PENDING_VAL is set */
+	set_locked(lock);
+
 	/*
 	 * contended path; wait for next if not observed yet, release.
 	 */
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 04/10] locking/qspinlock: Use atomic_cond_read_acquire
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
                   ` (2 preceding siblings ...)
  2018-04-05 16:59 ` [PATCH 03/10] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
@ 2018-04-05 16:59 ` Will Deacon
  2018-04-05 16:59 ` [PATCH 05/10] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

Rather than dig into the counter field of the atomic_t inside the
qspinlock structure so that we can call smp_cond_load_acquire, use
atomic_cond_read_acquire instead, which operates on the atomic_t
directly.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cdfa7b7328a8..291e1526d27b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -331,8 +331,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		 * barriers.
 		 */
 		if (val & _Q_LOCKED_MASK)
-			smp_cond_load_acquire(&lock->val.counter,
-					      !(VAL & _Q_LOCKED_MASK));
+			atomic_cond_read_acquire(&lock->val,
+						 !(VAL & _Q_LOCKED_MASK));
 		/*
 		 * take ownership and clear the pending bit.
 		 *
@@ -433,8 +433,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 *
 	 * The PV pv_wait_head_or_lock function, if active, will acquire
 	 * the lock and return a non-zero value. So we have to skip the
-	 * smp_cond_load_acquire() call. As the next PV queue head hasn't been
-	 * designated yet, there is no way for the locked value to become
+	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
+	 * been designated yet, there is no way for the locked value to become
 	 * _Q_SLOW_VAL. So both the set_locked() and the
 	 * atomic_cmpxchg_relaxed() calls will be safe.
 	 *
@@ -444,7 +444,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
-	val = smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_PENDING_MASK));
+	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
 
 locked:
 	/*
@@ -461,7 +461,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/* In the PV case we might already have _Q_LOCKED_VAL set */
 	if ((val & _Q_TAIL_MASK) == tail) {
 		/*
-		 * The smp_cond_load_acquire() call above has provided the
+		 * The atomic_cond_read_acquire() call above has provided the
 		 * necessary acquire semantics required for locking.
 		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 05/10] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
                   ` (3 preceding siblings ...)
  2018-04-05 16:59 ` [PATCH 04/10] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
@ 2018-04-05 16:59 ` Will Deacon
  2018-04-05 16:59 ` [PATCH 06/10] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Jason Low, Will Deacon

From: Jason Low <jason.low2@hp.com>

For qspinlocks on ARM64, we would like to use WFE instead
of purely spinning. Qspinlocks internally have lock
contenders spin on an MCS lock.

Update arch_mcs_spin_lock_contended() such that it uses
the new smp_cond_load_acquire() so that ARM64 can also
override this spin loop with its own implementation using WFE.

On x86, this can also be cheaper than spinning on
smp_load_acquire().

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/mcs_spinlock.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index f046b7ce9dd6..5e10153b4d3c 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -23,13 +23,15 @@ struct mcs_spinlock {
 
 #ifndef arch_mcs_spin_lock_contended
 /*
- * Using smp_load_acquire() provides a memory barrier that ensures
- * subsequent operations happen after the lock is acquired.
+ * Using smp_cond_load_acquire() provides the acquire semantics
+ * required so that subsequent operations happen after the
+ * lock is acquired. Additionally, some architectures such as
+ * ARM64 would like to do spin-waiting instead of purely
+ * spinning, and smp_cond_load_acquire() provides that behavior.
  */
 #define arch_mcs_spin_lock_contended(l)					\
 do {									\
-	while (!(smp_load_acquire(l)))					\
-		cpu_relax();						\
+	smp_cond_load_acquire(l, VAL);					\
 } while (0)
 #endif
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 06/10] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
                   ` (4 preceding siblings ...)
  2018-04-05 16:59 ` [PATCH 05/10] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
@ 2018-04-05 16:59 ` Will Deacon
  2018-04-05 17:22   ` Peter Zijlstra
  2018-04-05 16:59 ` [PATCH 07/10] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

Whilst we currently provide smp_cond_load_acquire and
atomic_cond_read_acquire, there are cases where the ACQUIRE semantics are
not required because of a subsequent fence or release operation once the
conditional loop has exited.

This patch adds relaxed versions of the conditional spinning primitives
to avoid unnecessary barrier overhead on architectures such as arm64.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/barrier.h | 27 +++++++++++++++++++++------
 include/linux/atomic.h        |  2 ++
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index fe297b599b0a..305e03b19a26 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -221,18 +221,17 @@ do {									\
 #endif
 
 /**
- * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * smp_cond_load_relaxed() - (Spin) wait for cond with no ordering guarantees
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
- * Equivalent to using smp_load_acquire() on the condition variable but employs
- * the control dependency of the wait to reduce the barrier on many platforms.
+ * Equivalent to using READ_ONCE() on the condition variable.
  *
  * Due to C lacking lambda expressions we load the value of *ptr into a
  * pre-named variable @VAL to be used in @cond.
  */
-#ifndef smp_cond_load_acquire
-#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+#ifndef smp_cond_load_relaxed
+#define smp_cond_load_relaxed(ptr, cond_expr) ({		\
 	typeof(ptr) __PTR = (ptr);				\
 	typeof(*ptr) VAL;					\
 	for (;;) {						\
@@ -241,10 +240,26 @@ do {									\
 			break;					\
 		cpu_relax();					\
 	}							\
-	smp_acquire__after_ctrl_dep();				\
 	VAL;							\
 })
 #endif
 
+/**
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ */
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({		\
+	typeof(*ptr) _val;					\
+	_val = smp_cond_load_relaxed(ptr, cond_expr);		\
+	smp_acquire__after_ctrl_dep();				\
+	_val;							\
+})
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 8b276fd9a127..01ce3997cb42 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -654,6 +654,7 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 }
 #endif
 
+#define atomic_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
 
 #ifdef CONFIG_GENERIC_ATOMIC64
@@ -1075,6 +1076,7 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
 }
 #endif
 
+#define atomic64_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
 #define atomic64_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
 
 #include <asm-generic/atomic-long.h>
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 07/10] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
                   ` (5 preceding siblings ...)
  2018-04-05 16:59 ` [PATCH 06/10] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
@ 2018-04-05 16:59 ` Will Deacon
  2018-04-05 16:59 ` [PATCH 08/10] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

When a locker reaches the head of the queue and takes the lock, a
concurrent locker may enqueue and force the lock holder to spin
whilst its node->next field is initialised. Rather than open-code
a READ_ONCE/cpu_relax() loop, this can be implemented using
smp_cond_load_relaxed instead.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 291e1526d27b..c8b57d375b49 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -475,10 +475,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/*
 	 * contended path; wait for next if not observed yet, release.
 	 */
-	if (!next) {
-		while (!(next = READ_ONCE(node->next)))
-			cpu_relax();
-	}
+	if (!next)
+		next = smp_cond_load_relaxed(&node->next, (VAL));
 
 	arch_mcs_spin_unlock_contended(&next->locked);
 	pv_kick_node(lock, next);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 08/10] locking/qspinlock: Merge struct __qspinlock into struct qspinlock
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
                   ` (6 preceding siblings ...)
  2018-04-05 16:59 ` [PATCH 07/10] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
@ 2018-04-05 16:59 ` Will Deacon
  2018-04-07  5:23   ` Boqun Feng
  2018-04-05 16:59 ` [PATCH 09/10] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

struct __qspinlock provides a handy union of fields so that
subcomponents of the lockword can be accessed by name, without having to
manage shifts and masks explicitly and take endianness into account.

This is useful in qspinlock.h and also potentially in arch headers, so
move the struct __qspinlock into struct qspinlock and kill the extra
definition.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qspinlock.h          |  2 +-
 arch/x86/include/asm/qspinlock_paravirt.h |  3 +-
 include/asm-generic/qspinlock_types.h     | 32 +++++++++++++++++++--
 kernel/locking/qspinlock.c                | 46 ++-----------------------------
 kernel/locking/qspinlock_paravirt.h       | 34 ++++++++---------------
 5 files changed, 46 insertions(+), 71 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 5e16b5d40d32..90b0b0ed8161 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -16,7 +16,7 @@
  */
 static inline void native_queued_spin_unlock(struct qspinlock *lock)
 {
-	smp_store_release((u8 *)lock, 0);
+	smp_store_release(&lock->locked, 0);
 }
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index 923307ea11c7..9ef5ee03d2d7 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -22,8 +22,7 @@ PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
  *
  * void __pv_queued_spin_unlock(struct qspinlock *lock)
  * {
- *	struct __qspinlock *l = (void *)lock;
- *	u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *	u8 lockval = cmpxchg(&lock->locked, _Q_LOCKED_VAL, 0);
  *
  *	if (likely(lockval == _Q_LOCKED_VAL))
  *		return;
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index 034acd0c4956..0763f065b975 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -29,13 +29,41 @@
 #endif
 
 typedef struct qspinlock {
-	atomic_t	val;
+	union {
+		atomic_t val;
+
+		/*
+		 * By using the whole 2nd least significant byte for the
+		 * pending bit, we can allow better optimization of the lock
+		 * acquisition for the pending bit holder.
+		 */
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u8	locked;
+			u8	pending;
+		};
+		struct {
+			u16	locked_pending;
+			u16	tail;
+		};
+#else
+		struct {
+			u16	tail;
+			u16	locked_pending;
+		};
+		struct {
+			u8	reserved[2];
+			u8	pending;
+			u8	locked;
+		};
+#endif
+	};
 } arch_spinlock_t;
 
 /*
  * Initializier
  */
-#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
+#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
 
 /*
  * Bitfields in the atomic value:
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c8b57d375b49..3ad8786a47e2 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -114,40 +114,6 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
-/*
- * By using the whole 2nd least significant byte for the pending bit, we
- * can allow better optimization of the lock acquisition for the pending
- * bit holder.
- *
- * This internal structure is also used by the set_locked function which
- * is not restricted to _Q_PENDING_BITS == 8.
- */
-struct __qspinlock {
-	union {
-		atomic_t val;
-#ifdef __LITTLE_ENDIAN
-		struct {
-			u8	locked;
-			u8	pending;
-		};
-		struct {
-			u16	locked_pending;
-			u16	tail;
-		};
-#else
-		struct {
-			u16	tail;
-			u16	locked_pending;
-		};
-		struct {
-			u8	reserved[2];
-			u8	pending;
-			u8	locked;
-		};
-#endif
-	};
-};
-
 #if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
@@ -159,9 +125,7 @@ struct __qspinlock {
  */
 static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
+	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
 }
 
 /*
@@ -176,13 +140,11 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
  */
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
-	struct __qspinlock *l = (void *)lock;
-
 	/*
 	 * Use release semantics to make sure that the MCS node is properly
 	 * initialized before changing the tail code.
 	 */
-	return (u32)xchg_release(&l->tail,
+	return (u32)xchg_release(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -237,9 +199,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
  */
 static __always_inline void set_locked(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 }
 
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 6ee477765e6c..2711940429f5 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -87,8 +87,6 @@ struct pv_node {
 #define queued_spin_trylock(l)	pv_hybrid_queued_unfair_trylock(l)
 static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
 	/*
 	 * Stay in unfair lock mode as long as queued mode waiters are
 	 * present in the MCS wait queue but the pending bit isn't set.
@@ -97,7 +95,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 		int val = atomic_read(&lock->val);
 
 		if (!(val & _Q_LOCKED_PENDING_MASK) &&
-		   (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
 			qstat_inc(qstat_pv_lock_stealing, true);
 			return true;
 		}
@@ -117,16 +115,12 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 #if _Q_PENDING_BITS == 8
 static __always_inline void set_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->pending, 1);
+	WRITE_ONCE(lock->pending, 1);
 }
 
 static __always_inline void clear_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	WRITE_ONCE(l->pending, 0);
+	WRITE_ONCE(lock->pending, 0);
 }
 
 /*
@@ -136,10 +130,8 @@ static __always_inline void clear_pending(struct qspinlock *lock)
  */
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
-
-	return !READ_ONCE(l->locked) &&
-	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
+	return !READ_ONCE(lock->locked) &&
+	       (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
 				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
 #else /* _Q_PENDING_BITS == 8 */
@@ -384,7 +376,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
-	struct __qspinlock *l = (void *)lock;
 
 	/*
 	 * If the vCPU is indeed halted, advance its state to match that of
@@ -413,7 +404,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 * the hash table later on at unlock time, no atomic instruction is
 	 * needed.
 	 */
-	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
+	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
 	(void)pv_hash(lock, pn);
 }
 
@@ -428,7 +419,6 @@ static u32
 pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
-	struct __qspinlock *l = (void *)lock;
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
@@ -479,13 +469,13 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 			 *
 			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
 			 */
-			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
+			if (xchg(&lock->locked, _Q_SLOW_VAL) == 0) {
 				/*
 				 * The lock was free and now we own the lock.
 				 * Change the lock value back to _Q_LOCKED_VAL
 				 * and unhash the table.
 				 */
-				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
+				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
 				goto gotlock;
 			}
@@ -493,7 +483,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		WRITE_ONCE(pn->state, vcpu_hashed);
 		qstat_inc(qstat_pv_wait_head, true);
 		qstat_inc(qstat_pv_wait_again, waitcnt);
-		pv_wait(&l->locked, _Q_SLOW_VAL);
+		pv_wait(&lock->locked, _Q_SLOW_VAL);
 
 		/*
 		 * Because of lock stealing, the queue head vCPU may not be
@@ -518,7 +508,6 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 __visible void
 __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
-	struct __qspinlock *l = (void *)lock;
 	struct pv_node *node;
 
 	if (unlikely(locked != _Q_SLOW_VAL)) {
@@ -547,7 +536,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	 * Now that we have a reference to the (likely) blocked pv_node,
 	 * release the lock.
 	 */
-	smp_store_release(&l->locked, 0);
+	smp_store_release(&lock->locked, 0);
 
 	/*
 	 * At this point the memory pointed at by lock can be freed/reused,
@@ -573,7 +562,6 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 #ifndef __pv_queued_spin_unlock
 __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 {
-	struct __qspinlock *l = (void *)lock;
 	u8 locked;
 
 	/*
@@ -581,7 +569,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 	 * unhash. Otherwise it would be possible to have multiple @lock
 	 * entries, which would be BAD.
 	 */
-	locked = cmpxchg_release(&l->locked, _Q_LOCKED_VAL, 0);
+	locked = cmpxchg_release(&lock->locked, _Q_LOCKED_VAL, 0);
 	if (likely(locked == _Q_LOCKED_VAL))
 		return;
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 09/10] locking/qspinlock: Make queued_spin_unlock use smp_store_release
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
                   ` (7 preceding siblings ...)
  2018-04-05 16:59 ` [PATCH 08/10] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
@ 2018-04-05 16:59 ` Will Deacon
  2018-04-05 16:59 ` [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
  2018-04-06 13:22 ` [PATCH 00/10] kernel/locking: qspinlock improvements Andrea Parri
  10 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

A qspinlock can be unlocked simply by writing zero to the locked byte.
This can be implemented in the generic code, so do that and remove the
arch-specific override for x86 in the !PV case.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/x86/include/asm/qspinlock.h | 17 ++++++-----------
 include/asm-generic/qspinlock.h  |  2 +-
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 90b0b0ed8161..cc77cbb01432 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -7,6 +7,12 @@
 #include <asm-generic/qspinlock_types.h>
 #include <asm/paravirt.h>
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
+
 #define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock
@@ -19,12 +25,6 @@ static inline void native_queued_spin_unlock(struct qspinlock *lock)
 	smp_store_release(&lock->locked, 0);
 }
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_init_lock_hash(void);
-extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
-
 static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	pv_queued_spin_lock_slowpath(lock, val);
@@ -40,11 +40,6 @@ static inline bool vcpu_is_preempted(long cpu)
 {
 	return pv_vcpu_is_preempted(cpu);
 }
-#else
-static inline void queued_spin_unlock(struct qspinlock *lock)
-{
-	native_queued_spin_unlock(lock);
-}
 #endif
 
 #ifdef CONFIG_PARAVIRT
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index b37b4ad7eb94..a8ed0a352d75 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -100,7 +100,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
 	/*
 	 * unlock() needs release semantics:
 	 */
-	(void)atomic_sub_return_release(_Q_LOCKED_VAL, &lock->val);
+	smp_store_release(&lock->locked, 0);
 }
 #endif
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
                   ` (8 preceding siblings ...)
  2018-04-05 16:59 ` [PATCH 09/10] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
@ 2018-04-05 16:59 ` Will Deacon
  2018-04-05 17:28   ` Peter Zijlstra
  2018-04-07  5:47   ` Boqun Feng
  2018-04-06 13:22 ` [PATCH 00/10] kernel/locking: qspinlock improvements Andrea Parri
  10 siblings, 2 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-05 16:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck,
	catalin.marinas, Will Deacon

The qspinlock slowpath must ensure that the MCS node is fully initialised
before it can be reached by another other CPU. This is currently enforced
by using a RELEASE operation when updating the tail and also when linking
the node into the waitqueue (since the control dependency off xchg_tail
is insufficient to enforce sufficient ordering -- see 95bcade33a8a
("locking/qspinlock: Ensure node is initialised before updating prev->next")).

Back-to-back RELEASE operations may be expensive on some architectures,
particularly those that implement them using fences under the hood. We
can replace the two RELEASE operations with a single smp_wmb() fence and
use RELAXED operations for the subsequent publishing of the node.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 kernel/locking/qspinlock.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 3ad8786a47e2..42c61f7b37c5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -141,10 +141,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
 	/*
-	 * Use release semantics to make sure that the MCS node is properly
-	 * initialized before changing the tail code.
+	 * We can use relaxed semantics since the caller ensures that the
+	 * MCS node is properly initialized before updating the tail.
 	 */
-	return (u32)xchg_release(&lock->tail,
+	return (u32)xchg_relaxed(&lock->tail,
 				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
 }
 
@@ -178,10 +178,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
 		/*
-		 * Use release semantics to make sure that the MCS node is
-		 * properly initialized before changing the tail code.
+		 * We can use relaxed semantics since the caller ensures that
+		 * the MCS node is properly initialized before updating the
+		 * tail.
 		 */
-		old = atomic_cmpxchg_release(&lock->val, val, new);
+		old = atomic_cmpxchg_relaxed(&lock->val, val, new);
 		if (old == val)
 			break;
 
@@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		goto release;
 
 	/*
+	 * Ensure that the initialisation of @node is complete before we
+	 * publish the updated tail and potentially link @node into the
+	 * waitqueue.
+	 */
+	smp_wmb();
+
+	/*
 	 * We have already touched the queueing cacheline; don't bother with
 	 * pending stuff.
 	 *
 	 * p,*,* -> n,*,*
-	 *
-	 * RELEASE, such that the stores to @node must be complete.
 	 */
 	old = xchg_tail(lock, tail);
 	next = NULL;
@@ -356,15 +362,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 	if (old & _Q_TAIL_MASK) {
 		prev = decode_tail(old);
-
-		/*
-		 * We must ensure that the stores to @node are observed before
-		 * the write to prev->next. The address dependency from
-		 * xchg_tail is not sufficient to ensure this because the read
-		 * component of xchg_tail is unordered with respect to the
-		 * initialisation of @node.
-		 */
-		smp_store_release(&prev->next, node);
+		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
 		arch_mcs_spin_lock_contended(&node->locked);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-05 16:58 ` [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
@ 2018-04-05 17:07   ` Peter Zijlstra
  2018-04-06 15:08     ` Will Deacon
  2018-04-05 17:13   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-05 17:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 05:58:59PM +0100, Will Deacon wrote:
> The qspinlock locking slowpath utilises a "pending" bit as a simple form
> of an embedded test-and-set lock that can avoid the overhead of explicit
> queuing in cases where the lock is held but uncontended. This bit is
> managed using a cmpxchg loop which tries to transition the uncontended
> lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
> 
> Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> indefinitely if the lock word is seen to oscillate between unlocked
> (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> able to take the lock in the cmpxchg loop without queuing and pass it
> around amongst themselves.
> 
> This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> using atomic_fetch_or, 

Of course, LL/SC or cmpxchg implementations of fetch_or do not in fact
get anything from this ;-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-05 16:58 ` [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
  2018-04-05 17:07   ` Peter Zijlstra
@ 2018-04-05 17:13   ` Peter Zijlstra
  2018-04-05 21:16   ` Waiman Long
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-05 17:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 05:58:59PM +0100, Will Deacon wrote:
> @@ -306,58 +306,48 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  		return;
>  
>  	/*
> +	 * If we observe any contention; queue.
> +	 */
> +	if (val & ~_Q_LOCKED_MASK)
> +		goto queue;
> +
> +	/*
>  	 * trylock || pending
>  	 *
>  	 * 0,0,0 -> 0,0,1 ; trylock
>  	 * 0,0,1 -> 0,1,1 ; pending
>  	 */
> +	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +	if (!(val & ~_Q_LOCKED_MASK)) {
>  		/*
> +		 * we're pending, wait for the owner to go away.
> +		 *
> +		 * *,1,1 -> *,1,0
> +		 *
> +		 * this wait loop must be a load-acquire such that we match the
> +		 * store-release that clears the locked bit and create lock
> +		 * sequentiality; this is because not all
> +		 * clear_pending_set_locked() implementations imply full
> +		 * barriers.
>  		 */
> +		if (val & _Q_LOCKED_MASK)
> +			smp_cond_load_acquire(&lock->val.counter,
> +					      !(VAL & _Q_LOCKED_MASK));

I much prefer { } for multi-line statements like this.

>  		/*
> +		 * take ownership and clear the pending bit.
> +		 *
> +		 * *,1,0 -> *,0,1
>  		 */
> +		clear_pending_set_locked(lock);
>  		return;
> +	}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/10] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue
  2018-04-05 16:59 ` [PATCH 03/10] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
@ 2018-04-05 17:19   ` Peter Zijlstra
  2018-04-06 10:54     ` Will Deacon
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-05 17:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 05:59:00PM +0100, Will Deacon wrote:
> +
> +	/* In the PV case we might already have _Q_LOCKED_VAL set */
> +	if ((val & _Q_TAIL_MASK) == tail) {
>  		/*
>  		 * The smp_cond_load_acquire() call above has provided the
> +		 * necessary acquire semantics required for locking.
>  		 */
>  		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
>  		if (old == val)
> +			goto release; /* No contention */
>  	}

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -464,8 +464,7 @@ void queued_spin_lock_slowpath(struct qs
 		 * The smp_cond_load_acquire() call above has provided the
 		 * necessary acquire semantics required for locking.
 		 */
-		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
-		if (old == val)
+		if (atomic_try_cmpxchg_release(&lock->val, &val, _Q_LOCKED_VAL))
 			goto release; /* No contention */
 	}

Does that also work for you? It would generate slightly better code for
x86 (not that it would matter much on this path).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/10] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
  2018-04-05 16:59 ` [PATCH 06/10] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
@ 2018-04-05 17:22   ` Peter Zijlstra
  2018-04-06 10:55     ` Will Deacon
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-05 17:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 05:59:03PM +0100, Will Deacon wrote:
> diff --git a/include/linux/atomic.h b/include/linux/atomic.h
> index 8b276fd9a127..01ce3997cb42 100644
> --- a/include/linux/atomic.h
> +++ b/include/linux/atomic.h
> @@ -654,6 +654,7 @@ static inline int atomic_dec_if_positive(atomic_t *v)
>  }
>  #endif
>  
> +#define atomic_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
>  #define atomic_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
>  
>  #ifdef CONFIG_GENERIC_ATOMIC64
> @@ -1075,6 +1076,7 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
>  }
>  #endif
>  
> +#define atomic64_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
>  #define atomic64_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
>  
>  #include <asm-generic/atomic-long.h>

Did we again forget atomic_long glue ? ;-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-05 16:59 ` [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
@ 2018-04-05 17:28   ` Peter Zijlstra
  2018-04-06 11:34     ` Will Deacon
  2018-04-07  5:47   ` Boqun Feng
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-05 17:28 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:
> @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  		goto release;
>  
>  	/*
> +	 * Ensure that the initialisation of @node is complete before we
> +	 * publish the updated tail and potentially link @node into the
> +	 * waitqueue.
> +	 */
> +	smp_wmb();

Maybe an explicit note to where the matching barrier lives..

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-05 16:58 ` [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
  2018-04-05 17:07   ` Peter Zijlstra
  2018-04-05 17:13   ` Peter Zijlstra
@ 2018-04-05 21:16   ` Waiman Long
  2018-04-06 15:08     ` Will Deacon
  2018-04-06 20:50   ` Waiman Long
  2018-04-10 13:49   ` Sasha Levin
  4 siblings, 1 reply; 47+ messages in thread
From: Waiman Long @ 2018-04-05 21:16 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, catalin.marinas

On 04/05/2018 12:58 PM, Will Deacon wrote:
> The qspinlock locking slowpath utilises a "pending" bit as a simple form
> of an embedded test-and-set lock that can avoid the overhead of explicit
> queuing in cases where the lock is held but uncontended. This bit is
> managed using a cmpxchg loop which tries to transition the uncontended
> lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
>
> Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> indefinitely if the lock word is seen to oscillate between unlocked
> (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> able to take the lock in the cmpxchg loop without queuing and pass it
> around amongst themselves.
>
> This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> using atomic_fetch_or, and then inspecting the old value to see whether
> we need to spin on the current lock owner, or whether we now effectively
> hold the lock. The tricky scenario is when concurrent lockers end up
> queuing on the lock and the lock becomes available, causing us to see
> a lockword of (n,0,0). With pending now set, simply queuing could lead
> to deadlock as the head of the queue may not have observed the pending
> flag being cleared. Conversely, if the head of the queue did observe
> pending being cleared, then it could transition the lock from (n,0,0) ->
> (0,0,1) meaning that any attempt to "undo" our setting of the pending
> bit could race with a concurrent locker trying to set it.
>
> We handle this race by preserving the pending bit when taking the lock
> after reaching the head of the queue and leaving the tail entry intact
> if we saw pending set, because we know that the tail is going to be
> updated shortly.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  kernel/locking/qspinlock.c | 80 ++++++++++++++++++++--------------------------
>  1 file changed, 35 insertions(+), 45 deletions(-)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index a192af2fe378..b75361d23ea5 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -294,7 +294,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  {
>  	struct mcs_spinlock *prev, *next, *node;
> -	u32 new, old, tail;
> +	u32 old, tail;
>  	int idx;
>  
>  	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> @@ -306,58 +306,48 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  		return;
>  
>  	/*
> +	 * If we observe any contention; queue.
> +	 */
> +	if (val & ~_Q_LOCKED_MASK)
> +		goto queue;
> +
> +	/*
>  	 * trylock || pending
>  	 *
>  	 * 0,0,0 -> 0,0,1 ; trylock
>  	 * 0,0,1 -> 0,1,1 ; pending
>  	 */
> -	for (;;) {
> +	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +	if (!(val & ~_Q_LOCKED_MASK)) {
>  		/*
> -		 * If we observe any contention; queue.
> +		 * we're pending, wait for the owner to go away.
> +		 *
> +		 * *,1,1 -> *,1,0
> +		 *
> +		 * this wait loop must be a load-acquire such that we match the
> +		 * store-release that clears the locked bit and create lock
> +		 * sequentiality; this is because not all
> +		 * clear_pending_set_locked() implementations imply full
> +		 * barriers.
>  		 */
> -		if (val & ~_Q_LOCKED_MASK)
> -			goto queue;
> -
> -		new = _Q_LOCKED_VAL;
> -		if (val == new)
> -			new |= _Q_PENDING_VAL;
> -
> +		if (val & _Q_LOCKED_MASK)
> +			smp_cond_load_acquire(&lock->val.counter,
> +					      !(VAL & _Q_LOCKED_MASK));
>  		/*
> -		 * Acquire semantic is required here as the function may
> -		 * return immediately if the lock was free.
> +		 * take ownership and clear the pending bit.
> +		 *
> +		 * *,1,0 -> *,0,1
>  		 */
> -		old = atomic_cmpxchg_acquire(&lock->val, val, new);
> -		if (old == val)
> -			break;
> -
> -		val = old;
> -	}
> -
> -	/*
> -	 * we won the trylock
> -	 */
> -	if (new == _Q_LOCKED_VAL)
> +		clear_pending_set_locked(lock);
>  		return;
> +	}
>  
>  	/*
> -	 * we're pending, wait for the owner to go away.
> -	 *
> -	 * *,1,1 -> *,1,0
> -	 *
> -	 * this wait loop must be a load-acquire such that we match the
> -	 * store-release that clears the locked bit and create lock
> -	 * sequentiality; this is because not all clear_pending_set_locked()
> -	 * implementations imply full barriers.
> -	 */
> -	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
> -
> -	/*
> -	 * take ownership and clear the pending bit.
> -	 *
> -	 * *,1,0 -> *,0,1
> +	 * If pending was clear but there are waiters in the queue, then
> +	 * we need to undo our setting of pending before we queue ourselves.
>  	 */
> -	clear_pending_set_locked(lock);
> -	return;
> +	if (!(val & _Q_PENDING_MASK))
> +		atomic_andnot(_Q_PENDING_VAL, &lock->val);
Can we add a clear_pending() helper that will just clear the byte if
_Q_PENDING_BITS == 8? That will eliminate one atomic instruction from
the failure path.

-Longman

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/10] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue
  2018-04-05 17:19   ` Peter Zijlstra
@ 2018-04-06 10:54     ` Will Deacon
  0 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-06 10:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 07:19:12PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 05, 2018 at 05:59:00PM +0100, Will Deacon wrote:
> > +
> > +	/* In the PV case we might already have _Q_LOCKED_VAL set */
> > +	if ((val & _Q_TAIL_MASK) == tail) {
> >  		/*
> >  		 * The smp_cond_load_acquire() call above has provided the
> > +		 * necessary acquire semantics required for locking.
> >  		 */
> >  		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
> >  		if (old == val)
> > +			goto release; /* No contention */
> >  	}
> 
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -464,8 +464,7 @@ void queued_spin_lock_slowpath(struct qs
>  		 * The smp_cond_load_acquire() call above has provided the
>  		 * necessary acquire semantics required for locking.
>  		 */
> -		old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
> -		if (old == val)
> +		if (atomic_try_cmpxchg_release(&lock->val, &val, _Q_LOCKED_VAL))
>  			goto release; /* No contention */
>  	}
> 
> Does that also work for you? It would generate slightly better code for
> x86 (not that it would matter much on this path).

Assuming you meant to use atomic_try_cmpxchg_relaxed, then that works for
me too.

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/10] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
  2018-04-05 17:22   ` Peter Zijlstra
@ 2018-04-06 10:55     ` Will Deacon
  0 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-06 10:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 07:22:26PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 05, 2018 at 05:59:03PM +0100, Will Deacon wrote:
> > diff --git a/include/linux/atomic.h b/include/linux/atomic.h
> > index 8b276fd9a127..01ce3997cb42 100644
> > --- a/include/linux/atomic.h
> > +++ b/include/linux/atomic.h
> > @@ -654,6 +654,7 @@ static inline int atomic_dec_if_positive(atomic_t *v)
> >  }
> >  #endif
> >  
> > +#define atomic_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
> >  #define atomic_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
> >  
> >  #ifdef CONFIG_GENERIC_ATOMIC64
> > @@ -1075,6 +1076,7 @@ static inline long long atomic64_fetch_andnot_release(long long i, atomic64_t *v
> >  }
> >  #endif
> >  
> > +#define atomic64_cond_read_relaxed(v, c)	smp_cond_load_relaxed(&(v)->counter, (c))
> >  #define atomic64_cond_read_acquire(v, c)	smp_cond_load_acquire(&(v)->counter, (c))
> >  
> >  #include <asm-generic/atomic-long.h>
> 
> Did we again forget atomic_long glue ? ;-)

Bah! I'll add it for v2, thanks.

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-05 17:28   ` Peter Zijlstra
@ 2018-04-06 11:34     ` Will Deacon
  2018-04-06 13:05       ` Andrea Parri
  0 siblings, 1 reply; 47+ messages in thread
From: Will Deacon @ 2018-04-06 11:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 07:28:08PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:
> > @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  		goto release;
> >  
> >  	/*
> > +	 * Ensure that the initialisation of @node is complete before we
> > +	 * publish the updated tail and potentially link @node into the
> > +	 * waitqueue.
> > +	 */
> > +	smp_wmb();
> 
> Maybe an explicit note to where the matching barrier lives..

Oh man, that's not a simple thing to write: there isn't a matching barrier!

Instead, we rely on dependency ordering for two cases:

  * We access a node by decoding the tail we get back from the xchg

- or -

  * We access a node by following our own ->next pointer

I could say something like:

  "Pairs with dependency ordering from both xchg_tail and explicit
   dereferences of node->next"

but it's a bit cryptic :(

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-06 11:34     ` Will Deacon
@ 2018-04-06 13:05       ` Andrea Parri
  2018-04-06 15:27         ` Will Deacon
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Parri @ 2018-04-06 13:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-kernel, linux-arm-kernel, mingo,
	boqun.feng, paulmck, catalin.marinas, Waiman Long

Hi Will,

On Fri, Apr 06, 2018 at 12:34:36PM +0100, Will Deacon wrote:
> On Thu, Apr 05, 2018 at 07:28:08PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:
> > > @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >  		goto release;
> > >  
> > >  	/*
> > > +	 * Ensure that the initialisation of @node is complete before we
> > > +	 * publish the updated tail and potentially link @node into the
> > > +	 * waitqueue.
> > > +	 */
> > > +	smp_wmb();
> > 
> > Maybe an explicit note to where the matching barrier lives..
> 
> Oh man, that's not a simple thing to write: there isn't a matching barrier!
> 
> Instead, we rely on dependency ordering for two cases:
> 
>   * We access a node by decoding the tail we get back from the xchg
> 
> - or -
> 
>   * We access a node by following our own ->next pointer
> 
> I could say something like:
> 
>   "Pairs with dependency ordering from both xchg_tail and explicit
>    dereferences of node->next"
> 
> but it's a bit cryptic :(

Agreed. ;)  It might be helpful to instead include a snippet to highlight
the interested memory accesses/dependencies; IIUC,

/*
 * Pairs with dependency ordering from both xchg_tail and explicit/?
 * dereferences of node->next:
 *
 *   CPU0
 *
 *   /* get node0, encode node0 in tail */
 *   pv_init_node(node0);
 *     ((struct pv_node *)node0)->cpu   = smp_processor_id();
 *     ((struct pv_node *)node0)->state = vcpu_running;

 *   smp_wmb();
 *   old = xchg_tail(lock, tail);
 *
 *   CPU1:
 *
 *   /* get node1, encode tail from node1 */
 *   old = xchg_tail(lock, tail);   // = tail corresponding to node0
 *                                  // head an addr. dependency
 *   /* decode old in prev */
 *   pv_wait_node(node1, prev);
 *     READ ((struct pv_node *)prev)->cpu   // addr. dependent read
 *     READ ((struct pv_node *)prev)->state // addr. dependend read
 *
 * [More details for the case "following our own ->next pointer" you
 *  mentioned dabove.]
 */

CPU1 would also have:

   WRITE_ONCE(prev->next, node1); // addr. dependent write

but I'm not sure how this pairs: does this belong to the the second
case above? can you elaborate on that?

  Andrea


> 
> Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] kernel/locking: qspinlock improvements
  2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
                   ` (9 preceding siblings ...)
  2018-04-05 16:59 ` [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
@ 2018-04-06 13:22 ` Andrea Parri
  2018-04-11 10:20   ` Catalin Marinas
  10 siblings, 1 reply; 47+ messages in thread
From: Andrea Parri @ 2018-04-06 13:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng,
	paulmck, catalin.marinas

On Thu, Apr 05, 2018 at 05:58:57PM +0100, Will Deacon wrote:
> Hi all,
> 
> I've been kicking the tyres further on qspinlock and with this set of patches
> I'm happy with the performance and fairness properties. In particular, the
> locking algorithm now guarantees forward progress whereas the implementation
> in mainline can starve threads indefinitely in cmpxchg loops.
> 
> Catalin has also implemented a model of this using TLA to prove that the
> lock is fair, although this doesn't take the memory model into account:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/commit/

Nice!  I'll dig into this formalization, but my guess is that our model
(and axiomatic models "a-la-herd", in general) are not well-suited when
it comes to study properties such as fairness, liveness...

Did you already think about this?

  Andrea


> 
> I'd still like to get more benchmark numbers and wider exposure before
> enabling this for arm64, but my current testing is looking very promising.
> This series, along with the arm64-specific patches, is available at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=qspinlock
> 
> Cheers,
> 
> Will
> 
> --->8
> 
> Jason Low (1):
>   locking/mcs: Use smp_cond_load_acquire() in mcs spin loop
> 
> Will Deacon (9):
>   locking/qspinlock: Don't spin on pending->locked transition in
>     slowpath
>   locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
>   locking/qspinlock: Kill cmpxchg loop when claiming lock from head of
>     queue
>   locking/qspinlock: Use atomic_cond_read_acquire
>   barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed
>   locking/qspinlock: Use smp_cond_load_relaxed to wait for next node
>   locking/qspinlock: Merge struct __qspinlock into struct qspinlock
>   locking/qspinlock: Make queued_spin_unlock use smp_store_release
>   locking/qspinlock: Elide back-to-back RELEASE operations with
>     smp_wmb()
> 
>  arch/x86/include/asm/qspinlock.h          |  19 ++-
>  arch/x86/include/asm/qspinlock_paravirt.h |   3 +-
>  include/asm-generic/barrier.h             |  27 ++++-
>  include/asm-generic/qspinlock.h           |   2 +-
>  include/asm-generic/qspinlock_types.h     |  32 ++++-
>  include/linux/atomic.h                    |   2 +
>  kernel/locking/mcs_spinlock.h             |  10 +-
>  kernel/locking/qspinlock.c                | 191 ++++++++++--------------------
>  kernel/locking/qspinlock_paravirt.h       |  34 ++----
>  9 files changed, 141 insertions(+), 179 deletions(-)
> 
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-05 17:07   ` Peter Zijlstra
@ 2018-04-06 15:08     ` Will Deacon
  0 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-06 15:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-arm-kernel, mingo, boqun.feng, paulmck,
	catalin.marinas, Waiman Long

On Thu, Apr 05, 2018 at 07:07:06PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 05, 2018 at 05:58:59PM +0100, Will Deacon wrote:
> > The qspinlock locking slowpath utilises a "pending" bit as a simple form
> > of an embedded test-and-set lock that can avoid the overhead of explicit
> > queuing in cases where the lock is held but uncontended. This bit is
> > managed using a cmpxchg loop which tries to transition the uncontended
> > lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
> > 
> > Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> > indefinitely if the lock word is seen to oscillate between unlocked
> > (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> > able to take the lock in the cmpxchg loop without queuing and pass it
> > around amongst themselves.
> > 
> > This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> > using atomic_fetch_or, 
> 
> Of course, LL/SC or cmpxchg implementations of fetch_or do not in fact
> get anything from this ;-)

Whilst it's true that they would still be unfair, the window is at least
reduced and moves a lot more of the fairness burden onto hardware itself.
ARMv8.1 has an instruction for atomic_fetch_or, so we can make good use
of it here.

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-05 21:16   ` Waiman Long
@ 2018-04-06 15:08     ` Will Deacon
  0 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-06 15:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng,
	paulmck, catalin.marinas

On Thu, Apr 05, 2018 at 05:16:16PM -0400, Waiman Long wrote:
> On 04/05/2018 12:58 PM, Will Deacon wrote:
> >  	/*
> > -	 * we're pending, wait for the owner to go away.
> > -	 *
> > -	 * *,1,1 -> *,1,0
> > -	 *
> > -	 * this wait loop must be a load-acquire such that we match the
> > -	 * store-release that clears the locked bit and create lock
> > -	 * sequentiality; this is because not all clear_pending_set_locked()
> > -	 * implementations imply full barriers.
> > -	 */
> > -	smp_cond_load_acquire(&lock->val.counter, !(VAL & _Q_LOCKED_MASK));
> > -
> > -	/*
> > -	 * take ownership and clear the pending bit.
> > -	 *
> > -	 * *,1,0 -> *,0,1
> > +	 * If pending was clear but there are waiters in the queue, then
> > +	 * we need to undo our setting of pending before we queue ourselves.
> >  	 */
> > -	clear_pending_set_locked(lock);
> > -	return;
> > +	if (!(val & _Q_PENDING_MASK))
> > +		atomic_andnot(_Q_PENDING_VAL, &lock->val);
> Can we add a clear_pending() helper that will just clear the byte if
> _Q_PENDING_BITS == 8? That will eliminate one atomic instruction from
> the failure path.

Good idea!

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-06 13:05       ` Andrea Parri
@ 2018-04-06 15:27         ` Will Deacon
  2018-04-06 15:49           ` Andrea Parri
  0 siblings, 1 reply; 47+ messages in thread
From: Will Deacon @ 2018-04-06 15:27 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Peter Zijlstra, linux-kernel, linux-arm-kernel, mingo,
	boqun.feng, paulmck, catalin.marinas, Waiman Long

Hi Andrea,

On Fri, Apr 06, 2018 at 03:05:12PM +0200, Andrea Parri wrote:
> On Fri, Apr 06, 2018 at 12:34:36PM +0100, Will Deacon wrote:
> > I could say something like:
> > 
> >   "Pairs with dependency ordering from both xchg_tail and explicit
> >    dereferences of node->next"
> > 
> > but it's a bit cryptic :(
> 
> Agreed. ;)  It might be helpful to instead include a snippet to highlight
> the interested memory accesses/dependencies; IIUC,
> 
> /*
>  * Pairs with dependency ordering from both xchg_tail and explicit/?
>  * dereferences of node->next:
>  *
>  *   CPU0
>  *
>  *   /* get node0, encode node0 in tail */
>  *   pv_init_node(node0);
>  *     ((struct pv_node *)node0)->cpu   = smp_processor_id();
>  *     ((struct pv_node *)node0)->state = vcpu_running;

I'd probably ignore the PV case here and just focus on the native init
of count/locked/next.

>  *   smp_wmb();
>  *   old = xchg_tail(lock, tail);
>  *
>  *   CPU1:
>  *
>  *   /* get node1, encode tail from node1 */
>  *   old = xchg_tail(lock, tail);   // = tail corresponding to node0
>  *                                  // head an addr. dependency
>  *   /* decode old in prev */
>  *   pv_wait_node(node1, prev);

Similarly here -- the dependency is through decode_tail.

>  *     READ ((struct pv_node *)prev)->cpu   // addr. dependent read
>  *     READ ((struct pv_node *)prev)->state // addr. dependend read
>  *
>  * [More details for the case "following our own ->next pointer" you
>  *  mentioned dabove.]
>  */
> 
> CPU1 would also have:
> 
>    WRITE_ONCE(prev->next, node1); // addr. dependent write
> 
> but I'm not sure how this pairs: does this belong to the the second
> case above? can you elaborate on that?

This is dependent on the result of decode_tail, so it's still the first
case. The second case is when we queued into an empty tail but somebody
later queued behind us, so we don't find them until we're claiming the
lock:

  if (!next)
  	next = smp_cond_load_relaxed(&node->next, (VAL));

  arch_mcs_spin_unlock_contended(&next->locked);

here, this is all straightforward address dependencies rather than the
arithmetic in decode_tail.

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-06 15:27         ` Will Deacon
@ 2018-04-06 15:49           ` Andrea Parri
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Parri @ 2018-04-06 15:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-kernel, linux-arm-kernel, mingo,
	boqun.feng, paulmck, catalin.marinas, Waiman Long

On Fri, Apr 06, 2018 at 04:27:45PM +0100, Will Deacon wrote:
> Hi Andrea,
> 
> On Fri, Apr 06, 2018 at 03:05:12PM +0200, Andrea Parri wrote:
> > On Fri, Apr 06, 2018 at 12:34:36PM +0100, Will Deacon wrote:
> > > I could say something like:
> > > 
> > >   "Pairs with dependency ordering from both xchg_tail and explicit
> > >    dereferences of node->next"
> > > 
> > > but it's a bit cryptic :(
> > 
> > Agreed. ;)  It might be helpful to instead include a snippet to highlight
> > the interested memory accesses/dependencies; IIUC,
> > 
> > /*
> >  * Pairs with dependency ordering from both xchg_tail and explicit/?
> >  * dereferences of node->next:
> >  *
> >  *   CPU0
> >  *
> >  *   /* get node0, encode node0 in tail */
> >  *   pv_init_node(node0);
> >  *     ((struct pv_node *)node0)->cpu   = smp_processor_id();
> >  *     ((struct pv_node *)node0)->state = vcpu_running;
> 
> I'd probably ignore the PV case here and just focus on the native init
> of count/locked/next.
> 
> >  *   smp_wmb();
> >  *   old = xchg_tail(lock, tail);
> >  *
> >  *   CPU1:
> >  *
> >  *   /* get node1, encode tail from node1 */
> >  *   old = xchg_tail(lock, tail);   // = tail corresponding to node0
> >  *                                  // head an addr. dependency
> >  *   /* decode old in prev */
> >  *   pv_wait_node(node1, prev);
> 
> Similarly here -- the dependency is through decode_tail.
> 
> >  *     READ ((struct pv_node *)prev)->cpu   // addr. dependent read
> >  *     READ ((struct pv_node *)prev)->state // addr. dependend read
> >  *
> >  * [More details for the case "following our own ->next pointer" you
> >  *  mentioned dabove.]
> >  */
> > 
> > CPU1 would also have:
> > 
> >    WRITE_ONCE(prev->next, node1); // addr. dependent write
> > 
> > but I'm not sure how this pairs: does this belong to the the second
> > case above? can you elaborate on that?
> 
> This is dependent on the result of decode_tail, so it's still the first
> case. The second case is when we queued into an empty tail but somebody
> later queued behind us, so we don't find them until we're claiming the
> lock:
> 
>   if (!next)
>   	next = smp_cond_load_relaxed(&node->next, (VAL));
> 
>   arch_mcs_spin_unlock_contended(&next->locked);
> 
> here, this is all straightforward address dependencies rather than the
> arithmetic in decode_tail.

Got it. Thanks!

  Andrea


> 
> Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-05 16:58 ` [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
                     ` (2 preceding siblings ...)
  2018-04-05 21:16   ` Waiman Long
@ 2018-04-06 20:50   ` Waiman Long
  2018-04-06 21:09     ` Paul E. McKenney
                       ` (2 more replies)
  2018-04-10 13:49   ` Sasha Levin
  4 siblings, 3 replies; 47+ messages in thread
From: Waiman Long @ 2018-04-06 20:50 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, mingo, boqun.feng, paulmck, catalin.marinas

On 04/05/2018 12:58 PM, Will Deacon wrote:
> The qspinlock locking slowpath utilises a "pending" bit as a simple form
> of an embedded test-and-set lock that can avoid the overhead of explicit
> queuing in cases where the lock is held but uncontended. This bit is
> managed using a cmpxchg loop which tries to transition the uncontended
> lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
>
> Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> indefinitely if the lock word is seen to oscillate between unlocked
> (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> able to take the lock in the cmpxchg loop without queuing and pass it
> around amongst themselves.
>
> This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> using atomic_fetch_or, and then inspecting the old value to see whether
> we need to spin on the current lock owner, or whether we now effectively
> hold the lock. The tricky scenario is when concurrent lockers end up
> queuing on the lock and the lock becomes available, causing us to see
> a lockword of (n,0,0). With pending now set, simply queuing could lead
> to deadlock as the head of the queue may not have observed the pending
> flag being cleared. Conversely, if the head of the queue did observe
> pending being cleared, then it could transition the lock from (n,0,0) ->
> (0,0,1) meaning that any attempt to "undo" our setting of the pending
> bit could race with a concurrent locker trying to set it.
>
> We handle this race by preserving the pending bit when taking the lock
> after reaching the head of the queue and leaving the tail entry intact
> if we saw pending set, because we know that the tail is going to be
> updated shortly.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---

The pending bit was added to the qspinlock design to counter performance
degradation compared with ticket lock for workloads with light
spinlock contention. I run my spinlock stress test on a Intel Skylake
server running the vanilla 4.16 kernel vs a patched kernel with this
patchset. The locking rates with different number of locking threads
were as follows:

  # of threads  4.16 kernel     patched 4.16 kernel
  ------------  -----------     -------------------
        1       7,417 kop/s         7,408 kop/s
        2       5,755 kop/s         4,486 kop/s
        3       4,214 kop/s         4,169 kop/s
        4       4,396 kop/s         4,383 kop/s
       
The 2 contending threads case is the one that exercise the pending bit
code path the most. So it is obvious that this is the one that is most
impacted by this patchset. The differences in the other cases are mostly
noise or maybe just a little bit on the 3 contending threads case.

I am not against this patch, but we certainly need to find out a way to
bring the performance number up closer to what it is before applying
the patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-06 20:50   ` Waiman Long
@ 2018-04-06 21:09     ` Paul E. McKenney
  2018-04-07  8:47       ` Peter Zijlstra
  2018-04-07  9:07     ` Peter Zijlstra
  2018-04-09 10:58     ` Will Deacon
  2 siblings, 1 reply; 47+ messages in thread
From: Paul E. McKenney @ 2018-04-06 21:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Will Deacon, linux-kernel, linux-arm-kernel, peterz, mingo,
	boqun.feng, catalin.marinas

On Fri, Apr 06, 2018 at 04:50:19PM -0400, Waiman Long wrote:
> On 04/05/2018 12:58 PM, Will Deacon wrote:
> > The qspinlock locking slowpath utilises a "pending" bit as a simple form
> > of an embedded test-and-set lock that can avoid the overhead of explicit
> > queuing in cases where the lock is held but uncontended. This bit is
> > managed using a cmpxchg loop which tries to transition the uncontended
> > lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
> >
> > Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> > indefinitely if the lock word is seen to oscillate between unlocked
> > (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> > able to take the lock in the cmpxchg loop without queuing and pass it
> > around amongst themselves.
> >
> > This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> > using atomic_fetch_or, and then inspecting the old value to see whether
> > we need to spin on the current lock owner, or whether we now effectively
> > hold the lock. The tricky scenario is when concurrent lockers end up
> > queuing on the lock and the lock becomes available, causing us to see
> > a lockword of (n,0,0). With pending now set, simply queuing could lead
> > to deadlock as the head of the queue may not have observed the pending
> > flag being cleared. Conversely, if the head of the queue did observe
> > pending being cleared, then it could transition the lock from (n,0,0) ->
> > (0,0,1) meaning that any attempt to "undo" our setting of the pending
> > bit could race with a concurrent locker trying to set it.
> >
> > We handle this race by preserving the pending bit when taking the lock
> > after reaching the head of the queue and leaving the tail entry intact
> > if we saw pending set, because we know that the tail is going to be
> > updated shortly.
> >
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> 
> The pending bit was added to the qspinlock design to counter performance
> degradation compared with ticket lock for workloads with light
> spinlock contention. I run my spinlock stress test on a Intel Skylake
> server running the vanilla 4.16 kernel vs a patched kernel with this
> patchset. The locking rates with different number of locking threads
> were as follows:
> 
>   # of threads  4.16 kernel     patched 4.16 kernel
>   ------------  -----------     -------------------
>         1       7,417 kop/s         7,408 kop/s
>         2       5,755 kop/s         4,486 kop/s
>         3       4,214 kop/s         4,169 kop/s
>         4       4,396 kop/s         4,383 kop/s
>        
> The 2 contending threads case is the one that exercise the pending bit
> code path the most. So it is obvious that this is the one that is most
> impacted by this patchset. The differences in the other cases are mostly
> noise or maybe just a little bit on the 3 contending threads case.
> 
> I am not against this patch, but we certainly need to find out a way to
> bring the performance number up closer to what it is before applying
> the patch.

It would indeed be good to not be in the position of having to trade off
forward-progress guarantees against performance, but that does appear to
be where we are at the moment.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/10] locking/qspinlock: Merge struct __qspinlock into struct qspinlock
  2018-04-05 16:59 ` [PATCH 08/10] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
@ 2018-04-07  5:23   ` Boqun Feng
  0 siblings, 0 replies; 47+ messages in thread
From: Boqun Feng @ 2018-04-07  5:23 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, paulmck, catalin.marinas

[-- Attachment #1: Type: text/plain, Size: 11070 bytes --]

On Thu, Apr 05, 2018 at 05:59:05PM +0100, Will Deacon wrote:
> struct __qspinlock provides a handy union of fields so that
> subcomponents of the lockword can be accessed by name, without having to
> manage shifts and masks explicitly and take endianness into account.
> 
> This is useful in qspinlock.h and also potentially in arch headers, so
> move the struct __qspinlock into struct qspinlock and kill the extra
> definition.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

As I said in the IRC, it's glad to see such a merge ;-)

Acked-by: Boqun Feng <boqun.feng@gmail.com>

Regards,
Boqun

> ---
>  arch/x86/include/asm/qspinlock.h          |  2 +-
>  arch/x86/include/asm/qspinlock_paravirt.h |  3 +-
>  include/asm-generic/qspinlock_types.h     | 32 +++++++++++++++++++--
>  kernel/locking/qspinlock.c                | 46 ++-----------------------------
>  kernel/locking/qspinlock_paravirt.h       | 34 ++++++++---------------
>  5 files changed, 46 insertions(+), 71 deletions(-)
> 
> diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
> index 5e16b5d40d32..90b0b0ed8161 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -16,7 +16,7 @@
>   */
>  static inline void native_queued_spin_unlock(struct qspinlock *lock)
>  {
> -	smp_store_release((u8 *)lock, 0);
> +	smp_store_release(&lock->locked, 0);
>  }
>  
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
> index 923307ea11c7..9ef5ee03d2d7 100644
> --- a/arch/x86/include/asm/qspinlock_paravirt.h
> +++ b/arch/x86/include/asm/qspinlock_paravirt.h
> @@ -22,8 +22,7 @@ PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
>   *
>   * void __pv_queued_spin_unlock(struct qspinlock *lock)
>   * {
> - *	struct __qspinlock *l = (void *)lock;
> - *	u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
> + *	u8 lockval = cmpxchg(&lock->locked, _Q_LOCKED_VAL, 0);
>   *
>   *	if (likely(lockval == _Q_LOCKED_VAL))
>   *		return;
> diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
> index 034acd0c4956..0763f065b975 100644
> --- a/include/asm-generic/qspinlock_types.h
> +++ b/include/asm-generic/qspinlock_types.h
> @@ -29,13 +29,41 @@
>  #endif
>  
>  typedef struct qspinlock {
> -	atomic_t	val;
> +	union {
> +		atomic_t val;
> +
> +		/*
> +		 * By using the whole 2nd least significant byte for the
> +		 * pending bit, we can allow better optimization of the lock
> +		 * acquisition for the pending bit holder.
> +		 */
> +#ifdef __LITTLE_ENDIAN
> +		struct {
> +			u8	locked;
> +			u8	pending;
> +		};
> +		struct {
> +			u16	locked_pending;
> +			u16	tail;
> +		};
> +#else
> +		struct {
> +			u16	tail;
> +			u16	locked_pending;
> +		};
> +		struct {
> +			u8	reserved[2];
> +			u8	pending;
> +			u8	locked;
> +		};
> +#endif
> +	};
>  } arch_spinlock_t;
>  
>  /*
>   * Initializier
>   */
> -#define	__ARCH_SPIN_LOCK_UNLOCKED	{ ATOMIC_INIT(0) }
> +#define	__ARCH_SPIN_LOCK_UNLOCKED	{ .val = ATOMIC_INIT(0) }
>  
>  /*
>   * Bitfields in the atomic value:
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index c8b57d375b49..3ad8786a47e2 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -114,40 +114,6 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
>  
>  #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
>  
> -/*
> - * By using the whole 2nd least significant byte for the pending bit, we
> - * can allow better optimization of the lock acquisition for the pending
> - * bit holder.
> - *
> - * This internal structure is also used by the set_locked function which
> - * is not restricted to _Q_PENDING_BITS == 8.
> - */
> -struct __qspinlock {
> -	union {
> -		atomic_t val;
> -#ifdef __LITTLE_ENDIAN
> -		struct {
> -			u8	locked;
> -			u8	pending;
> -		};
> -		struct {
> -			u16	locked_pending;
> -			u16	tail;
> -		};
> -#else
> -		struct {
> -			u16	tail;
> -			u16	locked_pending;
> -		};
> -		struct {
> -			u8	reserved[2];
> -			u8	pending;
> -			u8	locked;
> -		};
> -#endif
> -	};
> -};
> -
>  #if _Q_PENDING_BITS == 8
>  /**
>   * clear_pending_set_locked - take ownership and clear the pending bit.
> @@ -159,9 +125,7 @@ struct __qspinlock {
>   */
>  static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
>  {
> -	struct __qspinlock *l = (void *)lock;
> -
> -	WRITE_ONCE(l->locked_pending, _Q_LOCKED_VAL);
> +	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
>  }
>  
>  /*
> @@ -176,13 +140,11 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
>   */
>  static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>  {
> -	struct __qspinlock *l = (void *)lock;
> -
>  	/*
>  	 * Use release semantics to make sure that the MCS node is properly
>  	 * initialized before changing the tail code.
>  	 */
> -	return (u32)xchg_release(&l->tail,
> +	return (u32)xchg_release(&lock->tail,
>  				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
>  }
>  
> @@ -237,9 +199,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>   */
>  static __always_inline void set_locked(struct qspinlock *lock)
>  {
> -	struct __qspinlock *l = (void *)lock;
> -
> -	WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
> +	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
>  }
>  
>  
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index 6ee477765e6c..2711940429f5 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -87,8 +87,6 @@ struct pv_node {
>  #define queued_spin_trylock(l)	pv_hybrid_queued_unfair_trylock(l)
>  static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
>  {
> -	struct __qspinlock *l = (void *)lock;
> -
>  	/*
>  	 * Stay in unfair lock mode as long as queued mode waiters are
>  	 * present in the MCS wait queue but the pending bit isn't set.
> @@ -97,7 +95,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
>  		int val = atomic_read(&lock->val);
>  
>  		if (!(val & _Q_LOCKED_PENDING_MASK) &&
> -		   (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> +		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
>  			qstat_inc(qstat_pv_lock_stealing, true);
>  			return true;
>  		}
> @@ -117,16 +115,12 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
>  #if _Q_PENDING_BITS == 8
>  static __always_inline void set_pending(struct qspinlock *lock)
>  {
> -	struct __qspinlock *l = (void *)lock;
> -
> -	WRITE_ONCE(l->pending, 1);
> +	WRITE_ONCE(lock->pending, 1);
>  }
>  
>  static __always_inline void clear_pending(struct qspinlock *lock)
>  {
> -	struct __qspinlock *l = (void *)lock;
> -
> -	WRITE_ONCE(l->pending, 0);
> +	WRITE_ONCE(lock->pending, 0);
>  }
>  
>  /*
> @@ -136,10 +130,8 @@ static __always_inline void clear_pending(struct qspinlock *lock)
>   */
>  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
>  {
> -	struct __qspinlock *l = (void *)lock;
> -
> -	return !READ_ONCE(l->locked) &&
> -	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> +	return !READ_ONCE(lock->locked) &&
> +	       (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
>  				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
>  }
>  #else /* _Q_PENDING_BITS == 8 */
> @@ -384,7 +376,6 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
>  static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>  {
>  	struct pv_node *pn = (struct pv_node *)node;
> -	struct __qspinlock *l = (void *)lock;
>  
>  	/*
>  	 * If the vCPU is indeed halted, advance its state to match that of
> @@ -413,7 +404,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>  	 * the hash table later on at unlock time, no atomic instruction is
>  	 * needed.
>  	 */
> -	WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> +	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
>  	(void)pv_hash(lock, pn);
>  }
>  
> @@ -428,7 +419,6 @@ static u32
>  pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  {
>  	struct pv_node *pn = (struct pv_node *)node;
> -	struct __qspinlock *l = (void *)lock;
>  	struct qspinlock **lp = NULL;
>  	int waitcnt = 0;
>  	int loop;
> @@ -479,13 +469,13 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  			 *
>  			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
>  			 */
> -			if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
> +			if (xchg(&lock->locked, _Q_SLOW_VAL) == 0) {
>  				/*
>  				 * The lock was free and now we own the lock.
>  				 * Change the lock value back to _Q_LOCKED_VAL
>  				 * and unhash the table.
>  				 */
> -				WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
> +				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
>  				WRITE_ONCE(*lp, NULL);
>  				goto gotlock;
>  			}
> @@ -493,7 +483,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  		WRITE_ONCE(pn->state, vcpu_hashed);
>  		qstat_inc(qstat_pv_wait_head, true);
>  		qstat_inc(qstat_pv_wait_again, waitcnt);
> -		pv_wait(&l->locked, _Q_SLOW_VAL);
> +		pv_wait(&lock->locked, _Q_SLOW_VAL);
>  
>  		/*
>  		 * Because of lock stealing, the queue head vCPU may not be
> @@ -518,7 +508,6 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  __visible void
>  __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
>  {
> -	struct __qspinlock *l = (void *)lock;
>  	struct pv_node *node;
>  
>  	if (unlikely(locked != _Q_SLOW_VAL)) {
> @@ -547,7 +536,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
>  	 * Now that we have a reference to the (likely) blocked pv_node,
>  	 * release the lock.
>  	 */
> -	smp_store_release(&l->locked, 0);
> +	smp_store_release(&lock->locked, 0);
>  
>  	/*
>  	 * At this point the memory pointed at by lock can be freed/reused,
> @@ -573,7 +562,6 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
>  #ifndef __pv_queued_spin_unlock
>  __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
>  {
> -	struct __qspinlock *l = (void *)lock;
>  	u8 locked;
>  
>  	/*
> @@ -581,7 +569,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
>  	 * unhash. Otherwise it would be possible to have multiple @lock
>  	 * entries, which would be BAD.
>  	 */
> -	locked = cmpxchg_release(&l->locked, _Q_LOCKED_VAL, 0);
> +	locked = cmpxchg_release(&lock->locked, _Q_LOCKED_VAL, 0);
>  	if (likely(locked == _Q_LOCKED_VAL))
>  		return;
>  
> -- 
> 2.1.4
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-05 16:59 ` [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
  2018-04-05 17:28   ` Peter Zijlstra
@ 2018-04-07  5:47   ` Boqun Feng
  2018-04-09 10:47     ` Will Deacon
  1 sibling, 1 reply; 47+ messages in thread
From: Boqun Feng @ 2018-04-07  5:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, paulmck, catalin.marinas

[-- Attachment #1: Type: text/plain, Size: 4158 bytes --]

On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:
> The qspinlock slowpath must ensure that the MCS node is fully initialised
> before it can be reached by another other CPU. This is currently enforced
> by using a RELEASE operation when updating the tail and also when linking
> the node into the waitqueue (since the control dependency off xchg_tail
> is insufficient to enforce sufficient ordering -- see 95bcade33a8a
> ("locking/qspinlock: Ensure node is initialised before updating prev->next")).
> 
> Back-to-back RELEASE operations may be expensive on some architectures,
> particularly those that implement them using fences under the hood. We
> can replace the two RELEASE operations with a single smp_wmb() fence and
> use RELAXED operations for the subsequent publishing of the node.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  kernel/locking/qspinlock.c | 32 +++++++++++++++-----------------
>  1 file changed, 15 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 3ad8786a47e2..42c61f7b37c5 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -141,10 +141,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
>  static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>  {
>  	/*
> -	 * Use release semantics to make sure that the MCS node is properly
> -	 * initialized before changing the tail code.
> +	 * We can use relaxed semantics since the caller ensures that the
> +	 * MCS node is properly initialized before updating the tail.
>  	 */
> -	return (u32)xchg_release(&lock->tail,
> +	return (u32)xchg_relaxed(&lock->tail,
>  				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
>  }
>  
> @@ -178,10 +178,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>  	for (;;) {
>  		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
>  		/*
> -		 * Use release semantics to make sure that the MCS node is
> -		 * properly initialized before changing the tail code.
> +		 * We can use relaxed semantics since the caller ensures that
> +		 * the MCS node is properly initialized before updating the
> +		 * tail.
>  		 */
> -		old = atomic_cmpxchg_release(&lock->val, val, new);
> +		old = atomic_cmpxchg_relaxed(&lock->val, val, new);
>  		if (old == val)
>  			break;
>  
> @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  		goto release;
>  
>  	/*
> +	 * Ensure that the initialisation of @node is complete before we
> +	 * publish the updated tail and potentially link @node into the

I think it might be better if we mention exactly where we "publish the
updated tail" and "link @node", how about:

	* publish the update tail via xchg_tail() and potentially link
	* @node into the waitqueue via WRITE_ONCE(->next,..) below.

and also add comments below like:

> +	 * waitqueue.
> +	 */
> +	smp_wmb();
> +
> +	/*
>  	 * We have already touched the queueing cacheline; don't bother with
>  	 * pending stuff.
>  	 *
>  	 * p,*,* -> n,*,*
> -	 *
> -	 * RELEASE, such that the stores to @node must be complete.

	* publish the updated tail

>  	 */
>  	old = xchg_tail(lock, tail);
>  	next = NULL;
> @@ -356,15 +362,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	 */
>  	if (old & _Q_TAIL_MASK) {
>  		prev = decode_tail(old);
> -
> -		/*
> -		 * We must ensure that the stores to @node are observed before
> -		 * the write to prev->next. The address dependency from
> -		 * xchg_tail is not sufficient to ensure this because the read
> -		 * component of xchg_tail is unordered with respect to the
> -		 * initialisation of @node.
> -		 */
> -		smp_store_release(&prev->next, node);

		/* Eventually link @node to the wait queue */
	
Thoughts?

Regards,
Boqun

> +		WRITE_ONCE(prev->next, node);
>  
>  		pv_wait_node(node, prev);
>  		arch_mcs_spin_lock_contended(&node->locked);
> -- 
> 2.1.4
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-06 21:09     ` Paul E. McKenney
@ 2018-04-07  8:47       ` Peter Zijlstra
  2018-04-07 23:37         ` Paul E. McKenney
  2018-04-09 10:58         ` Will Deacon
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-07  8:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Waiman Long, Will Deacon, linux-kernel, linux-arm-kernel, mingo,
	boqun.feng, catalin.marinas

On Fri, Apr 06, 2018 at 02:09:53PM -0700, Paul E. McKenney wrote:
> It would indeed be good to not be in the position of having to trade off
> forward-progress guarantees against performance, but that does appear to
> be where we are at the moment.

Depends of course on how unfair cmpxchg is. On x86 we trade one cmpxchg
loop for another so the patch doesn't cure anything at all there. And
our cmpxchg has 'some' hardware fairness to it.

So while the patch is 'good' for platforms that have native fetch-or,
it doesn't help (or in our case even hurts) those that do not.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-06 20:50   ` Waiman Long
  2018-04-06 21:09     ` Paul E. McKenney
@ 2018-04-07  9:07     ` Peter Zijlstra
  2018-04-09 10:58     ` Will Deacon
  2 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-07  9:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Will Deacon, linux-kernel, linux-arm-kernel, mingo, boqun.feng,
	paulmck, catalin.marinas

On Fri, Apr 06, 2018 at 04:50:19PM -0400, Waiman Long wrote:
>   # of threads  4.16 kernel     patched 4.16 kernel
>   ------------  -----------     -------------------
>         1       7,417 kop/s         7,408 kop/s
>         2       5,755 kop/s         4,486 kop/s
>         3       4,214 kop/s         4,169 kop/s
>         4       4,396 kop/s         4,383 kop/s
>        

Interesting, I didn't see that dip in my userspace tests.. I'll have to
try again.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-07  8:47       ` Peter Zijlstra
@ 2018-04-07 23:37         ` Paul E. McKenney
  2018-04-09 10:58         ` Will Deacon
  1 sibling, 0 replies; 47+ messages in thread
From: Paul E. McKenney @ 2018-04-07 23:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Will Deacon, linux-kernel, linux-arm-kernel, mingo,
	boqun.feng, catalin.marinas

On Sat, Apr 07, 2018 at 10:47:32AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 02:09:53PM -0700, Paul E. McKenney wrote:
> > It would indeed be good to not be in the position of having to trade off
> > forward-progress guarantees against performance, but that does appear to
> > be where we are at the moment.
> 
> Depends of course on how unfair cmpxchg is. On x86 we trade one cmpxchg
> loop for another so the patch doesn't cure anything at all there. And
> our cmpxchg has 'some' hardware fairness to it.
> 
> So while the patch is 'good' for platforms that have native fetch-or,
> it doesn't help (or in our case even hurts) those that do not.

Might need different implementations for different architectures, then.
Or take advantage of the fact that x86 can do a native fetch-or to the
topmost bit, if that helps.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb()
  2018-04-07  5:47   ` Boqun Feng
@ 2018-04-09 10:47     ` Will Deacon
  0 siblings, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-09 10:47 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, paulmck, catalin.marinas

Hi Boqun,

On Sat, Apr 07, 2018 at 01:47:11PM +0800, Boqun Feng wrote:
> On Thu, Apr 05, 2018 at 05:59:07PM +0100, Will Deacon wrote:
> > @@ -340,12 +341,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  		goto release;
> >  
> >  	/*
> > +	 * Ensure that the initialisation of @node is complete before we
> > +	 * publish the updated tail and potentially link @node into the
> 
> I think it might be better if we mention exactly where we "publish the
> updated tail" and "link @node", how about:
> 
> 	* publish the update tail via xchg_tail() and potentially link
> 	* @node into the waitqueue via WRITE_ONCE(->next,..) below.
> 
> and also add comments below like:
> 
> > +	 * waitqueue.
> > +	 */
> > +	smp_wmb();
> > +
> > +	/*
> >  	 * We have already touched the queueing cacheline; don't bother with
> >  	 * pending stuff.
> >  	 *
> >  	 * p,*,* -> n,*,*
> > -	 *
> > -	 * RELEASE, such that the stores to @node must be complete.
> 
> 	* publish the updated tail
> 
> >  	 */
> >  	old = xchg_tail(lock, tail);
> >  	next = NULL;
> > @@ -356,15 +362,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  	 */
> >  	if (old & _Q_TAIL_MASK) {
> >  		prev = decode_tail(old);
> > -
> > -		/*
> > -		 * We must ensure that the stores to @node are observed before
> > -		 * the write to prev->next. The address dependency from
> > -		 * xchg_tail is not sufficient to ensure this because the read
> > -		 * component of xchg_tail is unordered with respect to the
> > -		 * initialisation of @node.
> > -		 */
> > -		smp_store_release(&prev->next, node);
> 
> 		/* Eventually link @node to the wait queue */
> 	
> Thoughts?

I'll make some changes along these lines for v2. Thanks!

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-06 20:50   ` Waiman Long
  2018-04-06 21:09     ` Paul E. McKenney
  2018-04-07  9:07     ` Peter Zijlstra
@ 2018-04-09 10:58     ` Will Deacon
  2018-04-09 14:54       ` Will Deacon
  2018-04-09 17:55       ` Waiman Long
  2 siblings, 2 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-09 10:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng,
	paulmck, catalin.marinas

Hi Waiman,

Thanks for taking this lot for a spin. Comments and questions below.

On Fri, Apr 06, 2018 at 04:50:19PM -0400, Waiman Long wrote:
> On 04/05/2018 12:58 PM, Will Deacon wrote:
> > The qspinlock locking slowpath utilises a "pending" bit as a simple form
> > of an embedded test-and-set lock that can avoid the overhead of explicit
> > queuing in cases where the lock is held but uncontended. This bit is
> > managed using a cmpxchg loop which tries to transition the uncontended
> > lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
> >
> > Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
> > indefinitely if the lock word is seen to oscillate between unlocked
> > (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
> > able to take the lock in the cmpxchg loop without queuing and pass it
> > around amongst themselves.
> >
> > This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
> > using atomic_fetch_or, and then inspecting the old value to see whether
> > we need to spin on the current lock owner, or whether we now effectively
> > hold the lock. The tricky scenario is when concurrent lockers end up
> > queuing on the lock and the lock becomes available, causing us to see
> > a lockword of (n,0,0). With pending now set, simply queuing could lead
> > to deadlock as the head of the queue may not have observed the pending
> > flag being cleared. Conversely, if the head of the queue did observe
> > pending being cleared, then it could transition the lock from (n,0,0) ->
> > (0,0,1) meaning that any attempt to "undo" our setting of the pending
> > bit could race with a concurrent locker trying to set it.
> >
> > We handle this race by preserving the pending bit when taking the lock
> > after reaching the head of the queue and leaving the tail entry intact
> > if we saw pending set, because we know that the tail is going to be
> > updated shortly.
> >
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> 
> The pending bit was added to the qspinlock design to counter performance
> degradation compared with ticket lock for workloads with light
> spinlock contention. I run my spinlock stress test on a Intel Skylake
> server running the vanilla 4.16 kernel vs a patched kernel with this
> patchset. The locking rates with different number of locking threads
> were as follows:
> 
>   # of threads  4.16 kernel     patched 4.16 kernel
>   ------------  -----------     -------------------
>         1       7,417 kop/s         7,408 kop/s
>         2       5,755 kop/s         4,486 kop/s
>         3       4,214 kop/s         4,169 kop/s
>         4       4,396 kop/s         4,383 kop/s
>        
> The 2 contending threads case is the one that exercise the pending bit
> code path the most. So it is obvious that this is the one that is most
> impacted by this patchset. The differences in the other cases are mostly
> noise or maybe just a little bit on the 3 contending threads case.

That is bizarre. A few questions:

  1. Is this with my patches as posted, or also with your WRITE_ONCE change?
  2. Could you try to bisect my series to see which patch is responsible
     for this degradation, please?
  3. Could you point me at your stress test, so I can try to reproduce these
     numbers on arm64 systems, please?

> I am not against this patch, but we certainly need to find out a way to
> bring the performance number up closer to what it is before applying
> the patch.

We certainly need to *understand* where the drop is coming from, because
the two-threaded case is still just a CAS on x86 with and without this
patch series. Generally, there's a throughput cost when ensuring fairness
and forward-progress otherwise we'd all be using test-and-set.

Thanks,

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-07  8:47       ` Peter Zijlstra
  2018-04-07 23:37         ` Paul E. McKenney
@ 2018-04-09 10:58         ` Will Deacon
  1 sibling, 0 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-09 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Waiman Long, linux-kernel, linux-arm-kernel,
	mingo, boqun.feng, catalin.marinas

On Sat, Apr 07, 2018 at 10:47:32AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 02:09:53PM -0700, Paul E. McKenney wrote:
> > It would indeed be good to not be in the position of having to trade off
> > forward-progress guarantees against performance, but that does appear to
> > be where we are at the moment.
> 
> Depends of course on how unfair cmpxchg is. On x86 we trade one cmpxchg
> loop for another so the patch doesn't cure anything at all there. And
> our cmpxchg has 'some' hardware fairness to it.
> 
> So while the patch is 'good' for platforms that have native fetch-or,
> it doesn't help (or in our case even hurts) those that do not.

We need to get to the bottom of this, otherwise we're just relying on
Waiman's testing to validate any changes to this code!

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-09 10:58     ` Will Deacon
@ 2018-04-09 14:54       ` Will Deacon
  2018-04-09 15:54         ` Peter Zijlstra
  2018-04-09 19:33         ` Waiman Long
  2018-04-09 17:55       ` Waiman Long
  1 sibling, 2 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-09 14:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng,
	paulmck, catalin.marinas

On Mon, Apr 09, 2018 at 11:58:35AM +0100, Will Deacon wrote:
> On Fri, Apr 06, 2018 at 04:50:19PM -0400, Waiman Long wrote:
> > The pending bit was added to the qspinlock design to counter performance
> > degradation compared with ticket lock for workloads with light
> > spinlock contention. I run my spinlock stress test on a Intel Skylake
> > server running the vanilla 4.16 kernel vs a patched kernel with this
> > patchset. The locking rates with different number of locking threads
> > were as follows:
> > 
> >   # of threads  4.16 kernel     patched 4.16 kernel
> >   ------------  -----------     -------------------
> >         1       7,417 kop/s         7,408 kop/s
> >         2       5,755 kop/s         4,486 kop/s
> >         3       4,214 kop/s         4,169 kop/s
> >         4       4,396 kop/s         4,383 kop/s
> >        
> > The 2 contending threads case is the one that exercise the pending bit
> > code path the most. So it is obvious that this is the one that is most
> > impacted by this patchset. The differences in the other cases are mostly
> > noise or maybe just a little bit on the 3 contending threads case.
> 
> That is bizarre. A few questions:
> 
>   1. Is this with my patches as posted, or also with your WRITE_ONCE change?
>   2. Could you try to bisect my series to see which patch is responsible
>      for this degradation, please?
>   3. Could you point me at your stress test, so I can try to reproduce these
>      numbers on arm64 systems, please?
> 
> > I am not against this patch, but we certainly need to find out a way to
> > bring the performance number up closer to what it is before applying
> > the patch.
> 
> We certainly need to *understand* where the drop is coming from, because
> the two-threaded case is still just a CAS on x86 with and without this
> patch series. Generally, there's a throughput cost when ensuring fairness
> and forward-progress otherwise we'd all be using test-and-set.

Whilst I think we still need to address my questions above, I've had a
crack at the diff below. Please can you give it a spin? It sticks a trylock
on the slowpath before setting pending and replaces the CAS-based set
with an xchg (which I *think* is safe, but will need to ponder it some
more).

Thanks,

Will

--->8

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 19261af9f61e..71eb5e3a3d91 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -139,6 +139,20 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
 }
 
+/**
+ * set_pending_fetch_acquire - set the pending bit and return the old lock
+ *                             value with acquire semantics.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,* -> *,1,*
+ */
+static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
+{
+	u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
+	val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK);
+	return val;
+}
+
 /*
  * xchg_tail - Put in the new queue tail code word & retrieve previous one
  * @lock : Pointer to queued spinlock structure
@@ -184,6 +198,18 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 }
 
 /**
+ * set_pending_fetch_acquire - set the pending bit and return the old lock
+ *                             value with acquire semantics.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,* -> *,1,*
+ */
+static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
+{
+	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+}
+
+/**
  * xchg_tail - Put in the new queue tail code word & retrieve previous one
  * @lock : Pointer to queued spinlock structure
  * @tail : The new queue tail code word
@@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;
 
 	/*
-	 * If we observe any contention; queue.
+	 * If we observe queueing, then queue ourselves.
 	 */
-	if (val & ~_Q_LOCKED_MASK)
+	if (val & _Q_TAIL_MASK)
 		goto queue;
 
 	/*
+	 * We didn't see any queueing, so have one more try at snatching
+	 * the lock in case it became available whilst we were taking the
+	 * slow path.
+	 */
+	if (queued_spin_trylock(lock))
+		return;
+
+	/*
 	 * trylock || pending
 	 *
 	 * 0,0,0 -> 0,0,1 ; trylock
 	 * 0,0,1 -> 0,1,1 ; pending
 	 */
-	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+	val = set_pending_fetch_acquire(lock);
 	if (!(val & ~_Q_LOCKED_MASK)) {
 		/*
 		 * we're pending, wait for the owner to go away.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-09 14:54       ` Will Deacon
@ 2018-04-09 15:54         ` Peter Zijlstra
  2018-04-09 17:19           ` Will Deacon
  2018-04-09 19:33         ` Waiman Long
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-09 15:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, linux-kernel, linux-arm-kernel, mingo, boqun.feng,
	paulmck, catalin.marinas

On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote:

> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 19261af9f61e..71eb5e3a3d91 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -139,6 +139,20 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
>  	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
>  }
>  
> +/**
> + * set_pending_fetch_acquire - set the pending bit and return the old lock
> + *                             value with acquire semantics.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +	u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
> +	val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK);
> +	return val;
> +}

> @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  		return;
>  
>  	/*
> -	 * If we observe any contention; queue.
> +	 * If we observe queueing, then queue ourselves.
>  	 */
> -	if (val & ~_Q_LOCKED_MASK)
> +	if (val & _Q_TAIL_MASK)
>  		goto queue;
>  
>  	/*
> +	 * We didn't see any queueing, so have one more try at snatching
> +	 * the lock in case it became available whilst we were taking the
> +	 * slow path.
> +	 */
> +	if (queued_spin_trylock(lock))
> +		return;
> +
> +	/*
>  	 * trylock || pending
>  	 *
>  	 * 0,0,0 -> 0,0,1 ; trylock
>  	 * 0,0,1 -> 0,1,1 ; pending
>  	 */
> +	val = set_pending_fetch_acquire(lock);
>  	if (!(val & ~_Q_LOCKED_MASK)) {

So, if I remember that partial paper correctly, the atomc_read_acquire()
can see 'arbitrary' old values for everything except the pending byte,
which it just wrote and will fwd into our load, right?

But I think coherence requires the read to not be older than the one
observed by the trylock before (since it uses c-cas its acquire can be
elided).

I think this means we can miss a concurrent unlock vs the fetch_or. And
I think that's fine, if we still see the lock set we'll needlessly 'wait'
for it go become unlocked.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-09 15:54         ` Peter Zijlstra
@ 2018-04-09 17:19           ` Will Deacon
  2018-04-10  9:35             ` Peter Zijlstra
  2018-09-20 16:08             ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Will Deacon @ 2018-04-09 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, linux-kernel, linux-arm-kernel, mingo, boqun.feng,
	paulmck, catalin.marinas

On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote:
> > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >  		return;
> >  
> >  	/*
> > -	 * If we observe any contention; queue.
> > +	 * If we observe queueing, then queue ourselves.
> >  	 */
> > -	if (val & ~_Q_LOCKED_MASK)
> > +	if (val & _Q_TAIL_MASK)
> >  		goto queue;
> >  
> >  	/*
> > +	 * We didn't see any queueing, so have one more try at snatching
> > +	 * the lock in case it became available whilst we were taking the
> > +	 * slow path.
> > +	 */
> > +	if (queued_spin_trylock(lock))
> > +		return;
> > +
> > +	/*
> >  	 * trylock || pending
> >  	 *
> >  	 * 0,0,0 -> 0,0,1 ; trylock
> >  	 * 0,0,1 -> 0,1,1 ; pending
> >  	 */
> > +	val = set_pending_fetch_acquire(lock);
> >  	if (!(val & ~_Q_LOCKED_MASK)) {
> 
> So, if I remember that partial paper correctly, the atomc_read_acquire()
> can see 'arbitrary' old values for everything except the pending byte,
> which it just wrote and will fwd into our load, right?
> 
> But I think coherence requires the read to not be older than the one
> observed by the trylock before (since it uses c-cas its acquire can be
> elided).
> 
> I think this means we can miss a concurrent unlock vs the fetch_or. And
> I think that's fine, if we still see the lock set we'll needlessly 'wait'
> for it go become unlocked.

Ah, but there is a related case that doesn't work. If the lock becomes
free just before we set pending, then another CPU can succeed on the
fastpath. We'll then set pending, but the lockword we get back may still
have the locked byte of 0, so two people end up holding the lock.

I think it's worth giving this a go with the added trylock, but I can't
see a way to avoid the atomic_fetch_or at the moment.

Will

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-09 10:58     ` Will Deacon
  2018-04-09 14:54       ` Will Deacon
@ 2018-04-09 17:55       ` Waiman Long
  1 sibling, 0 replies; 47+ messages in thread
From: Waiman Long @ 2018-04-09 17:55 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng,
	paulmck, catalin.marinas

On 04/09/2018 06:58 AM, Will Deacon wrote:
> Hi Waiman,
>
> Thanks for taking this lot for a spin. Comments and questions below.
>
> On Fri, Apr 06, 2018 at 04:50:19PM -0400, Waiman Long wrote:
>> On 04/05/2018 12:58 PM, Will Deacon wrote:
>>> The qspinlock locking slowpath utilises a "pending" bit as a simple form
>>> of an embedded test-and-set lock that can avoid the overhead of explicit
>>> queuing in cases where the lock is held but uncontended. This bit is
>>> managed using a cmpxchg loop which tries to transition the uncontended
>>> lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
>>>
>>> Unfortunately, the cmpxchg loop is unbounded and lockers can be starved
>>> indefinitely if the lock word is seen to oscillate between unlocked
>>> (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
>>> able to take the lock in the cmpxchg loop without queuing and pass it
>>> around amongst themselves.
>>>
>>> This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
>>> using atomic_fetch_or, and then inspecting the old value to see whether
>>> we need to spin on the current lock owner, or whether we now effectively
>>> hold the lock. The tricky scenario is when concurrent lockers end up
>>> queuing on the lock and the lock becomes available, causing us to see
>>> a lockword of (n,0,0). With pending now set, simply queuing could lead
>>> to deadlock as the head of the queue may not have observed the pending
>>> flag being cleared. Conversely, if the head of the queue did observe
>>> pending being cleared, then it could transition the lock from (n,0,0) ->
>>> (0,0,1) meaning that any attempt to "undo" our setting of the pending
>>> bit could race with a concurrent locker trying to set it.
>>>
>>> We handle this race by preserving the pending bit when taking the lock
>>> after reaching the head of the queue and leaving the tail entry intact
>>> if we saw pending set, because we know that the tail is going to be
>>> updated shortly.
>>>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Ingo Molnar <mingo@kernel.org>
>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>> ---
>> The pending bit was added to the qspinlock design to counter performance
>> degradation compared with ticket lock for workloads with light
>> spinlock contention. I run my spinlock stress test on a Intel Skylake
>> server running the vanilla 4.16 kernel vs a patched kernel with this
>> patchset. The locking rates with different number of locking threads
>> were as follows:
>>
>>   # of threads  4.16 kernel     patched 4.16 kernel
>>   ------------  -----------     -------------------
>>         1       7,417 kop/s         7,408 kop/s
>>         2       5,755 kop/s         4,486 kop/s
>>         3       4,214 kop/s         4,169 kop/s
>>         4       4,396 kop/s         4,383 kop/s
>>        
>> The 2 contending threads case is the one that exercise the pending bit
>> code path the most. So it is obvious that this is the one that is most
>> impacted by this patchset. The differences in the other cases are mostly
>> noise or maybe just a little bit on the 3 contending threads case.
> That is bizarre. A few questions:
>
>   1. Is this with my patches as posted, or also with your WRITE_ONCE change?

This is just the with your patches as posted.
 
>   2. Could you try to bisect my series to see which patch is responsible
>      for this degradation, please?

I have done further analysis with the help of CONFIG_QUEUED_LOCK_STAT
with another patch to enable counting the pending and the queuing code
paths.

Running the 2-thread test with the original qspinlock code on a Haswell
server, the performance data were

pending count = 3,265,220
queuing count = 22
locking rate = 11,648 kop/s

With your posted patches,

pending count = 330
queuing count = 9,965,127
locking rate = 4,178 kop/s

I believe that my test case has heavy dependency on _Q_PENDING_VAL
spinning loop. When I added back the loop, the performance data became:

pending count = 3,278,320
queuing count = 0
locking rate = 11,884 kop/s

Instead of an infinite loop, I also tried a limited spin with loop count
of 0x200 and I got similar performance data as the infinite loop case.

>   3. Could you point me at your stress test, so I can try to reproduce these
>      numbers on arm64 systems, please?

I will send you the test that I  used in a separate email.

>> I am not against this patch, but we certainly need to find out a way to
>> bring the performance number up closer to what it is before applying
>> the patch.
> We certainly need to *understand* where the drop is coming from, because
> the two-threaded case is still just a CAS on x86 with and without this
> patch series. Generally, there's a throughput cost when ensuring fairness
> and forward-progress otherwise we'd all be using test-and-set.

As stated above, the drop comes mainly from skipping the _Q_PENDING_VAL
spinning loop. I supposed that if we just do a limited spin, we can
still ensure forward progress while preserving the performance profile
of the original qspinlock code.

I don't think other codes in your patches cause any performance
regression as far as my testing is concerned.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-09 14:54       ` Will Deacon
  2018-04-09 15:54         ` Peter Zijlstra
@ 2018-04-09 19:33         ` Waiman Long
  1 sibling, 0 replies; 47+ messages in thread
From: Waiman Long @ 2018-04-09 19:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, linux-arm-kernel, peterz, mingo, boqun.feng,
	paulmck, catalin.marinas

On 04/09/2018 10:54 AM, Will Deacon wrote:
>
>>> I am not against this patch, but we certainly need to find out a way to
>>> bring the performance number up closer to what it is before applying
>>> the patch.
>> We certainly need to *understand* where the drop is coming from, because
>> the two-threaded case is still just a CAS on x86 with and without this
>> patch series. Generally, there's a throughput cost when ensuring fairness
>> and forward-progress otherwise we'd all be using test-and-set.
> Whilst I think we still need to address my questions above, I've had a
> crack at the diff below. Please can you give it a spin? It sticks a trylock
> on the slowpath before setting pending and replaces the CAS-based set
> with an xchg (which I *think* is safe, but will need to ponder it some
> more).
>
> Thanks,
>
> Will
>

Unfortunately, this patch didn't help.

pending count = 777
queuing count = 9,991,272
locking rate = 4,087 kop/s

-Longman

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-09 17:19           ` Will Deacon
@ 2018-04-10  9:35             ` Peter Zijlstra
  2018-09-20 16:08             ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2018-04-10  9:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, linux-kernel, linux-arm-kernel, mingo, boqun.feng,
	paulmck, catalin.marinas

On Mon, Apr 09, 2018 at 06:19:59PM +0100, Will Deacon wrote:
> On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote:
> > > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >  		return;
> > >  
> > >  	/*
> > > -	 * If we observe any contention; queue.
> > > +	 * If we observe queueing, then queue ourselves.
> > >  	 */
> > > -	if (val & ~_Q_LOCKED_MASK)
> > > +	if (val & _Q_TAIL_MASK)
> > >  		goto queue;
> > >  
> > >  	/*
> > > +	 * We didn't see any queueing, so have one more try at snatching
> > > +	 * the lock in case it became available whilst we were taking the
> > > +	 * slow path.
> > > +	 */
> > > +	if (queued_spin_trylock(lock))
> > > +		return;
> > > +
> > > +	/*
> > >  	 * trylock || pending
> > >  	 *
> > >  	 * 0,0,0 -> 0,0,1 ; trylock
> > >  	 * 0,0,1 -> 0,1,1 ; pending
> > >  	 */
> > > +	val = set_pending_fetch_acquire(lock);
> > >  	if (!(val & ~_Q_LOCKED_MASK)) {
> > 
> > So, if I remember that partial paper correctly, the atomc_read_acquire()
> > can see 'arbitrary' old values for everything except the pending byte,
> > which it just wrote and will fwd into our load, right?
> > 
> > But I think coherence requires the read to not be older than the one
> > observed by the trylock before (since it uses c-cas its acquire can be
> > elided).
> > 
> > I think this means we can miss a concurrent unlock vs the fetch_or. And
> > I think that's fine, if we still see the lock set we'll needlessly 'wait'
> > for it go become unlocked.
> 
> Ah, but there is a related case that doesn't work. If the lock becomes
> free just before we set pending, then another CPU can succeed on the
> fastpath. We'll then set pending, but the lockword we get back may still
> have the locked byte of 0, so two people end up holding the lock.
> 
> I think it's worth giving this a go with the added trylock, but I can't
> see a way to avoid the atomic_fetch_or at the moment.

Oh yikes, indeed. Yeah, I don't see how we'd be able to fix that one.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-05 16:58 ` [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
                     ` (3 preceding siblings ...)
  2018-04-06 20:50   ` Waiman Long
@ 2018-04-10 13:49   ` Sasha Levin
  4 siblings, 0 replies; 47+ messages in thread
From: Sasha Levin @ 2018-04-10 13:49 UTC (permalink / raw)
  To: Sasha Levin, Will Deacon, linux-kernel
  Cc: linux-arm-kernel, peterz, Peter Zijlstra, Ingo Molnar, stable

Hi,

[This is an automated email]

This commit has been processed by the -stable helper bot and determined
to be a high probability candidate for -stable trees. (score: 32.4825)

The bot has tested the following trees: v4.16.1, v4.15.16, v4.14.33, v4.9.93, v4.4.127.

v4.16.1: Failed to apply! Possible dependencies:
    Unable to calculate

v4.15.16: Failed to apply! Possible dependencies:
    Unable to calculate

v4.14.33: Failed to apply! Possible dependencies:
    Unable to calculate

v4.9.93: Failed to apply! Possible dependencies:
    Unable to calculate

v4.4.127: Failed to apply! Possible dependencies:
    1c4941fd53af ("locking/pvqspinlock: Allow limited lock stealing")
    1f03e8d29192 ("locking/barriers: Replace smp_cond_acquire() with smp_cond_load_acquire()")
    64d816cba06c ("locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg()")


Please let us know if you'd like to have this patch included in a stable tree.

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] kernel/locking: qspinlock improvements
  2018-04-06 13:22 ` [PATCH 00/10] kernel/locking: qspinlock improvements Andrea Parri
@ 2018-04-11 10:20   ` Catalin Marinas
  2018-04-11 15:39     ` Andrea Parri
  0 siblings, 1 reply; 47+ messages in thread
From: Catalin Marinas @ 2018-04-11 10:20 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Will Deacon, peterz, boqun.feng, linux-kernel, paulmck, mingo,
	linux-arm-kernel

On Fri, Apr 06, 2018 at 03:22:49PM +0200, Andrea Parri wrote:
> On Thu, Apr 05, 2018 at 05:58:57PM +0100, Will Deacon wrote:
> > I've been kicking the tyres further on qspinlock and with this set of patches
> > I'm happy with the performance and fairness properties. In particular, the
> > locking algorithm now guarantees forward progress whereas the implementation
> > in mainline can starve threads indefinitely in cmpxchg loops.
> > 
> > Catalin has also implemented a model of this using TLA to prove that the
> > lock is fair, although this doesn't take the memory model into account:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/commit/
> 
> Nice!  I'll dig into this formalization, but my guess is that our model
> (and axiomatic models "a-la-herd", in general) are not well-suited when
> it comes to study properties such as fairness, liveness...

Maybe someone with a background in formal methods could give a better
answer. How TLA+ works is closer to rmem [1] (operational model,
exhaustive memoised state search) than herd. Liveness verification
requires checking that, under certain fairness properties, some state is
eventually reached. IOW, it tries to show that either all state change
graphs lead to (go through) such state or that there are cycles in the
graph and the state is never reached. I don't know whether herd could be
modified to check liveness. I'm not sure it can handle infinite loops
either (the above model checks an infinite lock/unlock loop on each
CPU and that's easier to implement in a tool with memoised states).

The TLA+ model above assumes sequential consistency, so no memory
ordering taken into account. One could build an operational model in
TLA+ that's equivalent to the axiomatic one (e.g. following the Flat
model equivalence as in [2]), however, liveness checking (at least with
TLA+) is orders of magnitude slower than safety. Any small variation has
an exponential impact on the state space, so likely to be impractical.
For specific parts of the algorithm, you may be able to use a poor man's
ordering by e.g. writing two accesses in two different orders so the
model checks both combinations.

There are papers (e.g. [3]) on how to convert liveness checking to
safety checking but I haven't dug further. I think it's easier/faster if
you do liveness checking with a simplified model and separately check
the safety with respect to memory ordering on tools like herd.

[1] http://www.cl.cam.ac.uk/~sf502/regressions/rmem/
[2] http://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pdf
[3] https://www.sciencedirect.com/science/article/pii/S1571066104804109

-- 
Catalin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/10] kernel/locking: qspinlock improvements
  2018-04-11 10:20   ` Catalin Marinas
@ 2018-04-11 15:39     ` Andrea Parri
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Parri @ 2018-04-11 15:39 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, peterz, boqun.feng, linux-kernel, paulmck, mingo,
	linux-arm-kernel

On Wed, Apr 11, 2018 at 11:20:04AM +0100, Catalin Marinas wrote:
> On Fri, Apr 06, 2018 at 03:22:49PM +0200, Andrea Parri wrote:
> > On Thu, Apr 05, 2018 at 05:58:57PM +0100, Will Deacon wrote:
> > > I've been kicking the tyres further on qspinlock and with this set of patches
> > > I'm happy with the performance and fairness properties. In particular, the
> > > locking algorithm now guarantees forward progress whereas the implementation
> > > in mainline can starve threads indefinitely in cmpxchg loops.
> > > 
> > > Catalin has also implemented a model of this using TLA to prove that the
> > > lock is fair, although this doesn't take the memory model into account:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/commit/
> > 
> > Nice!  I'll dig into this formalization, but my guess is that our model
> > (and axiomatic models "a-la-herd", in general) are not well-suited when
> > it comes to study properties such as fairness, liveness...
> 
> Maybe someone with a background in formal methods could give a better
> answer. How TLA+ works is closer to rmem [1] (operational model,
> exhaustive memoised state search) than herd. Liveness verification
> requires checking that, under certain fairness properties, some state is
> eventually reached. IOW, it tries to show that either all state change
> graphs lead to (go through) such state or that there are cycles in the
> graph and the state is never reached. I don't know whether herd could be
> modified to check liveness. I'm not sure it can handle infinite loops
> either (the above model checks an infinite lock/unlock loop on each
> CPU and that's easier to implement in a tool with memoised states).
> 
> The TLA+ model above assumes sequential consistency, so no memory
> ordering taken into account. One could build an operational model in
> TLA+ that's equivalent to the axiomatic one (e.g. following the Flat
> model equivalence as in [2]), however, liveness checking (at least with
> TLA+) is orders of magnitude slower than safety. Any small variation has
> an exponential impact on the state space, so likely to be impractical.
> For specific parts of the algorithm, you may be able to use a poor man's
> ordering by e.g. writing two accesses in two different orders so the
> model checks both combinations.
> 
> There are papers (e.g. [3]) on how to convert liveness checking to
> safety checking but I haven't dug further. I think it's easier/faster if
> you do liveness checking with a simplified model and separately check
> the safety with respect to memory ordering on tools like herd.

Indeed.  A fundamental problem, AFAICT, is to formalize that concept of
'[it] will _eventually_ happen'.  Consider a simple example:

	{ x = 0}

	P0	|   P1
		|
	x = 1	|   while (!x)
		|  	 ;

herd 'knows' that:

	- on the 1st iteration of the 'while' loop, the load from x
	  can return the value 0 or 1 (only);

	- on the 2nd iteration of the 'while' loop, the load from x
	  can return the value 0 or 1;

	- [ ... and 'so on'! ]

but this is pretty much all herd knows about this snippet by now ... ;)

Thanks,
  Andrea


> 
> [1] http://www.cl.cam.ac.uk/~sf502/regressions/rmem/
> [2] http://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pdf
> [3] https://www.sciencedirect.com/science/article/pii/S1571066104804109
> 
> -- 
> Catalin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-04-09 17:19           ` Will Deacon
  2018-04-10  9:35             ` Peter Zijlstra
@ 2018-09-20 16:08             ` Peter Zijlstra
  2018-09-20 16:22               ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2018-09-20 16:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, linux-kernel, linux-arm-kernel, mingo, boqun.feng,
	paulmck, catalin.marinas

On Mon, Apr 09, 2018 at 06:19:59PM +0100, Will Deacon wrote:
> On Mon, Apr 09, 2018 at 05:54:20PM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 09, 2018 at 03:54:09PM +0100, Will Deacon wrote:

> > > +/**
> > > + * set_pending_fetch_acquire - set the pending bit and return the old lock
> > > + *                             value with acquire semantics.
> > > + * @lock: Pointer to queued spinlock structure
> > > + *
> > > + * *,*,* -> *,1,*
> > > + */
> > > +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> > > +{
> > > +       u32 val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;

	smp_mb();

> > > +       val |= (atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK);
> > > +       return val;
> > > +}

> > > @@ -289,18 +315,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > >  		return;
> > >  
> > >  	/*
> > > -	 * If we observe any contention; queue.
> > > +	 * If we observe queueing, then queue ourselves.
> > >  	 */
> > > -	if (val & ~_Q_LOCKED_MASK)
> > > +	if (val & _Q_TAIL_MASK)
> > >  		goto queue;
> > >  
> > >  	/*
> > > +	 * We didn't see any queueing, so have one more try at snatching
> > > +	 * the lock in case it became available whilst we were taking the
> > > +	 * slow path.
> > > +	 */
> > > +	if (queued_spin_trylock(lock))
> > > +		return;
> > > +
> > > +	/*
> > >  	 * trylock || pending
> > >  	 *
> > >  	 * 0,0,0 -> 0,0,1 ; trylock
> > >  	 * 0,0,1 -> 0,1,1 ; pending
> > >  	 */
> > > +	val = set_pending_fetch_acquire(lock);
> > >  	if (!(val & ~_Q_LOCKED_MASK)) {
> > 
> > So, if I remember that partial paper correctly, the atomc_read_acquire()
> > can see 'arbitrary' old values for everything except the pending byte,
> > which it just wrote and will fwd into our load, right?
> > 
> > But I think coherence requires the read to not be older than the one
> > observed by the trylock before (since it uses c-cas its acquire can be
> > elided).
> > 
> > I think this means we can miss a concurrent unlock vs the fetch_or. And
> > I think that's fine, if we still see the lock set we'll needlessly 'wait'
> > for it go become unlocked.
> 
> Ah, but there is a related case that doesn't work. If the lock becomes
> free just before we set pending, then another CPU can succeed on the
> fastpath. We'll then set pending, but the lockword we get back may still
> have the locked byte of 0, so two people end up holding the lock.
> 
> I think it's worth giving this a go with the added trylock, but I can't
> see a way to avoid the atomic_fetch_or at the moment.

So IIRC the addition of the smp_mb() above should ensure the @val load
is later than the @pending store.

Which makes the thing work again, right?

Now, obviously you don't actually want that on ARM64, but I can do that
on x86 just fine (our xchg() implies smp_mb() after all).


Another approach might be to use something like:

	val = xchg_relaxed(&lock->locked_pending, _Q_PENDING_VAL | _Q_LOCKED_VAL);
	val |= atomic_read_acquire(&lock->val) & _Q_TAIL_MASK;

combined with something like:

	/* 0,0,0 -> 0,1,1 - we won trylock */
	if (!(val & _Q_LOCKED_MASK)) {
		clear_pending(lock);
		return;
	}

	/* 0,0,1 -> 0,1,1 - we won pending */
	if (!(val & ~_Q_LOCKED_MASK)) {
		...
	}

	/* *,0,1 -> *,1,1 - we won pending, but there's queueing */
	if (!(val & _Q_PENDING_VAL))
		clear_pending(lock);

	...


Hmmm?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath
  2018-09-20 16:08             ` Peter Zijlstra
@ 2018-09-20 16:22               ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2018-09-20 16:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, linux-kernel, linux-arm-kernel, mingo, boqun.feng,
	paulmck, catalin.marinas

On Thu, Sep 20, 2018 at 06:08:32PM +0200, Peter Zijlstra wrote:
> Another approach might be to use something like:
> 
> 	val = xchg_relaxed(&lock->locked_pending, _Q_PENDING_VAL | _Q_LOCKED_VAL);
> 	val |= atomic_read_acquire(&lock->val) & _Q_TAIL_MASK;
> 
> combined with something like:
> 
> 	/* 0,0,0 -> 0,1,1 - we won trylock */
> 	if (!(val & _Q_LOCKED_MASK)) {

That one doesn't actually work... let me think about this more.

> 		clear_pending(lock);
> 		return;
> 	}
> 
> 	/* 0,0,1 -> 0,1,1 - we won pending */
> 	if (!(val & ~_Q_LOCKED_MASK)) {
> 		...
> 	}
> 
> 	/* *,0,1 -> *,1,1 - we won pending, but there's queueing */
> 	if (!(val & _Q_PENDING_VAL))
> 		clear_pending(lock);
> 
> 	...
> 
> 
> Hmmm?

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2018-09-20 16:22 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-05 16:58 [PATCH 00/10] kernel/locking: qspinlock improvements Will Deacon
2018-04-05 16:58 ` [PATCH 01/10] locking/qspinlock: Don't spin on pending->locked transition in slowpath Will Deacon
2018-04-05 16:58 ` [PATCH 02/10] locking/qspinlock: Remove unbounded cmpxchg loop from locking slowpath Will Deacon
2018-04-05 17:07   ` Peter Zijlstra
2018-04-06 15:08     ` Will Deacon
2018-04-05 17:13   ` Peter Zijlstra
2018-04-05 21:16   ` Waiman Long
2018-04-06 15:08     ` Will Deacon
2018-04-06 20:50   ` Waiman Long
2018-04-06 21:09     ` Paul E. McKenney
2018-04-07  8:47       ` Peter Zijlstra
2018-04-07 23:37         ` Paul E. McKenney
2018-04-09 10:58         ` Will Deacon
2018-04-07  9:07     ` Peter Zijlstra
2018-04-09 10:58     ` Will Deacon
2018-04-09 14:54       ` Will Deacon
2018-04-09 15:54         ` Peter Zijlstra
2018-04-09 17:19           ` Will Deacon
2018-04-10  9:35             ` Peter Zijlstra
2018-09-20 16:08             ` Peter Zijlstra
2018-09-20 16:22               ` Peter Zijlstra
2018-04-09 19:33         ` Waiman Long
2018-04-09 17:55       ` Waiman Long
2018-04-10 13:49   ` Sasha Levin
2018-04-05 16:59 ` [PATCH 03/10] locking/qspinlock: Kill cmpxchg loop when claiming lock from head of queue Will Deacon
2018-04-05 17:19   ` Peter Zijlstra
2018-04-06 10:54     ` Will Deacon
2018-04-05 16:59 ` [PATCH 04/10] locking/qspinlock: Use atomic_cond_read_acquire Will Deacon
2018-04-05 16:59 ` [PATCH 05/10] locking/mcs: Use smp_cond_load_acquire() in mcs spin loop Will Deacon
2018-04-05 16:59 ` [PATCH 06/10] barriers: Introduce smp_cond_load_relaxed and atomic_cond_read_relaxed Will Deacon
2018-04-05 17:22   ` Peter Zijlstra
2018-04-06 10:55     ` Will Deacon
2018-04-05 16:59 ` [PATCH 07/10] locking/qspinlock: Use smp_cond_load_relaxed to wait for next node Will Deacon
2018-04-05 16:59 ` [PATCH 08/10] locking/qspinlock: Merge struct __qspinlock into struct qspinlock Will Deacon
2018-04-07  5:23   ` Boqun Feng
2018-04-05 16:59 ` [PATCH 09/10] locking/qspinlock: Make queued_spin_unlock use smp_store_release Will Deacon
2018-04-05 16:59 ` [PATCH 10/10] locking/qspinlock: Elide back-to-back RELEASE operations with smp_wmb() Will Deacon
2018-04-05 17:28   ` Peter Zijlstra
2018-04-06 11:34     ` Will Deacon
2018-04-06 13:05       ` Andrea Parri
2018-04-06 15:27         ` Will Deacon
2018-04-06 15:49           ` Andrea Parri
2018-04-07  5:47   ` Boqun Feng
2018-04-09 10:47     ` Will Deacon
2018-04-06 13:22 ` [PATCH 00/10] kernel/locking: qspinlock improvements Andrea Parri
2018-04-11 10:20   ` Catalin Marinas
2018-04-11 15:39     ` Andrea Parri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).