LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2
@ 2019-04-28 21:25 Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment Waiman Long
                   ` (20 more replies)
  0 siblings, 21 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

 v7:
  - Fix a bug in patch 1 and include changes suggested by Linus.
  - Refresh the other patches accordingly.

 v6:
  - Add a new patch 1 to fix an existing rwsem bug that allows both
    readers and writer to become rwsem owners simultaneously.
  - Fix a missed wakeup bug (in patch 9) of the v5 series.
  - Fix boot up hang problem on some kernel configurations caused
    by the patch that merge owner into count.
  - Make the adaptive disabling of reader optimistic spinning more
    aggressive to further improve workloads that don't benefit from
    that.
  - Add another patch to disable preemption during down_read() when
    owner is merged into count to prevent the possiblity of reader
    count overflow.
  - Don't allow merging of owner into count if NR_CPUS is greater
    that the max supported reader count.
  - Allow readers dropped out from the optimistic spinning loop to
    attempt a trylock if it is still in the right read phase.
  - Fix incorrect DEBUG_RWSEMS_WARN_ON() warning.

 v5:
  - Drop v4 patch 1 as it is merged into tip's locking/core branch.
  - Integrate the 2 followup patches into the series. The first
    follow-up patch is broken into 2 pieces. The first piece comes in
    before the "Enable readers spinning on writer" and the 2nd piece
    is merged into the "Enable time-based spinning on reader-owned
    rwsem" patch. The 2nd followup patch is added after that.
  - Add a new patch to make all wake_up_q() calls after dropping
    wait_lock as suggested by PeterZ.
  - Incorporate numerouos suggestions by PeterZ and Davidlohr.

 v4:
  - Fix the missing initialization bug with !CONFIG_RWSEM_SPIN_ON_OWNER
    in patch 2.
  - Move the "Remove rwsem_wake() wakeup optimization" patch before
    the "Implement a new locking scheme" patch.
  - Add two new patches to merge the relevant content of rwsem.h and
    rwsem-xadd.c into rwsem.c as suggested by PeterZ.
  - Refactor the lock handoff patch to make all setting and clearing
    of the handoff bit serialized by wait_lock to ensure correctness.
  - Adapt the rest of the patches to the new code base.

This is part 2 of a 3-part (0/1/2) series to rearchitect the internal
operation of rwsem. Both part 0 and part 1 are merged into tip.

This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. It also implements the following 3
new features:

 1) Waiter lock handoff
 2) Reader optimistic spinning with adapative disabling
 3) Store write-lock owner in the atomic count (x86-64 only for now)

Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.

Reader optimistic spinning enables readers to acquire the lock more
quickly.  So workloads that use a mix of readers and writers should
see an increase in performance as long as the reader critical sections
are short. For those workloads that have long reader critical sections
reader optimistic spinning may hurt performance, so an adaptive disabling
mechanism is also implemented to disable it when reader-owned lock
spinning timeouts happen.

Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.

Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.

On a 2-socket 40-core 80-thread Skylake system with 40 reader and writer
locking threads, the min/mean/max locking operations done in a 5-second
testing window before the patchset were:

  40 readers, Iterations Min/Mean/Max = 2,017/2,018/2,022
  40 writers, Iterations Min/Mean/Max = 2,018/60,936/142,804

After the patchset, they became:

  40 readers, Iterations Min/Mean/Max = 33,648/35,442/37,748
  40 writers, Iterations Min/Mean/Max = 91,033/92,342/93,474

It can be seen that the performance improves for both readers and writers.
It also makes rwsem fairer to readers as well as among different threads
within the reader and writer groups.

Patch 1 fixes an existing rwsem bug that allows both readers and
writer to become the rwsem owners simultaneously which may lead to
data corruption.

Patch 2 makes owner a permanent member of the rw_semaphore structure and
set it irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Patch 3 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.

Patch 4 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().

Patch 5 merges the content of rwsem.h and rwsem-xadd.c into rwsem.c just
like the mutex. The rwsem-xadd.c is removed and a bare-bone rwsem.h is
left for internal function declaration needed by percpu-rwsem.c.

Patch 6 optimizes the merged rwsem.c file to generate smaller object
file and performs other miscellaneous code cleanups.

Patch 7 makes rwsem_spin_on_owner() returns owner state.

Patch 8 implments lock handoff to prevent lock starvation. It is expected
that throughput will be lower on workloads with highly contended rwsems
for better fairness.

Patch 9 makes sure that all wake_up_q() calls happened after dropping
the wait_lock.

Patch 10 makes RT task's handling of NULL owner more optimal.

Patch 11 makes reader wakeup to wake up almost all the readers in the
wait queue instead of just those in the front.

Patch 12 renames the RWSEM_ANONYMOUSLY_OWNED bit to RWSEM_NONSPINNABLE.

Patch 13 enables reader to spin on a writer-owned rwsem.

Patch 14 enables a writer to spin on a reader-owned rwsem for at most
25us and extends the RWSEM_NONSPINNABLE bit to 2 separate ones - one
for readers and one for writers.

Patch 15 implements the adaptive disabling of reader optimistic spinning
when the reader-owned rwsem spinning timeouts happen.

Patch 16 adds some new rwsem owner access helper functions.

Patch 17 handles the case of too many readers by reserving the sign
bit to designate that a reader lock attempt will fail and the locking
reader will be put to sleep. This will ensure that we will not overflow
the reader count.

Patch 18 merges the write-lock owner task pointer into the count.
Only 64-bit count has enough space to provide a reasonable number of
bits for reader count. This is for x86-64 only for the time being.

Patch 19 eliminates redundant computation of the merged owner-count.

Patch 20 disable preemption during the down_read() call to eliminate
the remote possibility that the reader count may overflow.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 2-socket Skylake system with equal numbers
of readers and writers (mixed) before and after this patchset were:

   # of Threads   Before Patch      After Patch
   ------------   ------------      -----------
        2            2,394             4,541
        4            1,184             4,083
        8              817             4,070
       16              747             3,786
       32              319             3,173
       64               81             3,125

On workloads where the rwsem reader critical section is relatively long
(longer than the spinning period), optimistic of writer on reader-owned
rwsem may not be that helpful. In fact, the performance may regress
in some cases like the will-it-sclae page_fault1 microbenchmark. This
is likely due to the fact that larger reader groups where the readers
acquire the lock together are broken into smaller ones. So more work
will be needed to better tune the rwsem code to that kind of workload.

The v5-v6 and v6-v7 diff are attached at the end.

Waiman Long (20):
  locking/rwsem: Prevent decrement of reader count before increment
  locking/rwsem: Make owner available even if
    !CONFIG_RWSEM_SPIN_ON_OWNER
  locking/rwsem: Remove rwsem_wake() wakeup optimization
  locking/rwsem: Implement a new locking scheme
  locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
  locking/rwsem: Code cleanup after files merging
  locking/rwsem: Make rwsem_spin_on_owner() return owner state
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Always release wait_lock before waking up tasks
  locking/rwsem: More optimal RT task handling of null owner
  locking/rwsem: Wake up almost all readers in wait queue
  locking/rwsem: Clarify usage of owner's nonspinaable bit
  locking/rwsem: Enable readers spinning on writer
  locking/rwsem: Enable time-based spinning on reader-owned rwsem
  locking/rwsem: Adaptive disabling of reader optimistic spinning
  locking/rwsem: Add more rwsem owner access helpers
  locking/rwsem: Guard against making count negative
  locking/rwsem: Merge owner into count on x86-64
  locking/rwsem: Remove redundant computation of writer lock word
  locking/rwsem: Disable preemption in down_read*() if owner in count

 arch/x86/Kconfig                  |    6 +
 include/linux/rwsem.h             |   11 +-
 include/linux/sched/wake_q.h      |    5 +
 kernel/Kconfig.locks              |   12 +
 kernel/locking/Makefile           |    2 +-
 kernel/locking/lock_events_list.h |   12 +-
 kernel/locking/rwsem-xadd.c       |  729 -------------
 kernel/locking/rwsem.c            | 1678 ++++++++++++++++++++++++++++-
 kernel/locking/rwsem.h            |  306 +-----
 lib/Kconfig.debug                 |    8 +-
 10 files changed, 1699 insertions(+), 1070 deletions(-)
 delete mode 100644 kernel/locking/rwsem-xadd.c

 v6=>v7 diff
 -----------
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 97a2334d9cd3..60783267b50d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -693,7 +693,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
                atomic_long_add(adjustment, &sem->count);
 
        /* 2nd pass */
-       list_for_each_entry(waiter, &wlist, list) {
+       list_for_each_entry_safe(waiter, tmp, &wlist, list) {
                struct task_struct *tsk;
 
                tsk = waiter->task;

 v5=>v6 diff
 -----------

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 3370ea21407b..818e595831a1 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -253,13 +253,13 @@ config QUEUED_RWLOCKS
 	depends on SMP
 
 #
-# An architecture that want to merge rwsem write-owner into count should
-# select ARCH_USE_RWSEM_OWNER_COUNT and define RWSEM_OWNER_COUNT_PA_BITS
-# as the correct number of physical address bits.
+# An 64-bit architecture that wants to merge rwsem write-owner into
+# count should select ARCH_USE_RWSEM_OWNER_COUNT and define
+# RWSEM_OWNER_COUNT_PA_BITS as the correct number of physical address
+# bits. In addition, the number of bits available for reader count
+# should allow all the CPUs as defined in NR_CPUS to acquire the same
+# read lock without overflowing it.
 #
 config ARCH_USE_RWSEM_OWNER_COUNT
 	bool
-
-config RWSEM_OWNER_COUNT
-	def_bool y if ARCH_USE_RWSEM_OWNER_COUNT
 	depends on SMP && 64BIT
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 661326885800..239039d0ce21 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,6 +61,7 @@ LOCK_EVENT(rwsem_opt_wlock)	/* # of opt-acquired write locks	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed optspins			*/
 LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled optspins		*/
 LOCK_EVENT(rwsem_opt_norspin)	/* # of disabled reader-only optspins	*/
+LOCK_EVENT(rwsem_opt_rlock2)	/* # of opt-acquired 2ndary read locks	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 19d8fbd50d17..97a2334d9cd3 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -83,10 +83,18 @@
  * Therefore, any writers that had observed the setting of the writer
  * nonspinnable bit for a given rwsem after they fail to acquire the lock
  * via optimistic spinning will set the reader nonspinnable bit once they
- * acquire the write lock. This is to discourage reader optmistic spinning
- * on that particular rwsem and make writers more preferred. This adaptive
- * disabling of reader optimistic spinning will alleviate the negative
- * side effect of this feature.
+ * acquire the write lock. Similarly, readers that observe the setting
+ * of reader nonspinnable bit at slowpath entry will set the reader
+ * nonspinnable bits when they acquire the read lock via the wakeup path.
+ *
+ * Once the reader nonspinnable bit is on, it will only be reset when
+ * a writer is able to acquire the rwsem in the fast path or somehow a
+ * reader or writer in the slowpath doesn't observe the nonspinable bit.
+ *
+ * This is to discourage reader optmistic spinning on that particular
+ * rwsem and make writers more preferred. This adaptive disabling of reader
+ * optimistic spinning will alleviate the negative side effect of this
+ * feature.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
 #define RWSEM_RD_NONSPINNABLE	(1UL << 1)
@@ -129,8 +137,8 @@
  * to (32k-1) active readers. If 5-level page table support isn't
  * configured, we can supports up to (2M-1) active readers.
  *
- * On x86-64 with CONFIG_X86_5LEVEL and CONFIG_RWSEM_OWNER_COUNT, the bit
- * definitions of the count are:
+ * On x86-64 with CONFIG_X86_5LEVEL and CONFIG_ARCH_USE_RWSEM_OWNER_COUNT,
+ * the bit definitions of the count are:
  *
  * Bit   0    - waiters present bit
  * Bit   1    - lock handoff bit
@@ -138,7 +146,7 @@
  * Bits 48-62 - 15-bit reader counts
  * Bit  63    - read fail bit
  *
- * On other 64-bit architectures without CONFIG_RWSEM_OWNER_COUNT, the bit
+ * On other 64-bit architectures without MERGE_OWNER_INTO_COUNT, the bit
  * definitions are:
  *
  * Bit  0    - writer locked bit
@@ -178,16 +186,34 @@
 #define RWSEM_FLAG_HANDOFF	(1UL << 1)
 #define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
-#ifdef CONFIG_RWSEM_OWNER_COUNT
-#define RWSEM_READER_SHIFT	(CONFIG_RWSEM_OWNER_COUNT_PA_BITS -\
+/*
+ * The MERGE_OWNER_INTO_COUNT macro will only be defined if the following
+ * conditions are true:
+ *  1) Both CONFIG_ARCH_USE_RWSEM_OWNER_COUNT and
+ *     CONFIG_RWSEM_OWNER_COUNT_PA_BITS are defined.
+ *  2) The number of reader count bits available is able to hold the
+ *     maximum number of CPUs as defined in NR_CPUS.
+ */
+#if	defined(CONFIG_ARCH_USE_RWSEM_OWNER_COUNT) && \
+	defined(CONFIG_RWSEM_OWNER_COUNT_PA_BITS)
+# define __READER_SHIFT		(CONFIG_RWSEM_OWNER_COUNT_PA_BITS -\
 				 L1_CACHE_SHIFT + 2)
+# define __READER_COUNT_BITS	(BITS_PER_LONG - __READER_SHIFT - 1)
+# define __READER_COUNT_MAX	((1UL << __READER_COUNT_BITS) - 1)
+# if (NR_CPUS <= __READER_COUNT_MAX)
+#  define MERGE_OWNER_INTO_COUNT
+# endif
+#endif
+
+#ifdef MERGE_OWNER_INTO_COUNT
+#define RWSEM_READER_SHIFT	__READER_SHIFT
 #define RWSEM_WRITER_MASK	((1UL << RWSEM_READER_SHIFT) - 4)
 #define RWSEM_WRITER_LOCKED	rwsem_owner_count(current)
-#else /* !CONFIG_RWSEM_OWNER_COUNT */
+#else /* !MERGE_OWNER_INTO_COUNT */
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_WRITER_MASK	(1UL << 7)
 #define RWSEM_WRITER_LOCKED	RWSEM_WRITER_MASK
-#endif /* CONFIG_RWSEM_OWNER_COUNT */
+#endif /* MERGE_OWNER_INTO_COUNT */
 
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
@@ -198,9 +224,15 @@
 /*
  * Task structure pointer compression (64-bit only):
  * (owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2)
+ *
+ * However, init_task may lie outside of the linearly mapped physical
+ * to virtual memory range and so has to be handled separately.
  */
 static inline unsigned long rwsem_owner_count(struct task_struct *owner)
 {
+	if (unlikely(owner == &init_task))
+		return RWSEM_WRITER_MASK;
+
 	return ((unsigned long)owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2);
 }
 
@@ -208,6 +240,9 @@ static inline unsigned long rwsem_count_owner(long count)
 {
 	unsigned long writer = (unsigned long)count & RWSEM_WRITER_MASK;
 
+	if (unlikely(writer == RWSEM_WRITER_MASK))
+		return (unsigned long)&init_task;
+
 	return writer ? (writer << (L1_CACHE_SHIFT - 2)) + PAGE_OFFSET : 0;
 }
 
@@ -218,9 +253,8 @@ static inline unsigned long rwsem_count_owner(long count)
  * may not need READ_ONCE() as long as the pointer value is only used
  * for comparison and isn't being dereferenced.
  *
- * On 32-bit architectures, the owner and count are separate. On 64-bit
- * architectures, however, the writer task structure pointer is written
- * to the count as well in addition to the owner field.
+ * With MERGE_OWNER_INTO_COUNT defined, the writer task structure pointer
+ * is written to the count as well in addition to the owner field.
  */
 
 static inline void rwsem_set_owner(struct rw_semaphore *sem)
@@ -233,27 +267,6 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 	WRITE_ONCE(sem->owner, NULL);
 }
 
-#ifdef CONFIG_RWSEM_OWNER_COUNT
-/*
- * Get the owner value from count to have early access to the task structure.
- * Owner from sem->count should includes the RWSEM_NONSPINNABLE bits
- * from sem->owner.
- */
-static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
-{
-	unsigned long cowner = rwsem_count_owner(atomic_long_read(&sem->count));
-	unsigned long sowner = (unsigned long)READ_ONCE(sem->owner);
-
-	return (struct task_struct *) (cowner
-		? cowner | (sowner & RWSEM_NONSPINNABLE) : sowner);
-}
-#else /* !CONFIG_RWSEM_OWNER_COUNT */
-static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
-{
-	return READ_ONCE(sem->owner);
-}
-#endif /* CONFIG_RWSEM_OWNER_COUNT */
-
 /*
  * The task_struct pointer of the last owning reader will be left in
  * the owner field.
@@ -261,11 +274,15 @@ static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
  * Note that the owner value just indicates the task has owned the rwsem
  * previously, it may not be the real owner or one of the real owners
  * anymore when that field is examined, so take it with a grain of salt.
+ *
+ * The reader non-spinnable bit is preserved.
  */
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED |
+			   ((unsigned long)READ_ONCE(sem->owner) &
+			     RWSEM_RD_NONSPINNABLE);
 
 	WRITE_ONCE(sem->owner, (struct task_struct *)val);
 }
@@ -295,6 +312,14 @@ static inline bool is_rwsem_spinnable(struct rw_semaphore *sem, bool wr)
 	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner), wr);
 }
 
+/*
+ * Remove all the flag bits from owner.
+ */
+static inline struct task_struct *owner_without_flags(struct task_struct *owner)
+{
+	return (struct task_struct *)((long)owner & ~RWSEM_OWNER_FLAGS_MASK);
+}
+
 /*
  * Return true if the rwsem is owned by a reader.
  */
@@ -349,6 +374,98 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 	}
 }
 
+#ifdef MERGE_OWNER_INTO_COUNT
+/*
+ * It is very unlikely that successive preemption at the middle of
+ * down_read's inc-check-dec sequence will cause the reader count to
+ * overflow, For absolute correctness, we still need to prevent
+ * that possibility from happening. So preemption will be disabled
+ * during the down_read*() call.
+ *
+ * For PREEMPT=n kernels, there isn't much overhead in doing that.
+ * For PREEMPT=y kernels, there will be some additional cost.
+ *
+ * If MERGE_OWNER_INTO_COUNT isn't defined, we don't need to worry
+ * about reader count overflow and so we don't need to disable
+ * preemption.
+ */
+#define rwsem_preempt_disable()			preempt_disable()
+#define rwsem_preempt_enable()			preempt_enable()
+#define rwsem_schedule_preempt_disabled()	schedule_preempt_disabled()
+
+/*
+ * Get the owner value from count to have early access to the task structure.
+ * Owner from sem->count should includes the RWSEM_NONSPINNABLE bits
+ * from sem->owner.
+ */
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+	unsigned long cowner = rwsem_count_owner(atomic_long_read(&sem->count));
+	unsigned long sowner = (unsigned long)READ_ONCE(sem->owner);
+
+	return (struct task_struct *) (cowner
+		? cowner | (sowner & RWSEM_NONSPINNABLE) : sowner);
+}
+
+/*
+ * This function does a read trylock by incrementing the reader count
+ * and then decrementing it immediately if too many readers are present
+ * (count becomes negative) in order to prevent the remote possibility
+ * of overflowing the count with minimal delay between the increment
+ * and decrement.
+ *
+ * When the owner task structure pointer is merged into couunt, less bits
+ * will be available for readers (down to 15 bits for x86-64). There is a
+ * very slight chance that preemption may happen in the middle of the
+ * inc-check-dec sequence leaving the reader count incremented for a
+ * certain period of time until the reader wakes up and move on. Still
+ * the chance of having enough of these unfortunate sequence of events to
+ * overflow the reader count is infinitesimally small.
+ *
+ * If MERGE_OWNER_INTO_COUNT isn't defined, we don't really need to
+ * worry about the possibility of overflowing the reader counts even
+ * for 32-bit architectures which can support up to 8M readers.
+ *
+ * It returns the adjustment that should be added back to the count
+ * in the slowpath.
+ */
+static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+	long adjustment = -RWSEM_READER_BIAS;
+
+	*cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	if (unlikely(*cnt < 0)) {
+		atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+		adjustment = 0;
+	}
+	return adjustment;
+}
+
+static int __init rwsem_show_count_status(void)
+{
+	pr_info("RW Semaphores: Write-owner in count & %d bits for readers.\n",
+		__READER_COUNT_BITS);
+	return 0;
+}
+late_initcall(rwsem_show_count_status);
+#else /* !MERGE_OWNER_INTO_COUNT */
+
+#define rwsem_preempt_disable()
+#define rwsem_preempt_enable()
+#define rwsem_schedule_preempt_disabled()	schedule()
+
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+	return READ_ONCE(sem->owner);
+}
+
+static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+	*cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	return -RWSEM_READER_BIAS;
+}
+#endif /* MERGE_OWNER_INTO_COUNT */
+
 /*
  * Guide to the rw_semaphore's count field.
  *
@@ -372,11 +489,6 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 void __init_rwsem(struct rw_semaphore *sem, const char *name,
 		  struct lock_class_key *key)
 {
-	/*
-	 * We should support at least (8k-1) concurrent readers
-	 */
-	BUILD_BUG_ON(sizeof(long) * 8 - RWSEM_READER_SHIFT < 14);
-
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	/*
 	 * Make sure we are not reinitializing a held semaphore:
@@ -404,6 +516,7 @@ struct rwsem_waiter {
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
 	unsigned long timeout;
+	unsigned long last_rowner;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -453,6 +566,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 {
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
+	struct list_head wlist;
 
 	lockdep_assert_held(&sem->wait_lock);
 
@@ -478,12 +592,20 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		return;
 	}
 
+	/*
+	 * No reader wakeup if there are too many of them already.
+	 */
+	if (unlikely(atomic_long_read(&sem->count) < 0))
+		return;
+
 	/*
 	 * Writers might steal the lock before we grant it to the next reader.
 	 * We prefer to do the first reader grant before counting readers
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
+		struct task_struct *owner;
+
 		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
 		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
@@ -504,8 +626,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		/*
 		 * Set it to reader-owned to give spinners an early
 		 * indication that readers now have the lock.
+		 * The reader nonspinnable bit seen at slowpath entry of
+		 * the reader is copied over.
 		 */
-		__rwsem_set_reader_owned(sem, waiter->task);
+		owner = waiter->task;
+		if (waiter->last_rowner & RWSEM_RD_NONSPINNABLE) {
+			owner = (void *)((long)owner | RWSEM_RD_NONSPINNABLE);
+			lockevent_inc(rwsem_opt_norspin);
+		}
+		__rwsem_set_reader_owned(sem, owner);
 	}
 
 	/*
@@ -519,30 +648,25 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * to acquire the lock at the same time irrespective of their order
 	 * in the queue. The writers acquire the lock according to their
 	 * order in the queue.
+	 *
+	 * We have to do wakeup in 2 passes to prevent the possibility that
+	 * the reader count may be decremented before it is incremented. It
+	 * is because the to-be-woken waiter may not have slept yet. So it
+	 * may see waiter->task got cleared, finish its critical section and
+	 * do an unlock before the reader count increment.
+	 *
+	 * 1) Collect the read-waiters in a separate list, count them and
+	 *    fully increment the reader count in rwsem.
+	 * 2) For each waiters in the new list, clear waiter->task and
+	 *    put them into wake_q to be woken up later.
 	 */
+	INIT_LIST_HEAD(&wlist);
 	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
-		struct task_struct *tsk;
-
 		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
 			continue;
 
 		woken++;
-		tsk = waiter->task;
-
-		get_task_struct(tsk);
-		list_del(&waiter->list);
-		/*
-		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_slowpath() cannot
-		 * race with do_exit() by always holding a reference count
-		 * to the task to wakeup.
-		 */
-		smp_store_release(&waiter->task, NULL);
-		/*
-		 * Ensure issuing the wakeup (either by us or someone else)
-		 * after setting the reader waiter to nil.
-		 */
-		wake_q_add_safe(wake_q, tsk);
+		list_move_tail(&waiter->list, &wlist);
 
 		/*
 		 * Limit # of readers that can be woken up per wakeup call.
@@ -567,6 +691,27 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 
 	if (adjustment)
 		atomic_long_add(adjustment, &sem->count);
+
+	/* 2nd pass */
+	list_for_each_entry(waiter, &wlist, list) {
+		struct task_struct *tsk;
+
+		tsk = waiter->task;
+		get_task_struct(tsk);
+
+		/*
+		 * Ensure calling get_task_struct() before setting the reader
+		 * waiter to nil such that rwsem_down_read_slowpath() cannot
+		 * race with do_exit() by always holding a reference count
+		 * to the task to wakeup.
+		 */
+		smp_store_release(&waiter->task, NULL);
+		/*
+		 * Ensure issuing the wakeup (either by us or someone else)
+		 * after setting the reader waiter to nil.
+		 */
+		wake_q_add_safe(wake_q, tsk);
+	}
 }
 
 /*
@@ -662,7 +807,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 	/*
 	 * Clear all the flag bits in owner
 	 */
-	*((unsigned long *)&owner) &= ~RWSEM_OWNER_FLAGS_MASK;
+	owner = owner_without_flags(owner);
 
 	/*
 	 * As lock holder preemption issue, we both skip spinning if
@@ -685,15 +830,17 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
 
 	preempt_disable();
 	rcu_read_lock();
+
 	owner = rwsem_get_owner(sem);
-	if (owner) {
-		ret = is_rwsem_owner_spinnable(owner, wr) &&
-		      (((unsigned long)owner & RWSEM_READER_OWNED) ||
-		       owner_on_cpu(owner));
-	}
+	if (!is_rwsem_owner_spinnable(owner, wr))
+		ret = false;
+	else if ((unsigned long)owner & RWSEM_READER_OWNED)
+		ret = true;
+	else if ((owner = owner_without_flags(owner)))
+		ret = owner_on_cpu(owner);
+
 	rcu_read_unlock();
 	preempt_enable();
-
 	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
 }
@@ -719,15 +866,15 @@ enum owner_state {
 
 static inline enum owner_state rwsem_owner_state(unsigned long owner, bool wr)
 {
-	if (!owner)
-		return OWNER_NULL;
-
 	if (!is_rwsem_owner_spinnable((void *)owner, wr))
 		return OWNER_NONSPINNABLE;
 
 	if (owner & RWSEM_READER_OWNED)
 		return OWNER_READER;
 
+	if (!(owner & ~RWSEM_OWNER_FLAGS_MASK))
+		return OWNER_NULL;
+
 	return OWNER_WRITER;
 }
 
@@ -952,6 +1099,43 @@ static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
 		atomic_long_andnot(RWSEM_WR_NONSPINNABLE,
 				  (atomic_long_t *)&sem->owner);
 }
+
+/*
+ * This function is called when the reader fails to acquire the lock via
+ * optimistic spinning. In this case we will still attempt to do a trylock
+ * when comparing the rwsem state right now with the state when entering
+ * the slowpath indicates that the reader is still in a valid reader phase.
+ * This happens when the following conditions are true:
+ *
+ * 1) The lock is currently reader owned, and
+ * 2) The lock is previously not reader-owned or the last read owner changes.
+ *
+ * In the former case, we have transitioned from a writer phase to a
+ * reader-phase while spinning. In the latter case, it means the reader
+ * phase hasn't ended when we entered the optimistic spinning loop. In
+ * both cases, the reader is eligible to acquire the lock. This is the
+ * secondary path where a read lock is acquired optimistically.
+ *
+ * The reader non-spinnable bit wasn't set at time of entry or it will
+ * not be here at all.
+ */
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+					      unsigned long last_rowner)
+{
+	unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
+
+	if (!(owner & RWSEM_READER_OWNED))
+		return false;
+
+	owner	    &= ~RWSEM_OWNER_FLAGS_MASK;
+	last_rowner &= ~RWSEM_OWNER_FLAGS_MASK;
+	if ((owner != last_rowner) && rwsem_try_read_lock_unqueued(sem)) {
+		lockevent_inc(rwsem_opt_rlock2);
+		lockevent_add(rwsem_opt_fail, -1);
+		return true;
+	}
+	return false;
+}
 #else
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
 {
@@ -965,42 +1149,50 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
 }
 
 static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
+
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+					      unsigned long last_rowner)
+{
+	return false;
+}
 #endif
 
 /*
  * Wait for the read lock to be granted
  */
 static struct rw_semaphore __sched *
-rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long count)
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 {
-	long adjustment = -RWSEM_READER_BIAS;
+	long count;
 	bool wake = false;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
-	if (unlikely(count < 0)) {
+	if (unlikely(!adjustment)) {
 		/*
-		 * The sign bit has been set meaning that too many active
-		 * readers are present. We need to decrement reader count &
-		 * enter wait queue immediately to avoid overflowing the
-		 * reader count.
-		 *
-		 * As preemption is not disabled, there is a remote
-		 * possibility that preemption can happen in the narrow
-		 * timing window between incrementing and decrementing
-		 * the reader count and the task is put to sleep for a
-		 * considerable amount of time. If sufficient number
-		 * of such unfortunate sequence of events happen, we
-		 * may still overflow the reader count. It is extremely
-		 * unlikey, though. If this is a concern, we should consider
-		 * disable preemption during this timing window to make
-		 * sure that such unfortunate event will not happen.
+		 * This shouldn't happen. If it does, there is probably
+		 * something wrong in the system.
 		 */
-		atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
-		adjustment = 0;
+		WARN_ON_ONCE(1);
+
+		/*
+		 * An adjustment of 0 means that there are too many readers
+		 * holding or trying to acquire the lock. So disable
+		 * optimistic spinning and go directly into the wait list.
+		 */
+		if (is_rwsem_spinnable(sem, false))
+			rwsem_set_nonspinnable(sem);
 		goto queue;
 	}
 
+	/*
+	 * Save the current read-owner of rwsem, if available, and the
+	 * reader nonspinnable bit.
+	 */
+	waiter.last_rowner = (long)READ_ONCE(sem->owner);
+	if (!(waiter.last_rowner & RWSEM_READER_OWNED))
+		waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
+
 	if (!rwsem_can_spin_on_owner(sem, false))
 		goto queue;
 
@@ -1011,11 +1203,11 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long count)
 	adjustment = 0;
 	if (rwsem_optimistic_spin(sem, false)) {
 		/*
-		 * Opportunistically wake up other readers in the wait queue.
-		 * It has another chance of wakeup at unlock time.
+		 * Wake up other readers in the wait list if the front
+		 * waiter is a reader.
 		 */
-		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS) &&
-		    raw_spin_trylock_irq(&sem->wait_lock)) {
+		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) {
+			raw_spin_lock_irq(&sem->wait_lock);
 			if (!list_empty(&sem->wait_list))
 				__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
 						  &wake_q);
@@ -1023,6 +1215,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long count)
 			wake_up_q(&wake_q);
 		}
 		return sem;
+	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+		return sem;
 	}
 
 queue:
@@ -1084,7 +1278,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long count)
 			raw_spin_unlock_irq(&sem->wait_lock);
 			break;
 		}
-		schedule();
+		rwsem_schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_reader);
 	}
 
@@ -1185,8 +1379,11 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 			 * This waiter may have become first in the wait
 			 * list after re-acquring the wait_lock. The
 			 * rwsem_first_waiter() test in the main while
-			 * loop below will correctly detect that.
+			 * loop below will correctly detect that. We do
+			 * need to reload count to perform proper trylock
+			 * and avoid missed wakeup.
 			 */
+			count = atomic_long_read(&sem->count);
 		}
 	} else {
 		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
@@ -1312,29 +1509,36 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-	long tmp = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-						 &sem->count);
+	long tmp, adjustment;
 
+	rwsem_preempt_disable();
+	adjustment = rwsem_read_trylock(sem, &tmp);
 	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, tmp);
+		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
+	rwsem_preempt_enable();
 }
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	long tmp = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-						 &sem->count);
+	long tmp, adjustment;
 
+	rwsem_preempt_disable();
+	adjustment = rwsem_read_trylock(sem, &tmp);
 	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE, tmp)))
+		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
+						    adjustment))) {
+			rwsem_preempt_enable();
 			return -EINTR;
+		}
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
+	rwsem_preempt_enable();
 	return 0;
 }
 
@@ -1367,7 +1571,7 @@ static inline void __down_write(struct rw_semaphore *sem)
 		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
 	else
 		rwsem_set_owner(sem);
-#ifdef CONFIG_RWSEM_OWNER_COUNT
+#ifdef MERGE_OWNER_INTO_COUNT
 	DEBUG_RWSEMS_WARN_ON(sem->owner != rwsem_get_owner(sem), sem);
 #endif
 }
@@ -1408,6 +1612,7 @@ inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
 		      RWSEM_FLAG_WAITERS)) {
 		clear_wr_nonspinnable(sem);
@@ -1448,7 +1653,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * read-locked region is ok to be re-ordered into the
 	 * write side. As such, rely on RELEASE semantics.
 	 */
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	DEBUG_RWSEMS_WARN_ON(owner_without_flags(sem->owner) != current, sem);
 	tmp = atomic_long_fetch_add_release(
 		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);

-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-05-03 12:06   ` Peter Zijlstra
  2019-05-07  7:07   ` [tip:locking/urgent] " tip-bot for Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 02/20] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
                   ` (19 subsequent siblings)
  20 siblings, 2 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long, stable

During my rwsem testing, it was found that after a down_read(), the
reader count may occasionally become 0 or even negative. Consequently,
a writer may steal the lock at that time and execute with the reader
in parallel thus breaking the mutual exclusion guarantee of the write
lock. In other words, both readers and writer can become rwsem owners
simultaneously.

The current reader wakeup code does it in one pass to clear waiter->task
and put them into wake_q before fully incrementing the reader count.
Once waiter->task is cleared, the corresponding reader may see it,
finish the critical section and do unlock to decrement the count before
the count is incremented. This is not a problem if there is only one
reader to wake up as the count has been pre-incremented by 1.  It is
a problem if there are more than one readers to be woken up and writer
can steal the lock.

The wakeup was actually done in 2 passes before the v4.9 commit
70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only
once"). To fix this problem, the wakeup is now done in two passes
again. In the first pass, we collect the readers and count them. The
reader count is then fully incremented. In the second pass, the
waiter->task is then cleared and they are put into wake_q to be woken
up later.

Fixes: 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")
Cc: <stable@vger.kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem-xadd.c | 46 +++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 6b3ee9948bf1..0b1f77957240 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -130,6 +130,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 {
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
+	struct list_head wlist;
 
 	/*
 	 * Take a peek at the queue head waiter such that we can determine
@@ -188,18 +189,43 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * of the queue. We know that woken will be at least 1 as we accounted
 	 * for above. Note we increment the 'active part' of the count by the
 	 * number of readers before waking any processes up.
+	 *
+	 * We have to do wakeup in 2 passes to prevent the possibility that
+	 * the reader count may be decremented before it is incremented. It
+	 * is because the to-be-woken waiter may not have slept yet. So it
+	 * may see waiter->task got cleared, finish its critical section and
+	 * do an unlock before the reader count increment.
+	 *
+	 * 1) Collect the read-waiters in a separate list, count them and
+	 *    fully increment the reader count in rwsem.
+	 * 2) For each waiters in the new list, clear waiter->task and
+	 *    put them into wake_q to be woken up later.
 	 */
-	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
-		struct task_struct *tsk;
-
+	list_for_each_entry(waiter, &sem->wait_list, list) {
 		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
 			break;
 
 		woken++;
-		tsk = waiter->task;
+	}
+	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
+
+	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+	lockevent_cond_inc(rwsem_wake_reader, woken);
+	if (list_empty(&sem->wait_list)) {
+		/* hit end of list above */
+		adjustment -= RWSEM_WAITING_BIAS;
+	}
+
+	if (adjustment)
+		atomic_long_add(adjustment, &sem->count);
+
+	/* 2nd pass */
+	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
+		struct task_struct *tsk;
 
+		tsk = waiter->task;
 		get_task_struct(tsk);
-		list_del(&waiter->list);
+
 		/*
 		 * Ensure calling get_task_struct() before setting the reader
 		 * waiter to nil such that rwsem_down_read_failed() cannot
@@ -213,16 +239,6 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 */
 		wake_q_add_safe(wake_q, tsk);
 	}
-
-	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
-	lockevent_cond_inc(rwsem_wake_reader, woken);
-	if (list_empty(&sem->wait_list)) {
-		/* hit end of list above */
-		adjustment -= RWSEM_WAITING_BIAS;
-	}
-
-	if (adjustment)
-		atomic_long_add(adjustment, &sem->count);
 }
 
 /*
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 02/20] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 03/20] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.

So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/rwsem.h       |  9 +++++----
 kernel/locking/rwsem-xadd.c |  2 +-
 kernel/locking/rwsem.h      | 23 -----------------------
 lib/Kconfig.debug           |  8 ++++----
 4 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 2ea18a3def04..148983e21d47 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -34,12 +34,12 @@
  */
 struct rw_semaphore {
 	atomic_long_t count;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	/*
-	 * Write owner. Used as a speculative check to see
-	 * if the owner is running on the cpu.
+	 * Write owner or one of the read owners. Can be used as a
+	 * speculative check to see if the owner is running on the cpu.
 	 */
 	struct task_struct *owner;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
 	raw_spinlock_t wait_lock;
@@ -73,13 +73,14 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #endif
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
+#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED
 #else
 #define __RWSEM_OPT_INIT(lockname)
 #endif
 
 #define __RWSEM_INITIALIZER(name)				\
 	{ __RWSEM_INIT_COUNT(name),				\
+	  .owner = NULL,					\
 	  .wait_list = LIST_HEAD_INIT((name).wait_list),	\
 	  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock)	\
 	  __RWSEM_OPT_INIT(name)				\
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 0b1f77957240..c0500679fd2f 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -86,8 +86,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
 	raw_spin_lock_init(&sem->wait_lock);
 	INIT_LIST_HEAD(&sem->wait_list);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 	osq_lock_init(&sem->osq);
 #endif
 }
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 64877f5294e3..eb9c8534299b 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -61,7 +61,6 @@
 #define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
 #define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
 
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
@@ -126,7 +125,6 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
  * real owner or one of the real owners. The only exception is when the
  * unlock is done by up_read_non_owner().
  */
-#define rwsem_clear_reader_owned rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
@@ -135,28 +133,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
 				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
 }
-#endif
-
 #else
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-					   struct task_struct *owner)
-{
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-#ifndef rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 }
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0d9e81779e37..2047f3884540 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1067,7 +1067,7 @@ config PROVE_LOCKING
 	select DEBUG_SPINLOCK
 	select DEBUG_MUTEXES
 	select DEBUG_RT_MUTEXES if RT_MUTEXES
-	select DEBUG_RWSEMS if RWSEM_SPIN_ON_OWNER
+	select DEBUG_RWSEMS
 	select DEBUG_WW_MUTEX_SLOWPATH
 	select DEBUG_LOCK_ALLOC
 	select TRACE_IRQFLAGS
@@ -1171,10 +1171,10 @@ config DEBUG_WW_MUTEX_SLOWPATH
 
 config DEBUG_RWSEMS
 	bool "RW Semaphore debugging: basic checks"
-	depends on DEBUG_KERNEL && RWSEM_SPIN_ON_OWNER
+	depends on DEBUG_KERNEL
 	help
-	  This debugging feature allows mismatched rw semaphore locks and unlocks
-	  to be detected and reported.
+	  This debugging feature allows mismatched rw semaphore locks
+	  and unlocks to be detected and reported.
 
 config DEBUG_LOCK_ALLOC
 	bool "Lock debugging: detect incorrect freeing of live locks"
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 03/20] locking/rwsem: Remove rwsem_wake() wakeup optimization
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 02/20] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 04/20] locking/rwsem: Implement a new locking scheme Waiman Long
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

With the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention
in wakeup after up_read()/up_write()"), the rwsem_wake() forgoes doing
a wakeup if the wait_lock cannot be directly acquired and an optimistic
spinning locker is present.  This can help performance by avoiding
spinning on the wait_lock when it is contended.

With the later commit 133e89ef5ef3 ("locking/rwsem: Enable lockless
waiter wakeup(s)"), the performance advantage of the above optimization
diminishes as the average wait_lock hold time become much shorter.

With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang. So the commit 59aabfc7e959
("locking/rwsem: Reduce spinlock contention in wakeup after
up_read()/up_write()") will have to be reverted.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem-xadd.c | 72 -------------------------------------
 1 file changed, 72 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index c0500679fd2f..3083fdf50447 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -411,25 +411,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-	return osq_is_locked(&sem->osq);
-}
-
 #else
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
 	return false;
 }
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-	return false;
-}
 #endif
 
 /*
@@ -651,65 +637,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
 
-	/*
-	* __rwsem_down_write_failed_common(sem)
-	*   rwsem_optimistic_spin(sem)
-	*     osq_unlock(sem->osq)
-	*   ...
-	*   atomic_long_add_return(&sem->count)
-	*
-	*      - VS -
-	*
-	*              __up_write()
-	*                if (atomic_long_sub_return_release(&sem->count) < 0)
-	*                  rwsem_wake(sem)
-	*                    osq_is_locked(&sem->osq)
-	*
-	* And __up_write() must observe !osq_is_locked() when it observes the
-	* atomic_long_add_return() in order to not miss a wakeup.
-	*
-	* This boils down to:
-	*
-	* [S.rel] X = 1                [RmW] r0 = (Y += 0)
-	*         MB                         RMB
-	* [RmW]   Y += 1               [L]   r1 = X
-	*
-	* exists (r0=1 /\ r1=0)
-	*/
-	smp_rmb();
-
-	/*
-	 * If a spinner is present, it is not necessary to do the wakeup.
-	 * Try to do wakeup only if the trylock succeeds to minimize
-	 * spinlock contention which may introduce too much delay in the
-	 * unlock operation.
-	 *
-	 *    spinning writer		up_write/up_read caller
-	 *    ---------------		-----------------------
-	 * [S]   osq_unlock()		[L]   osq
-	 *	 MB			      RMB
-	 * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
-	 *
-	 * Here, it is important to make sure that there won't be a missed
-	 * wakeup while the rwsem is free and the only spinning writer goes
-	 * to sleep without taking the rwsem. Even when the spinning writer
-	 * is just going to break out of the waiting loop, it will still do
-	 * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
-	 * rwsem_has_spinner() is true, it will guarantee at least one
-	 * trylock attempt on the rwsem later on.
-	 */
-	if (rwsem_has_spinner(sem)) {
-		/*
-		 * The smp_rmb() here is to make sure that the spinner
-		 * state is consulted before reading the wait_lock.
-		 */
-		smp_rmb();
-		if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
-			return sem;
-		goto locked;
-	}
 	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-locked:
 
 	if (!list_empty(&sem->wait_list))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 04/20] locking/rwsem: Implement a new locking scheme
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (2 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 03/20] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 05/20] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used.  The atomic long count has the following
bit definitions:

  Bit  0   - writer locked bit
  Bit  1   - waiters present bit
  Bits 2-7 - reserved for future extension
  Bits 8-X - reader count (24/56 bits)

The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        30,659   31,341   31,055   31,283
        2         8,909   16,457    9,884   17,659
        4         9,028   15,823    8,933   20,233
        8         8,410   14,212    7,230   17,140
       16         8,217   25,240    7,479   24,607

The locking rates of the benchmark on a Power8 system were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        12,963   13,647   13,275   13,601
        2         7,570   11,569    7,902   10,829
        4         5,232    5,516    5,466    5,435
        8         5,233    3,386    5,467    3,168

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        21,495   21,046   21,524   21,074
        2         5,293   10,502    5,333   10,504
        4         5,325   11,463    5,358   11,631
        8         5,391   11,712    5,470   11,680

The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem-xadd.c | 147 ++++++++++++------------------------
 kernel/locking/rwsem.h      |  74 +++++++++---------
 2 files changed, 85 insertions(+), 136 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3083fdf50447..7d537b50a849 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
  *
  * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
  * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long <longman@redhat.com>.
  */
 #include <linux/rwsem.h>
 #include <linux/init.h>
@@ -22,52 +24,20 @@
 #include "rwsem.h"
 
 /*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x0000000X	(1) X readers active or attempting lock, no writer waiting
- *		    X = #active_readers + #readers attempting to lock
- *		    (X*ACTIVE_BIAS)
- *
- * 0x00000000	rwsem is unlocked, and no one is waiting for the lock or
- *		attempting to read lock or write lock.
- *
- * 0xffff000X	(1) X readers active or attempting lock, with waiters for lock
- *		    X = #active readers + # readers attempting lock
- *		    (X*ACTIVE_BIAS + WAITING_BIAS)
- *		(2) 1 writer attempting lock, no waiters for lock
- *		    X-1 = #active readers + #readers attempting lock
- *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *		(3) 1 writer active, no waiters for lock
- *		    X-1 = #active readers + #readers attempting lock
- *		    ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0xffff0001	(1) 1 reader active or attempting lock, waiters for lock
- *		    (WAITING_BIAS + ACTIVE_BIAS)
- *		(2) 1 writer active or attempting lock, no waiters for lock
- *		    (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
  *
- * 0xffff0000	(1) There are writers or readers queued but none active
- *		    or in the process of attempting lock.
- *		    (WAITING_BIAS)
- *		Note: writer can attempt to steal lock for this count by adding
- *		ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
  *
- * 0xfffe0001	(1) 1 writer active, or attempting lock. Waiters on queue.
- *		    (ACTIVE_WRITE_BIAS + WAITING_BIAS)
- *
- * Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
- *	 the count becomes more than 0 for successful lock acquisition,
- *	 i.e. the case where there are only readers or nobody has lock.
- *	 (1st and 2nd case above).
- *
- *	 Writers attempt to lock by adding ACTIVE_WRITE_BIAS in down_write and
- *	 checking the count becomes ACTIVE_WRITE_BIAS for successful lock
- *	 acquisition (i.e. nobody else has lock or attempts lock).  If
- *	 unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- *	 are only waiters but none active (5th case above), and attempt to
- *	 steal the lock.
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
  *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
  */
 
 /*
@@ -113,9 +83,8 @@ enum rwsem_wake_type {
 
 /*
  * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then:
- *   - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
- *   - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ *   have been set.
  * - there must be someone on the queue
  * - the wait_lock must be held by the caller
  * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
@@ -160,22 +129,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
-		adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
+		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
-			/*
-			 * If the count is still less than RWSEM_WAITING_BIAS
-			 * after removing the adjustment, it is assumed that
-			 * a writer has stolen the lock. We have to undo our
-			 * reader grant.
-			 */
-			if (atomic_long_add_return(-adjustment, &sem->count) <
-			    RWSEM_WAITING_BIAS)
-				return;
-
-			/* Last active locker left. Retry waking readers. */
-			goto try_reader_grant;
+		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			atomic_long_sub(adjustment, &sem->count);
+			return;
 		}
 		/*
 		 * Set it to reader-owned to give spinners an early
@@ -209,11 +167,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	}
 	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
 
-	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+	adjustment = woken * RWSEM_READER_BIAS - adjustment;
 	lockevent_cond_inc(rwsem_wake_reader, woken);
 	if (list_empty(&sem->wait_list)) {
 		/* hit end of list above */
-		adjustment -= RWSEM_WAITING_BIAS;
+		adjustment -= RWSEM_FLAG_WAITERS;
 	}
 
 	if (adjustment)
@@ -248,22 +206,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  */
 static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
 {
-	/*
-	 * Avoid trying to acquire write lock if count isn't RWSEM_WAITING_BIAS.
-	 */
-	if (count != RWSEM_WAITING_BIAS)
+	long new;
+
+	if (count & RWSEM_LOCK_MASK)
 		return false;
 
-	/*
-	 * Acquire the lock by trying to set it to ACTIVE_WRITE_BIAS. If there
-	 * are other tasks on the wait list, we need to add on WAITING_BIAS.
-	 */
-	count = list_is_singular(&sem->wait_list) ?
-			RWSEM_ACTIVE_WRITE_BIAS :
-			RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS;
+	new = count + RWSEM_WRITER_LOCKED -
+	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
 
-	if (atomic_long_cmpxchg_acquire(&sem->count, RWSEM_WAITING_BIAS, count)
-							== RWSEM_WAITING_BIAS) {
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
 		rwsem_set_owner(sem);
 		return true;
 	}
@@ -279,9 +230,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!count || count == RWSEM_WAITING_BIAS) {
+	while (!(count & RWSEM_LOCK_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_ACTIVE_WRITE_BIAS)) {
+					count + RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -424,7 +375,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 static inline struct rw_semaphore __sched *
 __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 {
-	long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+	long count, adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
@@ -436,16 +387,16 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
 		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_ACTIVE_READ_BIAS has already
-		 * been set in the count.
+		 * immediately as its RWSEM_READER_BIAS has already been
+		 * set in the count.
 		 */
-		if (atomic_long_read(&sem->count) >= 0) {
+		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
 			return sem;
 		}
-		adjustment += RWSEM_WAITING_BIAS;
+		adjustment += RWSEM_FLAG_WAITERS;
 	}
 	list_add_tail(&waiter.list, &sem->wait_list);
 
@@ -458,9 +409,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	 * If there are no writers and we are first in the queue,
 	 * wake our own waiter to join the existing active readers !
 	 */
-	if (count == RWSEM_WAITING_BIAS ||
-	    (count > RWSEM_WAITING_BIAS &&
-	     adjustment != -RWSEM_ACTIVE_READ_BIAS))
+	if (!(count & RWSEM_LOCK_MASK) ||
+	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
@@ -488,7 +438,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 out_nolock:
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -521,9 +471,6 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
 
-	/* undo write bias from down_write operation, stop active locking */
-	count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
-
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_optimistic_spin(sem))
 		return sem;
@@ -543,16 +490,18 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 
 	list_add_tail(&waiter.list, &sem->wait_list);
 
-	/* we're now waiting on the lock, but no longer actively locking */
+	/* we're now waiting on the lock */
 	if (waiting) {
 		count = atomic_long_read(&sem->count);
 
 		/*
 		 * If there were already threads queued before us and there are
-		 * no active writers, the lock must be read owned; so we try to
-		 * wake any read locks that were queued ahead of us.
+		 * no active writers and some readers, the lock must be read
+		 * owned; so we try to  any read locks that were queued ahead
+		 * of us.
 		 */
-		if (count > RWSEM_WAITING_BIAS) {
+		if (!(count & RWSEM_WRITER_MASK) &&
+		     (count & RWSEM_READER_MASK)) {
 			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
 			/*
 			 * The wakeup is normally called _after_ the wait_lock
@@ -569,8 +518,9 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			wake_q_init(&wake_q);
 		}
 
-	} else
-		count = atomic_long_add_return(RWSEM_WAITING_BIAS, &sem->count);
+	} else {
+		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+	}
 
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
@@ -587,7 +537,8 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
-		} while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);
+			count = atomic_long_read(&sem->count);
+		} while (count & RWSEM_LOCK_MASK);
 
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
@@ -603,7 +554,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	raw_spin_lock_irq(&sem->wait_lock);
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list))
-		atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index eb9c8534299b..499a9b2bda82 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -42,24 +42,24 @@
 #endif
 
 /*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <paulus@samba.org>.
- */
-
-/*
- * the semaphore definition
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit  0   - writer locked bit
+ * Bit  1   - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
  */
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK		0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK		0x0000ffffL
-#endif
-
-#define RWSEM_ACTIVE_BIAS		0x00000001L
-#define RWSEM_WAITING_BIAS		(-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_WRITER_LOCKED	(1UL << 0)
+#define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_READER_SHIFT	8
+#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -151,7 +151,8 @@ extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
  */
 static inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_failed(sem);
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
@@ -162,7 +163,8 @@ static inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
@@ -183,11 +185,11 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 	lockevent_inc(rwsem_rtrylock);
 	do {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_ACTIVE_READ_BIAS)) {
+					tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
 			return 1;
 		}
-	} while (tmp >= 0);
+	} while (!(tmp & RWSEM_READ_FAILED_MASK));
 	return 0;
 }
 
@@ -196,22 +198,16 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline void __down_write(struct rw_semaphore *sem)
 {
-	long tmp;
-
-	tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
-					     &sem->count);
-	if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
 		rwsem_down_write_failed(sem);
 	rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
 {
-	long tmp;
-
-	tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
-					     &sem->count);
-	if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
 		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
 			return -EINTR;
 	rwsem_set_owner(sem);
@@ -224,7 +220,7 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
 
 	lockevent_inc(rwsem_wtrylock);
 	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-		      RWSEM_ACTIVE_WRITE_BIAS);
+					  RWSEM_WRITER_LOCKED);
 	if (tmp == RWSEM_UNLOCKED_VALUE) {
 		rwsem_set_owner(sem);
 		return true;
@@ -242,8 +238,9 @@ static inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
 				sem);
 	rwsem_clear_reader_owned(sem);
-	tmp = atomic_long_dec_return_release(&sem->count);
-	if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+			== RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -254,8 +251,8 @@ static inline void __up_write(struct rw_semaphore *sem)
 {
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
-						    &sem->count) < 0))
+	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+			&sem->count) & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -274,8 +271,9 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * write side. As such, rely on RELEASE semantics.
 	 */
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+	tmp = atomic_long_fetch_add_release(
+		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);
-	if (tmp < 0)
+	if (tmp & RWSEM_FLAG_WAITERS)
 		rwsem_downgrade_wake(sem);
 }
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 05/20] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (3 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 04/20] locking/rwsem: Implement a new locking scheme Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 06/20] locking/rwsem: Code cleanup after files merging Waiman Long
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.

So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:

  <upper part of rwsem.h>
  <rwsem-xadd.c>
  <lower part of rwsem.h>
  <rwsem.c>

The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().

This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/Makefile     |   2 +-
 kernel/locking/rwsem-xadd.c | 624 -------------------------
 kernel/locking/rwsem.c      | 884 ++++++++++++++++++++++++++++++++++++
 kernel/locking/rwsem.h      | 281 +-----------
 4 files changed, 891 insertions(+), 900 deletions(-)
 delete mode 100644 kernel/locking/rwsem-xadd.c

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 6fe2f333aecb..45452facff3b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
deleted file mode 100644
index 7d537b50a849..000000000000
--- a/kernel/locking/rwsem-xadd.c
+++ /dev/null
@@ -1,624 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
- *
- * Writer lock-stealing by Alex Shi <alex.shi@intel.com>
- * and Michel Lespinasse <walken@google.com>
- *
- * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
- * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
- *
- * Rwsem count bit fields re-definition by Waiman Long <longman@redhat.com>.
- */
-#include <linux/rwsem.h>
-#include <linux/init.h>
-#include <linux/export.h>
-#include <linux/sched/signal.h>
-#include <linux/sched/rt.h>
-#include <linux/sched/wake_q.h>
-#include <linux/sched/debug.h>
-#include <linux/osq_lock.h>
-
-#include "rwsem.h"
-
-/*
- * Guide to the rw_semaphore's count field.
- *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
- *
- * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
- * (2) some of the reader bits are set in count, and
- * (3) the owner field has RWSEM_READ_OWNED bit set.
- *
- * Having some reader bits set is not enough to guarantee a readers owned
- * lock as the readers may be in the process of backing out from the count
- * and a writer has just released the lock. So another writer may steal
- * the lock immediately after that.
- */
-
-/*
- * Initialize an rwsem:
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
-		  struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	/*
-	 * Make sure we are not reinitializing a held semaphore:
-	 */
-	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
-	lockdep_init_map(&sem->dep_map, name, key, 0);
-#endif
-	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
-	raw_spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-	sem->owner = NULL;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-	osq_lock_init(&sem->osq);
-#endif
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-
-enum rwsem_waiter_type {
-	RWSEM_WAITING_FOR_WRITE,
-	RWSEM_WAITING_FOR_READ
-};
-
-struct rwsem_waiter {
-	struct list_head list;
-	struct task_struct *task;
-	enum rwsem_waiter_type type;
-};
-
-enum rwsem_wake_type {
-	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
-	RWSEM_WAKE_READERS,	/* Wake readers only */
-	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
-};
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
- *   have been set.
- * - there must be someone on the queue
- * - the wait_lock must be held by the caller
- * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
- *   to actually wakeup the blocked task(s) and drop the reference count,
- *   preferably when the wait_lock is released
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only marked woken if downgrading is false
- */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
-			      enum rwsem_wake_type wake_type,
-			      struct wake_q_head *wake_q)
-{
-	struct rwsem_waiter *waiter, *tmp;
-	long oldcount, woken = 0, adjustment = 0;
-	struct list_head wlist;
-
-	/*
-	 * Take a peek at the queue head waiter such that we can determine
-	 * the wakeup(s) to perform.
-	 */
-	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
-
-	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
-		if (wake_type == RWSEM_WAKE_ANY) {
-			/*
-			 * Mark writer at the front of the queue for wakeup.
-			 * Until the task is actually later awoken later by
-			 * the caller, other writers are able to steal it.
-			 * Readers, on the other hand, will block as they
-			 * will notice the queued writer.
-			 */
-			wake_q_add(wake_q, waiter->task);
-			lockevent_inc(rwsem_wake_writer);
-		}
-
-		return;
-	}
-
-	/*
-	 * Writers might steal the lock before we grant it to the next reader.
-	 * We prefer to do the first reader grant before counting readers
-	 * so we can bail out early if a writer stole the lock.
-	 */
-	if (wake_type != RWSEM_WAKE_READ_OWNED) {
-		adjustment = RWSEM_READER_BIAS;
-		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
-		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
-			atomic_long_sub(adjustment, &sem->count);
-			return;
-		}
-		/*
-		 * Set it to reader-owned to give spinners an early
-		 * indication that readers now have the lock.
-		 */
-		__rwsem_set_reader_owned(sem, waiter->task);
-	}
-
-	/*
-	 * Grant an infinite number of read locks to the readers at the front
-	 * of the queue. We know that woken will be at least 1 as we accounted
-	 * for above. Note we increment the 'active part' of the count by the
-	 * number of readers before waking any processes up.
-	 *
-	 * We have to do wakeup in 2 passes to prevent the possibility that
-	 * the reader count may be decremented before it is incremented. It
-	 * is because the to-be-woken waiter may not have slept yet. So it
-	 * may see waiter->task got cleared, finish its critical section and
-	 * do an unlock before the reader count increment.
-	 *
-	 * 1) Collect the read-waiters in a separate list, count them and
-	 *    fully increment the reader count in rwsem.
-	 * 2) For each waiters in the new list, clear waiter->task and
-	 *    put them into wake_q to be woken up later.
-	 */
-	list_for_each_entry(waiter, &sem->wait_list, list) {
-		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
-
-		woken++;
-	}
-	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
-
-	adjustment = woken * RWSEM_READER_BIAS - adjustment;
-	lockevent_cond_inc(rwsem_wake_reader, woken);
-	if (list_empty(&sem->wait_list)) {
-		/* hit end of list above */
-		adjustment -= RWSEM_FLAG_WAITERS;
-	}
-
-	if (adjustment)
-		atomic_long_add(adjustment, &sem->count);
-
-	/* 2nd pass */
-	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
-		struct task_struct *tsk;
-
-		tsk = waiter->task;
-		get_task_struct(tsk);
-
-		/*
-		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_failed() cannot
-		 * race with do_exit() by always holding a reference count
-		 * to the task to wakeup.
-		 */
-		smp_store_release(&waiter->task, NULL);
-		/*
-		 * Ensure issuing the wakeup (either by us or someone else)
-		 * after setting the reader waiter to nil.
-		 */
-		wake_q_add_safe(wake_q, tsk);
-	}
-}
-
-/*
- * This function must be called with the sem->wait_lock held to prevent
- * race conditions between checking the rwsem wait list and setting the
- * sem->count accordingly.
- */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
-{
-	long new;
-
-	if (count & RWSEM_LOCK_MASK)
-		return false;
-
-	new = count + RWSEM_WRITER_LOCKED -
-	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
-
-	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-
-	return false;
-}
-
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-/*
- * Try to acquire write lock before the writer has been put on wait queue.
- */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
-{
-	long count = atomic_long_read(&sem->count);
-
-	while (!(count & RWSEM_LOCK_MASK)) {
-		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_WRITER_LOCKED)) {
-			rwsem_set_owner(sem);
-			lockevent_inc(rwsem_opt_wlock);
-			return true;
-		}
-	}
-	return false;
-}
-
-static inline bool owner_on_cpu(struct task_struct *owner)
-{
-	/*
-	 * As lock holder preemption issue, we both skip spinning if
-	 * task is not on cpu or its cpu is preempted
-	 */
-	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-}
-
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner;
-	bool ret = true;
-
-	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
-
-	if (need_resched())
-		return false;
-
-	rcu_read_lock();
-	owner = READ_ONCE(sem->owner);
-	if (owner) {
-		ret = is_rwsem_owner_spinnable(owner) &&
-		      owner_on_cpu(owner);
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
- * Return true only if we can still spin on the owner field of the rwsem.
- */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
-{
-	struct task_struct *owner = READ_ONCE(sem->owner);
-
-	if (!is_rwsem_owner_spinnable(owner))
-		return false;
-
-	rcu_read_lock();
-	while (owner && (READ_ONCE(sem->owner) == owner)) {
-		/*
-		 * Ensure we emit the owner->on_cpu, dereference _after_
-		 * checking sem->owner still matches owner, if that fails,
-		 * owner might point to free()d memory, if it still matches,
-		 * the rcu_read_lock() ensures the memory stays valid.
-		 */
-		barrier();
-
-		/*
-		 * abort spinning when need_resched or owner is not running or
-		 * owner's cpu is preempted.
-		 */
-		if (need_resched() || !owner_on_cpu(owner)) {
-			rcu_read_unlock();
-			return false;
-		}
-
-		cpu_relax();
-	}
-	rcu_read_unlock();
-
-	/*
-	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
-	 */
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
-}
-
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
-	bool taken = false;
-
-	preempt_disable();
-
-	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
-	if (!osq_lock(&sem->osq))
-		goto done;
-
-	/*
-	 * Optimistically spin on the owner field and attempt to acquire the
-	 * lock whenever the owner changes. Spinning will be stopped when:
-	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
-	 */
-	while (rwsem_spin_on_owner(sem)) {
-		/*
-		 * Try to acquire the lock
-		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
-			break;
-		}
-
-		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
-		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
-
-		/*
-		 * The cpu_relax() call is a compiler barrier which forces
-		 * everything in this loop to be re-loaded. We don't need
-		 * memory barriers as we'll eventually observe the right
-		 * values at the cost of a few extra spins.
-		 */
-		cpu_relax();
-	}
-	osq_unlock(&sem->osq);
-done:
-	preempt_enable();
-	lockevent_cond_inc(rwsem_opt_fail, !taken);
-	return taken;
-}
-#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
-	return false;
-}
-#endif
-
-/*
- * Wait for the read lock to be granted
- */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
-{
-	long count, adjustment = -RWSEM_READER_BIAS;
-	struct rwsem_waiter waiter;
-	DEFINE_WAKE_Q(wake_q);
-
-	waiter.task = current;
-	waiter.type = RWSEM_WAITING_FOR_READ;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-	if (list_empty(&sem->wait_list)) {
-		/*
-		 * In case the wait queue is empty and the lock isn't owned
-		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_READER_BIAS has already been
-		 * set in the count.
-		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
-			raw_spin_unlock_irq(&sem->wait_lock);
-			rwsem_set_reader_owned(sem);
-			lockevent_inc(rwsem_rlock_fast);
-			return sem;
-		}
-		adjustment += RWSEM_FLAG_WAITERS;
-	}
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
-
-	/*
-	 * If there are no active locks, wake the front queued process(es).
-	 *
-	 * If there are no writers and we are first in the queue,
-	 * wake our own waiter to join the existing active readers !
-	 */
-	if (!(count & RWSEM_LOCK_MASK) ||
-	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-
-	/* wait to be given the lock */
-	while (true) {
-		set_current_state(state);
-		if (!waiter.task)
-			break;
-		if (signal_pending_state(state, current)) {
-			raw_spin_lock_irq(&sem->wait_lock);
-			if (waiter.task)
-				goto out_nolock;
-			raw_spin_unlock_irq(&sem->wait_lock);
-			break;
-		}
-		schedule();
-		lockevent_inc(rwsem_sleep_reader);
-	}
-
-	__set_current_state(TASK_RUNNING);
-	lockevent_inc(rwsem_rlock);
-	return sem;
-out_nolock:
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	__set_current_state(TASK_RUNNING);
-	lockevent_inc(rwsem_rlock_fail);
-	return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
-/*
- * Wait until we successfully acquire the write lock
- */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
-{
-	long count;
-	bool waiting = true; /* any queued threads before us */
-	struct rwsem_waiter waiter;
-	struct rw_semaphore *ret = sem;
-	DEFINE_WAKE_Q(wake_q);
-
-	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
-		return sem;
-
-	/*
-	 * Optimistic spinning failed, proceed to the slowpath
-	 * and block until we can acquire the sem.
-	 */
-	waiter.task = current;
-	waiter.type = RWSEM_WAITING_FOR_WRITE;
-
-	raw_spin_lock_irq(&sem->wait_lock);
-
-	/* account for this before adding a new element to the list */
-	if (list_empty(&sem->wait_list))
-		waiting = false;
-
-	list_add_tail(&waiter.list, &sem->wait_list);
-
-	/* we're now waiting on the lock */
-	if (waiting) {
-		count = atomic_long_read(&sem->count);
-
-		/*
-		 * If there were already threads queued before us and there are
-		 * no active writers and some readers, the lock must be read
-		 * owned; so we try to  any read locks that were queued ahead
-		 * of us.
-		 */
-		if (!(count & RWSEM_WRITER_MASK) &&
-		     (count & RWSEM_READER_MASK)) {
-			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-			/*
-			 * The wakeup is normally called _after_ the wait_lock
-			 * is released, but given that we are proactively waking
-			 * readers we can deal with the wake_q overhead as it is
-			 * similar to releasing and taking the wait_lock again
-			 * for attempting rwsem_try_write_lock().
-			 */
-			wake_up_q(&wake_q);
-
-			/*
-			 * Reinitialize wake_q after use.
-			 */
-			wake_q_init(&wake_q);
-		}
-
-	} else {
-		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
-	}
-
-	/* wait until we successfully acquire the lock */
-	set_current_state(state);
-	while (true) {
-		if (rwsem_try_write_lock(count, sem))
-			break;
-		raw_spin_unlock_irq(&sem->wait_lock);
-
-		/* Block until there are no active lockers. */
-		do {
-			if (signal_pending_state(state, current))
-				goto out_nolock;
-
-			schedule();
-			lockevent_inc(rwsem_sleep_writer);
-			set_current_state(state);
-			count = atomic_long_read(&sem->count);
-		} while (count & RWSEM_LOCK_MASK);
-
-		raw_spin_lock_irq(&sem->wait_lock);
-	}
-	__set_current_state(TASK_RUNNING);
-	list_del(&waiter.list);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	lockevent_inc(rwsem_wlock);
-
-	return ret;
-
-out_nolock:
-	__set_current_state(TASK_RUNNING);
-	raw_spin_lock_irq(&sem->wait_lock);
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	else
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-	raw_spin_unlock_irq(&sem->wait_lock);
-	wake_up_q(&wake_q);
-	lockevent_inc(rwsem_wlock_fail);
-
-	return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
-/*
- * handle waking up a waiter on the semaphore
- * - up_read/up_write has decremented the active part of count if we come here
- */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
-	unsigned long flags;
-	DEFINE_WAKE_Q(wake_q);
-
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
-	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
-	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-	wake_up_q(&wake_q);
-
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_wake);
-
-/*
- * downgrade a write lock into a read lock
- * - caller incremented waiting part of count and discovered it still negative
- * - just wake up any readers at the front of the queue
- */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
-{
-	unsigned long flags;
-	DEFINE_WAKE_Q(wake_q);
-
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
-	if (!list_empty(&sem->wait_list))
-		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
-
-	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
-	wake_up_q(&wake_q);
-
-	return sem;
-}
-EXPORT_SYMBOL(rwsem_downgrade_wake);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ccbf18f560ff..8317bcdf063b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -3,17 +3,901 @@
  *
  * Written by David Howells (dhowells@redhat.com).
  * Derived from asm-i386/semaphore.h
+ *
+ * Writer lock-stealing by Alex Shi <alex.shi@intel.com>
+ * and Michel Lespinasse <walken@google.com>
+ *
+ * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
+ * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition and rwsem rearchitecture
+ * by Waiman Long <longman@redhat.com>.
  */
 
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/sched/task.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched/signal.h>
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
 
 #include "rwsem.h"
+#include "lock_events.h"
+
+/*
+ * The least significant 2 bits of the owner value has the following
+ * meanings when set.
+ *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
+ *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
+ *    i.e. the owner(s) cannot be readily determined. It can be reader
+ *    owned or the owning writer is indeterminate.
+ *
+ * When a writer acquires a rwsem, it puts its task_struct pointer
+ * into the owner field. It is cleared after an unlock.
+ *
+ * When a reader acquires a rwsem, it will also puts its task_struct
+ * pointer into the owner field with both the RWSEM_READER_OWNED and
+ * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * largely be left untouched. So for a free or reader-owned rwsem,
+ * the owner value may contain information about the last reader that
+ * acquires the rwsem. The anonymous bit is set because that particular
+ * reader may or may not still own the lock.
+ *
+ * That information may be helpful in debugging cases where the system
+ * seems to hang on a reader owned rwsem especially if only one reader
+ * is involved. Ideally we would like to track all the readers that own
+ * a rwsem, but the overhead is simply too big.
+ */
+#define RWSEM_READER_OWNED	(1UL << 0)
+#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+
+#ifdef CONFIG_DEBUG_RWSEMS
+# define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
+	if (!debug_locks_silent &&				\
+	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+		#c, atomic_long_read(&(sem)->count),		\
+		(long)((sem)->owner), (long)current,		\
+		list_empty(&(sem)->wait_list) ? "" : "not "))	\
+			debug_locks_off();			\
+	} while (0)
+#else
+# define DEBUG_RWSEMS_WARN_ON(c, sem)
+#endif
+
+/*
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit  0   - writer locked bit
+ * Bit  1   - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
+ */
+#define RWSEM_WRITER_LOCKED	(1UL << 0)
+#define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_READER_SHIFT	8
+#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+
+/*
+ * All writes to owner are protected by WRITE_ONCE() to make sure that
+ * store tearing can't happen as optimistic spinners may read and use
+ * the owner value concurrently without lock. Read from owner, however,
+ * may not need READ_ONCE() as long as the pointer value is only used
+ * for comparison and isn't being dereferenced.
+ */
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+	WRITE_ONCE(sem->owner, current);
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+	WRITE_ONCE(sem->owner, NULL);
+}
+
+/*
+ * The task_struct pointer of the last owning reader will be left in
+ * the owner field.
+ *
+ * Note that the owner value just indicates the task has owned the rwsem
+ * previously, it may not be the real owner or one of the real owners
+ * anymore when that field is examined, so take it with a grain of salt.
+ */
+static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
+					    struct task_struct *owner)
+{
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
+						 | RWSEM_ANONYMOUSLY_OWNED;
+
+	WRITE_ONCE(sem->owner, (struct task_struct *)val);
+}
+
+static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
+{
+	__rwsem_set_reader_owned(sem, current);
+}
+
+/*
+ * Return true if the a rwsem waiter can spin on the rwsem's owner
+ * and steal the lock, i.e. the lock is not anonymously owned.
+ * N.B. !owner is considered spinnable.
+ */
+static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+{
+	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
+}
+
+/*
+ * Return true if rwsem is owned by an anonymous writer or readers.
+ */
+static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
+{
+	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+}
+
+#ifdef CONFIG_DEBUG_RWSEMS
+/*
+ * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
+ * is a task pointer in owner of a reader-owned rwsem, it will be the
+ * real owner or one of the real owners. The only exception is when the
+ * unlock is done by up_read_non_owner().
+ */
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
+						   | RWSEM_ANONYMOUSLY_OWNED;
+	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
+		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
+				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+}
+#else
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+}
+#endif
+
+/*
+ * Guide to the rw_semaphore's count field.
+ *
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
+ *
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
+ *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
+ */
+
+/*
+ * Initialize an rwsem:
+ */
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+		  struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/*
+	 * Make sure we are not reinitializing a held semaphore:
+	 */
+	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
+	lockdep_init_map(&sem->dep_map, name, key, 0);
+#endif
+	atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+	raw_spin_lock_init(&sem->wait_lock);
+	INIT_LIST_HEAD(&sem->wait_list);
+	sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+	osq_lock_init(&sem->osq);
+#endif
+}
+
+EXPORT_SYMBOL(__init_rwsem);
+
+enum rwsem_waiter_type {
+	RWSEM_WAITING_FOR_WRITE,
+	RWSEM_WAITING_FOR_READ
+};
+
+struct rwsem_waiter {
+	struct list_head list;
+	struct task_struct *task;
+	enum rwsem_waiter_type type;
+};
+
+enum rwsem_wake_type {
+	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
+	RWSEM_WAKE_READERS,	/* Wake readers only */
+	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
+};
+
+/*
+ * handle the lock release when processes blocked on it that can now run
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ *   have been set.
+ * - there must be someone on the queue
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ *   to actually wakeup the blocked task(s) and drop the reference count,
+ *   preferably when the wait_lock is released
+ * - woken process blocks are discarded from the list after having task zeroed
+ * - writers are only marked woken if downgrading is false
+ */
+static void __rwsem_mark_wake(struct rw_semaphore *sem,
+			      enum rwsem_wake_type wake_type,
+			      struct wake_q_head *wake_q)
+{
+	struct rwsem_waiter *waiter, *tmp;
+	long oldcount, woken = 0, adjustment = 0;
+	struct list_head wlist;
+
+	/*
+	 * Take a peek at the queue head waiter such that we can determine
+	 * the wakeup(s) to perform.
+	 */
+	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+
+	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+		if (wake_type == RWSEM_WAKE_ANY) {
+			/*
+			 * Mark writer at the front of the queue for wakeup.
+			 * Until the task is actually later awoken later by
+			 * the caller, other writers are able to steal it.
+			 * Readers, on the other hand, will block as they
+			 * will notice the queued writer.
+			 */
+			wake_q_add(wake_q, waiter->task);
+			lockevent_inc(rwsem_wake_writer);
+		}
+
+		return;
+	}
+
+	/*
+	 * Writers might steal the lock before we grant it to the next reader.
+	 * We prefer to do the first reader grant before counting readers
+	 * so we can bail out early if a writer stole the lock.
+	 */
+	if (wake_type != RWSEM_WAKE_READ_OWNED) {
+		adjustment = RWSEM_READER_BIAS;
+		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
+		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+			atomic_long_sub(adjustment, &sem->count);
+			return;
+		}
+		/*
+		 * Set it to reader-owned to give spinners an early
+		 * indication that readers now have the lock.
+		 */
+		__rwsem_set_reader_owned(sem, waiter->task);
+	}
+
+	/*
+	 * Grant an infinite number of read locks to the readers at the front
+	 * of the queue. We know that woken will be at least 1 as we accounted
+	 * for above. Note we increment the 'active part' of the count by the
+	 * number of readers before waking any processes up.
+	 *
+	 * We have to do wakeup in 2 passes to prevent the possibility that
+	 * the reader count may be decremented before it is incremented. It
+	 * is because the to-be-woken waiter may not have slept yet. So it
+	 * may see waiter->task got cleared, finish its critical section and
+	 * do an unlock before the reader count increment.
+	 *
+	 * 1) Collect the read-waiters in a separate list, count them and
+	 *    fully increment the reader count in rwsem.
+	 * 2) For each waiters in the new list, clear waiter->task and
+	 *    put them into wake_q to be woken up later.
+	 */
+	list_for_each_entry(waiter, &sem->wait_list, list) {
+		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
+			break;
+
+		woken++;
+	}
+	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
+
+	adjustment = woken * RWSEM_READER_BIAS - adjustment;
+	lockevent_cond_inc(rwsem_wake_reader, woken);
+	if (list_empty(&sem->wait_list)) {
+		/* hit end of list above */
+		adjustment -= RWSEM_FLAG_WAITERS;
+	}
+
+	if (adjustment)
+		atomic_long_add(adjustment, &sem->count);
+
+	/* 2nd pass */
+	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
+		struct task_struct *tsk;
+
+		tsk = waiter->task;
+		get_task_struct(tsk);
+
+		/*
+		 * Ensure calling get_task_struct() before setting the reader
+		 * waiter to nil such that rwsem_down_read_failed() cannot
+		 * race with do_exit() by always holding a reference count
+		 * to the task to wakeup.
+		 */
+		smp_store_release(&waiter->task, NULL);
+		/*
+		 * Ensure issuing the wakeup (either by us or someone else)
+		 * after setting the reader waiter to nil.
+		 */
+		wake_q_add_safe(wake_q, tsk);
+	}
+}
+
+/*
+ * This function must be called with the sem->wait_lock held to prevent
+ * race conditions between checking the rwsem wait list and setting the
+ * sem->count accordingly.
+ */
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+	long new;
+
+	if (count & RWSEM_LOCK_MASK)
+		return false;
+
+	new = count + RWSEM_WRITER_LOCKED -
+	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
+		rwsem_set_owner(sem);
+		return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	while (!(count & RWSEM_LOCK_MASK)) {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
+					count + RWSEM_WRITER_LOCKED)) {
+			rwsem_set_owner(sem);
+			lockevent_inc(rwsem_opt_wlock);
+			return true;
+		}
+	}
+	return false;
+}
+
+static inline bool owner_on_cpu(struct task_struct *owner)
+{
+	/*
+	 * As lock holder preemption issue, we both skip spinning if
+	 * task is not on cpu or its cpu is preempted
+	 */
+	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *owner;
+	bool ret = true;
+
+	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+
+	if (need_resched())
+		return false;
+
+	rcu_read_lock();
+	owner = READ_ONCE(sem->owner);
+	if (owner) {
+		ret = is_rwsem_owner_spinnable(owner) &&
+		      owner_on_cpu(owner);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Return true only if we can still spin on the owner field of the rwsem.
+ */
+static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *owner = READ_ONCE(sem->owner);
+
+	if (!is_rwsem_owner_spinnable(owner))
+		return false;
+
+	rcu_read_lock();
+	while (owner && (READ_ONCE(sem->owner) == owner)) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking sem->owner still matches owner, if that fails,
+		 * owner might point to free()d memory, if it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		/*
+		 * abort spinning when need_resched or owner is not running or
+		 * owner's cpu is preempted.
+		 */
+		if (need_resched() || !owner_on_cpu(owner)) {
+			rcu_read_unlock();
+			return false;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * If there is a new owner or the owner is not set, we continue
+	 * spinning.
+	 */
+	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+}
+
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	bool taken = false;
+
+	preempt_disable();
+
+	/* sem->wait_lock should not be held when doing optimistic spinning */
+	if (!rwsem_can_spin_on_owner(sem))
+		goto done;
+
+	if (!osq_lock(&sem->osq))
+		goto done;
+
+	/*
+	 * Optimistically spin on the owner field and attempt to acquire the
+	 * lock whenever the owner changes. Spinning will be stopped when:
+	 *  1) the owning writer isn't running; or
+	 *  2) readers own the lock as we can't determine if they are
+	 *     actively running or not.
+	 */
+	while (rwsem_spin_on_owner(sem)) {
+		/*
+		 * Try to acquire the lock
+		 */
+		if (rwsem_try_write_lock_unqueued(sem)) {
+			taken = true;
+			break;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!sem->owner && (need_resched() || rt_task(current)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+	osq_unlock(&sem->osq);
+done:
+	preempt_enable();
+	lockevent_cond_inc(rwsem_opt_fail, !taken);
+	return taken;
+}
+#else
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+	return false;
+}
+#endif
+
+/*
+ * Wait for the read lock to be granted
+ */
+static inline struct rw_semaphore __sched *
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+{
+	long count, adjustment = -RWSEM_READER_BIAS;
+	struct rwsem_waiter waiter;
+	DEFINE_WAKE_Q(wake_q);
+
+	waiter.task = current;
+	waiter.type = RWSEM_WAITING_FOR_READ;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+	if (list_empty(&sem->wait_list)) {
+		/*
+		 * In case the wait queue is empty and the lock isn't owned
+		 * by a writer, this reader can exit the slowpath and return
+		 * immediately as its RWSEM_READER_BIAS has already been
+		 * set in the count.
+		 */
+		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+			raw_spin_unlock_irq(&sem->wait_lock);
+			rwsem_set_reader_owned(sem);
+			lockevent_inc(rwsem_rlock_fast);
+			return sem;
+		}
+		adjustment += RWSEM_FLAG_WAITERS;
+	}
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock, but no longer actively locking */
+	count = atomic_long_add_return(adjustment, &sem->count);
+
+	/*
+	 * If there are no active locks, wake the front queued process(es).
+	 *
+	 * If there are no writers and we are first in the queue,
+	 * wake our own waiter to join the existing active readers !
+	 */
+	if (!(count & RWSEM_LOCK_MASK) ||
+	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+
+	/* wait to be given the lock */
+	while (true) {
+		set_current_state(state);
+		if (!waiter.task)
+			break;
+		if (signal_pending_state(state, current)) {
+			raw_spin_lock_irq(&sem->wait_lock);
+			if (waiter.task)
+				goto out_nolock;
+			raw_spin_unlock_irq(&sem->wait_lock);
+			break;
+		}
+		schedule();
+		lockevent_inc(rwsem_sleep_reader);
+	}
+
+	__set_current_state(TASK_RUNNING);
+	lockevent_inc(rwsem_rlock);
+	return sem;
+out_nolock:
+	list_del(&waiter.list);
+	if (list_empty(&sem->wait_list))
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	__set_current_state(TASK_RUNNING);
+	lockevent_inc(rwsem_rlock_fail);
+	return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+{
+	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed_killable);
+
+/*
+ * Wait until we successfully acquire the write lock
+ */
+static inline struct rw_semaphore *
+__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+{
+	long count;
+	bool waiting = true; /* any queued threads before us */
+	struct rwsem_waiter waiter;
+	struct rw_semaphore *ret = sem;
+	DEFINE_WAKE_Q(wake_q);
+
+	/* do optimistic spinning and steal lock if possible */
+	if (rwsem_optimistic_spin(sem))
+		return sem;
+
+	/*
+	 * Optimistic spinning failed, proceed to the slowpath
+	 * and block until we can acquire the sem.
+	 */
+	waiter.task = current;
+	waiter.type = RWSEM_WAITING_FOR_WRITE;
+
+	raw_spin_lock_irq(&sem->wait_lock);
+
+	/* account for this before adding a new element to the list */
+	if (list_empty(&sem->wait_list))
+		waiting = false;
+
+	list_add_tail(&waiter.list, &sem->wait_list);
+
+	/* we're now waiting on the lock */
+	if (waiting) {
+		count = atomic_long_read(&sem->count);
+
+		/*
+		 * If there were already threads queued before us and there are
+		 * no active writers and some readers, the lock must be read
+		 * owned; so we try to  any read locks that were queued ahead
+		 * of us.
+		 */
+		if (!(count & RWSEM_WRITER_MASK) &&
+		     (count & RWSEM_READER_MASK)) {
+			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+			/*
+			 * The wakeup is normally called _after_ the wait_lock
+			 * is released, but given that we are proactively waking
+			 * readers we can deal with the wake_q overhead as it is
+			 * similar to releasing and taking the wait_lock again
+			 * for attempting rwsem_try_write_lock().
+			 */
+			wake_up_q(&wake_q);
+
+			/*
+			 * Reinitialize wake_q after use.
+			 */
+			wake_q_init(&wake_q);
+		}
+
+	} else {
+		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+	}
+
+	/* wait until we successfully acquire the lock */
+	set_current_state(state);
+	while (true) {
+		if (rwsem_try_write_lock(count, sem))
+			break;
+		raw_spin_unlock_irq(&sem->wait_lock);
+
+		/* Block until there are no active lockers. */
+		do {
+			if (signal_pending_state(state, current))
+				goto out_nolock;
+
+			schedule();
+			lockevent_inc(rwsem_sleep_writer);
+			set_current_state(state);
+			count = atomic_long_read(&sem->count);
+		} while (count & RWSEM_LOCK_MASK);
+
+		raw_spin_lock_irq(&sem->wait_lock);
+	}
+	__set_current_state(TASK_RUNNING);
+	list_del(&waiter.list);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	lockevent_inc(rwsem_wlock);
+
+	return ret;
+
+out_nolock:
+	__set_current_state(TASK_RUNNING);
+	raw_spin_lock_irq(&sem->wait_lock);
+	list_del(&waiter.list);
+	if (list_empty(&sem->wait_list))
+		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	else
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+	raw_spin_unlock_irq(&sem->wait_lock);
+	wake_up_q(&wake_q);
+	lockevent_inc(rwsem_wlock_fail);
+
+	return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed(struct rw_semaphore *sem)
+{
+	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed_killable(struct rw_semaphore *sem)
+{
+	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed_killable);
+
+/*
+ * handle waking up a waiter on the semaphore
+ * - up_read/up_write has decremented the active part of count if we come here
+ */
+__visible
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+	if (!list_empty(&sem->wait_list))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
+
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_wake);
+
+/*
+ * downgrade a write lock into a read lock
+ * - caller incremented waiting part of count and discovered it still negative
+ * - just wake up any readers at the front of the queue
+ */
+__visible
+struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+{
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+	if (!list_empty(&sem->wait_list))
+		__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+
+	raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+	wake_up_q(&wake_q);
+
+	return sem;
+}
+EXPORT_SYMBOL(rwsem_downgrade_wake);
+
+/*
+ * lock for reading
+ */
+inline void __down_read(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+		rwsem_down_read_failed(sem);
+		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+					RWSEM_READER_OWNED), sem);
+	} else {
+		rwsem_set_reader_owned(sem);
+	}
+}
+
+static inline int __down_read_killable(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+			&sem->count) & RWSEM_READ_FAILED_MASK)) {
+		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+			return -EINTR;
+		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+					RWSEM_READER_OWNED), sem);
+	} else {
+		rwsem_set_reader_owned(sem);
+	}
+	return 0;
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+	/*
+	 * Optimize for the case when the rwsem is not locked at all.
+	 */
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	lockevent_inc(rwsem_rtrylock);
+	do {
+		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					tmp + RWSEM_READER_BIAS)) {
+			rwsem_set_reader_owned(sem);
+			return 1;
+		}
+	} while (!(tmp & RWSEM_READ_FAILED_MASK));
+	return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
+		rwsem_down_write_failed(sem);
+	rwsem_set_owner(sem);
+}
+
+static inline int __down_write_killable(struct rw_semaphore *sem)
+{
+	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+						 RWSEM_WRITER_LOCKED)))
+		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+			return -EINTR;
+	rwsem_set_owner(sem);
+	return 0;
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	lockevent_inc(rwsem_wtrylock);
+	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
+					  RWSEM_WRITER_LOCKED);
+	if (tmp == RWSEM_UNLOCKED_VALUE) {
+		rwsem_set_owner(sem);
+		return true;
+	}
+	return false;
+}
+
+/*
+ * unlock after reading
+ */
+inline void __up_read(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
+				sem);
+	rwsem_clear_reader_owned(sem);
+	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+			== RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	rwsem_clear_owner(sem);
+	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+			&sem->count) & RWSEM_FLAG_WAITERS))
+		rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+	long tmp;
+
+	/*
+	 * When downgrading from exclusive to shared ownership,
+	 * anything inside the write-locked region cannot leak
+	 * into the read side. In contrast, anything in the
+	 * read-locked region is ok to be re-ordered into the
+	 * write side. As such, rely on RELEASE semantics.
+	 */
+	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	tmp = atomic_long_fetch_add_release(
+		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
+	rwsem_set_reader_owned(sem);
+	if (tmp & RWSEM_FLAG_WAITERS)
+		rwsem_downgrade_wake(sem);
+}
 
 /*
  * lock for reading
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 499a9b2bda82..2534ce49f648 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -1,279 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-/*
- * The least significant 2 bits of the owner value has the following
- * meanings when set.
- *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- *    i.e. the owner(s) cannot be readily determined. It can be reader
- *    owned or the owning writer is indeterminate.
- *
- * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
- *
- * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
- *
- * That information may be helpful in debugging cases where the system
- * seems to hang on a reader owned rwsem especially if only one reader
- * is involved. Ideally we would like to track all the readers that own
- * a rwsem, but the overhead is simply too big.
- */
-#include "lock_events.h"
 
-#define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+#ifndef __INTERNAL_RWSEM_H
+#define __INTERNAL_RWSEM_H
+#include <linux/rwsem.h>
 
-#ifdef CONFIG_DEBUG_RWSEMS
-# define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
-	if (!debug_locks_silent &&				\
-	    WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
-		#c, atomic_long_read(&(sem)->count),		\
-		(long)((sem)->owner), (long)current,		\
-		list_empty(&(sem)->wait_list) ? "" : "not "))	\
-			debug_locks_off();			\
-	} while (0)
-#else
-# define DEBUG_RWSEMS_WARN_ON(c, sem)
-#endif
+extern void __down_read(struct rw_semaphore *sem);
+extern void __up_read(struct rw_semaphore *sem);
 
-/*
- * The definition of the atomic counter in the semaphore:
- *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
- *
- * atomic_long_fetch_add() is used to obtain reader lock, whereas
- * atomic_long_cmpxchg() will be used to obtain writer lock.
- */
-#define RWSEM_WRITER_LOCKED	(1UL << 0)
-#define RWSEM_FLAG_WAITERS	(1UL << 1)
-#define RWSEM_READER_SHIFT	8
-#define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
-#define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
-#define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
-
-/*
- * All writes to owner are protected by WRITE_ONCE() to make sure that
- * store tearing can't happen as optimistic spinners may read and use
- * the owner value concurrently without lock. Read from owner, however,
- * may not need READ_ONCE() as long as the pointer value is only used
- * for comparison and isn't being dereferenced.
- */
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-	WRITE_ONCE(sem->owner, current);
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-	WRITE_ONCE(sem->owner, NULL);
-}
-
-/*
- * The task_struct pointer of the last owning reader will be left in
- * the owner field.
- *
- * Note that the owner value just indicates the task has owned the rwsem
- * previously, it may not be the real owner or one of the real owners
- * anymore when that field is examined, so take it with a grain of salt.
- */
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-					    struct task_struct *owner)
-{
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_ANONYMOUSLY_OWNED;
-
-	WRITE_ONCE(sem->owner, (struct task_struct *)val);
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-	__rwsem_set_reader_owned(sem, current);
-}
-
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
-{
-	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
-	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
-}
-
-#ifdef CONFIG_DEBUG_RWSEMS
-/*
- * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
- * is a task pointer in owner of a reader-owned rwsem, it will be the
- * real owner or one of the real owners. The only exception is when the
- * unlock is done by up_read_non_owner().
- */
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_ANONYMOUSLY_OWNED;
-	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
-		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
-}
-#else
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
-	} else {
-		rwsem_set_reader_owned(sem);
-	}
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
-			return -EINTR;
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
-	} else {
-		rwsem_set_reader_owned(sem);
-	}
-	return 0;
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
-	/*
-	 * Optimize for the case when the rwsem is not locked at all.
-	 */
-	long tmp = RWSEM_UNLOCKED_VALUE;
-
-	lockevent_inc(rwsem_rtrylock);
-	do {
-		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_READER_BIAS)) {
-			rwsem_set_reader_owned(sem);
-			return 1;
-		}
-	} while (!(tmp & RWSEM_READ_FAILED_MASK));
-	return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_failed(sem);
-	rwsem_set_owner(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
-			return -EINTR;
-	rwsem_set_owner(sem);
-	return 0;
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	lockevent_inc(rwsem_wtrylock);
-	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-					  RWSEM_WRITER_LOCKED);
-	if (tmp == RWSEM_UNLOCKED_VALUE) {
-		rwsem_set_owner(sem);
-		return true;
-	}
-	return false;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
-	rwsem_clear_reader_owned(sem);
-	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
-	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
-			== RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
-	long tmp;
-
-	/*
-	 * When downgrading from exclusive to shared ownership,
-	 * anything inside the write-locked region cannot leak
-	 * into the read side. In contrast, anything in the
-	 * read-locked region is ok to be re-ordered into the
-	 * write side. As such, rely on RELEASE semantics.
-	 */
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
-	tmp = atomic_long_fetch_add_release(
-		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
-	rwsem_set_reader_owned(sem);
-	if (tmp & RWSEM_FLAG_WAITERS)
-		rwsem_downgrade_wake(sem);
-}
+#endif /* __INTERNAL_RWSEM_H */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 06/20] locking/rwsem: Code cleanup after files merging
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (4 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 05/20] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 07/20] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:

 1) Remove all the EXPORT_SYMBOL() calls for functions that are not
    accessed elsewhere.
 2) Remove all the __visible tags as none of the functions will be
    called from assembly code anymore.
 3) Make all the internal functions static.
 4) Remove some unneeded blank lines.
 5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
    and rename __rwsem_down_{read|write}_failed_common() to
    rwsem_down_{read|write}_slowpath().
 6) Use atomic_long_try_cmpxchg_acquire() as much as possible.
 7) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
    are not that useful.

That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |   2 -
 kernel/locking/rwsem.c            | 119 +++++++++---------------------
 2 files changed, 34 insertions(+), 87 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ad7668cfc9da..11187a1d40b8 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,5 @@ LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
-LOCK_EVENT(rwsem_rtrylock)	/* # of read trylock calls		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
-LOCK_EVENT(rwsem_wtrylock)	/* # of write trylock calls		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8317bcdf063b..15480aa6d442 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -205,7 +205,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	osq_lock_init(&sem->osq);
 #endif
 }
-
 EXPORT_SYMBOL(__init_rwsem);
 
 enum rwsem_waiter_type {
@@ -330,7 +329,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 
 		/*
 		 * Ensure calling get_task_struct() before setting the reader
-		 * waiter to nil such that rwsem_down_read_failed() cannot
+		 * waiter to nil such that rwsem_down_read_slowpath() cannot
 		 * race with do_exit() by always holding a reference count
 		 * to the task to wakeup.
 		 */
@@ -516,8 +515,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 /*
  * Wait for the read lock to be granted
  */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore __sched *
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
 	struct rwsem_waiter waiter;
@@ -589,25 +588,11 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
 	return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
 /*
  * Wait until we successfully acquire the write lock
  */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore *
+rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
 	bool waiting = true; /* any queued threads before us */
@@ -708,26 +693,11 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
 	return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-	return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
 /*
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -742,15 +712,13 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 
 	return sem;
 }
-EXPORT_SYMBOL(rwsem_wake);
 
 /*
  * downgrade a write lock into a read lock
  * - caller incremented waiting part of count and discovered it still negative
  * - just wake up any readers at the front of the queue
  */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -765,7 +733,6 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
 
 	return sem;
 }
-EXPORT_SYMBOL(rwsem_downgrade_wake);
 
 /*
  * lock for reading
@@ -774,7 +741,7 @@ inline void __down_read(struct rw_semaphore *sem)
 {
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_failed(sem);
+		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
 	} else {
@@ -786,7 +753,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 {
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
 					RWSEM_READER_OWNED), sem);
@@ -803,7 +770,6 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 	 */
 	long tmp = RWSEM_UNLOCKED_VALUE;
 
-	lockevent_inc(rwsem_rtrylock);
 	do {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 					tmp + RWSEM_READER_BIAS)) {
@@ -819,30 +785,33 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
  */
 static inline void __down_write(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		rwsem_down_write_failed(sem);
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						      RWSEM_WRITER_LOCKED)))
+		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
 	rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
-						 RWSEM_WRITER_LOCKED)))
-		if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+	long tmp = RWSEM_UNLOCKED_VALUE;
+
+	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+						      RWSEM_WRITER_LOCKED))) {
+		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
+	}
 	rwsem_set_owner(sem);
 	return 0;
 }
 
 static inline int __down_write_trylock(struct rw_semaphore *sem)
 {
-	long tmp;
+	long tmp = RWSEM_UNLOCKED_VALUE;
 
-	lockevent_inc(rwsem_wtrylock);
-	tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
-					  RWSEM_WRITER_LOCKED);
-	if (tmp == RWSEM_UNLOCKED_VALUE) {
+	if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+					    RWSEM_WRITER_LOCKED)) {
 		rwsem_set_owner(sem);
 		return true;
 	}
@@ -856,12 +825,11 @@ inline void __up_read(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
-	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
-			== RWSEM_FLAG_WAITERS))
+	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
+		      RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -870,10 +838,12 @@ inline void __up_read(struct rw_semaphore *sem)
  */
 static inline void __up_write(struct rw_semaphore *sem)
 {
+	long tmp;
+
 	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
 	rwsem_clear_owner(sem);
-	if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
-			&sem->count) & RWSEM_FLAG_WAITERS))
+	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem);
 }
 
@@ -909,7 +879,6 @@ void __sched down_read(struct rw_semaphore *sem)
 
 	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read);
 
 int __sched down_read_killable(struct rw_semaphore *sem)
@@ -924,7 +893,6 @@ int __sched down_read_killable(struct rw_semaphore *sem)
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_read_killable);
 
 /*
@@ -938,7 +906,6 @@ int down_read_trylock(struct rw_semaphore *sem)
 		rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
 	return ret;
 }
-
 EXPORT_SYMBOL(down_read_trylock);
 
 /*
@@ -948,10 +915,8 @@ void __sched down_write(struct rw_semaphore *sem)
 {
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write);
 
 /*
@@ -962,14 +927,14 @@ int __sched down_write_killable(struct rw_semaphore *sem)
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
-	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+				  __down_write_killable)) {
 		rwsem_release(&sem->dep_map, 1, _RET_IP_);
 		return -EINTR;
 	}
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable);
 
 /*
@@ -984,7 +949,6 @@ int down_write_trylock(struct rw_semaphore *sem)
 
 	return ret;
 }
-
 EXPORT_SYMBOL(down_write_trylock);
 
 /*
@@ -993,10 +957,8 @@ EXPORT_SYMBOL(down_write_trylock);
 void up_read(struct rw_semaphore *sem)
 {
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
 	__up_read(sem);
 }
-
 EXPORT_SYMBOL(up_read);
 
 /*
@@ -1005,10 +967,8 @@ EXPORT_SYMBOL(up_read);
 void up_write(struct rw_semaphore *sem)
 {
 	rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
 	__up_write(sem);
 }
-
 EXPORT_SYMBOL(up_write);
 
 /*
@@ -1017,10 +977,8 @@ EXPORT_SYMBOL(up_write);
 void downgrade_write(struct rw_semaphore *sem)
 {
 	lock_downgrade(&sem->dep_map, _RET_IP_);
-
 	__downgrade_write(sem);
 }
-
 EXPORT_SYMBOL(downgrade_write);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -1029,40 +987,32 @@ void down_read_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
 	rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
-
 EXPORT_SYMBOL(down_read_nested);
 
 void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
 {
 	might_sleep();
 	rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(_down_write_nest_lock);
 
 void down_read_non_owner(struct rw_semaphore *sem)
 {
 	might_sleep();
-
 	__down_read(sem);
 	__rwsem_set_reader_owned(sem, NULL);
 }
-
 EXPORT_SYMBOL(down_read_non_owner);
 
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
-
 	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
-
 EXPORT_SYMBOL(down_write_nested);
 
 int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
@@ -1070,14 +1020,14 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
-	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+	if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+				  __down_write_killable)) {
 		rwsem_release(&sem->dep_map, 1, _RET_IP_);
 		return -EINTR;
 	}
 
 	return 0;
 }
-
 EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
@@ -1086,7 +1036,6 @@ void up_read_non_owner(struct rw_semaphore *sem)
 				sem);
 	__up_read(sem);
 }
-
 EXPORT_SYMBOL(up_read_non_owner);
 
 #endif
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 07/20] locking/rwsem: Make rwsem_spin_on_owner() return owner state
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (5 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 06/20] locking/rwsem: Code cleanup after files merging Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 08/20] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 65 ++++++++++++++++++++++++++++++------------
 1 file changed, 47 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 15480aa6d442..9ec25ef4e7b7 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -414,17 +414,54 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 }
 
 /*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * The rwsem_spin_on_owner() function returns the folowing 4 values
+ * depending on the lock owner state.
+ *   OWNER_NULL  : owner is currently NULL
+ *   OWNER_WRITER: when owner changes and is a writer
+ *   OWNER_READER: when owner changes and the new owner may be a reader.
+ *   OWNER_NONSPINNABLE:
+ *		   when optimistic spinning has to stop because either the
+ *		   owner stops running, is unknown, or its timeslice has
+ *		   been used up.
  */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+	OWNER_NULL		= 1 << 0,
+	OWNER_WRITER		= 1 << 1,
+	OWNER_READER		= 1 << 2,
+	OWNER_NONSPINNABLE	= 1 << 3,
+};
+#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
+
+static inline enum owner_state rwsem_owner_state(unsigned long owner)
 {
-	struct task_struct *owner = READ_ONCE(sem->owner);
+	if (!owner)
+		return OWNER_NULL;
 
-	if (!is_rwsem_owner_spinnable(owner))
-		return false;
+	if (owner & RWSEM_ANONYMOUSLY_OWNED)
+		return OWNER_NONSPINNABLE;
+
+	if (owner & RWSEM_READER_OWNED)
+		return OWNER_READER;
+
+	return OWNER_WRITER;
+}
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
+	enum owner_state state = rwsem_owner_state((unsigned long)owner);
+
+	if (state != OWNER_WRITER)
+		return state;
 
 	rcu_read_lock();
-	while (owner && (READ_ONCE(sem->owner) == owner)) {
+	for (;;) {
+		tmp = READ_ONCE(sem->owner);
+		if (tmp != owner) {
+			state = rwsem_owner_state((unsigned long)tmp);
+			break;
+		}
+
 		/*
 		 * Ensure we emit the owner->on_cpu, dereference _after_
 		 * checking sem->owner still matches owner, if that fails,
@@ -433,24 +470,16 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 		 */
 		barrier();
 
-		/*
-		 * abort spinning when need_resched or owner is not running or
-		 * owner's cpu is preempted.
-		 */
 		if (need_resched() || !owner_on_cpu(owner)) {
-			rcu_read_unlock();
-			return false;
+			state = OWNER_NONSPINNABLE;
+			break;
 		}
 
 		cpu_relax();
 	}
 	rcu_read_unlock();
 
-	/*
-	 * If there is a new owner or the owner is not set, we continue
-	 * spinning.
-	 */
-	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+	return state;
 }
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -473,7 +502,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem)) {
+	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
 		/*
 		 * Try to acquire the lock
 		 */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 08/20] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (6 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 07/20] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-05-03 13:10   ` Peter Zijlstra
  2019-04-28 21:25 ` [PATCH-tip v7 09/20] locking/rwsem: Always release wait_lock before waking up tasks Waiman Long
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.

A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458.  It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.

Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:

  40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
  40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081

By making waiter set handoff bit immediately after wakeup:

  40 readers, Iterations Min/Mean/Max = 43/44/46
  40 writers, Iterations Min/Mean/Max = 43/1,263/3,191

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |   2 +
 kernel/locking/rwsem.c            | 221 ++++++++++++++++++++++--------
 2 files changed, 169 insertions(+), 54 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 11187a1d40b8..634b47fd8b5e 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,5 +61,7 @@ LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
+LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
 LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
 LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
+LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 9ec25ef4e7b7..dd90884a80b8 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -10,8 +10,9 @@
  * Optimistic spinning by Tim Chen <tim.c.chen@intel.com>
  * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes.
  *
- * Rwsem count bit fields re-definition and rwsem rearchitecture
- * by Waiman Long <longman@redhat.com>.
+ * Rwsem count bit fields re-definition and rwsem rearchitecture by
+ * Waiman Long <longman@redhat.com> and
+ * Peter Zijlstra <peterz@infradead.org>.
  */
 
 #include <linux/types.h>
@@ -74,20 +75,33 @@
  *
  * Bit  0   - writer locked bit
  * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
+ * Bit  2   - lock handoff bit
+ * Bits 3-7 - reserved
  * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
+ *
+ * There are three places where the lock handoff bit may be set or cleared.
+ * 1) __rwsem_mark_wake() for readers.
+ * 2) rwsem_try_write_lock() for writers.
+ * 3) Error path of __rwsem_down_write_failed_common().
+ *
+ * For all the above cases, wait_lock will be held. A writer must also
+ * be the first one in the wait_list to be eligible for setting the handoff
+ * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
+#define RWSEM_FLAG_HANDOFF	(1UL << 2)
+
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+				 RWSEM_FLAG_HANDOFF)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -216,7 +230,10 @@ struct rwsem_waiter {
 	struct list_head list;
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
+	unsigned long timeout;
 };
+#define rwsem_first_waiter(sem) \
+	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
 
 enum rwsem_wake_type {
 	RWSEM_WAKE_ANY,		/* Wake whatever's at head of wait list */
@@ -224,6 +241,19 @@ enum rwsem_wake_type {
 	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
 };
 
+enum writer_wait_state {
+	WRITER_NOT_FIRST,	/* Writer is not first in wait list */
+	WRITER_FIRST,		/* Writer is first in wait list     */
+	WRITER_HANDOFF		/* Writer is first & handoff needed */
+};
+
+/*
+ * The typical HZ value is either 250 or 1000. So set the minimum waiting
+ * time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
+ * queue before initiating the handoff protocol.
+ */
+#define RWSEM_WAIT_TIMEOUT	DIV_ROUND_UP(HZ, 250)
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -244,11 +274,13 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	long oldcount, woken = 0, adjustment = 0;
 	struct list_head wlist;
 
+	lockdep_assert_held(&sem->wait_lock);
+
 	/*
 	 * Take a peek at the queue head waiter such that we can determine
 	 * the wakeup(s) to perform.
 	 */
-	waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+	waiter = rwsem_first_waiter(sem);
 
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		if (wake_type == RWSEM_WAKE_ANY) {
@@ -275,7 +307,18 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
 		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
-			atomic_long_sub(adjustment, &sem->count);
+			/*
+			 * When we've been waiting "too" long (for writers
+			 * to give up the lock), request a * HANDOFF to
+			 * force the issue.
+			 */
+			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
+			    time_after(jiffies, waiter->timeout)) {
+				adjustment -= RWSEM_FLAG_HANDOFF;
+				lockevent_inc(rwsem_rlock_handoff);
+			}
+
+			atomic_long_add(-adjustment, &sem->count);
 			return;
 		}
 		/*
@@ -317,6 +360,13 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		adjustment -= RWSEM_FLAG_WAITERS;
 	}
 
+	/*
+	 * When we've woken a reader, we no longer need to force writers
+	 * to give up the lock and we can clear HANDOFF.
+	 */
+	if (woken && (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF))
+		adjustment -= RWSEM_FLAG_HANDOFF;
+
 	if (adjustment)
 		atomic_long_add(adjustment, &sem->count);
 
@@ -346,23 +396,44 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  * This function must be called with the sem->wait_lock held to prevent
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
+ *
+ * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * bit is set or the lock is acquired with handoff bit cleared.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+					enum writer_wait_state wstate)
 {
 	long new;
 
-	if (count & RWSEM_LOCK_MASK)
-		return false;
+	lockdep_assert_held(&sem->wait_lock);
+	do {
+		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
-	new = count + RWSEM_WRITER_LOCKED -
-	     (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+		if (has_handoff && wstate == WRITER_NOT_FIRST)
+			return false;
 
-	if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
-		rwsem_set_owner(sem);
-		return true;
-	}
+		if (count & RWSEM_LOCK_MASK) {
+			if (has_handoff || (wstate != WRITER_HANDOFF))
+				return false;
+			new = count | RWSEM_FLAG_HANDOFF;
+		} else {
+			new = (count | RWSEM_WRITER_LOCKED) &
+				~RWSEM_FLAG_HANDOFF;
 
-	return false;
+			if (list_is_singular(&sem->wait_list))
+				new &= ~RWSEM_FLAG_WAITERS;
+		}
+	} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
+
+	/*
+	 * We have either acquired the lock with handoff bit cleared or
+	 * set the handoff bit.
+	 */
+	if (new & RWSEM_FLAG_HANDOFF)
+		return false;
+
+	rwsem_set_owner(sem);
+	return true;
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
@@ -373,9 +444,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 {
 	long count = atomic_long_read(&sem->count);
 
-	while (!(count & RWSEM_LOCK_MASK)) {
+	while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count + RWSEM_WRITER_LOCKED)) {
+					count | RWSEM_WRITER_LOCKED)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -456,6 +527,11 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 	rcu_read_lock();
 	for (;;) {
+		if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
+			state = OWNER_NONSPINNABLE;
+			break;
+		}
+
 		tmp = READ_ONCE(sem->owner);
 		if (tmp != owner) {
 			state = rwsem_owner_state((unsigned long)tmp);
@@ -553,16 +629,18 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
+	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list)) {
 		/*
 		 * In case the wait queue is empty and the lock isn't owned
-		 * by a writer, this reader can exit the slowpath and return
-		 * immediately as its RWSEM_READER_BIAS has already been
-		 * set in the count.
+		 * by a writer or has the handoff bit set, this reader can
+		 * exit the slowpath and return immediately as its
+		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+		if (!(atomic_long_read(&sem->count) &
+		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
@@ -609,8 +687,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	return sem;
 out_nolock:
 	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+	if (list_empty(&sem->wait_list)) {
+		atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
+				   &sem->count);
+	}
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -624,7 +704,7 @@ static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
-	bool waiting = true; /* any queued threads before us */
+	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
@@ -639,66 +719,95 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	 */
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
+	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 
 	/* account for this before adding a new element to the list */
-	if (list_empty(&sem->wait_list))
-		waiting = false;
+	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
 
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock */
-	if (waiting) {
+	if (wstate == WRITER_NOT_FIRST) {
 		count = atomic_long_read(&sem->count);
 
 		/*
-		 * If there were already threads queued before us and there are
-		 * no active writers and some readers, the lock must be read
-		 * owned; so we try to  any read locks that were queued ahead
-		 * of us.
+		 * If there were already threads queued before us and:
+		 *  1) there are no no active locks, wake the front
+		 *     queued process(es) as the handoff bit might be set.
+		 *  2) there are no active writers and some readers, the lock
+		 *     must be read owned; so we try to wake any read lock
+		 *     waiters that were queued ahead of us.
 		 */
-		if (!(count & RWSEM_WRITER_MASK) &&
-		     (count & RWSEM_READER_MASK)) {
-			__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
-			/*
-			 * The wakeup is normally called _after_ the wait_lock
-			 * is released, but given that we are proactively waking
-			 * readers we can deal with the wake_q overhead as it is
-			 * similar to releasing and taking the wait_lock again
-			 * for attempting rwsem_try_write_lock().
-			 */
-			wake_up_q(&wake_q);
+		if (count & RWSEM_WRITER_MASK)
+			goto wait;
 
-			/*
-			 * Reinitialize wake_q after use.
-			 */
-			wake_q_init(&wake_q);
-		}
+		__rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
+					? RWSEM_WAKE_READERS
+					: RWSEM_WAKE_ANY, &wake_q);
 
+		/*
+		 * The wakeup is normally called _after_ the wait_lock
+		 * is released, but given that we are proactively waking
+		 * readers we can deal with the wake_q overhead as it is
+		 * similar to releasing and taking the wait_lock again
+		 * for attempting rwsem_try_write_lock().
+		 */
+		wake_up_q(&wake_q);
+
+		/* We need wake_q again below, reinitialize */
+		wake_q_init(&wake_q);
 	} else {
 		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
 	}
 
+wait:
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem))
+		if (rwsem_try_write_lock(count, sem, wstate))
 			break;
+
 		raw_spin_unlock_irq(&sem->wait_lock);
 
 		/* Block until there are no active lockers. */
-		do {
+		for (;;) {
 			if (signal_pending_state(state, current))
 				goto out_nolock;
 
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
+			/*
+			 * If HANDOFF bit is set, unconditionally do
+			 * a trylock.
+			 */
+			if (wstate == WRITER_HANDOFF)
+				break;
+
+			if ((wstate == WRITER_NOT_FIRST) &&
+			    (rwsem_first_waiter(sem) == &waiter))
+				wstate = WRITER_FIRST;
+
 			count = atomic_long_read(&sem->count);
-		} while (count & RWSEM_LOCK_MASK);
+			if (!(count & RWSEM_LOCK_MASK))
+				break;
+
+			/*
+			 * The setting of the handoff bit is deferred
+			 * until rwsem_try_write_lock() is called.
+			 */
+			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
+			    time_after(jiffies, waiter.timeout))) {
+				wstate = WRITER_HANDOFF;
+				lockevent_inc(rwsem_wlock_handoff);
+				break;
+			}
+		}
 
 		raw_spin_lock_irq(&sem->wait_lock);
+		count = atomic_long_read(&sem->count);
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
@@ -711,6 +820,10 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
 	list_del(&waiter.list);
+
+	if (unlikely(wstate == WRITER_HANDOFF))
+		atomic_long_add(-RWSEM_FLAG_HANDOFF,  &sem->count);
+
 	if (list_empty(&sem->wait_list))
 		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
 	else
@@ -726,7 +839,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
 {
 	unsigned long flags;
 	DEFINE_WAKE_Q(wake_q);
@@ -859,7 +972,7 @@ inline void __up_read(struct rw_semaphore *sem)
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
 		      RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
+		rwsem_wake(sem, tmp);
 }
 
 /*
@@ -873,7 +986,7 @@ static inline void __up_write(struct rw_semaphore *sem)
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
-		rwsem_wake(sem);
+		rwsem_wake(sem, tmp);
 }
 
 /*
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 09/20] locking/rwsem: Always release wait_lock before waking up tasks
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (7 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 08/20] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-05-03 13:37   ` Peter Zijlstra
  2019-04-28 21:25 ` [PATCH-tip v7 10/20] locking/rwsem: More optimal RT task handling of null owner Waiman Long
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.

Remediate this situation by releasing the wait_lock before waking up
tasks and re-acquiring it afterward.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched/wake_q.h |  5 +++++
 kernel/locking/rwsem.c       | 30 +++++++++++++++++++-----------
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index ad826d2a4557..26a2013ac39c 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,6 +51,11 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
+static inline bool wake_q_empty(struct wake_q_head *head)
+{
+	return head->first == WAKE_Q_TAIL;
+}
+
 extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
 extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index dd90884a80b8..750f407b83cf 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -747,17 +747,25 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 					? RWSEM_WAKE_READERS
 					: RWSEM_WAKE_ANY, &wake_q);
 
-		/*
-		 * The wakeup is normally called _after_ the wait_lock
-		 * is released, but given that we are proactively waking
-		 * readers we can deal with the wake_q overhead as it is
-		 * similar to releasing and taking the wait_lock again
-		 * for attempting rwsem_try_write_lock().
-		 */
-		wake_up_q(&wake_q);
-
-		/* We need wake_q again below, reinitialize */
-		wake_q_init(&wake_q);
+		if (!wake_q_empty(&wake_q)) {
+			/*
+			 * We want to minimize wait_lock hold time especially
+			 * when a large number of readers are to be woken up.
+			 */
+			raw_spin_unlock_irq(&sem->wait_lock);
+			wake_up_q(&wake_q);
+			wake_q_init(&wake_q);	/* Used again, reinit */
+			raw_spin_lock_irq(&sem->wait_lock);
+			/*
+			 * This waiter may have become first in the wait
+			 * list after re-acquring the wait_lock. The
+			 * rwsem_first_waiter() test in the main while
+			 * loop below will correctly detect that. We do
+			 * need to reload count to perform proper trylock
+			 * and avoid missed wakeup.
+			 */
+			count = atomic_long_read(&sem->count);
+		}
 	} else {
 		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
 	}
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 10/20] locking/rwsem: More optimal RT task handling of null owner
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (8 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 09/20] locking/rwsem: Always release wait_lock before waking up tasks Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 11/20] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:

 1) The lock owner is in the process of releasing the lock, sem->owner
    is cleared but the lock has not been released yet.

 2) The lock was free and owner cleared, but another task just comes
    in and acquire the lock before we try to get it. The new owner may
    be a spinnable writer.

So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

    Locking threads     Pre-patch     Post-patch
    ---------------     ---------     -----------
            4             2,753          2,608
            8             2,529          2,520
           16             1,727          1,918
           32             1,263          1,956
           64               889          1,343

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 51 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 750f407b83cf..62ca834d59d0 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -561,6 +561,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
 	bool taken = false;
+	int prev_owner_state = OWNER_NULL;
 
 	preempt_disable();
 
@@ -578,7 +579,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	 *  2) readers own the lock as we can't determine if they are
 	 *     actively running or not.
 	 */
-	while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
+	for (;;) {
+		enum owner_state owner_state = rwsem_spin_on_owner(sem);
+
+		if (!(owner_state & OWNER_SPINNABLE))
+			break;
+
 		/*
 		 * Try to acquire the lock
 		 */
@@ -588,13 +594,44 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		}
 
 		/*
-		 * When there's no owner, we might have preempted between the
-		 * owner acquiring the lock and setting the owner field. If
-		 * we're an RT task that will live-lock because we won't let
-		 * the owner complete.
+		 * An RT task cannot do optimistic spinning if it cannot
+		 * be sure the lock holder is running or live-lock may
+		 * happen if the current task and the lock holder happen
+		 * to run in the same CPU. However, aborting optimistic
+		 * spinning while a NULL owner is detected may miss some
+		 * opportunity where spinning can continue without causing
+		 * problem.
+		 *
+		 * There are 2 possible cases where an RT task may be able
+		 * to continue spinning.
+		 *
+		 * 1) The lock owner is in the process of releasing the
+		 *    lock, sem->owner is cleared but the lock has not
+		 *    been released yet.
+		 * 2) The lock was free and owner cleared, but another
+		 *    task just comes in and acquire the lock before
+		 *    we try to get it. The new owner may be a spinnable
+		 *    writer.
+		 *
+		 * To take advantage of two scenarios listed agove, the RT
+		 * task is made to retry one more time to see if it can
+		 * acquire the lock or continue spinning on the new owning
+		 * writer. Of course, if the time lag is long enough or the
+		 * new owner is not a writer or spinnable, the RT task will
+		 * quit spinning.
+		 *
+		 * If the owner is a writer, the need_resched() check is
+		 * done inside rwsem_spin_on_owner(). If the owner is not
+		 * a writer, need_resched() check needs to be done here.
 		 */
-		if (!sem->owner && (need_resched() || rt_task(current)))
-			break;
+		if (owner_state != OWNER_WRITER) {
+			if (need_resched())
+				break;
+			if (rt_task(current) &&
+			   (prev_owner_state != OWNER_WRITER))
+				break;
+		}
+		prev_owner_state = owner_state;
 
 		/*
 		 * The cpu_relax() call is a compiler barrier which forces
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 11/20] locking/rwsem: Wake up almost all readers in wait queue
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (9 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 10/20] locking/rwsem: More optimal RT task handling of null owner Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-05-03 16:51   ` Peter Zijlstra
  2019-04-28 21:25 ` [PATCH-tip v7 12/20] locking/rwsem: Clarify usage of owner's nonspinaable bit Waiman Long
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-Patch   Post-patch
   ------------  ---------   ----------
        4          1,641        1,674
        8            731        1,062
       16            564          924
       32             78          300
       64             38          195
      240             50          149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 62ca834d59d0..8c212b658854 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -254,6 +254,14 @@ enum writer_wait_state {
  */
 #define RWSEM_WAIT_TIMEOUT	DIV_ROUND_UP(HZ, 250)
 
+/*
+ * Magic number to batch-wakeup waiting readers, even when writers are
+ * also present in the queue. This both limits the amount of work the
+ * waking thread must do and also prevents any potential counter overflow,
+ * however unlikely.
+ */
+#define MAX_READERS_WAKEUP	0x100
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -329,11 +337,17 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	}
 
 	/*
-	 * Grant an infinite number of read locks to the readers at the front
-	 * of the queue. We know that woken will be at least 1 as we accounted
+	 * Grant up to MAX_READERS_WAKEUP read locks to all the readers in the
+	 * queue. We know that the woken will be at least 1 as we accounted
 	 * for above. Note we increment the 'active part' of the count by the
 	 * number of readers before waking any processes up.
 	 *
+	 * This is an adaptation of the phase-fair R/W locks where at the
+	 * reader phase (first waiter is a reader), all readers are eligible
+	 * to acquire the lock at the same time irrespective of their order
+	 * in the queue. The writers acquire the lock according to their
+	 * order in the queue.
+	 *
 	 * We have to do wakeup in 2 passes to prevent the possibility that
 	 * the reader count may be decremented before it is incremented. It
 	 * is because the to-be-woken waiter may not have slept yet. So it
@@ -345,13 +359,20 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * 2) For each waiters in the new list, clear waiter->task and
 	 *    put them into wake_q to be woken up later.
 	 */
-	list_for_each_entry(waiter, &sem->wait_list, list) {
+	INIT_LIST_HEAD(&wlist);
+	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
 		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-			break;
+			continue;
 
 		woken++;
+		list_move_tail(&waiter->list, &wlist);
+
+		/*
+		 * Limit # of readers that can be woken up per wakeup call.
+		 */
+		if (woken >= MAX_READERS_WAKEUP)
+			break;
 	}
-	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
 
 	adjustment = woken * RWSEM_READER_BIAS - adjustment;
 	lockevent_cond_inc(rwsem_wake_reader, woken);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 12/20] locking/rwsem: Clarify usage of owner's nonspinaable bit
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (10 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 11/20] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-05-03 15:21   ` Peter Zijlstra
  2019-04-28 21:25 ` [PATCH-tip v7 13/20] locking/rwsem: Enable readers spinning on writer Waiman Long
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.

With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.

This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/rwsem.h  |  2 +-
 kernel/locking/rwsem.c | 43 +++++++++++++++++++++---------------------
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 148983e21d47..bb76e82398b2 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -50,7 +50,7 @@ struct rw_semaphore {
 };
 
 /*
- * Setting bit 1 of the owner field but not bit 0 will indicate
+ * Setting all bits of the owner field except bit 0 will indicate
  * that the rwsem is writer-owned with an unknown owner.
  */
 #define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8c212b658854..a12a1676e905 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -33,17 +33,18 @@
 /*
  * The least significant 2 bits of the owner value has the following
  * meanings when set.
- *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- *    i.e. the owner(s) cannot be readily determined. It can be reader
- *    owned or the owning writer is indeterminate.
+ *  - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
+ *  - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
+ *    The rwsem is anonymously owned, i.e. the owner(s) cannot be
+ *    readily determined. It can be reader owned or the owning writer
+ *    is indeterminate.
  *
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
  * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
  * largely be left untouched. So for a free or reader-owned rwsem,
  * the owner value may contain information about the last reader that
  * acquires the rwsem. The anonymous bit is set because that particular
@@ -55,7 +56,8 @@
  * a rwsem, but the overhead is simply too big.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED	(1UL << 1)
+#define RWSEM_NONSPINNABLE	(1UL << 1)
+#define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
 
 #ifdef CONFIG_DEBUG_RWSEMS
 # define DEBUG_RWSEMS_WARN_ON(c, sem)	do {			\
@@ -132,7 +134,7 @@ static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
 	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_ANONYMOUSLY_OWNED;
+						 | RWSEM_NONSPINNABLE;
 
 	WRITE_ONCE(sem->owner, (struct task_struct *)val);
 }
@@ -144,20 +146,12 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
 
 /*
  * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
+ * and steal the lock.
  * N.B. !owner is considered spinnable.
  */
 static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
 {
-	return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
-	return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+	return !((unsigned long)owner & RWSEM_NONSPINNABLE);
 }
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -170,10 +164,10 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_ANONYMOUSLY_OWNED;
+						   | RWSEM_NONSPINNABLE;
 	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+				RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
 }
 #else
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -490,7 +484,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 	struct task_struct *owner;
 	bool ret = true;
 
-	BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
 
 	if (need_resched())
 		return false;
@@ -529,7 +523,7 @@ static inline enum owner_state rwsem_owner_state(unsigned long owner)
 	if (!owner)
 		return OWNER_NULL;
 
-	if (owner & RWSEM_ANONYMOUSLY_OWNED)
+	if (owner & RWSEM_NONSPINNABLE)
 		return OWNER_NONSPINNABLE;
 
 	if (owner & RWSEM_READER_OWNED)
@@ -1048,7 +1042,12 @@ static inline void __up_write(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	/*
+	 * sem->owner may differ from current if the ownership is transferred
+	 * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
+	 */
+	DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
+			    !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
 	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 13/20] locking/rwsem: Enable readers spinning on writer
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (11 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 12/20] locking/rwsem: Clarify usage of owner's nonspinaable bit Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 14/20] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly.  The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
__rwsem_down_read_failed_common() and __rwsem_down_write_failed_common().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        4          1,674        1,684
        8          1,062        1,074
       16            924          900
       32            300          458
       64            195          208
      128            164          168
      240            149          143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem.c            | 86 ++++++++++++++++++++++++++-----
 2 files changed, 75 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 634b47fd8b5e..ca954e4e00e4 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader)	/* # of reader sleeps			*/
 LOCK_EVENT(rwsem_sleep_writer)	/* # of writer sleeps			*/
 LOCK_EVENT(rwsem_wake_reader)	/* # of reader wakeups			*/
 LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
+LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index a12a1676e905..6b11dce3139d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -452,6 +452,30 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+
+	if (count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
+		return false;
+
+	count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	if (!(count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+		rwsem_set_reader_owned(sem);
+		lockevent_inc(rwsem_opt_rlock);
+		return true;
+	}
+
+	/* Back out the change */
+	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+	return false;
+}
+
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
@@ -486,9 +510,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 
 	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
 
-	if (need_resched())
+	if (need_resched()) {
+		lockevent_inc(rwsem_opt_fail);
 		return false;
+	}
 
+	preempt_disable();
 	rcu_read_lock();
 	owner = READ_ONCE(sem->owner);
 	if (owner) {
@@ -496,6 +523,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		      owner_on_cpu(owner);
 	}
 	rcu_read_unlock();
+	preempt_enable();
+
+	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
 }
 
@@ -573,7 +603,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	return state;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
@@ -581,9 +611,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	preempt_disable();
 
 	/* sem->wait_lock should not be held when doing optimistic spinning */
-	if (!rwsem_can_spin_on_owner(sem))
-		goto done;
-
 	if (!osq_lock(&sem->osq))
 		goto done;
 
@@ -603,10 +630,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 		/*
 		 * Try to acquire the lock
 		 */
-		if (rwsem_try_write_lock_unqueued(sem)) {
-			taken = true;
+		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+			      : rwsem_try_read_lock_unqueued(sem);
+
+		if (taken)
 			break;
-		}
 
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
@@ -663,7 +691,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 	return taken;
 }
 #else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+	return false;
+}
+
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	return false;
 }
@@ -679,6 +712,31 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	if (!rwsem_can_spin_on_owner(sem))
+		goto queue;
+
+	/*
+	 * Undo read bias from down_read() and do optimistic spinning.
+	 */
+	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+	adjustment = 0;
+	if (rwsem_optimistic_spin(sem, false)) {
+		/*
+		 * Wake up other readers in the wait list if the front
+		 * waiter is a reader.
+		 */
+		if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) {
+			raw_spin_lock_irq(&sem->wait_lock);
+			if (!list_empty(&sem->wait_list))
+				__rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+						  &wake_q);
+			raw_spin_unlock_irq(&sem->wait_lock);
+			wake_up_q(&wake_q);
+		}
+		return sem;
+	}
+
+queue:
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
@@ -691,7 +749,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 		 * exit the slowpath and return immediately as its
 		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (!(atomic_long_read(&sem->count) &
+		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
@@ -703,7 +761,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock, but no longer actively locking */
-	count = atomic_long_add_return(adjustment, &sem->count);
+	if (adjustment)
+		count = atomic_long_add_return(adjustment, &sem->count);
+	else
+		count = atomic_long_read(&sem->count);
 
 	/*
 	 * If there are no active locks, wake the front queued process(es).
@@ -762,7 +823,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	DEFINE_WAKE_Q(wake_q);
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_optimistic_spin(sem))
+	if (rwsem_can_spin_on_owner(sem) &&
+	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
 	/*
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 14/20] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (12 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 13/20] locking/rwsem: Enable readers spinning on writer Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-05-06 15:47   ` Peter Zijlstra
  2019-04-28 21:25 ` [PATCH-tip v7 15/20] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.

We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

      System                 Time for 16 Iterations
      ------                 ----------------------
  1-socket Skylake                  ~800ns
  4-socket Broadwell                ~300ns
  2-socket ThunderX2 (arm64)        ~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          1,759        6,684
        4          1,684        6,738
        8          1,074        7,222
       16            900        7,163
       32            458        7,316
       64            208          520
      128            168          425
      240            143          474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |   1 +
 kernel/locking/rwsem.c            | 242 ++++++++++++++++++++++++------
 2 files changed, 197 insertions(+), 46 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ca954e4e00e4..baa998401052 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
 LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
 LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
+LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled reader opt-spinnings	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 6b11dce3139d..62fcb8bacfab 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -23,6 +23,7 @@
 #include <linux/sched/debug.h>
 #include <linux/sched/wake_q.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/clock.h>
 #include <linux/export.h>
 #include <linux/rwsem.h>
 #include <linux/atomic.h>
@@ -31,24 +32,28 @@
 #include "lock_events.h"
 
 /*
- * The least significant 2 bits of the owner value has the following
+ * The least significant 3 bits of the owner value has the following
  * meanings when set.
  *  - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
- *  - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
- *    The rwsem is anonymously owned, i.e. the owner(s) cannot be
- *    readily determined. It can be reader owned or the owning writer
- *    is indeterminate.
+ *  - Bit 1: RWSEM_RD_NONSPINNABLE - Readers cannot spin on this lock.
+ *  - Bit 2: RWSEM_WR_NONSPINNABLE - Writers cannot spin on this lock.
  *
+ * When the rwsem is either owned by an anonymous writer, or it is
+ * reader-owned, but a spinning writer has timed out, both nonspinnable
+ * bits will be set to disable optimistic spinning by readers and writers.
+ * In the later case, the last unlocking reader should then check the
+ * writer nonspinnable bit and clear it only to give writers preference
+ * to acquire the lock via optimistic spinning, but not readers. Similar
+ * action is also done in the reader slowpath.
+
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
+ * pointer into the owner field with the RWSEM_READER_OWNED bit set.
+ * On unlock, the owner field will largely be left untouched. So
+ * for a free or reader-owned rwsem, the owner value may contain
+ * information about the last reader that acquires the rwsem.
  *
  * That information may be helpful in debugging cases where the system
  * seems to hang on a reader owned rwsem especially if only one reader
@@ -56,7 +61,9 @@
  * a rwsem, but the overhead is simply too big.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
-#define RWSEM_NONSPINNABLE	(1UL << 1)
+#define RWSEM_RD_NONSPINNABLE	(1UL << 1)
+#define RWSEM_WR_NONSPINNABLE	(1UL << 2)
+#define RWSEM_NONSPINNABLE	(RWSEM_RD_NONSPINNABLE | RWSEM_WR_NONSPINNABLE)
 #define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -133,8 +140,7 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-						 | RWSEM_NONSPINNABLE;
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
 
 	WRITE_ONCE(sem->owner, (struct task_struct *)val);
 }
@@ -149,9 +155,27 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
  * and steal the lock.
  * N.B. !owner is considered spinnable.
  */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+static inline bool is_rwsem_owner_spinnable(struct task_struct *owner, bool wr)
 {
-	return !((unsigned long)owner & RWSEM_NONSPINNABLE);
+	unsigned long bit = wr ? RWSEM_WR_NONSPINNABLE : RWSEM_RD_NONSPINNABLE;
+
+	return !((unsigned long)owner & bit);
+}
+
+/*
+ * Return true if the rwsem is spinnable.
+ */
+static inline bool is_rwsem_spinnable(struct rw_semaphore *sem, bool wr)
+{
+	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner), wr);
+}
+
+/*
+ * Remove all the flag bits from owner.
+ */
+static inline struct task_struct *owner_without_flags(struct task_struct *owner)
+{
+	return (struct task_struct *)((long)owner & ~RWSEM_OWNER_FLAGS_MASK);
 }
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -163,11 +187,11 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
  */
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
-	unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-						   | RWSEM_NONSPINNABLE;
-	if (READ_ONCE(sem->owner) == (struct task_struct *)val)
+	unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
+
+	if ((owner & ~RWSEM_OWNER_FLAGS_MASK) == (unsigned long)current)
 		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
-				RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
+				owner & RWSEM_OWNER_FLAGS_MASK);
 }
 #else
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -175,6 +199,22 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Set the RWSEM_NONSPINNABLE bits if the RWSEM_READER_OWNED flag
+ * remains set. Otherwise, the operation will be aborted.
+ */
+static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
+{
+	long owner = (long)READ_ONCE(sem->owner);
+
+	while (owner & RWSEM_READER_OWNED) {
+		if (!is_rwsem_owner_spinnable((void *)owner, false))
+			break;
+		owner = cmpxchg((long *)&sem->owner, owner,
+				owner | RWSEM_NONSPINNABLE);
+	}
+}
+
 /*
  * Guide to the rw_semaphore's count field.
  *
@@ -496,6 +536,11 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
 
 static inline bool owner_on_cpu(struct task_struct *owner)
 {
+	/*
+	 * Clear all the flag bits in owner
+	 */
+	owner = owner_without_flags(owner);
+
 	/*
 	 * As lock holder preemption issue, we both skip spinning if
 	 * task is not on cpu or its cpu is preempted
@@ -503,12 +548,12 @@ static inline bool owner_on_cpu(struct task_struct *owner)
 	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 }
 
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
 {
 	struct task_struct *owner;
 	bool ret = true;
 
-	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
+	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN, true));
 
 	if (need_resched()) {
 		lockevent_inc(rwsem_opt_fail);
@@ -517,14 +562,17 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 
 	preempt_disable();
 	rcu_read_lock();
+
 	owner = READ_ONCE(sem->owner);
-	if (owner) {
-		ret = is_rwsem_owner_spinnable(owner) &&
-		      owner_on_cpu(owner);
-	}
+	if (!is_rwsem_owner_spinnable(owner, wr))
+		ret = false;
+	else if ((unsigned long)owner & RWSEM_READER_OWNED)
+		ret = true;
+	else if ((owner = owner_without_flags(owner)))
+		ret = owner_on_cpu(owner);
+
 	rcu_read_unlock();
 	preempt_enable();
-
 	lockevent_cond_inc(rwsem_opt_fail, !ret);
 	return ret;
 }
@@ -546,26 +594,27 @@ enum owner_state {
 	OWNER_READER		= 1 << 2,
 	OWNER_NONSPINNABLE	= 1 << 3,
 };
-#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER)
+#define OWNER_SPINNABLE		(OWNER_NULL | OWNER_WRITER | OWNER_READER)
 
-static inline enum owner_state rwsem_owner_state(unsigned long owner)
+static inline enum owner_state rwsem_owner_state(unsigned long owner, bool wr)
 {
-	if (!owner)
-		return OWNER_NULL;
-
-	if (owner & RWSEM_NONSPINNABLE)
+	if (!is_rwsem_owner_spinnable((void *)owner, wr))
 		return OWNER_NONSPINNABLE;
 
 	if (owner & RWSEM_READER_OWNED)
 		return OWNER_READER;
 
+	if (!(owner & ~RWSEM_OWNER_FLAGS_MASK))
+		return OWNER_NULL;
+
 	return OWNER_WRITER;
 }
 
-static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+static noinline enum owner_state
+rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
 {
 	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
-	enum owner_state state = rwsem_owner_state((unsigned long)owner);
+	enum owner_state state = rwsem_owner_state((unsigned long)owner, wr);
 
 	if (state != OWNER_WRITER)
 		return state;
@@ -579,7 +628,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 
 		tmp = READ_ONCE(sem->owner);
 		if (tmp != owner) {
-			state = rwsem_owner_state((unsigned long)tmp);
+			state = rwsem_owner_state((unsigned long)tmp, wr);
 			break;
 		}
 
@@ -603,10 +652,53 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
 	return state;
 }
 
+/*
+ * Calculate reader-owned rwsem spinning threshold for writer
+ *
+ * It is assumed that the more readers own the rwsem, the longer it will
+ * take for them to wind down and free the rwsem. So the formula to
+ * determine the actual spinning time limit is:
+ *
+ * 1) RWSEM_FLAG_WAITERS set
+ *    Spinning threshold = (10 + nr_readers/2)us
+ *
+ * 2) RWSEM_FLAG_WAITERS not set
+ *    Spinning threshold = 25us
+ *
+ * In the first case when RWSEM_FLAG_WAITERS is set, no new reader can
+ * become rwsem owner. It is assumed that the more readers own the rwsem,
+ * the longer it will take for them to wind down and free the rwsem. In
+ * addition, if it happens that a previous task that releases the lock
+ * is in the process of waking up readers one-by-one, the process will
+ * take longer when more readers needed to be woken up. This is subjected
+ * to a maximum value of 25us.
+ *
+ * In the second case with RWSEM_FLAG_WAITERS off, new readers can join
+ * and become one of the owners. So assuming for the worst case and spin
+ * for at most 25us.
+ */
+static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
+{
+	long count = atomic_long_read(&sem->count);
+	u64 delta = 25 * NSEC_PER_USEC;
+
+	if (count & RWSEM_FLAG_WAITERS) {
+		int readers = count >> RWSEM_READER_SHIFT;
+
+		if (readers > 30)
+			readers = 30;
+		delta = (20 + readers) * NSEC_PER_USEC / 2;
+	}
+
+	return sched_clock() + delta;
+}
+
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
+	int loop = 0;
+	u64 rspin_threshold = 0;
 
 	preempt_disable();
 
@@ -618,11 +710,10 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 	 * Optimistically spin on the owner field and attempt to acquire the
 	 * lock whenever the owner changes. Spinning will be stopped when:
 	 *  1) the owning writer isn't running; or
-	 *  2) readers own the lock as we can't determine if they are
-	 *     actively running or not.
+	 *  2) readers own the lock and spinning time has exceeded limit.
 	 */
 	for (;;) {
-		enum owner_state owner_state = rwsem_spin_on_owner(sem);
+		enum owner_state owner_state = rwsem_spin_on_owner(sem, wlock);
 
 		if (!(owner_state & OWNER_SPINNABLE))
 			break;
@@ -636,6 +727,39 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 		if (taken)
 			break;
 
+		/*
+		 * Time-based reader-owned rwsem optimistic spinning
+		 */
+		if (wlock && (owner_state == OWNER_READER)) {
+			/*
+			 * Re-initialize rspin_threshold every time when
+			 * the owner state changes from non-reader to reader.
+			 * This allows a writer to steal the lock in between
+			 * 2 reader phases and have the threshold reset at
+			 * the beginning of the 2nd reader phase.
+			 */
+			if (prev_owner_state != OWNER_READER) {
+				if (!is_rwsem_spinnable(sem, wlock))
+					break;
+				rspin_threshold = rwsem_rspin_threshold(sem);
+				loop = 0;
+			}
+
+			/*
+			 * Check time threshold once every 16 iterations to
+			 * avoid calling sched_clock() too frequently so
+			 * as to reduce the average latency between the times
+			 * when the lock becomes free and when the spinner
+			 * is ready to do a trylock.
+			 */
+			else if (!(++loop & 0xf) &&
+				 (sched_clock() > rspin_threshold)) {
+				rwsem_set_nonspinnable(sem);
+				lockevent_inc(rwsem_opt_nospin);
+				break;
+			}
+		}
+
 		/*
 		 * An RT task cannot do optimistic spinning if it cannot
 		 * be sure the lock holder is running or live-lock may
@@ -690,8 +814,25 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 	lockevent_cond_inc(rwsem_opt_fail, !taken);
 	return taken;
 }
+
+/*
+ * Clear the owner's RWSEM_WR_NONSPINNABLE bit if it is set. This should
+ * only be called when the reader count reaches 0.
+ *
+ * This give writers better chance to acquire the rwsem first before
+ * readers when the rwsem was being held by readers for a relatively long
+ * period of time. Race can happen that an optimistic spinner may have
+ * just stolen the rwsem and set the owner, but just clearing the
+ * RWSEM_WR_NONSPINNABLE bit will do no harm anyway.
+ */
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
+{
+	if (!is_rwsem_spinnable(sem, true))
+		atomic_long_andnot(RWSEM_WR_NONSPINNABLE,
+				  (atomic_long_t *)&sem->owner);
+}
 #else
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
 {
 	return false;
 }
@@ -700,6 +841,8 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
 	return false;
 }
+
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
 #endif
 
 /*
@@ -709,10 +852,11 @@ static struct rw_semaphore __sched *
 rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count, adjustment = -RWSEM_READER_BIAS;
+	bool wake = false;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
-	if (!rwsem_can_spin_on_owner(sem))
+	if (!rwsem_can_spin_on_owner(sem, false))
 		goto queue;
 
 	/*
@@ -772,8 +916,12 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	 * If there are no writers and we are first in the queue,
 	 * wake our own waiter to join the existing active readers !
 	 */
-	if (!(count & RWSEM_LOCK_MASK) ||
-	   (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+	if (!(count & RWSEM_LOCK_MASK)) {
+		clear_wr_nonspinnable(sem);
+		wake = true;
+	}
+	if (wake || (!(count & RWSEM_WRITER_MASK) &&
+		    (adjustment & RWSEM_FLAG_WAITERS)))
 		__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 
 	raw_spin_unlock_irq(&sem->wait_lock);
@@ -823,7 +971,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	DEFINE_WAKE_Q(wake_q);
 
 	/* do optimistic spinning and steal lock if possible */
-	if (rwsem_can_spin_on_owner(sem) &&
+	if (rwsem_can_spin_on_owner(sem, true) &&
 	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
@@ -1093,8 +1241,10 @@ inline void __up_read(struct rw_semaphore *sem)
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
-		      RWSEM_FLAG_WAITERS))
+		      RWSEM_FLAG_WAITERS)) {
+		clear_wr_nonspinnable(sem);
 		rwsem_wake(sem, tmp);
+	}
 }
 
 /*
@@ -1130,7 +1280,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 	 * read-locked region is ok to be re-ordered into the
 	 * write side. As such, rely on RELEASE semantics.
 	 */
-	DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+	DEBUG_RWSEMS_WARN_ON(owner_without_flags(sem->owner) != current, sem);
 	tmp = atomic_long_fetch_add_release(
 		-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
 	rwsem_set_reader_owned(sem);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 15/20] locking/rwsem: Adaptive disabling of reader optimistic spinning
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (13 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 14/20] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 16/20] locking/rwsem: Add more rwsem owner access helpers Waiman Long
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.

 1) The reader critical section is long, perhaps the task sleeps after
    acquiring the read lock.
 2) There are just too many readers contending the lock causing it to
    take a while to service all of them.

In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.

Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.

Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.

This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.

In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.

On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:

  Threads  Before rspin  Before patch  After patch    %change
  -------  ------------  ------------  -----------    -------
    20        5541068      5345484       5455667    -3.5%/ +2.1%
    40       10185150      7292313       9219276   -28.5%/+26.4%
    60        8196733      6460517       7181209   -21.2%/+11.2%
    80        9508864      6739559       8107025   -29.1%/+20.3%

This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.

Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          4,730        4,969
        4          4,814        4,786
        8          4,866        4,815
       16          4,715        4,511
       32          3,338        3,500
       64          3,212        3,389
       80          3,110        3,044

When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.

Before patch:

  40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
  40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644

After patch:

  40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
  40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/lock_events_list.h |  10 ++-
 kernel/locking/rwsem.c            | 131 +++++++++++++++++++++++++++++-
 2 files changed, 133 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index baa998401052..239039d0ce21 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,10 +56,12 @@ LOCK_EVENT(rwsem_sleep_reader)	/* # of reader sleeps			*/
 LOCK_EVENT(rwsem_sleep_writer)	/* # of writer sleeps			*/
 LOCK_EVENT(rwsem_wake_reader)	/* # of reader wakeups			*/
 LOCK_EVENT(rwsem_wake_writer)	/* # of writer wakeups			*/
-LOCK_EVENT(rwsem_opt_rlock)	/* # of read locks opt-spin acquired	*/
-LOCK_EVENT(rwsem_opt_wlock)	/* # of write locks opt-spin acquired	*/
-LOCK_EVENT(rwsem_opt_fail)	/* # of failed opt-spinnings		*/
-LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled reader opt-spinnings	*/
+LOCK_EVENT(rwsem_opt_rlock)	/* # of opt-acquired read locks		*/
+LOCK_EVENT(rwsem_opt_wlock)	/* # of opt-acquired write locks	*/
+LOCK_EVENT(rwsem_opt_fail)	/* # of failed optspins			*/
+LOCK_EVENT(rwsem_opt_nospin)	/* # of disabled optspins		*/
+LOCK_EVENT(rwsem_opt_norspin)	/* # of disabled reader-only optspins	*/
+LOCK_EVENT(rwsem_opt_rlock2)	/* # of opt-acquired 2ndary read locks	*/
 LOCK_EVENT(rwsem_rlock)		/* # of read locks acquired		*/
 LOCK_EVENT(rwsem_rlock_fast)	/* # of fast read locks acquired	*/
 LOCK_EVENT(rwsem_rlock_fail)	/* # of failed read lock acquisitions	*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 62fcb8bacfab..a5025a63b7c3 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -59,6 +59,42 @@
  * seems to hang on a reader owned rwsem especially if only one reader
  * is involved. Ideally we would like to track all the readers that own
  * a rwsem, but the overhead is simply too big.
+ *
+ * Reader optimistic spinning is helpful when the reader critical section
+ * is short and there aren't that many readers around. It makes readers
+ * relatively more preferred than writers. When a writer times out spinning
+ * on a reader-owned lock and set the nospinnable bits, there are two main
+ * reasons for that.
+ *
+ *  1) The reader critical section is long, perhaps the task sleeps after
+ *     acquiring the read lock.
+ *  2) There are just too many readers contending the lock causing it to
+ *     take a while to service all of them.
+ *
+ * In the former case, long reader critical section will impede the progress
+ * of writers which is usually more important for system performance. In
+ * the later case, reader optimistic spinning tends to make the reader
+ * groups that contain readers that acquire the lock together smaller
+ * leading to more of them. That may hurt performance in some cases. In
+ * other words, the setting of nonspinnable bits indicates that reader
+ * optimistic spinning may not be helpful for those workloads that cause
+ * it.
+ *
+ * Therefore, any writers that had observed the setting of the writer
+ * nonspinnable bit for a given rwsem after they fail to acquire the lock
+ * via optimistic spinning will set the reader nonspinnable bit once they
+ * acquire the write lock. Similarly, readers that observe the setting
+ * of reader nonspinnable bit at slowpath entry will set the reader
+ * nonspinnable bits when they acquire the read lock via the wakeup path.
+ *
+ * Once the reader nonspinnable bit is on, it will only be reset when
+ * a writer is able to acquire the rwsem in the fast path or somehow a
+ * reader or writer in the slowpath doesn't observe the nonspinable bit.
+ *
+ * This is to discourage reader optmistic spinning on that particular
+ * rwsem and make writers more preferred. This adaptive disabling of reader
+ * optimistic spinning will alleviate the negative side effect of this
+ * feature.
  */
 #define RWSEM_READER_OWNED	(1UL << 0)
 #define RWSEM_RD_NONSPINNABLE	(1UL << 1)
@@ -136,11 +172,15 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
  * Note that the owner value just indicates the task has owned the rwsem
  * previously, it may not be the real owner or one of the real owners
  * anymore when that field is examined, so take it with a grain of salt.
+ *
+ * The reader non-spinnable bit is preserved.
  */
 static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
 					    struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
+	unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED |
+			   ((unsigned long)READ_ONCE(sem->owner) &
+			     RWSEM_RD_NONSPINNABLE);
 
 	WRITE_ONCE(sem->owner, (struct task_struct *)val);
 }
@@ -265,6 +305,7 @@ struct rwsem_waiter {
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
 	unsigned long timeout;
+	unsigned long last_rowner;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -346,6 +387,8 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * so we can bail out early if a writer stole the lock.
 	 */
 	if (wake_type != RWSEM_WAKE_READ_OWNED) {
+		struct task_struct *owner;
+
 		adjustment = RWSEM_READER_BIAS;
 		oldcount = atomic_long_fetch_add(adjustment, &sem->count);
 		if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
@@ -366,8 +409,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		/*
 		 * Set it to reader-owned to give spinners an early
 		 * indication that readers now have the lock.
+		 * The reader nonspinnable bit seen at slowpath entry of
+		 * the reader is copied over.
 		 */
-		__rwsem_set_reader_owned(sem, waiter->task);
+		owner = waiter->task;
+		if (waiter->last_rowner & RWSEM_RD_NONSPINNABLE) {
+			owner = (void *)((long)owner | RWSEM_RD_NONSPINNABLE);
+			lockevent_inc(rwsem_opt_norspin);
+		}
+		__rwsem_set_reader_owned(sem, owner);
 	}
 
 	/*
@@ -831,6 +881,43 @@ static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
 		atomic_long_andnot(RWSEM_WR_NONSPINNABLE,
 				  (atomic_long_t *)&sem->owner);
 }
+
+/*
+ * This function is called when the reader fails to acquire the lock via
+ * optimistic spinning. In this case we will still attempt to do a trylock
+ * when comparing the rwsem state right now with the state when entering
+ * the slowpath indicates that the reader is still in a valid reader phase.
+ * This happens when the following conditions are true:
+ *
+ * 1) The lock is currently reader owned, and
+ * 2) The lock is previously not reader-owned or the last read owner changes.
+ *
+ * In the former case, we have transitioned from a writer phase to a
+ * reader-phase while spinning. In the latter case, it means the reader
+ * phase hasn't ended when we entered the optimistic spinning loop. In
+ * both cases, the reader is eligible to acquire the lock. This is the
+ * secondary path where a read lock is acquired optimistically.
+ *
+ * The reader non-spinnable bit wasn't set at time of entry or it will
+ * not be here at all.
+ */
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+					      unsigned long last_rowner)
+{
+	unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
+
+	if (!(owner & RWSEM_READER_OWNED))
+		return false;
+
+	owner	    &= ~RWSEM_OWNER_FLAGS_MASK;
+	last_rowner &= ~RWSEM_OWNER_FLAGS_MASK;
+	if ((owner != last_rowner) && rwsem_try_read_lock_unqueued(sem)) {
+		lockevent_inc(rwsem_opt_rlock2);
+		lockevent_add(rwsem_opt_fail, -1);
+		return true;
+	}
+	return false;
+}
 #else
 static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
 {
@@ -843,6 +930,12 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 }
 
 static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
+
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+					      unsigned long last_rowner)
+{
+	return false;
+}
 #endif
 
 /*
@@ -856,6 +949,14 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	/*
+	 * Save the current read-owner of rwsem, if available, and the
+	 * reader nonspinnable bit.
+	 */
+	waiter.last_rowner = (long)READ_ONCE(sem->owner);
+	if (!(waiter.last_rowner & RWSEM_READER_OWNED))
+		waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
+
 	if (!rwsem_can_spin_on_owner(sem, false))
 		goto queue;
 
@@ -878,6 +979,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 			wake_up_q(&wake_q);
 		}
 		return sem;
+	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+		return sem;
 	}
 
 queue:
@@ -958,6 +1061,15 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	return ERR_PTR(-EINTR);
 }
 
+static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem,
+						bool disable)
+{
+	if (unlikely(disable)) {
+		*((unsigned long *)&sem->owner) |= RWSEM_RD_NONSPINNABLE;
+		lockevent_inc(rwsem_opt_norspin);
+	}
+}
+
 /*
  * Wait until we successfully acquire the write lock
  */
@@ -965,6 +1077,7 @@ static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
+	bool disable_rspin;
 	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
@@ -975,6 +1088,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	    rwsem_optimistic_spin(sem, true))
 		return sem;
 
+	/*
+	 * Disable reader optimistic spinning for this rwsem after
+	 * acquiring the write lock when the setting of the nonspinnable
+	 * bits are observed.
+	 */
+	disable_rspin = (long)READ_ONCE(sem->owner) & RWSEM_NONSPINNABLE;
+
 	/*
 	 * Optimistic spinning failed, proceed to the slowpath
 	 * and block until we can acquire the sem.
@@ -1081,6 +1201,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);
+	rwsem_disable_reader_optspin(sem, disable_rspin);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	lockevent_inc(rwsem_wlock);
 
@@ -1202,7 +1323,8 @@ static inline void __down_write(struct rw_semaphore *sem)
 	if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
 						      RWSEM_WRITER_LOCKED)))
 		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
-	rwsem_set_owner(sem);
+	else
+		rwsem_set_owner(sem);
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1213,8 +1335,9 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
 						      RWSEM_WRITER_LOCKED))) {
 		if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
+	} else {
+		rwsem_set_owner(sem);
 	}
-	rwsem_set_owner(sem);
 	return 0;
 }
 
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 16/20] locking/rwsem: Add more rwsem owner access helpers
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (14 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 15/20] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 17/20] locking/rwsem: Guard against making count negative Waiman Long
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

Before combining owner and count, we are adding two new helpers for
accessing the owner value in the rwsem.

 1) struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
 2) bool is_rwsem_reader_owned(struct rw_semaphore *sem)

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index a5025a63b7c3..1a4f2c1c27c4 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -165,6 +165,11 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 	WRITE_ONCE(sem->owner, NULL);
 }
 
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+	return READ_ONCE(sem->owner);
+}
+
 /*
  * The task_struct pointer of the last owning reader will be left in
  * the owner field.
@@ -218,6 +223,23 @@ static inline struct task_struct *owner_without_flags(struct task_struct *owner)
 	return (struct task_struct *)((long)owner & ~RWSEM_OWNER_FLAGS_MASK);
 }
 
+/*
+ * Return true if the rwsem is owned by a reader.
+ */
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_DEBUG_RWSEMS
+	/*
+	 * Check the count to see if it is write-locked.
+	 */
+	long count = atomic_long_read(&sem->count);
+
+	if (count & RWSEM_WRITER_MASK)
+		return false;
+#endif
+	return (unsigned long)sem->owner & RWSEM_READER_OWNED;
+}
+
 #ifdef CONFIG_DEBUG_RWSEMS
 /*
  * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
@@ -230,7 +252,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 	unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
 
 	if ((owner & ~RWSEM_OWNER_FLAGS_MASK) == (unsigned long)current)
-		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
+		cmpxchg_relaxed((unsigned long *)&sem->owner, owner,
 				owner & RWSEM_OWNER_FLAGS_MASK);
 }
 #else
@@ -613,7 +635,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
 	preempt_disable();
 	rcu_read_lock();
 
-	owner = READ_ONCE(sem->owner);
+	owner = rwsem_get_owner(sem);
 	if (!is_rwsem_owner_spinnable(owner, wr))
 		ret = false;
 	else if ((unsigned long)owner & RWSEM_READER_OWNED)
@@ -663,7 +685,7 @@ static inline enum owner_state rwsem_owner_state(unsigned long owner, bool wr)
 static noinline enum owner_state
 rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
 {
-	struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
+	struct task_struct *tmp, *owner = rwsem_get_owner(sem);
 	enum owner_state state = rwsem_owner_state((unsigned long)owner, wr);
 
 	if (state != OWNER_WRITER)
@@ -676,7 +698,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
 			break;
 		}
 
-		tmp = READ_ONCE(sem->owner);
+		tmp = rwsem_get_owner(sem);
 		if (tmp != owner) {
 			state = rwsem_owner_state((unsigned long)tmp, wr);
 			break;
@@ -1275,8 +1297,7 @@ inline void __down_read(struct rw_semaphore *sem)
 	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
+		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
@@ -1288,8 +1309,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
 			&sem->count) & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
 			return -EINTR;
-		DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-					RWSEM_READER_OWNED), sem);
+		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
@@ -1360,7 +1380,7 @@ inline void __up_read(struct rw_semaphore *sem)
 {
 	long tmp;
 
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
+	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
@@ -1574,8 +1594,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
 {
-	DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-				sem);
+	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	__up_read(sem);
 }
 EXPORT_SYMBOL(up_read_non_owner);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 17/20] locking/rwsem: Guard against making count negative
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (15 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 16/20] locking/rwsem: Add more rwsem owner access helpers Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 18/20] locking/rwsem: Merge owner into count on x86-64 Waiman Long
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 95 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 80 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 1a4f2c1c27c4..1ab1762e9988 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,13 +116,28 @@
 #endif
 
 /*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
  *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bit  2   - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit  63   - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit  0    - writer locked bit
+ * Bit  1    - waiters present bit
+ * Bit  2    - lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit  31   - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -139,6 +154,7 @@
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
 #define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
 #define RWSEM_READER_SHIFT	8
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
@@ -146,7 +162,7 @@
 #define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
-				 RWSEM_FLAG_HANDOFF)
+				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -277,6 +293,28 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 	}
 }
 
+/*
+ * This function does a read trylock by incrementing the reader count
+ * and then decrementing it immediately if too many readers are present
+ * (count becomes negative) in order to prevent the remote possibility
+ * of overflowing the count with minimal delay between the increment
+ * and decrement.
+ *
+ * It returns the adjustment that should be added back to the count
+ * in the slowpath.
+ */
+static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+	long adjustment = -RWSEM_READER_BIAS;
+
+	*cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	if (unlikely(*cnt < 0)) {
+		atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+		adjustment = 0;
+	}
+	return adjustment;
+}
+
 /*
  * Guide to the rw_semaphore's count field.
  *
@@ -403,6 +441,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		return;
 	}
 
+	/*
+	 * No reader wakeup if there are too many of them already.
+	 */
+	if (unlikely(atomic_long_read(&sem->count) < 0))
+		return;
+
 	/*
 	 * Writers might steal the lock before we grant it to the next reader.
 	 * We prefer to do the first reader grant before counting readers
@@ -964,13 +1008,30 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
  * Wait for the read lock to be granted
  */
 static struct rw_semaphore __sched *
-rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 {
-	long count, adjustment = -RWSEM_READER_BIAS;
+	long count;
 	bool wake = false;
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	if (unlikely(!adjustment)) {
+		/*
+		 * This shouldn't happen. If it does, there is probably
+		 * something wrong in the system.
+		 */
+		WARN_ON_ONCE(1);
+
+		/*
+		 * An adjustment of 0 means that there are too many readers
+		 * holding or trying to acquire the lock. So disable
+		 * optimistic spinning and go directly into the wait list.
+		 */
+		if (is_rwsem_spinnable(sem, false))
+			rwsem_set_nonspinnable(sem);
+		goto queue;
+	}
+
 	/*
 	 * Save the current read-owner of rwsem, if available, and the
 	 * reader nonspinnable bit.
@@ -1294,9 +1355,10 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
+	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+
+	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
+		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
@@ -1305,9 +1367,11 @@ inline void __down_read(struct rw_semaphore *sem)
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
-			&sem->count) & RWSEM_READ_FAILED_MASK)) {
-		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
+	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+
+	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
+		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
+						    adjustment)))
 			return -EINTR;
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
@@ -1383,6 +1447,7 @@ inline void __up_read(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	rwsem_clear_reader_owned(sem);
 	tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+	DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
 	if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
 		      RWSEM_FLAG_WAITERS)) {
 		clear_wr_nonspinnable(sem);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 18/20] locking/rwsem: Merge owner into count on x86-64
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (16 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 17/20] locking/rwsem: Guard against making count negative Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 19/20] locking/rwsem: Remove redundant computation of writer lock word Waiman Long
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

With separate count and owner, there are timing windows where the two
values are inconsistent. That can cause problem when trying to figure
out the exact state of the rwsem. For instance, a RT task will stop
optimistic spinning if the lock is acquired by a writer but the owner
field isn't set yet. That can be solved by combining the count and
owner together in a single atomic value.

On 32-bit architectures, there aren't enough bits to hold both.
64-bit architectures, however, can have enough bits to do that. For
x86-64, the physical address can use up to 52 bits. That is 4PB of
memory. That leaves 12 bits available for other use. The task structure
pointer is aligned to the L1 cache size. That means another 6 bits
(64 bytes cacheline) will be available. Reserving 2 bits for status
flags, we will have 16 bits for the reader count and the read fail bit.
That can supports up to (32k-1) readers. Without 5-level page table,
we can supports up to (2M-1) readers.

The owner value will still be duplicated in the owner field as that will
ease debugging when looking at core dump. There may be a slight overhead
in transforming the task pointer to fit into a smaller number of bits,
but that shouldn't be noticeable in real workloads.

This change is currently enabled for x86-64 only. Other 64-bit
architectures may be enabled in the future if the need arises.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
writer-only locking threads and then equal numbers of readers and writers
(mixed) before patch and after this and subsequent related patches were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    mixed    wlock    mixed
   ------------  -----    -----    -----    -----
        1        30,422   31,034   30,323   30,379
        2         6,427    6,684    7,804    9,436
        4         6,742    6,738    7,568    8,268
        8         7,092    7,222    5,679    7,041
       16         6,882    7,163    6,848    7,652
       32         7,458    7,316    7,975    2,189
       64         7,906      520    8,269      534
      128         1,680      425    8,047      448

In the single thread case, the complex write-locking operation does
introduce a little bit of overhead (about 0.3%). For the contended cases,
except for some anomalies in the data, there is no evidence that this
change will adversely impact performance.

When running the same microbenchmark with RT locking threads instead,
we got the following results:

                  Before Patch      After Patch
   # of Threads  wlock    mixed    wlock    mixed
   ------------  -----    -----    -----    -----
        2         4,065    3,642    4,756    5,062
        4         2,254    1,907    3,460    2,496
        8         2,386      964    3,012    1,964
       16         2,095    1,596    3,083    1,862
       32         2,388      530    3,717      359
       64         1,424      322    4,060      401
      128         1,642      510    4,488      628

It is obvious that RT tasks can benefit pretty significantly with this set
of patches.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/x86/Kconfig       |   6 ++
 kernel/Kconfig.locks   |  12 ++++
 kernel/locking/rwsem.c | 159 +++++++++++++++++++++++++++++++++++++----
 3 files changed, 163 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7d160f58a8f6..82a8c02f1b44 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -80,6 +80,7 @@ config X86
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
+	select ARCH_USE_RWSEM_OWNER_COUNT	if X86_64
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_THP_SWAP		if X86_64
@@ -350,6 +351,11 @@ config PGTABLE_LEVELS
 	default 3 if X86_PAE
 	default 2
 
+config RWSEM_OWNER_COUNT_PA_BITS
+	int
+	default 52 if X86_5LEVEL
+	default 46 if X86_64
+
 config CC_HAS_SANE_STACKPROTECTOR
 	bool
 	default $(success,$(srctree)/scripts/gcc-x86_64-has-stack-protector.sh $(CC)) if 64BIT
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index e335953fa704..818e595831a1 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -251,3 +251,15 @@ config ARCH_USE_QUEUED_RWLOCKS
 config QUEUED_RWLOCKS
 	def_bool y if ARCH_USE_QUEUED_RWLOCKS
 	depends on SMP
+
+#
+# An 64-bit architecture that wants to merge rwsem write-owner into
+# count should select ARCH_USE_RWSEM_OWNER_COUNT and define
+# RWSEM_OWNER_COUNT_PA_BITS as the correct number of physical address
+# bits. In addition, the number of bits available for reader count
+# should allow all the CPUs as defined in NR_CPUS to acquire the same
+# read lock without overflowing it.
+#
+config ARCH_USE_RWSEM_OWNER_COUNT
+	bool
+	depends on SMP && 64BIT
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 1ab1762e9988..9120fde02e0d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,7 +116,38 @@
 #endif
 
 /*
- * On 64-bit architectures, the bit definitions of the count are:
+ * With separate count and owner, there are timing windows where the two
+ * values are inconsistent. That can cause problem when trying to figure
+ * out the exact state of the rwsem. That can be solved by combining
+ * the count and owner together in a single atomic value.
+ *
+ * On 64-bit architectures, the owner task structure pointer can be
+ * compressed and combined with reader count and other status flags.
+ * A simple compression method is to map the virtual address back to
+ * the physical address by subtracting PAGE_OFFSET. On 32-bit
+ * architectures, the long integer value just isn't big enough for
+ * combining owner and count. So they remain separate.
+ *
+ * For x86-64, the physical address can use up to 52 bits if
+ * CONFIG_X86_5LEVEL. That is 4PB of memory. That leaves 12 bits
+ * available for other use. The task structure pointer is also aligned
+ * to the L1 cache size. That means another 6 bits (64 bytes cacheline)
+ * will be available. Reserving 2 bits for status flags, we will have
+ * 16 bits for the reader count and read fail bit. That can supports up
+ * to (32k-1) active readers. If 5-level page table support isn't
+ * configured, we can supports up to (2M-1) active readers.
+ *
+ * On x86-64 with CONFIG_X86_5LEVEL and CONFIG_ARCH_USE_RWSEM_OWNER_COUNT,
+ * the bit definitions of the count are:
+ *
+ * Bit   0    - waiters present bit
+ * Bit   1    - lock handoff bit
+ * Bits  2-47 - compressed task structure pointer
+ * Bits 48-62 - 15-bit reader counts
+ * Bit  63    - read fail bit
+ *
+ * On other 64-bit architectures without MERGE_OWNER_INTO_COUNT, the bit
+ * definitions are:
  *
  * Bit  0    - writer locked bit
  * Bit  1    - waiters present bit
@@ -151,26 +182,81 @@
  * be the first one in the wait_list to be eligible for setting the handoff
  * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
-#define RWSEM_WRITER_LOCKED	(1UL << 0)
-#define RWSEM_FLAG_WAITERS	(1UL << 1)
-#define RWSEM_FLAG_HANDOFF	(1UL << 2)
+#define RWSEM_FLAG_WAITERS	(1UL << 0)
+#define RWSEM_FLAG_HANDOFF	(1UL << 1)
 #define RWSEM_FLAG_READFAIL	(1UL << (BITS_PER_LONG - 1))
 
+/*
+ * The MERGE_OWNER_INTO_COUNT macro will only be defined if the following
+ * conditions are true:
+ *  1) Both CONFIG_ARCH_USE_RWSEM_OWNER_COUNT and
+ *     CONFIG_RWSEM_OWNER_COUNT_PA_BITS are defined.
+ *  2) The number of reader count bits available is able to hold the
+ *     maximum number of CPUs as defined in NR_CPUS.
+ */
+#if	defined(CONFIG_ARCH_USE_RWSEM_OWNER_COUNT) && \
+	defined(CONFIG_RWSEM_OWNER_COUNT_PA_BITS)
+# define __READER_SHIFT		(CONFIG_RWSEM_OWNER_COUNT_PA_BITS -\
+				 L1_CACHE_SHIFT + 2)
+# define __READER_COUNT_BITS	(BITS_PER_LONG - __READER_SHIFT - 1)
+# define __READER_COUNT_MAX	((1UL << __READER_COUNT_BITS) - 1)
+# if (NR_CPUS <= __READER_COUNT_MAX)
+#  define MERGE_OWNER_INTO_COUNT
+# endif
+#endif
+
+#ifdef MERGE_OWNER_INTO_COUNT
+#define RWSEM_READER_SHIFT	__READER_SHIFT
+#define RWSEM_WRITER_MASK	((1UL << RWSEM_READER_SHIFT) - 4)
+#define RWSEM_WRITER_LOCKED	rwsem_owner_count(current)
+#else /* !MERGE_OWNER_INTO_COUNT */
 #define RWSEM_READER_SHIFT	8
+#define RWSEM_WRITER_MASK	(1UL << 7)
+#define RWSEM_WRITER_LOCKED	RWSEM_WRITER_MASK
+#endif /* MERGE_OWNER_INTO_COUNT */
+
 #define RWSEM_READER_BIAS	(1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK	(~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK	RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK		(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK	(RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
 				 RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
+/*
+ * Task structure pointer compression (64-bit only):
+ * (owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2)
+ *
+ * However, init_task may lie outside of the linearly mapped physical
+ * to virtual memory range and so has to be handled separately.
+ */
+static inline unsigned long rwsem_owner_count(struct task_struct *owner)
+{
+	if (unlikely(owner == &init_task))
+		return RWSEM_WRITER_MASK;
+
+	return ((unsigned long)owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2);
+}
+
+static inline unsigned long rwsem_count_owner(long count)
+{
+	unsigned long writer = (unsigned long)count & RWSEM_WRITER_MASK;
+
+	if (unlikely(writer == RWSEM_WRITER_MASK))
+		return (unsigned long)&init_task;
+
+	return writer ? (writer << (L1_CACHE_SHIFT - 2)) + PAGE_OFFSET : 0;
+}
+
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
  * the owner value concurrently without lock. Read from owner, however,
  * may not need READ_ONCE() as long as the pointer value is only used
  * for comparison and isn't being dereferenced.
+ *
+ * With MERGE_OWNER_INTO_COUNT defined, the writer task structure pointer
+ * is written to the count as well in addition to the owner field.
  */
+
 static inline void rwsem_set_owner(struct rw_semaphore *sem)
 {
 	WRITE_ONCE(sem->owner, current);
@@ -181,11 +267,6 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
 	WRITE_ONCE(sem->owner, NULL);
 }
 
-static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
-{
-	return READ_ONCE(sem->owner);
-}
-
 /*
  * The task_struct pointer of the last owning reader will be left in
  * the owner field.
@@ -293,6 +374,21 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 	}
 }
 
+#ifdef MERGE_OWNER_INTO_COUNT
+/*
+ * Get the owner value from count to have early access to the task structure.
+ * Owner from sem->count should includes the RWSEM_NONSPINNABLE bits
+ * from sem->owner.
+ */
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+	unsigned long cowner = rwsem_count_owner(atomic_long_read(&sem->count));
+	unsigned long sowner = (unsigned long)READ_ONCE(sem->owner);
+
+	return (struct task_struct *) (cowner
+		? cowner | (sowner & RWSEM_NONSPINNABLE) : sowner);
+}
+
 /*
  * This function does a read trylock by incrementing the reader count
  * and then decrementing it immediately if too many readers are present
@@ -300,6 +396,18 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
  * of overflowing the count with minimal delay between the increment
  * and decrement.
  *
+ * When the owner task structure pointer is merged into couunt, less bits
+ * will be available for readers (down to 15 bits for x86-64). There is a
+ * very slight chance that preemption may happen in the middle of the
+ * inc-check-dec sequence leaving the reader count incremented for a
+ * certain period of time until the reader wakes up and move on. Still
+ * the chance of having enough of these unfortunate sequence of events to
+ * overflow the reader count is infinitesimally small.
+ *
+ * If MERGE_OWNER_INTO_COUNT isn't defined, we don't really need to
+ * worry about the possibility of overflowing the reader counts even
+ * for 32-bit architectures which can support up to 8M readers.
+ *
  * It returns the adjustment that should be added back to the count
  * in the slowpath.
  */
@@ -315,14 +423,34 @@ static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
 	return adjustment;
 }
 
+static int __init rwsem_show_count_status(void)
+{
+	pr_info("RW Semaphores: Write-owner in count & %d bits for readers.\n",
+		__READER_COUNT_BITS);
+	return 0;
+}
+late_initcall(rwsem_show_count_status);
+#else /* !MERGE_OWNER_INTO_COUNT */
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+	return READ_ONCE(sem->owner);
+}
+
+static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+	*cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+	return -RWSEM_READER_BIAS;
+}
+#endif /* MERGE_OWNER_INTO_COUNT */
+
 /*
  * Guide to the rw_semaphore's count field.
  *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
+ * When any of the RWSEM_WRITER_MASK bits in count is set, the lock is
+ * owned by a writer.
  *
  * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (1) none of the RWSEM_WRITER_MASK bits is set in count,
  * (2) some of the reader bits are set in count, and
  * (3) the owner field has RWSEM_READ_OWNED bit set.
  *
@@ -1409,6 +1537,9 @@ static inline void __down_write(struct rw_semaphore *sem)
 		rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
 	else
 		rwsem_set_owner(sem);
+#ifdef MERGE_OWNER_INTO_COUNT
+	DEBUG_RWSEMS_WARN_ON(sem->owner != rwsem_get_owner(sem), sem);
+#endif
 }
 
 static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1469,7 +1600,7 @@ static inline void __up_write(struct rw_semaphore *sem)
 	DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
 			    !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
 	rwsem_clear_owner(sem);
-	tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+	tmp = atomic_long_fetch_and_release(~RWSEM_WRITER_MASK, &sem->count);
 	if (unlikely(tmp & RWSEM_FLAG_WAITERS))
 		rwsem_wake(sem, tmp);
 }
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 19/20] locking/rwsem: Remove redundant computation of writer lock word
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (17 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 18/20] locking/rwsem: Merge owner into count on x86-64 Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 21:25 ` [PATCH-tip v7 20/20] locking/rwsem: Disable preemption in down_read*() if owner in count Waiman Long
  2019-04-28 22:46 ` [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Linus Torvalds
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

On 64-bit architectures, each rwsem writer will have its unique lock
word for acquiring the lock. Right now, the writer code recomputes the
lock word every time it tries to acquire the lock. This is a waste of
time. The lock word is now cached and reused when it is needed.

When CONFIG_RWSEM_OWNER_COUNT isn't defined, the extra constant argument
to rwsem_try_write_lock() and rwsem_try_write_lock_unqueued() should
be optimized out by the compiler.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 9120fde02e0d..cd7fdb8b02ed 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -700,6 +700,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  * bit is set or the lock is acquired with handoff bit cleared.
  */
 static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+					const long wlock,
 					enum writer_wait_state wstate)
 {
 	long new;
@@ -716,8 +717,7 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
 				return false;
 			new = count | RWSEM_FLAG_HANDOFF;
 		} else {
-			new = (count | RWSEM_WRITER_LOCKED) &
-				~RWSEM_FLAG_HANDOFF;
+			new = (count | wlock) & ~RWSEM_FLAG_HANDOFF;
 
 			if (list_is_singular(&sem->wait_list))
 				new &= ~RWSEM_FLAG_WAITERS;
@@ -763,13 +763,14 @@ static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
+						 const long wlock)
 {
 	long count = atomic_long_read(&sem->count);
 
 	while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
-					count | RWSEM_WRITER_LOCKED)) {
+					count | wlock)) {
 			rwsem_set_owner(sem);
 			lockevent_inc(rwsem_opt_wlock);
 			return true;
@@ -937,7 +938,7 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
 	return sched_clock() + delta;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 {
 	bool taken = false;
 	int prev_owner_state = OWNER_NULL;
@@ -965,7 +966,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 		/*
 		 * Try to acquire the lock
 		 */
-		taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+		taken = wlock ? rwsem_try_write_lock_unqueued(sem, wlock)
 			      : rwsem_try_read_lock_unqueued(sem);
 
 		if (taken)
@@ -1118,7 +1119,8 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
 	return false;
 }
 
-static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+					 const long wlock)
 {
 	return false;
 }
@@ -1293,10 +1295,11 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
+	const long wlock = RWSEM_WRITER_LOCKED;
 
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_can_spin_on_owner(sem, true) &&
-	    rwsem_optimistic_spin(sem, true))
+	    rwsem_optimistic_spin(sem, wlock))
 		return sem;
 
 	/*
@@ -1367,7 +1370,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	while (true) {
-		if (rwsem_try_write_lock(count, sem, wstate))
+		if (rwsem_try_write_lock(count, sem, wlock, wstate))
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH-tip v7 20/20] locking/rwsem: Disable preemption in down_read*() if owner in count
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (18 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 19/20] locking/rwsem: Remove redundant computation of writer lock word Waiman Long
@ 2019-04-28 21:25 ` Waiman Long
  2019-04-28 22:46 ` [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Linus Torvalds
  20 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 21:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, x86, Davidlohr Bueso, Linus Torvalds, Tim Chen,
	huang ying, Waiman Long

It is very unlikely that successive preemption at the middle of
down_read's inc-check-dec sequence will cause the reader count to
overflow, For absolute correctness, however, we still need to prevent
that possibility from happening. So preemption will be disabled during
the down_read*() call.

For PREEMPT=n kernels, there isn't much overhead in doing that.
For PREEMPT=y kernels, there will be some additional cost. RT kernels
have their own rwsem code, so it will not be a problem for them.

If MERGE_OWNER_INTO_COUNT isn't defined, we don't need to worry about
reader count overflow and so we don't need to disable preemption.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index cd7fdb8b02ed..60783267b50d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -375,6 +375,24 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
 }
 
 #ifdef MERGE_OWNER_INTO_COUNT
+/*
+ * It is very unlikely that successive preemption at the middle of
+ * down_read's inc-check-dec sequence will cause the reader count to
+ * overflow, For absolute correctness, we still need to prevent
+ * that possibility from happening. So preemption will be disabled
+ * during the down_read*() call.
+ *
+ * For PREEMPT=n kernels, there isn't much overhead in doing that.
+ * For PREEMPT=y kernels, there will be some additional cost.
+ *
+ * If MERGE_OWNER_INTO_COUNT isn't defined, we don't need to worry
+ * about reader count overflow and so we don't need to disable
+ * preemption.
+ */
+#define rwsem_preempt_disable()			preempt_disable()
+#define rwsem_preempt_enable()			preempt_enable()
+#define rwsem_schedule_preempt_disabled()	schedule_preempt_disabled()
+
 /*
  * Get the owner value from count to have early access to the task structure.
  * Owner from sem->count should includes the RWSEM_NONSPINNABLE bits
@@ -431,6 +449,11 @@ static int __init rwsem_show_count_status(void)
 }
 late_initcall(rwsem_show_count_status);
 #else /* !MERGE_OWNER_INTO_COUNT */
+
+#define rwsem_preempt_disable()
+#define rwsem_preempt_enable()
+#define rwsem_schedule_preempt_disabled()	schedule()
+
 static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
 {
 	return READ_ONCE(sem->owner);
@@ -1255,7 +1278,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
 			raw_spin_unlock_irq(&sem->wait_lock);
 			break;
 		}
-		schedule();
+		rwsem_schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_reader);
 	}
 
@@ -1486,28 +1509,36 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+	long tmp, adjustment;
 
+	rwsem_preempt_disable();
+	adjustment = rwsem_read_trylock(sem, &tmp);
 	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
 		rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
+	rwsem_preempt_enable();
 }
 
 static inline int __down_read_killable(struct rw_semaphore *sem)
 {
-	long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+	long tmp, adjustment;
 
+	rwsem_preempt_disable();
+	adjustment = rwsem_read_trylock(sem, &tmp);
 	if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
 		if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
-						    adjustment)))
+						    adjustment))) {
+			rwsem_preempt_enable();
 			return -EINTR;
+		}
 		DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 	} else {
 		rwsem_set_reader_owned(sem);
 	}
+	rwsem_preempt_enable();
 	return 0;
 }
 
-- 
2.18.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2
  2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
                   ` (19 preceding siblings ...)
  2019-04-28 21:25 ` [PATCH-tip v7 20/20] locking/rwsem: Disable preemption in down_read*() if owner in count Waiman Long
@ 2019-04-28 22:46 ` Linus Torvalds
  2019-04-28 23:12   ` Waiman Long
  20 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2019-04-28 22:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying

This doesn't seem to be the full diff - looking at that patch 1 you
seem to have taken my suggested list_cut_before() change too.

I'm not against it (it does seem to be simpler and better), I just
hope you double-checked it, since I kind of hand-waved it.

                    Linus

On Sun, Apr 28, 2019 at 2:26 PM Waiman Long <longman@redhat.com> wrote:
>
>  v6=>v7 diff
>  -----------
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 97a2334d9cd3..60783267b50d 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -693,7 +693,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>                 atomic_long_add(adjustment, &sem->count);
>
>         /* 2nd pass */
> -       list_for_each_entry(waiter, &wlist, list) {
> +       list_for_each_entry_safe(waiter, tmp, &wlist, list) {
>                 struct task_struct *tsk;
>
>                 tsk = waiter->task;
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2
  2019-04-28 22:46 ` [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Linus Torvalds
@ 2019-04-28 23:12   ` Waiman Long
  2019-04-28 23:19     ` Waiman Long
  2019-04-29  0:10     ` Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 23:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying

On 4/28/19 6:46 PM, Linus Torvalds wrote:
> This doesn't seem to be the full diff - looking at that patch 1 you
> seem to have taken my suggested list_cut_before() change too.
>
> I'm not against it (it does seem to be simpler and better), I just
> hope you double-checked it, since I kind of hand-waved it.
>
>                     Linus

I implemented your suggestion in patch 1 as it will produce simpler and
faster code. However, one of the changes in my patchset is to wake up
all the readers in the wait list. This means I have to jump over the
writers and wake up the readers behind them as well. See patch 11 for
details. As a result, I have to revert back to use list_add_tail() and
list_for_each_entry_safe() for the first pass. That is why the diff for
the whole patchset is just the below change. It is done on purpose, not
an omission.

Cheers,
Longman

>
> On Sun, Apr 28, 2019 at 2:26 PM Waiman Long <longman@redhat.com> wrote:
>>  v6=>v7 diff
>>  -----------
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index 97a2334d9cd3..60783267b50d 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -693,7 +693,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>                 atomic_long_add(adjustment, &sem->count);
>>
>>         /* 2nd pass */
>> -       list_for_each_entry(waiter, &wlist, list) {
>> +       list_for_each_entry_safe(waiter, tmp, &wlist, list) {
>>                 struct task_struct *tsk;
>>
>>                 tsk = waiter->task;
>>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2
  2019-04-28 23:12   ` Waiman Long
@ 2019-04-28 23:19     ` Waiman Long
  2019-04-29  0:10     ` Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-04-28 23:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying

On 4/28/19 7:12 PM, Waiman Long wrote:
> On 4/28/19 6:46 PM, Linus Torvalds wrote:
>> This doesn't seem to be the full diff - looking at that patch 1 you
>> seem to have taken my suggested list_cut_before() change too.
>>
>> I'm not against it (it does seem to be simpler and better), I just
>> hope you double-checked it, since I kind of hand-waved it.
>>
>>                     Linus
> I implemented your suggestion in patch 1 as it will produce simpler and
> faster code. However, one of the changes in my patchset is to wake up
> all the readers in the wait list. This means I have to jump over the
> writers and wake up the readers behind them as well. See patch 11 for
> details. As a result, I have to revert back to use list_add_tail() and
> list_for_each_entry_safe() for the first pass. That is why the diff for
> the whole patchset is just the below change. It is done on purpose, not
> an omission.

That is also the reason why it was implemented this way in my v6
patchset. I implemented the fix on top of the rwsem patchset first and
then move it backward to the beginning of the patchset for easier backport.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2
  2019-04-28 23:12   ` Waiman Long
  2019-04-28 23:19     ` Waiman Long
@ 2019-04-29  0:10     ` Linus Torvalds
  2019-04-29  0:27       ` Waiman Long
  1 sibling, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2019-04-29  0:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying

On Sun, Apr 28, 2019 at 4:12 PM Waiman Long <longman@redhat.com> wrote:
>
> I implemented your suggestion in patch 1 as it will produce simpler and
> faster code. However, one of the changes in my patchset is to wake up
> all the readers in the wait list. This means I have to jump over the
> writers and wake up the readers behind them as well. See patch 11 for
> details. As a result, I have to revert back to use list_add_tail() and
> list_for_each_entry_safe() for the first pass. That is why the diff for
> the whole patchset is just the below change. It is done on purpose, not
> an omission.

Ahh, ok. In that case I suspect the clever code isn't even worth it,
since it very much depends on just splitting the list in a fixed
place.

                   Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2
  2019-04-29  0:10     ` Linus Torvalds
@ 2019-04-29  0:27       ` Waiman Long
  2019-04-29  2:41         ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: Waiman Long @ 2019-04-29  0:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying

On 4/28/19 8:10 PM, Linus Torvalds wrote:
> On Sun, Apr 28, 2019 at 4:12 PM Waiman Long <longman@redhat.com> wrote:
>> I implemented your suggestion in patch 1 as it will produce simpler and
>> faster code. However, one of the changes in my patchset is to wake up
>> all the readers in the wait list. This means I have to jump over the
>> writers and wake up the readers behind them as well. See patch 11 for
>> details. As a result, I have to revert back to use list_add_tail() and
>> list_for_each_entry_safe() for the first pass. That is why the diff for
>> the whole patchset is just the below change. It is done on purpose, not
>> an omission.
> Ahh, ok. In that case I suspect the clever code isn't even worth it,
> since it very much depends on just splitting the list in a fixed
> place.
>
>                    Linus

Not really, this is a serious problem that have to be backported to
earlier stable releases and downstream. The clever code is helpful in
those cases.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2
  2019-04-29  0:27       ` Waiman Long
@ 2019-04-29  2:41         ` Linus Torvalds
  0 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2019-04-29  2:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, Linux List Kernel Mailing,
	the arch/x86 maintainers, Davidlohr Bueso, Tim Chen, huang ying

On Sun, Apr 28, 2019 at 5:28 PM Waiman Long <longman@redhat.com> wrote:
>
> Not really, this is a serious problem that have to be backported to
> earlier stable releases and downstream. The clever code is helpful in
> those cases.

Fair enough, I guess the code will live in the stable trees for a longish while.

               Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment
  2019-04-28 21:25 ` [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment Waiman Long
@ 2019-05-03 12:06   ` Peter Zijlstra
  2019-05-03 13:32     ` Waiman Long
  2019-05-07  7:07   ` [tip:locking/urgent] " tip-bot for Waiman Long
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2019-05-03 12:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying, stable

On Sun, Apr 28, 2019 at 05:25:38PM -0400, Waiman Long wrote:
> During my rwsem testing, it was found that after a down_read(), the
> reader count may occasionally become 0 or even negative. Consequently,
> a writer may steal the lock at that time and execute with the reader
> in parallel thus breaking the mutual exclusion guarantee of the write
> lock. In other words, both readers and writer can become rwsem owners
> simultaneously.
> 
> The current reader wakeup code does it in one pass to clear waiter->task
> and put them into wake_q before fully incrementing the reader count.
> Once waiter->task is cleared, the corresponding reader may see it,
> finish the critical section and do unlock to decrement the count before
> the count is incremented. This is not a problem if there is only one
> reader to wake up as the count has been pre-incremented by 1.  It is
> a problem if there are more than one readers to be woken up and writer
> can steal the lock.
> 
> The wakeup was actually done in 2 passes before the v4.9 commit
> 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only
> once"). To fix this problem, the wakeup is now done in two passes
> again. In the first pass, we collect the readers and count them. The
> reader count is then fully incremented. In the second pass, the
> waiter->task is then cleared and they are put into wake_q to be woken
> up later.
> 
> Fixes: 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")

It is effectively a revert of that patch, right? Just written more
clever.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 08/20] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-04-28 21:25 ` [PATCH-tip v7 08/20] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
@ 2019-05-03 13:10   ` Peter Zijlstra
  2019-05-03 13:57     ` Waiman Long
  2019-05-03 14:37     ` David Laight
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2019-05-03 13:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Sun, Apr 28, 2019 at 05:25:45PM -0400, Waiman Long wrote:
> +static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
> +					enum writer_wait_state wstate)
>  {
>  	long new;
>  
> +	lockdep_assert_held(&sem->wait_lock);
> +	do {
> +		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>  
> +		if (has_handoff && wstate == WRITER_NOT_FIRST)
> +			return false;
>  
> +		if (count & RWSEM_LOCK_MASK) {
> +			if (has_handoff || (wstate != WRITER_HANDOFF))
> +				return false;
> +			new = count | RWSEM_FLAG_HANDOFF;
> +		} else {
> +			new = (count | RWSEM_WRITER_LOCKED) &
> +				~RWSEM_FLAG_HANDOFF;
>  
> +			if (list_is_singular(&sem->wait_list))
> +				new &= ~RWSEM_FLAG_WAITERS;
> +		}
> +	} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
> +
> +	/*
> +	 * We have either acquired the lock with handoff bit cleared or
> +	 * set the handoff bit.
> +	 */
> +	if (new & RWSEM_FLAG_HANDOFF)
> +		return false;
> +
> +	rwsem_set_owner(sem);
> +	return true;
>  }

I've changed that like so.

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -406,19 +406,23 @@ static inline bool rwsem_try_write_lock(
 	long new;
 
 	lockdep_assert_held(&sem->wait_lock);
+
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
 		if (has_handoff && wstate == WRITER_NOT_FIRST)
 			return false;
 
+		new = count;
+
 		if (count & RWSEM_LOCK_MASK) {
 			if (has_handoff || (wstate != WRITER_HANDOFF))
 				return false;
-			new = count | RWSEM_FLAG_HANDOFF;
+
+			new |= RWSEM_FLAG_HANDOFF;
 		} else {
-			new = (count | RWSEM_WRITER_LOCKED) &
-				~RWSEM_FLAG_HANDOFF;
+			new |= RWSEM_WRITER_LOCKED;
+			new &= ~RWSEM_FLAG_HANDOFF;
 
 			if (list_is_singular(&sem->wait_list))
 				new &= ~RWSEM_FLAG_WAITERS;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment
  2019-05-03 12:06   ` Peter Zijlstra
@ 2019-05-03 13:32     ` Waiman Long
  0 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-05-03 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying, stable

On 5/3/19 8:06 AM, Peter Zijlstra wrote:
> On Sun, Apr 28, 2019 at 05:25:38PM -0400, Waiman Long wrote:
>> During my rwsem testing, it was found that after a down_read(), the
>> reader count may occasionally become 0 or even negative. Consequently,
>> a writer may steal the lock at that time and execute with the reader
>> in parallel thus breaking the mutual exclusion guarantee of the write
>> lock. In other words, both readers and writer can become rwsem owners
>> simultaneously.
>>
>> The current reader wakeup code does it in one pass to clear waiter->task
>> and put them into wake_q before fully incrementing the reader count.
>> Once waiter->task is cleared, the corresponding reader may see it,
>> finish the critical section and do unlock to decrement the count before
>> the count is incremented. This is not a problem if there is only one
>> reader to wake up as the count has been pre-incremented by 1.  It is
>> a problem if there are more than one readers to be woken up and writer
>> can steal the lock.
>>
>> The wakeup was actually done in 2 passes before the v4.9 commit
>> 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only
>> once"). To fix this problem, the wakeup is now done in two passes
>> again. In the first pass, we collect the readers and count them. The
>> reader count is then fully incremented. In the second pass, the
>> waiter->task is then cleared and they are put into wake_q to be woken
>> up later.
>>
>> Fixes: 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")
> It is effectively a revert of that patch, right? Just written more
> clever.
>
Yes, it is essentially a revert.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 09/20] locking/rwsem: Always release wait_lock before waking up tasks
  2019-04-28 21:25 ` [PATCH-tip v7 09/20] locking/rwsem: Always release wait_lock before waking up tasks Waiman Long
@ 2019-05-03 13:37   ` Peter Zijlstra
  2019-05-03 13:56     ` Waiman Long
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2019-05-03 13:37 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Sun, Apr 28, 2019 at 05:25:46PM -0400, Waiman Long wrote:

> +			/*
> +			 * This waiter may have become first in the wait
> +			 * list after re-acquring the wait_lock. The
> +			 * rwsem_first_waiter() test in the main while
> +			 * loop below will correctly detect that. We do
> +			 * need to reload count to perform proper trylock
> +			 * and avoid missed wakeup.
> +			 */
> +			count = atomic_long_read(&sem->count);
> +		}
>  	} else {
>  		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
>  	}

I've been eyeing that count usage for the past few patches, and this
here makes me think we should get rid of it.

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -400,13 +400,14 @@ static void __rwsem_mark_wake(struct rw_
  * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
  * bit is set or the lock is acquired with handoff bit cleared.
  */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 					enum writer_wait_state wstate)
 {
-	long new;
+	long count, new;
 
 	lockdep_assert_held(&sem->wait_lock);
 
+	count = atomic_long_read(&sem->count);
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
@@ -760,25 +761,16 @@ rwsem_down_write_slowpath(struct rw_sema
 			wake_up_q(&wake_q);
 			wake_q_init(&wake_q);	/* Used again, reinit */
 			raw_spin_lock_irq(&sem->wait_lock);
-			/*
-			 * This waiter may have become first in the wait
-			 * list after re-acquring the wait_lock. The
-			 * rwsem_first_waiter() test in the main while
-			 * loop below will correctly detect that. We do
-			 * need to reload count to perform proper trylock
-			 * and avoid missed wakeup.
-			 */
-			count = atomic_long_read(&sem->count);
 		}
 	} else {
-		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+		atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
 	}
 
 wait:
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	for (;;) {
-		if (rwsem_try_write_lock(count, sem, wstate))
+		if (rwsem_try_write_lock(sem, wstate))
 			break;
 
 		raw_spin_unlock_irq(&sem->wait_lock);
@@ -819,7 +811,6 @@ rwsem_down_write_slowpath(struct rw_sema
 		}
 
 		raw_spin_lock_irq(&sem->wait_lock);
-		count = atomic_long_read(&sem->count);
 	}
 	__set_current_state(TASK_RUNNING);
 	list_del(&waiter.list);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 09/20] locking/rwsem: Always release wait_lock before waking up tasks
  2019-05-03 13:37   ` Peter Zijlstra
@ 2019-05-03 13:56     ` Waiman Long
  0 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-05-03 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 5/3/19 9:37 AM, Peter Zijlstra wrote:
> On Sun, Apr 28, 2019 at 05:25:46PM -0400, Waiman Long wrote:
>
>> +			/*
>> +			 * This waiter may have become first in the wait
>> +			 * list after re-acquring the wait_lock. The
>> +			 * rwsem_first_waiter() test in the main while
>> +			 * loop below will correctly detect that. We do
>> +			 * need to reload count to perform proper trylock
>> +			 * and avoid missed wakeup.
>> +			 */
>> +			count = atomic_long_read(&sem->count);
>> +		}
>>  	} else {
>>  		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
>>  	}
> I've been eyeing that count usage for the past few patches, and this
> here makes me think we should get rid of it.
>
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -400,13 +400,14 @@ static void __rwsem_mark_wake(struct rw_
>   * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
>   * bit is set or the lock is acquired with handoff bit cleared.
>   */
> -static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
> +static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>  					enum writer_wait_state wstate)
>  {
> -	long new;
> +	long count, new;
>  
>  	lockdep_assert_held(&sem->wait_lock);
>  
> +	count = atomic_long_read(&sem->count);
>  	do {
>  		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>  
> @@ -760,25 +761,16 @@ rwsem_down_write_slowpath(struct rw_sema
>  			wake_up_q(&wake_q);
>  			wake_q_init(&wake_q);	/* Used again, reinit */
>  			raw_spin_lock_irq(&sem->wait_lock);
> -			/*
> -			 * This waiter may have become first in the wait
> -			 * list after re-acquring the wait_lock. The
> -			 * rwsem_first_waiter() test in the main while
> -			 * loop below will correctly detect that. We do
> -			 * need to reload count to perform proper trylock
> -			 * and avoid missed wakeup.
> -			 */
> -			count = atomic_long_read(&sem->count);
>  		}
>  	} else {
> -		count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
> +		atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
>  	}
>  
>  wait:
>  	/* wait until we successfully acquire the lock */
>  	set_current_state(state);
>  	for (;;) {
> -		if (rwsem_try_write_lock(count, sem, wstate))
> +		if (rwsem_try_write_lock(sem, wstate))
>  			break;
>  
>  		raw_spin_unlock_irq(&sem->wait_lock);
> @@ -819,7 +811,6 @@ rwsem_down_write_slowpath(struct rw_sema
>  		}
>  
>  		raw_spin_lock_irq(&sem->wait_lock);
> -		count = atomic_long_read(&sem->count);
>  	}
>  	__set_current_state(TASK_RUNNING);
>  	list_del(&waiter.list);

Yes, this is an alternative way of doing it.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 08/20] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-05-03 13:10   ` Peter Zijlstra
@ 2019-05-03 13:57     ` Waiman Long
  2019-05-03 14:37     ` David Laight
  1 sibling, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-05-03 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 5/3/19 9:10 AM, Peter Zijlstra wrote:
> On Sun, Apr 28, 2019 at 05:25:45PM -0400, Waiman Long wrote:
>> +static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
>> +					enum writer_wait_state wstate)
>>  {
>>  	long new;
>>  
>> +	lockdep_assert_held(&sem->wait_lock);
>> +	do {
>> +		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>>  
>> +		if (has_handoff && wstate == WRITER_NOT_FIRST)
>> +			return false;
>>  
>> +		if (count & RWSEM_LOCK_MASK) {
>> +			if (has_handoff || (wstate != WRITER_HANDOFF))
>> +				return false;
>> +			new = count | RWSEM_FLAG_HANDOFF;
>> +		} else {
>> +			new = (count | RWSEM_WRITER_LOCKED) &
>> +				~RWSEM_FLAG_HANDOFF;
>>  
>> +			if (list_is_singular(&sem->wait_list))
>> +				new &= ~RWSEM_FLAG_WAITERS;
>> +		}
>> +	} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
>> +
>> +	/*
>> +	 * We have either acquired the lock with handoff bit cleared or
>> +	 * set the handoff bit.
>> +	 */
>> +	if (new & RWSEM_FLAG_HANDOFF)
>> +		return false;
>> +
>> +	rwsem_set_owner(sem);
>> +	return true;
>>  }
> I've changed that like so.
>
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -406,19 +406,23 @@ static inline bool rwsem_try_write_lock(
>  	long new;
>  
>  	lockdep_assert_held(&sem->wait_lock);
> +
>  	do {
>  		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>  
>  		if (has_handoff && wstate == WRITER_NOT_FIRST)
>  			return false;
>  
> +		new = count;
> +
>  		if (count & RWSEM_LOCK_MASK) {
>  			if (has_handoff || (wstate != WRITER_HANDOFF))
>  				return false;
> -			new = count | RWSEM_FLAG_HANDOFF;
> +
> +			new |= RWSEM_FLAG_HANDOFF;
>  		} else {
> -			new = (count | RWSEM_WRITER_LOCKED) &
> -				~RWSEM_FLAG_HANDOFF;
> +			new |= RWSEM_WRITER_LOCKED;
> +			new &= ~RWSEM_FLAG_HANDOFF;
>  
>  			if (list_is_singular(&sem->wait_list))
>  				new &= ~RWSEM_FLAG_WAITERS;

Will the apply the changes in the next version.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH-tip v7 08/20] locking/rwsem: Implement lock handoff to prevent lock starvation
  2019-05-03 13:10   ` Peter Zijlstra
  2019-05-03 13:57     ` Waiman Long
@ 2019-05-03 14:37     ` David Laight
  1 sibling, 0 replies; 41+ messages in thread
From: David Laight @ 2019-05-03 14:37 UTC (permalink / raw)
  To: 'Peter Zijlstra', Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

From: Peter Zijlstra
> Sent: 03 May 2019 14:11
> To: Waiman Long
...
> I've changed that like so.
> 
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -406,19 +406,23 @@ static inline bool rwsem_try_write_lock(
>  	long new;
> 
>  	lockdep_assert_held(&sem->wait_lock);
> +
>  	do {
>  		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
> 
>  		if (has_handoff && wstate == WRITER_NOT_FIRST)
>  			return false;
> 
> +		new = count;
> +
>  		if (count & RWSEM_LOCK_MASK) {
>  			if (has_handoff || (wstate != WRITER_HANDOFF))
>  				return false;
> -			new = count | RWSEM_FLAG_HANDOFF;
> +
> +			new |= RWSEM_FLAG_HANDOFF;
>  		} else {
> -			new = (count | RWSEM_WRITER_LOCKED) &
> -				~RWSEM_FLAG_HANDOFF;
> +			new |= RWSEM_WRITER_LOCKED;
> +			new &= ~RWSEM_FLAG_HANDOFF;
> 
>  			if (list_is_singular(&sem->wait_list))
>  				new &= ~RWSEM_FLAG_WAITERS;

The compiler will probably convert it back :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 12/20] locking/rwsem: Clarify usage of owner's nonspinaable bit
  2019-04-28 21:25 ` [PATCH-tip v7 12/20] locking/rwsem: Clarify usage of owner's nonspinaable bit Waiman Long
@ 2019-05-03 15:21   ` Peter Zijlstra
  2019-05-03 15:26     ` Waiman Long
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2019-05-03 15:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Sun, Apr 28, 2019 at 05:25:49PM -0400, Waiman Long wrote:
> Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
> anonymous owner - readers or an anonymous writer. The setting of this
> anonymous bit is used as an indicator that optimistic spinning cannot
> be done on this rwsem.
> 
> With the upcoming reader optimistic spinning patches, a reader-owned
> rwsem can be spinned on for a limit period of time. We still need
> this bit to indicate a rwsem is nonspinnable, but not setting this
> bit loses its meaning that the owner is known. So rename the bit
> to RWSEM_NONSPINNABLE to clarify its meaning.
> 
> This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/rwsem.h  |  2 +-
>  kernel/locking/rwsem.c | 43 +++++++++++++++++++++---------------------
>  2 files changed, 22 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 148983e21d47..bb76e82398b2 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -50,7 +50,7 @@ struct rw_semaphore {
>  };
>  
>  /*
> - * Setting bit 1 of the owner field but not bit 0 will indicate
> + * Setting all bits of the owner field except bit 0 will indicate
>   * that the rwsem is writer-owned with an unknown owner.
>   */
>  #define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)

As you know, I'm trying to kill that :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 12/20] locking/rwsem: Clarify usage of owner's nonspinaable bit
  2019-05-03 15:21   ` Peter Zijlstra
@ 2019-05-03 15:26     ` Waiman Long
  0 siblings, 0 replies; 41+ messages in thread
From: Waiman Long @ 2019-05-03 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 5/3/19 11:21 AM, Peter Zijlstra wrote:
> On Sun, Apr 28, 2019 at 05:25:49PM -0400, Waiman Long wrote:
>> Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
>> anonymous owner - readers or an anonymous writer. The setting of this
>> anonymous bit is used as an indicator that optimistic spinning cannot
>> be done on this rwsem.
>>
>> With the upcoming reader optimistic spinning patches, a reader-owned
>> rwsem can be spinned on for a limit period of time. We still need
>> this bit to indicate a rwsem is nonspinnable, but not setting this
>> bit loses its meaning that the owner is known. So rename the bit
>> to RWSEM_NONSPINNABLE to clarify its meaning.
>>
>> This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  include/linux/rwsem.h  |  2 +-
>>  kernel/locking/rwsem.c | 43 +++++++++++++++++++++---------------------
>>  2 files changed, 22 insertions(+), 23 deletions(-)
>>
>> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
>> index 148983e21d47..bb76e82398b2 100644
>> --- a/include/linux/rwsem.h
>> +++ b/include/linux/rwsem.h
>> @@ -50,7 +50,7 @@ struct rw_semaphore {
>>  };
>>  
>>  /*
>> - * Setting bit 1 of the owner field but not bit 0 will indicate
>> + * Setting all bits of the owner field except bit 0 will indicate
>>   * that the rwsem is writer-owned with an unknown owner.
>>   */
>>  #define RWSEM_OWNER_UNKNOWN	((struct task_struct *)-2L)
> As you know, I'm trying to kill that :-)

I am planning to remove it once your patch is merged.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 11/20] locking/rwsem: Wake up almost all readers in wait queue
  2019-04-28 21:25 ` [PATCH-tip v7 11/20] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
@ 2019-05-03 16:51   ` Peter Zijlstra
  2019-05-03 17:15     ` Waiman Long
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2019-05-03 16:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Sun, Apr 28, 2019 at 05:25:48PM -0400, Waiman Long wrote:
> When the front of the wait queue is a reader, other readers
> immediately following the first reader will also be woken up at the
> same time. However, if there is a writer in between. Those readers
> behind the writer will not be woken up.

> @@ -345,13 +359,20 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>  	 * 2) For each waiters in the new list, clear waiter->task and
>  	 *    put them into wake_q to be woken up later.
>  	 */
> -	list_for_each_entry(waiter, &sem->wait_list, list) {
> +	INIT_LIST_HEAD(&wlist);
> +	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
>  		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
> -			break;
> +			continue;
>  
>  		woken++;
> +		list_move_tail(&waiter->list, &wlist);
> +
> +		/*
> +		 * Limit # of readers that can be woken up per wakeup call.
> +		 */
> +		if (woken >= MAX_READERS_WAKEUP)
> +			break;
>  	}
> -	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
>  
>  	adjustment = woken * RWSEM_READER_BIAS - adjustment;
>  	lockevent_cond_inc(rwsem_wake_reader, woken);

An idea for later; maybe we can simplify this by playing silly games
with the queueing.

Writers: always list_add_tail()
Readers: keep a pointer to first_reader in the queue;
	 when NULL; list_add_tail() and set
	 otherwise: list_add_tail(, first_reader);

Possily also keep a count of first_reader list size, and if 'big' reset
first_reader.

That way we never have to skip over writers.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 11/20] locking/rwsem: Wake up almost all readers in wait queue
  2019-05-03 16:51   ` Peter Zijlstra
@ 2019-05-03 17:15     ` Waiman Long
  2019-05-06 11:49       ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Waiman Long @ 2019-05-03 17:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On 5/3/19 12:51 PM, Peter Zijlstra wrote:
> On Sun, Apr 28, 2019 at 05:25:48PM -0400, Waiman Long wrote:
>> When the front of the wait queue is a reader, other readers
>> immediately following the first reader will also be woken up at the
>> same time. However, if there is a writer in between. Those readers
>> behind the writer will not be woken up.
>> @@ -345,13 +359,20 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>>  	 * 2) For each waiters in the new list, clear waiter->task and
>>  	 *    put them into wake_q to be woken up later.
>>  	 */
>> -	list_for_each_entry(waiter, &sem->wait_list, list) {
>> +	INIT_LIST_HEAD(&wlist);
>> +	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
>>  		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
>> -			break;
>> +			continue;
>>  
>>  		woken++;
>> +		list_move_tail(&waiter->list, &wlist);
>> +
>> +		/*
>> +		 * Limit # of readers that can be woken up per wakeup call.
>> +		 */
>> +		if (woken >= MAX_READERS_WAKEUP)
>> +			break;
>>  	}
>> -	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
>>  
>>  	adjustment = woken * RWSEM_READER_BIAS - adjustment;
>>  	lockevent_cond_inc(rwsem_wake_reader, woken);
> An idea for later; maybe we can simplify this by playing silly games
> with the queueing.
>
> Writers: always list_add_tail()
> Readers: keep a pointer to first_reader in the queue;
> 	 when NULL; list_add_tail() and set
> 	 otherwise: list_add_tail(, first_reader);
>
> Possily also keep a count of first_reader list size, and if 'big' reset
> first_reader.
>
> That way we never have to skip over writers.
>
Yes, that can work. However, that will require adding one more pointer
to the rw_semaphore structure. The performance gain with this
optimization may not justify increasing the size of the structure by 4/8
bytes.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 11/20] locking/rwsem: Wake up almost all readers in wait queue
  2019-05-03 17:15     ` Waiman Long
@ 2019-05-06 11:49       ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2019-05-06 11:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Fri, May 03, 2019 at 01:15:22PM -0400, Waiman Long wrote:
> Yes, that can work. However, that will require adding one more pointer
> to the rw_semaphore structure. The performance gain with this
> optimization may not justify increasing the size of the structure by 4/8
> bytes.

Indeed; what was I thinking :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH-tip v7 14/20] locking/rwsem: Enable time-based spinning on reader-owned rwsem
  2019-04-28 21:25 ` [PATCH-tip v7 14/20] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
@ 2019-05-06 15:47   ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2019-05-06 15:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Will Deacon, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, x86, Davidlohr Bueso,
	Linus Torvalds, Tim Chen, huang ying

On Sun, Apr 28, 2019 at 05:25:51PM -0400, Waiman Long wrote:

> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 6b11dce3139d..62fcb8bacfab 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c

> @@ -56,7 +61,9 @@
>   * a rwsem, but the overhead is simply too big.
>   */
>  #define RWSEM_READER_OWNED	(1UL << 0)
> -#define RWSEM_NONSPINNABLE	(1UL << 1)
> +#define RWSEM_RD_NONSPINNABLE	(1UL << 1)
> +#define RWSEM_WR_NONSPINNABLE	(1UL << 2)
> +#define RWSEM_NONSPINNABLE	(RWSEM_RD_NONSPINNABLE | RWSEM_WR_NONSPINNABLE)
>  #define RWSEM_OWNER_FLAGS_MASK	(RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
>  
>  #ifdef CONFIG_DEBUG_RWSEMS

> @@ -149,9 +155,27 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
>   * and steal the lock.
>   * N.B. !owner is considered spinnable.
>   */
> -static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
> +static inline bool is_rwsem_owner_spinnable(struct task_struct *owner, bool wr)
>  {
> -	return !((unsigned long)owner & RWSEM_NONSPINNABLE);
> +	unsigned long bit = wr ? RWSEM_WR_NONSPINNABLE : RWSEM_RD_NONSPINNABLE;
> +
> +	return !((unsigned long)owner & bit);
> +}

You _could_ write that as:

  return !((unsigned long)owner & (1UL << (1 + wr)));

> +/*
> + * Return true if the rwsem is spinnable.
> + */
> +static inline bool is_rwsem_spinnable(struct rw_semaphore *sem, bool wr)
> +{
> +	return is_rwsem_owner_spinnable(READ_ONCE(sem->owner), wr);
> +}

I think we can do without this; see below.

> +/*
> + * Remove all the flag bits from owner.
> + */
> +static inline struct task_struct *owner_without_flags(struct task_struct *owner)
> +{
> +	return (struct task_struct *)((long)owner & ~RWSEM_OWNER_FLAGS_MASK);
>  }

I hates the name on this one...

> @@ -163,11 +187,11 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
>   */
>  static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
>  {
> +	unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
> +
> +	if ((owner & ~RWSEM_OWNER_FLAGS_MASK) == (unsigned long)current)
>  		cmpxchg_relaxed((unsigned long *)&sem->owner, val,
> -				RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
> +				owner & RWSEM_OWNER_FLAGS_MASK);

Missing { }, also, I think this is all broken, this sem->owner stuff
mixes regular writes and cmpxchg() which is known broken on a number of
platforms :/ Also the cmpxchg here can fail due to the cmpchg in
rwsem_set_nonspinnable().

Also; below you've used atomic_long_andnot(); it seems to me you can do
the same here:

	atomic_long_andnot(RWSEM_OWNER_FLAGS_MASK,
			   (atomic_long_t *)&sem->owner);

(ideally without the cast, because you've made sem->owner atomic_long_t)

>  }
>  #else
>  static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
> @@ -175,6 +199,22 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
>  }
>  #endif
>  
> +/*
> + * Set the RWSEM_NONSPINNABLE bits if the RWSEM_READER_OWNED flag
> + * remains set. Otherwise, the operation will be aborted.
> + */
> +static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
> +{
> +	long owner = (long)READ_ONCE(sem->owner);
> +
> +	while (owner & RWSEM_READER_OWNED) {
> +		if (!is_rwsem_owner_spinnable((void *)owner, false))
> +			break;

		if (owner & RWSEM_RD_NONSPINNABLE)
			break;

is _much_ more readable. Also, afaict, this function doesn't do what the
comment on top says it does. If it has RD_NONSPINNABLE set, it will
never set WR_NONSPINNABLE.

> +		owner = cmpxchg((long *)&sem->owner, owner,
> +				owner | RWSEM_NONSPINNABLE);

Same comment for mixing normal stores and cmpxchg.

> +	}
> +}
> +
>  /*
>   * Guide to the rw_semaphore's count field.
>   *
> @@ -496,6 +536,11 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
>  
>  static inline bool owner_on_cpu(struct task_struct *owner)
>  {
> +	/*
> +	 * Clear all the flag bits in owner
> +	 */
> +	owner = owner_without_flags(owner);

Pointless comment; it says the exact same thing the function name says.

Why is this added now? How come we could never have bits set before this
patch?

Also note how you're doing this twice, the only callsite already does
owner_without_flags() on it before calling this.

>  	/*
>  	 * As lock holder preemption issue, we both skip spinning if
>  	 * task is not on cpu or its cpu is preempted
> @@ -503,12 +548,12 @@ static inline bool owner_on_cpu(struct task_struct *owner)
>  	return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
>  }
>  
> -static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
>  {
>  	struct task_struct *owner;
>  	bool ret = true;
>  
> -	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
> +	BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN, true));

	BUILD_BUG_ON(!((unsigned long)RWSEM_OWNER_UNKNOWN & RWSEM_WR_NONSPINABLE));

>  
>  	if (need_resched()) {
>  		lockevent_inc(rwsem_opt_fail);
> @@ -517,14 +562,17 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>  
>  	preempt_disable();
>  	rcu_read_lock();
> +
>  	owner = READ_ONCE(sem->owner);
> +	if (!is_rwsem_owner_spinnable(owner, wr))
> +		ret = false;
> +	else if ((unsigned long)owner & RWSEM_READER_OWNED)
> +		ret = true;
> +	else if ((owner = owner_without_flags(owner)))
> +		ret = owner_on_cpu(owner);
> +

	state = rwsem_owner_state(owner, wr);
	if (state & OWNER_NONSPINNABLE)
		ret = false;
	else if (state & OWNER_READER)
		ret = true;
	else
		ret = owner_on_cpu(owner);

>  	rcu_read_unlock();
>  	preempt_enable();
> -
>  	lockevent_cond_inc(rwsem_opt_fail, !ret);
>  	return ret;
>  }

> @@ -603,10 +652,53 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
>  	return state;
>  }
>  
> +/*
> + * Calculate reader-owned rwsem spinning threshold for writer
> + *
> + * It is assumed that the more readers own the rwsem, the longer it will

No assumptions needed; its a direct concequence of the propagation of
uncertaintly:

  https://en.wikipedia.org/wiki/Propagation_of_uncertainty

Basically, if we can characterize a single read-side critical section as
a mean and stdev error, then for N readers the mean hold-time will be
identical, but the error will be sqrt(N*s^2).

And therefore the more readers, the more likely we'll see longer hold
times.

> + * take for them to wind down and free the rwsem. So the formula to
> + * determine the actual spinning time limit is:
> + *
> + * 1) RWSEM_FLAG_WAITERS set
> + *    Spinning threshold = (10 + nr_readers/2)us
> + *
> + * 2) RWSEM_FLAG_WAITERS not set
> + *    Spinning threshold = 25us
> + *
> + * In the first case when RWSEM_FLAG_WAITERS is set, no new reader can
> + * become rwsem owner. It is assumed that the more readers own the rwsem,
> + * the longer it will take for them to wind down and free the rwsem. In
> + * addition, if it happens that a previous task that releases the lock
> + * is in the process of waking up readers one-by-one, the process will
> + * take longer when more readers needed to be woken up. This is subjected
> + * to a maximum value of 25us.
> + *
> + * In the second case with RWSEM_FLAG_WAITERS off, new readers can join
> + * and become one of the owners. So assuming for the worst case and spin
> + * for at most 25us.
> + */
> +static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
> +{
> +	long count = atomic_long_read(&sem->count);
> +	u64 delta = 25 * NSEC_PER_USEC;
> +
> +	if (count & RWSEM_FLAG_WAITERS) {
> +		int readers = count >> RWSEM_READER_SHIFT;
> +
> +		if (readers > 30)
> +			readers = 30;
> +		delta = (20 + readers) * NSEC_PER_USEC / 2;
> +	}
> +
> +	return sched_clock() + delta;
> +}

Which, to me, would suggest we just use the active reader count. That
is, I'm still somewhat puzzled by the WAITERS thing.


> @@ -636,6 +727,39 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
>  		if (taken)
>  			break;
>  
> +		/*
> +		 * Time-based reader-owned rwsem optimistic spinning
> +		 */
> +		if (wlock && (owner_state == OWNER_READER)) {
> +			/*
> +			 * Re-initialize rspin_threshold every time when
> +			 * the owner state changes from non-reader to reader.
> +			 * This allows a writer to steal the lock in between
> +			 * 2 reader phases and have the threshold reset at
> +			 * the beginning of the 2nd reader phase.
> +			 */
> +			if (prev_owner_state != OWNER_READER) {
> +				if (!is_rwsem_spinnable(sem, wlock))

would that not be the same as: state & OWNER_NONSPINNABLE ?
Which then removes the last user of that function.

> +					break;
> +				rspin_threshold = rwsem_rspin_threshold(sem);
> +				loop = 0;
> +			}
> +
> +			/*
> +			 * Check time threshold once every 16 iterations to
> +			 * avoid calling sched_clock() too frequently so
> +			 * as to reduce the average latency between the times
> +			 * when the lock becomes free and when the spinner
> +			 * is ready to do a trylock.
> +			 */
> +			else if (!(++loop & 0xf) &&
> +				 (sched_clock() > rspin_threshold)) {
> +				rwsem_set_nonspinnable(sem);
> +				lockevent_inc(rwsem_opt_nospin);
> +				break;
> +			}
> +		}
> +
>  		/*
>  		 * An RT task cannot do optimistic spinning if it cannot
>  		 * be sure the lock holder is running or live-lock may
> @@ -690,8 +814,25 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
>  	lockevent_cond_inc(rwsem_opt_fail, !taken);
>  	return taken;
>  }
> +
> +/*
> + * Clear the owner's RWSEM_WR_NONSPINNABLE bit if it is set. This should
> + * only be called when the reader count reaches 0.
> + *
> + * This give writers better chance to acquire the rwsem first before
> + * readers when the rwsem was being held by readers for a relatively long
> + * period of time. Race can happen that an optimistic spinner may have
> + * just stolen the rwsem and set the owner, but just clearing the
> + * RWSEM_WR_NONSPINNABLE bit will do no harm anyway.
> + */
> +static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
> +{
> +	if (!is_rwsem_spinnable(sem, true))
> +		atomic_long_andnot(RWSEM_WR_NONSPINNABLE,
> +				  (atomic_long_t *)&sem->owner);
> +}

That's atrocious... and another clue that ->owner should maybe also be
atomic_long_t.

Something like:

static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
{
	if (READ_ONCE(sem->owner) & RWSEM_WR_NONSPINNABLE) {
		atomic_long_andnot(RWSEM_WR_NONSPINNABLE,
				   (atomic_long_t *)&sem->owner);
	}
}

>  #else
> -static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
>  {
>  	return false;
>  }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [tip:locking/urgent] locking/rwsem: Prevent decrement of reader count before increment
  2019-04-28 21:25 ` [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment Waiman Long
  2019-05-03 12:06   ` Peter Zijlstra
@ 2019-05-07  7:07   ` tip-bot for Waiman Long
  1 sibling, 0 replies; 41+ messages in thread
From: tip-bot for Waiman Long @ 2019-05-07  7:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: huang.ying.caritas, dave, hpa, bp, will.deacon, tglx, tim.c.chen,
	peterz, longman, torvalds, mingo, linux-kernel

Commit-ID:  a9e9bcb45b1525ba7aea26ed9441e8632aeeda58
Gitweb:     https://git.kernel.org/tip/a9e9bcb45b1525ba7aea26ed9441e8632aeeda58
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Sun, 28 Apr 2019 17:25:38 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 7 May 2019 08:46:46 +0200

locking/rwsem: Prevent decrement of reader count before increment

During my rwsem testing, it was found that after a down_read(), the
reader count may occasionally become 0 or even negative. Consequently,
a writer may steal the lock at that time and execute with the reader
in parallel thus breaking the mutual exclusion guarantee of the write
lock. In other words, both readers and writer can become rwsem owners
simultaneously.

The current reader wakeup code does it in one pass to clear waiter->task
and put them into wake_q before fully incrementing the reader count.
Once waiter->task is cleared, the corresponding reader may see it,
finish the critical section and do unlock to decrement the count before
the count is incremented. This is not a problem if there is only one
reader to wake up as the count has been pre-incremented by 1.  It is
a problem if there are more than one readers to be woken up and writer
can steal the lock.

The wakeup was actually done in 2 passes before the following v4.9 commit:

  70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")

To fix this problem, the wakeup is now done in two passes
again. In the first pass, we collect the readers and count them.
The reader count is then fully incremented. In the second pass, the
waiter->task is then cleared and they are put into wake_q to be woken
up later.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Fixes: 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once")
Link: http://lkml.kernel.org/r/20190428212557.13482-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem-xadd.c | 46 ++++++++++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 6b3ee9948bf1..0b1f77957240 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -130,6 +130,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 {
 	struct rwsem_waiter *waiter, *tmp;
 	long oldcount, woken = 0, adjustment = 0;
+	struct list_head wlist;
 
 	/*
 	 * Take a peek at the queue head waiter such that we can determine
@@ -188,18 +189,43 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 	 * of the queue. We know that woken will be at least 1 as we accounted
 	 * for above. Note we increment the 'active part' of the count by the
 	 * number of readers before waking any processes up.
+	 *
+	 * We have to do wakeup in 2 passes to prevent the possibility that
+	 * the reader count may be decremented before it is incremented. It
+	 * is because the to-be-woken waiter may not have slept yet. So it
+	 * may see waiter->task got cleared, finish its critical section and
+	 * do an unlock before the reader count increment.
+	 *
+	 * 1) Collect the read-waiters in a separate list, count them and
+	 *    fully increment the reader count in rwsem.
+	 * 2) For each waiters in the new list, clear waiter->task and
+	 *    put them into wake_q to be woken up later.
 	 */
-	list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
-		struct task_struct *tsk;
-
+	list_for_each_entry(waiter, &sem->wait_list, list) {
 		if (waiter->type == RWSEM_WAITING_FOR_WRITE)
 			break;
 
 		woken++;
-		tsk = waiter->task;
+	}
+	list_cut_before(&wlist, &sem->wait_list, &waiter->list);
+
+	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+	lockevent_cond_inc(rwsem_wake_reader, woken);
+	if (list_empty(&sem->wait_list)) {
+		/* hit end of list above */
+		adjustment -= RWSEM_WAITING_BIAS;
+	}
+
+	if (adjustment)
+		atomic_long_add(adjustment, &sem->count);
+
+	/* 2nd pass */
+	list_for_each_entry_safe(waiter, tmp, &wlist, list) {
+		struct task_struct *tsk;
 
+		tsk = waiter->task;
 		get_task_struct(tsk);
-		list_del(&waiter->list);
+
 		/*
 		 * Ensure calling get_task_struct() before setting the reader
 		 * waiter to nil such that rwsem_down_read_failed() cannot
@@ -213,16 +239,6 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 		 */
 		wake_q_add_safe(wake_q, tsk);
 	}
-
-	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
-	lockevent_cond_inc(rwsem_wake_reader, woken);
-	if (list_empty(&sem->wait_list)) {
-		/* hit end of list above */
-		adjustment -= RWSEM_WAITING_BIAS;
-	}
-
-	if (adjustment)
-		atomic_long_add(adjustment, &sem->count);
 }
 
 /*

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2019-05-07  7:07 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-28 21:25 [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 01/20] locking/rwsem: Prevent decrement of reader count before increment Waiman Long
2019-05-03 12:06   ` Peter Zijlstra
2019-05-03 13:32     ` Waiman Long
2019-05-07  7:07   ` [tip:locking/urgent] " tip-bot for Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 02/20] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 03/20] locking/rwsem: Remove rwsem_wake() wakeup optimization Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 04/20] locking/rwsem: Implement a new locking scheme Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 05/20] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 06/20] locking/rwsem: Code cleanup after files merging Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 07/20] locking/rwsem: Make rwsem_spin_on_owner() return owner state Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 08/20] locking/rwsem: Implement lock handoff to prevent lock starvation Waiman Long
2019-05-03 13:10   ` Peter Zijlstra
2019-05-03 13:57     ` Waiman Long
2019-05-03 14:37     ` David Laight
2019-04-28 21:25 ` [PATCH-tip v7 09/20] locking/rwsem: Always release wait_lock before waking up tasks Waiman Long
2019-05-03 13:37   ` Peter Zijlstra
2019-05-03 13:56     ` Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 10/20] locking/rwsem: More optimal RT task handling of null owner Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 11/20] locking/rwsem: Wake up almost all readers in wait queue Waiman Long
2019-05-03 16:51   ` Peter Zijlstra
2019-05-03 17:15     ` Waiman Long
2019-05-06 11:49       ` Peter Zijlstra
2019-04-28 21:25 ` [PATCH-tip v7 12/20] locking/rwsem: Clarify usage of owner's nonspinaable bit Waiman Long
2019-05-03 15:21   ` Peter Zijlstra
2019-05-03 15:26     ` Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 13/20] locking/rwsem: Enable readers spinning on writer Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 14/20] locking/rwsem: Enable time-based spinning on reader-owned rwsem Waiman Long
2019-05-06 15:47   ` Peter Zijlstra
2019-04-28 21:25 ` [PATCH-tip v7 15/20] locking/rwsem: Adaptive disabling of reader optimistic spinning Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 16/20] locking/rwsem: Add more rwsem owner access helpers Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 17/20] locking/rwsem: Guard against making count negative Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 18/20] locking/rwsem: Merge owner into count on x86-64 Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 19/20] locking/rwsem: Remove redundant computation of writer lock word Waiman Long
2019-04-28 21:25 ` [PATCH-tip v7 20/20] locking/rwsem: Disable preemption in down_read*() if owner in count Waiman Long
2019-04-28 22:46 ` [PATCH-tip v7 00/20] locking/rwsem: Rwsem rearchitecture part 2 Linus Torvalds
2019-04-28 23:12   ` Waiman Long
2019-04-28 23:19     ` Waiman Long
2019-04-29  0:10     ` Linus Torvalds
2019-04-29  0:27       ` Waiman Long
2019-04-29  2:41         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).