LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/8] latched RB-trees and __module_address()
@ 2015-03-18 13:36 Peter Zijlstra
  2015-03-18 13:36 ` [PATCH 1/8] module: Sanitize RCU usage and locking Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

This series is aimed at making __module_address() go fast(er).

On the way there is:
 - annotates and sanitizes module locking
 - introduces the latched RB-tree
 - employs it to make __module_address() go fast.

I've build and boot tested this on x86_64 with modules and lockdep
enabled.  Performance numbers (below) are done with lockdep disabled.

As mentioned in the previous posting; the reason for writing the latched
RB-tree as generic code is mostly for clarity/documentation purposes; as
there are a number of separate and non trivial bits to the complete
solution.

As measued on my ivb-ep system with 84 modules loaded; prior to patching
the test module (below) reports:

          avg +- stdev
Before:  1689 +- 287 [ns] per __module_address() call
After:    137 +-  38 [ns] per __module_address() call

Note; I have also tested things like: perf record -a -g modprobe
mod_test, to make 'sure' to hit some of the more interesting paths.

---
 kernel/Makefile   |    2 ++
 kernel/mod_test.c |   30 ++++++++++++++++++++++++++++++
 kernel/module.c   |    3 ++-
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..ec69606 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -28,6 +28,8 @@ obj-y += irq/
 obj-y += rcu/
 obj-y += livepatch/
 
+obj-m += mod_test.o
+
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
diff --git a/kernel/mod_test.c b/kernel/mod_test.c
index e69de29..cf32cfc 100644
--- a/kernel/mod_test.c
+++ b/kernel/mod_test.c
@@ -0,0 +1,30 @@
+
+#include <linux/module.h>
+#include <linux/sched.h>
+
+MODULE_LICENSE("GPL");
+
+extern unsigned long module_addr_max;
+
+static int __init test_init(void)
+{
+	u64 t1, t2;
+
+	local_irq_disable();
+	t1 = sched_clock();
+	barrier();
+	t1 = sched_clock() - t1;
+
+	barrier();
+
+	t2 = sched_clock();
+	(void)__module_address(module_addr_max);
+	t2 = sched_clock() - t2;
+	local_irq_enable();
+
+	printk("time: %Lu %Lu %Lu\n", t2 - t1, t1, t2);
+
+	return -EINVAL;
+}
+
+module_init(test_init);
diff --git a/kernel/module.c b/kernel/module.c
index b3d634e..6423dc2 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -155,7 +155,8 @@ static BLOCKING_NOTIFIER_HEAD(module_notify_list);
 
 /* Bounds of module allocation, for speeding __module_address.
  * Protected by module_mutex. */
-static unsigned long module_addr_min = -1UL, module_addr_max = 0;
+unsigned long module_addr_min = -1UL, module_addr_max = 0;
+EXPORT_SYMBOL(module_addr_max);
 
 int register_module_notifier(struct notifier_block *nb)
 {


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/8] module: Sanitize RCU usage and locking
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
@ 2015-03-18 13:36 ` Peter Zijlstra
  2015-03-20  3:36   ` Rusty Russell
  2015-03-18 13:36 ` [PATCH 2/8] module: Annotate module version magic Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

[-- Attachment #1: peterz-module-sanitize-rcu.patch --]
[-- Type: text/plain, Size: 5648 bytes --]

Currently the RCU usage in module is an inconsistent mess of RCU and
RCU-sched, this is broken for CONFIG_PREEMPT where synchronize_rcu()
does not imply synchronize_sched().

Most usage sites use preempt_{dis,en}able() which is RCU-sched, but
(most of) the modification sites use synchronize_rcu(). With the
exception of the module bug list, which actually uses RCU.

Convert everything over to RCU-sched.

Furthermore add lockdep asserts to all sites, because its not at all
clear to me the required locking is observed, esp. on exported
functions.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |   12 ++++++++++--
 kernel/module.c        |   42 ++++++++++++++++++++++++++++++++++--------
 lib/bug.c              |    7 +++++--
 3 files changed, 49 insertions(+), 12 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -415,14 +415,22 @@ struct symsearch {
 	bool unused;
 };
 
-/* Search for an exported symbol by name. */
+/*
+ * Search for an exported symbol by name.
+ *
+ * Must be called with module_mutex held or preemption disabled.
+ */
 const struct kernel_symbol *find_symbol(const char *name,
 					struct module **owner,
 					const unsigned long **crc,
 					bool gplok,
 					bool warn);
 
-/* Walk the exported symbol table */
+/*
+ * Walk the exported symbol table
+ *
+ * Must be called with module_mutex held or preemption disabled.
+ */
 bool each_symbol_section(bool (*fn)(const struct symsearch *arr,
 				    struct module *owner,
 				    void *data), void *data);
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -106,6 +106,24 @@ static LIST_HEAD(modules);
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
 
+static void module_assert_mutex(void)
+{
+	lockdep_assert_held(&module_mutex);
+}
+
+static void module_assert_mutex_or_preempt(void)
+{
+#ifdef CONFIG_LOCKDEP
+	int rcu_held = rcu_read_lock_sched_held();
+	int mutex_held = 1;
+
+	if (debug_locks)
+		mutex_held = lockdep_is_held(&module_mutex);
+
+	WARN_ON(!rcu_held && !mutex_held);
+#endif
+}
+
 #ifdef CONFIG_MODULE_SIG
 #ifdef CONFIG_MODULE_SIG_FORCE
 static bool sig_enforce = true;
@@ -319,6 +337,8 @@ bool each_symbol_section(bool (*fn)(cons
 #endif
 	};
 
+	module_assert_mutex_or_preempt();
+
 	if (each_symbol_in_section(arr, ARRAY_SIZE(arr), NULL, fn, data))
 		return true;
 
@@ -458,6 +478,8 @@ static struct module *find_module_all(co
 {
 	struct module *mod;
 
+	module_assert_mutex();
+
 	list_for_each_entry(mod, &modules, list) {
 		if (!even_unformed && mod->state == MODULE_STATE_UNFORMED)
 			continue;
@@ -1856,8 +1878,8 @@ static void free_module(struct module *m
 	list_del_rcu(&mod->list);
 	/* Remove this module from bug list, this uses list_del_rcu */
 	module_bug_cleanup(mod);
-	/* Wait for RCU synchronizing before releasing mod->list and buglist. */
-	synchronize_rcu();
+	/* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */
+	synchronize_sched();
 	mutex_unlock(&module_mutex);
 
 	/* This may be NULL, but that's OK */
@@ -3106,11 +3128,11 @@ static noinline int do_init_module(struc
 	mod->init_text_size = 0;
 	/*
 	 * We want to free module_init, but be aware that kallsyms may be
-	 * walking this with preempt disabled.  In all the failure paths,
-	 * we call synchronize_rcu/synchronize_sched, but we don't want
-	 * to slow down the success path, so use actual RCU here.
+	 * walking this with preempt disabled.  In all the failure paths, we
+	 * call synchronize_sched, but we don't want to slow down the success
+	 * path, so use actual RCU here.
 	 */
-	call_rcu(&freeinit->rcu, do_free_init);
+	call_rcu_sched(&freeinit->rcu, do_free_init);
 	mutex_unlock(&module_mutex);
 	wake_up_all(&module_wq);
 
@@ -3368,8 +3390,8 @@ static int load_module(struct load_info
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
 	wake_up_all(&module_wq);
-	/* Wait for RCU synchronizing before releasing mod->list. */
-	synchronize_rcu();
+	/* Wait for RCU-sched synchronizing before releasing mod->list. */
+	synchronize_sched();
 	mutex_unlock(&module_mutex);
  free_module:
 	/* Free lock-classes; relies on the preceding sync_rcu() */
@@ -3636,6 +3658,8 @@ int module_kallsyms_on_each_symbol(int (
 	unsigned int i;
 	int ret;
 
+	module_assert_mutex();
+
 	list_for_each_entry(mod, &modules, list) {
 		if (mod->state == MODULE_STATE_UNFORMED)
 			continue;
@@ -3810,6 +3834,8 @@ struct module *__module_address(unsigned
 	if (addr < module_addr_min || addr > module_addr_max)
 		return NULL;
 
+	module_assert_mutex_or_preempt();
+
 	list_for_each_entry_rcu(mod, &modules, list) {
 		if (mod->state == MODULE_STATE_UNFORMED)
 			continue;
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -66,7 +66,7 @@ static const struct bug_entry *module_fi
 	struct module *mod;
 	const struct bug_entry *bug = NULL;
 
-	rcu_read_lock();
+	rcu_read_lock_sched();
 	list_for_each_entry_rcu(mod, &module_bug_list, bug_list) {
 		unsigned i;
 
@@ -77,7 +77,7 @@ static const struct bug_entry *module_fi
 	}
 	bug = NULL;
 out:
-	rcu_read_unlock();
+	rcu_read_unlock_sched();
 
 	return bug;
 }
@@ -88,6 +88,8 @@ void module_bug_finalize(const Elf_Ehdr
 	char *secstrings;
 	unsigned int i;
 
+	lockdep_assert_held(&module_mutex);
+
 	mod->bug_table = NULL;
 	mod->num_bugs = 0;
 
@@ -113,6 +115,7 @@ void module_bug_finalize(const Elf_Ehdr
 
 void module_bug_cleanup(struct module *mod)
 {
+	lockdep_assert_held(&module_mutex);
 	list_del_rcu(&mod->bug_list);
 }
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/8] module: Annotate module version magic
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
  2015-03-18 13:36 ` [PATCH 1/8] module: Sanitize RCU usage and locking Peter Zijlstra
@ 2015-03-18 13:36 ` Peter Zijlstra
  2015-03-18 13:36 ` [PATCH 3/8] module, jump_label: Fix module locking Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

[-- Attachment #1: peterz-module-fix-load_module.patch --]
[-- Type: text/plain, Size: 2807 bytes --]

Due to the new lockdep checks we go:

[    9.759380] ------------[ cut here ]------------
[    9.759389] WARNING: CPU: 31 PID: 597 at ../kernel/module.c:216 each_symbol_section+0x121/0x130()
[    9.759391] Modules linked in:
[    9.759393] CPU: 31 PID: 597 Comm: modprobe Not tainted 4.0.0-rc1+ #65
[    9.759393] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[    9.759396]  ffffffff817d8676 ffff880424567ca8 ffffffff8157e98b 0000000000000001
[    9.759398]  0000000000000000 ffff880424567ce8 ffffffff8105fbc7 ffff880424567cd8
[    9.759400]  0000000000000000 ffffffff810ec160 ffff880424567d40 0000000000000000
[    9.759400] Call Trace:
[    9.759407]  [<ffffffff8157e98b>] dump_stack+0x4f/0x7b
[    9.759410]  [<ffffffff8105fbc7>] warn_slowpath_common+0x97/0xe0
[    9.759412]  [<ffffffff810ec160>] ? section_objs+0x60/0x60
[    9.759414]  [<ffffffff8105fc2a>] warn_slowpath_null+0x1a/0x20
[    9.759415]  [<ffffffff810ed9c1>] each_symbol_section+0x121/0x130
[    9.759417]  [<ffffffff810eda01>] find_symbol+0x31/0x70
[    9.759420]  [<ffffffff810ef5bf>] load_module+0x20f/0x2660
[    9.759422]  [<ffffffff8104ef10>] ? __do_page_fault+0x190/0x4e0
[    9.759426]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759427]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759433]  [<ffffffff810ae73d>] ? trace_hardirqs_on_caller+0x11d/0x1e0
[    9.759437]  [<ffffffff812fcc0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[    9.759439]  [<ffffffff815880ec>] ? retint_restore_args+0x13/0x13
[    9.759441]  [<ffffffff810f1ade>] SyS_init_module+0xce/0x100
[    9.759443]  [<ffffffff81587429>] system_call_fastpath+0x12/0x17
[    9.759445] ---[ end trace 9294429076a9c644 ]---

As per the comment this site should be fine, but lets wrap it in
preempt_disable() anyhow to placate lockdep.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1191,11 +1191,17 @@ static inline int check_modstruct_versio
 {
 	const unsigned long *crc;
 
-	/* Since this should be found in kernel (which can't be removed),
-	 * no locking is necessary. */
+	/*
+	 * Since this should be found in kernel (which can't be removed), no
+	 * locking is necessary -- use preempt_disable() to placate lockdep.
+	 */
+	preempt_disable();
 	if (!find_symbol(VMLINUX_SYMBOL_STR(module_layout), NULL,
-			 &crc, true, false))
+			 &crc, true, false)) {
+		preempt_enable();
 		BUG();
+	}
+	preempt_enable();
 	return check_version(sechdrs, versindex,
 			     VMLINUX_SYMBOL_STR(module_layout), mod, crc,
 			     NULL);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/8] module, jump_label: Fix module locking
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
  2015-03-18 13:36 ` [PATCH 1/8] module: Sanitize RCU usage and locking Peter Zijlstra
  2015-03-18 13:36 ` [PATCH 2/8] module: Annotate module version magic Peter Zijlstra
@ 2015-03-18 13:36 ` Peter Zijlstra
  2015-03-20  4:26   ` Rusty Russell
  2015-03-18 13:36 ` [PATCH 4/8] rbtree: Make lockless searches non-fatal Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, Jason Baron

[-- Attachment #1: peterz-module-fix-jump_label.patch --]
[-- Type: text/plain, Size: 3127 bytes --]

As per the module core lockdep annotations:

[   18.034047] ---[ end trace 9294429076a9c673 ]---
[   18.047760] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[   18.059228]  ffffffff817d8676 ffff880036683c38 ffffffff8157e98b 0000000000000001
[   18.067541]  0000000000000000 ffff880036683c78 ffffffff8105fbc7 ffff880036683c68
[   18.075851]  ffffffffa0046b08 0000000000000000 ffffffffa0046d00 ffffffffa0046cc8
[   18.084173] Call Trace:
[   18.086906]  [<ffffffff8157e98b>] dump_stack+0x4f/0x7b
[   18.092649]  [<ffffffff8105fbc7>] warn_slowpath_common+0x97/0xe0
[   18.099361]  [<ffffffff8105fc2a>] warn_slowpath_null+0x1a/0x20
[   18.105880]  [<ffffffff810ee502>] __module_address+0x1d2/0x1e0
[   18.112400]  [<ffffffff81161153>] jump_label_module_notify+0x143/0x1e0
[   18.119710]  [<ffffffff810814bf>] notifier_call_chain+0x4f/0x70
[   18.126326]  [<ffffffff8108160e>] __blocking_notifier_call_chain+0x5e/0x90
[   18.134009]  [<ffffffff81081656>] blocking_notifier_call_chain+0x16/0x20
[   18.141490]  [<ffffffff810f0f00>] load_module+0x1b50/0x2660
[   18.147720]  [<ffffffff810f1ade>] SyS_init_module+0xce/0x100
[   18.154045]  [<ffffffff81587429>] system_call_fastpath+0x12/0x17
[   18.160748] ---[ end trace 9294429076a9c674 ]---

Jump labels is not doing it right; fix this.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jason Baron <jbaron@akamai.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/jump_label.c |   21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -280,6 +280,17 @@ void jump_label_apply_nops(struct module
 	}
 }
 
+static inline bool address_in_module(unsigned long addr, struct module *mod)
+{
+	bool ret;
+
+	preempt_disable();
+	ret = __module_address(addr) == mod;
+	preempt_enable();
+
+	return ret;
+}
+
 static int jump_label_add_module(struct module *mod)
 {
 	struct jump_entry *iter_start = mod->jump_entries;
@@ -302,7 +313,7 @@ static int jump_label_add_module(struct
 			continue;
 
 		key = iterk;
-		if (__module_address(iter->key) == mod) {
+		if (address_in_module(iter->key, mod)) {
 			/*
 			 * Set key->entries to iter, but preserve JUMP_LABEL_TRUE_BRANCH.
 			 */
@@ -339,7 +350,7 @@ static void jump_label_del_module(struct
 
 		key = (struct static_key *)(unsigned long)iter->key;
 
-		if (__module_address(iter->key) == mod)
+		if (address_in_module(iter->key, mod))
 			continue;
 
 		prev = &key->next;
@@ -443,14 +454,16 @@ static void jump_label_update(struct sta
 {
 	struct jump_entry *stop = __stop___jump_table;
 	struct jump_entry *entry = jump_label_get_entries(key);
-
 #ifdef CONFIG_MODULES
-	struct module *mod = __module_address((unsigned long)key);
+	struct module *mod;
 
 	__jump_label_mod_update(key, enable);
 
+	preempt_disable();
+	mod = __module_address((unsigned long)key);
 	if (mod)
 		stop = mod->jump_entries + mod->num_jump_entries;
+	preempt_enable();
 #endif
 	/* if there are no users, entry can be NULL */
 	if (entry)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/8] rbtree: Make lockless searches non-fatal
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
                   ` (2 preceding siblings ...)
  2015-03-18 13:36 ` [PATCH 3/8] module, jump_label: Fix module locking Peter Zijlstra
@ 2015-03-18 13:36 ` Peter Zijlstra
  2015-03-18 13:36 ` [PATCH 5/8] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, David Woodhouse,
	Rik van Riel, Andrea Arcangeli, Michel Lespinasse

[-- Attachment #1: peterz-rbtree-lockless.patch --]
[-- Type: text/plain, Size: 10107 bytes --]

Change the insert and erase code such that lockless searches are
non-fatal.

In and of itself an rbtree cannot be correctly searched while
in-modification, we can however provide weaker guarantees that will
allow the rbtree to be used in conjunction with other techniques, such
as latches; see 9b0fd802e8c0 ("seqcount: Add raw_write_seqcount_latch()").

For this to work we need the following guarantees from the rbtree
code:

 1) a lockless reader must not see partial stores, this would allow it
    to observe nodes that are invalid memory.

 2) there must not be (temporary) loops in the tree structure in the
    modifier's program order, this would cause a lookup which
    interrupts the modifier to get stuck indefinitely.

For 1) we must use WRITE_ONCE() for all updates to the tree structure;
in particular this patch only does rb_{left,right} as those are the
only element required for simple searches.

It generates slightly worse code, probably because volatile. But in
pointer chasing heavy code a few instructions more should not matter.

For 2) I have carefully audited the code and drawn every intermediate
link state and not found a loop.

Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree.h           |   10 +++++
 include/linux/rbtree_augmented.h |   21 +++++++---
 lib/rbtree.c                     |   76 +++++++++++++++++++++++++++------------
 3 files changed, 78 insertions(+), 29 deletions(-)

--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -31,6 +31,7 @@
 
 #include <linux/kernel.h>
 #include <linux/stddef.h>
+#include <linux/rcupdate.h>
 
 struct rb_node {
 	unsigned long  __rb_parent_color;
@@ -85,6 +86,15 @@ static inline void rb_link_node(struct r
 	*rb_link = node;
 }
 
+static inline void rb_link_node_rcu(struct rb_node * node, struct rb_node * parent,
+				    struct rb_node ** rb_link)
+{
+	node->__rb_parent_color = (unsigned long)parent;
+	node->rb_left = node->rb_right = NULL;
+
+	rcu_assign_pointer(*rb_link, node);
+}
+
 #define rb_entry_safe(ptr, type, member) \
 	({ typeof(ptr) ____ptr = (ptr); \
 	   ____ptr ? rb_entry(____ptr, type, member) : NULL; \
--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -123,11 +123,11 @@ __rb_change_child(struct rb_node *old, s
 {
 	if (parent) {
 		if (parent->rb_left == old)
-			parent->rb_left = new;
+			WRITE_ONCE(parent->rb_left, new);
 		else
-			parent->rb_right = new;
+			WRITE_ONCE(parent->rb_right, new);
 	} else
-		root->rb_node = new;
+		WRITE_ONCE(root->rb_node, new);
 }
 
 extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root,
@@ -137,7 +137,8 @@ static __always_inline struct rb_node *
 __rb_erase_augmented(struct rb_node *node, struct rb_root *root,
 		     const struct rb_augment_callbacks *augment)
 {
-	struct rb_node *child = node->rb_right, *tmp = node->rb_left;
+	struct rb_node *child = node->rb_right;
+	struct rb_node *tmp = node->rb_left;
 	struct rb_node *parent, *rebalance;
 	unsigned long pc;
 
@@ -167,6 +168,7 @@ __rb_erase_augmented(struct rb_node *nod
 		tmp = parent;
 	} else {
 		struct rb_node *successor = child, *child2;
+
 		tmp = child->rb_left;
 		if (!tmp) {
 			/*
@@ -180,6 +182,7 @@ __rb_erase_augmented(struct rb_node *nod
 			 */
 			parent = successor;
 			child2 = successor->rb_right;
+
 			augment->copy(node, successor);
 		} else {
 			/*
@@ -201,19 +204,23 @@ __rb_erase_augmented(struct rb_node *nod
 				successor = tmp;
 				tmp = tmp->rb_left;
 			} while (tmp);
-			parent->rb_left = child2 = successor->rb_right;
-			successor->rb_right = child;
+			child2 = successor->rb_right;
+			WRITE_ONCE(parent->rb_left, child2);
+			WRITE_ONCE(successor->rb_right, child);
 			rb_set_parent(child, successor);
+
 			augment->copy(node, successor);
 			augment->propagate(parent, successor);
 		}
 
-		successor->rb_left = tmp = node->rb_left;
+		tmp = node->rb_left;
+		WRITE_ONCE(successor->rb_left, tmp);
 		rb_set_parent(tmp, successor);
 
 		pc = node->__rb_parent_color;
 		tmp = __rb_parent(pc);
 		__rb_change_child(node, successor, tmp, root);
+
 		if (child2) {
 			successor->__rb_parent_color = pc;
 			rb_set_parent_color(child2, parent, RB_BLACK);
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -44,6 +44,30 @@
  *  parentheses and have some accompanying text comment.
  */
 
+/*
+ * Notes on lockless lookups:
+ *
+ * All stores to the tree structure (rb_left and rb_right) must be done using
+ * WRITE_ONCE(). And we must not inadvertently cause (temporary) loops in the
+ * tree structure as seen in program order.
+ *
+ * These two requirements will allow lockless iteration of the tree -- not
+ * correct iteration mind you, tree rotations are not atomic so a lookup might
+ * miss entire subtrees.
+ *
+ * But they do guarantee that any such traversal will only see valid elements
+ * and that it will indeed complete -- does not get stuck in a loop.
+ *
+ * It also guarantees that if the lookup returns an element it is the 'correct'
+ * one. But not returning an element does _NOT_ mean its not present.
+ *
+ * NOTE:
+ *
+ * Stores to __rb_parent_color are not important for simple lookups so those
+ * are left undone as of now. Nor did I check for loops involving parent
+ * pointers.
+ */
+
 static inline void rb_set_black(struct rb_node *rb)
 {
 	rb->__rb_parent_color |= RB_BLACK;
@@ -129,8 +153,9 @@ __rb_insert(struct rb_node *node, struct
 				 * This still leaves us in violation of 4), the
 				 * continuation into Case 3 will fix that.
 				 */
-				parent->rb_right = tmp = node->rb_left;
-				node->rb_left = parent;
+				tmp = node->rb_left;
+				WRITE_ONCE(parent->rb_right, tmp);
+				WRITE_ONCE(node->rb_left, parent);
 				if (tmp)
 					rb_set_parent_color(tmp, parent,
 							    RB_BLACK);
@@ -149,8 +174,8 @@ __rb_insert(struct rb_node *node, struct
 			 *     /                 \
 			 *    n                   U
 			 */
-			gparent->rb_left = tmp;  /* == parent->rb_right */
-			parent->rb_right = gparent;
+			WRITE_ONCE(gparent->rb_left, tmp); /* == parent->rb_right */
+			WRITE_ONCE(parent->rb_right, gparent);
 			if (tmp)
 				rb_set_parent_color(tmp, gparent, RB_BLACK);
 			__rb_rotate_set_parents(gparent, parent, root, RB_RED);
@@ -171,8 +196,9 @@ __rb_insert(struct rb_node *node, struct
 			tmp = parent->rb_left;
 			if (node == tmp) {
 				/* Case 2 - right rotate at parent */
-				parent->rb_left = tmp = node->rb_right;
-				node->rb_right = parent;
+				tmp = node->rb_right;
+				WRITE_ONCE(parent->rb_left, tmp);
+				WRITE_ONCE(node->rb_right, parent);
 				if (tmp)
 					rb_set_parent_color(tmp, parent,
 							    RB_BLACK);
@@ -183,8 +209,8 @@ __rb_insert(struct rb_node *node, struct
 			}
 
 			/* Case 3 - left rotate at gparent */
-			gparent->rb_right = tmp;  /* == parent->rb_left */
-			parent->rb_left = gparent;
+			WRITE_ONCE(gparent->rb_right, tmp); /* == parent->rb_left */
+			WRITE_ONCE(parent->rb_left, gparent);
 			if (tmp)
 				rb_set_parent_color(tmp, gparent, RB_BLACK);
 			__rb_rotate_set_parents(gparent, parent, root, RB_RED);
@@ -224,8 +250,9 @@ ____rb_erase_color(struct rb_node *paren
 				 *      / \         / \
 				 *     Sl  Sr      N   Sl
 				 */
-				parent->rb_right = tmp1 = sibling->rb_left;
-				sibling->rb_left = parent;
+				tmp1 = sibling->rb_left;
+				WRITE_ONCE(parent->rb_right, tmp1);
+				WRITE_ONCE(sibling->rb_left, parent);
 				rb_set_parent_color(tmp1, parent, RB_BLACK);
 				__rb_rotate_set_parents(parent, sibling, root,
 							RB_RED);
@@ -275,9 +302,10 @@ ____rb_erase_color(struct rb_node *paren
 				 *                       \
 				 *                        Sr
 				 */
-				sibling->rb_left = tmp1 = tmp2->rb_right;
-				tmp2->rb_right = sibling;
-				parent->rb_right = tmp2;
+				tmp1 = tmp2->rb_right;
+				WRITE_ONCE(sibling->rb_left, tmp1);
+				WRITE_ONCE(tmp2->rb_right, sibling);
+				WRITE_ONCE(parent->rb_right, tmp2);
 				if (tmp1)
 					rb_set_parent_color(tmp1, sibling,
 							    RB_BLACK);
@@ -297,8 +325,9 @@ ____rb_erase_color(struct rb_node *paren
 			 *        / \         / \
 			 *      (sl) sr      N  (sl)
 			 */
-			parent->rb_right = tmp2 = sibling->rb_left;
-			sibling->rb_left = parent;
+			tmp2 = sibling->rb_left;
+			WRITE_ONCE(parent->rb_right, tmp2);
+			WRITE_ONCE(sibling->rb_left, parent);
 			rb_set_parent_color(tmp1, sibling, RB_BLACK);
 			if (tmp2)
 				rb_set_parent(tmp2, parent);
@@ -310,8 +339,9 @@ ____rb_erase_color(struct rb_node *paren
 			sibling = parent->rb_left;
 			if (rb_is_red(sibling)) {
 				/* Case 1 - right rotate at parent */
-				parent->rb_left = tmp1 = sibling->rb_right;
-				sibling->rb_right = parent;
+				tmp1 = sibling->rb_right;
+				WRITE_ONCE(parent->rb_left, tmp1);
+				WRITE_ONCE(sibling->rb_right, parent);
 				rb_set_parent_color(tmp1, parent, RB_BLACK);
 				__rb_rotate_set_parents(parent, sibling, root,
 							RB_RED);
@@ -336,9 +366,10 @@ ____rb_erase_color(struct rb_node *paren
 					break;
 				}
 				/* Case 3 - right rotate at sibling */
-				sibling->rb_right = tmp1 = tmp2->rb_left;
-				tmp2->rb_left = sibling;
-				parent->rb_left = tmp2;
+				tmp1 = tmp2->rb_left;
+				WRITE_ONCE(sibling->rb_right, tmp1);
+				WRITE_ONCE(tmp2->rb_left, sibling);
+				WRITE_ONCE(parent->rb_left, tmp2);
 				if (tmp1)
 					rb_set_parent_color(tmp1, sibling,
 							    RB_BLACK);
@@ -347,8 +378,9 @@ ____rb_erase_color(struct rb_node *paren
 				sibling = tmp2;
 			}
 			/* Case 4 - left rotate at parent + color flips */
-			parent->rb_left = tmp2 = sibling->rb_right;
-			sibling->rb_right = parent;
+			tmp2 = sibling->rb_right;
+			WRITE_ONCE(parent->rb_left, tmp2);
+			WRITE_ONCE(sibling->rb_right, parent);
 			rb_set_parent_color(tmp1, sibling, RB_BLACK);
 			if (tmp2)
 				rb_set_parent(tmp2, parent);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/8] seqlock: Better document raw_write_seqcount_latch()
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
                   ` (3 preceding siblings ...)
  2015-03-18 13:36 ` [PATCH 4/8] rbtree: Make lockless searches non-fatal Peter Zijlstra
@ 2015-03-18 13:36 ` Peter Zijlstra
  2015-03-18 14:29   ` Mathieu Desnoyers
  2015-03-18 13:36 ` [PATCH 6/8] rbtree: Implement generic latch_tree Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, David Woodhouse,
	Rik van Riel, Andrea Arcangeli, Michel Lespinasse

[-- Attachment #1: peterz-latch-comment.patch --]
[-- Type: text/plain, Size: 4959 bytes --]

Improve the documentation of the latch technique as used in the
current timekeeping code, such that it can be readily employed
elsewhere.

Borrow from the comments in timekeeping and replace those with a
reference to this more generic comment.

Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/seqlock.h   |   77 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/time/timekeeping.c |   27 ----------------
 2 files changed, 77 insertions(+), 27 deletions(-)

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -233,9 +233,84 @@ static inline void raw_write_seqcount_en
 	s->sequence++;
 }
 
-/*
+/**
  * raw_write_seqcount_latch - redirect readers to even/odd copy
  * @s: pointer to seqcount_t
+ *
+ * The latch technique is a multiversion concurrency control method that allows
+ * queries during non atomic modifications. If you can guarantee queries never
+ * interrupt the modification -- e.g. the concurrency is strictly between CPUs
+ * -- you most likely do not need this.
+ *
+ * Where the traditional RCU/lockless data structures rely on atomic
+ * modifications to ensure queries observe either the old or the new state the
+ * latch allows the same for non atomic updates. The trade-off is doubling the
+ * cost of storage; we have to maintain two copies of the entire data
+ * structure.
+ *
+ * Very simply put: we first modify one copy and then the other. This ensures
+ * there is always one copy in a stable state, ready to give us an answer.
+ *
+ * The basic form is a data structure like:
+ *
+ * struct latch_struct {
+ *	seqcount_t		seq;
+ *	struct data_struct	data[2];
+ * };
+ *
+ * Where a modification, which is assumed to be externally serialized, does the
+ * following:
+ *
+ * void latch_modify(struct latch_struct *latch, ...)
+ * {
+ *	smp_wmb();	<- Ensure that the last data[1] update is visible
+ *	latch->seq++;
+ *	smp_wmb();	<- Ensure that the seqcount update is visible
+ *
+ *	modify(latch->data[0], ...);
+ *
+ *	smp_wmb();	<- Ensure that the data[0] update is visible
+ *	latch->seq++;
+ *	smp_wmb();	<- Ensure that the seqcount update is visible
+ *
+ *	modify(latch->data[1], ...);
+ * }
+ *
+ * The query will have a form like:
+ *
+ * struct entry *latch_query(struct latch_struct *latch, ...)
+ * {
+ *	struct entry *entry;
+ *	unsigned seq;
+ *	int idx;
+ *
+ *	do {
+ *		seq = latch->seq;
+ *		smp_rmb();
+ *
+ *		idx = seq & 0x01;
+ *		entry = data_query(latch->data[idx], ...);
+ *
+ *		smp_rmb();
+ *	} while (seq != latch->seq);
+ *
+ *	return entry;
+ * }
+ *
+ * So during the modification, queries are first redirected to data[1]. Then we
+ * modify data[0]. When that is complete, we redirect queries back to data[0]
+ * and we can modify data[1].
+ *
+ * NOTE: The non-requirement for atomic modifications does _NOT_ include
+ *       the publishing of new entries in the case where data is a dynamic
+ *       data structure.
+ *
+ *       An iteration might start in data[0] and get suspended long enough
+ *       to miss an entire modification sequence, once it resumes it might
+ *       observe the new entry.
+ *
+ * NOTE: When data is a dynamic data structure; one should use regular RCU
+ *       patterns to manage the lifetimes of the objects within.
  */
 static inline void raw_write_seqcount_latch(seqcount_t *s)
 {
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -339,32 +339,7 @@ static inline s64 timekeeping_get_ns_raw
  * We want to use this from any context including NMI and tracing /
  * instrumenting the timekeeping code itself.
  *
- * So we handle this differently than the other timekeeping accessor
- * functions which retry when the sequence count has changed. The
- * update side does:
- *
- * smp_wmb();	<- Ensure that the last base[1] update is visible
- * tkf->seq++;
- * smp_wmb();	<- Ensure that the seqcount update is visible
- * update(tkf->base[0], tkr);
- * smp_wmb();	<- Ensure that the base[0] update is visible
- * tkf->seq++;
- * smp_wmb();	<- Ensure that the seqcount update is visible
- * update(tkf->base[1], tkr);
- *
- * The reader side does:
- *
- * do {
- *	seq = tkf->seq;
- *	smp_rmb();
- *	idx = seq & 0x01;
- *	now = now(tkf->base[idx]);
- *	smp_rmb();
- * } while (seq != tkf->seq)
- *
- * As long as we update base[0] readers are forced off to
- * base[1]. Once base[0] is updated readers are redirected to base[0]
- * and the base[1] update takes place.
+ * Employ the latch technique; see @raw_write_seqcount_latch.
  *
  * So if a NMI hits the update of base[0] then it will use base[1]
  * which is still consistent. In the worst case this can result is a



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
                   ` (4 preceding siblings ...)
  2015-03-18 13:36 ` [PATCH 5/8] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
@ 2015-03-18 13:36 ` Peter Zijlstra
  2015-03-18 16:20   ` Ingo Molnar
  2015-03-19  5:14   ` Andrew Morton
  2015-03-18 13:36 ` [PATCH 7/8] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

[-- Attachment #1: peterz-rbtree-latch.patch --]
[-- Type: text/plain, Size: 8018 bytes --]

Implement a latched RB-tree in order to get unconditional RCU/lockless
lookups.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree_latch.h |  223 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)

--- /dev/null
+++ b/include/linux/rbtree_latch.h
@@ -0,0 +1,223 @@
+/*
+ * Latched RB-trees
+ *
+ * Copyright (C) 2015 Intel Corp., Peter Zijlstra <peterz@infradead.org>
+ *
+ * Since RB-trees have non atomic modifications they're not immediately suited
+ * for RCU/lockless queries. Even though we made RB tree lookups non-fatal for
+ * lockless lookups; we cannot guarantee they return a correct result.
+ *
+ * The simplest solution is a seqlock + rb-tree, this will allow lockless
+ * lookups; but has the constraint (inherent to the seqlock) that read sides
+ * cannot nest in write sides.
+ *
+ * If we need to allow unconditional lookups (say as required for NMI context
+ * usage) we need a more complex setup; this data structure provides this by
+ * employing the latch technique -- see @raw_write_seqcount_latch -- to
+ * implement a latched RB-tree which does allow for unconditional lookups by
+ * virtue of always having (at least) one stable copy of the tree.
+ *
+ * However, while we have the guarantee that there is at all times one stable
+ * copy, this does not guarantee an iteration will not observe modifications.
+ * What might have been a stable copy at the start of the iteration, need not
+ * remain so for the duration of the iteration.
+ *
+ * Therefore, this does require a lockless RB-tree iteration to be non-fatal;
+ * see the comment in lib/rbtree.c. Note however that we only require the first
+ * condition -- not seeing partial stores -- because the latch thing isolates
+ * us from loops. If we were to interrupt a modification the lookup would be
+ * pointed at the stable tree and complete while the modification was halted.
+ */
+
+#ifndef RB_TREE_LATCH_H
+#define RB_TREE_LATCH_H
+
+#include <linux/rbtree.h>
+#include <linux/seqlock.h>
+
+struct latch_tree_node {
+	/*
+	 * Because we have an array of two entries in struct latch_tree_nodes
+	 * its not possible to use container_of() to get back to the
+	 * encapsulating structure; therefore we have to put in a back pointer.
+	 */
+	void		*priv;
+	struct rb_node	node;
+};
+
+struct latch_tree_nodes {
+	struct latch_tree_node node[2];
+};
+
+struct latch_tree_root {
+	seqcount_t	seq;
+	struct rb_root	tree[2];
+};
+
+/**
+ * latch_tree_ops - operators to define the tree order
+ * @less: used for insertion; provides the (partial) order between two elements.
+ * @comp: used for lookups; provides the order between the search key and an element.
+ *
+ * The operators are related like:
+ *
+ *	comp(a->key,b) < 0  := less(a,b)
+ *	comp(a->key,b) > 0  := less(b,a)
+ *	comp(a->key,b) == 0 := !less(a,b) && !less(b,a)
+ *
+ * If these operators define a partial order on the elements we make no
+ * guarantee on which of the elements matching the key is found. See
+ * latch_tree_find().
+ */
+struct latch_tree_ops {
+	bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b);
+	int  (*comp)(void *key,                 struct latch_tree_node *b);
+};
+
+static __always_inline void
+__lt_insert(struct latch_tree_node *ltn, struct rb_root *root,
+	    bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b))
+{
+	struct rb_node **link = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct latch_tree_node *ltp;
+
+	while (*link) {
+		parent = *link;
+		ltp = container_of(parent, struct latch_tree_node, node);
+
+		if (less(ltn, ltp))
+			link = &parent->rb_left;
+		else
+			link = &parent->rb_right;
+	}
+
+	rb_link_node_rcu(&ltn->node, parent, link);
+	rb_insert_color(&ltn->node, root);
+}
+
+static __always_inline void
+__lt_erase(struct latch_tree_node *ltn, struct rb_root *root)
+{
+	rb_erase(&ltn->node, root);
+}
+
+static __always_inline struct latch_tree_node *
+__lt_find(void *key, struct rb_root *root,
+	  int (*comp)(void *key, struct latch_tree_node *ltn))
+{
+	struct rb_node *n = rcu_dereference_raw(root->rb_node);
+	struct latch_tree_node *ltn;
+	int c;
+
+	while (n) {
+		ltn = container_of(n, struct latch_tree_node, node);
+		c = comp(key, ltn);
+
+		if (c < 0)
+			n = rcu_dereference_raw(n->rb_left);
+		else if (c > 0)
+			n = rcu_dereference_raw(n->rb_right);
+		else
+			return ltn;
+	}
+
+	return NULL;
+}
+
+/**
+ * latch_tree_insert() - insert @nodes into the trees @root
+ * @nodes: nodes to insert
+ * @root: trees to insert @nodes into
+ * @priv: pointer to node private data
+ * @ops: operators defining the node order
+ *
+ * Initializes @nodes private pointer with @priv; which typically is a back
+ * pointer to the containing structure, used by @ops to find the search key.
+ *
+ * Then it inserts @nodes into @root in an ordered fashion such that we can
+ * always observe one complete tree. See the comment for
+ * raw_write_seqcount_latch().
+ *
+ * The inserts use rcu_assign_pointer() to publish the element such that
+ * both the @priv pointer values in @nodes as the tree structure is stored
+ * before we can observe the new @nodes.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+static __always_inline void
+latch_tree_insert(struct latch_tree_nodes *nodes,
+		  struct latch_tree_root *root,
+		  void *priv,
+		  const struct latch_tree_ops *ops)
+{
+	nodes->node[0].priv = nodes->node[1].priv = priv;
+
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(&nodes->node[0], &root->tree[0], ops->less);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(&nodes->node[1], &root->tree[1], ops->less);
+}
+
+/**
+ * latch_tree_erase() - removes @nodes from the trees @root
+ * @nodes: nodes to remote
+ * @root: trees to remove @nodes from
+ * @ops: operators defining the node order
+ *
+ * Removes @nodes from the trees @root in an ordered fashion such that we can
+ * always observe one complete tree. See the comment for
+ * raw_write_seqcount_latch().
+ *
+ * It is assumed that @nodes will observe one RCU quiesent state before being
+ * reused of freed.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+static __always_inline void
+latch_tree_erase(struct latch_tree_nodes *nodes,
+		 struct latch_tree_root *root,
+		 const struct latch_tree_ops *ops)
+{
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(&nodes->node[0], &root->tree[0]);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(&nodes->node[1], &root->tree[1]);
+}
+
+/**
+ * latch_tree_find() - find the node matching @key in the trees @root
+ * @key: search key
+ * @root: trees to search for @key
+ * @ops: operators defining the node order
+ *
+ * Does a lockless lookup in the trees @root for the node matching @key.
+ *
+ * It is assumed that this is called while holding the appropriate RCU read
+ * side lock.
+ *
+ * If the operators define a partial order on the elements (there are multiple
+ * elements which have the same key value) it is undefined which of these
+ * elements will be found. Nor is it possible to iterate the tree to find
+ * further elements with the same key value.
+ *
+ * Returns: a pointer to the node matching @key or NULL.
+ */
+static __always_inline struct latch_tree_node *
+latch_tree_find(void *key, struct latch_tree_root *root,
+		const struct latch_tree_ops *ops)
+{
+	struct latch_tree_node *node;
+	unsigned int seq;
+
+	do {
+		seq = raw_read_seqcount(&root->seq);
+		node = __lt_find(key, &root->tree[seq & 1], ops->comp);
+	} while (read_seqcount_retry(&root->seq, seq));
+
+	return node;
+}
+
+#endif /* RB_TREE_LATCH_H */



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 7/8] module: Optimize __module_address() using a latched RB-tree
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
                   ` (5 preceding siblings ...)
  2015-03-18 13:36 ` [PATCH 6/8] rbtree: Implement generic latch_tree Peter Zijlstra
@ 2015-03-18 13:36 ` Peter Zijlstra
  2015-03-18 13:36 ` [PATCH 8/8] module: Use __module_address() for module_address_lookup() Peter Zijlstra
  2015-03-18 14:21 ` [PATCH 0/8] latched RB-trees and __module_address() Christoph Hellwig
  8 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

[-- Attachment #1: peterz-module-latch-tree.patch --]
[-- Type: text/plain, Size: 6945 bytes --]

Currently __module_address() is using a linear search through all
modules in order to find the module corresponding to the provided
address. With a lot of modules this can take a lot of time.

One of the users of this is kernel_text_address() which is employed
in many stack unwinders; which in turn are used by perf-callchain and
ftrace (possibly from NMI context).

So by optimizing __module_address() we optimize many stack unwinders
which are used by both perf and tracing in performance sensitive code.

Note that because (un)module loading is very rare, we will typically
iterate the tree[0] which will contain only core.node[0] entries --
seeing how the init entries will get removed once the module is
completely loaded. This is why we place the core.node[0] entry in the
same cacheline as the search key data.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/module.h |   22 ++++++++--
 kernel/module.c        |  107 ++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 121 insertions(+), 8 deletions(-)

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -17,6 +17,7 @@
 #include <linux/moduleparam.h>
 #include <linux/jump_label.h>
 #include <linux/export.h>
+#include <linux/rbtree_latch.h>
 
 #include <linux/percpu.h>
 #include <asm/module.h>
@@ -269,8 +270,15 @@ struct module {
 	/* Startup function. */
 	int (*init)(void);
 
-	/* If this is non-NULL, vfree after init() returns */
-	void *module_init;
+	/*
+	 * If this is non-NULL, vfree after init() returns.
+	 *
+	 * Cacheline align here, such that:
+	 *   module_init, module_core, init_size, core_size,
+	 *   init_text_size, core_text_size and ltn_core.node[0]
+	 * are on the same cacheline.
+	 */
+	void *module_init	____cacheline_aligned;
 
 	/* Here is the actual code + data, vfree'd on unload. */
 	void *module_core;
@@ -281,6 +289,14 @@ struct module {
 	/* The size of the executable code in each section.  */
 	unsigned int init_text_size, core_text_size;
 
+	/*
+	 * We rely on the order of these two entries; not only do we want
+	 * core.node[0] to be in the same cacheline as the above entries,
+	 * we also assume ltn_init has a higher address than ltn_core.
+	 */
+	struct latch_tree_nodes	ltn_core;
+	struct latch_tree_nodes	ltn_init;
+
 	/* Size of RO sections of the module (text+rodata) */
 	unsigned int init_ro_size, core_ro_size;
 
@@ -361,7 +377,7 @@ struct module {
 	ctor_fn_t *ctors;
 	unsigned int num_ctors;
 #endif
-};
+} ____cacheline_aligned;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
 #endif
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -102,6 +102,99 @@
 DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
+
+/*
+ * Use a latched RB-tree for __module_address(); this allows us to use
+ * RCU-sched lookups of the address from any context.
+ *
+ * Because modules have two address ranges: init and core, we need two
+ * latch_tree_nodes entries. We use the order they appear in struct module to
+ * determine if we need to use the init or core values for the comparisons.
+ */
+
+static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
+{
+	struct module *mod = n->priv;
+
+	if (n >= mod->ltn_init.node)
+		return (unsigned long)mod->module_init;
+	else
+		return (unsigned long)mod->module_core;
+}
+
+static __always_inline unsigned long __mod_tree_size(struct latch_tree_node *n)
+{
+	struct module *mod = n->priv;
+
+	if (n >= mod->ltn_init.node)
+		return (unsigned long)mod->init_size;
+	else
+		return (unsigned long)mod->core_size;
+}
+
+static __always_inline bool
+mod_tree_less(struct latch_tree_node *a, struct latch_tree_node *b)
+{
+	return __mod_tree_val(a) < __mod_tree_val(b);
+}
+
+static __always_inline int
+mod_tree_comp(void *key, struct latch_tree_node *n)
+{
+	unsigned long val = (unsigned long)key;
+	unsigned long start, end;
+
+	end = start = __mod_tree_val(n);
+	end += __mod_tree_size(n);
+
+	if (val < start)
+		return -1;
+
+	if (val >= end)
+		return 1;
+
+	return 0;
+}
+
+static const struct latch_tree_ops mod_tree_ops = {
+	.less = mod_tree_less,
+	.comp = mod_tree_comp,
+};
+
+static struct latch_tree_root mod_tree __cacheline_aligned;
+
+/*
+ * These modifications: insert, remove_init and remove; are serialized by the
+ * module_mutex.
+ */
+static void mod_tree_insert(struct module *mod)
+{
+	latch_tree_insert(&mod->ltn_core, &mod_tree, mod, &mod_tree_ops);
+	if (mod->init_size)
+		latch_tree_insert(&mod->ltn_init, &mod_tree, mod, &mod_tree_ops);
+}
+
+static void mod_tree_remove_init(struct module *mod)
+{
+	if (mod->init_size)
+		latch_tree_erase(&mod->ltn_init, &mod_tree, &mod_tree_ops);
+}
+
+static void mod_tree_remove(struct module *mod)
+{
+	latch_tree_erase(&mod->ltn_core, &mod_tree, &mod_tree_ops);
+	mod_tree_remove_init(mod);
+}
+
+static struct module *mod_tree_find(unsigned long addr)
+{
+	struct latch_tree_node *ltn;
+
+	ltn = latch_tree_find((void *)addr, &mod_tree, &mod_tree_ops);
+
+	return ltn ? ltn->priv : NULL;
+}
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -1876,6 +1969,7 @@ static void free_module(struct module *m
 	mutex_lock(&module_mutex);
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
+	mod_tree_remove(mod);
 	/* Remove this module from bug list, this uses list_del_rcu */
 	module_bug_cleanup(mod);
 	/* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */
@@ -3120,6 +3214,7 @@ static noinline int do_init_module(struc
 	mod->symtab = mod->core_symtab;
 	mod->strtab = mod->core_strtab;
 #endif
+	mod_tree_remove_init(mod);
 	unset_module_init_ro_nx(mod);
 	module_arch_freeing_init(mod);
 	mod->module_init = NULL;
@@ -3190,6 +3285,7 @@ static int add_unformed_module(struct mo
 		goto out;
 	}
 	list_add_rcu(&mod->list, &modules);
+	mod_tree_insert(mod);
 	err = 0;
 
 out:
@@ -3389,6 +3485,7 @@ static int load_module(struct load_info
 	mutex_lock(&module_mutex);
 	/* Unlink carefully: kallsyms could be walking list. */
 	list_del_rcu(&mod->list);
+	mod_tree_remove(mod);
 	wake_up_all(&module_wq);
 	/* Wait for RCU-sched synchronizing before releasing mod->list. */
 	synchronize_sched();
@@ -3833,13 +3930,13 @@ struct module *__module_address(unsigned
 
 	module_assert_mutex_or_preempt();
 
-	list_for_each_entry_rcu(mod, &modules, list) {
+	mod = mod_tree_find(addr);
+	if (mod) {
+		BUG_ON(!within_module(addr, mod));
 		if (mod->state == MODULE_STATE_UNFORMED)
-			continue;
-		if (within_module(addr, mod))
-			return mod;
+			mod = NULL;
 	}
-	return NULL;
+	return mod;
 }
 EXPORT_SYMBOL_GPL(__module_address);
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 8/8] module: Use __module_address() for module_address_lookup()
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
                   ` (6 preceding siblings ...)
  2015-03-18 13:36 ` [PATCH 7/8] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
@ 2015-03-18 13:36 ` Peter Zijlstra
  2015-03-18 14:21 ` [PATCH 0/8] latched RB-trees and __module_address() Christoph Hellwig
  8 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 13:36 UTC (permalink / raw)
  To: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

[-- Attachment #1: peterz-module-use-__module_address.patch --]
[-- Type: text/plain, Size: 1127 bytes --]

Use the generic __module_address() addr to struct module lookup
instead of open coding it once more.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3515,19 +3515,15 @@ const char *module_address_lookup(unsign
 			    char **modname,
 			    char *namebuf)
 {
-	struct module *mod;
 	const char *ret = NULL;
+	struct module *mod;
 
 	preempt_disable();
-	list_for_each_entry_rcu(mod, &modules, list) {
-		if (mod->state == MODULE_STATE_UNFORMED)
-			continue;
-		if (within_module(addr, mod)) {
-			if (modname)
-				*modname = mod->name;
-			ret = get_ksymbol(mod, addr, size, offset);
-			break;
-		}
+	mod = __module_address(addr);
+	if (mod) {
+		if (modname)
+			*modname = mod->name;
+		ret = get_ksymbol(mod, addr, size, offset);
 	}
 	/* Make a copy in here where it's safe */
 	if (ret) {
@@ -3535,6 +3531,7 @@ const char *module_address_lookup(unsign
 		ret = namebuf;
 	}
 	preempt_enable();
+
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] latched RB-trees and __module_address()
  2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
                   ` (7 preceding siblings ...)
  2015-03-18 13:36 ` [PATCH 8/8] module: Use __module_address() for module_address_lookup() Peter Zijlstra
@ 2015-03-18 14:21 ` Christoph Hellwig
  2015-03-18 14:38   ` Peter Zijlstra
  8 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2015-03-18 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx

On Wed, Mar 18, 2015 at 02:36:26PM +0100, Peter Zijlstra wrote:
> This series is aimed at making __module_address() go fast(er).

What users do hit this so hard that it matters?  Seems like the jump
label code uses it directly and throgh __module_text_address, and krobes
use it throug __module_text_address.  Is it the pageattr code or
lockdep?

Also seems interesting that both __module_address and __module_text_address
are exported, but don't seem to have modular users.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/8] seqlock: Better document raw_write_seqcount_latch()
  2015-03-18 13:36 ` [PATCH 5/8] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
@ 2015-03-18 14:29   ` Mathieu Desnoyers
  2015-03-18 14:50     ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Mathieu Desnoyers @ 2015-03-18 14:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rusty, oleg, paulmck, torvalds, linux-kernel, andi,
	rostedt, tglx, David Woodhouse, Rik van Riel, Andrea Arcangeli,
	Michel Lespinasse

----- Original Message -----
> Improve the documentation of the latch technique as used in the
> current timekeeping code, such that it can be readily employed
> elsewhere.
> 
> Borrow from the comments in timekeeping and replace those with a
> reference to this more generic comment.
> 
> Cc: David Woodhouse <David.Woodhouse@intel.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Michel Lespinasse <walken@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/seqlock.h   |   77
>  +++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/time/timekeeping.c |   27 ----------------
>  2 files changed, 77 insertions(+), 27 deletions(-)
> 
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -233,9 +233,84 @@ static inline void raw_write_seqcount_en
>  	s->sequence++;
>  }
>  
> -/*
> +/**
>   * raw_write_seqcount_latch - redirect readers to even/odd copy
>   * @s: pointer to seqcount_t
> + *
> + * The latch technique is a multiversion concurrency control method that
> allows
> + * queries during non atomic modifications. If you can guarantee queries
> never
> + * interrupt the modification -- e.g. the concurrency is strictly between
> CPUs
> + * -- you most likely do not need this.
> + *
> + * Where the traditional RCU/lockless data structures rely on atomic
> + * modifications to ensure queries observe either the old or the new state
> the
> + * latch allows the same for non atomic updates. The trade-off is doubling
> the
> + * cost of storage; we have to maintain two copies of the entire data
> + * structure.
> + *
> + * Very simply put: we first modify one copy and then the other. This
> ensures
> + * there is always one copy in a stable state, ready to give us an answer.
> + *
> + * The basic form is a data structure like:
> + *
> + * struct latch_struct {
> + *	seqcount_t		seq;
> + *	struct data_struct	data[2];
> + * };
> + *
> + * Where a modification, which is assumed to be externally serialized, does
> the
> + * following:
> + *
> + * void latch_modify(struct latch_struct *latch, ...)
> + * {
> + *	smp_wmb();	<- Ensure that the last data[1] update is visible
> + *	latch->seq++;
> + *	smp_wmb();	<- Ensure that the seqcount update is visible
> + *
> + *	modify(latch->data[0], ...);
> + *
> + *	smp_wmb();	<- Ensure that the data[0] update is visible
> + *	latch->seq++;
> + *	smp_wmb();	<- Ensure that the seqcount update is visible
> + *
> + *	modify(latch->data[1], ...);
> + * }
> + *
> + * The query will have a form like:
> + *
> + * struct entry *latch_query(struct latch_struct *latch, ...)
> + * {
> + *	struct entry *entry;
> + *	unsigned seq;
> + *	int idx;

very minor nit: why is seq unsigned, but idx a signed int ?
Could we do:

  unsigned seq, idx;   instead ?

Other than that:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> + *
> + *	do {
> + *		seq = latch->seq;
> + *		smp_rmb();
> + *
> + *		idx = seq & 0x01;
> + *		entry = data_query(latch->data[idx], ...);
> + *
> + *		smp_rmb();
> + *	} while (seq != latch->seq);
> + *
> + *	return entry;
> + * }
> + *
> + * So during the modification, queries are first redirected to data[1]. Then
> we
> + * modify data[0]. When that is complete, we redirect queries back to
> data[0]
> + * and we can modify data[1].
> + *
> + * NOTE: The non-requirement for atomic modifications does _NOT_ include
> + *       the publishing of new entries in the case where data is a dynamic
> + *       data structure.
> + *
> + *       An iteration might start in data[0] and get suspended long enough
> + *       to miss an entire modification sequence, once it resumes it might
> + *       observe the new entry.
> + *
> + * NOTE: When data is a dynamic data structure; one should use regular RCU
> + *       patterns to manage the lifetimes of the objects within.
>   */
>  static inline void raw_write_seqcount_latch(seqcount_t *s)
>  {
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -339,32 +339,7 @@ static inline s64 timekeeping_get_ns_raw
>   * We want to use this from any context including NMI and tracing /
>   * instrumenting the timekeeping code itself.
>   *
> - * So we handle this differently than the other timekeeping accessor
> - * functions which retry when the sequence count has changed. The
> - * update side does:
> - *
> - * smp_wmb();	<- Ensure that the last base[1] update is visible
> - * tkf->seq++;
> - * smp_wmb();	<- Ensure that the seqcount update is visible
> - * update(tkf->base[0], tkr);
> - * smp_wmb();	<- Ensure that the base[0] update is visible
> - * tkf->seq++;
> - * smp_wmb();	<- Ensure that the seqcount update is visible
> - * update(tkf->base[1], tkr);
> - *
> - * The reader side does:
> - *
> - * do {
> - *	seq = tkf->seq;
> - *	smp_rmb();
> - *	idx = seq & 0x01;
> - *	now = now(tkf->base[idx]);
> - *	smp_rmb();
> - * } while (seq != tkf->seq)
> - *
> - * As long as we update base[0] readers are forced off to
> - * base[1]. Once base[0] is updated readers are redirected to base[0]
> - * and the base[1] update takes place.
> + * Employ the latch technique; see @raw_write_seqcount_latch.
>   *
>   * So if a NMI hits the update of base[0] then it will use base[1]
>   * which is still consistent. In the worst case this can result is a
> 
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] latched RB-trees and __module_address()
  2015-03-18 14:21 ` [PATCH 0/8] latched RB-trees and __module_address() Christoph Hellwig
@ 2015-03-18 14:38   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 14:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx

On Wed, Mar 18, 2015 at 07:21:14AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 18, 2015 at 02:36:26PM +0100, Peter Zijlstra wrote:
> > This series is aimed at making __module_address() go fast(er).
> 
> What users do hit this so hard that it matters?  Seems like the jump
> label code uses it directly and throgh __module_text_address, and krobes
> use it throug __module_text_address.  Is it the pageattr code or
> lockdep?

>From 7/8:

"One of the users of this is kernel_text_address() which is employed
in many stack unwinders; which in turn are used by perf-callchain and
ftrace (possibly from NMI context).

So by optimizing __module_address() we optimize many stack unwinders
which are used by both perf and tracing in performance sensitive code."

> Also seems interesting that both __module_address and __module_text_address
> are exported, but don't seem to have modular users.

Yeah and a few others too iirc.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/8] seqlock: Better document raw_write_seqcount_latch()
  2015-03-18 14:29   ` Mathieu Desnoyers
@ 2015-03-18 14:50     ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-18 14:50 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: mingo, rusty, oleg, paulmck, torvalds, linux-kernel, andi,
	rostedt, tglx, David Woodhouse, Rik van Riel, Andrea Arcangeli,
	Michel Lespinasse

On Wed, Mar 18, 2015 at 02:29:35PM +0000, Mathieu Desnoyers wrote:
> > + * struct entry *latch_query(struct latch_struct *latch, ...)
> > + * {
> > + *	struct entry *entry;
> > + *	unsigned seq;
> > + *	int idx;
> 
> very minor nit: why is seq unsigned, but idx a signed int ?
> Could we do:
> 
>   unsigned seq, idx;   instead ?
> 
> Other than that:
> 
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Done and thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-18 13:36 ` [PATCH 6/8] rbtree: Implement generic latch_tree Peter Zijlstra
@ 2015-03-18 16:20   ` Ingo Molnar
  2015-03-19  5:14   ` Andrew Morton
  1 sibling, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2015-03-18 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rusty, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, Michel Lespinasse, Andrea Arcangeli,
	David Woodhouse, Rik van Riel


* Peter Zijlstra <peterz@infradead.org> wrote:

> Implement a latched RB-tree in order to get unconditional RCU/lockless
> lookups.

Two very minor nits:

> +struct latch_tree_node {
> +	/*
> +	 * Because we have an array of two entries in struct latch_tree_nodes
> +	 * its not possible to use container_of() to get back to the
> +	 * encapsulating structure; therefore we have to put in a back pointer.
> +	 */
> +	void		*priv;
> +	struct rb_node	node;
> +};

s/its/it's


> +/**
> + * latch_tree_erase() - removes @nodes from the trees @root
> + * @nodes: nodes to remote
> + * @root: trees to remove @nodes from
> + * @ops: operators defining the node order
> + *
> + * Removes @nodes from the trees @root in an ordered fashion such that we can
> + * always observe one complete tree. See the comment for
> + * raw_write_seqcount_latch().
> + *
> + * It is assumed that @nodes will observe one RCU quiesent state before being
> + * reused of freed.

s/quiesent/quiescent

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-18 13:36 ` [PATCH 6/8] rbtree: Implement generic latch_tree Peter Zijlstra
  2015-03-18 16:20   ` Ingo Molnar
@ 2015-03-19  5:14   ` Andrew Morton
  2015-03-19  7:25     ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2015-03-19  5:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Wed, 18 Mar 2015 14:36:32 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

>  include/linux/rbtree_latch.h |  223 +++++++++++++++++++++++++++++++++++++++++++

Did it really need to all be inlined?

How much of this code is unneeded on uniprocessor?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19  5:14   ` Andrew Morton
@ 2015-03-19  7:25     ` Peter Zijlstra
  2015-03-19 20:58       ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-19  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Wed, Mar 18, 2015 at 10:14:46PM -0700, Andrew Morton wrote:
> On Wed, 18 Mar 2015 14:36:32 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> >  include/linux/rbtree_latch.h |  223 +++++++++++++++++++++++++++++++++++++++++++
> 
> Did it really need to all be inlined?

Without that you get actual function calls to the less() and comp()
operators. This way GCC can inline the lot even though its function
pointers.

The typical RB tree user open-codes all this every single time.

> How much of this code is unneeded on uniprocessor?

None, UP has NMIs too.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19  7:25     ` Peter Zijlstra
@ 2015-03-19 20:58       ` Andrew Morton
  2015-03-19 21:36         ` Steven Rostedt
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Andrew Morton @ 2015-03-19 20:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, 19 Mar 2015 08:25:02 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Mar 18, 2015 at 10:14:46PM -0700, Andrew Morton wrote:
> > On Wed, 18 Mar 2015 14:36:32 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > >  include/linux/rbtree_latch.h |  223 +++++++++++++++++++++++++++++++++++++++++++
> > 
> > Did it really need to all be inlined?
> 
> Without that you get actual function calls to the less() and comp()
> operators. This way GCC can inline the lot even though its function
> pointers.
> 
> The typical RB tree user open-codes all this every single time.

Is it a good tradeoff?

> > How much of this code is unneeded on uniprocessor?
> 
> None, UP has NMIs too.

OK.  This code is basically required to support perf/ftrace and
modules, yes?  Presumably small and space-constrained systems aren't
using either, so they don't take the hit.

However CONFIG_MODULES systems which aren't using perf/ftrace _do_ take
a hit.  How many systems are we talking here?  All non-x86?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19 20:58       ` Andrew Morton
@ 2015-03-19 21:36         ` Steven Rostedt
  2015-03-19 21:38           ` Steven Rostedt
                             ` (2 more replies)
  2015-03-19 22:04         ` Peter Zijlstra
  2015-03-19 22:10         ` Peter Zijlstra
  2 siblings, 3 replies; 28+ messages in thread
From: Steven Rostedt @ 2015-03-19 21:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, mingo, rusty, mathieu.desnoyers, oleg, paulmck,
	torvalds, linux-kernel, andi, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, 19 Mar 2015 13:58:33 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> OK.  This code is basically required to support perf/ftrace and
> modules, yes?  Presumably small and space-constrained systems aren't
> using either, so they don't take the hit.
> 
> However CONFIG_MODULES systems which aren't using perf/ftrace _do_ take
> a hit.  How many systems are we talking here?  All non-x86?

Compromise... (Totally untested)

-- Steve

diff --git a/include/linux/rbtree_latch.h b/include/linux/rbtree_latch.h
index bd4a9700c06c..e22f758ca433 100644
--- a/include/linux/rbtree_latch.h
+++ b/include/linux/rbtree_latch.h
@@ -74,150 +74,24 @@ struct latch_tree_ops {
 	int  (*comp)(void *key,                 struct latch_tree_node *b);
 };
 
-static __always_inline void
-__lt_insert(struct latch_tree_node *ltn, struct rb_root *root,
-	    bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b))
-{
-	struct rb_node **link = &root->rb_node;
-	struct rb_node *parent = NULL;
-	struct latch_tree_node *ltp;
-
-	while (*link) {
-		parent = *link;
-		ltp = container_of(parent, struct latch_tree_node, node);
-
-		if (less(ltn, ltp))
-			link = &parent->rb_left;
-		else
-			link = &parent->rb_right;
-	}
-
-	rb_link_node_rcu(&ltn->node, parent, link);
-	rb_insert_color(&ltn->node, root);
-}
-
-static __always_inline void
-__lt_erase(struct latch_tree_node *ltn, struct rb_root *root)
-{
-	rb_erase(&ltn->node, root);
-}
-
-static __always_inline struct latch_tree_node *
-__lt_find(void *key, struct rb_root *root,
-	  int (*comp)(void *key, struct latch_tree_node *ltn))
-{
-	struct rb_node *n = rcu_dereference_raw(root->rb_node);
-	struct latch_tree_node *ltn;
-	int c;
-
-	while (n) {
-		ltn = container_of(n, struct latch_tree_node, node);
-		c = comp(key, ltn);
-
-		if (c < 0)
-			n = rcu_dereference_raw(n->rb_left);
-		else if (c > 0)
-			n = rcu_dereference_raw(n->rb_right);
-		else
-			return ltn;
-	}
-
-	return NULL;
-}
-
-/**
- * latch_tree_insert() - insert @nodes into the trees @root
- * @nodes: nodes to insert
- * @root: trees to insert @nodes into
- * @priv: pointer to node private data
- * @ops: operators defining the node order
- *
- * Initializes @nodes private pointer with @priv; which typically is a back
- * pointer to the containing structure, used by @ops to find the search key.
- *
- * Then it inserts @nodes into @root in an ordered fashion such that we can
- * always observe one complete tree. See the comment for
- * raw_write_seqcount_latch().
- *
- * The inserts use rcu_assign_pointer() to publish the element such that
- * both the @priv pointer values in @nodes as the tree structure is stored
- * before we can observe the new @nodes.
- *
- * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
- * serialized.
- */
-static __always_inline void
-latch_tree_insert(struct latch_tree_nodes *nodes,
-		  struct latch_tree_root *root,
-		  void *priv,
-		  const struct latch_tree_ops *ops)
-{
-	nodes->node[0].priv = nodes->node[1].priv = priv;
-
-	raw_write_seqcount_latch(&root->seq);
-	__lt_insert(&nodes->node[0], &root->tree[0], ops->less);
-	raw_write_seqcount_latch(&root->seq);
-	__lt_insert(&nodes->node[1], &root->tree[1], ops->less);
-}
-
-/**
- * latch_tree_erase() - removes @nodes from the trees @root
- * @nodes: nodes to remote
- * @root: trees to remove @nodes from
- * @ops: operators defining the node order
- *
- * Removes @nodes from the trees @root in an ordered fashion such that we can
- * always observe one complete tree. See the comment for
- * raw_write_seqcount_latch().
- *
- * It is assumed that @nodes will observe one RCU quiesent state before being
- * reused of freed.
- *
- * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
- * serialized.
- */
-static __always_inline void
-latch_tree_erase(struct latch_tree_nodes *nodes,
-		 struct latch_tree_root *root,
-		 const struct latch_tree_ops *ops)
-{
-	raw_write_seqcount_latch(&root->seq);
-	__lt_erase(&nodes->node[0], &root->tree[0]);
-	raw_write_seqcount_latch(&root->seq);
-	__lt_erase(&nodes->node[1], &root->tree[1]);
-}
-
-/**
- * latch_tree_find() - find the node matching @key in the trees @root
- * @key: search key
- * @root: trees to search for @key
- * @ops: operators defining the node order
- *
- * Does a lockless lookup in the trees @root for the node matching @key.
- *
- * It is assumed that this is called while holding the appropriate RCU read
- * side lock.
- *
- * If the operators define a partial order on the elements (there are multiple
- * elements which have the same key value) it is undefined which of these
- * elements will be found. Nor is it possible to iterate the tree to find
- * further elements with the same key value.
- *
- * Returns: a pointer to the node matching @key or NULL.
- */
-static __always_inline struct latch_tree_node *
-latch_tree_find(void *key, struct latch_tree_root *root,
-		const struct latch_tree_ops *ops)
-{
-	struct latch_tree_node *node;
-	unsigned int seq;
-
-	do {
-		seq = raw_read_seqcount(&root->seq);
-		node = __lt_find(key, &root->tree[seq & 1], ops->comp);
-	} while (read_seqcount_retry(&root->seq, seq));
-
-	return node;
-}
+#ifdef CONFIG_RBTREE_LATCH_INLINE
+#define RBTREE_LATCH_ANNOTATE static __always_inline
+#include <linux/rbtree_latch_code.h>
+#else
+void __lt_insert(struct latch_tree_node *ltn, struct rb_root *root,
+	bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b));
+void __lt_erase(struct latch_tree_node *ltn, struct rb_root *root);
+struct latch_tree_node *__lt_find(void *key, struct rb_root *root,
+	  int (*comp)(void *key, struct latch_tree_node *ltn));
+void latch_tree_insert(struct latch_tree_nodes *nodes,
+		       struct latch_tree_root *root,
+		       void *priv,
+		       const struct latch_tree_ops *ops);
+void latch_tree_erase(struct latch_tree_nodes *nodes,
+		      struct latch_tree_root *root,
+		      const struct latch_tree_ops *ops);
+struct latch_tree_node *latch_tree_find(void *key, struct latch_tree_root *root,
+					const struct latch_tree_ops *ops);
+#endif
 
 #endif /* RB_TREE_LATCH_H */
diff --git a/include/linux/rbtree_latch_code.h b/include/linux/rbtree_latch_code.h
new file mode 100644
index 000000000000..940fd763ee97
--- /dev/null
+++ b/include/linux/rbtree_latch_code.h
@@ -0,0 +1,146 @@
+
+RBTREE_LATCH_ANNOTATE void
+__lt_insert(struct latch_tree_node *ltn, struct rb_root *root,
+	    bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b))
+{
+	struct rb_node **link = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct latch_tree_node *ltp;
+
+	while (*link) {
+		parent = *link;
+		ltp = container_of(parent, struct latch_tree_node, node);
+
+		if (less(ltn, ltp))
+			link = &parent->rb_left;
+		else
+			link = &parent->rb_right;
+	}
+
+	rb_link_node_rcu(&ltn->node, parent, link);
+	rb_insert_color(&ltn->node, root);
+}
+
+RBTREE_LATCH_ANNOTATE void
+__lt_erase(struct latch_tree_node *ltn, struct rb_root *root)
+{
+	rb_erase(&ltn->node, root);
+}
+
+RBTREE_LATCH_ANNOTATE struct latch_tree_node *
+__lt_find(void *key, struct rb_root *root,
+	  int (*comp)(void *key, struct latch_tree_node *ltn))
+{
+	struct rb_node *n = rcu_dereference_raw(root->rb_node);
+	struct latch_tree_node *ltn;
+	int c;
+
+	while (n) {
+		ltn = container_of(n, struct latch_tree_node, node);
+		c = comp(key, ltn);
+
+		if (c < 0)
+			n = rcu_dereference_raw(n->rb_left);
+		else if (c > 0)
+			n = rcu_dereference_raw(n->rb_right);
+		else
+			return ltn;
+	}
+
+	return NULL;
+}
+
+/**
+ * latch_tree_insert() - insert @nodes into the trees @root
+ * @nodes: nodes to insert
+ * @root: trees to insert @nodes into
+ * @priv: pointer to node private data
+ * @ops: operators defining the node order
+ *
+ * Initializes @nodes private pointer with @priv; which typically is a back
+ * pointer to the containing structure, used by @ops to find the search key.
+ *
+ * Then it inserts @nodes into @root in an ordered fashion such that we can
+ * always observe one complete tree. See the comment for
+ * raw_write_seqcount_latch().
+ *
+ * The inserts use rcu_assign_pointer() to publish the element such that
+ * both the @priv pointer values in @nodes as the tree structure is stored
+ * before we can observe the new @nodes.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+RBTREE_LATCH_ANNOTATE void
+latch_tree_insert(struct latch_tree_nodes *nodes,
+		  struct latch_tree_root *root,
+		  void *priv,
+		  const struct latch_tree_ops *ops)
+{
+	nodes->node[0].priv = nodes->node[1].priv = priv;
+
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(&nodes->node[0], &root->tree[0], ops->less);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_insert(&nodes->node[1], &root->tree[1], ops->less);
+}
+
+/**
+ * latch_tree_erase() - removes @nodes from the trees @root
+ * @nodes: nodes to remote
+ * @root: trees to remove @nodes from
+ * @ops: operators defining the node order
+ *
+ * Removes @nodes from the trees @root in an ordered fashion such that we can
+ * always observe one complete tree. See the comment for
+ * raw_write_seqcount_latch().
+ *
+ * It is assumed that @nodes will observe one RCU quiesent state before being
+ * reused of freed.
+ *
+ * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be
+ * serialized.
+ */
+RBTREE_LATCH_ANNOTATE void
+latch_tree_erase(struct latch_tree_nodes *nodes,
+		 struct latch_tree_root *root,
+		 const struct latch_tree_ops *ops)
+{
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(&nodes->node[0], &root->tree[0]);
+	raw_write_seqcount_latch(&root->seq);
+	__lt_erase(&nodes->node[1], &root->tree[1]);
+}
+
+/**
+ * latch_tree_find() - find the node matching @key in the trees @root
+ * @key: search key
+ * @root: trees to search for @key
+ * @ops: operators defining the node order
+ *
+ * Does a lockless lookup in the trees @root for the node matching @key.
+ *
+ * It is assumed that this is called while holding the appropriate RCU read
+ * side lock.
+ *
+ * If the operators define a partial order on the elements (there are multiple
+ * elements which have the same key value) it is undefined which of these
+ * elements will be found. Nor is it possible to iterate the tree to find
+ * further elements with the same key value.
+ *
+ * Returns: a pointer to the node matching @key or NULL.
+ */
+RBTREE_LATCH_ANNOTATE struct latch_tree_node *
+latch_tree_find(void *key, struct latch_tree_root *root,
+		const struct latch_tree_ops *ops)
+{
+	struct latch_tree_node *node;
+	unsigned int seq;
+
+	do {
+		seq = raw_read_seqcount(&root->seq);
+		node = __lt_find(key, &root->tree[seq & 1], ops->comp);
+	} while (read_seqcount_retry(&root->seq, seq));
+
+	return node;
+}
diff --git a/lib/Kconfig b/lib/Kconfig
index 87da53bb1fef..6d515f8d8fb5 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -527,4 +527,8 @@ source "lib/fonts/Kconfig"
 config ARCH_HAS_SG_CHAIN
 	def_bool n
 
+config RBTREE_LATCH_INLINE
+       def_bool y
+       depends on PERF_EVENTS || TRACING
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 58f74d2dd396..3dfb4f697239 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 sha1.o md5.o irq_regs.o argv_split.o \
 	 proportions.o flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-	 earlycpio.o seq_buf.o
+	 earlycpio.o seq_buf.o rbtree_latch.o
 
 obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
 lib-$(CONFIG_MMU) += ioremap.o
diff --git a/lib/rbtree_latch.c b/lib/rbtree_latch.c
new file mode 100644
index 000000000000..13867da4f27f
--- /dev/null
+++ b/lib/rbtree_latch.c
@@ -0,0 +1,5 @@
+
+#ifndef CONFIG_RBTREE_LATCH_INLINE
+#define RBTREE_LATCH_ANNOTATE
+#include <linux/rbtree_latch_code.h>
+#endif

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19 21:36         ` Steven Rostedt
@ 2015-03-19 21:38           ` Steven Rostedt
  2015-03-19 21:47           ` Andrew Morton
  2015-03-19 22:12           ` Peter Zijlstra
  2 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2015-03-19 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, mingo, rusty, mathieu.desnoyers, oleg, paulmck,
	torvalds, linux-kernel, andi, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, 19 Mar 2015 17:36:36 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> diff --git a/lib/rbtree_latch.c b/lib/rbtree_latch.c
> new file mode 100644
> index 000000000000..13867da4f27f
> --- /dev/null
> +++ b/lib/rbtree_latch.c
> @@ -0,0 +1,5 @@
> +
> +#ifndef CONFIG_RBTREE_LATCH_INLINE
> +#define RBTREE_LATCH_ANNOTATE
> +#include <linux/rbtree_latch_code.h>
> +#endif

As I said, totally untested, so the above still requires at a minimum:


#include <linux/rbtree_latch.h>

at the start.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19 21:36         ` Steven Rostedt
  2015-03-19 21:38           ` Steven Rostedt
@ 2015-03-19 21:47           ` Andrew Morton
  2015-03-19 21:54             ` Steven Rostedt
  2015-03-19 22:12           ` Peter Zijlstra
  2 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2015-03-19 21:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, mingo, rusty, mathieu.desnoyers, oleg, paulmck,
	torvalds, linux-kernel, andi, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, 19 Mar 2015 17:36:36 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 19 Mar 2015 13:58:33 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > OK.  This code is basically required to support perf/ftrace and
> > modules, yes?  Presumably small and space-constrained systems aren't
> > using either, so they don't take the hit.
> > 
> > However CONFIG_MODULES systems which aren't using perf/ftrace _do_ take
> > a hit.  How many systems are we talking here?  All non-x86?
> 
> Compromise... (Totally untested)
> 
> ...
>
> +config RBTREE_LATCH_INLINE
> +       def_bool y
> +       depends on PERF_EVENTS || TRACING

Should this be PERF_EVENTS, or PERF_EVENTS_NMI?

Could we just keep the old __module_address() for these configs?  It's
only 10 lines..


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19 21:47           ` Andrew Morton
@ 2015-03-19 21:54             ` Steven Rostedt
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2015-03-19 21:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, mingo, rusty, mathieu.desnoyers, oleg, paulmck,
	torvalds, linux-kernel, andi, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, 19 Mar 2015 14:47:32 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> Should this be PERF_EVENTS, or PERF_EVENTS_NMI?

No idea. I just whipped up a quick patch and guessed at the configs.

I was trying to show other options for the solution more than to show a
working patch ;-)

> 
> Could we just keep the old __module_address() for these configs?  It's
> only 10 lines..

I'm fine with that solution too.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19 20:58       ` Andrew Morton
  2015-03-19 21:36         ` Steven Rostedt
@ 2015-03-19 22:04         ` Peter Zijlstra
  2015-03-20  9:50           ` Peter Zijlstra
  2015-03-19 22:10         ` Peter Zijlstra
  2 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-19 22:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Mar 19, 2015 at 01:58:33PM -0700, Andrew Morton wrote:
> OK.  This code is basically required to support perf/ftrace and
> modules, yes?  Presumably small and space-constrained systems aren't
> using either, so they don't take the hit.
> 
> However CONFIG_MODULES systems which aren't using perf/ftrace _do_ take
> a hit.  How many systems are we talking here?  All non-x86?

No, there's plenty !x86 systems that have NMI and perf/ftrace, in fact,
ARM has them (FIQ) and there's event some ARM chips that use them for
perf (i.MX6 is one).

But sure, we could make this depend on CONFIG_PERF_EVENTS ||
CONFIG_TRACING.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19 20:58       ` Andrew Morton
  2015-03-19 21:36         ` Steven Rostedt
  2015-03-19 22:04         ` Peter Zijlstra
@ 2015-03-19 22:10         ` Peter Zijlstra
  2 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-19 22:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Mar 19, 2015 at 01:58:33PM -0700, Andrew Morton wrote:
> On Thu, 19 Mar 2015 08:25:02 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Mar 18, 2015 at 10:14:46PM -0700, Andrew Morton wrote:
> > > On Wed, 18 Mar 2015 14:36:32 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > >  include/linux/rbtree_latch.h |  223 +++++++++++++++++++++++++++++++++++++++++++
> > > 
> > > Did it really need to all be inlined?
> > 
> > Without that you get actual function calls to the less() and comp()
> > operators. This way GCC can inline the lot even though its function
> > pointers.
> > 
> > The typical RB tree user open-codes all this every single time.
> 
> Is it a good tradeoff?

Is what? Writing it like this or open-coding it all?

For many archs (indirect) function calls are far more expensive than the
typical comparison.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19 21:36         ` Steven Rostedt
  2015-03-19 21:38           ` Steven Rostedt
  2015-03-19 21:47           ` Andrew Morton
@ 2015-03-19 22:12           ` Peter Zijlstra
  2 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-19 22:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, mingo, rusty, mathieu.desnoyers, oleg, paulmck,
	torvalds, linux-kernel, andi, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Mar 19, 2015 at 05:36:36PM -0400, Steven Rostedt wrote:
> On Thu, 19 Mar 2015 13:58:33 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > OK.  This code is basically required to support perf/ftrace and
> > modules, yes?  Presumably small and space-constrained systems aren't
> > using either, so they don't take the hit.
> > 
> > However CONFIG_MODULES systems which aren't using perf/ftrace _do_ take
> > a hit.  How many systems are we talking here?  All non-x86?
> 
> Compromise... (Totally untested)

It only makes sense to out-of-line __lt_insert() (to avoid it being
inlined twice). And for that we can create an instantiation macro such
that it can still inline the comparison.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/8] module: Sanitize RCU usage and locking
  2015-03-18 13:36 ` [PATCH 1/8] module: Sanitize RCU usage and locking Peter Zijlstra
@ 2015-03-20  3:36   ` Rusty Russell
  0 siblings, 0 replies; 28+ messages in thread
From: Rusty Russell @ 2015-03-20  3:36 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz

Peter Zijlstra <peterz@infradead.org> writes:
> Currently the RCU usage in module is an inconsistent mess of RCU and
> RCU-sched, this is broken for CONFIG_PREEMPT where synchronize_rcu()
> does not imply synchronize_sched().
>
> Most usage sites use preempt_{dis,en}able() which is RCU-sched, but
> (most of) the modification sites use synchronize_rcu(). With the
> exception of the module bug list, which actually uses RCU.
>
> Convert everything over to RCU-sched.
>
> Furthermore add lockdep asserts to all sites, because its not at all
> clear to me the required locking is observed, esp. on exported
> functions.
>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I'm happy to take these via my modules-next tree, once you're ready.

If you prefer a different carrier, I'll just Ack.

Cheers,
Rusty.

> ---
>  include/linux/module.h |   12 ++++++++++--
>  kernel/module.c        |   42 ++++++++++++++++++++++++++++++++++--------
>  lib/bug.c              |    7 +++++--
>  3 files changed, 49 insertions(+), 12 deletions(-)
>
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -415,14 +415,22 @@ struct symsearch {
>  	bool unused;
>  };
>  
> -/* Search for an exported symbol by name. */
> +/*
> + * Search for an exported symbol by name.
> + *
> + * Must be called with module_mutex held or preemption disabled.
> + */
>  const struct kernel_symbol *find_symbol(const char *name,
>  					struct module **owner,
>  					const unsigned long **crc,
>  					bool gplok,
>  					bool warn);
>  
> -/* Walk the exported symbol table */
> +/*
> + * Walk the exported symbol table
> + *
> + * Must be called with module_mutex held or preemption disabled.
> + */
>  bool each_symbol_section(bool (*fn)(const struct symsearch *arr,
>  				    struct module *owner,
>  				    void *data), void *data);
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -106,6 +106,24 @@ static LIST_HEAD(modules);
>  struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
>  #endif /* CONFIG_KGDB_KDB */
>  
> +static void module_assert_mutex(void)
> +{
> +	lockdep_assert_held(&module_mutex);
> +}
> +
> +static void module_assert_mutex_or_preempt(void)
> +{
> +#ifdef CONFIG_LOCKDEP
> +	int rcu_held = rcu_read_lock_sched_held();
> +	int mutex_held = 1;
> +
> +	if (debug_locks)
> +		mutex_held = lockdep_is_held(&module_mutex);
> +
> +	WARN_ON(!rcu_held && !mutex_held);
> +#endif
> +}
> +
>  #ifdef CONFIG_MODULE_SIG
>  #ifdef CONFIG_MODULE_SIG_FORCE
>  static bool sig_enforce = true;
> @@ -319,6 +337,8 @@ bool each_symbol_section(bool (*fn)(cons
>  #endif
>  	};
>  
> +	module_assert_mutex_or_preempt();
> +
>  	if (each_symbol_in_section(arr, ARRAY_SIZE(arr), NULL, fn, data))
>  		return true;
>  
> @@ -458,6 +478,8 @@ static struct module *find_module_all(co
>  {
>  	struct module *mod;
>  
> +	module_assert_mutex();
> +
>  	list_for_each_entry(mod, &modules, list) {
>  		if (!even_unformed && mod->state == MODULE_STATE_UNFORMED)
>  			continue;
> @@ -1856,8 +1878,8 @@ static void free_module(struct module *m
>  	list_del_rcu(&mod->list);
>  	/* Remove this module from bug list, this uses list_del_rcu */
>  	module_bug_cleanup(mod);
> -	/* Wait for RCU synchronizing before releasing mod->list and buglist. */
> -	synchronize_rcu();
> +	/* Wait for RCU-sched synchronizing before releasing mod->list and buglist. */
> +	synchronize_sched();
>  	mutex_unlock(&module_mutex);
>  
>  	/* This may be NULL, but that's OK */
> @@ -3106,11 +3128,11 @@ static noinline int do_init_module(struc
>  	mod->init_text_size = 0;
>  	/*
>  	 * We want to free module_init, but be aware that kallsyms may be
> -	 * walking this with preempt disabled.  In all the failure paths,
> -	 * we call synchronize_rcu/synchronize_sched, but we don't want
> -	 * to slow down the success path, so use actual RCU here.
> +	 * walking this with preempt disabled.  In all the failure paths, we
> +	 * call synchronize_sched, but we don't want to slow down the success
> +	 * path, so use actual RCU here.
>  	 */
> -	call_rcu(&freeinit->rcu, do_free_init);
> +	call_rcu_sched(&freeinit->rcu, do_free_init);
>  	mutex_unlock(&module_mutex);
>  	wake_up_all(&module_wq);
>  
> @@ -3368,8 +3390,8 @@ static int load_module(struct load_info
>  	/* Unlink carefully: kallsyms could be walking list. */
>  	list_del_rcu(&mod->list);
>  	wake_up_all(&module_wq);
> -	/* Wait for RCU synchronizing before releasing mod->list. */
> -	synchronize_rcu();
> +	/* Wait for RCU-sched synchronizing before releasing mod->list. */
> +	synchronize_sched();
>  	mutex_unlock(&module_mutex);
>   free_module:
>  	/* Free lock-classes; relies on the preceding sync_rcu() */
> @@ -3636,6 +3658,8 @@ int module_kallsyms_on_each_symbol(int (
>  	unsigned int i;
>  	int ret;
>  
> +	module_assert_mutex();
> +
>  	list_for_each_entry(mod, &modules, list) {
>  		if (mod->state == MODULE_STATE_UNFORMED)
>  			continue;
> @@ -3810,6 +3834,8 @@ struct module *__module_address(unsigned
>  	if (addr < module_addr_min || addr > module_addr_max)
>  		return NULL;
>  
> +	module_assert_mutex_or_preempt();
> +
>  	list_for_each_entry_rcu(mod, &modules, list) {
>  		if (mod->state == MODULE_STATE_UNFORMED)
>  			continue;
> --- a/lib/bug.c
> +++ b/lib/bug.c
> @@ -66,7 +66,7 @@ static const struct bug_entry *module_fi
>  	struct module *mod;
>  	const struct bug_entry *bug = NULL;
>  
> -	rcu_read_lock();
> +	rcu_read_lock_sched();
>  	list_for_each_entry_rcu(mod, &module_bug_list, bug_list) {
>  		unsigned i;
>  
> @@ -77,7 +77,7 @@ static const struct bug_entry *module_fi
>  	}
>  	bug = NULL;
>  out:
> -	rcu_read_unlock();
> +	rcu_read_unlock_sched();
>  
>  	return bug;
>  }
> @@ -88,6 +88,8 @@ void module_bug_finalize(const Elf_Ehdr
>  	char *secstrings;
>  	unsigned int i;
>  
> +	lockdep_assert_held(&module_mutex);
> +
>  	mod->bug_table = NULL;
>  	mod->num_bugs = 0;
>  
> @@ -113,6 +115,7 @@ void module_bug_finalize(const Elf_Ehdr
>  
>  void module_bug_cleanup(struct module *mod)
>  {
> +	lockdep_assert_held(&module_mutex);
>  	list_del_rcu(&mod->bug_list);
>  }
>  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/8] module, jump_label: Fix module locking
  2015-03-18 13:36 ` [PATCH 3/8] module, jump_label: Fix module locking Peter Zijlstra
@ 2015-03-20  4:26   ` Rusty Russell
  2015-03-20  9:11     ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Rusty Russell @ 2015-03-20  4:26 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, mathieu.desnoyers, oleg, paulmck, torvalds
  Cc: linux-kernel, andi, rostedt, tglx, peterz, Jason Baron

Peter Zijlstra <peterz@infradead.org> writes:
> As per the module core lockdep annotations:
>
> [   18.034047] ---[ end trace 9294429076a9c673 ]---
> [   18.047760] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [   18.059228]  ffffffff817d8676 ffff880036683c38 ffffffff8157e98b 0000000000000001
> [   18.067541]  0000000000000000 ffff880036683c78 ffffffff8105fbc7 ffff880036683c68
> [   18.075851]  ffffffffa0046b08 0000000000000000 ffffffffa0046d00 ffffffffa0046cc8
> [   18.084173] Call Trace:
> [   18.086906]  [<ffffffff8157e98b>] dump_stack+0x4f/0x7b
> [   18.092649]  [<ffffffff8105fbc7>] warn_slowpath_common+0x97/0xe0
> [   18.099361]  [<ffffffff8105fc2a>] warn_slowpath_null+0x1a/0x20
> [   18.105880]  [<ffffffff810ee502>] __module_address+0x1d2/0x1e0
> [   18.112400]  [<ffffffff81161153>] jump_label_module_notify+0x143/0x1e0
> [   18.119710]  [<ffffffff810814bf>] notifier_call_chain+0x4f/0x70
> [   18.126326]  [<ffffffff8108160e>] __blocking_notifier_call_chain+0x5e/0x90
> [   18.134009]  [<ffffffff81081656>] blocking_notifier_call_chain+0x16/0x20
> [   18.141490]  [<ffffffff810f0f00>] load_module+0x1b50/0x2660
> [   18.147720]  [<ffffffff810f1ade>] SyS_init_module+0xce/0x100
> [   18.154045]  [<ffffffff81587429>] system_call_fastpath+0x12/0x17
> [   18.160748] ---[ end trace 9294429076a9c674 ]---
>
> Jump labels is not doing it right; fix this.
>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Jason Baron <jbaron@akamai.com>
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/jump_label.c |   21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> --- a/kernel/jump_label.c
> +++ b/kernel/jump_label.c
> @@ -280,6 +280,17 @@ void jump_label_apply_nops(struct module
>  	}
>  }
>  
> +static inline bool address_in_module(unsigned long addr, struct module *mod)
> +{
> +	bool ret;
> +
> +	preempt_disable();
> +	ret = __module_address(addr) == mod;
> +	preempt_enable();
> +
> +	return ret;
> +}
> +
>  static int jump_label_add_module(struct module *mod)
>  {
>  	struct jump_entry *iter_start = mod->jump_entries;
> @@ -302,7 +313,7 @@ static int jump_label_add_module(struct
>  			continue;
>  
>  		key = iterk;
> -		if (__module_address(iter->key) == mod) {
> +		if (address_in_module(iter->key, mod)) {
>  			/*
>  			 * Set key->entries to iter, but preserve JUMP_LABEL_TRUE_BRANCH.
>  			 */
> @@ -339,7 +350,7 @@ static void jump_label_del_module(struct
>  
>  		key = (struct static_key *)(unsigned long)iter->key;
>  
> -		if (__module_address(iter->key) == mod)
> +		if (address_in_module(iter->key, mod))
>  			continue;
>  
>  		prev = &key->next;

Using __module_address() for this was just plain weird in the first
place.  How about:

        if (within_module(iter->key, mod))

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/8] module, jump_label: Fix module locking
  2015-03-20  4:26   ` Rusty Russell
@ 2015-03-20  9:11     ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-20  9:11 UTC (permalink / raw)
  To: Rusty Russell
  Cc: mingo, mathieu.desnoyers, oleg, paulmck, torvalds, linux-kernel,
	andi, rostedt, tglx, Jason Baron

On Fri, Mar 20, 2015 at 02:56:25PM +1030, Rusty Russell wrote:
> > +static inline bool address_in_module(unsigned long addr, struct module *mod)

> Using __module_address() for this was just plain weird in the first
> place.  How about:
> 
>         if (within_module(iter->key, mod))
> 

Oh hey, look at that ;-) /me fixes.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 6/8] rbtree: Implement generic latch_tree
  2015-03-19 22:04         ` Peter Zijlstra
@ 2015-03-20  9:50           ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-03-20  9:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, rusty, mathieu.desnoyers, oleg, paulmck, torvalds,
	linux-kernel, andi, rostedt, tglx, Michel Lespinasse,
	Andrea Arcangeli, David Woodhouse, Rik van Riel

On Thu, Mar 19, 2015 at 11:04:33PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 19, 2015 at 01:58:33PM -0700, Andrew Morton wrote:
> > OK.  This code is basically required to support perf/ftrace and
> > modules, yes?  Presumably small and space-constrained systems aren't
> > using either, so they don't take the hit.
> > 
> > However CONFIG_MODULES systems which aren't using perf/ftrace _do_ take
> > a hit.  How many systems are we talking here?  All non-x86?
> 
> No, there's plenty !x86 systems that have NMI and perf/ftrace, in fact,
> ARM has them (FIQ) and there's event some ARM chips that use them for
> perf (i.MX6 is one).
> 
> But sure, we could make this depend on CONFIG_PERF_EVENTS ||
> CONFIG_TRACING.

A little something like so.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/module.c |   30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -102,6 +102,8 @@ DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
 
+#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_TRACING)
+
 /*
  * Use a latched RB-tree for __module_address(); this allows us to use
  * RCU-sched lookups of the address from any context.
@@ -109,6 +111,10 @@ static LIST_HEAD(modules);
  * Because modules have two address ranges: init and core, we need two
  * latch_tree_nodes entries. We use the order they appear in struct module to
  * determine if we need to use the init or core values for the comparisons.
+ *
+ * This is conditional on PERF_EVENTS || TRACING because those can really hit
+ * __module_address() hard by doing a lot of stack unwinding; potentially from
+ * NMI context.
  */
 
 static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
@@ -185,7 +191,7 @@ static void mod_tree_remove(struct modul
 	mod_tree_remove_init(mod);
 }
 
-static struct module *mod_tree_find(unsigned long addr)
+static struct module *mod_find(unsigned long addr)
 {
 	struct latch_tree_node *ltn;
 
@@ -194,6 +200,26 @@ static struct module *mod_tree_find(unsi
 	return ltn ? ltn->priv : NULL;
 }
 
+#else /* !(PERF_EVENTS || TRACING) */
+
+static void mod_tree_insert(struct module *mod) { }
+static void mod_tree_remove_init(struct module *mod) { }
+static void mod_tree_remove(struct module *mod) { }
+
+static struct module *mod_find(unsigned long addr)
+{
+	struct module *mod;
+
+	list_for_each_entry_rcu(mod, &modules, list) {
+		if (within_module(addr, mod))
+			return mod;
+	}
+
+	return NULL;
+}
+
+#endif /* !(PERF_EVENTS || TRACING) */
+
 #ifdef CONFIG_KGDB_KDB
 struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
 #endif /* CONFIG_KGDB_KDB */
@@ -3939,7 +3965,7 @@ struct module *__module_address(unsigned
 
 	module_assert_mutex_or_preempt();
 
-	mod = mod_tree_find(addr);
+	mod = mod_find(addr);
 	if (mod) {
 		BUG_ON(!within_module(addr, mod));
 		if (mod->state == MODULE_STATE_UNFORMED)

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-03-20  9:50 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-18 13:36 [PATCH 0/8] latched RB-trees and __module_address() Peter Zijlstra
2015-03-18 13:36 ` [PATCH 1/8] module: Sanitize RCU usage and locking Peter Zijlstra
2015-03-20  3:36   ` Rusty Russell
2015-03-18 13:36 ` [PATCH 2/8] module: Annotate module version magic Peter Zijlstra
2015-03-18 13:36 ` [PATCH 3/8] module, jump_label: Fix module locking Peter Zijlstra
2015-03-20  4:26   ` Rusty Russell
2015-03-20  9:11     ` Peter Zijlstra
2015-03-18 13:36 ` [PATCH 4/8] rbtree: Make lockless searches non-fatal Peter Zijlstra
2015-03-18 13:36 ` [PATCH 5/8] seqlock: Better document raw_write_seqcount_latch() Peter Zijlstra
2015-03-18 14:29   ` Mathieu Desnoyers
2015-03-18 14:50     ` Peter Zijlstra
2015-03-18 13:36 ` [PATCH 6/8] rbtree: Implement generic latch_tree Peter Zijlstra
2015-03-18 16:20   ` Ingo Molnar
2015-03-19  5:14   ` Andrew Morton
2015-03-19  7:25     ` Peter Zijlstra
2015-03-19 20:58       ` Andrew Morton
2015-03-19 21:36         ` Steven Rostedt
2015-03-19 21:38           ` Steven Rostedt
2015-03-19 21:47           ` Andrew Morton
2015-03-19 21:54             ` Steven Rostedt
2015-03-19 22:12           ` Peter Zijlstra
2015-03-19 22:04         ` Peter Zijlstra
2015-03-20  9:50           ` Peter Zijlstra
2015-03-19 22:10         ` Peter Zijlstra
2015-03-18 13:36 ` [PATCH 7/8] module: Optimize __module_address() using a latched RB-tree Peter Zijlstra
2015-03-18 13:36 ` [PATCH 8/8] module: Use __module_address() for module_address_lookup() Peter Zijlstra
2015-03-18 14:21 ` [PATCH 0/8] latched RB-trees and __module_address() Christoph Hellwig
2015-03-18 14:38   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).