LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC patch 00/18] Trace Clock v2
@ 2008-11-07 5:23 Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 01/18] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
` (18 more replies)
0 siblings, 19 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Hi,
I've cleaned up the LTTng timestamping code, renamed it into "trace clock",
ripped apart the tsc_sync.c x86 code, added documentation (printk to the console
when the tracing clock is used) about what to do when an unsync TSC is detected.
I however kept the cache-line bouncing workaround for now. However, I now
synchronize the counters every jiffy with a per-cpu timer so it gives an upper
bound to the time imprecision.
The trace clock works with a trace_clock_get()/put(), so all the mechanic and
overhead that might be required to provide correct timestamps on weird systems
is *only* enabled when tracing is active.
I plan to stick to this simple solution for now so we can get reliable tracing
for 95ish% of systems out there, and to keep room for improvement (nice NTP-like
schemes) for a later version. (sadly, given this is actually v2, I cannot say
"let's keep that for v2") ;)
This patchset applies on top of 2.6.28-rc3.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 01/18] get_cycles() : kconfig HAVE_GET_CYCLES
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 02/18] get_cycles() : x86 HAVE_GET_CYCLES Mathieu Desnoyers
` (17 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
[-- Attachment #1: get-cycles-kconfig-have-get-cycles.patch --]
[-- Type: text/plain, Size: 1935 bytes --]
Create a new "HAVE_GET_CYCLES" architecture option to specify which
architectures provide 64-bits TSC counters readable with get_cycles(). It's
principally useful to only enable high-precision tracing code only on such
architectures and don't even bother building it on architectures which lack such
support.
It also requires architectures to provide get_cycles_barrier() and
get_cycles_rate().
I mainly use it for the "priority-sifting rwlock" latency tracing code, which
traces worse-case latency induced by the locking. It also provides the basic
changes needed for the LTTng timestamping infrastructure.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
init/Kconfig | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux.trees.git/init/Kconfig
===================================================================
--- linux.trees.git.orig/init/Kconfig 2008-11-07 00:06:07.000000000 -0500
+++ linux.trees.git/init/Kconfig 2008-11-07 00:07:23.000000000 -0500
@@ -330,6 +330,16 @@ config CPUSETS
config HAVE_UNSTABLE_SCHED_CLOCK
bool
+#
+# Architectures with a 64-bits get_cycles() should select this.
+# They should also define
+# get_cycles_barrier() : instruction synchronization barrier if required
+# get_cycles_rate() : cycle counter rate, in HZ. If 0, TSC are not synchronized
+# across CPUs or their frequency may vary due to frequency scaling.
+#
+config HAVE_GET_CYCLES
+ def_bool n
+
config GROUP_SCHED
bool "Group CPU scheduler"
depends on EXPERIMENTAL
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 02/18] get_cycles() : x86 HAVE_GET_CYCLES
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 01/18] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 03/18] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
` (16 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
[-- Attachment #1: get-cycles-x86-have-get-cycles.patch --]
[-- Type: text/plain, Size: 1875 bytes --]
This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
get_cycles_rate() are implemented.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/tsc.h | 12 ++++++++++++
2 files changed, 13 insertions(+)
Index: linux.trees.git/arch/x86/Kconfig
===================================================================
--- linux.trees.git.orig/arch/x86/Kconfig 2008-11-07 00:06:06.000000000 -0500
+++ linux.trees.git/arch/x86/Kconfig 2008-11-07 00:09:33.000000000 -0500
@@ -20,6 +20,7 @@ config X86
def_bool y
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
+ select HAVE_GET_CYCLES
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_IOREMAP_PROT
Index: linux.trees.git/arch/x86/include/asm/tsc.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/tsc.h 2008-10-30 20:22:50.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/tsc.h 2008-11-07 00:09:33.000000000 -0500
@@ -50,6 +50,18 @@ extern void mark_tsc_unstable(char *reas
extern int unsynchronized_tsc(void);
int check_tsc_unstable(void);
+static inline cycles_t get_cycles_rate(void)
+{
+ if (check_tsc_unstable())
+ return 0;
+ return tsc_khz;
+}
+
+static inline void get_cycles_barrier(void)
+{
+ rdtsc_barrier();
+}
+
/*
* Boot-time check whether the TSCs are synchronized across
* all CPUs/cores:
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 03/18] get_cycles() : sparc64 HAVE_GET_CYCLES
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 01/18] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 02/18] get_cycles() : x86 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 04/18] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
` (15 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, David S. Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
[-- Attachment #1: get-cycles-sparc64-have-get-cycles.patch --]
[-- Type: text/plain, Size: 2763 bytes --]
This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
get_cycles_rate() are implemented.
Changelog :
- Use tb_ticks_per_usec * 1000000 in get_cycles_rate().
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: David S. Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
arch/sparc/include/asm/timex_64.h | 19 ++++++++++++++++++-
arch/sparc64/Kconfig | 1 +
arch/sparc64/kernel/time.c | 3 ++-
3 files changed, 21 insertions(+), 2 deletions(-)
Index: linux.trees.git/arch/sparc64/Kconfig
===================================================================
--- linux.trees.git.orig/arch/sparc64/Kconfig 2008-10-30 20:22:50.000000000 -0400
+++ linux.trees.git/arch/sparc64/Kconfig 2008-11-07 00:09:35.000000000 -0500
@@ -13,6 +13,7 @@ config SPARC64
default y
select HAVE_FUNCTION_TRACER
select HAVE_IDE
+ select HAVE_GET_CYCLES
select HAVE_LMB
select HAVE_ARCH_KGDB
select USE_GENERIC_SMP_HELPERS if SMP
Index: linux.trees.git/arch/sparc/include/asm/timex_64.h
===================================================================
--- linux.trees.git.orig/arch/sparc/include/asm/timex_64.h 2008-09-30 11:38:51.000000000 -0400
+++ linux.trees.git/arch/sparc/include/asm/timex_64.h 2008-11-07 00:09:35.000000000 -0500
@@ -12,7 +12,24 @@
/* Getting on the cycle counter on sparc64. */
typedef unsigned long cycles_t;
-#define get_cycles() tick_ops->get_tick()
+
+static inline cycles_t get_cycles(void)
+{
+ return tick_ops->get_tick();
+}
+
+/* get_cycles instruction is synchronized on sparc64 */
+static inline void get_cycles_barrier(void)
+{
+ return;
+}
+
+extern unsigned long tb_ticks_per_usec;
+
+static inline cycles_t get_cycles_rate(void)
+{
+ return tb_ticks_per_usec * 1000000UL;
+}
#define ARCH_HAS_READ_CURRENT_TIMER
Index: linux.trees.git/arch/sparc64/kernel/time.c
===================================================================
--- linux.trees.git.orig/arch/sparc64/kernel/time.c 2008-11-07 00:06:06.000000000 -0500
+++ linux.trees.git/arch/sparc64/kernel/time.c 2008-11-07 00:09:35.000000000 -0500
@@ -793,7 +793,8 @@ static void __init setup_clockevent_mult
sparc64_clockevent.mult = mult;
}
-static unsigned long tb_ticks_per_usec __read_mostly;
+unsigned long tb_ticks_per_usec __read_mostly;
+EXPORT_SYMBOL_GPL(tb_ticks_per_usec);
void __delay(unsigned long loops)
{
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 04/18] get_cycles() : powerpc64 HAVE_GET_CYCLES
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (2 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 03/18] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 14:56 ` Josh Boyer
2008-11-07 5:23 ` [RFC patch 05/18] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
` (14 subsequent siblings)
18 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, Steven Rostedt, linux-arch
[-- Attachment #1: get-cycles-powerpc-have-get-cycles.patch --]
[-- Type: text/plain, Size: 1979 bytes --]
This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
get_cycles_rate() are implemented.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/timex.h | 11 +++++++++++
2 files changed, 12 insertions(+)
Index: linux.trees.git/arch/powerpc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig 2008-11-07 00:09:44.000000000 -0500
+++ linux.trees.git/arch/powerpc/Kconfig 2008-11-07 00:09:46.000000000 -0500
@@ -121,6 +121,7 @@ config PPC
select HAVE_DMA_ATTRS if PPC64
select USE_GENERIC_SMP_HELPERS if SMP
select HAVE_OPROFILE
+ select HAVE_GET_CYCLES if PPC64
config EARLY_PRINTK
bool
Index: linux.trees.git/arch/powerpc/include/asm/timex.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/timex.h 2008-11-07 00:09:44.000000000 -0500
+++ linux.trees.git/arch/powerpc/include/asm/timex.h 2008-11-07 00:09:46.000000000 -0500
@@ -7,6 +7,7 @@
* PowerPC architecture timex specifications
*/
+#include <linux/time.h>
#include <asm/cputable.h>
#include <asm/reg.h>
@@ -46,5 +47,15 @@ static inline cycles_t get_cycles(void)
#endif
}
+static inline cycles_t get_cycles_rate(void)
+{
+ return tb_ticks_per_sec;
+}
+
+static inline void get_cycles_barrier(void)
+{
+ isync();
+}
+
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_TIMEX_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 05/18] get_cycles() : MIPS HAVE_GET_CYCLES_32
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (3 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 04/18] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 06/18] Trace clock generic Mathieu Desnoyers
` (13 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Ralf Baechle, David Miller, Ingo Molnar,
Thomas Gleixner, Steven Rostedt, linux-arch
[-- Attachment #1: get-cycles-mips-have-get-cycles.patch --]
[-- Type: text/plain, Size: 2784 bytes --]
partly reverts commit efb9ca08b5a2374b29938cdcab417ce4feb14b54. Selects
HAVE_GET_CYCLES_32 only on CPUs where it is safe to use it.
Currently consider the "_WORKAROUND" cases for 4000 and 4400 to be unsafe, but
should probably add other sub-architecture to the blacklist.
Do not define HAVE_GET_CYCLES because MIPS does not provide 64-bit tsc (only
32-bits).
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
arch/mips/Kconfig | 4 ++++
arch/mips/include/asm/timex.h | 25 +++++++++++++++++++++++++
2 files changed, 29 insertions(+)
Index: linux.trees.git/arch/mips/include/asm/timex.h
===================================================================
--- linux.trees.git.orig/arch/mips/include/asm/timex.h 2008-10-30 20:22:50.000000000 -0400
+++ linux.trees.git/arch/mips/include/asm/timex.h 2008-11-07 00:10:10.000000000 -0500
@@ -29,14 +29,39 @@
* which isn't an evil thing.
*
* We know that all SMP capable CPUs have cycle counters.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ * HAVE_GET_CYCLES makes sure that this case is handled properly :
+ *
+ * Ralf Baechle <ralf@linux-mips.org> :
+ * This avoids us executing an mfc0 c0_count instruction on processors which
+ * don't have but also on certain R4000 and R4400 versions where reading from
+ * the count register just in the very moment when its value equals c0_compare
+ * will result in the timer interrupt getting lost.
*/
typedef unsigned int cycles_t;
+#ifdef HAVE_GET_CYCLES_32
+static inline cycles_t get_cycles(void)
+{
+ return read_c0_count();
+}
+
+static inline void get_cycles_barrier(void)
+{
+}
+
+static inline cycles_t get_cycles_rate(void)
+{
+ return CLOCK_TICK_RATE;
+}
+#else
static inline cycles_t get_cycles(void)
{
return 0;
}
+#endif
#endif /* __KERNEL__ */
Index: linux.trees.git/arch/mips/Kconfig
===================================================================
--- linux.trees.git.orig/arch/mips/Kconfig 2008-11-07 00:06:06.000000000 -0500
+++ linux.trees.git/arch/mips/Kconfig 2008-11-07 00:10:10.000000000 -0500
@@ -1611,6 +1611,10 @@ config CPU_R4000_WORKAROUNDS
config CPU_R4400_WORKAROUNDS
bool
+config HAVE_GET_CYCLES_32
+ def_bool y
+ depends on !CPU_R4400_WORKAROUNDS
+
#
# Use the generic interrupt handling code in kernel/irq/:
#
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 06/18] Trace clock generic
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (4 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 05/18] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 07/18] Trace clock core Mathieu Desnoyers
` (12 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
[-- Attachment #1: trace-clock-generic.patch --]
[-- Type: text/plain, Size: 4568 bytes --]
Wrapper to use the lower level clock sources available on the systems. Fall-back
on jiffies or'd with a logical clock for architectures lacking CPU timestamp
counters. Fall-back on a mixed TSC-logical clock on architectures lacking
synchronized TSC on SMP.
A generic fallback based on a logical clock and the timer interrupt is
available.
generic - Uses jiffies or'd with a logical clock extended to 64 bits by
ltt-timestamp.c.
i386 - Uses TSC. If detects non synchronized TSC, uses mixed TSC-logical clock.
mips - Uses TSC extended atomically from 32 to 64 bits by ltt-heartbeat.c.
powerpc - Uses TSC or generic ltt clock.
x86_64 - Uses TSC. If detects non synchronized TSC, uses mixed TSC-logical clock
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
include/asm-generic/trace-clock.h | 69 ++++++++++++++++++++++++++++++++++++++
include/linux/trace-clock.h | 17 +++++++++
kernel/timer.c | 2 +
3 files changed, 88 insertions(+)
Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c 2008-10-06 10:23:39.000000000 -0400
+++ linux.trees.git/kernel/timer.c 2008-11-07 00:10:13.000000000 -0500
@@ -37,6 +37,7 @@
#include <linux/delay.h>
#include <linux/tick.h>
#include <linux/kallsyms.h>
+#include <linux/trace-clock.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -1067,6 +1068,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_times(ticks);
+ trace_clock_add_timestamp(ticks);
}
#ifdef __ARCH_WANT_SYS_ALARM
Index: linux.trees.git/include/asm-generic/trace-clock.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/include/asm-generic/trace-clock.h 2008-11-07 00:10:13.000000000 -0500
@@ -0,0 +1,69 @@
+#ifndef _ASM_GENERIC_TRACE_CLOCK_H
+#define _ASM_GENERIC_TRACE_CLOCK_H
+
+/*
+ * include/asm-generic/trace-clock.h
+ *
+ * Copyright (C) 2007 - Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ *
+ * Generic tracing clock for architectures without TSC.
+ */
+
+#include <linux/param.h> /* For HZ */
+#include <asm/atomic.h>
+
+#define TRACE_CLOCK_SHIFT 13
+
+u64 trace_clock_read_synthetic_tsc(void);
+
+extern atomic_t trace_clock;
+
+static inline u32 trace_clock_read32(void)
+{
+ return atomic_add_return(1, &trace_clock);
+}
+
+static inline u64 trace_clock_read64(void)
+{
+ return trace_clock_read_synthetic_tsc();
+}
+
+static inline void trace_clock_add_timestamp(unsigned long ticks)
+{
+ int old_clock, new_clock;
+
+ do {
+ old_clock = atomic_read(&trace_clock);
+ new_clock = (old_clock + (ticks << TRACE_CLOCK_SHIFT))
+ & (~((1 << TRACE_CLOCK_SHIFT) - 1));
+ } while (atomic_cmpxchg(&trace_clock, old_clock, new_clock)
+ != old_clock);
+}
+
+static inline unsigned int trace_clock_frequency(void)
+{
+ return HZ << TRACE_CLOCK_SHIFT;
+}
+
+static inline u32 trace_clock_freq_scale(void)
+{
+ return 1;
+}
+
+extern void get_synthetic_tsc(void);
+extern void put_synthetic_tsc(void);
+
+static inline void get_trace_clock(void)
+{
+ get_synthetic_tsc();
+}
+
+static inline void put_trace_clock(void)
+{
+ put_synthetic_tsc();
+}
+
+static inline void set_trace_clock_is_sync(int state)
+{
+}
+#endif /* _ASM_GENERIC_TRACE_CLOCK_H */
Index: linux.trees.git/include/linux/trace-clock.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/include/linux/trace-clock.h 2008-11-07 00:10:13.000000000 -0500
@@ -0,0 +1,17 @@
+#ifndef _LINUX_TRACE_CLOCK_H
+#define _LINUX_TRACE_CLOCK_H
+
+/*
+ * Trace clock
+ *
+ * Chooses between an architecture specific clock or an atomic logical clock.
+ *
+ * Copyright (C) 2007,2008 Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ */
+
+#ifdef CONFIG_HAVE_TRACE_CLOCK
+#include <asm/trace-clock.h>
+#else
+#include <asm-generic/trace-clock.h>
+#endif /* CONFIG_HAVE_TRACE_CLOCK */
+#endif /* _LINUX_TRACE_CLOCK_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 07/18] Trace clock core
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (5 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 06/18] Trace clock generic Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:52 ` Andrew Morton
2008-11-07 5:23 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() Mathieu Desnoyers
` (11 subsequent siblings)
18 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Nicolas Pitre, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
[-- Attachment #1: trace-clock-core.patch --]
[-- Type: text/plain, Size: 13958 bytes --]
32 to 64 bits clock extension. Extracts 64 bits tsc from a [1..32]
bits counter, kept up to date by periodical timer interrupt. Lockless.
It's actually a specialized version of cnt_32_to_63.h which does the following
in addition :
- Uses per-cpu data to keep track of counters.
- It limits cache-line bouncing
- I supports machines with non-synchronized TSCs.
- Does not require read barriers, which can be slow on some architectures.
- Supports a full 64-bits counter (well, just one bit more than 63 is not really
a big deal when we talk about timestamp counters). If 2^64 is considered long
enough between overflows, 2^63 is normally considered long enough too.
- The periodical update of the value is insured by the infrastructure. There is
no assumption that the counter is read frequently, because we cannot assume
that given the events for which tracing is enabled can be dynamically
selected.
- Supports counters of various width (32-bits and below) by changing the
HW_BITS define.
What cnt_32_to_63.h does that this patch doesn't do :
- It has a global counter, which removes the need to do an update periodically
on _each_ cpu. This can be important in a dynamic tick system where CPUs need
to sleep to save power. It is therefore well suited for systems reading a
global clock expected to be _exactly_ synchronized across cores (where time
can never ever go backward).
Q:
> do you actually use the RCU internals? or do you just reimplement an RCU
> algorithm?
>
A:
Nope, I don't use RCU internals in this code. Preempt disable seemed
like the best way to handle this utterly short code path and I wanted
the write side to be fast enough to be called periodically. What I do is:
- Disable preemption at the read-side :
it makes sure the pointer I get will point to a data structure that
will never change while I am in the preempt disabled code. (see *)
- I use per-cpu data to allow the read-side to be as fast as possible
(only need to disable preemption, does not race against other CPUs and
won't generate cache line bouncing). It also allows dealing with
unsynchronized TSCs if needed.
- Periodical write side : it's called from an IPI running on each CPU.
(*) We expect the read-side (preempt off region) to last shorter than
the interval between IPI updates so we can guarantee the data structure
it uses won't be modified underneath it. Since the IPI update is
launched each seconds or so (depends on the frequency of the counter we
are trying to extend), it's more than ok.
Changelog:
- Support [1..32] bits -> 64 bits.
I volountarily limit the code to use at most 32 bits of the hardware clock for
performance considerations. If this is a problem it could be changed. Also, the
algorithm is aimed at a 32 bits architecture. The code becomes muuuch simpler on
a 64 bits arch, since we can do the updates atomically.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Nicolas Pitre <nico@cam.org>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
init/Kconfig | 14 +
kernel/Makefile | 3
kernel/trace/Makefile | 1
kernel/trace/trace-clock-32-to-64.c | 286 ++++++++++++++++++++++++++++++++++++
4 files changed, 302 insertions(+), 2 deletions(-)
Index: linux.trees.git/kernel/trace/Makefile
===================================================================
--- linux.trees.git.orig/kernel/trace/Makefile 2008-10-30 20:22:52.000000000 -0400
+++ linux.trees.git/kernel/trace/Makefile 2008-11-07 00:11:23.000000000 -0500
@@ -24,5 +24,6 @@ obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_BOOT_TRACER) += trace_boot.o
+obj-$(CONFIG_HAVE_TRACE_CLOCK_32_TO_64) += trace-clock-32-to-64.o
libftrace-y := ftrace.o
Index: linux.trees.git/kernel/trace/trace-clock-32-to-64.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/kernel/trace/trace-clock-32-to-64.c 2008-11-07 00:11:06.000000000 -0500
@@ -0,0 +1,286 @@
+/*
+ * kernel/trace/trace-clock-32-to-64.c
+ *
+ * (C) Copyright 2006,2007,2008 -
+ * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ *
+ * Extends a 32 bits clock source to a full 64 bits count, readable atomically
+ * from any execution context.
+ *
+ * notes :
+ * - trace clock 32->64 bits extended timer-based clock cannot be used for early
+ * tracing in the boot process, as it depends on timer interrupts.
+ * - The timer is only on one CPU to support hotplug.
+ * - We have the choice between schedule_delayed_work_on and an IPI to get each
+ * CPU to write the heartbeat. IPI has been chosen because it is considered
+ * faster than passing through the timer to get the work scheduled on all the
+ * CPUs.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+#include <linux/cpu.h>
+#include <linux/timex.h>
+#include <linux/bitops.h>
+#include <linux/trace-clock.h>
+#include <linux/smp.h>
+#include <linux/sched.h> /* FIX for m68k local_irq_enable in on_each_cpu */
+
+/*
+ * Number of hardware clock bits. The higher order bits are expected to be 0.
+ * If the hardware clock source has more than 32 bits, the bits higher than the
+ * 32nd will be truncated by a cast to a 32 bits unsigned. Range : 1 - 32.
+ * (too few bits would be unrealistic though, since we depend on the timer to
+ * detect the overflows).
+ */
+#define HW_BITS 32
+
+#define HW_BITMASK ((1ULL << HW_BITS) - 1)
+#define HW_LSB(hw) ((hw) & HW_BITMASK)
+#define SW_MSB(sw) ((sw) & ~HW_BITMASK)
+
+/* Expected maximum interrupt latency in ms : 15ms, *2 for security */
+#define EXPECTED_INTERRUPT_LATENCY 30
+
+static DEFINE_MUTEX(synthetic_tsc_mutex);
+static int synthetic_tsc_refcount; /* Number of readers */
+static int synthetic_tsc_enabled; /* synth. TSC enabled on all online CPUs */
+
+atomic_t trace_clock;
+EXPORT_SYMBOL(trace_clock);
+
+static DEFINE_PER_CPU(struct timer_list, tsc_timer);
+static unsigned int precalc_expire;
+
+struct synthetic_tsc_struct {
+ union {
+ u64 val;
+ struct {
+#ifdef __BIG_ENDIAN
+ u32 msb;
+ u32 lsb;
+#else
+ u32 lsb;
+ u32 msb;
+#endif
+ } sel;
+ } tsc[2];
+ unsigned int index; /* Index of the current synth. tsc. */
+};
+
+static DEFINE_PER_CPU(struct synthetic_tsc_struct, synthetic_tsc);
+
+/* Called from IPI : either in interrupt or process context */
+static void update_synthetic_tsc(void)
+{
+ struct synthetic_tsc_struct *cpu_synth;
+ u32 tsc;
+
+ preempt_disable();
+ cpu_synth = &per_cpu(synthetic_tsc, smp_processor_id());
+ tsc = trace_clock_read32(); /* Hardware clocksource read */
+
+ if (tsc < HW_LSB(cpu_synth->tsc[cpu_synth->index].sel.lsb)) {
+ unsigned int new_index = 1 - cpu_synth->index; /* 0 <-> 1 */
+ /*
+ * Overflow
+ * Non atomic update of the non current synthetic TSC, followed
+ * by an atomic index change. There is no write concurrency,
+ * so the index read/write does not need to be atomic.
+ */
+ cpu_synth->tsc[new_index].val =
+ (SW_MSB(cpu_synth->tsc[cpu_synth->index].val)
+ | (u64)tsc) + (1ULL << HW_BITS);
+ cpu_synth->index = new_index; /* atomic change of index */
+ } else {
+ /*
+ * No overflow : We know that the only bits changed are
+ * contained in the 32 LSBs, which can be written to atomically.
+ */
+ cpu_synth->tsc[cpu_synth->index].sel.lsb =
+ SW_MSB(cpu_synth->tsc[cpu_synth->index].sel.lsb) | tsc;
+ }
+ preempt_enable();
+}
+
+/* Called from buffer switch : in _any_ context (even NMI) */
+u64 notrace trace_clock_read_synthetic_tsc(void)
+{
+ struct synthetic_tsc_struct *cpu_synth;
+ u64 ret;
+ unsigned int index;
+ u32 tsc;
+
+ preempt_disable_notrace();
+ cpu_synth = &per_cpu(synthetic_tsc, smp_processor_id());
+ index = cpu_synth->index; /* atomic read */
+ tsc = trace_clock_read32(); /* Hardware clocksource read */
+
+ /* Overflow detection */
+ if (unlikely(tsc < HW_LSB(cpu_synth->tsc[index].sel.lsb)))
+ ret = (SW_MSB(cpu_synth->tsc[index].val) | (u64)tsc)
+ + (1ULL << HW_BITS);
+ else
+ ret = SW_MSB(cpu_synth->tsc[index].val) | (u64)tsc;
+ preempt_enable_notrace();
+ return ret;
+}
+EXPORT_SYMBOL_GPL(trace_clock_read_synthetic_tsc);
+
+static void synthetic_tsc_ipi(void *info)
+{
+ update_synthetic_tsc();
+}
+
+/*
+ * tsc_timer_fct : - Timer function synchronizing synthetic TSC.
+ * @data: unused
+ *
+ * Guarantees at least 1 execution before low word of TSC wraps.
+ */
+static void tsc_timer_fct(unsigned long data)
+{
+ update_synthetic_tsc();
+
+ per_cpu(tsc_timer, smp_processor_id()).expires =
+ jiffies + precalc_expire;
+ add_timer_on(&per_cpu(tsc_timer, smp_processor_id()),
+ smp_processor_id());
+}
+
+/*
+ * precalc_stsc_interval: - Precalculates the interval between the clock
+ * wraparounds.
+ */
+static int __init precalc_stsc_interval(void)
+{
+ precalc_expire =
+ (HW_BITMASK / ((trace_clock_frequency() / HZ
+ * trace_clock_freq_scale()) << 1)
+ - 1 - (EXPECTED_INTERRUPT_LATENCY * HZ / 1000)) >> 1;
+ WARN_ON(precalc_expire == 0);
+ printk(KERN_DEBUG "Synthetic TSC timer will fire each %u jiffies.\n",
+ precalc_expire);
+ return 0;
+}
+
+static void prepare_synthetic_tsc(int cpu)
+{
+ struct synthetic_tsc_struct *cpu_synth;
+ u64 local_count;
+
+ cpu_synth = &per_cpu(synthetic_tsc, cpu);
+ local_count = trace_clock_read_synthetic_tsc();
+ cpu_synth->tsc[0].val = local_count;
+ cpu_synth->index = 0;
+ smp_wmb(); /* Writing in data of CPU about to come up */
+ init_timer(&per_cpu(tsc_timer, cpu));
+ per_cpu(tsc_timer, cpu).function = tsc_timer_fct;
+ per_cpu(tsc_timer, cpu).expires = jiffies + precalc_expire;
+}
+
+static void enable_synthetic_tsc(int cpu)
+{
+ smp_call_function_single(cpu, synthetic_tsc_ipi, NULL, 1);
+ add_timer_on(&per_cpu(tsc_timer, cpu), cpu);
+}
+
+static void disable_synthetic_tsc(int cpu)
+{
+ del_timer_sync(&per_cpu(tsc_timer, cpu));
+}
+
+/*
+ * hotcpu_callback - CPU hotplug callback
+ * @nb: notifier block
+ * @action: hotplug action to take
+ * @hcpu: CPU number
+ *
+ * Sets the new CPU's current synthetic TSC to the same value as the
+ * currently running CPU.
+ *
+ * Returns the success/failure of the operation. (NOTIFY_OK, NOTIFY_BAD)
+ */
+static int __cpuinit hotcpu_callback(struct notifier_block *nb,
+ unsigned long action,
+ void *hcpu)
+{
+ unsigned int hotcpu = (unsigned long)hcpu;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ if (synthetic_tsc_refcount)
+ prepare_synthetic_tsc(hotcpu);
+ break;
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ if (synthetic_tsc_refcount)
+ enable_synthetic_tsc(hotcpu);
+ break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ if (synthetic_tsc_refcount)
+ disable_synthetic_tsc(hotcpu);
+ break;
+#endif /* CONFIG_HOTPLUG_CPU */
+ }
+ return NOTIFY_OK;
+}
+
+void get_synthetic_tsc(void)
+{
+ int cpu;
+
+ get_online_cpus();
+ mutex_lock(&synthetic_tsc_mutex);
+ if (synthetic_tsc_refcount++)
+ goto end;
+
+ synthetic_tsc_enabled = 1;
+ for_each_online_cpu(cpu) {
+ prepare_synthetic_tsc(cpu);
+ enable_synthetic_tsc(cpu);
+ }
+end:
+ mutex_unlock(&synthetic_tsc_mutex);
+ put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(get_synthetic_tsc);
+
+void put_synthetic_tsc(void)
+{
+ int cpu;
+
+ get_online_cpus();
+ mutex_lock(&synthetic_tsc_mutex);
+ WARN_ON(synthetic_tsc_refcount <= 0);
+ if (synthetic_tsc_refcount != 1 || !synthetic_tsc_enabled)
+ goto end;
+
+ for_each_online_cpu(cpu)
+ disable_synthetic_tsc(cpu);
+ synthetic_tsc_enabled = 0;
+end:
+ synthetic_tsc_refcount--;
+ mutex_unlock(&synthetic_tsc_mutex);
+ put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(put_synthetic_tsc);
+
+/* Called from CPU 0, before any tracing starts, to init each structure */
+static int __init init_synthetic_tsc(void)
+{
+ precalc_stsc_interval();
+ hotcpu_notifier(hotcpu_callback, 3);
+ return 0;
+}
+
+/* Before SMP is up */
+early_initcall(init_synthetic_tsc);
Index: linux.trees.git/init/Kconfig
===================================================================
--- linux.trees.git.orig/init/Kconfig 2008-11-07 00:07:23.000000000 -0500
+++ linux.trees.git/init/Kconfig 2008-11-07 00:11:06.000000000 -0500
@@ -340,6 +340,20 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config HAVE_GET_CYCLES
def_bool n
+#
+# Architectures with a specialized tracing clock should select this.
+#
+config HAVE_TRACE_CLOCK
+ def_bool n
+
+#
+# Architectures with only a 32-bits clock source should select this.
+#
+config HAVE_TRACE_CLOCK_32_TO_64
+ bool
+ default y if (!HAVE_TRACE_CLOCK)
+ default n if HAVE_TRACE_CLOCK
+
config GROUP_SCHED
bool "Group CPU scheduler"
depends on EXPERIMENTAL
Index: linux.trees.git/kernel/Makefile
===================================================================
--- linux.trees.git.orig/kernel/Makefile 2008-10-30 20:22:52.000000000 -0400
+++ linux.trees.git/kernel/Makefile 2008-11-07 00:11:52.000000000 -0500
@@ -88,8 +88,7 @@ obj-$(CONFIG_MARKERS) += marker.o
obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
obj-$(CONFIG_LATENCYTOP) += latencytop.o
obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
-obj-$(CONFIG_FUNCTION_TRACER) += trace/
-obj-$(CONFIG_TRACING) += trace/
+obj-y += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (6 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 07/18] Trace clock core Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 6:05 ` Andrew Morton
2008-11-07 10:59 ` David Howells
2008-11-07 5:23 ` [RFC patch 09/18] Powerpc : Trace clock Mathieu Desnoyers
` (10 subsequent siblings)
18 siblings, 2 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Nicolas Pitre, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
[-- Attachment #1: cnt32_to_63-use-barriers.patch --]
[-- Type: text/plain, Size: 3143 bytes --]
Assume the time source is a global clock which insures that time will never
*ever* go backward. Use a smp_rmb() to make sure the cnt_lo value is read before
__m_cnt_hi.
Remove the now unnecessary volatile statement. The barrier takes care of memory
ordering.
Mathieu:
> Yup, you are right. However, the case where one CPU sees the clock source
> a little bit off-sync (late) still poses a problem. Example follows :
>
> CPU A B
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> return 0x0000000100000001
> read __m_cnt_hi (0x00000001)
> (late) read hw cnt low (0xFFFFFFFA)
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) <
+0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)
> A similar situation can be generated by out-of-order hi/low bits reads.
Nicolas:
This, of course, should and can be prevented. No big deal.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Nicolas Pitre <nico@cam.org>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
include/linux/cnt32_to_63.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
Index: linux-2.6-lttng/include/linux/cnt32_to_63.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/cnt32_to_63.h 2008-11-04 01:39:03.000000000 -0500
+++ linux-2.6-lttng/include/linux/cnt32_to_63.h 2008-11-04 01:48:50.000000000 -0500
@@ -65,12 +65,17 @@ union cnt32_to_63 {
* implicitly by making the multiplier even, therefore saving on a runtime
* clear-bit instruction. Otherwise caller must remember to clear the top
* bit explicitly.
+ *
+ * Assume the time source is a global clock read from memory mapped I/O which
+ * insures that time will never *ever* go backward. Using a smp_rmb() to make
+ * sure the __m_cnt_hi value is read before the cnt_lo mmio read.
*/
#define cnt32_to_63(cnt_lo) \
({ \
- static volatile u32 __m_cnt_hi; \
+ static u32 __m_cnt_hi; \
union cnt32_to_63 __x; \
__x.hi = __m_cnt_hi; \
+ smp_rmb(); /* read __m_cnt_hi before mmio cnt_lo */ \
__x.lo = (cnt_lo); \
if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
__m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 09/18] Powerpc : Trace clock
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (7 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 10/18] Sparc64 " Mathieu Desnoyers
` (9 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, benh, paulus, linux-arch
[-- Attachment #1: powerpc-trace-clock.patch --]
[-- Type: text/plain, Size: 2142 bytes --]
Powerpc implementation of trace clock with get_tb().
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: linux-arch@vger.kernel.org
---
arch/powerpc/Kconfig | 1
arch/powerpc/include/asm/trace-clock.h | 50 +++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)
Index: linux.trees.git/arch/powerpc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig 2008-11-07 00:09:46.000000000 -0500
+++ linux.trees.git/arch/powerpc/Kconfig 2008-11-07 00:11:59.000000000 -0500
@@ -114,6 +114,7 @@ config PPC
select HAVE_IOREMAP_PROT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_KPROBES
+ select HAVE_TRACE_CLOCK
select HAVE_ARCH_KGDB
select HAVE_KRETPROBES
select HAVE_ARCH_TRACEHOOK
Index: linux.trees.git/arch/powerpc/include/asm/trace-clock.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/powerpc/include/asm/trace-clock.h 2008-11-07 00:11:59.000000000 -0500
@@ -0,0 +1,50 @@
+/*
+ * Copyright (C) 2005,2008 Mathieu Desnoyers
+ *
+ * Trace clock PowerPC definitions.
+ *
+ * Use get_tb() directly to insure reading a 64-bits value on powerpc 32.
+ */
+
+#ifndef _ASM_TRACE_CLOCK_H
+#define _ASM_TRACE_CLOCK_H
+
+#include <linux/timex.h>
+#include <linux/time.h>
+#include <asm/processor.h>
+
+static inline u32 trace_clock_read32(void)
+{
+ return get_tbl();
+}
+
+static inline u64 trace_clock_read64(void)
+{
+ return get_tb();
+}
+
+static inline void trace_clock_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int trace_clock_frequency(void)
+{
+ return get_cycles_rate();
+}
+
+static inline u32 trace_clock_freq_scale(void)
+{
+ return 1;
+}
+
+static inline void get_trace_clock(void)
+{
+}
+
+static inline void put_trace_clock(void)
+{
+}
+
+static inline void set_trace_clock_is_sync(int state)
+{
+}
+#endif /* _ASM_TRACE_CLOCK_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 10/18] Sparc64 : Trace clock
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (8 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 09/18] Powerpc : Trace clock Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:45 ` David Miller
2008-11-07 5:23 ` [RFC patch 11/18] LTTng timestamp sh Mathieu Desnoyers
` (8 subsequent siblings)
18 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, David Miller, linux-arch
[-- Attachment #1: sparc64-trace-clock.patch --]
[-- Type: text/plain, Size: 2002 bytes --]
Implement sparc64 trace clock.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David Miller <davem@davemloft.net>
CC: linux-arch@vger.kernel.org
---
arch/sparc/include/asm/trace-clock.h | 46 +++++++++++++++++++++++++++++++++++
arch/sparc64/Kconfig | 1
2 files changed, 47 insertions(+)
Index: linux.trees.git/arch/sparc64/Kconfig
===================================================================
--- linux.trees.git.orig/arch/sparc64/Kconfig 2008-11-07 00:09:35.000000000 -0500
+++ linux.trees.git/arch/sparc64/Kconfig 2008-11-07 00:12:26.000000000 -0500
@@ -16,6 +16,7 @@ config SPARC64
select HAVE_GET_CYCLES
select HAVE_LMB
select HAVE_ARCH_KGDB
+ select HAVE_TRACE_CLOCK
select USE_GENERIC_SMP_HELPERS if SMP
select HAVE_ARCH_TRACEHOOK
select ARCH_WANT_OPTIONAL_GPIOLIB
Index: linux.trees.git/arch/sparc/include/asm/trace-clock.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/sparc/include/asm/trace-clock.h 2008-11-07 00:12:04.000000000 -0500
@@ -0,0 +1,46 @@
+/*
+ * Copyright (C) 2008, Mathieu Desnoyers
+ *
+ * Trace clock definitions for Sparc64.
+ */
+
+#ifndef _ASM_SPARC_TRACE_CLOCK_H
+#define _ASM_SPARC_TRACE_CLOCK_H
+
+#include <linux/timex.h>
+
+static inline u32 trace_clock_read32(void)
+{
+ return get_cycles();
+}
+
+static inline u64 trace_clock_read64(void)
+{
+ return get_cycles();
+}
+
+static inline void trace_clock_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int trace_clock_frequency(void)
+{
+ return get_cycles_rate();
+}
+
+static inline u32 trace_clock_freq_scale(void)
+{
+ return 1;
+}
+
+static inline void get_trace_clock(void)
+{
+}
+
+static inline void put_trace_clock(void)
+{
+}
+
+static inline void set_trace_clock_is_sync(int state)
+{
+}
+#endif /* _ASM_SPARC_TRACE_CLOCK_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 11/18] LTTng timestamp sh
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (9 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 10/18] Sparc64 " Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 12/18] LTTng - TSC synchronicity test Mathieu Desnoyers
` (7 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Giuseppe Cavallaro, Mathieu Desnoyers, Paul Mundt, linux-sh
[-- Attachment #1: sh-trace-clock.patch --]
[-- Type: text/plain, Size: 3779 bytes --]
This patch adds the timestamping mechanism in the trace-clock.h arch header
file. The new timestamp functions use the TMU channel 1.
This code only works if the TMU channel 1 is initialized during the kernel boot.
Big fat warning(TM) from Mathieu Desnoyers :
This patch seems to assume TMU channel 1 is setup at boot. Is it always true on
all SuperH boards ? Is there some Kconfig selection that should be done here ?
Make sure this patch does not break get_cycles on SuperH before merging.
From: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Paul Mundt <lethal@linux-sh.org>
CC: linux-sh@vger.kernel.org
---
arch/sh/Kconfig | 2 +
arch/sh/include/asm/timex.h | 7 +++-
arch/sh/include/asm/trace-clock.h | 61 ++++++++++++++++++++++++++++++++++++++
3 files changed, 68 insertions(+), 2 deletions(-)
Index: linux.trees.git/arch/sh/include/asm/timex.h
===================================================================
--- linux.trees.git.orig/arch/sh/include/asm/timex.h 2008-09-30 11:38:51.000000000 -0400
+++ linux.trees.git/arch/sh/include/asm/timex.h 2008-11-07 00:12:47.000000000 -0500
@@ -6,13 +6,16 @@
#ifndef __ASM_SH_TIMEX_H
#define __ASM_SH_TIMEX_H
-#define CLOCK_TICK_RATE (CONFIG_SH_PCLK_FREQ / 4) /* Underlying HZ */
+#include <linux/io.h>
+#include <asm/cpu/timer.h>
+
+#define CLOCK_TICK_RATE (HZ * 100000UL)
typedef unsigned long long cycles_t;
static __inline__ cycles_t get_cycles (void)
{
- return 0;
+ return 0xffffffff - ctrl_inl(TMU1_TCNT);
}
#endif /* __ASM_SH_TIMEX_H */
Index: linux.trees.git/arch/sh/Kconfig
===================================================================
--- linux.trees.git.orig/arch/sh/Kconfig 2008-11-07 00:06:06.000000000 -0500
+++ linux.trees.git/arch/sh/Kconfig 2008-11-07 00:12:47.000000000 -0500
@@ -11,6 +11,8 @@ config SUPERH
select HAVE_CLK
select HAVE_IDE
select HAVE_OPROFILE
+ select HAVE_TRACE_CLOCK
+ select HAVE_TRACE_CLOCK_32_TO_64
select HAVE_GENERIC_DMA_COHERENT
select HAVE_IOREMAP_PROT if MMU
help
Index: linux.trees.git/arch/sh/include/asm/trace-clock.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/sh/include/asm/trace-clock.h 2008-11-07 00:12:47.000000000 -0500
@@ -0,0 +1,61 @@
+/*
+ * Copyright (C) 2007,2008 Giuseppe Cavallaro <peppe.cavallaro@st.com>
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * Trace clock definitions for SuperH.
+ */
+
+#ifndef _ASM_SH_TRACE_CLOCK_H
+#define _ASM_SH_TRACE_CLOCK_H
+
+#include <linux/timer.h>
+#include <asm/clock.h>
+
+extern u64 trace_clock_read_synthetic_tsc(void);
+
+static inline u32 trace_clock_get_timestamp32(void)
+{
+ return get_cycles();
+}
+
+static inline u64 trace_clock_get_timestamp64(void)
+{
+ return trace_clock_read_synthetic_tsc();
+}
+
+static inline void trace_clock_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int trace_clock_frequency(void)
+{
+ unsigned long rate;
+ struct clk *tmu1_clk;
+
+ tmu1_clk = clk_get(NULL, "tmu1_clk");
+ rate = clk_get_rate(tmu1_clk);
+
+ return (unsigned int)rate;
+}
+
+static inline u32 trace_clock_freq_scale(void)
+{
+ return 1;
+}
+
+extern void get_synthetic_tsc(void);
+extern void put_synthetic_tsc(void);
+
+static inline void get_trace_clock(void)
+{
+ get_synthetic_tsc();
+}
+
+static inline void put_trace_clock(void)
+{
+ put_synthetic_tsc();
+}
+
+static inline void set_trace_clock_is_sync(int state)
+{
+}
+#endif /* _ASM_SH_TRACE_CLOCK_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 12/18] LTTng - TSC synchronicity test
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (10 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 11/18] LTTng timestamp sh Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 13/18] x86 : remove arch-specific tsc_sync.c Mathieu Desnoyers
` (6 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Ingo Molnar, Jan Kiszka, Thomas Gleixner,
Steven Rostedt
[-- Attachment #1: test-tsc-sync.patch --]
[-- Type: text/plain, Size: 13490 bytes --]
Test TSC synchronization across CPUs. Architecture independant and can therefore
be used on various architectures. Aims at testing the TSC synchronization on a
running system (not only at early boot), with minimal impact on interrupt
latency.
I've written this code before x86 tsc_sync.c existed and given it worked well
for my needs, I never switched to tsc_sync.c. Although it has the same goal, it
does it a bit differently :
tsc_sync looks at the cycle counters on two CPUs to see if one compared to the
other are going backward when read in loop. The LTTng code synchronizes both
cores with a counter used as a memory barrier and then reads the two TSCs at a
delta equal to the cache line exchange. Instruction and data caches are primed.
This test is repeated in loops to insure we deal with MCE, NMIs which could skew
the results.
The problem I see with tsc_sync.c is that is one of the two CPUs is delayed by
an interrupt handler (for way too long) while the other CPU is doing its
check_tsc_warp() execution, and if the CPU with the lowest TSC values runs
first, this code will fail to detect unsynchronized CPUs.
This sync test code does not have this problem.
A following patch replaces the x86 tsc_sync.c code by this architecture
independant code.
This code also adds the kernel parameter
force_tsc_sync=1
which forces resynchronization of CPU TSCs when a CPU is hotplugged.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ingo Molnar <mingo@redhat.com>
CC: Jan Kiszka <jan.kiszka@siemens.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
---
Documentation/kernel-parameters.txt | 4
init/Kconfig | 7
kernel/time/Makefile | 1
kernel/time/tsc-sync.c | 313 ++++++++++++++++++++++++++++++++++++
4 files changed, 325 insertions(+)
Index: linux.trees.git/kernel/time/tsc-sync.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/kernel/time/tsc-sync.c 2008-11-07 00:13:01.000000000 -0500
@@ -0,0 +1,313 @@
+/*
+ * kernel/time/tsc-sync.c
+ *
+ * Test TSC synchronization
+ *
+ * marks the tsc as unstable _and_ keep a simple "_tsc_is_sync" variable, which
+ * is fast to read when a simple test must determine which clock source to use
+ * for kernel tracing.
+ *
+ * - CPU init :
+ *
+ * We check whether all boot CPUs have their TSC's synchronized,
+ * print a warning if not and turn off the TSC clock-source.
+ *
+ * Only two CPUs may participate - they can enter in any order.
+ * ( The serial nature of the boot logic and the CPU hotplug lock
+ * protects against more than 2 CPUs entering this code.
+ *
+ * - When CPUs are up :
+ *
+ * TSC synchronicity of all CPUs can be checked later at run-time by calling
+ * test_tsc_synchronization().
+ *
+ * Copyright 2007, 2008
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+#include <linux/module.h>
+#include <linux/timer.h>
+#include <linux/timex.h>
+#include <linux/jiffies.h>
+#include <linux/trace-clock.h>
+#include <linux/cpu.h>
+#include <linux/kthread.h>
+#include <linux/mutex.h>
+#include <linux/cpu.h>
+
+#define MAX_CYCLES_DELTA 1000ULL
+
+/*
+ * Number of loops to take care of MCE, NMIs, SMIs.
+ */
+#define NR_LOOPS 10
+
+static DEFINE_MUTEX(tscsync_mutex);
+
+struct sync_data {
+ int nr_waits;
+ int wait_sync;
+ cycles_t tsc_count;
+} ____cacheline_aligned;
+
+/* 0 is master, 1 is slave */
+static struct sync_data sync_data[2] = {
+ [0 ... 1] = {
+ .nr_waits = 3 * NR_LOOPS + 1,
+ .wait_sync = 3 * NR_LOOPS + 1,
+ },
+};
+
+int _tsc_is_sync = 1;
+EXPORT_SYMBOL(_tsc_is_sync);
+
+static int force_tsc_sync;
+static cycles_t slave_offset;
+static int slave_offset_ready; /* for 32-bits architectures */
+
+static int __init force_tsc_sync_setup(char *str)
+{
+ force_tsc_sync = simple_strtoul(str, NULL, 0);
+ return 1;
+}
+__setup("force_tsc_sync=", force_tsc_sync_setup);
+
+/*
+ * Mark it noinline so we make sure it is not unrolled.
+ * Wait until value is reached.
+ */
+static noinline void tsc_barrier(long this_cpu)
+{
+ sync_core();
+ sync_data[this_cpu].wait_sync--;
+ smp_mb(); /* order master/slave sync_data read/write */
+ while (unlikely(sync_data[1 - this_cpu].wait_sync >=
+ sync_data[this_cpu].nr_waits))
+ barrier(); /*
+ * barrier is used because faster and
+ * more predictable than cpu_idle().
+ */
+ smp_mb(); /* order master/slave sync_data read/write */
+ sync_data[this_cpu].nr_waits--;
+ get_cycles_barrier();
+ sync_data[this_cpu].tsc_count = get_cycles();
+ get_cycles_barrier();
+}
+
+/*
+ * Worker thread called on each CPU.
+ * First wait with interrupts enabled, then wait with interrupt disabled,
+ * for precision. We are already bound to one CPU.
+ * this_cpu 0 : master
+ * this_cpu 1 : slave
+ */
+static void test_sync(void *arg)
+{
+ long this_cpu = (long)arg;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ /* Make sure the instructions are in I-CACHE */
+ tsc_barrier(this_cpu);
+ tsc_barrier(this_cpu);
+ sync_data[this_cpu].wait_sync--;
+ smp_mb(); /* order master/slave sync_data read/write */
+ while (unlikely(sync_data[1 - this_cpu].wait_sync >=
+ sync_data[this_cpu].nr_waits))
+ barrier(); /*
+ * barrier is used because faster and
+ * more predictable than cpu_idle().
+ */
+ smp_mb(); /* order master/slave sync_data read/write */
+ sync_data[this_cpu].nr_waits--;
+ /*
+ * Here, only the master will wait for the slave to reach this barrier.
+ * This makes sure that the master, which holds the mutex and will reset
+ * the barriers, waits for the slave to stop using the barrier values
+ * before it continues. This is only done at the complete end of all the
+ * loops. This is why there is a + 1 in original wait_sync value.
+ */
+ if (sync_data[this_cpu].nr_waits == 1)
+ sync_data[this_cpu].wait_sync--;
+ local_irq_restore(flags);
+}
+
+/*
+ * Each CPU (master and target) must decrement the wait_sync value twice (one
+ * for priming in cache), and also once after the get_cycles. After all the
+ * loops, one last synchronization is required to make sure the master waits
+ * for the slave before resetting the barriers.
+ */
+static void reset_barriers(void)
+{
+ int i;
+
+ /*
+ * Wait until slave is done so that we don't overwrite
+ * wait_end_sync prematurely.
+ */
+ smp_mb(); /* order master/slave sync_data read/write */
+ while (unlikely(sync_data[1].wait_sync >= sync_data[0].nr_waits))
+ barrier(); /*
+ * barrier is used because faster and
+ * more predictable than cpu_idle().
+ */
+ smp_mb(); /* order master/slave sync_data read/write */
+
+ for (i = 0; i < 2; i++) {
+ WARN_ON(sync_data[i].wait_sync != 0);
+ WARN_ON(sync_data[i].nr_waits != 1);
+ sync_data[i].wait_sync = 3 * NR_LOOPS + 1;
+ sync_data[i].nr_waits = 3 * NR_LOOPS + 1;
+ }
+}
+
+/*
+ * Do loops (making sure no unexpected event changes the timing), keep the best
+ * one. The result of each loop is the highest tsc delta between the master CPU
+ * and the slaves. Stop CPU hotplug when this code is executed to make sure we
+ * are concurrency-safe wrt CPU hotplug also using this code. Test TSC
+ * synchronization even if we already "know" CPUs were not synchronized. This
+ * can be used as a test to check if, for some reason, the CPUs eventually got
+ * in sync after a CPU has been unplugged. This code is kept separate from the
+ * CPU hotplug code because the slave CPU executes in an IPI, which we want to
+ * keep as short as possible (this is happening while the system is running).
+ * Therefore, we do not send a single IPI for all the test loops, but rather
+ * send one IPI per loop.
+ */
+int test_tsc_synchronization(void)
+{
+ long cpu, master;
+ cycles_t max_diff = 0, diff, best_loop, worse_loop = 0;
+ int i;
+
+ mutex_lock(&tscsync_mutex);
+ get_online_cpus();
+
+ printk(KERN_INFO
+ "checking TSC synchronization across all online CPUs:");
+
+ preempt_disable();
+ master = smp_processor_id();
+ for_each_online_cpu(cpu) {
+ if (master == cpu)
+ continue;
+ best_loop = (cycles_t)ULLONG_MAX;
+ for (i = 0; i < NR_LOOPS; i++) {
+ smp_call_function_single(cpu, test_sync,
+ (void *)1UL, 0);
+ test_sync((void *)0UL);
+ diff = abs(sync_data[1].tsc_count
+ - sync_data[0].tsc_count);
+ best_loop = min(best_loop, diff);
+ worse_loop = max(worse_loop, diff);
+ }
+ reset_barriers();
+ max_diff = max(best_loop, max_diff);
+ }
+ preempt_enable();
+ if (max_diff >= MAX_CYCLES_DELTA) {
+ printk(KERN_WARNING
+ "Measured %llu cycles TSC offset between CPUs,"
+ " turning off TSC clock.\n", (u64)max_diff);
+ mark_tsc_unstable("check_tsc_sync_source failed");
+ _tsc_is_sync = 0;
+ } else {
+ printk(" passed.\n");
+ }
+ put_online_cpus();
+ mutex_unlock(&tscsync_mutex);
+ return max_diff < MAX_CYCLES_DELTA;
+}
+EXPORT_SYMBOL_GPL(test_tsc_synchronization);
+
+/*
+ * Test synchronicity of a single core when it is hotplugged.
+ * Source CPU calls into this - waits for the freshly booted target CPU to
+ * arrive and then start the measurement:
+ */
+void __cpuinit check_tsc_sync_source(int cpu)
+{
+ cycles_t diff, abs_diff,
+ best_loop = (cycles_t)ULLONG_MAX, worse_loop = 0;
+ int i;
+
+ /*
+ * No need to check if we already know that the TSC is not synchronized:
+ */
+ if (!force_tsc_sync && unsynchronized_tsc()) {
+ /*
+ * Make sure we mark _tsc_is_sync to 0 if the TSC is found
+ * to be unsynchronized for other causes than non-synchronized
+ * TSCs across CPUs.
+ */
+ _tsc_is_sync = 0;
+ set_trace_clock_is_sync(0);
+ return;
+ }
+
+ printk(KERN_INFO "checking TSC synchronization [CPU#%d -> CPU#%d]:",
+ smp_processor_id(), cpu);
+
+ for (i = 0; i < NR_LOOPS; i++) {
+ test_sync((void *)0UL);
+ diff = sync_data[1].tsc_count - sync_data[0].tsc_count;
+ abs_diff = abs(diff);
+ best_loop = min(best_loop, abs_diff);
+ worse_loop = max(worse_loop, abs_diff);
+ if (force_tsc_sync && best_loop == abs_diff)
+ slave_offset = diff;
+ }
+ reset_barriers();
+
+ if (!force_tsc_sync && best_loop >= MAX_CYCLES_DELTA) {
+ printk(" failed.\n");
+ printk(KERN_WARNING
+ "Measured %llu cycles TSC offset between CPUs,"
+ " turning off TSC clock.\n", (u64)best_loop);
+ mark_tsc_unstable("check_tsc_sync_source failed");
+ _tsc_is_sync = 0;
+ set_trace_clock_is_sync(0);
+ } else {
+ printk(" %s.\n", !force_tsc_sync ? "passed" : "forced");
+ }
+ if (force_tsc_sync) {
+ /* order slave_offset and slave_offset_ready writes */
+ smp_wmb();
+ slave_offset_ready = 1;
+ }
+}
+
+/*
+ * Freshly booted CPUs call into this:
+ */
+void __cpuinit check_tsc_sync_target(void)
+{
+ int i;
+
+ if (!force_tsc_sync && unsynchronized_tsc())
+ return;
+
+ for (i = 0; i < NR_LOOPS; i++)
+ test_sync((void *)1UL);
+
+ /*
+ * Force slave synchronization if requested.
+ */
+ if (force_tsc_sync) {
+ unsigned long flags;
+ cycles_t new_tsc;
+
+ while (!slave_offset_ready)
+ cpu_relax();
+ /* order slave_offset and slave_offset_ready reads */
+ smp_rmb();
+ local_irq_save(flags);
+ /*
+ * slave_offset is read when master has finished writing to it,
+ * and is protected by cpu hotplug serialization.
+ */
+ new_tsc = get_cycles() - slave_offset;
+ write_tsc((u32)new_tsc, (u32)((u64)new_tsc >> 32));
+ local_irq_restore(flags);
+ }
+}
Index: linux.trees.git/kernel/time/Makefile
===================================================================
--- linux.trees.git.orig/kernel/time/Makefile 2008-11-07 00:12:55.000000000 -0500
+++ linux.trees.git/kernel/time/Makefile 2008-11-07 00:13:01.000000000 -0500
@@ -6,3 +6,4 @@ obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCA
obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o
obj-$(CONFIG_TICK_ONESHOT) += tick-sched.o
obj-$(CONFIG_TIMER_STATS) += timer_stats.o
+obj-$(CONFIG_HAVE_UNSYNCHRONIZED_TSC) += tsc-sync.o
Index: linux.trees.git/init/Kconfig
===================================================================
--- linux.trees.git.orig/init/Kconfig 2008-11-07 00:12:55.000000000 -0500
+++ linux.trees.git/init/Kconfig 2008-11-07 00:13:01.000000000 -0500
@@ -354,6 +354,13 @@ config HAVE_TRACE_CLOCK_32_TO_64
default y if (!HAVE_TRACE_CLOCK)
default n if HAVE_TRACE_CLOCK
+#
+# Architectures which need to dynamically detect if their TSC is unsynchronized
+# across cpus should select this.
+#
+config HAVE_UNSYNCHRONIZED_TSC
+ def_bool n
+
config GROUP_SCHED
bool "Group CPU scheduler"
depends on EXPERIMENTAL
Index: linux.trees.git/Documentation/kernel-parameters.txt
===================================================================
--- linux.trees.git.orig/Documentation/kernel-parameters.txt 2008-11-07 00:12:55.000000000 -0500
+++ linux.trees.git/Documentation/kernel-parameters.txt 2008-11-07 00:13:01.000000000 -0500
@@ -765,6 +765,10 @@ and is between 256 and 4096 characters.
parameter will force ia64_sal_cache_flush to call
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
+ force_tsc_sync
+ Force TSC resynchronization when SMP CPUs go online.
+ See also idle=poll and disable frequency scaling.
+
gamecon.map[2|3]=
[HW,JOY] Multisystem joystick and NES/SNES/PSX pad
support via parallel port (up to 5 devices per port)
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 13/18] x86 : remove arch-specific tsc_sync.c
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (11 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 12/18] LTTng - TSC synchronicity test Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 14/18] MIPS use tsc_sync.c Mathieu Desnoyers
` (5 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
Steven Rostedt
[-- Attachment #1: x86-remove-arch-specific-tsc_sync.patch --]
[-- Type: text/plain, Size: 7873 bytes --]
Depends on the new arch. independent kernel/time/tsc-sync.c
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Steven Rostedt <rostedt@goodmis.org>
---
arch/x86/Kconfig | 2
arch/x86/include/asm/tsc.h | 9 +-
arch/x86/kernel/Makefile | 4
arch/x86/kernel/tsc_sync.c | 189 ---------------------------------------------
4 files changed, 12 insertions(+), 192 deletions(-)
Index: linux.trees.git/arch/x86/kernel/Makefile
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/Makefile 2008-11-07 00:06:06.000000000 -0500
+++ linux.trees.git/arch/x86/kernel/Makefile 2008-11-07 00:15:13.000000000 -0500
@@ -56,9 +56,9 @@ obj-$(CONFIG_PCI) += early-quirks.o
apm-y := apm_32.o
obj-$(CONFIG_APM) += apm.o
obj-$(CONFIG_X86_SMP) += smp.o
-obj-$(CONFIG_X86_SMP) += smpboot.o tsc_sync.o ipi.o tlb_$(BITS).o
+obj-$(CONFIG_X86_SMP) += smpboot.o ipi.o tlb_$(BITS).o
obj-$(CONFIG_X86_32_SMP) += smpcommon.o
-obj-$(CONFIG_X86_64_SMP) += tsc_sync.o smpcommon.o
+obj-$(CONFIG_X86_64_SMP) += smpcommon.o
obj-$(CONFIG_X86_TRAMPOLINE) += trampoline_$(BITS).o
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
Index: linux.trees.git/arch/x86/kernel/tsc_sync.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/tsc_sync.c 2008-09-30 11:38:51.000000000 -0400
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,189 +0,0 @@
-/*
- * check TSC synchronization.
- *
- * Copyright (C) 2006, Red Hat, Inc., Ingo Molnar
- *
- * We check whether all boot CPUs have their TSC's synchronized,
- * print a warning if not and turn off the TSC clock-source.
- *
- * The warp-check is point-to-point between two CPUs, the CPU
- * initiating the bootup is the 'source CPU', the freshly booting
- * CPU is the 'target CPU'.
- *
- * Only two CPUs may participate - they can enter in any order.
- * ( The serial nature of the boot logic and the CPU hotplug lock
- * protects against more than 2 CPUs entering this code. )
- */
-#include <linux/spinlock.h>
-#include <linux/kernel.h>
-#include <linux/init.h>
-#include <linux/smp.h>
-#include <linux/nmi.h>
-#include <asm/tsc.h>
-
-/*
- * Entry/exit counters that make sure that both CPUs
- * run the measurement code at once:
- */
-static __cpuinitdata atomic_t start_count;
-static __cpuinitdata atomic_t stop_count;
-
-/*
- * We use a raw spinlock in this exceptional case, because
- * we want to have the fastest, inlined, non-debug version
- * of a critical section, to be able to prove TSC time-warps:
- */
-static __cpuinitdata raw_spinlock_t sync_lock = __RAW_SPIN_LOCK_UNLOCKED;
-static __cpuinitdata cycles_t last_tsc;
-static __cpuinitdata cycles_t max_warp;
-static __cpuinitdata int nr_warps;
-
-/*
- * TSC-warp measurement loop running on both CPUs:
- */
-static __cpuinit void check_tsc_warp(void)
-{
- cycles_t start, now, prev, end;
- int i;
-
- start = get_cycles();
- /*
- * The measurement runs for 20 msecs:
- */
- end = start + tsc_khz * 20ULL;
- now = start;
-
- for (i = 0; ; i++) {
- /*
- * We take the global lock, measure TSC, save the
- * previous TSC that was measured (possibly on
- * another CPU) and update the previous TSC timestamp.
- */
- __raw_spin_lock(&sync_lock);
- prev = last_tsc;
- now = get_cycles();
- last_tsc = now;
- __raw_spin_unlock(&sync_lock);
-
- /*
- * Be nice every now and then (and also check whether
- * measurement is done [we also insert a 10 million
- * loops safety exit, so we dont lock up in case the
- * TSC readout is totally broken]):
- */
- if (unlikely(!(i & 7))) {
- if (now > end || i > 10000000)
- break;
- cpu_relax();
- touch_nmi_watchdog();
- }
- /*
- * Outside the critical section we can now see whether
- * we saw a time-warp of the TSC going backwards:
- */
- if (unlikely(prev > now)) {
- __raw_spin_lock(&sync_lock);
- max_warp = max(max_warp, prev - now);
- nr_warps++;
- __raw_spin_unlock(&sync_lock);
- }
- }
- WARN(!(now-start),
- "Warning: zero tsc calibration delta: %Ld [max: %Ld]\n",
- now-start, end-start);
-}
-
-/*
- * Source CPU calls into this - it waits for the freshly booted
- * target CPU to arrive and then starts the measurement:
- */
-void __cpuinit check_tsc_sync_source(int cpu)
-{
- int cpus = 2;
-
- /*
- * No need to check if we already know that the TSC is not
- * synchronized:
- */
- if (unsynchronized_tsc())
- return;
-
- printk(KERN_INFO "checking TSC synchronization [CPU#%d -> CPU#%d]:",
- smp_processor_id(), cpu);
-
- /*
- * Reset it - in case this is a second bootup:
- */
- atomic_set(&stop_count, 0);
-
- /*
- * Wait for the target to arrive:
- */
- while (atomic_read(&start_count) != cpus-1)
- cpu_relax();
- /*
- * Trigger the target to continue into the measurement too:
- */
- atomic_inc(&start_count);
-
- check_tsc_warp();
-
- while (atomic_read(&stop_count) != cpus-1)
- cpu_relax();
-
- if (nr_warps) {
- printk("\n");
- printk(KERN_WARNING "Measured %Ld cycles TSC warp between CPUs,"
- " turning off TSC clock.\n", max_warp);
- mark_tsc_unstable("check_tsc_sync_source failed");
- } else {
- printk(" passed.\n");
- }
-
- /*
- * Reset it - just in case we boot another CPU later:
- */
- atomic_set(&start_count, 0);
- nr_warps = 0;
- max_warp = 0;
- last_tsc = 0;
-
- /*
- * Let the target continue with the bootup:
- */
- atomic_inc(&stop_count);
-}
-
-/*
- * Freshly booted CPUs call into this:
- */
-void __cpuinit check_tsc_sync_target(void)
-{
- int cpus = 2;
-
- if (unsynchronized_tsc())
- return;
-
- /*
- * Register this CPU's participation and wait for the
- * source CPU to start the measurement:
- */
- atomic_inc(&start_count);
- while (atomic_read(&start_count) != cpus)
- cpu_relax();
-
- check_tsc_warp();
-
- /*
- * Ok, we are done:
- */
- atomic_inc(&stop_count);
-
- /*
- * Wait for the source CPU to print stuff:
- */
- while (atomic_read(&stop_count) != cpus)
- cpu_relax();
-}
-#undef NR_LOOPS
-
Index: linux.trees.git/arch/x86/Kconfig
===================================================================
--- linux.trees.git.orig/arch/x86/Kconfig 2008-11-07 00:09:33.000000000 -0500
+++ linux.trees.git/arch/x86/Kconfig 2008-11-07 00:15:13.000000000 -0500
@@ -169,6 +169,7 @@ config X86_SMP
bool
depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64)
select USE_GENERIC_SMP_HELPERS
+ select HAVE_UNSYNCHRONIZED_TSC
default y
config X86_32_SMP
@@ -178,6 +179,7 @@ config X86_32_SMP
config X86_64_SMP
def_bool y
depends on X86_64 && SMP
+ select HAVE_UNSYNCHRONIZED_TSC
config X86_HT
bool
Index: linux.trees.git/arch/x86/include/asm/tsc.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/tsc.h 2008-11-07 00:09:33.000000000 -0500
+++ linux.trees.git/arch/x86/include/asm/tsc.h 2008-11-07 00:15:40.000000000 -0500
@@ -48,7 +48,7 @@ static __always_inline cycles_t vget_cyc
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
-int check_tsc_unstable(void);
+extern int check_tsc_unstable(void);
static inline cycles_t get_cycles_rate(void)
{
@@ -71,4 +71,11 @@ extern void check_tsc_sync_target(void);
extern int notsc_setup(char *);
+extern int test_tsc_synchronization(void);
+extern int _tsc_is_sync;
+static inline int tsc_is_sync(void)
+{
+ return _tsc_is_sync;
+}
+
#endif /* _ASM_X86_TSC_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 14/18] MIPS use tsc_sync.c
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (12 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 13/18] x86 : remove arch-specific tsc_sync.c Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 15/18] MIPS : export hpt frequency for trace_clock Mathieu Desnoyers
` (4 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Ralf Baechle
[-- Attachment #1: mips-use-tsc_sync.patch --]
[-- Type: text/plain, Size: 2120 bytes --]
tsc-sync.c is now available to test if TSC is synchronized across cores. Given
I currently don't have access to a MIPS board myself, help trying to use it
when CPUs go online and testing the implementation would be welcome.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
arch/mips/include/asm/timex.h | 26 ++++++++++++++++++++++++++
arch/mips/kernel/smp.c | 1 +
2 files changed, 27 insertions(+)
Index: linux.trees.git/arch/mips/kernel/smp.c
===================================================================
--- linux.trees.git.orig/arch/mips/kernel/smp.c 2008-11-07 00:06:06.000000000 -0500
+++ linux.trees.git/arch/mips/kernel/smp.c 2008-11-07 00:16:05.000000000 -0500
@@ -178,6 +178,7 @@ void __init smp_cpus_done(unsigned int m
{
mp_ops->cpus_done();
synchronise_count_master();
+ test_tsc_synchronization();
}
/* called from main before smp_init() */
Index: linux.trees.git/arch/mips/include/asm/timex.h
===================================================================
--- linux.trees.git.orig/arch/mips/include/asm/timex.h 2008-11-07 00:10:10.000000000 -0500
+++ linux.trees.git/arch/mips/include/asm/timex.h 2008-11-07 00:16:05.000000000 -0500
@@ -56,13 +56,39 @@ static inline cycles_t get_cycles_rate(v
{
return CLOCK_TICK_RATE;
}
+
+extern int test_tsc_synchronization(void);
+extern int _tsc_is_sync;
+static inline int tsc_is_sync(void)
+{
+ return _tsc_is_sync;
+}
#else
static inline cycles_t get_cycles(void)
{
return 0;
}
+static inline int test_tsc_synchronization(void)
+{
+ return 0;
+}
+static inline int tsc_is_sync(void)
+{
+ return 0;
+}
#endif
+#define DELAY_INTERRUPT 100
+/*
+ * Only updates 32 LSB.
+ */
+static inline void write_tsc(u32 val1, u32 val2)
+{
+ write_c0_count(val1);
+ /* Arrange for an interrupt in a short while */
+ write_c0_compare(read_c0_count() + DELAY_INTERRUPT);
+}
+
#endif /* __KERNEL__ */
#endif /* _ASM_TIMEX_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 15/18] MIPS : export hpt frequency for trace_clock.
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (13 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 14/18] MIPS use tsc_sync.c Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 16/18] MIPS create empty sync_core() Mathieu Desnoyers
` (3 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Ralf Baechle
[-- Attachment #1: mips-export-hpt-frequency-for-trace-clock.patch --]
[-- Type: text/plain, Size: 1393 bytes --]
Trace_clock needs to export the hpt frequency to modules (e.g. LTTng).
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
---
arch/mips/include/asm/timex.h | 2 ++
arch/mips/kernel/time.c | 1 +
2 files changed, 3 insertions(+)
Index: linux.trees.git/arch/mips/include/asm/timex.h
===================================================================
--- linux.trees.git.orig/arch/mips/include/asm/timex.h 2008-11-07 00:16:05.000000000 -0500
+++ linux.trees.git/arch/mips/include/asm/timex.h 2008-11-07 00:16:17.000000000 -0500
@@ -89,6 +89,8 @@ static inline void write_tsc(u32 val1, u
write_c0_compare(read_c0_count() + DELAY_INTERRUPT);
}
+extern unsigned int mips_hpt_frequency;
+
#endif /* __KERNEL__ */
#endif /* _ASM_TIMEX_H */
Index: linux.trees.git/arch/mips/kernel/time.c
===================================================================
--- linux.trees.git.orig/arch/mips/kernel/time.c 2008-07-19 09:18:07.000000000 -0400
+++ linux.trees.git/arch/mips/kernel/time.c 2008-11-07 00:16:17.000000000 -0500
@@ -70,6 +70,7 @@ EXPORT_SYMBOL(perf_irq);
*/
unsigned int mips_hpt_frequency;
+EXPORT_SYMBOL(mips_hpt_frequency);
void __init clocksource_set_clock(struct clocksource *cs, unsigned int clock)
{
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 16/18] MIPS create empty sync_core()
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (14 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 15/18] MIPS : export hpt frequency for trace_clock Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 17/18] MIPS : Trace clock Mathieu Desnoyers
` (2 subsequent siblings)
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Ralf Baechle
[-- Attachment #1: mips-create-empty-sync_core.patch --]
[-- Type: text/plain, Size: 1009 bytes --]
Needed by architecture-independant tsc-sync.c.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
arch/mips/include/asm/barrier.h | 6 ++++++
1 file changed, 6 insertions(+)
Index: linux.trees.git/arch/mips/include/asm/barrier.h
===================================================================
--- linux.trees.git.orig/arch/mips/include/asm/barrier.h 2008-10-30 20:22:49.000000000 -0400
+++ linux.trees.git/arch/mips/include/asm/barrier.h 2008-11-07 00:16:28.000000000 -0500
@@ -152,4 +152,10 @@
#define smp_llsc_rmb() __asm__ __volatile__(__WEAK_LLSC_MB : : :"memory")
#define smp_llsc_wmb() __asm__ __volatile__(__WEAK_LLSC_MB : : :"memory")
+/*
+ * MIPS does not have any instruction to serialize instruction execution on the
+ * core.
+ */
+#define sync_core()
+
#endif /* __ASM_BARRIER_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 17/18] MIPS : Trace clock
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (15 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 16/18] MIPS create empty sync_core() Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 11:53 ` Peter Zijlstra
2008-11-07 5:23 ` [RFC patch 18/18] x86 trace clock Mathieu Desnoyers
2008-11-07 10:55 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() David Howells
18 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Ralf Baechle
[-- Attachment #1: mips-trace-clock.patch --]
[-- Type: text/plain, Size: 10611 bytes --]
MIPS get_cycles only returns a 32 bits TSC (see timex.h). The assumption there
is that the reschedule is done every 8 seconds or so. Given that tracing needs
to detect delays longer than 8 seconds, we need a full 64-bits TSC, which is
provided by trace-clock-32-to-64.
I leave the "depends on !CPU_R4400_WORKAROUNDS" in Kconfig because the solution
proposed by Ralf to deal with the R4400 bug is racy, so let's just not support
this broken architecture. :(
Note for Peter Zijlstra :
You should probably have a look at lockdep.c raw_spinlock_t lockdep_lock usage.
I suspect it may be used with preemption enabled in graph_lock(). (not sure
though, but it's worth double-checking.
This patch uses the same cache-line bouncing algorithm used for x86. This is a
best-effort to support architectures lacking synchronized TSC without adding a
lot of complexity too soon. This keeps room for improvement in a second phase.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
arch/mips/Kconfig | 3
arch/mips/include/asm/timex.h | 17 +++
arch/mips/include/asm/trace-clock.h | 68 ++++++++++++++
arch/mips/kernel/Makefile | 2
arch/mips/kernel/trace-clock.c | 172 ++++++++++++++++++++++++++++++++++++
5 files changed, 261 insertions(+), 1 deletion(-)
Index: linux.trees.git/arch/mips/Kconfig
===================================================================
--- linux.trees.git.orig/arch/mips/Kconfig 2008-11-07 00:10:10.000000000 -0500
+++ linux.trees.git/arch/mips/Kconfig 2008-11-07 00:16:42.000000000 -0500
@@ -1614,6 +1614,9 @@ config CPU_R4400_WORKAROUNDS
config HAVE_GET_CYCLES_32
def_bool y
depends on !CPU_R4400_WORKAROUNDS
+ select HAVE_TRACE_CLOCK
+ select HAVE_TRACE_CLOCK_32_TO_64
+ select HAVE_UNSYNCHRONIZED_TSC
#
# Use the generic interrupt handling code in kernel/irq/:
Index: linux.trees.git/arch/mips/include/asm/trace-clock.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/mips/include/asm/trace-clock.h 2008-11-07 00:16:42.000000000 -0500
@@ -0,0 +1,68 @@
+/*
+ * Copyright (C) 2005,2008 Mathieu Desnoyers
+ *
+ * Trace clock MIPS definitions.
+ */
+
+#ifndef _ASM_MIPS_TRACE_CLOCK_H
+#define _ASM_MIPS_TRACE_CLOCK_H
+
+#include <linux/timex.h>
+#include <asm/processor.h>
+
+#define TRACE_CLOCK_MIN_PROBE_DURATION 200
+
+extern u64 trace_clock_read_synthetic_tsc(void);
+
+/*
+ * MIPS get_cycles only returns a 32 bits TSC (see timex.h). The assumption
+ * there is that the reschedule is done every 8 seconds or so. Given that
+ * tracing needs to detect delays longer than 8 seconds, we need a full 64-bits
+ * TSC, whic is provided by trace-clock-32-to-64.
+*/
+extern u64 trace_clock_async_tsc_read(void);
+
+static inline u32 trace_clock_read32(void)
+{
+ u32 cycles;
+
+ if (likely(tsc_is_sync()))
+ cycles = (u32)get_cycles(); /* only need the 32 LSB */
+ else
+ cycles = (u32)trace_clock_async_tsc_read();
+ return cycles;
+}
+
+static inline u64 trace_clock_read64(void)
+{
+ u64 cycles;
+
+ if (likely(tsc_is_sync()))
+ cycles = trace_clock_read_synthetic_tsc();
+ else
+ cycles = trace_clock_async_tsc_read();
+ return cycles;
+}
+
+static inline void trace_clock_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int trace_clock_frequency(void)
+{
+ return mips_hpt_frequency;
+}
+
+static inline u32 trace_clock_freq_scale(void)
+{
+ return 1;
+}
+
+extern void get_trace_clock(void);
+extern void put_trace_clock(void);
+extern void get_synthetic_tsc(void);
+extern void put_synthetic_tsc(void);
+
+static inline void set_trace_clock_is_sync(int state)
+{
+}
+#endif /* _ASM_MIPS_TRACE_CLOCK_H */
Index: linux.trees.git/arch/mips/kernel/trace-clock.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/mips/kernel/trace-clock.c 2008-11-07 00:16:42.000000000 -0500
@@ -0,0 +1,172 @@
+/*
+ * arch/mips/kernel/trace-clock.c
+ *
+ * Trace clock for mips.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>, October 2008
+ */
+
+#include <linux/module.h>
+#include <linux/trace-clock.h>
+#include <linux/jiffies.h>
+#include <linux/mutex.h>
+#include <linux/timer.h>
+#include <linux/spinlock.h>
+
+static u64 trace_clock_last_tsc;
+static DEFINE_PER_CPU(struct timer_list, update_timer);
+static DEFINE_MUTEX(async_tsc_mutex);
+static int async_tsc_refcount; /* Number of readers */
+static int async_tsc_enabled; /* Async TSC enabled on all online CPUs */
+
+/*
+ * Support for architectures with non-sync TSCs.
+ * When the local TSC is discovered to lag behind the highest TSC counter, we
+ * increment the TSC count of an amount that should be, ideally, lower than the
+ * execution time of this routine, in cycles : this is the granularity we look
+ * for : we must be able to order the events.
+ */
+
+#if BITS_PER_LONG == 64
+notrace u64 trace_clock_async_tsc_read(void)
+{
+ u64 new_tsc, last_tsc;
+
+ WARN_ON(!async_tsc_refcount || !async_tsc_enabled);
+ new_tsc = trace_clock_read_synthetic_tsc();
+ do {
+ last_tsc = trace_clock_last_tsc;
+ if (new_tsc < last_tsc)
+ new_tsc = last_tsc + TRACE_CLOCK_MIN_PROBE_DURATION;
+ /*
+ * If cmpxchg fails with a value higher than the new_tsc, don't
+ * retry : the value has been incremented and the events
+ * happened almost at the same time.
+ * We must retry if cmpxchg fails with a lower value :
+ * it means that we are the CPU with highest frequency and
+ * therefore MUST update the value.
+ */
+ } while (cmpxchg64(&trace_clock_last_tsc, last_tsc, new_tsc) < new_tsc);
+ return new_tsc;
+}
+EXPORT_SYMBOL_GPL(trace_clock_async_tsc_read);
+#else
+/*
+ * Emulate an atomic 64-bits update with a spinlock.
+ * Note : preempt_disable or irq save must be explicit with raw_spinlock_t.
+ * Given we use a spinlock for this time base, we should never be called from
+ * NMI context.
+ */
+static raw_spinlock_t trace_clock_lock =
+ (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+
+static inline u64 trace_clock_cmpxchg64(u64 *ptr, u64 old, u64 new)
+{
+ u64 val;
+
+ val = *ptr;
+ if (likely(val == old))
+ *ptr = val = new;
+ return val;
+}
+
+notrace u64 trace_clock_async_tsc_read(void)
+{
+ u64 new_tsc, last_tsc;
+ unsigned long flags;
+
+ WARN_ON(!async_tsc_refcount || !async_tsc_enabled);
+ local_irq_save(flags);
+ __raw_spin_lock(&trace_clock_lock);
+ new_tsc = trace_clock_read_synthetic_tsc();
+ do {
+ last_tsc = trace_clock_last_tsc;
+ if (new_tsc < last_tsc)
+ new_tsc = last_tsc + TRACE_CLOCK_MIN_PROBE_DURATION;
+ /*
+ * If cmpxchg fails with a value higher than the new_tsc, don't
+ * retry : the value has been incremented and the events
+ * happened almost at the same time.
+ * We must retry if cmpxchg fails with a lower value :
+ * it means that we are the CPU with highest frequency and
+ * therefore MUST update the value.
+ */
+ } while (trace_clock_cmpxchg64(&trace_clock_last_tsc, last_tsc,
+ new_tsc) < new_tsc);
+ __raw_spin_unlock(&trace_clock_lock);
+ local_irq_restore(flags);
+ return new_tsc;
+}
+EXPORT_SYMBOL_GPL(trace_clock_async_tsc_read);
+#endif
+
+
+static void update_timer_ipi(void *info)
+{
+ (void)trace_clock_async_tsc_read();
+}
+
+/*
+ * update_timer_fct : - Timer function to resync the clocks
+ * @data: unused
+ *
+ * Fires every jiffy.
+ */
+static void update_timer_fct(unsigned long data)
+{
+ (void)trace_clock_async_tsc_read();
+
+ per_cpu(update_timer, smp_processor_id()).expires = jiffies + 1;
+ add_timer_on(&per_cpu(update_timer, smp_processor_id()),
+ smp_processor_id());
+}
+
+static void enable_trace_clock(int cpu)
+{
+ init_timer(&per_cpu(update_timer, cpu));
+ per_cpu(update_timer, cpu).function = update_timer_fct;
+ per_cpu(update_timer, cpu).expires = jiffies + 1;
+ smp_call_function_single(cpu, update_timer_ipi, NULL, 1);
+ add_timer_on(&per_cpu(update_timer, cpu), cpu);
+}
+
+static void disable_trace_clock(int cpu)
+{
+ del_timer_sync(&per_cpu(update_timer, cpu));
+}
+
+void get_trace_clock(void)
+{
+ int cpu;
+
+ mutex_lock(&async_tsc_mutex);
+ if (async_tsc_refcount++ || tsc_is_sync())
+ goto end;
+
+ async_tsc_enabled = 1;
+ for_each_online_cpu(cpu)
+ enable_trace_clock(cpu);
+end:
+ mutex_unlock(&async_tsc_mutex);
+ get_synthetic_tsc();
+}
+EXPORT_SYMBOL_GPL(get_trace_clock);
+
+void put_trace_clock(void)
+{
+ int cpu;
+
+ put_synthetic_tsc();
+ mutex_lock(&async_tsc_mutex);
+ WARN_ON(async_tsc_refcount <= 0);
+ if (async_tsc_refcount != 1 || !async_tsc_enabled)
+ goto end;
+
+ for_each_online_cpu(cpu)
+ disable_trace_clock(cpu);
+ async_tsc_enabled = 0;
+end:
+ async_tsc_refcount--;
+ mutex_unlock(&async_tsc_mutex);
+}
+EXPORT_SYMBOL_GPL(put_trace_clock);
Index: linux.trees.git/arch/mips/include/asm/timex.h
===================================================================
--- linux.trees.git.orig/arch/mips/include/asm/timex.h 2008-11-07 00:16:17.000000000 -0500
+++ linux.trees.git/arch/mips/include/asm/timex.h 2008-11-07 00:16:42.000000000 -0500
@@ -42,7 +42,7 @@
typedef unsigned int cycles_t;
-#ifdef HAVE_GET_CYCLES_32
+#ifdef CONFIG_HAVE_GET_CYCLES_32
static inline cycles_t get_cycles(void)
{
return read_c0_count();
@@ -91,6 +91,21 @@ static inline void write_tsc(u32 val1, u
extern unsigned int mips_hpt_frequency;
+/*
+ * Currently unused, should update internal tsc-related timekeeping sources.
+ */
+static inline void mark_tsc_unstable(char *reason)
+{
+}
+
+/*
+ * Currently simply use the tsc_is_sync value.
+ */
+static inline int unsynchronized_tsc(void)
+{
+ return !tsc_is_sync();
+}
+
#endif /* __KERNEL__ */
#endif /* _ASM_TIMEX_H */
Index: linux.trees.git/arch/mips/kernel/Makefile
===================================================================
--- linux.trees.git.orig/arch/mips/kernel/Makefile 2008-10-30 20:22:50.000000000 -0400
+++ linux.trees.git/arch/mips/kernel/Makefile 2008-11-07 00:16:42.000000000 -0500
@@ -85,6 +85,8 @@ obj-$(CONFIG_GPIO_TXX9) += gpio_txx9.o
obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
+obj-$(CONFIG_HAVE_GET_CYCLES_32) += trace-clock.o
+
CFLAGS_cpu-bugs64.o = $(shell if $(CC) $(KBUILD_CFLAGS) -Wa,-mdaddi -c -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo "-DHAVE_AS_SET_DADDI"; fi)
obj-$(CONFIG_HAVE_STD_PC_SERIAL_PORT) += 8250-platform.o
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* [RFC patch 18/18] x86 trace clock
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (16 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 17/18] MIPS : Trace clock Mathieu Desnoyers
@ 2008-11-07 5:23 ` Mathieu Desnoyers
2008-11-07 10:55 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() David Howells
18 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 5:23 UTC (permalink / raw)
To: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Mathieu Desnoyers, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
Steven Rostedt
[-- Attachment #1: x86-trace-clock.patch --]
[-- Type: text/plain, Size: 11306 bytes --]
X86 trace clock. Depends on tsc_sync to detect if timestamp counters are
synchronized on the machine.
I am leaving this poorly scalable solution for now as this is the simplest, yet
working, solution I found (compared to using the HPET which also scales very
poorly, probably due to bus contention). This should be a good start and let us
trace a good amount of machines out there.
A "Big Fat" (TM) warning is shown on the console when the trace clock is used on
systems without synchronized TSCs to tell the user to
- use force_tsc_sync=1
- use idle=poll
- disable Powernow or Speedstep
In order to get accurate and fast timestamps.
This keeps room for further improvement in a second phase.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Steven Rostedt <rostedt@goodmis.org>
---
arch/x86/Kconfig | 1
arch/x86/include/asm/trace-clock.h | 73 ++++++++++
arch/x86/kernel/Makefile | 1
arch/x86/kernel/trace-clock.c | 248 +++++++++++++++++++++++++++++++++++++
4 files changed, 323 insertions(+)
Index: linux.trees.git/arch/x86/Kconfig
===================================================================
--- linux.trees.git.orig/arch/x86/Kconfig 2008-11-07 00:15:13.000000000 -0500
+++ linux.trees.git/arch/x86/Kconfig 2008-11-07 00:17:17.000000000 -0500
@@ -30,6 +30,7 @@ config X86
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_DYNAMIC_FTRACE
select HAVE_FUNCTION_TRACER
+ select HAVE_TRACE_CLOCK
select HAVE_KVM if ((X86_32 && !X86_VOYAGER && !X86_VISWS && !X86_NUMAQ) || X86_64)
select HAVE_ARCH_KGDB if !X86_VOYAGER
select HAVE_ARCH_TRACEHOOK
Index: linux.trees.git/arch/x86/kernel/Makefile
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/Makefile 2008-11-07 00:15:13.000000000 -0500
+++ linux.trees.git/arch/x86/kernel/Makefile 2008-11-07 00:16:57.000000000 -0500
@@ -36,6 +36,7 @@ obj-y += bootflag.o e820.o
obj-y += pci-dma.o quirks.o i8237.o topology.o kdebugfs.o
obj-y += alternative.o i8253.o pci-nommu.o
obj-y += tsc.o io_delay.o rtc.o
+obj-y += trace-clock.o
obj-$(CONFIG_X86_TRAMPOLINE) += trampoline.o
obj-y += process.o
Index: linux.trees.git/arch/x86/kernel/trace-clock.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/x86/kernel/trace-clock.c 2008-11-07 00:16:57.000000000 -0500
@@ -0,0 +1,248 @@
+/*
+ * arch/x86/kernel/trace-clock.c
+ *
+ * Trace clock for x86.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>, October 2008
+ */
+
+#include <linux/module.h>
+#include <linux/trace-clock.h>
+#include <linux/jiffies.h>
+#include <linux/mutex.h>
+#include <linux/timer.h>
+#include <linux/cpu.h>
+
+static cycles_t trace_clock_last_tsc;
+static DEFINE_PER_CPU(struct timer_list, update_timer);
+static DEFINE_MUTEX(async_tsc_mutex);
+static int async_tsc_refcount; /* Number of readers */
+static int async_tsc_enabled; /* Async TSC enabled on all online CPUs */
+
+int _trace_clock_is_sync = 1;
+EXPORT_SYMBOL_GPL(_trace_clock_is_sync);
+
+/*
+ * Called by check_tsc_sync_source from CPU hotplug.
+ */
+void set_trace_clock_is_sync(int state)
+{
+ _trace_clock_is_sync = state;
+}
+
+#if BITS_PER_LONG == 64
+static cycles_t read_last_tsc(void)
+{
+ return trace_clock_last_tsc;
+}
+#else
+/*
+ * A cmpxchg64 update can happen concurrently. Based on the assumption that
+ * two cmpxchg64 will never update it to the same value (the count always
+ * increases), reading it twice insures that we read a coherent value with the
+ * same "sequence number".
+ */
+static cycles_t read_last_tsc(void)
+{
+ cycles_t val1, val2;
+
+ val1 = trace_clock_last_tsc;
+ for (;;) {
+ val2 = val1;
+ barrier();
+ val1 = trace_clock_last_tsc;
+ if (likely(val1 == val2))
+ break;
+ }
+ return val1;
+}
+#endif
+
+/*
+ * Support for architectures with non-sync TSCs.
+ * When the local TSC is discovered to lag behind the highest TSC counter, we
+ * increment the TSC count of an amount that should be, ideally, lower than the
+ * execution time of this routine, in cycles : this is the granularity we look
+ * for : we must be able to order the events.
+ */
+notrace cycles_t trace_clock_async_tsc_read(void)
+{
+ cycles_t new_tsc, last_tsc;
+
+ WARN_ON(!async_tsc_refcount || !async_tsc_enabled);
+ rdtsc_barrier();
+ new_tsc = get_cycles();
+ rdtsc_barrier();
+ last_tsc = read_last_tsc();
+ do {
+ if (new_tsc < last_tsc)
+ new_tsc = last_tsc + TRACE_CLOCK_MIN_PROBE_DURATION;
+ /*
+ * If cmpxchg fails with a value higher than the new_tsc, don't
+ * retry : the value has been incremented and the events
+ * happened almost at the same time.
+ * We must retry if cmpxchg fails with a lower value :
+ * it means that we are the CPU with highest frequency and
+ * therefore MUST update the value.
+ */
+ last_tsc = cmpxchg64(&trace_clock_last_tsc, last_tsc, new_tsc);
+ } while (unlikely(last_tsc < new_tsc));
+ return new_tsc;
+}
+EXPORT_SYMBOL_GPL(trace_clock_async_tsc_read);
+
+static void update_timer_ipi(void *info)
+{
+ (void)trace_clock_async_tsc_read();
+}
+
+/*
+ * update_timer_fct : - Timer function to resync the clocks
+ * @data: unused
+ *
+ * Fires every jiffy.
+ */
+static void update_timer_fct(unsigned long data)
+{
+ (void)trace_clock_async_tsc_read();
+
+ per_cpu(update_timer, smp_processor_id()).expires = jiffies + 1;
+ add_timer_on(&per_cpu(update_timer, smp_processor_id()),
+ smp_processor_id());
+}
+
+static void enable_trace_clock(int cpu)
+{
+ init_timer(&per_cpu(update_timer, cpu));
+ per_cpu(update_timer, cpu).function = update_timer_fct;
+ per_cpu(update_timer, cpu).expires = jiffies + 1;
+ smp_call_function_single(cpu, update_timer_ipi, NULL, 1);
+ add_timer_on(&per_cpu(update_timer, cpu), cpu);
+}
+
+static void disable_trace_clock(int cpu)
+{
+ del_timer_sync(&per_cpu(update_timer, cpu));
+}
+
+/*
+ * hotcpu_callback - CPU hotplug callback
+ * @nb: notifier block
+ * @action: hotplug action to take
+ * @hcpu: CPU number
+ *
+ * Returns the success/failure of the operation. (NOTIFY_OK, NOTIFY_BAD)
+ */
+static int __cpuinit hotcpu_callback(struct notifier_block *nb,
+ unsigned long action,
+ void *hcpu)
+{
+ unsigned int hotcpu = (unsigned long)hcpu;
+ int cpu;
+
+ mutex_lock(&async_tsc_mutex);
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ break;
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ /*
+ * trace_clock_is_sync() is updated by set_trace_clock_is_sync()
+ * code, protected by cpu hotplug disable.
+ * It is ok to let the hotplugged CPU read the timebase before
+ * the CPU_ONLINE notification. It's just there to give a
+ * maximum bound to the TSC error.
+ */
+ if (async_tsc_refcount && !trace_clock_is_sync()) {
+ if (!async_tsc_enabled) {
+ async_tsc_enabled = 1;
+ for_each_online_cpu(cpu)
+ enable_trace_clock(cpu);
+ } else {
+ enable_trace_clock(hotcpu);
+ }
+ }
+ break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ if (!async_tsc_refcount && num_online_cpus() == 1)
+ set_trace_clock_is_sync(1);
+ break;
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ /*
+ * We cannot stop the trace clock on other CPUs when readers are
+ * active even if we go back to a synchronized state (1 CPU)
+ * because the CPU left could be the one lagging behind.
+ */
+ if (async_tsc_refcount && async_tsc_enabled)
+ disable_trace_clock(hotcpu);
+ if (!async_tsc_refcount && num_online_cpus() == 1)
+ set_trace_clock_is_sync(1);
+ break;
+#endif /* CONFIG_HOTPLUG_CPU */
+ }
+ mutex_unlock(&async_tsc_mutex);
+
+ return NOTIFY_OK;
+}
+
+void get_trace_clock(void)
+{
+ int cpu;
+
+ if (!trace_clock_is_sync()) {
+ printk(KERN_WARNING
+ "Trace clock falls back on cache-line bouncing\n"
+ "workaround due to non-synchronized TSCs.\n"
+ "This workaround preserves event order across CPUs.\n"
+ "Please consider disabling Speedstep or PowerNow and\n"
+ "using kernel parameters "
+ "\"force_tsc_sync=1 idle=poll\"\n"
+ "for accurate and fast tracing clock source.\n");
+ }
+
+ get_online_cpus();
+ mutex_lock(&async_tsc_mutex);
+ if (async_tsc_refcount++ || trace_clock_is_sync())
+ goto end;
+
+ async_tsc_enabled = 1;
+ for_each_online_cpu(cpu)
+ enable_trace_clock(cpu);
+end:
+ mutex_unlock(&async_tsc_mutex);
+ put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(get_trace_clock);
+
+void put_trace_clock(void)
+{
+ int cpu;
+
+ get_online_cpus();
+ mutex_lock(&async_tsc_mutex);
+ WARN_ON(async_tsc_refcount <= 0);
+ if (async_tsc_refcount != 1 || !async_tsc_enabled)
+ goto end;
+
+ for_each_online_cpu(cpu)
+ disable_trace_clock(cpu);
+ async_tsc_enabled = 0;
+end:
+ async_tsc_refcount--;
+ if (!async_tsc_refcount && num_online_cpus() == 1)
+ set_trace_clock_is_sync(1);
+ mutex_unlock(&async_tsc_mutex);
+ put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(put_trace_clock);
+
+static __init int init_unsync_trace_clock(void)
+{
+ hotcpu_notifier(hotcpu_callback, 4);
+ return 0;
+}
+early_initcall(init_unsync_trace_clock);
Index: linux.trees.git/arch/x86/include/asm/trace-clock.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/arch/x86/include/asm/trace-clock.h 2008-11-07 00:16:57.000000000 -0500
@@ -0,0 +1,73 @@
+#ifndef _ASM_X86_TRACE_CLOCK_H
+#define _ASM_X86_TRACE_CLOCK_H
+
+/*
+ * linux/arch/x86/include/asm/trace-clock.h
+ *
+ * Copyright (C) 2005,2006,2008
+ * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ *
+ * Trace clock definitions for x86.
+ */
+
+#include <linux/timex.h>
+#include <asm/system.h>
+#include <asm/processor.h>
+#include <asm/atomic.h>
+
+/* Minimum duration of a probe, in cycles */
+#define TRACE_CLOCK_MIN_PROBE_DURATION 200
+
+extern cycles_t trace_clock_async_tsc_read(void);
+
+extern int _trace_clock_is_sync;
+static inline int trace_clock_is_sync(void)
+{
+ return _trace_clock_is_sync;
+}
+
+static inline u32 trace_clock_read32(void)
+{
+ u32 cycles;
+
+ if (likely(trace_clock_is_sync())) {
+ get_cycles_barrier();
+ cycles = (u32)get_cycles(); /* only need the 32 LSB */
+ get_cycles_barrier();
+ } else
+ cycles = (u32)trace_clock_async_tsc_read();
+ return cycles;
+}
+
+static inline u64 trace_clock_read64(void)
+{
+ u64 cycles;
+
+ if (likely(trace_clock_is_sync())) {
+ get_cycles_barrier();
+ cycles = get_cycles();
+ get_cycles_barrier();
+ } else
+ cycles = trace_clock_async_tsc_read();
+ return cycles;
+}
+
+static inline void trace_clock_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int trace_clock_frequency(void)
+{
+ return cpu_khz;
+}
+
+static inline u32 trace_clock_freq_scale(void)
+{
+ return 1000;
+}
+
+extern void get_trace_clock(void);
+extern void put_trace_clock(void);
+
+extern void set_trace_clock_is_sync(int state);
+
+#endif /* _ASM_X86_TRACE_CLOCK_H */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 10/18] Sparc64 : Trace clock
2008-11-07 5:23 ` [RFC patch 10/18] Sparc64 " Mathieu Desnoyers
@ 2008-11-07 5:45 ` David Miller
0 siblings, 0 replies; 118+ messages in thread
From: David Miller @ 2008-11-07 5:45 UTC (permalink / raw)
To: mathieu.desnoyers
Cc: torvalds, akpm, mingo, a.p.zijlstra, linux-kernel, linux-arch
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Fri, 07 Nov 2008 00:23:46 -0500
> Implement sparc64 trace clock.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: David Miller <davem@davemloft.net>
> CC: linux-arch@vger.kernel.org
Acked-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 07/18] Trace clock core
2008-11-07 5:23 ` [RFC patch 07/18] Trace clock core Mathieu Desnoyers
@ 2008-11-07 5:52 ` Andrew Morton
2008-11-07 6:16 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 5:52 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Nicolas Pitre, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
On Fri, 07 Nov 2008 00:23:43 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 32 to 64 bits clock extension. Extracts 64 bits tsc from a [1..32]
> bits counter, kept up to date by periodical timer interrupt. Lockless.
>
> ...
>
> +#include <linux/sched.h> /* FIX for m68k local_irq_enable in on_each_cpu */
What's going on here?
> +struct synthetic_tsc_struct {
> + union {
> + u64 val;
> + struct {
> +#ifdef __BIG_ENDIAN
> + u32 msb;
> + u32 lsb;
> +#else
> + u32 lsb;
> + u32 msb;
> +#endif
One would expect an identifier called "msb" to mean "most significant
bit" or possible "most significant byte".
Maybe ms32 and ls32?
> + } sel;
> + } tsc[2];
> + unsigned int index; /* Index of the current synth. tsc. */
> +};
> +
> +static DEFINE_PER_CPU(struct synthetic_tsc_struct, synthetic_tsc);
> +
> +/* Called from IPI : either in interrupt or process context */
IPI handlers should always be called with local interrupts disabled.
> +static void update_synthetic_tsc(void)
> +{
> + struct synthetic_tsc_struct *cpu_synth;
> + u32 tsc;
> +
> + preempt_disable();
which would make this unnecessary.
> + cpu_synth = &per_cpu(synthetic_tsc, smp_processor_id());
> + tsc = trace_clock_read32(); /* Hardware clocksource read */
> +
> + if (tsc < HW_LSB(cpu_synth->tsc[cpu_synth->index].sel.lsb)) {
> + unsigned int new_index = 1 - cpu_synth->index; /* 0 <-> 1 */
> + /*
> + * Overflow
> + * Non atomic update of the non current synthetic TSC, followed
> + * by an atomic index change. There is no write concurrency,
> + * so the index read/write does not need to be atomic.
> + */
> + cpu_synth->tsc[new_index].val =
> + (SW_MSB(cpu_synth->tsc[cpu_synth->index].val)
> + | (u64)tsc) + (1ULL << HW_BITS);
> + cpu_synth->index = new_index; /* atomic change of index */
> + } else {
> + /*
> + * No overflow : We know that the only bits changed are
> + * contained in the 32 LSBs, which can be written to atomically.
> + */
> + cpu_synth->tsc[cpu_synth->index].sel.lsb =
> + SW_MSB(cpu_synth->tsc[cpu_synth->index].sel.lsb) | tsc;
> + }
> + preempt_enable();
> +}
Is there something we should be fixing in m68k?
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 5:23 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() Mathieu Desnoyers
@ 2008-11-07 6:05 ` Andrew Morton
2008-11-07 8:12 ` Nicolas Pitre
2008-11-07 11:03 ` David Howells
2008-11-07 10:59 ` David Howells
1 sibling, 2 replies; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 6:05 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Nicolas Pitre, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
On Fri, 07 Nov 2008 00:23:44 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> #define cnt32_to_63(cnt_lo) \
> ({ \
> - static volatile u32 __m_cnt_hi; \
> + static u32 __m_cnt_hi; \
> union cnt32_to_63 __x; \
> __x.hi = __m_cnt_hi; \
> + smp_rmb(); /* read __m_cnt_hi before mmio cnt_lo */ \
> __x.lo = (cnt_lo); \
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
Oh dear. We have a macro which secretly maintains
per-instantiation-site global state? And doesn't even implement locking
to protect that state?
I mean, the darned thing is called from sched_clock(), which can be
concurrently called on separate CPUs and which can be called from
interrupt context (with an arbitrary nesting level!) while it was running
in process context.
Who let that thing into Linux?
Look:
/*
* Caller must provide locking to protect *caller_state
*/
u32 cnt32_to_63(u32 *caller_state, u32 cnt_lo);
But even that looks pretty crappy.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 07/18] Trace clock core
2008-11-07 5:52 ` Andrew Morton
@ 2008-11-07 6:16 ` Mathieu Desnoyers
2008-11-07 6:26 ` Andrew Morton
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 6:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Nicolas Pitre, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Fri, 07 Nov 2008 00:23:43 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > 32 to 64 bits clock extension. Extracts 64 bits tsc from a [1..32]
> > bits counter, kept up to date by periodical timer interrupt. Lockless.
> >
> > ...
> >
> > +#include <linux/sched.h> /* FIX for m68k local_irq_enable in on_each_cpu */
>
> What's going on here?
>
Hi Andrew,
When I wrote this comment (kernel ~2.6.25), the situation was (and it
still looks valid, although I haven't tried to fix it since then) :
linux/smp.h :
on_each_cpu() (!SMP)
local_irq_enable()
but, on m68k :
asm-m68k/system.h defines this ugly macro :
/* interrupt control.. */
#if 0
#define local_irq_enable() asm volatile ("andiw %0,%%sr": : "i" (ALLOWINT) : "memory")
#else
#include <linux/hardirq.h>
#define local_irq_enable() ({ \
if (MACH_IS_Q40 || !hardirq_count()) \
asm volatile ("andiw %0,%%sr": : "i" (ALLOWINT) : "memory"); \
})
#endif
Which uses !hardirq_count(), which is defined by sched.h. However, I did
try in the past to include sched.h in asm-m68k/system.h, but it ended up
doing a recursive inclusion.
> > +struct synthetic_tsc_struct {
> > + union {
> > + u64 val;
> > + struct {
> > +#ifdef __BIG_ENDIAN
> > + u32 msb;
> > + u32 lsb;
> > +#else
> > + u32 lsb;
> > + u32 msb;
> > +#endif
>
> One would expect an identifier called "msb" to mean "most significant
> bit" or possible "most significant byte".
>
> Maybe ms32 and ls32?
>
Yep, seems clearer.
> > + } sel;
> > + } tsc[2];
> > + unsigned int index; /* Index of the current synth. tsc. */
> > +};
> > +
> > +static DEFINE_PER_CPU(struct synthetic_tsc_struct, synthetic_tsc);
> > +
> > +/* Called from IPI : either in interrupt or process context */
>
> IPI handlers should always be called with local interrupts disabled.
>
> > +static void update_synthetic_tsc(void)
> > +{
> > + struct synthetic_tsc_struct *cpu_synth;
> > + u32 tsc;
> > +
> > + preempt_disable();
>
> which would make this unnecessary.
>
Ah, yes, right.
> > + cpu_synth = &per_cpu(synthetic_tsc, smp_processor_id());
> > + tsc = trace_clock_read32(); /* Hardware clocksource read */
> > +
> > + if (tsc < HW_LSB(cpu_synth->tsc[cpu_synth->index].sel.lsb)) {
> > + unsigned int new_index = 1 - cpu_synth->index; /* 0 <-> 1 */
> > + /*
> > + * Overflow
> > + * Non atomic update of the non current synthetic TSC, followed
> > + * by an atomic index change. There is no write concurrency,
> > + * so the index read/write does not need to be atomic.
> > + */
> > + cpu_synth->tsc[new_index].val =
> > + (SW_MSB(cpu_synth->tsc[cpu_synth->index].val)
> > + | (u64)tsc) + (1ULL << HW_BITS);
> > + cpu_synth->index = new_index; /* atomic change of index */
> > + } else {
> > + /*
> > + * No overflow : We know that the only bits changed are
> > + * contained in the 32 LSBs, which can be written to atomically.
> > + */
> > + cpu_synth->tsc[cpu_synth->index].sel.lsb =
> > + SW_MSB(cpu_synth->tsc[cpu_synth->index].sel.lsb) | tsc;
> > + }
> > + preempt_enable();
> > +}
>
> Is there something we should be fixing in m68k?
>
Yes, but I fear it's going to go deep into include hell :-(
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 07/18] Trace clock core
2008-11-07 6:16 ` Mathieu Desnoyers
@ 2008-11-07 6:26 ` Andrew Morton
2008-11-07 16:12 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 6:26 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Nicolas Pitre, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
On Fri, 7 Nov 2008 01:16:43 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > Is there something we should be fixing in m68k?
> >
>
> Yes, but I fear it's going to go deep into include hell :-(
Oh, OK. I thought that the comment meant that m68k's on_each_cpu()
behaves differently at runtime from other architectures (and wrongly).
If it's just some compile-time #include snafu then that's far less
of a concern.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 6:05 ` Andrew Morton
@ 2008-11-07 8:12 ` Nicolas Pitre
2008-11-07 8:38 ` Andrew Morton
2008-11-07 11:20 ` David Howells
2008-11-07 11:03 ` David Howells
1 sibling, 2 replies; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-07 8:12 UTC (permalink / raw)
To: Andrew Morton
Cc: Mathieu Desnoyers, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
linux-kernel, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
On Thu, 6 Nov 2008, Andrew Morton wrote:
> On Fri, 07 Nov 2008 00:23:44 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > #define cnt32_to_63(cnt_lo) \
> > ({ \
> > - static volatile u32 __m_cnt_hi; \
> > + static u32 __m_cnt_hi; \
> > union cnt32_to_63 __x; \
> > __x.hi = __m_cnt_hi; \
> > + smp_rmb(); /* read __m_cnt_hi before mmio cnt_lo */ \
> > __x.lo = (cnt_lo); \
> > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
>
> Oh dear. We have a macro which secretly maintains
> per-instantiation-site global state? And doesn't even implement locking
> to protect that state?
Please do me a favor and look for those very unfrequent posts I've sent
to lkml lately. I've explained it all at least 3 times so far, to Peter
Zijlstra, to David Howells, to Mathieu Desnoyers, and now to you.
> I mean, the darned thing is called from sched_clock(), which can be
> concurrently called on separate CPUs and which can be called from
> interrupt context (with an arbitrary nesting level!) while it was running
> in process context.
Yes! And this is so on *purpose*. Please take some time to read the
comment that goes along with it, and if you're still not convinced then
look for those explanation emails I've already posted.
> /*
> * Caller must provide locking to protect *caller_state
> */
NO! This is meant to be LOCK FREE!
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 8:12 ` Nicolas Pitre
@ 2008-11-07 8:38 ` Andrew Morton
2008-11-07 11:20 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 8:38 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Mathieu Desnoyers, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
linux-kernel, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
On Fri, 07 Nov 2008 03:12:18 -0500 (EST) Nicolas Pitre <nico@cam.org> wrote:
> On Thu, 6 Nov 2008, Andrew Morton wrote:
>
> > On Fri, 07 Nov 2008 00:23:44 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> >
> > > #define cnt32_to_63(cnt_lo) \
> > > ({ \
> > > - static volatile u32 __m_cnt_hi; \
> > > + static u32 __m_cnt_hi; \
> > > union cnt32_to_63 __x; \
> > > __x.hi = __m_cnt_hi; \
> > > + smp_rmb(); /* read __m_cnt_hi before mmio cnt_lo */ \
> > > __x.lo = (cnt_lo); \
> > > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > > __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> >
> > Oh dear. We have a macro which secretly maintains
> > per-instantiation-site global state? And doesn't even implement locking
> > to protect that state?
>
> Please do me a favor and look for those very unfrequent posts I've sent
> to lkml lately.
No. Reading the kernel code (and, at a pinch, the changelogs) should
suffice. If it does not suffice, the kernel code is in error.
> I've explained it all at least 3 times so far, to Peter
> Zijlstra, to David Howells, to Mathieu Desnoyers, and now to you.
If four heads have exploded (thus far) over one piece of code, perhaps
the blame doesn't lie with those heads.
> > I mean, the darned thing is called from sched_clock(), which can be
> > concurrently called on separate CPUs and which can be called from
> > interrupt context (with an arbitrary nesting level!) while it was running
> > in process context.
>
> Yes! And this is so on *purpose*. Please take some time to read the
> comment that goes along with it,
OK.
> and if you're still not convinced then
> look for those explanation emails I've already posted.
No.
> > /*
> > * Caller must provide locking to protect *caller_state
> > */
>
> NO! This is meant to be LOCK FREE!
We have a macro which must only have a single usage in any particular
kernel build (and nothing to detect a violation of that).
We have a macro which secretly hides internal state, on a per-expansion-site
basis, no less.
It apparently tries to avoid races via ordering tricks, as long
as it is called with sufficient frequency. But nothing guarantees
that it _is_ called sufficiently frequently?
There is absolutely no reason why the first two of these quite bad things
needed to be done. In fact there is no reason why it needed to be
implemented as a macro at all.
As I said in the text which you deleted and ignored, this would be
better if it was implemented as a C function which requires that the
caller explicitly pass in a reference to the state storage.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
` (17 preceding siblings ...)
2008-11-07 5:23 ` [RFC patch 18/18] x86 trace clock Mathieu Desnoyers
@ 2008-11-07 10:55 ` David Howells
2008-11-07 17:09 ` Mathieu Desnoyers
18 siblings, 1 reply; 118+ messages in thread
From: David Howells @ 2008-11-07 10:55 UTC (permalink / raw)
To: Mathieu Desnoyers, Paul E. McKenney
Cc: dhowells, Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra,
linux-kernel, Nicolas Pitre, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> Assume the time source is a global clock which insures that time will never
> *ever* go backward. Use a smp_rmb() to make sure the cnt_lo value is read before
> __m_cnt_hi.
If you have an smp_rmb(), then don't you need an smp_wmb()/smp_mb() to match
it to make it work? And is your assumption valid that smp_rmb() will affect
memory vs the I/O access to read the clock? There's no requirement that
cnt_lo will have been read from an MMIO location at all.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 5:23 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() Mathieu Desnoyers
2008-11-07 6:05 ` Andrew Morton
@ 2008-11-07 10:59 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 10:59 UTC (permalink / raw)
To: Andrew Morton
Cc: dhowells, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Nicolas Pitre, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Andrew Morton <akpm@linux-foundation.org> wrote:
> I mean, the darned thing is called from sched_clock(), which can be
> concurrently called on separate CPUs and which can be called from
> interrupt context (with an arbitrary nesting level!) while it was running
> in process context.
>
> Who let that thing into Linux?
Having crawled all over it, and argued with Nicolas and Panasonic about it, I
think it's safe in sched_clock(), provided sched_clock() never gets preempted -
which appears to be the case.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 6:05 ` Andrew Morton
2008-11-07 8:12 ` Nicolas Pitre
@ 2008-11-07 11:03 ` David Howells
2008-11-07 16:51 ` Mathieu Desnoyers
1 sibling, 1 reply; 118+ messages in thread
From: David Howells @ 2008-11-07 11:03 UTC (permalink / raw)
To: Nicolas Pitre, Andrew Morton
Cc: dhowells, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
Nicolas Pitre <nico@cam.org> wrote:
> > I mean, the darned thing is called from sched_clock(), which can be
> > concurrently called on separate CPUs and which can be called from
> > interrupt context (with an arbitrary nesting level!) while it was running
> > in process context.
>
> Yes! And this is so on *purpose*. Please take some time to read the
> comment that goes along with it, and if you're still not convinced then
> look for those explanation emails I've already posted.
I agree with Nicolas on this. It's abominably clever, but I think he's right.
The one place I remain unconvinced is over the issue of preemption of a process
that is in the middle of cnt32_to_63(), where if the preempted process is
asleep for long enough, I think it can wind time backwards when it resumes, but
that's not a problem for the one place I want to use it (sched_clock()) because
that is (almost) always called with preemption disabled in one way or another.
The one place it isn't is a debugging case that I'm not too worried about.
> > /*
> > * Caller must provide locking to protect *caller_state
> > */
>
> NO! This is meant to be LOCK FREE!
Absolutely.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 8:12 ` Nicolas Pitre
2008-11-07 8:38 ` Andrew Morton
@ 2008-11-07 11:20 ` David Howells
2008-11-07 15:01 ` Nicolas Pitre
` (3 more replies)
1 sibling, 4 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 11:20 UTC (permalink / raw)
To: Andrew Morton
Cc: dhowells, Nicolas Pitre, Mathieu Desnoyers, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Andrew Morton <akpm@linux-foundation.org> wrote:
> We have a macro which must only have a single usage in any particular
> kernel build (and nothing to detect a violation of that).
That's not true. It's a macro containing a _static_ local variable, therefore
the macro may be used multiple times, and each time it's used the compiler
will allocate a new piece of storage.
> It apparently tries to avoid races via ordering tricks, as long
> as it is called with sufficient frequency. But nothing guarantees
> that it _is_ called sufficiently frequently?
The comment attached to it clearly states this restriction. Therefore the
caller must guarantee it. That is something Mathieu's code and my code must
deal with, not Nicolas's.
> There is absolutely no reason why the first two of these quite bad things
> needed to be done. In fact there is no reason why it needed to be
> implemented as a macro at all.
There's a very good reason to implement it as either a macro or an inline
function: it's faster. Moving the whole thing out of line would impose an
additional function call overhead - with a 64-bit return value on 32-bit
platforms. For my case - sched_clock() - I'm willing to burn a bit of extra
space to get the extra speed.
> As I said in the text which you deleted and ignored, this would be
> better if it was implemented as a C function which requires that the
> caller explicitly pass in a reference to the state storage.
I'd be quite happy if it was:
static inline u64 cnt32_to_63(u32 cnt_lo, u32 *__m_cnt_hi)
{
union cnt32_to_63 __x;
__x.hi = *__m_cnt_hi;
__x.lo = cnt_lo;
if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
*__m_cnt_hi =
__x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
return __x.val;
}
I imagine this would compile pretty much the same as the macro. I think it
would make it more obvious about the independence of the storage.
Alternatively, perhaps Nicolas just needs to mention this in the comment more
clearly.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 17/18] MIPS : Trace clock
2008-11-07 5:23 ` [RFC patch 17/18] MIPS : Trace clock Mathieu Desnoyers
@ 2008-11-07 11:53 ` Peter Zijlstra
2008-11-07 17:44 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Peter Zijlstra @ 2008-11-07 11:53 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, Ralf Baechle
On Fri, 2008-11-07 at 00:23 -0500, Mathieu Desnoyers wrote:
> Note for Peter Zijlstra :
> You should probably have a look at lockdep.c raw_spinlock_t lockdep_lock usage.
> I suspect it may be used with preemption enabled in graph_lock(). (not sure
> though, but it's worth double-checking.
Are you worried about the graph_lock() instance in
lookup_chain_cache() ?
That is the locking for validate_chain,
__lock_acquire()
validate_chain()
look_up_chain_cache()
graph_lock()
check_prevs_add()
check_prev_add()
graph_unlock()
graph_lock()
graph_unlock()
which is all done without modifying IRQ state.
However, __lock_acquire() is only called with IRQs disabled:
lock_acquire()
raw_local_irq_save()
__lock_acquire()
lock_release()
raw_local_irq_save()
__lock_release()
lock_release_nested()
__lock_acquire()
lock_release_non_nested()
__lock_acquire()
lock_set_subclass()
raw_local_irq_save()
__lock_set_subclass()
__lock_acquire()
So I think we're good.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 04/18] get_cycles() : powerpc64 HAVE_GET_CYCLES
2008-11-07 5:23 ` [RFC patch 04/18] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-11-07 14:56 ` Josh Boyer
2008-11-07 18:14 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Josh Boyer @ 2008-11-07 14:56 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
On Fri, Nov 07, 2008 at 12:23:40AM -0500, Mathieu Desnoyers wrote:
>This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
>get_cycles_rate() are implemented.
>
>Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
>CC: benh@kernel.crashing.org
>CC: paulus@samba.org
>CC: David Miller <davem@davemloft.net>
>CC: Linus Torvalds <torvalds@linux-foundation.org>
>CC: Andrew Morton <akpm@linux-foundation.org>
>CC: Ingo Molnar <mingo@redhat.com>
>CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
>CC: Thomas Gleixner <tglx@linutronix.de>
>CC: Steven Rostedt <rostedt@goodmis.org>
>CC: linux-arch@vger.kernel.org
>---
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/include/asm/timex.h | 11 +++++++++++
> 2 files changed, 12 insertions(+)
>
>Index: linux.trees.git/arch/powerpc/Kconfig
>===================================================================
>--- linux.trees.git.orig/arch/powerpc/Kconfig 2008-11-07 00:09:44.000000000 -0500
>+++ linux.trees.git/arch/powerpc/Kconfig 2008-11-07 00:09:46.000000000 -0500
>@@ -121,6 +121,7 @@ config PPC
> select HAVE_DMA_ATTRS if PPC64
> select USE_GENERIC_SMP_HELPERS if SMP
> select HAVE_OPROFILE
>+ select HAVE_GET_CYCLES if PPC64
So maybe it's just me because it's Friday and I'm on vacation, but I don't
see anything overly specific to ppc64 here. In fact, you use get_cycles_rate
for all of powerpc in a later patch in the series.
Is there something special about HAVE_GET_CYCLES that I'm missing that would
make it only apply to ppc64 and not ppc32?
josh
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 11:20 ` David Howells
@ 2008-11-07 15:01 ` Nicolas Pitre
2008-11-07 15:50 ` Andrew Morton
2008-11-07 16:21 ` David Howells
2008-11-07 16:07 ` David Howells
` (2 subsequent siblings)
3 siblings, 2 replies; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-07 15:01 UTC (permalink / raw)
To: David Howells
Cc: Andrew Morton, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 7 Nov 2008, David Howells wrote:
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > As I said in the text which you deleted and ignored, this would be
> > better if it was implemented as a C function which requires that the
> > caller explicitly pass in a reference to the state storage.
The whole purpose of that thing is to be utterly fast and lightweight.
Having an out of line C call would trash the major advantage of this
code.
> I'd be quite happy if it was:
>
> static inline u64 cnt32_to_63(u32 cnt_lo, u32 *__m_cnt_hi)
> {
> union cnt32_to_63 __x;
> __x.hi = *__m_cnt_hi;
> __x.lo = cnt_lo;
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
> *__m_cnt_hi =
> __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
> return __x.val;
> }
>
> I imagine this would compile pretty much the same as the macro.
Depends. As everybody has noticed now, the read ordering is important,
and if gcc decides to not inline this for whatever reason then the
ordering is lost. This is why this was a macro to start with.
> I think it
> would make it more obvious about the independence of the storage.
I don't think having the associated storage be outside the macro make
any sense either. There is simply no valid reason for having it shared
between multiple invokations of the macro, as well as making its
interface more complex for no gain.
> Alternatively, perhaps Nicolas just needs to mention this in the comment more
> clearly.
I wrote that code so to me it is cristal clear already. Any suggestions
as to how this could be improved?
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 15:01 ` Nicolas Pitre
@ 2008-11-07 15:50 ` Andrew Morton
2008-11-07 16:47 ` Nicolas Pitre
2008-11-07 16:55 ` David Howells
2008-11-07 16:21 ` David Howells
1 sibling, 2 replies; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 15:50 UTC (permalink / raw)
To: Nicolas Pitre
Cc: David Howells, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 07 Nov 2008 10:01:01 -0500 (EST) Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 7 Nov 2008, David Howells wrote:
>
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > > As I said in the text which you deleted and ignored, this would be
> > > better if it was implemented as a C function which requires that the
> > > caller explicitly pass in a reference to the state storage.
>
> The whole purpose of that thing is to be utterly fast and lightweight.
Well I'm glad it wasn't designed to demonstrate tastefulness.
btw, do you know how damned irritating and frustrating it is for a code
reviewer to have his comments deliberately ignored and deleted in
replies?
> Having an out of line C call would trash the major advantage of this
> code.
Not really.
> > I'd be quite happy if it was:
> >
> > static inline u64 cnt32_to_63(u32 cnt_lo, u32 *__m_cnt_hi)
> > {
> > union cnt32_to_63 __x;
> > __x.hi = *__m_cnt_hi;
> > __x.lo = cnt_lo;
> > if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
> > *__m_cnt_hi =
> > __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
> > return __x.val;
> > }
> >
> > I imagine this would compile pretty much the same as the macro.
>
> Depends. As everybody has noticed now, the read ordering is important,
> and if gcc decides to not inline this
If gcc did that then it would need to generate static instances of
inlined functions within individual compilation units. It would be a
disaster for the kernel. For a start, functions which are "inlined" in kernel
modules wouldn't be able to access their static storage and modprobing
them would fail.
> for whatever reason then the
> ordering is lost.
Uninlining won't affect any ordering I can see.
> This is why this was a macro to start with.
>
> > I think it
> > would make it more obvious about the independence of the storage.
>
> I don't think having the associated storage be outside the macro make
> any sense either. There is simply no valid reason for having it shared
> between multiple invokations of the macro, as well as making its
> interface more complex for no gain.
oh god.
> > Alternatively, perhaps Nicolas just needs to mention this in the comment more
> > clearly.
>
> I wrote that code so to me it is cristal clear already. Any suggestions
> as to how this could be improved?
>
Does mn10300's get_cycles() really count backwards? The first two
callsites I looked at (crypto/tcrypt.c and fs/ext4/mballoc.c) assume
that it is an upcounter.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 11:20 ` David Howells
2008-11-07 15:01 ` Nicolas Pitre
@ 2008-11-07 16:07 ` David Howells
2008-11-07 16:47 ` Mathieu Desnoyers
2008-11-07 17:04 ` David Howells
3 siblings, 0 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 16:07 UTC (permalink / raw)
To: Nicolas Pitre
Cc: dhowells, Andrew Morton, Mathieu Desnoyers, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Nicolas Pitre <nico@cam.org> wrote:
> The whole purpose of that thing is to be utterly fast and lightweight.
> Having an out of line C call would trash the major advantage of this
> code.
No argument there.
> > I imagine this would compile pretty much the same as the macro.
Having said that, I realise it's wrong. The macro potentially takes a h/w
read operation (cnt_lo) and does it at a place of its choosing - which the
compiler may not be permitted to move if cnt_lo resolves to a bit of volatile
inline asm with memory constraints. Converting it to an inline function
forces cnt_lo to be resolved first.
> Depends. As everybody has noticed now, the read ordering is important,
> and if gcc decides to not inline this for whatever reason then the
> ordering is lost. This is why this was a macro to start with.
Good point. I wonder if you should've put a compiler barrier in there to at
least make the point.
> I don't think having the associated storage be outside the macro make any
> sense either.
It can have a comment attached to it to say what it represents. On the other
hand, it'd probably need further comments attaching to it to fend off people
who start thinking they can make use of this variable in other ways...
> > Alternatively, perhaps Nicolas just needs to mention this in the comment
> > more clearly.
>
> I wrote that code so to me it is cristal clear already. Any suggestions
> as to how this could be improved?
It ought to be, but clearly it isn't. Sometimes the obvious is all too easy to
overlook. I'll think about it, but perhaps something like:
* This macro uses a static internal variable to retain the upper counter.
* This has two consequences: firstly, it may be used in multiple ways by
* different callers for different things without interference; and secondly,
* each caller will get its own, independent counter, and so an out of line
* wrapper must be used if multiple callers want to use the same pair of
* counters.
It's a bit heavy-handed, but...
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 07/18] Trace clock core
2008-11-07 6:26 ` Andrew Morton
@ 2008-11-07 16:12 ` Mathieu Desnoyers
2008-11-07 16:19 ` Andrew Morton
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 16:12 UTC (permalink / raw)
To: Andrew Morton
Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Nicolas Pitre, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Fri, 7 Nov 2008 01:16:43 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > > Is there something we should be fixing in m68k?
> > >
> >
> > Yes, but I fear it's going to go deep into include hell :-(
>
> Oh, OK. I thought that the comment meant that m68k's on_each_cpu()
> behaves differently at runtime from other architectures (and wrongly).
>
> If it's just some compile-time #include snafu then that's far less
> of a concern.
>
Should I simply remove this comment then ?
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 07/18] Trace clock core
2008-11-07 16:12 ` Mathieu Desnoyers
@ 2008-11-07 16:19 ` Andrew Morton
2008-11-07 18:16 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 16:19 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Nicolas Pitre, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
On Fri, 7 Nov 2008 11:12:38 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> * Andrew Morton (akpm@linux-foundation.org) wrote:
> > On Fri, 7 Nov 2008 01:16:43 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> >
> > > > Is there something we should be fixing in m68k?
> > > >
> > >
> > > Yes, but I fear it's going to go deep into include hell :-(
> >
> > Oh, OK. I thought that the comment meant that m68k's on_each_cpu()
> > behaves differently at runtime from other architectures (and wrongly).
> >
> > If it's just some compile-time #include snafu then that's far less
> > of a concern.
> >
>
> Should I simply remove this comment then ?
>
umm, it could perhaps be clarified - mention that it's needed for an
include order problem.
It's a bit odd. Surely by the time we've included these:
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+#include <linux/cpu.h>
+#include <linux/timex.h>
+#include <linux/bitops.h>
+#include <linux/trace-clock.h>
+#include <linux/smp.h>
someone has already included sched.h, and the definition of
_LINUX_SCHED_H will cause the later inclusion to not change anything?
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 15:01 ` Nicolas Pitre
2008-11-07 15:50 ` Andrew Morton
@ 2008-11-07 16:21 ` David Howells
2008-11-07 16:29 ` Andrew Morton
2008-11-07 17:10 ` David Howells
1 sibling, 2 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 16:21 UTC (permalink / raw)
To: Andrew Morton
Cc: dhowells, Nicolas Pitre, Mathieu Desnoyers, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Andrew Morton <akpm@linux-foundation.org> wrote:
> If gcc did that then it would need to generate static instances of
> inlined functions within individual compilation units. It would be a
> disaster for the kernel. For a start, functions which are "inlined" in kernel
> modules wouldn't be able to access their static storage and modprobing
> them would fail.
Do you expect a static inline function that lives in a header file and that
has a static variable in it to share that static variable over all instances
of that function in a program? Or do you expect the static variable to be
limited at the file level? Or just at the invocation level?
> Does mn10300's get_cycles() really count backwards?
Yes, because the value is generated by a pair of cascaded 16-bit hardware
down-counters.
> The first two callsites I looked at (crypto/tcrypt.c and fs/ext4/mballoc.c)
> assume that it is an upcounter.
Hmmm... I didn't occur to me that get_cycles() was available for use outside
of arch code. Possibly it wasn't so used when I first came up with the code.
I should probably make it count the other way.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 16:21 ` David Howells
@ 2008-11-07 16:29 ` Andrew Morton
2008-11-07 17:10 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 16:29 UTC (permalink / raw)
To: David Howells
Cc: Nicolas Pitre, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 07 Nov 2008 16:21:55 +0000 David Howells <dhowells@redhat.com> wrote:
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > If gcc did that then it would need to generate static instances of
> > inlined functions within individual compilation units. It would be a
> > disaster for the kernel. For a start, functions which are "inlined" in kernel
> > modules wouldn't be able to access their static storage and modprobing
> > them would fail.
>
> Do you expect a static inline function that lives in a header file and that
> has a static variable in it to share that static variable over all instances
> of that function in a program? Or do you expect the static variable to be
> limited at the file level? Or just at the invocation level?
I'd expect it to behave in the same way as it would if the function was
implemented out-of-line.
But it occurs to me that the modrobe-doesnt-work thing would happen if
the function _is_ inlined anyway, so we won't be doing that.
Whatever. Killing this many puppies because gcc may do something so
bizarrely wrong isn't justifiable.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 15:50 ` Andrew Morton
@ 2008-11-07 16:47 ` Nicolas Pitre
2008-11-07 17:21 ` Andrew Morton
2008-11-07 16:55 ` David Howells
1 sibling, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-07 16:47 UTC (permalink / raw)
To: Andrew Morton
Cc: David Howells, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 7 Nov 2008, Andrew Morton wrote:
> On Fri, 07 Nov 2008 10:01:01 -0500 (EST) Nicolas Pitre <nico@cam.org> wrote:
>
> > On Fri, 7 Nov 2008, David Howells wrote:
> >
> > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > > As I said in the text which you deleted and ignored, this would be
> > > > better if it was implemented as a C function which requires that the
> > > > caller explicitly pass in a reference to the state storage.
> >
> > The whole purpose of that thing is to be utterly fast and lightweight.
>
> Well I'm glad it wasn't designed to demonstrate tastefulness.
Fast tricks aren't always meant to be beautiful. That,s why we have
abstraction layers.
> btw, do you know how damned irritating and frustrating it is for a code
> reviewer to have his comments deliberately ignored and deleted in
> replies?
Do you know how irritating and frustrating it is when reviewers don't
care reading the damn comments along with the code? Maybe it wasn't the
case, but you gave the impression of jumping to conclusion without even
bothering about the associated explanation in the code until I directed
you at it.
> > Having an out of line C call would trash the major advantage of this
> > code.
>
> Not really.
Your opinion.
> > > I imagine this would compile pretty much the same as the macro.
> >
> > Depends. As everybody has noticed now, the read ordering is important,
> > and if gcc decides to not inline this
>
> If gcc did that then it would need to generate static instances of
> inlined functions within individual compilation units. It would be a
> disaster for the kernel. For a start, functions which are "inlined" in kernel
> modules wouldn't be able to access their static storage and modprobing
> them would fail.
That doesn't mean that access ordering is preserved within the function,
unless the interface is pointer based, but that won't work if the
counter is accessed through some special instruction rather than a
memory location.
> > for whatever reason then the
> > ordering is lost.
>
> Uninlining won't affect any ordering I can see.
See above.
> > This is why this was a macro to start with.
> >
> > > I think it
> > > would make it more obvious about the independence of the storage.
> >
> > I don't think having the associated storage be outside the macro make
> > any sense either. There is simply no valid reason for having it shared
> > between multiple invokations of the macro, as well as making its
> > interface more complex for no gain.
>
> oh god.
Thank you. ;-)
> > > Alternatively, perhaps Nicolas just needs to mention this in the comment more
> > > clearly.
> >
> > I wrote that code so to me it is cristal clear already. Any suggestions
> > as to how this could be improved?
> >
>
> Does mn10300's get_cycles() really count backwards? The first two
> callsites I looked at (crypto/tcrypt.c and fs/ext4/mballoc.c) assume
> that it is an upcounter.
I know nothing about mn10300.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 11:20 ` David Howells
2008-11-07 15:01 ` Nicolas Pitre
2008-11-07 16:07 ` David Howells
@ 2008-11-07 16:47 ` Mathieu Desnoyers
2008-11-07 20:11 ` Russell King
2008-11-07 17:04 ` David Howells
3 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 16:47 UTC (permalink / raw)
To: David Howells
Cc: Andrew Morton, Nicolas Pitre, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
* David Howells (dhowells@redhat.com) wrote:
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > We have a macro which must only have a single usage in any particular
> > kernel build (and nothing to detect a violation of that).
>
> That's not true. It's a macro containing a _static_ local variable, therefore
> the macro may be used multiple times, and each time it's used the compiler
> will allocate a new piece of storage.
>
> > It apparently tries to avoid races via ordering tricks, as long
> > as it is called with sufficient frequency. But nothing guarantees
> > that it _is_ called sufficiently frequently?
>
> The comment attached to it clearly states this restriction. Therefore the
> caller must guarantee it. That is something Mathieu's code and my code must
> deal with, not Nicolas's.
>
> > There is absolutely no reason why the first two of these quite bad things
> > needed to be done. In fact there is no reason why it needed to be
> > implemented as a macro at all.
>
> There's a very good reason to implement it as either a macro or an inline
> function: it's faster. Moving the whole thing out of line would impose an
> additional function call overhead - with a 64-bit return value on 32-bit
> platforms. For my case - sched_clock() - I'm willing to burn a bit of extra
> space to get the extra speed.
>
> > As I said in the text which you deleted and ignored, this would be
> > better if it was implemented as a C function which requires that the
> > caller explicitly pass in a reference to the state storage.
>
> I'd be quite happy if it was:
>
> static inline u64 cnt32_to_63(u32 cnt_lo, u32 *__m_cnt_hi)
> {
> union cnt32_to_63 __x;
> __x.hi = *__m_cnt_hi;
> __x.lo = cnt_lo;
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
> *__m_cnt_hi =
> __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
> return __x.val;
> }
>
Almost there. At least, with this kind of implementation, we would not
have to resort to various tricks to make sure a single code path is
called at a certain frequency. We would simply have to make sure the
__m_cnt_hi value is updated at a certain frequency. Thinking about
"data" rather than "code" makes much more sense.
The only missing thing here is the correct ordering. The problem is, as
I presented in more depth in my previous discussion with Nicolas, that
the __m_cnt_hi value has to be read before cnt_lo. First off, using this
macro with get_cycles() is simply buggy, because the macro expects
_perfect_ order of timestamps, no skew whatsoever, or otherwise time
could jump. This macro is therefore good only for mmio reads. One should
use per-cpu variables to keep the state of get_cycles() reads (as I did
in my other patch).
The following approach should work :
static inline u64 cnt32_to_63(u32 io_addr, u32 *__m_cnt_hi)
{
union cnt32_to_63 __x;
__x.hi = *__m_cnt_hi; /* memory read for high bits internal state */
smp_rmb(); /*
* read high bits before low bits insures time
* does not go backward.
*/
__x.lo = readl(cnt_lo); /* mmio read */
if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
*__m_cnt_hi =
__x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
return __x.val;
}
But any get_cycles() user of cnt32_to_63() should be shot down. The
bright side is : there is no way get_cycles() can be used with this
new code. :)
e.g. of incorrect users for arm (unless they are UP only, but that seems
like a weird design argument) :
mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
/* OS timer Counter Reg. */
mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
Timer Counter Register */
mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);
Correct user :
mach-versatile/core.c: unsigned long long v =
cnt32_to_63(readl(VERSATILE_REFCOUNTER));
The new call would look like :
/* Hi 32-bits of versatile refcounter state, kept for cnt32_to_64. */
static u32 versatile_refcounter_hi;
unsigned long long v = cnt32_to_64(VERSATILE_REFCOUNTER, refcounter_hi);
Mathieu
> I imagine this would compile pretty much the same as the macro. I think it
> would make it more obvious about the independence of the storage.
>
> Alternatively, perhaps Nicolas just needs to mention this in the comment more
> clearly.
>
> David
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 11:03 ` David Howells
@ 2008-11-07 16:51 ` Mathieu Desnoyers
2008-11-07 20:18 ` Nicolas Pitre
2008-11-07 23:55 ` David Howells
0 siblings, 2 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 16:51 UTC (permalink / raw)
To: David Howells
Cc: Nicolas Pitre, Andrew Morton, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
* David Howells (dhowells@redhat.com) wrote:
> Nicolas Pitre <nico@cam.org> wrote:
>
> > > I mean, the darned thing is called from sched_clock(), which can be
> > > concurrently called on separate CPUs and which can be called from
> > > interrupt context (with an arbitrary nesting level!) while it was running
> > > in process context.
> >
> > Yes! And this is so on *purpose*. Please take some time to read the
> > comment that goes along with it, and if you're still not convinced then
> > look for those explanation emails I've already posted.
>
> I agree with Nicolas on this. It's abominably clever, but I think he's right.
>
> The one place I remain unconvinced is over the issue of preemption of a process
> that is in the middle of cnt32_to_63(), where if the preempted process is
> asleep for long enough, I think it can wind time backwards when it resumes, but
> that's not a problem for the one place I want to use it (sched_clock()) because
> that is (almost) always called with preemption disabled in one way or another.
>
> The one place it isn't is a debugging case that I'm not too worried about.
>
I am also concerned about the non-preemption off case.
Then I think the function should document that it must be called with
preempt disabled.
Mathieu
> > > /*
> > > * Caller must provide locking to protect *caller_state
> > > */
> >
> > NO! This is meant to be LOCK FREE!
>
> Absolutely.
>
> David
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 15:50 ` Andrew Morton
2008-11-07 16:47 ` Nicolas Pitre
@ 2008-11-07 16:55 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 16:55 UTC (permalink / raw)
To: Nicolas Pitre
Cc: dhowells, Andrew Morton, Mathieu Desnoyers, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Nicolas Pitre <nico@cam.org> wrote:
> > Well I'm glad it wasn't designed to demonstrate tastefulness.
>
> Fast tricks aren't always meant to be beautiful. That,s why we have
> abstraction layers.
And comments. The comment attached to cnt32_to_63() is well written and
informative.
> I know nothing about mn10300.
That's aimed my way.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 11:20 ` David Howells
` (2 preceding siblings ...)
2008-11-07 16:47 ` Mathieu Desnoyers
@ 2008-11-07 17:04 ` David Howells
2008-11-07 17:17 ` Mathieu Desnoyers
2008-11-07 23:27 ` David Howells
3 siblings, 2 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 17:04 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: dhowells, Andrew Morton, Nicolas Pitre, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> First off, using this macro with get_cycles() is simply buggy, because the
> macro expects _perfect_ order of timestamps, no skew whatsoever, or
> otherwise time could jump.
Erm... Why can't I pass it get_cycles()? Are you saying that sched_clock()
in MN10300 is wrong for it's use of get_cycles() with cnt32_to_63()?
> __x.lo = readl(cnt_lo); /* mmio read */
readl() might insert an extra barrier instruction. Not only that, io_addr
must be unsigned long.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 10:55 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() David Howells
@ 2008-11-07 17:09 ` Mathieu Desnoyers
2008-11-07 17:33 ` Steven Rostedt
` (3 more replies)
0 siblings, 4 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 17:09 UTC (permalink / raw)
To: David Howells
Cc: Paul E. McKenney, Linus Torvalds, akpm, Ingo Molnar,
Peter Zijlstra, linux-kernel, Nicolas Pitre, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
* David Howells (dhowells@redhat.com) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > Assume the time source is a global clock which insures that time will never
> > *ever* go backward. Use a smp_rmb() to make sure the cnt_lo value is read before
> > __m_cnt_hi.
>
> If you have an smp_rmb(), then don't you need an smp_wmb()/smp_mb() to match
> it to make it work? And is your assumption valid that smp_rmb() will affect
> memory vs the I/O access to read the clock? There's no requirement that
> cnt_lo will have been read from an MMIO location at all.
>
> David
I want to make sure
__m_cnt_hi
is read before
mmio cnt_lo read
for the detailed reasons explained in my previous discussion with
Nicolas here :
http://lkml.org/lkml/2008/10/21/1
I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
be required so it works also on UP systems safely wrt interrupts).
The write side is between the hardware counter, which is assumed to
increment monotonically between each read, and the value __m_cnt_hi
updated by the CPU. I don't see where we could put a wmb() there.
Without barrier, the smp race looks as follow :
CPU A B
read hw cnt low (0xFFFFFFFA)
read __m_cnt_hi (0x80000000)
read hw cnt low (0x00000001)
(wrap detected :
(s32)(0x80000000 ^ 0x1) < 0)
write __m_cnt_hi = 0x00000001
read __m_cnt_hi (0x00000001)
return 0x0000000100000001
(wrap detected :
(s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
write __m_cnt_hi = 0x80000001
return 0x80000001FFFFFFFA
(time jumps)
And UP interrupt race :
Thread context Interrupts
read hw cnt low (0xFFFFFFFA)
read __m_cnt_hi (0x80000000)
read hw cnt low (0x00000001)
(wrap detected :
(s32)(0x80000000 ^ 0x1) < 0)
write __m_cnt_hi = 0x00000001
read __m_cnt_hi (0x00000001)
return 0x0000000100000001
(wrap detected :
(s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
write __m_cnt_hi = 0x80000001
return 0x80000001FFFFFFFA
(time jumps)
New code to fix it here with full rmb() :
static inline u64 cnt32_to_63(u32 io_addr, u32 *__m_cnt_hi)
{
union cnt32_to_63 __x;
__x.hi = *__m_cnt_hi; /* memory read for high bits internal state */
rmb(); /*
* read high bits before low bits insures time
* does not go backward. Sync across
* CPUs and for interrupts.
*/
__x.lo = readl(cnt_lo); /* mmio read */
if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
*__m_cnt_hi =
__x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
return __x.val;
}
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 16:21 ` David Howells
2008-11-07 16:29 ` Andrew Morton
@ 2008-11-07 17:10 ` David Howells
2008-11-07 17:26 ` Andrew Morton
1 sibling, 1 reply; 118+ messages in thread
From: David Howells @ 2008-11-07 17:10 UTC (permalink / raw)
To: Andrew Morton
Cc: dhowells, Nicolas Pitre, Mathieu Desnoyers, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Andrew Morton <akpm@linux-foundation.org> wrote:
> I'd expect it to behave in the same way as it would if the function was
> implemented out-of-line.
>
> But it occurs to me that the modrobe-doesnt-work thing would happen if
> the function _is_ inlined anyway, so we won't be doing that.
>
> Whatever. Killing this many puppies because gcc may do something so
> bizarrely wrong isn't justifiable.
With gcc, you get one instance of the static variable from inside a static
(inline or outofline) function per .o file that invokes it, and these do not
merge even though they're common symbols. I asked around and the opinion
seems to be that this is correct C. I suppose it's the equivalent of cutting
and pasting a function between several files - why should the compiler assume
it's the same function in each?
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:04 ` David Howells
@ 2008-11-07 17:17 ` Mathieu Desnoyers
2008-11-07 23:27 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 17:17 UTC (permalink / raw)
To: David Howells
Cc: Andrew Morton, Nicolas Pitre, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
* David Howells (dhowells@redhat.com) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > First off, using this macro with get_cycles() is simply buggy, because the
> > macro expects _perfect_ order of timestamps, no skew whatsoever, or
> > otherwise time could jump.
>
> Erm... Why can't I pass it get_cycles()? Are you saying that sched_clock()
> in MN10300 is wrong for it's use of get_cycles() with cnt32_to_63()?
>
Yes. Do you think the synchronization of the cycles counters is
_perfect_ across CPUs so that there is no possible way whatsoever that
two cycle counter values appear to go backward between CPUs ? (also
taking in account delays in __m_cnt_hi write-back...)
As I showed in my previous example, if you are unlucky enough to hit the
spot where the cycle counters go backward at the time warp edge, time
will jump of 2^32, so about 4.29s at 1GHz.
> > __x.lo = readl(cnt_lo); /* mmio read */
>
> readl() might insert an extra barrier instruction. Not only that, io_addr
> must be unsigned long.
If we expect the only correct use-case to be with readl(), I don't see
the problem with added synchronization.
>
Ah, right, then the parameters should be updated accordingly.
static inline u64 cnt32_to_63(unsigned long io_addr, u32 *__m_cnt_hi)
{
union cnt32_to_63 __x;
__x.hi = *__m_cnt_hi; /* memory read for high bits internal state */
rmb(); /*
* read high bits before low bits insures time
* does not go backward. Sync across
* CPUs and for interrupts.
*/
__x.lo = readl(io_addr); /* mmio read */
if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
*__m_cnt_hi =
__x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
return __x.val;
}
Mathieu
> David
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 16:47 ` Nicolas Pitre
@ 2008-11-07 17:21 ` Andrew Morton
2008-11-07 20:03 ` Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 17:21 UTC (permalink / raw)
To: Nicolas Pitre
Cc: David Howells, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 07 Nov 2008 11:47:47 -0500 (EST) Nicolas Pitre <nico@cam.org> wrote:
> > btw, do you know how damned irritating and frustrating it is for a code
> > reviewer to have his comments deliberately ignored and deleted in
> > replies?
>
> Do you know how irritating and frustrating it is when reviewers don't
> care reading the damn comments along with the code?
As you still seek to ignore it, I shall repeat my earlier question.
Please do not delete it again.
It apparently tries to avoid races via ordering tricks, as long
as it is called with sufficient frequency. But nothing guarantees
that it _is_ called sufficiently frequently?
Things like tickless kernels and SCHED_RR can surely cause
sched_clock() to not be called for arbitrary periods.
Userspace cli() will definitely do this, but it is expected to break
stuff and is not as legitiate a thing to do.
I'm just giving up on the tastefulness issue.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:10 ` David Howells
@ 2008-11-07 17:26 ` Andrew Morton
2008-11-07 18:00 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 17:26 UTC (permalink / raw)
To: David Howells
Cc: Nicolas Pitre, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 07 Nov 2008 17:10:00 +0000 David Howells <dhowells@redhat.com> wrote:
>
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > I'd expect it to behave in the same way as it would if the function was
> > implemented out-of-line.
> >
> > But it occurs to me that the modrobe-doesnt-work thing would happen if
> > the function _is_ inlined anyway, so we won't be doing that.
> >
> > Whatever. Killing this many puppies because gcc may do something so
> > bizarrely wrong isn't justifiable.
>
> With gcc, you get one instance of the static variable from inside a static
> (inline or outofline) function per .o file that invokes it, and these do not
> merge even though they're common symbols. I asked around and the opinion
> seems to be that this is correct C. I suppose it's the equivalent of cutting
> and pasting a function between several files - why should the compiler assume
> it's the same function in each?
>
OK, thanks, I guess that makes sense. For static inline. I wonder if
`extern inline' or plain old `inline' should change it.
It's one of those things I hope I never need to know about, but perhaps
we do somewhere have static storage in an inline. Wouldn't surprise
me, and I bet that if we do, it's a bug.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:09 ` Mathieu Desnoyers
@ 2008-11-07 17:33 ` Steven Rostedt
2008-11-07 19:18 ` Mathieu Desnoyers
2008-11-07 20:08 ` Steven Rostedt
` (2 subsequent siblings)
3 siblings, 1 reply; 118+ messages in thread
From: Steven Rostedt @ 2008-11-07 17:33 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: David Howells, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
>
> __m_cnt_hi
> is read before
> mmio cnt_lo read
>
> for the detailed reasons explained in my previous discussion with
> Nicolas here :
> http://lkml.org/lkml/2008/10/21/1
>
> I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> be required so it works also on UP systems safely wrt interrupts).
smp_rmb turns into a compiler barrier on UP and should prevent the below
description.
-- Steve
>
> The write side is between the hardware counter, which is assumed to
> increment monotonically between each read, and the value __m_cnt_hi
> updated by the CPU. I don't see where we could put a wmb() there.
>
> Without barrier, the smp race looks as follow :
>
>
> CPU A B
> read hw cnt low (0xFFFFFFFA)
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> read __m_cnt_hi (0x00000001)
> return 0x0000000100000001
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)
>
> And UP interrupt race :
>
> Thread context Interrupts
> read hw cnt low (0xFFFFFFFA)
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> read __m_cnt_hi (0x00000001)
> return 0x0000000100000001
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)
>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 17/18] MIPS : Trace clock
2008-11-07 11:53 ` Peter Zijlstra
@ 2008-11-07 17:44 ` Mathieu Desnoyers
0 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 17:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Linus Torvalds, akpm, Ingo Molnar, linux-kernel, Ralf Baechle
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Fri, 2008-11-07 at 00:23 -0500, Mathieu Desnoyers wrote:
> > Note for Peter Zijlstra :
> > You should probably have a look at lockdep.c raw_spinlock_t lockdep_lock usage.
> > I suspect it may be used with preemption enabled in graph_lock(). (not sure
> > though, but it's worth double-checking.
>
> Are you worried about the graph_lock() instance in
> lookup_chain_cache() ?
>
> That is the locking for validate_chain,
>
> __lock_acquire()
> validate_chain()
> look_up_chain_cache()
> graph_lock()
> check_prevs_add()
> check_prev_add()
> graph_unlock()
> graph_lock()
> graph_unlock()
>
> which is all done without modifying IRQ state.
>
> However, __lock_acquire() is only called with IRQs disabled:
>
> lock_acquire()
> raw_local_irq_save()
> __lock_acquire()
>
> lock_release()
> raw_local_irq_save()
> __lock_release()
> lock_release_nested()
> __lock_acquire()
> lock_release_non_nested()
> __lock_acquire()
>
> lock_set_subclass()
> raw_local_irq_save()
> __lock_set_subclass()
> __lock_acquire()
>
> So I think we're good.
>
Yep, looks good.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:26 ` Andrew Morton
@ 2008-11-07 18:00 ` Mathieu Desnoyers
2008-11-07 18:21 ` Andrew Morton
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 18:00 UTC (permalink / raw)
To: Andrew Morton
Cc: David Howells, Nicolas Pitre, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Fri, 07 Nov 2008 17:10:00 +0000 David Howells <dhowells@redhat.com> wrote:
>
> >
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > > I'd expect it to behave in the same way as it would if the function was
> > > implemented out-of-line.
> > >
> > > But it occurs to me that the modrobe-doesnt-work thing would happen if
> > > the function _is_ inlined anyway, so we won't be doing that.
> > >
> > > Whatever. Killing this many puppies because gcc may do something so
> > > bizarrely wrong isn't justifiable.
> >
> > With gcc, you get one instance of the static variable from inside a static
> > (inline or outofline) function per .o file that invokes it, and these do not
> > merge even though they're common symbols. I asked around and the opinion
> > seems to be that this is correct C. I suppose it's the equivalent of cutting
> > and pasting a function between several files - why should the compiler assume
> > it's the same function in each?
> >
>
> OK, thanks, I guess that makes sense. For static inline. I wonder if
> `extern inline' or plain old `inline' should change it.
>
> It's one of those things I hope I never need to know about, but perhaps
> we do somewhere have static storage in an inline. Wouldn't surprise
> me, and I bet that if we do, it's a bug.
Tracepoints actually use that. It could be changed so they use :
DECLARE_TRACE() (in include/trace/group.h)
DEFINE_TRACE() (in the appropriate kernel c file)
trace_somename(); (in the code)
instead. That would actually make more sense and remove the need for
multiple declarations when the same tracepoint name is used in many
spots (this is a problem kmemtrace has, it generates a lot of tracepoint
declarations).
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 04/18] get_cycles() : powerpc64 HAVE_GET_CYCLES
2008-11-07 14:56 ` Josh Boyer
@ 2008-11-07 18:14 ` Mathieu Desnoyers
0 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 18:14 UTC (permalink / raw)
To: Josh Boyer
Cc: Linus Torvalds, akpm, Ingo Molnar, Peter Zijlstra, linux-kernel,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
* Josh Boyer (jwboyer@linux.vnet.ibm.com) wrote:
> On Fri, Nov 07, 2008 at 12:23:40AM -0500, Mathieu Desnoyers wrote:
> >This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
> >get_cycles_rate() are implemented.
> >
> >Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> >CC: benh@kernel.crashing.org
> >CC: paulus@samba.org
> >CC: David Miller <davem@davemloft.net>
> >CC: Linus Torvalds <torvalds@linux-foundation.org>
> >CC: Andrew Morton <akpm@linux-foundation.org>
> >CC: Ingo Molnar <mingo@redhat.com>
> >CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >CC: Thomas Gleixner <tglx@linutronix.de>
> >CC: Steven Rostedt <rostedt@goodmis.org>
> >CC: linux-arch@vger.kernel.org
> >---
> > arch/powerpc/Kconfig | 1 +
> > arch/powerpc/include/asm/timex.h | 11 +++++++++++
> > 2 files changed, 12 insertions(+)
> >
> >Index: linux.trees.git/arch/powerpc/Kconfig
> >===================================================================
> >--- linux.trees.git.orig/arch/powerpc/Kconfig 2008-11-07 00:09:44.000000000 -0500
> >+++ linux.trees.git/arch/powerpc/Kconfig 2008-11-07 00:09:46.000000000 -0500
> >@@ -121,6 +121,7 @@ config PPC
> > select HAVE_DMA_ATTRS if PPC64
> > select USE_GENERIC_SMP_HELPERS if SMP
> > select HAVE_OPROFILE
> >+ select HAVE_GET_CYCLES if PPC64
>
> So maybe it's just me because it's Friday and I'm on vacation, but I don't
> see anything overly specific to ppc64 here. In fact, you use get_cycles_rate
> for all of powerpc in a later patch in the series.
>
> Is there something special about HAVE_GET_CYCLES that I'm missing that would
> make it only apply to ppc64 and not ppc32?
>
> josh
Hi Josh,
powerpc32 only uses the 32 LSBs for the TSC in the current get_cycles()
implementation. We could either define HAVE_GET_CYCLES_32 like I did on
mips32, or change get_cycles so it also reads the 32 MSBs in a loop like
this (it does not take care of the CPU_FTR_CELL_TB_BUG though) :
typedef unsigned long long cycles_t;
cycles_t get_cycles_ppc32(void)
{
union {
cycles_t v;
struct {
u32 ms32, ls32; /* powerpc is big endian */
} s;
} cycles;
do {
cycles.ls32 = mftbu();
cycles.ms32 = mftbl();
} while (cycles.ls32 != mftbu());
return cycles.v;
}
I'd prefer this second solution. If one needs a specific get_cycles() to
be only 32-bits (but really really fast) for the scheduler, then this
could be a get_cycles_sched() or something like this which does not
guarantee that it returns full 64-bits...
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 07/18] Trace clock core
2008-11-07 16:19 ` Andrew Morton
@ 2008-11-07 18:16 ` Mathieu Desnoyers
0 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 18:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Nicolas Pitre, Ralf Baechle, benh, paulus, David Miller,
Ingo Molnar, Thomas Gleixner, Steven Rostedt, linux-arch
* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Fri, 7 Nov 2008 11:12:38 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > * Andrew Morton (akpm@linux-foundation.org) wrote:
> > > On Fri, 7 Nov 2008 01:16:43 -0500 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > >
> > > > > Is there something we should be fixing in m68k?
> > > > >
> > > >
> > > > Yes, but I fear it's going to go deep into include hell :-(
> > >
> > > Oh, OK. I thought that the comment meant that m68k's on_each_cpu()
> > > behaves differently at runtime from other architectures (and wrongly).
> > >
> > > If it's just some compile-time #include snafu then that's far less
> > > of a concern.
> > >
> >
> > Should I simply remove this comment then ?
> >
>
> umm, it could perhaps be clarified - mention that it's needed for an
> include order problem.
>
> It's a bit odd. Surely by the time we've included these:
>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/delay.h>
> +#include <linux/timer.h>
> +#include <linux/workqueue.h>
> +#include <linux/cpu.h>
> +#include <linux/timex.h>
> +#include <linux/bitops.h>
> +#include <linux/trace-clock.h>
> +#include <linux/smp.h>
>
> someone has already included sched.h, and the definition of
> _LINUX_SCHED_H will cause the later inclusion to not change anything?
>
Maybe now it's ok, but in the past, sched.h was not included..
surprisingly.
I'll just write a clearer comment.
Thanks,
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 18:00 ` Mathieu Desnoyers
@ 2008-11-07 18:21 ` Andrew Morton
2008-11-07 18:30 ` Harvey Harrison
` (2 more replies)
0 siblings, 3 replies; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 18:21 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: dhowells, nico, torvalds, mingo, a.p.zijlstra, linux-kernel,
ralf, benh, paulus, davem, mingo, tglx, rostedt, linux-arch
On Fri, 7 Nov 2008 13:00:41 -0500
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> * Andrew Morton (akpm@linux-foundation.org) wrote:
> > On Fri, 07 Nov 2008 17:10:00 +0000 David Howells <dhowells@redhat.com> wrote:
> >
> > >
> > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > > I'd expect it to behave in the same way as it would if the function was
> > > > implemented out-of-line.
> > > >
> > > > But it occurs to me that the modrobe-doesnt-work thing would happen if
> > > > the function _is_ inlined anyway, so we won't be doing that.
> > > >
> > > > Whatever. Killing this many puppies because gcc may do something so
> > > > bizarrely wrong isn't justifiable.
> > >
> > > With gcc, you get one instance of the static variable from inside a static
> > > (inline or outofline) function per .o file that invokes it, and these do not
> > > merge even though they're common symbols. I asked around and the opinion
> > > seems to be that this is correct C. I suppose it's the equivalent of cutting
> > > and pasting a function between several files - why should the compiler assume
> > > it's the same function in each?
> > >
> >
> > OK, thanks, I guess that makes sense. For static inline. I wonder if
> > `extern inline' or plain old `inline' should change it.
> >
> > It's one of those things I hope I never need to know about, but perhaps
> > we do somewhere have static storage in an inline. Wouldn't surprise
> > me, and I bet that if we do, it's a bug.
>
> Tracepoints actually use that.
Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
It does look a bit fragile. Does every .c file which included
include/trace/block.h get a copy of __tracepoint_block_rq_issue,
whether or not it used that tracepoint? Hopefully not.
> It could be changed so they use :
>
> DECLARE_TRACE() (in include/trace/group.h)
> DEFINE_TRACE() (in the appropriate kernel c file)
> trace_somename(); (in the code)
>
> instead. That would actually make more sense and remove the need for
> multiple declarations when the same tracepoint name is used in many
> spots (this is a problem kmemtrace has, it generates a lot of tracepoint
> declarations).
I'm unsure of the requirements here. Do you _want_ each call to
trace_block_rq_issue() to share some in-memory state? If so then yes,
there's a problem with calls to trace_block_rq_issue() from within
separate compilation units.
otoh, if all calls to trace_block_rq_issue() are supposed to have
independent state (which seems to be the case) then that could be
addressed by making trace_block_rq_issue() a macro which defines
static storage, as cnt32_to_63() shouldn't have done ;)
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 18:21 ` Andrew Morton
@ 2008-11-07 18:30 ` Harvey Harrison
2008-11-07 18:42 ` Mathieu Desnoyers
2008-11-07 18:33 ` Mathieu Desnoyers
2008-11-07 18:36 ` Linus Torvalds
2 siblings, 1 reply; 118+ messages in thread
From: Harvey Harrison @ 2008-11-07 18:30 UTC (permalink / raw)
To: Andrew Morton
Cc: Mathieu Desnoyers, dhowells, nico, torvalds, mingo, a.p.zijlstra,
linux-kernel, ralf, benh, paulus, davem, mingo, tglx, rostedt,
linux-arch
On Fri, 2008-11-07 at 10:21 -0800, Andrew Morton wrote:
> On Fri, 7 Nov 2008 13:00:41 -0500
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > > It's one of those things I hope I never need to know about, but perhaps
> > > we do somewhere have static storage in an inline. Wouldn't surprise
> > > me, and I bet that if we do, it's a bug.
> >
> > Tracepoints actually use that.
>
> Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
>
> It does look a bit fragile. Does every .c file which included
> include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> whether or not it used that tracepoint? Hopefully not.
>
> > It could be changed so they use :
> >
> > DECLARE_TRACE() (in include/trace/group.h)
> > DEFINE_TRACE() (in the appropriate kernel c file)
> > trace_somename(); (in the code)
> >
> > instead. That would actually make more sense and remove the need for
> > multiple declarations when the same tracepoint name is used in many
> > spots (this is a problem kmemtrace has, it generates a lot of tracepoint
> > declarations).
Could this scheme also help with the thousands of sparse warnings that
kmemtrace produces because of the current arrangement, all of the form:
include/linux/kmemtrace.h:33:2: warning: Initializer entry defined twice
include/linux/kmemtrace.h:33:2: also defined here
As you could have unique names for the tracepoints now, rather than the
'unique' static storage? Or am I off-base here?
Harvey
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 18:21 ` Andrew Morton
2008-11-07 18:30 ` Harvey Harrison
@ 2008-11-07 18:33 ` Mathieu Desnoyers
2008-11-07 18:36 ` Linus Torvalds
2 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 18:33 UTC (permalink / raw)
To: Andrew Morton
Cc: dhowells, nico, torvalds, mingo, a.p.zijlstra, linux-kernel,
ralf, benh, paulus, davem, mingo, tglx, rostedt, linux-arch
* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Fri, 7 Nov 2008 13:00:41 -0500
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > * Andrew Morton (akpm@linux-foundation.org) wrote:
> > > On Fri, 07 Nov 2008 17:10:00 +0000 David Howells <dhowells@redhat.com> wrote:
> > >
> > > >
> > > > Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > > I'd expect it to behave in the same way as it would if the function was
> > > > > implemented out-of-line.
> > > > >
> > > > > But it occurs to me that the modrobe-doesnt-work thing would happen if
> > > > > the function _is_ inlined anyway, so we won't be doing that.
> > > > >
> > > > > Whatever. Killing this many puppies because gcc may do something so
> > > > > bizarrely wrong isn't justifiable.
> > > >
> > > > With gcc, you get one instance of the static variable from inside a static
> > > > (inline or outofline) function per .o file that invokes it, and these do not
> > > > merge even though they're common symbols. I asked around and the opinion
> > > > seems to be that this is correct C. I suppose it's the equivalent of cutting
> > > > and pasting a function between several files - why should the compiler assume
> > > > it's the same function in each?
> > > >
> > >
> > > OK, thanks, I guess that makes sense. For static inline. I wonder if
> > > `extern inline' or plain old `inline' should change it.
> > >
> > > It's one of those things I hope I never need to know about, but perhaps
> > > we do somewhere have static storage in an inline. Wouldn't surprise
> > > me, and I bet that if we do, it's a bug.
> >
> > Tracepoints actually use that.
>
> Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
>
> It does look a bit fragile. Does every .c file which included
> include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> whether or not it used that tracepoint? Hopefully not.
>
No, __tracepoint_block_rq_issue is only instanciated if the static
inline function is used. One instance per use.
> > It could be changed so they use :
> >
> > DECLARE_TRACE() (in include/trace/group.h)
> > DEFINE_TRACE() (in the appropriate kernel c file)
> > trace_somename(); (in the code)
> >
> > instead. That would actually make more sense and remove the need for
> > multiple declarations when the same tracepoint name is used in many
> > spots (this is a problem kmemtrace has, it generates a lot of tracepoint
> > declarations).
>
> I'm unsure of the requirements here. Do you _want_ each call to
> trace_block_rq_issue() to share some in-memory state? If so then yes,
> there's a problem with calls to trace_block_rq_issue() from within
> separate compilation units.
>
> otoh, if all calls to trace_block_rq_issue() are supposed to have
> independent state (which seems to be the case) then that could be
> addressed by making trace_block_rq_issue() a macro which defines
> static storage, as cnt32_to_63() shouldn't have done ;)
>
They could share the same data, given it *has* to be the same. I'll
try to fix this.
Mathieu
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 18:21 ` Andrew Morton
2008-11-07 18:30 ` Harvey Harrison
2008-11-07 18:33 ` Mathieu Desnoyers
@ 2008-11-07 18:36 ` Linus Torvalds
2008-11-07 18:45 ` Andrew Morton
2 siblings, 1 reply; 118+ messages in thread
From: Linus Torvalds @ 2008-11-07 18:36 UTC (permalink / raw)
To: Andrew Morton
Cc: Mathieu Desnoyers, dhowells, nico, mingo, a.p.zijlstra,
linux-kernel, ralf, benh, paulus, davem, mingo, tglx, rostedt,
linux-arch
On Fri, 7 Nov 2008, Andrew Morton wrote:
>
> Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
>
> It does look a bit fragile. Does every .c file which included
> include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> whether or not it used that tracepoint? Hopefully not.
Look at "ratelimit()" too. Broken, broken. Of course, I don't think it's
actually _used_ anywhere, so..
Linus
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 18:30 ` Harvey Harrison
@ 2008-11-07 18:42 ` Mathieu Desnoyers
0 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 18:42 UTC (permalink / raw)
To: Harvey Harrison
Cc: Andrew Morton, dhowells, nico, torvalds, mingo, a.p.zijlstra,
linux-kernel, ralf, benh, paulus, davem, mingo, tglx, rostedt,
linux-arch
* Harvey Harrison (harvey.harrison@gmail.com) wrote:
> On Fri, 2008-11-07 at 10:21 -0800, Andrew Morton wrote:
> > On Fri, 7 Nov 2008 13:00:41 -0500
> > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > > > It's one of those things I hope I never need to know about, but perhaps
> > > > we do somewhere have static storage in an inline. Wouldn't surprise
> > > > me, and I bet that if we do, it's a bug.
> > >
> > > Tracepoints actually use that.
> >
> > Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
> >
> > It does look a bit fragile. Does every .c file which included
> > include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> > whether or not it used that tracepoint? Hopefully not.
> >
> > > It could be changed so they use :
> > >
> > > DECLARE_TRACE() (in include/trace/group.h)
> > > DEFINE_TRACE() (in the appropriate kernel c file)
> > > trace_somename(); (in the code)
> > >
> > > instead. That would actually make more sense and remove the need for
> > > multiple declarations when the same tracepoint name is used in many
> > > spots (this is a problem kmemtrace has, it generates a lot of tracepoint
> > > declarations).
>
> Could this scheme also help with the thousands of sparse warnings that
> kmemtrace produces because of the current arrangement, all of the form:
>
> include/linux/kmemtrace.h:33:2: warning: Initializer entry defined twice
> include/linux/kmemtrace.h:33:2: also defined here
>
> As you could have unique names for the tracepoints now, rather than the
> 'unique' static storage? Or am I off-base here?
>
Exactly.
Mathieu
> Harvey
>
>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 18:36 ` Linus Torvalds
@ 2008-11-07 18:45 ` Andrew Morton
0 siblings, 0 replies; 118+ messages in thread
From: Andrew Morton @ 2008-11-07 18:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: mathieu.desnoyers, dhowells, nico, mingo, a.p.zijlstra,
linux-kernel, ralf, benh, paulus, davem, mingo, tglx, rostedt,
linux-arch, Dave Young
On Fri, 7 Nov 2008 10:36:27 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Fri, 7 Nov 2008, Andrew Morton wrote:
> >
> > Referring to include/linux/tracepoint.h:DEFINE_TRACE()?
> >
> > It does look a bit fragile. Does every .c file which included
> > include/trace/block.h get a copy of __tracepoint_block_rq_issue,
> > whether or not it used that tracepoint? Hopefully not.
>
> Look at "ratelimit()" too. Broken, broken.
Yup. Easy enough to fix, but...
> Of course, I don't think it's
> actually _used_ anywhere, so..
>
removing it altogether would be best, I think. It's a bit of an odd
thing.
I'll see what this
--- a/include/linux/ratelimit.h~a
+++ a/include/linux/ratelimit.h
@@ -17,11 +17,4 @@ struct ratelimit_state {
struct ratelimit_state name = {interval, burst,}
extern int __ratelimit(struct ratelimit_state *rs);
-
-static inline int ratelimit(void)
-{
- static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
- return __ratelimit(&rs);
-}
#endif
breaks.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:33 ` Steven Rostedt
@ 2008-11-07 19:18 ` Mathieu Desnoyers
2008-11-07 19:32 ` Peter Zijlstra
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 19:18 UTC (permalink / raw)
To: Steven Rostedt
Cc: David Howells, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
* Steven Rostedt (rostedt@goodmis.org) wrote:
>
> On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> >
> > __m_cnt_hi
> > is read before
> > mmio cnt_lo read
> >
> > for the detailed reasons explained in my previous discussion with
> > Nicolas here :
> > http://lkml.org/lkml/2008/10/21/1
> >
> > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > be required so it works also on UP systems safely wrt interrupts).
>
> smp_rmb turns into a compiler barrier on UP and should prevent the below
> description.
>
Ah, right, preserving program order on UP should be enough. smp_rmb()
then.
Thanks,
Mathieu
> -- Steve
>
> >
> > The write side is between the hardware counter, which is assumed to
> > increment monotonically between each read, and the value __m_cnt_hi
> > updated by the CPU. I don't see where we could put a wmb() there.
> >
> > Without barrier, the smp race looks as follow :
> >
> >
> > CPU A B
> > read hw cnt low (0xFFFFFFFA)
> > read __m_cnt_hi (0x80000000)
> > read hw cnt low (0x00000001)
> > (wrap detected :
> > (s32)(0x80000000 ^ 0x1) < 0)
> > write __m_cnt_hi = 0x00000001
> > read __m_cnt_hi (0x00000001)
> > return 0x0000000100000001
> > (wrap detected :
> > (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> > write __m_cnt_hi = 0x80000001
> > return 0x80000001FFFFFFFA
> > (time jumps)
> >
> > And UP interrupt race :
> >
> > Thread context Interrupts
> > read hw cnt low (0xFFFFFFFA)
> > read __m_cnt_hi (0x80000000)
> > read hw cnt low (0x00000001)
> > (wrap detected :
> > (s32)(0x80000000 ^ 0x1) < 0)
> > write __m_cnt_hi = 0x00000001
> > read __m_cnt_hi (0x00000001)
> > return 0x0000000100000001
> > (wrap detected :
> > (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> > write __m_cnt_hi = 0x80000001
> > return 0x80000001FFFFFFFA
> > (time jumps)
> >
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 19:18 ` Mathieu Desnoyers
@ 2008-11-07 19:32 ` Peter Zijlstra
2008-11-07 20:02 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Peter Zijlstra @ 2008-11-07 19:32 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, David Howells, Paul E. McKenney, Linus Torvalds,
akpm, Ingo Molnar, linux-kernel, Nicolas Pitre, Ralf Baechle,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
linux-arch
On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> >
> > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > >
> > > __m_cnt_hi
> > > is read before
> > > mmio cnt_lo read
> > >
> > > for the detailed reasons explained in my previous discussion with
> > > Nicolas here :
> > > http://lkml.org/lkml/2008/10/21/1
> > >
> > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > be required so it works also on UP systems safely wrt interrupts).
> >
> > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > description.
> >
>
> Ah, right, preserving program order on UP should be enough. smp_rmb()
> then.
I'm not quite sure I'm following here. Is this a global hardware clock
you're reading from multiple cpus, if so, are you sure smp_rmb() will
indeed be enough to sync the read?
(In which case the smp_wmb() is provided by the hardware increasing the
clock?)
If these are per-cpu clocks then even in the smp case we'd be good with
a plain barrier() because you'd only ever want to read your own cpu's
clock (and have a separate __m_cnt_hi per cpu).
Or am I totally missing out on something?
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 19:32 ` Peter Zijlstra
@ 2008-11-07 20:02 ` Mathieu Desnoyers
2008-11-07 20:45 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 20:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, David Howells, Paul E. McKenney, Linus Torvalds,
akpm, Ingo Molnar, linux-kernel, Nicolas Pitre, Ralf Baechle,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
linux-arch
* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > >
> > > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > > >
> > > > __m_cnt_hi
> > > > is read before
> > > > mmio cnt_lo read
> > > >
> > > > for the detailed reasons explained in my previous discussion with
> > > > Nicolas here :
> > > > http://lkml.org/lkml/2008/10/21/1
> > > >
> > > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > > be required so it works also on UP systems safely wrt interrupts).
> > >
> > > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > > description.
> > >
> >
> > Ah, right, preserving program order on UP should be enough. smp_rmb()
> > then.
>
>
> I'm not quite sure I'm following here. Is this a global hardware clock
> you're reading from multiple cpus, if so, are you sure smp_rmb() will
> indeed be enough to sync the read?
>
> (In which case the smp_wmb() is provided by the hardware increasing the
> clock?)
>
> If these are per-cpu clocks then even in the smp case we'd be good with
> a plain barrier() because you'd only ever want to read your own cpu's
> clock (and have a separate __m_cnt_hi per cpu).
>
> Or am I totally missing out on something?
>
This is the global hardware clock scenario.
We have to order an uncached mmio read wrt a cached variable read/write.
The uncached mmio read vs smp_rmb() barrier (e.g. lfence instruction)
should be insured by program order because the read will skip the cache
and go directly to the bus. Luckily we only do a mmio read and no mmio
write, so mmiowb() is not required.
You might be right in that it could require more barriers.
Given adequate program order, we can assume the the mmio read will
happen "on the spot", but that the cached read may be delayed.
What we want is :
readl(io_addr)
read __m_cnt_hi
write __m_cnt_hi
With the two reads in the correct order. If we consider two consecutive
executions on the same CPU :
readl(io_addr)
read __m_cnt_hi
write __m_cnt_hi
readl(io_addr)
read __m_cnt_hi
write __m_cnt_hi
We might have to order the read/write pair wrt the following readl, such
as :
smp_rmb(); /* Waits for every cached memory reads to complete */
readl(io_addr);
barrier(); /* Make sure the compiler leaves mmio read before cached read */
read __m_cnt_hi
write __m_cnt_hi
smp_rmb(); /* Waits for every cached memory reads to complete */
readl(io_addr)
barrier(); /* Make sure the compiler leaves mmio read before cached read */
read __m_cnt_hi
write __m_cnt_hi
Would that make more sense ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:21 ` Andrew Morton
@ 2008-11-07 20:03 ` Nicolas Pitre
0 siblings, 0 replies; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-07 20:03 UTC (permalink / raw)
To: Andrew Morton
Cc: David Howells, Mathieu Desnoyers, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 7 Nov 2008, Andrew Morton wrote:
> On Fri, 07 Nov 2008 11:47:47 -0500 (EST) Nicolas Pitre <nico@cam.org> wrote:
>
> > > btw, do you know how damned irritating and frustrating it is for a code
> > > reviewer to have his comments deliberately ignored and deleted in
> > > replies?
> >
> > Do you know how irritating and frustrating it is when reviewers don't
> > care reading the damn comments along with the code?
>
> As you still seek to ignore it, I shall repeat my earlier question.
> Please do not delete it again.
>
> It apparently tries to avoid races via ordering tricks, as long
> as it is called with sufficient frequency. But nothing guarantees
> that it _is_ called sufficiently frequently?
>
> Things like tickless kernels and SCHED_RR can surely cause
> sched_clock() to not be called for arbitrary periods.
On the machines this was initially written for, the critical period is
in the order of minutes. And if you're afraid you might lack enough
scheduling activities for that long, you simply have to keep the
algorithm "warm" with a simple kernel timer which only purpose is to
ensure it is called often enough.
> Userspace cli() will definitely do this, but it is expected to break
> stuff and is not as legitiate a thing to do.
Why do you bring it on then?
> I'm just giving up on the tastefulness issue.
Taste is a pretty subjective matter.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:09 ` Mathieu Desnoyers
2008-11-07 17:33 ` Steven Rostedt
@ 2008-11-07 20:08 ` Steven Rostedt
2008-11-07 20:55 ` Paul E. McKenney
2008-11-07 21:27 ` Mathieu Desnoyers
2008-11-07 20:36 ` Nicolas Pitre
2008-11-07 23:50 ` David Howells
3 siblings, 2 replies; 118+ messages in thread
From: Steven Rostedt @ 2008-11-07 20:08 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: David Howells, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
>
> I want to make sure
>
> __m_cnt_hi
> is read before
> mmio cnt_lo read
Hmm, let me make sure I understand why there is no wmb.
Paul, can you verify this?
Mathieu, you do the following:
read a
smp_rmb
reab b
if (test b)
write a
So the idea is that you must read b to test it. And since we must read a
before reading b we can see that we write a before either?
The question remains, can the write happen before either of the reads?
But since the read b is reading the hw clock, perhaps that just implies a
wmb on the hardware side?
-- Steve
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 16:47 ` Mathieu Desnoyers
@ 2008-11-07 20:11 ` Russell King
2008-11-07 21:36 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Russell King @ 2008-11-07 20:11 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: David Howells, Andrew Morton, Nicolas Pitre, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
On Fri, Nov 07, 2008 at 11:47:58AM -0500, Mathieu Desnoyers wrote:
> But any get_cycles() user of cnt32_to_63() should be shot down. The
> bright side is : there is no way get_cycles() can be used with this
> new code. :)
>
> e.g. of incorrect users for arm (unless they are UP only, but that seems
> like a weird design argument) :
>
> mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
> /* OS timer Counter Reg. */
> mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
> mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
> Timer Counter Register */
> mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);
It's strange for you to make that assertion when PXA was the exact
platform that Nicolas created this code for - and that's a platform
where preempt has been widely used.
The two you mention are both ARMv5 or older architectures, and the
first real SMP ARM architecture is ARMv6. So architecturally they
are UP only.
So, tell me why you say "unless they are UP only, but that seems like
a weird design argument"? If the platforms can only ever be UP only,
what's wrong with UP only code being used with them? (Not that I'm
saying anything there about cnt32_to_63.)
I'd like to see you modify the silicon of a PXA or SA11x0 SoC to add
more than one processor to the chip - maybe you could use evostick to
glue two dies together and a microscope to aid bonding wires between
the two? (Of course, you'd need to design something to ensure cache
coherence as well, and arbitrate the internal bus between the two
dies.) ;)
Personally, I think that's highly unlikely.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 16:51 ` Mathieu Desnoyers
@ 2008-11-07 20:18 ` Nicolas Pitre
2008-11-07 23:55 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-07 20:18 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: David Howells, Andrew Morton, Linus Torvalds, Ingo Molnar,
Peter Zijlstra, linux-kernel, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> * David Howells (dhowells@redhat.com) wrote:
> > Nicolas Pitre <nico@cam.org> wrote:
> >
> > > > I mean, the darned thing is called from sched_clock(), which can be
> > > > concurrently called on separate CPUs and which can be called from
> > > > interrupt context (with an arbitrary nesting level!) while it was running
> > > > in process context.
> > >
> > > Yes! And this is so on *purpose*. Please take some time to read the
> > > comment that goes along with it, and if you're still not convinced then
> > > look for those explanation emails I've already posted.
> >
> > I agree with Nicolas on this. It's abominably clever, but I think he's right.
> >
> > The one place I remain unconvinced is over the issue of preemption of a process
> > that is in the middle of cnt32_to_63(), where if the preempted process is
> > asleep for long enough, I think it can wind time backwards when it resumes, but
> > that's not a problem for the one place I want to use it (sched_clock()) because
> > that is (almost) always called with preemption disabled in one way or another.
> >
> > The one place it isn't is a debugging case that I'm not too worried about.
> >
>
> I am also concerned about the non-preemption off case.
>
> Then I think the function should document that it must be called with
> preempt disabled.
I explained several times already why I disagree. Preemption is not a
problem unless you're preempted away for long enough, or IOW if your
counter is too fast.
And no, ^Z on a process doesn't create preemption. This is a signal that
gets acted upon far away from the middle of cnt32_to_63().
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:09 ` Mathieu Desnoyers
2008-11-07 17:33 ` Steven Rostedt
2008-11-07 20:08 ` Steven Rostedt
@ 2008-11-07 20:36 ` Nicolas Pitre
2008-11-07 20:55 ` Mathieu Desnoyers
2008-11-07 23:50 ` David Howells
3 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-07 20:36 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: David Howells, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> I want to make sure
>
> __m_cnt_hi
> is read before
> mmio cnt_lo read
>
> for the detailed reasons explained in my previous discussion with
> Nicolas here :
> http://lkml.org/lkml/2008/10/21/1
>
> I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> be required so it works also on UP systems safely wrt interrupts).
>
> The write side is between the hardware counter, which is assumed to
> increment monotonically between each read, and the value __m_cnt_hi
> updated by the CPU. I don't see where we could put a wmb() there.
>
> Without barrier, the smp race looks as follow :
>
>
> CPU A B
> read hw cnt low (0xFFFFFFFA)
> read __m_cnt_hi (0x80000000)
> read hw cnt low (0x00000001)
> (wrap detected :
> (s32)(0x80000000 ^ 0x1) < 0)
> write __m_cnt_hi = 0x00000001
> read __m_cnt_hi (0x00000001)
> return 0x0000000100000001
> (wrap detected :
> (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> write __m_cnt_hi = 0x80000001
> return 0x80000001FFFFFFFA
> (time jumps)
Could you have hardware doing such things? You would get a non cached
and more expensive read on CPU B which is not in program order with the
read that should have happened before, and before that second out of
order read could be performed, you'd have a full sequence in program
order performed on CPU A.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:02 ` Mathieu Desnoyers
@ 2008-11-07 20:45 ` Mathieu Desnoyers
2008-11-07 20:54 ` Paul E. McKenney
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 20:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, David Howells, Paul E. McKenney, Linus Torvalds,
akpm, Ingo Molnar, linux-kernel, Nicolas Pitre, Ralf Baechle,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
linux-arch
* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > >
> > > > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > > > >
> > > > > __m_cnt_hi
> > > > > is read before
> > > > > mmio cnt_lo read
> > > > >
> > > > > for the detailed reasons explained in my previous discussion with
> > > > > Nicolas here :
> > > > > http://lkml.org/lkml/2008/10/21/1
> > > > >
> > > > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > > > be required so it works also on UP systems safely wrt interrupts).
> > > >
> > > > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > > > description.
> > > >
> > >
> > > Ah, right, preserving program order on UP should be enough. smp_rmb()
> > > then.
> >
> >
> > I'm not quite sure I'm following here. Is this a global hardware clock
> > you're reading from multiple cpus, if so, are you sure smp_rmb() will
> > indeed be enough to sync the read?
> >
> > (In which case the smp_wmb() is provided by the hardware increasing the
> > clock?)
> >
> > If these are per-cpu clocks then even in the smp case we'd be good with
> > a plain barrier() because you'd only ever want to read your own cpu's
> > clock (and have a separate __m_cnt_hi per cpu).
> >
> > Or am I totally missing out on something?
> >
>
> This is the global hardware clock scenario.
>
> We have to order an uncached mmio read wrt a cached variable read/write.
> The uncached mmio read vs smp_rmb() barrier (e.g. lfence instruction)
> should be insured by program order because the read will skip the cache
> and go directly to the bus. Luckily we only do a mmio read and no mmio
> write, so mmiowb() is not required.
>
> You might be right in that it could require more barriers.
>
> Given adequate program order, we can assume the the mmio read will
> happen "on the spot", but that the cached read may be delayed.
>
> What we want is :
>
> readl(io_addr)
> read __m_cnt_hi
> write __m_cnt_hi
>
> With the two reads in the correct order. If we consider two consecutive
> executions on the same CPU :
>
> readl(io_addr)
> read __m_cnt_hi
> write __m_cnt_hi
>
> readl(io_addr)
> read __m_cnt_hi
> write __m_cnt_hi
>
> We might have to order the read/write pair wrt the following readl, such
> as :
>
> smp_rmb(); /* Waits for every cached memory reads to complete */
> readl(io_addr);
> barrier(); /* Make sure the compiler leaves mmio read before cached read */
> read __m_cnt_hi
> write __m_cnt_hi
>
> smp_rmb(); /* Waits for every cached memory reads to complete */
> readl(io_addr)
> barrier(); /* Make sure the compiler leaves mmio read before cached read */
> read __m_cnt_hi
> write __m_cnt_hi
>
> Would that make more sense ?
>
Oh, actually, I got things reversed in this email : the readl(io_addr)
must be done _after_ the __m_cnt_hi read.
Therefore, two consecutive executions would look like :
barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
previous mmio read. */
read __m_cnt_hi
smp_rmb(); /* Waits for every cached memory reads to complete */
readl(io_addr);
write __m_cnt_hi
barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
previous mmio read. */
read __m_cnt_hi
smp_rmb(); /* Waits for every cached memory reads to complete */
readl(io_addr);
write __m_cnt_hi
Mathieu
> Mathieu
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:45 ` Mathieu Desnoyers
@ 2008-11-07 20:54 ` Paul E. McKenney
2008-11-07 21:04 ` Steven Rostedt
2008-11-07 21:16 ` Mathieu Desnoyers
0 siblings, 2 replies; 118+ messages in thread
From: Paul E. McKenney @ 2008-11-07 20:54 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Peter Zijlstra, Steven Rostedt, David Howells, Linus Torvalds,
akpm, Ingo Molnar, linux-kernel, Nicolas Pitre, Ralf Baechle,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
linux-arch
On Fri, Nov 07, 2008 at 03:45:46PM -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > >
> > > > > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > > > > >
> > > > > > __m_cnt_hi
> > > > > > is read before
> > > > > > mmio cnt_lo read
> > > > > >
> > > > > > for the detailed reasons explained in my previous discussion with
> > > > > > Nicolas here :
> > > > > > http://lkml.org/lkml/2008/10/21/1
> > > > > >
> > > > > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > > > > be required so it works also on UP systems safely wrt interrupts).
> > > > >
> > > > > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > > > > description.
> > > > >
> > > >
> > > > Ah, right, preserving program order on UP should be enough. smp_rmb()
> > > > then.
> > >
> > >
> > > I'm not quite sure I'm following here. Is this a global hardware clock
> > > you're reading from multiple cpus, if so, are you sure smp_rmb() will
> > > indeed be enough to sync the read?
> > >
> > > (In which case the smp_wmb() is provided by the hardware increasing the
> > > clock?)
> > >
> > > If these are per-cpu clocks then even in the smp case we'd be good with
> > > a plain barrier() because you'd only ever want to read your own cpu's
> > > clock (and have a separate __m_cnt_hi per cpu).
> > >
> > > Or am I totally missing out on something?
> > >
> >
> > This is the global hardware clock scenario.
> >
> > We have to order an uncached mmio read wrt a cached variable read/write.
> > The uncached mmio read vs smp_rmb() barrier (e.g. lfence instruction)
> > should be insured by program order because the read will skip the cache
> > and go directly to the bus. Luckily we only do a mmio read and no mmio
> > write, so mmiowb() is not required.
> >
> > You might be right in that it could require more barriers.
> >
> > Given adequate program order, we can assume the the mmio read will
> > happen "on the spot", but that the cached read may be delayed.
> >
> > What we want is :
> >
> > readl(io_addr)
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > With the two reads in the correct order. If we consider two consecutive
> > executions on the same CPU :
> >
> > readl(io_addr)
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > readl(io_addr)
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > We might have to order the read/write pair wrt the following readl, such
> > as :
> >
> > smp_rmb(); /* Waits for every cached memory reads to complete */
> > readl(io_addr);
> > barrier(); /* Make sure the compiler leaves mmio read before cached read */
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > smp_rmb(); /* Waits for every cached memory reads to complete */
> > readl(io_addr)
> > barrier(); /* Make sure the compiler leaves mmio read before cached read */
> > read __m_cnt_hi
> > write __m_cnt_hi
> >
> > Would that make more sense ?
> >
>
> Oh, actually, I got things reversed in this email : the readl(io_addr)
> must be done _after_ the __m_cnt_hi read.
>
> Therefore, two consecutive executions would look like :
>
> barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> previous mmio read. */
> read __m_cnt_hi
> smp_rmb(); /* Waits for every cached memory reads to complete */
If these are MMIO reads, then you need rmb() rather than smp_rmb(),
at least on architectures that can reorder writes (Power, Itanium,
and I believe also ARM, ...).
Thanx, Paul
> readl(io_addr);
> write __m_cnt_hi
>
>
> barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> previous mmio read. */
> read __m_cnt_hi
> smp_rmb(); /* Waits for every cached memory reads to complete */
> readl(io_addr);
> write __m_cnt_hi
>
> Mathieu
>
> > Mathieu
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:08 ` Steven Rostedt
@ 2008-11-07 20:55 ` Paul E. McKenney
2008-11-07 21:27 ` Mathieu Desnoyers
1 sibling, 0 replies; 118+ messages in thread
From: Paul E. McKenney @ 2008-11-07 20:55 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, David Howells, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
On Fri, Nov 07, 2008 at 03:08:12PM -0500, Steven Rostedt wrote:
>
> On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> >
> > I want to make sure
> >
> > __m_cnt_hi
> > is read before
> > mmio cnt_lo read
>
> Hmm, let me make sure I understand why there is no wmb.
>
> Paul, can you verify this?
>
> Mathieu, you do the following:
>
> read a
> smp_rmb
> reab b
> if (test b)
> write a
>
> So the idea is that you must read b to test it. And since we must read a
> before reading b we can see that we write a before either?
>
> The question remains, can the write happen before either of the reads?
>
> But since the read b is reading the hw clock, perhaps that just implies a
> wmb on the hardware side?
The hardware must do an equivalent of a wmb(), but this might well be
done in logic or firmware on the device itself.
Thanx, Paul
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:36 ` Nicolas Pitre
@ 2008-11-07 20:55 ` Mathieu Desnoyers
2008-11-07 21:22 ` Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 20:55 UTC (permalink / raw)
To: Nicolas Pitre
Cc: David Howells, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
* Nicolas Pitre (nico@cam.org) wrote:
> On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
>
> > I want to make sure
> >
> > __m_cnt_hi
> > is read before
> > mmio cnt_lo read
> >
> > for the detailed reasons explained in my previous discussion with
> > Nicolas here :
> > http://lkml.org/lkml/2008/10/21/1
> >
> > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > be required so it works also on UP systems safely wrt interrupts).
> >
> > The write side is between the hardware counter, which is assumed to
> > increment monotonically between each read, and the value __m_cnt_hi
> > updated by the CPU. I don't see where we could put a wmb() there.
> >
> > Without barrier, the smp race looks as follow :
> >
> >
> > CPU A B
> > read hw cnt low (0xFFFFFFFA)
> > read __m_cnt_hi (0x80000000)
> > read hw cnt low (0x00000001)
> > (wrap detected :
> > (s32)(0x80000000 ^ 0x1) < 0)
> > write __m_cnt_hi = 0x00000001
> > read __m_cnt_hi (0x00000001)
> > return 0x0000000100000001
> > (wrap detected :
> > (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> > write __m_cnt_hi = 0x80000001
> > return 0x80000001FFFFFFFA
> > (time jumps)
>
> Could you have hardware doing such things? You would get a non cached
> and more expensive read on CPU B which is not in program order with the
> read that should have happened before, and before that second out of
> order read could be performed, you'd have a full sequence in program
> order performed on CPU A.
>
Hrm, yes ? Well, it's the whole point in barriers/cache coherency
mechanisms, out-of-order reads... etc.
First off, read hw cnt low _is_ an uncached memory read (this is the
mmio read). __m_cnt_hi is a cached read, and therefore can be delayed if
the cache-line is busy. And we have no control on how much time can pass
between the two reads given the CPU may stall waiting for a cache-line.
So the scenario above happens if CPU A have __m_cnt_hi in its cacheline,
but for come reason CPU B have to defer the cacheline read of __m_cnt_hi
due to heavy cacheline traffic and decides to proceed to mmio read
before the cacheline has been brought to the CPU because "hey, there is
no data dependency between those two reads !".
See Documentation/memory-barriers.txt.
Mathieu
>
> Nicolas
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:54 ` Paul E. McKenney
@ 2008-11-07 21:04 ` Steven Rostedt
2008-11-08 0:34 ` Paul E. McKenney
2008-11-07 21:16 ` Mathieu Desnoyers
1 sibling, 1 reply; 118+ messages in thread
From: Steven Rostedt @ 2008-11-07 21:04 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Mathieu Desnoyers, Peter Zijlstra, David Howells, Linus Torvalds,
akpm, Ingo Molnar, linux-kernel, Nicolas Pitre, Ralf Baechle,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
linux-arch
On Fri, 7 Nov 2008, Paul E. McKenney wrote:
> > > Would that make more sense ?
> > >
> >
> > Oh, actually, I got things reversed in this email : the readl(io_addr)
> > must be done _after_ the __m_cnt_hi read.
> >
> > Therefore, two consecutive executions would look like :
> >
> > barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> > previous mmio read. */
> > read __m_cnt_hi
> > smp_rmb(); /* Waits for every cached memory reads to complete */
>
> If these are MMIO reads, then you need rmb() rather than smp_rmb(),
> at least on architectures that can reorder writes (Power, Itanium,
> and I believe also ARM, ...).
The read is from a clock source. The only writes that are happening is
by the clock itself.
On a UP system, is a rmb still needed? That is, can you have two reads on
the same CPU from the clock source that will produce a backwards clock?
That to me sounds like the clock interface is broken.
-- Steve
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:54 ` Paul E. McKenney
2008-11-07 21:04 ` Steven Rostedt
@ 2008-11-07 21:16 ` Mathieu Desnoyers
1 sibling, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 21:16 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Peter Zijlstra, Steven Rostedt, David Howells, Linus Torvalds,
akpm, Ingo Molnar, linux-kernel, Nicolas Pitre, Ralf Baechle,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
linux-arch
* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Nov 07, 2008 at 03:45:46PM -0500, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > > On Fri, 2008-11-07 at 14:18 -0500, Mathieu Desnoyers wrote:
> > > > > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > > > >
> > > > > > On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> > > > > > >
> > > > > > > __m_cnt_hi
> > > > > > > is read before
> > > > > > > mmio cnt_lo read
> > > > > > >
> > > > > > > for the detailed reasons explained in my previous discussion with
> > > > > > > Nicolas here :
> > > > > > > http://lkml.org/lkml/2008/10/21/1
> > > > > > >
> > > > > > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > > > > > be required so it works also on UP systems safely wrt interrupts).
> > > > > >
> > > > > > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > > > > > description.
> > > > > >
> > > > >
> > > > > Ah, right, preserving program order on UP should be enough. smp_rmb()
> > > > > then.
> > > >
> > > >
> > > > I'm not quite sure I'm following here. Is this a global hardware clock
> > > > you're reading from multiple cpus, if so, are you sure smp_rmb() will
> > > > indeed be enough to sync the read?
> > > >
> > > > (In which case the smp_wmb() is provided by the hardware increasing the
> > > > clock?)
> > > >
> > > > If these are per-cpu clocks then even in the smp case we'd be good with
> > > > a plain barrier() because you'd only ever want to read your own cpu's
> > > > clock (and have a separate __m_cnt_hi per cpu).
> > > >
> > > > Or am I totally missing out on something?
> > > >
> > >
> > > This is the global hardware clock scenario.
> > >
> > > We have to order an uncached mmio read wrt a cached variable read/write.
> > > The uncached mmio read vs smp_rmb() barrier (e.g. lfence instruction)
> > > should be insured by program order because the read will skip the cache
> > > and go directly to the bus. Luckily we only do a mmio read and no mmio
> > > write, so mmiowb() is not required.
> > >
> > > You might be right in that it could require more barriers.
> > >
> > > Given adequate program order, we can assume the the mmio read will
> > > happen "on the spot", but that the cached read may be delayed.
> > >
> > > What we want is :
> > >
> > > readl(io_addr)
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > With the two reads in the correct order. If we consider two consecutive
> > > executions on the same CPU :
> > >
> > > readl(io_addr)
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > readl(io_addr)
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > We might have to order the read/write pair wrt the following readl, such
> > > as :
> > >
> > > smp_rmb(); /* Waits for every cached memory reads to complete */
> > > readl(io_addr);
> > > barrier(); /* Make sure the compiler leaves mmio read before cached read */
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > smp_rmb(); /* Waits for every cached memory reads to complete */
> > > readl(io_addr)
> > > barrier(); /* Make sure the compiler leaves mmio read before cached read */
> > > read __m_cnt_hi
> > > write __m_cnt_hi
> > >
> > > Would that make more sense ?
> > >
> >
> > Oh, actually, I got things reversed in this email : the readl(io_addr)
> > must be done _after_ the __m_cnt_hi read.
> >
> > Therefore, two consecutive executions would look like :
> >
> > barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> > previous mmio read. */
> > read __m_cnt_hi
> > smp_rmb(); /* Waits for every cached memory reads to complete */
>
> If these are MMIO reads, then you need rmb() rather than smp_rmb(),
> at least on architectures that can reorder writes (Power, Itanium,
> and I believe also ARM, ...).
>
> Thanx, Paul
>
I just dug into the barrier() question at the beginning of the code. I
think it's not necessary after all, because the worse a compiler could
do is probably the following :
Read nr | code
1 read a
1 rmb()
2 read a <------ ugh. Compiler could decide to prefetch the a value
and only update it if the test is true :(
1 read b
1 if (test b) {
1 write a
2 read a
}
2 rmb()
2 read b
2 if (test b)
2 write a
But it would not mix the order of a/b reads. So I think just the rmb()
would be enough.
Mathieu
> > readl(io_addr);
> > write __m_cnt_hi
> >
> >
> > barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> > previous mmio read. */
> > read __m_cnt_hi
> > smp_rmb(); /* Waits for every cached memory reads to complete */
> > readl(io_addr);
> > write __m_cnt_hi
> >
> > Mathieu
> >
> > > Mathieu
> > >
> > > --
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:55 ` Mathieu Desnoyers
@ 2008-11-07 21:22 ` Nicolas Pitre
0 siblings, 0 replies; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-07 21:22 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: David Howells, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> First off, read hw cnt low _is_ an uncached memory read (this is the
> mmio read). __m_cnt_hi is a cached read, and therefore can be delayed if
> the cache-line is busy. And we have no control on how much time can pass
> between the two reads given the CPU may stall waiting for a cache-line.
>
> So the scenario above happens if CPU A have __m_cnt_hi in its cacheline,
> but for come reason CPU B have to defer the cacheline read of __m_cnt_hi
> due to heavy cacheline traffic and decides to proceed to mmio read
> before the cacheline has been brought to the CPU because "hey, there is
> no data dependency between those two reads !".
OK that makes sense.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:08 ` Steven Rostedt
2008-11-07 20:55 ` Paul E. McKenney
@ 2008-11-07 21:27 ` Mathieu Desnoyers
1 sibling, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 21:27 UTC (permalink / raw)
To: Steven Rostedt
Cc: David Howells, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
* Steven Rostedt (rostedt@goodmis.org) wrote:
>
> On Fri, 7 Nov 2008, Mathieu Desnoyers wrote:
> >
> > I want to make sure
> >
> > __m_cnt_hi
> > is read before
> > mmio cnt_lo read
>
> Hmm, let me make sure I understand why there is no wmb.
>
> Paul, can you verify this?
>
> Mathieu, you do the following:
>
> read a
> smp_rmb
> reab b
> if (test b)
> write a
>
> So the idea is that you must read b to test it. And since we must read a
> before reading b we can see that we write a before either?
>
> The question remains, can the write happen before either of the reads?
>
write a cannot happen before read a (same variable).
write a must happen after read b because it depends on the b value. It
makes sure the the side-effect of "write a" is seen by other CPUs
*after* we have read the b value.
> But since the read b is reading the hw clock, perhaps that just implies a
> wmb on the hardware side?
>
It makes sense. The hardware clock has no cache coherency problem.. so
it could be seen as doing wmb() after each data update.
Mathieu
> -- Steve
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 20:11 ` Russell King
@ 2008-11-07 21:36 ` Mathieu Desnoyers
2008-11-07 22:18 ` Russell King
2008-11-07 23:41 ` David Howells
0 siblings, 2 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 21:36 UTC (permalink / raw)
To: David Howells, Andrew Morton, Nicolas Pitre, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
* Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> On Fri, Nov 07, 2008 at 11:47:58AM -0500, Mathieu Desnoyers wrote:
> > But any get_cycles() user of cnt32_to_63() should be shot down. The
> > bright side is : there is no way get_cycles() can be used with this
> > new code. :)
> >
> > e.g. of incorrect users for arm (unless they are UP only, but that seems
> > like a weird design argument) :
> >
> > mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
> > /* OS timer Counter Reg. */
> > mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
> > mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
> > Timer Counter Register */
> > mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);
>
> It's strange for you to make that assertion when PXA was the exact
> platform that Nicolas created this code for - and that's a platform
> where preempt has been widely used.
>
> The two you mention are both ARMv5 or older architectures, and the
> first real SMP ARM architecture is ARMv6. So architecturally they
> are UP only.
>
Ok. And hopefully they do not execute instructions speculatively ?
Because then a instruction sync would be required between the __m_hi_cnt
read and get_cycles.
If you design such stuff with portability in mind, you'd use per-cpu
variables, which ends up being a single variable in the single-cpu
special-case.
> So, tell me why you say "unless they are UP only, but that seems like
> a weird design argument"? If the platforms can only ever be UP only,
> what's wrong with UP only code being used with them? (Not that I'm
> saying anything there about cnt32_to_63.)
That's fine, as long as the code does not end up in include/linux and
stays in arch/arm/up-only-subarch/.
When one try to create architecture agnostic code (which is what is
likely to be palatable to arch agnostic headers), designing with UP in
mind does not make much sense.
>
> I'd like to see you modify the silicon of a PXA or SA11x0 SoC to add
> more than one processor to the chip - maybe you could use evostick to
> glue two dies together and a microscope to aid bonding wires between
> the two? (Of course, you'd need to design something to ensure cache
> coherence as well, and arbitrate the internal bus between the two
> dies.) ;)
>
> Personally, I think that's highly unlikely.
>
Very unlikely indeed. ;)
Mathieu
> --
> Russell King
> Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
> maintainer of:
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 21:36 ` Mathieu Desnoyers
@ 2008-11-07 22:18 ` Russell King
2008-11-07 22:36 ` Mathieu Desnoyers
2008-11-07 23:41 ` David Howells
1 sibling, 1 reply; 118+ messages in thread
From: Russell King @ 2008-11-07 22:18 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: David Howells, Andrew Morton, Nicolas Pitre, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
On Fri, Nov 07, 2008 at 04:36:10PM -0500, Mathieu Desnoyers wrote:
> * Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> > On Fri, Nov 07, 2008 at 11:47:58AM -0500, Mathieu Desnoyers wrote:
> > > But any get_cycles() user of cnt32_to_63() should be shot down. The
> > > bright side is : there is no way get_cycles() can be used with this
> > > new code. :)
> > >
> > > e.g. of incorrect users for arm (unless they are UP only, but that seems
> > > like a weird design argument) :
> > >
> > > mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
> > > /* OS timer Counter Reg. */
> > > mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
> > > mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
> > > Timer Counter Register */
> > > mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);
> >
> > It's strange for you to make that assertion when PXA was the exact
> > platform that Nicolas created this code for - and that's a platform
> > where preempt has been widely used.
> >
> > The two you mention are both ARMv5 or older architectures, and the
> > first real SMP ARM architecture is ARMv6. So architecturally they
> > are UP only.
>
> Ok. And hopefully they do not execute instructions speculatively ?
Again, that's ARMv6 and later.
> Because then a instruction sync would be required between the __m_hi_cnt
> read and get_cycles.
What get_cycles? This is the ARM implementation of get_cycles():
static inline cycles_t get_cycles (void)
{
return 0;
}
Maybe you're using a name for one thing which means something else to
other people? Please don't use confusing vocabulary.
> If you design such stuff with portability in mind, you'd use per-cpu
> variables, which ends up being a single variable in the single-cpu
> special-case.
Explain how and why sched_clock(), which is a global time source, should
use per-cpu variables.
> > So, tell me why you say "unless they are UP only, but that seems like
> > a weird design argument"? If the platforms can only ever be UP only,
> > what's wrong with UP only code being used with them? (Not that I'm
> > saying anything there about cnt32_to_63.)
>
> That's fine, as long as the code does not end up in include/linux and
> stays in arch/arm/up-only-subarch/.
Well, that's where it was - private to ARM. Then David Howells came
along and unilaterally - and without reference to anyone as far as I
can see - moved it to include/linux.
Neither Nicolas, nor me had any idea that it was going to move into
include/linux - the first we knew of it was when pulling the change
from Linus' tree.
Look, if people in the kernel community can't or won't communicate
with others (either through malice, purpose or accident), you can
expect this kind of crap to happen.
> When one try to create architecture agnostic code (which is what is
> likely to be palatable to arch agnostic headers), designing with UP in
> mind does not make much sense.
It wasn't architecture agnostic code. It was ARM specific. We have
a "version control system" which stores "comments" for changes to the
kernel tree. Please use it to find out the true story. I'll save
you the trouble, here's the commits with full comments:
$ git log include/linux/cnt32_to_63.h
commit b4f151ff899362fec952c45d166252c9912c041f
Author: David Howells <dhowells@redhat.com>
Date: Wed Sep 24 17:48:26 2008 +0100
MN10300: Move asm-arm/cnt32_to_63.h to include/linux/
Move asm-arm/cnt32_to_63.h to include/linux/ so that MN10300 can make
use of it too.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
$ git log -- arch/arm/include/asm/cnt32_to_63.h include/asm-arm/cnt32_to_63.h
commit bc173c5789e1fc6065fd378edc815914b40ee86b
Author: David Howells <dhowells@redhat.com>
Date: Fri Sep 26 16:22:58 2008 +0100
ARM: Delete ARM's own cnt32_to_63.h
Delete ARM's own cnt32_to_63.h as the copy in include/linux/ should now be
used instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 4baa9922430662431231ac637adedddbb0cfb2d7
Author: Russell King <rmk@dyn-67.arm.linux.org.uk>
Date: Sat Aug 2 10:55:55 2008 +0100
[ARM] move include/asm-arm to arch/arm/include/asm
Move platform independent header files to arch/arm/include/asm, leaving
those in asm/arch* and asm/plat* alone.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
commit 838ccbc35eae5b44d47724e5f694dbec4a26d269
Author: Nicolas Pitre <nico@cam.org>
Date: Mon Dec 4 20:19:31 2006 +0100
[ARM] 3978/1: macro to provide a 63-bit value from a 32-bit hardware counter
This is done in a completely lockless fashion. Bits 0 to 31 of the count
are provided by the hardware while bits 32 to 62 are stored in memory.
The top bit in memory is used to synchronize with the hardware count
half-period. When the top bit of both counters (hardware and in memory)
differ then the memory is updated with a new value, incrementing it when
the hardware counter wraps around. Because a word store in memory is
atomic then the incremented value will always be in synch with the top
bit indicating to any potential concurrent reader if the value in memory
is up to date or not wrt the needed increment. And any race in updating
the value in memory is harmless as the same value would be stored more
than once.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
So, stop slinging mud onto Nicolas and me over this. The resulting
mess is clearly not our creation.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 22:18 ` Russell King
@ 2008-11-07 22:36 ` Mathieu Desnoyers
0 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-07 22:36 UTC (permalink / raw)
To: David Howells, Andrew Morton, Nicolas Pitre, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
* Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> On Fri, Nov 07, 2008 at 04:36:10PM -0500, Mathieu Desnoyers wrote:
> > * Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> > > On Fri, Nov 07, 2008 at 11:47:58AM -0500, Mathieu Desnoyers wrote:
> > > > But any get_cycles() user of cnt32_to_63() should be shot down. The
> > > > bright side is : there is no way get_cycles() can be used with this
> > > > new code. :)
> > > >
> > > > e.g. of incorrect users for arm (unless they are UP only, but that seems
> > > > like a weird design argument) :
> > > >
> > > > mach-sa1100/include/mach/SA-1100.h:#define OSCR __REG(0x90000010)
> > > > /* OS timer Counter Reg. */
> > > > mach-sa1100/generic.c: unsigned long long v = cnt32_to_63(OSCR);
> > > > mach-pxa/include/mach/pxa-regs.h:#define OSCR __REG(0x40A00010) /* OS
> > > > Timer Counter Register */
> > > > mach-pxa/time.c: unsigned long long v = cnt32_to_63(OSCR);
> > >
> > > It's strange for you to make that assertion when PXA was the exact
> > > platform that Nicolas created this code for - and that's a platform
> > > where preempt has been widely used.
> > >
> > > The two you mention are both ARMv5 or older architectures, and the
> > > first real SMP ARM architecture is ARMv6. So architecturally they
> > > are UP only.
> >
> > Ok. And hopefully they do not execute instructions speculatively ?
>
> Again, that's ARMv6 and later.
>
> > Because then a instruction sync would be required between the __m_hi_cnt
> > read and get_cycles.
>
> What get_cycles? This is the ARM implementation of get_cycles():
>
> static inline cycles_t get_cycles (void)
> {
> return 0;
> }
>
> Maybe you're using a name for one thing which means something else to
> other people? Please don't use confusing vocabulary.
>
get_cycles() is expected to be a cpu register read which reads a cycle
counter. As far as I can tell,
#define OSCR __REG(0x90000010) /* OS timer Counter Reg. */
seems to fit this definition pretty closely. Or maybe not ?
I am personnally not trying to find someone to blame. Just trying to
figure out how to fix the existing code or, at the very least, to make
sure nobody will ask me to use a piece of code not suitable for
tracing clock source purposes.
Mathieu
> > If you design such stuff with portability in mind, you'd use per-cpu
> > variables, which ends up being a single variable in the single-cpu
> > special-case.
>
> Explain how and why sched_clock(), which is a global time source, should
> use per-cpu variables.
>
> > > So, tell me why you say "unless they are UP only, but that seems like
> > > a weird design argument"? If the platforms can only ever be UP only,
> > > what's wrong with UP only code being used with them? (Not that I'm
> > > saying anything there about cnt32_to_63.)
> >
> > That's fine, as long as the code does not end up in include/linux and
> > stays in arch/arm/up-only-subarch/.
>
> Well, that's where it was - private to ARM. Then David Howells came
> along and unilaterally - and without reference to anyone as far as I
> can see - moved it to include/linux.
>
> Neither Nicolas, nor me had any idea that it was going to move into
> include/linux - the first we knew of it was when pulling the change
> from Linus' tree.
>
> Look, if people in the kernel community can't or won't communicate
> with others (either through malice, purpose or accident), you can
> expect this kind of crap to happen.
>
> > When one try to create architecture agnostic code (which is what is
> > likely to be palatable to arch agnostic headers), designing with UP in
> > mind does not make much sense.
>
> It wasn't architecture agnostic code. It was ARM specific. We have
> a "version control system" which stores "comments" for changes to the
> kernel tree. Please use it to find out the true story. I'll save
> you the trouble, here's the commits with full comments:
>
> $ git log include/linux/cnt32_to_63.h
> commit b4f151ff899362fec952c45d166252c9912c041f
> Author: David Howells <dhowells@redhat.com>
> Date: Wed Sep 24 17:48:26 2008 +0100
>
> MN10300: Move asm-arm/cnt32_to_63.h to include/linux/
>
> Move asm-arm/cnt32_to_63.h to include/linux/ so that MN10300 can make
> use of it too.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> $ git log -- arch/arm/include/asm/cnt32_to_63.h include/asm-arm/cnt32_to_63.h
> commit bc173c5789e1fc6065fd378edc815914b40ee86b
> Author: David Howells <dhowells@redhat.com>
> Date: Fri Sep 26 16:22:58 2008 +0100
>
> ARM: Delete ARM's own cnt32_to_63.h
>
> Delete ARM's own cnt32_to_63.h as the copy in include/linux/ should now be
> used instead.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> commit 4baa9922430662431231ac637adedddbb0cfb2d7
> Author: Russell King <rmk@dyn-67.arm.linux.org.uk>
> Date: Sat Aug 2 10:55:55 2008 +0100
>
> [ARM] move include/asm-arm to arch/arm/include/asm
>
> Move platform independent header files to arch/arm/include/asm, leaving
> those in asm/arch* and asm/plat* alone.
>
> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
>
> commit 838ccbc35eae5b44d47724e5f694dbec4a26d269
> Author: Nicolas Pitre <nico@cam.org>
> Date: Mon Dec 4 20:19:31 2006 +0100
>
> [ARM] 3978/1: macro to provide a 63-bit value from a 32-bit hardware counter
>
> This is done in a completely lockless fashion. Bits 0 to 31 of the count
> are provided by the hardware while bits 32 to 62 are stored in memory.
> The top bit in memory is used to synchronize with the hardware count
> half-period. When the top bit of both counters (hardware and in memory)
> differ then the memory is updated with a new value, incrementing it when
> the hardware counter wraps around. Because a word store in memory is
> atomic then the incremented value will always be in synch with the top
> bit indicating to any potential concurrent reader if the value in memory
> is up to date or not wrt the needed increment. And any race in updating
> the value in memory is harmless as the same value would be stored more
> than once.
>
> Signed-off-by: Nicolas Pitre <nico@cam.org>
> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
>
> So, stop slinging mud onto Nicolas and me over this. The resulting
> mess is clearly not our creation.
>
> --
> Russell King
> Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
> maintainer of:
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:04 ` David Howells
2008-11-07 17:17 ` Mathieu Desnoyers
@ 2008-11-07 23:27 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 23:27 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: dhowells, Andrew Morton, Nicolas Pitre, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> Yes. Do you think the synchronization of the cycles counters is
> _perfect_ across CPUs so that there is no possible way whatsoever that
> two cycle counter values appear to go backward between CPUs ? (also
> taking in account delays in __m_cnt_hi write-back...)
Given there's currently only one CPU allowed, yes, I think it's perfect:-)
It's something to re-evaluate should Panasonic decide to do SMP.
> If we expect the only correct use-case to be with readl(), I don't see
> the problem with added synchronization.
It might be expensive if you don't actually want to call readl(). But that's
on a par with using funky instructions to read the TSC, I guess.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 21:36 ` Mathieu Desnoyers
2008-11-07 22:18 ` Russell King
@ 2008-11-07 23:41 ` David Howells
2008-11-08 0:15 ` Russell King
2008-11-08 0:45 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() David Howells
1 sibling, 2 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 23:41 UTC (permalink / raw)
To: Russell King
Cc: dhowells, Mathieu Desnoyers, Andrew Morton, Nicolas Pitre,
Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, Steven Rostedt, linux-arch
Russell King <rmk+lkml@arm.linux.org.uk> wrote:
> Well, that's where it was - private to ARM. Then David Howells came
> along and unilaterally - and without reference to anyone as far as I
> can see - moved it to include/linux.
>
> Neither Nicolas, nor me had any idea that it was going to move into
> include/linux - the first we knew of it was when pulling the change
> from Linus' tree.
>
> Look, if people in the kernel community can't or won't communicate
> with others (either through malice, purpose or accident), you can
> expect this kind of crap to happen.
Excuse me, Russell, but I sent Nicolas an email prior to doing so asking him
if he had any objections:
To: Nicolas Pitre <nico@cam.org>
cc: dhowells@redhat.com
Subject: Moving asm-arm/cnt32_to_63.h to include/linux/
Date: Thu, 31 Jul 2008 16:04:04 +0100
Hi Nicolas,
Mind if I move include/asm-arm/cnt32_to_63.h to include/linux/?
I need to use it for MN10300.
David
He didn't respond. Not only that, but I copied Nicolas on the patch to make
the move and the patch to make MN10300 use it when I submitted it to Linus on
the 24th September, so it's not like he didn't have plenty of time. He
certainly saw that because he joined in the discussion of the second patch.
Furthermore, he could've asked Linus to refuse the patch, or to revert it if
it had already gone in.
I suppose I should've cc'd the ARM list too... but why should it adversely
affect ARM?
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 17:09 ` Mathieu Desnoyers
` (2 preceding siblings ...)
2008-11-07 20:36 ` Nicolas Pitre
@ 2008-11-07 23:50 ` David Howells
2008-11-08 0:55 ` Steven Rostedt
2008-11-09 11:51 ` David Howells
3 siblings, 2 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 23:50 UTC (permalink / raw)
To: Steven Rostedt
Cc: dhowells, Mathieu Desnoyers, Paul E. McKenney, Linus Torvalds,
akpm, Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
Steven Rostedt <rostedt@goodmis.org> wrote:
> > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > be required so it works also on UP systems safely wrt interrupts).
>
> smp_rmb turns into a compiler barrier on UP and should prevent the below
> description.
Note that that does not guarantee that the two reads will be done in the order
you want. The compiler barrier _only_ affects the compiler. It does not stop
the CPU from doing the reads in any order it wants. You need something
stronger than smp_rmb() if you need the reads to be so ordered.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 16:51 ` Mathieu Desnoyers
2008-11-07 20:18 ` Nicolas Pitre
@ 2008-11-07 23:55 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: David Howells @ 2008-11-07 23:55 UTC (permalink / raw)
To: Nicolas Pitre
Cc: dhowells, Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
Nicolas Pitre <nico@cam.org> wrote:
> I explained several times already why I disagree. Preemption is not a
> problem unless you're preempted away for long enough, or IOW if your
> counter is too fast.
That's my point. Say a nice -19 process gets preempted... what guarantees
that it will resume within, the time it takes the counter to wrap? Even if
the preempting process goes back to sleep, in that time a bunch of other
processes could have woken up and could starve it for a long period of time.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 23:41 ` David Howells
@ 2008-11-08 0:15 ` Russell King
2008-11-08 15:24 ` Nicolas Pitre
2008-11-08 0:45 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() David Howells
1 sibling, 1 reply; 118+ messages in thread
From: Russell King @ 2008-11-08 0:15 UTC (permalink / raw)
To: David Howells
Cc: Mathieu Desnoyers, Andrew Morton, Nicolas Pitre, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
On Fri, Nov 07, 2008 at 11:41:55PM +0000, David Howells wrote:
> Russell King <rmk+lkml@arm.linux.org.uk> wrote:
>
> > Well, that's where it was - private to ARM. Then David Howells came
> > along and unilaterally - and without reference to anyone as far as I
> > can see - moved it to include/linux.
> >
> > Neither Nicolas, nor me had any idea that it was going to move into
> > include/linux - the first we knew of it was when pulling the change
> > from Linus' tree.
> >
> > Look, if people in the kernel community can't or won't communicate
> > with others (either through malice, purpose or accident), you can
> > expect this kind of crap to happen.
>
> Excuse me, Russell, but I sent Nicolas an email prior to doing so asking him
> if he had any objections:
>
> To: Nicolas Pitre <nico@cam.org>
> cc: dhowells@redhat.com
> Subject: Moving asm-arm/cnt32_to_63.h to include/linux/
> Date: Thu, 31 Jul 2008 16:04:04 +0100
>
> Hi Nicolas,
>
> Mind if I move include/asm-arm/cnt32_to_63.h to include/linux/?
>
> I need to use it for MN10300.
>
> David
>
> He didn't respond. Not only that, but I copied Nicolas on the patch to make
> the move and the patch to make MN10300 use it when I submitted it to Linus on
> the 24th September, so it's not like he didn't have plenty of time. He
> certainly saw that because he joined in the discussion of the second patch.
> Furthermore, he could've asked Linus to refuse the patch, or to revert it if
> it had already gone in.
>
> I suppose I should've cc'd the ARM list too... but why should it adversely
> affect ARM?
I take back the "Neither Nicolas" bit but the rest of my comment stands
and remains valid.
In light of akpm's demands to know how this got into the kernel, I decided
I'd put the story forward, especially as people in this thread are confused
about what it was designed for, and making random unfounded claiming that
its existing ARM uses are buggy when they aren't.
It sounds to me as if the right answer is for it to move back to being an
ARM private thing with a MN10300 private copy, rather than it pretending
to be something everyone can use.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 21:04 ` Steven Rostedt
@ 2008-11-08 0:34 ` Paul E. McKenney
0 siblings, 0 replies; 118+ messages in thread
From: Paul E. McKenney @ 2008-11-08 0:34 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, Peter Zijlstra, David Howells, Linus Torvalds,
akpm, Ingo Molnar, linux-kernel, Nicolas Pitre, Ralf Baechle,
benh, paulus, David Miller, Ingo Molnar, Thomas Gleixner,
linux-arch
On Fri, Nov 07, 2008 at 04:04:48PM -0500, Steven Rostedt wrote:
>
> On Fri, 7 Nov 2008, Paul E. McKenney wrote:
> > > > Would that make more sense ?
> > > >
> > >
> > > Oh, actually, I got things reversed in this email : the readl(io_addr)
> > > must be done _after_ the __m_cnt_hi read.
> > >
> > > Therefore, two consecutive executions would look like :
> > >
> > > barrier(); /* Make sure the compiler does not reorder __m_cnt_hi and
> > > previous mmio read. */
> > > read __m_cnt_hi
> > > smp_rmb(); /* Waits for every cached memory reads to complete */
> >
> > If these are MMIO reads, then you need rmb() rather than smp_rmb(),
> > at least on architectures that can reorder writes (Power, Itanium,
> > and I believe also ARM, ...).
>
> The read is from a clock source. The only writes that are happening is
> by the clock itself.
>
> On a UP system, is a rmb still needed? That is, can you have two reads on
> the same CPU from the clock source that will produce a backwards clock?
> That to me sounds like the clock interface is broken.
I do not believe that all CPUs are guaranteed to execute a sequence
of MMIO reads in order.
Thanx, Paul
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 23:41 ` David Howells
2008-11-08 0:15 ` Russell King
@ 2008-11-08 0:45 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: David Howells @ 2008-11-08 0:45 UTC (permalink / raw)
To: Russell King
Cc: dhowells, Mathieu Desnoyers, Andrew Morton, Nicolas Pitre,
Linus Torvalds, Ingo Molnar, Peter Zijlstra, linux-kernel,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, Steven Rostedt, linux-arch
Russell King <rmk+lkml@arm.linux.org.uk> wrote:
> It sounds to me as if the right answer is for it to move back to being an
> ARM private thing with a MN10300 private copy, rather than it pretending
> to be something everyone can use.
The only problem I have with that is that there are then two independent
copies of it:-(
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 23:50 ` David Howells
@ 2008-11-08 0:55 ` Steven Rostedt
2008-11-09 11:51 ` David Howells
1 sibling, 0 replies; 118+ messages in thread
From: Steven Rostedt @ 2008-11-08 0:55 UTC (permalink / raw)
To: David Howells
Cc: Mathieu Desnoyers, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
On Fri, 7 Nov 2008, David Howells wrote:
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > > I use smp_rmb() to do this on SMP systems (hrm, actually, a rmb() could
> > > be required so it works also on UP systems safely wrt interrupts).
> >
> > smp_rmb turns into a compiler barrier on UP and should prevent the below
> > description.
>
> Note that that does not guarantee that the two reads will be done in the order
> you want. The compiler barrier _only_ affects the compiler. It does not stop
> the CPU from doing the reads in any order it wants. You need something
> stronger than smp_rmb() if you need the reads to be so ordered.
For reading hardware devices that can indeed be correct. But for normal
memory access on a uniprocessor, if the CPU were to reorder the reads that
would effect the actual algorithm then that CPU is broken.
read a
<--- interrupt - should see read a here before read b is done.
read b
Now the fact that one of the reads is a hardware clock, then this
statement might not be too strong. But the fact that it is a clock, and
not some memory mapped device register, I still think smp_rmb is
sufficient.
-- Steve
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-08 0:15 ` Russell King
@ 2008-11-08 15:24 ` Nicolas Pitre
2008-11-08 23:20 ` [PATCH] clarify usage expectations for cnt32_to_63() Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-08 15:24 UTC (permalink / raw)
To: Russell King
Cc: David Howells, Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
Ingo Molnar, Peter Zijlstra, linux-kernel, Ralf Baechle, benh,
paulus, David Miller, Ingo Molnar, Thomas Gleixner,
Steven Rostedt, linux-arch
On Sat, 8 Nov 2008, Russell King wrote:
> On Fri, Nov 07, 2008 at 11:41:55PM +0000, David Howells wrote:
> > Russell King <rmk+lkml@arm.linux.org.uk> wrote:
> >
> > > Well, that's where it was - private to ARM. Then David Howells came
> > > along and unilaterally - and without reference to anyone as far as I
> > > can see - moved it to include/linux.
> > >
> > > Neither Nicolas, nor me had any idea that it was going to move into
> > > include/linux - the first we knew of it was when pulling the change
> > > from Linus' tree.
> > >
> > > Look, if people in the kernel community can't or won't communicate
> > > with others (either through malice, purpose or accident), you can
> > > expect this kind of crap to happen.
> >
> > Excuse me, Russell, but I sent Nicolas an email prior to doing so asking him
> > if he had any objections:
> >
> > To: Nicolas Pitre <nico@cam.org>
> > cc: dhowells@redhat.com
> > Subject: Moving asm-arm/cnt32_to_63.h to include/linux/
> > Date: Thu, 31 Jul 2008 16:04:04 +0100
> >
> > Hi Nicolas,
> >
> > Mind if I move include/asm-arm/cnt32_to_63.h to include/linux/?
> >
> > I need to use it for MN10300.
> >
> > David
> >
> > He didn't respond. Not only that, but I copied Nicolas on the patch to make
> > the move and the patch to make MN10300 use it when I submitted it to Linus on
> > the 24th September, so it's not like he didn't have plenty of time. He
> > certainly saw that because he joined in the discussion of the second patch.
> > Furthermore, he could've asked Linus to refuse the patch, or to revert it if
> > it had already gone in.
I was OK with the patch moving that code and I think I told you so as
well. But...
> > I suppose I should've cc'd the ARM list too... but why should it adversely
> > affect ARM?
>
> I take back the "Neither Nicolas" bit but the rest of my comment stands
> and remains valid.
>
> In light of akpm's demands to know how this got into the kernel, I decided
> I'd put the story forward, especially as people in this thread are confused
> about what it was designed for, and making random unfounded claiming that
> its existing ARM uses are buggy when they aren't.
... I must agree with Russell that this is apparently creating more
confusion with people than anything else.
> It sounds to me as if the right answer is for it to move back to being an
> ARM private thing with a MN10300 private copy, rather than it pretending
> to be something everyone can use.
I think this is OK if not everyone can use this. The main purpose for
this code was to provide much increased accuracy for shed_clock() on
processors with only a 32-bit hardware counter.
Given that sched_clock() is already used in contexts where preemption is
disabled, I don't mind the addition of a precision to the associated
comment mentioning that it must be called at least once per
half period of the base counter ***and*** not be preempted
away for longer than the half period of the counter minus the longest
period between two calls. The comment already mention a kernel timer
which can be used to control the longest period between two calls.
Implicit disabling of preemption is _NOT_ the goal of this code.
I also don't mind having a real barrier for this code to be useful on
other platforms. On the platform this was written for, any kind of
barrier is defined as a compiler barrier which is perfectly fine and
achieve the same effect as the current usage of volatile.
I also don't mind making the high part of the counter always be a per
CPU variable. Again this won't change anything on the target this was
intended for and this would make this code useful for more usages, and
possibly help making the needed barrier on SMP more lightweight. The
usage requirement therefore becomes per CPU even if the base counter is
global. There are per CPU timers with add_timer_on() so this can be
ensured pretty easily.
And if after all this the code doesn't suit your needs then just don't
use it. Its documentation should be clear enough so if people start
using it in contexts where it isn't appropriate then it's not the code's
fault.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH] clarify usage expectations for cnt32_to_63()
2008-11-08 15:24 ` Nicolas Pitre
@ 2008-11-08 23:20 ` Nicolas Pitre
2008-11-09 2:25 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-08 23:20 UTC (permalink / raw)
To: Linus Torvalds
Cc: Russell King, David Howells, Mathieu Desnoyers, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
Currently, all existing users of cnt32_to_63() are fine since the CPU
architectures where it is used don't do read access reordering, and user
mode preemption is disabled already. It is nevertheless a good idea to
better elaborate usage requirements wrt preemption, and use an explicit
memory barrier for proper results on CPUs that may perform instruction
reordering.
Signed-off-by: Nicolas Pitre <nico@marvell.com>
---
On Sat, 8 Nov 2008, Nicolas Pitre wrote:
> I think this is OK if not everyone can use this. The main purpose for
> this code was to provide much increased accuracy for shed_clock() on
> processors with only a 32-bit hardware counter.
>
> Given that sched_clock() is already used in contexts where preemption is
> disabled, I don't mind the addition of a precision to the associated
> comment mentioning that it must be called at least once per
> half period of the base counter ***and*** not be preempted
> away for longer than the half period of the counter minus the longest
> period between two calls. The comment already mention a kernel timer
> which can be used to control the longest period between two calls.
> Implicit disabling of preemption is _NOT_ the goal of this code.
>
> I also don't mind having a real barrier for this code to be useful on
> other platforms. On the platform this was written for, any kind of
> barrier is defined as a compiler barrier which is perfectly fine and
> achieve the same effect as the current usage of volatile.
So here it is.
I used a rmb() so this is also safe for mixed usages in and out of
interrupt context. On the architecture I care about this is turned into
a simple compiler barrier and therefore doesn't make a difference, while
smp_rmb() is a noop which isn't right.
I won't make a per_cpu_cnt32_to_63() version myself as I have no need
for that. But if someone wants to use this witha per CPU counter which
is not coherent between CPUs then be my guest. The usage requirements
would be the same but on each used CPU instead of globally.
diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
index 8c0f950..584289d 100644
--- a/include/linux/cnt32_to_63.h
+++ b/include/linux/cnt32_to_63.h
@@ -53,11 +53,19 @@ union cnt32_to_63 {
* needed increment. And any race in updating the value in memory is harmless
* as the same value would simply be stored more than once.
*
- * The only restriction for the algorithm to work properly is that this
- * code must be executed at least once per each half period of the 32-bit
- * counter to properly update the state bit in memory. This is usually not a
- * problem in practice, but if it is then a kernel timer could be scheduled
- * to manage for this code to be executed often enough.
+ * The restrictions for the algorithm to work properly are:
+ *
+ * 1) this code must be called at least once per each half period of the
+ * 32-bit counter;
+ *
+ * 2) this code must not be preempted for a duration longer than the
+ * 32-bit counter half period minus the longest period between two
+ * calls to this code.
+ *
+ * Those requirements ensure proper update to the state bit in memory.
+ * This is usually not a problem in practice, but if it is then a kernel
+ * timer should be scheduled to manage for this code to be executed often
+ * enough.
*
* Note that the top bit (bit 63) in the returned value should be considered
* as garbage. It is not cleared here because callers are likely to use a
@@ -68,9 +76,10 @@ union cnt32_to_63 {
*/
#define cnt32_to_63(cnt_lo) \
({ \
- static volatile u32 __m_cnt_hi; \
+ static u32 __m_cnt_hi; \
union cnt32_to_63 __x; \
__x.hi = __m_cnt_hi; \
+ rmb(); \
__x.lo = (cnt_lo); \
if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
__m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] clarify usage expectations for cnt32_to_63()
2008-11-08 23:20 ` [PATCH] clarify usage expectations for cnt32_to_63() Nicolas Pitre
@ 2008-11-09 2:25 ` Mathieu Desnoyers
2008-11-09 2:54 ` Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-09 2:25 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
* Nicolas Pitre (nico@cam.org) wrote:
> Currently, all existing users of cnt32_to_63() are fine since the CPU
> architectures where it is used don't do read access reordering, and user
> mode preemption is disabled already. It is nevertheless a good idea to
> better elaborate usage requirements wrt preemption, and use an explicit
> memory barrier for proper results on CPUs that may perform instruction
> reordering.
>
> Signed-off-by: Nicolas Pitre <nico@marvell.com>
> ---
>
> On Sat, 8 Nov 2008, Nicolas Pitre wrote:
>
> > I think this is OK if not everyone can use this. The main purpose for
> > this code was to provide much increased accuracy for shed_clock() on
> > processors with only a 32-bit hardware counter.
> >
> > Given that sched_clock() is already used in contexts where preemption is
> > disabled, I don't mind the addition of a precision to the associated
> > comment mentioning that it must be called at least once per
> > half period of the base counter ***and*** not be preempted
> > away for longer than the half period of the counter minus the longest
> > period between two calls. The comment already mention a kernel timer
> > which can be used to control the longest period between two calls.
> > Implicit disabling of preemption is _NOT_ the goal of this code.
> >
> > I also don't mind having a real barrier for this code to be useful on
> > other platforms. On the platform this was written for, any kind of
> > barrier is defined as a compiler barrier which is perfectly fine and
> > achieve the same effect as the current usage of volatile.
>
> So here it is.
>
> I used a rmb() so this is also safe for mixed usages in and out of
> interrupt context. On the architecture I care about this is turned into
> a simple compiler barrier and therefore doesn't make a difference, while
> smp_rmb() is a noop which isn't right.
>
Hum ? smp_rmb() is turned into a compiler barrier on !SMP architectures.
Turning it into a NOP would be broken. Actually, ARM defines it as a
barrier().
I *think* that smp_rmb() would be enough, supposing the access to memory
is done in program order wrt local interrupts in UP. This is basically
Steven's question, which has not received any clear answer yet. I'd like
to know what others think about it.
Mathieu
> I won't make a per_cpu_cnt32_to_63() version myself as I have no need
> for that. But if someone wants to use this witha per CPU counter which
> is not coherent between CPUs then be my guest. The usage requirements
> would be the same but on each used CPU instead of globally.
>
> diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> index 8c0f950..584289d 100644
> --- a/include/linux/cnt32_to_63.h
> +++ b/include/linux/cnt32_to_63.h
> @@ -53,11 +53,19 @@ union cnt32_to_63 {
> * needed increment. And any race in updating the value in memory is harmless
> * as the same value would simply be stored more than once.
> *
> - * The only restriction for the algorithm to work properly is that this
> - * code must be executed at least once per each half period of the 32-bit
> - * counter to properly update the state bit in memory. This is usually not a
> - * problem in practice, but if it is then a kernel timer could be scheduled
> - * to manage for this code to be executed often enough.
> + * The restrictions for the algorithm to work properly are:
> + *
> + * 1) this code must be called at least once per each half period of the
> + * 32-bit counter;
> + *
> + * 2) this code must not be preempted for a duration longer than the
> + * 32-bit counter half period minus the longest period between two
> + * calls to this code.
> + *
> + * Those requirements ensure proper update to the state bit in memory.
> + * This is usually not a problem in practice, but if it is then a kernel
> + * timer should be scheduled to manage for this code to be executed often
> + * enough.
> *
> * Note that the top bit (bit 63) in the returned value should be considered
> * as garbage. It is not cleared here because callers are likely to use a
> @@ -68,9 +76,10 @@ union cnt32_to_63 {
> */
> #define cnt32_to_63(cnt_lo) \
> ({ \
> - static volatile u32 __m_cnt_hi; \
> + static u32 __m_cnt_hi; \
> union cnt32_to_63 __x; \
> __x.hi = __m_cnt_hi; \
> + rmb(); \
> __x.lo = (cnt_lo); \
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] clarify usage expectations for cnt32_to_63()
2008-11-09 2:25 ` Mathieu Desnoyers
@ 2008-11-09 2:54 ` Nicolas Pitre
2008-11-09 5:06 ` Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-09 2:54 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Sat, 8 Nov 2008, Mathieu Desnoyers wrote:
> > I used a rmb() so this is also safe for mixed usages in and out of
> > interrupt context. On the architecture I care about this is turned into
> > a simple compiler barrier and therefore doesn't make a difference, while
> > smp_rmb() is a noop which isn't right.
> >
>
> Hum ? smp_rmb() is turned into a compiler barrier on !SMP architectures.
> Turning it into a NOP would be broken. Actually, ARM defines it as a
> barrier().
Oh, right. I got confused somehow with read_barrier_depends().
> I *think* that smp_rmb() would be enough, supposing the access to memory
> is done in program order wrt local interrupts in UP. This is basically
> Steven's question, which has not received any clear answer yet. I'd like
> to know what others think about it.
In the mean time a pure rmb() is the safest thing to do now. Once we
can convince ourselves that out-of-order reads are always rolled back
upon the arrival of an interrupt then this could be relaxed.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] clarify usage expectations for cnt32_to_63()
2008-11-09 2:54 ` Nicolas Pitre
@ 2008-11-09 5:06 ` Nicolas Pitre
2008-11-09 5:27 ` [PATCH v2] " Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-09 5:06 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Sat, 8 Nov 2008, Nicolas Pitre wrote:
> On Sat, 8 Nov 2008, Mathieu Desnoyers wrote:
>
> > I *think* that smp_rmb() would be enough, supposing the access to memory
> > is done in program order wrt local interrupts in UP. This is basically
> > Steven's question, which has not received any clear answer yet. I'd like
> > to know what others think about it.
>
> In the mean time a pure rmb() is the safest thing to do now. Once we
> can convince ourselves that out-of-order reads are always rolled back
> upon the arrival of an interrupt then this could be relaxed.
After thinking about it some more, then a smp_rmb() must be correct.
On UP, that would be completely insane if an exception didn't resume
the whole sequence since the CPU cannot presume anything when returning
from it.
If the instruction flows says:
READ A
READ B
And speculative execution makes for B to be read before A, and you get
an interrupt after B was read but before A was read, then the program
counter may only point at READ A upon returning from the exception and B
will be read again. Doing otherwise would require the CPU to remember
any reordering that it might have performed upon every exceptions which
is completely insane.
So smp_rmb() it shall be.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-09 5:06 ` Nicolas Pitre
@ 2008-11-09 5:27 ` Nicolas Pitre
2008-11-09 6:48 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-09 5:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mathieu Desnoyers, Russell King, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
Currently, all existing users of cnt32_to_63() are fine since the CPU
architectures where it is used don't do read access reordering, and user
mode preemption is disabled already. It is nevertheless a good idea to
better elaborate usage requirements wrt preemption, and use an explicit
memory barrier on SMP to avoid different CPUs accessing the counter
value in the wrong order. On UP a simple compiler barrier is
sufficient.
Signed-off-by: Nicolas Pitre <nico@marvell.com>
---
On Sun, 9 Nov 2008, Nicolas Pitre wrote:
> On Sat, 8 Nov 2008, Nicolas Pitre wrote:
>
> > On Sat, 8 Nov 2008, Mathieu Desnoyers wrote:
> >
> > > I *think* that smp_rmb() would be enough, supposing the access to memory
> > > is done in program order wrt local interrupts in UP. This is basically
> > > Steven's question, which has not received any clear answer yet. I'd like
> > > to know what others think about it.
> >
> > In the mean time a pure rmb() is the safest thing to do now. Once we
> > can convince ourselves that out-of-order reads are always rolled back
> > upon the arrival of an interrupt then this could be relaxed.
>
> After thinking about it some more, a smp_rmb() must be correct.
>
> On UP, that would be completely insane if an exception didn't resume
> the whole sequence since the CPU cannot presume anything when returning
> from it.
>
> If the instruction flows says:
>
> READ A
> READ B
>
> And speculative execution makes for B to be read before A, and you get
> an interrupt after B was read but before A was read, then the program
> counter may only point at READ A upon returning from the exception and B
> will be read again. Doing otherwise would require the CPU to remember
> any reordering that it might have performed upon every exceptions which
> is completely insane.
>
> So smp_rmb() it shall be.
diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
index 8c0f950..7605fdd 100644
--- a/include/linux/cnt32_to_63.h
+++ b/include/linux/cnt32_to_63.h
@@ -16,6 +16,7 @@
#include <linux/compiler.h>
#include <linux/types.h>
#include <asm/byteorder.h>
+#include <asm/system.h>
/* this is used only to give gcc a clue about good code generation */
union cnt32_to_63 {
@@ -53,11 +54,19 @@ union cnt32_to_63 {
* needed increment. And any race in updating the value in memory is harmless
* as the same value would simply be stored more than once.
*
- * The only restriction for the algorithm to work properly is that this
- * code must be executed at least once per each half period of the 32-bit
- * counter to properly update the state bit in memory. This is usually not a
- * problem in practice, but if it is then a kernel timer could be scheduled
- * to manage for this code to be executed often enough.
+ * The restrictions for the algorithm to work properly are:
+ *
+ * 1) this code must be called at least once per each half period of the
+ * 32-bit counter;
+ *
+ * 2) this code must not be preempted for a duration longer than the
+ * 32-bit counter half period minus the longest period between two
+ * calls to this code.
+ *
+ * Those requirements ensure proper update to the state bit in memory.
+ * This is usually not a problem in practice, but if it is then a kernel
+ * timer should be scheduled to manage for this code to be executed often
+ * enough.
*
* Note that the top bit (bit 63) in the returned value should be considered
* as garbage. It is not cleared here because callers are likely to use a
@@ -68,9 +77,10 @@ union cnt32_to_63 {
*/
#define cnt32_to_63(cnt_lo) \
({ \
- static volatile u32 __m_cnt_hi; \
+ static u32 __m_cnt_hi; \
union cnt32_to_63 __x; \
__x.hi = __m_cnt_hi; \
+ smp_rmb(); \
__x.lo = (cnt_lo); \
if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
__m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-09 5:27 ` [PATCH v2] " Nicolas Pitre
@ 2008-11-09 6:48 ` Mathieu Desnoyers
2008-11-09 13:34 ` Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-09 6:48 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
* Nicolas Pitre (nico@cam.org) wrote:
> Currently, all existing users of cnt32_to_63() are fine since the CPU
> architectures where it is used don't do read access reordering, and user
> mode preemption is disabled already. It is nevertheless a good idea to
> better elaborate usage requirements wrt preemption, and use an explicit
> memory barrier on SMP to avoid different CPUs accessing the counter
> value in the wrong order. On UP a simple compiler barrier is
> sufficient.
>
> Signed-off-by: Nicolas Pitre <nico@marvell.com>
> ---
>
...
> @@ -68,9 +77,10 @@ union cnt32_to_63 {
> */
> #define cnt32_to_63(cnt_lo) \
> ({ \
> - static volatile u32 __m_cnt_hi; \
> + static u32 __m_cnt_hi; \
It's important to get the smp_rmb() here, which this patch provides, so
consider this patch to be acked-by me. The added documentation is needed
too.
But I also think that declaring the static u32 __m_cnt_hi here is
counter-intuitive for developers who wish to use it.
I'd recommend letting the declaration be done outside of cnt32_to_63 so
the same variable can be passed as parameter to more than one execution
site.
Mathieu
> union cnt32_to_63 __x; \
> __x.hi = __m_cnt_hi; \
> + smp_rmb(); \
> __x.lo = (cnt_lo); \
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-07 23:50 ` David Howells
2008-11-08 0:55 ` Steven Rostedt
@ 2008-11-09 11:51 ` David Howells
2008-11-09 14:31 ` Steven Rostedt
2008-11-09 16:18 ` Mathieu Desnoyers
1 sibling, 2 replies; 118+ messages in thread
From: David Howells @ 2008-11-09 11:51 UTC (permalink / raw)
To: Steven Rostedt
Cc: dhowells, Mathieu Desnoyers, Paul E. McKenney, Linus Torvalds,
akpm, Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
Steven Rostedt <rostedt@goodmis.org> wrote:
> > Note that that does not guarantee that the two reads will be done in the
> > order you want. The compiler barrier _only_ affects the compiler. It
> > does not stop the CPU from doing the reads in any order it wants. You
> > need something stronger than smp_rmb() if you need the reads to be so
> > ordered.
>
> For reading hardware devices that can indeed be correct. But for normal
> memory access on a uniprocessor, if the CPU were to reorder the reads that
> would effect the actual algorithm then that CPU is broken.
>
> read a
> <--- interrupt - should see read a here before read b is done.
> read b
Life isn't that simple. Go and read the section labelled "The things cpus get
up to" in Documentation/memory-barriers.txt.
The two reads we're talking about are independent of each other. Independent
reads and writes can be reordered and merged at will by the CPU, subject to
restrictions imposed by barriers, cacheability attributes, MMIO attributes and
suchlike.
You can get read b happening before read a, but in such a case both
instructions will be in the CPU's execution pipeline. When an interrupt
occurs, the CPU will presumably finish clearing what's in its pipeline before
going and servicing the interrupt handler.
If a CPU is strictly ordered with respect to reads, do you actually need read
barriers?
The fact that a pair of reads might be part of an algorithm that is critically
dependent on the ordering of those reads isn't something the CPU cares about.
It doesn't know there's an algorithm there.
> Now the fact that one of the reads is a hardware clock, then this
> statement might not be too strong. But the fact that it is a clock, and
> not some memory mapped device register, I still think smp_rmb is
> sufficient.
To quote again from memory-barriers.txt, section "CPU memory barriers":
Mandatory barriers should not be used to control SMP effects, since
mandatory barriers unnecessarily impose overhead on UP systems. They
may, however, be used to control MMIO effects on accesses through
relaxed memory I/O windows. These are required even on non-SMP
systems as they affect the order in which memory operations appear to
a device by prohibiting both the compiler and the CPU from reordering
them.
Section "Accessing devices":
(2) If the accessor functions are used to refer to an I/O memory window with
relaxed memory access properties, then _mandatory_ memory barriers are
required to enforce ordering.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-09 6:48 ` Mathieu Desnoyers
@ 2008-11-09 13:34 ` Nicolas Pitre
2008-11-09 13:43 ` Russell King
2008-11-09 16:22 ` Mathieu Desnoyers
0 siblings, 2 replies; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-09 13:34 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:
> * Nicolas Pitre (nico@cam.org) wrote:
> > Currently, all existing users of cnt32_to_63() are fine since the CPU
> > architectures where it is used don't do read access reordering, and user
> > mode preemption is disabled already. It is nevertheless a good idea to
> > better elaborate usage requirements wrt preemption, and use an explicit
> > memory barrier on SMP to avoid different CPUs accessing the counter
> > value in the wrong order. On UP a simple compiler barrier is
> > sufficient.
> >
> > Signed-off-by: Nicolas Pitre <nico@marvell.com>
> > ---
> >
> ...
> > @@ -68,9 +77,10 @@ union cnt32_to_63 {
> > */
> > #define cnt32_to_63(cnt_lo) \
> > ({ \
> > - static volatile u32 __m_cnt_hi; \
> > + static u32 __m_cnt_hi; \
>
> It's important to get the smp_rmb() here, which this patch provides, so
> consider this patch to be acked-by me. The added documentation is needed
> too.
Thanks.
> But I also think that declaring the static u32 __m_cnt_hi here is
> counter-intuitive for developers who wish to use it.
I'm rather not convinced of that. And this is a much bigger change
affecting all callers so I'd defer such change even if I was convinced
of it.
> I'd recommend letting the declaration be done outside of cnt32_to_63 so
> the same variable can be passed as parameter to more than one execution
> site.
Do you really have such instances where multiple call sites are needed?
That sounds even more confusing to me than the current model. Better
encapsulate the usage of this macro within some function which has a
stronger meaning, such as sched_clock(), and call _that_ from multiple
sites instead.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-09 13:34 ` Nicolas Pitre
@ 2008-11-09 13:43 ` Russell King
2008-11-09 16:22 ` Mathieu Desnoyers
1 sibling, 0 replies; 118+ messages in thread
From: Russell King @ 2008-11-09 13:43 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Mathieu Desnoyers, Linus Torvalds, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Sun, Nov 09, 2008 at 08:34:23AM -0500, Nicolas Pitre wrote:
> Do you really have such instances where multiple call sites are needed?
> That sounds even more confusing to me than the current model. Better
> encapsulate the usage of this macro within some function which has a
> stronger meaning, such as sched_clock(), and call _that_ from multiple
> sites instead.
What if sched_clock() is inline and uses cnt32_to_63()? I think that's
where the problem lies. Better add a comment that it shouldn't be used
inside another inline function.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-09 11:51 ` David Howells
@ 2008-11-09 14:31 ` Steven Rostedt
2008-11-09 16:18 ` Mathieu Desnoyers
1 sibling, 0 replies; 118+ messages in thread
From: Steven Rostedt @ 2008-11-09 14:31 UTC (permalink / raw)
To: David Howells
Cc: Mathieu Desnoyers, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
On Sun, 9 Nov 2008, David Howells wrote:
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > > Note that that does not guarantee that the two reads will be done in the
> > > order you want. The compiler barrier _only_ affects the compiler. It
> > > does not stop the CPU from doing the reads in any order it wants. You
> > > need something stronger than smp_rmb() if you need the reads to be so
> > > ordered.
> >
> > For reading hardware devices that can indeed be correct. But for normal
> > memory access on a uniprocessor, if the CPU were to reorder the reads that
> > would effect the actual algorithm then that CPU is broken.
Please read what I said above again.
"For reading hardware devices that can indeed be correct."
There I agree that accessing devices will require a rmb.
"But for normal memory access on a uniprocessor, if the CPU were to
reorder the reads that would effect the actual algorithm then that CPU is
broken."
Here I'm talking about accessing normal RAM. If the CPU decides to read b
before reading a then that will break the code.
> >
> > read a
> > <--- interrupt - should see read a here before read b is done.
> > read b
>
> Life isn't that simple. Go and read the section labelled "The things cpus get
> up to" in Documentation/memory-barriers.txt.
I've read it. Several times ;-)
>
> The two reads we're talking about are independent of each other. Independent
> reads and writes can be reordered and merged at will by the CPU, subject to
> restrictions imposed by barriers, cacheability attributes, MMIO attributes and
> suchlike.
>
> You can get read b happening before read a, but in such a case both
> instructions will be in the CPU's execution pipeline. When an interrupt
> occurs, the CPU will presumably finish clearing what's in its pipeline before
> going and servicing the interrupt handler.
This above sounds like you just answered my question, and a smp_rmb is
enough. If an interrupt occurs, then the read a and read b will be
completed. Really does not matter in which order, as long as the interrupt
itself does not see the read b before the read a.
>
> If a CPU is strictly ordered with respect to reads, do you actually need read
> barriers?
>
> The fact that a pair of reads might be part of an algorithm that is critically
> dependent on the ordering of those reads isn't something the CPU cares about.
> It doesn't know there's an algorithm there.
>
> > Now the fact that one of the reads is a hardware clock, then this
> > statement might not be too strong. But the fact that it is a clock, and
> > not some memory mapped device register, I still think smp_rmb is
> > sufficient.
>
> To quote again from memory-barriers.txt, section "CPU memory barriers":
>
> Mandatory barriers should not be used to control SMP effects, since
> mandatory barriers unnecessarily impose overhead on UP systems. They
> may, however, be used to control MMIO effects on accesses through
> relaxed memory I/O windows. These are required even on non-SMP
> systems as they affect the order in which memory operations appear to
> a device by prohibiting both the compiler and the CPU from reordering
> them.
>
> Section "Accessing devices":
>
> (2) If the accessor functions are used to refer to an I/O memory window with
> relaxed memory access properties, then _mandatory_ memory barriers are
> required to enforce ordering.
My confidence on reading a clock is not as strong that a smp_rmb is
enough. And it may not be. I'll have to think about this a bit more.
Again, the question arrises with:
read a (memory)
<---- interrupt
read b (clock)
Will the b be seen before the interrupt occurred, and before the a is
read? That is what will break the algorithm on UP. If we can not
guarantee this statement, then a rmb is needed.
-- Steve
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb()
2008-11-09 11:51 ` David Howells
2008-11-09 14:31 ` Steven Rostedt
@ 2008-11-09 16:18 ` Mathieu Desnoyers
1 sibling, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-09 16:18 UTC (permalink / raw)
To: David Howells
Cc: Steven Rostedt, Paul E. McKenney, Linus Torvalds, akpm,
Ingo Molnar, Peter Zijlstra, linux-kernel, Nicolas Pitre,
Ralf Baechle, benh, paulus, David Miller, Ingo Molnar,
Thomas Gleixner, linux-arch
* David Howells (dhowells@redhat.com) wrote:
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > > Note that that does not guarantee that the two reads will be done in the
> > > order you want. The compiler barrier _only_ affects the compiler. It
> > > does not stop the CPU from doing the reads in any order it wants. You
> > > need something stronger than smp_rmb() if you need the reads to be so
> > > ordered.
> >
> > For reading hardware devices that can indeed be correct. But for normal
> > memory access on a uniprocessor, if the CPU were to reorder the reads that
> > would effect the actual algorithm then that CPU is broken.
> >
> > read a
> > <--- interrupt - should see read a here before read b is done.
> > read b
>
> Life isn't that simple. Go and read the section labelled "The things cpus get
> up to" in Documentation/memory-barriers.txt.
>
> The two reads we're talking about are independent of each other. Independent
> reads and writes can be reordered and merged at will by the CPU, subject to
> restrictions imposed by barriers, cacheability attributes, MMIO attributes and
> suchlike.
>
> You can get read b happening before read a, but in such a case both
> instructions will be in the CPU's execution pipeline. When an interrupt
> occurs, the CPU will presumably finish clearing what's in its pipeline before
> going and servicing the interrupt handler.
>
> If a CPU is strictly ordered with respect to reads, do you actually need read
> barriers?
>
> The fact that a pair of reads might be part of an algorithm that is critically
> dependent on the ordering of those reads isn't something the CPU cares about.
> It doesn't know there's an algorithm there.
>
> > Now the fact that one of the reads is a hardware clock, then this
> > statement might not be too strong. But the fact that it is a clock, and
> > not some memory mapped device register, I still think smp_rmb is
> > sufficient.
>
> To quote again from memory-barriers.txt, section "CPU memory barriers":
>
> Mandatory barriers should not be used to control SMP effects, since
> mandatory barriers unnecessarily impose overhead on UP systems. They
> may, however, be used to control MMIO effects on accesses through
> relaxed memory I/O windows. These are required even on non-SMP
> systems
<emphasis>
> as they affect the order in which memory operations appear to a device
</emphasis>
In this particular case, we don't care about the order of memory
operations as seen by the device, given we only read the mmio time
source atomically. So considering what you said above about the fact
that the CPU will flush all the pending operations in the pipeline
before proceeding to service an interrupt, a simple barrier() should be
enough to make the two operations appear in correct order wrt local
interrupts. I therefore don't think a full rmb() is required to insure
correct read order on UP, because, again, in this case we don't need to
order accesses as seen by the device.
Mathieu
> by prohibiting both the compiler and the CPU from reordering
> them.
>
> Section "Accessing devices":
>
> (2) If the accessor functions are used to refer to an I/O memory window with
> relaxed memory access properties, then _mandatory_ memory barriers are
> required to enforce ordering.
>
> David
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-09 13:34 ` Nicolas Pitre
2008-11-09 13:43 ` Russell King
@ 2008-11-09 16:22 ` Mathieu Desnoyers
2008-11-10 4:20 ` Nicolas Pitre
1 sibling, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-09 16:22 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
* Nicolas Pitre (nico@cam.org) wrote:
> On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:
>
> > * Nicolas Pitre (nico@cam.org) wrote:
> > > Currently, all existing users of cnt32_to_63() are fine since the CPU
> > > architectures where it is used don't do read access reordering, and user
> > > mode preemption is disabled already. It is nevertheless a good idea to
> > > better elaborate usage requirements wrt preemption, and use an explicit
> > > memory barrier on SMP to avoid different CPUs accessing the counter
> > > value in the wrong order. On UP a simple compiler barrier is
> > > sufficient.
> > >
> > > Signed-off-by: Nicolas Pitre <nico@marvell.com>
> > > ---
> > >
> > ...
> > > @@ -68,9 +77,10 @@ union cnt32_to_63 {
> > > */
> > > #define cnt32_to_63(cnt_lo) \
> > > ({ \
> > > - static volatile u32 __m_cnt_hi; \
> > > + static u32 __m_cnt_hi; \
> >
> > It's important to get the smp_rmb() here, which this patch provides, so
> > consider this patch to be acked-by me. The added documentation is needed
> > too.
>
> Thanks.
>
> > But I also think that declaring the static u32 __m_cnt_hi here is
> > counter-intuitive for developers who wish to use it.
>
> I'm rather not convinced of that. And this is a much bigger change
> affecting all callers so I'd defer such change even if I was convinced
> of it.
>
> > I'd recommend letting the declaration be done outside of cnt32_to_63 so
> > the same variable can be passed as parameter to more than one execution
> > site.
>
> Do you really have such instances where multiple call sites are needed?
> That sounds even more confusing to me than the current model. Better
> encapsulate the usage of this macro within some function which has a
> stronger meaning, such as sched_clock(), and call _that_ from multiple
> sites instead.
>
>
> Nicolas
I see a few reasons for it :
- If we want to inline the whole read function so we don't pay the extra
runtime cost of a function call, this would become required.
- If we want to create a per cpu timer which updates the value
periodically without calling the function. We may want to add some
WARN_ON or some sanity tests in this periodic update that would not be
part of the standard read code. If we don't have access to this
variable outside of the macro, this becomes impossible.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-09 16:22 ` Mathieu Desnoyers
@ 2008-11-10 4:20 ` Nicolas Pitre
2008-11-10 4:42 ` Andrew Morton
0 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-10 4:20 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:
> * Nicolas Pitre (nico@cam.org) wrote:
> > Do you really have such instances where multiple call sites are needed?
> > That sounds even more confusing to me than the current model. Better
> > encapsulate the usage of this macro within some function which has a
> > stronger meaning, such as sched_clock(), and call _that_ from multiple
> > sites instead.
>
> I see a few reasons for it :
>
> - If we want to inline the whole read function so we don't pay the extra
> runtime cost of a function call, this would become required.
You can inline it as you want as long as it remains in the same .c file.
The static variable is still shared amongst all call sites in that case.
> - If we want to create a per cpu timer which updates the value
> periodically without calling the function. We may want to add some
> WARN_ON or some sanity tests in this periodic update that would not be
> part of the standard read code. If we don't have access to this
> variable outside of the macro, this becomes impossible.
I don't see how you could update the variable without calling the
function somehow or duplicating it. As to the sanity check argument:
you can perform such checks on the result rather than the internal
variable which IMHO would be more logical.
And if you want a per CPU version, then it's better to create a
per_cpu_cnt32_to_63() variant which could use a relaxed barrier in that
case.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-10 4:20 ` Nicolas Pitre
@ 2008-11-10 4:42 ` Andrew Morton
2008-11-10 21:34 ` Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Andrew Morton @ 2008-11-10 4:42 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Mathieu Desnoyers, Linus Torvalds, Russell King, David Howells,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Sun, 09 Nov 2008 23:20:00 -0500 (EST) Nicolas Pitre <nico@cam.org> wrote:
> On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:
>
> > * Nicolas Pitre (nico@cam.org) wrote:
> > > Do you really have such instances where multiple call sites are needed?
> > > That sounds even more confusing to me than the current model. Better
> > > encapsulate the usage of this macro within some function which has a
> > > stronger meaning, such as sched_clock(), and call _that_ from multiple
> > > sites instead.
> >
> > I see a few reasons for it :
> >
> > - If we want to inline the whole read function so we don't pay the extra
> > runtime cost of a function call, this would become required.
>
> You can inline it as you want as long as it remains in the same .c file.
> The static variable is still shared amongst all call sites in that case.
Please don't rely upon deep compiler behaviour like that. It is
unobvious to the reader and it might break if someone uses it incorrectly,
or if the compiler implementation changes, or if a non-gcc compiler is
used, etc.
It is far better to make the management of the state explicit and at
the control of the caller. Get the caller to allocate the state and
pass its address into this function. Simple, clear, explicit and
robust.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-10 4:42 ` Andrew Morton
@ 2008-11-10 21:34 ` Nicolas Pitre
2008-11-10 21:58 ` Andrew Morton
0 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-10 21:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Mathieu Desnoyers, Linus Torvalds, Russell King, David Howells,
Ingo Molnar, Peter Zijlstra, lkml, Ralf Baechle, benh, paulus,
David Miller, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
linux-arch
On Sun, 9 Nov 2008, Andrew Morton wrote:
> On Sun, 09 Nov 2008 23:20:00 -0500 (EST) Nicolas Pitre <nico@cam.org> wrote:
>
> > On Sun, 9 Nov 2008, Mathieu Desnoyers wrote:
> >
> > > * Nicolas Pitre (nico@cam.org) wrote:
> > > > Do you really have such instances where multiple call sites are needed?
> > > > That sounds even more confusing to me than the current model. Better
> > > > encapsulate the usage of this macro within some function which has a
> > > > stronger meaning, such as sched_clock(), and call _that_ from multiple
> > > > sites instead.
> > >
> > > I see a few reasons for it :
> > >
> > > - If we want to inline the whole read function so we don't pay the extra
> > > runtime cost of a function call, this would become required.
> >
> > You can inline it as you want as long as it remains in the same .c file.
> > The static variable is still shared amongst all call sites in that case.
>
> Please don't rely upon deep compiler behaviour like that. It is
> unobvious to the reader and it might break if someone uses it incorrectly,
> or if the compiler implementation changes, or if a non-gcc compiler is
> used, etc.
If a compiler doesn't reference the same storage for a static variable
used by a function that gets inlined in the same compilation unit then
it is utterly broken.
> It is far better to make the management of the state explicit and at
> the control of the caller. Get the caller to allocate the state and
> pass its address into this function. Simple, clear, explicit and
> robust.
Sigh... What about this compromize then?
diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
index 7605fdd..74ce767 100644
--- a/include/linux/cnt32_to_63.h
+++ b/include/linux/cnt32_to_63.h
@@ -32,8 +32,9 @@ union cnt32_to_63 {
/**
- * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
+ * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
* @cnt_lo: The low part of the counter
+ * @cnt_hi_p: Pointer to storage for the extended part of the counter
*
* Many hardware clock counters are only 32 bits wide and therefore have
* a relatively short period making wrap-arounds rather frequent. This
@@ -75,16 +76,31 @@ union cnt32_to_63 {
* clear-bit instruction. Otherwise caller must remember to clear the top
* bit explicitly.
*/
-#define cnt32_to_63(cnt_lo) \
+#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
({ \
- static u32 __m_cnt_hi; \
union cnt32_to_63 __x; \
- __x.hi = __m_cnt_hi; \
+ __x.hi = *(cnt_hi_p); \
smp_rmb(); \
__x.lo = (cnt_lo); \
if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
- __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
+ *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
__x.val; \
})
+/**
+ * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
+ * @cnt_lo: The low part of the counter
+ *
+ * This is the same as __cnt32_to_63() except that the storage for the
+ * extended part of the counter is implicit. Because this uses a static
+ * variable, a user of this code must not be an inline function unless
+ * that function is contained in a single .c file for a given counter.
+ * All usage requirements for __cnt32_to_63() also apply here as well.
+ */
+#define cnt32_to_63(cnt_lo) \
+({ \
+ static u32 __m_cnt_hi; \
+ __cnt32_to_63(cnt_lo, &__m_cnt_hi); \
+})
+
#endif
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-10 21:34 ` Nicolas Pitre
@ 2008-11-10 21:58 ` Andrew Morton
2008-11-10 23:15 ` Nicolas Pitre
0 siblings, 1 reply; 118+ messages in thread
From: Andrew Morton @ 2008-11-10 21:58 UTC (permalink / raw)
To: Nicolas Pitre
Cc: mathieu.desnoyers, torvalds, rmk+lkml, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
Nicolas Pitre <nico@cam.org> wrote:
> > It is far better to make the management of the state explicit and at
> > the control of the caller. Get the caller to allocate the state and
> > pass its address into this function. Simple, clear, explicit and
> > robust.
>
> Sigh... What about this compromize then?
>
> diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> index 7605fdd..74ce767 100644
> --- a/include/linux/cnt32_to_63.h
> +++ b/include/linux/cnt32_to_63.h
> @@ -32,8 +32,9 @@ union cnt32_to_63 {
>
>
> /**
> - * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> + * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> * @cnt_lo: The low part of the counter
> + * @cnt_hi_p: Pointer to storage for the extended part of the counter
> *
> * Many hardware clock counters are only 32 bits wide and therefore have
> * a relatively short period making wrap-arounds rather frequent. This
> @@ -75,16 +76,31 @@ union cnt32_to_63 {
> * clear-bit instruction. Otherwise caller must remember to clear the top
> * bit explicitly.
> */
> -#define cnt32_to_63(cnt_lo) \
> +#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
> ({ \
> - static u32 __m_cnt_hi; \
> union cnt32_to_63 __x; \
> - __x.hi = __m_cnt_hi; \
> + __x.hi = *(cnt_hi_p); \
> smp_rmb(); \
> __x.lo = (cnt_lo); \
> if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> + *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> __x.val; \
> })
This references its second argument twice, which can cause correctness
or efficiency problems.
There is no reason that this had to be implemented in cpp.
Implementing it in C will fix the above problem.
To the reader of the code, a call to cnt32_to_63() looks exactly like a
plain old function call. Hiding the instantiation of the state storage
inside this macro misleads the reader and hence is bad practice. This
is one of the reasons why the management of that state should be
performed by the caller and made explicit.
I cannot think of any other cases in the kernel where this trick of
instantiating static storage at a macro expansion site is performed.
It is unusual. It will surprise readers. Surprising readers is
undesirable.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-10 21:58 ` Andrew Morton
@ 2008-11-10 23:15 ` Nicolas Pitre
2008-11-10 23:22 ` Andrew Morton
0 siblings, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-10 23:15 UTC (permalink / raw)
To: Andrew Morton
Cc: mathieu.desnoyers, torvalds, rmk+lkml, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
On Mon, 10 Nov 2008, Andrew Morton wrote:
> On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
> Nicolas Pitre <nico@cam.org> wrote:
>
> > > It is far better to make the management of the state explicit and at
> > > the control of the caller. Get the caller to allocate the state and
> > > pass its address into this function. Simple, clear, explicit and
> > > robust.
> >
> > Sigh... What about this compromize then?
> >
> > diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> > index 7605fdd..74ce767 100644
> > --- a/include/linux/cnt32_to_63.h
> > +++ b/include/linux/cnt32_to_63.h
> > @@ -32,8 +32,9 @@ union cnt32_to_63 {
> >
> >
> > /**
> > - * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > + * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > * @cnt_lo: The low part of the counter
> > + * @cnt_hi_p: Pointer to storage for the extended part of the counter
> > *
> > * Many hardware clock counters are only 32 bits wide and therefore have
> > * a relatively short period making wrap-arounds rather frequent. This
> > @@ -75,16 +76,31 @@ union cnt32_to_63 {
> > * clear-bit instruction. Otherwise caller must remember to clear the top
> > * bit explicitly.
> > */
> > -#define cnt32_to_63(cnt_lo) \
> > +#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
> > ({ \
> > - static u32 __m_cnt_hi; \
> > union cnt32_to_63 __x; \
> > - __x.hi = __m_cnt_hi; \
> > + __x.hi = *(cnt_hi_p); \
> > smp_rmb(); \
> > __x.lo = (cnt_lo); \
> > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > + *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > __x.val; \
> > })
>
> This references its second argument twice, which can cause correctness
> or efficiency problems.
>
> There is no reason that this had to be implemented in cpp.
> Implementing it in C will fix the above problem.
No, it won't, for correctness and efficiency reasons.
And I've explained why already.
No need to discuss this further if you can't get it.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-10 23:15 ` Nicolas Pitre
@ 2008-11-10 23:22 ` Andrew Morton
2008-11-10 23:38 ` Steven Rostedt
2008-11-11 0:26 ` Nicolas Pitre
0 siblings, 2 replies; 118+ messages in thread
From: Andrew Morton @ 2008-11-10 23:22 UTC (permalink / raw)
To: Nicolas Pitre
Cc: mathieu.desnoyers, torvalds, rmk+lkml, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 10 Nov 2008, Andrew Morton wrote:
>
> > On Mon, 10 Nov 2008 16:34:54 -0500 (EST)
> > Nicolas Pitre <nico@cam.org> wrote:
> >
> > > > It is far better to make the management of the state explicit and at
> > > > the control of the caller. Get the caller to allocate the state and
> > > > pass its address into this function. Simple, clear, explicit and
> > > > robust.
> > >
> > > Sigh... What about this compromize then?
> > >
> > > diff --git a/include/linux/cnt32_to_63.h b/include/linux/cnt32_to_63.h
> > > index 7605fdd..74ce767 100644
> > > --- a/include/linux/cnt32_to_63.h
> > > +++ b/include/linux/cnt32_to_63.h
> > > @@ -32,8 +32,9 @@ union cnt32_to_63 {
> > >
> > >
> > > /**
> > > - * cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > > + * __cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
> > > * @cnt_lo: The low part of the counter
> > > + * @cnt_hi_p: Pointer to storage for the extended part of the counter
> > > *
> > > * Many hardware clock counters are only 32 bits wide and therefore have
> > > * a relatively short period making wrap-arounds rather frequent. This
> > > @@ -75,16 +76,31 @@ union cnt32_to_63 {
> > > * clear-bit instruction. Otherwise caller must remember to clear the top
> > > * bit explicitly.
> > > */
> > > -#define cnt32_to_63(cnt_lo) \
> > > +#define __cnt32_to_63(cnt_lo, cnt_hi_p) \
> > > ({ \
> > > - static u32 __m_cnt_hi; \
> > > union cnt32_to_63 __x; \
> > > - __x.hi = __m_cnt_hi; \
> > > + __x.hi = *(cnt_hi_p); \
> > > smp_rmb(); \
> > > __x.lo = (cnt_lo); \
> > > if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
> > > - __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > > + *(cnt_hi_p) = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
> > > __x.val; \
> > > })
> >
> > This references its second argument twice, which can cause correctness
> > or efficiency problems.
> >
> > There is no reason that this had to be implemented in cpp.
> > Implementing it in C will fix the above problem.
>
> No, it won't, for correctness and efficiency reasons.
>
> And I've explained why already.
I'd be very surprised if you've really found a case where a macro is
faster than an inlined function. I don't think that has happened
before.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-10 23:22 ` Andrew Morton
@ 2008-11-10 23:38 ` Steven Rostedt
2008-11-11 0:26 ` Nicolas Pitre
1 sibling, 0 replies; 118+ messages in thread
From: Steven Rostedt @ 2008-11-10 23:38 UTC (permalink / raw)
To: Andrew Morton
Cc: Nicolas Pitre, mathieu.desnoyers, torvalds, rmk+lkml, dhowells,
mingo, a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem,
mingo, tglx, linux-arch
On Mon, 10 Nov 2008, Andrew Morton wrote:
> On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> Nicolas Pitre <nico@cam.org> wrote:
> > >
> > > This references its second argument twice, which can cause correctness
> > > or efficiency problems.
> > >
> > > There is no reason that this had to be implemented in cpp.
> > > Implementing it in C will fix the above problem.
> >
> > No, it won't, for correctness and efficiency reasons.
> >
> > And I've explained why already.
>
> I'd be very surprised if you've really found a case where a macro is
> faster than an inlined function. I don't think that has happened
> before.
But that's the way my Grandpa did it. With macros!
-- Steve
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v2] clarify usage expectations for cnt32_to_63()
2008-11-10 23:22 ` Andrew Morton
2008-11-10 23:38 ` Steven Rostedt
@ 2008-11-11 0:26 ` Nicolas Pitre
2008-11-11 18:28 ` [PATCH] convert cnt32_to_63 to inline Mathieu Desnoyers
2008-11-11 22:31 ` David Howells
1 sibling, 2 replies; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-11 0:26 UTC (permalink / raw)
To: Andrew Morton
Cc: mathieu.desnoyers, torvalds, rmk+lkml, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
On Mon, 10 Nov 2008, Andrew Morton wrote:
> On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> Nicolas Pitre <nico@cam.org> wrote:
>
> > On Mon, 10 Nov 2008, Andrew Morton wrote:
> >
> > > This references its second argument twice, which can cause correctness
> > > or efficiency problems.
> > >
> > > There is no reason that this had to be implemented in cpp.
> > > Implementing it in C will fix the above problem.
> >
> > No, it won't, for correctness and efficiency reasons.
> >
> > And I've explained why already.
>
> I'd be very surprised if you've really found a case where a macro is
> faster than an inlined function. I don't think that has happened
> before.
That hasn't anything to do with "a macro is faster" at all. It's all
about the order used to evaluate provided arguments. And the first one
might be anything like a memory value, an IO operation, an expression,
etc. An inline function would work correctly with pointers only and
therefore totally break apart on x86 for example.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH] convert cnt32_to_63 to inline
2008-11-11 0:26 ` Nicolas Pitre
@ 2008-11-11 18:28 ` Mathieu Desnoyers
2008-11-11 19:13 ` Russell King
2008-11-11 21:00 ` Nicolas Pitre
2008-11-11 22:31 ` David Howells
1 sibling, 2 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-11 18:28 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Andrew Morton, torvalds, rmk+lkml, dhowells, mingo, a.p.zijlstra,
linux-kernel, ralf, benh, paulus, davem, mingo, tglx, rostedt,
linux-arch
* Nicolas Pitre (nico@cam.org) wrote:
> On Mon, 10 Nov 2008, Andrew Morton wrote:
>
> > On Mon, 10 Nov 2008 18:15:32 -0500 (EST)
> > Nicolas Pitre <nico@cam.org> wrote:
> >
> > > On Mon, 10 Nov 2008, Andrew Morton wrote:
> > >
> > > > This references its second argument twice, which can cause correctness
> > > > or efficiency problems.
> > > >
> > > > There is no reason that this had to be implemented in cpp.
> > > > Implementing it in C will fix the above problem.
> > >
> > > No, it won't, for correctness and efficiency reasons.
> > >
> > > And I've explained why already.
> >
> > I'd be very surprised if you've really found a case where a macro is
> > faster than an inlined function. I don't think that has happened
> > before.
>
> That hasn't anything to do with "a macro is faster" at all. It's all
> about the order used to evaluate provided arguments. And the first one
> might be anything like a memory value, an IO operation, an expression,
> etc. An inline function would work correctly with pointers only and
> therefore totally break apart on x86 for example.
>
>
> Nicolas
Let's see what it gives once implemented. Only compile-tested. Assumes
pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
arm versatile.
Turn cnt32_to_63 into an inline function.
Change all callers to new API.
Document barrier usage.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Nicolas Pitre <nico@cam.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: torvalds@linux-foundation.org
CC: rmk+lkml@arm.linux.org.uk
CC: dhowells@redhat.com
CC: paulus@samba.org
CC: a.p.zijlstra@chello.nl
CC: mingo@elte.hu
CC: benh@kernel.crashing.org
CC: rostedt@goodmis.org
CC: tglx@linutronix.de
CC: davem@davemloft.net
CC: ralf@linux-mips.org
---
arch/arm/mach-pxa/time.c | 14 ++++++++++++-
arch/arm/mach-sa1100/generic.c | 15 +++++++++++++-
arch/arm/mach-versatile/core.c | 12 ++++++++++-
arch/mn10300/kernel/time.c | 19 +++++++++++++-----
include/linux/cnt32_to_63.h | 42 +++++++++++++++++++++++++++++------------
5 files changed, 82 insertions(+), 20 deletions(-)
Index: linux.trees.git/arch/arm/mach-pxa/time.c
===================================================================
--- linux.trees.git.orig/arch/arm/mach-pxa/time.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-pxa/time.c 2008-11-11 13:05:01.000000000 -0500
@@ -37,6 +37,10 @@
#define OSCR2NS_SCALE_FACTOR 10
static unsigned long oscr2ns_scale;
+static u32 sched_clock_cnt_hi; /*
+ * Shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */
static void __init set_oscr2ns_scale(unsigned long oscr_rate)
{
@@ -54,7 +58,15 @@ static void __init set_oscr2ns_scale(uns
unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(OSCR);
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = OSCR;
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();
return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
}
Index: linux.trees.git/include/linux/cnt32_to_63.h
===================================================================
--- linux.trees.git.orig/include/linux/cnt32_to_63.h 2008-11-11 12:20:17.000000000 -0500
+++ linux.trees.git/include/linux/cnt32_to_63.h 2008-11-11 13:10:44.000000000 -0500
@@ -32,7 +32,9 @@ union cnt32_to_63 {
/**
* cnt32_to_63 - Expand a 32-bit counter to a 63-bit counter
- * @cnt_lo: The low part of the counter
+ * @cnt_hi: The high part of the counter (read first)
+ * @cnt_lo: The low part of the counter (read after cnt_hi)
+ * @cnt_hi_ptr: Pointer to high part of the counter
*
* Many hardware clock counters are only 32 bits wide and therefore have
* a relatively short period making wrap-arounds rather frequent. This
@@ -57,7 +59,10 @@ union cnt32_to_63 {
* code must be executed at least once per each half period of the 32-bit
* counter to properly update the state bit in memory. This is usually not a
* problem in practice, but if it is then a kernel timer could be scheduled
- * to manage for this code to be executed often enough.
+ * to manage for this code to be executed often enough. If a per-cpu cnt_hi is
+ * used, the value must be updated at least once per 32-bits half-period on each
+ * CPU. If cnt_hi is shared between CPUs, it suffice to update it once per
+ * 32-bits half-period on any CPU.
*
* Note that the top bit (bit 63) in the returned value should be considered
* as garbage. It is not cleared here because callers are likely to use a
@@ -65,16 +70,29 @@ union cnt32_to_63 {
* implicitly by making the multiplier even, therefore saving on a runtime
* clear-bit instruction. Otherwise caller must remember to clear the top
* bit explicitly.
+ *
+ * Preemption must be disabled when reading the cnt_hi and cnt_lo values and
+ * calling this function.
+ *
+ * The cnt_hi parameter _must_ be read before cnt_lo. This implies using the
+ * proper barriers :
+ * - smp_rmb() if cnt_lo is read from mmio and the cnt_hi variable is shared
+ * across CPUs.
+ * - use a per-cpu variable for cnt_high if cnt_lo is read from per-cpu cycles
+ * counters or to read the counters with only a barrier().
*/
-#define cnt32_to_63(cnt_lo) \
-({ \
- static volatile u32 __m_cnt_hi; \
- union cnt32_to_63 __x; \
- __x.hi = __m_cnt_hi; \
- __x.lo = (cnt_lo); \
- if (unlikely((s32)(__x.hi ^ __x.lo) < 0)) \
- __m_cnt_hi = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31); \
- __x.val; \
-})
+static inline u64 cnt32_to_63(u32 cnt_hi, u32 cnt_lo, u32 *cnt_hi_ptr)
+{
+ union cnt32_to_63 __x = {
+ {
+ .hi = cnt_hi,
+ .lo = cnt_lo,
+ },
+ };
+
+ if (unlikely((s32)(__x.hi ^ __x.lo) < 0))
+ *cnt_hi_ptr = __x.hi = (__x.hi ^ 0x80000000) + (__x.hi >> 31);
+ return __x.val; /* Remember to clear the top bit in the caller */
+}
#endif
Index: linux.trees.git/arch/arm/mach-sa1100/generic.c
===================================================================
--- linux.trees.git.orig/arch/arm/mach-sa1100/generic.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-sa1100/generic.c 2008-11-11 13:05:10.000000000 -0500
@@ -34,6 +34,11 @@
unsigned int reset_status;
EXPORT_SYMBOL(reset_status);
+static u32 sched_clock_cnt_hi; /*
+ * Shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */
+
#define NR_FREQS 16
/*
@@ -133,7 +138,15 @@ EXPORT_SYMBOL(cpufreq_get);
*/
unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(OSCR);
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = OSCR;
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();
/* the <<1 gets rid of the cnt_32_to_63 top bit saving on a bic insn */
v *= 78125<<1;
Index: linux.trees.git/arch/arm/mach-versatile/core.c
===================================================================
--- linux.trees.git.orig/arch/arm/mach-versatile/core.c 2008-11-11 12:20:42.000000000 -0500
+++ linux.trees.git/arch/arm/mach-versatile/core.c 2008-11-11 12:57:55.000000000 -0500
@@ -60,6 +60,8 @@
#define VA_VIC_BASE __io_address(VERSATILE_VIC_BASE)
#define VA_SIC_BASE __io_address(VERSATILE_SIC_BASE)
+static u32 sched_clock_cnt_hi;
+
static void sic_mask_irq(unsigned int irq)
{
irq -= IRQ_SIC_START;
@@ -238,7 +240,15 @@ void __init versatile_map_io(void)
*/
unsigned long long sched_clock(void)
{
- unsigned long long v = cnt32_to_63(readl(VERSATILE_REFCOUNTER));
+ u32 cnt_lo, cnt_hi;
+ unsigned long long v;
+
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ smp_rmb();
+ cnt_lo = readl(VERSATILE_REFCOUNTER);
+ v = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ preempt_enable_notrace();
/* the <<1 gets rid of the cnt_32_to_63 top bit saving on a bic insn */
v *= 125<<1;
Index: linux.trees.git/arch/mn10300/kernel/time.c
===================================================================
--- linux.trees.git.orig/arch/mn10300/kernel/time.c 2008-11-11 12:41:42.000000000 -0500
+++ linux.trees.git/arch/mn10300/kernel/time.c 2008-11-11 13:04:42.000000000 -0500
@@ -29,6 +29,11 @@ unsigned long mn10300_iobclk; /* system
unsigned long mn10300_tsc_per_HZ; /* number of ioclks per jiffy */
#endif /* CONFIG_MN10300_RTC */
+static u32 sched_clock_cnt_hi; /*
+ * shared cnt_hi OK with cycle counter only
+ * for UP systems.
+ */
+
static unsigned long mn10300_last_tsc; /* time-stamp counter at last time
* interrupt occurred */
@@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
unsigned long long ll;
unsigned l[2];
} tsc64, result;
- unsigned long tsc, tmp;
+ unsigned long tmp;
unsigned product[3]; /* 96-bit intermediate value */
+ u32 cnt_lo, cnt_hi;
- /* read the TSC value
- */
- tsc = 0 - get_cycles(); /* get_cycles() counts down */
+ preempt_disable_notrace();
+ cnt_hi = sched_clock_cnt_hi;
+ barrier(); /* read cnt_hi before cnt_lo */
+ cnt_lo = 0 - get_cycles(); /* get_cycles() counts down */
/* expand to 64-bits.
* - sched_clock() must be called once a minute or better or the
* following will go horribly wrong - see cnt32_to_63()
*/
- tsc64.ll = cnt32_to_63(tsc) & 0x7fffffffffffffffULL;
+ tsc64.ll = cnt32_to_63(cnt_hi, cnt_lo, &sched_clock_cnt_hi);
+ tsc64.ll &= 0x7fffffffffffffffULL;
+ preempt_enable_notrace();
/* scale the 64-bit TSC value to a nanosecond value via a 96-bit
* intermediate
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 18:28 ` [PATCH] convert cnt32_to_63 to inline Mathieu Desnoyers
@ 2008-11-11 19:13 ` Russell King
2008-11-11 20:11 ` Mathieu Desnoyers
2008-11-11 21:00 ` Nicolas Pitre
1 sibling, 1 reply; 118+ messages in thread
From: Russell King @ 2008-11-11 19:13 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Nicolas Pitre, Andrew Morton, torvalds, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> Let's see what it gives once implemented. Only compile-tested. Assumes
> pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> arm versatile.
Versatile is also UP only.
The following are results from PXA built with gcc 3.4.3:
1. two additional registers used in sched_clock()
2. 8 additional bytes of code (which are needless if gcc was more inteligent)
both of these I put down to inefficiencies in gcc's register allocation.
3. worse instruction scheduling - two inter-dependent loads next to each
other causing a pipeline stall
Actual reading of variables/hardware is unaffected by this patch.
Old code:
c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
20: e1a07002 mov r7, r2
24: e3a08000 mov r8, #0 ; 0x0
28: e5933000 ldr r3, [r3] ; read OSCR register
...
58: e1820b04 orr r0, r2, r4, lsl #22
5c: e1a01524 lsr r1, r4, #10
60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
New code:
c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
18: e1a08001 mov r8, r1
1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
24: e1a09002 mov r9, r2
28: e3a0a000 mov sl, #0 ; 0x0
2c: e5933000 ldr r3, [r3] ; read OSCR
...
58: e1825b04 orr r5, r2, r4, lsl #22
5c: e1a06524 lsr r6, r4, #10
60: e1a01006 mov r1, r6
64: e1a00005 mov r0, r5
68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
Versatile:
1. 12 additional bytes of code
2. same number of registers
3. worse instruction scheduling causing pipeline stall
Actual reading of variables/hardware is unaffected by this patch.
So, we have two platforms where this patch makes things visibly worse
with no material benefit.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 19:13 ` Russell King
@ 2008-11-11 20:11 ` Mathieu Desnoyers
2008-11-11 21:51 ` Russell King
0 siblings, 1 reply; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-11 20:11 UTC (permalink / raw)
To: Nicolas Pitre, Andrew Morton, torvalds, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
* Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> On Tue, Nov 11, 2008 at 01:28:00PM -0500, Mathieu Desnoyers wrote:
> > Let's see what it gives once implemented. Only compile-tested. Assumes
> > pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> > arm versatile.
>
> Versatile is also UP only.
>
> The following are results from PXA built with gcc 3.4.3:
>
> 1. two additional registers used in sched_clock()
> 2. 8 additional bytes of code (which are needless if gcc was more inteligent)
>
> both of these I put down to inefficiencies in gcc's register allocation.
>
> 3. worse instruction scheduling - two inter-dependent loads next to each
> other causing a pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> Old code:
>
> c: e59f3050 ldr r3, [pc, #80] ; load address of oscr2ns_scale
> 10: e59fc050 ldr ip, [pc, #80] ; load address of __m_cnt_hi
> 14: e5932000 ldr r2, [r3] ; read oscr2ns_scale
> 18: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 1c: e59c1000 ldr r1, [ip] ; read __m_cnt_hi
> 20: e1a07002 mov r7, r2
> 24: e3a08000 mov r8, #0 ; 0x0
> 28: e5933000 ldr r3, [r3] ; read OSCR register
> ...
> 58: e1820b04 orr r0, r2, r4, lsl #22
> 5c: e1a01524 lsr r1, r4, #10
> 60: e89da9f0 ldm sp, {r4, r5, r6, r7, r8, fp, sp, pc}
>
>
> New code:
>
> c: e59f0058 ldr r0, [pc, #88] ; load address of oscr2ns_scale
> 10: e5901000 ldr r1, [r0] ; read oscr2ns_scale <= pipeline stall
> 14: e59f3054 ldr r3, [pc, #84] ; load address of __m_cnt_hi
> 18: e1a08001 mov r8, r1
> 1c: e5932000 ldr r2, [r3] ; read __m_cnt_hi
> 20: e59f304c ldr r3, [pc, #76] ; load address of OSCR
> 24: e1a09002 mov r9, r2
> 28: e3a0a000 mov sl, #0 ; 0x0
> 2c: e5933000 ldr r3, [r3] ; read OSCR
> ...
> 58: e1825b04 orr r5, r2, r4, lsl #22
> 5c: e1a06524 lsr r6, r4, #10
> 60: e1a01006 mov r1, r6
> 64: e1a00005 mov r0, r5
> 68: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
>
> Versatile:
>
> 1. 12 additional bytes of code
> 2. same number of registers
> 3. worse instruction scheduling causing pipeline stall
>
> Actual reading of variables/hardware is unaffected by this patch.
>
> So, we have two platforms where this patch makes things visibly worse
> with no material benefit.
>
I think the added barrier() are causing these pipeline stalls. They
don't allow the compiler to read variables such as oscr2ns_scale before
the barrier because gcc cannot assume it won't be modified. However, to
insure that OSCR read is done after __m_cnt_hi read, this barrier seems
required to be safe against gcc optimizations.
Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
the macro or to a vanilla tree ?
I wonder if reading those values sooner would help gcc ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 18:28 ` [PATCH] convert cnt32_to_63 to inline Mathieu Desnoyers
2008-11-11 19:13 ` Russell King
@ 2008-11-11 21:00 ` Nicolas Pitre
2008-11-11 21:13 ` Russell King
1 sibling, 1 reply; 118+ messages in thread
From: Nicolas Pitre @ 2008-11-11 21:00 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Andrew Morton, torvalds, rmk+lkml, dhowells, mingo, a.p.zijlstra,
linux-kernel, ralf, benh, paulus, davem, mingo, tglx, rostedt,
linux-arch
On Tue, 11 Nov 2008, Mathieu Desnoyers wrote:
> * Nicolas Pitre (nico@cam.org) wrote:
> > That hasn't anything to do with "a macro is faster" at all. It's all
> > about the order used to evaluate provided arguments. And the first one
> > might be anything like a memory value, an IO operation, an expression,
> > etc. An inline function would work correctly with pointers only and
> > therefore totally break apart on x86 for example.
> >
> >
> > Nicolas
>
> Let's see what it gives once implemented. Only compile-tested. Assumes
> pxa, sa110 and mn10300 are all UP-only. Correct smp_rmb() are used for
> arm versatile.
>
> Turn cnt32_to_63 into an inline function.
> Change all callers to new API.
> Document barrier usage.
Look, I'm not interested at all into this mess.
The _WHOLE_ point of the cnt32_to_63 macro was to abstract and
encapsulate the subtlety of the algorithm. It initially started as an
open coded implementation in PXA's sched_clock(). Then I was asked to
make it a macro that can be reused for other ARM platforms. Then David
wanted to reuse it on other platforms than ARM.
Now you are simply destroying all the value of having that macro in the
first place. The argument is that the macro could be misused because it
has a static variable inside it, etc. etc. The solution: spread the
subtlety all around instead of keeping it into the macro and risk having
it wrong or broken due to future changes surrounding it in the future.
And it _will_ happen due to the increased exposure making the whole idea
even more fragile compared to having it concentrated in only one spot.
This is a total non sense and I can't believe you truly think your patch
makes things better...
You're even disabling preemption where it is really unneeded making the
whole thing about the double its initial cost. Look at the generated
assembly and count the cycles if you don't believe me!
No thank you. If this trend continues I'm going to make it back private
to ARM again so you could pessimize your own code as much as you want.
NACK.
Nicolas
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 21:00 ` Nicolas Pitre
@ 2008-11-11 21:13 ` Russell King
0 siblings, 0 replies; 118+ messages in thread
From: Russell King @ 2008-11-11 21:13 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Mathieu Desnoyers, Andrew Morton, torvalds, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
On Tue, Nov 11, 2008 at 04:00:46PM -0500, Nicolas Pitre wrote:
> No thank you. If this trend continues I'm going to make it back private
> to ARM again so you could pessimize your own code as much as you want.
As I've already stated several days ago, I think that's the right
course of action. Given all the concerns raised, it's clearly not
something that should have been allowed to become generic.
So, let's just close this discussion off by taking that course of
action.
What's required is (in order):
1. a local copy for NM10300 needs to be created and it converted to that
2. these two commits then need to be reverted:
bc173c5789e1fc6065fd378edc815914b40ee86b
b4f151ff899362fec952c45d166252c9912c041f
Then our usage is again limited to sched_clock() which is well understood
and known to be problem free.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 20:11 ` Mathieu Desnoyers
@ 2008-11-11 21:51 ` Russell King
2008-11-12 3:48 ` Mathieu Desnoyers
0 siblings, 1 reply; 118+ messages in thread
From: Russell King @ 2008-11-11 21:51 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Nicolas Pitre, Andrew Morton, torvalds, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
On Tue, Nov 11, 2008 at 03:11:30PM -0500, Mathieu Desnoyers wrote:
> I think the added barrier() are causing these pipeline stalls. They
> don't allow the compiler to read variables such as oscr2ns_scale before
> the barrier because gcc cannot assume it won't be modified. However, to
> insure that OSCR read is done after __m_cnt_hi read, this barrier seems
> required to be safe against gcc optimizations.
>
> Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
> the macro or to a vanilla tree ?
Nicolas' patch compared to unmodified - there's less side effects,
which come down to two pipeline stalls whereas we had none with
the unmodified code.
One pipeline stall for loading the address of __m_cnt_hi and reading
its value, followed by the same thing for oscr2ns_scale.
I think this is showing the problem of compiler barriers - they are
indescriminate. They are total and complete barriers - not only do
they act on the data but also the compilers ability to emit code for
generating the addresses of the data to be loaded.
Clearly, the address of OSCR, __m_cnt_hi nor oscr2ns_scale is ever
going to change at run time - their addresses are all stored in the
literal pool, but by putting compiler barriers in, the compiler is
being prevented from reading from the literal pool at the most
appropriate point.
So, I've tried this:
unsigned long long sched_clock(void)
{
+ unsigned long *oscr2ns_ptr = &oscr2ns_scale;
unsigned long long v = cnt32_to_63(OSCR);
- return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
+ return (v * *oscr2ns_ptr) >> OSCR2NS_SCALE_FACTOR;
}
to try to explicitly code the loads. This unfortunately results in
three pipeline stalls. Also tried swapping the two lines starting
'unsigned long' without any improvement on not having those extra hacks
to work around the barrier.
So, let's summarise this:
1. the existing code works, is correct on ARM, and is efficient.
2. throwing barriers into the function makes it less efficient.
3. re-engineering the code appears to make things worse.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 0:26 ` Nicolas Pitre
2008-11-11 18:28 ` [PATCH] convert cnt32_to_63 to inline Mathieu Desnoyers
@ 2008-11-11 22:31 ` David Howells
2008-11-11 22:37 ` Peter Zijlstra
1 sibling, 1 reply; 118+ messages in thread
From: David Howells @ 2008-11-11 22:31 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: dhowells, Nicolas Pitre, Andrew Morton, torvalds, rmk+lkml,
mingo, a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem,
mingo, tglx, rostedt, linux-arch
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> ...
> + preempt_disable_notrace();
Please, no! sched_clock() is called with preemption or interrupts disabled
everywhere except from some debugging code (lock tracing IIRC). If you need
to insert this preemption disablement somewhere, please insert it there. At
least then sched_clock() will be called consistently.
David
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 22:31 ` David Howells
@ 2008-11-11 22:37 ` Peter Zijlstra
2008-11-12 1:13 ` Steven Rostedt
0 siblings, 1 reply; 118+ messages in thread
From: Peter Zijlstra @ 2008-11-11 22:37 UTC (permalink / raw)
To: David Howells
Cc: Mathieu Desnoyers, Nicolas Pitre, Andrew Morton, torvalds,
rmk+lkml, mingo, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
On Tue, 2008-11-11 at 22:31 +0000, David Howells wrote:
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> > ...
> > + preempt_disable_notrace();
>
> Please, no! sched_clock() is called with preemption or interrupts disabled
> everywhere except from some debugging code (lock tracing IIRC). If you need
> to insert this preemption disablement somewhere, please insert it there. At
> least then sched_clock() will be called consistently.
Agreed. You could do a WARN_ON(!in_atomic); in sched_clock() depending
on DEBUG_PREEMPT or something to ensure this.
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 22:37 ` Peter Zijlstra
@ 2008-11-12 1:13 ` Steven Rostedt
0 siblings, 0 replies; 118+ messages in thread
From: Steven Rostedt @ 2008-11-12 1:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: David Howells, Mathieu Desnoyers, Nicolas Pitre, Andrew Morton,
torvalds, rmk+lkml, mingo, linux-kernel, ralf, benh, paulus,
davem, mingo, tglx, linux-arch
On Tue, 11 Nov 2008, Peter Zijlstra wrote:
> On Tue, 2008-11-11 at 22:31 +0000, David Howells wrote:
> > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> >
> > > @@ -52,18 +57,22 @@ unsigned long long sched_clock(void)
> > > ...
> > > + preempt_disable_notrace();
> >
> > Please, no! sched_clock() is called with preemption or interrupts disabled
> > everywhere except from some debugging code (lock tracing IIRC). If you need
> > to insert this preemption disablement somewhere, please insert it there. At
> > least then sched_clock() will be called consistently.
>
> Agreed. You could do a WARN_ON(!in_atomic); in sched_clock() depending
> on DEBUG_PREEMPT or something to ensure this.
It would also be nice if this requirement (calling sched_clock with
preemption disabled) was documented somewhere more obvious.
Doing as Peter suggested, adding a WARN_ON and documenting that this must
be called with preemption disabled, would be nice.
-- Steve
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH] convert cnt32_to_63 to inline
2008-11-11 21:51 ` Russell King
@ 2008-11-12 3:48 ` Mathieu Desnoyers
0 siblings, 0 replies; 118+ messages in thread
From: Mathieu Desnoyers @ 2008-11-12 3:48 UTC (permalink / raw)
To: Nicolas Pitre, Andrew Morton, torvalds, dhowells, mingo,
a.p.zijlstra, linux-kernel, ralf, benh, paulus, davem, mingo,
tglx, rostedt, linux-arch
* Russell King (rmk+lkml@arm.linux.org.uk) wrote:
> On Tue, Nov 11, 2008 at 03:11:30PM -0500, Mathieu Desnoyers wrote:
> > I think the added barrier() are causing these pipeline stalls. They
> > don't allow the compiler to read variables such as oscr2ns_scale before
> > the barrier because gcc cannot assume it won't be modified. However, to
> > insure that OSCR read is done after __m_cnt_hi read, this barrier seems
> > required to be safe against gcc optimizations.
> >
> > Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
> > the macro or to a vanilla tree ?
>
> Nicolas' patch compared to unmodified - there's less side effects,
> which come down to two pipeline stalls whereas we had none with
> the unmodified code.
>
> One pipeline stall for loading the address of __m_cnt_hi and reading
> its value, followed by the same thing for oscr2ns_scale.
>
> I think this is showing the problem of compiler barriers - they are
> indescriminate. They are total and complete barriers - not only do
> they act on the data but also the compilers ability to emit code for
> generating the addresses of the data to be loaded.
>
> Clearly, the address of OSCR, __m_cnt_hi nor oscr2ns_scale is ever
> going to change at run time - their addresses are all stored in the
> literal pool, but by putting compiler barriers in, the compiler is
> being prevented from reading from the literal pool at the most
> appropriate point.
>
> So, I've tried this:
>
> unsigned long long sched_clock(void)
> {
> + unsigned long *oscr2ns_ptr = &oscr2ns_scale;
> unsigned long long v = cnt32_to_63(OSCR);
> - return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
> + return (v * *oscr2ns_ptr) >> OSCR2NS_SCALE_FACTOR;
> }
>
> to try to explicitly code the loads. This unfortunately results in
> three pipeline stalls. Also tried swapping the two lines starting
> 'unsigned long' without any improvement on not having those extra hacks
> to work around the barrier.
>
> So, let's summarise this:
>
> 1. the existing code works, is correct on ARM, and is efficient.
> 2. throwing barriers into the function makes it less efficient.
> 3. re-engineering the code appears to make things worse.
>
Hrm, if we want to obtain similar results gcc has currently, casting to
(volatile u32) or doing a *(volatile u32 *) to force gcc to do specific
memory accesses in order could probably be used. Those are generally
discouraged because they are not suitable for SMP systems, but we are
talking here about UP-specific optimizations that will end up in UP-only
code, so why not ?
http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Volatiles.html
"The minimum either standard specifies is that at a sequence point all
previous accesses to volatile objects have stabilized and no subsequent
accesses have occurred. Thus an implementation is free to reorder and
combine volatile accesses which occur between sequence points, but
cannot do so for accesses across a sequence point."
Therefore, regarding program order, the access is insured to be
performed between each semicolumn.
So that would be an argument for leaving the variable read in
architecture-specific code, because it heavily depends on the
architecture context (whether it's UP-only or must support SMP, whether
it has unordered memory reads...).
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 118+ messages in thread
end of thread, other threads:[~2008-11-12 3:53 UTC | newest]
Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-07 5:23 [RFC patch 00/18] Trace Clock v2 Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 01/18] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 02/18] get_cycles() : x86 HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 03/18] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 04/18] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
2008-11-07 14:56 ` Josh Boyer
2008-11-07 18:14 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 05/18] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 06/18] Trace clock generic Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 07/18] Trace clock core Mathieu Desnoyers
2008-11-07 5:52 ` Andrew Morton
2008-11-07 6:16 ` Mathieu Desnoyers
2008-11-07 6:26 ` Andrew Morton
2008-11-07 16:12 ` Mathieu Desnoyers
2008-11-07 16:19 ` Andrew Morton
2008-11-07 18:16 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() Mathieu Desnoyers
2008-11-07 6:05 ` Andrew Morton
2008-11-07 8:12 ` Nicolas Pitre
2008-11-07 8:38 ` Andrew Morton
2008-11-07 11:20 ` David Howells
2008-11-07 15:01 ` Nicolas Pitre
2008-11-07 15:50 ` Andrew Morton
2008-11-07 16:47 ` Nicolas Pitre
2008-11-07 17:21 ` Andrew Morton
2008-11-07 20:03 ` Nicolas Pitre
2008-11-07 16:55 ` David Howells
2008-11-07 16:21 ` David Howells
2008-11-07 16:29 ` Andrew Morton
2008-11-07 17:10 ` David Howells
2008-11-07 17:26 ` Andrew Morton
2008-11-07 18:00 ` Mathieu Desnoyers
2008-11-07 18:21 ` Andrew Morton
2008-11-07 18:30 ` Harvey Harrison
2008-11-07 18:42 ` Mathieu Desnoyers
2008-11-07 18:33 ` Mathieu Desnoyers
2008-11-07 18:36 ` Linus Torvalds
2008-11-07 18:45 ` Andrew Morton
2008-11-07 16:07 ` David Howells
2008-11-07 16:47 ` Mathieu Desnoyers
2008-11-07 20:11 ` Russell King
2008-11-07 21:36 ` Mathieu Desnoyers
2008-11-07 22:18 ` Russell King
2008-11-07 22:36 ` Mathieu Desnoyers
2008-11-07 23:41 ` David Howells
2008-11-08 0:15 ` Russell King
2008-11-08 15:24 ` Nicolas Pitre
2008-11-08 23:20 ` [PATCH] clarify usage expectations for cnt32_to_63() Nicolas Pitre
2008-11-09 2:25 ` Mathieu Desnoyers
2008-11-09 2:54 ` Nicolas Pitre
2008-11-09 5:06 ` Nicolas Pitre
2008-11-09 5:27 ` [PATCH v2] " Nicolas Pitre
2008-11-09 6:48 ` Mathieu Desnoyers
2008-11-09 13:34 ` Nicolas Pitre
2008-11-09 13:43 ` Russell King
2008-11-09 16:22 ` Mathieu Desnoyers
2008-11-10 4:20 ` Nicolas Pitre
2008-11-10 4:42 ` Andrew Morton
2008-11-10 21:34 ` Nicolas Pitre
2008-11-10 21:58 ` Andrew Morton
2008-11-10 23:15 ` Nicolas Pitre
2008-11-10 23:22 ` Andrew Morton
2008-11-10 23:38 ` Steven Rostedt
2008-11-11 0:26 ` Nicolas Pitre
2008-11-11 18:28 ` [PATCH] convert cnt32_to_63 to inline Mathieu Desnoyers
2008-11-11 19:13 ` Russell King
2008-11-11 20:11 ` Mathieu Desnoyers
2008-11-11 21:51 ` Russell King
2008-11-12 3:48 ` Mathieu Desnoyers
2008-11-11 21:00 ` Nicolas Pitre
2008-11-11 21:13 ` Russell King
2008-11-11 22:31 ` David Howells
2008-11-11 22:37 ` Peter Zijlstra
2008-11-12 1:13 ` Steven Rostedt
2008-11-08 0:45 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() David Howells
2008-11-07 17:04 ` David Howells
2008-11-07 17:17 ` Mathieu Desnoyers
2008-11-07 23:27 ` David Howells
2008-11-07 11:03 ` David Howells
2008-11-07 16:51 ` Mathieu Desnoyers
2008-11-07 20:18 ` Nicolas Pitre
2008-11-07 23:55 ` David Howells
2008-11-07 10:59 ` David Howells
2008-11-07 5:23 ` [RFC patch 09/18] Powerpc : Trace clock Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 10/18] Sparc64 " Mathieu Desnoyers
2008-11-07 5:45 ` David Miller
2008-11-07 5:23 ` [RFC patch 11/18] LTTng timestamp sh Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 12/18] LTTng - TSC synchronicity test Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 13/18] x86 : remove arch-specific tsc_sync.c Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 14/18] MIPS use tsc_sync.c Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 15/18] MIPS : export hpt frequency for trace_clock Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 16/18] MIPS create empty sync_core() Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 17/18] MIPS : Trace clock Mathieu Desnoyers
2008-11-07 11:53 ` Peter Zijlstra
2008-11-07 17:44 ` Mathieu Desnoyers
2008-11-07 5:23 ` [RFC patch 18/18] x86 trace clock Mathieu Desnoyers
2008-11-07 10:55 ` [RFC patch 08/18] cnt32_to_63 should use smp_rmb() David Howells
2008-11-07 17:09 ` Mathieu Desnoyers
2008-11-07 17:33 ` Steven Rostedt
2008-11-07 19:18 ` Mathieu Desnoyers
2008-11-07 19:32 ` Peter Zijlstra
2008-11-07 20:02 ` Mathieu Desnoyers
2008-11-07 20:45 ` Mathieu Desnoyers
2008-11-07 20:54 ` Paul E. McKenney
2008-11-07 21:04 ` Steven Rostedt
2008-11-08 0:34 ` Paul E. McKenney
2008-11-07 21:16 ` Mathieu Desnoyers
2008-11-07 20:08 ` Steven Rostedt
2008-11-07 20:55 ` Paul E. McKenney
2008-11-07 21:27 ` Mathieu Desnoyers
2008-11-07 20:36 ` Nicolas Pitre
2008-11-07 20:55 ` Mathieu Desnoyers
2008-11-07 21:22 ` Nicolas Pitre
2008-11-07 23:50 ` David Howells
2008-11-08 0:55 ` Steven Rostedt
2008-11-09 11:51 ` David Howells
2008-11-09 14:31 ` Steven Rostedt
2008-11-09 16:18 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).